You are on page 1of 280

Artificial

Intelligence in
Ophthalmology
Andrzej Grzybowski
Editor

123
Artificial Intelligence in Ophthalmology
Andrzej Grzybowski
Editor

Artificial Intelligence
in Ophthalmology
Editor
Andrzej Grzybowski
Department of Ophthalmology
University of Warmia and Mazury
Olsztyn, Poland

Institute for Research in Ophthalmology


Foundation for Ophthalmology Development
Poznan, Poland

ISBN 978-3-030-78600-7    ISBN 978-3-030-78601-4 (eBook)


https://doi.org/10.1007/978-3-030-78601-4

© Springer Nature Switzerland AG 2021


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in
this book are believed to be true and accurate at the date of publication. Neither the publisher nor
the authors or the editors give a warranty, expressed or implied, with respect to the material
contained herein or for any errors or omissions that may have been made. The publisher remains
neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Contents

1 Artificial Intelligence in Ophthalmology: Promises,


Hazards and Challenges������������������������������������������������������������������   1
Andrzej Grzybowski
2 Basics of Artificial Intelligence for Ophthalmologists������������������  17
Ikram Issarti and Jos J. Rozema
3 Overview of Artificial Intelligence Systems
in Ophthalmology����������������������������������������������������������������������������  31
Paisan Ruamviboonsuk, Natsuda Kaothanthong,
Thanaruk Theeramunkong, and Varis Ruamviboonsuk
4 Autonomous Artificial Intelligence Safety and Trust��������������������  55
Michael D. Abramoff
5 Technical Aspects of Deep Learning in Ophthalmology��������������  69
Zhiqi Chen and Hiroshi Ishikawa
6 Selected Image Analysis Methods for Ophthalmology ����������������  77
Tomasz Krzywicki
7 Experimental Artificial Intelligence Systems
in Ophthalmology: An Overview����������������������������������������������������  87
Joelle A. Hallak, Kathleen Emily Romond,
and Dimitri T. Azar
8 Artificial Intelligence in Age-­Related Macular
Degeneration (AMD) ���������������������������������������������������������������������� 101
Yifan Peng, Qingyu Chen, Tiarnan D. L. Keenan,
Emily Y. Chew, and Zhiyong Lu
9 AI and Glaucoma ���������������������������������������������������������������������������� 113
Zhiqi Chen, Gadi Wollstein, Joel S. Schuman,
and Hiroshi Ishikawa
10 Artificial Intelligence in Retinopathy of Prematurity������������������ 127
Brittni A. Scruggs, J. Peter Campbell, and Michael F. Chiang
11 Artificial Intelligence in Diabetic Retinopathy������������������������������ 139
Andrzej Grzybowski and Piotr Brona

v
vi Contents

12 Google and DeepMind: Deep Learning Systems


in Ophthalmology���������������������������������������������������������������������������� 161
Xinle Liu, Akinori Mitani, Terry Spitz, Derek J. Wu,
and Joseph R. Ledsam
13 Singapore Eye Lesions Analyzer (SELENA):
The Deep Learning System for Retinal Diseases�������������������������� 177
David Chuen Soong Wong, Grace Kiew, Sohee Jeon,
and Daniel Ting
14 Automatic Retinal Imaging and Analysis: Age-Related
Macular Degeneration (AMD) within Age-­Related
Eye Disease Studies (AREDS)�������������������������������������������������������� 187
T. Y. Alvin Liu and Neil M. Bressler
15 Artificial Intelligence for Keratoconus Detection
and Refractive Surgery Screening�������������������������������������������������� 193
José Luis Reyes Luis and Roberto Pineda
16 Artificial Intelligence for Cataract Management�������������������������� 203
Haotian Lin, Lixue Liu, and Xiaohang Wu
17 Artificial Intelligence in Refractive Surgery���������������������������������� 207
Yan Wang, Mohammad Alzogool, and Haohan Zou
18 Artificial Intelligence in Cataract Surgery Training�������������������� 215
Nouf Alnafisee, Sidra Zafar, Kristen Park,
Satyanarayana Swaroop Vedula,
and Shameema Sikder
19 Artificial Intelligence in Ophthalmology Triaging������������������������ 227
Yiran Tan, Stephen Bacchi, and Weng Onn Chan
20 Deep Learning Applications in Ocular Oncology ������������������������ 235
T. Y. Alvin Liu and Zelia M. Correa
21 Artificial Intelligence in Neuro-ophthalmology���������������������������� 239
Dan Milea and Raymond Najjar
22 Artificial Intelligence Using the Eye as a Biomarker
of Systemic Risk ������������������������������������������������������������������������������ 243
Rachel Marjorie Wei Wen Tseng, Tyler Hyungtaek Rim,
Carol Y. Cheung, and Tien Yin Wong
23 Artificial Intelligence in Calculating the IOL Power�������������������� 257
John G. Ladas and Shawn R. Lin
24 Practical Considerations for AI Implementation in IOL
Calculation Formulas���������������������������������������������������������������������� 263
Guillaume Debellemanière, Alain Saad,
and Damien Gatinel
Index���������������������������������������������������������������������������������������������������������� 279
Artificial Intelligence
in Ophthalmology: Promises,
1
Hazards and Challenges

Andrzej Grzybowski

“If you do not get feedback, your confidence grows much faster than your accuracy”
Tetlock P., Gardner D. Superforcasting: The Art and Science of Prediction, Crown
Publishing, 2016.

 he Promise of Artificial
T After decades of slow progress since the Turing
Intelligence test was proposed, AI has finally blossomed.
Many new technologies and applications are
The term “artificial intelligence” (AI) was coined available, and there is great enthusiasm about the
on August 31, 1955, when John McCarthy, promise of AI in health care. It holds the potential
Marvin L.  Minsky, Nathaniel Rochester, and to improve patient and practitioner outcomes,
Claude E. Shannon submitted “A Proposal for the reduce costs by preventing errors and unneces-
Dartmouth Summer Research Project on Artificial sary procedures, and provide population-­ wide
Intelligence.” [1, 2]. It was, however, Alan Turing health improvements. We have entered the fourth
who during a public lecture in London in 1947 stage of the Industrial Revolution that began in
mentioned computer intelligence, and in 1948 he the eighteenth century, and its defining feature
introduced many of the central concepts of AI in may well be the use of AI technologies (Fig. 1.1).
a report entitled “Intelligent Machinery.” [3]. The results of an annual competition known as
Moreover, Turing proposed in 1950 the test, orig- the ImageNet Large Scale Visual Recognition
inally called the imitation game, and later known Challenge (ILSVRC) provide interesting insights
as the Turing test, as a way to confirm that the into recent developments in AI technology
intelligent behavior of a machine was equivalent (Fig. 1.2). Over the years 2010–2016 there was a
to that of a human. A human evaluator is asked to steady decrease in the error rates of the algo-
determine the nature of a partner (human or rithms presented, and in 2017, 29 of the 38 com-
machine) based on a text-only conversation [1–3]. peting teams had error rates lower than 5%
(considered to be the human threshold). Thus in
10 years AI algorithms exceeded human perfor-
A. Grzybowski (*) mance in image recognition.
Department of Ophthalmology, University of Warmia There are many promising applications for AI
and Mazury, Olsztyn, Poland
in health care, addressing a variety of aims and
Institute for Research in Ophthalmology, Foundation taking many different approaches (Table  1.1).
for Ophthalmology Development, Poznan, Poland
For example, misdiagnoses constitute a huge,
© Springer Nature Switzerland AG 2021 1
A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_1
2 A. Grzybowski

Fig. 1.1  The four main stages of the Industrial Revolution that began in the eighteenth century

Fig. 1.2 Error-rate ImageNet competition results


history on ImageNet

0.5

0.4

0.3
Error rate

0.2

0.1

0.0
2011 2012 2013 2014 2015 2016
Year
1  Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 3

Table 1.1  Some ambitious expectations for AI in health describe how they perceive their physician found
care. Adapted from Topol E.  Deep Medicine: How
that the most common negative responses were
Artificial Intelligence Can Make Healthcare Human
Again. Basic Books, New York 2019 “rushed,” “busy,” and “hurried.” [7]. These reac-
•  outperform doctors,
tions are manifestations of “shallow medicine.”
•  help to diagnose what is presently undiagnosable, One of the arguments supporting the use of AI
•  help to treat what is presently untreatable, in medicine is that human cognitive capacity to
• to recognize on images what is presently effectively manage information is often exceeded
unrecognizable,
•  predict the unpredictable,
by the quantity of data generated. Each year the
•  classify the unclassifiable, world produces zettabytes of data (roughly,
•  decrease the workflow inefficiencies, enough to fill a trillion smartphones) [2].
•  decrease hospital admissions and readmissions, Moreover, unlike humans, who have bad days
•  increase medication adherence
•  decrease patient harm
and emotions, and who get tired, with subsequent
•  decrease or eliminate misdiagnosis decreases in performance and accuracy, AI works
24/7 without vacations or complaints [2].
AI-based technologies employing deep-­
although poorly recognized, medical problem. A learning (DL) approaches have proven effective
study published in 2014 estimated that diagnostic in supporting decisions in many medical special-
errors affect at least 5% of US adults (12 million ties, including radiology, cardiology, oncology,
people) per year [4]. More recently a systematic dermatology, ophthalmology, and others. For
review and meta-analysis reported that the rate of example, AI/DL algorithms (also referred to as
diagnostic errors causing adverse events among AI/DL models in the following text) have been
hospitalized patients was 0.7% [5]. Furthermore, shown to reduce waiting times, improve medica-
diagnostic error is the most important reason for tion adherence, customize insulin dosages, and
malpractice litigation in the United States, help interpret magnetic resonance images. The
accounting for 31% of malpractice lawsuits in number of AI life-science papers listed in
2017 [2]. The creation of AI programs to identify PubMed increased from 596 in 2010 to 12,422 in
and analyze diagnostic errors could be an impor- 2019 [8]. The number of papers on the use of AI
tant step in addressing this problem [6]. in the field of ophthalmology also has increased
Eric Topol has proposed that AI could help dramatically (Figs. 1.3 and 1.4).
shift into “deep medicine,” by allowing the physi- AI/DL algorithms have been used to detect
cians to devote more time to crucial relationships diseases based on image analysis, with fundus
with their patients—an aspect of medicine that photos and optical coherence tomography (OCT)
cannot be replaced by any AI technology [2]. It is scans analyzed for retinal diseases, chest radio-
also interesting to consider whether AI might graphs assessed for lung diseases, and skin pho-
enrich the doctor-patient relationship, enabling a tos analyzed for skin disorders. Retinal photos
shift from the present “shallow medicine” into have also been used to identify risk factors related
“deep medicine,” based on deep empathy and to cardiovascular disorders, including blood pres-
connection [2]. Success in building such relation- sure, smoking, and body mass index [9]. Using
ships is very much related to the amount of time DL models trained on data from over 280,000
doctors can spare for patients and the extent of patients and validated on two independent data
the personal contact they have with their patients. sets, Poplin et  al. predicted cardiovascular risk
The average time of a clinic visit in the United factors not previously thought to be present or
States for an established patient is 7 min and for quantifiable in retinal images, such as age (mean
a new patient 12 min. In many Asian countries, absolute error within 3.26 years), gender (area
clinic visits last as little as 2 min per patient [2]. under the receiver operating characteristic curve
Making this situation even worse, part of this =  0.97), smoking status (AUC  =  0.71), systolic
time must be devoted to completing electronic blood pressure (mean absolute error within
health records, further limiting personal contact. 11.23 mmHg) and major adverse cardiac events
A study published in 2017 that asked patients to (AUC = 0.70) (Fig. 1.5) [9].
4 A. Grzybowski

Fig. 1.3  The number of


PubMed articles on AI
and the eye that were
published between 1974
and 2020

Fig. 1.4  The number of


articles relating to AI
and eye diseases
published in 2020

The COVID-19 pandemic has raised expecta- ies are conducted in experimental conditions and
tions for the use of AI in data analysis. So far it based on preselected data. They might provide
has been used in epidemic modeling, detection of inadequate insight into the use of AI applications
misinformation, diagnostics, vaccine and drug in heterogeneous, real-world care settings.
development, triage and patient outcomes, and Lee et al. tested seven algorithms being used
identification of regions of greatest need [10]. clinically around the world, including one with
US Food and Drug Administration (FDA)
approval and four whose developers have submit-
 egulating AI-Based Medical
R ted applications for FDA approval. They found
Devices: Demonstrating Benefit that most of these algorithms performed worse in
and Safety real-world, compared with experimental, situa-
tions, with only three of seven and one of seven
One of many challenges in the field of AI is deter- having comparable sensitivity and specificity to
mining what constitutes evidence of impact and the human graders, respectively. Only one algo-
benefit for AI medical devices and who should rithm performed as well as human graders [11].
assess the evidence [2]. The majority of AI stud- Another of the algorithms tested performed
1  Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 5

Original Age Gender

Actual: 57.6 years Actual: female


Predicted: 59.1 years Predicted: female

Smoking HbA 1c BMI

Actual: non-smoker Actual: non-diabetic Actual: 26.3 kg m-2


Predicted: non-smoker Predicted: 6.7% Predicted: 24.1 kg m-2

SBP DBP

Actual: 148.5 mmHg Actual: 78.5 mmHg


Predicted: 148.0 mmHg Predicted: 86.6 mmHg

Fig. 1.5  Attention maps for a single retinal fundus image. model is using to make the prediction for the image.
The top left image is a sample retinal image in color from Source: Poplin R, Varadarajan AV, Blumer K, Liu Y,
the UK Biobank data set. The remaining images show the McConnell MV, Corrado GS, Peng L, Webster
same retinal image in black-and-white. The soft attention DR. Prediction of cardiovascular risk factors from retinal
heat map for each prediction is overlaid in green, indicat- fundus photographs via deep learning. Nat Biomed Eng.
ing the areas of the heat map that the neural-network 2018 Mar;2(3):158–164
6 A. Grzybowski

significantly worse than human graders at all lev- lowest-­risk medical devices (CE class I), the
els of DR severity—it missed 25.58% of cases of manufacturer ensures that the product complies
advanced retinopathy, which could have serious with regulations and an approval procedure is not
consequences. One of the potential hazards of the required. The registration procedure for higher-­
clinical use of algorithms identified in this study risk medical devices (CE class IIa, IIb, and III) is
was the risk of applying an algorithm trained on a handled by private entities, called notified bodies,
particular demographic group to a population that have been accredited to assess the devices
that differs in factors such as ethnicity, age, and and issue a CE mark.
sex. Moreover, many studies of algorithms devel- Thirteen CE-marked AI-based medical
oped with AI have excluded low-quality images, devices were approved in 2015, 27 in 2016, 26 in
treating them as ungradable images, and patients 2017, 55 in 2018, and 100 in 2019. The majority
with comorbid eye diseases, making them less were designed for use in radiology, general hos-
reflective of real-world conditions. pital care, cardiology, neurology, ophthalmology
The study by Lee et al. shows the importance (12 devices), and pathology, and most were class
and limitations of the registration process of IIa (40%), class I (35%), or class IIb (12%)
AI-based medical devices. FDA registration is devices [12]. Of the AI-based devices that were
based on a centralized system, which does not CE-marked between 2015 and 2019, 124 (52%)
have a specific, easily accessible regulatory path- were also FDA approved, making up 56% of the
way for AI-based medical devices. FDA clears the AI-based tools that the FDA approved in those.
medical devices through three pathways: the pre- Bigger companies were more likely to get both
market approval pathway, the de-novo premarket approvals, whereas smaller companies were more
review, and the 510(k) pathway [12, 13]. The likely to obtain only a CE mark. The authors of
leading AI disciplines in medicine are radiology, this study suggested that the European approval
cardiology, internal medicine/endocrinology, system was less rigorous than the US one. This
neurology, ophthalmology, emergency medicine, conclusion is supported by an FDA report on 12
and oncology. FDA approvals of AI-based medi- devices that received CE approval only and later
cal devices have increased steadily in recent years; were found to be unsafe or ineffective [13, 14]. A
there were 9 in 2015, 13 in 2016, 32 in 2017, 67 in major problem in studying CE-marked devices in
2018, and 77 in 2019, with the majority of devices the European Economic Area is the lack of a pub-
designed for use in radiology, cardiology, and licly available register of approved devices com-
neurology [12]. Interestingly, 85% of FDA- parable to the FDA register. Moreover, the
approved medical devices in the years 2015–2019 information submitted to the notified bodies is
were intended for use by health-­care profession- confidential. In 2022, a new European database
als, and only 15% for use by patients. The best- on medical devices (Eudamed), providing a live
known, FDA-approved, AI-based medical devices picture of the lifecycle of medical devices, will
in the field of ophthalmology are IDx-DR (2018), become operational. It will be composed of six
the first software to provide screening decisions modules, including actor registration, unique
that do not have to be interpreted by a clinician, device identification (UDI), device registration,
and Eyenuk (2020), which, like IDx-DR, screens notified bodies and certificates, clinical investiga-
for diabetic retinopathy. tions and performance studies, and vigilance and
In European Economic Area, which includes market surveillance [15].
the European Union (EU) countries and the
European Free Trade Association (EFTA) mem-
bers (Iceland, Lichtenstein, Norway, and Access to Reliable Data
Switzerland), medical devices are approved in a
decentralized manner. DL algorithm training requires large data sets
Conformité Européenne (CE) marking indi- with thousands or even hundreds of thousands of
cates conformity with EU health, safety, and diverse, well-balanced, and accurately labeled
environmental-protection standards. For the images [16]. The resources required for an AI
1  Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 7

AI/ML technology Training Data

Test Data

Performance Claim not included in training


Setting/Experts Profiles data
AI/ML model achieves or
exceeds expert performance

Validation Data
Gold Standard/ Benchmark

Performance evidence

AI/ML model performance on test


data

Fig. 1.6  The schematic presentation of resources required for an AI study

study are presented in Fig.  1.6. The enormous and interpretable diagnosis by highlighting the
numbers of required images rarely can be regions recognized by the neural network.
obtained from individual centers; thus they are Further, they showed that a transfer-learning
secured from data repositories or centers that approach produced only modestly worse results
agree to share data. There is a growing need for (twofold increase of error, compared with the full
consensus on standardized definitions of medical data set) while using approximately 20 times
entities; conventions for data formatting; identifi- fewer images. They also demonstrated the wider
cation of units of measure; protocols for data utility of this approach by applying it to the iden-
cleaning, harmonization, and validation; stan- tification of pediatric pneumonia using chest
dards for sharing and reusing data and sharing of X-ray images. They provided their data and code
code implementing AI models; and the adoption in a publicly available database to facilitate their
of open application program interfaces to AI use by other biomedical researchers in order to
models [17]. This is required for data sharing and improve the performance of future models [18].
open communication in AI, which is critical for Transfer learning (Figs. 1.7 and 1.8) has been
conducting the reproducible research that is nec- used in recent years to build classification models
essary before AI technology can be adopted in for medical images because the number of images
health care. that can be used for training is relatively small
Kermany et al. used a DL analysis of a data set compared to the number of images available to
of optical coherence tomography images for tri- train general models [19] (Fig.  1.9). Another
age and diagnosis of choroidal neovasculariza- approach to meet the need for large, annotated
tion, diabetic macular edema, and drusen. They training data sets might be the use of low-shot DL
demonstrated performance comparable to that of algorithms. Low-shot learning (LSL), also known
human experts and provided a more transparent as few-shot learning, is a type of machine learn-
8 A. Grzybowski

Transfer learning: idea

Instead of training a new model from scratch for a new (target) task: Source labels Target labels

take a model trained on a different domain for a different


source taks,
Large
adapt it to the new domain and the target task. dataset
Small
Source model Transfer learned knowledge Target model dataset
Variants

Same domain, various tasks


Various domains, same task

Source data Target data

Fig. 1.7  The idea of transfer learning.

Traditional Machine Learning vs Transfer Learining

Single task learning Learning of a model for a


without accumulating new task is based on other
knowledge: learned tasks:

Model training consists in transferring the entire data Learning process can be faster, faster because knowledge
set with assigned labels from other tasks is used

Dataset A Model for task A Learning System


Dataset A
Task A

Dataset B Model for task B

Knowledge

Dataset B Learning System


Task B

Fig. 1.8  Schematic of a convolutional neural network and transfer learning

ing (ML) problems where the training dataset methods degraded substantially when used with
contains limited information. It is well known limited data sets, but LSL methods performed
that many real-life situations, including rare dis- better and might be applied in retinal diagnostics
eases (e.g., serpiginous choroidopathy or angioid when a limited number of retina images are avail-
streaks in pseudoxanthoma elasticum) and non-­ able for training [20].
typical presentations or subtypes of common dis- Another approach that has been suggested by
orders, are prone to AI bias due to the paucity or several authors to address the problem of limited
imbalance of data. These deficiencies may also data sets is the use of generative adversarial net-
result in less accurate future models. When works (GANs) to synthesize new images from a
addressing this sort of bias, dividing data accord- training data set of real images. GANs are ML
ing to some patient features (e.g., age, sex, and models that can generate new data with the same
race/ethnicity) may result in smaller data sets that statistics as the training set (Fig. 1.10). For exam-
may be insufficient for training models for these ple, a GAN trained on photographs can generate
particular groups. The study by Burlina et  al. photographs of non-existing persons that look as
showed that the performance of widely used DL authentic as real humans (Fig.  1.11). Artificial
1  Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 9

IMAGENET NEWLY INITIALIZED WEIGHTS OUTPUT

1000
Categories

TRANSFER
LEARNING

INPUTS PRETRAINED LEARNED OUTPUT


WEIGHTS WEIGHTS

Retinopathy
grade0

Retinopathy
grade1

Retinopathy
grade2

Retinopathy
grade3

Fig. 1.9  The schematic diagram of transfer learning. ­6596/1544/1/012133. Content from this work may be
Source: Lingling Li et al. Diabetic retinopathy identifica- used under the terms of the Creative Commons Attribution
tion system based on transfer learning. 2020. J.  Phys.: 3.0 licence
Conf. Ser. 1544 012133. https://doi.org/10.1088/1742-

photos can be found at https://thispersondoesno- trained on real images [21]. Liu et al. have shown
texist.com. Many applications of GAN have been that 92% of synthetic OCT images had sufficient
proposed, including, art, fashion, advertising, sci- quality for further clinical interpretation. Only
ence, video games, however, concerns about mali- about 26–30% of synthetic post-therapeutic
cious uses were also raised, e.g., to produce fake, images could be accurately identified as synthetic
possibly incriminating, photographs and videos. images (Fig.  1.8) [22]. The accuracy of models
Burlina et  al. used the Age-Related Eye trained on synthetic images to predict wet or dry
Disease Study data set of over 130,000 fundus macular status was 0.85 (95% CI 0.74–0.95)
images to generate a similar number of synthetic [22]. In a study by Zheng et al., the image quality
images to train DL models. The performance of of real versus synthetic OCT images was similar
DL models trained with the synthetic images was as assessed by two retinal specialists. The accu-
nearly as good as the performance of models racy of discrimination of real versus synthetic
10 A. Grzybowski

Real images

Discriminator Real
Deep Convolutional Network

Random noise Generator


Deconvolutional Network Fake

Generated images
Fine tune training

Fig. 1.10  The schematic presentation of generative adversarial network (GAN)

An important and interesting issue is the clini-


cal application of continual ML, i.e., continuous
learning and development from new data while
retaining previously learned knowledge [25].
However, there are technical challenges to the
implementation of this promising concept,
including the need to prevent interference
between new and old data, and old and new
knowledge. In the catastrophic interference phe-
nomenon, the acquisition of new data can lead to
an abrupt decrease in the performance of an algo-
rithm. Practical applications of AI tools in health
care must be cautiously introduced because of
the existence of such risks. FDA regulations
require that FDA-approved autonomic algo-
rithms be locked for safety to prevent unpredict-
Fig. 1.11  The image of a young woman generated by
able future changes. This requirement, however,
StyleGAN, an generative adversarial network (GAN). is designed to ensure the safety of the model
The person in this photo does not exist, but is gener- rather than improving its performance. Continual
ated by an artificial intelligence based on an analysis learning could refine the performance of machine-­
of portraits. Source: https://commons.wikimedia.org/
wiki/File:Woman_1.jpg. This file is in the public domain
learning algorithms by the gradual correction and
because, as the work of a computer algorithm or artificial elimination of mistakes. It will be necessary to
intelligence, it has no human author in whom copyright is consider how this technology can be introduced
vested safely to health care.

OCT images was 59.50% for retinal specialist 1


and 53.67% for retinal specialist 2. For the local  azards and Challenges of AI
H
data set, the DL model trained on real and syn- in Ophthalmology
thetic OCT images had an area under the curve of
0.99 and 0.98, respectively. For the clinical data The future development of ophthalmology
set, the area under the curve was 0.94 for the real depends on better and possibly unlimited access
model and 0.90 for the synthetic one [23]. These to the medical data stored within electronic health
studies suggest the GAN synthetic images can be records. However, this access cannot be allowed
used by clinicians for educational purposes and to compromise of privacy of this very sensitive
developing DL algorithms [24]. data. There is a need for effective regulations that
1  Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 11

will set a balance between individual protection Table 1.2  Practical challenges to the advancement and
application of AI tools in clinical settings
and the common good. One approach to protect-
ing privacy and increasing sample size is to share Workflow Understand the technical, cognitive,
integration social, and political factors in play
DL algorithms with local institutions for retrain- and incentives impacting integration
ing purposes, but without sharing the private data of AI into health care workflows.
used to build the algorithms. This model-to-data Enhanced To promote integration of AI into
approach, also known as federated learning, was explainability health care workflows, consider
and what needs to be explained and
tested in ophthalmology and was shown to work
interpretability approaches for ensuring
effectively [26]. understanding by all members of
According to the US National Institute of the health care team.
Standards and Technology, biometric data, Workforce Promote educational programs to
including retina images, are personally identifi- education inform clinicians about AI/machine
learning approaches and to develop
able information and should be protected from an adequate workforce.
inappropriate access. Although AI models have Oversight and Consider the appropriate regulatory
been shown to diagnose and stage some ocular regulation mechanism for AI/machine learning
diseases from fundus photographs, OCT, and and approaches for evaluating
algorithms and their impact.
visual-field images, most AI algorithms were
Problem Catalog the different areas of health
tested on data sets that did not correspond well to identification and care and public health where AI/
real-world conditions. Patient populations were prioritization machine learning could make a
usually homogenous, and poor-quality images difference, focusing on intervention-­
and patients with multiple pathologies were driven AI.
Clinician and Understand the appropriate
excluded. Future studies are needed to validate patient approaches for involving consumers
algorithms on ocular images from heterogeneous engagement and clinicians in AI/machine
populations, including both good- and poor-­ learning prioritization, development,
quality images. Otherwise, we may face the situ- and integration, and the potential
impact of AI/machine learning
ation of “good AI gone bad.” The tendency to algorithms on the patient-provider
cherry-pick the best results might make the situa- relationship.
tion even worse. AI algorithms can behave unpre- Data quality and Promoting data quality, access, and
dictably when applied in real life. Algorithm access sharing, as well as the use of both
structured and unstructured data and
performance can degrade after deployment due to
the integration of non-clinical data
the changes between the training and testing con- is critical to developing effective AI
ditions (dataset shift), caused, for example, by tools.
using to images generated by a different device Source: Matheny ME, Thadaney Israni S, Ahmed M,
than this in the training set or collected in a dif- Whicher D.  AI in Health Care: The Hope, the Hype,
ferent clinical environment [27–30]. Moreover, the Promise, the Peril. Washington, DC: National
Academy of Medicine; 2019. https://nam.edu/artificial-
algorithms may return different outputs at differ- ­intelligence-­special-­publication
ent times when presented with similar inputs [31,
32]—they can be affected by minor changes in
image quality or extraneous data on an image advocate the use of openly accessible, standard-
[32–35]. All these problems might lead to misdi- ized, population-representative data; addressing
agnosis and erroneous treatment suggestions, explicit and implicit biases related to AI; devel-
breaching trust in AI technologies. An error in an oping and deploying appropriate training and
AI system could harm hundreds or even thou- educational programs for health workers to sup-
sands of patients. port health-care AI; and balancing innovation and
A recent report from the National Academy of safety through the use of regulation and legisla-
Medicine [36] highlights some important chal- tion to promote trust.
lenges in the further development of AI applica- To understand the limitations of AI-based
tions in health care (Table  1.2). Its authors models in health care and the responsibilities of
12 A. Grzybowski

manufacturers and users of AI software as a med- fairness and bias, and to allow rapid replication
ical device (SaMD), an MI-CLAIM checklist of the technical design by any legitimate clinical
was proposed for use in AI software development AI study. The MI-CLAIM checklist has six parts
[37]. Its purpose is to enable a direct assessment (Table  1.3), including (1) Study design; (2)
of clinical impact, including considerations of Separation of data into partitions for model train-

Table 1.3  The MI-CLAIM checklist [Source: Norgeot B, Quer G, Beaulieu-Jones BK, Torkamani A, Dias R,
Gianfrancesco M, Arnaout R, Kohane IS, Saria S, Topol E, Obermeyer Z, Yu B, Butte AJ. Minimum information about
clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020 Sep;26(9):1320–1324]
Before paper submission
Study design (Part 1) Completed: Notes if not
page number completed
The clinical problem in which the model will be employed is clearly detailed in the □
paper.
The research question is clearly stated. □
The characteristics of the cohorts (training and test sets) are detailed in the text. □
The cohorts (training and test sets) are shown to be representative of real-world clinical □
settings.
The state-of-the-art solution used as a baseline for comparison has been identified and □
detailed.
Data and optimization (Parts 2, 3) Completed: Notes if not
page number completed
The origin of the data is described and the original format is detailed in the paper. □
Transformations of the data before it is applied to the proposed model are described. □
The independence between training and test sets has been proven in the paper. □
Details on the models that were evaluated and the code developed to select the best □
model are provided.
Is the input data type structured or unstructured? □
Model performance (Part 4) Completed: Notes if not
page number completed
The primary metric selected to evaluate algorithm performance (e.g., AUC, F-score, □
etc.), including the justification for selection, has been clearly stated.
The primary metric selected to evaluate the clinical utility of the model (e.g., PPV, □
NNT, etc.), including the justification for selection, has been clearly stated.
The performance comparison between baseline and proposed model is presented with □
the appropriate statistical significance.
Model examination (Part 5) Completed: Notes if not
page number completed
Examination technique 1a □
Examination technique 2a □
A discussion of the relevance of the examination results with respect to model/ □
algorithm performance is presented.
A discussion of the feasibility and significance of model interpretability at the case □
level if examination methods are uninterpretable is presented.
A discussion of the reliability and robustness of the model as the underlying data □
distribution shifts is included.
Reproducibility (Part 6): choose appropriate tier of transparency Notes
Tier 1: complete sharing of the code □
Tier 2: allow a third party to evaluate the code for accuracy/fairness; share the results of □
this evaluation
Tier 3: release of a virtual machine (binary) for running the code on new data without □
sharing its details
Tier 4: no sharing □
PPV positive predictive value, NNT numbers needed to treat
a
Common examination approaches based on study type: for studies involving exclusively structured data, coefficients
and sensitivity analysis are often appropriate; for studies involving unstructured data in the domains of image analysis
or natural language processing, saliency maps (or equivalents) and sensitivity analyses are often appropriate
1  Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 13

Table 1.4  Major topics of CONSORT-AI extension It should be also remembered that AI algo-
  1. State the inclusion and exclusion criteria at the level rithms can be designed to perform in unethical
of participants ways. For example, Uber’s software Greyball
  2. State the inclusion and exclusion criteria at the level
of the input data
allowed the company to identify and circumvent
  3. Describe how the AI intervention was integrated local regulations, and Volkswagen’s algorithm
into the trial setting, including any onsite or offsite allowed vehicles to pass emission tests by reduc-
requirements ing their emissions of nitrogen oxide during test-
  4. State which version of the AI algorithm was used
  5. Describe how the input data were acquired and
ing. AI algorithms could be tuned to generate
selected for the AI intervention increased profits for their owners by recommend-
  6. Describe how poor-quality or unavailable input data ing particular drugs, tests, or the like without clini-
were assessed and handled cal users’ awareness. AI systems are vulnerable to
  7. Specify whether there was human—AI interaction
in the handling of the input data, and what level of
cybersecurity attacks that could cause their algo-
expertise was required for users rithms to misclassify medical information [31].
  8. Specify the output of the AI intervention Seven essential factors to design AI for social
  9. Explain how the AI intervention’s outputs good were proposed by Floridi et al. (Table 1.5)
contributed to decision-making or other elements of
clinical practice
[38]. The authors propose falsifiability as an
10. Describe results of any analysis of performance essential factor to improve the trustworthiness of
errors and how errors were identified, where technological application, i.e., for an SaMD to be
available. If no such analysis was planned or done, trustworthy, its safety should be falsifiable.
explain why not.
Critical requirements for a device to be fully
Source: Adapted from Liu X, Cruz Rivera S, Moher D,
functional must be specified and must be testable.
Calvert MJ, Denniston AK; SPIRIT-AI and CONSORT-AI
Working Group. Reporting guidelines for clinical trial If falsifiability is not possible, then the critical
reports for interventions involving artificial intelligence: requirements cannot be checked, and the system
the CONSORT-AI extension. Lancet Digit Health. 2020 should not be deemed trustworthy [38].
Oct;2(10):e537–e548

Cost-Effectiveness of AI-Based
ing and model testing; (3) Optimization and final Devices
model selection; (4) Performance evaluation; (5)
Model examination; and (6) Reproducible pipe- One of the arguments for AI-based medical
line. The CONSORT-AI and SPIRIT-AI working devices is that they can reduce medical costs and
groups have proposed reporting guidelines for eliminate unnecessary procedures. A study from
clinical trials of interventions involving AI. A Singapore found that a semiautomated model
summary of these guidelines is presented in that combined a DL system with human assess-
Table 1.4. ment achieved the best economic returns, leading
Inherent conflicts of interest should be to savings of 19.5% in screening for diabetic reti-
acknowledged. Manufacturers who develop and nopathy. An earlier study from the UK reported
market SaMD have a strong financial interest in cost-savings of 12.8–21.0%; however, a simple
presenting their products positively. Thus, con- comparison between them is not possible due to
flicts of interest exist if they fund, conduct, and the different models of DR screening in the two
publish results of studies, including those that countries (two-stage screening in Singapore and
might report deficiencies in their products. Many three-stage screening in the UK), and their differ-
of the published papers in the field of AI-based ent DR classification systems. The authors of
diabetic retinopathy screening, particularly those both studies argued that a semiautomated system
using CE-marked and FDA-approved algorithms, produces more savings than a fully automated
were conducted by manufacturers or patent system due to the lower rate of false positives and
owners. unnecessary specialist visits [39, 40].
14 A. Grzybowski

Table 1.5  Essential Factors to design AI for social good


Corresponding
Factors Corresponding best practices ethical principle
Falsifiability and Identify falsifiable requirements and test them in incremental steps Nonmaleficence
incremental deployment from the lab to the “outside world”
Safeguards against the Adopt safeguards which (i) ensure that non-causal indicators do not Nonmaleficence
manipulation of inappropriately skew interventions, and (ii) limit, when appropriate,
predictors knowledge of how inputs affect outputs from AI4SG systems, to
prevent manipulation
Receiver-contextualised Build decision-making systems in consultation with users interacting Autonomy
intervention with and impacted by these systems; with understanding of users’
characteristics, the methods of coordination, the purposes and effects
of an intervention; and with respect for users’ right to ignore or
modify interventions
Receiver-contextualised Choose a Level of Abstraction for AI explanation that fulfils the Explicability
explanation and desired explanatory purpose and is appropriate to the system and the
transparent purposes receivers; then deploy arguments that are rationally and suitably
persuasive for the receiver to deliver the explanation; and ensure that
the goal (the system’s purpose) for which an AI4SG system is
developed and deployed is knowable to receivers of its outputs by
default
Privacy protection and Respect the threshold of consent established for the processing of Nonmaleficence;
data subject consent datasets of personal data autonomy
Situational fairness Remove from relevant datasets variables and proxies that are Justice
irrelevant to an outcome, except when their inclusion supports
inclusivity, safety, or other ethical imperatives
Human-friendly Do not hinder the ability for people to semanticise (that is, to give Autonomy
semanticisation meaning to, and make sense of) something
Source: Floridi L, Cowls J, King TC, Taddeo M. How to Design AI for Social Good: Seven Essential Factors. Sci Eng
Ethics. 2020 Jun;26(3):1771–1796. Springer

This book aims to provide ophthalmologists contexts. The very important chapter on AI safety
and other visual professionals and researchers and efficacy outlines the challenges ophthalmol-
with an overview of current research into the use ogy will face with the introduction and wide-
of AI in ophthalmology. Together with a team of spread dissemination of this technology. Although
international experts from Europe, North we have covered all of the major areas of AI/ML
America, and Asia, we present an overview of the technology in ophthalmology, research in this
most important documentary research in ophthal- field is progressing so quickly that some new
mology on ML and AI technologies and their concepts that emerged at the end of 2020 and in
benefits. We discussed the use of AI in the early 2021 do not appear on these pages.
diagnosis of some retinal and corneal disor- However, evidence-based medicine often
ders, the diagnosis of congenital cataract, neuro-­ demands that we await for more evidence to ver-
ophthalmology, glaucoma, intraocular lens cal- ify early reports and assess the real value of new
culation methods, ocular oncology, medical technologies or applications. I would
ophthal­mology triaging, cataract-surgery train- like to thank all the contributors for sharing their
ing, refractive surgery, and the assessment and knowledge in this new and fascinating discipline,
prediction of systemic diseases through the use which has great potential to change
of the eye. Chapters on digital-image analysis, AI ophthalmology.
basics, and technical aspects of AI provide the
reader with knowledge not commonly possessed Acknowledgements I would like to thank Aleksandra
by ophthalmologists, but required to understand Lemanik, Foundation for Ophthalmology Development,
the topic in both its field-specific and broader Poznan, Poland and Tomasz Krzywicki, Faculty of
1  Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 15

Mathematics and Computer Science, University of learning-based software devices in medicine. JAMA.
Warmia and Mazury, Olsztyn, Poland for their help in pre- 2019;322(23):2285–6.
paring illustrations, and Szymon Wilk, Faculty of 14. Hwang TJ, Sokolov E, Franklin JM, Kesselheim

Computing and Telecommunications, Poznan University AS. Comparison of rates of safety issues and report-
of Technology, Poznan, Poland, for his valuable discus- ing of trial outcomes for medical devices approved in
sion on this chapter. the European Union and United States: cohort study.
BMJ. 2016;353:i3323.
15.
European Commission. Medical devices—
EUDAMED. 17 June 2020. https://ec.europa.eu/
References growth/sectors/medical-­d evices/new-­r egulations/
eudamed_en. Accessed 15 Jan 2021.
1. Mitchell M. Artificial intelligence: a guide for think- 16. Ting DSW, Liu Y, Burlina P, Xu X, Bressler NM,
ing humans. Penguin UK; 2019. Wong TY.  AI for medical imaging goes deep. Nat
2. Topol E.  Deep medicine: how artificial intelligence Med. 2018;24(5):539–40.
can make healthcare human again. New York: Basic 17. Wang SY, Pershing S, Lee AY, AAO Taskforce
Books; 2019. on AI and AAO Medical Information Technology
3. Copeland BJ.  Artificial intelligence. Encyclopedia Committee. Big data requirements for artificial intel-
Britannica, 11 August 2020. https://www.britannica. ligence. Curr Opin Ophthalmol. 2020;31(5):318–23.
com/technology/artificial-­intelligence. Accessed 18 18. Kermany DS, Goldbaum M, Cai W, Valentim CCS,
Mar 2021. Liang H, Baxter SL, McKeown A, Yang G, Wu X,
4. Singh H, Meyer AN, Thomas EJ.  The frequency of Yan F, Dong J, Prasadha MK, Pei J, Ting MYL, Zhu
diagnostic errors in outpatient care: estimations from J, Li C, Hewett S, Dong J, Ziyar I, Shi A, Zhang R,
three large observational studies involving US adult Zheng L, Hou R, Shi W, Fu X, Duan Y, Huu VAN,
populations. BMJ Qual Saf. 2014;23(9):727–31. Wen C, Zhang ED, Zhang CL, Li O, Wang X, Singer
5. Gunderson CG, Bilan VP, Holleck JL, et al. Prevalence MA, Sun X, Xu J, Tafreshi A, Lewis MA, Xia H,
of harmful diagnostic errors in hospitalised adults: a Zhang K.  Identifying medical diagnoses and treat-
systematic review and meta-analysis. BMJ Qual Saf. able diseases by image-based deep learning. Cell.
2020;29:1008–18. 2018;172(5):1122–1131.e9.
6. Zwaan L, Singh H.  Diagnostic error in hospitals: 19. Rampasek L, Goldenberg A. Learning from everyday
finding forests not just the big trees. BMJ Qual Saf. images enables expert-like diagnosis of retinal dis-
2020;29(12):961–4. eases. Cell. 2018;172(5):893–5.
7. Singletary B, Patel N, Heslin M.  Patient per- 20. Burlina P, Paul W, Mathew P, Joshi N, Pacheco KD,
ceptions about their physician in 2 words: Bressler NM.  Low-shot deep learning of diabetic
the good, the bad, and the ugly. JAMA Surg. retinopathy with potential applications to address
2017;152(12):1169–70. artificial intelligence bias in retinal diagnostics
8. Benjamens S, Dhunnoo P, Meskó B.  The state of and rare ophthalmic diseases. JAMA Ophthalmol.
artificial intelligence-based FDA-approved medical 2020;138(10):1070–7.
devices and algorithms: an online database. NPJ Digit 21. Burlina PM, Joshi N, Pacheco KD, Liu TYA, Bressler
Med. 2020;3:118. NM. Assessment of deep generative models for high-­
9. Poplin R, Varadarajan AV, Blumer K, Liu Y, resolution synthetic retinal image generation of age-­
McConnell MV, Corrado GS, Peng L, Webster related macular degeneration. JAMA Ophthalmol.
DR.  Prediction of cardiovascular risk factors from 2019;137:258–64.
retinal fundus photographs via deep learning. Nat 22. Liu Y, Yang J, Zhou Y, Wang W, Zhao J, Yu W, Zhang
Biomed Eng. 2018;2(3):158–64. D, Ding D, Li X, Chen Y. Prediction of OCT images
10. Chen J, See KC. Artificial intelligence for COVID-19: of short-term response to anti-VEGF treatment for
rapid review. J Med Internet Res. 2020;22(10):e21476. neovascular age-related macular degeneration using
11. Lee AY, Yanagihara RT, Lee CS, Blazes M, Jung
generative adversarial network. Br J Ophthalmol.
HC, Chee YE, Gencarella MD, Gee H, Maa AY, 2020;104(12):1735–40.
Cockerham GC, Lynch M, Boyko EJ.  Multicenter, 23. Zheng C, Xie X, Zhou K, Chen B, Chen J, Ye H, Li W,
head-to-head, real-world validation study of seven Qiao T, Gao S, Yang J, Liu J. Assessment of genera-
automated artificial intelligence diabetic retinopathy tive adversarial networks model for synthetic optical
screening systems. Diabetes Care. 2021;dc201877. coherence tomography images of retinal disorders.
https://doi.org/10.2337/dc20-­1877. Transl Vis Sci Technol. 2020;9(2):29.
12. Muehlematter UJ, Daniore P, Vokinger KN. Approval 24. Liu TYA, Farsiu S, Ting DS.  Generative adversarial
of artificial intelligence and machine learning-based networks to predict treatment response for neovascu-
medical devices in the USA and Europe (2015-­ lar age-related macular degeneration: interesting, but
20): a comparative analysis. Lancet Digit Health. is it useful? Br J Ophthalmol. 2020;104(12):1629–30.
2021;3(3):e195–203. 25. Lee CS, Lee AY.  Clinical applications of continual
13. Hwang TJ, Kesselheim AS, Vokinger KN.  Lifecycle learning machine learning. Lancet Digit Health.
regulation of artificial intelligence- and machine 2020;2(6):e279–81.
16 A. Grzybowski

26. Mehta N, Lee CS, Mendonça LSM, Raza K, Braun radiographs: a cross-sectional study. PLoS Med.
PX, Duker JS, Waheed NK, Lee AY.  Model-to-data 2018;15:e1002683.
approach for deep learning in optical coherence 35. Antun V, Renna F, Poon C, Adcock B, Hansen AC. On
tomography intraretinal fluid segmentation. JAMA instabilities of deep learning in image reconstruction
Ophthalmol. 2020;138(10):1017–24. and the potential costs of AI.  Proc Natl Acad Sci U
27. Larson DB, Harvey H, Rubin DL, Irani N, Tse JR, S A. 2020; pii: 201907377. https://doi.org/10.1073/
Langlotz CP. Regulatory frameworks for development pnas.1907377117.
and evaluation of artificial intelligence-based diag- 36. Matheny ME, Thadaney Israni S, Ahmed M,

nostic imaging algorithms: summary and recommen- Whicher D.  AI in health care: the hope, the hype,
dations. J Am Coll Radiol. 2021;18(3 Pt A):413–24. the promise, the peril. Washington, DC: National
28. Wang X, Liang G, Zhang Y, Blanton H, Bessinger Z, Academy of Medicine; 2019. https://nam.edu/
Jacobs N. Inconsistent performance of deep learning artificial-­intelligence-­special-­publication
models on mammogram classification. J Am Coll 37. Norgeot B, Quer G, Beaulieu-Jones BK, Torkamani
Radiol. 2020;17:796–803. A, Dias R, Gianfrancesco M, Arnaout R, Kohane
29. Subbaswamy A, Schulam P, Saria S.  Preventing
IS, Saria S, Topol E, Obermeyer Z, Yu B, Butte
failures due to dataset shift: Learning predic- AJ.  Minimum information about clinical artificial
tive models that transport. Proc Mach Learn Res. intelligence modeling: the MI-CLAIM checklist. Nat
2019;89:3118–27. Med. 2020;26(9):1320–4.
30. Subbaswamy A, Saria S.  From development to
38. Floridi L, Cowls J, King TC, Taddeo M.  How to
deployment: dataset shift, causality, and shift-stable design AI for social good: seven essential factors. Sci
models in health AI. Biostatistics. 2020;21:345–52. Eng Ethics. 2020;26(3):1771–96.
31. Kelly CJ, Karthikesalingam A, Suleyman M, Corrado 39. Xie Y, Nguyen QD, Hamzah H, Lim G, Bellemo V,
G, King D.  Key challenges for delivering clini- Gunasekeran DV, Yip MYT, Qi Lee X, Hsu W, Li Lee
cal impact with artificial intelligence. BMC Med. M, Tan CS, Tym Wong H, Lamoureux EL, Tan GSW,
2019;17:195. Wong TY, Finkelstein EA, Ting DSW. Artificial intel-
32. Winkler JK, Fink C, Toberer F. Association between ligence for teleophthalmology-based diabetic retinop-
surgical skin markings in dermoscopic images and athy screening in a national programme: an economic
diagnostic performance of a deep learning convo- analysis modelling study. Lancet Digit Health.
lutional neural network for melanoma recognition. 2020;2(5):e240–9.
JAMA Dermatol. 2019;155:1135–41. 40. Tufail A, Rudisill C, Egan C, Kapetanakis VV, Salas-­
33.
Finlayson SG, Bowers JD, Ito J.  Adversarial Vega S, Owen CG, Lee A, Louw V, Anderson J, Liew
attacks on medical machine learning. Science. G, Bolter L, Srinivas S, Nittala M, Sadda S, Taylor
2019;363:1287–9. P, Rudnicka AR.  Automated diabetic retinopathy
34. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, image assessment software: diagnostic accuracy and
Oermann EK. Variable generalization performance of cost-effectiveness compared with human graders.
a deep learning model to detect pneumonia in chest Ophthalmology. 2017;124(3):343–51.
Basics of Artificial Intelligence
for Ophthalmologists
2
Ikram Issarti and Jos J. Rozema

Introduction Short History

The past decade has seen a steep rise in the num- After being considered science fiction for a long
ber of applications of Artificial Intelligence (AI), time, the first scientific step towards intelligent
especially for repetitive or complex tasks where machines was taken by Alan Turing, who in 1950
humans may quickly suffer from either a drifting developed the famous Turing test [1]. This
attention span or subtle inconsistencies. Such sys- involves an interview with open-ended questions
tems are often more cost efficient, thus accelerat- to determine whether the intelligence of the inter-
ing their adoption and acceptance, consequently viewee is human or artificial. If this distinction
increasing people’s reliance on AI. But an under- can no longer be made, within certain predefined
standing of its inner workings is often lacking, margins, true machine intelligence has been
many tend to approach it as a ‘black box’ at the accomplished. The concept suggests that a
risk of uncritically accepting whatever output it machine could, in principle, think and stimulate
produces. Although by its very nature AI is opaque human intelligence through behaviour such as
about how it reaches a result, there are statistical learning, interpreting and communicating. This
methods to objectively assess the quality of its concept is referred to as artificial intelligence.
output. As AI becomes a popular subject within Between 1956 and 1974, began the period
the scientific community and health care practice, known as the Golden Age of AI. This time saw a
this chapter explains the basic principles of AI in massive growth in computing power, allowing to
a comprehensive step-by-step manner, along with test the ideas of MacCulloch and Pitts [2] that the
examples of ophthalmological applications. brain’s neurons may be described by simple logi-
Special attention will be paid to the differences cal operators (AND, OR and NOT), and leading
between AI, machine learning (ML), and deep to the first AI algorithms called neural networks.
learning (DL), highly interconnected techniques This illustrates how from the very beginning
that are often confused for one another. onwards AI has been inspired by biological phe-
nomena to mimic human abilities and behaviour,
such as the ability to learn and adapt to real-life
I. Issarti · J. J. Rozema (*) scenarios. These ideas were expanded upon in
Visual Optics Lab Antwerp (VOLANTIS), the decades that followed with the introduction of
Department of Ophthalmology, Antwerp University new techniques until in the 1990s the first
Hospital, Edegem, Belgium
ophthalmological applications started to
Department of Medicine and Health Sciences, emerge for the screening of glaucoma [3], dia-
Antwerp University, Wilrijk, Belgium

© Springer Nature Switzerland AG 2021 17


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_2
18 I. Issarti and J. J. Rozema

betic ­retinopathy [4] and keratoconus [5]. A more simultaneously analyse multiple layers of data.
detailed overview of ophthalmological applica- These layers consist of data processing units,
tions can be found in a recent review paper [6] or called neurons, that allow them to analyse large
the other chapters. amounts of data at once while preserving the
data’s spatial distribution. DL systems have seen
significant successes in applications such as pat-
Overview tern recognition, image processing, and speech
recognition.
Artificial Intelligence is a very broad field of The training process of ML and DL is very
study encompassing a wide range of techniques similar to that found in schools, with a professor
that allow machines to display ever more intelli- teaching his students. From a large amount of
gent behaviour (Fig.  2.1). Machine learning is given data, the algorithm learns how to describe a
one of the most important subfields of specific topic into a model (knowledge acquisi-
AI. Although ML and AI are often confused, AI tion), which will subsequently be validated using
also includes other approaches not included in unseen data to evaluate its generalizability.
machine learning, such as Expert systems, Finally, the performance of the algorithm is eval-
knowledge- or rule-based systems that emulate uated based on several guideline given in section
human cognitive and the reasoning abilities by “Performance Evaluation”.
following certain guidelines to perform a
decision-­making process [7]. Meanwhile, ML
refers to a group of mathematical algorithms that Data Basis
learn from experience (data) by mimicking
human learning behaviour to perform new tasks. Data is the fuel of AI, which can come from dif-
ML is able to fit complex data sets, to extract new ferent sources such as, webs, videos, audios, text,
knowledge, imitate complex behaviour, predict etc. It is comprised of massive amounts of bits,
and classify based on prior data. Another well-­ binary values of zeros and ones, that can be reor-
known group of algorithms is Deep learning ganized to form structured data that are usually
(DL), which is a subset of machine learning easier to process by AI algorithms, as a relational
based on artificial neural networks. DL is able to data base or a spreadsheet. It is also possible to
work with unstructured data without predefined
formatting (e.g. audio, video, text, etc.), or a
hybrid form of a structured and unstructured data
called semi-structured data. Finally, one can con-
sider time series data, consisting of structured or
Artificial Intelligence
unstructured data in sequential time steps [8]. A
good understanding of data structures allows a
proper AI implementation. Some highlights are
Machine Learning
given in section “Conducting a Machine Learning
Analysis”, but more details are available in the
data mining literature and data pre-processing
text books [9].
Deep
Learning
Common Tasks

In medicine, Machine Learning is mostly used to


assist physicians with diagnosis, monitoring, and
Fig. 2.1  Artificial intelligence techniques decision making by providing insight into the
2  Basics of Artificial Intelligence for Ophthalmologists 19

structure and patterns within large datasets. The come of a surgical procedure or treatment
most typical tasks for ML are classification, clus- (Fig. 2.2c).
tering and prediction. • Regression while classification problems
classify data into different set of classes or cat-
• Classification involves sorting new cases into egories, regression problem predicts the val-
two or more groups (Fig. 2.2a). In healthcare, ues of a continuous variable rather than
classification could be used for diagnosis categorical real variables. This problem is also
(healthy or abnormal) or the identification of referred as prediction task.
biological markers.
• Clustering in clustering the algorithm divides a
dataset into several, previously unknown clus- Learning Models
ters (groups) with certain properties in common
(Fig. 2.2b). Clustering can be used to e.g. distin- AI algorithms can be trained in any one of four
guish the different stages of a disease. methods:
• Prediction consists of building a model based
on historical data to forecast unknown param- • Supervised Learning (‘with professor’)
eter values in the future to e.g. predict the out- teaches a ML algorithm the desired output

a 5 b 5
Cluster 2
Class 2
4 4

3 3
Parameter Y

Parameter Y

2 2

Class 3
1 1
Class 1 Class 3

Cluster 1
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Parameter X Parameter X

c 60
Patient response
ML simulation
50 ML prediction
Drug infusion (mg/m2)

40

30

20

10

0
0 10 20 30 40 50 60 70 80
Time (days)

Fig. 2.2  Examples of (a) classification, (b) clustering and (c) prediction
20 I. Issarti and J. J. Rozema

(answer) given an input with labelled response based on trial and error, much as in
­categories. Based on this the algorithm learns human learning. This can be applied when
the characteristics of each category, so when it there is a continuous change in the situation to
is presented with an unseen input, it will be which the machine needs to adapt and respond
able to assign it to the right output class (cat- to. Although quite advanced, its use remains
egory). Supervised algorithms are mostly used limited within the field of medicine to e.g. sys-
for classification problems (Fig.  2.2a) where tems that learn from the successes and failures
points can be assigned to three pre-defined of clinical trials in the literature to suggest
classes (e.g. healthy, pathological and suspect) new approaches for testing.
or for prediction problems (Fig. 2.2c), such as
predicting the future evolution of a tumour.
• Unsupervised Learning (‘without profes- Machine Learning Algorithms
sor’) algorithms assign data to multiple sub-
groups (clusters) with similar properties There are dozens of machine learning algorithms
within the input data without being given described in the literature. For reasons of con-
desired answers or outputs. Unsupervised ciseness, only the most common ones will be
learning can be applied for classification prob- listed below.
lems with unknown outputs, as is illustrated in
Fig. 2.2b, where the algorithm identified three
clusters based on the available data. (Non)-linear Regression
• Semi-supervised learning combines super-
vised and unsupervised learning by giving the Regression analysis is a well-known statistical
desired output for only a small number of inputs. method that builds a mathematical model from
After training based on the labelled data, the prior observations to make a prediction, which
algorithm uses unsupervised learning for the constitutes the basis of machine learning. If the
unlabelled data to create new clusters. Ultimately relation between the input and output is linear,
these clusters are themselves labelled and added the model is called linear regression (Fig. 2.3).
to the previous outputs. This method is used For example, one can score the progress of a
when not all outputs are available. disease based on several observed variables
• Reinforcement learning is a training method (x1, x2,…, xn) by assigning a weight (w1, w2,…,
in which an algorithm must define its own wn) to each variable indicating their relative

3.5 2
a b
1.8
3
1.6

2.5
1.4

1.2
2
1
1.5
0.8

1 0.6

0.4
0.5
0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Fig. 2.3  Examples of (a) linear and (b) non-linear fitting


2  Basics of Artificial Intelligence for Ophthalmologists 21

importance. The overall score is then defined Naïve Bayes Algorithm


as a function of the weighted variables as
follows: Naïve Bayes is one of the simplest supervised
classification algorithms available, based only on
Score = w1 . x1 + w2 . x2 +  + wn . xn
Bayes’ theorem. This theorem relates the proba-
with weights that are estimated through an over- bility P(cǀx) that a certain event c occurs under
all fit of the observed dataset. Similarly, for more circumstances x with the probability P(xǀc) that
complex relationships non-linear regression may circumstances x are present once event c has
be used, using higher orders of the variables (x1, already happened and the ratio of the probabili-
x2,…, xn). A second order non-linear regression ties P(c) and P(x) of event c and circumstances x
for n parameters would look like: individually. Formally, this is written as:
Score = w1 . x1 +  + wn . xn + w11 . x12 P ( c | x ) = P ( x | c ) .P ( c ) / P ( x )
+ w12 . x1 . x2 +  + wnn . xn2

Although this may seem complicated, think
for example of the situation where a patient must
Logistic Regression be classified as normal (NL) or keratoconus (KC)
using only minimum pachymetry Pmin given as
This is an easy to implement but powerful clas- ‘Thin’ where Pmin  <  500  μm and ‘Thick’ with
sification algorithm that gives binary outputs Pmin  ≥  500  μm. Suppose a new patient appears
(e.g. diseased or healthy) [8, 10]. with Pmin < 500 μm, then the probability that this
patient really has keratoconus is given by:

P ( KC )
P ( KC | Thin ) = P ( Thin | KC ) .
P ( KC ) P ( Thin KC ) + P ( NL ) .P ( Thin | NL )
. |

where all the terms on the right-hand side can a binary classifier that searches for a dividing
easily be estimated on forehand from a large data plane to separate distinct data. The orientation
set after weighting P(KCǀThin, W) with a weight and the position of this dividing plane is deter-
W. In practice the classification is based on many mined by the closest points, called support vec-
variables, however, increasing the chance of tors, in an attempt to maximise the margins
interdependence between parameters. Naïve between distinct groups [8, 10].
Bayes chooses to ignore this interdependence,
which, despite being a severe oversimplification,
demonstrated good results in classification, espe- K-Means
cially in text recognition, spam detection, and
medical diagnosis. As a form of unsupervised clustering, this method
groups unlabelled input data into a predefined
number of K clusters. As such, it can operate on a
Support Vector Machine (SVM) large dataset. For example, if a dataset of normal
and diseased subjects, but without clear classifi-
SVM is a recently developed form of supervised cation of the disease’s features, then K-means
machine learning used for both regression and can be used to define K distinct clusters with sta-
classification. These algorithms are preferred by tistical characteristics that may be used clinically.
experts for their accurate and reproducible results It does this by first randomly choosing centroid
with less computation power, while being robust points (Fig.  2.4) and seeing whether the neigh-
enough to handle small samples. In short, SVM is bouring points can efficiently be divided into the
22 I. Issarti and J. J. Rozema

8 8
a b
7 7

6 6

5 5

4 4

3 3

2 2

1 1
2 3 4 5 6 7 2 3 4 5 6 7

Fig. 2.4 (a) Input data; (b) K-means clustering, starting randomly placed centroids (open squares) that are gradually
adjusted until an equilibrium has been reached (open circles)

requested number of clusters by assessing their trees can also be combined into a structure, called
distance to the centroids. If this is not the case, a Random Forest, as a way to average multiple
the centroids points are iteratively shifted until a Decision Trees built from different parts of the
certain minimum distance is achieved. training set to reduce the risk of overfitting often
experienced by a single Decision Trees. The
increased performance comes at a cost of some
K-Nearest Neighbor (KNN) loss of interpretable and accurate fit to training
data, however.
KNN is a supervised machine learning method to
address regression and classification problems. It
is considered a ‘lazy’ algorithm as it does not Artificial Neural Networks (ANN)
require training, relying instead on the assump-
tion that similar inputs remain close to one Artificial neural networks are a family of algo-
another. The algorithm computes a distance rithms that, inspired by the human brain, form
between a test data set and its K nearest neigh- interconnected structures of artificial neurons.
bours to form a large cluster. Whenever a new These structures interact with each other to
data point is presented for classification, KNN mimic the complex behaviour found in real neu-
will look in the database at the K points nearest to rons, such as self-adapting, self-organizing, and
the new point to determine to what group this real-time learning from examples. The human
point should belong. brain consists of many neurons that are intercon-
nected through a large number of axons to
exchange signals. When a neuron receives a spe-
Decision Trees cific input signal through its connections, its cel-
lular body will generate a new signal through the
A decision tree is a directed data structure that axons and transmit to other dendritic cells
uses a series of yes/no questions for classifica- (Fig.  2.5) [11]. This biological architecture is
tion. The start of the decision tree is the root node emulated in artificial neurons, where dendritic
with a yes/no input question. From this starts a signals represent the neural inputs X = (xj)j that
series of decision paths (branch) where the algo- are assigned synaptic weights θij. The cellular
rithm makes a decision through a computed prob- body is represented by a nonlinear activation
ability that ultimately leads to the leaf of the tree, function that operates on the input signal to cre-
corresponding with the outcome [10]. Decision ate an output signal y that is passed on to the next
2  Basics of Artificial Intelligence for Ophthalmologists 23

a layers while preserving their spatial connection.


In image analysis, for example, neural networks,
Dendrites
Synapses would translate each pixel to a vector, thus losing
the spatial connection and correlations between
Nucleus
them [7]. Meanwhile, the multilayer architecture
of CNN, overcomes this limitation and preserves
Axon the relationship between the pixels.

 ecurrent Neural Network (RNN)


R
b
RNNs are neural network models designed for
Input Weight Output processing sequential data, such as a time series or
x1 a sequence of data with repetitive properties (e.g.
w1 DNA, which has long sequences of base pairs).
x2
w2 σ(wx + b)
x3 w3
wN
 elf-Organization Maps (SOM)
S
xN These unsupervised neural network algorithms
are inspired by the brain’s visual cortex. Their
Fig. 2.5  Comparison of (a) biological and (b) artificial neurons are organised in close spatial proximity
neurons to one another to process the input with the short-
est possible transmission of signals during pro-
layer of neurons. The strength of ANNs lies in cessing. SOM is mostly used for pattern
their several interconnected artificial neurons, recognition, to extract features, and to compress
exchanging signals in a forward and backward highly dimensional data [12, 13].
direction. Dozens of ANNs algorithms have been
described in the literature, some supervised, such  einforcement Learning (RL)
R
as multilayer perceptron (MLP) and feedforward This method is an idealized, goal-directed com-
neural network (FFN), and some unsupervised, putational approach that closely imitates human
such as self-organization maps (SOM). Artificial leaning behaviour, consisting of interacting with
neural networks can be used for regression, clas- an environment and assessing its response as a
sification, clustering problems and have been source of knowledge. Reinforcement learning
applied to highly complicated tasks like forecast- algorithms are not told what to do, but instead
ing and system control. figure out their best course of action through tri-
als, mapping situations and maximizing a reward
function [14]. In theory, this method could be
Deep Neural Networks (DNN) applied to e.g. minimize the postoperative com-
plications of cataract surgery by choosing the
These are neural networks with many layers and best surgical strategy that minimizes postopera-
a large number of neurons. Although this design tive refractive errors.
comes at a high computational cost, DNN show
very accurate results in medical imaging, vision
recognition, etc. Some classic examples of deep Performance Evaluation
learning algorithms are the Conventional Neural
Network, Recurrent Neural Network, Self-­ Metrics
Organization Maps, etc. Deep neural network is
an equivalent term for deep learning. Performance evaluation assesses how an algorithm
handles unseen data that is representative for a
Conventional Neural Network (CNN) general population. The common practice, called
This is an advanced neural network algorithm holdout validation, is to randomly split the avail-
able to simultaneously analyse multiple input able data into a training set (60%), a test set (15%)
24 I. Issarti and J. J. Rozema

and a validation set (25%), each with a different rics (Fig. 2.6). Underfitting, on the other hand, is
purpose. The training set is fed into the machine the opposite where the model cannot account for
learning algorithm, the test set is used to perform various relations due to a lack of well-discriminat-
an internal validation at each iteration of the train- ing parameters. Consequently, it is a good practice
ing, while the validation set helps to assess the to avoid overfits and underfits cases before com-
algorithm’s performance after it stabilized. puting the performance metrics on the validation
ML models use metrics that compare the real, set. Meanwhile, in a good fit the performance of
measured values with the models’ prediction to the training and test sets should be very similar.
assess performance in each iteration using the
training and test. This procedure is done using the
training set to evaluate the learning performance, Confusion Matrix
but also the validation set to evaluate the models’
generalisability. The most common metrics One popular way to represent performance is
includes accuracy, sensitivity, specificity, and using a confusion matrix that compares the
precision, defined in Table 2.1. algorithm’s classifications to the actual classifi-
During the training process it is especially cation using the metrics in Table 2.2. Ideally, the
important to be mindful of overfitting and under- non-diagonal values should remain 0. Usually
fitting of the model. Overfitting occurs when the confusion matrices are binary, but they may be
model has become overly detailed to the point expanded to include more than two classes.
where it begins to fit random statistical variations
(noise). This can be noticed by a continued
improvement of the metrics of the training set, but  eceiver Operating Characteristic
R
a stabilization or worsening of the test set’s met- Curve (ROC)

Table 2.1  Performance metrics The ROC curve is a plot of the true positive rate
True TP Abnormal cases identified as a function of the false positive rate (i.e. 100–
positive correctly Specificity) for different cut offs. This aims to
True TN Normal cases identified
negative correctly
False FP Normal cases classified as Underfitting Overfitting
positive abnormal (Type I error)
False FN Abnormal cases classified as
negative normal (Type II error)
Accuracy (TP + TN) / Percentage of times the
(TP + FP + algorithm is correct
Error

FN + TN)
Sensitivity TP / (TP + Percentage of true positives
FN) correctly identified
Specificity TN / (TN + Percentage of true negatives
FP) correctly identified
Precision TP / (TP + Ratio of correctly classified Training score
Test score
FP) positives; high precision
relates to low false positive.
Model complexity
Cut-off Value or point designed as a
limit of a group. Fig. 2.6  Under- and overfitting

Table 2.2 Confusion
Actually positive Actually negative
matrix
Predicted positive True positive False positive
Predicted negative False negative True negative
2  Basics of Artificial Intelligence for Ophthalmologists 25

1
Perfect classifier look at the available parameters and what kind of
outcome can be expected. Based on this assess-
0.8 ment, you can decide what parameters may be
the most appropriate for the analysis, or whether
parameters must be combined or transformed
True positive rate

0.6 r
s ifie before proceeding. Next, perform the pre-­
clas processing to remove missing values and outli-
m
0.4 n do ers, and restructure the data to an [input, output]
Ra format for supervised learning, or an [input] for-
0.2
mat for unsupervised learning. Once the data is
ready for training, select an AI algorithm based
on the task to perform (e.g. classification, cluster-
0
0 0.2 0.4 0.6 0.8 1
ing, time series prediction, etc.). If multiple AI
False positive rate algorithms are available, select the one that is
easiest to implement and adjust to avoid a loss of
Fig. 2.7  Examples of ROC curves
time and computational cost. The actual training
of the AI involves a non-linear optimisation based
find the optimal cut-off that maximizes the bal- on your inputs and outputs. This step is mostly
ance between specificity and sensitivity. Typically done in a black box format that is incorporated in
curves close to the top-left corner of the plot rep- your chosen software environment (e.g. Python,
resent “ideal” models with a false positive rate of Weka, Matlab, etc.), yielding a trained model,
zero, and a true positive rate of one, while curves along with its performance metrics. During this
close to the diagonal approximate random noise phase you should keep an eye out for overfitting,
(Fig. 2.7). ROC curves can also be represented by underfitting or large errors. If any of these occur,
the Area Under the Curve (AUC), which ranges despite retraining several times, the parameters
between 0.5 for random noise and 1 for perfect selected in data pre-processing step are not repre-
models. sentative enough and other parameters should be
chosen before retraining the model. Once the
model performs well based on the metrics, ROC
K-Fold Cross Validation Testing curves and confusion matrices based on the test
sets, it is time to assess the model performance
This validation technique splits the training data using the unseen validation data. Ideally, this
into K folds, selects one for validation and builds should be completely independent data from a
the model based on the remaining (K−1) folds. It different centre, if possible. If this validation is
can be considered as a K-repetitive hold out vali- also satisfactory, the model development is com-
dation, where the test set and the validation set plete. A full overview of these steps is given in
are independent for each iteration or run. There Fig. 2.8.
are several variations of the technique, such as
Stratified K-Fold Cross Validation, Leave-P-Out
Cross Validation, etc.  oftware for Machine Learning
S
Implementation

 onducting a Machine Learning


C There are many software packages available that
Analysis allow quick implementation of machine learning.
One of the most popular is Python. This open
To start a machine learning or deep learning anal- source environment has a user-friendly syntax, is
ysis, you first need to have a clear understanding easy to learn and allows for rapid prototyping,
of the problem you are trying to address. First, which is often used for e.g. web development,
26 I. Issarti and J. J. Rozema

Weka is a Java-based open source package


Data pre-processing with a graphical user interface that can be oper-
ated with just a few mouse clicks or some mini-
mal programming. This makes it ideal for
beginners in machine learning. It implements
Model selection several machine learning algorithms for classifi-
cation, clustering and data pre-processing [16],
as well as data mining.
Model (re)training Another widely used programming language is
called R, which is optimized for statistical data
analysis. It uses mathematical data structures such
as vectors, lists and data frames that are easy to
Metrics ok? manipulate by ML.  R supports various machine
No
Yes learning algorithms in the form of community-­
authored packages such as K-nearest neighbours,
naive Bayes, decision trees, regression methods,
Model validation neural networks and support vector machine [17].
Finally, MATLAB is a high-end, high-­
performance commercial package for scientific
Metrics ok? computing that integrates programming, compu-
No tation and visualisation into a matrix-based inter-
Yes active system. It has many extensions, called
toolboxes, that include machine learning and
Model complete deep learning. These allow implementing, train-
ing and validating various algorithms with appli-
cations in classification, clustering, and
Fig. 2.8  Building a machine learning model prediction. One major advantage of MATLAB is
its user-friendly visualization and data pre-­
mathematics or system scripting. Although processing facilities, along with its ability to
implementing an AI algorithm can be time con- interface with other ML tools such as TensorFlow
suming, Python offers many handy libraries dedi- and Keras.
cated to AI, such as Keras, TensorFlow, and
SciKit-learn. Other than a Python library,
TensorFlow is also a leading stand-alone open Applications in Ophthalmology
source software package for machine learning
and deep learning, used by ML beginners and Current medical diagnostic technologies gener-
practitioners alike. It has an open source toolkit ate massive amounts of data that are difficult
for building machine learning pipelines, as well to analyse and hamper practitioners by their lack
as support for applications such as computer of interpretability. This prompts clinical practi-
vision, natural language processing, speech rec- tioners to focus on a limited number of diagnostic
ognition, and general predictive analysis [15]. criteria alongside their subjective expertise, cre-
Meanwhile, Keras is an open-source neural net- ating the risk that certain indicators for pathology
work library that is popular for its ease of use for may escape their attention. With its ability to
fast experimentation and prototyping. It supports extract meaningful knowledge in an objective
implementing deep learning algorithms such as and automatic way, AI has become an important
convolutional neural networks and recurrent neu- tool for medical diagnosis and for decision mak-
ral networks. ing. The following, along with the following
2  Basics of Artificial Intelligence for Ophthalmologists 27

chapters, highlights some examples of AI appli- Keratoconus


cations in ophthalmology, with the focus on the
technique and the achieved results. Although, the earliest attempts with AI targeted
advanced stages of the disease, it showed poten-
tial for early detection and improved interpret-
Glaucoma ability. Nowadays, AI is used to identify the
earliest stages of the disease and score its sever-
The earliest application of ML in ophthalmology ity. This early detection is especially important
[18] demonstrated that neural networks could be since it allow therapeutic action to halt progres-
used to screen visual field data for signs of glau- sion, or screen these patients out from refractive
coma with a performance similar to that of expert surgery, thus avoiding postoperative complica-
observers (ML: sensitivity =  65%, specificity tions. For example, Sousa et  al. [27] evaluated
=  71%; Experts: sensitivity =  59%, specificity three classifiers (SVM, MLP, Radial Basis
=  74%). Later studies improved upon these Functions), reporting a sensitivity of 98–99%,
results and confirmed that ML can outperform Arbelaez et al. [28] applied SVM to tomographic
clinical practitioners in the detection of glaucoma and pachymetry data, yielding an accuracy of
[19]. Current advanced ML algorithms, such as 98.2% for keratoconus and 97.3% for early kera-
self-organisation maps with decision trees [20], toconus, and other studies reported even better
are now able to detect glaucoma with the very results for the earliest cases [28, 38]. Lopes et al.
high accuracy of 0.98 [21]. [29] examined tomographic data using Random
Forests, with a sensitivity of 85.2%, specificity of
96.6% for the earliest cases. Finally, a hybrid ML
Retinal Disease algorithm combining supervised and unsuper-
vised learning based on FFNN reported a sensi-
Deep learning is often used to accurately diag- tivity of 97.8% and specificity of 95.6% for
nose retinal conditions such as Diabetic suspect keratoconus detection [30].
Retinopathy (DR), Age-related Macular
Degeneration (AMD), and Retinopathy of
Prematurity (ROP). For example, DL was used to Refractive Surgery
assess the quality of fundus images for ROP,
achieving an AUC of 0.95 [22], while a CNN for Recently, a Decision Forest classifier was devel-
the diagnosis of ROP achieved an AUC of (0.94, oped for the individual, postoperative risk
0.99) for normal and diseased cases, respectively assessment after LASIK refractive surgery for
[23]. These values are similar as for a human follow-up periods of 12 years [31]. Meanwhile,
expert. The latter DL algorithm [24] was also Yoo et al. [32] developed a decision support sys-
applied to retinal images for the detection of tem to screen out unsuitable candidates for
referable DR, resulting in a sensitivity of 97.5% refractive surgery. This study combined from
and a specificity of 93.4%. Meanwhile, predic- both medical devices and experts to achieve an
tive models identified the retinal biomarkers for AUC of 0.983, highlighting the benefit of
drusen regression in intermediate AMD with an human-AI collaboration. Finally, an artificial
AUC of 0.75 [25]. Similarly, a DL system by multilayer perceptron was used to predict the
Arcadu et al. [26] predicted two-step worsening improvement in visual quality for KC patients
of diabetic retinopathy for a period of 12 months implanted with intracorneal ring segments based
with an AUC of 0.79. These examples demon- on corneal curvature and astigmatism, yielding a
strate the potential for using ML to assess the best error of 0.97D of corneal power and 0.93D
progression risk of retinal diseases. of astigmatism [33].
28 I. Issarti and J. J. Rozema

Other Applications result in a major difference. Instead, to get a


remarkable improvement, you can revise the
One of the striking demonstrations of the abilities design strategy, add more descriptive features,
of deep learning is that it was able to extract new remove correlations, increase the data set size,
knowledge from retinal fundus images to predict implement a hybrid machine learning system
cardiovascular risk factors [34] or assess the reti- (e.g. supervised combined with unsupervised),
nal findings and Alzheimer disease [35]. etc. There are, however, exceptions. Some appli-
Although, ophthalmologists have been looking at cations are more suited for implementation with
fundus images for many years, this knowledge Deep Learning or Reinforcement Learning. For
was impossible to extract before because of the example, in image analysis DL is highly recom-
large variability in retinal features, patterns, mended, while for the time series of the weather
colours, etc. forecast ML may be the most suitable and in
games automatic RL would be the most suitable
choice.
Common Misconceptions

Since there are many misconceptions about AI, it Social Aspects


is important to address these to inspire ophthal-
mologist users to have realistic expectations. There is a widespread fear that in many profes-
sions AI will eventually make human workers
obsolete. While this may be true for boring,
Technical Aspects repetitive tasks that can be well described, new
job opportunities will arise in the training, super-
First of all, AI is not a magical tool that can vision and maintenance of AI systems. For now,
solve any problem. Instead, AI are knowledge-­ the list of jobs that can be replaced by AI is lim-
driven algorithms that analyse massive amounts ited, given the technical limitations listed above,
of data and extract patterns much faster than along with the fact that computers are still unable
human biological synapses. But current AI sys- to emulate the important human characteris-
tems require clearly defined problems to perform tics, such as teamwork, interaction, creativity,
their tasks and are unable to surpass focused spe- adaptability, empathy, etc., that are essential in a
cific tasks. Therefore, if the user does not provide medical environment [36]. Although one could
such a well-delineated problem or high-quality argue that AI could emulate some of these impor-
data to work with, the outcome will likely be tant human characteristics, it has yet to show true
underwhelming. creativity in open-ended problems, awareness or
Contrary to what many users think, larger empathy. As such, AI lacks the all-important
datasets will not always yield better results. human aspect of the doctor-patient relationship
The training data can be increased when it is not that enhances the patient’s understanding of his
descriptive enough, or if there are patterns that situation, his treatment compliance, the therapeu-
cannot be detected by the algorithm. But when a tic effectiveness [37], as well as his physical,
trained AI algorithm has an acceptable validation emotional, and social needs. Some aspects, such
error, adding more data may lead to overtraining as e.g. initial anamnesis or questionnaires, could
the algorithm, divergence issues and an increased perhaps be handled by a virtual assistant similar
computational cost. to Siri or Alexa, provided a physician will later
Similarly, your choice of AI algorithm does go over the responses with the patient to ensure
not make a big difference since in essence all nothing important was missed. This same is true
algorithms are a similar combination of (non-) for the screening and forecasting examples given
linear functions. Changing AI algorithm may in section “Applications in Ophthalmology”,
therefore slightly affect accuracy but will never which should serve as a Decision Support System
2  Basics of Artificial Intelligence for Ophthalmologists 29

to help the physician reach a conclusion, rather 8. Taulli T.  Artificial intelligence basics: a non-­
technical introduction. Apress; 2019. https://doi.
than a Diagnostic System to replace the physi- org/10.1007/978-­1-­4842-­5028-­0.
cian’s expertise altogether. Finally, AI may help 9. Aggarwal CC.  Data mining: the textbook. Springer;
with screening, patient follow-up and scheduling, 2015.
filling out patient files, letters and administration, 10. Rebala G, Ravi A, Churiwala S.  An introduction to
Machine Learning. Springer; 2019.
provided it is supervised and corrected after- 11. Lo JT-H. Functional model of biological neural net-
wards by a physician. Current AI systems works. Cogn Neurodyn. 2010;4:295–313.
would therefore have to embedded in a human 12.
Gupta N, Trindade BL, Hooshmand J, Chan
context. Provided such AI systems are developed E.  Variation in the best fit sphere radius of curva-
ture as a test to detect keratoconus progression on a
with respect for human interaction, empathy and Scheimpflug-based corneal tomographer. J Refract
privacy, this could optimize time use and reduce Surg Thorofare NJ. 2018;1995(34):260–3.
the waiting times in hospitals. 13. Kohonen T.  Self-organization of very large docu-
ment collections: state of the art. In: Niklasson
L, Bodén M, Ziemke T, editors. ICANN
98. Springer; 1998. p.  65–74. https://doi.
Conclusion org/10.1007/978-­1-­4471-­1599-­1_6.
14. Sutton RS, Barto AG.  Reinforcement learning: an

For very well-delineated tasks in ophthalmology, introduction. p. 352.
15. Hope IL, Yehezkel Resheff T. Learning TensorFlow.
AI can reach exceptional levels of performance that 16. Witten I, Cunningham SJ, Frank E.  Weka: practi-
supersede human ophthalmologists. The technol- cal machine learning tools and techniques with Java
ogy suffers from a number of limitations, however, implementations.
that make it unwise to rely solely on their output. 17. Lantz B. Machine learning with R: learn how to use R
to apply powerful machine learning methods and gain
Instead, AI systems are ideal to work in partnership an insight into real-world applications. Packt Publ;
with ophthalmologists, for example for disease 2013.
detection or as a decision support system. 18. Goldbaum MH, et  al. Interpretation of automated

perimetry for glaucoma by neural network. Invest
Ophthalmol Vis Sci. 1994;35:3362–73.
19. Goldbaum MH, et  al. Comparing machine learning
classifiers for diagnosing glaucoma from standard
References automated perimetry. Invest Ophthalmol Vis Sci.
2002;43:162–9.
1. Turing AM. I.—COMPUTING MACHINERY AND 20. Huang M-L, Chen H-Y, Lin J-C.  Rule extrac-

INTELLIGENCE. Mind. 1950;LIX:433–60. tion for glaucoma detection with summary data
2. McCulloch WS, Pitts W.  A logical calculus of the from StratusOCT.  Invest Ophthalmol Vis Sci.
ideas immanent in nervous activity. Bull Math 2007;48:244–50.
Biophys. 1943;5:115–33. 21. Kim SJ, Cho KJ, Oh S.  Development of machine
3. Goldbaum, M. H. et al. Interpretation of automated learning models for diagnosis of glaucoma. PLoS
perimetry for glaucoma by neural network. Invest. One. 2017;12:e0177726.
Ophthalmol. Vis. Sci. 1994;35:3362–73. 22. Coyner AS, et  al. Automated fundus image quality
4. Gardner GG, Keating D, Williamson TH, Elliott assessment in retinopathy of prematurity using deep
AT. Automatic detection of diabetic retinopathy using convolutional neural networks. Ophthalmol Retina.
an artificial neural network: a screening tool. Br J 2019;3:444–50.
Ophthalmol. 1996;80:940–4. 23. Brown JM, et  al. Automated diagnosis of plus dis-
5. Maeda N, Klyce SD, Smolek MK.  Neural network ease in retinopathy of prematurity using deep con-
classification of corneal topography. Preliminary volutional neural networks. JAMA Ophthalmol.
demonstration. Invest Ophthalmol Vis Sci. 2018;136:803–10.
1995;36:1327–35. 24. Gulshan V, et  al. Development and validation of a
6. Consejo A, Melcer T, Rozema JJ.  Introduction to deep learning algorithm for detection of diabetic
Machine Learning for ophthalmologists. Semin retinopathy in retinal fundus photographs. JAMA.
Ophthalmol. 2019;34:19–41. 2016;316:2402–10.
7. Choi RY, Coyner AS, Kalpathy-Cramer J, Chiang 25. Bogunovic H, et al. Machine learning of the progres-
MF, Campbell JP. Introduction to Machine Learning, sion of intermediate age-related macular degeneration
Neural Networks, and Deep Learning. Transl Vis Sci based on OCT imaging. Invest Ophthalmol Vis Sci.
Technol. 2020;9:14. 2017;58:BIO141–50.
30 I. Issarti and J. J. Rozema

26. Arcadu F, et al. Deep learning algorithm predicts dia- 32. Yoo TK, et al. Adopting machine learning to automat-
betic retinopathy progression in individual patients. ically identify candidate patients for corneal refractive
Npj Digit Med. 2019;2:1–9. surgery. Npj Digit Med. 2019;2:1–9.
27. Souza MB, Medeiros FW, Souza DB, Garcia R, Alves 33. Valdés-Mas MA, et  al. A new approach based on
MR.  Evaluation of machine learning classifiers in Machine Learning for predicting corneal curvature
keratoconus detection from orbscan II examinations. (K1) and astigmatism in patients with keratoconus
Clinics. 2010;65:1223–8. after intracorneal ring implantation. Comput Methods
28. Arbelaez MC, Versaci F, Vestri G, Barboni P, Savini Prog Biomed. 2014;116:39–47.
G.  Use of a support vector machine for keratoco- 34. Poplin R, et al. Prediction of cardiovascular risk fac-
nus and subclinical keratoconus detection by topo- tors from retinal fundus photographs via deep learn-
graphic and tomographic data. Ophthalmology. ing. Nat Biomed Eng. 2018;2:158–64.
2012;119:2231–8. 35. Schrijvers EMC, et al. Retinopathy and risk of demen-
29. Lopes BT, et al. Enhanced tomographic assessment to tia: the Rotterdam Study. Neurology. 2012;79:365–70.
detect corneal ectasia based on artificial intelligence. 36. Korot E, et  al. Will AI replace ophthalmologists?

Am J Ophthalmol. 2018;195:223–32. Transl Vis Sci Technol. 2020;9:2.
30. Issarti I, et al. Computer aided diagnosis for suspect 37. Blasi ZD, Harkness E, Ernst E, Georgiou A, Kleijnen
keratoconus detection. Comput Biol Med. 2019; J. Influence of context effects on health outcomes: a
https://doi.org/10.1016/j.compbiomed.2019.04.024. systematic review. Lancet. 2001;357:757–62.
31. Achiron A, et  al. Predicting refractive surgery out- 38. Smadja D, et al. Detection of subclinical keratoconus
come: machine learning approach with big data. J using an automated decision tree classification. Am J
Refract Surg. 2017;33:592–7. Ophthalmol. 2013;156:237–246.e1.
Overview of Artificial Intelligence
Systems in Ophthalmology
3
Paisan Ruamviboonsuk, Natsuda Kaothanthong,
Thanaruk Theeramunkong,
and Varis Ruamviboonsuk

One of the first successful systems of artificial It took another period of about two decades
intelligence (AI) in health care can be traced back later when another system of AI in ophthalmol-
to a study in the late 1970s. In this study by Yu ogy, the iDx-DR [3], became the first system of
et  al. [1], a computer was able to recommend AI in health care approved by the United States
choices of antibiotics for treatment of meningitis Food and Drug Administration (U.S.  FDA) for
with acceptability rate of 65%. This rate may not automated detection of diabetic retinopathy (DR)
be very high but the corresponding acceptability for referrals to ophthalmologists in primary care
rates of faculty specialists who performed the settings.
same task were only from 42.5% to 62.5%. It is The approval of the iDx-DR has placed AI in
obvious that this early AI system had a better per- ophthalmology at the forefront of AI in health
formance than the specialists. care even though the studies of AI in other fields,
About two decades later, in the late 1990s, a such as pathology and radiology, may outnumber
system of AI in ophthalmology by Sinthanayothin the studies of AI in ophthalmology. While major-
et al. [2] was able to recognize the optic disc from ity of studies of AI in health care focus on the
retinal images with both sensitivity and specific- objective of screening or early detection of dis-
ity as high as 99.1%; recognize the fovea from eases, such as screening for DR, AI is also useful
the same images with the sensitivity and specific- for other tasks in ophthalmology [4]. AI has been
ity of 80.4% and 99.1% respectively. These studied for automated segmentation of retinal
results were far more promising than the results layers in macular edema due to DR, age-related
of the earlier AI for choosing choices of antibiot- macular degeneration (AMD), and retinal vein
ics stated previously. occlusion (RVO); for automated segmentation of
optic nerve head (ONH) in glaucoma; for auto-
mated extraction of features, such as nucleus and
capsule in cataract. In addition, AI has been stud-
P. Ruamviboonsuk (*)
Department of Ophthalmology, Rajavithi Hospital, ied for therapeutic and prognostic predictions,
Bangkok, Thailand such as prediction of requirement for anti-­
N. Kaothanthong · T. Theeramunkong vascular endothelial growth factors (anti-VEGF)
Sirindhorn International Institute of Technology, injections and prediction of visual outcome after
Thammasat University, Pathumtani, Thailand treatment of AMD.
e-mail: natsuda@siit.tu.ac.th; thanaruk@siit.tu.ac.th
V. Ruamviboonsuk
Department of Biochemistry, Faculty of Medicine,
Chulalongkorn University, Bangkok, Thailand

© Springer Nature Switzerland AG 2021 31


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_3
32 P. Ruamviboonsuk et al.

 I, Machine Learning, and Deep


A labelled with the outcomes: disease or no disease,
Learning whereas the unsupervised learning does not. It
learns from finding homogenous groups of the
In overview, deep learning (DL), the widely used input data using similarity of their features [5, 6].
system of AI today, is a subset of machine learn- On technical perspectives, as shown in
ing (ML) whereas ML is a subset of AI. ML was Fig.  3.1, most conventional ML algorithms,
commonly used before the current era of DL. The either supervised or unsupervised approaches,
mention of ML in today’s perspective usually require the process of feature extraction, which
means this conventional ML, not including usually needs a domain expert. The application of
DL. Though the current use of AI has generally the conventional ML requires an expert to guide
been shifted from conventional ML into DL, it is a feature extraction process on how to transform
misleading to understand that ML is rarely useful the raw data, such as the pixel values of an image
today. It is depended on the tasks. ML can still into a feature vector, for a model to learn for pat-
achieve robust performances in many applica- tern recognition. To obtain a good model or algo-
tions in health care, particularly when combined rithm, therefore, we need a suitable set of good
with DL. features for training.
ML is an application in AI that enables a com- In the past, a natural solution towards a good
puter system to learn from input data. The learn- model is to apply several feature extraction meth-
ing process can also be called training. Knowledge ods on the same image to obtain a set of different
obtained in this training phase is used by the sys- feature representations. However, as a trade-off,
tem to infer the output of new data; this process is this plentiful number of the generated features
also called inferencing, see Fig.  3.1. There are could in turn lead to the sparseness of the fea-
two common learning approaches, supervised tures. This is called a curse of dimensionality [7].
and unsupervised learning. The difference As a result, similar patterns of the same class of
between the two is the input knowledge which is features could be hardly detected [8].
provided by humans. The supervised learning DL, on the other hand, allows the system to
requires the labelled inputs, such as images automatically extract features without assisting

Training

Training:
Feature
Image Dataset
Extraction
(raw image) Training: Machine
Labelled Feature Learning
Vector Algorithm
Training:
Label of each image

New Image: Predicted label


Feature of the new
Target for Predictive Model
Extraction image
prediction

Prediction or Inferencing

Fig. 3.1  Manual feature extraction in Machine Learning


3  Overview of Artificial Intelligence Systems in Ophthalmology 33

Training Deep Learning Algorithm

Input layer Hidden layer Output layer


Training: x1
Image Dataset y1
(raw image) x2
yk
xi
Training: yp
Label of each image
xn

Predictive Predicted label


New image:
Model of the new
Target for prediction
image

Prediction or Inferencing

Fig. 3.2  Automatic feature extraction using the connectionist approach: Deep Learning

from the domain expert. This is why it is known Naïve Bayes is a probabilistic method based
as “black box”. The architecture of DL is a mul- on Bayes’ theorem. It finds the probability that
tilayered stack of simple modules, see Fig. 3.2. It the desired output, denoted by B, will occur when
is capable of discovering the feature representa- the input feature, denoted by A, is presented, see
tions from a set of raw input data for classifica- the equation below:
tion [9]. Each module transforms the input from
P ( B|A ) P ( A )
the previous levels into a representation at a P ( A|B ) =
higher, slightly more abstract level. With multiple P ( B)

levels of transformation, very complex features,
which are much too high dimensional to be For example, in an image segmentation appli-
accessible for human interpretation, are extracted cation, A is defined as a class, such as disease or
and inferences can be performed [10]. no disease, and B is a feature extracted from a
pixel in an image. The pixel B is classified as
either disease or no disease using a ratio of the
 verview of Conventional ML
O likelihood of the feature B occurring in the area
Algorithms of class A.
Naïve Bayes can also be applied for an image
Supervised Learning Approach classification. For example, the probability that
the features of an input image are DME or Non-­
Given input-output pairs assigned by humans, DME.  There are many Naïve Bayes theorem-­
supervised learning finds a pattern of input fea- based methods such as multinomial Naïve Bayes
tures that discriminates the different desired out- for predicting classes and Gaussian Naïve Bayes
put. This pattern is considered as knowledge that for predicting continuous value [11].
is used to predict the output of a new input fea- Support Vector Machine (SVM) finds a line
ture. There are many methods for supervised segment or a hyperplane that optimally discrimi-
learning. nates the features of two classes. The data on one
34 P. Ruamviboonsuk et al.

side of the hyperplane should contain the data of the same or different value of feature. Figure 3.4
the same class as much as possible. Figure  3.3 shows an example of a Decision Tree for classify-
shows an example of an image classification ing the disease. The features in the training set is
application using SVM, where each point repre- separated into two subgroups using “Feature1”.
sents an image feature. New data, when given to Each subgroup is divided further using
SVM, are classified according to their positions “Feature2”, where the training data of “Feature2
on the plane [12]. SVM can be applied as both < YY” are mostly in Class 1. The subdivision is
image classification and image segmentation. The continued until the optimal class separation is
latter is by assessing features of each pixel and found, as shown in the gray-shaded labels.
classify them as either background (no disease) or Instead of relying on only one tree, Random
foreground (disease) for segmentation (Fig. 3.11). Forest applies multiple trees to learn the input
Decision Tree is a binary tree-based method features [12]. Criteria for separating the input
that recursively divides the features of input in features into homogeneous groups are defined
the training dataset into two parts until the opti- differently for each tree. To predict the output of
mal spilt between each class of the output has a new image, features are extracted and used by
been reached. For each separation, the value of a decision Treesb in the forest. The classification
feature is used to optimally divide the features outputs that are decided by each tree are voted.
into the two subgroups, such as disease and no The output that achieved the highest number of
disease. Each subgroup is divided further using votes becomes the prediction result.
Artificial Neural Network (ANN) extracts
relevant features from the input data by learning
from examples without explicitly stating the rules
to perform classification tasks [13]. It applies the
DME
concept of connected neural network where each
neuron adjusts the weights (the optimal parame-
ters) from the precedent neurons for the learning
process. ANN has been applied in many tasks
and is also a fundamental of deep learning.
Normal

Unsupervised Learning Approach

Fig. 3.3  An illustration of a classification using a support Whereas the supervised learning approach relies
vector machine (SVM) on the input-output pairs assigned by humans,

Feature1 >=XX
Feature1 <XX

Feature2 >= YY Feature3 < Z Feature3 >= Z

Class 1
Feature1 < YY Feature1 >= YY

Feature3 < YY Feature3 >= YY Feature3 < YY Feature3 >= YY


Class 1
Feature3 < YY Feature3 >= YY

Class 1 Class 2 Class 3 Class 2


Class 1 Class 2

Fig. 3.4  Example of a Decision Tree for classifying XXX


3  Overview of Artificial Intelligence Systems in Ophthalmology 35

Unsupervised learning approach requires only algorithms that are used in ophthalmology, par-
the input features to separate them homoge- ticularly for the prediction task.
neously. The aim of the unsupervised learning
approach is to discover a structure or distribution
in the input data in order to learn more about Overview of DL Algorithms
each separated group.
The unsupervised learning is used when the Methods of DL that are commonly used in oph-
input-output pair is not provided. It has widely thalmology may be classified into Convolutional
been used in an image segmentation task to Neural Networks (CNNs), Pre-trained
separate the set of pixels into a group of back- Unsupervised Networks (PUNs) [17], and
ground and foreground or a region of an interest Recurrent/Recursive Neural Networks (RNNs)
object (Fig. 3.11). In addition, it is also applied [18]. Among these three different categories of
for studying objects in each homogeneous DL networks, CNN has been used more exten-
group. sively in medical image recognition [19, 20]
K-Nearest Neighbor (KNN) finds sets of including in ophthalmology.
objects whose features are similar to the input
features. Distance among the input features is
used as a similarity measure [14]. Given a feature Convolutional Neural Network (CNN)
of new data, the classification result is achieved
by voting the number of the k closest objects to The CNN is designed to automatically extract
the new data. features in two-dimensional data, such as images,
while merging semantically similar features into
one in order to reduce sparseness. The features
Boosting Algorithms extracted using CNN can preserve important
information for obtaining a good prediction. In
Boosting is a generic algorithm that aims to addition, one or multiple images can be used as
improve the accuracy of the prediction result. input, and a single diagnostic feature is designed
Instead of relying on the prediction outcome of a as the output, such as disease presence or absence.
single model, boosting algorithms apply multiple The architecture of CNN comprises three lay-
weak classifiers trained with new data to achieve ers: Input Layer, Convolution Layer, and Pooling
a good classifier [15]. Outcome of the precedent Layer (see Fig. 3.5). An input image is placed in
weak model is connected to a new model together the first Image layer, as shown in Fig. 3.5a. The
with the new data to train and improve the predic- image is a two-dimensional array, where each
tion outcome. There are many boosting algo- cell in the array is in a three-color-channel, red,
rithms, where each applies different measures to green, and blue. Each channel is considered as a
improve the prediction accuracy. Adaboost [16] matrix and applied for a feature extraction. To
and Gredient Boosting are examples of these obtain a rich representation, the input image is

a b c

Fig. 3.5  Architecture of Convolutional Neural Network. (a) Input layer. (b) Convoluted layer. (c) Max pooling layer
36 P. Ruamviboonsuk et al.

divided into smaller subimages. Each subimage tional and positional invariant. In other words, it
is used in the subsequent layers. can distinguish between non-disease and disease
The Convolution Layer in Fig.  3.5b then location in an image.
applies a filter to extract a feature from the input
matrix. The objective of this layer is for feature CNN in DL
extraction using different filters. The output of Applying the CNN in DL is typically structured
the Convolution Layer is varied according to the as a series of stages where each stage consists of
filters, such as edge or texture [21]. Figure  3.6 a CNN unit as shown in Fig. 3.7. The number of
illustrates two examples of the convolution result the units depends on the number of filter size,
using two different gradient filters [22]. Blood also called a network width. The stages are con-
vessels can be clearly represented using Filter A nected as shown in Fig.  3.8a. The number of
while the optic disc can be visualized using Filter stages refers to a network depth. The first few
B. Since the performance of the prediction model stages focus on mapping features from input
is also depended on the weights (optimal param- images. The later stages apply the features from
eters) [23], many filters, therefore, are applied to the previous stages as input to merge semanti-
extract features from an image. The problem of cally similar features into one [9, 19] (Lu, 2018).
using too many filters to obtain a rich representa- The last stage is fully-connected layer for pre-
tion of an image is the sparseness of the feature dicting the classification result. This fully-­
which results in a low accuracy. To cope with this connected layer can be viewed as a conventional
limitation, CNN utilizes the Pooling Layer for a ML that applies an input feature from the Pooling
dimensionality reduction. Layer for a classification.
The Pooling Layer in Fig. 3.5c applies a filter The architectures of CNN can be varied in the
to preserve important information of the extracted arrangement of the number and size of filters for
feature in the previous layer and down-sampling feature extraction, in the connections between
it into a smaller size. The filter can be in any size, these features, and in the network depth as
such as 3×3 filter as shown in Fig. 3.5c. The value depicted in Fig. 3.8. The number of filters defines
of the extracted features is summarized using one the width of the network while the depth refines
of the three mappings: Max Pooling, Average the learning capability of the network. Many
Pooling, and Sum Pooling. In addition to the CNN architectures have been created, such as
dimension reduction, the Pooling Layer is useful AlexNet [17], VGG [24], Inception [25], ResNet
for extracting dominant features to achieve rota- [26], and EfficientNet [27]. They employ the

1 0 -1 1 1 1
1 0 -1 0 0 0
1 0 -1 -1 -1 -1

Result of Filter Result of Filter


Input Filter A A Filter B B

Fig. 3.6  Output of an input image using a filter

Fig. 3.7  An overview A CNN unit (input layer, convoluted layer, and pooling layer)
of the deep learning
neural network using width
CNN A level in a deep learning network.
Each level may have different width.
3  Overview of Artificial Intelligence Systems in Ophthalmology 37

Connected
Pooling
Output

Fully
Image
Model

depth

Connected
Pooling

Fully
Image Output
Model

Fig. 3.8  Examples of the deep neural network using CNN. (a) A plain network. (b) A network with a short-cut connec-
tion to skip some layers

same structure of CNN for feature extraction, set. Also, a uniformly increased network size
however, the number of layers and feature map- means the dramatically increased use of compu-
pings (filters) are varied. Their efficiency, such as tational resources. The architecture of Inception
accuracy, and time for training are also varied. is to increase the depth and the width of the net-
work to achieve a higher accuracy while keeping
AlexNet the computational budget constant [25].
The difference of the original CNN and AlexNet is
the number of convolution layers. AlexNet com- ResNet
prises eight layers where five layers are convolu- Although increasing the depth of a network
tion and the rest are fully-connected layer [17]. It improves its performance, it can lead to a degra-
is the first architecture that applies the multiple- dation. The problem of degradation causes satu-
layered CNN with graphic processing unit (GPU) ration of the network’s performance. Instead of
to accelerate computational time of the DL. passing every layer in the network, ResNet
applied a “shortcut” connection to skip some
 GG
V CNN layers, see Fig. 3.8b. The shortcut is used
Since increasing the depth of the CNN showed when the output feature of the layer is the same as
more accurate results of its classification [24], the the one before, then this particular layer can be
architecture of VGG applies a small-sized filter skipped, and the degradation problem can be
of 3×3 and a deeper weight layers to the resolved [26]. There are many configurations of
CNN.  The architecture of VGG16 and VGG19 ResNet architecture: ResNet35, ResNet50, etc.
are similar, except only in the depth of the weight The number refers to the depth of the network.
layers. With a smaller filter, compared to the orig-
inal CNN, VGG shows a significant improvement EfficientNet
of the result. Other than scaling up the depth of deep neural
networks, such as in VGG [24], Inception [25],
Inception ResNet [26], scaling up the width [28] and
The drawbacks of increasing depth and width of increasing image resolution [29] are other means
deep neural networks are an overfitting of the to improve network performance. EfficientNet
prediction model when the training size was lim- [27] presents a compound scaling method which
ited. Overfitting means the model performs much achieves the maximal accuracy by uniformly
poorer in a new dataset than in the training data- scaling network width, depth, and resolution of
38 P. Ruamviboonsuk et al.

an input. However, it requires a specialized pipe- Cost J(w)


line parallelism library to train.

 re-trained Unsupervised Network


P Initial weight value
(PUN)
Learning Rate
PUN is a family of DL models that uses unsuper-
vised learning to train each of the hidden layers in
a neural network to achieve a more accurate fitting
The best w
of the dataset. Types of PUNs include Weight w
Autoencoders, Deep Belief Networks (DBN), and
Generative Adversarial Networks (GAN). In Fig. 3.9  The relationship of the weight values, the learn-
ing rate, and the cost value
image processing tasks, PUN is trained using the
very large number of images without considering
labels of the data [30]. The weights obtained from model performance. In DL, a cost, denoted by
the “pre-training” could be used as an initialized J(w), is the penalty on any error that the model
value for refining weights for a different target makes with a specific set of weights, denoted by
domain such as detecting glaucomatous optic neu- w. In the beginning of the training process, the
ropathy (GON) in fundus images [31]. Utilizing initial weight is a set of small random values. For
pre-training data in another different task is called each training iteration, a hyperparameter called
“transfer learning”. Since the sample and dataset “learning rate” is used to scale the magnitude of
sizes of medical images are usually small, com- the weight, which is updated for the next itera-
pared to non-medical images, transfer learning has tion, see Fig.  3.9. To achieve the best perfor-
become an important popular technique. mance, the model is trained several times to
obtain the weights that minimize the cost.
The duration of the training process and the
Recurrent Neural Network (RNN) optimal weight values depend on the learning rate.
A smaller learning rate results in a more reliable
The Recurrent Neural Network (RNN) is a neural model, but the duration of training will be longer
network designed to process an input sequence than using a larger learning rate. However, using a
one element at a time [18]. The characteristic of large learning rate may result in skipping the opti-
RNN is the prediction of the next occurrence of mal weight values. Figure  3.10 shows the com-
the element based on the previously learned data. parison between using the small learning rate and
In addition, it accepts a series of an input with no the large learning rate. It can be seen in Fig. 3.10b
pre-determined limit on the size. It has been found that the update of the weight using the large learn-
to be very good at predicting the next character in ing rate may skip the optimal weight value. On the
the text. The Long Short-Term Memory (LSTM) other hand, using the small learning rate as in
is a kind of an artificial recurrent neural network Fig. 3.10a requires many iterations to achieve the
(RNN) architecture. Davidson et al. used RNN for optimal point. Assigning an appropriate learning
localization of cone photoreceptor cells in healthy rate, therefore, is essential in training the model.
and patients with Stargadt’s disease [32]. Epochs (the number of rounds that the entire
dataset is passed through in a DL network [33]) is
another hyperparameter that limits the number of
Training a DL Model training iterations. Although a smaller learning
rate provides a more reliable result, the training
On bioengineering perspectives, the objective of iteration may stop before obtaining the optimal
training a DL model is to find the optimal param- weight value due to the epochs hyperparameter.
eters, also called weights, that yield the best Assigning appropriate initial weight values can
3  Overview of Artificial Intelligence Systems in Ophthalmology 39

a b
Cost J(w) Cost J(w)

Initial weight value Initial weight value

Learning Rate Learning Rate

The best w The best w


Weight w Weight w

Small value of the learning rate Large value of the learning rate

Fig. 3.10  A comparison of a small value (a) and a large value (b) of the learning rate

reduce the number of training iterations without Table 3.1 A confusion matrix for predicting two
the limitation due to Epochs. With transfer learn- outcomes
ing, these initial weight values can be assigned Actual labels
from the already available model, also called Predicted results Disease No disease
Disease True Positive False Positive
“pre-trained” model. There is a number of pre-­
No disease False Negative True Negative
trained models for CNN-based DL [34].

visualizes the predicted results versus the actual


Testing a DL Model labels assigned by humans. Table  3.1 shows a
confusion matrix of ‘Disease’ and ‘No Disease’
There are many terms in evaluation of an AI outcomes. The ‘True Positive’ and the ‘True
model which are not yet settled [35]. “Validation” Negative’ show the numbers of data that are cor-
means evaluation of performance of an AI model. rectly predicted. The ‘False Negative’ and the
When “Internal Validation” is used, it means ‘False Positive’ show the numbers of wrong pre-
evaluation of the model’s performance in the dictions. To measure the performance of unsu-
same dataset as training. Some studies call this pervised learning, log likelihood or distance
“testing”. On the other hand, if the evaluation is measure is preferable [37].
conducted in the new dataset which is different The confusion matrix can also be used to com-
from the training dataset, this is called “External pute sensitivity and specificity values. The sensi-
Validation”. Generally, data in the original data- tive value shows the model performance on
set for developing an AI model will be divided predicting the ‘disease’ (positive) class, while the
into training set and testing (internal validation) specificity refers to prediction of the ‘no disease’
set with 80:20 proportion [35]. There are many (negative) class. Either Area Under the Curve of
public datasets of retinal fundus images available the Receiver Operating Characteristics (AUC)
for development of AI systems. The top three which is the area under the plot between sensitiv-
most commonly use are Kaggle, Messidor-2, and ity and 1-specificity or accuracy which is the pro-
EyePACS [36]. portion of the total True Positive and True
A confusion matrix is used to show the per- Negative can also be used for judging perfor-
formance of a supervised learning model. It mance of an AI model.
40 P. Ruamviboonsuk et al.

 verview of Systems for Screening


O stereoscopic retinal images, representing the
and Classification 7-field Early Treatment Diabetic Retinopathy
Study (ETDRS) stereoscopic standard photo-
The principles and practices of screening for a graphs, interpreted by certified graders from
disease was laid out by Wilson and Jungner on Wisconsin Reading Center. This trial found iDX-
behalf of the World Health Organization (WHO) ­DR provided 96.1% interpretability (able to ana-
in 1968 [38]. It took 50 years until Dobrow et al. lyze 819 out of 852 patients), sensitivity of
[39] conducted a systematic review and modified 87.2%, and specificity of 90.7%. These confu-
Delphi consensus process on principles and prac- sion metrics may be slightly lower than those
tices of screening diseases that included this clas- obtained by most other systems using DL for DR
sic work. The newly consolidated principles from screening which generally provide sensitivity
this review focused on screening programs and and specificity around 95%.
system principles rather than disease, test, and The iDX-DR system, however, was validated
treatment as in the original principles. The eye in another cohort of 1410 patients in a Dutch dia-
disease that is fit into these principles and has betic eye care system [42]. This system was able
already been screened worldwide is diabetic reti- to grade 66.3% of these patients whereas three
nopathy (DR). DR is not only an eye disease in independent human graders in this study was able
which there is the greatest number of studies of to grade 80.4%. When applied with two different
AI to date but also the greatest number of studies grading systems, the EURODIAB grading and
of AI for screening. the International Clinical Classification of DR,
after adjustment, the system could provide 96%
and 86% of sensitivity and specificity
Screening of DR respectively.
Other available systems for DR screening
Among the many available systems of AI for include RetmarkerDR and EyeArt which have
DR screening, the only autonomous system (no been available since before the current era of
requirement of additional intervention by cli- DL. Both AI systems adopt feature-based extrac-
nicians) approved by the U.S.  FDA to date is tion of conventional ML that can also detect turn-
still the iDx-DR system [3]. The AI model in over of microaneurysms. For DR classification,
this system have been evolved from integrating in a study of both systems in the UK national DR
only conventional ML algorithms to combining screening programs [43], using arbitrated results
with DL algorithms. Abramoff et  al. were able of human grading as comparators, RetmarkerDR
to show improvement of the AUC of the model provided a sensitivity of 73% for any retinopathy,
for detecting referable DR from 0.937 to 0.987 85% for referable DR, and 97.9% for prolifera-
when DL (AlexNet and VGG) was added [40]. tive DR (PDR) whereas those of EyeArt were
This hybrid system requires two fields of color 94.7%, 93.8%, and 99.6% respectively. However,
retinal images, one with macula-centered and the false positive rate of both systems was rela-
another with disc-­centered from Topcon NW400 tively high at 50%.
non-mydriatic retinal cameras, as inputs to pro- EyeArt, in addition, was validated as an appli-
vide the output results as either “more than mild cation on a smartphone device in a study of 296
retinopathy: refer to ophthalmologists” or “nega- patients with DR [44], it could achieve the sensi-
tive for more than mild retinopathy: rescreen in tivities of 95.8%, 99.3%, and 99.1% for any DR,
12 months” [41]. referable DR, and sight-threatening DR (STDR),
The U.S.  FDA cleared the performance of whereas the specificities of 80.2%, 68.8%, and
iDX-DR with a clinical trial [41] that used this 80.4% for the corresponding levels were found,
system to prospectively screen 900 patients with respectively.
diabetes in 10 primary care units in the U.S. The One of the first AI systems that used CNN
standardization of this trial was four wide-field (Inception V3) for DR screening is from Google
3  Overview of Artificial Intelligence Systems in Ophthalmology 41

Research by Gulshan et al. [45]. The algorithm in independent multiethnic datasets, this algorithm
this study was developed from more than 100,000 achieved the performance on par with other sys-
retinal images and was validated in the other two tems for detecting STDR.  The authors high-
independent datasets of more than 10,000 images. lighted 77% of false negatives as undetected
This study was the first that showed the achieve- intraretinal microvascular abnormalities.
ment of both sensitivity and specificity at 95% All of the aforementioned systems of DL
and AUC at 99% for detecting referable DR when for DR screening designed for being applied to
validated in independent datasets. color fundus photography (CFP). Apart from
This system was further validated in another detecting referable DR, their performances on
retrospective dataset of more than 20,000 images detecting diabetic macular edema (DME) were
of 7000 patients in a nationwide screening pro- similar to the detection of STDR [46] (DME is
gram of DR to detect moderate NPDR or worse almost always counted as part of STDR). The
[46]. This AI model achieved 97% sensitivity identification of DME from CFP, however,
compared to 74% of human graders in this could be problematic since in the real clinical
screening program whereas slightly lower speci- practice DME was identified using images from
ficity of 96%, compared to 98%, of humans was optical coherence tomography (OCT) which is
found. When validated prospectively in two pri- three-­dimensional. To overcome the limitation
vate hospitals in India with more than 3000 of two-dimensional images of CFP, the pres-
patients to detect moderate NPDR or worse, this ence of hard exudates in the macular area was
model still achieved about 90% sensitivity with used as proxy for detecting DME when grad-
slightly higher than 90% of specificity which ing CFP.  There are always cases for which
were better than manual grading [47]. the identification of DME from CFP and from
Another large-scale study of DL system for OCT is not in concordance [52]. In addition,
DR screening is from Singapore in which more the prevalence of DME based on each modal-
than 70,000 images were used for development ity is significantly different [53]. An interesting
of the algorithm. The highlight of this study by study by Varadarajan et al. used paired data of
Ting et  al. [48] was the largest population with CFP and OCT to train a DL algorithm to learn
more than 100,000 images of independent datas- to detect OCT-derived DME from grading on
ets of various races for validation to date. This CFP only [54].
AI-based software, now called SELENA (VGG-­ Developed from more than 6000 of the paired
19), was able to detect STDR with 100% sensi- images, this algorithm (CNN: Inception V3) was
tivity, 91% specificity, and 0.96 AUC. able to detect centered-involved DME (CI-DME)
SELENA was also explored for DR screening from CFP in the testing set of 1000 CFP images
in Zambia, a country in Africa where resources with 85% sensitivity and 80% specificity whereas
are scant (for example, the number of ophthal- three retinal specialists who graded the same
mologists is less than three per million Zambian CFP images using hard exudates as proxy for
population), in a study by Bellemo et  al. [49]. CI-DME had, in average, similar sensitivity but
The performances of SELENA in this study were about half of the specificity at 45%. In validation
found to be on par with those in Ting et al. [42] on another independent dataset of 990 CFP
mentioned previously. SELENA was also found images, sensitivity and specificity of this algo-
to be able to estimate prevalence and systemic rithm was lowered to be at 57% and 91% respec-
risk factors of DR similar to human assessors; tively, whereas sensitivity and specificity of
hence, this showed the potential of the roles of graders who graded the same images were even
DL in epidemiology study [50]. lowered than the algorithm at 55% and 79%
There was another system of DL for DR respectively. It was noted that data in the devel-
screening developed in China by Li et  al. [51]. opment dataset in this study were from tertiary
Developed from more than 70,000 images and care settings while those in the independent set
validated in more than 30,000 images of three were from primary care settings.
42 P. Ruamviboonsuk et al.

However, this study showed a potential of AI [61] (CNN: AlexNet) used a training and testing
to make prediction across two imaging modali- dataset of almost 54,000 and 14,000 images from
ties or across two kinds of labelled data (other AREDS while a study by Grassman et  al. [62]
examples: predict gender from CFP, blood pres- (using various CNNs) used approximately 87,000
sure from CFP, etc. [55]), when trained with pairs and 34,000 images. The former study used exist-
of both imaging modalities or trained with pairs ing grades from AREDS for training while the
of both data labelling (label gender data with latter required a trained ophthalmologist to label
CFP, label blood pressure data with CFP, etc.). data for the training, both studies provided out-
This concept is sometimes called “label puts as grades according to AREDS Classification.
transfer”. Burlina et  al. classified the outputs into two
classes, referable and non-referable, whereas
Grassman et  al. classified the outputs into nine
Classification of AMD steps of AREDS and three late AMD stages. Both
studies achieved sensitivity and specificity of
The aim for AI to screen AMD has been widely around 90% for the testing dataset. The study by
assessed recently. Many attempts had been made Grassman et al. conducted validation in the exter-
before for using other means, such as Amsler’s nal dataset of more than 5000 images and
Grid [56] or hyperacuity device [57], for screen- achieved a sensitivity and specificity of 82.2%
ing AMD with fair success. A recent study in and 97.1% for detecting intermediate AMD, and
South Korea found systematic, population-wide, a sensitivity and specificity of 100% and 96.5%
retinal photography of people more than 40 years for late AMD. The system by Burlina et al. could
old by non-specialists for screening AMD was later classify the 9-step AREDS and predict
cost-effective [58]. Another study found screen- 5-year risk of progression to advanced AMD with
ing AMD in the concurrent programs for screen- acceptable error [63].
ing of DR was also cost-effective [59]. It is still There are other AI systems for classification
not known whether screening for AMD replacing of AMD from spectral-domain OCT (SD-OCT)
the non-specialists with AI is also cost-effective. images. Some systems use DL for classifying
Most of the AI systems for screening and clas- AMD directly from OCT images whereas some
sification of AMD are developed using CNN and systems apply conventional ML for automated
use CFP as inputs. Fewer studies used OCT segmentation of fluid or biomarkers on OCT
images as inputs. SELENA, one of the first sys- images as the first step then use DL classifiers for
tems for screening AMD, was initially applied in classification later. Studies by Kermany et  al.
patients with diabetes. Although use for screen- [64] and Treder et  al. [65] are examples of the
ing AMD, the algorithm in SELENA was devel- former. Both used “transfer learning” from exist-
oped from a training dataset of more than 72,000 ing, open-sourced, pre-trained, ImageNet deep
images of patients with diabetes in Singapore and neural network (DNN) with 1000 output catego-
Malaysia and a testing dataset of almost 36,000 ries to train on OCT images for AMD.
images of patients in the same population. The Kermany et  al. trained the ImagNet on four
output in this study was defined as referable categories: choroidal neovascularization (CNV),
AMD [48]. DME, drusen, and normal. With the training
There are other studies on AI for screening dataset of more than 100,000 images (37,000
AMD that was developed from CFP of CNV, 11,000 DME, 8600 drusen, and 51,000
­Age-­Related Eye Disease Study (AREDS) [60] normal) and 1000 images for validation with
which is a large randomized controlled trial com- equal distribution of the four categories, the sys-
pared between vitamin supplements and placebo tem achieved AUC of 98% with accuracy, sensi-
for AMD development and progression. Since tivity and specificity around 95%. The authors
the AREDS collected CFP as films, they are digi- also performed the occlusion test to uncover the
tized for applying AI. A study by Burlina et al. potential “black box” created by the model.
3  Overview of Artificial Intelligence Systems in Ophthalmology 43

Treder et al, on the other hand, trained and tested This system then makes classification in dif-
their system using over 1000 images (90% for ferent retinal diagnoses, for example, normal,
training which contained 70% AMD and 30% CNV, macular hole, central serous chorioretinop-
controls, and 10% for testing which contained athy, vitreomacular traction, etc., and also refer-
50% for both AMD and controls). ral suggestions: urgent, semi-urgent, routine, and
Lee et  al. [66] linked data from electronic observation. The authors found that on test per-
medical record (EMR) with OCT images to formance of the model on an independent test set
develop CNN system (VGG16) to classify of 997 patients (252 urgent, 230 semi-urgent, 266
AMD. With approximately 100,000 OCT images routine, 249 observation), an AUC of 99.9 was
with linked data points of EMR were used, half achieved for urgent referral; whereas the error
normal and another half AMD, the system rate of 3.4% was on par with those of retinal spe-
achieved AUC and accuracy around 90%. cialists and was better than optometrists.
Studies by Prahs et al. [67] and Hwang et al.
[68] are examples of applying AI for classifica-
tion of OCT images for decision-making. Prahs Classification in Glaucoma
et  al. trained CNN (GoogLeNet or Inception)
with the inputs of AMD, DME, RVO, CSC, and The diagnosis of glaucoma may require identifi-
the outputs of “requiring” anti-VEGF treatment cation of many co-existing parameters, such as
or “not requiring” anti-VEGF treatment, labeled increased optic nerve head cupping, characteris-
by treating clinicians. This study conducted vali- tic loss of retinal nerve fiber layer or characteris-
dation on external dataset of more than 5500 tic defects of visual field. These may make
images and achieved sensitivity of 90%, specific- diagnosis of glaucoma by AI more complex,
ity of 96%. Hwang et al. not only used three dif- compared with retinal disease.
ferent DL systems, VGG16, InceptionV3, and The SELENA system by Ting et al. [48] can
ResNet50, to train on OCT images of normal, dry also detect glaucomatous optic nerve head
AMD, active wet AMD, and inactive wet AMD, (GONH), this part of the algorithms was devel-
they also studied the DL as a cloud-based plat- oped from CFP of more than 120,000 patients
form. The authors found that the three CNN sys- with diabetes. Li et  al. [70] also developed
tems performed similarly for classification of the another DL system (VGG) for detecting GONH
four categories of AMD with slightly lower per- from more than 50,000 CFP graded by more than
formance on dry AMD. They also found a poten- 20 ophthalmologists. The identification of refer-
tial on prediction of longitudinal changes after able glaucoma in these studies by Ting et al. and
treatment of wet AMD with 90% accuracy. Li et  al. may have a limitation from relying on
The major AI system that was designed to per- only GONH since even ophthalmologists might
form both automated segmentation of OCT not have high agreement among them on grading
images and then performed classification tasks GONH [71]. There are other retinal imaging
for AMD and other retinal diseases is by De technologies deployed for detecting glaucoma,
Fauw et  al. [69]. For the segmentation, the such as OCT, confocal scanning laser ophthalmo-
authors applied a three-dimensional U-net archi- scope (CSLO), and scanning laser polarimetry
tecture for deep segmentation network to delin- (SLP), on which AI can be applied.
eate OCT scanned images using more than 1000 Before this era of DL, many models of con-
manually segmented training images to form ventional ML were applied for detecting glau-
tissue segmentation maps of the OCT scans.
­ coma from both time- and spectral-domain OCT
Another classification network, a customized 29 images of optic nerve head (ONH) with accept-
CNN layers with 5 pooling layers, developed able performance [72]. Muhammad et al. showed
from 14,884 training tissue maps with confirmed that a hybrid DL model (AlexNet and Random
diagnosis and referral decision, was applied to Forest Classifier) was able to analyze single-scan
the segmented OCT maps. of SS-OCT images to classify between normal
44 P. Ruamviboonsuk et al.

and glaucoma suspects with 93% accuracy [73]. and found their performances improved with the
Christopher et  al. applied Principal Component combined data for classification between glau-
Analysis (PCA) approach of unsupervised ML to coma and non-glaucoma patients.
analyze retinal nerve fiber layer (RNFL) thick- Medeiros et al. [79], on the other hand, applied
ness maps from SS-OCT and showed that this the concept of “label transfer”, stated previously
approach could achieve the highest AUC of 0.95, in glaucoma. They trained a CNN (ResNet34)
compared to SD-OCT-based circumferential with more than 30,000 paired data of both CFP
RNFL thickness measurements and visual filed images and RNFL thickness to predict the RNFL
global indices for detection of glaucoma. Using thickness from analyzing only the CFP. In the test
stereoscopic CFP as standard for defining glau- set of around 6200 CFP images, the model could
coma, this model could also detect glaucoma pro- predict the RNFL thickness with a strong correla-
gression with the highest AUC [74] compared to tion between predicted and observed RNFL
the other means. thickness values (Pearson r = 0.832; R2 = 69.3%;
Visual field (VF) progression, another impor- P < 0.001), with mean absolute error of the pre-
tant indicator of worsening glaucoma, had been dictions of 7.39 μm. The AUC of the classifica-
detected using back-propagation neural network tion of glaucoma and normal was both 0.94 for
since the early 2000s in the Advanced Glaucoma the prediction made by the CNN on grading only
Intervention Study (AGIS) with an AUC of 0.92 CFP images and for the actual RNFL
[75]. In another study, Yousefi et al. introduced a measurement.
new glaucoma VF index calculated by an unsu- In the next study by Jammal et  al, the same
pervised ML, Gaussian Mixture Model and team of authors validated their AI model with
Expectation (GEM), to detect VF progression. another set of 490 CFP images of 490 eyes of 370
This model was trained on more than 2000 VFs subjects graded by two glaucoma specialists for
and tested on a longitudinal cohort of 270 eyes the probability of glaucomatous optical neuropa-
followed every 6 months. This new AI-based thy (GON) and estimates of cup-to-disc ratios
index outperformed existing indices, such as (C/D). The AUC for classifying GON from CFP
Global or Region-wise, by finding the time to was higher for their AI model compared with the
progression of 25% of the eyes in the longitudi- glaucoma specialists, 0.529 vs 0.411 [80]. This
nal cohort at 3.5 years, compared with 4.5 years concept of “label transfer” may be applied more
for the Region-wise and 5.2 years for the Global in AI in ophthalmology in the future.
index [76].
In a recent study, Li et al. [77] compared (1)
DL-CNN (VGG architecture) (2) conventional Classification of Cataract
ML models (SVM, RF, k-NN) (3) rule-based
algorithms (AGIS and enhanced glaucoma stag- One of the first AI systems for grading cataract
ing system [GSS2] and (4) human experts for was by Gao et  al. [81]. This system (Recursive
grading 300 VFs to differentiate between glau- Neural Networks combined with Support Vector
coma and non-glaucoma patients. The CNN, Regression) was trained to grade severity of cata-
developed from the same data set of 4000 VF ract from slit lamp photography in a decimal
images, achieved the accuracy of 0.876, while score, from 0.1 to 5.0, based on Wisconsin
the specificity and sensitivity were 0.826 and Grading System. The testing was performed in
0.932 respectively, whereas the accuracy of the 5378 images, this system achieved a 70.7% exact
three ML models was around 0.65, the human agreement ratio (R0), a 88.4% decimal grading
experts was around 0.6, and AGIS and GSS2 was error of less than 0.5 (Re0.5), a 99.0% decimal
around 0.5. grading error of less than 1.0 (Re1.0) when com-
A study by Bowd et al. [78] combined struc- pared against clinical integral grading.
tural data (OCT) and functional data (VF) to train A much larger scale of an AI system for clas-
conventional ML models (Bayesian Classifiers) sification of cataract was by Wu et  al. [82] in
3  Overview of Artificial Intelligence Systems in Ophthalmology 45

China. The authors trained a DL-CNN (ResNet) Koprowski et al. [84] deployed ANN to evaluate
to classify photographs of anterior segment in corneal power after refractive surgery.
three steps. The first was to classify according Interestingly, other than patient care, AI has
to modes of illumination: slit or diffuse, and also been used for ophthalmic training in cataract
methods of capture: mydriatic or non-mydri- surgery. The aim is to use DL to identify manu-
atic; the second was according to diagnosis: ally pre-segmented phases of cataract surgery
normal or cataract or postoperative cataract; the procedures in assisting to develop efficient and
third was according to severity: mild or severe effective skill training tools. In a study by Yu
nuclear sclerosis, and visual axis involvement et al. [85], from a dataset of videos of 100 cata-
or not. The development dataset was from ract surgery procedures, the authors applied dif-
37,638 slit lamp photographs of 16,611 patients ferent AI algorithms, SVM, CNN, CNN-RNN, to
with 80% for training and 20% for testing. identify phases in the cataract surgery videos.
Validation of the models in the testing set found Modelling time series of labels of instruments in
relatively good performances with confusion use provided the highest accuracy for the identifi-
metrics more than 90% across the board. The cation. In another study by Morita et al. [86], a
performances on images from slit or diffuse DL-CNN model, Inception V3, was also trained
illumination, mydriatic or non-mydriatic were to extract important phases, capsulorrhexis and
similar. nuclear extraction, of cataract surgery videos
However, when the models were validated in with high correct response rate and low errors.
four separate community hospitals in real-world
situations, some of the confusion metrics of the
AI dropped. Here are the metrics found to be Screening for Retinopathy
lower than 90% in the real-world testing: sensi- of Prematurity and Pediatric Cataract
tivity for classifying normal at 71.3%; specificity
for classifying cataract at 83.9%; sensitivity/ Retinopathy of prematurity (ROP) shares some
specificity for classifying severe nuclear sclerosis similarities with DR: early detection by retinal
at 73%/86% and classifying mild nuclear sclero- examination is essential to reduce the risk of
sis at 86%/73% respectively. The authors claimed visual loss. CFP of both diseases can be cap-
that using this AI-based assistance of the referral tured using commercially available cameras.
system, however, an ophthalmologist may serve Digital CFP allows less burden to both examin-
up to more than 40,000 persons a year, compared ers and examinees for examination of ROP [86]
to 4000 persons without AI. This kind of AI mod- and also allows possibility of screening by neo-
els using anterior segment photography for deter- natologists [87]. The main difference between
mination of referral cataract should be subjected the two diseases is an urgency for treatment in
for further assessment since many patients in the ROP which is within 72 h for Plus ROP [88]. The
real world may live with relatively dense nuclear term “referable DR” in screening settings may
sclerosis without visual impairment or interfer- has different definitions concerning different
ence of daily lives. The indication for cataract resources available in screening programs [89]
surgery, in addition, may still be varied in differ- while “referable ROP” is universally agreeable
ent eye care services. as the definition of Plus disease [90]. Similar to
AI, in theory, would also be ideal as a tool for DR screening, telemedicine has been deployed
intraocular lens (IOL) power calculation. Sramka successfully for screening of ROP [91]. This
et al. [83] demonstrated that a conventional ML technology allows addressing two important bar-
model, Support Vector Machine Regression, and riers for ROP screening: interexaminer variabil-
multilayered neural network ensemble achieved ity due to subjectivity and inadequate number of
significantly better performance for IOL power qualified trained examiners [92]. Further allevia-
calculation compared to conventional methods of tion of these barriers is expected with applica-
calculation without AI.  Another study by tion of AI.
46 P. Ruamviboonsuk et al.

There was a number of studies on conven- AI applied to slit-lamp images and comprised
tional ML for determination of tortuosity and three CNNs (AlexNet) for screening, assessment
width of retinal vessels in Plus ROP in the era of severity, and treatment recommendation. It
prior to DL [93–97]. All of these models required was compared with experienced pediatric oph-
manual annotations in implementation. The fully thalmologists in five clinics in China in a pro-
automated software, without additional manual spective randomized controlled trial evaluating
annotations, for detection of ROP came with 350 patients younger than 14 years old. Applying
application of CNN. DeepROP by Wang J et al. grading of images by a panel of three experienced
[98] is one of the first systems of CNNs for ROP pediatric ophthalmologists as the gold standard,
detection. The system (modified inception-BN the accuracy of this AI system was significantly
nets pre-trained on ImageNet) was developed in lower than the experts for detection of cataract
China with the largest dataset of ROP to date (87% vs 99%) and recommendation of treatment
(more than 20,000 images). (71% vs 97%). The duration for reaching results,
Another system, i-ROP-DL (the earlier ver- however, was significantly shorter for the AI sys-
sion of i-ROP had not yet adopted DL [97]) tem (2.8 vs 8.5 min).
developed in the U.S. with more than 5000 Other pediatric ophthalmology conditions that
images, was demonstrated in a study by Redd AI is studied for include detection of strabismus,
et al. [99] to successfully classify severity of ROP refractive error, prediction of future high myopia
into type 1 and type 2. Another study of i-ROP- and reading disability [106].
DL by Brown et  al. [100] compared the DL
­
model (Inception V1 and U-Net) with eight inter-
national ROP experts on the output of Plus dis-  verview of Systems for Automated
O
ease. Using the consensus of image-based Segmentation
diagnosis combined with ophthalmoscopy as the
gold standard, the authors found the DL system The difference between AI-based classification
agreed with this standard more than six of the and AI-based automated segmentation may be
eight experts. This i-ROP-DL system, in addi- clarified from the outputs. The classification sys-
tion, was able to compute the probability of its tems generally provide outputs as stages of a dis-
prediction via a linear formula to be ROP severity ease classification, or binary outputs as referable
scores. or non-referable. Most of the automated segmen-
In the subsequent study of i-ROP-DL, Taylor tation systems, on the other hand, provide outputs
et al. [101] showed that these ROP severity scores as biomarkers or pathological characteristics,
could potentially be applied to monitor ROP dis- such as subretinal fluid, retinal pigment epithelial
ease progression. In monitoring more than 870 detachment, or optic nerve head. The objectives
infants with ROP over time, the median scores of of automated segmentation are assisting special-
those who progressed to require treatment were ists in busy clinics or assisting researchers in
significantly higher than those who did not prog- research for time and cost saving. The objectives
ress at each of the postmenstrual age time points of the classification systems are generally for the
in the study. benefits of patients, such as early detection or
There was another CNN system for ROP screening disease.
developed from a smaller dataset of 1500 images On bioengineering perspective, the objective
from Canada and England. This system could of image segmentation task is to find the interest-
also detect Plus ROP with the accuracy which is ing area, such as an area of characteristics of dis-
comparable with other systems [102]. ease in images, then separate image pixels in that
Apart from ROP, another major AI system was area into background and foreground. Figure 3.11
developed for detection of congenital cataract in shows segmentation results of the macular fluid
China [103–105]. This system called Congenital in an OCT image. In this example, the pixels of
Cataract-Cruiser, or CC-Cruiser, is a cloud-based the fluid area are separated as foreground from
3  Overview of Artificial Intelligence Systems in Ophthalmology 47

Example 1 Example 2

OCT Slice

Ground Truth

Segmentation
Result

Fig. 3.11  Example of the input images, their ground truths, and the segmentation results

the rest which are background. Each of the image AMD.  While macular fluid was generally the
pixels is then classified into either a class of most common outputs for segmentation of retinal
background or foreground. To train the model, a OCT images, photoreceptor cells loss was output
set of training images with labelling of the fluid for segmentation of retinal OCT in patients with
area by experts, also called as ground truth, is retinitis pigmentosa and choroideremia [111].
provided (Fig. 3.11). The ML model then learns Segmentation of choroidal thickness in patients
to find the patterns of the pixels that separate with AMD from enhanced-depth OCT (EDI-­
them into foreground and background classes. To OCT) images was also performed successfully in
validate the model, the ML model then uses these another study [112]. Adaptive Optics Scanning
patterns to categorize image pixels in the new Light Ophthalmoscopy images from patients
images as either background or foreground. with Stargadt’s disease was also segmented by a
The conventional ML method, such as Multi-Dimensional RNN for localization of cone
Principle Component Analysis (PCA), is gener- cells [32].
ally applied to reduce the sparseness of the While the automated segmentation of OCT
extracted features. Another method, such as images of macular region is important in retinal
SVM, is applied to classify each pixel into the disease, the region of interest (ROI) for auto-
two classes [107]. DL can also be applied for the mated segmentation in glaucoma is at the
classification [108]. A common statistic that is ONH. Many studies used different segmentation
used for assessment of accuracy of automated algorithms for ONH segmentation from
segmentation is Dice coefficient [109]. The hand-­ CFP.  Almost all of the studies reported higher
drawn of the segmented area by human experts is than 90% of accuracy [72]. Hagiwara et al. [113]
used as ground truth. proposed a common workflow for a conventional
Retinal OCT images are excellent inputs for computer-aided diagnosis of glaucoma using
automated segmentation. Not only in DR, Fang CFP images as the followings: (1) Image Input,
et al. [110] was able to segment the nine retinal (2) Pre-Processing, (3) Segmentation, (4) Feature
layers in OCT images of patients with dry extraction, (5) Feature Selection and Ranking,
48 P. Ruamviboonsuk et al.

and (6) Classification. Some studies performed care and facilitate research from database of their
segmentation of peripapillary area which own institute. Based on electronic health record
included RNFL, neural retina, retinal pigment collected in a form of Data Warehouse [119], the
epithelium, choroid, peripapillary sclera, and authors developed and compared different con-
lamina cribrosa from OCT images [114, 115]. ventional ML models (AdaBoost.R2, Gradient
Boosting, Random Forests, Extremely
Randomized Trees, and Lasso) that predicted VA
 verview of Systems for Prediction
O at 3 months and 12 months after 3 monthly injec-
of Natural History or Treatment tions of anti-VEGF agents for neovascular
Outcomes AMD. The prediction by the models at 3 months
had mean absolute error between 5.5 and 9 let-
Prediction aims to foresee the outcome of the dis- ters, and root mean square between 7 and 10 let-
ease, either with or without treatment, when a set ters. The prediction at 12 months was slightly
of features, such as best corrected visual acuity worse.
(BCVA), or recurrence of macular edema at cer- Other studies on AI-based prediction of treat-
tain time period, which are related to but not nec- ment outcomes were also centered on data from
essarily within the images themselves, is input. major clinical trials. For example, data from the
Longitudinal data or images are usually required Protocol T [120] of drcr.net trials were used for
for development of AI systems for prediction. prediction of BCVA at 12 months after treatment
There has been an attempt to predict progres- with anti-VEGF injections for patients with DME
sion of DR at 6 months, 12 months, and 2 years [121]. Data from HARBOUR Study [122] were
for individual patients who did not have any ocu- used for prediction of (1) BCVA at 12 months
lar intervention during that period [116]. Using after treatment [123] (2) requirement of low or
CFP of patients with DME who were randomized high injections of anti-VEGF [124] and (3)
to be in the sham arm in the clinical trials of RISE advanced AMD conversion for patients with
and RIDE, a system of deep CNN (Inception V3) AMD [125]. For retinal vein occlusion, data from
was developed to predict progression of DR CRYSTRAL Study [126] were used for predic-
[117]. The CFPs in this AI model was based on tion of BCVA and recurrence macular edema at
the standard 7-field ETDRS photography, each 12 months [127].
field had its own CNN for prediction of 2-step or
more worsening of DR according to Diabetic
Retinopathy Severity Scale (DRSS), the results The Future of AI in Ophthalmology
of each CNN was then combined using Random
Forest. The prediction at 12 months of the system We are in a transition phase of AI in ophthalmol-
of combined models achieved the best perfor- ogy. There will be abundance of AI systems com-
mance with AUC of 0.75 whereas the prediction ing for more specific tasks in ophthalmology at
of this system based on some peripheral fields of the present time and in the future. No more doubt
ETDRS surprisingly outperformed the prediction will be cast on AI’s performance in research, but
based on the posterior fields of ETDRS. However, many questions remain for deployment. The
the generalizability of this system might still be robust performance of many systems may not be
doubt due to (1) the limited availability of the carried over into the real world. Since compari-
ETDRS 7-field photography (2) the lack of vali- son between different AI systems is difficult, the
dation in independent dataset and (3) the applica- choosing of an AI system to be used in eye care
bility of the models in patients without DME should neither be easy. While researchers will be
since this study included only CFP of patients unravelling “black box”, patient acceptability,
with DME. data privacy, data protection, regulations, includ-
A study by Rohm et  al. [118] reflected an ing medico-legal aspects [128] will be issues that
attempt to develop AI systems to improve patient every AI system will face in the future.
3  Overview of Artificial Intelligence Systems in Ophthalmology 49

Acknowledgement This work is partially supported 17. Krizhevsky A, Sutskever I, Hinton GE.  ImageNet
under the Thailand Research Fund grant number classification with deep convolutional neural net-
RTA6280015 and the Ratchadapisek Sompoch works. 2012.
Endowment Fund under Telehealth Cluster, 18. Hochreiter S, Schmidhuber J. Long short-term mem-
Chulalongkorn University. ory. Neural Comput. 1997;9:1735–80.
19. Lu W, Tong Y, Yu Y, Xing Y, Chen C, Shen
Y. Applications of artificial intelligence in ophthal-
mology: general overview. J Ophthalmol. 2018;
References https://doi.org/10.1155/2018/5278196.
20. Shen D, Wu G, Suk H-I.  Deep Learning in medi-
1. Yu VL, Fagan LM, Wraith SM, Clancey WJ, Scott cal image analysis. Annu Rev Biomed Eng.
AC, Hannigan J, Blum RL, Buchanan BG, Cohen 2017;19:221–48.
SN.  Antimicrobial selection by a computer: a 21. Andrearczyk V, Whelan PF.  Using filter banks in
blinded evaluation by infectious diseases experts. Convolutional Neural Networks for texture classifi-
JAMA J Am Med Assoc. 1979;242:1279–82. cation. Pattern Recognit Lett. 2016;84:63–9.
2. Sinthanayothin C, Boyce JF, Cook HL, Williamson 22. Robinson R.  Convolutional Neural Networks  –
TH. Automated localisation of the optic disc, fovea, basics. 2017. https://mlnotebook.github.io/post/
and retinal blood vessels from digital colour fundus CNN1/. Accessed 20 Mar 2020.
images. Br J Ophthalmol. 1999;83:902–10. 23. Saxe AM, Saxe AM, Koh PW, Chen Z, Bhand M,
3. He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K. The Suresh B, Ng AY. On random weights and unsuper-
practical implementation of artificial intelligence vised feature learning. In: 28th Int Conf Mach Learn
technologies in medicine. Nat Med. 2019;25:30–6. ICML 2011. 2011. p. 1089–96.
4. Schmidt-Erfurth U, Sadeghipour A, Gerendas BS, 24. Simonyan K, Zisserman A. Very deep convolutional
Waldstein SM, Bogunović H. Artificial intelligence networks for large-scale image recognition. In: 3rd
in retina. Prog Retin Eye Res. 2018;67:1–29. Int. Conf. Learn. Represent. ICLR 2015  – Conf.
5. Han J, Kamber M, Pei J. Data mining: concepts and Track Proc. 2015.
techniques. Data Min Concepts Tech. 2012; https:// 25. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S,
doi.org/10.1016/C2009-­0-­61819-­5. Anguelov D, Erhan D, Vanhoucke V, Rabinovich
6. Alpaydin E.  Introduction to machine learning. 4th A. Going deeper with convolutions. In: Proc. IEEE
ed. MIT Press; 2020. Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
7. Indyk P, Motwani R.  Approximate nearest neigh- IEEE Computer Society; 2015. p. 1–9.
bors: towards removing the curse of dimensionality. 26. He K, Zhang X, Ren S, Sun J. Deep residual learn-
In: STOC ’98 Proc. 30th Annu. ACM Symp. Theory ing for image recognition. In: Proc. IEEE Comput.
Comput. 1998. p. 604–613. Soc. Conf. Comput. Vis. Pattern Recognit. IEEE
8. Bengio Y, LeCun Y.  Scaling learning algorithms Computer Society; 2016. p. 770–778.
towards AI. 2007. 27. Tan M, Le QV. EfficientNet: rethinking model scal-
9. LeCun Y, Bengio Y, Hinton G.  Deep learning. ing for Convolutional Neural Networks. In: 36th
Nature. 2015;521:436–44. Int Conf Mach Learn ICML 2019-June; 2019.
10. Hosseini M-P, Lu S, Kamaraj K, Slowikowski A, p. 10691–700.
Venkatesh HC.  Deep Learning Architecture. In: 28. Zagoruyko S, Komodakis N. Wide residual networks.
Pedrycz W, Chen S-M, editors. Deep Learning: In: Br. Mach. Vis. Conf. 2016, BMVC 2016. British
concepts and architectures. Cham: Springer Machine Vision Association; 2016. p. 87.1–87.12.
International; 2020. p. 1–24. 29. Huang Y, Cheng Y, Bapna A, et al. GPipe: efficient
11. Lewis DD. Naive(Bayes) at forty: the independence training of giant neural networks using pipeline
assumption in information retrieval. In: Lect. Notes parallelism. In: Adv. Neural Inf. Process. Syst. 32
Comput. Sci. (including Subser. Lect. Notes Artif. (NIPS 2019). Vancouver; 2019. p. 103–12.
Intell. Lect. Notes Bioinformatics). Springer; 1998. 30. Bengio Y, Courville A, Vincent P.  Representation
p. 4–15. learning: a review and new perspectives. IEEE Trans
12. Alpaydin E. Introduction to Machine Learning. 3rd Pattern Anal Mach Intell. 2013;35:1798–828.
ed. https://doi.org/10.1007/978-­1-­62703-­748-­8_7. 31. Christopher M, Belghith A, Bowd C, Proudfoot JA,
13. Graupe D. Principles of Artificial Neural Networks. Goldbaum MH, Weinreb RN, Girkin CA, Liebmann
2013. https://doi.org/10.1142/8868. JM, Zangwill LM.  Performance of Deep Learning
14. Weinberger KQ, Saul LK. Distance Metric Learning architectures and transfer learning for detecting
for large margin nearest neighbor classification. J glaucomatous optic neuropathy in fundus photo-
Mach Learn Res. 2009;10:207–44. graphs. Sci Rep. 2018;8:1–13.
15. Hastie T, Rosset S, Zhu J, Zou H.  Multi-class 32. Davidson B, Kalitzeos A, Carroll J, Dubra A,
AdaBoost. Stat Interface. 2009;2:349–60. Ourselin S, Michaelides M, Bergeles C. Automatic
16. Schapire RE.  Explaining adaboost. In: Empir. cone photoreceptor localisation in healthy and star-
Inference Festschrift Honor Vladimir N.  Vapnik. gardt afflicted retinas using deep learning. Sci Rep.
Berlin: Springer; 2013. p. 37–52. 2018;8:1–13.
50 P. Ruamviboonsuk et al.

33. Diaz GI, Fokoue-Nkoutche A, Nannicini G, for detecting diabetic retinopathy in India. JAMA
Samulowitz H.  An effective algorithm for hyper- Ophthalmol. 2019;137:987–93.
parameter optimization of neural networks. 48. Ting DSW, Cheung CYL, Lim G, et al. Development
IBM J Res Dev. 2017; https://doi.org/10.1147/ and validation of a deep learning system for diabetic
JRD.2017.2709578. retinopathy and related eye diseases using retinal
34. Models for image classification with weights trained images from multiethnic populations with diabetes.
on ImageNet. https://keras.io/applications/#models-­ JAMA – J Am Med Assoc. 2017;318:2211–23.
for-­image-­classification-­with-­weights-­trained-­on-­ 49. Bellemo V, Lim G, Rim TH, et  al. Artificial intel-
imagenet. Accessed 27 Mar 2020. ligence screening for diabetic retinopathy: the real-­
35. Ting DSW, Lee AY, Wong TY. An ophthalmologist’s world emerging application. Curr Diab Rep. 2019;
guide to deciphering studies in artificial intelligence. https://doi.org/10.1007/s11892-­019-­1189-­3.
Ophthalmology. 2019;126:1475–9. 50. Ting DSW, Cheung CY, Nguyen Q, et  al. Deep
36. Raman R, Srinivasan S, Virmani S, Sivaprasad S, learning in estimating prevalence and systemic risk
Rao C, Rajalakshmi R.  Fundus photograph-based factors for diabetic retinopathy: a multi-ethnic study.
deep learning algorithms in detecting diabetic reti- npj Digit Med. 2019;2:1–8.
nopathy. Eye. 2019;33:97–109. 51. Li Z, Keel S, Liu C, et  al. An automated grading
37. Dy JG, Brodley CE. Feature selection for unsuper- system for detection of vision-threatening referable
vised learning. 2004. diabetic retinopathy on the basis of color fundus
38. Wilson JMG, Jungner G, Organization photographs. Diabetes Care. 2018;41:2509–16.
WH.  Principles and practice of screening for dis- 52. Mackenzie S, Schmermer C, Charnley A, Sim D, Tah
ease. Russian version of nos. 31-46 bound together V, Dumskyj M, Nussey S, Egan C. SDOCT imaging
(barc). 1968. to identify macular pathology in patients diagnosed
39. Dobrow MJ, Hagens V, Chafe R, Sullivan T, with diabetic maculopathy by a digital photographic
Rabeneck L.  Consolidated principles for screening retinal screening programme. PLoS One. 2011;
based on a systematic review and consensus process. https://doi.org/10.1371/journal.pone.0014811.
CMAJ. 2018;190:E422–9. 53. Wang YT, Tadarati M, Wolfson Y, Bressler SB,
40. Abràmoff MD, Lou Y, Erginay A, Clarida W, Bressler NM. Comparison of prevalence of diabetic
Amelon R, Folk JC, Niemeijer M.  Improved auto- macular edema based on monocular fundus pho-
mated detection of diabetic retinopathy on a publicly tography vs optical coherence tomography. JAMA
available dataset through integration of deep learn- Ophthalmol. 2016;134:222–8.
ing. Investig Ophthalmol Vis Sci. 2016;57:5200–6. 54. Varadarajan AV, Bavishi P, Ruamviboonsuk P, et al.
41. Abràmoff MD, Lavin PT, Birch M, Shah N, Folk Predicting optical coherence tomography-derived
JC.  Pivotal trial of an autonomous AI-based diag- diabetic macular edema grades from fundus pho-
nostic system for detection of diabetic retinopathy in tographs using deep learning. Nat Commun. 2020;
primary care offices. npj Digit Med. 2018;1:1–8. https://doi.org/10.1038/s41467-­019-­13922-­8.
42. van der Heijden AA, Abramoff MD, Verbraak F, van 55. Poplin R, Varadarajan AV, Blumer K, Liu Y,
Hecke MV, Liem A, Nijpels G. Validation of auto- McConnell MV, Corrado GS, Peng L, Webster
mated screening for referable diabetic retinopathy DR.  Prediction of cardiovascular risk factors from
with the IDx-DR device in the Hoorn Diabetes Care retinal fundus photographs via deep learning. Nat
System. Acta Ophthalmol. 2018;96:63–8. Biomed Eng. 2018;2:158–64.
43. Tufail A, Rudisill C, Egan C, et  al. Automated 56. Faes L, Bodmer NS, Bachmann LM, Thiel MA,
diabetic retinopathy image assessment soft- Schmid MK.  Diagnostic accuracy of the Amsler
ware: diagnostic accuracy and cost-effectiveness grid and the preferential hyperacuity perimetry in
compared with human graders. Ophthalmology. the screening of patients with age-related macular
2017;124:343–51. degeneration: systematic review and meta-analysis.
44. Rajalakshmi R, Subashini R, Anjana RM, Mohan Eye. 2014;28:788–96.
V.  Automated diabetic retinopathy detection in 57. AREDS2-HOME Study Research Group, Chew
smartphone-based fundus photography using artifi- EY, Clemons TE, Bressler SB, Elman MJ, Danis
cial intelligence. Eye. 2018;32:1138–44. RP, Domalpally A, Heier JS, Kim JE, Garfinkel
45. Gulshan V, Peng L, Coram M, et al. Development and R.  Randomized trial of a home monitoring system
validation of a deep learning algorithm for detection for early detection of choroidal neovasculariza-
of diabetic retinopathy in retinal fundus photographs. tion home monitoring of the eye (HOME) study.
JAMA – J Am Med Assoc. 2016;316:2402–10. Ophthalmology. 2014;121:535–44.
46. Ruamviboonsuk P, Krause J, Chotcomwongse P, 58. Ho R, Song LD, Choi JA, Jee D.  The cost-­
et al. Deep learning versus human graders for clas- effectiveness of systematic screening for age-related
sifying diabetic retinopathy severity in a nationwide macular degeneration in South Korea. PLoS One.
screening program. npj Digit Med. 2019;2:1–9. 2018; https://doi.org/10.1371/journal.pone.0206690.
47. Gulshan V, Rajan RP, Widner K, et al. Performance 59. Chew EY, Schachat AP.  Should we add screen-
of a Deep-Learning algorithm vs manual grading ing of age-related macular degeneration to cur-
3  Overview of Artificial Intelligence Systems in Ophthalmology 51

rent screening programs for diabetic retinopathy? 72. Zheng C, Johnson TV, Garg A, Boland MV. Artificial
Ophthalmology. 2015;122:2155–6. intelligence in glaucoma. Curr Opin Ophthalmol.
60. Chew EY, Clemons TE, SanGiovanni JP, et  al. 2019;30:97–103.
Lutein + zeaxanthin and omega-3 fatty acids 73. Muhammad H, Fuchs TJ, De Cuir N, De Moraes
for age-related macular degeneration: the Age- CG, Blumberg DM, Liebmann JM, Ritch R, Hood
Related Eye Disease Study 2 (AREDS2) ran- DC. Hybrid Deep Learning on single wide-field opti-
domized clinical trial. JAMA  – J Am Med Assoc. cal coherence tomography scans accurately classifies
2013;309:2005–15. glaucoma suspects. J Glaucoma. 2017;26:1086–94.
61. Burlina PM, Joshi N, Pekala M, Pacheco KD, 74. Christopher M, Belghith A, Weinreb RN, Bowd C,
Freund DE, Bressler NM.  Automated grading of Goldbaum MH, Saunders LJ, Medeiros FA, Zangwill
age-related macular degeneration from color fundus LM. Retinal nerve fiber layer features identified by
images using deep convolutional neural networks. unsupervised machine learning on optical coherence
JAMA Ophthalmol. 2017;135:1170–6. tomography scans predict glaucoma progression.
62. Grassmann F, Mengelkamp J, Brandl C, Harsch S, Investig Ophthalmol Vis Sci. 2018;59:2748–56.
Zimmermann ME, Linkohr B, Peters A, Heid IM, 75. Lin A, Hoffman D, Gaasterland DE, Caprioli
Palm C, Weber BHF.  A Deep Learning algorithm J.  Neural networks to identify glaucomatous
for prediction of age-related eye disease study visual field progression. Am J Ophthalmol.
severity scale for age-related macular degeneration 2003;135:49–54.
from color fundus photography. Ophthalmology. 76. Yousefi S, Kiwaki T, Zheng Y, Sugiura H, Asaoka
2018;125:1410–20. R, Murata H, Lemij H, Yamanishi K.  Detection
63. Burlina PM, Joshi N, Pacheco KD, Freund DE, Kong of longitudinal visual field progression in glau-
J, Bressler NM. Use of Deep Learning for detailed coma using machine learning. Am J Ophthalmol.
severity characterization and estimation of 5-year 2018;193:71–9.
risk among patients with age-related macular degen- 77. Li F, Wang Z, Qu G, et al. Automatic differentiation
eration. JAMA Ophthalmol. 2018;136:1359–66. of Glaucoma visual field from non-glaucoma visual
64. Kermany DS, Goldbaum M, Cai W, et al. Identifying filed using deep convolutional neural network. BMC
medical diagnoses and treatable diseases by image-­ Med Imaging. 2018;18:35.
based Deep Learning. Cell. 2018;172:1122–1131. 78. Bowd C, Hao J, Tavares IM, Medeiros FA, Zangwill
e9. LM, Lee TW, Sample PA, Weinreb RN, Goldbaum
65. Treder M, Lauermann JL, Eter N. Automated detec- MH. Bayesian machine learning classifiers for com-
tion of exudative age-related macular degeneration in bining structural and functional measurements to
spectral domain optical coherence tomography using classify healthy and glaucomatous eyes. Investig
deep learning. Graefe’s Arch Clin Exp Ophthalmol. Ophthalmol Vis Sci. 2008;49:945–53.
2018;256:259–65. 79. Medeiros FA, Jammal AA, Thompson AC.  From
66. Lee CS, Baughman DM, Lee AY. Deep Learning is machine to machine: an OCT-trained deep learning
effective for classifying normal versus age-related algorithm for objective quantification of glaucoma-
macular degeneration OCT images. Kidney Int Rep. tous damage in fundus photographs. Ophthalmology.
2017;1:322–7. 2019;126:513–21.
67. Prahs P, Radeck V, Mayer C, Cvetkov Y, Cvetkova 80. Jammal AA, Thompson AC, Mariottoni EB,
N, Helbig H, Märker D.  OCT-based deep learning Berchuck SI, Urata CN, Estrela T, Wakil SM, Costa
algorithm for the evaluation of treatment indica- VP, Medeiros FA. Human versus machine: compar-
tion with anti-vascular endothelial growth factor ing a Deep Learning algorithm to human gradings
medications. Graefe’s Arch Clin Exp Ophthalmol. for detecting glaucoma on fundus photographs. Am
2018;256:91–8. J Ophthalmol. 2020;211:123–31.
68. Hwang DK, Hsu CC, Chang KJ, et  al. Artificial 81. Gao X, Lin S, Wong TY. Automatic feature learning
intelligence-based decision-making for age-related to grade nuclear cataracts based on deep learning.
macular degeneration. Theranostics. 2019;9:232–45. IEEE Trans Biomed Eng. 2015;62:2693–701.
69. De Fauw J, Ledsam JR, Romera-Paredes B, et  al. 82. Wu X, Huang Y, Liu Z, et  al. Universal artificial
Clinically applicable deep learning for diag- intelligence platform for collaborative management
nosis and referral in retinal disease. Nat Med. of cataracts. Br J Ophthalmol. 2019;103:1553–60.
2018;24:1342–50. 83. Sramka M, Slovak M, Tuckova J, Stodulka
70. Li Z, He Y, Keel S, Meng W, Chang RT, He P.  Improving clinical refractive results of cataract
M.  Efficacy of a Deep Learning system for surgery by machine learning. PeerJ. 2019; https://
detecting glaucomatous optic neuropathy based doi.org/10.7717/peerj.7202.
on color fundus photographs. Ophthalmology. 84. Koprowski R, Lanza M, Irregolare C. Corneal power
2018;125:1199–206. evaluation after myopic corneal refractive surgery
71. Abrams LS, Scott IU, Spaeth GL, Quigley HA, using artificial neural networks. Biomed Eng Online.
Varma R. Agreement among optometrists, ophthal- 2016;15:121.
mologists, and residents in evaluating the optic disc 85. Yu F, Silva Croso G, Kim TS, Song Z, Parker F,
for glaucoma. Ophthalmology. 1994;101:1662–7. Hager GD, Reiter A, Vedula SS, Ali H, Sikder
52 P. Ruamviboonsuk et al.

S. Assessment of automated identification of phases 99. Redd TK, Campbell JP, Brown JM, et al. Evaluation
in videos of cataract surgery using machine learning of a deep learning image assessment system for
and deep learning techniques. JAMA Netw Open. detecting severe retinopathy of prematurity. Br J
2019;2:e191860. Ophthalmol. 2019;103:580–4.
86. Morita S, Tabuchi H, Masumoto H, Yamauchi T, 100. Brown JM, Campbell JP, Beers A, et al. Automated
Kamiura N. Real-time extraction of important surgi- diagnosis of plus disease in retinopathy of prema-
cal phases in cataract surgery videos. Sci Rep. 2019; turity using deep convolutional neural networks.
https://doi.org/10.1038/s41598-­019-­53091-­8. JAMA Ophthalmol Am Med Assoc. 2018:803–10.
87. Gilbert C, Wormald R, Fielder A, Deorari A, Zepeda-­ 101. Taylor S, Brown JM, Gupta K, et al. Monitoring dis-
Romero LC, Quinn G, Vinekar A, Zin A, Darlow ease progression with a quantitative severity scale
B. Potential for a paradigm change in the detection for retinopathy of prematurity using Deep Learning.
of retinopathy of prematurity requiring treatment. JAMA Ophthalmol. 2019;137:1022–8.
Arch Dis Child Fetal Neonatal Ed. 2016;101:F6–7. 102. Worrall DE, Wilson CM, Brostow GJ.  Automated
88. Salvin JH, Lehman SS, Jin J, Hendricks DH. Update retinopathy of prematurity case detection with
on retinopathy of prematurity: treatment options and convolutional neural networks. https://doi.
outcomes. Curr Opin Ophthalmol. 2010;21:329–34. org/10.1007/978-­3-­319-­46976-­8.
89. Wong TY, Sun J, Kawasaki R, et  al. Guidelines 103. Long E, Lin H, Liu Z, et al. An artificial intelligence
on diabetic eye care: the International Council of platform for the multihospital collaborative man-
Ophthalmology recommendations for screening, agement of congenital cataracts. Nat Biomed Eng.
follow-up, referral, and treatment based on resource 2017;1:1–8.
settings. Ophthalmology. 2018;125:1608–22. 104. Lin H, Li R, Liu Z, et  al. Diagnostic efficacy and
90. Davitt BV, Wallace DK.  Plus disease. Surv therapeutic decision-making capacity of an artificial
Ophthalmol. 2009;54:663–70. intelligence platform for childhood cataracts in eye
91. Daniel E, Quinn GE, Hildebrand PL, et al. Validated clinics: a multicentre randomized controlled trial.
system for centralized grading of retinopathy of EClinicalMedicine. 2019;9:52–9.
prematurity: telemedicine approaches to evaluating 105. Liu X, Jiang J, Zhang K, et al. Localization and diag-
acute-phase Retinopathy of Prematurity (e-ROP) nosis framework for pediatric cataracts based on slit-­
Study. JAMA Ophthalmol. 2015;133:675–82. lamp images using deep features of a convolutional
92. Ting DSW, Pasquale LR, Peng L, Campbell JP, Lee neural network. PLoS One. 2017;12:e0168606.
AY, Raman R, Tan GSW, Schmetterer L, Keane PA, 106. Reid JE, Eaton E.  Artificial intelligence for pedi-
Wong TY. Artificial intelligence and deep learning in atric ophthalmology. Curr Opin Ophthalmol.
ophthalmology. Br J Ophthalmol. 2019;103:167–75. 2019;30:337–46.
93. Capowski JJ, Kylstra JA, Freedman SF. A numeric 107. Alsaih K, Lemaitre G, Rastgoo M, Massich J, Sidibé
index based on spatial frequency for the tortuos- D, Meriaudeau F.  Machine learning techniques for
ity of retinal vessels and its application to plus diabetic macular edema (DME) classification on
disease in retinopathy of prematurity. Retina. SD-OCT images. Biomed Eng Online. 2017;16:68.
1995;15:490–500. 108. Schlegl T, Waldstein SM, Bogunovic H, Endstraßer
94. Heneghan C, Flynn J, O’Keefe M, Cahill F, Sadeghipour A, Philip AM, Podkowinski D,
M. Characterization of changes in blood vessel width Gerendas BS, Langs G, Schmidt-Erfurth U.  Fully
and tortuosity in retinopathy of prematurity using automated detection and quantification of macular
image analysis. Med Image Anal. 2002;6:407–29. fluid in OCT using Deep Learning. Ophthalmology.
95. Swanson C, Cocker KD, Parker KH, Moseley MJ, 2018;125:549–58.
Fielder AR.  Semiautomated computer analysis of 109. Zou KH, Warfield SK, Bharatha A, Tempany CMC,
vessel growth in preterm infants without and with Kaus MR, Haker SJ, Wells WM, Jolesz FA, Kikinis
ROP. Br J Ophthalmol. 2003;87:1474–7. R. Statistical validation of image segmentation qual-
96. Gelman R, Martinez-Perez ME, Vanderveen DK, ity based on a Spatial Overlap Index. Acad Radiol.
Moskowitz A, Fulton AB. Diagnosis of plus disease 2004;11:178–89.
in retinopathy of prematurity using retinal image 110. Fang L, Cunefare D, Wang C, Guymer RH, Li S,
multiScale analysis. Investig Ophthalmol Vis Sci. Farsiu S.  Automatic segmentation of nine retinal
2005;46:4734–8. layer boundaries in OCT images of non-exudative
97. Ataer-Cansizoglu E, Bolon-Canedo V, Campbell AMD patients using deep learning and graph search.
JP, et  al. Computer-based image analysis for plus Biomed Opt Express. 2017;8:2732.
disease diagnosis in retinopathy of prematurity: 111. Camino A, Wang Z, Wang J, Pennesi ME, Yang P,
performance of the “i-ROP” system and image fea- Huang D, Li D, Jia Y. Deep learning for the segmen-
tures associated with expert diagnosis. Transl Vis Sci tation of preserved photoreceptors on en face optical
Technol. 2015;4:5. coherence tomography in two inherited retinal dis-
98. Wang J, Ju R, Chen Y, Zhang L, Hu J, Wu Y, Dong eases. Biomed Opt Express. 2018;9:3092.
W, Zhong J, Yi Z.  Automated retinopathy of pre- 112. Chen M, Wang J, Oguz I, VanderBeek BL, Gee
maturity screening using deep neural networks. JC.  Automated segmentation of the choroid in
EBioMedicine. 2018;35:361–8. EDI-­ OCT images with retinal pathology using
3  Overview of Artificial Intelligence Systems in Ophthalmology 53

convolution neural networks. In: Lect. Notes 121. Gerendas BS, Bogunovic H, Sadeghipour A,
Comput. Sci. (including Subser. Lect. Notes Artif. Schlegl T, Langs G, Waldstein SM, Schmidt-Erfurth
Intell. Lect. Notes Bioinformatics). Springer; U. Computational image analysis for prognosis deter-
2017. p. 177–184. mination in DME. Vision Res. 2017;139:204–10.
113. Hagiwara Y, Koh JEW, Tan JH, Bhandary SV, Laude 122. Suner IJ, Yau L, Lai P.  HARBOR Study: one-year
A, Ciaccio EJ, Tong L, Acharya UR.  Computer-­ results of efficacy and safety of 2.0 mg versus 0.5 mg
aided diagnosis of glaucoma using fundus images: ranibizumab in patients with subfoveal choroidal
a review. Comput Methods Programs Biomed. neovascularization secondary to age-related macu-
2018;165:1–12. lar degeneration | IOVS | ARVO Journals. Invest
114. Devalla SK, Chin KS, Mari JM, Tun TA, Strouthidis Ophthalmol Vis Sci. 2012;53.
N, Aung T, Thiéry AH, Girard MJA. A deep learn- 123. Schmidt-Erfurth U, Bogunovic H, Sadeghipour
ing approach to digitally stain optical coherence A, Schlegl T, Langs G, Gerendas BS, Osborne A,
tomography images of the optic nerve head. Investig Waldstein SM.  Machine learning to analyze the
Ophthalmol Vis Sci. 2018;59:63–74. prognostic value of current imaging biomarkers
115. Devalla SK, Renukanand PK, Sreedhar BK, et  al. in neovascular age-related macular degeneration.
DRUNET: a dilated-residual U-Net deep learn- Ophthalmol Retin. 2018;2:24–30.
ing network to segment optic nerve head tissues in 124. Bogunovic H, Waldstein SM, Schlegl T, Langs G,
optical coherence tomography images. Biomed Opt Sadeghipour A, Liu X, Gerendas BS, Osborne A,
Express. 2018;9:3244. Schmidt-Erfurth U.  Prediction of anti-VEGF treat-
116. Arcadu F, Benmansour F, Maunz A, Willis J, ment requirements in neovascular AMD using a
Haskova Z, Prunotto M.  Deep learning algorithm machine learning approach. Invest Ophthalmol Vis
predicts diabetic retinopathy progression in indi- Sci. 2017;58:3240–8.
vidual patients. npj Digit Med. 2019. https://doi. 125. Schmidt-Erfurth U, Waldstein SM, Klimscha S,
org/10.1038/s41746-­019-­0172-­3. Sadeghipour A, Hu X, Gerendas BS, Osborne A,
117. Nguyen QD, Brown DM, Marcus DM, et  al. Bogunović H. Prediction of individual disease con-
Ranibizumab for diabetic macular edema: results version in early AMD using artificial intelligence.
from 2 phase iii randomized trials: RISE and Investig Ophthalmol Vis Sci. 2018;59:3199–208.
RIDE. Ophthalmology. 2012;119:789–801. 126. Larsen M, Waldstein SM, Boscia F, et  al.
118. Rohm M, Tresp V, Müller M, Kern C, Manakov I, Individsualized ranibizumab regimen driven by sta-
Weiss M, Sim DA, Priglinger S, Keane PA, Kortuem bilization criteria for central retinal vein occlusion:
K.  Predicting visual acuity by using machine twelve-month results of the CRYSTAL Study. In:
learning in patients treated for neovascular age-­ Ophthalmology. Elsevier; 2016. p. 1101–1111.
related macular degeneration. Ophthalmology. 127. Vogl WD, Waldstein SM, Gerendas BS, Schlegl T,
2018;125:1028–36. Langs G, Schmidt-Erfurth U. Analyzing and predict-
119. Kortüm KU, Müller M, Kern C, Babenko A, Mayer ing visual acuity outcomes of anti-VEGF therapy
WJ, Kampik A, Kreutzer TC, Priglinger S, Hirneiss by a longitudinal mixed effects model of imag-
C. Using electronic health records to build an oph- ing and clinical data. Investig Ophthalmol Vis Sci.
thalmologic data warehouse and visualize patients’ 2017;58:4173–81.
data. Am J Ophthalmol. 2017;178:84–93. 128. Grzybowski A, Brona P, Lim G, Ruamviboonsuk P,
120. Wells JA, Glassman AR, Ayala AR, et al. Aflibercept, Tan GSW, Abramoff M, Ting DSW. Artificial intel-
bevacizumab, or ranibizumab for diabetic macular ligence for diabetic retinopathy screening: a review.
edema. N Engl J Med. 2015;372:1193–203. Eye. 2019;34:451–60.
Autonomous Artificial Intelligence
Safety and Trust
4
Michael D. Abramoff

Introduction Rigorously validated medical diagnostic


autonomous AI systems hold great promise for
Artificial Intelligence, or Augmented Intelligence improving patient access, as well as increasing
(AI), is the term for systems capable of making accuracy, and lowering cost, while enabling spe-
decisions of high cognitive complexity. In health-­ cialist physicians to provide the greatest value by
care, Autonomous AI systems are those AI sys-­ managing and treating those patients whose out-
tems that make clinical decisions without human comes can be improved [3, 4]. There are signifi-­
oversight, and where the autonomous AI creator cant ethical and legal concerns around the
assumes medical liability [1]. Autonomous AI introduction of autonomous AI into healthcare [5,
systems are thus different from Assistive AI sys-­ 6]. Ensuring the public’s and healthcare systems’
tems, which help a clinician make better diagnos-­ trust that autonomous AI provides these benefits,
tic or management decisions, and where the requires creators to negotiate multiple ethical and
liability for the medical decision remains with practi-cal challenges.
the clinician [2]. In 2018, after extensive work on the ethics and
As an example, diagnostic autonomous AI accountability implications of a computer mak-
systems for the point of care diagnosis of diabetic ing a medical diagnosis, culminating in a clinical
retinopathy and diabetic macular edema, provide trial for autonomous AI comparing it to clinical
a direct diagnostic recommendation. In doing so, outcome, the first autonomous point of care dia-
they perform a cognitive highly complex task that betic retinopathy exam was de novo authorized
was previously only performed by ophthalmolo- by the US FDA [7]. This milestone marked that a
gists and optometrists—representing 0.02% of pathway for other autonomous AI systems to be
all Americans—after extensive, specialized, introduced safely exists. In 2020, this same
training. autonomous AI system became part of the
American Diabetes Association’s Standard of
Medical Care in Diabetes [8]. In the same year,
M. D. Abramoff (*) the US National Committee for Quality Assurance
The Robert C Watzke Professor of Ophthalmology updated its quality measurements to support the
and Visual Sciences, University of Iowa, use of autonomous AI to perform the diabetic eye
Iowa City, IA, USA
exam, at an equivalent level as an ophthalmolo-
Retina Service, University of Iowa, gist or optometrist [9]. Also in 2020, US Medicare
Iowa City, IA, USA
national insurance formally adopted autonomous
Digital Diagnostics, Coralville, IA, USA AI for the diabetic eye exam in its reimbursement
e-mail: michael-abramoff@uiowa.edu

© Springer Nature Switzerland AG 2021 55


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_4
56 M. D. Abramoff

system, thus offering a pathway for other autono- many cases, low image quality requires calling
mous AI to be reimbursed. the patient back for another examination. Thus,
No prior processes existed for determining the safer, more trustworthy solutions were desirable.
safety, efficacy and equity of such autonomous Patient adherence to any form of regular dia-­
AI systems that might serve as guidance. This betic eye examination is generally low around the
chapter goes through the why and how of ethical world, a state of affairs primarily caused by lack
and accountability principles for autonomous AI, of access and convenience. Recent research
explains how they were practically addressed and shows that in the US, adherence to a regular doc-
how they form the foundation for a series of umented diabetic eye exam remains as low as
requirements for autonomous AI, that were used 15.3%, even though all practice guidelines and
in design, in the de novo FDA clearance process, standards recommend regular diabetic eye exams
and in ongoing implementation. It leans heavily [16, 19, 20]. Compared to testing at a remote
on the concepts addressed in “Lessons Learned laboratory, point of care diagnosis increases
About Autonomous AI: Finding a Safe, access, which is has been shown for point of care
Efficacious, and Ethical Path Through the A1C testing [21, 22], and has even been shown to
Development Process” [10]. improve clinical outcome [23].
Ultimately, successful introduction of autono-­ Historically, clinicians are trained during
mous AI into the healthcare system is dependent medical school, residency and possible fellow-
on trust in this technology, and it is thus vital that
ship, and then their medical competency contin-
everyone involved with autonomous AI helps ues to be overseen by Medical Boards. However,
build trust [11]. in practice, clinicians are rarely validated on their
safety, efficacy and equity for a specific diagnos-
tic process, against a valid standard. Furthermore,
Autonomous AI for the Diabetic consistency of clinicians performing the diabetic
Retinopathy Exam eye exam is limited, and the only scientifically
valid studies comparing clinicians to the most
The patient and societal benefits of early detec- rigorous patient outcome-based prognostic refer-
tion of diabetic retinopathy are well established ence standard, show that clinician accuracy,
[12–15]. This diabetic eye exam is typically per- expressed as sensitivity, does not exceed 50%
formed as an indirect dilated retinal examination, [24, 25]. Additionally, documentation of a dia-
using slitlamp biomicroscopy and indirect bin- betic eye exam in the chart, once performed, is a
ocular ophthalmoscopy, as well as macular opti- human driven process with many points of poten-
cal coherence tomography, by an ophthalmologist, tial fail-ure, so that too often, the fact that a dia-
retina specialist, or optometrist. Thus, many betic eye exam was performed cannot be verified
practice guidelines referred to this method as the from the patient’s chart. Operational coding sys-
standard of care [16]. In the 1990s, evidence tems for physician activities are typically not fine
showed that telemedicine for diabetic retinopa- grained enough to establish that a diabetic eye
thy, where retinal images are taken and then eval- exam was performed [26–28]. Instead, typically
uated remotely, has at least a similar patient only the fact that the physician spent time inter-
safety profile as the dilated retinal examination acting with the patient is documented, if it is
[17, 18]. While improving access, traditional documented at all.
telemedicine does not allow instantaneous, point Together, these are major challenges with pro-
of care diagnosis, nor has the safety, efficacy and cess integrity for the diabetic eye exam (both tra-
equity of its process been established in a scien- ditional, in the chair, as well as through
tifically valid, hypothesis testing, manner. Thus, telemedicine). In other words, the traceability of
trust has generally been limited. There is a delay what exactly happens to the patient when the
of typically days between when the patient is need for a diabetic eye exam is established, is
imaged, and the availability of an ophthalmolo-­ limited, because the accuracy of the diagnostic
gist to read images and diagnose the patient. In process is unknown, and because the documen-
4  Autonomous Artificial Intelligence Safety and Trust 57

tary evidence that a completely determined by ones: undesired racial, sex or ethnic bias, valida-
fallible clinician documentation [29]. To increase tion, what type of trial and what to compare the
trust, program integrity for the diabetic eye exam AI against, what safety threshold to use, data
needs to be as high as possible. usage, and liability in this section [10].
The use of artificial intelligence for medical A recent study of an AI showed that using
diagnostic purposes, started in the 1960s with medical cost as a proxy for patients’ overall
Mycin, an AI that helped physicians prescribe health needs led to inappropriate racial bias in
antibiotics [30]. This continued through the allocating healthcare resources, as black patients
1980s with algorithms such as the perceptron were incorrectly deemed to have lower risk com-
[31] and multilayer neural network learning pared to white patients because their incurred
using backpropagation [32]. Safe performance of costs were lower for a given health risk status
these AI systems was limited, mainly because of [35]. Another study showed a consistent decrease
a lack of high quality maximally objective input in performance for underrepresented sex catego-
data. Input consisted of physicians interpreting ries in the AI’s machine learning training data
the patient’s symptoms and signs and then typing when a minimum balance was not fulfilled [36].
them in, a process with an inherently low signal To demonstrate AI systems’ safety, scientifi-
to noise ratio. Instead much of the foundational cally valid studies that are replicable are essen-
methodology for modern autonomous AI has tial. The design of these studies is a concern.
been developed over the past decades. The recent Some, like healthcare pundit Eric Topol, have
introduction of affordable digital retinal cameras proposed that Randomized Clinical Trials (RCT)
with high-quality complementary metal-oxide are the gold standard for diagnostic AI [37],
semiconductor (CMOS) image sensors was a piv- though there is clear evidence that other study
otal moment. These CMOS image sensors make designs can work as well or better [38]. For
it possible to acquire images of high fidelity and example, RCTs for diagnostic AI require an arm
consistency, such as retinal images of people where patient outcome, includ-ing need for inter-
with diabetes, and thus provide highly objective vention, is only decided based on the AI output—
input data for AI algorithms. The resulting higher without possible override by a clinician. If the AI
performance and potentially higher safety, is nec- is not perfect, this requires potentially withhold-
essary for increasing trust in autonomous AI. ing effective treatment for a treatable condition
In summary, diagnostic autonomous AI sys- where we know how to improve outcome. Most
tems, such as IDx-DR, provide a direct diagnos- Institutional Review Boards see this as unethical,
tic recommendation for the point of care diagnosis as the benefit to other patients and society may
of diabetic retinopathy and diabetic macular not outweigh the harm to the patient who has
edema. Autonomous AI allows the diagnosis for been diagnosed by AI [39, 40]. As diagnostic AIs
diabetic retinopathy and diabetic macular edema are typically designed for conditions where effec-
to be performed in real time, with the goals of tive interven-tions are available, RCTs are ethi-
higher adherence, improved access, higher cost-­ cally unsound. In other words, while a null
effectiveness [33], increased accuracy and diag- hypothesis of “no effect” works well in most
nosability [34], and increased program integrity. interventional trials, a null hypothesis of “not
informative” is not appro-priate for validation of
diagnostic AI [41].
 an We Trust It? Concerns About
C Replicability is another big problem affecting
Autonomous AI trust in safety, and without preregistration, AI
performance tends to be overestimated, and suc-­
The idea of ‘a computer making a diagnosis’, cessful study replication becomes less likely. In
generates concerns in both physicians and fact when comparing trials with and without pre-­
patients, as is to be anticipated with any new registration, trial effect sizes are larger when they
technology. We will address the most common lack preregistration [34, 42].
58 M. D. Abramoff

Often, to demonstrate AI systems’ safety, their aspect. One metric is so-called population diag-
output is compared against the diagnosis by clini- nosability. Population diagnosability is defined
cians or groups of clinicians, called a “reference as
standard” [43]. This approach assumes that the
nn  n p
AI system safety is highest when it most closely PS 
matches clinicians’ diagnosis. There are several nn  n p  nx  ni

problems with this approach: (a) showing that an
AI system is safer than clinicians is impossible, Where
as a discrepancy between clinician and AI system
will by definition be attributed to an error by the
n p = number of subjects that received
AI rather than the clinician; (b) clinicians differ
greatly on typical diagnostic tasks, in many cases a positive diagnostic result
in 30% of cases or more [44], and it is thus impos-
nn = number of subjects that received
sible to determine which clinician is right and
which is wrong; c) if compared to clinical out- a negative diagnostic result
come, such as with a prognostic standard, the
nx = number of subjects that were excluded
ultimate determinant of clinical relevance, clini-­
cians perform poorly, for example achieving only for any reason from study completion
33% and 34% in the only two studies comparing
ni = number of subject that received an
clinicians diagnosing diabetic retinopathy as
determined by the Wisconsin Reading Center insuficient input quality result
ETDRS standard [24, 25]. Thus, esti-mates of For example, if in a completed validation
safety are greatly affected by the choice of refer- study, the total number of subjects recruited for
ence standard. the study, n, is 1000, nx, 200 subjects were
In 2007, Fenton and colleagues first demon-­ excluded from analysis for various reasons, and
strated the importance of rigorous validation of the autonomous AI gave an invalid input quality,
AI in the actual workflow setting, rather than in a ni, for 100 subjects, PS = 0.7. Obviously, a higher
modeled laboratory setting [45]. In this pivotal population diagnosability improves efficiency
study, the outcomes of women undergoing breast and especially workflow, as there is less risk of an
cancer screening by a radiologist assisted by a individual patient not being diagnosed by the
previously FDA approved assistive AI system, autonomous AI, and needing to fall back on the in
were compared to women who underwent breast person exam.
cancer screening by a radiologist without such an Safety, for a diagnostic process, including for
assistive AI. The assistive AI had, in 2000, been an autonomous AI system, is typically measured
approved by FDA, on the basis of a study that as sensitivity, as appropriate for binary outcomes,
showed that, when used in isolation, the assistive more so than metrics such as Receiver Operator
AI had high diagnostic accuracy compared to Characteristics (ROC) analysis [46]. In many
radiologists. When this assistive AI system was cases, the purpose of the diagnostic process is to
tested in a study design that reflected actual find (true) cases in a population. While a high
usage, assisting a radiologist who makes the final sensitivity maximizes the efficiency of such a
clinical decision, the study showed worse out-­ process, a population level analysis will show
comes for the women who underwent breast can-­ that adherence with the diagnostic process plays
cer screening with AI assistance. as important a role. For example, if a diagnostic
For continued trust and acceptance, the auton-­ process has high sensitivity of 90%, but only
omous AI’s impact on clinic workflow needs to 10% of a population undergoes the diagnostic
be optimized, and optimal clinic workflow of the process, the cases in the remaining 90% of the
patient around the autonomous AI is an important population will not be found, lowering the
4  Autonomous Artificial Intelligence Safety and Trust 59

“population achieved sensitivity”. In order to patient, the physician, the hospital system, or
account for this diagnostic access bias [47], the even who paid for acquisition, has not been fully
population achieved sensitivity (PAS) or “com- litigated, and as such can easily lead to concerns
pliance corrected sensitiv-ity” can be calculated and controversy. For example, in one case patient
as follows: data for training AI through was obtained through
an agreement with a health system [49]. While
sc cpc
PAS  agreements were in place, patients and physi-
cpc  1  c  
pnc cians were not aware of this data usage, leading

to confusion, so that the Department of Health
Where and Human services became involved. In another
• sc  =  sensitivity (as determined in compliant example, a class action lawsuit alleging failure to
population) adequately deidentify patient data for AI training
• c = compliance was initiated against an academic health system.
• pc  =  measured prevalence in the compliant Autonomous AI is typically designed and vali-
population dated for a narrow diagnostic task, and so-called
•  pnc = estimated prevalence, in the incidental findings, that potentially could have
noncompliant population been discovered by a general exam, are not
flagged. For our example, which diagnoses dia-
If we assume that pc ≅  pnc i.e. the prevalence betic retinopathy and diabetic macular edema,
of the disease is the same in the non-compliant as other diagnoses such as glaucoma, or macular
in the compliant part of the population, we can degeneration will not be made, as the autono-
use the simplified estimate scc. mous AI is neither designed nor validated for
anything but DR. While there is widespread evi-
PAS ≅ sc c
dence for the effectiveness and cost-effectiveness
This estimated PAS would then form an upper of early detection of diabetic retinopathy [15],
bound, as in most cases, prevalence in the non-­ this is currently not yet the case for glaucoma
compliant subpopulation is higher than in the [50], macular degeneration [51] and many other
compliant part. eye diseases. Therefore, clinical trials for a spe-
For example, if compliance c, with the dia- cific AI will typically not be designed or powered
betic eye exam, is 15% [19], and the minimal to analyze diagnostic accuracy on other retinal
acceptable sensitivity is 85% [34], the popula- abnormalities or other eye abnormalities in peo-
tion achieved sensitivity (PAS) = 0.13. In other ple with diabetes. However, it is important to note
words, only 13% of all cases in the population that little evidence exists on how accurately clini-
will be identified correctly with this diagnostic cians can diagnose these incidental findings: their
system. performance has not been evaluated in formal
These metrics show the importance of includ- studies. In fact, such studies may be logistically
ing workflow and patient-experience in general impossible and challenging to power and there-­
in the study design, as focus on sensitivity in the fore are unlikely to be performed because of the
compliant population may not give the optimal enormous number of subjects required. For
population benefit, and increasing compliance example, at a prevalence of 5 per million, evalu-­
with the autonomous AI may have more effect on ating the diagnostic performance of either oph-
its safety than increasing sensitivity. thalmologists or autonomous AI for choroidal
The development of any AI requires vast melanoma diagnosis would require a clinical trial
amounts of clinical data. There are many statutes with approximately 40 million subjects [52].
and regulations covering patient derived data, In the past, liability for medical errors con-
such as HIPAA and HITECH [48]. Ultimately, tributed to by AI has generally been attributed
whether patient derived data belongs to the to the physician using it [53]. While this is
60 M. D. Abramoff

acceptable practice for assistive AI, as ulti- Table 4.1 Deriving requirements from bioethical
principles
mately the medical decision is made the phy-
sician using the AI, this may not be the case Relevant
bioethical
for autonomous AI, where after all, the medi- Autonomous AI requirement principles [57]
cal decisions is made by the AI, without input Improve patient outcome as Nonmaleficence,
by that physician. Nevertheless many AI cre- shown either by direct evidence or Beneficence, and
ators have publicly refused to take liability linked clinical literature, and JEquity
for their AI products, as shown by the ongoing aligned with evidence based
clinical standards of care/practice
liability debate concerning autonomous driv- patterns from quality of care
ing [54]. organizations, professional medical
Trust in Autonomous AI is essential, and defi- societies and patient organizations,
nitely, lack of trust has negatively affected other while accounting for safety,
efficacy and equity
medical innovations. In the early 2000s, gene Design so its operations are Beneficence
therapy effectively went through a moratorium maximally reducible to Non-maleficence
on research funding, including closure of research characteristics aligned with
institutions, after several young people died in scientific knowledge of human
clinician cognition.
poorly planned and executed gene therapy stud-
Maximize traceability of patient Accountability
ies [55]. Only in 2017, almost two decades later, derived data, and commensurate and Respect for
did FDA approve the first ever gene therapy, for data stewardship, accountability, Autonomy
the RPE65 variant of Leber’s Congenital and authorization; including by
adherence to accepted standards.
Amaurosis [56]. In sort, trust in autonomous AI
Validate rigorously for safety, Nonmaleficence,
needs to be earned. efficacy and equity, using and Equity
preregistered clinical studies, by
comparing against clinical
Building Trust: An Ethical outcome, or outcome surrogates in
the case of chronic diseases, in the
Foundation for Autonomous AI intended clinical workflow and
Requirements usage, as shown by either direct or
linked evidence
Successful introduction of autonomous AI into Assume liability commensurate Accountability,
with indications for use and and Equity
the healthcare system is dependent on trust, and it autonomy
is essential that everyone involved with autono-­
mous AI helps build such trust [6, 11]. This is best
done through an ethical foundation. Previously,  utonomous AI System Design
A
Char, Abramoff et  al., derived a foundation for Requirements
autonomous AI in healthcare from bioethical
principles as well as accountability principles, Considerations for Autonomous AI design can have
implemented in Abramoff et al. [10]. We ensured unexpected and profound ethical implications.
alignment with classical bioethics principles, such When autonomous AI is designed so its opera-
as Beauchamp and Childress [57]. The following tions are maximally reducible to characteristics
table is modified from our paper [10]. Copyright aligned with scientific knowledge of human cli-
retained by the Authors (Table 4.1). nician cognition, it ethically aligns with non-­
The following sections will illustrate these maleficence. The use of patho-physiologically
various requirements for autonomous AI as they sound priors, such as biomarkers, and exploiting
are relevant. Addressing such requirements dur- higher order coherence in the input data, not only
ing the design, validation and implementation is helps gain trust from regulators, physicians and
expected to increase trust from all healthcare patients, but also improves safety and equity, thus
stakeholders, including patients, physicians, reg-­ aligning with non-maleficence and justice.
ulators, and payers. Machine learning algorithms that mimic closely
4  Autonomous Artificial Intelligence Safety and Trust 61

how clinicians diagnose have been shown to be Evidence of the relationship to outcome is
more robust to small perturbations in the input, to essential to confirm that the autonomous AI
show less catastrophic failure, and to be less improves outcome. The level of disease diag-
likely to exhibit inappropriate racial and other nosed by an autonomous AI can align directly
bias [58, 59]. Black box or gray box algorithm with the risk of poor outcome: an IDx-DR nega-
designs makes such bias harder to mitigate and tive output (less than level ETDRS 35 and no
detect, while the speed and scalability can multi- macular edema) implies a risk of 1.7% or less of
ply the effect of inappropriate bias faster than tra- proliferative retinopathy in 3 years, and a risk of
ditional enforcement efforts can react. Take as an 2.4% or less of DME in 1 year, while an IDx-DR
example, the design of autonomous AI for dia- positive output (ETDRS level 35 or higher or
betic retinopathy: for over 150  years, clinicians center-involved or clinically significant macular
have evaluated a patient’s retina for the different edema) confers a risk of at least 18% of prolifera-
indicators of diabetic retinopathy such as hemor- tive retinopathy in 3  years, as well as a risk of
rhages, microaneurysms, and neovascularization 17.7% of developing DME in 1  year—if left
[60, 61]. These are all indicators or biomarkers untreated [34, 65]. Tying the autonomous AI out-­
that are invariant with regards to race, ethnicity, put to patient relevant clinical outcome.
sex, and age. Using multiple, statistically depen-
dent detectors for such biomarkers [62, 63], each
optimized using machine learning algorithms  utonomous AI System Validation
A
mitigates the problems sketched. Requirements
Designing the autonomous AI so that it
improves patient outcome as shown either by Just as is the case for its design, the validation of
direct evidence or linked clinical literature, an autonomous AI can also have unexpected and
ensures patient benefit, and ethically aligns with profound ethical implications. To maximize trust,
non-maleficence and justice. Similarly, align- autonomous AI should be validated rigorously
ment with evidence based clinical standards of for safety, efficacy and equity, ideally using pre-
care/practice patterns from quality of care organi- registered clinical studies, by comparing the AI
zations, professional medical societies and against clinical outcome, or prognostic standards,
patient organizations, while accounting for in the case of chronic diseases, within the
safety, efficacy and equity, again aligns ethically intended clinical workflow and usage, as shown
with non-maleficence and justice. ‘Glamour AI’ by either direct or linked evidence, non-­
or AI which is technologically of high interest, maleficence and equity are maximized. The
but does not improve outcome, is thus avoided by meaning of each these terms is explained here.
this requirement. In accordance with the bioethical principle of
For our example, when the goal is to create non-maleficence [6, 10], AI validation studies
a real time point of care autonomous retinal should test hypotheses of safety, efficacy and
exam for diabetic retinopathy, non-maleficence equity. Scientific validity of such studies is higher
was maximized, by ensuring that diabetic mac- the more replicable they are, and common report-
ular was diagnosed accurately by the autono- ing standards [66], CONSORT-AI [67], preregis-
mous AI system—as is standard of care [8, 16]. tration of study and analysis protocols [42, 68],
Somewhat surprisingly, this design require- and validated relationship to patient outcome
ment is different from that of most AI designs, [10] are important factors to enhance replicabil-
which focus only on the classic ischemic vari- ity, and consistent with US Federal regulations.
ant of diabetic retinopathy, and ignore macular While standards have been set for preregistration,
edema. At most, they only look for exudates, and especially Good Clinical Practice (GCP) [7],
which in many cases are not proxies for the depending on the risk of harm to the patient,
presence or absence of diabetic macular edema these can be burdensome and can require sub-
[34, 64]. stantial resources from AI creators, and thus may
62 M. D. Abramoff

not always be achievable for AI validation stud- mograms, in breast cancer, or evidence of mitosis
ies, in accordance with risk of patient harm. in skin cancer.
Broadly, preregistration includes public registra- Within diabetic retinopathy the Early scale
tion of the in- and exclusion criteria, the entire and the DRCR.net macular edema scale are prog-
protocol, and statistical on a site such as clinical­­ nostic standards [65, 71].
trials.gov, a hypothesis-testing design with pre- Obviously, while requiring less time and fewer
defined endpoints, a predefined method for resources than true outcome in the case of chronic
statistical analysis, predefined inclusion and disease, prognostic standard based reference
exclusion criteria, a predefined sampling proto- standards still requires considerable effort. This
col, a plan for handling of the trial data by an is an important reason why clinician derived ref-
independent Contract Research Organization or erence standards, rather than outcome or surro-
third party, and prohibition of access by the gate based reference standards, are so widely
researchers to the subject level results before used in AI. This practice is so widespread that it
finalizing the statistical analysis. is almost a standard. A widely cited meta-­analysis
Also in accordance with non-maleficence, is of the quality of evidence of AI accuracy, while
to validate the autonomous AI against what mat- mentioning the potential of AI to improve out-
ters to the patient. This is most likely clinical out- come, takes as a given the compari-son to clini-
come: the clinical event relevant to the patient, or, cian derived ground truth. Validation against
an event of which the patient is aware and wants (surrogate) clinical outcome is not even consid-
to avoid, including death, loss of vision, visual ered [72].
field loss, the need for ventilatory support, or In addition to unknown validity against out-
other events causing a reduction in quality of the come, reproducibility—different clinicians eval-
patient’s life [69]. uating the same patient differently in 30–50% of
While for acute diseases or interventions, clin- cases—; repeatability—the same clinician evalu-
ical outcome may be immediate and easy to mea- ating the same patient differently in 20–30% of
sure—such as visual acuity in the case of myopia cases—; and temporal drift—clinicians system-
or central retinal artery occlusion. However, for atically evaluating the same hypothetical patient
the many chronic diseases that autonomous AI differently over generations of clinicians—; are
has particular potential for, such as diabetic reti- other major issues to be addressed [24, 25, 73].
nopathy, glaucoma or macular degeneration, As the evidence for a given treatment based on a
clinical outcome may take years to manifest. As a given evaluation may have been derived decades
consequence, surrogate endpoints have been ago, the latter, temporal drift, is a form of bias
developed [70], to reduce the cost and shorten the that is especially pernicious, and difficult to cor-­
duration of trials, especially in the drug approval rect for. When prognostic standards, or outcome
process. One type of surrogate endpoint is a phe- is unfeasible, optimal correction for reproducibil-­
notype: a laboratory measurement, or a physical ity and repeatability requires strict evaluation
sign, used as a substitute for a clinical outcome protocols and independent verification where
that measures directly how a patient feels, func- possible.
tions or survives. Another type is a prognostic Summarizing the above, clinical outcome or
standard, which are (combinations of) biomark- prognostic standards should be used for primary
ers that are associated with prognosis. Changes endpoints, are preferred, provided their validity
induced by a therapy on a prognostic standard, or against true clinical outcome has already been
other surrogate endpoints should reflect changes rigorously established. Only if these are not
in clinical outcome. Examples of surrogate end- available, can non outcome associated endpoints
points are suppression of ventricular arrhythmias, be used, but that should then be clear in the
or reduction in cholesterol level, in cardiovascu- autonomous AI labelling [69] (Table 4.2):
lar trials, Prognostic standards include positive Validation in the envisioned context, envi-
pathology in a biopsy, or progression on mam- ronment and workflow, in “locked” form, so that
4  Autonomous Artificial Intelligence Safety and Trust 63

Table 4.2  Reference standards © 2020 Abramoff require new validation at the same level as the
• Level I Reference Standard: A reference standard original, in chronic disease.
that either is a clinical outcome, a prognostic standard Ensuring that the validation is applicable for
or other surrogate outcome. If the surrogate outcome is
derived from an independent reading center, validation
the clinical use case, requires workflow analysis,
against outcome is required, as is published evidence of and where possible, mimicking workflow during
temporal drift, reproducibility, and repeatability metrics the trial. In our example, this required the trial to
• Level II Reference Standard: A reference be performed in primary care clinics, in the stan-
standard established by an independent reading center
dard diabetes management workflow, without
with published temporal drift, reproducibility, and
repeatability metrics. A B level reference standard has modifications to the clinic environment, and with
not been validated to correlate to a clinical outcome operators recruited from existing staff without
• Level III Reference Standard: A reference prior experience or training.
standard created from the same modality as used by the
AI, by adjudicating or voting of multiple independent
expert readers, documented to be masked, with
published reproducibility and repeatability metrics. A C  utonomous AI System
A
level reference standard has not been established an Implementation Requirements
independent reading center, and has not been validated
to correlate with a clinical outcome
For enhancing trust, ethical alignment with
• Level IV Reference Standard: All other
reference standards, created by single readers or patient autonomy from a patient-derived perspec-
non-expert readers, without an established protocol. tive is important, as we have derived previously
A D level reference standard has not been derived [10]. Focusing on maximizing traceability of
from an independent reading center, has not been
patient derived data, including commensurate
validated to correlate with a clinical outcome, and
there are no published reproducibility and data stewardship, accountability, and authoriza-
repeatability metrics tion, as well as adherence to accepted standards.
Obviously, this applies to data usage during the
design phase, but is of primary importance dur-
its performance is known and persists in real ing deployment. Alignment with patient auton-
world clinical practice, is desirable for aligning omy may make so-called “data-plays”, where the
with non-maleficence. For example, the nega- purpose of the autonomous AI is maximizing the
tive effect of AI on outcome, that was shown in value of resellable patient derived data, rather
the Fenton study, could thus have been pre- than providing a patient diagnosis.
vented [45]. Operationally, autonomous AI creators have
Updates to the autonomous AI should be eval- an obligation to lawfully collect data, in the US
uated on their potential effect on patient risk of requiring compliance with HIPAA/HITECH, as
harm. Changing the font size of the autonomous well as other applicable statutory and regulatory
AI user interface will has substantially lower risk rules, in a manner that is transparent about the
than adding training data for potential perfor- purpose and scope for which the data will be used
mance improvement. A standardized process that [48]. Data used by the autonomous AI creator
links the level of evidence to potential risk of should be traceable to an authorization to use
patient harm for each type of system update, such data. Transparency on the part of autono-
allows analysis of such “continuous learning”. At mous AI creators, through written agreements, is
the highest risk of patient harm, a fully locked essential to assess whether patients have ade-
autonomous AI, once validated, cannot have its quately authorized use of data. Physicians and AI
training data automatically updated based on new creators together are accountable directly to
inputs, as then the safety, efficacy and equity are patients and each must take full responsibility for
not known. In a narrow sense, “Continuous learn- protecting patient rights as stewards of patient
ing” AI systems, where learning is used to derived data. Additionally, the rules require audit-
describe incorporating new training data as it able processes and security controls to ensure
processes new inputs during deployment, will that data is being used in accordance with the
64 M. D. Abramoff

scope for which it was authorized and to protect However, medical decisions by autonomous
the data from unauthorized use or access. AI on individual patients typically cannot be
It is critical that autonomous AI system valida- unequivocally labeled as correct or incorrect,
tion requirements also include ongoing monitor- especially in chronic diseases where outcomes
ing of real-world performance after deployment. may emerge years later. On populations of
Typically, this is achieved by instituting a compre- patients however, the medical decisions can be
hensive Quality Management System (QMS), compared statistically to the desired decisions,
such as that under 21 CFR 820, that accommo- for example to claimed correctness, and it is thus
dates user feedback, complaints, reportable where the liability will be focused. Another issue
events, and ongoing product monitoring. is that, while autonomous AI is preferably com-
Performance data monitored under the QMS pared to patient outcome, or surrogate outcome,
should include a predefined protocol for deter- this requires enormous resources that will be not
mining whether the autonomous AI system results available for the individual patient where liability
remain within the specified performance range is at stake. Then, the autonomous AI decision
that aligns with safe, effective, and equitable use will be compared to an individual physician or
of the AI system. In addition, ongoing monitoring group of physicians, lacking validation and thus,
of real-world performance includes all other qual- with unknown correspondence to outcome or
ity responsibilities that remain within developers’ surrogate outcome. This is obviously an issue for
control such as usability, user experience, product so-called continuous learning AI systems.
performance (which include uptime, bugs, and These distinctions will need to be resolved as
issues), and necessary safety controls (which various AI applications move forward.
include a comprehensive framework for cyberse-
curity, data protection, and data privacy).
Program integrity is maximized through the Summary
inherent validation of the autonomous AI safety
and equity, as well through design controls for Successful introduction of autonomous AI into
the integration of the AI with the wider clinical the healthcare system can be achieved. For exam-
EHR and charting system, ensuring automated ple, at the University of Iowa, introduction of
population of EHR with the diagnostic outputs autonomous AI for the diabetic eye exam has led
and management. to greatly increased compliance of people with
Creators of healthcare autonomous AI should diabetes with their annual eye exams. This was
assume liability for harm related to the inade- especially important during the recent COVID-­19
quate performance of the device when used prop- pandemic, where patients, already at higher risk
erly and on-label. This is essential for adoption: it of morbidity and mortality from the virus, were
is inappropriate for a clinician, using an autono- reluctant to make the extra visit to an ophthal-
mous AI to make a diagnosis they are not com- mologist that would otherwise have been
fortable making themselves, to have full medical required. Even after all clinics where shut down
liability for harm caused by that AI. This position for a few weeks, the care gaps resulting from
has been confirmed by the American Medical patients not getting their diabetic eye exam, were
Association in its 2019 AI Policy [1]. Just like a closed within weeks, so that there is almost com-
physician grading an examination would be held plete compliance in this diabetes population.
responsible for their diagnosis, creators of auton- Using the autonomous AI system, patients were
omous AI products have obtained medical mal- able to get in and get the exam within minutes
practice insurance. This paradigm for who would have otherwise not been able to get in
responsibility shifts medical liability for a medi- for an exam for at least 6 months. Because of the
cal diagnostic from the provider managing the program integrity enabled by autonomous AI,
patient’s diabetes, who orders the autonomous patients and payers were ensured that the diabetic
point of care retinal examination, to the autono- eye exam was fully documented, billed and
mous AI creator. coded.
4  Autonomous Artificial Intelligence Safety and Trust 65

Trust in autonomous AI, is essential, and we Ophthalmol. 2020; https://www.ncbi.nlm.nih.gov/


pubmed/32171769.
need to do this “the right way” [11]. It is thus 11. Robeznieks A. (American Medical Association).

vital that everyone involved with autonomous This ophthalmologist is doing health care AI
AI helps build trust. Scientifically valid the right way  AMA website. 2019. https://www.
considerations, and best practices such as
­ a m a -­a s s n . o rg / p r a c t i c e -­m a n a g e m e n t / d i g i t a l /
ophthalmologist-­doing-­health-­care-­ai-­right-­way.
being currently developed by the “Foundational 12. Bragge P, Gruen RL, Chau M, Forbes A, Taylor

Principles of Ophthalmic imaging and HR.  Screening for presence or absence of diabetic
Algorithmic Interpretation” workgroup allow retinopathy: a meta-analysis. Arch Ophthalmol.
any autonomous AI creator to contribute rather 2011;129(4):435–44.
13. Rein DB, Zhang P, Wirth KE, Lee PP, Hoerger TJ,
than detract from building trust in autonomous McCall N, et  al. The economic burden of major
AI. Ultimately, the benefits of autonomous AI adult visual disorders in the United States. Arch
to patients, physicians and the public in terms Ophthalmol. 2006;124(12):1754–60. https://www.
of better quality and consistency of care, better ncbi.nlm.nih.gov/pubmed/17159036.
14.
Fong DS, Aiello L, Gardner TW, King GL,
access and lower cost will thus be realized Blankenship G, Cavallerano JD, et al. Retinopathy in
[11]. diabetes. Diabetes Care. 2004;27(Suppl 1):S84–S7.
15. Klonoff DC, Schwartz DM.  An economic analy-

sis of interventions for diabetes. Diabetes Care.
2000;23(3):390–404.
References 16.
American Academy of Ophthalmology Retina/
Vitreous Panel, Hoskins Center for Quality Eye Care.
1. American Medical Association (AMA) Board of Preferred practice patterns: diabetic retinopathy.
Trustees Policy Summary. Augmented intelligence in In: American Academy of Ophthalmology Retina
healthcare. 2019. https://www.ama-­assn.org/system/ Panel, editor. Updated 2016 ed. San Francisco, CA:
files/2019-­08/ai-­2018-­board-­policy-­summary.pdf. American Academy of Ophthalmology; 2016.
2. Horton MB, Brady CJ, Cavallerano J, Abramoff M, 17. Ahmed J, Ward TP, Bursell SE, Aiello LM,

Barker G, Chiang MF, et  al. Practice guidelines for Cavallerano JD, Vigersky RA.  The sensitivity and
ocular telehealth-diabetic retinopathy, 3rd edition. specificity of nonmydriatic digital stereoscopic retinal
Telemed J E Health. 2020;26(4):495–543. https:// imaging in detecting diabetic retinopathy. Diabetes
www.ncbi.nlm.nih.gov/pubmed/32209018 Care. 2006;29(10):2205–9. http://www.ncbi.nlm.nih.
3. Helmchen LA, Lehmann HP, Abramoff gov/pubmed/17003294.
MD.  Automated detection of retinal disease. Am J 18. Aiello LM, Bursell SE, Cavallerano J, Gardner WK,
Manag Care. 2014;11(17). Strong J.  Joslin vision network validation study:
4. Centers for Medicare and Medicaid Services. Artificial pilot image stabilization phase. J Am Optom Assoc.
Intelligence (AI) health outcomes challenge. 2019. 1998;69(11):699–710.
5. Char DS, Shah NH, Magnus D.  Implementing 19. Benoit SR, Swenor B, Geiss LS, Gregg EW, Saaddine
machine learning in health care  – addressing ethi- JB.  Eye care utilization among insured people with
cal challenges. N Engl J Med. 2018;378(11):981–3. diabetes in the U.S., 2010-2014. Diabetes Care.
https://www.ncbi.nlm.nih.gov/pubmed/29539284. 2019;42(3):427–33. https://www.ncbi.nlm.nih.gov/
6. Char DS, Abramoff MD, Feudtner C.  Identifying pubmed/30679304.
potential ethical concerns in the conceptualiza- 20. Solomon SD, Chew E, Duh EJ, Sobrin L, Sun JK,
tion, development, implementation, and evaluation VanderBeek BL, et  al. Diabetic retinopathy: a posi-
of machine learning healthcare applications. Am J tion statement by the American Diabetes Association.
Bioethics. 2020. [in press]. Diabetes Care. 2017;40(3):412–8. https://www.ncbi.
7. US Food and Drug Administration (FDA). E6(R2) nlm.nih.gov/pubmed/28223445.
Good clinical practice: integrated addendum to ICH 21. Cagliero E, Levina EV, Nathan DM. Immediate feed-
E6(R1). 2018. back of HbA1c levels improves glycemic control in
8. American Diabetes A. 11. Microvascular complica- type 1 and insulin-treated type 2 diabetic patients.
tions and foot care: standards of medical care in diabe- Diabetes Care. 1999;22(11):1785–9. https://www.
tes-­2020. Diabetes Care. 2020;43(Suppl 1):S135–S51. ncbi.nlm.nih.gov/pubmed/10546008.
https://www.ncbi.nlm.nih.gov/pubmed/31862754. 22. Lian J, Liang Y.  Diabetes management in the

9. National Committee for Quality Assurance (NCQA). real world and the impact of adherence to guide-
HEDIS Measurement Year 2020 and Measurement line recommendations. Curr Med Res Opin.
Year 2021. Volume 2L Technical specifications for 2014;30(11):2233–40. https://www.ncbi.nlm.nih.
health plans. Washington, DC: National Committee gov/pubmed/25105305.
for Quality Assurance (NCQA); 2020. 23. Egbunike V, Gerard S.  The impact of point-of-

10. Abramoff MD, Tobey D, Char DS.  Lessons learnt care A1C testing on provider compliance and
about autonomous AI: finding a safe, efficacious and A1C levels in a primary setting. Diabetes Educ.
ethical path through the development process. Am J 2013;39(1):66–73.
66 M. D. Abramoff

24. Pugh JA, Jacobson JM, Van Heuven WA, Watters JA, 36. Larrazabal AJ, Nieto N, Peterson V, Milone DH,

Tuley MR, Lairson DR, et al. Screening for diabetic Ferrante E.  Gender imbalance in medical imaging
retinopathy. The wide-angle retinal camera. Diabetes datasets produces biased classifiers for computer-­
Care. 1993;16(6):889–95. http://www.ncbi.nlm.nih. aided diagnosis. Proc Natl Acad Sci U S A.
gov/pubmed/8100761. 2020;117(23):12592–4. https://www.ncbi.nlm.nih.
25. Lin DY, Blumenkranz MS, Brothers RJ, Grosvenor gov/pubmed/32457147.
DM.  The sensitivity and specificity of single-field 37. Angus DC.  Randomized clinical trials of artificial
nonmydriatic monochromatic digital fundus photog- intelligence. JAMA. 2020; https://www.ncbi.nlm.nih.
raphy with remote image interpretation for diabetic gov/pubmed/32065828.
retinopathy screening: a comparison with ophthal- 38. Pearl J, Mackenzie D. The book of why: the new science
moscopy and standardized mydriatic color photogra- of cause and effect. New York: Basic Books; 2018.
phy. Am J Ophthalmol. 2002;134(2):204–13. 3 9. Bossuyt PM, Lijmer JG, Mol BW.  Randomised
26. Thorwarth WT Jr. From concept to CPT code to com- comparisons of medical tests: some-
pensation: how the payment system works. J Am Coll times invalid, not always efficient. Lancet.
Radiol. 2004;1(1):48–53. https://www.ncbi.nlm.nih. 2000;356(9244):1844–7. https://www.ncbi.nlm.
gov/pubmed/17411519. nih.gov/pubmed/11117930.
27.
Chiang MF, Casper DS, Cimino JJ, Starren 40. Korevaar DA, Gopalakrishna G, Cohen JF, Bossuyt
J.  Representation of ophthalmology concepts by PM. Targeted test evaluation: a framework for design-
electronic systems: adequacy of controlled medical ing diagnostic accuracy studies with clear study
terminologies. Ophthalmology. 2005;112(2):175–83. hypotheses. Diagn Progn Res. 2019;3:22. https://
https://www.ncbi.nlm.nih.gov/pubmed/15691548. www.ncbi.nlm.nih.gov/pubmed/31890896.
28. Steindel SJ. A comparison between a SNOMED CT 41. Lu B, Gatsonis C.  Efficiency of study designs in

problem list and the ICD-10-CM/PCS HIPAA code diagnostic randomized clinical trials. Stat Med.
sets. Perspect Health Inf Manag. 2012;9:1b. https:// 2013;32(9):1451–66. https://www.ncbi.nlm.nih.gov/
www.ncbi.nlm.nih.gov/pubmed/22548020. pubmed/23071073.
29. Linder JA, Kaleba EO, Kmetik KS. Using electronic 42. Kaplan RM, Irvin VL.  Likelihood of null effects of
health records to measure physician performance for large NHLBI clinical trials has increased over time.
acute conditions in primary care: empirical evalua- PLoS One. 2015;10(8):e0132382. https://www.ncbi.
tion of the community-acquired pneumonia clinical nlm.nih.gov/pubmed/26244868.
quality measure set. Med Care. 2009;47(2):208–16. 43. Ting DSW, Peng L, Varadarajan AV, Keane PA,

https://www.ncbi.nlm.nih.gov/pubmed/19169122. Burlina PM, Chiang MF, et al. Deep learning in oph-
30. Shortliffe EH, Davis R, Axline SG, Buchanan BG, thalmology: the technical and clinical considerations.
Green CC, Cohen SN. Computer-based consultations Prog Retin Eye Res. 2019;72:100759. https://www.
in clinical therapeutics: explanation and rule acqui- ncbi.nlm.nih.gov/pubmed/31048019.
sition capabilities of the MYCIN system. Comput 44. Van Dijk HW, Verbraak FD, Kok PHB, Oberstein
Biomed Res. 1975;8(4):303–20. http://www.ncbi. SYL, Schlingemann RO, Russell SR, et al. Variability
nlm.nih.gov/pubmed/1157471. in photocoagulation treatment of diabetic macu-
31. Fukushima K. Neocognitron: a self organizing neural lar oedema. Acta Ophthalmol. 2013;91(8):722–7.
network model for a mechanism of pattern recogni- https://www.scopus.com/inward/record.uri?eid=2-­
tion unaffected by shift in position. Biol Cybern. s2.0-­8 4888203653&doi=10.1111%2fj.1755-­-
1980;36(4):193–202. http://www.ncbi.nlm.nih.gov/ 3768.2012.02524.x&partnerID=40&md5=48a44cbc
pubmed/7370364. 77f3b8682f5c428b10c88683.
32.
Rumelhart DE, McClelland JL, University of 45. Fenton JJ, Taplin SH, Carney PA, Abraham L, Sickles
California San Diego. PDP Research Group. Parallel EA, D’Orsi C, et  al. Influence of computer-aided
distributed processing: explorations in the microstruc- detection on performance of screening mammogra-
ture of cognition. Cambridge, MA: MIT Press; 1986. phy. N Engl J Med. 2007;356(14):1399–409.
33. Wolf RM, Channa R, Abramoff MD, Lehmann
46. Sonka M, Fitzpatrick JM. Handbook of medical imag-
HP.  Cost-effectiveness of autonomous point-of-care ing – volume 2, medical image processing and analy-
diabetic retinopathy screening for pediatric patients sis. Wellingham, WA: The International Society for
with diabetes. JAMA Ophthalmol. 2020. https://www. Optical Engineering Press; 2000.
ncbi.nlm.nih.gov/pubmed/32880616. 47. Sackett DL. Bias in analytic research. J Chronic Dis.
34. Abràmoff MD, Lavin PT, Birch M, Shah N, Folk 1979;32(1–2):51–63. https://www.ncbi.nlm.nih.gov/
JC.  Pivotal trial of an autonomous AI-based diag- pubmed/447779.
nostic system for detection of diabetic retinopathy in 48. Blumenthal D.  Launching HITECH.  N Engl J Med.
primary care offices. Nat Digit Med. 2018;1(1):39. 2010;362(5):382–5. http://www.ncbi.nlm.nih.gov/
https://doi.org/10.1038/s41746-­018-­0040-­6. pubmed/20042745.
35. Obermeyer Z, Powers B, Vogeli C, Mullainathan
49.
Copeland R, Needleman S.  Google’s ‘Project
S.  Dissecting racial bias in an algorithm used Nightingale’ triggers federal inquiry. WSJ. 2019.
to manage the health of populations. Science. https://www.wsj.com/articles/behind-­g oogles-­
2019;366(6464):447–53. https://www.ncbi.nlm.nih. project-­nightingale-­a-­health-­data-­gold-­mine-­of-­50-­
gov/pubmed/31649194. million-­patients-­11573571867.
4  Autonomous Artificial Intelligence Safety and Trust 67

50. Moyer VA, Force USPST.  Screening for glaucoma: revealed by high resolution optical imaging. Science.
U.S. preventive services task force recommenda- 1990;249(4967):417–20.
tion statement. Ann Intern Med. 2013;159(7):484–9. 64. Wang YT, Tadarati M, Wolfson Y, Bressler SB,

https://www.ncbi.nlm.nih.gov/pubmed/24325017. Bressler NM.  Comparison of prevalence of diabetic
51. Chou R, Dana T, Bougatsos C, Grusing S, Blazina macular edema based on monocular fundus pho-
I.  Screening for impaired visual acuity in older tography vs optical coherence tomography. JAMA
adults: updated evidence report and systematic review Ophthalmol. 2016;134(2):222–8. http://www.ncbi.
for the US preventive services task force. JAMA. nlm.nih.gov/pubmed/26719967.
2016;315(9):915–33. https://www.ncbi.nlm.nih.gov/ 65. Fundus photographic risk factors for progression of
pubmed/26934261. diabetic retinopathy. ETDRS report number 12. Early
52. McLaughlin CC, Wu XC, Jemal A, Martin HJ, Roche treatment diabetic retinopathy study research group.
LM, Chen VW.  Incidence of noncutaneous melano- Ophthalmology. 1991;98(5 Suppl):823–33.
mas in the U.S. Cancer. 2005;103(5):1000–7. https:// 66. Cohen JF, Korevaar DA, Altman DG, Bruns DE,

www.ncbi.nlm.nih.gov/pubmed/15651058. Gatsonis CA, Hooft L, et al. STARD 2015 guidelines
53. Sullivan HR, Schweikart SJ. Are current tort liability for reporting diagnostic accuracy studies: explanation
doctrines adequate for addressing injury caused by and elaboration. BMJ Open. 2016;6(11):e012799.
AI? AMA J Ethics. 2019;21(2):E160–6. https://www. https://www.ncbi.nlm.nih.gov/pubmed/28137831.
ncbi.nlm.nih.gov/pubmed/30794126. 67. Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston
54. Maier S.  Elon take the wheel. Minnesota Law Rev. AK, Chan A-W, et al. Reporting guidelines for clini-
2017. https://minnesotalawreview.org/2017/01/24/ cal trial reports for interventions involving artifi-
elon-­take-­the-­wheel/. cial intelligence: the CONSORT-AI extension. Nat
55. Chandler RJ, Venditti CP.  Gene therapy for meta- Med. 2020;26(9):1364–74. https://doi.org/10.1038/
bolic diseases. Transl Sci Rare Dis. 2016;1(1):73–89. s41591-­020-­1034-­x.
https://www.ncbi.nlm.nih.gov/pubmed/27853673. 68. US Food and Drug Agency (FDA). FDA permits mar-
56. Russell S, Bennett J, Wellman JA, Chung DC, Yu keting of artificial intelligence-based device to detect
ZF, Tillman A, et  al. Efficacy and safety of voreti- certain diabetes-related eye problems. Washington,
gene neparvovec (AAV2-hRPE65v2) in patients with DC; 2018. https://www.fda.gov/newsevents/news-
RPE65-mediated inherited retinal dystrophy: a ran- room/pressannouncements/ucm604357.htm.
domised, controlled, open-label, phase 3 trial. Lancet. 69. Fleming TR, DeMets DL.  Surrogate end points in
2017;390(10097):849–60. https://www.ncbi.nlm.nih. clinical trials: are we being misled? Ann Intern Med.
gov/pubmed/28712537. 1996;125(7):605–13. https://www.ncbi.nlm.nih.gov/
57. Beauchamp TL, Childress JF. Principles of biomedi- pubmed/8815760.
cal ethics. 8th ed. New York: Oxford University Press; 70. Temple R. A regulatory authority’s opinion about sur-
2019. rogate endpoints. In: Nimmo W, Tucker G, editors.
58. Shah A, Lynch S, Niemeijer M, Amelon R, Clarida W, Clinical measurement in drug evaluation. New York:
Folk J, et  al., editors. Susceptibility to misdiagnosis Wiley; 1995.
of adversarial images by deep learning based retinal 71. Browning DJ, Glassman AR, Aiello LP, Bressler NM,
image analysis algorithms. Proceedings – International Bressler SB, Danis RP, et al. Optical coherence tomog-
Symposium on Biomedical Imaging; 2018. raphy measurements and analysis methods in optical
59. Finlayson SG, Bowers JD, Ito J, Zittrain JL, Beam coherence tomography studies of diabetic macular
AL, Kohane IS.  Adversarial attacks on medical edema. Ophthalmology. 2008;115(8):1366–71, 71 e1.
machine learning. Science. 2019;363(6433):1287–9. http://www.ncbi.nlm.nih.gov/pubmed/18675696.
https://www.ncbi.nlm.nih.gov/pubmed/30898923. 72. Nagendran M, Chen Y, Lovejoy CA, Gordon AC,
60. Friedenwald J, Day R.  The vascular lesions of
Komorowski M, Harvey H, et  al. Artificial intelli-
diabetic retinopathy. Bull Johns Hopkins Hosp. gence versus clinicians: systematic review of design,
1950;86(4):253–4. http://www.ncbi.nlm.nih.gov/ reporting standards, and claims of deep learning stud-
pubmed/15411556. ies. BMJ. 2020;368:m689. https://www.ncbi.nlm.nih.
61. MacKenzie S.  A case of glycosuric retinitis, with gov/pubmed/32213531.
comments. (Microscopical Examination of the Eyes 73. Lin AP, Katz LJ, Spaeth GL, Moster MR, Henderer
by Mr. Nettleship). Roy London Ophthal Hosp Rep. JD, Schmidt CM Jr, et al. Agreement of visual field
1879;9(134). interpretation among glaucoma specialists and com-
62. Hubel DH, Wiesel TN.  Receptive fields of sin-
prehensive ophthalmologists: comparison of time
gle neurones in the cat’s striate cortex. J Physiol. and methods. Br J Ophthalmol. 2011;95(6):828–31.
1959;148:574–91. http://www.ncbi.nlm.nih.gov/pubmed/20956271.
63.
Ts’o DY, Frostig RD, Lieke EE, Grinvald
A.  Functional organization of primate visual cortex
Technical Aspects of Deep
Learning in Ophthalmology
5
Zhiqi Chen and Hiroshi Ishikawa

Deep learning (DL) is a specific category within collected over 14 million labeled images from
machine learning. The prototype of DL dates 1000 categories, were launched [4]. Soon after in
back to 1940s when Walter Pitts and Warren 2012, DL models won the recognition challenge
McCulloch designed a computation model to of ImageNet drastically for the first time, where it
mimic the neural networks of human brain [1]. decreased the top-5 error rate from 26.1% to
Initially, neural networks were clumsy and inef- 15.3% [5]. DL models started to take over these
ficient, and would not become useful until 1985 challenges since then. In some applications such
when the concept of back propagation was as traffic sign recognition, diabetic retinopathy
applied to neural networks [2]. In 1989, LeCun classification and Go (traditional Chinese strat-
first demonstrated the feasibility of convolutional egy game like chess), DL even exceeded human
neural networks with back propagation on recog- performance [6–8]. DL provides a powerful
nizing handwritten zip code [3]. While computa- framework for learning. The deep structure
tional speeds increased exponentially with the enables the algorithm to represent complex func-
development of graphics processing units tions. Consequently, given models and dataset
(GPUs), neural networks have more layers and that are big enough, deep learning can be used to
began to compete with support vector machines. learn the mapping from the input data to the out-
Moreover, neural networks are scalable and able put data to do complex tasks in real life.
to continue improving as more parameters and Clinical diagnosis of many eye diseases relies on
training data are added with increased demand of characteristic patterns of the visualization of the eye
computational power. In 2009, ImageNet, which and its surrounding structures. The high dependence
on imaging makes DL perfectly fit into the field of
ophthalmology. In recent years, research involving
Z. Chen
Department of Electrical and Computer Engineering, DL in ophthalmology have risen exponentially [7,
New York University, Brooklyn, NY, USA 9–13]. DL based AI technology is likely to aid clini-
Department of Ophthalmology, NYU Langone Health, cal decision-making process in the near future
New York, NY, USA improving overall medical service quality.
e-mail: zc1337@nyu.edu In this chapter, we provide a formal introduc-
H. Ishikawa (*) tion and definition of the deep learning concepts,
Department of Ophthalmology, NYU Langone Health, techniques and architectures. We begin this chap-
New York, NY, USA ter with main forms of learning. Next, we review
Department of Biomedical Engineering, New York the basics of the most basic and simple DL archi-
University, Brooklyn, NY, USA tecture, deep feedforward neural networks
e-mail: Hiroshi.Ishikawa@nyulangone.org

© Springer Nature Switzerland AG 2021 69


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_5
70 Z. Chen and H. Ishikawa

(DFNs). Then, we present important further


extensions of DFNs, convolutional neural net-
works (CNNs) which is powerful in processing
array data, recurrent neural networks (RNNs)
for sequence modelling and generative adver-
sarial networks (GANs) which is a recent inno-
vation in DL.

Supervised Learning
and Unsupervised Learning

In machine learning application, there are two


main focus of tasks: supervised learning and
unsupervised learning. Supervised learning is the
most common form. In supervised learning appli-
cations, each input of training data comes along Fig. 5.1  Structure of a 3-layer DFN. The layers are made
with its corresponding target (ground truth data). of nodes where computation happens. A node combines
Cases such as image recognition, in which the input vector with a set of weights to get the weighted sum
goal is to assign each image to a discrete category of the input. Then the sum passes through an activation
function to get the output of the node. The input signals go
like horse, car or people, are called classification. through nodes of several layers to produce the final
Other cases like predicting vision outcomes after output
surgeries, in which the desired output is continu-
ous, are called regression. In unsupervised learn-
ing applications, we do not have prior knowledge where w is the weight of the model and can be
of output values of input samples. The aim of learnt from data in order to best approximate f*
such tasks is to infer the underlying structure with f. Figure 5.1 shows the structure of DFNs.
within a set of data, such as grouping similar A DFN is a network because it is a combina-
examples within the data, known as clustering, tion of several functions. Take a three-layer DFN
projecting data from higher dimensions to lower for example, three functions f(1), f(2) and f(3), cor-
dimensions for visualization. responding to the first, second and third layer of
Other forms like semi-supervised learning are the model respectively, are connected in a chain
becoming more and more popular recently since to form f(x) = f(3)(f(2)(f(1)(x))). The three layers are
preparing ground truth dataset is difficult in med- called input layer, hidden layer and output layer,
ical application (e.g. number of samples, manual respectively. The overall length of the chain is
labeling labor, etc.). In semi-supervised learning, called the depth of the network, and the number
the algorithm uses a small amount of labeled data of neurons within hidden layers, is the width of
in conjunction with a large amount of unlabeled the network.
data to achieve considerable improvement of DFNs are forward as the information flow
learning accuracy compared to supervised learn- along the chain is one-directional with no loops
ing with large dataset. or cycles. When feedback connections are
included in DFN, it’s called Recurrent Neural
Networks which will be presented in section
Deep Feedforward Networks (DFNs) “Recurrent Neural Networks (RNNs)”.
At the most basic level, DFNs have two basic
The classic model of deep learning is the DFN, features: (1) layers and nodes; (2) activation.
also called as multilayer perceptron (MLP). DFN Layers and nodes are basic building blocks of
aims to approximate some function f* in a real DFNs. Nodes make up of layers and layers con-
problem. DFN defined a mapping y = f(x; w), nect into a dense network where each node, also
5  Technical Aspects of Deep Learning in Ophthalmology 71

called as a neuron, in a layer is connected to all  onvolutional Neural Networks


C
nodes in the next layer and connections within lay- (CNNs)
ers are all one-directional. The output of a previ-
ous layer is the input of the next layer. Another key CNNs are a special form of DFNs, which are spe-
component with DFNs is the activation. Inspired cifically designed to process array data such as
by the biological neurons that are activated when one-dimension time series and two-dimension
inputs accumulate beyond a certain threshold, acti- images. In ophthalmology, the most robust DL
vation functions take the weighted sum of inputs system have been in image-centric applications,
to a layer as an input to the function and transform such as glaucoma detection based on Optical
it non-linearly. For example, the most common Coherence Tomography (OCT) scans and used
activation function, the rectified linear unit (ReLU) CNNs as the learning algorithm [13].
function, transforms an input signal to a 0 (not The architecture of a typical CNN is struc-
activated) or itself (activated) if the input signal is tured with a series of subnetworks which are
big enough. Sigmoid and tanh function are also composed with three stacked layers: a convolu-
commonly used as activation functions. Figure 5.2 tional layer, a non-linear activation layer and a
shows the sigmoid function. pooling layer. The activation layer is the same
Back-propagation allows the error between as the activation in DFNs while the convolution
output values and ground-truth values to be fed operation in convolutional layers and the pool-
back though the network and enables the learning ing operation in pooling layers distinguish
of deep neural networks. Through back propaga- CNNs from DFNs. CNNs resemble the biologi-
tion the algorithm searches for the weight of each cal process of animal visual cortex [14], where
cross-layer connection between neurons in order simple cells respond to lines and complex cells
to minimize the error function. respond to certain location-invariant patterns.
In short, those general concepts of DFNs form The convolutional and pooling layers directly
basis for future developments of DL such as mimic the behavior of the simple and complex
CNNs and RNNs. cells.

Fig. 5.2 Sigmoid
1.0
activation function. The
sigmoid activation
function is a nonlinear
0.8
function which maps
input signal in between
0 and 1. Such nonlinear
activation functions 0.6
increase the ability of
f(x)

DFN to represent
complex functions 0.4

0.2

0.0

–10.0 –7.5 –5.0 –2.5 0.0 2.5 5.0 7.5 10.0


x
72 Z. Chen and H. Ishikawa

2 2

x00 x01 x02 ΣΣ


i=0 j=0
wij xij

w00 w01 w02


3 3

x10 x11 x12 x13 ΣΣ


i=1 j=1
wij xij

w10 w11 w12


x20 x21 x22 x23

w20 w21 w22


x31 x32 x33

Input array Convolution kernel Output array

Fig. 5.3  Example of convolution operation with 3×3 kernel size. Each element in the output array is the sum of point-­
wise multiplication of a 3×3 convolution kernel and a 3×3 corresponding patch in the input array

Convolution any part of an image. Hence, sharing weights


enables detection of same local pattern in differ-
CNNs use discrete convolution, a special form of ent parts of an array. Thus, using convolution
linear operation, to replace usual matrix multipli- leads to less computation to produce outputs, less
cation in DFN. As shown in Fig. 5.3, convolution memory required for storing model and higher
is the weighted average of neighboring values in statistical efficiency. Despite the benefits of
the input metric. The weights used to average a weights sharing and location invariance, in many
neighborhood are called convolution kernel, and cases such as detecting certain kind of tumor,
the output we get after convolution is called the location information really matters. Also, other
feature map. There are two key ideas behind techniques are needed to deal with some transfor-
using convolution in neural networks: sparse mations, such as scaling and rotating, which con-
connectivity and shared weights [15]. volution is not invariant to.
Traditional DFNs use matrix multiplication in
which connections are made between every out-
put unit and every input unit while connections Pooling
only exist between every output unit and its
neighboring input units for convolution in CNNs. The role of a pooling layer is to merge semanti-
Therefore, CNNs have the characteristics of cally similar features by neighboring statistical
sparse connectivity by setting the convolution characteristics. Pooling units take input from
kernel size far smaller than the input size. For patches that are shifted from one or more rows or
example, we can use kernels with only hundreds columns. For example, max pooling produces the
of pixels to detect small but meaningful features maximum value in a square patch and represents
such as edges when dealing with an image, which the patch with the maximum value, shown in
usually consists of thousands if not millions of Fig.  5.4. Hence, pooling operation reduces fea-
pixels. Moreover, the same set of kernel weights ture dimension and creating invariance to small
are applied to all units across a feature map, shifts and distortions. Some other commonly
referring as shared weights. Local groups of val- used pooling functions include average pooling,
ues are often highly correlated in array data, and which represents a local patch with its mean
local statistics of array data are usually invariant value, and L2 pooling, which calculates the
to location. For example, a motif can appear in Euclidean norm of a patch.
5  Technical Aspects of Deep Learning in Ophthalmology 73

Fig. 5.4  Example of


max pooling operation
with a stride of 2. Each 0 2 1 5
element in the output
array is the maximum
value in the
corresponding 2×2
region in the input array 8 3 9 8 9
5

4 4 7 10 8 10

6 8 1 3

ht h0 h1 ht

Unfold

A A A A

A: a chunk of RNN
xt: in put at time t
ht: hidden state at time t

xt x0 x1 ... xt

Fig. 5.5  Structure of RNN. A is a chunk of neural net- information from one time step of the network to the next
work which takes input x at time step t and output a hid- step. The output at the next time step is calculated based
den state value h at time step t. The loop passes the on the previous hidden state as well as the current input

Recurrent Neural Networks (RNNs) same weights at every time step of the sequence
as shown in Fig. 5.5.
Many tasks in Ophthalmology involves sequence Although hidden units are designed to learn to
data. Recurrent neural networks are a class of store past information, the stored information is
neural networks that are designed for tasks lossy thus does not have long-term dependencies.
involving sequential data. A RNN is recurrent as To overcome this drawback, a class of more com-
it performs the same operation for the input at plicated neural networks, long short-term mem-
every time step and produce current output based ory (LSTM) networks is proposed [16]. As shown
on previous output. A hidden unit is maintained in Fig. 5.6, LSTMs augment RNNs with explicit
to store past history. In this way, RNNs map an memory which consists of three gates, an input
input sequence to an output sequence. Similar to gate, a forget gate and an output gate, which
CNNs which share weights at every local patches make it easier to remember the input for a long
of an array, RNNs, once unfold in time, share the time. In standard RNNs, the shared module only
74 Z. Chen and H. Ishikawa

Fig. 5.6  Structure of LSTM. LSTM is made of three LSTM has longer-term dependency compared to RNN
gates, the input, forget and output gate. Each gate acts as a which has only one gate to filter the input information
filter to control the information flow explicitly. Thus,

Generator
Noise network False data

Discriminator
Real data Real or not
network

Fig. 5.7  Structure of GAN. GAN is composed of a gen- discriminator while the discriminator is trained to dis-
erator and a discriminator. The generator is trained to gen- criminate between the fake and true samples
erate plausible samples from noise in order to fool the

has one single neural layer while that of LSTMs  enerative Adversarial Networks
G
has four neural layers for which the three gates (GANs)
interact in a special way. The first layer, called
“forget gate layer”, looks at the previous one hid- GANs, which have the capacity to generate data
den state and the current input and output a num- without explicitly modelling the underlying dis-
ber between 0 and 1 for each neuron to decide tribution, are one of the most interesting recent
what information we are going to throw away in innovations in DL [17]. GANs are a special form
memory. The second layer, called “input gate of neural networks where two networks, a gen-
layer”, generates the scalar for each neuron to erator and a discriminator, are trained alterna-
decide what information we are going to remem- tively. The generator is trained to produce realistic
ber in memory. Then in the third layer, the old data sample from some random noise while the
memory after forgotten is combined with the new discriminator is trained to discriminate the fake
information that the input gate decides to remem- samples generated by the generator and the real
ber. Finally, the output gate decides what is going samples. Figure 5.7 shows the structure of GANs.
to output based on the updated memory. CNNs and RNNs can be easily incorporated into
Theoretical and empirical evidence shows that GANs to deal with array data and sequence data.
LSTMs have longer-term dependency compared There are two potential applications of GANs
to standard RNNs [16]. Therefore, this is good in Ophthalmology. The first one focuses on the
for modeling longitudinal disease progression generator which is able to extract the underlying
and changes. structure of the data and learn to generate new
5  Technical Aspects of Deep Learning in Ophthalmology 75

data samples. Plausible OCT scans generated in a nationwide screening program. NPJ Digital Med.
2019;2(1):1–9.
from random noise [18] as well as denoised scans 8. Silver D, et  al. Mastering the game of Go with
generated from real scans [19] can be achieved deep neural networks and tree search. Nature.
by the generator of GANs. The second one 2016;529(7587):484–9.
focuses on the discriminator which uses the dis- 9. Ting DSW, Cheung CYL, Lim G, Tan GSW, Quang
ND, Gan A, et  al. Development and validation
criminator as a learned prior to detect abnormal of a deep learning system for diabetic retinopa-
samples. For example, Zhou et al. trained a GAN thy and related eye diseases using retinal images
with only healthy data and use the trained dis- from multiethnic populations with diabetes. JAMA.
criminator to detect anomalies from OCT [20]. 2017;318(22):2211–23.
10. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D,
Narayanaswamy A, et al. Development and validation
of a deep learning algorithm for detection of diabetic
Conclusions retinopathy in retinal fundus photographs. JAMA.
2016;316(22):2402–10.
11. Abràmoff MD, Lou Y, Erginay A, Clarida W, Amelon
DL is an emerging set of tools that provides vari- R, Folk JC, Niemeijer M. Improved automated detec-
ous potential solutions for applications in oph- tion of diabetic retinopathy on a publicly available
thalmology. The range of applications is very dataset through integration of deep learning. Invest
wide, from diagnosis and strategizing the treat- Ophthalmol Vis Sci. 2016;57(13):5200–6.
12. Gargeya R, Leng T. Automated identification of dia-
ment, to understanding pathogenesis and fore- betic retinopathy using deep learning. Ophthalmology.
casting disease prognosis. Understanding various 2017;124(7):962–9.
DL techniques will help clinicians/researchers in 13. Maetschke S, Antony B, Ishikawa H, Wollstein G,
leveraging the potential of DL applications in Schuman J, Garnavi R. A feature agnostic approach
for glaucoma detection in OCT volumes. PLoS One.
ophthalmology leading to improve the quality 2019;14(7):e0219126.
and delivery of ophthalmic care. 14. Hubel DH, Wiesel TN.  Receptive fields, binocular
interaction, and functional architecture in the cat’s
visual cortex. J Physiol. 1962;160:106–54.
15. Goodfellow I, Bengio Y, Courville A. Deep learning.
References MIT Press; 2016.
16. Sepp Hochreiter, Jürgen Schmidhuber; Long Short-
1. McCulloch WS, Pitts W.  A logical calculus of the Term Memory. Neural Comput 1997; 9 (8): 1735–
ideas immanent in nervous activity. Bull Mathematical 1780. https://doi.org/10.1162/neco.1997.9.8.1735.
Biophys. 1943;5(4):115–33. 17. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi

2. Lecun Y.  Une procedure d’apprentissage pour Mirza, Bing Xu, David Warde-Farley, Sherjil
reseau a seuil asymmetrique (A learning scheme for Ozair, Aaron Courville, and Yoshua Bengio. 2014.
asymmetric threshold networks). In: Proceedings Generative adversarial nets. In Proceedings of the
of Cognitiva 85, Paris, France. 1985. p. 599–604. 27th International Conference on Neural Information
3. LeCun Y, Boser B, Denker JS, Henderson D, Howard Processing Systems - Volume 2 (NIPS’14). MIT
RE, Hubbard W, Jackel LD. Backpropagation applied Press, Cambridge, MA, USA, 2672–2680.
to handwritten zip code recognition. Neural Comput. 18. Zheng C, Xie X, Zhou K, Chen B, Chen J, Ye H,
1989;1(4):541–51. et al. Assessment of generative adversarial networks
4. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei model for synthetic optical coherence tomography
L.  Imagenet: a large-scale hierarchical image data- images of retinal disorders. Transl Vis Sci Technol.
base. In: 2009 IEEE conference on computer vision 2020;9(2):29.
and pattern recognition. IEEE; 2009. p. 248–55. 19. Halupka KJ, Antony BJ, Lee MH, Lucy KA, Rai RS,
5. Krizhevsky A, Sutskever I, Hinton GE. Imagenet clas- Ishikawa H, et al. Retinal optical coherence tomogra-
sification with deep convolutional neural networks. phy image enhancement via deep learning. Biomed
In: Advances in neural information processing sys- Optics Express. 2018;9(12):6205–21.
tems. 2012. p. 1097–105. 20. Zhou K, Gao S, Cheng J, Gu Z, Fu H, Tu Z, ... Liu
6. CireşAn D, Meier U, Masci J, Schmidhuber J. Multi-­ J.  Sparse-GAN: sparsity-constrained generative
column deep neural network for traffic sign classifica- adversarial network for anomaly detection in reti-
tion. Neural Netw. 2012;32:333–8. nal OCT image. In: 2020 IEEE 17th International
7. Ruamviboonsuk P, et al. Deep learning versus human Symposium on Biomedical Imaging (ISBI). IEEE;
graders for classifying diabetic retinopathy severity 2020. p. 1227–31.
Selected Image Analysis Methods
for Ophthalmology
6
Tomasz Krzywicki

Introduction diagram of the image analysis process can be


seen in Fig. 6.1.
The analysis of the retinal image helps in the Recognition of ophthalmic and systemic
diagnosis of many diseases, not only ophthalmic, chronic diseases in the retinal image can rely on
but also systemic chronic diseases. Examples of detection of anatomical features and measure-
such diseases are hypertension and diabetes that ment of their properties, such as lesion and vessel
affect small vessels and microcirculation, and diameters. Usually, achieving these tasks requires
which can be diagnosed non-invasively by ana- a pre-processing stage tuned according to the
lyzing the image of the retina [1, 2]. measured features. The preprocessing stage typi-
Fundus images can be obtained with ophthal- cally involves normalization of intensities after
moscope. The first attempts to analyze the fundus image formation, as well as the enhancement of
images involved detection of vessels with fluo- the image through contrast enhancement.
rescein in an analog image [3]. This method One of the important steps in the analysis of
employed fluorescent agent to improve appear- the retinal image is its registration, which con-
ance of vessels in the image, which made them sists in building a transform that can be used in
easier to detect by a medical professional or the many ways, for example to superimpose images.
computer. Unfortunately, this procedure was It also enables much wider areas of the retina to
invasive and time-consuming. be obtained. Registration of images from differ-
Digital image analysis is definitely a less time-­ ent examinations may enable assessment of treat-
consuming and non-invasive method, therefore it ment through the speed of symptom reversal.
is widely used to analyze retinal fundus images Among the popular methods of image registra-
during screening and diagnosis. Among the pop- tion are the ones that rely on keypoints detected
ular methods of fundus image analysis, there are in combined images.
algorithms for preprocessing images (for exam- Images after preprocessing and registration
ple feature detection or image registration) and steps can be used in the inference process: clas-
for creating decision models, such as neural net- sification, regression or clustering. The most
works. These methods will be discussed in this commonly used tools for this purpose are
chapter along with their applications. The general Convolutional Neural Networks (CNNs) that
allow all these operations to be performed [4–6].
T. Krzywicki (*) One popular use of the CNNs is to classify fun-
Faculty of Mathematics and Computer Science, dus retinal images according to the severity of
University of Warmia and Mazury, Olsztyn, Poland Diabetic Retinopathy (DR).
e-mail: tomasz.krzywicki@matman.uwm.edu.pl

© Springer Nature Switzerland AG 2021 77


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_6
78 T. Krzywicki

Fig. 6.1 General
RETINAL FUNDUS IMAGE
diagram of the image DIGITAL DEVICE
analysis process

IMAGE PREPROCESSING STAGE


INTENSITY NORMALIZATION, CONTRAST ENHANCEMENT

RETINAL IMAGE REGISTRATION

INFERENCE STAGE
CLASSIFICATION, REGRESSION, CLUSTERING

PREDICTION STAGE
DECISION CLASS, FORECAST, GROUP INDEX

The structure of this chapter is as follows. two-­dimensional (2D) images. In formal terms,
First, we introduce the basics and structure of the image is a function of f (r, c) represented as
digital images. Then, the preprocessing stage of a matrix, whose elements are signal values at
images is described. Next, we discuss the image specific pairs of coordinates r, c (correspond-
registration stage, preparing the retinal images ing to pixels). Typically the signal is intensity,
for analysis in inference stage. The last part of the but it may also be, for example, temperature.
chapter presents artificial neural networks with Digital images, unlike analog images, for a
the convolutional layer as a tool for the classifica- given pair of coordinates take finite and dis-
tion of retinal images, including the detection of crete values. Each digital image is composed
diseases. of a finite number of pixels, each of which has
a specific location in space and an intensity
value. The function f (r, c) can be defined as
Digital Images follows:

Images are representations of the real world  f (11 ,) f (1,2 )  f (1,n ) 


 
objects, which result from their influence on f ( 2,1) f ( 2,2 )  f ( 2,n ) 
any sensitive surface, for example, the photo- f ( r ,c ) =  (6.1)
     
graphic film of the camera or the CCD matrix  
of a digital camera. In this section we focus on  f ( m,1) f ( m,2 )  f ( m,n ) 
6  Selected Image Analysis Methods for Ophthalmology 79

where component color (range from 0 to 255). In the


RGB color space, a pixel with all component col-
• m is the height of the image: r = 1, . . . , m ors equal to 0 will be black, and a pixel with all
• n is the width of the image: c = 1, . . . , n component colors equal to 255 will be white.
• vr,c = f (r, c) is the pixel value for the r, c Another popular color space is HSV (H—
coordinates Hue, S—Saturation, V—Value or intensity),
where each pixel is characterized by these three
In practice, we process color images. Such component values. The visualization of the
images are represented by a number of matrices HSV color space can be seen in the Fig.  6.3.
corresponding to individual components of the The HSV color space can be represented as a
adopted color space. The color space is a mathe- cone with a base H, with values from 0° to
matical system for describing color in a specific 360°, a base radius of S, with values from 0 to
coordinate system. Color images consist of mul- 100, and a height V with values from 0 to 100.
tiple layers of the color space, each component Each point in the cone corresponds to a specific
layer is saved in a separate matrix, as in the case color. The HSV color model may be used for
of a monochrome image (formula 6.1). segmentation of color medical images, which
RGB is the most popular representation of are transformed from RGB to HSV color space
color images, not only in the case of fundus [7]. Suzuki used the HSV color space for dif-
images, which consists of three layers represent- ferentiation of small hemorrhages, hard exu-
ing specific colors: R—red, G—green, B—blue. dates, and photocoagulation marks from dust
This means that the color of each pixel is created artifacts based on fundus retinal images [8].
by mixing the three component colors. The visu- Semary used the HSV color space for pseudo-
alization of the RGB color space can be seen in coloring monochrome medical images to
the (Fig. 6.2). The RGB color space can be repre- improve discrimination of region of interest
sented in the form of a cube in which the axes and the background parts [9].
correspond to component colors. Each point
(voxel) in such a cube corresponds to a certain
color. Image Preprocessing Stage
The most commonly used method for repre-
senting RGB images is the 24-bit representation, It often happens that the image due to the use of
which means that 8 bits are used to encode each inappropriate lighting has reduced contrast.
Some images are also distorted by noise. The
purpose of initial image processing is to obtain a
new image free from defects that prevent its anal-
ysis, as well as to highlight the features needed
during its further analysis.
Green

Hue

Value
d
Re

Blu
e

Fig. 6.2  RGB color space Fig. 6.3  HSV color space
80 T. Krzywicki

Intensity Normalization Contrast Enhancement

In general, intensity is the individual components Contrast enhancement is one of typical steps in
of a color space. Intensity normalization is a pro- image preprocessing stage and it is employed in
cess used in the compensation of artifacts, due to multiple domains of image analysis, including
uneven illumination of retinal tissue by the imag- ophthalmology. Its purpose is to improve the
ing modality. The global linear intensity normal- details and fidelity of the image by emphasizing
ization is the simplest method of image its structure. Among the frequently used
normalization operation, it is applied to each approaches are generic methods, however the
layer (component in the color model), and can be most commonly used are approaches targeted at
expressed as follows: the subsequent analysis stages, that facilitate the
subsequent registration of images and the infer-
g ( r ,c ) = ( f ( r ,c ) − min )
ence on their basis.
new max − new min The sharpening filter is one of the simplest
+ newMin (6.2)
max − min generic methods to improve the contrast of an
image. It creates a new image that is less blurred
where: than the original, so more details are visible.
However, although resulting image is cleaner
• f is the original image than the original, it may also contain some extra
• g is the new image created by the intensity noise resulting in additional distortion.
normalization operation Contrast enhancement is applied iteratively to
• min and max are the minimum and maximum each layer (component in the color model). The
pixel values in the considered layer of the operation can be done by applying a convolution
original image filter to the original image, which can be
• new Min and new Max are the new minimum expressed by the following formula:
and maximum pixel values in the a b
• considered layer of the new image g ( r ,c ) = w × f ( r ,c ) = ∑ ∑ w ( dr,dc )
dr =− a dc =− b
(6.3)
In addition to global intensity normalization f ( r + dr ,c + dc )

methods, there are also local methods that take
into account the characteristics of the neighbour- where:
hood. Salem et al. [10] present several methods
for preprocessing retinal fundus images. Generic • f is the original image
local approaches are one of the popular image • w is the convolution filter (kernel) of size 2a +
intensity normalization methods for these images. 1, 2b + 1
Dedicated methods have also been proposed, one • g is a new image created in the processing
of them is employing a color-constancy [11], operation
which detects vessels in retinal images by seg-
menting retinal vasculatures. Other dedicated A sample 3 × 3 sharpening filter (where a = 1
methods estimate the illumination model and and b = 1) to be used in the convolution operation
correct the intensity according to the expected can be represented as:
luminance [12–15]. This model is a mask image
in which the pixel values are estimates of the  0 −1 0 
reflectance of the tissue. Since the illumination  −1 5 −1 (6.4)
source is usually unknown, it is obtained assum-  
 0 −1 0 
ing that the local illumination variation is lower
than the global variation for the entire image.
6  Selected Image Analysis Methods for Ophthalmology 81

a b

Fig. 6.4  Image sharpening filter operation. (a) Original image; (b) Sharpened image

The effect of the filter is shown in Fig. 6.4. The applications of the retinal image registra-
The more advanced methods that may be tion process depend on when test and image
applied to retinal images include generic images are taken. Fundus retinal images obtained
approaches operating on the local contrast [16, during the same examination can be combined to
17]. However, these methods can introduce noise obtain a higher resolution image [21, 22] as they
in images. Single scale [18] and multi-scale [19] are devoid of anatomical changes. This allows for
linear filters have been also considered, but more accurate measurements of the eye parame-
unfortunately they filter out relevant image ters, however overlap of the fundus retinal images
details. Other clinical image processing methods must be significant. Fundus retinal images with
are described in [20] in the context of developing little overlap are used to create mosaics that
a system to identify retinal disease in retinal fun- expand the sensor’s field of view [23, 24]—an
dus images. example is given in Fig. 6.5.
Longitudinal studies of the retina [25, 26] can
be done by the registration of images acquired at
Retinal Image Registration different examinations. Accurate registration of
fundus retinal images of the same region can be
The image registration process requires test and useful in detecting small but significant changes
reference images to obtain the estimation of the (i.e., small in size in the image, but relevant to the
aligning transformation. This transformation patient’s condition) such as local hemorrhages or
warps the test image, so that characteristic retinal differences in vasculature width.
points in the test image occur at the same loca- The first retinal image registration methods
tions as in the reference image. The fundus reti- were global and consisted in matching global
nal image registration process may be difficult to similarities of the entire test and reference image
complete due to optical differences across modal- as transformed into appropriate representation:
ities or devices, optical distortions, anatomical spatial [27] or frequency [28] assuming that the
changes due to lesions or disease, and viewpoint intensities in both images are consistent.
differences due to projective distortion of the However, when using global methods, problems
curved surface of the eye. may arise due to uneven illumination, and ana-
82 T. Krzywicki

a b

Fig. 6.5  Registration of retinal images into mosaic. (a) First of original images; (b) Second of original images;
(c) Registration result

tomical changes of the eye as captured by the test and crossovers—detected keypoints may be
and reference images. included as features of the retina [24]. A good
Popular retinal image registration methods alternative to the patented SURF and SIFT gen-
include local methods that rely on well-localized eral methods can be the Oriented FAST and
features or keypoints [22, 24]. The keypoint in Rotated BRIEF (ORB) [32] algorithms.
the image can be understood as a characteristic Local methods in image registration are more
place, for example a corner—examples can be frequently used than global methods. They are
seen in Fig.  6.6. For more efficient processing, especially useful for images with limited overlap,
keypoints are presented as numerical vectors due to the increased specificity that point matches
called descriptors. The SIFT [29] and SURF provide. Local methods are also much better
algorithms are popular generic methods of suited to the registration of images with anatomi-
searching for keypoints in the image [30, 31]. cal changes due to their resistance to differences
They can be used to detect vessels, bifurcations, in images.
6  Selected Image Analysis Methods for Ophthalmology 83

Among the methods of image registration, it and color fundus photographs of the ocular pos-
is worth mentioning the multimodal ophthalmic terior pole.
image registration, which is a process integrat-
ing information stored in two or more images,
captured using multiple imaging modalities. It Inference and Prediction
is challenging because geometric deformations Using CNNs
are an inseparable part of multimodal ophthal-
mic imaging. They contain integral deforma- This section presents the last two stages of image
tions resulting from heterogeneity in the optical processing—inference and prediction, focused on
specifications of imaging devices and patient classification. Classification is often used in
dependent factors. In [33] authors proposed a assessing a patient’s condition, when a patient is
method using Laplacian features, Hessian affine assigned to a diagnostic category or a risk class.
feature space and phase correlation, to register Classification is also used in the retinal image
blue autofluorescence, near-infrared reflectance quality assessments to indicate whether a given
image can be reliably used to establish diagnosis.
A CNN is one of the Deep Learning (DL)
models. More specifically, it is kind of a feedfor-
ward Artificial Neural Network (ANN) based on
a stacked structure of specialized layers of neu-
rons, with each layer specialized in recognizing
specific patterns in images. Figure  6.7 shows a
general diagram of a CNN.
A CNN usually consists of more than one
convolution layer. Then, they are augmented by
the densely connected classical ANN layers, as
in the case of a multi-layer neural network. The
CNN schema (construction of features and their
use) has been designed to be the best fit to use
the structure of the two-dimensional layers of
the input images. Convolutional layers of CNN
also have fewer parameters than dense layers,
which may reduce the risk of overtraining the
Fig. 6.6  Keypoints of the image marked in green model.

dense layers
max-pooling layer flattening
convolutional layer

output
(prediction)

input image

feature maps

feature extration classification

Fig. 6.7  Diagram of CNN


84 T. Krzywicki

Retinal Image Quality Assessment can be seen that with the multi-class classifica-
tion of the patient’s condition, the accuracy
The quality of an image is an important issue decreased compared to the binary classification.
affecting image analysis, registration and seg- Arcadu et al. [37] propose an approach to pre-
mentation. The reliability of these methods dict progression of DR.  More specifically, the
depends largely on the quality of the images on study predicts future DR regarding the two-step
which they operate. Ophthalmic images may lose worsening of the Early Treatment DR Severity
quality due to the behavior of the device operator Scale. Authors estimate the severity of DR
at the time of image capture resulting in differ- assessed at 6, 12, and 24 months. The decision
ences in camera exposure and focal plane errors. model for considered time periods obtained an
Zago et al. [34] present an approach to retinal area under the curve (AUC) of 0.68 (sensitivity,
image quality assessment at the moment of the 66%; specificity, 77%), 0.79 (sensitivity, 91%;
acquisition, aiming at assisting health care pro- specificity, 65%), and 0.77 (sensitivity, 79%;
fessionals during fundus examination. They sug- specificity, 72%), respectively.
gest using a CNN pretrained on non-medical Lam et  al. [38] propose an approach to DR
images for extracting general image features. The grading on fundus retinal images cropped using
efficiency of the proposed method was tested on Otsu’s method [39] to isolate the circular image
two publicly available databases (i.e., DRIMDB of the retina. The study used popular CNN archi-
and ELSA-Brasil) and the best decision model tectures such as GoogLeNet and AlexNet. The
obtained accuracy of 98.6%, sensitivity of 97.1% most accurate results in binary classification task
and specificity of 100%. for both sensitivity, and the specificity were
Mahapatra et  al. [35] propose a method for achieved with the GoogLeNet architecture with
retinal fundus image assessment that has been accuracy of 97%, sensitivity of 95% and specific-
inspired by the human visual system. Saliency ity of 96%.
map is used to identify differences of particular Gulshan et al. [40] present a method to auto-
regions from their neighbors with respect to matically detect DR and diabetic macular oedema
image features. The method obtained sensitivity in retinal fundus images. The method was evalu-
of 98.2% and specificity of 97.8%. ated in identifying gradable (high quality) and
ungradable (poor quality) images. The method
obtained 90.3% of sensitivity and specificity of
Diagnostic Support and Prediction 98.1% in detecting of DR and sensitivity of
87.0% and specificity of 98.5% in detecting of
DR is a serious and prevalent disease, and there- diabetic macular oedema.
fore a lot of research has been done on support-
ing its automatic diagnosis based on fundus
retinal images. Many studies used a CNN as a Conclusions
classifier, and the images were preprocessed
using image preprocessing techniques described This chapter presents the process of retinal fun-
in this chapter—selected relevant works are dus image analysis including pre-processing,
summarized below. registration, inference and prediction. Pre-­
Sahlsten et al. [36] propose a system for diag- processing focuses on intensity normalization
nosing DR, where binary (non-referable/refer- and contrast enhancement methods. The methods
able) and multi-class classification (five DR of inference are presented using the example of
stages) are considered. The decision model in the classification, where decision classes correspond
binary classification task obtained accuracy of to image quality or diagnostic categories.
94%, sensitivity of 89.6% and specificity of The preprocessing and image registration
97.4%. In the multiclass classification task the steps are required in a broad variety of use cases.
decision model obtained accuracy of 86.9%. It The simplest one is image inspection by a medi-
6  Selected Image Analysis Methods for Ophthalmology 85

cal professional, where preprocessing and regis- dimensional HSV color space (Version 10005546).
2016. https://doi.org/10.5281/zenodo.1126874.
tration are required to improve the fidelity of the 9. Semary N.  A proposed HSV-based pseudo coloring
acquired image while clarifying or accenting its scheme for enhancing medical image. 2018:81–92.
structure and anatomical features. Retinal image https://doi.org/10.5121/csit.2018.80407.
registration is also able to help monitor and eval- 10. Salem NM, Nandi AK. Novel and adaptive contribu-
tion of the red channel in pre-processing of colour
uate the effectiveness of treatment by combining fundus images. J Franklin Inst. 2007;344(3–4):243–
multiple retinal images into improved or larger 56. ISSN 0016-0032. https://doi.org/10.1016/j.
images or by comparing these images, which is jfranklin.2006.09.001.
essential for monitoring a disease and the assess- 11. Zhao Y, Liu Y, Wu X, Harding S, Zheng Y. Retinal
vessel segmentation: an efficient graph cut
ment of its treatment efficacy. approach with Retinex and local phase. PLoS One.
Preprocessing and registration steps are also 2015;10(4):1–22.
utilized as initial steps in automatic diagnosis. In 12. Kolar R, Odstrcilik J, Jan J, Harabis V. Illumination
practice, the most popular inference method correction and contrast equalization in colour fundus
images. European Signal Processing Conference,
based on images employs CNNs, due to their 2011. p. 298–302.
very good results. 13. Foracchia M, Grisan E, Ruggeri A.  Luminosity and
contrast normalization in retinal images. Med Image
Anal. 2005;9(3):179–90.
14. Narasimha-Iyer H, Can A, Roysam B, Stewart V,

References Tanenbaum HL, Majerovics A, Singh H. Robust detec-
tion and classification of longitudinal changes in color
1. Grosso A.  Hypertensive retinopathy revisited: retinal fundus images for monitoring diabetic retinop-
some answers, more questions. Br J Ophthalmol. athy. IEEE Trans Biomed Eng. 2006;53(6):1084–98.
2005;89:1646–54. https://doi.org/10.1136/ 15. Grisan E, Giani A, Ceseracciu E, Ruggeri A. Model-­
bjo.2005.072546. based illumination correction in retinal images. IEEE
2. Danis RP, Davis MD. Proliferative diabetic retinopa- International Symposium on Biomedical Imaging:
thy, diabetic retinopathy. Totowa, NJ: Humana Press; Nano to Macro. 2006. p. 984–7.
2008. p. 29–65. 16. Walter T, Massin P, Erginay A, et al. Automatic detec-
3. Matsui M, Tashiro T, Matsumoto K, Yamamoto S. A tion of microaneurysms in color fundus images.
study on automatic and quantitative diagnosis of fun- Med Image Anal. 2007;11:555–66. https://doi.
dus photographs. I. Detection of contour line of reti- org/10.1016/j.media.2007.05.001.
nal blood vessel images on color fundus photographs. 17. Fleming A, Philip S, Goatman K, Olson J, Sharp
Nippon Ganka Gakkai Zasshi. 1973;77(8):907–18. P.  Automated microaneurysm detection using local
4. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi contrast normalization and local vessel detection.
F, Ghafoorian M, van der Laak JAWM, van Ginneken IEEE Trans Med Imaging. 2006;25(9):1223–32.
B, Sánchez CI.  A survey on deep learning in medi- 18. Qidwai U, Qidwai U.  Blind deconvolution for reti-
cal image analysis. Med Image Anal. 2017;42:60– nal image enhancement. IEEE EMBS Conference
88. ISSN 1361-8415. https://doi.org/10.1016/j. on Biomedical Engineering and Sciences. 2010.
media.2017.07.005. p. 20–25.
5. Lathuilière S, Mesejo P, Alameda-Pineda X, 19. Sivaswamy J, Agarwal A, Chawla M, Rani A, Das
Horaud R.  A comprehensive analysis of deep T. Extraction of capillary non-perfusion from fundus
regression. IEEE Trans Pattern Anal Mach Intell. fluorescein angiogram. In: Fred A, Filipe J, Gamboa
2020;42(9):2065–81. https://doi.org/10.1109/ H, editors. Biomedical engineering systems and tech-
TPAMI.2019.2910523. nologies. Berlin: Springer; 2009. p. 176–88.
6. Min E, Guo X, Liu Q, Zhang G, Cui J, Long J. A survey 20. Rajan K, Sreejith C.  Retinal image process-
of clustering with deep learning: from the perspective ing and classification using convolutional neu-
of network architecture. IEEE Access. 2018;6:39501– ral networks. In: Pandian D, Fernando X, Baig
14. https://doi.org/10.1109/ACCESS.2018.2855437. Z, Shi F, editors. Proceedings of the International
7. Stanescu L, Burdescu DD, Stoica C. Color image seg- Conference on ISMAC in Computational Vision
mentation applied to medical domain. In: Yin H, Tino and Bio-Engineering 2018 (ISMAC-CVB). ISMAC
P, Corchado E, Byrne W, Yao X, editors. Intelligent 2018. Lecture Notes in Computational Vision and
Data Engineering and Automated Learning – IDEAL Biomechanics, vol. 30. Cham: Springer; 2019. https://
2007. IDEAL 2007. Lecture Notes in Computer doi.org/10.1007/978-­3-­030-­00665-­5_120.
Science, vol. 4881. Berlin: Springer; 2007. https://doi. 21. Meitav N, Ribak EN. Improving retinal image resolu-
org/10.1007/978-­3-­540-­77226-­2_47. tion with iterative weighted shift-and-add. J Opt Soc
8. Suzuki N.  Distinction between manifestations of Am A. 2011;28(7):1395–402. https://doi.org/10.1364/
diabetic retinopathy and dust artifacts using three-­ JOSAA.28.001395.
86 T. Krzywicki

22.
Hernandez-Matas C, Zabulis X.  Super resolu- 31. Hernandez-Matas C, Zabulis X, Argyros AA. Retinal
tion for fundoscopy based on 3D image registra- image registration through simultaneous camera pose
tion. 36th Annual International Conference of and eye shape estimation. 38th Annual International
the IEEE Engineering in Medicine and Biology Conference of the IEEE Engineering in Medicine and
Society. 2014. p.  6332–8. https://doi.org/10.1109/ Biology Society (EMBC). 2016. p. 3247–51. https://
EMBC.2014.6945077. doi.org/10.1109/EMBC.2016.7591421.
23. Can A, Stewart CV, Roysam B, Tanenbaum HL.  A 32. Rublee E, Rabaud V, Konolige K, Bradski G. ORB:
feature-based technique for joint, linear estimation of an efficient alternative to SIFT or SURF. Proceedings
high-order image-to-mosaic transformations: mosa- of the IEEE International Conference on Computer
icing the curved human retina. IEEE Trans Pattern Vision. 2011. p.  2564–71. https://doi.org/10.1109/
Anal Mach Intell. 2002;24(3):412–9. https://doi. ICCV.2011.6126544.
org/10.1109/34.990145. 33. Suthaharan S, Rossi EA, Snyder V, Chhablani J,

24. Ryan N, Heneghan C, de Chazal P.  Registration of Lejoyeux R, Sahel J-A, Dansingani K. Laplacian fea-
digital retinal images using landmark correspondence ture detection and feature alignment for multi-modal
by expectation maximization. Image Vis Comput. ophthalmic image registration using phase correlation
2004;22(11):883–98. https://doi.org/10.1016/j. and Hessian affine feature space. Sig Process. 2020;
imavis.2004.04.004. https://doi.org/10.1016/j.sigpro.2020.107733.
25. Narasimha-Iyer H, Can A, Roysam B, Tanenbaum 34. Zago GT, Andreão RV, Dorizzi B, Salles EOT. Retinal
HL, Majerovics A.  Integrated analysis of vascular image quality assessment using deep learning. Comp
and nonvascular changes from color retinal fun- Biol Med. 2018;103:64–70. ISSN 0010-4825. https://
dus image sequences. IEEE Trans Biomed Eng. doi.org/10.1016/j.compbiomed.2018.10.004.
2007;54(8):1436–45. https://doi.org/10.1109/ 35. Mahapatra D, Roy PK, Sedai S, Garnavi R.  Retinal
TBME.2007.900807. image quality classification using saliency maps
26. Troglio G, Benediktsson JA, Moser G, Serpico SB, and CNNs. In: International Workshop on Machine
Stefansson E.  Unsupervised change detection in Learning in Medical Imaging. Springer; 2016.
multitemporal images of the human retina. In: Multi p. 172–9.
modality state-of-the-art medical image segmentation 36. Sahlsten J, Jaskari J, Kivinen J, Turunen L, Jaanio
and registration methodologies, vol. 1. Boston, MA: E, Hietala K, Kaski K.  Deep learning fundus image
Springer US; 2011. p. 309–37. analysis for diabetic retinopathy and macular edema
27. Reel PS, Dooley LS, Wong KCP, Börner A.  Robust grading. 2019. arXiv preprint arXiv:1904.08764.
retinal image registration using expectation maximi- 37. Arcadu F, Benmansour F, Maunz A, Willis J, Haskova
sation with mutual information. IEEE International Z, Prunotto M. Deep learning algorithm predicts dia-
Conference on Acoustics, Speech and Signal betic retinopathy progression in individual patients.
Processing. 2013. p.  1118–1122. https://doi. NPJ Digit Med. 2019;2(1):1–9.
org/10.1109/ICASSP.2013.6637824. 38. Lam C, Yi D, Guo M, Lindsey T. Automated detection
28. Cideciyan AV, Jacobson SG, Kemp CM, Knighton of diabetic retinopathy using deep learning. AMIA
RW, Nagel JH. Registration of high resolution images Summits Transl Sci Proc. 2018;2018:147.
of the retina. SPIE Med Imaging. 1652;1992:310–22. 39.
Otsu N.  A threshold selection method from
https://doi.org/10.1117/12.59439. gray-level histograms. IEEE Trans Syst Man
29.
Lowe DG.  Distinctive image features from Cybernetics. 1979;9(1):62–6. https://doi.org/10.1109/
scale-invariant keypoints. Int J Comput Vis. TSMC.1979.4310076.
2004;60(2):91–110. https://doi.org/10.1023/B:V 40. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D,
ISI.0000029664.99615.94. Narayanaswamy A, et al. Development and validation
30. Lin Y, Medioni G.  Retinal image registration from of a deep learning algorithm for detection of diabetic
2D to 3D.  IEEE Conference on Computer Vision retinopathy in retinal fundus photographs. JAMA.
and Pattern Recognition. 2008. p.  1–8. https://doi. 2016;316(22):2402–10.
org/10.1109/CVPR.2008.4587705.
Experimental Artificial Intelligence
Systems in Ophthalmology:
7
An Overview

Joelle A. Hallak, Kathleen Emily Romond,
and Dimitri T. Azar

Introduction language processing techniques are also being


used to harness data from electronic health
Artificial intelligence (AI) research and applica- records. It is expected that AI will assist ophthal-
tions in ophthalmology have been expanding in mologists in diagnosing cases at a quicker pace,
recent years [1, 2]. With advanced computing, we help with cases that are difficult to diagnose, and
are able to derive novel insights from multisource in predicting progression for early interventions.
data. The majority of AI research includes utiliz- AI tasks in ophthalmology include interpreta-
ing deep learning techniques to develop auto- tion of imaging techniques (visual fields [7], opti-
mated methods for disease classification and to cal coherence tomography [8], and fundus photos
help clinicians with identifying progression early. [9]), molecular and genomic data [10], and data
Elsewhere in this book, the various forms of clin- management from electronic health records (Iris
ically relevant AI techniques have been described. Registry) [11]. A primary benefit is in screening,
Neural networks consist of digitized inputs, text such as for diabetic retinopathy and retinopathy
or images, which proceed through connected lay- of prematurity. Applications have not yet been
ers that progressively detect features, with an translated to therapeutic interventions in ophthal-
ending predictive output (Fig.  7.1). Deep learn- mic clinics. One of the challenges in ophthalmol-
ing (DL) consists of a large-scale network with ogy is that diagnosis relies on information from
multiple hidden layers (Fig. 7.2) [3]. Researchers multiple imaging and clinical data sources.
are encouraged by performance results from Glaucoma specialists, for example, rely on visual
applying convolutional neural nets (CNNs), fields for functional changes, fundus photos and
which analyze pixel-level information for classi- OCT for structural changes, and intraocular pres-
fying diseases and their severity [4–6]. Natural sure measurements for clinical changes. Multi-­
view representation learning is required for
ophthalmic diseases that combine information
J. A. Hallak · K. E. Romond
Department of Ophthalmology and Visual Sciences, from fundus, optical coherence tomography
University of Illinois at Chicago, Chicago, IL, USA (OCT), and text/patient metadata. Additionally,
e-mail: joelle@uic.edu; kromon2@uic.edu validation on real-world data from multiple clini-
D. T. Azar (*) cal settings is also a prerequisite requirement
Department of Ophthalmology and Visual Sciences, prior to translating applications to patient care.
University of Illinois at Chicago, Chicago, IL, USA This chapter provides an overview of experi-
Twenty/Twenty Therapeutics, mental AI systems in ophthalmology. Specifi­
San Francisco, CA, USA cally, it discusses potential applications in
e-mail: dazar@uic.edu

© Springer Nature Switzerland AG 2021 87


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_7
88 J. A. Hallak et al.

Back Propagation
Bias node

Weight
1 1
bw2
bw1 bw3

w1
a1 b1 Total Error
w2 w5

Input c1 Output
w3 w6
w4
a2 b2

Input Layer Hidden Layer Output Layer


Forward Propagation

Fig. 7.1  Simple neural network and its components. Reconstructed from Taylor M. Neural Networks Math: Visual

Fig. 7.2  Example of a large-scale network with a variety data from each tower is then merged and flows through
of ophthalmic data types as input (imaging, clinical and higher levels, allowing the deep neural network to per-
genetic) to determine a certain output. Each data type form inference across data types [3]
learns a useful featurization in its lower-level towers. The
7  Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 89

keratoconus, refractive surgery, cataracts, stra- screening to detect the presence of keratoconus
bismus, retinopathy of prematurity (ROP), and or keratoconus suspects [13]. They reported that
neuro-­ophthalmology. Reinforcement learning the neural networks completely distinguished
(RL) and inverse reinforcement learning (IRL), keratoconus from keratoconus suspects and from
and their potential applications for surgical sim- topographies that resembled keratoconus.
ulation applications are also discussed. Additionally, their network approach equaled the
Additionally, in light of the COVID-19 pan- sensitivity of currently used tests for keratoconus
demic, we conclude by highlighting the impact detection and outperformed them in terms of
and potential role of AI systems in ophthalmol- accuracy and specificity [13].
ogy in the post COVID-­19 era. Viera and Barbosa [14] showed that Zernike
polynomials are reliable parameters as inputs in a
feed-forward artificial neural network and a dis-
 I Applications in Corneal
A criminant analysis technique, with reported pre-
Diagnosis cisions of 94% and 84.8%. Whereas Kovacs et al.
[15] used Tomographic data, topographic data,
AI applications have a decades-long history in and keratoconus indices from the Scheimpflug
corneal topography interpretation, and their value camera. They reported that classifiers trained on
has been demonstrated for enhancing clinical bilateral data were better than unilateral single
decisions for patients with corneal diseases. With parameter in discriminating fellow eyes of
advancements in technology, we are able to eval- patients with keratoconus from controls (ROC
uate corneal curvatures and elevation maps, tis- 0.96 vs. 0.88). However, this result may be due to
sue anatomy, and histology and biomechanical including one eye in the training set and the sec-
properties. Machine learning (ML) and AI tech- ond from the same patient in the test set.
niques are useful tools to help in corneal disease Several studies combined parameters from
classification (normal, early suspect keratoconus more than one device. Hwang et al. [16] extracted
and keratoconus), allowing early interventions, variables from slit-scan tomography and spectral-­
like collagen cross-linking, to prevent progres- domain OCT to differentiate between normal
sion and vision loss. Additionally, ML and AI controls and the clinically normal fellow eyes of
techniques provide useful tools for refractive sur- highly asymmetric keratoconus subjects. The
gery screening, in  vivo corneal morphology best discrimination was found when using a com-
exams, and corneal surgeries. bination of variables from both instruments, with
spectral-domain OCT corneal thickness mea-
sures followed by anterior corneal measures from
Keratoconus Screening tomography as being most important. Ambrosio
and Classification et al. [17] used a tomographic and biomechanical
index (TBI), which combined Scheimpflug based
ML models for classifying keratoconus have corneal tomography and biomechanics. Their
been implemented on parameters from one device multinational retrospective study employed sev-
and a combination of devices. Deep learning eral AI techniques, including logistic regression,
applications have also been used for the detection a support vector machine (SVM), and random
of keratoconus. forest (RF), all verified by leave one out cross
Maeda et al. were one of the earliest groups to validation (LOOCV) to distinguish normal ver-
apply learning techniques on imaging data [12]. sus ectatic cases. The RF/LOOCV performed the
They used a classification tree combined with a best of all trained and tested methods, and was
linear discriminant function to distinguish successful in detecting subclinical (fruste)
between a keratoconus and a non-keratoconic ­ectasia, with an area under receiver operating
pattern. Smolek and Klyce were the first to use a characteristic curve (AUROC) for all ectasia
classification neural network for keratoconus cases of 0.996. Furthermore, the TBI cutoff value
90 J. A. Hallak et al.

of 0.79 showed 100% sensitivity and specificity gery (such as age, room temperature and humid-
for detecting clinical ectasia [17]. ity, astigmatism axis, stromal thickness, laser
More recently, some studies have been using ablation method, keratometry values, and laser
deep learning methods for the detection of kera- characteristics etc). After data preprocessing, a
toconus. Kamiya et al. [18], used a deep learning learning vector quantizer (LVQ) neural network
model on colour-coded maps obtained by the was employed for classification. LVQ has a non-­
anterior segment optical coherence tomography linear classification property, which is preferred
(AS-OCT) to discriminate keratoconus from nor- for this task. They reported a sensitivity of 0.88
mal corneas, with classifying the grade of the and specificity of 0.93 [19]. Two studies have
disease. developed learning algorithms for refractive sur-
gery screening. Using pentacam images as input,
Xie et  al., developed a deep learning model for
Corneal Refractive Surgery screening of candidates for refractive surgery
[21]. A total of 6465 corneal tomographic images
Ocular imaging technology has evolved in recent of 1385 patients from Zhongshan Ophthalmic
years to address candidacy issues in corneal Center, Guangzhou, China were used to develop
refractive surgery. A preoperative refractive sur- the AI model. Their model, Pentacam
gery exam results in imaging, text, and numeric InceptionResNetV2 Screening System (PIRSS),
data. These data are a perfect example for multi-­ achieved an overall detection accuracy of 0.947
view and multi-task learning algorithms, with on the validation data set, and on the independent
data integration, feature selection and modeling test data set it achieved an overall detection accu-
applications (Fig. 7.3). racy of 0.95, which was comparable with that of
Neural networks in refractive surgery applica- senior ophthalmologists who are refractive sur-
tions have been designed to predict outcomes. geons 0.928 [21]. In another recent study, Yoo
and for candidate screenings [19, 20]. Balidis et al. developed a machine learning architecture
et al. developed an algorithm to predict the need system integrating multi-data from patient demo-
for retreatment in patients undergoing refractive graphics, pentacam imaging, and ophthalmic
surgery for myopia [19]. They used a computer- examinations to identify candidates for refractive
ized query to select patients who underwent surgery [20]. Five algorithms were used for pre-
PRK, LASEK, Epi-LASIK, or LASIK, and dictions, SVM, artificial neural networks, ran-
investigated 13 factors related to refractive sur- dom forests, LASSO, and AdaBoost to classify

Fig. 7.3  Example of a data integration technique used for multi-view data in refractive surgery for modeling
predictions
7  Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 91

normal controls versus ectasia-risk patients. and validation of more than 5000 images against
Followed by an ensemble classifier to improve a clinical ground truth achieved a structure detec-
performance. Training and internal validation tion success rate of 95% with an average grading
were conducted using subjects who had visited difference of 0.36 on a 5.0 scale [24]. Further
between 2016 and 2017, and external validation improvements have since been reported, includ-
was performed using subjects who had visited in ing Srivastava et al.’s proposal of extracting fea-
2018. With the ensemble classifier, the reported tures representing the visibility, or lack thereof,
AUC were 0.983 and 0.972  in the internal and of lens parts [25]. While previous works had
external validation set, respectively [20]. focused on brightness and color metrics of the
Recent research efforts involve modifying eye, these authors used features which repre-
neural networks to increase AI interpretability. sented the fading of edges of distinct anatomical
Yoo et al., developed a multiclass machine learn- regions as the severity of nuclear cataract
ing model that provides output selecting the type increased. The integration of these new features
of refractive surgery procedure a patient may be with previously used features provided better
eligible for [22]. They constructed a multiclass accuracy than either feature type used alone [25].
XGBoost model to classify patients into four Limitations to DL techniques, including
categories including laser epithelial keratomile- incomplete, redundant or noisy representations,
usis, laser in situ keratomileusis, small incision have been addressed in more recent work. One
lenticule extraction, and contraindication groups. study addressed these challenges with their sys-
The model was trained based on clinical deci- tem which first automatically learned features for
sions of experts and ophthalmic measurements. grading cataract through local filters and then pre-
The SHapley Additive explanations technique dicted cataract grade using support vector regres-
was adopted to explain the output of the XGBoost sion [26]. The filters were acquired by clustering
model. They reported an accuracy of 0.81 and patches of images showing lenses from the same
0.789 when testing on the internal and external grading class and then fed through a convolu-
validation datasets, respectively. The SHapley tional neural network as well as a set of recursive
Additive explanations for the results were con- neural networks to produce higher order features.
sistent with prior knowledge from ophthalmolo- Validation using these selected features on a pop-
gists [22]. ulation-based dataset of over 5000 images showed
a mean absolute error of 0.304 [26].
Studies which examine the use of DL algo-
Cataract Diagnosis and Grading rithms as a means of accurate diagnosis and mon-
itoring of cataract among vulnerable populations
The use of DL algorithms for automatic grading have shown similarly promising results, includ-
of cataracts has been explored using photographs ing Liu et  al.’s presentation of a convolutional
from slit-lamp microscopes. An early study digi- neural network to grade slit-lamp images among
tized images before extracting gray-level statis- pediatric patients [27]. Their algorithm offered
tics from within the relevant circular regions of excellent mean accuracy, sensitivity and specific-
the nucleus [23]. These extracted features were ity for classification (97.07%, 97.28%, and
fed into a neural network to produce classifica- 96.83%). Zhang et al. [28] discussed the need of
tions. Other approaches include Li et al.’s use of an automatic diagnosis aid for rural populations
support vector machine regression to predict in China that lack access to expert care in their
grade of cataract [24]. The authors introduced the study which used fundus images, a more acces-
first automatic system for nucleus region detec- sible imaging modality than slit-lamp, as input to
tion in slit lamp images. Local features were their six-level cataract grading system and
extracted on the basis of anatomical landmarks achieved an average accuracy of 92.66%.
92 J. A. Hallak et al.

 I Applications in Strabismus
A thus far have included image-based or video-­
and ROP based methods and eye-tracking methods.
Image and video-based approaches to auto-
The most significant advances in AI applications matic diagnostic aids have included Yang et al.’s
for pediatric ophthalmology involve the auto- [33] prospective observational pilot study on 30
mated detection of retinopathy of prematurity. intermittent exotropia patients, 30 esotropia
Reid et al. summarize AI applications in pediatric patients and 30 orthotropic patients who were
ophthalmology [29]. In addition to ROP, machine able to cooperate with a PCT. Two ophthalmolo-
learning has also been applied to the classifica- gists independently performed the PCT for each
tion of pediatric cataracts, prediction of postop- subject to examine the angle of deviation. Using
erative complications following cataract surgery, an infrared camera, full-face photos were taken
detection of strabismus and refractive error, pre- with a selective wavelength filter placed over
diction of future high myopia, and diagnosis of either eye. The resulting images were then fed
reading disability. In addition, techniques have through a 3D strabismus photo analyzer and cor-
been used for the study of visual development, relations between the angles of deviation were
vessel segmentation in pediatric fundus images, compared between the automatic and manual
and ophthalmic image synthesis. methods which showed an excellent positive cor-
relation (R  =  0.900, P  <  0.001) [33]. Another
study selected features and classified images
Detection of Strabismus using a supervised support vector machine learn-
ing algorithm to mimic the diagnosis process of
Strabismus is a common affliction, affecting the Hirschberg test, which calculates the magni-
approximately 4% of the population [30]. Often tude of yaw in the non-fixating eye by measuring
diagnosed in infancy or childhood, the disease the luminous reflection displacement of the cor-
presents as misalignment of the eyes so that each nea [34]. Their fully automated system was able
eye looks in different directions. Causes include to find the region of the eyes in images, locate the
problems with the eye muscles, the nerves of the limbus and area of brightness and classify the
eye in charge of transmitting information, the images among previously diagnosed cases and
area of the brain which controls eye movements, controls, achieving 94.14% sensitivity, 95.38%
or other eye injuries or disease. Untreated, it may specificity, 98.78% for PPV (Positive predictive
lead to amblyopia, impairment of depth percep- value) and 83.07% for NPV (Negative predictive
tion and 3D vision, or permanent vision loss, as value). Chandna et  al. [35] used measures
the brain suppresses the image contributed by the obtained from prisms and cover tests as input to
weaker eye. Accurate and timely diagnosis is a their backpropagation neural networks to pro-
topic of interest among researchers of late, as ear- duce a differential diagnosis. Despite an average
lier diagnosis will allow for earlier treatment reported accuracy of 100%, their system was lim-
improving quality of life for patients not only in ited to vertical strabismus.
terms of visual outcomes, but through improved Several other studies have presented algo-
self-image and self-confidence. Traditional diag- rithms with their own limitations [36–38]. These
nosis may require several tests performed by and studies, along with the studies described above,
interpreted by a clinician, which may include the used equipment that would be difficult to obtain
Maddox Rod test, corneal light reflex test, and or prohibitively expensive at some clinics, only
the commonly used, gold standard prism and diagnosed one direction of strabismus, or was
alternate cover test (PCT). Level of expertise of not sensitive enough to detect unapparent stra-
the examiners has been shown to affect diagnos- bismus. Valente et al. [39] attempted to address
tic decisions [31, 32], bringing to light the need these limitations in their 2017 study, which used
for a more objective and accurate approach digital video to detect strabismus through the
through automatic methods. Techniques explored cover test. The only materials needed were a
7  Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 93

digital camera and regular computer for image Early identification of the disease and treatment
and video processing and their methods achieved has shown better visual outcomes, and therefore
87% accuracy while acknowledging measures accurate and timely diagnosis is critical [45].
lower than 1Δ. There is variability in the diagnostic process for
Several eye tracking techniques have also ROP and inconsistent classification agreement
been proposed. Corneal light reflection was used (plus vs. pre-plus vs. normal) among clinicians
in one early animal study to characterize binocu- has been observed [46, 47]. Challenges which
lar misalignment in macaque monkeys with stra- may precipitate variation in clinical diagnoses
bismus through measuring eye alignment errors include geographic differences in training, vague-
to fixation targets which were presented through- ness of the definition of ROP, differing cut points
out the subject’s field of gaze at any distance on the continuous spectrum of vascular abnor-
[40]. The method allowed for horizontal and ver- mality, as judged by clinicians, and a currently
tical error measurements and showed similar used standard published photograph for plus dis-
accuracy when compared to standard prism and ease from the 1980s showing a smaller and more
cover assessments. Another eye tracking system magnified view compared to currently available
automated a Hirschberg test for infant subjects images [48]. As discussed elsewhere in this book,
using a two-camera gaze-estimation system [41]. AI and ML techniques have been explored as
The Hirschberg ratio and angle Kappa (the angle assistive tools to aid with these challenges and
between the optical and visual axis) were deter- improve diagnostic accuracy.
mined through measurements of optical axis First attempts at automatic and semi-­
direction, corneal reflexes and coordinates of the automatic imaging analysis used wide-angle
center of the entrance pupil when the infants RetCam images and focused on quantification
looked at images presented on a computer moni- of dilation and vascular toruosity for plus dis-
tor. This method was only tested on five infants ease diagnosis. The diagnostic decisions of these
and needed further verification [41]. Chen et al. systems were compared to expert diagnosis, and
[42] used an eye-tracking system, Tobii X2-60, to clinical applications were not realized because
collect gaze data. Subjects were asked to look at of either limitations in usability, or lack of agree-
nine points on a screen while the tracker detected ment with the experts [49–51]. In particular, an
the gaze points and eye movements. Gaze devia- ML system proposed by Ataer-Cansizoglu et al.
tion maps were created and combined to form a [52] (Imaging & Informatics in ROP, i-ROP)
gaze deviation image which was fed into a convo- showed promising classification capabilities by
lutional neural network for categorization. The comparing different cropped shapes and sizes
best performing model of the six attempted, of images, as well as extracted toruosity and
achieved 96% sensitivity and 94.1 specificity dilation features. Their algorithm showed high
[42]. Lu et al.’s deep neural network for telemedi- diagnostic accuracy (95%) using a large circu-
cine applications was proposed as an automated lar six disc-diameter crop of wide-angle RetCam
solution to provide diagnosis for patients living images and a metric which combined arterial and
in remote communities and to reduce the burden venous torousity, which was shown to be compa-
of on-site specialists [43]. Their algorithm rable to 3 experts (96%, 94%, 92%) and higher
showed excellent results, with a reported detec- than the mean accuracy of 31 non-experts (81%).
tion performance of 0.933 sensitivity, 0.96 speci- Despite the algorithm performing well, manual
ficity and 0.9865 AUC. segmentation of the images limited usability in a
real-world setting [52].
Brown et  al. [5] presented their fully auto-
Retinopathy of Prematurity mated i-ROP DL system in 2018, which was
developed to provide a three-level plus disease
Retinopathy of Prematurity (ROP) accounts for diagnosis in ROP patients and performed well on
6–18% of childhood blindness, worldwide [44]. both internal and external validation sets. 5511
94 J. A. Hallak et al.

RetCam photographs were collected over a 95.1% and 97.8% for diagnosis of plus disease
5-year period and U-Net vessel segmentation and and 92.4% and 97.4% for pre-plus or worse [54].
a pretrained Inception-V1 technical network
were used on the images. The algorithm was
compared to gold standard diagnosis performed Neuro-ophthalmology Applications
by an ophthalmic examination by one expert and
image analysis by three experts. A fivefold cross-­ Identifying abnormalities of the optic nerve can
validation showed AUCs of 0.94 and 0.98 for help to reveal vision-related neurological condi-
normal vs. pre-plus and plus, and plus vs. normal tions. Detection of papilledema, for example, can
and pre-plus classifications, respectively. alert clinicians to possible elevated intracranial
External validation on an independent dataset of pressure from life-threatening brain tumors or
100 wide-angle RetCam images continued to blood clots in the brain. Fundus photographs can
show excellent specificity and sensitivity with capture changes to the optic nerve head, allowing
plus classification exhibiting 93% sensitivity and for examination without the use of an ophthalmo-
94% sensitivity and preplus or worse exhibiting scope, a tool which non-ophthalmologists may
100% sensitivity and 94% specificity [5]. find difficult to use. AI systems have been used
Building on Brown et al.’s work, Redd et al. on fundus photo data to detect abnormalities in
[53] used the same DL system to classify images the optic nerve head and have been proposed as a
and produce a probability based nine-level dis- diagnostic aid for patients with neuro-ophthalmic
ease severity scale. The reference standard diag- disease who may be presenting at non-neuro or
nosis integrated image-based and ophthalmic non-ophthalmic health care clinics.
diagnoses from expert clinicians, 4861 examina-
tions from 870 infants were included in the anal-
ysis and showed an AUC of 0.960 and 94% Optic Disc Abnormalities
sensitivity for detecting type-1 ROP and an AUC
of 0.910 for clinically significant ROP, showing Changes to the optic nerve head due to neuro-­
the system’s ability to recognize broad as well as ophthalmic causes are rare, compared to optic
specific categories of the disease. Furthermore, disc abnormalities caused by glaucoma, and the
the i-ROP DL severity score correlated with dis- ability of deep learning to aid in the detection of
ease severity decided by expert graders, lending non-glaucomatous vs. glaucomatous optic neu-
evidence that i) ROP phenotypes present on a ropathy has been explored. Yang et al.’s [55] con-
continuum of severity from mild to severe and ii) volutional neural network to perform such a task
this severity can be measured automatically and showed a diagnostic accuracy of 93.4% sensitiv-
accurately with their proposed system [53]. ity, 81.8% specificity and 0.874 AUC, with false
More recently, Mao et al. [54] presented their positive cases exhibiting optic discs that were
deep convolutional neural network (DCNN) tilted or extensive peripapillary atrophy. The rar-
which provided a diagnostic decision and ana- ity of neuro-ophthalmic caused changes in optic
lyzed pathological features of ROP to generate disc appearance, vs. glaucoma caused changes
quantitative metrics for tortuosity, width, fractal may contribute to neuro-ophthalmic misdiagno-
dimension and vessel density. Three DCNNs sis. Non-ophthalmic physicians may not be con-
were used in the network architecture. A modi- fident in identifying markers of neuro-ophthalmic
fied U-net segmented the blood vessels while disease including those presenting on images of
another segmented the optic disc. Three-level optic disks and there is a resulting unmet need for
classification was performed by DenseNet. an objective approach to abnormal optic disc
Whole images were used during training of the diagnosis. Indeed, a 2019 review found high rates
system and data augmentation was performed by of misdiagnosis in neuro-ophthalmology in the
flipping the images vertically and horizontally as literature with patients undergoing unnecessary,
well as rotating in three-degree increments. The costly treatments and test before being seen by a
authors reported a sensitivity and specificity of neuro-ophthalmologist [56].
7  Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 95

A few studies have attempted automatic detec- et al.’s proposed ML techniques for distinguish-
tion of non-glaucomatous optic disc abnormali- ing between optic neuropathies and pseudopap-
ties, using ML techniques to diagnose and stage illedema, and Yang et al.’s identification of optic
neuro-ophthalmic conditions by detecting optic disc pallor [61, 62]. A deep learning system with
disc irregularities, such as papilledema and optic the ability to determine laterality (ocular domi-
disc atrophy, in fundus photographs. One early nance) from fundus photo features was proposed
study graded papilledema severity using ML and recently. This algorithm could help with the labo-
decision tree forest classification to extract optic rious and time-consuming task of sorting and
disc margins features, vascular features and peri- labeling images and was developed so that deter-
papillary texture features from fundus photo- mination could be achieved for photos displaying
graphs [57]. The system achieved substantial normal or abnormal optic discs [63].
agreement with a severity grading ground truth
provided by an expert neuro-ophthalmologist. In
a subsequent study, a support vector machine Reinforcement and Inverse
(SVM) was used, alongside a hybrid feature Reinforcement Learning
extraction method, to achieve excellent accuracy for Surgical Applications
(92.86%) of papilledema detection on a dataset
of 160 fundus images, and even higher accuracy Reinforcement learning (RL) and inverse rein-
(97.85%) for grading papilledema images as forcement learning (IRL), or apprenticeship, are
‘mild’ or ‘severe’ [58]. Fatima et  al. [59] also AI algorithms that are being explored for health-
presented a computer aided system to detect pap- care applications. Both methods may have some
illedema. After preprocessing of fundus photos to potential applications to surgery, surgical robot-
detect the optic disk and perform vessel segmen- ics, surgical training and assessment. In RL the
tation, 26 features were extracted which repre- goal is finding the optimal policy. Once an opti-
sented optic disk change. The best features were mal value function is learned, it is possible to
selected within four categories, color, texture, generate the optimal policy (surgical skill) for a
and vascular and disc margin obstruction. Again, given task from the value function [64]. RL algo-
an SVM system was tested on 160 fundus images rithms learn by trial and error, taking as input
from two publicly available data sets, STARE sequences of interactions (histories) between the
and AFIO, achieving a 95.6% accuracy for the decision maker, in this case the surgeon, and their
STARE dataset, 87.4% for the AFIO dataset, and environment [65]. At every decision point, the
85.9% for the combined data [59]. RL algorithm chooses an action according to its
A deep learning system has been introduced to policy and receives new observations and imme-
detect papilledema vs. other abnormalities vs. diate outcomes (rewards). In IRL, we recover the
normal status, using fundus photographs from a reward function, where the optimal policy is
multi-site, multiethnic dataset of over 15,000 given by an expert or another agent (mentor) and
photos [60]. 14,341 images from 19 sites in 11 we find out what the reward function is. RL and
countries were used for training, and internal IRL may have potential in healthcare when learn-
testing, and 1505 images from five other sites ing requires physician demonstration, for exam-
were used for external validation. The internal ple in learning to suture wounds for
testing showed a 0.99 AUC for both papilledema robotic-assisted surgery [66]. As such, there is
detection vs. normal and other abnormalities, and potential for applications in robotic-assisted sur-
normal status detection vs. papilledema and other gery (RAS) in ophthalmology and training.
abnormalities. External validation resulted in an RL can learn from a surgeon’s motions
AUC of 0.96 for the detection of papilledema, enhancing RAS.  Additionally, segmentation
with sensitivity and specificity of 96.4% and techniques can reconstruct open wounds from
84.7%, respectively [60]. imaging, and a suturing method can be generated
Other studies attempting multi-level classifi- by finding an optimal trajectory taking into con-
cation of optic disk images have included Ahn sideration external factors, such joint motions
96 J. A. Hallak et al.

and obstacles. Image-trained RNNs can also be embedded applications, and their connectivity to
developed to tie knots autonomously by learning ophthalmic care providers will allow for immedi-
sequences of events, such as the surgeons’ hand ate intervention. The need to overcome the hurdle
movement. of affordability and data privacy will be a priority
IRL applications may have potential applica- for many of these applications. Additionally,
tions for surgical training. These applications challenges in translating deep learning applica-
assume that the mentor has the same reward func- tions due to limited explainability and interpret-
tion as the observer/trainee and chooses from the ability need to be addressed to ensure targeted
same set of actions. The idea is then to infer the representation and to identify potential biases. AI
reward function of the mentor so as to produce the researchers and clinicians should work together
observed behavior. IRL assumes that the reward to increase interpretability and integrate AI sys-
function is expressed as a linear function of known tems in the clinical decision-making process in a
features. A prime example is teaching a person responsible manner.
how to drive, rather than tell a young unseasoned
driver what the reward function is, it is more natu-
ral to demonstrate to them how to drive [67].
Supervised methods have been developed to
References
mimic the mentor. However, there are instances 1. Ting DSW, Pasquale LR, Peng L, et  al. Artificial
where blindly following the mentor’s trajectory intelligence and deep learning in ophthalmology.
may not work because the environment encoun- Br J Ophthalmol. 2019;103(2):167–75. https://doi.
tered is different (consider the example of high- org/10.1136/bjophthalmol-­2018-­313173.
2. Hogarty DT, Mackey DA, Hewitt AW.  Current
way driving and varying traffic patterns). Abeel state and future prospects of artificial intelligence
and Ng were the first to apply IRL for apprentice- in ophthalmology: a review. Clin Exp Ophthalmol.
ship learning simulation studies and reported that 2019;47(1):128–39. https://doi.org/10.1111/
the policy found will have performance compara- ceo.13381.
3. Esteva A, Robicquet A, Ramsundar B, et al. A guide
ble to or better than that of an expert, on the to deep learning in healthcare. Nat Med. 2019;25:24–
expert’s unknown reward function [67]. 9. https://doi.org/10.1038/s41591-­018-­0316-­z.
Challenges in applying learning algorithms 4. Treder M, Lauermann JL, Eter N.  Deep learning-­
for surgical applications will be more difficult to based detection and classification of geographic
atrophy using a deep convolutional neural network
overcome and include correctly localizing an classifier. Graefes Arch Clin Exp Ophthalmol.
instrument’s position and orientation in surgical 2018;256(11):2053–60. https://doi.org/10.1007/
scenes, data collection, adapting to completely s00417-­018-­4098-­2.
unknown and unique situations, which may cause 5. Brown JM, Campbell JP, Beers A, et  al. Automated
diagnosis of plus disease in retinopathy of prema-
serious surgical errors [64]. Limiting these appli- turity using deep convolutional neural networks.
cations to simulation studies and training may be JAMA Ophthalmol. 2018;136(7):803–10. https://doi.
the safest approach. org/10.1001/jamaophthalmol.2018.1934.
6. Gulshan V, Peng L, Coram M, et  al. Development
and validation of a deep learning algorithm for
detection of diabetic retinopathy in retinal fundus
AI in the Post COVID-19 Era ­photographs. JAMA. 2016;316(22):2402–10. https://
doi.org/10.1001/jama.2016.17216.
COVID-19 has transformed all aspects of medi- 7. Wang M, Pasquale LR, Shen LQ, et  al. Reversal of
glaucoma hemifield test results and visual field fea-
cine and ophthalmology, from clinical care and tures in glaucoma. Ophthalmology. 2018;125(3):352–
research to educational and public health activi- 60. https://doi.org/10.1016/j.ophtha.2017.09.021.
ties. We predict that in the post-covid19 era, AI 8. Lee CS, Baughman DM, Lee AY.  Deep learning is
applications will further expand and play a cen- effective for classifying normal versus age-related
macular degeneration OCT images. Ophthalmol
tral role in ophthalmology. Telemedicine will be Retina. 2017;1(4):322–7.
embraced more as a tool for healthcare delivery. 9. Gargeya R, Leng T. Automated identification of dia-
Home monitors and smart devices, with AI betic retinopathy using deep learning. Ophthalmology.
7  Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 97

2017;124(7):962–9. https://doi.org/10.1016/j. 21. Xie Y, Zhao L, Yang X, et  al. Screening candidates
ophtha.2017.02.008. for refractive surgery with corneal tomographic-­
10. Schmidt-Erfurth U, Waldstein SM, Klimscha S,
based deep learning. JAMA Ophthalmol.
et  al. Prediction of individual disease conversion 2020;2020:e200507. https://doi.org/10.1001/
in early AMD using artificial intelligence. Invest jamaophthalmol.2020.0507. [published online ahead
Ophthalmol Vis Sci. 2018;59(8):3199–208. https:// of print, 2020 Mar 26].
doi.org/10.1167/iovs.18-­24106. 22. Yoo TK, Ryu IH, Choi H, et al. Explainable machine
11. Rich WL III, Chiang MF, Lum F, Hancock R, Parke learning approach as a tool to understand factors used
DW II. Performance rates measured in the American to select the refractive surgery technique on the expert
Academy of Ophthalmology IRIS© Registry level. Transl Vis Sci Technol. 2020;9(2):8. https://doi.
(Intelligent Research in Sight). Ophthalmology. org/10.1167/tvst.9.2.8.
2018;125(5):782–4. https://doi.org/10.1016/j. 23. Duncan DD, Shukla OB, West SK, et al. New objec-
ophtha.2017.11.033. tive classification system for nuclear opacification. J
12. Maeda N, Klyce SD, Smolek MK, Thompson
Opt Soc Am A Opt Image Sci Vis. 1997;14:1197–204.
HW.  Automated keratoconus screening with corneal 24. Li H, Lim JH, Liu J, et al. An automatic diagnosis sys-
topography analysis. Invest Ophthalmol Vis Sci. tem of nuclear cataract using slit-lamp images. IEEE
1994;35(6):2749–57. Trans Biomed Eng. 2010;57:1690–8.
13. Smolek MK, Klyce SD. Current keratoconus detection 25. Srivastava R, Gao X, Yin F, et al. Automatic nuclear
methods compared with a neural network approach. cataract grading using image gradients. J Med
Invest Ophthalmol Vis Sci. 1997;38(11):2290–9. Imaging (Bellingham). 2014;1:014502.
14. de Carvalho LAV, Barbosa MS. Neural networks and 26. Gao X, Lin S, Wong TY. Automatic feature learning
statistical analysis for classification of corneal vid- to grade nuclear cataracts based on deep learning.
eokeratography maps based on Zernike coefficients: IEEE Trans Biomed Eng. 2015;62:2693–701.
a quantitative comparison. Arquivos Brasileiros 27. Liu X, Jiang J, Zhang K, Long E, Cui J, et  al.

de Oftalmologia. 2008;71:337–41. https://doi. Localization and diagnosis framework for pediatric
org/10.1590/S0004-­27492008000300006. cataracts based on slit-lamp images using deep fea-
15. Kovács I, Miháltz K, Kránitz K, et  al. Accuracy of tures of a convolutional neural network. PLoS One.
machine learning classifiers using bilateral data from 2017;12(3):e0168606.
a Scheimpflug camera for identifying eyes with pre- 28. Zhang H, Niu K, Xiong Y, et  al. Automatic cataract
clinical signs of keratoconus. J Cataract Refract grading methods based on deep learning. Comput
Surg. 2016;42(2):275–83. https://doi.org/10.1016/j. Methods Programs Biomed. 2019;182:104978.
jcrs.2015.09.020. 29. Reid JE, Eaton E.  Artificial intelligence for pedi-

16. Hwang ES, Perez-Straziota CE, Kim SW, Santhiago atric ophthalmology. Curr Opin Ophthalmol.
MR, Randleman JB. Distinguishing highly asymmet- 2019;30(5):337–46. https://doi.org/10.1097/
ric keratoconus eyes using combined Scheimpflug ICU.0000000000000593.
and spectral-domain OCT analysis. Ophthalmology. 30. Hertle RW.  Clinical characteristics of surgically

2018;125(12):1862–71. https://doi.org/10.1016/j. treated adult strabismus. J Pediatr Ophthalmol
ophtha.2018.06.020. Strabismus. 1998;35(3):138–68.
17. Ambrósio R Jr, Lopes BT, Faria-Correia F, Salomão 31. Anderson HA, Manny RE, Cotter SA, et al. Effect of
MQ, Bühren J, Roberts CJ, Elsheikh A, Vinciguerra examiner experience and technique on the alternate
R, Vinciguerra P.  Integration of Scheimpflug- cover test. Optom Vis Sci. 2010;87:168–75.
based corneal tomography and biomechani- 32. Hrynchak PK, Herriot C, Irving EL.  Comparison

cal assessments for enhancing ectasia detection. of alternate cover test reliability at near in non-­
J Refract Surg. 2017;33(7):434–43. https://doi. strabismus between experienced and novice examin-
org/10.3928/1081597X-­20170426-­02. ers. Ophthalmic Physiol Opt. 2010;30:304–9.
18. Kamiya K, Ayatsuka Y, Kato Y, et  al. Keratoconus 33. Yang HK, Seo JM, Hwang JM, Kim KG. Automated
detection using deep learning of colour-coded analysis of binocular alignment using an infrared
maps with anterior segment optical coherence camera and selective wavelength filter. Investig
tomography: a diagnostic accuracy study. BMJ Ophthalmol Vis Sci. 2013;54:2733–7.
Open. 2019;9(9):e031313. https://doi.org/10.1136/ 34. De Almeid JDS, Silva AC, De Paiva AC, et  al.

bmjopen-­2019-­031313. Computational methodology for automatic detection
19. Balidis M, Papadopoulou I, Malandris D, Zachariadis of strabismus in digital images through Hirschberg
Z, Sakellaris D, Vakalis T, et  al. Using neural net- test. Comput Biol Med. 2012;42:135–46.
works to predict the outcome of refractive surgery for 35. Chandna A, Fisher A, Cunninghan I, et al. Pattern rec-
myopia. 4Open. 2019;2:29. ognition of vertical strabismus using an artificial neu-
20. Yoo TK, Ryu IH, Lee G, Kim Y, Kim JK, Lee IS, et al. ral network (strabnet). Strabismus. 2009;17(4):131–8.
Adopting machine learning to automatically identify 36. Kim TY, Seo SS, Kim YJ, et  al. A new soft-

candidate patients for corneal refractive surgery. NPJ ware for quantitative measurement of strabismus
Digit Med. 2019;2(1):59. https://doi.org/10.1038/ based on digital image. J Korea Multimedia Soc.
s41746-­019-­0135-­8. 2012;15(5):595–605.
98 J. A. Hallak et al.

37. Seo MW, Yang HK, Hwang JM, Seo JM.  The
of the “i-ROP” system and image features associ-
automated diagnosis of strabismus using an infra- ated with expert diagnosis. Transl Vis Sci Technol.
red camera. 6th Eur Conf Int Fed Med Biol Eng. 2015;4(6):5.
2015;45:142–5. 53. Redd TK, Campbell JP, Brown JM, et al. Evaluation
38.
Khumdat N, Phukpattaranont P, Tengtrisorn of a deep learning image assessment system for
S.  Development of a computer system for strabis- detecting severe retinopathy of prematurity. Br J
mus screening. In: 6th Biomedical Engineering Ophthalmol. 2019;103(5):580–4.
International Conference. IEEE; 2013. p. 1–5. 54. Mao J, Luo Y, Liu L, et al. Automated diagnosis and
39. Valente TL, de Almeida JD, Silva AC, et  al.
quantitative analysis of plus disease in retinopathy of
Automatic diagnosis of strabismus in digital videos prematurity based on deep convolutional neural net-
through cover test. Comput Methods Prog Biomed. works. Acta Ophthalmol. 2020;98(3):e339–45.
2017;140:295–305. 55. Yang HK, Kim YJ, Sung JY, et  al. Efficacy for dif-
40. Quick MW, Boothe RG.  A photographic technique ferentiating nonglaucomatous versus glaucomatous
for measuring horizontal and vertical eye alignment optic neuropathy using deep learning systems. Am J
throughout the field of gaze. Investig Ophthalmol Vis Ophthalmol. 2020;2.
Sci. 1992;33:234–46. 56. Stunkel L, Newman NJ, Biousse V.  Diagnostic

41. Model D, Eizenman M. An automated Hirschberg test error and neuro-ophthalmology. Curr Opin Neurol.
for infants. IEEE Trans Biomed Eng. 2011;58:103–9. 2019;32(1):62–7.
42. Chen Z, Fu H, Lo WL, Chi Z. Strabismus recognition 57. Echegaray S, Zamora G, Yu H, et  al. Automated

using eye-tracking data and convolutional neural net- analysis of optic nerve images for detection and
works. J Healthc Eng. 2018;2018:1–9. staging of papilledema. Invest Ophthalmol Vis Sci.
43. Lu J, Feng J, Fan Z, et  al. Automated strabis-
2011;52:7470–8.
mus detection based on deep neural networks for 58. Akbar S, Akram MU, Sharif M, et al. Decision sup-
telemedicine applications. 2018. https://deepai. port system for detection of papilledema through fun-
org/publication/automated-­s trabismus-­d etection-­ dus retinal images. J Med Syst. 2017;41:66.
based-­o n-­d eep-­n eural-­n etworks-­f or-­t elemedicine-­ 59. Fatima KN, Hassan T, Akram MU, et al. Fully auto-
applications. Accessed 31 Jul 2020. mated diagnosis of papilledema through robust
44. Fleck BW, Dangata Y.  Causes of visual handicap in extraction of vascular patterns and ocular pathol-
the Royal Blind School, Edinburgh, 1991–2. Br J ogy from fundus photographs. Biomed Opt Express.
Ophthalmol. 1994 May;78(5):421. 2017;8:1005–24.
45. Early Treatment for Retinopathy of Prematurity
60. Milea D, Najjar RP, Zhubo J, Ting D, Vasseneix C, Xu
Cooperative Group, Good WV, Hardy RJ, Dobson X, et  al. Artificial intelligence to detect papilledema
V, et al. Final visual acuity results in the early treat- from ocular fundus photographs. N Engl J Med.
ment for retinopathy of prematurity study. Archiv 2020;382(18):1687–95.
Ophthalmol. 2010;128(6):663. 61. Ahn JM, Kim S, Ahn KS, et al. Accuracy of machine
46. Chan RP, Williams SL, Yonekawa Y, et al. Accuracy learning for differentiation between optic neuropa-
of retinopathy of prematurity diagnosis by retinal fel- thies and pseudopapilledema. BMC Ophthalmol.
lows. Retina (Philadelphia, PA). 2010;30(6):958. 2019;19:178.
47. Myung JS, Chan RV, Espiritu MJ, et  al. Accuracy 62. Yang HK, Oh JE, Han SB, et  al. Automatic

of retinopathy of prematurity image-based diagno- computer-­aided analysis of optic & disc pallor in
sis by pediatric ophthalmology fellows: implica- fundus photographs. Acta Ophthalmol (Copenh).
tions for training. J Am Assoc Pediatric Ophthalmol 2019;97:e519–25.
Strabismus. 2011;15(6):573–8. 63. Liu TYA, Ting DSW, Yi PH, Wei J, Zhu H,

48. Ting DS, Peng L, Varadarajan AV, et al. Deep learning Subramanian PS, et  al. Deep learning and transfer
in ophthalmology: the technical and clinical consider- learning for optic disc laterality detection: implica-
ations. Progr Retinal Eye Res. 2019;72:100759. tions for machine learning in neuro-ophthalmology. J
49. Koreen S, Gelman R, Martinez-Perez ME, et  al.
Neuroophthalmol. 2020;40(2):178–84.
Evaluation of a computer-based system for plus 64.
Kassahun Y, Yu B, Tibebu AT, Stoyanov D,
disease diagnosis in retinopathy of prematurity. Giannarou S, Metzen JH, et  al. Surgical robotics
Ophthalmology. 2007;114(12):e59–67. beyond enhanced dexterity instrumentation: a sur-
50. Wilson CM, Wong K, Ng J, Cocker KD, et al. Digital vey of machine learning techniques and their role
image analysis in retinopathy of prematurity: a com- in intelligent and autonomous surgical actions. Int J
parison of vessel selection methods. J Am Assoc CARS. 2016;11(4):553–68. https://doi.org/10.1007/
Pediatric Ophthalmol Strabismus. 2012;16(3):223–8. s11548-­015-­1305-­z.
51.
Abbey AM, Besirli CG, Musch DC, et  al. 65. Gottesman O, Johansson F, Komorowski M, Faisal
Evaluation of screening for retinopathy of prema- A, Sontag D, Doshi-Velez F, et  al. Guidelines
turity by ROP tool or a lay reader. Ophthalmology. for reinforcement learning in healthcare. Nat
2016;123(2):385–90. Med. 2019;25(1):16–8. https://doi.org/10.1038/
52. Ataer-Cansizoglu E, Bolon-Canedo V, Campbell
s41591-­018-­0310-­5.
JP.  Computer-based image analysis for plus disease 66. Esteva A, Robicquet A, Ramsundar B, Kuleshov V,
diagnosis in retinopathy of prematurity: performance DePristo M, Chou K, et al. A guide to deep learning
7  Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 99

in healthcare. Nat Med. 2019;25(1):24–9. https://doi. the twenty-first international conference on Machine
org/10.1038/s41591-­018-­0316-­z. learning (ICML ‘04). Banff, AB, Canada: Association
67. Abbeel P, Ng AY.  Apprenticeship learning via
for Computing Machinery; 2004. p.  1. https://doi.
inverse reinforcement learning. In: Proceedings of org/10.1145/1015330.1015430.
Artificial Intelligence
in Age-­Related Macular
8
Degeneration (AMD)

Yifan Peng, Qingyu Chen, Tiarnan D. L. Keenan,


Emily Y. Chew, and Zhiyong Lu

Abbreviations Introduction

AMD Age-related macular degeneration Age-related macular degeneration (AMD)


AREDS Age-Related Eye Disease Study accounts for approximately 9% of global blind-
CGA Central geographic atrophy ness and is the leading cause of visual loss in
DHA Docosahexaenoic acid developed countries [1, 2]. The number of people
EPA Eicosapentaenoic acid with AMD worldwide is projected to be 196 mil-
FAF Fundus autofluorescence lion in 2020, rising substantially to 288 million in
GA Geographic atrophy 2040 [3]. The prevalence of AMD increases expo-
RPE Retinal pigment epithelium nentially with age: late AMD in white populations
RPD Reticular pseudodrusen has been estimated by meta-analysis at 6% at 80
years and 20% at 90 years [4]. Over time,
increased disease prevalence through changing
population demographics may place significant
burdens on eye services, especially where retinal
Y. Peng specialists are not available in sufficient numbers
National Center for Biotechnology Information to perform individual examinations on all patients.
(NCBI), National Library of Medicine (NLM),
National Institutes of Health (NIH),
It is conceivable that deep learning and/or tele-
Bethesda, MD, USA medicine approaches might support future eye
Department of Population Health Sciences, Weill
services. However, this might only apply when
Cornell Medicine, New York, NY, USA evidence-based systems have undergone exten-
e-mail: yip4002@med.cornell.edu sive validation and demonstrated performance
Q. Chen · Z. Lu metrics that are at least non-inferior to those of
National Center for Biotechnology Information clinical ophthalmologists in routine practice.
(NCBI), National Library of Medicine (NLM), AMD arises from a complex interplay
National Institutes of Health (NIH),
Bethesda, MD, USA
between aging, genetics, and environmental
e-mail: qingyu.chen@nih.gov; luzh@ncbi.nlm.nih. risk factors [5, 6]. It is regarded as a progres-
gov sive, step-wise disease. It is classified by clinical
T. D. L. Keenan · E. Y. Chew (*) features (based on clinical examination or color
Department of Epidemiology and Clinical fundus photography) into early, intermediate,
Applications, National Eye Institute (NEI), National and late stages [7]. The hallmarks of the inter-
Institutes of Health (NIH), Bethesda, MD, USA
e-mail: tiarnan.keenan@nih.gov; echew@nei.nih.gov
mediate disease are the presence of large drusen

© Springer Nature Switzerland AG 2021 101


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_8
102 Y. Peng et al.

Fig. 8.1  Demonstration of the AMD pathologies, according to the Age-Related Eye Disease Study (AREDS)

and/or pigmentary abnormalities at the macula suming, deep learning enables more accurate
(Fig.  8.1). There are two forms of late AMD: and efficient approaches.
neovascular AMD and atrophic AMD, defined Due to the importance of deep learning in the
by geographic atrophy (GA). imaging-based analysis of AMD, this chapter
Automated image analysis tools have dem- aims to discuss the role of medical imaging,
onstrated promising results in biology and med- empowered by AI, in detecting AMD, which will
icine [8–13]. Early retinal image classification encourage future practical applications and meth-
systems of color fundus photographs (CFP) had odological research.
adopted traditional machine learning with In the following, we first introduce common
human-­ engineered features [14]. Recently, data modalities and publicly available datasets
deep l­earning, a subfield of machine learning, and then summarize popular machine learning
has generated substantial interest in the field of methods in AMD detection, classification, and
ophthalmology [8, 15–20]. Compared to the prognosis. Finally, we discuss several open prob-
traditional imaging workflow that relies heavily lems and challenges. We hope to provide guid-
on human grading, which is highly time-con- ance for researchers and ophthalmologists.
8  Artificial Intelligence in Age-Related Macular Degeneration (AMD) 103

Materials AREDS and AREDS2

In this section, we briefly introduce common data The Age-Related Eye Disease Study (AREDS),
modalities and publicly available data sets that sponsored by the National Eye Institute (National
have been widely used in AI-powered AMD Institutes of Health), was a 12-year multi-center
research. prospective cohort study of the clinical course,
prognosis, and risk factors of AMD, as well as a
phase III randomized clinical trial to assess the
Common Data Modalities effects of nutritional supplements on AMD pro-
gression [26]. In short, 4757 participants aged
Three imaging modalities are commonly used in 55–80 years were recruited between 1992 and
AI-powered AMD research. 1998 at 11 retinal specialty clinics in the United
Color Fundus Photograph (CFP) uses a fun- States. The inclusion criteria were wide, from no
dus camera to record color images of the condi- AMD in either eye to late AMD in one eye. The
tion of the interior surface of the eye [21]. A participants were randomly assigned to placebo,
fundus camera or retinal camera is a specialized antioxidants, zinc, or the combination of antioxi-
low-power microscope with an attached camera dants and zinc. The AREDS dataset is publicly
designed to photograph the interior surface of the accessible to researchers by request at dbGAP
eye, including the retina, retinal vasculature, (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-­
optic disc, macula, and posterior pole (i.e., the bin/study.cgi?study_id=phs000001.v3.p1).
fundus). The retina is imaged to document condi- Similarly, the AREDS2 was a multi-center
tions such as diabetic retinopathy, age-related phase III randomized clinical trial that analyzed
macular degeneration, macular edema, and reti- the effects of different nutritional supplements on
nal detachment. the course of AMD [27]. 4203 participants aged
Fundus autofluorescence (FAF) is a non-­ 50–85 years were recruited between 2006 and
invasive retinal imaging modality used in clinical 2008 at 82 retinal specialty clinics in the United
practice to provide a density map of lipofuscin, States. The inclusion criteria were the presence of
the predominant ocular fluorophore, in the retinal either bilateral large drusen or late AMD in one
pigment epithelium [22]. Classically, FAF uti- eye and large drusen in the fellow eye. The par-
lizes blue-light excitation, then collects emis- ticipants were randomly assigned to placebo,
sions within preset spectra to form a brightness lutein/zeaxanthin, docosahexaenoic acid (DHA)
map reflecting the distribution of lipofuscin, a plus eicosapentaenoic acid (EPA), or the combi-
dominant fluorophore located in the retinal pig- nation of lutein/zeaxanthin and DHA plus
ment epithelium (RPE). Thus, it provides detailed EPA.  AREDS supplements were also adminis-
insight into the health of the RPE. tered to all AREDS2 participants because they
Optical coherence tomography (OCT) is were by then considered the standard of care [28].
another non-invasive imaging test, which uses
light waves to take cross-section pictures of the
retina [23]. More recently, advances in OCT AMD Scales
technology have made it possible to investigate
the feasibility of using OCT for the diagnosis AREDS Simplified Severity Scale and 4-step
and monitoring of retinal diseases such as glau- Scale  Longitudinal analysis of the AREDS
coma and AMD.  Existing studies have used cohort led to the development of the patient-­based
single or multiple modalities for AMD-related AREDS Simplified Severity Scale for AMD,
disease diagnosis and prognosis, such as using based on CFP [29] (Fig.  8.2a). This simplified
CFP and FAF to detect the presence of RPD scale provides convenient risk factors for assess-
[24] and using FAF and OCT to detect the pres- ing the risk of progression to late AMD that can
ence of GA [25]. be determined by clinical examination or by less
104 Y. Peng et al.

a Right eye

0 0 0 1 1 1 - pigmentary abnormal

0 1 2 0 1 2 - drusen size

0 0 0 0 0 0 1 late AMD

0 0 0 0 0 1 1 1 2 5

0 1 0 0 1 1 1 2 2 5

0 2 0 1 1 2 2 2 3 5
Left eye

1 0 0 1 1 2 2 2 3 5

1 1 0 1 2 2 2 3 3 5

1 2 0 2 2 3 3 3 4 5

- - 1 5 5 5 5 5 5 5
late AMD
pigmentary abnormalities

drusen size

b Pigment abnormalities

Geographic atrophy 0 0 0 0 0 1
Increased pigment 0 1 – – – –
Depigmentation
Drusen area 0 0 1 2 3 –

0 1 2 2 4 8 9
1 2 4 4 4 8 9
2 3 4 4 5 8 9
3 4 5 5 6 8 9
4 5 6 6 7 8 9
5 6 7 7 8 8 9

Fig. 8.2  Different AMD scales. (a) AREDS Simplified atrophy (0/1, i.e., absent/present), increased pigment (0/1,
Severity Scale Scoring schematic for participants with i.e., absent/present), depigmentation (graded 0–3), and
and without late age-related macular degeneration drusen area (graded 0–5). The final AREDS Severity
(AMD). Pigmentary abnormalities: 0 = no, 1 = yes; dru- Scale score (steps 1–9, shown shaded in different colors)
sen size: 0 = small or none, 1 = medium, 2 = large; late is defined by the combination of findings from these four
AMD: 0 = no, 1 = yes. (b) AREDS Severity Scale scores categories
1–9, defined by graders from four categories: geographic
8  Artificial Intelligence in Age-Related Macular Degeneration (AMD) 105

demanding photographic procedures than used in machine learning, has recently generated
the AREDS. The scale combines risk factors from ­substantial interest in the field of ophthalmology
both eyes to generate an overall score for the indi- [8, 15–20]. In general, deep learning is the pro-
vidual, based on the presence of one or more large cess of training algorithmic models with labeled
drusen (diameter >125 μm) and/or AMD pigmen- data (e.g., CFP categorized manually as contain-
tary abnormalities at the macula of each eye [29]. ing pigmentary abnormalities or not), where
The Simplified Severity Scale is clinically useful these models can then be used to assign labels
in that it allows ophthalmologists to predict an automatically to new data. Deep learning differs
individual’s 5-year risk of developing late from traditional machine learning methods in
AMD. This 5-step scale (from score 0 to 4) esti- that specific image features do not need to be pre-­
mates the 5-year risk of the development of late specified by experts in that field. Instead, the
AMD in at least one eye as 0.4%, 3.1%, 11.8%, image features are learned directly from the
25.9%, and 47.3%, respectively [29]. images themselves. Past studies have utilized
deep learning systems for the identification of
9-step AREDS Severity Scale  The AREDS various retinal diseases, including diabetic reti-
9-step Severity Scale was designed to allow read- nopathy [31–36], glaucoma [36–39], and retinop-
ing center graders to perform comprehensive athy of prematurity [40]. In recent years, we have
classification of AMD-related features on fundus seen the successful use of deep learning on AMD
photographs, as a research tool [30] (Fig. 8.2b). (Fig. 8.3).
The detailed nature of the system means that oph-
thalmologists in clinical practice very rarely use
it. In short, the 9-step severity scale combines a Deep Learning in AMD
6-step drusen area scale with a 5-step pigmentary
abnormality scale. The 5-year risk of advanced Classification of 2-class and 2-class AMD
AMD increases progressively from less than 1% severity  Recently, several deep learning sys-
in step 1 to about 50% in step 9. tems have been developed for the classification of
CFP into AMD severity scales, at the level of the
individual eye. They applied deep neural net-
Methods works to a 2-class classification problem in which
the task was to distinguish referable AMD (inter-
Automated image analysis tools have demon- mediate or advanced disease) from non-referable
strated promising results in biology and medicine AMD (no or early disease) [16, 20, 36, 41, 42].
[8–13]. In particular, deep learning, a subfield of Burlina et al. used transfer learning and universal

• Classification of referable • 9-step AREDS severity scale • Classification of 9-step • AMD prognosis in a wide
and non-referable AMD using CFP and an ensemble AREDS severity scale using time interval using CFP of
using CFP (Burlina et al., model (Grassmann et al., CFP and multi-task strategy both eyes and demographic
2017a; Lee et al., 2017 Ting 2018) (Chen et al., 2019) information of patients
et al., 2017) • Wet-AMD detection using • Classification of AMD (Peng et al., 2020)
• Classification of 4-class AMD ultra–wide-field fundus simplified severity score • AMD prognosis in exceeding
severity using CFP and images (Matsuba et al., using CFP of both eyes the inquired year using CFP
universal features (Burlina 2018) (Peng et al., 2019) and genotypes (Yan et al.,
et al., 2017b) • GA in fundus • Multiple clinical referral 2020)
• Dry-AMD detection using autofluorescence (Treder et suggestions on OCT images • RPD detection with
OCT images (Karri et al., al, 2018b) (De Fauw et al., 2019) intermediated to late AMD
2017) • Late AMD prognosis 5 year • GA detection in color funds using FAF and CFP (Keenan
• AMD detection using OCT using CFP (Burlina et al., photographs (Keenan et al., et al., 2020)
images (Lee et al., 2017) 2018) 2019

2017 2018 2019 2020

Fig. 8.3  The timeline of selected articles on Deep Learning in AMD


106 Y. Peng et al.

features to address a 4-class AMD severity clas- [29]. To this end, Peng et  al. proposed a deep
sification problem (no, early, intermediate, and learning framework to automatically identify
advanced disease) and obtained 79.4% vs. 75.8% AMD severity from CFP of both eyes [46]. It
in accuracy for machine and physician grading, mimicked the human grading process by first
respectively [16]. detecting individual risk factors (drusen and pig-
mentary abnormalities) in each eye and then
combining values from both eyes to assign an
Classification of 9-step AREDS severity AMD score to each patient. Thus, the model
scale  In a more complicated scenario in which closely matches the clinical decision-making
the aim was to classify AMD according to the process, which allows an ophthalmologist to
9-step AREDS Severity Scale, Grassmann et al. inspect and visualize an interpretable result.
trained an ensemble of six different neural net
architectures with a random forest approach to Prediction of risk of progression to late
directly predict the 9-step AREDS severity scale AMD  Besides AMD classification, making
in an AREDS test set with an overall accuracy of accurate time-based predictions of progression to
63.3%, outperforming human graders [17] late AMD is also clinically critical. This would
(Fig.  8.2b). However, 1-step classification of enable improved decision-making regarding: (i)
images according to the 9-step AREDS Severity medical treatments, especially oral supplements
Scale does not reflect regular human grading known to decrease progression risk, (ii) lifestyle
practice. In reading centers, graders instead first interventions, mainly smoking cessation and
calculate individual scores for four different dietary changes, and (iii) intensity of patient
AMD characteristics (drusen area, geographic monitoring, e.g., frequent reimaging in the clinic
atrophy, increased pigmentation, and depigmen- or tailored home monitoring programs [47–51]. It
tation), then combine the scores for these four would also aid the design of future clinical trials,
characteristics into the 9-step non-advanced which could be enriched for participants with a
AREDS severity scale [30]. Hence, a deep learn- high risk of progression events [52].
ing approach that predicts the overall AREDS Currently, three existing methods are available
severity score directly, without these intermedi- clinically for using CFP to predict the risk of pro-
ate steps, may have lower transparency and gression. Of the three methods, the most com-
decreased information content for research and monly used is the AREDS Simplified Severity
clinical purposes [43, 44]. To address these Scale, as described above [29]. The second method
potential criticisms, Chen et  al. designed four is an online risk calculator [53]. Like the Simplified
deep learning models, each of which is responsi- Severity Scale, its inputs include the presence of
ble for classifying an individual characteristic, macular drusen and pigmentary abnormalities;
and trained them separately using a multi-task however, it can also receive the individual’s age,
strategy [45]. Evaluation on both the AREDS and smoking status, and basic genotype information
AREDS2 datasets showed that the accuracy of consisting of two SNPs (when available). The
the model exceeded the current state-­of-­the-art third method is a deep learning-­based architecture
model by >10%. to predict progression with improved accuracy and
transparency in two steps: image classification fol-
Classification of AREDS Simplified Severity lowed by survival analysis [54]. The model was
Scale  Besides AMD classification at the level of developed and clinically validated on two datasets
the individual eye, it is also helpful to obtain one from AREDS and AREDS2.
overall score for the individual from both eyes
(Fig. 8.2a). This is particularly relevant because Classification of AMD on Optical Coherence
estimates of rates of progression to late AMD are Tomography  Besides CFP, OCT also plays a
highly influenced by the status of fellow eyes, as major role in the detection of AMD [55]. Several
the behavior of the two eyes is highly correlated recent studies have reported robust performance in
8  Artificial Intelligence in Age-Related Macular Degeneration (AMD) 107

the automated classification of AMD from OCT lesions with less extensive RPE atrophy. In addi-
scans. Karri et al. fine-tuned a CNN model to clas- tion, the increasing prevalence of GA (through
sify OCT images of dry AMD [56]. Lee et  al. aging populations in many countries) will trans-
developed an algorithm to categorize OCT images late to greater demand for retinal services. As
as either normal or AMD [20]. The images were such, deep learning approaches involving retinal
linked to clinical data points from the electronic images, obtained perhaps using telemedicine-­
health record, and gold-standard labels were based devices, might support GA detection and
extracted using the ICD-9 diagnosis codes. At a diagnosis. However, these approaches would
patient level, the model achieved an area under the require establishing evidence-based and ‘explain-
ROC curve of 97.45% with an accuracy of 93.45%. able’ systems that have undergone extensive vali-
De Fauw et al. further demonstrated expert perfor- dation and demonstrated performance metrics
mance on multiple clinical referral suggestions for that are at least non-inferior to those of clinical
two independent test datasets [57]. ophthalmologists in routine practice.
In contrast to studies of AMD classification,
few studies have explicitly focused on GA. Treder
Deep Learning in GA et al. detected and classified GA in FAF images
using a deep learning algorithm [63]. Two classi-
Geographic atrophy (GA) is the defining lesion fiers were built, one to classify healthy patients
of the atrophic form of late AMD. GA in AMD and patients with GA and the other to classify
has been estimated to affect over 5 million people patients with GA and other retinal diseases. Both
worldwide [3, 4]. Unlike neovascular AMD, no achieved high accuracy. Keenan et al. conducted
drug therapies are available to prevent GA, slow an empirical study to investigate the performance
its enlargement, or restore lost vision; this makes of deep learning models on CFP [64]. The first
it an important research priority [58, 59]. Rapid model predicted GA presence from a population
and accurate identification of eyes with GA could of eyes with no AMD to advanced AMD using
improve the recruitment of eligible patients for CFP; the second model predicted CGA presence
future clinical trials and eventually to early iden- from the same population, and the third model
tification of appropriate patients for proven predicted central GA (CGA) presence from the
treatments. subset of eyes with GA.  Experiments demon-
Since the original description of GA by Gass strated that deep learning could achieve high
[60], clinical definitions have varied between accuracy for the detection of GA, and compared
research groups [61]. In the AREDS, GA was favorably for the detection of CGA to human reti-
defined as a sharply demarcated, usually circu- nal specialists.
lar zone of partial or complete depigmentation
of RPE, typically with exposure of underlying
large choroidal blood vessels, at least as large as Deep Learning in RPD
grading circle I-1 (1/8 disc diameter in diame-
ter) [62]. The sensitivity of the retina to light Reticular pseudodrusen (RPD), also known as
stimuli is markedly decreased (i.e., dense scoto- subretinal drusenoid deposits, have been identi-
mata) in areas affected by GA. The natural his- fied as another disease feature independently
tory of the disease involves progressive associated with an increased risk of progression
enlargement of GA lesions over time, with to late AMD [65]. Unlike soft drusen, which are
visual acuity decreasing markedly as the central located in the sub-retinal pigment epithelial
macula becomes involved [58]. (RPE) space, RPD are thought to represent aggre-
The identification of GA by ophthalmologists gations of material in the subretinal space
conducting dilated fundus examinations is some- between the RPE and photoreceptors [65, 66].
times challenging. This may be particularly true Compositional differences have also been found
for cases with early GA, comprising smaller between soft drusen and RPD [66].
108 Y. Peng et al.

The detection of eyes with RPD is important Discussion and Future Directions


for multiple reasons. Not only is their presence
associated with increased risk of late AMD, but The application of deep learning to AMD research
the increased risk is weighted towards particular using fundus photographs is just the beginning.
forms of late AMD, including the recently recog- As introduced above, attempts have been made to
nized phenotype of outer retinal atrophy [65, 67, apply deep learning to AMD detection or disease
68]. In recent analyses of AREDS2 data, the risk features associated with increased risk of pro-
of progression to GA was significantly higher gression to late AMD.  However, there is still
with RPD presence. In contrast, the risk of neo- much work to be conducted in the future.
vascular AMD was not [51]. Hence, RPD pres- One limitation arises from the imbalance of
ence may be a powerfully discriminating feature cases used for deep learning training, particularly
that could be very useful in risk prediction algo- the relatively low proportion of participants with
rithms for the detailed prognosis of AMD pro- outcomes in the clinical trials. This is likely to
gression. The presence of RPD has also been have contributed to the relatively lower accuracy
associated with increased speed of GA enlarge- of the model. However, this limitation may poten-
ment [69], which is a critical endpoint in ongoing tially be addressed by further training using
clinical trials. Finally, in eyes with intermediate image datasets with a higher proportion of posi-
AMD, the presence of RPD may be a critical tive cases.
determinant of the reduced efficacy of subthresh- The second limitation of this dataset includes
old nanosecond laser to slow progression to late the sole use of CFP, OCT, or FAF. Multi-modal
AMD [50] as observed in an unplanned subgroup imaging would be desirable. In AMD, some
analysis of the clinical trial cohort by stratifying disease features are visualized more clearly to
by the presence of RPD. human experts on one than the other modality
However, owing to the poor visibility of RPD [75]. For example, macular drusen are typically
on clinical examination and CFP [61, 70, 71], they observed well on CFP but poorly on FAF, while
were not incorporated into the traditional AMD the opposite is true for RPD [61, 70, 71, 75].
classification and risk stratification systems, such Other AMD features are observed in both
as the Beckman clinical classification scale [7] or modalities. For example, pigmentary abnormal-
the AREDS scales [29, 30]. With the advent of ities are seen on both though typically classi-
more recent imaging modalities, including FAF, fied on CFP [7, 75], while geographic atrophy
near-infrared reflectance, and OCT [66, 67], the is seen on both (but typically identified and
presence of RPD may be ascertained more accu- measured on FAF) [75, 76]. Hence, any tech-
rately with careful grading at the reading center niques that enable the accurate ascertainment
level [61, 72, 73]. However, their detection by oph- of the full spectrum of AMD features would be
thalmologists (including retinal specialists) in a important for improved disease classification
clinical setting may still be challenging. and risk prediction.
Keenan et al. studied a deep learning model to Another potential limitation lies in higher lev-
detect RPD in eyes with intermediate to late els of image quality for accurate classification.
AMD, using FAF images (FAF model) and CFP Despite high theoretical accuracy, deep learning
images (CFP model), respectively [74]. The gold models might be impractical in real-world prac-
standard labels were annotated based on the FAF tice. In a recent study, Beede et al. showed that
images. Model performance was compared with the accuracy of the DL models was highly ­varied
that of four ophthalmologists using a random sub- across different clinical setting and locations
set from the entire test set. Both models achieved where eye screenings took place [77]. Further
a high level of accuracy, equal or superior to four steps need to be taken for the methods to be
ophthalmologists on the random subset. applied in the clinic.
8  Artificial Intelligence in Age-Related Macular Degeneration (AMD) 109

References Recognition (CVPR). IEEE; 2017. p. 3462–71.


https://doi.org/10.1109/CVPR.2017.369.
13. Wang X, Peng Y, Lu L, Lu Z, Summers RM. TieNet:
1. Congdon N, O’Colmain B, Klaver CCW, et  al.
text-image embedding network for common thorax
Causes and prevalence of visual impairment among
disease classification and reporting in chest x-rays. In:
adults in the United States. Arch Ophthalmol Chic Ill
IEEE Conference on Computer Vision and Pattern
1960. 2004;122(4):477–85. https://doi.org/10.1001/
Recognition (CVPR). IEEE; 2018. p.  9049–58.
archopht.122.4.477.
https://doi.org/10.1109/cvpr.2018.00943.
2. Quartilho A, Simkiss P, Zekite A, Xing W, Wormald
14. Dalal N, Triggs B. Histograms of oriented gradients
R, Bunce C.  Leading causes of certifiable visual
for human detection. In: The IEEE Conference on
loss in England and Wales during the year ending
Computer Vision and Pattern Recognition (CVPR).
31 March 2013. Eye Lond Engl. 2016;30(4):602–7.
IEEE; 2005. p. 886–93. https://doi.org/10.1109/
https://doi.org/10.1038/eye.2015.288.
CVPR.2005.177.
3. Wong WL, Su X, Li X, et  al. Global prevalence of
15. Burlina P, Freund DE, Joshi N, Wolfson Y, Bressler
age-related macular degeneration and disease burden
NM.  Detection of age-related macular degeneration
projection for 2020 and 2040: a systematic review and
via deep learning. In: IEEE International Symposium
meta-analysis. Lancet Glob Health. 2014;2(2):e106–
on Biomedical Imaging (ISBI). IEEE; 2016. https://
16. https://doi.org/10.1016/S2214-­109X(13)70145-­1.
doi.org/10.1109/isbi.2016.7493240.
4. Rudnicka AR, Jarrar Z, Wormald R, Cook DG,
16. Burlina P, Pacheco KD, Joshi N, Freund DE,

Fletcher A, Owen CG.  Age and gender variations
Bressler NM.  Comparing humans and deep learn-
in age-related macular degeneration prevalence in
ing performance for grading AMD: a study in
populations of European ancestry: a meta-analysis.
using universal deep features and transfer learn-
Ophthalmology. 2012;119(3):571–80. https://doi.
ing for automated AMD analysis. Comput Biol
org/10.1016/j.ophtha.2011.09.027.
Med. 2017;82:80–6. https://doi.org/10.1016/j.
5. Fritsche LG, Fariss RN, Stambolian D, Abecasis GR,
compbiomed.2017.01.018.
Curcio CA, Swaroop A. Age-related macular degen-
17. Grassmann F, Mengelkamp J, Brandl C, et al. A deep
eration: genetics and biology coming together. Annu
learning algorithm for prediction of age-related eye
Rev Genomics Hum Genet. 2014;15:151–71. https://
disease study severity scale for age-related macu-
doi.org/10.1146/annurev-­genom-­090413-­025610.
lar degeneration from color fundus photography.
6. Ratnapriya R, Chew EY.  Age-related macular
Ophthalmology. 2018;125(9):1410–20. https://doi.
degeneration-­clinical review and genetics update. Clin
org/10.1016/j.ophtha.2018.02.037.
Genet. 2013;84(2):160–6. https://doi.org/10.1111/
18. Kermany DS, Goldbaum M, Cai W, et al. Identifying
cge.12206.
medical diagnoses and treatable diseases by image-­
7. Ferris FL, Wilkinson CP, Bird A, et  al. Clinical
based deep learning. Cell. 2018;172(5):1122–31.e9.
classification of age-related macular degeneration.
https://doi.org/10.1016/j.cell.2018.02.010.
Ophthalmology. 2013;120(4):844–51. https://doi.
19. Lam C, Yu C, Huang L, Rubin D. Retinal lesion detec-
org/10.1016/j.ophtha.2012.10.036.
tion with deep learning using image patches. Invest
8. Ching T, Himmelstein DS, Beaulieu-Jones BK, et al.
Ophthalmol Vis Sci. 2018;59(1):590–6. https://doi.
Opportunities and obstacles for deep learning in biol-
org/10.1167/iovs.17-­22721.
ogy and medicine. J R Soc Interface. 2018;15(141).
20. Lee CS, Baughman DM, Lee AY.  Deep learning is
https://doi.org/10.1098/rsif.2017.0387.
effective for the classification of OCT images of
9. Ehteshami Bejnordi B, Veta M, Johannes van
normal versus age-related macular degeneration.
Diest P, et  al. Diagnostic assessment of deep
Ophthalmol Retina. 2017;1(4):322–7. https://doi.
learning algorithms for detection of lymph node
org/10.1016/j.oret.2016.12.009.
metastases in women with breast cancer. JAMA.
21. Graham KW, Chakravarthy U, Hogg RE, Muldrew
2017;318(22):2199–210. https://doi.org/10.1001/
KA, Young IS, Kee F. Identifying features of early
jama.2017.14585.
and late age-related macular degeneration: a com-
10. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-­
parison of multicolor versus traditional color fundus
level classification of skin cancer with deep neural
photography. Retina Phila Pa. 2018;38(9):1751–8.
networks. Nature. 2017;542(7639):115–8. https://doi.
https://doi.org/10.1097/IAE.0000000000001777.
org/10.1038/nature21056.
22. Holz FG, Bindewald-Wittich A, Fleckenstein M, et al.
11. Lehman CD, Wellman RD, Buist DSM, et  al.

Progression of geographic atrophy and impact of fun-
Diagnostic accuracy of digital screening mammog-
dus autofluorescence patterns in age-related macular
raphy with and without computer-aided detection.
degeneration. Am J Ophthalmol. 2007;143(3):463–
JAMA Intern Med. 2015;175(11):1828–37. https://
72. https://doi.org/10.1016/j.ajo.2006.11.041.
doi.org/10.1001/jamainternmed.2015.5231.
23. Fujimoto JG, Pitris C, Boppart SA, Brezinski

12. Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers
ME.  Optical coherence tomography: an emerg-
RM. Chestx-ray8: hospital-scale chest x-ray database
ing technology for biomedical imaging and optical
and benchmarks on weakly-supervised classifica-
biopsy. Neoplasia N Y N. 2000;2(1–2):9–25. https://
tion and localization of common thorax diseases. In:
doi.org/10.1038/sj.neo.7900071.
IEEE Conference on Computer Vision and Pattern
110 Y. Peng et al.

24. Chen Q, Keenan TDL, Allot A, Peng Y, Agrón E, 35. Takahashi H, Tampo H, Arai Y, Inoue Y, Kawashima
Domalpally A, Klaver CCW, Luttikhuizen DT, H. Applying artificial intelligence to disease staging:
Colyer MH, Cukras CA, Wiley HE, Teresa Magone deep learning for improved staging of diabetic reti-
M, Cousineau-Krieger C, Wong WT, Zhu Y, Chew nopathy. PLoS One. 2017;12(6):e0179790. https://
EY, Lu Z; AREDS2 Deep Learning Research Group. doi.org/10.1371/journal.pone.0179790.
Multimodal, multitask, multiattention (M3) deep 36. Ting DSW, Cheung CY-L, Lim G, et al. Development
learning detection of reticular pseudodrusen: Toward and validation of a deep learning system for dia-
automated and accessible classification of age-related betic retinopathy and related eye diseases using
macular degeneration. J Am Med Inform Assoc. 2021 retinal images from multiethnic populations with
Jun 12;28(6):1135–48. https://doi.org/10.1093/jamia/ diabetes. JAMA. 2017;318(22):2211–23. https://doi.
ocaa302. org/10.1001/jama.2017.18152.
25. Arslan J, Samarasinghe G, Benke KK, et al. Artificial 37. Asaoka R, Murata H, Iwase A, Araie M. Detecting pre-
intelligence algorithms for analysis of geographic perimetric glaucoma with standard automated perim-
atrophy: a review and evaluation. Transl Vis Sci etry using a deep learning classifier. Ophthalmology.
Technol. 2020;9(2):57. https://doi.org/10.1167/ 2016;123(9):1974–80. https://doi.org/10.1016/j.
tvst.9.2.57. ophtha.2016.05.029.
26. Age-Related Eye Disease Study Research Group.
38. Cerentini A, Welfer D, Cordeiro d’Ornellas M, Pereira
The age-related eye disease study (AREDS): design Haygert CJ, Dotto GN.  Automatic identification of
implications. AREDS report no. 1. Control Clin glaucoma using deep learning methods. Stud Health
Trials. 1999;20(6):573–600. https://doi.org/10.1016/ Technol Inform. 2017;245:318–21.
s0197-2456(99)00031-8. 39. Muhammad H, Fuchs TJ, De Cuir N, et  al. Hybrid
27. AREDS2 Research Group, Chew EY, Clemons
deep learning on single wide-field optical coherence
T, et al. The Age-Related Eye Disease Study 2 tomography scans accurately classifies glaucoma sus-
(AREDS2): study design and baseline characteris- pects. J Glaucoma. 2017;26(12):1086–94. https://doi.
tics (AREDS2 report number 1). Ophthalmology. org/10.1097/IJG.0000000000000765.
2012;119(11):2282–9. https://doi.org/10.1016/j. 40. Brown JM, Campbell JP, Beers A, et  al. Automated
ophtha.2012.05.027. diagnosis of plus disease in retinopathy of prema-
28.
American Academy of Ophthalmology turity using deep convolutional neural networks.
Retina/Vitreous Panel. Preferred Practice JAMA Ophthalmol. 2018;136(7):803–10. https://doi.
Pattern®Guidelines. Age-related macular degenera- org/10.1001/jamaophthalmol.2018.1934.
tion. Am Acad Ophthalmol. 2015. 41. Matsuba S, Tabuchi H, Ohsugi H, et  al. Accuracy
29. Ferris FL, Davis MD, Clemons TE, et al. A simplified of ultra-wide-field fundus ophthalmoscopy-assisted
severity scale for age-related macular degeneration: deep learning, a machine-learning technology, for
AREDS Report No. 18. Arch Ophthalmol Chic Ill detecting age-related macular degeneration. Int
1960. 2005;123(11):1570–4. https://doi.org/10.1001/ Ophthalmol. Published online May 2018. https://doi.
archopht.123.11.1570. org/10.1007/s10792-­018-­0940-­0.
30. Davis MD, Gangnon RE, Lee L-Y, et  al. The age-­ 42. Treder M, Lauermann JL, Eter N. Automated detec-
related eye disease study severity scale for age-related tion of exudative age-related macular degeneration in
macular degeneration: AREDS report no. 17. Arch spectral domain optical coherence tomography using
Ophthalmol Chic Ill 1960. 2005;123(11):1484–98. deep learning. Graefes Arch Clin Exp Ophthalmol
https://doi.org/10.1001/archopht.123.11.1484. Albrecht Von Graefes Arch Klin Exp Ophthalmol.
31. Choi JY, Yoo TK, Seo JG, Kwak J, Um TT, Rim 2018;256(2):259–65. https://doi.org/10.1007/
TH.  Multi-categorical deep learning neural network s00417-­017-­3850-­3.
to classify retinal images: a pilot study employing 43. Doshi-Velez F, Kim B.  Towards a rigorous sci-

small database. PLoS One. 2017;12(11):e0187336. ence of interpretable machine learning. ArXiv
https://doi.org/10.1371/journal.pone.0187336. Prepr. Published online 2017. https://arxiv.org/
32. Gargeya R, Leng T. Automated identification of dia- abs/1702.08608.
betic retinopathy using deep learning. Ophthalmology. 44. Madumal P, Miller T, Vetere F, Sonenberg L. Towards
2017;124(7):962–9. https://doi.org/10.1016/j. a grounded dialog model for explainable artificial
ophtha.2017.02.008. intelligence. ArXiv Prepr. Published online 2018.
33. Gulshan V, Peng L, Coram M, et  al. Development https://arxiv.org/abs/1806.08055.
and validation of a deep learning algorithm for detec- 45. Chen Q, Peng Y, Keenan T, et al. A multi-task deep
tion of diabetic retinopathy in retinal fundus photo- learning model for the classification of age-related
graphs. JAMA. 2016;316(22):2402–10. https://doi. macular degeneration. Proc AMIA Jt Summits Transl
org/10.1001/jama.2016.17216. Sci. 2019;2019:505–14. https://pubmed.ncbi.nlm.nih.
34. Raju M, Pagidimarri V, Barreto R, Kadam A,
gov/31259005.
Kasivajjala V, Aswath A.  Development of a deep 46. Peng Y, Dharssi S, Chen Q, et  al. DeepSeeNet: a
learning algorithm for automatic diagnosis of dia- deep learning model for automated classification of
betic retinopathy. Stud Health Technol Inform. patient-based age-related macular degeneration sever-
2017;245:559–63. ity from color fundus photographs. Ophthalmology.
8  Artificial Intelligence in Age-Related Macular Degeneration (AMD) 111

2018;126(4):565–75. https://doi.org/10.1016/j. referral in retinal disease. Nat Med. 2018;24(9):1342–


ophtha.2018.11.015. 50. https://doi.org/10.1038/s41591-­018-­0107-­6.
47. Age-Related Eye Disease Study 2 Research Group. 58. Keenan TD, Agrón E, Domalpally A, et al. Progression
Lutein + zeaxanthin and omega-3 fatty acids for of geographic atrophy in age-related macular degen-
age-related macular degeneration: the Age-Related eration: AREDS2 report number 16. Ophthalmology.
Eye Disease Study 2 (AREDS2) randomized clini- 2018;125(12):1913–28. https://doi.org/10.1016/j.
cal trial. JAMA. 2013;309(19):2005–15. https://doi. ophtha.2018.05.028.
org/10.1001/jama.2013.4997. 59. Rosenfeld PJ.  Preventing the growth of geographic
48. Age-Related Eye Disease Study Research Group. A atrophy: an important therapeutic target in age-­
randomized, placebo-controlled, clinical trial of high- related macular degeneration. Ophthalmology.
dose supplementation with vitamins C and E, beta 2018;125(6):794–5. https://doi.org/10.1016/j.
carotene, and zinc for age-related macular degen- ophtha.2018.02.027.
eration and vision loss: AREDS report no. 8. Arch 60. Gass JD.  Drusen and disciform macular detachment
Ophthalmol. 2001;119(10):1417–36. https://doi. and degeneration. Arch Ophthalmol Chic Ill 1960.
org/10.1001/archopht.119.10.1417. 1973;90(3):206–17.
49. Areds Home Study Research Group, Chew EY,
61. Schmitz-Valckenberg S, Sadda S, Staurenghi G,

Clemons TE, et al. Randomized trial of a home moni- et  al. GEOGRAPHIC ATROPHY: semantic con-
toring system for early detection of choroidal neovas- siderations and literature review. Retina Phila Pa.
cularization home monitoring of the Eye (HOME) 2016;36(12):2250–64. https://doi.org/10.1097/
study. Ophthalmology. 2014;121(2):535–44. https:// IAE.0000000000001258.
doi.org/10.1016/j.ophtha.2013.10.027. 62. Age-Related Eye Disease Study Research Group.

50. Guymer RH, Wu Z, Hodgson LAB, et al. Subthreshold The Age-Related Eye Disease Study system for
nanosecond laser intervention in age-related macu- classifying age-related macular degeneration from
lar degeneration: the LEAD randomized controlled stereoscopic color fundus photographs: the Age-­
clinical trial. Ophthalmology. 2019;126(6):829–38. Related Eye Disease Study Report Number 6. Am
https://doi.org/10.1016/j.ophtha.2018.09.015. J Ophthalmol. 2001;132(5):668–81. https://doi.
51. Domalpally A, Clemons TE, Bressler SB, et  al.
org/10.1016/s0002-­9394(01)01218-­1.
Imaging characteristics of choroidal neovascular 63. Treder M, Lauermann JL, Eter N.  Deep learning-­
lesions in the AREDS2-HOME study: report number based detection and classification of geographic
4. Ophthalmol Retina. 2019;3(4):326–35. https://doi. atrophy using a deep convolutional neural network
org/10.1016/j.oret.2019.01.004. classifier. Graefes Arch Clin Exp Ophthalmol.
52. Calaprice-Whitty D, Galil K, Salloum W, Zariv A, 2018;256(11):2053–60. https://doi.org/10.1007/
Jimenez B.  Improving clinical trial participant pre- s00417-­018-­4098-­2.
screening with artificial intelligence (AI): a com- 64. Keenan TD, Dharssi S, Peng Y, et al. A deep learning
parison of the results of AI-assisted vs standard approach for automated detection of geographic atro-
methods in 3 oncology trials. Ther Innov Regul phy from color fundus photographs. Ophthalmology.
Sci. 2020;54(1):69–74. https://doi.org/10.1007/ Published online June 2019. https://doi.org/10.1016/j.
s43441-­019-­00030-­4. ophtha.2019.06.005.
53. Klein R, Klein BEK, Myers CE.  Risk assessment 65. Spaide RF, Ooto S, Curcio CA. Subretinal drusenoid
models for late age-related macular degeneration. deposits AKA pseudodrusen. Surv Ophthalmol.
Arch Ophthalmol Chic Ill 1960. 2011;129(12):1605– 2018;63(6):782–815. https://doi.org/10.1016/j.
6. https://doi.org/10.1001/archophthalmol.2011.372. survophthal.2018.05.005.
54. Peng Y, Keenan TD, Chen Q, et  al. Predicting risk 66. Wightman AJ, Guymer RH.  Reticular pseudo-

of late age-related macular degeneration using deep drusen: current understanding. Clin Exp Optom.
learning. NPJ Digit Med. 2020;3:111. https://doi. 2019;102(5):455–62. https://doi.org/10.1111/
org/10.1038/s41746-­020-­00317-­z. cxo.12842.
55. Ting DSW, Cheung CY, Nguyen Q, et al. Deep learn- 67. Sadda SR, Guymer R, Holz FG, et al. Consensus defi-
ing in estimating prevalence and systemic risk fac- nition for atrophy associated with age-related macular
tors for diabetic retinopathy: a multi-ethnic study. degeneration on OCT: classification of atrophy report
NPJ Digit Med. 2019;2:24. https://doi.org/10.1038/ 3. Ophthalmology. 2018;125(4):537–48. https://doi.
s41746-­019-­0097-­x. org/10.1016/j.ophtha.2017.09.028.
56. Karri SPK, Chakraborty D, Chatterjee J.  Transfer
68. Spaide RF.  Outer retinal atrophy after regression of
learning based classification of optical coherence subretinal drusenoid deposits as a newly recognized
tomography images with diabetic macular edema and form of late age-related macular degeneration. Retina
dry age-related macular degeneration. Biomed Opt Phila Pa. 2013;33(9):1800–8. https://doi.org/10.1097/
Express. 2017;8(2):579–92. https://doi.org/10.1364/ IAE.0b013e31829c3765.
BOE.8.000579. 69. Fleckenstein M, Mitchell P, Freund KB, et  al. The
57. De Fauw J, Ledsam JR, Romera-Paredes B, et  al. progression of geographic atrophy secondary to
Clinically applicable deep learning for diagnosis and age-related macular degeneration. Ophthalmology.
112 Y. Peng et al.

2018;125(3):369–90. https://doi.org/10.1016/j. 74. Keenan TDL, Chen Q, Peng Y, et  al. Deep learning
ophtha.2017.08.038. automated detection of reticular pseudodrusen from
70. Domalpally A, Agrón E, Pak JW, et  al. Prevalence, fundus autofluorescence images or color fundus pho-
risk, and genetic association of reticular pseudodru- tographs in AREDS2. Ophthalmology. Published
sen in age-related macular degeneration: Age-Related online May 21, 2020. https://doi.org/10.1016/j.
Eye Disease Study 2 Report 21. Ophthalmology. ophtha.2020.05.036.
2019;126(12):1659–66. https://doi.org/10.1016/j. 75.
Garrity ST, Sarraf D, Freund KB, Sadda
ophtha.2019.07.022. SR.  Multimodal imaging of nonneovascular age-­
71.
Alten F, Clemens CR, Heiduschka P, Eter related macular degeneration. Invest Ophthalmol Vis
N.  Characterisation of reticular pseudodrusen and Sci. 2018;59(4):AMD48–64. https://doi.org/10.1167/
their central target aspect in multi-spectral, confocal iovs.18-­24158.
scanning laser ophthalmoscopy. Graefes Arch Clin 76. Holz FG, Sadda SR, Staurenghi G, et  al. Imaging
Exp Ophthalmol Albrecht Von Graefes Arch Klin protocols in clinical studies in advanced age-­
Exp Ophthalmol. 2014;252(5):715–21. https://doi. related macular degeneration: recommendations
org/10.1007/s00417-­013-­2525-­y. from classification of atrophy consensus meetings.
72. Ueda-Arakawa N, Ooto S, Tsujikawa A, Yamashiro Ophthalmology. 2017;124(4):464–78. https://doi.
K, Oishi A, Yoshimura N.  Sensitivity and specific- org/10.1016/j.ophtha.2016.12.002.
ity of detecting reticular pseudodrusen in multi- 77. Beede E, Baylor E, Hersch F, Iurchenko A, Wilcox
modal imaging in Japanese patients. Retina Phila L, Ruamviboonsuk P, Vardoulakis LM. A human-cen-
Pa. 2013;33(3):490–7. https://doi.org/10.1097/ tered evaluation of a deep learning system deployed
IAE.0b013e318276e0ae. in clinics for the detection of diabetic retinopathy. In
73. van Grinsven MJJP, Buitendijk GHS, Brussee C,
Proceedings of the 2020 CHI Conference on Human
et al. Automatic identification of reticular pseudodru- Factors in Computing Systems (CHI ‘20). 2020. p.
sen using multimodal retinal image analysis. Invest 1–12. https://doi.org/10.1145/3313831.3376718.
Ophthalmol Vis Sci. 2015;56(1):633–9. https://doi.
org/10.1167/iovs.14-­15019.
AI and Glaucoma
9
Zhiqi Chen, Gadi Wollstein, Joel S. Schuman,
and Hiroshi Ishikawa

Glaucoma is characterized by progressive loss of visual field (VF, Fig.  9.3)) measurements are
retinal ganglion cell (RGC) and their axons commonly assessed in addition to the conven-
which may result in optic nerve head (ONH) and tional observations (e.g. optic disc assessment,
retinal nerve fiber layer (RNFL) changes and intraocular pressure (IOP)). Various longitudinal
eventually lead to vision loss and irreversible studies on glaucoma progression reported contra-
blindness [1–3]. Since glaucoma is a slowly pro- dicting non-linear relationships between struc-
gressing disease with irreversible neural damage, tural and functional measurements [4–10]. There
early diagnosis and sensitive progression moni- are complex, non-linear, asynchronous interac-
toring are essential to glaucoma management. tions between them, which have not been fully
For clinical assessment, structural (e.g. fundus understood yet.
photography (Fig.  9.1)), optical coherence Recently, artificial intelligence (AI) has started
tomography (OCT, Fig. 9.2)) and functional (e.g. to impact in ophthalmology [11–15]. Deep learn-
ing (DL) is a class of state-of-the-art machine
Z. Chen learning (ML) algorithms that are especially tai-
Department of Electrical and Computer Engineering, lored to extract meaningful features from complex
New York University, Brooklyn, NY, USA and high-dimensional data. Consequently, AI
Department of Ophthalmology, NYU Langone Health, algorithms, especially DL, have the potential to
New York, NY, USA revolutionize the diagnosis and management of
e-mail: zc1337@nyu.edu
glaucoma based on the interpretation of functional
G. Wollstein · H. Ishikawa (*) and/or structural information and even to improve
Department of Ophthalmology, NYU Langone Health,
New York, NY, USA the understanding of glaucoma by defining the
structural features responsible for certain func-
Department of Biomedical Engineering,
New York University, Brooklyn, NY, USA tional damages and to identify phenotypes that fol-
e-mail: Gadi.Wollstein@nyulangone.org; low similar progression patterns. Table  9.1
Hiroshi.Ishikawa@nyulangone.org summarizes current DL applications in glaucoma.
J. S. Schuman In this chapter, we provide an overview of cur-
Department of Electrical and Computer Engineering, rent AI applications and challenges in glaucoma.
New York University, Brooklyn, NY, USA Section “Glaucoma Diagnosis” introduces AI utili-
Department of Ophthalmology, NYU Langone Health, zation in detecting glaucoma; section “Longitudinal
New York, NY, USA Analysis” focuses on role of AI in longitudinal pro-
Department of Biomedical Engineering, jection; section ­“Structural-­Functional Correlation”
New York University, Brooklyn, NY, USA summarizes developments of AI in finding the
e-mail: Joel.Schuman@nyulangone.org

© Springer Nature Switzerland AG 2021 113


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_9
114 Z. Chen et al.

structural-functional relationship; finally, section Glaucoma Diagnosis


“Other AI Applications in Glaucoma” presents
some other applications of AI in glaucoma. Diagnosis of glaucoma can be modeled as a clas-
sification problem which typically has one or
multiple features (clinical parameters or images)
as input and a single diagnostic variable as output
(e.g., presence or severity of glaucoma). It is one
of the first areas in which AI is extensively
explored.
In 1994, ML classifiers were first used to dis-
criminate normal and glaucomatous eyes based
on visual fields [1]. Subsequent studies explored
the classification problem with more ML meth-
ods and data modalities and demonstrated the
effectiveness of ML models. The community ini-
tially focused on taking clinical parameters as
input and using classical ML classifiers such as
random forest (RF) and support vector machine
(SVM) and manually designed features for the
task, which are problem-dependent and require
domain knowledge.
Since 2013, the development of DL, especially
Fig. 9.1  Fundus photography of a left eye with glau-
coma. Large cupping and peripapillary atrophy are shown convolutional neural networks (CNNs), has
in the image enabled automatic learning of discriminative rep-

a b

Fig. 9.2  Examples of Cirrus OCT report from a healthy case (a) and a glaucomatous case (b). The image is color-­
coded (red, orange, and yellow represent thicker areas while green and blue represent thinner areas)
9  AI and Glaucoma 115

a b

Fig. 9.3  Examples of Humphrey 24-2 VF report from a deficit, large inferior nasal step and superior temporal
healthy eye (a) and a glaucomatous eye (b). Figure (b) step, and paracentral scotoma
shows advanced visual field damages with superior nasal

Table 9.1  Summary of DL applications in glaucoma


Application Subtasks Data type Models References
Glaucoma diagnosis Segmentation, object detection, Fundus photography, FNN, CNN, [15–33]
classification OCT, VF, demographic ResNet, Faster
features, IOP, et al. RCNN, U-Net
Longitudinal VF forecasting, structural loss VF, RNFL thickness, RNN, CNN, [34–36]
analysis forecasting GCIPL thickness LSTM
Structural-­ Mapping between structural OCT, retinal thickness CNN, ResNet50 [37–41]
functional features and functional map
relationship measurements (VF thresholds,
discovery VFI, VF MD et al.)
Knowledge Age, gender, or race prediction OCT CNN [42]
discovery from OCT images
Image enhancement Speckle noise reduction on OCT OCT GAN [43]
images

resentations of data that optimally solves the prob- comatous damage assessment. Several classical
lem [44–47]. DL models utilize multiple ML classifiers (i.e., multilayer perceptron (MLP),
processing layers to obtain scalability and learn SVM, and linear (LDA) and quadratic discrimi-
hierarchical feature representations of data with nant analysis (QDA), Parzen window and mix-
multiple levels of abstraction which are suitable ture of Gaussian (MOG)) have been proposed to
for classification. Therefore, DL models have been automatically discriminate between normal eyes
studied to improve the accuracy of automated and eyes with pre-perimetric glaucoma based on
glaucoma diagnosis, summarized in Table 9.2. visual fields and have shown promising perfor-
mance [48–50]. With the development of compu-
tational capacity, deeper models have become
Functional Defects as Input possible to be implemented. Asaoka et  al. [26]
proposed a multi-layer feed-forward neural net-
In clinical practice, VF testing is widely used as work (FNN) with stacked denoising autoencoder
the gold standard for disease diagnose and glau- to classify pre-perimetric glaucoma VFs and
116 Z. Chen et al.

Table 9.2  Summary of recent DL work on glaucoma diagnosis


Input type Reference Models Subtasks Input data Output classes
Functional [26] FNN Classification Individual VF thresholds Glaucoma/
non-glaucoma
[28] CNN Classification VF map Early
glaucoma/
non-glaucoma
[29] CNN Classification Probability map of VF PD Glaucoma/
non-glaucoma
Structural [21] CNN Classification Color fundus image Glaucoma/
non-glaucoma
[15] CNN Classification Color fundus image Glaucoma/
non-glaucoma
[22] Hierarchical Classification Color fundus image Glaucoma/
ResNet, UNet non-glaucoma
[23] Inception-v3 Classification Color fundus image Glaucoma/
non-glaucoma
[24] CNN Classification Color fundus image Glaucoma/
non-glaucoma
[30] CNN, Random Feature extraction, RNFL thickness map, GCIPL Glaucoma/
forest classification thickness map, RNFL non-glaucoma
probability map, GCIPL
probability map, and en face
projection image
[31] Multi-Context Classification 2D Anterior Segment OCT Open-angle/
Deep Network image angle-closure
glaucoma
[32] 3D CNN Classification 3D OCT volume Glaucoma/
non-glaucoma
[33] CNN Classification RNFL probability map Glaucoma/
non-glaucoma
Mixed [25] CNN, Feature extraction, OD Color fundus image, age, IOP, Glaucoma/
Faster-RCNN, region detection, OC eyesight, and symptoms non-glaucoma
FCN segmentation,
classification

healthy VFs and achieved better performance CNN, which explicitly took spatial information
over shallower ML models. into account through spatial convolutions, was
Previous work showed promising perfor- used to discriminate between healthy and early
mance in classification of VFs. Yet, these meth- glaucomatous VFs with those converted images
ods treated each VF point as individual features as input. Results demonstrated supremacy of
and failed to leverage spatial information within CNN over NN that did not consider spatial infor-
VFs. Spatial Information is useful to discover VF mation (average precision score: 0.874  ±  0.095
defect pattern and therefore helps glaucoma diag- vs. 0.843  ±  0.089). By computing the gradient,
nosis [27]. Thus, incorporating spatial informa- saliency maps can be obtained to visualize impor-
tion into ML classifiers may boost the tant pixels that contribute most to the outputs of
discrimination ability. CNN is an evolution of CNN.  The saliency maps suggested that CNNs
FNN which replace matrix multiplication with were capable of detecting patterns of localized
convolution to process spatial information. Thus, VF defects.
researchers started to implement CNN models to Li et al. [29] took probability map of pattern
discriminate VFs. deviation (PD) from the VF reports as the input to
Kucur et  al. [28] converted VFs to images a CNN.  Results showed that CNN achieved
using a Voronoi parcelation [51]. A seven-layer higher accuracy compared to ophthalmologists,
9  AI and Glaucoma 117

rule-based method (Advanced Glaucoma In 2015, Chen et al. [15] proposed a six-layer
Intervention Study (AGIS) criteria and Glaucoma CNN to classify glaucoma and non-glaucoma
Staging System (GSS) criteria), and traditional eyes directly from fundus images from public
machine learning algorithms (SVM, RF, k-­nearest available ORIGA dataset [59] and SCES dataset
neighbor). [60]. Experimental results showed AUC of 0.831
and 0.887 on ORIGA and SCES respectively. In
a later work, Chen et  al. [61] designed a novel
Structural Damages as Input CNN which embed multilayer perceptron to dis-
criminate glaucoma and non-glaucoma patterns
Assessment of structural damage has become a of fundus images.
practical standard for glaucoma diagnosis. Early In 2018, Fu et al. [22] proposed a Disc-aware
studies focused on structural measurements Ensemble Network (DENet) which consisted of
obtained from imaging techniques such as confo- four streams to integrate hierarchical context of
cal scanning laser ophthalmoscopy (CSLO) and global fundus images and local OD regions. The
scanning laser polarimetry (SLP) [47, 52–58]. first stream used a Residual Network (ResNet)
Promising performance of ML classifiers on [62] to learn the global representation on the
structural parameters such as optic disc parame- whole fundus image directly and produce a glau-
ters measured by CSLO and RNFL measure- coma classification probability. The second
ments from SLP were reported. However, due to stream adapted the U-shape CNN (U-Net) [63],
the popularity of the technologies, recent which is an efficient DL model for medical image
AI-based studies on structural glaucomatous segmentation, to produce the disc probability
damages have focused on fundus photography map and a glaucoma classification probability.
and OCT. The third stream cropped OD region image as
input and output a classification probability
Fundus Photography through a ResNet. The fourth stream applied the
Fundus photography is a well-established and pixel-wise polar transformation to transfer the
cost-effective imaging technique to identify fea- copped original image to the polar coordinate
tures of the fundus region including fovea, mac- system in order to enlarge the cup region and
ula, optic disc (OD), and optic cup (OC). augment data. Then, a ResNet was trained to out-
Glaucoma can be identified by the optic nerve put a classification probability. The model was
cupping. Thus cup-to-disc ratio (CDR), which trained on ORIGA dataset and yielded testing
measures the vertical diameter of the optic cup to results of an accuracy of 0.832 on SCES and
that of the disc, is one of the most important bio- 0.666 on SINDI datasets.
markers for glaucoma diagnosis. Higher CDR Later, Li et al. [23] applied Inception-v3 [64]
value indicates a higher probability of glaucoma. on a private dataset to detect referable glaucoma-
Therefore, many AI-based studies focused on tous optic neuropathy (GON) and achieved an
automatic segmentation of OD and OC using AUC OD 0.986 with sensitivity of 0.956 and
deep learning [16–21]. Segmentation-based specificity of 0.92. Results also showed that other
methods, however, lack sufficiently discrimina- eye conditions would greatly affect the detection
tive representations and are easily affected by accuracy. Hight or pathological myopia
noise and low image quality. Moreover, pre- ­contributed most to false-negative results while
defined clinical parameters lack complex mor- physiologic cupping and pathological myopia are
phological information that might be useful in the most common reason for false-positive
the diagnosis. Therefore, more recent methods results.
simultaneously learn discriminative representa- Though previous methods demonstrated the
tion that optimize classification results directly efficiency of DL in glaucoma diagnosis, DL
from fundus images. methods suffer from overfitting problem due to
118 Z. Chen et al.

relatively small dataset available and a large uses handcrafted features and is prone to errors.
number of parameters needed training. In 2018, Therefore, deeper and segmentation-free meth-
Chakravarty et  al. [24] presented a multi-task ods were desired to avoid the problem.
CNN that segmented OD and OC on fundus In 2017, Muhammad et  al. [30] used a pre-­
images and jointly classified the image to glau- trained CNN model for feature extraction and a
coma and non-glaucoma. The proposed method random forest model for classification. Though
was evaluated on the REFUGE dataset to achieve the proposed model is deep, the images fed into
an average dice score (which measures the over- the model were still generated by conventional
lap of segmentations and ground truths) of 0.92 segmentation methods: (1) retinal ganglion cell
for OD segmentation, 0.84 for OC segmentation, plus inner plexiform layer (GCIPL) thickness
and AUC of 0.95 for classification. The cross-­ map; (2) RNFL thickness map; (3) GCIPL prob-
task design reduced the number of parameters ability map; (4) RNFL probability map; (5) en
and ensured good generalization of the model on face projection. The results showed that the pro-
small dataset. In another work, Chai et  al. [25] posed method with the RNFL probability map as
designed a multi-branch neural network (MB-­ input outperformed OCT and VF clinic but fell
NN) model to leverage domain knowledge short of an experienced human expert.
includes important measures (e.g., CDR) for In 2018, Fu et al. [31] present a Multi-Context
glaucoma diagnosis. The first branch extracted Deep Network (MCDN) to classify angle-closure
hidden features directly from fundus image and open-angle glaucoma based on Anterior
through a CNN. The second branch used Faster-­ Segment Optical Coherence Tomography
RCNN [65] which is a deep learning framework (AS-OCT). The anterior chamber angle (ACA)
for object detection to obtain optic disc region. region was first localized by a data-driven
Then another CNN is used to extract local hidden AS-OCT structure segmentation method [22] to
features. The third branch used a fully convolu- compute the clinical parameters (e.g., Anterior
tional network (FCN) [66] to segment OD, OC, Chamber Width, Lens-Vault, Chamber Height,
and peripapillary atrophy (PPA), and then calcu- Iris Curvature, and Anterior Chamber Area). A
lated measures related to disc, cup, and linear SVM was employed to predict an angle-­
PPA.  RNFL defects, a roughly wedge shape closure probability based on these clinical param-
region starting from OD, detected from another eters. Then localized ACA region and the original
CNN and non-image features (e.g. age, IOP, eye- scan were fed into two parallel CNNs to jointly
sight and symptoms) from case reports were also gain local and global discriminative representa-
inputs to the third branch. The proposed frame- tions respectively and output an angle-closure
work was verified on a private dataset and probability. Finally, the probabilities from clini-
achieved an accuracy of 0.915, sensitivity of cal parameters and CNNs are averaged to pro-
0.9233, and specificity of 0.909. duce the final results. Experimental results
showed that the proposed method is effective for
OCT angle-closure glaucoma screening. Detailed anal-
OCT, which is a non-invasive imaging technique ysis of three input streams showed that DL-based
to provide micrometer resolution cross-sectional global discriminative features did not work as
and volumetric images of retina, has emerged as well as handcrafted visual features (AUC 0.894
the de facto standard in objective quantification vs. 0.924) while DL-based local discriminative
of structural damage in glaucoma. Similarly, features achieved on par performance with hand-
early studies focused on comparison of various crafted features (AUC 0.920 vs. 0.924).
classical ML classifiers using parameters mea- In 2019, Maetschke et al. [32] proposed a 3D
sured by OCT [67–69]. Though classical ML CNN to be trained directly on raw spectral
classifiers classified glaucoma with satisfying domain optical coherence tomography (SD-OCT)
accuracy, the limitation of ML classifiers is the volumes of ONH to classify healthy and glauco-
reliance of the segmentation of retinal layers that matous eyes. The class activation map (CAM)
9  AI and Glaucoma 119

analysis found neuroretinal rim, optic disc cup- Generally speaking, DL models are capable of
ping, and lamina cribrosa and its surrounding learning discriminative representations and iden-
area were significant associated with the classifi- tifying glaucoma patients. However, comparing
cation results, which aligned with commonly those methods remains challenging because of
used clinical markers for glaucoma diagnosis the variety of training and testing datasets and
such as neuroretinal thinning at the superior and validation methods. The possibility to extract
inferior segments and increased cup volume. knowledge that might not be discovered before,
In the same year, RNFL probability maps, such as unknown glaucoma related structures/
which are generated based on swept-source optic features that are highly associated with glauco-
coherence tomography (SS-OCT) to superim- matous damages and glaucoma phenotyping, is
pose structural changes with VF locations, were the most exciting part of DL. Therefore, increas-
also trained with CNNs to discriminate between ing interpretability of DL to visualize learned
glaucomatous and healthy eyes [33]. CAM anal- knowledge will be critical to future development
ysis suggested that anatomical variation in blood of the use of DL in glaucoma diagnosis.
vessel or RNFL location caused ambiguity in
false positive and false negative. This discover
might be useful for future improvement of DL Longitudinal Analysis
systems by supplying information about blood
vessels. The accelerated retinal ganglion cell loss is a
characteristic feature of glaucoma progression
together with functional damages. Therefore,
Combining Structure and Function identifying progression and estimating the rate of
loss either structurally or functionally are crucial
Many studies also developed effective ML classi- to glaucoma management.
fiers combining structural and functional data. The current clinical gold standard for progres-
Global VF indices (mean defect, corrected loss sion analysis is the Guided Progression Analysis
variance, and short-term fluctuation) in combina- (GPA) provided by the commercial software
tion with structural data (CDR, rim area, cup vol- developed by Carl Zeiss [72, 73]. The software
ume, and nerve fiber layer height) analyzed by an allows clinicians to evaluate the patient’s func-
ANN was capable to correctly identify glauco- tional or structural loss over time compared to his
matous eyes with an accuracy of 88% in an early or her own baseline, which is a composite of two
study [70]. This figure was higher than that of the initial examinations. Event-based and trend-­
same ANN trained with only structural or func- based analysis are two approaches to tell whether
tional data. The development of computational the progression exists. Event-based analysis eval-
ability accommodated larger models and larger uates changes from baseline compared to
inputs. Bowd et al. [71] took complete VF maps expected variability. The expected variability is
and OCT RNFL thickness measurements of 32 determined by the 95% confidence intervals of
sectors to train multiple ML learning classifiers. the magnitude of fluctuation of stable glaucoma
In a later study, Silva et  al. [24] tested several patients from empirical datasets. Progression is
classifiers, including bagging (BAG), naïve bayes defined as the change exceeds the expected vari-
(NB), MLP, radial basis function (RBF), RF, ability. Trend-based analysis estimated the rate of
ensemble selection (ENS), classification tree change over time using linear regression. While
(CTREE), ada boost M1 (ADA), SVM, using 17 GPA is useful to define and quantify glaucoma
RNFL thickness parameters (average thickness, 4 progression, GPA does not forecast future pro-
quadrants and 12 clock hour measurements) and gressions, which could augment clinical decision
mean deviation (MD), pattern standard deviation making.
(PSD), and glaucoma hemifield test (GHT). RF For VF forecasting, Caprioli et  al. [74] pro-
achieved the best AUC result of 0.946. jected individual VFs through an exponential
120 Z. Chen et al.

model which characterized fast or slow progres- RNFL thickness and VFI. Sedai et al. [35] devel-
sion rate in VF losses better than linear models. oped a ML regressor to forecast circumpapillary
However, both linear model and exponential RNFL thickness at the next visit from multimodal
model assume constant loss rates of VF loss, data including clinical (age and IOP), structural
which usually decay over time [75]. To better (circumpapillary RNFL thickness derived from
depict glaucomatous damage, Chen et  al. [76] OCT scans and DL-extracted OCT features), and
compared pointwise linear, exponential, logistic functional (VF parameters) data of three prior
functions, and combinations of functions and visits and the inter-visit intervals. Chen et al. [36]
showed that a combination of exponential and also investigated the predictive DL in predicting
logistic functions predicted future progressions structural loss. A time-aware long short-term
better. Previous methods treated test points as memory network was designed to predict fifth
individual points and did not incorporate spatial visit of GCIPL thickness map based on four prior
correlations between VF test points at the time maps and took uneven intervals between every
point. Several statistical methods have been pro- two visits into account.
posed to incorporate spatio-temporal correlations
in VFs [77–80].
Application of DL in this field of predictive Structural-Functional Correlation
medicine is particularly interesting to manage-
ment of glaucoma since many factors contribut- Relationships between structural loss and func-
ing to the rate or severity of glaucoma progression tional loss has been a controversial topic which
still remain unknown. But unlike the more defin- we still do not have a general consensus yet.
itive diagnosis of glaucoma, there have been lim- Early work has investigated classical ML models
ited investigation into the potential of DL in such as LR [83], a Bayesian framework with a
predicting future findings. Park et al. [34] devel- radial basis function [84], and Bayesian LR [85],
oped a recurrent neural network (RNN) to pre- and logarithmic regression [86] that map func-
dict the sixth visual field test. The performance tion from structure. However, model performance
of RNN was compared with that of a point-wise has been limited and highly depending on
linear regression. Results showed that VFs pre- assumptions of linear relationship or the gaussian
dicted by RNN were more accurate than that by distribution of variability in VF measurements,
linear regression (root mean square error which is not optimal given that it is usually heav-
(RMSE): 4.31 ± 2.54 dB vs. 4.96 ± 2.76 dB, p < ily tailed. Given the success of DL in identifying
0.001) and RNN was more robust (smaller and and forecasting glaucoma, DL may help to
more slowly increasing of RMSE as the false improve the understanding of the structural-­
negative rate increases). However, the proposed functional relationship in glaucoma. In addition,
method required a large number of VF tests over VF tests are subjective, time-consuming, and
a long period of time. And many years of VF very noisy. Thus, estimating VF from OCT accu-
testing would be needed to accurately predict the rately may help to reduce unnecessary VF testing
future VFs. To overcome the problem, Wen et al. in eyes that are estimated to be stable.
[81] trained a deep learning model on the tempo- In 2017, Uesaka et  al. [37] proposed two
ral history for a large group of patients to accu- methods to estimate full-resolution 10-2 mode
rately predict future VFs up to 5.5 years given VF maps from retinal thickness (RT) data includ-
only a single VF test, with a correlation of 0.92 ing GCIPL thickness maps, RNFL thickness
between MD on predicted VFs and MD on actual maps and RCL thickness maps. The proposed
future VFs. two methods were affine structured non-negative
For structural progression forecasting, Song matrix factorization (ASNMF) and a
et al. [82] proposed a 2D continuous-time hidden CNN. Results showed that ASNMF worked bet-
markov model to predict average circumpapillary ter for small data size while CNN was powerful
9  AI and Glaucoma 121

for large data size. 7.27 dB of average root mean Other AI Applications in Glaucoma
squared errors (RMSE) was achieved by ASNMF
and 6.79 dB by CNNs. One application of AI is to discover new knowl-
Later in 2018, Sugiura et al. [38] reduced the edge in glaucoma. Mendoza et al. [42] developed
overfitting effect of CNNs by pattern-based regu- a DL method to predict age, sex, and race based
larization (PBR) which utilized characteristic on Spectralis OCT RNFL circle scans from
pattern obtained from a large amount of non-­ healthy individuals, glaucoma suspect, and glau-
paired VF-RT data. Characteristic VF patterns coma patients. A MAE (95% CI) of 4.5 years
were extracted with an unsupervised learning (3.9, 5.2) and a strong (R2 (95% CI)) association
method. Then, the model was regularized by add- of 0.73 between predicted and actual age were
ing a regularization term to the loss function. The achieved for predicting age. AUC (95% CI) of
regularization term penalizes the model if the predicting race and sex were 0.96 (0.86, 0.99)
estimation is far from the manifold formed by the and 0.70 (0.57, 0.80), respectively. These results
extracted VF patterns. Moreover, the location-­ suggest that DL can learn demographic features
wise estimation at the last layer of CNNs was including age, race, and sex that are not apparent
replaced by group-wise estimation to reduce net- to human observers. The research implied that
work parameters. VF locations were first catego- there are still uncovered knowledges to be dis-
rized into several groups depending on functional covered in retinal OCT scans.
similarity. Then, an estimation model was shared Another application of AI is to enhance OCT
within each group. 6.16  dB of RMSE was scans. Halupka et al. [43] presented a CNN using
achieved by the model. either mean squared error or a generative adver-
In 2019, Christopher et  al. [39] applied sarial network (GAN) with Wasserstein distance
ResNet50 to detect eyes with glaucomatous and perceptual similarity to reduce speckle noise
visual field damage (GVFD) and predict VF MD, of OCT images from both healthy and glaucoma-
PSD, and mean VF sectoral PD from RNFL tous eyes. The results demonstrated the effective-
thickness map, RNFL en face image, and CSLO ness of CNNs to denoising OCT B-scams while
images. Model parameters were initialized by a preserving structural features of retinal layers.
transfer learning that trained the model on Such denoising methods could be extremely use-
ImageNet which is a large image recognition ful in the analysis pipeline and ensure the reli-
dataset and finetuned on a private training dataset ability of the following disease assessment.
in order to reduce overfitting effect.
Previous work relied on segmentation-based
features which are prone to errors especially with Conclusion
advanced glaucoma and other co-existing ocular
pathologies. Segmentation-free DL methods In this chapter, we discussed the role of AI in
have also been explored. In 2019, Maetschke glaucoma. Accurate automated diagnosis and
et  al. [40] inferred VFI and MD directly from prognosis of glaucoma may assist clinicians to
OCT volumes of the ONH or the macula to elimi- increase efficiency, minimize diagnosis errors,
nate the need for layer segmentation. The pro- and improve overall quality of glaucoma treat-
posed 3D CNN was compared with several ments. With its abilities to extract meaningful
classical ML methods with segmentation-based information from high dimensional and complex
OCT features and proved to outperform those multi-modal data, AI may help to discover new
ML methods. In 2020, Christopher et  al. [41] biomarkers, patterns, or knowledge to improve
used U-Net to predict full-resolution 24-2 and the current understanding of glaucoma, which
10-2 mode VF maps from unsegmented SD-OCT could be useful for promoting research and devel-
circle scans. The R2 of the predicted results opment into new treatments.
ranged from 0.07 dB to 0.71 for 24-2 mode and There are still several challenges for clinical
from 0.01 to 0.85 for 10-2 mode. applications of AI in glaucoma. First, datasets
122 Z. Chen et al.

used in many studies are small and collected ing and current concepts. Clin Exp Ophthalmol.
2012;40(4):369–80.
from homogeneous populations while modern AI 9. Harwerth RS, Wheat JL, Fredette MJ, Anderson
systems require very large training dataset and DR. Linking structure and function in glaucoma. Prog
are often subject to numerous variabilities. Retin Eye Res. 2010;29(4):249–71.
Tremendous efforts would be required to collect 10. Garg A, Hood DC, Pensec N, Liebmann JM,

Blumberg DM.  Macular damage, as determined by
a large and general dataset for glaucoma research. structure-function staging, is associated with worse
Second, the definition of glaucoma is not clear. vision-related quality of life in early glaucoma. Am J
Disagreements in the definition of the disease Ophthalmol. 2018;194:88–94.
phenotypes often happen between experienced 11. Taylor P, Kalpathy-Cramer J.  Machine learning has
arrived! Aaron Lee, MD, MSCI-Seattle, Washington.
ophthalmologists. Therefore, it is hard to obtain 12. Rahimy E. Deep learning applications in ophthalmol-
high-quality ground-truth labels. Third, despite ogy. Curr Opin Ophthalmol. 2018;29(3):254–60.
many efforts in increasing the interpretability of 13. Ting DSW, Pasquale LR, Peng L, Campbell JP,

AI models, AI models are still being considered Lee AY, Raman R, et  al. Artificial intelligence and
deep learning in ophthalmology. Br J Ophthalmol.
as “black boxes”, which limits its clinical adop- 2019;103(2):167–75.
tion. Thus, it is crucial to develop more visualiza- 14. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D,
tion tools for AI algorithms. Despite these Narayanaswamy A, et al. Development and validation
challenges ahead, AI will likely have positive of a deep learning algorithm for detection of diabetic
retinopathy in retinal fundus photographs. JAMA.
impact on research and clinical practice in 2016;316(22):2402–10.
glaucoma. 15. Chen X, Xu Y, Wong DWK, Wong TY, Liu

J.  Glaucoma detection based on deep convolutional
neural network. In: 2015 37th annual international
conference of the IEEE engineering in medicine and
References biology society (EMBC). IEEE; 2015. p. 715–8.
16. Thakur N, Juneja M.  Survey on segmentation and
1. Tan O, Chopra V, Lu AT, et  al. Detection of macu- classification approaches of optic cup and optic disc
lar ganglion cell loss in glaucoma by Fourier-domain for diagnosis of glaucoma. Biomed Signal Process
optical coherence tomography. Ophthalmology. Control. 2018;42:162–89.
2009;116(12):2305–2314.e1–e2. 17.
Shankaranarayana M, Ram SM, Mitra K,
2. Quigley HA, Broman AT.  The number of people Sivaprakasam K. Joint optic disc and cup segmenta-
with glaucoma worldwide in 2010 and 2020. Br J tion using fully convolutional and adversarial net-
Ophthalmol. 2006;90(3):262–7. works. In: Fetal, infant and ophthalmic medical
3. Ramulu P. Glaucoma and disability: which tasks are image analysis, vol. 10554. Cham: Springer; 2017.
affected, and at what stage of disease? Curr Opin p. 168–76.
Ophthalmol. 2009;20:92. 18. Zilly J, Buhmann JM, Mahapatra D. Glaucoma detec-
4. Hood DC, Tsamis E, Bommakanti NK, Joiner DB, tion using entropy sampling and ensemble learn-
Al-Aswad LA, Blumberg DM, et  al. Structure-­ ing for automatic optic cup and disc segmentation.
function agreement is better than commonly thought Comput Med Imaging Graphics. 2017;55:28–41.
in eyes with early glaucoma. Invest Ophthalmol Vis 19. Sevastopolsky A.  Optic disc and cup segmentation
Sci. 2019;60(13):4241–8. methods for glaucoma detection with modification of
5. Rao HL, Zangwill LM, Weinreb RN, Leite MT, Sample U-Net convolutional neural network. Pattern Recognit
PA, Medeiros FA.  Structure-function relationship in Image Anal. 2017;27(3):618–24.
glaucoma using spectral-domain optical coherence 20. Fu H, Cheng J, Xu Y, Wong DWK, Liu J, Cao X. Joint
tomography. Arch Ophthalmol. 2011;129(7):864–71. optic disc and cup segmentation based on multi-label
6. Leite MT, Zangwill LM, Weinreb RN, Rao HL, deep network and polar transformation. IEEE Trans
Alencar LM, Medeiros FA.  Structure-function rela- Med Imaging. 2018:1–9.
tionships using the Cirrus spectral domain optical 21. Al-Bander B, Zheng Y. Dense fully convolutional seg-
coherence tomograph and standard automated perim- mentation of the optic disc and cup in colour fundus
etry. J Glaucoma. 2012;21(1):49. for glaucoma diagnosis. Symmetry. 2018;10(4):87.
7. Wollstein G, Kagemann L, Bilonick RA, Ishikawa 22. Fu H, Cheng J, Xu Y, Zhang C, Wong DWK, Liu J,
H, Folio LS, Gabriele ML, et al. Retinal nerve fibre Cao X.  Disc-aware ensemble network for glaucoma
layer and visual function loss in glaucoma: the tipping screening from fundus image. IEEE Trans Med
point. Br J Ophthalmol. 2012;96(1):47–52. Imaging. 2018;37(11):2493–501.
8. Malik R, Swanson WH, Garway-Heath DF. Structure– 23. Zhixi L, He Y, Keel S, Meng W, Chang R, He
function relationship in glaucoma: past think- M.  Efficacy of a deep learning system for detecting
9  AI and Glaucoma 123

glaucomatous optic neuropathy based on color fundus Symposium on Biomedical Imaging (ISBI). IEEE;
photographs. Ophthalmology. 2018;125(8):1199–206. 2020. p. 1–5.
24. Chakravarty A, Sivswamy J.  A deep learning based 37. Uesaka T, Morino K, Sugiura H, Kiwaki T, Murata
joint segmentation and classification framework for H, Asaoka R, Yamanishi K. Multi-view learning over
glaucoma assessment in retinal color fundus images. retinal thickness and visual sensitivity on glaucoma-
arXiv preprint arXiv:1808.01355. tous eyes. In: Proceedings of the 23rd ACM SIGKDD
25. Chai Y, Liu H, Xu J.  Glaucoma diagnosis based
International Conference on Knowledge Discovery
on both hidden features and domain knowledge and Data Mining. 2017. p. 2041–50.
through deep learning models. Knowl-Based Syst. 38. Sugiura H, Kiwaki T, Yousefi S, Murata H, Asaoka
2018;161:147–56. R, Yamanishi K.  Estimating glaucomatous visual
26. Asaoka R, Murata H, Iwase A, Araie M. Detecting pre- sensitivity from retinal thickness with pattern-based
perimetric glaucoma with standard automated perim- regularization and visualization. In: Proceedings of
etry using a deep learning classifier. Ophthalmology. the 24th ACM SIGKDD International Conference
2016;123(9):1974–80. on Knowledge Discovery & Data Mining. 2018.
27. Sample PA, Chan K, Boden C, Lee TW, Blumenthal p. 783–92.
EZ, Weinreb RN, et al. Using unsupervised learning 39. Christopher M, Bowd C, Belghith A, Goldbaum

with variational bayesian mixture of factor analysis to MH, Weinreb RN, Fazio MA, et  al. Deep learning
identify patterns of glaucomatous visual field defects. approaches predict glaucomatous visual field damage
Invest Ophthalmol Vis Sci. 2004;45(8):2596–605. from OCT optic nerve head En face images and reti-
28. Kucur ŞS, Holló G, Sznitman R.  A deep learning nal nerve fiber layer thickness maps. Ophthalmology.
approach to automatic detection of early glaucoma 2020;127(3):346–56.
from visual fields. PLoS One. 2018;13(11):e0206081. 40. Maetschke S, Antony B, Ishikawa H, Wollstein G,
29. Li F, Wang Z, Qu G, Song D, Yuan Y, Xu Y, et  al. Schuman J, Garnavi R. Inference of visual field test
Automatic differentiation of Glaucoma visual field performance from OCT volumes using deep learning.
from non-glaucoma visual filed using deep con- arXiv preprint arXiv:1908.01428. 2019.
volutional neural network. BMC Med Imaging. 41. Christopher M, Proudfoot JA, Bowd C, Belghith A,
2018;18(1):35. Goldbaum MH, Rezapour J, et al. Deep learning mod-
30. Muhammad H, Fuchs T, De Cuir N, De Moraes C, els based on unsegmented OCT RNFL circle scans
Blumberg D, Liebmann J, Ritch R, Hood D. Hybrid provide accurate detection of glaucoma and high
deep learning on single wide-field optical coherence resolution prediction of visual field damage. Invest
tomography scans accurately classifies glaucoma sus- Ophthalmol Vis Sci. 2020;61(7):1439.
pects. J Glaucoma. 2017;26(12):1086–94. 42. Mendoza L, Christopher M, Belghith A, Bowd C,
31. Fu H, Xu Y, Lin S, Wong D, Mani B, Mahesh M, Rezapour J, Fazio MA, et  al. Deep learning mod-
Aung T, Liu J. Multi-context deep network for angle-­ els predict age, sex and race from OCT optic nerve
closure glaucoma screening in anterior segment oct. head circle scans. Invest Ophthalmol Vis Sci.
In: International Conference on Medical Image 2020;61(7):2012.
Computing and Computer-Assisted Intervention. 43. Halupka KJ, Antony BJ, Lee MH, Lucy KA, Rai RS,
Springer; 2018. p. 356–63. Ishikawa H, et al. Retinal optical coherence tomogra-
32. Maetschke S, Antony B, Ishikawa H, Wollstein G, phy image enhancement via deep learning. Biomed
Schuman J, Garnavi R. A feature agnostic approach Optics Express. 2018;9(12):6205–21.
for glaucoma detection in OCT volumes. PLoS One. 44. Zeiler MD, Fergus R. Visualizing and understanding
2019;14(7):e0219126. convolutional networks. In: European conference on
33. Thakoor KA, Li X, Tsamis E, Sajda P, Hood
computer vision. Cham: Springer; 2014. p. 818–33.
DC.  Enhancing the accuracy of glaucoma detection 45. Simonyan K, Vedaldi A, Zisserman A.  Deep inside
from OCT probability maps using convolutional convolutional networks: visualising image classi-
neural networks. In: 2019 41st Annual International fication models and saliency maps. arXiv preprint
Conference of the IEEE Engineering in Medicine and arXiv:1312.6034. 2013.
Biology Society (EMBC). IEEE; 2019. p. 2036–40. 46. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba
34. Park K, Kim J, Lee J.  Visual field prediction using A. Learning deep features for discriminative localiza-
recurrent neural network. Sci Rep. 2019;9(1):1–12. tion. In: Proceedings of the IEEE conference on com-
35. Sedai S, Antony B, Ishikawa H, Wollstein G,
puter vision and pattern recognition; 2016. p. 2921–9.
Schuman JS, Garnavi R.  Forecasting retinal nerve 47. Bowd C, Chan K, Zangwill LM, et  al. Comparing
fiber layer thickness from multimodal temporal data neural networks and linear discriminant functions
incorporating OCT volumes. Ophthalmol Glaucoma. for glaucoma detection using confocal scanning laser
2020;3(1):14–24. ophthalmoscopy of the optic disc. Invest Ophthalmol
36. Chen Z, Wang Y, Wollstein G, de los Angeles Ramos-­ Vis Sci. 2002;43:3444–54.
Cadena M, Schuman J, Ishikawa H. Macular GCIPL 48.
Goldbaum MH, Sample PA, White H, Colt
thickness map prediction via time-aware convo- B, Raphaelian P, Fechtner RD, Weinreb
lutional LSTM.  In: 2020 IEEE 17th International RN.  Interpretation of automated perimetry for glau-
124 Z. Chen et al.

coma by neural network. Invest Ophthalmol Vis Sci. based on deep learning. In: International Conference
1994;35(9):3362–73. on Medical Image Computing and Computer-Assisted
49. Chan K, Lee TW, Sample PA, Goldbaum MH,
Intervention. Cham: Springer; 2015. p. 669–77.
Weinreb RN, Sejnowski TJ. Comparison of machine 62. He K, Zhang X, Ren S, Sun J. Deep residual learning
learning and traditional classifiers in glaucoma diag- for image recognition. In: Proceedings of the IEEE
nosis. IEEE Trans Biomed Eng. 2002;49(9):963–74. conference on computer vision and pattern recogni-
50. Goldbaum MH, Sample PA, Chan K, Williams J, Lee tion. 2016. p. 770–8.
TW, Blumenthal E, et al. Comparing machine learn- 63.
Ronneberger O, Fischer P, Brox T.  U-net:
ing classifiers for diagnosing glaucoma from standard Convolutional networks for biomedical image seg-
automated perimetry. Invest Ophthalmol Vis Sci. mentation. In: International Conference on Medical
2002;43(1):162–9. image computing and computer-assisted intervention.
51.
Aurenhammer F.  Voronoi diagrams—a survey Cham: Springer; 2015. p. 234–41.
of a fundamental geometric data structure. ACM 64. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna
Computing Surveys (CSUR). 1991;23(3):345–405. Z.  Rethinking the inception architecture for com-
52. Townsend KA, Wollstein G, Danks D, et al. Heidelberg puter vision. In: Proceedings of the IEEE conference
retina tomograph 3 machine learning classifiers for on computer vision and pattern recognition; 2016.
glaucoma detection. Br J Ophthalmol. 2008;92:814– p. 2818–26.
8. https://doi.org/10.1136/bjo.2007.133074. 65. Ren S, He K, Girshick R, Sun J. Faster r-cnn: towards
53. Zangwill LM, Chan K, Bowd C, et  al. Heidelberg real-time object detection with region proposal net-
retina tomograph measurements of the optic disc and works. In: Advances in neural information processing
parapapillary retina for detecting glaucoma analyzed systems. 2015. p. 91–9.
by machine learning classifiers. Invest Ophthalmol 66. Long J, Shelhamer E, Darrell T. Fully convolutional
Vis Sci. 2004;45:3144–51. https://doi.org/10.1167/ networks for semantic segmentation. In: Proceedings
iovs.04-­0202. of the IEEE conference on computer vision and pat-
54. Uchida H, Brigatti L, Caprioli J.  Detection of
tern recognition. 2015. p. 3431–40.
structural damage from glaucoma with confocal 67.
Bizios D, Heijl A, Hougaard JL, Bengtsson
laser image analysis. Invest Ophthalmol Vis Sci. B. Machine learning classifiers for glaucoma diagno-
1996;37:2393–401. sis based on classification of retinal nerve fibre layer
55. Adler W, Peters A, Lausen B. Comparison of classi- thickness parameters measured by Stratus OCT. Acta
fiers applied to confocal scanning laser ophthalmos- Ophthalmol. 2010;88(1):44–52.
copy data. Methods Inf Med. 2008;47:38–46. https:// 68. Barella KA, Costa VP, Gonçalves Vidotti V, Silva FR,
doi.org/10.3414/ME0348. Dias M, Gomi ES.  Glaucoma diagnostic accuracy
56. Bowd C, Zangwill LM, Medeiros FA, et al. Confocal of machine learning classifiers using retinal nerve
scanning laser ophthalmoscopy classifiers and ste- fiber layer and optic nerve data from SD-OCT.  J
reophotograph evaluation for prediction of visual Ophthalmol. 2013.
field abnormalities in glaucoma-suspect eyes. Invest 69. Kotsiantis SB, Zaharakis I, Pintelas P.  Supervised

Ophthalmol Vis Sci. 2004;45:2255–62. machine learning: a review of classification tech-
57. Weinreb RN, Zangwill L, Berry CC, et al. Detection niques. In: Maglogiannis I, et  al., editors. Emerging
of glaucoma with scanning laser polarimetry. Artificial Intelligence Applications in Computer
Arch Ophthalmol. 1998;116:1583–9. https://doi. Engineering. IOS Press; 2007. p. 3–24.
org/10.1001/archopht.116.12.1583. 70. Brigatti L, Hoffman D, Caprioli J. Neural networks to
58. Bowd C, Medeiros FA, Zhang Z, et  al. Relevance identify glaucoma with structural and functional mea-
vector machine and support vector machine clas- surements. Am J Ophthalmol. 1996;121:511–21.
sifier analysis of scanning laser polarimetry retinal 71. Bowd C, Hao J, Tavares IM, et al. Bayesian machine
nerve fiber layer measurements. Invest Ophthalmol learning classifiers for combining structural and
Vis Sci. 2005;46:1322–9. https://doi.org/10.1167/ functional measurements to classify healthy and
iovs.04-­1122. glaucomatous eyes. Invest Ophthalmol Vis Sci.
59. Zhang Z, Yin FS, Liu J, Wong WK, Tan NM, Lee 2008;49:945–53.
BH, et al. Origa-light: an online retinal fundus image 72. Leung CKS, Cheung CYL, Weinreb RN, Qiu K, Liu
database for glaucoma analysis and research. In: S, Li H, et al. Evaluation of retinal nerve fiber layer
2010 Annual International Conference of the IEEE progression in glaucoma: a study on optical ­coherence
Engineering in Medicine and Biology. IEEE; 2010. tomography guided progression analysis. Invest
p. 3065–8. Ophthalmol Vis Sci. 2010;51(1):217–22.
60. Sng CC, Foo LL, Cheng CY, Allen JC, He M,
73. Na JH, Sung KR, Baek S, Lee JY, Kim S. Progression
Krishnaswamy G, Nongpiur ME, Friedman DS, Wong of retinal nerve fiber layer thinning in glaucoma
TY, Aung T. Determinants of anterior chamber depth: assessed by cirrus optical coherence tomography-­
the Singapore Chinese Eye Study. Opthalmology. guided progression analysis. Curr Eye Res.
2012;119(6):1143–50. 2013;38(3):386–95.
61. Chen X, Xu Y, Yan S, Wong DWK, Wong TY, Liu 74. Caprioli J, Mock D, Bitrian E, Afifi AA, Yu F, Nouri-­
J. Automatic feature learning for glaucoma detection Mahdavi K, Coleman AL. A method to measure and
9  AI and Glaucoma 125

predict rates of regional visual field decay in glaucoma. 82. Song Y, Ishikawa H, Wu M, Liu YY, Lucy KA,

Invest Ophthalmol Vis Sci. 2011;52(7):4765–73. Lavinsky F, et  al. Clinical prediction performance
75. Otarola F, Chen A, Morales E, Yu F, Afifi A, Caprioli of glaucoma progression using a 2-dimensional
J.  Course of glaucomatous visual field loss across continuous-time hidden markov model with struc-
the entire perimetric range. JAMA ophthalmology. tural and functional measurements. Ophthalmology.
2016;134(5):496–502. 2018;125(9):1354–61.
76. Chen A, Nouri-Mahdavi K, Otarola FJ, Yu F, Afifi 83. Hood DC, Kardon RH.  A framework for com-
AA, Caprioli J. Models of glaucomatous visual field paring structural and functional measures of
loss. Invest Ophthalmol Vis Sci. 2014;55(12):7881–7. glaucomatous damage. Prog Retin Eye Res.
77. Warren JL, Mwanza JC, Tanna AP, Budenz DL. A sta- 2007;26(6):688–710.
tistical model to analyze clinician expert consensus on 84. Zhu H, Crabb DP, Schlottmann PG, Lemij HG, Reus
glaucoma progression using spatially correlated visual NJ, Healey PR, et al. Predicting visual function from
field data. Transl Vis Sci Technol. 2016;5(4):14. the measurements of retinal nerve fiber layer structure.
78. Betz-Stablein BD, Morgan WH, House PH, Hazelton Invest Ophthalmol Vis Sci. 2010;51(11):5657–66.
ML. Spatial modeling of visual field data for assess- 85. Russell RA, Malik R, Chauhan BC, Crabb DP,

ing glaucoma progression. Invest Ophthalmol Vis Sci. Garway-Heath DF.  Improved estimates of visual
2013;54(2):1544–53. field progression using Bayesian linear regression
79. Anderson AJ. Comparison of three parametric models to integrate structural information in patients with
for glaucomatous visual field progression rate distri- ocular hypertension. Invest Ophthalmol Vis Sci.
butions. Transl Vis Sci Technol. 2015;4(4):2–2. 2012;53(6):2760–9.
80.
VanBuren J, Oleson JJ, Zamba GK, Wall 86. Pollet-Villard F, Chiquet C, Romanet JP, Noel

M.  Integrating independent spatio-temporal replica- C, Aptel F.  Structure-function relationships with
tions to assess population trends in disease spread. spectral-­domain optical coherence tomography retinal
Stat Med. 2016;35(28):5210–21. nerve fiber layer and optic nerve head measurements.
81. Wen JC, Lee CS, Keane PA, Xiao S, Rokem AS, Chen Invest Ophthalmol Vis Sci. 2014;55(5):2953–62.
PP, et  al. Forecasting future Humphrey visual fields
using deep learning. PLoS One. 2019;14(4):e0214875.
Artificial Intelligence
in Retinopathy of Prematurity
10
Brittni A. Scruggs, J. Peter Campbell,
and Michael F. Chiang

Introduction changes, with each area demonstrating signifi-


cant intra- and inter-expert subjectivity and
Retinopathy of prematurity (ROP) is a leading disagreement.
cause of preventable childhood blindness world- Automated image analysis and deep learning
wide. Extremely preterm infants are at risk of systems for ROP have the potential to improve
developing ROP given their low gestational age ROP care by improving the efficiency and accu-
and low birth weight [1, 2]. There are a number racy of diagnosis and by facilitating quantitative
of challenges for ROP screening and diagnosis disease monitoring and risk prediction [3]. This
using current technology. ROP screening requires chapter focuses on the limitations of current
either bedside ophthalmoscopic screening or methods for ROP diagnosis and highlights the
telemedicine using remote interpretation of digi- recent major advances and the clinical and tech-
tal fundus imaging. There are several potential nical challenges of artificial intelligence (AI)
challenges to ensuring every at risk baby is diag- for automated diagnosis of ROP in the real
nosed accurately and on time. Further, ROP diag- world.
nosis is sub-classified by zone, stage, and vascular

Risk Factors and Prevalence


B. A. Scruggs
Casey Eye Institute, Department of Ophthalmology, The Multicenter Study of Early Treatment for
Oregon Health and Science University, Retinopathy of Prematurity (ET-ROP) study
Portland, OR, USA found that 68% of infants born <1251  g devel-
e-mail: scruggsb@ohsu.edu oped mild ROP or worse [4]. The cryotherapy for
J. P. Campbell ROP (CRYO-ROP) study found that every addi-
Casey Eye Institute, Department of Ophthalmology, tional gestational week led to a 19% decrease in
Oregon Health and Science University,
Portland, OR, USA developing severe ROP requiring treatment (i.e.,
threshold ROP), whereas each 100 g increase in
Department of Medical Informatics and Clinical
Epidemiology, Oregon Health and Science birth weight was associated with a 27% decrease
University, Portland, OR, USA in threshold ROP [5]. ROP disease and severity
e-mail: campbelp@ohsu.edu are increased, including in mature babies, in
M. F. Chiang (*) countries with lack of oxygen saturation moni-
National Eye Institute, National Institutes of Health, tors and little to no advanced training of neonatal
Bethesda, MD, USA staff and/or ophthalmologists [6].
e-mail: michael.chiang@nih.gov

© Springer Nature Switzerland AG 2021 127


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_10
128 B. A. Scruggs et al.

Worldwide Impact present and the degree of hypoxia (Phase II ROP)


[10]. This proliferative phase can lead to retinal
Blindness from ROP is often preventable with neovascularization, vitreoretinal traction, retinal
appropriate primary, secondary, and tertiary pre- detachments, and blindness. Screening of pre-
vention [7]. However, approximately 16% of sur- term babies prior to phase II ROP (approximately
viving infants born under 32 gestational weeks 31 gestational weeks) is imperative for achieving
develop some degree of ROP [8]. A large popula- good outcomes as ROP often gets worse prior to
tion study of 184,700 preterm babies with ROP improving. The average involution of ROP has
found that 20,000 babies became blind or severely been reported as 38.6 weeks [12]. Screening
visually impaired by ROP in 2010 [8]. The Global guidelines and current limitations are discussed
Burden of Disease study estimated that 257,000 later in this chapter.
years lived with disability worldwide in 2010
were associated with visual impairment second-
ary to ROP [9]. The greatest burden of disease Screening
currently is in Southeast Asia, Latin America,
and North Africa [8]. The worldwide incidence Timely and accurate ROP diagnosis is essential
will likely increase due to improved infant sur- to prevent blindness from ROP.  ROP screening
vival rates at increasingly younger gestational guidelines have been jointly published and
ages. updated by the American Academy of Pediatrics,
the American Association for Pediatric
Ophthalmology and Strabismus, and the
Pathophysiology American Academy of Ophthalmology [2, 13].
All infants <1500  g birth weight or <30 weeks
ROP is a vasoproliferative disease that occurs gestational age should be screened at 31 weeks
due to incomplete vascularization and non-­ gestational age or 4 weeks chronological age,
physiologic hyperoxia [10]. Fetal retinal vas- whichever is later. Selected infants with higher
cular development transitions from a process of birth weights and/or gestational age >30 weeks
vasculogenesis (i.e., in situ differentiation and may benefit from ROP screening; such infants
growth of blood vessels) to angiogenesis (i.e., include, but are not limited to, those with
remodeling and expansion of new blood ves- increased oxygen requirement and/or severe co-­
sels) after 20 weeks gestational age [10, 11]. morbidities (e.g., lung disease, sepsis, necrotiz-
During the angiogenesis phase, physiologic ing enterocolitis, profound anemia).
hypoxia with an oxygen saturation of 60–70% is
maintained in utero; however, oxygen saturation
increases to >88% after birth with many pre- Limitations in Screening
term babies requiring high levels of supplemen-
tal oxygen for prevention of serious sequelae, Only approximately 5–10% of babies within a
including death [10]. screening population will develop sight-­
High and variable oxygen levels in the neona- threatening ROP. Unfortunately, there are often
tal intensive care unit (NICU) and lack of mater- barriers to ensuring consistent screening of at
nal protection decrease intraocular oxygen-related risk babies especially in low- and middle-income
growth factors, such as insulin growth factor countries (LMIC), including lack of equipment,
(IGF-1) and vascular endothelial growth factor inadequate training, personnel shortages, and
(VEGF), lead to vaso-obliteration (Phase I ROP) inconsistent examinations between clinicians [3,
[10]. As the infant ages (usually >32 gestational 14]. The population at risk in these regions is
weeks), the VEGF and IGF-1 drive increases pro- significantly higher due to differences in the
portionally to the amount of avascular retina level of oxygen regulation. There are also wide
10  Artificial Intelligence in Retinopathy of Prematurity 129

disparities in the distribution of ophthalmolo- Classification


gists between rural vs. urban settings and
between countries. Even when screening is per- ROP is classified based the location (zone),
formed in a timely fashion with trained ophthal- extent, and severity of disease (stage and vascular
mologists, the use of indirect ophthalmoscopy changes) according to the International
remains highly subjective and not interpretable; Classification of Retinopathy of Prematurity
additional limitations of ROP diagnosis using (ICROP) guidelines developed in 1984 and
ophthalmoscopy will be discussed later in this refined in 2005 [25–27]. Zone I subtends 30
chapter. The use of AI for quantitative ROP diag- degrees centered around the optic disc. Examiners
nosis may enable monitoring of disease severity often determine the radius of this circle as twice
between NICUs within a geographic region and the distance from the fovea to the optic disc. Zone
over time. II extends from zone I to the temporal equator
and to a tangential point on the nasal ora serrata.
The remaining crescent, zone III, extends from
Telemedicine zone II anterior to the ora serrata (Fig. 10.1).
Incomplete vascularization (i.e., immature
Telemedicine allows a single expert to screen vessels) in zone I or II without other pathology is
babies over a large region where screening oph- considered Stage 0 ROP. Stage 1 ROP refers to
thalmologists may be limited. Such programs the presence of a white demarcation line without
also provide objective data (photos) for clinicians dimension separating the vascularized retina
to detect change, a task that is difficult when from the anterior avascular region. Progression
humans rely solely on drawings or chart records. leads to stage 2 ROP where the demarcation line
There are multiple examples of telemedicine pro- thickens as a ridge and, ultimately, to stage 3
grams worldwide that are successfully providing ROP with extraretinal fibrovascular proliferation
efficient screening methods for at risk babies along the ridge. Extrafoveal retinal detachments
[15–19]. In the 2014 Stanford University Network (stage 4A) and sub-total foveal-involving retinal
for Diagnosis of Retinopathy of Prematurity detachments (stage 4B) occur due to the develop-
(SUNDROP) trial, a wide-field lens was ment of neovascularization of the retina with
employed using the RetCam digital imaging sys- associated tractional membranes [25].
tem to obtain five fundus views per eye for 1755 Extraretinal proliferation of nonvascular tissue
infants over 6995 examinations [20]. The e-ROP and further traction can lead to stage 5 ROP, or
Cooperative Group has published numerous total retinal detachment. The overall stage is
studies on telemedicine approaches to evaluating determined by the individual clock hour with the
acute-phase ROP [21, 22]. In one e-ROP study, worse stage. Figure  10.1 includes examples of
the diagnostic accuracy of reader grading was ROP disease in different zones and various
determined across 5350 image set pairs, and the stages.
grading sensitivity and specificity for detecting Vascular abnormalities generally increase
referral warranted ROP was 90.0% and 87.0%, with more posterior disease and with higher stage
respectively [21]. In a different telemedicine pro- and extent of peripheral disease. Plus disease is
gram, plus disease was detected in 95% of eyes defined as venous dilation and arteriolar tortuos-
[23]. Large databases of digital fundus images ity in two or more quadrants within the posterior
from these ROP telemedicine programs serve as pole and greater than a standard published photo-
the first step in development of AI for automated graph [28]. An intermediate level, pre-plus dis-
image-based ROP diagnosis [24]. However, ease, was introduced to the 2005 ICROP
unique telemedicine challenges exist, including classification reflecting the fact that vascular
the wide range of image quality, fundus pigmen- changes present on a continuum. In rare cases,
tation, prevalence of disease, and ROP phenotype infants may develop aggressive posterior (AP)-
between geographic regions. ROP with progressive, posterior ROP with
130 B. A. Scruggs et al.

Retinopathy of Prematurity Classification Stage

1
*
I
Zone

2
II

II

3
Vessels

Normal Pre-Plus Plus 4A

Fig. 10.1  ROP classification by zone, stage, and vessel and dilation. The far-right column depicts representative
changes. The top row shows mild ROP disease in zone I stages 1 through 4A. Black arrowheads highlight the faint
(upper panel) and in zone II (bottom panel). The montage demarcation line in an eye with stage 1 ROP.  Note the
photo shows the location of zones I and II. The asterisk temporal ridge in stage 2, the neovascularization present
indicates the location of the fovea. The bottom row depicts in stage 3, and the localized temporal retinal detachment
images with a label of normal vessels, pre-plus, or plus in stage 4A. Stage 4B (sub-total retinal detachment with
disease based on multiple (>3) expert consensus from macula involvement) and stage 5 (total retinal detach-
ophthalmoscopy and image grading of vessel tortuosity ment) are not shown

marked plus disease out of proportion to the ease. Despite these well-defined guidelines, some
peripheral retinal pathology; these eyes often experts are overly aggressive in their treatment
manifest flat neovascularization that can be diffi- plans whereas others are more conservative.
cult to appreciate. Peripheral laser ablation permanently destroys
the peripheral retina that is driving VEGF pro-
duction. Although a mainstay treatment, laser is
Treatment associated with strabismus and high myopia [29],
and incomplete treatment such as skip lesions
‘Threshold ROP’ is a term for ROP requiring can lead to ROP progression and retinal detach-
treatment as defined by the CRYO-ROP study as ment despite treatment. The Bevacizumab
five or more contiguous or eight total clock-hours Eliminates the Angiogenic Threat of Retinopathy
of stage 3 ROP in zone I or II in the presence of of Prematurity (BEAT-ROP) and Ranibizumab
plus disease [12]. The Early Treatment for ROP versus laser therapy for the treatment of very low
(ET-ROP) trial further classified ROP into type 1 birthweight infants with retinopathy of prematu-
and type 2 pre-threshold treatment to guide the rity (RAINBOW) trials demonstrated the utility
treatment of infants with early laser before the of using VEGF inhibitors (intravitreal bevaci-
development of threshold ROP [4]. Type 1 ROP zumab or ranibizumab, respectively) instead of
remains the currently accepted treatment cutoff laser in certain cases, such as zone I stage 3 ROP
for ROP and is defined as 1) zone I stage 3 with- with plus disease [30, 31]. Despite encouraging
out plus disease; 2) any stage in zone I with plus results showing fewer unfavorable ocular out-
disease; or 3) zone II stage 2 or 3 with plus dis- comes than laser therapy, intravitreal therapy for
10  Artificial Intelligence in Retinopathy of Prematurity 131

ROP introduces new challenges, including the three components despite good relative agree-
possibility of end-organ effects from systemic ment on ROP disease severity [32, 33]. Real
exposure to anti-VEGF medications and the need world images with high inter-observer variability
for increased monitoring, potentially for years, to among ROP experts (stage 1 vs. 2; pre-plus vs.
assess for reactivation. It is the authors hope that plus disease) are provided in Fig. 10.2. Identifying
automated computer systems may soon help cli- plus disease remains the most critical finding for
nicians decide which treatment is warranted for diagnosing threshold disease. However, agree-
an individual infant and the risk of post-treatment ment on plus disease between experts is imper-
progression. fect due to systematic biases and differences in
diagnostic thresholds along a continuum [32].
Gelman et  al. found that when diagnosing plus
 imitations in ROP Diagnosis
L disease, 22 experts had sensitivities and specifici-
and Management ties ranging from 0.31 to 1.00 and 0.57 to 1.00,
respectively, with agreement in only 21% of the
Assessment of ROP diagnosis and severity images.
depends on the subjective evaluation of zone, Ghergherehchi et  al. proposed that the vari-
stage, and plus disease, and it is well established ability in plus disease diagnosis is partly due to
that there is wide inter-observer variability for all attention to undefined vascular features [28]. For
Vessels
Stage

Fig. 10.2  Real world images with high inter-observer pre-plus or plus disease with no consensus by ROP
variability among ROP experts. The top row shows two experts. Similarly, the bottom row shows two telemedi-
fundus photos depicting arteriolar tortuosity and venous cine photos that some ROP experts documented stage 1
dilation in the posterior pole. These are photos from a tele- ROP, whereas others documented stage 2 ROP
medicine screening program that were graded as either
132 B. A. Scruggs et al.

a b

Fig. 10.3  Effect of field of view and peripheral vessel Despite expert consensus that all three eyes had pre-plus,
appearance on diagnostic interpretation. (a)–(c) Three not plus, disease, the peripheral vessel tortuosity and dila-
montage images documenting vascular changes that were tion correlated with significant peripheral pathology that
more evident in the periphery than in the posterior pole. required ROP treatment with laser in all cases

example, vessel tortuosity can be quite striking the development of AI technology in ROP and
peripherally despite normal appearing vessels have contributed to the high medicolegal risk of
posteriorly (Fig. 10.3). This is important because ROP screening. Such limitations provided moti-
the standard photographs for ROP diagnosis vation for the development of numerous
appear narrower than the field of view (FOV) computer-­based systems for ROP, including the
obtained with bedside examination and/or tele- i-ROP deep learning (DL) system, which may be
medicine photos. Wide FOV allow different a way of standardizing disease severity and will
examiners to focus on different parts of the retina be discussed later in this chapter.
than originally described in ICROP, and inter-­
expert agreement is higher in plus disease diag-
nosis using wide-angle images [34]. Kim et  al. Early AI Systems for ROP Diagnosis
found lower accuracy when clinicians diagnosed
plus disease one quadrant at a time, suggesting The first computer-based systems for ROP
that clinicians subconsciously evaluate the whole diagnosis utilized manual tracings of dilation
eye even when they intend to carefully evaluate and tortuosity to produce an objective metric
plus by quadrant [35]. Figure 10.3 demonstrates of severity [38]. Such semi-automated ROP
the effect of FOV on severity appearance. diagnostic systems include ROPToolTM [39].
Most examiners do not routinely perform pho- Retinal Image multiScale Analysis (RISA)
tography at the time of examination, and limited [40], Computer Assisted Image Analysis of
objective data may contribute to the significant the Retina (CAIAR) [41], among others; these
inter-expert variability across different regions systems were reviewed by Wittenberg et  al. in
[28, 36, 37]. Increased use of photography offers 2012 [38]. As feature-­extraction-­based systems,
serial comparisons for monitoring ROP disease they all utilized manual or semi-automated sys-
and for improving ROP training. The lack of tems to quantify dilation and/or tortuosity for
objective diagnosis of ROP and the high rates of correlation with clinical diagnosis of ROP.  In
inter-observer variability have been hindrances to contrast to newer machine learning (ML) and
10  Artificial Intelligence in Retinopathy of Prematurity 133

DL systems, there was no automated image Convolutional neural networks (CNN) incor-
analysis performed by the computer; instead, porate image classification algorithms that differ
feature combinations and diagnostic cut-points from traditional feature extraction and ML sys-
were determined manually with clinicians label- tems. Using a large database of input images, the
ing or selecting findings within the images. CNN uses learnable weights and biases and gives
Comparisons of expert performance to the RISA importance to image features (e.g., tortuosity of
system demonstrated high diagnostic accuracy arterioles, dilation of venules) that best correlate
for plus disease using the computer-based anal- the input image with the diagnosis. The CNN
ysis [40, 42, 43]. However, these systems can- learns these features with or without pre-­
not process large numbers of images and do not processing but without explicit human input [46,
correlate well enough with ROP diagnosis to be 48, 49]. The CNN’s fully connected ‘output’
widely utilized [44]. layer classifies the image (e.g., absence or pres-
ence of plus disease) with improved performance
than feature-extraction-based ML approaches.
 utomated Detection of Plus
A Worrall et  al. reported the first fully automated
Disease plus disease diagnosis using a CNN; this study
used a real-world dataset that included input
Machine learning utilizes a classifier, such as a image discrepancies across experts [49]. This
support vector machine (SVM), that learns the system’s image recognition classifier performed
best relationship between image features and the as well as some of the human experts (92% accu-
diagnosis [45]. One approach to have more racy) [49].
explainable AI is to combine DL and ML meth- Brown et  al. reported the results of a fully
ods with traditional feature extraction, and sev- automated DL-based system for automated three-­
eral groups have attempted this for plus disease level diagnosis of plus disease [48]. This deep
[46, 47]. Mao et al. trained a DL network to seg- CNN, called the i-ROP DL system, was trained
ment retinal vessels and the optic disc and to and validated on more than 5000 images with a
diagnosis plus disease based on automated quan- single reference standard diagnosis (RSD) based
titative characterization of pathological features, on the consensus diagnosis of three independent
such as vessel tortuosity, width, fractal dimen- image graders and the clinical diagnosis. The
sion, and density [46]. area under the curve (AUC) for plus disease diag-
In 2015, a ML model with a trained SVM was nosis was excellent (0.98). On an independent
developed to determine the combination of fea- dataset of 100 images (i.e., not included in the
tures and FOV that best correlated with expert training set), the i-ROP DL system had higher
plus disease diagnosis [45]. This automated sys- diagnostic agreement with the RSD than seven
tem diagnosed plus disease as well as experts out of eight of the experts. For diagnosis of plus
when incorporating vascular tortuosity from both disease, the sensitivity and specificity of the algo-
arteries and veins with the widest FOV [45]. The rithm were 93% and 94%, respectively. These
accuracy was significantly lower using a FOV values increased to 100% and 94%, respectively,
comparable to that of the standard ICROP photo- when including pre-plus disease or worse [48].
graph; this suggested that experts consider the
vascular information from a large area of retina
when diagnosing plus disease. The montage Continuous Scoring for Plus Disease
images in Fig. 10.3 show examples of peripheral
vascular pathology that may influence diagnostic Vascular disease in ROP presents on a contin-
interpretation. Despite expert-level performance, uum, which likely explains why there is poor
this system was limited in clinical utility as it absolute agreement on plus disease classification
required manual tracing and segmentation of the between experts [32]. This finding motivated the
vessels as an input [45]. development of a quantitative severity scale using
134 B. A. Scruggs et al.

the i-ROP DL system. Redd et al. reported that a to classify zone or stage [55, 56]. For example,
scale from 1 to 9 could accurately detect type 1 DeepROP is a different automated ROP detection
ROP with an AUC of 0.95 [50]. Taylor et  al. system that was developed using deep neural net-
implemented the i-ROP DL algorithm to assign a works (DNNs) [57]. An identification DNN model
continuous ROP vascular severity score 1–9 and (Id-Net) and a grading DNN model (Gr-Net)
to classify images based on severity: no ROP, directly learned ROP features from big datasets,
mild ROP, type 2 ROP and pre-plus disease, or which were comprised of retinal photographs
type 1 ROP [51]. The continuous ROP vascular labeled by ROP experts. Both the identification
score was associated with the ICROP category of and the grading DNNs performed better than some
disease at a single point in time and the clinical of the human experts; impressively, the Id-Net
progression of ROP over time [51]. achieved a sensitivity of 96.62% (95%CI, 92.29–
Using the i-ROP dataset, Gupta et al. showed 98.89%) and a specificity of 99.32% (95%CI,
that these continuous scores reflected post-­ 99.98%) for ROP identification [57].
treatment regression in eyes with treatment Similarly, Hu et  al. developed a deep CNN
requiring-ROP [52]. Additionally, eyes requiring with a novel architecture to determine the pres-
multiple treatment sessions (laser or intravitreal ence and severity of ROP disease; a sub-network
injection of bevacizumab) had higher pre-­ designed to extract high-level features from
treatment ROP vascular severity scores compared images was connected to a second sub-network
with eyes requiring only a single treatment, sug- that predicted ROP severity (mild vs. severe)
gesting that treatment failure may be related to [58]. Using a feature aggregate operator, this sys-
more aggressive disease or disease treated at a tem was found to have a high classification accu-
later stage [52]. A recent study by Yildiz et al. and racy [58].
the iROP Consortium described iROP ASSIST, a Zhao et al. reported the development of a DL
fully automated system with CNN-like perfor- system that can automatically draw the border of
mance to diagnosis plus vs. not plus disease (0.94 zone 1 on a fundus image as a diagnostic aid [56].
AUC) [53]. Inspired by the algorithms of Ataer-­ Mulay et al. first reported the identification of a
Cansizoglu et al. [45, 54], this system uses hand- peripheral ROP ridge directly in a fundus image
crafted features with a combined neural network [55]. A CNN was trained by Coyner et al. in 2018
for automatic vessel segmentation, tracing, fea- to automatically assess the quality of retinal fun-
ture extraction, and classification; it is publicly dus images [59, 60]; this would serve well as a
available for generation of a vessel severity score prescreening method for telemedicine and
(0–100) from an input image [53]. computer-­based image analysis in ROP. Thus, DL
Improvement in the feature extraction process seems to hold promise for automated and objec-
will allow clinicians to achieve better perfor- tive diagnosis of ROP in digital fundus images.
mance levels without sacrificing explainability However, none of these systems are yet available
[53]. Ultimately, using similar automated quanti- for clinical use and further research is needed. A
tative severity scale for ROP diagnosis may help recent review by Scruggs et al. offers recommen-
optimize treatment regimens by better predicting dations for future AI research applied to ROP
the preterm infants at risk for treatment failure [61], including using optical coherence tomogra-
and disease recurrence [52]. Future clinical trials phy (OCT) and OCT-angiography (OCT-­A) to
may use a quantitative scale to help evaluate identify the structural signs (e.g., vitreoretinal
treatment thresholds. traction) preceding disease progression [62, 63].

 utomated Classification of ROP


A Challenges to AI Implementation
Stage and Zone
Ting et al. published on the clinical and technical
Most studies have focused on computer-based sys- challenges of DL applications in ophthalmology
tems to diagnose plus disease; however, several [64]. While AI holds great promise for improving
studies report using DL to grade ROP severity or care for ROP, the gap between scientific discovery
10  Artificial Intelligence in Retinopathy of Prematurity 135

Table 10.1  Main challenges of AI implementation for ROP diagnosis in clinical practice
Main challenges Potential Solutions
Generalizability • CNNs often do not generalize well to • Validation of AI system performance on the
unseen data target population prior to clinical use using
• Qualitatively different populations and images of varying quality and fields of view
phenotypes being studied, such as in •  Datasets tested in different populations
LMIC •  Open-access datasets and software
• Differences in the ways the images • Automated DL-enhanced algorithms integrated
were acquired into commonly used cameras (e.g., RetCam) or
• Technical differences between camera into cloud-based systems.
systems
• Resolution and quality of input images
or labels
Explainability • Inability to explain how the algorithm • Combination of deep learning methods with
arrived at a conclusion traditional feature extraction [46, 47, 53]
• “Black box” nature of clinical • Correlation of disease specific features with
diagnosis, in general [65] the CNN diagnostic outcome [47]
• Difficult to develop methodology for • Rigorous clinical validation demonstrating
understanding the high-level features improvement in outcomes despite lack of
that CNNs use for discrimination complete transparency
• Use of activation maps to highlight feature
areas on that image that contributed to
classification
Regulatory and • ROP care is the highest medicolegal • Precise indication for use and evidence of
medicolegal issues risk within ophthalmology effectiveness in a real world population
• Need to adjudicate liability from care • Innovation of evaluation methods by the Food
decisions informed by AI [66] and Drug Agency (FDA) to ensure safe
• Regulatory requirements will continue implementation
to evolve

and clinically useful implementation of technol- direct supervision during ophthalmology training
ogy remains wide. The main potential challenges [70]. Chan et al. demonstrated that there was sig-
hindering the deployment of DL systems include nificant variability in diagnostic accuracy among
ensuring generalizability, explainability, and retinal fellows when analyzing ROP images com-
overcoming regulatory and medicolegal issues pared to RSDs [71]. Both Chan et al. and Myung
[64]. Table 10.1 outlines these challenges as they et  al. demonstrated the inconsistent accuracy of
apply to AI for ROP diagnosis. detecting type 2 ROP and treatment-requiring
ROP by fellows [71, 72].
These studies raise serious concerns for ROP
AI for ROP Training screening performed by inexperienced examin-
ers, and there are no accepted criteria for mini-
If ROP experts often do not agree on how to diag- mum necessary supervision, exams, treatments,
nose ROP or on the diagnosis of individual etc. for clinical competency for ROP cares.
babies, it is not surprising that ROP trainees find Improved global education for ROP training is
the task of ROP diagnosis challenging as well. It necessary to ensure treatments are performed
is well established that ophthalmology graduates adequately. The development of AI systems for
complete residency, as well as ophthalmology automated diagnosis in ROP may facilitate the
fellowship programs, without confidence in their incorporation of these algorithms within medical
ability to diagnose ROP [67–69]. Fewer than a training to standardize ROP education and certi-
third of learners perform ROP screenings under fication [69].
136 B. A. Scruggs et al.

Conclusions randomized trial. Trans Am Ophthalmol Soc.


2004;102:233–48. discussion 248–250.
5. Schaffer DB, Palmer EA, Plotsky DF, et al. Prognostic
ROP is a leading cause of preventable child- factors in the natural course of retinopathy of pre-
hood blindness worldwide, yet the diagno- maturity. The Cryotherapy for Retinopathy of
sis remains both subjective and qualitative. Prematurity Cooperative Group. Ophthalmology.
1993;100(2):230–7.
Significant intra- and inter-expert variabil- 6. Chan-Ling T, Gole GA, Quinn GE, Adamson SJ,
ity limits the efficiency and accuracy of ROP Darlow BA.  Pathophysiology, screening and treat-
screening and diagnosis. AI-assisted screen- ment of ROP: a multi-disciplinary perspective. Prog
ing may lead to automated, quantifiable, and Retin Eye Res. 2018;62:77–119.
7. Norman M, Hellström A, Hallberg B, et al. Prevalence
objective diagnosis in ROP to improve accu- of severe visual disability among preterm children
racy while lessening the screening burden in with retinopathy of prematurity and association with
LMIC.  Providing objectivity to ROP educa- adherence to best practice guidelines. JAMA Netw
tion, AI may improve trainee performance on Open. 2019;2(1):e186801.
8. Blencowe H, Lawn JE, Vazquez T, Fielder A, Gilbert
ROP management. Already, AI has enabled the C.  Preterm-associated visual impairment and esti-
development of an ROP vascular severity score mates of retinopathy of prematurity at regional and
that correlates with ICROP disease classifica- global levels for 2010. Pediatr Res. 2013;74(Suppl
tion and shows promise for quantitative disease 1):35–49.
9. Vos T, Flaxman AD, Naghavi M, et  al. Years lived
monitoring, improved risk prediction, and post-­ with disability (YLDs) for 1160 sequelae of 289 dis-
treatment identification of treatment failure and eases and injuries 1990-2010: a systematic analysis
recurrence. Integrated into a telemedicine sys- for the Global Burden of Disease Study 2010. Lancet.
tem, AI could significantly benefit ROP clinical 2012;380(9859):2163–96.
10. Smith LE, Hard AL, Hellström A.  The biology

care and may also improve early identification of retinopathy of prematurity: how knowledge
of severe ROP prior to the development of reti- of pathogenesis guides treatment. Clin Perinatol.
nal detachment. 2013;40(2):201–14.
11. Patan S.  Vasculogenesis and angiogenesis. Cancer

Treat Res. 2004;117:3–32.
Acknowledgement  Grant Information: This chapter was 12.
Multicenter trial of cryotherapy for retinopa-
supported by grants R01EY19474, K12 EY027720, and thy of prematurity. One-year outcome–structure
P30EY10572 from the National Institutes of Health and function. Cryotherapy for Retinopathy of
(Bethesda, MD), by grants SCH-1622679, SCH-1622542, Prematurity Cooperative Group. Arch Ophthalmol.
& SCH-1622536 from the National Science Foundation 1990;108(10):1408–16.
(Arlington, VA), by The Heed Foundation, and by unre- 13. Fierson WM, Ophthalmology AAoPSo,
stricted departmental funding and a Career Development Ophthalmology AAo, Strabismus AAfPOa,
Award (JPC) from Research to Prevent Blindness (New Orthoptists AAoC. Screening examination of prema-
York, NY). ture infants for retinopathy of prematurity. Pediatrics.
2013;131(1):189–195.
14. Gilbert C.  Retinopathy of prematurity: a global per-
spective of the epidemics, population of babies at
References risk and implications for control. Early Hum Dev.
2008;84(2):77–82.
1. Flynn JT, Bancalari E, Bachynski BN, et  al. 15.
Fierson WM, Capone A, Ophthalmology
Retinopathy of prematurity. Diagnosis, severity, and AAoPSo, American Academy of Ophthalmology
natural history. Ophthalmology. 1987;94(6):620–9. AeAoCO. Telemedicine for evaluation of retinopathy
2. Fierson WM. American Academy of Pediatrics of prematurity. Pediatrics. 2015;135(1):e238–54.
Section on Ophthalmology; American Academy 16. Quinn GE, Ying GS, Daniel E, et  al. Validity of a
of Ophthalmology; American Association for telemedicine system for the evaluation of acute-­
Pediatric Ophthalmology and Strabismus; American phase retinopathy of prematurity. JAMA Ophthalmol.
Association of Certified Orthoptists. Pediatrics 2014;132(10):1178–84.
2018;142(6):e20183061. 17. Weaver DT, Murdock TJ.  Telemedicine detection of
3. Valikodath N, Cole E, Chiang MF, Campbell JP, Chan type 1 ROP in a distant neonatal intensive care unit. J
RVP. Imaging in retinopathy of prematurity. Asia Pac AAPOS. 2012;16(3):229–33.
J Ophthalmol (Phila). 2019;8(2):178–86. 18. Chiang MF, Melia M, Buffenn AN, et  al. Detection
4. Good WV, Group ETfRoPC. Final results of the Early of clinically significant retinopathy of prematu-
Treatment for Retinopathy of Prematurity (ETROP) rity using wide-angle digital retinal photography: a
10  Artificial Intelligence in Retinopathy of Prematurity 137

report by the American Academy of Ophthalmology. 32. Kalpathy-Cramer J, Campbell JP, Erdogmus D,

Ophthalmology. 2012;119(6):1272–80. et  al. Plus disease in retinopathy of prematurity:
19. Ells AL, Holmes JM, Astle WF, et  al. Telemedicine improving diagnosis by ranking disease severity and
approach to screening for severe retinopathy using quantitative image analysis. Ophthalmology.
of prematurity: a pilot study. Ophthalmology. 2016;123(11):2345–51.
2003;110(11):2113–7. 33. Campbell JP, Ataer-Cansizoglu E, Bolon-Canedo V,
20. Fijalkowski N, Zheng LL, Henderson MT, et  al.
et al. Expert diagnosis of plus disease in retinopathy
Stanford University Network for Diagnosis of of prematurity from computer-based image analysis.
Retinopathy of Prematurity (SUNDROP): five years JAMA Ophthalmol. 2016;134(6):651–7.
of screening with telemedicine. Ophthalmic Surg 34. Rao R, Jonsson NJ, Ventura C, et al. Plus disease in
Lasers Imaging Retina. 2014;45(2):106–13. retinopathy of prematurity: diagnostic impact of field
21. Quinn GE, Ells A, Capone A, et al. Analysis of dis- of view. Retina. 2012;32(6):1148–55.
crepancy between diagnostic clinical examination 35.
Kim SJ, Campbell JP, Kalpathy-Cramer J,
findings and corresponding evaluation of digital et  al. Accuracy and reliability of eye-based vs
images in the telemedicine approaches to evaluating ­quadrant-­based diagnosis of plus disease in reti-
acute-phase retinopathy of prematurity study. JAMA nopathy of prematurity. JAMA Ophthalmol.
Ophthalmol. 2016;134(11):1263–70. 2018;136(6):648–55.
22. Ying GS, Pan W, Quinn GE, Daniel E, Repka MX, 36. Reynolds JD, Dobson V, Quinn GE, et al. Evidence-­
Baumritter A.  Intereye agreement of retinopathy of based screening criteria for retinopathy of prema-
prematurity from image evaluation in the telemedicine turity: natural history data from the CRYO-ROP
approaches to evaluating of acute-phase ROP (e-ROP) and LIGHT-ROP studies. Arch Ophthalmol.
Study. Ophthalmol Retina. 2017;1(4):347–54. 2002;120(11):1470–6.
23. Schwartz SD, Harrison SA, Ferrone PJ, Trese
37. Fleck BW, Williams C, Juszczak E, et  al. An inter-
MT.  Telemedical evaluation and management national comparison of retinopathy of prematu-
of retinopathy of prematurity using a fiber- rity grading performance within the Benefits of
optic digital fundus camera. Ophthalmology. Oxygen Saturation Targeting II trials. Eye (Lond).
2000;107(1):25–8. 2018;32(1):74–80.
24. Chee RI, Darwish D, Fernandez-Vega A, et al. Retinal 38. Wittenberg LA, Jonsson NJ, Chan RV, Chiang

telemedicine. Curr Ophthalmol Rep. 2018;6(1):36–45. MF. Computer-based image analysis for plus disease
25.
International Committee for the Classification diagnosis in retinopathy of prematurity. J Pediatr
of Retinopathy of Prematurity. The International Ophthalmol Strabismus. 2012;49(1):11–9; quiz 10,
Classification of Retinopathy of Prematurity revisited. 20.
Arch Ophthalmol. 2005;123(7):991–9. 39. Wallace DK, Zhao Z, Freedman SF.  A pilot study
26. The International Committee for the Classification using “ROPtool” to quantify plus disease in retinopa-
of the Late Stages of Retinopathy of Prematurity. An thy of prematurity. J AAPOS. 2007;11(4):381–7.
international classification of retinopathy of prematu- 40. Gelman R, Jiang L, Du YE, Martinez-Perez ME,

rity. II. The classification of retinal detachment. Arch Flynn JT, Chiang MF. Plus disease in retinopathy of
Ophthalmol. 1987;105(7):906–12. prematurity: pilot study of computer-based and expert
27. The Committee for the Classification of Retinopathy diagnosis. J AAPOS. 2007;11(6):532–40.
of Prematurity. An international classification of 41. Shah DN, Wilson CM, Ying GS, et  al. Comparison
retinopathy of prematurity. Arch Ophthalmol. of expert graders to computer-assisted image analy-
1984;102(8):1130–4. sis of the retina in retinopathy of prematurity. Br J
28. Ghergherehchi L, Kim SJ, Campbell JP, Ostmo S, Ophthalmol. 2011;95(10):1442–5.
Chan RVP, Chiang MF. Plus disease in retinopathy of 42. Chiang MF, Gelman R, Jiang L, Martinez-Perez

prematurity: more than meets the ICROP? Asia Pac J ME, Du YE, Flynn JT. Plus disease in retinopathy of
Ophthalmol (Phila). 2018;7(3):152–5. prematurity: an analysis of diagnostic performance.
29. Geloneck MM, Chuang AZ, Clark WL, et  al.
Trans Am Ophthalmol Soc. 2007;105:73–84. discus-
Refractive outcomes following bevacizumab mono- sion 84-75.
therapy compared with conventional laser treatment: 43. Koreen S, Gelman R, Martinez-Perez ME, et  al.

a randomized clinical trial. JAMA Ophthalmol. Evaluation of a computer-based system for plus
2014;132(11):1327–33. disease diagnosis in retinopathy of prematurity.
30. Mintz-Hittner HA, Kennedy KA, Chuang AZ, Group Ophthalmology. 2007;114(12):e59–67.
B-RC.  Efficacy of intravitreal bevacizumab for 44. Wilson CM, Wong K, Ng J, Cocker KD, Ells AL,
stage 3+ retinopathy of prematurity. N Engl J Med. Fielder AR. Digital image analysis in retinopathy of
2011;364(7):603–15. prematurity: a comparison of vessel selection meth-
31. Stahl A, Lepore D, Fielder A, et  al. Ranibizumab ods. J AAPOS. 2012;16(3):223–8.
versus laser therapy for the treatment of very low 45. Ataer-Cansizoglu E, Bolon-Canedo V, Campbell JP,
birthweight infants with retinopathy of prematurity et al. Computer-based image analysis for plus disease
(RAINBOW): an open-label randomised controlled diagnosis in retinopathy of prematurity: performance
trial. Lancet. 2019;394(10208):1551–9. of the “i-ROP” system and image features associ-
138 B. A. Scruggs et al.

ated with expert diagnosis. Transl Vis Sci Technol. retinopathy of prematurity. AMIA Annu Symp Proc.
2015;4(6):5. 2018;2018:1224–32.
46. Mao J, Luo Y, Liu L, et al. Automated diagnosis and 60. Coyner AS, Swan R, Campbell JP, et  al. Automated
quantitative analysis of plus disease in retinopathy of fundus image quality assessment in retinopathy of
prematurity based on deep convolutional neural net- prematurity using deep convolutional neural net-
works. Acta Ophthalmol. 2019. works. Ophthalmol Retina. 2019;3(5):444–50.
47. Graziani M, Brown JM, Andrearczyk V, et  al.
61.
Scruggs BA, Chan RVP, Kalpathy-Cramer J,
Improved interpretability for computer-aided sever- Chiang MF, Campbell JP.  Artificial Intelligence in
ity assessment of retinopathy of prematurity. SPIE Retinopathy of Prematurity Diagnosis. Transl Vis Sci
Medical Imaging. San Diego, CA; 2019. Technol. 2020;9(2).
48. Brown JM, Campbell JP, Beers A, et  al. Automated 62. Campbell JP.  Why do we still rely on ophthalmos-
diagnosis of plus disease in retinopathy of prematu- copy to diagnose retinopathy of prematurity? JAMA
rity using deep convolutional neural networks. JAMA Ophthalmol. 2018;136(7):759–60.
Ophthalmol. 2018;136(7):803–10. 63. De Fauw J, Ledsam JR, Romera-Paredes B, et  al.
49. Worrall DE, Wilson CM, Brostow GJ.  Automated
Clinically applicable deep learning for diag-
retinopathy of prematurity case detection with con- nosis and referral in retinal disease. Nat Med.
volutional neural networks. Deep learning and data 2018;24(9):1342–50.
labeling for medical applications. Athens; 2016. 64. Ting DSW, Peng L, Varadarajan AV, et al. Deep learn-
50. Redd TK, Campbell JP, Brown JM, et al. Evaluation ing in ophthalmology: the technical and clinical con-
of a deep learning image assessment system for siderations. Prog Retin Eye Res. 2019.
detecting severe retinopathy of prematurity. Br J 65. Reid JE, Eaton E.  Artificial intelligence for pedi-

Ophthalmol. 2018. atric ophthalmology. Curr Opin Ophthalmol.
51. Taylor S, Brown JM, Gupta K, et al. Monitoring dis- 2019;30(5):337–46.
ease progression with a quantitative severity scale 66. Shah NH, Milstein A, Bagley SC.  Making machine
for retinopathy of prematurity using deep learning. learning models clinically useful. JAMA. 2019.
JAMA Ophthalmol. 2019. 67.
Patel SN, Martinez-Castellanos MA, Berrones-­
52. Gupta K, Campbell JP, Taylor S, et al. A quantitative Medina D, et al. Assessment of a tele-education sys-
severity scale for retinopathy of prematurity using tem to enhance retinopathy of prematurity training by
deep learning to monitor disease regression after international ophthalmologists-in-training in Mexico.
treatment. JAMA Ophthalmol. 2019. Ophthalmology. 2017;124(7):953–61.
53. Yildiz VM, Tian P, Yildiz I, et al. Plus disease in reti- 68. Campbell JP, Swan R, Jonas K, et al. Implementation
nopathy of prematurity: convolutional neural network and evaluation of a tele-education system for the diag-
performance using a combined neural network and nosis of ophthalmic disease by international trainees.
feature extraction approach. 2020;9(2). AMIA Annu Symp Proc. 2015;2015:366–75.
54. Ataer-Cansizoglu E, You S, Kalpathy-Cramer J,
69. Chan RV, Patel SN, Ryan MC, et  al. The Global
Keck K, Chiang MF, Erdogmus D.  OBSERVER Education Network for Retinopathy of Prematurity
AND FEATURE ANALYSIS ON DIAGNOSIS OF (Gen-Rop): development, implementation, and evalu-
RETINOPATHY OF PREMATURITY.  IEEE Int ation of a novel tele-education system (An American
Workshop Mach Learn Signal Process. 2012:1–6. Ophthalmological Society Thesis). Trans Am
55. Mulay S, Ram K, Sivaprakasam M, Vinekar A. Early Ophthalmol Soc. 2015;113:T2.
detection of retinopathy of prematurity stage using 70. Al-Khaled T, Mikhail M, Jonas KE, et al. Training of
deep learning approach. Paper presented at: SPIE residents and fellows in retinopathy of prematurity
Medical Imaging, 2019, San Diego, CA. around the world: an international web-based survey.
56. Zhao J, Lei B, Wu Z, et al. A deep learning framework J Pediatr Ophthalmol Strabismus. 2019;56(5):282–7.
for identifying zone I in RetCam images. Vol 7. IEEE 71. Paul Chan RV, Williams SL, Yonekawa Y, Weissgold
Access; 2019. p. 103530–7. DJ, Lee TC, Chiang MF.  Accuracy of retinopathy
57. Wang J, Ju R, Chen Y, et  al. Automated retinopathy of prematurity diagnosis by retinal fellows. Retina.
of prematurity screening using deep neural networks. 2010;30(6):958–65.
EBioMedicine. 2018;35:361–8. 72. Myung JS, Paul Chan RV, Espiritu MJ, et al. Accuracy
58. Hu J, Chen Y, Zhong J, Ju R, Yi Z. Automated analy- of retinopathy of prematurity image-based diagnosis
sis for retinopathy of prematurity by deep neural net- by pediatric ophthalmology fellows: implications for
works. IEEE Trans Med Imaging. 2019;38(1):269–79. training. J AAPOS. 2011;15(6):573–8.
59. Coyner AS, Swan R, Brown JM, et  al. Deep learn-
ing for image quality assessment of fundus images in
Artificial Intelligence in Diabetic
Retinopathy
11
Andrzej Grzybowski and Piotr Brona

Epidemiology of Diabetic Diabetic retinopathy is one of the major com-


Retinopathy plications of diabetes, estimated to be the leading
cause of blindness among working-age adults
Over the last four decades the number of people globally [3].
living with diabetes has more than quadrupled Prevalence of DR and of proliferative DR
from 108 million in 1980 to an estimated 422 (pDR) varies between type 1 and 2 diabetes and
million in 2014. At the same time diabetes preva- among the different regions of the world.
lence among adults has almost doubled to 8.5% Prevalence of DR among type 2 diabetics is
[1]. Future projections estimate that, by 2035, reported between 20 and 40% in most studies. In
592 million people will have diabetes, with the type 1 diabetes, in Europe and the USA, reported
largest rise in low- and middle-income regions prevalence vary widely between reports ranging
[2]. There is no doubt that diabetes constitutes a from 36.5 to 93.6% [3]. Of those with DR an
significant problem for global health and wellbe- approximate one third may have vision threaten-
ing. It is a disease that is prevalent all over the ing DR with either proliferative DR or diabetic
world, in the affluent, resource rich countries and macular edema (DME). Overall DR prevalence is
much poorer developing countries. Diabetes can higher among Western communities as compared
cause a number of significant complications, to Asian regions [3]. Singapore is a notable
each of them associated with significant morbid- exception to this, with a much higher prevalence
ity, requiring different, highly qualified medical of DR, but comparable to developed Western
personnel to diagnose and treat them. This poses countries.
a challenge for the local health services which It appears incidence of DR among diabetics is
often struggle with either delivering or funding also increasing in some regions. A study based in
the appropriate care. Spain found yearly incidence of DR to increase
by almost 1% over an 8-year lifespan, from
8.09% in 2007 to 8.99% in 2014, with incidence
A. Grzybowski (*) of DME also increasing [4]. The increasing
Department of Ophthalmology, University of Warmia
and Mazury, Olsztyn, Poland worldwide population, coupled with increasing
prevalence of diabetes and increased incidence of
Institute for Research in Ophthalmology, Foundation
for Ophthalmology Development, Poznan, Poland DR all lead to increasing number of patients with
ocular complications of diabetes.
P. Brona
Department of Ophthalmology, Poznan City Hospital, Adding to the global burden of pDR and
Poznan, Poland DME, these appear to be prognostic of other

© Springer Nature Switzerland AG 2021 139


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_11
140 A. Grzybowski and P. Brona

d­ iabetes complications like nephropathy, periph- referrals with suspected proliferative disease and
eral neuropathy and cardiovascular events [3]. over 52,000 referrals for suspected maculopathy
or pre-proliferative diabetic retinopathy, with an
overall rate of DR of 2.8%.
 onventional Screening Initiatives
C The aforementioned screening programme
of DR: Telemedicine was expected to reduce the number of people
considered legally blind in England from 4200 to
There have been many DR screening initiatives less than 1000. It appears this goal has been
throughout the world with varying degrees of accomplished with a 2014 report showing DR is
coverage and longevity. Nevertheless, only a few no longer the leading cause of certifiable blind-
countries were able to successfully establish and ness in England and Wales for the first time in
continue DR screening on national level, most 50 years [5].
prominently—UK and Singapore. It appears
such programme is also functioning within
Denmark, however very little information regard- Wales
ing it is available in English.
The Diabetic Retinopathy Screening Service for
Wales (DRSSW), established in 2002, is a mobile
United Kingdom screening service. Similarly to the English pro-
gramme, two fundus images are taken per eye.
Each country within the UK established their Patients with sight-threatening DR are referred to
own national screening programme. The specific a hospital-based retinal service. 30 screening
protocols and grading methods vary, however all teams serve 220 locations within Wales, achiev-
are based on digital, colour fundus photography. ing patient over of about 80%.
The programmes cover all diabetics over the age
of 12 years old with vision of at least light per-
ception in one eye. Scotland

Scotland started its DR screening in 2006.


England Qualifying patients are identified automatically
using the Scottish Care Information-Diabetes
The NHS Diabetic Eye Screening Programme Collaboration database. Screening is based on a
(NDESP) is a continuation of an English screen- single macula-centred image per eye, with mydri-
ing programme set up in 2006. Patients are asis as required. Images are later sent to grading
screened on annual visits, with two fundus centres. Thanks to automatic patient selection,
images per eye—one macula- and one disc-­ patient coverage is above 99%.
centred. Images are taken after mydriasis. These
images are later digitally sent to one of central-
ised grading centres. The sheer scale of piloting Northern Ireland
and implementing such an initiative is impres-
sive, in years 2015–2016 the programme invited The Northern Ireland Diabetic Retinopathy
more than 2.5 million diabetics to attend screen- Screening Programme (NIDRSP) was estab-
ing with an uptake of 82.8% [5]. It also gives us lished in 2002. Its similar in functioning to the
an important insight into the epidemiology of DR welsh programme DRSSW. Patients are referred
in the local population. Between 2015 and 2016 to the programme by their GPs, with trained
screening resulted in just under 8000 urgent readers grading the photographs.
11  Artificial Intelligence in Diabetic Retinopathy 141

Ireland the world. Some of those are similarly long-­


standing projects that control their population
DR screening was first identified as a key goal in yearly, while most were discontinued or singular
2008, but only introduced in 2013 under the screening efforts. Even though so many screen-
name Diabetic RetinaScreen. The Irish pro- ing projects were attempted, only the few afore-
gramme screens diabetics 12 and older. Diabetic mentioned countries were able to implement
RetinaScreen supervises both annual fundus nationwide screening, further highlighting the
image-based screening and DR treatment and difficulty of such undertaking.
consists of both stationary and mobile
community-­based screening centers. The grading
follows the English system closely both in terms  utomated Screening for Diabetic
A
of the grading matrix and the quality assurance Retinopathy
protocols. According to the latest report, screen-
ing uptake is around 67% and rose considerably The idea of adopting a computer programme in
since the screening was first introduced [6]. assessing fundus images for DR is certainly not
new. First report, that we were able to find, of
such endeavour was published in 1996 by Gardner
Singapore and colleagues. Almost 25 years ago, the authors,
established a neural network trained on 147 dia-
Singapore began widespread DR screening almost betic and 32 normal fundus images and aimed to
three decades ago in 1991. At that time Polaroid train it to recognise particular features of an
non-mydriatic fundus photography was chosen, as image—vessels, haemorrhages and exudates. Due
images could be taken by trained staff, instead of to the many constraints, including computational
ophthalmologists. Images were reviewed by the capacity, each images was divided into small
local hospital-based ophthalmologist. At the time it squares 20 or 30 pixels wide and later assessed by
was the first and only nationwide DR screening pro- an ophthalmologist as containing either vessels,
gramme. The Singaporean screening initiative was exudates, haemorrhages and micro-aneurysms or
revived in an updated version reflecting the techno- normal retina without vessels [8].
logical advancements and possibilities and is now Another study done in 2000 describes a mixed
known as Singapore Integrated Diabetic Retinopathy technique where algorithms designed to enhance
Programme (SIDRP). Based on primary care clinics round features in an image were used to select for
equipped with retinal cameras and specialised read- micro-aneurysms in a fundus image. This was
ing centres employing trained graders the pro- later assessed by a neural network to determine
gramme aims for a result within 24 h of screening. the significance of the round feature extracted.
Cost effectiveness analysis has shown that this This resulted in a sensible detection rate for
telemedicine-based model generated $173 of cost images containing microaneurysms (81%) as
savings per patient compared to the previous compared to the opinion of a clinical research fel-
screening model where family physicians graded low [9].
the images themselves after special training [7]. Several studies explored the subject in the early
2000s, without the use of neural networks, relying
on various pre-established image-­ analysis tech-
Local Screening Initiatives niques, such as automated detection of anatomical
landmarks in fundus images (optic disc, blood ves-
Other than the established national screening sels, fovea etc.) coupled with specifically designed
programmes, there have been a large number of algorithms for detection of DR features. Among
smaller-scaled local screening initiatives all other those, first three reports of a system, later known
142 A. Grzybowski and P. Brona

as Retinalyze and discussed in further sections, Iowa Detection Program. Based on the publicly
were published showing relatively good sensitivi- available set of fundus images with/without DR—
ties of 71–93% and specificities of 72–86%, these the Messidor-2 dataset, the sensitivity improved
were based on small sample sizes reaching 137 from 94.4% to 96.8% and specificity from a con-
patients in the largest study [10, 11]. fidence interval of 55.7%–63% to 87% [12]. For
All of those studies were done in the pre-­ the Iowa Detection Program, deep learning fea-
digitisation era, meaning images, in the form of tures were added on top of already existing algo-
slides, taken from a fundus camera had to be rithms, many other initiatives attempted to
scanned by hand. This was done using a slide establish entirely new deep-based learning DR
reader or scanner to achieve a workable digital ver- detection software. Establishing automated or
sion of the image. The process was time consuming semi-automated screening, with the use of AI,
and required specialized equipment, and additional will require striking a careful balance between
processing steps introduced potential image arte- sensitivity and specificity, imaging modality,
facts and loss of quality. The lack of centralised gradeability of the images, all of which will need
databases and digital storage of fundus images to be weighed against the potential cost. The cost-
meant training and verification images were hard to benefit balance is not universal and will vary
acquire. As a consequence, most studies suffered depending on the relationship of those parameters
from low number of images used, as compared to with the relevant population characteristics, such
modern models using tens of thousands of fundus as the prevalence of DR and sight-­ threatening
images to establish and validate a system. DR, availability of treatment, cost and availability
Even though at that time automated screening of trained staff etc. A recent paper explores the
was severely limited from a technical standpoint, potential approaches to making a health economic
a number of people already attempted devising assessment and safety analysis of implementing
suitable screening methods, recognising the novel AI DR solutions into widespread screening
potential of new technology to enhance or substi- [13]. Deep learning DR detection has been found
tute human-based grading. to be cost-effective in developed countries, like
Singapore and United Kingdom. However, there
are no published studies looking into the feasibil-
Deep Learning Algorithms ity of implementing AI DR screening in countries
without a robust teleophthalmology screening
In subsequent years, with increasing digitisation, programme setup beforehand and other resource-
new ways of approaching the subject of auto- limited settings Table 11.1 [13].
mated image analysis were made possible. Up Described further are several significant initia-
until 2010s experts designed algorithms for detec- tives for AI-based diabetic retinopathy detection.
tion of specific features of DR like micro aneu-
rysms or haemorrhages. In deep learning the
software is presented with a fundus image as a IDx-DR
whole and a pre-specified result for that image.
Over the course of analysing many such images, IDx-DR is combined DR screening solution that
often thousands, it starts being able to distinguish incorporates the aforementioned DR screening
between images with different results. What sepa- algorithm with image quality assessment and
rates one result from another does not have to be feedback system. Submission of images is done
explicitly specified by its designers. The advent of using the IDx-DR client, which is a stand-alone
deep learning-based DR detection revealed a sig- piece of software. The IDx-DR client features a
nificant improvement in the accuracy of newly system for resubmission of images deemed to be
developed or improved systems. Abramoff and of too low quality. The threshold for a positive
colleagues reported how the introduction of deep result has been set as ‘more than mild’ diabetic
learning techniques, allowed a significant retinopathy according to the ICDR grading scale
improvement to the already established, classi- or signs of diabetic macular edema. IDx-DR
cally designed, automated DR software—the offers one additional result level of vision threat-
11  Artificial Intelligence in Diabetic Retinopathy 143

Table 11.1  The list of deep learning - based DR screening algorithms available at the end of 2020
Name of the Country of
software origin Classification level Comments
IDx-DR USA Per patient First AI autonomous diagnostic device to be FDA
rDR/no rDR approved.
Class IIa medical device in EU
Eyeart USA Per patient Second AI software to receive FDA approval. Approved by
rDR/no rDR Canadian FDA
Class IIa medical device in EU
RetmarkerDR Portugal DR/no DR Previously used in various screening initiatives in Portugal
Microaneurysm Class IIa medical device in EU
turnover rate
SELENA + Singapore Per patient Scheduled to be implemented into national DR screening
rDR/no rDR in Singapore
Google USA Per picture Studies surrounding real-world implementation based in
algorithm rDR/no rDR India, Thailand. Currently no official software package
available outside of research studies
MediosAI India Per patient Integrated into an offline smartphone app to be paired with
rDR/no rDR the Remidio fundus-on-phone device
Verisee Taiwan rDR/no rDR Relatively new algorithm, recently approved by the
Taiwanese FDA-equivalent government body
Pegasus United rDR/no rDR Operated by the Orbis non-profit organisation
Kingdom
RetCAD Netherlands rDR/no rDR Detects referable AMD as well
Retinalyze Denmark Per image, retinal Detects AMD related changes as well, also offers an
changes/no changes automated glaucoma screening module
OphtAI France Per patient rDR/no Also detects glaucoma and AMD
rDR and DR grade

Fig. 11.1  IDx-DR image submission screen. Printed with Permission © IDx Technologies

ening DR, indicative of a suspicion of more and all images need to be submitted for a result to
advanced, possibly proliferative DR. Screening is be produced. The algorithm is able to cope with
based on four fundus images per patient, two some quality loss utilizing the overlap of the two
from each eye, one macula- and one disc-centred image fields (Fig. 11.1).
144 A. Grzybowski and P. Brona

Although on the front-end, the user is pre- the algorithm had no access to. With odds stacked
sented with a screening result in one of the four against it, the AI was still able to exceed all end-
categories—no DR, mtmDR, vision threatening points set before the trial began, endpoints at sen-
DR and insufficient quality, on the back-end sitivity of 87.2% (>85%), specificity of 90.7%
IDx-DR produces a numerical value representing (>82.5%), and imageability rate of 96.1% (among
its assessment of likelihood of mtmDR. Currently patients deemed imageable by the reading cen-
it uses defined cut-offs to sort the patient into an ter). The landmark FDA decision to allow
appropriate category. Theoretically, this means IDx-DR to operate within the United States was
that the IDx-DR output could be adjusted to max- largely based on the results of this study [14]. In
imise either sensitivity or specificity depending US, according to the FDA approved use, IDx-DR
on the needs of a given screening initiative. needs to be coupled with the Topcon NW-400
IDx-DR is the first autonomous diagnostic non-mydriatic fundus camera.
software and one of the very first AI-based soft- Previously to this study, were a number of
ware’s in medicine to receive Federal Drug and studies published on IDx-DR, though none as
Administration (FDA) approval. In a self-titled significant. Notably its performance against the
pivotal trial, IDx-DR software was studied in a Messidor-2 dataset was significantly higher than
real-world application. A little under 900 patients in the above described trial, with 96.8% sensitiv-
were screened using IDx-DR coupled with ity and specificity of 87%. In another real-life
Topcon NW-400 automatic fundus camera in a study, performed in Netherlands, 1410 patients
primary care setting. The staff operating the were screened within the Dutch diabetic care sys-
IDx-DR client and taking the fundus images were tem. Three experts graded the resultant images
not IDx-DR or clinical trial staff, but pre-existing according to ICDR and EURODIAB grading
employees of those clinics who underwent stan- scales, resulting in significantly different algo-
dardised training. This is important as in a sce- rithm performance depending on the scale used.
nario of large-scale DR screening deployment For EURODIAB IDx-DR sensitivity and speci-
specialised staff, say in ophthalmology imaging ficity was 91% and 84%, whereas for ICDR they
may be harder to produce and acquire that the were 68% and 86% respectively. The signifi-
necessary technical equipment. In previous trials cantly lower performance when compared to
of IDx-DR and other AI algorithms the perfor- ICDR criteria could all be attributed to a single
mance of the AI was compared to human grading aspect of ICDR—judging a single haemorrhage
with the same information available, which was as at least moderate DR, the authors note that
mostly the fundus images. Sometimes to should this be changed the sensitivity changes
strengthen the human grading standard against from 68% to 96.1% [15].
which the AI was compared, several persons This is a great illustration of how important
graded each image with a consensus grading that grading criteria are. A number of differing crite-
followed. This trial took an even more stringent, ria have been used in different studies so far,
extreme approach—giving the human graders a Eurodiab, ICDR, ETDRS, some studies use local
lot more available information, while keeping the grading guidelines, with each being one of the
AI limited to the four fundus images taken by most significant parts affecting the outcome and
relatively inexperienced staff, albeit with an auto- final performance indicators published. The first
matic fundus camera and selective mydriasis. question and most important question in estab-
This was compared to grading done on four lishing DR screening is ‘what is the screening
­stereoscopic, widefield fundus images taken by trying to accomplish?’. In the simplest form the
professional technicians and graded by an estab- aim of a DR screening initiative should be finding
lished, independent reading center—the those patients, who will require a specialty oph-
Wisconsin Fundus Photograph Reading Center. thalmology visit before the next screening epi-
Presence of clinically significant diabetic macu- sode. This seems to hold true for established
lar edema (CSME) was additionally established traditional screening programmes in developed
based on macula OCT imaging, which of course countries. However, depending on the region and
11  Artificial Intelligence in Diabetic Retinopathy 145

result. Since its introduction it went through a


period of inactivity and was reintroduced in 2013,
with modern era machine learning improve-
ments. It is certified with the Conformité
Européenne (CE) level I under the previous regu-
lations. Retinalyze additionally offers screening
towards AMD and glaucoma, from the same fun-
dus photos.

RetmarkerDR

Retmarker is a DR detection system originating in


Portugal. It is one of the first AI screening tools
successfully implemented into real life screening,
not just for the purpose of a clinical trial. The cen-
tral region of Portugal has a longstanding DR
screening programme established back in 2001. In
2011 RetmarkerDR has been implemented into the
Fig. 11.2  IDX-DR result page. Printed with Permission
© IDx Technologies already existing, human-grader based DR screen-
ing programme. This screening is based on several
teams of photographers equipped with mobile fun-
resources available this can change. In a setting dus cameras. These screening units rotate between
of poorer countries, with many-fold less ophthal- different healthcare providers covering their whole
mologists and low availability of treatment, one route over the course of 12 months and then repeat-
might want to put the bar for referral higher. ing this cycle. Patients with diabetes and no history
Nevertheless, the scale used to measure and qual- of DR treatment are invited for screening at a local
ify the retinal changes present, needs to be health centre. These images are later collated and
backed-up by the risk of DR progression and risk sent on a weekly basis to a centralized reading cen-
to vision at a given level (Fig. 11.2). tre (Fig. 11.3).
The Retmarker software forms the first line of
analysis for those submitted images. Images in
Retinalyze which the algorithm detects signs of DR, or pro-
gression of DR in case of repeat screening, are
Retinalyze is a DR screening system developed sent for human grading, similarly with images
in Denmark. As mentioned, it is one of the very deemed low quality by the algorithm. In this case
first published automated DR analysis programs, Retmarker is used in the preliminary ‘disease’ or
with initial reports of its efficacy starting in 2003, ‘no disease’ sorting, which then specifies the
based on scanned 35 mm film fundus images. It need for human grader assessment of the ‘dis-
features a web-based interface, with a per-image ease’ sub-group. For quality assurance a certain
result. Images are submitted through the inter- number of DR negative results are sent for human
face, utilising a secure internet protocol. Results analysis as well, with the human graders blinded
are presented in terms of number/severity of to the AI decision.
detected retinal changes as either no changes, Such implementation of an AI algorithm to
mild retinal changes, or severe changes. An inter- detect DR relies on very high sensitivity, as false
esting feature of Retinalyze is being able to see negatives will rarely be discovered, but can com-
an annotated image with the detected retinal promise on specificity. As long as it eliminates a
changes highlighted, therefore being able to get a significant number of images from human analy-
glimpse into what led to the algorithms final sis, without missing cases of advanced disease,
146 A. Grzybowski and P. Brona

Fig. 11.3  RetmarkerDR exam manager

the process will likely be resource effective, as sons of AI DR systems ever published [19]. This
even a specificity of 50% means almost halving study, done for the purpose of assessing a poten-
the human grader work. tial introduction of autonomous DR detection
A noteworthy feature that distinguishes software into the existing English DR screening
Retmarker from other algorithms is its ability to programme, invited AI DR software makers to
take previous screenings into account. By com- submit their algorithm for the testing. Three sys-
paring the fundus images taken on a previous tems participated in the testing, RetmarkerDR,
screening visit, the system is able to track retinal Eyeart and iGradingM.  Because of technical
changes and determine if progression occurred. issues iGradingM, a DR detection software born
This leads to another interesting avenue—track- in Scotland, was disqualified from the study and
ing microaneurysms. Microaneurysms disappear its parent company has since dissolved. The study
over time and new ones form. Tracking those involved images taken from consecutive, routine
changes using traditional, human-grader based, screening visits of over 20,000 patients to an
methods is very labour intensive, but is virtually English DR screening centre, which were previ-
instantaneous for an AI. The rate of microaneu- ously graded as per the national screening proto-
rysms appearing and disappearing was named col were processed by the systems, and any
microaneurysm turn-over rate. A number of stud- discrepancies in grading between the AI and
ies have been published showing this parameter human-graders were sent to an external reading
is a promising predictive factor for future DR centre. Both the efficiency in detecting DR, refer-
progression [9, 16–18]. Although these studies able DR and cost-effectiveness were studied [19].
consistently linked increased MA turn-over to The study concluded with the following sensitiv-
increased chance of DR, to establish a clinically ity levels:
significant and actionable link between lesion
turn-over and diabetic retinopathy progression • EyeArt 94.7% for any retinopathy, 93.8% for
would require further work (Fig. 11.4). referable retinopathy (human graded as either
In addition to being introduced as a part of ungradable, maculopathy, preproliferative, or
screening in Portugal, RetmarkerDR was also proliferative), 99.6% for proliferative
studied in one of the only head-to-head compari- retinopathy;
11  Artificial Intelligence in Diabetic Retinopathy 147

Fig. 11.4  RetmarkerDR image submission

• Retmarker 73.0% for any retinopathy, 85.0% described above is being developed by Eyenuk
for referable retinopathy, 97.9% for prolifera- Inc., based in Los Angeles, USA. It additionally
tive retinopathy. offers another product—Eyemark for tracking
DR progression which, similarly to Retmarker,
Specificity: offers MA turnover measurements. Eyeart is able
to take in a variable number of pictures per
• 20% for Eyeart for any DR patient, making it suitable for various screening
• 52.3% for Retmarker for any DR scenarios without further adjustments needed, in
contrast to some of its competitors. This solves a
Although the sensitivity levels are much number of issues, as was illustrated by IDx-DR,
higher for Eyeart, this is equalised by the reverse which had to be specially modified to accept the
situation happening in specificity. Of note are single image per eye Messidor-2 dataset, instead
the remarkably low specificity levels for both of its typical input of two images.
systems as compared to more recent reports and Eyeart had been verified retrospectively on a
estimates of those and other software. It is database of 78,685 patient encounters (total of
important to realise that although the study was 627,490 images) with a refer/no refer result and
originally published in 2016, it started some a final screening sensitivity of 91.7% and speci-
years prior, during that period of time machine- ficity of 91.5%, as compared to the Eye Picture
learning and image analysis methods were Archive Communication System (EyePACS)
improved dramatically and one can assume the graders, however only the abstract for the study
algorithms established for this period of time was available on-line. It appears Eyeart has
improved as well. decided to pursue this line of enquiry further
with publishing of a full study, done on more
than 100,000 consecutive patient visits from the
Eyeart EyePACS database. A total of 850, 908 images
were analysed, collected from 404 primary care
Eyeart, the second software compared for the facilities between 2014 and 2015. Patients gen-
purpose of the British screening programme, as erally had eight images taken, four per eye; one
148 A. Grzybowski and P. Brona

image of external eye, and a single macula- screening episodes over one third had severe or
disc-­centred image and an image temporal to proliferative DR, the authors note that the sys-
the disc, though no patient was disqualified tem treats non-screenable patients as positive,
because of number of images taken or their res- for the purpose of patient safety [20].
olution. The images almost evenly split between Eyeart analysed the whole cohort of over
non-­mydriatic at 54% and mydriatic at 46%. 100,000 screening encounters, almost a million
The final results in terms of detecting referable images in less than 2 full days [20]. Assuming an
DR were 91.3% sensitivity and 91.1% specific- average 30  seconds of grading time per image,
ity, in line with the previous partial results. the same task would take about 7000 work-hours
Sensitivity for detecting higher DR levels that or about 4 full time graders working for a whole
are treatable—either severe or proliferative DR year, showing just how much faster computer
was 98.5% and 97.1% for detecting CSME (as analysis can be. Of course in the actual screening
compared to human graders assessing the same scenario no one is grading thousands of images at
fundus pictures). The systems accuracy did not a time, with a quick result available within min-
seem to change depending on mydriasis, with utes of the screening being much more satisfac-
98.0% and 98.8% sensitivities for detecting tory, but AI can do that too, 24 h a day, every day
treatable DR, in non-mydriatic and mydriatic of the year (Figs. 11.5 and 11.6).
encounters respectively. Only 910 patient Eyeart achieved similar results in terms of
encounters, less than 1%, were deemed non- sensitivity, to the aforementioned UK study look-
screenable by Eyeart, of those 198 encounters ing into AI DR screening viability, though there
were assigned as insufficient for full human is a very considerable discrepancy in specificity
grading previously. Nevertheless, of those 910 between the two studies [19, 21]. As mentioned

Fig. 11.5  EyeArt result page


11  Artificial Intelligence in Diabetic Retinopathy 149

Fig. 11.6  EyeArt result page

before, these studies were not done in the same and comprised of a similar number of patients—893
time-period, and further improvements to the patients screened in total. The screening was per-
system probably account for the increase in its formed in primary-care clinics with two-field non-
accuracy. Indeed, the authors themselves describe mydriatic fundus photography first and 4-field
the improvement that the 1.2 version of Eyeart mydriatic imaging second. The study compared
(still based on traditional image analysis tech- the ability of Eyeart to detect clinically significant
niques) has undergone with the inclusion of mul- DME, moderate non-­proliferative DR or higher
tiple convolutional neural networks. based on the two-field imaging with external read-
Eyeart was also measured against the ing centre (the Wisconsin Fundus Photograph
Messidor-2 dataset. Referable DR screening sen- Reading Center, as was used in the IDx-DR trial)
sitivity was 93.8%, specificity of 72.2%. grading decision using the four wide-field stereo-
Importantly this dataset does not have a pre-­ scopic images per eye. For non-mydriatic screen-
defined result or grading attached to it, therefore ing EyeArt’s was shown to have high sensitivity at
necessitating a separate set of graders to judge it 95.5%, good specificity at 86%, and gradeability
for the standard that the AI is compared against, of 87.5%. When dilating patients from the initially
this grading is separate for each study, further ungradable group, the systems overall gradeability
hampering the ability to directly compare any rose to 97.4%, while retaining the same sensitivity
systems involved. and a rise in specificity of 0.5% to 86.5%. Although
Eyeart has recently published the results of its this trial did not involve OCT imaging for the
most robust clinical trial to date. The study was detection of DME, in all other respects this trial
pre-registered, as with the IDx-DR pivotal trial, appears similar to the IDx-DR clinical trial, with
150 A. Grzybowski and P. Brona

similar results in terms of the both systems’ smaller, independent teams and companies but
accuracy. also industry giants—Google. This is not
Another result, perhaps even more surprising Google’s only foray into medical AI, with teams
than the stellar performance of the AI, was a com- at Google collaborating to find solutions for
parison based on a subset of the patients in this automated analysis of histopathology images
trial, that have undergone dilated ophthalmoscopy and other non-image analysis related publica-
after the fundus imaging. A total of 497 patients tions. A Google inc. sponsored study introducing
were tested across 10  U.S. clinical centres, with their automated DR screening algorithm was
some specialty retinal centres and others general published in 2016 by Gulshan and colleagues. To
ophthalmology clinics. This was compared against develop the algorithm the authors gathered over
the adjudicated decision of the Wisconsin reading 128,000 macula-centered images from patients
center based on the 4 wide-­field stereoscopic fun- presenting for their diabetic screening in India
dus photography. Although the ophthalmoscope- and US.  To validate the resultant algorithm a
based examinations had high specificity of 99.5%, random set of images from the same data source
this was coupled with an abysmal sensitivity of was chosen, those images were not used in cre-
28.1% overall. Even among the retina specialty ation of the algorithm. The image set for both
centres the sensitivity rate was only 59.1% [22]. development and validation consisted of mixed
This shows that human-­based grading, using oph- mydriatic and non-mydriatic photos from sev-
thalmoscopy, as one of the tools commonly avail- eral different fundus camera models.
able in primary-care clinics, is very unlikely to be Additionally, the authors tested the algorithm
a sensible screening solution, if even ophthalmolo- against the aforementioned French dataset—
gists struggle with its accuracy. Messidor-2. The algorithm achieved impressive
The most recent study regarding Eyeart was results at a sensitivity of 96.1% and specificity of
done on 30,000 images taken from the English 93.9% (tuned for high sensitivity) and sensitivity
DR screening programme and followed a very of 87.0%, specificity of 98.5% (tuned for speci-
similar protocol and analysis pattern to the only ficity). The respective numbers for Messidor-2
comparative study on AI in DR screening [19, data-set were 97.5% and 93.4% (high sensitiv-
23]. Images from three different centers were ity) and 90.3% and 98.1% (high specificity) [24].
graded according to the established national Although these accuracy results are among the
screening protocol. Among 30,405 screening epi- highest published, and the sample size is consid-
sodes, Eyeart flagged all 462 cases of moderate erable, this study stood out in that it put a lot of
and severe DR.  Overall sensitivity for rDR was emphasis on selection of human graders and
95.7% for rDR and 54% specificity. Although the their validation. Initially, for the development of
specificity is once again lower than in other stud- the dataset, the study invited 54 US-licensed
ies, it is still a very significant increase from the ophthalmologists or ophthalmology trainees at
20% specificity in the previous study [19, 23]. last year of residency, with each grading between
The authors concluded that with the introduction 20 and 62,508 images. As a result, each image
of such an AI system into the currently estab- was graded between 3 and 7 times. Final DR sta-
lished national screening protocol replacing the tus and gradeability of the image were set based
primary grader, the overall human grading work- on majority-decision. Graders were sometimes
load could be halved. shown the images they have previously marked,
to measure intra-grader reliability, or how often
given the same image, the grader decides on the
‘Google’ Algorithm same result. Sixteen graders went through
enough volume of images for this to be feasibly
The potential application of new artificial intel- calculated, and the top 7 or 8 ophthalmologists,
ligence solutions for analysis of fundus images, based on this measure, were chosen to grade all
DR particularly, caught the attention of not only the images from the validation datasets. Inter-
11  Artificial Intelligence in Diabetic Retinopathy 151

grader reliability was also measured for 26 of the which relied only on majority decision. The new
ophthalmologists. The mean intra-grader reli- algorithm was based on well over 1.5 million
ability for the 16 graders for referable DR was retinal images, with 3737 images with adjudi-
94%, and inter-grader reliability for the 26 grad- cated grading used to fine tune the system and
ers was 95.5%. 1958 images used for validation. The validation
Even when choosing the most self-consistent set was graded by three retinal specialists on
graders out of several board-certified ophthal- their own, and was repeated later with face-to-
mologists, the mean agreement rate for referable face adjudication of all images between all three
DR images was only 77.7% for the EyePacs-1 specialists. Additionally, three separate ophthal-
dataset, with complete agreement among all eight mologists graded the images on their own. The
graders achieved in less than 20% of referable adjudicated grade was set as the gold standard
DR images. Grader agreement was much better for further comparisons.
for non-referable DR images, with complete All of the graders had high specificity—97.5%,
agreement on 85.1% of the nonreferable cases 97.9% and 99.1% for ophthalmologists and
[24]. This highlights just how many caveats; the 99.1%, 99.3%, 99.3% for retinal specialists.
current universally acceptable grading method Sensitivities however were much lower with oph-
and gold standard of certified human grading can thalmologists ranging from 75.2% to 76.4% indi-
have. Out of 16 graders, on average, 4 out of 100 vidually and 83.8% as majority decision as
images were marked differently each time they compared to the adjudicated grading [25]. Even
were assessed by the same person. Out of 8 most the majority decision grading of retinal special-
self-consistent graders, only 20% of referable DR ists showed room for improvement at 88.1% and
cases were judged as such by all graders. individual sensitivities of 74.6%, 74.6% and
Issues surrounding human grading were fur- 82.1%. Most cases of discrepancy between the
ther explored in a subsequent 2018 study [25]. In majority grading of ophthalmologists and the
it, authors build up on the previously described adjudicated result stemmed from missed MAs—
work by Gulshan in terms of developing an 36%; misinterpreted image artefacts that can be
improved algorithm, expanding the training data- construed as MAs or small haemorrhages—20%;
set and exploring different presently used grading and misclassified haemorrhages—16%.
protocols. The authors implemented a solution After implementing the adjudication proce-
where the software outputs several numbers dure and fine-tuning the autonomous system it
ranging from 0 to 1, each indicating its confi- achieved accuracy levels comparable to any of
dence that the image represents a given severity the retinal specialists or ophthalmologists
level of DR. This appears to be very similar to the involved [25].
back-end solutions implemented by IDx-DR, A prospective trial was done to assess the real-­
which also output’s its confidence level in the world viability of the algorithm, utilising many
result being more than moderate DR, although of the lessons learned from the two above-­
this is not presented to the end-user. This allows described studies [26]. The trial was done in two
relatively easy adjustments to the systems hospitals in India on a total of 3049 diabetics
sensitivity-­specificity balance, focusing on either attending their appointments in the local general
of those measures. ophthalmology and vitreoretinal clinics, as well
This study ended up with three different as, telescreening initiatives. During their appoint-
‘grading pools’—EyePacs graders, Certified ments macula-centered 40–45 degree fundus
Ophthalmologists and Retinal specialists. images were taken mainly with a Forus 3nethra
Additionally, an adjudication protocol was camera, a compact, low-cost fundus camera [26].
introduced in cases of disagreement by the reti- All images were non-mydriatic and were not
nal specialists with both asynchronous and live included in further therapeutic decisions for the
adjudication sessions until an agreement was patients, as they carried on with their appoint-
reached [25]. This is in contrast to the first work, ments. All images were later graded by a non-­
152 A. Grzybowski and P. Brona

physician trained grader, a retinal specialist. All were additionally graded by two senior non-phy-
images from taken from one of the two centre, sician graders and adjudicated by a senior retinal
997 patients total, also underwent grading by specialist in case of conflicting grading. Overall
three retinal specialists with an adjudication pro- 72,610 images were included in the training data-
cess as in the previous study. Additionally, any set, taken from the years 2010–2013, and further
images from the second centre with any discrep- 71,896 from years 2014–2015 were used for the
ancies between any of the graders or algorithm primary validation dataset. The system was addi-
output (5-point DR grading and DME status) tionally validated using images from multi-ethnic
were also adjudicated. The results, in terms of populations from Singapore, and using images
human grading accuracy in detecting rDR, were taken in screening studies from around the
largely similar to those in the previous study— world—China, African-American Eye disease
the four human graders had sensitivities between study (US based), Royal Victoria Eye Hospital
73.4% and 88.8%, with specificities between (Australia), Mexico and University of Hong
83.5 to 98.7%. The algorithm had comparable Kong. These studies included between 1052 and
performance, at a sensitivity of 88.9% at the first 15,798 images for a total validation dataset of
centre and 92.1% at the second centre respective 112,618 images, more than 56 thousand patients.
specificities of 92.2% and 95.2% [26]. The Reference standards varied between the different
‘Google’ DR algorithm was trained on images studies, but all included at least two graders, with
taken from many different cameras of which only the largest study by image volume (n  =  15,798)
0.3% were taken by this specific fundus camera, also including retinal specialist arbitration.
yet it has showed very good performance on For the primary validation, that is the data from
images taken using it, suggesting the algorithm is SIDRP years 2014–2015, the system demon-
able to deal with different equipment being used strated a sensitivity of 90.5% for detecting refer-
to take the images [26]. Although the algorithm able DR, comparable to professional graders on
and its results appears very promising, with good the same dataset at 91.5%, as compared to the final
accuracy, it does require further work in order to retinal specialist arbitration decision. Specificity
be used in a clinical setting, which the authors of this solution was 91.6%, lower than that of pro-
point out themselves. Firstly, as it currently has fessional graders at 99.3%. Interestingly the sys-
no image quality assessment capabilities, only tem proved better at detecting sight-­threatening
images deemed gradable by the adjudication DR at 100%, with trained graders rated at only
panel were included in this latest study. 88.6%, again, at a cost of the lower specificity. As
Additionally, as with all other algorithms, their the study included multiple ethnic populations, yet
place within and the precise protocols of wide-­ was devised only on the basis of SIDRP images,
spread screening and integration into the existing the authors analysed if it showed racial or other
clinical workflow or outside of it remains to be biases. This was made possible by the large racial
devised and assessed. This latest study was diversity among the validation datasets—Malay,
designed specifically for the algorithm not to Indian, Chinese, White, African-American and
interfere with established clinical set-up. Hispanic. The algorithm achieved comparable per-
formance in different subgroups of patients by
race, additionally age, sex, and glycaemic control
SELENA+, Singapore Algorithm did not affect the accuracy of the algorithm.

Singapore, one of the very few countries that have


an established national DR screening programme, Verisee
is also at the forefront of testing deep learning for
DR detection. Ting and colleagues used images Verisee, an algorithm developed in Taiwan, was
from the on-going Singapore National Diabetic described in a recent paper. The algorithm was
Retinopathy Screening Program (SIDRP), which developed based on single-field images taken
11  Artificial Intelligence in Diabetic Retinopathy 153

previously at the National Taiwan University, However, for all of the above datasets, including
with a single fundus camera [27]. The images the development and validation dataset, only
were graded by two ophthalmologists undergo- images of good quality were chosen.
ing fellowship training, with an experienced reti-
nal specialist employed for adjudication. The
algorithm was trained on about 37,000 images, OphtAI
with 1875 images used for validation. The valida-
tion dataset was not used for training, but was OphtAI is a relatively new entry to the commercial
taken with the same camera at the same location. AI DR detection market. It originates from a joint
The algorithm achieved 92.2% specificity and venture of two French medical IT companies
89.5% sensitivity for any DR, and 89.2% and Evolucare and ADCIS, it was developed in France
90.1% for rDR. The algorithm exceeded the sen- and possesses a class IIa CE certification. The DR
sitivity for detecting rDR achieved by ophthal- algorithm was developed based on a dataset of over
mologists in this study, which was calculated at 275,000 eyes from a French medical imaging data-
71.1%, and did much better than internal physi- base [30]. It is mostly a cloud-based service acces-
cians at detecting any DR (64.3% sensitivity, sible through a web interface, my.ophtai.com,
71.9% specificity, based on diagnosis available in which allows between 1 and 6 images per patient
chart records) [27]. Although these results are to be sent for analysis and offers a DR grading
promising, due to the low volume and homogene- result in a few seconds along with a confidence
ity of validation dataset, the performance of the rating and heatmap of the suspect retinal changes.
algorithm in other scenarios remains uncertain. OphtAI is also available as a locally hosted plat-
Nevertheless, the algorithm has been approved form, dependent on local regulations. While soft-
by the Taiwanese FDA-equivalent body and is ware additionally detects referable DR, diabetic
scheduled to be implemented into real-world macular edema, glaucoma and AMD from fundus
screening in Taiwan in the near future. images, there are plans for the next version to
detect general eye health in addition to detection
of over 10 specific pathologies and 27 disease
RetCAD signs to expand the number of detected patholo-
gies to over 30. The DR detection algorithm was
A recently published system, developed in the compared against the Messidor-2 dataset with
Netherlands, allowing for joint detection of DR very promising results [30, 31]. We would expect
and AMD from fundus images [28]. It is the only further publications related to the verification and
study to show algorithm’s effectiveness at screen- efficacy of this algorithm in the coming years.
ing for both AMD and rDR at the same time. The
validation dataset was rather small, relative to
other studies described here, and comprised of Other AI DR Solutions
600 images. Nevertheless the software achieved
good accuracy and was able to distinguish The initiatives described so far focused mostly on
between rDR and referable AMD rather well the aspect of image analysis. One of the hurdles
with sensitivity of 90.1% and specificity of 90.6% to go through with their development regarded
[28]. Unlike the SELENA software, which can equipment and technique used to take the fundus
also detect both AMD and DR, both diseases images, and how that might affect the system’s
were tested at the same time, instead of testing diagnostic ability or its image quality detection
the accuracy against AMD and DR on separate protocols. Use of different fundus cameras by
data sets [29]. RetCAD was tested against the many different technicians can introduce a lot of
publicly available datasets of Messidor-2, for DR variability in picture quality, its resolution or
detection and Age-Related Eye Disease Study sharpness. IDx-DR, for example, is only approved
dataset for AMD, achieving favourable results. for use in US when coupled, not only with a sin-
154 A. Grzybowski and P. Brona

gle brand of fundus devices, but with a specific


fundus camera—the Topcon NW-400. Other ini-
tiatives employ a number of computational tech-
niques to normalize each image to a standard
deemed appropriate for the system. Another line
of thinking is that using images from multiple
fundus cameras in training the algorithm may
help it ignore the non-relevant, fundus camera-­
dependent changes in the images. This strategy
appears to be working with most developers
reporting their systems as having no significant
impact on final accuracy in regards to fundus
camera used. This issue is particularly important
in case of low-cost or mobile fundus cameras.
Introducing DR screening in low-resource
regions of the world is costly not just in terms of
grading but also in terms of equipment cost and
portability, establishing permanent, stationary
screening points is unlikely to be viable in set-
tings with low population density and low patient
mobility. Even in developed, wealthy countries,
wide-spread screening is often done utilising
Fig. 11.7  Remidio FOP device. Printed with permission
mobile screening units, as exemplified by some from Remidio
of the UK-based screening strategies. The rapid
development of AI in diabetic retinopathy did not software, embedded into their current generation
go unnoticed by companies that already function Fundus on phone devices (Fig. 11.7).
in the fundus image field, with companies devel- The software side of Remidio’s DR detection
oping dedicated AI DR screening solutions for system was named Medios AI. These results have
their existing fundus imaging hardware. since been replicated in another similar study by V
and colleagues, where 3-field, dilated retinal
images taken with the Remidio mobile camera
 R Detection with the Use of Mobile
D were compared to the diagnosis of a vitreoretinal
Devices speciality resident and specialist based on the same
pictures. The images were taken by a healthcare
Another widespread invention of the digital era— professional with no experience in using fundus
the smartphone, and its relative cheap cost and cameras, with the offline system achieving simi-
ubiquity appears promising in regards to mobile, larly high accuracy results [33]. In a similar study
low-cost screening. In one study, images taken done on 297 the systems performance was mea-
Retinal images of 296 patients taken with a smart- sured against that of 2 vitreoretinal specialists, with
phone-based add-on and software—‘Remidio final sensitivity and specificity of the AI in detect-
Fundus on phone’ device were analysed by Eyeart ing referable DR at 98.8% and specificity of 86.7%
software. Even though the Eyeart algorithms have [34]. This was further corroborated by a study
not been trained on the use of smartphone based looking into 900 adult subjects with diabetes in
fundus photography, it achieved sensitivity 99.3% India, where five retinal specialists graded images
for referable DR and 99.1% for sight-threatening taken with the Remidio mobile camera for any DR
DR, with specificities of 68.8% and 80.4% respec- or rDR. This was later compared to the Medios AI
tively [32]. Since that study was done, Remidio software running offline on an Iphone 6, a 6-year-
have developed their own in-house DR analysis old mobile device that currently costs less than 200
11  Artificial Intelligence in Diabetic Retinopathy 155

Fig. 11.8  MediosAI Image selection. Printed with per- Fig. 11.9 MediosAI report. Printed with permission
mission from MediosAI from MediosAI

USD for a refurbished model. Medios AI achieved Although the access to wireless internet sources is
good results with sensitivity and specificity pair for spreading all over the world, this can be a hugely
any DR of 83.3% and 95.5% and for rDR 93% and important factor in screening remote and under-
92.5% [35]. For Medios AI, all studies so far com- privileged communities, where internet access is
pared AI and grader performance on the same sometimes not possible and very often unreliable.
source material  - pictures taken with the mobile This approach is picking up steam with more
camera. A study similar to that done by IDx-DR mobile, smartphone based or smartphone aided
and Eyeart, where the chosen system is compared fundus imaging solutions being studied and con-
to diagnosis based on professional, multi-field fun- sidered for adoption in DR screening. Smartphones,
dus imaging might provide additional insight and coupled with a compatible mobile fundus camera
comparability of those systems with the mobile attachment or device provides a low cost, highly
approach (Fig. 11.8). mobile and highly scalable DR screening solution,
The big difference in Remidio’s DR screening especially if the analysis is integrated into the
system, other than implementing it directly into the smartphone itself. A recent study conducted in
fundus imaging device, is performing the analysis India compared effectiveness of four such devices
entirely offline, without need for internet access. in human based DR grading [36] (Fig. 11.9).
156 A. Grzybowski and P. Brona

It appears the company Bosch has also taken a However, testing on curated, high quality data
similar approach in improving its ‘Bosch Mobile sets will overestimate the real-world testing accu-
Eye Care’ fundus camera and developing an in-­ racy. Testing the software should be done in a
house DR diagnostic algorithm to be imple- scenario as close to the desired implementation
mented within the fundus camera itself. as possible, to achieve accuracy metrics that will
Single-field images taken with their camera, be true to real-life screening.
without pharmacological mydriasis, were anal-
ysed by a convolutional neural network-based AI
software to deliver a disease/no-disease or insuf- New Technologies in Retina
ficient quality output. The system is cloud based Imaging and DR Screening
and would require internet access. Out of 1128
eyes studied, 44 (3.9%) were deemed inconclu- Although most DR screening efforts are directed
sive by the algorithm, with just 4 out of 568 towards analysis of fundus images, there have
patients having images from both eyes of insuf- been significant advancements in employing AI
ficient quality. The study compared AI’s perfor- for analysis of optical coherence tomography
mance with grading based on 7-field stereoscopic, (OCT). OCT is commonly used in assessing and
mydriatic, ETDRS imaging done on the same monitoring DR and DME on an individual
eye. Bosch DR Algorithm achieved good results patient basis. Several metrics like central macu-
with sensitivity, specificity, PPV, and NPV rates lar thickness help us establish some objective
of 91%, 96%, 94%, and 95% respectively [37]. parameters, nevertheless the evaluation of OCT
However little is known about the grading criteria scans is still subjective, user-dependent, simi-
employed in this study, in contrast to other simi- larly to evaluating fundus pictures. A further
lar works, it employs purely a disease positive/ development of OCT—OCT angiography
negative criteria, rather than the more useful rDR, (OCTA), allows for non-invasive tracing of reti-
non rDR distinction [37]. Unfortunately no fur- nal and choroid vasculature, the role of OCTA
ther reports of this algorithms effectiveness are in common ophthalmic practice is not firmly
available at this time. defined, and there are few objective quantifica-
Even though mobile screening does appear tions possible. First attempts at using OCTA
very appealing, and as exemplified above the data for machine-learning and automated analy-
results are very promising, it is conceivable that sis of DR patients have already been made.
the lower image quality obtained when using OCTA data from 106 patients with type II dia-
mobile fundus cameras might affect the accuracy betes and either no DR (n = 23) or mild non pro-
of the AI system used to grade it. A recent study liferative DR (n  =  83) was used to train the
compared the performance of a deep learning algorithm to detect DR features from superficial
based DR detection system against a benchmark, and deep retinal maps [39]. Using a combined
curated image set, taken with a desktop camera approach of using both layers, the system dem-
against its accuracy with images taken with a onstrated overall accuracy of 94.3%, sensitivity
handheld fundus camera [38]. Although the soft- of 97.9%, specificity of 87.0%, and an area
ware, dubbed Pegasus, did exceptionally well on under curve (AUC) of 92.4% [39]. Although the
the curated, desktop dataset with 93.4% and relatively high reliability measures are promis-
94.2% sensitivity and specificity, this did not ing, it is important to note that the validation
translate to equal detection rate in the handheld was done on the training subset. Nevertheless,
camera images with a statistically significant the study has shown that OCTA can be subjected
decrease in accuracy. The parameters for the to deep learning and automated analysis and we
handheld camera dataset were 81.6% sensitivity may very well see more such initiatives in the
and 81.7% specificity—a drop of more than 10% future. The specific computational techniques
for each of the parameters [38]. Mobile screening for detecting DR from OCTA have been further
setups and portable cameras are very attractive explored in a recent study comparing different
means for introducing widespread screening. neural network approaches to analysing OCTA
11  Artificial Intelligence in Diabetic Retinopathy 157

and their results. The best performing algorithm interval, but may also be shorter, for a subset of
achieved accuracy of 0.90–0.92 [40]. patients with high risk for developing DR com-
Teaching general practitioners (GPs) to take plications. In a recent study based in one
photos with a mobile fundus camera and Norwegian ophthalmic practice between 2014
­subsequently grade them, might be an alternative and 2019 average screening interval was extended
method of widening access to DR screening, to 23 months as compared to 14 month average
without the use of AI or automated systems. A for the control group with fixed screening inter-
recent study looked into training GPs in Sri-­ vals [42].
Lanka to take and grade fundus photos taken with
a mobile camera (Zeiss-Visuscout100®). The
GPs underwent a training programme delivered Conclusions
by two retinologists, however of the nine doctors
that undertook the training only two with the best Deep learning DR diagnostic software is cur-
test grading results were chosen for the study. rently a rapidly developing topic. During the last
The GPs took and graded non-dilated and subse- decade we have seen the concepts surrounding
quently mydriatic fundus images, their perfor- automatic DR screening evolve from few expert-­
mance was graded against a decision of a retinal designed algorithms with varying measures of
specialist after performing a dilated fundus accuracy to a multitude of different approaches
examination using slit lamp biomicroscopy and employing the newest developments in deep
indirect ophthalmoscopy. Assuming ungradable learning and other fields. We have seen progres-
subjects as referable, the two GPs achieved sensi- sively more robust studies emerge, proving the
tivities for detecting rDR of 85%, 87% with spec- diagnostic or decision-support algorithms to be
ificities of 72%, 77% for non-mydriatic screening, accurate and reliable, some basing on millions of
rising to 89%, 93% specificity and 95%, 96% images, others with particularly rigorous setting
sensitivity after mydriasis. Although, this shows of their gold-standard. During the last 2 years, a
that training GPs to screen for rDR is theoreti- number of software packages have been approved
cally feasible and can achieve good diagnostic by regulatory bodies around the world and are
accuracy, both the availability of GPs and their well on their way to be implemented into wide-­
ability to take on additional workload is limited. spread screening in the respective countries.
In the aforementioned study only the two best Following the general worldwide trend, increas-
performing GPs (measured as agreement with the ing emphasis is being placed on mobile solutions,
retinal specialist on a test image set) were which may prove to be a better fit for resource
included, unlike an automated system the accu- starved regions. Although the body of evidence
racy would likely vary between different GP speaking for the various algorithms is quite large
graders [41]. and constantly increasing, there are significant
Approaching the issues surrounding DR shortcomings in our current study of AI in
screening from a different direction is RetinaRisk, DR.  Virtually all of the current studies looking
a software developed in Iceland. RetinaRisk aims into and measuring DR algorithms are sponsored
to decrease the overall burden of yearly DR or dependent on the respective algorithm’s’ com-
screening by safely extending the time between pany. Independent studies are very few and far
screening for part of DR population. Although between. For a long time the only independent
not explicitly derived from machine learning, it is and the only robust comparison available, pub-
based on analysis of extensive datasets. The algo- lished by Tufail and colleagues in 2016, compares
rithm takes in patients’ parameters, such as gen- algorithms tested in 2013. Since that time deep
der, age, HbA1c level, DR status, diabetes type learning and related concepts progressed almost
and duration, and blood pressure level. As a beyond recognition, and many of the algorithms
result, the algorithm presents a recommended described here are being constantly updated. This
time till next screening interval, which may be situation changed only recently with the publish-
longer than the traditionally accepted yearly ing of a study comparing multiple AI DR detec-
158 A. Grzybowski and P. Brona

tion algorithms in an anonymised fashion, which 10. Hansen AB, Hartvig NV, Jensen MS, Borch-Johnsen
K, Lund-Andersen H, Larsen M.  Diabetic retinopa-
made it clear the algorithms’ accuracy can vary thy screening using digital non-mydriatic fundus
significantly, but unfortunately not giving readers photography and automated image analysis. Acta
any insight into the performance of any given Ophthalmol Scand. 2004;82(6):666–72.
algorithm [43]. We recently published a much 11. Larsen M, Godt J, Larsen N, Lund-Andersen H, Sjølie
AK, Agardh E, et al. Automated detection of fundus
smaller study comparing two algorithms on a photographic red lesions in diabetic retinopathy.
local dataset [44]. Nevertheless independent stud- Invest Ophthalmol Vis Sci. 2003;44(2):761–6.
ies, particularly comparisons or studies establish- 12. Abràmoff MD, Lou Y, Erginay A, Clarida W, Amelon
ing objective criteria through which the respective R, Folk JC, et  al. Improved automated detection of
diabetic retinopathy on a publicly available data-
algorithms could be compared are missing, with set through integration of deep learning. Invest
organisations, end-users or consumers left with a Ophthalmol Vis Sci. 2016;57(13):5200–6.
considerable dilemma when trying to choose and 13. Xie Y, Gunasekeran DV, Balaskas K, Keane PA,

algorithm for screening their local population. Sim DA, Bachmann LM, et al. Health economic and
safety considerations for artificial intelligence appli-
cations in diabetic retinopathy screening. Transl Vis
Sci Technol. 2020;9(2):22.
14. Abràmoff MD, Lavin PT, Birch M, Shah N, Folk
References JC.  Pivotal trial of an autonomous AI-based
diagnostic system for detection of diabetic reti-
1. Klein BEK.  Overview of epidemiologic studies nopathy in primary care offices. NPJ Digit Med.
of diabetic retinopathy. Ophthalmic Epidemiol. 2018;1(1):1–8.
2007;14(4):179–83. 15. Van Der Heijden AA, Abramoff MD, Verbraak F, van
2. Guariguata L, Whiting DR, Hambleton I, Beagley J, Hecke MV, Liem A, Nijpels G. Validation of automated
Linnenkamp U, Shaw JE.  Global estimates of dia- screening for referable diabetic retinopathy with the
betes prevalence for 2013 and projections for 2035. IDx-DR device in the Hoorn Diabetes Care System.
Diabetes Res Clin Pract. 2014;103(2):137–49. Acta Ophthalmol (Copenh). 2018;96(1):63–8.
3. Lee R, Wong TY, Sabanayagam C.  Epidemiology 16. Haritoglou C, Kernt M, Neubauer A, Gerss J, Oliveira
of diabetic retinopathy, diabetic macular edema and CM, Kampik A, et al. Microaneurysm formation rate
related vision loss. Eye Vis [Internet]. 2015 Sep 30 as a predictive marker for progression to clinically
[cited 2020 Feb 7];2. Available from: https://www. significant macular edema in nonproliferative diabetic
ncbi.nlm.nih.gov/pmc/articles/PMC4657234/ retinopathy. Retina. 2014;34(1):157–64.
4. Romero-Aroca P, de la Riva-Fernandez S, Valls-­ 17. Nunes S, Pires I, Rosa A, Duarte L, Bernardes R,
Mateu A, Sagarra-Alamo R, Moreno-Ribas A, Cunha-Vaz J. Microaneurysm turnover is a biomarker
Soler N.  Changes observed in diabetic retinopathy: for diabetic retinopathy progression to clinically sig-
eight-year follow-up of a Spanish population. Br J nificant macular edema: findings for type 2 diabetics
Ophthalmol. 2016;100(10):1366–71. with nonproliferative retinopathy. Ophthalmologica.
5. Scanlon PH.  The English National Screening 2009;223(5):292–7.
Programme for diabetic retinopathy 2003–2016. Acta 18. Pappuru RK, Ribeiro L, Lobo C, Alves D, Cunha-­
Diabetol. 2017;54(6):515–25. Vaz J.  Microaneurysm turnover is a predictor of
6. Pandey R, Morgan MM, Murphy C, Kavanagh H, diabetic retinopathy progression. Br J Ophthalmol.
Acheson R, Cahill M, et al. Irish National Diabetic 2019;103(2):222–6.
RetinaScreen Programme: report on five rounds of 19. Tufail A, Kapetanakis VV, Salas-Vega S, Egan
retinopathy screening and screen-positive referrals. C, Rudisill C, Owen CG, et  al. An observational
(INDEAR study report no. 1). Br J Ophthalmol. study to assess if automated diabetic retinopathy
2020;Published Online First: 17 December 2020. image assessment software can replace one or more
7. Nguyen HV, GSW T, Tapp RJ, Mital S, DSW T, steps of manual imaging grading and to determine
Wong HT, et al. Cost-effectiveness of a national tele- their cost-­effectiveness. Health Technol Assess.
medicine diabetic retinopathy screening program in 2016;20(92):1–72.
Singapore. Ophthalmology. 2016;123(12):2571–80. 20. Bhaskaranand M, Ramachandra C, Bhat S, Cuadros
8. Gardner GG, Keating D, Williamson TH, Elliott J, Nittala MG, Sadda SR, et  al. The value of auto-
AT. Automatic detection of diabetic retinopathy using mated diabetic retinopathy screening with the EyeArt
an artificial neural network: a screening tool. Br J system: a study of more than 100,000 consecutive
Ophthalmol. 1996;80(11):940–4. encounters from people with diabetes. Diabetes
9. Hipwell JH, Strachan F, Olson JA, KC MH, Sharp PF, Technol Ther. 2019;21(11):635–43.
Forrester JV. Automated detection of microaneurysms 21. Solanki K, Bhaskaranand M, Bhat S, Ramachandra
in digital red-free photographs: a diabetic retinopathy C, Cuadros J, Nittala MG, et al. Automated diabetic
screening tool. Diabet Med. 2000;17(8):588–94. retinopathy screening: large-scale study on con-
11  Artificial Intelligence in Diabetic Retinopathy 159

secutive patient visits in a primary care setting. In: retinopathy screening with an offline artificial intel-
Diabetologia. Springer 233 SPRING ST, New York; ligence system on a smartphone. JAMA Ophthalmol.
2016. p. S64. 2019;137(10):1182–8.
22. Ipp E, Shah VN, Bode BW, Sadda SR. 599-P: dia- 34. Sosale B, Sosale AR, Murthy H, Sengupta S,

betic retinopathy (DR) screening performance of Naveenam M.  Medios–An offline, smartphone-­
general ophthalmologists, retina specialists, and based artificial intelligence algorithm for the diag-
artificial intelligence (AI): analysis from a piv- nosis of diabetic retinopathy. Indian J Ophthalmol.
otal multicenter prospective clinical trial. Diabetes 2020;68(2):391–5.
[Internet]. 2019 [cited 2020 Feb 26];68(Supplement 35. Sosale B, Aravind SR, Murthy H, Narayana S,

1). Available from: https://diabetes.diabetesjournals. Sharma U, SGV G, et al. Simple, mobile-based artifi-
org/content/68/Supplement_1/599-­P cial intelligence algorithm in the detection of diabetic
23. Heydon P, Egan C, Bolter L, Chambers R, Anderson retinopathy (SMART) study. BMJ Open Diabetes Res
J, Aldington S, et  al. Prospective evaluation of an Amp Care. 2020;8(1):e000892.
­artificial intelligence-enabled algorithm for automated 36. MWM W, Mishra DK, Hartmann L, Shah P, Konana
diabetic retinopathy screening of 30 000 patients. Br J VK, Sagar P, et  al. Diabetic retinopathy screening
Ophthalmol. 2020;bjophthalmol-2020-316594. using smartphone-based fundus imaging in India.
24. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Ophthalmology. 2020;127(11):1529–38.
Narayanaswamy A, et al. Development and validation 37. Bawankar P, Shanbhag N, SS K, Dhawan B, Palsule
of a deep learning algorithm for detection of diabetic A, Kumar D, et al. Sensitivity and specificity of auto-
retinopathy in retinal fundus photographs. JAMA. mated analysis of single-field non-mydriatic fundus
2016;316(22):2402–10. photographs by Bosch DR Algorithm—comparison
25. Krause J, Gulshan V, Rahimy E, Karth P, Widner with mydriatic fundus photography (ETDRS) for
K, Corrado GS, et  al. Grader variability and the screening in undiagnosed diabetic retinopathy. PLoS
importance of reference standards for evaluating One. 2017;12(12):e0189854.
machine learning models for diabetic retinopathy. 38. Rogers TW, Gonzalez-Bueno J, Franco RG, Star EL,
Ophthalmology. 2018;125(8):1264–72. Marín DM, Vassallo J, et al. Evaluation of an AI sys-
26. Gulshan V, Rajan RP, Widner K, Wu D, Wubbels tem for the detection of diabetic retinopathy from
P, Rhodes T, et  al. Performance of a deep-learning images captured with a handheld portable fundus
algorithm vs manual grading for detecting dia- camera: the MAILOR AI study. Eye. 2020:1–7.
betic retinopathy in India. JAMA Ophthalmol. 39. Sandhu HS, Eladawi N, Elmogy M, Keynton R,

2019;137(9):987–93. Helmy O, Schaal S, et  al. Automated diabetic reti-
27. Hsieh Y-T, Chuang L-M, Jiang Y-D, Chang T-J, Yang nopathy detection using optical coherence tomog-
C-M, Yang C-H, et  al. Application of deep learning raphy angiography: a pilot study. Br J Ophthalmol.
image assessment software VeriSeeTM for diabetic reti- 2018;102(11):1564–9.
nopathy screening. J Formos Med Assoc. 2021;120(1, 40. Heisler M, Karst S, Lo J, Mammo Z, Yu T, Warner S,
Part 1):165–71. et al. Ensemble deep learning for diabetic retinopathy
28.
González-Gonzalo C, Sánchez-Gutiérrez V, detection using optical coherence tomography angi-
Hernández-Martínez P, Contreras I, Lechanteur YT, ography. Transl Vis Sci Technol. 2020;9(2):20.
Domanian A, et al. Evaluation of a deep learning sys- 41. Piyasena MMPN, Yip JL, MacLeod D, Kim M,

tem for the joint automated detection of diabetic reti- Gudlavalleti VSM.  Diagnostic test accuracy of dia-
nopathy and age-related macular degeneration. Acta betic retinopathy screening by physician graders
Ophthalmol (Copenh). 2020;98(4):368–77. using a hand-held non-mydriatic retinal camera at
29. DSW T, Cheung CY-L, Lim G, GSW T, Quang ND, a tertiary level medical clinic. BMC Ophthalmol.
Gan A, et  al. Development and validation of a deep 2019;19(1):89.
learning system for diabetic retinopathy and related eye 42. Estil S, Steinarsson ÆÞ, Einarsson S, Aspelund T,
diseases using retinal images from multiethnic popu- Stefánsson E.  Diabetic eye screening with variable
lations with diabetes. JAMA. 2017;318(22):2211–23. screening intervals based on individual risk factors
30. Quellec G, et al. Instant automatic diagnosis of dia- is safe and effective in ophthalmic practice. Acta
betic retinopathy. arXiv e-prints: arXiv-1906. 2019. Ophthalmol (Copenh). 2020;98(4):343–6.
https://arxiv.org/abs/1906.11875. 43. Lee, A. Y., Yanagihara, R. T., Lee, C. S., Blazes, M.,
31. Quellec G, et al. Automatic detection of rare patholo- Jung, H. C., Chee, Y. E., ... & Boyko, E. J. (2021).
gies in fundus photographs using few-shot learn- Multicenter, head-to-head, real-world validation
ing. Med Image Anal. 2020;61:101660. https://doi. study of seven automated artificial intelligence dia-
org/10.1016/j.media.2020.101660. https://arxiv.org/ betic retinopathy screening systems. Diabetes care.
abs/1907.09449. 2021;44(5), 1168–1175.
32. Rajalakshmi R, Subashini R, Anjana RM, Mohan
44. Grzybowski, A., & Brona, P. (2021). Analysis and
V.  Automated diabetic retinopathy detection in Comparison of Two Artificial Intelligence Diabetic
smartphone-based fundus photography using artificial Retinopathy Screening Algorithms in a Pilot Study:
intelligence. Eye. 2018;32(6):1138–44. IDx-DR and Retinalyze. J Clin Med. 2021;10(11),
33. Natarajan S, Jain A, Krishnan R, Rogye A, Sivaprasad 2352.
S. Diagnostic accuracy of community-based diabetic
Google and DeepMind: Deep
Learning Systems
12
in Ophthalmology

Xinle Liu, Akinori Mitani, Terry Spitz, Derek J. Wu,


and Joseph R. Ledsam

Introduction detection of glaucoma [7, 8] and cataract man-


agement [9], among many others. The first auton-
Over the last century the field of ophthalmology omous AI system approved by the U.S. Food and
has seen major advances as a result of technologi- Drug Administration (FDA) was for the detection
cal development. The introduction of digital fun- of diabetic retinopathy (DR) [10]. Despite the
doscopy changed workflows across the field, promise of AI, challenges remain. The 2020
enabling large scale screening programs for dia- American Diabetes Association guidelines,
betic retinopathy [1]. Optical coherence tomogra- which discuss AI enabled DR screening as “an
phy (OCT) has allowed practitioners to visualize alternative to traditional screening approaches”,
ophthalmic anatomy in 3-dimensional (3D) and nonetheless note with caution that “the benefits
provided greater insights into diseases such as age and optimal utilization of this type of screening
related macular degeneration, glaucoma and reti- have yet to be fully determined” [11]. Beyond
nal vascular conditions [2]. Among the more DR, no other ophthalmic AI systems have been
recent innovations, artificial intelligence (AI; Box approved by FDA as of Q1 2020, partially
12.1) is poised to have a major impact on the field, because there are significant challenges to over-
promising increased accessibility to screening come throughout the development of an AI sys-
programs [3] and automated virtual triage [4]. tem to be used in clinical practice, with nuances
There is a large and growing body of work at every step and both room to improve and
demonstrating the impact of applying AI meth- potential to explore.
ods to ophthalmology. A wide range of subspe- To overcome these challenges, it is essential to
cialties are covered, including medical retina [4, have an holistic approach that takes into account
5], in particular diabetic retinopathy [3, 6], the patient pathways, clinical workflows, and how
healthcare professionals will interact with a
model (Fig.  12.1). Such an approach will help
ensure the development of AI models that meet or
Xinle Liu and Akinori Mitani contributed equally.
exceed regulatory requirements (e.g. FDA in the
U.S., Conformité Européenne (CE) in Europe)
X. Liu · A. Mitani · D. J. Wu and satisfy unmet clinical needs in a safe and effi-
Google Health, Palo Alto, CA, USA
cacious way. With such a holistic approach, AI
T. Spitz may transform how patients interact with both
Google Health, London, UK
community and clinical eye care, expanding the
J. R. Ledsam (*) access to clinical expertise globally. This
DeepMind, London, UK

© Springer Nature Switzerland AG 2021 161


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_12
162 X. Liu et al.

Development Lifecycle

Applied Product Clinical Regulatory Product Post-market


research development trials apporoval Deployment surveillance

Engaging with Data Label


partners collection Modeling
creation

Engaging Preparing
Grading Adjudication
graders guidelines

Fig. 12.1  The development lifecycle shows the stages of time consuming and directly influences the final model
taking applied AI research through to deployment and quality. Grading requires clear guidelines that are often
beyond. Tasks within applied research and within label- the result of multiple iterations by medical experts. Data
ling are expanded upon in greater detail. Applied research can be then graded either independently by several grad-
covers typical AI model development tasks, with data and ers or potentially even adjudicated until consensus is
label acquisition, and modeling (training, evaluating and reached between the graders
testing). The medical grading process shown is potentially

Product Product deployment


Applied research Clinical trials & Post-market
development
surveillance

Gulshan et al. 2016 Smith-Morris et al. 2018 “Diabetic Gulshan et al. 2018 Beede et al. 2010
“Development and validation of a retinopathy and the cascade into “Performnace of a deep-learning “A human-centered evaluation of a
deep learning algorithm for vision loss” algorithm vs manual grading for deep learning system deployed in
detection of diabetic retinopathy in detecting diabetic retinopathy in clinics for the detection of diavetic
retined fundus photographs” india” retinopathy”

Krause et al. 2018 Bouskil et al. 2018 Ruamviboonusuk et al. 2019 Verity Blog 2010
“Grader variability and the “Blind spots in telemedicine: a “Deep learning versus human “Launching a powerful new
importance of reference standards qualitative study of staff graders for classifying diabetic screening tool for diabetic eye
for evaluating machine learning workarounds to resolve gaps in retinopathy severity in a disease in india”
models for diabetic retinopathy” diabetes management” nationwide screening program”

Huston et al. 2019 Sayres et al. 2018 Google The Keyword 2018
“Quality control challenges in “Using a deep learning algorithm “Al for social good in Asia Pacific”
crowdsourcing medical labeling” and integrated gradients
explanantion to assist grading for
diabetic retinopathy”

Schaekermann et al. 2019 Schaekermann et al. 2018


“Remote tool-based adjudication “Expert discussion improve
for grading diabetic retinopathy” comprehension of difficult cases in
medical image assessment”

Fig. 12.2  Representative publications or announcements from Google along the development lifecycle for diabetic
retinopathy screening

p­ atient-­centered approach is the cornerstone of  pplied Deep Learning Research


A
Google and DeepMind’s strategy in applying AI Work in Eye Diseases
to ophthalmology.
This chapter aims to provide an overview of Color Fundus Photography (CFP)
the work at Google and DeepMind. We start by
examining example applications in medical ret- Our initial contributions applying AI to ophthal-
ina, glaucoma and other subspecialties of oph- mology was in the field of medical retina, where
thalmology. We follow this with a section on color fundus photography has been widely used
clinical translational research, and finally discuss for diagnosing and screening multiple eye dis-
applications of AI beyond ophthalmic diseases. eases with established grading systems (Fig. 12.2).
12  Google and DeepMind: Deep Learning Systems in Ophthalmology 163

The first application of deep learning (DL) to network architecture called Inception-v3 shown
the retina in Google was for DR screening [3]. to be effective for image classification of non-­
DR has been the leading cause of preventable medical images (e.g. cats vs. dogs) [3]. In this
vision loss among the working-age population study, Inception-v3 was used to detect DR using
[12], remaining as a global health burden [13]. the 5-point International Clinical Diabetic
Early detection and proper follow-up for timely Retinopathy scale [16]: none, mild, moderate,
treatment is the key to prevent irreversible vision severe and proliferative (Fig.  12.3). Working
loss from DR [14]. This necessitates scalable together with our collaborators, we determined
screening programs to cover increasing global that the most clinically relevant model would
population with diabetes [15], and automated detect referable DR at the level of moderate or
grading that could potentially improve the effi- above, the threshold at which follow-up visits to
cacy and availability of such screening programs. ophthalmologists are normally requested. This
To achieve this, we applied DL by using a neural first work shows DR detection performance on-­
par with general ophthalmologists, achieving an
Area Under the Receiver-Operating Characteristic
Curve (AUC) of 0.99, when evaluated against a
Box 12.1 Terminology reference standard determined by the majority
• Artificial Intelligence (AI) opinion of US board-certified ophthalmologist
–– AI is a general term for the broad graders [3].
research field of developing intelli- Next, we observed that intergrader variability
gent systems. still exists (refer to the Grading subsection
• Machine Learning (ML) below), and the majority opinions were some-
–– Within the field of AI, ML describes times different from opinions arrived at after dis-
algorithms that perform tasks requir- cussion within a panel of graders. This is because
ing intelligence by learning from taking the majority can ignore “minority opin-
examples. ions” that reflect actual pathology. For example,
• Deep Learning (DL) if a single grader pointed out the existence of
–– DL is a particular form of ML loosely subtle abnormality, other graders might change
inspired by biological neural net- the grade after the discussion even though they
works where algorithms process had initially missed independently. By tuning the
information through a complex net- model using more reliable labels based on such
work of adaptive artificial compute adjudicated grades, the algorithm further
units (“neurons”). achieved a performance comparable to retina
specialists (Fig. 12.4) [17].

Convolution
AvgPool
MaxPool
Concat
Dropout
Fully connected
Softmax

Fig. 12.3  A deep learning system for detecting DR from the relative likelihood that the input image is each one of
CFPs. A CFP is used as input to Inception v3. The model five grades of DR, and whether the image itself is gradable
is a deep neural network made up of building blocks that for DR. The 5-class output here is then separated via the
include convolutions, average pooling, max pooling, con- dotted line to determine a referability result
cats, dropouts, and fully connected layers. The output is
164 X. Liu et al.

100

80
100

90

60
Sensitivity (%)

80

70

40 60
Moderate or Worse Diabetic
50 Retinopathy Model (AUC = 0.986)

40 Generalists
20 Specialists
30
0 2 4 6 8 10

0
0 20 40 60 80 100
1 - Specificity (%)

Fig. 12.4  Model performance for detecting referable dia- demonstrating performance on par with retinal specialists.
betic retinopathy. Receiver operating characteristic (ROC) Also shown is the performance of generalists, assessed in
curve for our DR model published in Krause et al. [17], our previous 2016 publication [3]

To evaluate how the model would generalize to diagnose due to ambiguities and subjectivity,
in the actual clinical settings, the model perfor- and the diagnosis generally requires a number
mance needs to be validated to ensure at least of other clinical data such as visual field.
decent generalization to new data and population. Fortunately, many signs of glaucoma-­ related
We conducted two validation studies to date, a neuropathy (such as high cup-to-disc ratio, neu-
prospective study in India and a retrospective roretinal rim notching, retinal nerve fiber layer
study in Thailand. The model performed on par defect) are visible from a fundus photograph. In
with manual grading in both of these studies [18, training a model for glaucoma, we also col-
19] and another prospective study is under way in lected feature-level (e.g. vertical elongation of
Thailand [20]. the optic cup, parapapillary atrophy, disc hem-
In addition to DR, diabetic screening pro- orrhage) grades, in addition to referable glauco-
grams must catch a wide range of common matous optic neuropathy grades. We showed
referrable ophthalmic diseases that may coexist that a DL model’s predictions of glaucoma sus-
in the diabetic population, including Age- pect correlates well with glaucomatous neurop-
related Macular Degeneration (AMD), glau- athy and actual glaucoma diagnoses [23].
coma, and Retinal Vein Occlusion (RVO).
Similar to how DR can manifest as hard exu-
dates in DR, AMD can present with lesions Grading
called drusen and RVO with obstructions.
Glaucoma, the second leading cause of blind- Machine learning models require labeled data for
ness [21] and the leading cause of irreversible both development and validation. In ophthalmol-
blindness [22] worldwide, is more challenging ogy, these labels are generally obtained via grad-
12  Google and DeepMind: Deep Learning Systems in Ophthalmology 165

ing by ophthalmologists. Both model training cases with disagreements can also be adjudicated
and performance evaluation is dependent on the by graders via asynchronous discussion to reach
quality of the grades provided. However, there is consensus [24].
significant intergrader variability for DR grading
(Fig. 12.5).
One central tenet of our approach to reducing Optical Coherence Tomography (OCT)
grading variability is to create guidelines that
result in consistent and reproducible grades for Despite being more expensive than CFP, OCT
given diseases. This involves getting experts to usage is growing in community eye care settings
grade a small set of cases on prototype guide- [25] because it enables the diagnosis of macular
lines, quantitatively evaluating their intergrader conditions with greater accuracy and the identifi-
agreement, having the experts come together to cation of early pathological changes. As such, the
discuss and resolve disagreements, and finally community use of OCT could lead to better man-
revising the guidelines for clarity and better agement of patients. Regular remote follow up
alignment. This process is repeated until the via virtual clinics is also rapidly becoming a stan-
agreement metrics (such as Krippendorf’s alpha dard of care [26, 27]. However, such a shift to
or Cohen’s kappa) plateaus. remote assessment may come at a cost. The lack
Our experience further suggests that while the of sufficient local clinical expertise has led to a
model training process is generally resilient to high referral or false positive rate, and the
variability or “noise” in the training set, highly increased workload and number of referrals can
reliable grades are more important for the valida- burden tertiary care sites. This problem is exacer-
tion set to ensure the ability to measure model bated by the increase in prevalence of sight-­
performance precisely. If multiple graders reach threatening diseases for which OCT is the gold
a consensus via discussion, the grade is generally standard of initial assessment [28].
more reliable than simply taking the most fre- AI offers a potential solution to this problem,
quent initial grade among them. However, having by identifying abnormalities and by triaging
face-to-face discussion or even online meeting is scans to appropriate virtual clinics. To evaluate
often difficult at remote labelling settings. With the potential of AI in this setting, we applied DL
our customized platform for grading, if desired, to triage macular OCT [4]. In this study the

Fig. 12.5 Intergrader
variability. The
variability between
individual graders
(columns) is shown
for 19 cases (rows).
All graders were
board certified
ophthalmologists. Each
cell shows the DR grade
from an individual
grader. By looking along
a whole row, such as
those highlighted by the
two black rectangles,
one can see cases where
there was significant
variability between
individuals
166 X. Liu et al.

14,884 training tissue maps with confirmed


diagnosis and referral decision
877 manually segmented CNV

training images
MRO
normal

full mac. hole Referral Suggestion [%]

CSR Urgent 98.9

semi urgent 0.5

routine 0.4

observation only 0.2

Diagnosis Probability [%]


normal 7.1

CNV 99.0

MRO 5.4

full mac. hole 11.0

b: segementation d: classification part. mac. hole 24.2

network ensemble network ensemble CSR

VMT
15.0

43.4

geo. atrophy 51.9

Tissue Volumes [mm3]

a: digital OCT scan


drusen 0.050

ER14 0.000

e: diagnosis probabilities
c: tissue map hypotheses and referral suggestion

Fig. 12.6  AI framework from OCT paper [4]. (a) Raw trained with tissue maps with confirmed diagnoses and
retinal OCT (pictured here with 6 × 6 × 2.3 mm3 centered optimal referral decisions. (e) Predicted diagnosis proba-
at the macula). (b) Deep segmentation network, trained bilities for each pathology and referral suggestions
with manually segmented OCT scans. (c) Resulting tissue (Reproduced from [4])
segmentation map. (d) Deep classification network,

authors proposed a network that consisted of two


stages, both consisting of 3D DL models Box 12.2 Advantages of Using Segmentation
(Fig.  12.6). The first stage automatically seg- as an Intermediate Representation
ments an OCT scan, creating a 3D tissue segmen- • Generalizability
tation map of up to 15 classes including anatomy –– An intermediate representation is
and pathology (including neurosensory retina, ideally device independent, and seg-
retinal pigment epithelium, intraretinal and sub- mentation offers one way to achieve
retinal fluid, hyper-reflective material), and three this. When truly independent, the
artifact classes. This tissue map is then passed to number of training cases to general-
the second stage, a classification network that ize a model to a new device will be
provides a referral suggestion consistent with the considerably reduced as only the seg-
clinical pathways in the UK, and one or more mentation model needs retraining.
specific diagnoses. Segmentation models can often be
The two-step approach produces an interme- trained with fewer examples and they
diate representation: the 3D tissue segmentation. are more robust to class imbalance.
De Fauw et al. showed that this approach offers • Interpretability
several advantages (Box 12.2), including easier –– By highlighting important anatomi-
generalization to new OCT manufacturers, by cal and pathological tissue types, the
retraining just the segmentation network using a segmentation provides useful infor-
relatively smaller number of scans. mation in contextualizing model
The model provides a referral suggestion for decisions. A failure of the segmenta-
over 50 different retinal pathologies that may be tion, or high variance between differ-
of clinical interest. To compare the model per- ent segmentation instances in an
formance to human experts, retinal specialists ensemble (a group of models trained
and optometrists determined the reference stan-
12  Google and DeepMind: Deep Learning Systems in Ophthalmology 167

s­ ight-­threatening pathology was classified as nor-


to perform the same task), can indi- mal by the model, implying all such cases would
cate a case that may need to be manu- be referred to specialists as expected.
ally reviewed. Conversely, the
presence of any of the predefined
pathologies in the segmentation can Diabetic Macular Edema (DME)
support a model decision.
• Quantifying pathology OCT has been the primary diagnostic modality
–– It is straightforward to derive clini- for DME, which is characterized by pathologies
cally important measurements from of retinal thickening, intraretinal fluid (IRF) pres-
segmentation results, such as center ence, subretinal fluid (SRF) presence. DME has
point thickness, central subfield been the leading cause of blindness in the dia-
thickness, the presence and volumes betic population [30, 31], and it can be sight-­
of intraretinal and subretinal fluid, threatening especially when the pathologies
drusen and various other affect an area within 500 μm of the fovea (center-­
pathologies. involving (ci-) DME). Therefore, early detection
• Education and assisted read and treatment of ci-DME is essential to prevent
–– Just as segmentation can help inter- vision loss. However, most screening centers are
pret model decisions, it may also be only equipped with CFP cameras, not with OCT
valuable in demonstrating key devices by virtue of cost. The presence of hard
regions in an image associated with exudates (HEs) on CFP is used as a proxy, lead-
certain diagnoses. This could be used ing to a high false positive rate and unnecessary
in medical education, particularly referrals to specialists [32, 33].
given the rapid growth of OCT usage. Varadarajan et al. approached this problem by
The fact that AI can assist with grad- training a DL model to predict the OCT-derived
ing CFPs [29], also suggests that DME labels only using CFPs as inputs. The DME
similar or greater value may be labels included both objective pathologies, e.g.
derived for OCT, for which experts retina thickness values, intraretinal fluid pres-
are in even shorter supply. ence, subretinal fluid presence, and clinical diag-
• Subgroup categorization noses such as ci-DME or non-ci-DME if
–– The quantities of segmented tissue available. The trained DL model produced sub-
compartments or pathology can ease stantially fewer false positives than doctors who
the ability of categorizing patients looked for HEs (as consistent with current prac-
into subgroups. This may be particu- tice) [34]. In terms of its ability to detect presence
larly useful in determining patient of IRF and SRF, the model demonstrated AUCs
eligibility for clinical trials or as part of 0.81 (95% confidence interval (CI): [0.81,
of subgroup analysis in research 0.86]) and 0.88 (95% CI: [0.85, 0.91]), respec-
studies. tively (Fig. 12.7).
This study demonstrated the possibility of
using DL to detect subtle signals in common
dard for the test set using the OCT and addi- imaging modalities (such as CFP) even when the
tional clinical information (CFP, patient history, current standard for diagnosis involves more spe-
etc.) that would be available in routine clinical cialized, invasive, or time-consuming measure-
practice. ment modalities. Such more specialized
The model performed on par with retinal spe- modalities include OCT angiography, fluorescein
cialists, achieving an overall accuracy of 94.5% angiography, refraction, intraocular pressure,
for triaging into four different referral categories. intraocular length, visual acuity, visual field, ultra
Encouragingly, not a single case with wide field images, stereo fundus images, etc.
168 X. Liu et al.

Fig. 12.7 ROC 1.0


curve of a DL
model to detect
ci-DME taking
CFPs as input.
When evaluated
0.8
using ci-DME
determined based
on the OCT, the
AUC was 0.89
(95% CI: [0.87,
0.91]). The model’s 0.6
ROC curve was
Sensitivity

substantially higher
than retina
specialists grading
CFPs, irrespective 0.4
of the location of
any HEs found and
their distance from
the fovea in disc
diameters (DD) [34] Model AUC: 0.89 (95% CI 0.87-0.91)
0.2
Retina Specialists: Overall Judgement
Retina Specialists: HE<500 microns
Retina Specialists: HE<1DD
Retina Specialists: HE<2DD
0.0
0.0 0.2 0.4 0.6 0.8 1.0
1-Specificity

Further research is warranted to investigate if the [36]. The ability to predict patients’ risk of adverse
need for such specialized measurements can be events enables better matching of patients to treat-
reduced by applying DL to common modalities. ment, a process known as risk stratification.

 ge-Related Macular Degeneration


A
AI for Scientific Discovery (AMD) Progression
AMD is the leading cause of irreversible blind-
The DL models described so far have been devel- ness for people aged 50 years or older in industri-
oped to reproduce tasks that ophthalmologists per- alized countries [37]. There are several pieces of
form, namely detecting eye diseases. Next we will evidence to support risk stratification for
discuss some work beyond typical ophthalmology AMD.  First, earlier intervention may improve
practice: predicting future eye disease progression outcomes [38], and the lack of resources to evalu-
and detecting signs of systematic diseases. ate all patients regularly suggests a need to priori-
tize scarce resources [39]. It may even be possible
to prophylactically treat patients at high risk of
Disease Progression progressing to neovascular (nv-) AMD, the most
severe form of AMD, a hypothesis that is being
Generally, disease diagnosis is a classification actively investigated by several trials [40, 41]
problem: such as “what is visible in this image?” (Fig.  12.8). We will next describe two efforts
However, it is also possible to predict future dis- applying DL for AMD risk stratification: the first
ease incidence or progression using both tradi- is based on paired stereo pairs of CFPs of a single
tional risk factor-based approaches [35] and AI eye, and the second is based on OCT.
12  Google and DeepMind: Deep Learning Systems in Ophthalmology 169

Patient with wet AMD


having regular injections in first eye.

Fellow eye under observation

...

0 monts 5 months 8 months 11 months 13 months Patient


Conversion to Injections continues
wet AMD Injections

Fig. 12.8  Progression from early to exudative AMD. The for further treatment. The timing of each follow-up visit
images depict the progression of AMD over time in the varies depending on the treatment regimen of the first eye
fellow (i.e., other) eye of a patient who is already receiv- as well as patient and clinic factors. At each visit, both
ing injections for exudative AMD in their first eye. In this CFP and OCT scans are captured for both eyes. The fel-
case, on diagnosis of exudative AMD (ex-AMD) in one low eye converts to ex-AMD at 11 months, highlighted by
eye at 0 months, the patient commences intravitreal ther- the red box
apy in the affected first eye and is followed-up regularly

 tereo Pairs of CFPs Based AMD


S AMD in one eye are monitored regularly for con-
Progression version in their second fellow eye, because
Babenko et al. developed a DL model to predict patients often rely on the fellow eye in daily life,
progression to nv-AMD within 1 year based on and the fellow eye is at high risk of losing vision.
stereo pairs of color fundus photographs of a sin- We curated a dataset of fellow eye OCTs from
gle eye; both fellow eyes and eyes from com- patients already being treated for nv-AMD in their
pletely unaffected patients were included. To first eye, and used this dataset to train a DL model
combine the information from two images, a late to predict fellow eye progression of nv-­ AMD
fusion approach was adopted where the left and [43]. Using a ground truth of conversion date the
right CFP were processed by neural networks model achieved a per-volumetric-scan sensitivity
with shared weights, and the predictions for the of 80% at 55% specificity, and 34% sensitivity at
two CFPs were averaged. The model had much 90% specificity. The model outperformed base-
higher sensitivity than manually graded severity lines of predictions based only on drusen and
using either 4-category or 9-scale scales across intraretinal hyperreflective material, as well as
multiple initial conditions [42]. expert retina specialists and optometrists.

 CT Based AMD Progression


O DR Incidence
Where available, OCT is the standard imaging As most patients will develop DR within two
modality for diagnosing and monitoring decades after diabetes [14], worldwide millions
AMD. The majority of patients who develop nv-­ of diabetic patients have been recommended to
170 X. Liu et al.

be regularly screened (yearly or every 2  years) microvasculature is visible for visual examina-
for DR, by International Council of tion or imaging non-invasively. We will next
Ophthalmology (ICO) [44], American Academy cover work on specific systemic diseases.
of Ophthalmology (AAO) [45], American
Diabetes Association (ADA) [46], etc. These  ardiovascular Risk Factors
C
screening programs provided longitudinal fol- Identifying patients at high risk of future cardiovas-
lowup data that enabled research for predicting cular events is crucial in preventing cardiovascular
DR incidence and progression. The ability to risk diseases [49], the leading cause of global death
stratify diabetic patients for developing DR could [50]. Cardiovascular risk factors may manifest in
potentially lead to personalized medication and the eyes; for example severe hypertension can lead
lifestyle coaching, as well as early diagnosis and to hypertensive retinopathy [51]. Poplin et  al.
treatment to avoid vision loss [13, 47]. Preliminary showed that models developed on CFPs can accu-
results in identifying patients at high risk of rately predict blood pressure, age, self-­reported sex
developing DR have been described at the time of and other cardiovascular disease risk factors
this writing. ([48], accepted). (Fig. 12.9) [53], and these findings were confirmed
by other researchers on an external validation set
[54]. It demonstrates the potential for further posi-
 redicting the Presence of Systemic
P tive interactions between clinical practices for oph-
Conditions thalmic and systemic disease management.
Two of the findings were particularly unex-
In addition to clinical applications in ophthalmol- pected: first, though age was known to affect the
ogy, recent work has shown the potential for appearance of the vessels, quantifying age within
detecting signs of systemic diseases. Many sys- the error range of a few years was not known to be
temic conditions can manifest in the eye, with the possible; second, sex-associated differences in the
most prominent examples being diabetes. More retina was not previously known to appear in
interestingly, the retina is the only organ where CFPs. To better understand how the models com-

Age Systolic blood pressure


a b
250
UK Biobank UK Biobank
80 EyePACS

200
Predicated [mmHg]
Predicated [year]

60

150
40

100
20

20 40 60 80 100 150 200 250


Actual [year] Actual [mmHg]

Fig. 12.9  Predictions of age and systolic blood pressure. the UK Biobank validation dataset [52]. The diagonal
(a) Predicted and actual age in the two validation datasets, lines represent perfect correlation between predicted and
UK Biobank [52] and EyePACS (http://www.eyepacs. actual values. (Plot was recreated using data from [53])
com). (b) Predicted and actual systolic blood pressure on
12  Google and DeepMind: Deep Learning Systems in Ophthalmology 171

pute the predictions, we applied soft-attention [55] tus related features were along the blood vessels,
to produce heatmaps of where the model focused and self-reported sex features were near macula.
on for each image (Fig. 12.10). The findings sug-
gested that the models used different part of the Anemia
images for each task, e.g. age-related features Anemia is another example of a systemic disease
were distributed over the whole eye, smoking sta- that is posing global public health problems [56].

a Original CFP b Age

Actual: 57.6 years


Predicted: 59.1 years
c Self-reported sex d Whether a current smoker

Actual: Female Actual: Nonsmoker


Predicted: Female Predicted: Nonsmoker

Fig. 12.10  Attention maps. (a): The top left image is a tion heatmap overlaid in green, indicating the areas that
sample color fundus photograph from the UK Biobank the DL model is using to make the prediction for the
dataset. (b), (c) and (d): The remaining images show the image. (Plot was recreated using data from [53])
same retinal image in black and white with the soft atten-
172 X. Liu et al.

Extending the previous work of predicting car- els are meeting real clinical needs. Clinicians
diovascular risk factors, Mitani et al. showed that will need to be able to respond to model deci-
ML models can predict blood hemoglobin levels sions appropriately, and to engage with models
and detect anemia from CFPs [57] (Fig. 12.11). in a proactive way. While this is still an active
In this study, multiple model explanation meth- area of research, a few specific examples would
ods were applied to show that most of the attribu- help demonstrate the required work. We inves-
tion of the model goes to the disc and the blood tigated the impact of DL models for DR on
vessels nearby. In addition, they used occlusion graders when presented as AI-based assistance,
analysis to examine how performance is impacted finding improvements in accuracy and confi-
by removing certain parts of the image. This dence at the expense of grading time [29]. We
analysis confirmed that the region around the further conducted an observational human cen-
optic disc is most crucial for predicting hemoglo- tered study [62] to identify socio-environmen-
bin and anemia. tal factors that impact model performance in a
number of real world clinical deployments [63].
 hallenges in Translating AI to Clinical
C The insights from this research will ultimately
Practice inform clinical trials. Prospective evidence of
We have described many applications that may, efficacy [18] and additional studies on clinical
if deployed, improve patient care. To achieve this impact such as economic cost and patient out-
potential it is essential to couple developmental comes is also important in understanding clini-
research with user research to ensure the mod- cal applicability.

a b c d

Blocking top and bottom Blocking center


e f g h
0.9 0.9

0.8 0.8
AUC

AUC

0.7 f 0.7 h

0.6 Anemia 0.6 Anemia


Moderate anemia Moderate anemia
0.5 0.5
0.00 0.25 0.50 0.75 Blocked area: 80% 0.00 0.25 0.50 0.75 Blocked area: 40%
Proportion of blocked area Anemia (AUC): 0.79 Proportion of blocked area Anemia (AUC): 0.80
Moderate (AUC): 0.87 Moderate (AUC): 0.88

Fig. 12.11  Saliency maps and effects of occluding parts ing center core, (h) same as (f), CFP example for (g).
of the image on the prediction of anemia from CFPs. (a) Proportions of the blocked area for f and h were chosen to
Example CFP from UK Biobank. (b) Saliency map for have AUC for predicting anemia close to 0.80, to illustrate
predicting anemia using GradCAM [58], (c) Smooth inte- the proportions that need to be occluded to result in simi-
grated gradients [59, 60], (d) Guided-backprop [61], (e) lar performance decrease, highlighting the importance of
Effects of masking top and bottom of the image on predic- optic disc in detecting anemia from CFPs. (Plot was recre-
tion of anemia (in orange) and moderate anemia (in blue), ated using data from [57])
(f) masked CFP example for (e), (g) same as (e) for mask-
12  Google and DeepMind: Deep Learning Systems in Ophthalmology 173

As illustrated by these studies, it is crucial to come, we are confident that AI will transform
not only show that AI is accurate on retrospec- patient care in ophthalmology.
tive datasets, but also to demonstrate how to
implement such techniques in a way that can
Box 12.3 Open Source Software for AI
benefit everyone in practice [64]. Regulatory
bodies have to balance safety and innovation, • Google and DeepMind contribute open-­
and standards have not yet been well estab- source tools and models to the
lished for this new technology. For each appli- community:
cation, the evaluation has to be clinically –– TensorFlow framework (www.ten-
meaningful. Comparison methods between sorflow.org) for model development
algorithms should be established, including and serving [65]
how to choose a representative test set, so that –– Google Colaboratory (colab.
users can optimize the selection for the target research.google.com) for data analy-
use-cases, and monitor any potential perfor- sis and preparation [66]
mance shifts even after the initial decision. In –– Model architecture used for predict-
addition, further research is needed to under- ing conversion to wet age related
stand how and when AI-based systems can macular degeneration using deep
introduce bias or even fail so that we can moni- learning in [43, 67]
tor for and mitigate potential harm. For exam- • We also use common open-source soft-
ple, the system can learn to use confounders ware and models:
that further reinforce existing bias, or may learn –– Inception-v3 network architecture
features only helpful for specific populations. ([68], available in Tensorflow at
Alternatively, if it learned the spurious correla- InceptionV3 [69]) for Fundus
tion that only exists in the original sets, the sys- models
tem may underperform when applied elsewhere. –– UNet [70] architecture for OCT
Improving the interpretability of the models segmentation
and careful subgroup analysis would help iden- –– Numpy [71]
tify such issues and contribute to building more –– Pandas
transparent and trustworthy systems that serve –– Matplotlib
the global population. –– Seaborn
–– Scipy
–– Sklearn
Summary • Proprietary custom software is used for
other tasks specific to our computing
In this chapter we have provided a brief overview infrastructure, such as medical data pre-
of the contributions from Google and DeepMind processing and management, scalable
to the field of AI in Ophthalmology. We have dis- image grading, distributed training of
cussed several different phases, from early DL models, runtime inference and
research and development (see also Box 12.3) to model serving—most of which is avail-
implementations in clinical trials, with examples able in similar third-party offerings.
of each. Finally we have demonstrated the poten-
tial of AI for Scientific Discovery. A considerable
amount of work remains ahead to answer funda-
mental questions around topics such as bias,
uncertainty, safety, interpretability and generaliz- Acknowledgements We would like to thank Y.  Liu,
D.  Webster, O.Ronneberger and P.  Kohli for their guid-
ability. As these hurdles and challenges are over-
ance and feedback.
174 X. Liu et al.

Bibliography 14. Fong DS, et  al. Retinopathy in diabetes. Diabetes


Care. 2004;27(Suppl 1):S84–7. https://care.diabetes-
journals.org/content/27/suppl_1/s84
1. Bernardes R, et  al. Digital ocular fundus imaging:
15. Guariguata L, et al. Global estimates of diabetes preva-
a review. Ophthalmologica. 2011;226(4):161–81.
lence for 2013 and projections for 2035. Diabetes Res
https://www.karger.com/Article/Abstract/329597
Clin Pract. 2014;103(2):137–49. https://www.scien-
2. Drexler W, Fujimoto JG.  State-of-the-art retinal
cedirect.com/science/article/pii/S0168822713003859
optical coherence tomography. Prog Retin Eye Res.
16. Wilkinson CP, et  al. Proposed international clini-

2008;27(1):45–88. https://www.sciencedirect.com/
cal diabetic retinopathy and diabetic macular
science/article/pii/S1350946207000444
edema disease severity scales. Ophthalmology.
3. Gulshan V, et  al. Development and validation of a
2003;110(9):1677–82. https://www.ncbi.nlm.nih.gov/
deep learning algorithm for detection of diabetic
pubmed/13129861
retinopathy in retinal fundus photographs. JAMA.
17. Krause J, et al. Grader variability and the importance
2016;316(22):2402–10. jamanetwork, https://jama-
of reference standards for evaluating machine learn-
network.com/journals/jama/fullarticle/2588763
ing models for diabetic retinopathy. Ophthalmology.
4. De Fauw J, et  al. Clinically applicable deep learn-
2018;125(8):1264–72. https://www.aaojournal.org/
ing for diagnosis and referral in retinal disease. Nat
article/S0161-­6420(17)32698-­2/abstract
Med. 2018;24(9):1342–50. https://www.nature.com/
18. Gulshan V, et  al. Performance of a deep-learning

articles/s41591-­018-­0107-­6
algorithm vs manual grading for detecting dia-
5. Lee CS, et al. Deep learning is effective for classify-
betic retinopathy in India. JAMA Ophthalmol.
ing normal versus age-related macular degeneration
2019;137(9):987–93. https://jamanetwork.com/
OCT images. Ophthalmol Retina. 2017;1(4):322–7.
journals/jamaophthalmology/fullarticle/2734990.
https://www.sciencedirect.com/science/article/pii/
Accessed Apr 2020.
S2468653016301749
19. Ruamviboonsuk P, et al. Deep learning versus human
6. Ting DSW, et  al. Artificial intelligence and deep
graders for classifying diabetic retinopathy sever-
learning in ophthalmology. Br J Ophthalmol.
ity in a nationwide screening program. npj Digital
2019;103:167–75. https://bjo.bmj.com/con-
Med. 2019;2. https://www.nature.com/articles/
tent/103/2/167.abstract
s41746-­019-­0099-­8
7. Chen X, et  al. Automatic feature learning for
20. Detecting Center-Involved Diabetic Macular Edema
glaucoma detection based on deep learning.
from Analysis of Retina Images Using Deep Learning.
In: International conference on medical image
2018. http://www.clinicaltrials.in.th/index.php?tp=re
computing and computer-assisted interven-
gtrials&menu=trialsearch&smenu=fulltext&task=sea
tion, 2015a. p.  669–77. https://link.springer.com/
rch&task2=view1&id=3819
chapter/10.1007%2F978-­3-­319-­24574-­4_80
21. Quigley HA, Broman AT.  The number of people

8. Chen X, et  al. Glaucoma detection based on deep
with glaucoma worldwide in 2010 and 2020. Br J
convolutional neural network. In: 2015 37th annual
Ophthalmol. 2006;90:262–7. https://bjo.bmj.com/
international conference of the IEEE engineering in
content/90/3/262
medicine and biology society (EMBC), 2015b, p. 715–
22. Tham Y-C, et al. Global prevalence of glaucoma and
8. https://ieeexplore.ieee.org/document/7318462
projections of glaucoma burden through 2040: a sys-
9. Long E, et  al. An artificial intelligence platform for
tematic review and meta-analysis. Ophthalmology.
the multihospital collaborative management of con-
2014;121(11):2081–90. https://www.sciencedirect.
genital cataracts. Nat Biomed Eng. 2017;1(2):1–8.
com/science/article/abs/pii/S0161642014004333
https://www.nature.com/articles/s41551-­016-­0024
23. Phene S, et al. Deep learning and glaucoma spe-

10. Abràmoff MD, et  al. Pivotal trial of an autonomous
cialists: the relative importance of optic disc
AI-based diagnostic system for detection of dia-
features to predict glaucoma referral in fundus pho-
betic retinopathy in primary care offices. Npj Digital
tographs. Ophthalmology. 2019;126.12:1627–1639.
Med. 2018;1:39. https://www.nature.com/articles/
https://www.sciencedirect.com/science/article/pii/
s41746-­018-­0040-­6
S0161642019318755
11. 11. Microvascular Complications and Foot Care:

24. Schaekermann M, et al. Remote tool-based adjudica-
Standards of Medical Care in Diabetes. Diabetes
tion for grading diabetic retinopathy. Transl Vis Sci
Care, edited by American Diabetes Association, vol.
Technol. 2019;8(40). http://tvst.arvojournals.org/arti-
43, no. 5, 2020. p. S135–51. https://care.diabetesjour-
cle.aspx?articleid=2757836
nals.org/content/43/Supplement_1/S135
25. Fidalgo BR, et  al. Role of advanced technology in
12. Cheung N, et  al. Diabetic retinopathy. Lancet.

the detection of sight-threatening eye disease in a
2010;376(9735):124–36. https://www.thelancet.com/
UK community setting. BMJ Open Ophthalmol.
journals/lancet/article/PIIS0140-6736(09)62124-3
2019;4(1). https://bmjophth.bmj.com/content/4/1/
13. Ting DSW, et al. Diabetic retinopathy: global preva-
e000347
lence, major risk factors, screening practices and pub-
26. Buchan JC, et al. How to defuse a demographic time
lic health challenges: a review. Clin Exp Ophthalmol.
bomb: the way forward? Eye. 2017;31:1519–22.
2016;44:260–77. https://doi.org/10.1111/ceo.12696
https://www.nature.com/articles/eye2017114
12  Google and DeepMind: Deep Learning Systems in Ophthalmology 175

27. Whited JD, et  al. A modeled economic analysis of possible patient care. Eye. 2020;26(S1). https://www.
a digital teleophthalmology system as used by three nature.com/articles/eye2011342
federal healthcare agencies for detecting proliferative 40. Heier JS.  IAI versus Sham as prophylaxis against
diabetic retinopathy. Telemed e-Health. 2005;11:641– conversion to neovascular AMD (PRO-CON). clini-
51. https://doi.org/10.1089/tmj.2005.11.641 caltrials.gov. https://clinicaltrials.gov/ct2/show/
28. Bourne RRA, et al. Magnitude, temporal trends, and NCT02462889
projections of the global prevalence of blindness and 41.
Southern California Desert Retina Consultants,
distance and near vision impairment: a systematic MC.  Prophylactic Ranibizumab for Exudative Age-­
review and meta-analysis. Lancet Global Health. related Macular Degeneration (PREVENT). clini-
2017;5:e888–97. https://www.thelancet.com/journals/ caltrials.gov. https://clinicaltrials.gov/ct2/show/
langlo/article/PIIS2214-109X(17)30293-0/fulltext NCT02140151
29. Sayres R, et  al. Using a deep learning algorithm
42. Babenko B, et  al. Predicting progression of age-­

and integrated gradients explanation to assist grad- related macular degeneration from fundus images
ing for diabetic retinopathy. Ophthalmology. using deep learning. arXiv, Apr 2019. https://arxiv.
2019;126(4):552–64. https://www.aaojournal.org/ org/pdf/1904.05478.pdf
article/S0161-­6420(18)31575-­6/fulltext 43. Yim J, et  al. Predicting conversion to wet age

30. Mitchell P, et al. Cost-effectiveness of ranibizumab in related macular degeneration using deep learning.
treatment of diabetic macular oedema (DME) caus- Nat Med. 2020. https://www.nature.com/articles/
ing visual impairment: evidence from the RESTORE s41591-­020-­0867-­7
trial. Br J Ophthalmol. 2012;96:688–93. https://bjo. 44.
International Council of Ophthalmology.
bmj.com/content/96/5/688 ICO Guidelines for Diabetic Eye Care.
31. Romero-Aroca P. Managing diabetic macular edema: 2017. http://www.icoph.org/downloads/
the leading cause of diabetes blindness. World J ICOGuidelinesforDiabeticEyeCare.pdf
Diabetes. 2011;2(6):98–104. https://www.ncbi.nlm. 45. AAO PPP Retina/Vitreous Committee, Hoskins Center
nih.gov/pubmed/21860693 for Quality Eye Care. Diabetic Retinopathy PPP 2019.
32. Mackenzie S, et  al. SDOCT imaging to identify
2019. https://www.aao.org/preferred-­practice-­pattern/
macular pathology in patients diagnosed with dia- diabetic-­retinopathy-­ppp
betic maculopathy by a digital photographic retinal 46. Solomon SD.  Diabetic retinopathy: a position state-
screening programme. PLoS One. 2011;6(5):e14811. ment by the American Diabetes Association. Diabetes
https://doi.org/10.1371/journal.pone.0014811 Care. 2017;40(3):412–8. https://care.diabetesjour-
33. Wong RL, et al. Are we making good use of our public nals.org/content/40/3/412
resources? The false-positive rate of screening by fun- 47. Dornhorst A, Merrin PK. Primary, secondary and ter-
dus photography for diabetic macular oedema. Hong tiary prevention of non-insulin-dependent diabetes.
Kong Med J. 2017;23(4):356–64. https://pubmed. Postgrad Med J. 1994;70(826):529–35. https://www.
ncbi.nlm.nih.gov/28684650/ ncbi.nlm.nih.gov/pmc/articles/PMC2397691
34. Varadarajan AV, et  al. Predicting optical coherence 48. Bora A, et  al. Deep learning for predicting the pro-
tomography-derived diabetic macular edema grades gression of diabetic retinopathy using fundus images.
from fundus photographs using deep learning. Nat ARVO Abstract, 2020.
Commun. 2020;11(130). https://www.nature.com/ 49. Goff DC, et  al. ACC/AHA guideline on the assess-
articles/s41467-­019-­13922-­8 ment of cardiovascular risk: a report of the American
35. D’Agostino RB, et  al. General cardiovascular risk College of Cardiology/American Heart Association
profile for use in primary care: the Framingham heart Task Force on Practice Guidelines. Circulation.
study. Circulation. 2008;117(6):743–53. https://www. 2014;129:S49–73.
ncbi.nlm.nih.gov/pubmed/18212285 50. WHO The Top 10 Causes of Death. 2017. 2018.

36. Tomašev N, et al. A clinically applicable approach to https://www.who.int/news-­room/fact-­sheets/detail/
continuous prediction of future acute kidney injury. the-­top-­10-­causes-­of-­death
Nature. 2019;572:116–9. https://www.nature.com/ 51. Wong TY, Mitchell P.  Hypertensive retinopathy. N
articles/s41586-­019-­1390-­1 Engl J Med. 2004;22(351):2310–7.
3 7. Wong WL, et  al. Global prevalence of age-
52. Sudlow C, et al. UK Biobank: an open access resource
related macular degeneration and disease bur- for identifying the causes of a wide range of com-
den projection for 2020 and 2040: a systematic plex diseases of middle and old age. PLoS Med.
review and meta-­ analysis. Lancet Glob Health. 2015;12(3):e1001779.
2014;2(2):e106–16. 53. Poplin R, et al. Prediction of cardiovascular risk fac-
38. Lim JH, et al. Delay to treatment and visual outcomes tors from retinal fundus photographs via deep learn-
in patients treated with anti-vascular endothelial ing. Nat Biomed Eng. 2018;2:158–64. https://www.
growth factor for age-related macular degeneration. nature.com/articles/s41551-­018-­0195-­0
Am J Ophthalmol. 2012;153(4):678–86. https://www. 54. Ting DSW, Wong TY. Eyeing cardiovascular risk fac-
ajo.com/article/S0002-­9394(11)00721-­5 tors. Nat Biomed Eng. 2018;2:140–1. https://www.
39. Action on AMD. Optimising patient management: act nature.com/articles/s41551-­018-­0210-­5
now to ensure current and continual delivery of best
176 X. Liu et al.

55. Xu K, et al. Show, attend and tell: neural image cap- Recognit. 2016. https://www.researchgate.net/pub-
tion generation with visual attention. 2015. https:// lication/306281834_Rethinking_the_Inception_
arxiv.org/abs/1502.03044 Architecture_for_Computer_Vision
56. McLean E, et al. Worldwide prevalence of anaemia, 69. InceptionV3. https://www.tensorflow.org/api_docs/
WHO vitamin and mineral nutrition information sys- python/tf/keras/applications/InceptionV3
tem, 1993–2005. Public Health Nutr. 2009;12(4):444– 70. Ronneberger O, et al. U-Net: convolutional networks
54. https://pubmed.ncbi.nlm.nih.gov/18498676/ for biomedical image segmentation. In: International
57. Mitani A, et  al. Detection of anaemia from retinal conference on medical image computing and
fundus images via deep learning. Nat Biomed Eng. computer-­ assisted intervention, 2015, p.  234–41.
2020;4:18–27. https://www.nature.com/articles/ https://doi.org/10.1007/978-­3-­319-­24574-­4_28
s41551-­019-­0487-­z. 71. van der Walt S, et  al. The NumPy array: a structure
58. Selvaraju RR, et al. Grad-CAM: visual explanations for efficient numerical computation. Comput Sci
from deep networks via gradient-based localization. Eng. 2011;13(2):22–30. https://www.researchgate.
Int J Comput Vis. 2019;128:336–59. net/publication/224223550_The_NumPy_Array_A_
59. Smilkov D, et  al. SmoothGrad: removing noise by Structure_for_Efficient_Numerical_Computation
adding noise. 2017. https://arxiv.org/abs/1706.03825 72. Bouskill KE, et al. Blind spots in telemedicine: a qual-
60. Sundararajan M., et al. Axiomatic attribution for deep itative study of staff workarounds to resolving gaps
networks. In: Proceedings of the 34th international in chronic disease care. BMC Health Services Res.
conference on machine learning, 2017, p. 3319–28. 2018;18:617. https://research.google/pubs/pub47345/
61. Springenberg TJ, et  al. Striving for simplicity:
73.
Google. AI for social good in Asia Pacific.
the all convolutional net. 2014. https://arxiv.org/ The Keyword, Dec 2018. https://www.
abs/1412.6806. b l o g . g o o g l e / a r o u n d -­t h e -­g l o b e / g o o g l e -­a s i a /
62. Jaimes A, et al. Human-centered computing: toward a ai-­social-­good-­asia-­pacific
human revolution. Computer. 2007;40(5):30–4. 74. Google Research. TensorFlow: large-scale machine
63. Beede E, et  al. A human-centered evaluation of a learning on heterogeneous distributed systems. 2015.
deep learning system deployed in clinics for the https://www.tensorflow.org/about/bib
detection of diabetic retinopathy. In: Proceedings 75. Hutson M, et al. Quality control challenges in crowd-
of the 2020 CHI conference on human factors in sourcing medical labeling. 2019. https://research.
computing systems, 2020, p.  1–12. https://doi. google/pubs/pub48327/
org/10.1145/3313831.3376718 76. Schaekermann M, et  al. Expert discussions improve
64.
Kelly CJ, et  al. Key challenges for deliver- comprehension of difficult cases in medical image
ing clinical impact with artificial intelligence. assessment. CHI Conference on Human Factors in
BMC Med. 2019;17:195. https://doi.org/10.1186/ Computing Systems (CHI ‘20), April 25–30, 2020,
s12916-­019-­1426-­2 Honolulu, HI. ACM, New  York, 2020. https://doi.
65. Abadi M, et al. TensorFlow: a system for large-scale org/10.1145/3313831.3376290.
machine learning. In: OSDI’16: Proceedings of the 77. Shlens J.  Train your own image classifier with

12th USENIX conference on operating systems Inception in TensorFlow. https://ai.googleblog.com/,
design and implementation, 2016. p. 265–83. https:// Google, 9 3 2016, https://ai.googleblog.com/2016/03/
doi.org/10.5555/3026877.3026899 train-­your-­own-­image-­classifier-­with.html. Accessed
66. Carneiro T, et  al. Performance analysis of Google 6.5.2020
Colaboratory as a tool for accelerating deep learning 78. Smith-Morris C, et  al. Diabetic retinopathy and

applications. IEEE Access. 2018;6:61677–85. https:// the cascade into vision loss. Med Anthropol.
ieeexplore.ieee.org/abstract/document/8485684 2020;39(2):109–22. https://pubmed.ncbi.nlm.nih.
67. Google Health. Model architecture for predicting con- gov/29338335/
version to wet age related macular degeneration using 79. Verily. Launching a powerful new screening tool

deep learning. https://github.com/google-­health/ for diabetic eye disease in India. Verily Blog; 2019.
imaging-­research/wet-­amd-­prediction https://blog.verily.com/2019/02/launching-­powerful-­
68. Szegedy C, et  al. Rethinking the inception archi-
new-­screening-­tool.html. Accessed Apr 2020.
tecture for computer vision. Comput Vis Pattern
Singapore Eye Lesions Analyzer
(SELENA): The Deep Learning
13
System for Retinal Diseases

David Chuen Soong Wong, Grace Kiew,
Sohee Jeon, and Daniel Ting

Introduction AI as its performance outstrips traditional


machine learning techniques in many applica-
Machine learning (ML) describes the process of tions [2–5].
identifying patterns within data and thereby pre- AI applications are already changing health-
dicting relationships based on the learned pat- care as we know it in myriad ways and at multi-
terns. Since this simulates human intelligence in ple levels [6]. At the health systems level, AI
part, ML is considered a form of “artificial intel- could improve the efficiency of hospitals and
ligence” (AI), a termed coined by McCarthy in community care by assisting decision making in
1956 [1]. Deep learning (DL) is a category of ML allocation of resources and monitoring of unwell
algorithms that use artificial neural networks to and well patients alike. At the patient level, the
process data such as images, passing it through advances in AI and other technologies are herald-
many layers of interconnected mathematical ing a new era of telemedicine, facilitating the
nodes. These “neurons” detect features progres- monitoring of health at home and holistic, per-
sively to eventually provide an output, typically a sonalised healthcare. For clinicians, AI has the
classification (e.g. disease or no disease) [2]. Due potential to revolutionise the practice of
to innovations in parallelised computing, DL has Ophthalmology and many other specialties
emerged as a viable, useful and popular form of largely through the automated detection of
lesions [6].
Financial Disclosure: Dr. Ting holds a patent for deep Ophthalmology is at the forefront of AI adop-
learning system for retinal diseases, co-founder and share-­ tion in healthcare [7–9]. Deep learning systems
holder of EyRIS, Singapore. have been used to diagnose a range of leading
causes of blindness from digital fundus photo-
D. C. S. Wong graphs such as diabetic retinopathy (DR) [7, 10–
School of Clinical Medicine, University of 12], age-related macular degeneration (AMD)
Cambridge, Cambridge, UK [13–15], glaucoma [15, 16] and retinopathy of
G. Kiew prematurity [17] as well as predicting risk factors
Royal Bournemouth and Christchurch Hospitals NHS of cardiovascular disease [18]. Deep learning has
Foundation Trust, Bournemouth, UK also been used for disease diagnosis and monitor-
S. Jeon ing progression and treatment response using
Keye Eye Center, Seoul, South Korea optical coherence tomography (OCT) imaging
D. Ting (*) [19–21].
Duke-NUS Medical School, Singapore National Eye
Center, Singapore, Singapore

© Springer Nature Switzerland AG 2021 177


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_13
178 D. C. S. Wong et al.

In order to implement these DL systems, mul- SELENA is also able to detect other leading
tiple stages of testing must be carried out to con- causes of blindness globally: possible glaucoma
firm the efficacy of the methods and to ensure and age-related macular degeneration (AMD).
patient safety. The Singapore Eye Lesion The first objective of the project was to train and
Analyzer (SELENA) was developed and tested at validate this system to detect these diseases by
the Singapore Eye Research Institute (SERI) in analysing retinal fundus images obtained from
2017 [15] to aid tertiary care referral decisions community-based DR screening of patients in
based on fundoscopy photographs taken during Singapore. Further external validation of the per-
routine screening for diabetic retinopathy (DR). formance of the system with referable DR was
The conception of this system showcases some of carried out using datasets from 10 multi-ethnic
the remarkable capabilities of deep learning in populations collected from different countries
Ophthalmology and healthcare more widely. In with diverse community and hospital-based dia-
this chapter we discuss the initial development of betic populations. Another objective of the proj-
SELENA in Singapore, its testing on African ect was to determine how SELENA might fit in
populations, and the detection of cardiovascular two different models of DR screening: fully auto-
risk factors. mated (in countries without national screening
programmes) or assistive (semi-automated as
referable cases detected by SELENA were
SELENA: Development, Validation assessed by humans).
and Testing

Diabetes is a major and increasing global health Design


challenge, with 600 million people predicted to be
diabetic by 2040, and roughly a third will have A key strength of SELENA is the multi-ethnic
diabetic retinopathy (DR) [22]. DR is one of the nature of the patients contributing to all the data-
most prevalent causes of irreversible blindness in sets used during training and validation. For
the world [23, 24], and the Asia-Pacific region example, the training set for referable diabetic
will receive the greatest increase in number of retinopathy contained 76,730 images taken from
patients with DR during the ongoing diabetes epi- 13,099 patients of Chinese, Malay and Indian
demic [23, 25]. Screening and subsequent timely ethnicities, obtained 2010–2013 by the Singapore
referral and treatment is a universal strategy for National Diabetic Retinopathy Screening
preventing DR, and can be performed by many Programme. The validation set consisted of
health practitioners including but not limited to images taken between 2014 and 2015 from the
ophthalmologists, optometrists and general prac- same screening programme. External validation
titioners. However, such programmes for screen- was extremely thorough, using 10 datasets that
ing face an increasing burden of patients, included a total of 112,648 images from 15,157
availability of human assessors and poor financial patients of Chinese, Malay, Indian, White,
sustainability [26]. A multi-disciplinary group of Hispanic and African-American ethnicities. This
researchers from the Singapore Eye Research was a striking effort to reduce ethnic bias during
Institute (SERI) addressed this issue by tapping the development of the algorithm, and to ensure
into the recent developments in AI technology the that its performance was generalisable.
described in the introduction of this chapter. The Images were captured by a range of camera types
overall goal was to develop a system that would and labelled by multiple assessors, always includ-
improve patient outcomes by increasing the effi- ing subspecialists in retinal or glaucoma (for the
ciency of DR screening in many populations. glaucoma dataset), and experienced professional
The Singapore Eye Lesion Analyzer graders.
(SELENA) provides a recommendation for the SELENA was composed of eight modified ver-
need for referral to tertiary ophthalmic care, as sions of a convolutional neural network (CNN)
well as a severity grade for diabetic retinopathy known as VGG-19 (Fig. 13.1). Two networks clas-
(DR). In addition to multiple grades of DR, sified DR severity, two networks identified possible
13  Singapore Eye Lesions Analyzer (SELENA): The Deep Learning System for Retinal Diseases 179

(c) convolutional
(b) input map (d) max-pooling
map
weight map
kernel

(e)
fully-
retinal fundus network module connected
direction of data processing flow layer
photograph

(f)
output DR score1
layer
(a) template
image
ensembled DR
image score
module module
input output

(g) local DR score2


contrast-
normalized
template
image
AMD score1

ensembled AMD
image score

AMD score2
referable
diagnosis

Glaucoma score1

ensembled Glaucoma
image score

Glaucoma score2

Ungradable
score

Nonretinal
score

Fig. 13.1  Convolutional neural network architecture of SELENA. The algorithm consists of a 8 modified variants
of the VGG-19 CNN. Figure reproduced with permission from Ting et al. 2017 (JAMA) [15]. Please see this original
paper for full details. Briefly, steps a) to f) consist of template image (a) processing by a deep CNN consisting of a suc-
cession of network modules, each continuing a series of convolutional maps. This results in a final output node (f) for
each class trained for. For the classification of severity, a second deep CNN was provided locally contrast-normalized
images (g) as input; the final disease severity score is then the mean of the outputs. This was repeated for DR, AMD and
glaucoma. Additional CNNs were trained to reject images for insufficient image quality, as well as for being invalid
input (i.e. not being a retinal image)

referable glaucoma, two networks identified refer- formance were used to show that for referable DR,
able AMD, one network assessed image quality, SELENA achieved an AUC of 0.936 (95% CI,
and one network rejected invalid non-retinal images. 0.925–0.943), sensitivity of 90.5% (95% CI,
Each CNN was trained by progressive exposure to 87.3%–93.0%), and specificity of 91.6% (95% CI,
randomly selected images from the training set 91.0%–92.2%). For vision-­ threatening DR, the
together with the ground truth as determined by the statistics showed greater AUC and sensitivity and
human labellers, thus gradually learning the appro- slightly lower specificity: AUC of 0.958 (95% CI,
priate features (i.e. modifying the weight values) for 0.956–0.961), sensitivity of 100% (95% CI,
classification via gradient descent. 94.1%–100.0%), and specificity of 91.1% (95%
CI, 90.7%–91.4%). The key statistics are shown in
Table  13.1. This performance was generalizable,
Results as the AUCs from the 10 external validation sets
ranged between 0.889 and 0.983.
In the primary validation dataset there were 71,896 It is difficult to compare the typical statistical
images from 14,880 patients, with a mean age of measures of performance between deep learning
60.2 (SD 2.2), and 54.6% were men. Within this systems due to the different datasets used for
cohort, the prevalence of referable DR was 3%, training and validation, and the different reference
vision-threatening DR 0.6%, possible glaucoma standards applied [9]. It therefore follows that sta-
0.1% and AMD 2.5%. Standard measures of per- tistical measures of performance must be supple-
180 D. C. S. Wong et al.

Table 13.1  Performance of SELENA on the primary validation dataset, showing area under the receiver-­operating
curve (AUC), sensitivity and specificity
Disease category AUC (95% CI) Sensitivity (95% CI) Specificity (95% CI)
Referable DR 0.936 (0.925–0.943) 90.5 (87.3–93.0) 91.6 (91.0–92.2)
VTDR 0.958 (0.956–0.961) 100 (94.1–100.0) 91.1 (90.7–91.4)
Possible glaucoma 0.942 (0.929–0.954) 96.4 (81.7–99.9) 87.2 (86.8–87.5)
AMD 0.931 (0.928–0.935) 93.2 (91.1–99.8) 88.7 (88.3–89.0)
DR diabetic retinopathy, VTDR vision-threatening diabetic retinopathy, AMD age-related macular degeneration,
CI confidence interval

mented with additional information in order to different population of patients attending a


gauge not only the clinical performance and rele- mobile screening unit in five urban centres in the
vance of the algorithm, but also the technical Copperbelt province of Zambia. The validation
implications of its implementation in the real dataset consisted of 4504 fundus images from
world. For example, many algorithms detecting 1574 patients with diabetes who were invited to
DR have been developed, and all have achieved take part in a retinopathy screening programme
good performance statistically, but they vary con- run by the Kitwe Central Hospital Eye Unit, in
siderably in their architecture and dataset charac- partnership with Konkola Cooper Mines
teristics [9]. The SELENA system was tested (Chingola, Zambia) and Frimley Park Hospital
using a pre-fixed operating threshold an all of the Eye Department (Frimley, UK).
testing sets including the 10 external validation When comparing the training and validation
sets that encompassed all the ethnic groups men- datasets, the notable differences include ethnic
tioned above. SELENA still performed well under composition of the dataset population (a mixture
these stringent testing conditions, suggesting that of Chinese, Malay, Indian and other for the train-
its architecture was robust and generalisable, and ing dataset from Singapore and 100% African in
therefore suitable for the next stages of testing. the validation dataset from Zambia). The retinal
cameras used for capturing fundus images, image
resolution and width of field were different
Additional Testing of SELENA between the two datasets.
on the African Population Validation using this prospective external
dataset showed a high AUC of the AI system for
All AI systems need to be validated in order to referable diabetic retinopathy of 0.973 (95% CI
prove real-world feasibility. While an AI system 0.969–0.978), with sensitivity of 92.25% (95%
may perform well on the clinical trial dataset on CI 90.10–94.12) and specificity of 89.04 (95% CI
which it was trained, this may not always trans- 87.85–90.28). The system also attained a sensi-
late to a similar performance in a primary care tivity of 99.42% (95% CI 99.15–99.68) for
population in a real-world setting, where most vision-threatening diabetic retinopathy and a sen-
screening programmes take place. The architec- sitivity of 97.19% (95% CI 96.61–97.77) for
ture of SELENA was therefore adapted and vali- ­diabetic macular oedema. As described earlier in
dated with further populations. the chapter, SELENA has previously shown to
Bellemo et al. [27] described a deep learning have good performance with African American
system for diabetic retinopathy that was trained Eye Disease Study datasets [15], providing fur-
with 76,370 retinal fundus images from 13,099 ther evidence for the reliability of this AI system
patients with diabetes who had participated in the in detecting referable diabetic retinopathy in dark
Singapore Integrated Diabetic Retinopathy fundi.
Program. The AI system was trained using an Taking this further, the Singapore team sought
ensemble of a VGGnet and a ResNet architecture to find risk factors for referable DR from the
(Fig. 13.2) prior to being validated on an entirely information acquired during the study. A multi-
13  Singapore Eye Lesions Analyzer (SELENA): The Deep Learning System for Retinal Diseases 181

Input map Weight Convolutional map Max-pooling map


kernel

Network module

Fully connected layer

VGGNet architecture Output layer VGGNet


score

Template retinal
fundus image Enembled
image score

Module input Module output


Direction of data processng flow
Network module

Average pooling layer

ResNet architecture ResNet score

Template retinal
fundus image

Identity shortcut connection

Fig. 13.2  Modified network architecture of SELENA. tures. Figure reproduced from Bellemo et al. 2019 (Lancet
The algorithm was modified to include a ResNet model to Digital Health) [27]
support more layers and thus analyse more image fea-

variate analysis was performed to look at sys- tion with a completely different ethnic composi-
temic risk factors for referable DR for the AI tion. This is especially relevant as the greatest
model and human graders; both identified the need for AI applications is in third world coun-
same risk factors of longer duration of diabetes, tries with poor healthcare resources and a short-
higher level of glycated haemoglobin and age of health professionals, where patients
increased systolic blood pressure as associated struggle to access healthcare expertise. Low-­
with referable DR. AUC for systemic risk factors resource countries stand to gain the most from
for the AI model (0.723 (95% CI 0.691–0.754)) the application of AI systems to healthcare
and human graders (0.741 (95% CI 0.710–0.771)) screening programmes, and this study represents
were comparable (p = 0.432). a potential way forward where the AI model
This study provides further evidence for the replaces human graders in identifying patients
clinical feasibility of AI applications in popula- requiring referral for further assessment and
tion screening programmes, with the AI system treatment. The implications of this are far-­
performing to an acceptable clinical standard reaching, as AI systems may be able to perform
comparable to human graders in a prospectively screening comparable to that of human graders in
recruited population in a resource-poor country a fraction of the time previously needed and with
despite having been trained on a different popula- minimal human resource consumption, allowing
182 D. C. S. Wong et al.

development of screening programmes previ- and the SiMES, SINDI, and SCES images at the
ously limited by lack of manpower and resources Blue Mountain Eye Study reading centre in
in countries such as Zambia. Ophthalmologists in Australia by non-ophthalmologists. Images from
these areas can then focus their resources and Beijing and Hong Kong were graded by general
time towards treating cases with sight-­threatening ophthalmologists and retinal specialists, respec-
diabetic retinopathy. tively. The average time per image for human
However, lack of resources in countries such assessors was 2–5 min.
as Zambia may also pose other challenges to the The input for the DL system was 76,370 optic
development of an AI-based screening pro- disc– and macula-centred retinal images. The
gramme due to the poor telecommunication infra- output nodes were the individual DR severity lev-
structure in place. A feasible strategy would have els according to International Clinical Diabetic
to take into account the need for computing power Retinopathy Severity Scale (ICDRSS) classifica-
and telecommunication network requirements, tion [34]. The input images were composed of
either by providing the necessary elements for the 88.3% normal retina, 6.4% mild non-­proliferative
AI system to be implemented, or alternatively by DR (NPDR), 3.8% moderate NPDR, and 1.5%
using the AI system as a standalone system or vision-threatening DR (VTDR; severe NPDR
integrating it with the retinal cameras to be used. and proliferative DR). To train the DL system
Moreover, screening programmes by themselves with ICDRSS classification, the weights of the
do not improve a population’s health; a robust DL system were adjusted with stochastic gradi-
treatment strategy for those identified by the ent descent. For validation, the DL model pre-
screening programme would also need to be in dicted a raw confidence score for each severity
place to make a significant difference to the health level output node. The scores were linearly
of the general population. weighted to produce a single image-level DR
score and used two separate models: one trained
with the original image and one with its contest-­
 ystemic Vascular Risk Factor
S equalized version. These were averaged into an
Associations eye-level DR score. The DR score was translated
into DR grading according to previously speci-
DL systems may also be very useful in the fied score thresholds. Patients were classified as
research setting. Epidemiologic research in par- ungradable and excluded from analysis when
ticular would benefit from DL systems because both eyes were ungradable. Additional manual
vast amounts of data may be efficiently analysed. grading was done on DL system-ungradable
To explore potential roles in this space, the per- images. The time taken to pre-process and
formance of SELENA was compared to that of ­analyse retinal images using a graphics process-
human assessors in reviewing retinal images for ing unit (GPU) for eight datasets was recorded.
DR prevalence and risk factors [28]. The AUC and level of agreement in detection
A total of 93,293 retinal images from 18,912 of the three outcomes were calculated with the
patients of multiple races were collected from the human assessor grading as a reference. The AUC
Singapore Integrated Diabetic Retinopathy was 0.863 (95% CI, 0.854–0.871) for any DR,
Screening Program (SiDRP), Singapore Malay 0.963 (95% CI, 0.956–0.969) for referable DR,
Eye Study (SiMES), Singapore Indian Eye Study and 0.950 (95% CI, 0.940–0.959) for
(SINDI), Singapore Chinese Eye Study (SCES) VTDR.  Human assessors analysed the preva-
[29], Beijing Eye Study (BES) [30], African lence of any DR, referable DR, and VTDR as
American Eye Study (AFEDS) [31], Chinese 15.9%, 6.5%, and 4.1%, while DLS analysed as
University of Hong Kong [32], and Diabetes 16.1%, 6.4%, and 3.7% (P  =  0.59, 0.46, and
Management Project Melbourne (DMP Melb) 0.07), respectively.
[33]. The SiDRP, AFEDS, and DMP images were Patient demographics and systemic risk fac-
graded at the Singapore Eye Research Institute tors (e.g. age, sex, ethnicity, diabetes duration,
13  Singapore Eye Lesions Analyzer (SELENA): The Deep Learning System for Retinal Diseases 183

haemoglobin A1c (HbA1c), systolic blood pressure Future Directions


(SBP), diastolic blood pressure, body mass index,
total cholesterol, and triglyceride levels) were In this chapter, we have outlined the development
evaluated as variables. A pooled analysis for of SELENA and its high performance in detect-
eight individual datasets of human-assessed DR ing referable DR in multiple populations of vary-
outcomes was performed by random-effect mul- ing ethnicities. SELENA has been shown to
tivariate logistic regression. Results show that perform very well in detecting not only referable
diabetes duration, increased HbA1c, and SBP DR but also possible glaucoma and AMD, show-
were significantly associated with any DR, refer- ing that this DL system can be a valuable, multi-
able DR, and VTDR (p  <  0.001 for all evalua- functional tool in the screening process.
tions) for both SELENA and the human Throughout testing, SELENA was employed in
assessors. scenarios to test the models application in real-­
An additional analysis of the relationship with world screening scenarios in both high- and low-­
risk factors was done by calculating odds ratios resource environments. Importantly, these
from meta-analysis. Both human assessors and findings were shown using prospective trial
DLS identified younger age, longer diabetes designs and pre-fixed operating thresholds, show-
duration, increased HbA1c, and SBP as risk fac- ing that SELENA is robust and generalisable [9].
tors for increasing DR severity on the forest plot Strikingly, SELENA has also been shown to be a
(P  <  0.001), which was consistent with previ- valuable tool for epidemiological research
ously published studies [35, 36]. because the DL system greatly reduced the num-
It is noteworthy that it took 10.4 h for SELENA ber of hours required by human assessors for ana-
vs 1554.8 hours for human assessors (i.e. 553.9 lysing images [28].
man-days (roughly 6.5  h), or over 2  years) to However, there are many regulatory, ethical,
analyse 93,293 images, which was 360 times social and technical challenges to implementing
faster than human assessors and provided similar any AI algorithm in Ophthalmology and health-
results. Although there were 7391 images ungrad- care more generally. Large prospective clinical
able by SELENA which required a secondary trials with AI systems like SELENA are needed
manual grading by human assessors (additional to evaluate the safety and efficacy of the proposed
123.2  h or 19.0 man-days), the total duration screening DL system, particularly taking into
including the additional grading was 125.4  h account diverse hardware, population character-
(21.1 man-days). Hence SELENA greatly istics and local logistical challenges. In addition,
reduced the time and cost of data processing for studying the real-world impact of SELENA will
DR research using retinal images. This shows be critical to determine the effects on clinical
that DL systems like SELENA could be an care more widely. The impact on health systems
invaluable tool for health systems, greatly assist- may include increased demand on follow-up and
ing in public health research by automating much treatment in tertiary care, but on the other hand
of the work. the demand may be reduced by fewer false posi-
SELENA could be used for research purposes tives compared to existing systems—this will
on large-scale population-based epidemiological have to be tested in the real world so that the AI
studies with tremendous cost and time benefit. system may be implemented in a safe manner in
Future research should focus on the validation of multiple healthcare settings worldwide.
algorithms on real-world ocular images from dif- The Singapore AI team is currently working
ferent ethnicities and various imaging machines. on expanding the scope of SELENA, for example
Development of an algorithm trained from multi- integrating optical coherence tomography imag-
modal imaging approaches accompanied by lon- ing for retinal disease detection and grading,
gitudinal clinical data would open a new era of improving detection of glaucoma and other ante-
clinical research. rior segment diseases as well as the prediction of
184 D. C. S. Wong et al.

myopia. The group is also pursuing the integra- 14. Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund
DE, Bressler NM. Automated grading of age-related
tion of genetic, epigenetic and proteomic infor- macular degeneration from color fundus images
mation in the SELENA algorithm, with the hope using deep convolutional neural networks. JAMA
of ushering in a new era of personalised care for Ophthalmol. 2017;135(11):1170–6.
patients, ultimately improving health outcomes 15. Ting DSW, Cheung CY, Lim G, Tan GSW, Quang
ND, Gan A, et  al. Development and validation
for people all around the world. of a deep learning system for diabetic retinopa-
thy and related eye diseases using retinal images
from multiethnic populations with diabetes. JAMA.
2017;318(22):2211–23.
References 16. Li Z, He Y, Keel S, Meng W, Chang RT, He M. Efficacy
of a deep learning system for detecting glaucomatous
1. Moor J. The Dartmouth College artificial intelligence optic neuropathy based on color fundus photographs.
conference: the next fifty years. AI Magazine. 2006. Ophthalmology. 2018;125(8):1199–206.
87–91. 17. Brown JM, Campbell JP, Beers A, Chang K, Ostmo
2. Lecun Y, Bengio Y, Hinton G. Deep learning. Nature. S, Chan RP, et  al. Automated diagnosis of plus dis-
2015;521:436–44. ease in retinopathy of prematurity using deep con-
3. Krizhevsky A, Sutskever I, Hinton GE.  ImageNet volutional neural networks. JAMA Ophthalmol.
classification with deep convolutional neural net- 2018;136(7):803–10.
works; 2012. 18. Poplin R, Varadarajan AV, Blumer K, Liu Y, McConnell
4. Le QV, Ranzato MA, Monga R, Devin M, Chen K, MV, Corrado GS, et al. Prediction of cardiovascular
Corrado GS, et al. Building high-level features using risk factors from retinal fundus photographs via deep
large scale unsupervised learning; 2012. learning. Nat Biomed Eng. 2018;2(3):158.
5. Raina R, Madhavan A, Ng AY. Large-scale deep unsu- 19. Lee CS, Tyring AJ, Deruyter NP, Wu Y, Rokem A, Lee
pervised learning using graphics processors; 2009. AY.  Deep-learning based, automated segmentation
6. Topol EJ.  High-performance medicine: the conver- of macular edema in optical coherence tomography.
gence of human and artificial intelligence. Nat Med. Biomed Opt Express. 2017;8(7):3440–8.
2019;25:44–56. 20. De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov
7. Ting DSW, Lin H, Ruamviboonsuk P, Wong TY, Sim S, Tomasev N, Blackwell S, et al. Clinically applica-
DA. Artificial intelligence, the internet of things, and ble deep learning for diagnosis and referral in retinal
virtual clinics: ophthalmology at the digital transla- disease. Nat Med. 2018;24(9):1342.
tion forefront. Lancet Digital Health. 2020;2(1):e8–9. 21. Kermany DS, Goldbaum M, Cai W, Valentim CC,
8. Ting DSW, Pasquale LR, Peng L, Campbell JP, Liang H, Baxter SL, et al. Identifying medical diag-
Lee AY, Raman R, et  al. Artificial intelligence and noses and treatable diseases by image-based deep
deep learning in ophthalmology. Br J Ophthalmol. learning. Cell. 2018;172(5):1122–31. e9.
2019;103(2):167–75. 22. Yau JWY, Rogers SL, Kawasaki R, Lamoureux EL,
9. Ting DSW, Peng L, Varadarajan AV, Keane PA, Kowalski JW, Bek T, et  al. Global prevalence and
Burlina PM, Chiang MF, et al. Deep learning in oph- major risk factors of diabetic retinopathy. Diabetes
thalmology: the technical and clinical considerations. Care. 2012;35(3):556.
Prog Retin Eye Res. 2019;72:100759. 23. Taylor HR.  Global blindness: the progress we are
10. Gargeya R, Leng T. Automated identification of dia- making and still need to make. Asia-Pac J Ophthalmol.
betic retinopathy using deep learning. Ophthalmology. 2019;8(6).
2017;124(7):962–9. 24. Flaxman SR, Bourne RR, Resnikoff S, Ackland P,
11. Abràmoff MD, Lou Y, Erginay A, Clarida W, Amelon Braithwaite T, Cicinelli MV, et  al. Global causes of
R, Folk JC, et  al. Improved automated detection of blindness and distance vision impairment 1990–2020:
diabetic retinopathy on a publicly available data- a systematic review and meta-analysis. Lancet Glob
set through integration of deep learning. Invest Health. 2017;5(12):e1221–e34.
Ophthalmol Vis Sci. 2016;57(13):5200–6. 25.
Chua J, Lim CXY, Wong TY, Sabanayagam
12. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, C. Diabetic retinopathy in the Asia-Pacific. Asia-Pac J
Narayanaswamy A, et al. Development and validation Ophthalmol. 2018;7(1):3–16.
of a deep learning algorithm for detection of diabetic 26. Ting DSW, Cheung GCM, Wong TY.  Diabetic reti-
retinopathy in retinal fundus photographs. JAMA. nopathy: global prevalence, major risk factors, screen-
2016;316(22):2402–10. ing practices and public health challenges: a review.
13. Grassmann F, Mengelkamp J, Brandl C, Harsch S, Clin Exp Ophthalmol. 2016;44(4):260–77.
Zimmermann ME, Linkohr B, et al. A deep learning 27. Bellemo V, Lim ZW, Lim G, Nguyen QD, Xie Y, Yip
algorithm for prediction of age-related eye disease MY, et al. Artificial intelligence using deep learning
study severity scale for age-related macular degenera- to screen for referable and vision-threatening dia-
tion from color fundus photography. Ophthalmology. betic retinopathy in Africa: a clinical validation study.
2018;125(9):1410–20. Lancet Digital Health. 2019;1(1):e35–44.
13  Singapore Eye Lesions Analyzer (SELENA): The Deep Learning System for Retinal Diseases 185

28. Ting DSW, Cheung CY, Nguyen Q, Sabanayagam C, 33. Lamoureux EL, Fenwick E, Xie J, McAuley A,

Lim G, Lim ZW, et  al. Deep learning in estimating Nicolaou T, Larizza M, et al. Methodology and early
prevalence and systemic risk factors for diabetic reti- findings of the diabetes management project: a cohort
nopathy: a multi-ethnic study. npj Digital Medicine: study investigating the barriers to optimal diabetes
Springer US; 2019. p. 1–8. care in diabetic patients with and without diabetic
29. Tan GS, Gan A, Sabanayagam C, Tham YC, Neelam retinopathy. Clin Exp Ophthalmol. 2012;40(1):73–82.
K, Mitchell P, et al. Ethnic differences in the preva- 34. Wilkinson CP, Ferris FL 3rd, Klein RE, Lee PP,

lence and risk factors of diabetic retinopathy: The Agardh CD, Davis M, et  al. Proposed international
Singapore epidemiology of eye diseases study. clinical diabetic retinopathy and diabetic macu-
Ophthalmology. 2018;125(4):529–36. lar edema disease severity scales. Ophthalmology.
30. Xu J, Xu L, Wang YX, You QS, Jonas JB, Wei
2003;110(9):1677–82.
WB.  Ten-year cumulative incidence of diabetic reti- 35. Jones CD, Greenwood RH, Misra A, Bachmann

nopathy. The Beijing Eye Study 2001/2011. PLoS MO. Incidence and progression of diabetic retinopathy
One. 2014;9(10):e111320. during 17 years of a population-based screening pro-
31. McKean-Cowdin R, Fairbrother-Crisp A, Torres M, gram in England. Diabetes Care. 2012;35(3):592–6.
Lastra C, Choudhury F, Jiang X, et  al. The African 36. Thomas RL, Dunstan F, Luzio SD, Roy Chowdury S,
American eye disease study: design and methods. Hale SL, North RV, et al. Incidence of diabetic retinop-
Ophthalmic Epidemiol. 2018;25(4):306–14. athy in people with type 2 diabetes mellitus attending
32. Tang FY, Ng DS, Lam A, Luk F, Wong R, Chan C, the diabetic retinopathy screening service for wales:
et al. Determinants of quantitative optical coherence retrospective analysis. BMJ. 2012;344:e874.
tomography angiography metrics in patients with dia-
betes. Sci Rep. 2017;7(1):2575.
Automatic Retinal Imaging
and Analysis: Age-Related Macular
14
Degeneration (AMD) within
Age-­Related Eye Disease Studies
(AREDS)

T. Y. Alvin Liu and Neil M. Bressler

In this chapter, we focus on a series of collabora- the University of Wisconsin Fundus Photograph
tions between clinicians from the School of Reading Center, which is the designated reading
Medicine and computer scientists from the center for the AREDS.
Applied Physics Lab at the Johns Hopkins In the first study [3], a DLS was trained with a
University. The described studies utilized deep combination of transfer learning, using the uni-
learning (DL) and focused on analysis of color versal features taken from the fully connected
fundus photographs (CFP) of patients with age-­ layer of the pre-trained OverFeat [4] deep convo-
related macular degeneration (AMD), the leading lutional neural network (DCNN), and a linear
cause of central vision loss in persons over age support vector machine (LSVM). Only a subset of
50 in the United States [1] and around the world. the AREDS dataset, 5664 images, was used.
The dataset used for training and testing was Three sets of classification experiments were per-
derived from the Age-Related Eye Disease formed: 4-class classification (no AMD vs. early
Studies (AREDS) [2], a longitudinal cohort study AMD vs. intermediate AMD vs. advanced AMD);
funded by the National Eye Institute with over 3-class classification (no or early AMD vs. inter-
4500 participants and roughly 130,000 CFPs mediate AMD vs. advanced AMD); 2-class clas-
taken with a 30 degree camera. The ground truth sification (no or early AMD vs. intermediate or
of the deep learning systems (DLS) was based on advanced AMD). The DLS’s performance was
the annotations (gradings) by trained graders at compared to that of a human ophthalmologist.

DLS Ophthalmologist
Classification Problem Accuracy Kappa Accuracy Kappa
4-class 79.4% 0.6962 75.8% 0.6583
3-class 81.5% 0.7226 85.0% 0.7748
2-class 93.4% 0.8482 95.2% 0.8897

To achieve the same level of performance, a Given the size of the training data was fixed in
DLS will need more data for training to perform this study, a drop in accuracy and correlation with
progressively more fine-grained classifications. the gold standard (kappa score) was seen with
progressively more fine-grained classifications as
T. Y. Alvin Liu (*) · N. M. Bressler expected. The trend was observed for both the
Retina Division, Department of Ophthalmology DLS and human ophthalmologist. Overall, the
(Wilmer Eye Institute), Johns Hopkins University DLS’s performance was comparable to that of a
School of Medicine, Baltimore, MD, USA human ophthalmologist, and the DLS showed
e-mail: tliu25@jhmi.edu

© Springer Nature Switzerland AG 2021 187


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_14
188 T. Y. Alvin Liu and N. M. Bressler

robust performance in a 2-class classification and consistently produced a superior result. A


task, i.e., detecting AMD that should be referred human ophthalmologist also independently
for management from CFP. This was one of the graded a subset of 5000 images to enable com-
first DL studies in ophthalmology, and parison with the DLS. When the entire AREDS
­demonstrated that DL was a promising technique dataset was used and the data was split at a patient
in analyzing CFPs from patients with AMD. visit level, the superior DL approach achieved an
The second study [5] aimed to extend the find- area-under-receiver-operating-characteristic-­­
ings of the first study [3], and involved a more curve (AUC-ROC curve) value of 0.96, accuracy
complicated study design. A DLS was developed of 91.6% ± 0.1% and a kappa of 0.829 ± 0.003.
for the 2-class classification task of differentiat- When the entire AREDS dataset was used and the
ing between “no or early AMD” and “intermedi- data was split at a patient level, the superior DL
ate or advanced AMD”, and its performance was approach achieved an AUC-ROC curve value of
evaluated using fivefold cross validation, in 0.94, an accuracy of 88.7% ± 0.7% and a kappa
which 4 folds were used for training and one fold of 0.770  ±  0.013. For comparison, the human
was used for testing (with a rotation of the folds). ophthalmologist achieved an accuracy of 90.2%
Different experiments were performed using dif- and kappa of 0.800. This study confirmed that a
ferent variations of the AREDS dataset: using the DLS could be trained to reliably differentiate
entire AREDS dataset including the stereo pairs between non-referable and referable AMD in
(133,821 images) vs. using only 1 of the stereo CFPs (Fig. 14.1), with performance comparable
pair for each eye (67,401 images) vs. using only to that of a human ophthalmologist.
1 of the stereo pair for each eye minus images of The third study [7] explored technical refine-
poor quality (66,943 images). Two DL approaches ments that could improve a DLS’s ability in dif-
were compared: using the AREDS data to re-­ ferentiating between non-referable and referable
train only the final LSVM classification stage of AMD in CFPs, included only one image from
the OverFeat DCNN (as it was done in the first each stereo pair in the AREDS dataset (67,401
study) vs. using the AREDS data to optimize all images in total), and utilized a different DL
of the DCNN weights over all layers of the approach. This different approach used transfer
AlexNet DCNN [6]. The latter DL approach was learning, fine-tuned the original ResNet [8]
more computationally advanced and intensive, DCNN weights, first performed a 4-step classifi-

Fig. 14.1  An example of non-referable (left) and referable (right) AMD


14  Automatic Retinal Imaging and Analysis: Age-Related Macular Degeneration (AMD)… 189

cation [9] (step-1: no or only small drusen tion” first calculated the 9-step class probabilities
(<63 μm) and no pigmentary abnormalities; step-­ for each of the nine classes, using the ResNet 50
2: multiple small drusen or medium-sized drusen derived classifications, and the risk associated
(63–125  μm) and/or pigmentary abnormalities; with the class with maximum probability was
step-3: large drusen (≥125  μm) or numerous designated as the 5-year risk estimate. “Regressed
medium-sized drusen and pigmentary abnormali- prediction” skipped the 9-step severity scale pre-
ties; step-4: choroidal neovascularization or geo- diction step and directly mapped an input CFP
graphic atrophy), and then fused the four classes image to a 5-year risk estimate by using the
into two (non-referable and referable). The resul- RestNet 50 in regression mode.
tant DLS, when compared to the AlexNet DLS For the 9-step severity scale classification
used in the second study [5], produced a task, the DLS achieved a linearly weighted kappa
statistically-­significant superior performance in score of 0.738, suggesting a high degree of cor-
terms of accuracy (91.6% vs. 88.4%), sensitivity relation with the gold standard established at the
(89.0% vs. 84.5%) and specificity (93.6% vs. reading center. However, variation in the DLS’s
91.5%). classification accuracy was seen across the nine
Based on longitudinal outcomes data, the classes, which was likely driven by the imbal-
AREDS 9-step severity scale [10, 11] incorpo- ances in sample size available for each class. For
rates detailed quantification of drusen area and example, of the 58,370 images involved in this
pigmentary abnormalities, and provides 5-year study, there were 24,411 step-1 images, but only
risk estimates for progression to advanced AMD, 1160 step-9 images. For the 5-year progression
which includes choroidal neovascularization, risk estimation task, the “hard prediction” method
central geography, or both. While this severity performed best in most classes and the overall
scale provides fairly granular relative risk estima- mean estimation error between the three methods
tions for developing advanced AMD, the grading ranged from 3.47% to 5.29%, indicating a rela-
is so complex that it is likely no human ophthal- tively small estimation error by the DLS.
mologist uses the scale. Nevertheless, the scale The fifth study [14] produced by this series of
could be useful. For example, an eye with a step-1 collaboration pivoted from utilizing DL for clas-
score carries 0.3% chance of progression, while sification tasks to utilizing DL for generative
an eye with a step-9 score carries a 53% chance tasks. Specifically, generative adversarial net-
[10]. The fourth study [12] had two major contri- works (GANs) [15] were used to create high-
butions: using DL for fine-grained, 9-step sever- resolution CFPs with various stages of AMD
ity classification and for 5-year progression risk (Fig.  14.2). GANs have two main components:
calculation inferred directly from a CFP as an “generative” and “discriminative”. The “genera-
input. Only one image from each stereo pair in tive” network uses training data to generate syn-
the AREDS dataset (58,370 images in total) was thetic images, which are then presented to the
used. The dataset was split at the patient level, “discriminative” network that is responsible for
and the training, validation and testing subsets discriminating between the synthetic and real
compromised of 88%, 2% and 10% of the images, images. The two networks are “adversarial” in
respectively. The DLS was trained with the that the “generative” network aims to generate
ResNet 50 [13] DCNN. The 5-year progression synthetic images that can “fool” the “discrimina-
risk inference was performed using three meth- tive” network. These two networks are then
ods: soft prediction, hard prediction and regressed trained reiteratively against each other to ulti-
prediction. “Soft prediction” first calculated the mately maximize the “authenticity” of the syn-
9-step class probabilities for each of the nine thetic images. Since the description of GANs by
classes, using the ResNet 50 derived classifica- Goodfellow et al. in 2014 [15], many variants of
tions and SoftMax output values, and computed GANs have been developed, e.g. fully-connected
the 5-year risk estimate as the expected value of GANs, Laplacian Pyramid GANs, boundary
class risk under this probability. “Hard predic- equilibrium GANs, self-attention GANs, Big
190 T. Y. Alvin Liu and N. M. Bressler

Fig. 14.2  Sample synthetic images generated by GANs: early (left), intermediate (middle) and advanced (right) AMD

GANs, conditional GANs and progressive GANs from the AREDS, such as CFPs and optical
(ProGANs) [16]. This study adopted ProGANs, coherence tomography imaging, to develop more
the experiments involved two retinal specialists, powerful DLSs, using GANs to supplement the
and here are the major findings. First, the syn- under-represented classes of images in the 9-step
thetic images were of high quality. There was no severity scale to improve progression risk estima-
difference in gradeability detected between the tions, and testing DLSs trained with AREDS
synthetic and real images as determined by the images against images collected in clinical
two retinal specialists. Second, the synthetic practice.
images were judged realistic in most cases. The
two retinal specialists only achieved 59.5% and
53.7% accuracy, respectively, in discriminating References
real AMD images from synthetic images. The
accuracies of close to 50% suggested that the 1. Bressler NM.  Age-related macular degenera-
retinal specialists’ ability in identifying the syn- tion is the leading cause of blindness. JAMA.
thetic images was similar to random chance. 2004;291(15):1900–1.
2. Age-Related Eye Disease Study Research G.  The
Third, the synthetic images generated by the Age-Related Eye Disease Study (AREDS): design
GANs were useful for training a DL algorithm. implications. AREDS report no. 1. Control Clin
The authors showed that a DLS, trained entirely Trials. 1999;20(6):573–600.
with synthetic images, still achieved a respect- 3. Burlina P, Pacheco KD, Joshi N, Freund DE, Bressler
NM.  Comparing humans and deep learning perfor-
able AUC-ROC value of 0.9235 and accuracy of mance for grading AMD: a study in using universal
82.92% in differentiating between non-referable deep features and transfer learning for automated
vs. referable AMD when tested against real AMD analysis. Comput Biol Med. 2017;82:80–6.
images. 4. Razavian AS, Azizpour H, Sullivan J, Carlsson
S.  CNN features off-the-shelf: an astounding base-
In summary, a series of collaboration between line for recognition. Paper presented at: the Institute
retinal specialists at the School of Medicine and of Electrical and Electronics Engineers Conference
computer scientists at the Applied Physics Lab at of computer vision and pattern recognition; May
the Johns Hopkins University produced five pub- 12, 2014; Stockholm, Sweden https://arxiv.org/
pdf/1403.6382.pdf
lications that explored various aspects of DL 5. Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund
applications in AMD using data from CFPs of DE, Bressler NM. Automated grading of age-related
AREDS: from referable AMD determination to macular degeneration from color fundus images
fine-grained AMD severity classification to using deep convolutional neural networks. JAMA
Ophthalmol. 2017;135(11):1170–6.
DCNN technical refinements to synthetic images 6. Krizhevsky A, Sutskever I, Hinton GE.  Imagenet
generation with GANs. Future directions may classification with deep convolutional neural net-
include the following: using multimodal imaging works. https://papers.nips.cc/paper/4824-­imagenet-­
14  Automatic Retinal Imaging and Analysis: Age-Related Macular Degeneration (AMD)… 191

classification-­w ith-­d eep-­c onvolutional-­n eural-­ 9-step severity scale applied to participants in the com-
networks.pdf plications of age-related macular degeneration preven-
7. Burlina P, Joshi N, Pacheco KD, Freund DE, tion trial. Arch Ophthalmol. 2009;127(9):1147–51.
Kong J, Bressler NM.  Utility of deep learning 12. Burlina PM, Joshi N, Pacheco KD, Freund DE, Kong
methods for referability classification of age-­ J, Bressler NM.  Use of deep learning for detailed
related macular degeneration. JAMA Ophthalmol. severity characterization and estimation of 5-year risk
2018;136(11):1305–7. among patients with age-related macular degenera-
8. He K, Zhang X, Ren S, Sun J. Deep residual learning tion. JAMA Ophthalmol. 2018;136(12):1359–66.
for image recognition. In: Proceedings of IEEE con- 13. He K, Zhang X, Ren S, Sun J.  Deep residual learn-
ference on computer vision and pattern recognition; ing for image recognition. In: Proceedings of the 2016
June 27–30, 2016; Las Vegas, NV. p. 771–8. IEEE conference on computer vision and pattern rec-
9. Age-Related Eye Disease Study Research G.  The ognition (CVPR). Piscataway, NJ: Institute of Electric
age-related eye disease study system for classify- and Electronics Engineers; 2016. p. 771–8.
ing age-related macular degeneration from stereo- 14. Burlina PM, Joshi N, Pacheco KD, Liu TYA, Bressler
scopic color fundus photographs: the age-related eye NM. Assessment of deep generative models for high-­
disease study report number 6. Am J Ophthalmol. resolution synthetic retinal image generation of age-­
2001;132(5):668–81. related macular degeneration. JAMA Ophthalmol.
10. Davis MD, Gangnon RE, Lee LY, et  al. The age-­ 2019.
related eye disease study severity scale for age-related 15. Goodfellow I, Pouget-Abadie J, Mirza M, et  al.

macular degeneration: AREDS report no. 17. Arch Generative adversarial nets. Adv Neural Inf Process
Ophthalmol. 2005;123(11):1484–98. Syst. 2014:2672–80.
11. Ying GS, Maguire MG, Alexander J, Martin RW,
16. Karras, T., Aila, T., Laine, S., and Lehtinen,

Antoszyk AN, Complications of Age-related J.  Progressive growing of gans for improved
Macular Degeneration Prevention Trial Research quality, stability, and variation. arXiv preprint
G.  Description of the age-related eye disease study arXiv:1710.10196 (2017).
Artificial Intelligence
for Keratoconus Detection
15
and Refractive Surgery Screening

José Luis Reyes Luis and Roberto Pineda

Background (FFKC) patients is still a dilemma for ophthal-


mologists. Moreover, there is no standardized
Keratoconus (KCN) is a progressive non-­ consensus regarding definitions for KCS and
inflammatory corneal ectasia that is over 90% FFKC [5]. In general, for the majority of the lit-
bilateral in reported cases [1]. Interestingly, ini- erature, the term “keratoconus suspect” has been
tial unilateral presentations are described in applied to patients with abnormal topography
approximately 4.5% of patients; however, 50% of who do not have keratoconus in the fellow eye,
this population will develop signs of keratoconus while “forme frustre keratoconus” has been
in the ‘unaffected eye’ within 16 years [2]. The employed in patients with relatively normal cor-
prevalence of keratoconus is also variable, rang- neal topography in one eye and clinical keratoco-
ing from 1 in 50 people in regions such as Central nus in their fellow eyes [6].
India [3] to 1 in 2000 people in countries such as On the other hand, Laser Vision Correction
the USA [4]. In early disease, patients with kera- (LVC) procedures are among the most performed
toconus are typically asymptomatic. Most of surgeries worldwide. Acknowledging the elective
them will require glasses later in the course of the nature of the procedure, potential complications
disease and, at some point if disease advances, must be carefully considered and avoided, as this
corneal transplant may be necessary [5]. surgery can induce substantial biomechanical
Fortunately, the distinction between mild and alterations that can result in iatrogenic ectasias
advanced keratoconus is not challenging because [7–9]. The first LVC iatrogenic ectasia was
this disease has particular biomicroscopic and reported by Seiler in 1998. In his report, Seiler
retinoscopic manifestations that can be easily described a case of a patient with a previous diag-
recognized. In addition, there are many commer- nosis of FFKC, who underwent laser assisted in
cially available diagnostic tools, including topog- situ keratomileusis (LASIK) surgery. Since this
raphy, tomography, and anterior segment optic report, FFKC has been considered a contraindi-
coherence tomography, that are highly accurate cation for LASIK surgery and its identification
in the screening and classification of manifest has become an essential step during refractive
keratoconus. However, the differentiation of the surgery screening [10]. However, preventing iat-
normal population from keratoconus suspects rogenic ectasia is not so easy, as the main chal-
(KCS) as well as from forme fruste keratoconus lenge is to detect atypical corneas at high risk for
developing ectasia while minimizing the number
of normal patients who are denied surgery [11].
J. L. Reyes Luis · R. Pineda (*) This must be performed employing high sensitiv-
Boston, MA, USA

© Springer Nature Switzerland AG 2021 193


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_15
194 J. L. Reyes Luis and R. Pineda

ity and specificity risk factor detection methods as refractive surgery screening), as it constitutes a
[9, 11]. valuable instrument that has been proven to iden-
Recently, ectasia scoring systems refined the tify early signs of ectasias with high degree of
ability of some devices to detect early forms of reliability [2].
keratoconus. For example, the Belin/Ambrosio
enhanced ectasia display (BAD-D) by Pentacam
(Oculus Wetzlar, Germany) combines elevation-­ Artificial Intelligence
based maps and pachymetry to screen corneal
ectasias with high sensitivity and specificity One common term used in AI is big data. This
(Fig. 15.1). This is particularly useful for evalu- concept refers to the aggregate of data that has
ating patients at risk for post-operative ectasia been generated over years in scientific research
and identification of early or subclinical kerato- projects. This information is extremely extensive
conus. A final BAD-D score, also referred to as and complex; therefore, its storage and interpre-
“final D”, greater than 1.88 has shown a 2.5% tation with traditional tools is not possible [13].
false positive rate with less than 1% false nega- AI provides the investigator with useful tools that
tives [12]. allow a better understanding and analysis of that
Moreover, the use of Artificial Intelligence data [8]. To achieve that objective, big data needs
(AI) techniques has been recommended for diffi- initially to be organized in a dataset. Then, in
cult KCN diagnosis (such as differentiation of order to determine the presence of patterns in an
normal population from KCS and FFKC, as well extensive group of information, big data must

Fig. 15.1  The Belin-Ambrosio Enhanced Ectasia Display (BAD-D) in a case with early keratoconus. Abnormal
reported indices are represented in yellow and red
15  Artificial Intelligence for Keratoconus Detection and Refractive Surgery Screening 195

undergo a process called data mining. This pat- Although most of the AI studies for KCN
tern recognition is very important and cannot be detection and refractive screening are based on
achieved manually (due to the complicatedness supervised machine learning techniques; other
of the information). Data mining can be per- types of AI, including unsupervised machine
formed using two approaches: (1) Supervised learning technique, expert decision tree [15], and
classification, which is assisted by prototypes or multivariate logistic regression analysis [16],
examples; or (2) Unsupervised classification or have also proven to be useful.
clustering, which examines relationships between
the properties of the objects [13].
The first study involving the use of AI for Studies
corneal pathologies was performed by Maeda in
1994 [11]. Since then several AI techniques  upervised Machine Learning
S
have been employed for corneal topography Classifiers
interpretation and early keratoconus detection
[11, 14]. Neural Networks
“Neural network” (NN) refers to the AI method
that simulates human neurological processing
Artificial Intelligence Techniques abilities with the objective of interpolating or
estimating complicated data [17]. This method
“Machine learning” is a popular AI technique has the capacity of detecting characteristics, that
which has the objective of mimicking human are concealed within the input data, without
learning. This method has the advantage of giv- requiring additional information about logic
ing computers the aptitude to ‘learn’ without structures between input and output data.
being programmed to do so [5]. This is accom- Therefore, for developing a successful interpreta-
plished through machine programs, that allow the tion system and obtaining an accurate response,
boosting of output using a previously known the establishment of input data is fundamental in
information dataset, and through the use of mul- this technique.
tiple types of algorithms, e.g. the classifiers algo- To date, several studies have used NN for cor-
rithm (See Box 15.1). Machine learning neal topography classification. The first NN study
algorithms are “trained” on a guidance data set, was reported in 1995 by Maeda and collabora-
and afterward are “validated” on a different set of tors. In their publication, topographic maps were
test or verification data, in order to ensure the classified (using 11 indices of the computer-­
external validity of this process [5]. assisted videokeratoskope TMS-1 [Computed
Anatomy, New York, NY]) into normal, astigma-
tism with the rule, KCN mild to advanced, post-­
Box 15.1 Classifier photorefractive keratectomy and post penetrating
In machine learning, a classifier is an algo- keratoplasty. The NN backpropagation model
rithm that implements classification to find (which is a supervised, multilayer, feed-forward
patterns within a dataset. As previously network) was used, obtaining a correct classifica-
exposed and according to the way new tion in 80% of cases (with >90% specificity, and
information is obtained, these algorithms 44–100% sensitivity) [17].
can be divided into two: Supervised or In 1997, Smolek and Klyce published their
unsupervised. Classifiers can be used, for first NN study. Using 10 indices of the computer-­
example, in the classification of topo- assisted videokeratoskope TMS-1, they were
graphic maps as normal, astigmatic, or able to detect KCS and KCN severity with 100%
with KCN [14]. accuracy, 100% sensitivity, and 100% specificity
[18]. Four years later, these two authors also
196 J. L. Reyes Luis and R. Pineda

reported their NN findings using wavelet data with a known class assignment. This adaptable
from the computer-assisted videokeratoskope investigation tool is suitable for nonparametric
TMS-1. The NN was able to detect normal from data analysis and has been used for data mining
prior refractive surgery corneas with 99.3% accu- [22]. In this technique, cutoff values of discrimi-
racy, 99.1% sensitivity and 100% specificity. nant variables are designated, giving place to suc-
These outcomes were considerably higher than cessive ‘nodes’, ‘branches’ (which divide into
those obtained by clinicians (who only achieved two mutual subgroups), and ‘leaves’ (the final
65% sensitivity and 95% specificity) [19]. decision of class assignment).
Accardo and Pensiero, on the other hand, used Smajda and collaborators have used the auto-
NN to evaluate unilateral and bilateral indices, as mated decision tree classification method to dis-
well as early KCN maps obtained with the video- criminate inputs from the Galilei corneal
keratoscope EyeSys (EyeSys Vision, Houston, topographer (Ziemer Ophthalmic Systems, Port,
Texas), This method allowed the authors to dif- Switzerland). Posterior surface asymmetry index
ferentiate early KCN from non-KCN corneas and corneal volume were the two most important
with 94.1% sensitivity and 97.6% specificity [20]. discriminant variables in their study. With this
In 2008, Vieira de Carvalho and Barbosa method, normal and KCN corneas were classified
employed NN and discriminant analysis to clas- with 100% sensitivity and 93.6% specificity; while
sify corneal shapes using Zernicke coefficients. normal and FFKC corneas were categorized with
Eyesys videokeratoscope data was used as inputs, 93.6% sensitivity and 97.2% specificity [22].
achieving a 94% accuracy with NN and 84.8%
accuracy with discriminant analysis [21].  EKA Software Classifiers (SVM,
W
Six years later, Silverman and collaborators Random Forest, Bayes Network)
compared the ability of NN and linear discrimi- More recently, an open-source machine learning
nant analysis to differentiate between normal and software was created at the University of Waikato,
KCN corneas. This was done by using maps of New Zealand. This software, called Waikato
epithelial and stromal thickness obtained from Environment for Knowledge Analysis (WEKA)
Artemis 1 (ArcScan, Morrison, CO), digital workbench, contains a compilation of machine
ultrasound scanner. Both classifiers obtained an learning algorithms and data processing tools for
Area Under the Receiver Operating data mining [23]. An example of the learning
Characteristics (AUROC) of 1.0 (indicative of algorithms offered by WEKA are the supervised
complete separation of groups); however, NN classifiers: Support Vector Machine (SVM),
sensitivity and specificity were slightly better Bayes Network, Multi-Layer Perceptron (MLP),
when compared to linear discriminant analysis Naive Bayes, Random Forest and Radial Basis
(98.9% and 99.5% versus 94.6% and 99.2%, Function Neural Network (RBNFNN) [13].
respectively) [11]. In 2010, Souza et  al. evaluated the perfor-
Kovacs et al., in turn, analyzed unilateral and mance of SVM, MLP and RBNFNN on KCN
bilateral indices obtained with Sheimpflug imag- detection using indices from the Orbscan II scan-
ing (Pentacam corneal topographer). NN differ- ning slit topographer (Bausch & Lomb, Quebec,
entiated normal and KCN corneas with an Canada). The results showed high KCN detection
AUROC of 0.99, 100% sensitivity, and 95% indices for all three machine learning algorithms;
specificity; however, differentiation between nor- however, no-difference between SVM and MLP
mal and FFKC corneas was slightly less accurate, efficiency (with 0.99 AUROC, and 100% sensi-
as it only achieved an AUROC of 0.97, 90% sen- tivity), and RBFNN high values were found (0.98
sitivity, and 90% specificity [2]. AUROC and 98% sensitivity) [24].
Later, Ruiz Hidalgo and collaborators evalu-
 utomated Decision Tree Classification
A ated the KCN detection effectiveness of the SVM
The “automated decision tree” classification algorithm using 22 parameters obtained from the
method is a machine learning algorithm that was Pentacam corneal topographer. Their results
generated employing a sample of training data demonstrated a KCN vs non-KCN classification
15  Artificial Intelligence for Keratoconus Detection and Refractive Surgery Screening 197

accuracy of 98.9%, sensitivity of 99.1%, and cation applications and machine learning. Its goal
specificity of 98.5%; while, a FFKC vs non-KCN is to display the original data in a smaller dimen-
classification achieved an accuracy of 93.1%, sional space while maximizing the separation
sensitivity of 79.1%, and specificity 97.9%. between categories [27].
Finally, the authors were able to classify 5 groups Saad and Gantinel used LDA to differentiate
[KCN, FFKC, Astigmatic, PR (post-refractive KCN, FFKC, and Non-KCN using 51 Orbscan
surgery) and Normal] with an 88.8% accuracy, II topographic indices. Percentage of thickness
sensitivity 89%, and 95.2% specificity [25]. increase and maximum posterior corneal eleva-
More recently, a study published by Lopes tion were the most important contributors and,
et al. intended to detect corneal ectasia suscepti- with this technique, the differentiation capacity
bility by analyzing Pentacam tomographic data between normal and FFKC groups and between
of patients with unilateral asymmetric ectasia, normal and KCN groups reached an AUROC of
clinical KCN, and stable LASIK.  The authors 0.98 and 0.99, respectively. These results sug-
compared five classifiers (Random Forest, Naive gest that it is plausible to accurately differenti-
Bayes, NN, SVM, and regularized Discriminant ate normal from FFKC eyes using topography
Analysis) of which the Random Forest yielded indices [28].
the highest accuracy (80–85.2% sensitivity and Two years later, the same investigators, used
96.6% specificity). To date, this is the largest the OPD-Scan corneal analyzer to obtain
machine learning study for ectasia susceptibility Zernicke coefficients and differentiate between
and early detection of clinical ectasia [8]. normal, FFKC and KCN corneas using LDA. An
AUROC of 0.98 for the distinction between nor-
mal and FFKC corneas was reported, while an
Unsupervised Machine Learning AUROC 0.96 between normal and KCN corneas
was reached [29].
Contrary to supervised machine learning tech-
niques, the unsupervised algorithms do not need to  xpert System Classifier
E
pre-label data for training [26]. This allows inves- The “Expert System Classifier” is a non-machine
tigators to use a non-biased approach that employs learning AI technique that through an extensive
multiple comprehensive parameters for analysis. set of decision rules, deductive decisions, and
Although to date most of the machine learning step-by step logical operations resolutions are
techniques for automatic detection of KCN are reached [17]. In 1994, Maeda and collaborators
supervised, Yousefi et al. conducted an unsuper- reported the first Expert System Classifier results
vised machine learning study using a big dataset and combined them with LDA in order to differ-
of corneal images obtained with the corneal entiate KCN and non-KCN. Using indices from
swept-source 1000 CASIA OCT. The classifica- TMS-1 maps, this method resulted in 96% accu-
tion was able to detect KCN cases with 97.7% racy, 89% sensitivity and 99% specificity [15].
sensitivity and 94.1% specificity. To the best of
our knowledge, this is the first and only study that  ultivariate Logistic Regression
M
employs unsupervised machine learning in cor- Analysis
neal ectasia detection [26]. Although multivariate logistic regression analy-
sis is not a machine learning technique, the clas-
sification results that have been obtained with
Non-machine Learning Classifiers this classic statistical method are worth includ-
ing in this chapter. This method is convenient
 inear Discriminant Analysis
L for models with dichotomous dependent vari-
“Linear Discriminant Analysis” (LDA) is a tech- ables and uses logistic regression coefficients to
nique commonly used for data reduction that can estimate odds ratios for each independent vari-
help as a pre-processing step for pattern classifi- able [30].
198 J. L. Reyes Luis and R. Pineda

In 2018, Hwang and collaborators used a multi- neal ectasias). However, in cases which there is
variate logistic regression analysis to differentiate not an expert consensus on the definition of a
normal from FFKC corneas. They combined five specific condition, such as early ectasias follow-
Sheimpflug Pentacam indices with 11 anterior seg- ing refractive surgery, a reliable classification is
ment OCT RT-Vue-100 (Optovue, Fremont, CA), difficult to achieve [5].
indices, obtaining a classification system with In addition, the lack of large sample sizes for
100% sensitivity and 100% specificity. The mayor low-incidence conditions (such as FFKC and
contributors in this analysis were the epithelial early post-refractive surgery ectasia) and the
thickness variability, the total focal corneal thick- scarcity of exposure to a sufficient variety of pre-
ness variability from OCT, the anterior curvature sentations of the same pathology and its differen-
and the topometric indices from Sheimpflug tial diagnoses (e.g. ectasias following refractive
tomography [16]. In Fig.  15.2, anterior segment surgery, KCN, and pellucid marginal degenera-
OCT and Sheimpflug images, similar to the ones tion), undermine the results of the algorithms and
used in Hwang study, are displayed. reduce their external validity [14].

Corneal Biomechanics Advantages of AI in Cornea

Recently, a spotlight has fallen on the evaluation The advantages of using AI in corneal ectasias
of corneal Biomechanics not only due to its capa- include:
bility of diagnosing masked corneal ectasias, but
also for guiding therapeutic interventions such as • Automated interpretation of topographic
crosslinking. Nowadays, there are few devices maps. This may help unskilled observers
for in  vivo evaluation of corneal biomechanics enhance their decision-making performance
which include the ocular response analyzer to a more skilled level [17].
(ORA) and the Corvis ST. [31–33] Biomechanics • Early Keratoconus detection. KCS and
have contributed to the expansion of corneal FFKC detection by AI models has achieved
imaging and more studies are expected to good results, allowing ophthalmologists to
improve datasets for AI in keratoconus and diminish progression by the administration
refractive screening. In Fig. 15.3 the results of a of early treatments. One of these treatments
patient with a very asymmetric ectasia (VAE-E) is Riboflavin-­ induced UVA cross-linking,
evaluated with the Ambrosio, Roberts and which seems to be an effective and enduring
Vinciguerra (ARV) biomechanics and tomo- treatment when the detection is done at early
graphic assessment, performed with the Corvis stages [34].
ST device. • Customized LASIK ablations. Recently AI
capacity to combine data has been focused on
optimizing refractive results. This led to the
Limitations design of LASIK ablations based on axial
length measurements, total eye wavefront
The term “ground truth” is used to depict the input, and detailed corneal and anterior seg-
algorithm result that the AI technique is pro- ment tomography data, this first module was
grammed to obtain; however, reaching that called Innoveyes by Alcon. With this informa-
ground truth is not a simple task. Large datasets tion, mathematical models of ray-tracing, pre-
can be more reliable for designating the ground diction of biomechanical changes of the
truth, as cutoffs can be set to include only the cornea and the anticipated epithelial remodel-
advanced forms of the disease (e.g. manifest cor- ing can also be obtained [35].
15  Artificial Intelligence for Keratoconus Detection and Refractive Surgery Screening 199

Fig. 15.2  Combined image of anterior segment OCT and ness and the posterior elevation maps (middle). The total
Sheimpflug tomography from a patient with KCN in the corneal thickness and epithelial thickness map can be
left eye. The Sheimpflug Tomography shows the anterior found in the anterior segment OCT report (bottom)
elevation and curvature maps (top), and the corneal thick-
200 J. L. Reyes Luis and R. Pineda

Fig. 15.3  The ARV Biomechanical and Tomographic Ambrosio Enhanced Ectasia Display (BAD-D). Abnormal
Display in a patient left eye with keratoconus. We can see reported indices are represented in yellow and red. Image
below the Corvis Biomechanical Index (CBI), Courtesy of Dr. Renato Ambrosio MD PhD
Tomographic Biomechanical Index (TBI) and Belin

ers using bilateral data from a Scheimpflug camera for


Conclusions identifying eyes with preclinical signs of keratoconus.
J Cataract Refract Surg. 2016;42(2):275–83. https://
Due to the lack of a validated datasets upon tech- doi.org/10.1016/j.jcrs.2015.09.020.
3. Jonas JB, Nangia V, Matin A, Kulkarni M, Bhojwani
niques as well as the absence of a consistent defi- K. Prevalence and associations of keratoconus in rural
nition for KCN or its early manifestations, it is Maharashtra in Central India: the Central India eye and
difficult to truly determine whether one AI tech- medical study. Am J Ophthalmol. 2009;148(5):760–5.
nique offers a significant advantage over the oth- https://doi.org/10.1016/j.ajo.2009.06.024.
4. Kennedy RH, Bourne WM, Dyer JA. A 48-year clini-
ers (e.g. FFKC and KCS). Despite this, AI cal and epidemiologic study of keratoconus. Am J
techniques have proven to be fruitful in KCN Ophthalmol. 1986;101(3):267–73.
detection and refractive surgery screening, with 5. Lin SR, Ladas JG, Bahadur GG, Al-Hashimi S, Pineda
promising capacity for continued progress as R. A review of machine learning techniques for kera-
toconus detection and refractive surgery screening.
imaging instrumentation and methods become Semin Ophthalmol. 2019;34(4):317–26. https://doi.
more advanced [5]. org/10.1080/08820538.2019.1620812.
6. Klyce SD.  Chasing the suspect: keratoconus.
Br J Ophthalmol. 2009;93(7):845–7. https://doi.
org/10.1136/bjo.2008.147371.
References 7. Ambrósio R, Randleman JB.  Screening for ecta-
sia risk: what are we screening for and how should
1. Burns DM, Johnston FM, Frazer DG, Patterson C, we screen for it? J Refract Surg. 2013;29(4):230–2.
Jackson AJ, et al. Keratoconus: an analysis of corneal https://doi.org/10.3928/1081597X-­20130318-­01.
asymmetry. Br J Ophthalmol. 2004;88(10):1252–5. 8. Lopes BT, Ramos IC, Salomão MQ, Guerra FP,
https://doi.org/10.1136/bjo.2003.033670. Schallhorn SC, Schallhorn JM, et  al. Enhanced
2. Kovács I, Miháltz K, Kránitz K, Juhász É, Takács Á, tomographic assessment to detect corneal
Dienes L, et al. Accuracy of machine learning classifi- ectasia based on artificial intelligence. Am J
15  Artificial Intelligence for Keratoconus Detection and Refractive Surgery Screening 201

Ophthalmol. 2018;195:223–32. https://doi. 22. Smadja D, Touboul D, Cohen A, Doveh E, Santhiago


org/10.1016/j.ajo.2018.08.005. MR, Mello GR, et al. Detection of subclinical kera-
9. Santhiago MR, Smadja D, Gomes BF, Mello GR, toconus using an automated decision tree classifi-
Monteiro ML, Wilson SE, et al. Association between cation. Am J Ophthalmol. 2013;156(2) https://doi.
the percent tissue altered and post–laser in situ ker- org/10.1016/j.ajo.2013.03.034.
atomileusis ectasia in eyes with normal preoperative 23. Witten IH, Frank E, Hall MA, Pal CJ. Data mining:
topography. Am J Ophthalmol. 2014;158(1) https:// practical machine learning tools and techniques. 4th
doi.org/10.1016/j.ajo.2014.04.002. ed. Amsterdam: Elsevier; 2016. p. 7–15.
10. Seiler T, Quurke AW.  Iatrogenic keratectasia after
24. Souza MB, Medeiros FW, Souza DB, Garcia R, Alves
LASIK in a case of forme fruste keratoconus. J MR.  Evaluation of machine learning classifiers in
Cataract Refract Surg. 1998;24(7):1007–9. keratoconus detection from orbscan II examinations.
11. Silverman RH, Urs R, Roychoudhury A, Archer TJ, Clinics (Sao Paulo). 2010;65(12):1223–8. https://doi.
Gobbe M, Reinstein DZ.  Epithelial remodeling as org/10.1590/s1807-­59322010001200002.
basis for machine-based identification of keratoconus. 25. Hidalgo IR, Rodriguez P, Rozema JJ, Dhubhghaill
Invest Ophthalmol Vis Sci. 2014;55(3):1580. https:// SN, Zakaria N, Tassignon M-J, et  al. Evaluation of
doi.org/10.1167/iovs.13-­12578. a machine-learning classifier for keratoconus detec-
12. Belin MW, Villavicencio OF, Ambrósio tion based on Scheimpflug tomography. Cornea.
RR.  Tomographic parameters for the detection of 2016;35(6):827–32. https://doi.org/10.1097/
keratoconus. Eye Contact Lens. 2014;40(6):326–30. ICO.0000000000000834.
https://doi.org/10.1097/ICL.0000000000000077. 26. Yousefi S, Yousefi E, Takahashi H, Hayashi T,

13. Amancio DR, Comin CH, Casanova D, Travieso Tampo H, Inoda S, et al. Keratoconus severity iden-
G, Bruno OM, Rodrigues FA, et  al. A system- tification using unsupervised machine learning. PLoS
atic comparison of supervised classifiers. PLoS One. 2018;13(11) https://doi.org/10.1371/journal.
One. 2014;9(4) https://doi.org/10.1371/journal. pone.0205998.
pone.0094137. 27.
Tharwat A, Gaber T, Ibrahim A, Hassanien
14. Klyce SD.  The future of keratoconus screen-
AE.  Linear discriminant analysis: a detailed tuto-
ing with artificial intelligence. Ophthalmology. rial. AI Commun. 2017;30(2):169–90. https://doi.
2018;125(12):1872–3. https://doi.org/10.1016/j. org/10.3233/AIC-­170729.
ophtha.2018.08.019. 28. Saad A, Gatinel D.  Topographic and tomographic

15. Maeda N, Klyce SD, Smolek MK, Thompson
properties of Forme Fruste keratoconus corneas.
HW.  Automated keratoconus screening with corneal Invest Ophthalmol Vis Sci. 2010;51(11):5546. https://
topography analysis. Invest Ophthalmol Vis Sci. doi.org/10.1167/iovs.10-­5369.
1994;35(6):2749–57. 29. Saad A, Gatinel D.  Evaluation of total and corneal
16. Hwang ES, Perez-Straziota CE, Kim SW, Santhiago wavefront high order aberrations for the detection
MR, Randleman JB. Distinguishing highly asymmet- of Forme Fruste keratoconus. Invest Ophthalmol
ric keratoconus eyes using combined Scheimpflug Vis Sci. 2012;53(6):2978. https://doi.org/10.1167/
and spectral-domain OCT analysis. Ophthalmology. iovs.11-­8803.
2018;125(12):1862–71. https://doi.org/10.1016/j. 30. Alexopoulos EC. Introduction to multivariate regres-
ophtha.2018.06.020. sion analysis. Hippokratia. 2010;14(Suppl 1):23–8.
17. Maeda N, Klyce SD, Smolek MK.  Neural network 31.
Yuan A, Pineda R.  Developments in imag-
classification of corneal topography. Preliminary ing of corneal biomechanics. Int Ophthalmol
demonstration. Invest Ophthalmol Vis Sci. Clin. 2019;59(4):1–17. https://doi.org/10.1097/
1995;36(7):1327–35. IIO.0000000000000286.
18. Smolek MK, Klyce SD. Current keratoconus detection 32. Gokul A, Vellara HR, Patel DV.  Advanced ante-

methods compared with a neural network approach. rior segment imaging in keratoconus: a review. Clin
Invest Ophthalmol Vis Sci. 1997;38(11):2290–9. Exp Ophthalmol. 2018;46(2):122–32. https://doi.
19. Smolek MK, Klyce SD. Screening of prior refractive org/10.1111/ceo.13108.
surgery by a wavelet-based neural network. J Cataract 33. De Stefano VSD, Dupps WJ. Biomechanical diagnos-
Refract Surg. 2001;27(12):1926–31. tics of the cornea. Int Ophthalmol Clin. 2017;57(3):75–
20. Accardo P, Pensiero S. Neural network-based system 86. https://doi.org/10.1097/IIO.0000000000000172.
for early keratoconus detection from corneal topogra- 34. Keating A, Roberto Pineda II, Colby K.  Corneal

phy. J Biomed Inform. 2002;35(3):151–9. cross linking for keratoconus. Semin Ophthalmol.
21. Carvalho LAVD, Barbosa MS.  Neural networks
2010;25(5–6):249–55. https://doi.org/10.3109/08820
and statistical analysis for classification of corneal 538.2010.518503.
videokeratography maps based on Zernike coeffi- 35.
Sightmap & InnovEyes  – YouTube [Internet].
cients: a quantitative comparison. Arq Bras Oftalmol. [cited 2020 Mar24]. https://www.youtube.com/
2008;71(3):337–41. watch?v=CPcRoH0qcPM
Artificial Intelligence for Cataract
Management
16
Haotian Lin, Lixue Liu, and Xiaohang Wu

 rtificial Intelligence for Detection


A applied to detect the features specific to each
and Grading of Cataracts class. By using 140 photographs, this system
exhibited an average rate of 93.3% in detecting
Cataract was estimated responsible for 52.6 mil- different lens condition. Although this system
lion (18.2–109.6 million) visually impaired produced acceptable results, it was not appropri-
patients in 2015 [1]. Considering the association ate for clinical adoption since it didn’t identify
between cataracts and aging, the prevalence of referable cases in each class.
cataracts is expected to increase with the global A universal artificial intelligence platform for
trend of population aging [2]. However, it’s collaborative management of cataracts was later
expensive in terms of both time and costs for developed by Wu et  al. [6]. For each slit-lamp
ophthalmologists to inspect cataracts, particu- photograph, this AI platform was designed to
larly in low-income and middle-income coun- perform three steps: mode recognition, cataract
tries given the high interregional disparity of diagnosis and severity evaluation. After that, a
medical resources [3]. Therefore, many research- management decision of referral or follow-up
ers have focused on the use of AI for automated was given. The three-step strategy extended the
identification and grading of cataracts, which context this AI agent to multiple capture modes
can be helpful for large-population screening and cataracts of different etiologies and pheno-
[4–7] (Fig. 16.1). types. Accordingly, this platform showed great
With the universalization of cataract surgery, potential to be implemented in clinical scenarios
an increasing number of cataract patients have and improve the efficiency as well as coverage of
had their eyes operated. Many researchers have ophthalmology care, which was later proved in
included these postoperative eyes in datasets for this study. In traditional healthcare system, an
better differentiation of lens status. Acharya et al. ophthalmologist was estimated to serve 4000
proposed an artificial neural network classifier patients in a year. In the novel tertiary healthcare
for the identification of normal, cataract and post- system proposed by the authors, however, an
operative slit-lamp photographs in 2010 [4]. ophthalmologist could serve as many as 40,806
Fuzzy K-means clustering algorithm, a kind of patients in a year with the assistance of mobile
traditional machine learning techniques, was devices, community-based healthcare facilities
and this AI platform. The results suggested this
H. Lin (*) · L. Liu · X. Wu collaborative platform and referral pattern were
State Key Laboratory of Ophthalmology, Zhongshan of high efficiency and could be expanded to the
Ophthalmic Center, Sun Yat-sen University,
Guangzhou, People’s Republic of China

© Springer Nature Switzerland AG 2021 203


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_16
204 H. Lin et al.

Cataract diagnosis Severity evaluation

Fig. 16.1  AI for cataract detection and grading

ELP
K IOL

Ax

Surgical complexity evaluation IOL power calculation

Fig. 16.2  AI for preoperative assessment

management of other ophthalmic diseases, with one of the most commonly performed surgical
updated user-accessible mobile devices and auto- procedures today, with more than 11 million
matic examination instruments. eyes undergoing intraocular lens (IOL) implan-
tation worldwide every year [9–11]. The post-
operative vision of patients depends largely on
Artificial Intelligence an accurate calculation of the IOL optical
for Preoperative Assessment power, which represents a challenge for oph-
of Cataract Surgery thalmologists [12]. Researchers have integrated
artificial intelligence in this process to achieve
The current standardized management of a better visual outcomes for patients [13, 14]
visually impairing cataract is surgical removal (Fig. 16.2).
of the cataractous lens and implantation with an Currently, the most commonly used formulas
intraocular lens instead [8]. Cataract surgery is for IOL power calculation are from the Vergence
16  Artificial Intelligence for Cataract Management 205

formula category such as SRK/T formula [15]. Artificial Intelligence


However, their accuracy is only able to achieve a for Postoperative Management
±0.5 diopter from the intended target refraction of Cataract Surgery
in 60–80% of eyes. Findl et al. tried improving
IOL power calculation through optimization of Generally, cataract surgery is safe and effective,
prediction of the pseudophakic anterior chamber with 84–94% of eyes achieving best-corrected
depth (pACD), which is an important yet hard- visual acuity of 20/30 or better at 6 months after
to-­estimate variable in SRK/T formula, one of surgery [16, 17]. Still, postoperative complica-
Vergence formula category. The performance of tions could arise, the most common one is poste-
multilayer perceptron, a type of artificial neural rior capsule opacification (PCO) [8]. It is the
networks, was compared with that of state-of-­ consequence of proliferation of retained lens epi-
the-art linear regression for the prediction of thelial cells and can cause decreased visual acu-
pACD from a variety of preoperative biometric ity, blurred vision, or glare. Various factors affect
parameters. It turned out there was no significant the occurrence of PCO, including patient age at
improvement from linear regression to multi- the surgery [18], sex [19], diabetes mellitus [20],
layer perceptron in terms of correlation coeffi- IOL material [19] and so on. Because of the com-
cient in prediction of pACD.  Replacing the plex nature of PCO pathogenesis, AI models are
SRK/T pACD prediction with the neural net- expected to be useful in predicting its occurrence
work pACD prediction enhanced performance of (Fig. 16.3).
the formula by only 0.01 D in mean absolute Mohammadi et al. applied artificial neural net-
error in IOL power calculation. This may be a work to predict the occurrence of clinically sig-
result of the rather small dataset of only 77 eyes. nificant PCO 2  years after phacoemulsification
Sramka et al. made another trial in a dataset con- [21]. A total of 10 input variables were selected
taining information about up to 2194 eyes [14]. to develop the model. Trained and tested on a
In this study, IOL power was calculated from dataset of 352 eyes, this model produced a rea-
clinical data through two machine learning mod- sonable accuracy of 87%. Another attempt was
els: Support Vector Machine Regression model made by Jiang et  al. [22]. They used slit-lamp
and Multilayer Neural Network Ensemble images rather than variables for the prediction of
model. The prediction errors of the two machine PCO progression. For each of 1015 patients, slit-­
learning models were significantly lower than lamp images of six consecutive postoperative
those of SRK/T formula, indicating a strong reexamination stages were collected: the 3rd, 6th,
potential of machine learning models for improv- 9th, 12th, 18th and 24th month. A temporal
ing IOL calculations. sequence network was then constructed to pre-

High intraocular pressure Posterior capsular opacification

Fig. 16.3  AI for postoperative follow-up


206 H. Lin et al.

dict the existence of PCO in 24th month postop- 7. Xu X, Zhang L, Li J, Guan Y, Zhang L.  A hybrid
global-local representation CNN model for automatic
eratively. This proposed model offered cataract grading. IEEE J Biomed Health Inform.
exceptional performance with a sensitivity of 2019;
88.55%, a specificity of 94.31% and an area 8. Liu YC, Wilkins M, Kim T, Malyugin B, Mehta
under curve of 0.9718. Prediction models of this JS. Cataracts. Lancet. 2017;390(10094):600–12.
9. Frampton G, Harris P, Cooper K, Lotery A, Shepherd
kind help to plan treatment strategies and to pro- J.  The clinical effectiveness and cost-effectiveness
vide early warning for the patients. of second-eye cataract surgery: a systematic review
and economic evaluation. Health Technol Assess.
2014;18(68):1–205. v-vi
10. Wang W, Yan W, Muller A, He M. A global view on
 uture of AI-Based Cataract
F output and outcomes of cataract surgery with national
Management indices of socioeconomic development. Invest
Ophthalmol Vis Sci. 2017;58(9):3669–76.
As is discussed above, AI technologies have been 11. Abell RG, Vote BJ.  Cost-effectiveness of femtosec-
ond laser-assisted cataract surgery versus phaco-
applied to multiple aspects of cataract manage- emulsification cataract surgery. Ophthalmology.
ment. From a visionary perspective, AI applica- 2014;121(1):10–6.
tions may be extended to other cataract-related 12. Olsen T. Sources of error in intraocular lens power cal-
areas such as preoperative risk stratification for culation. J Cataract Refract Surg. 1992;18(2):125–9.
13. Findl O, Struhal W, Dorffner G, Drexler W. Analysis
cataract surgery, prediction of postoperative of nonlinear systems to estimate intraocular lens posi-
visual and refractive outcomes and patient assess- tion after cataract surgery. J Cataract Refract Surg.
ment for implants such as multifocal or accom- 2004;30(4):863–6.
modative IOLs. It is becoming apparent that AI 14.
Sramka M, Slovak M, Tuckova J, Stodulka
P. Improving clinical refractive results of cataract sur-
technologies will play a crucial role in the revolu- gery by machine learning. PeerJ. 2019;7:e7202.
tion of healthcare service delivery within the field 15. Melles RB, Holladay JT, Chang WJ.  Accuracy of
of ophthalmology and serve as a powerful addi- intraocular lens calculation formulas. Ophthalmology.
tion to the current diagnostic and therapeutic 2018;125(2):169–78.
16. Ewe SY, Abell RG, Oakley CL, Lim CH, Allen PL,
armamentarium of cataract specialists. McPherson ZE, et al. A comparative cohort study of
visual outcomes in femtosecond laser-assisted versus
phacoemulsification cataract surgery. Ophthalmology.
2016;123(1):178–82.
References 17. Ruit S, Tabin G, Chang D, Bajracharya L, Kline

DC, Richheimer W, et al. A prospective randomized
1. Flaxman SR, Bourne RRA, Resnikoff S, Ackland P, clinical trial of phacoemulsification vs manual suture-
Braithwaite T, Cicinelli MV, et  al. Global causes of less small-incision extracapsular cataract surgery in
blindness and distance vision impairment 1990-2020: Nepal. Am J Ophthalmol. 2007;143(1):32–8.
a systematic review and meta-analysis. Lancet Glob 18. Apple DJ, Solomon KD, Tetz MR, Assia EI, Holland
Health. 2017;5(12):e1221–e34. EY, Legler UF, et al. Posterior capsule opacification.
2. Song P, Wang H, Theodoratou E, Chan KY, Rudan Surv Ophthalmol. 1992;37(2):73–116.
I. The national and subnational prevalence of cataract 19. Ando H, Ando N, Oshika T.  Cumulative probability
and cataract blindness in China: a systematic review of neodymium: YAG laser posterior capsulotomy
and meta-analysis. J Glob Health. 2018;8(1):010804. after phacoemulsification. J Cataract Refract Surg.
3. Ramke J, Zwi AB, Lee AC, Blignault I, Gilbert 2003;29(11):2148–54.
CE.  Inequality in cataract blindness and services: 20. Ebihara Y, Kato S, Oshika T, Yoshizaki M, Sugita
moving beyond unidimensional analyses of social G.  Posterior capsule opacification after cataract sur-
position. Br J Ophthalmol. 2017;101(4):395–400. gery in patients with diabetes mellitus. J Cataract
4. Acharya RU, Yu W, Zhu K, Nayak J, Lim TC, Chan Refract Surg. 2006;32(7):1184–7.
JY. Identification of cataract and post-cataract surgery 21. Mohammadi SF, Sabbaghi M, Hashemi H, Alizadeh
optical images using artificial intelligence techniques. S, Majdi M, et  al. Using artificial intelligence to
J Med Syst. 2010;34(4):619–28. predict the risk for posterior capsule opacification
5. Gao X, Lin S, Wong TY. Automatic feature learning after phacoemulsification. J Cataract Refract Surg.
to grade nuclear cataracts based on deep learning. 2012;38(3):403–8.
IEEE Trans Biomed Eng. 2015;62(11):2693–701. 22. Jiang J, Liu X, Liu L, Wang S, Long E, Yang H, et al.
6. Wu X, Huang Y, Liu Z, Lai W, Long E, Zhang K, et al. Predicting the progression of ophthalmic disease
Universal artificial intelligence platform for collabora- based on slit-lamp images using a deep temporal
tive management of cataracts. Br J Ophthalmol. 2019; sequence network. PLoS One. 2018;13(7):e0201142.
Artificial Intelligence in Refractive
Surgery
17
Yan Wang, Mohammad Alzogool,
and Haohan Zou

Refractive surgery has undergone rapid advance-  pplication of AI in Refractive


A
ments in the last decades with good visual Surgery
effects and long-term safety [1]. There are sev-
eral refractive surgery types available both in the  pplication of AI for Surgical
A
cornea and lens. The present chapter focuses on Screening and Suspected
corneal refractive surgery with laser vision cor- Keratoconus Detection
rection since these procedures are most popular
worldwide, including the earliest application Surgical safety is the basis for the refractive sur-
of excimer laser photorefractive keratectomy gery development worldwide. Screening the sus-
(PRK), the most popular laser refractive surgery ceptible populations of keratoconus or
performed laser-assisted in situ keratomileusis keratectasia is an important step before surgery.
(LASIK), and small incision lenticule extrac- Due to the occult onset of keratoconus, its early
tion (SMILE) surgery that use femtosecond laser. stages are often difficult to detect. Furthermore, it
With constantly updating surgical technology and greatly affects the patient’s vision with a risk of
ophthalmic examination equipment, more clini- blindness; thus far, clinical diagnosis remains
cal data related to the surgery are being generated difficult.
and more accuracy for the preoperative assess- Accordingly, many ways from different per-
ment and screening are required. Therefore, spectives have proposed the use of machine
artificial intelligence assisting in diagnosis and learning technology to assist in the research and
surgery procedures may be needed. diagnosis of keratoconus [2]. Mainly includes the
following:
Y. Wang (*) · H. Zou
Tianjin Eye Hospital, Tianjin Eye Institute, Tianjin
1. AI diagnostic algorithm based on a single
Key Laboratory of Ophthalmology and Visual detection device
Science, Nankai University School of Medicine, With the continuous updating of corneal
Nankai University, Tianjin, China topography technology, from the earliest
Clinical College of Ophthalmology, Tianjin Medical Placido disk imaging topographers based on
University, Tianjin, China the anterior corneal surface to the develop-
M. Alzogool ment of tomography devices that can map
Tianjin Eye Hospital, Tianjin Eye Institute, both the anterior and posterior corneal sur-
Tianjin Key Laboratory of Ophthalmology and Visual
Science, Nankai University School of Medicine,
faces, the amount of measured data produced
Nankai University, Tianjin, China has exponentially increased. Also, from focus-

© Springer Nature Switzerland AG 2021 207


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_17
208 Y. Wang et al.

ing on the anterior corneal surface parameter ological parameters of people in different
analysis then changed to focus on the compre- regions also show regional distribution char-
hensive analysis of full corneal parameters. acteristics [7]. Recently, our team completed a
By incorporating artificial intelligence algo- study among 2000 participants based on the
rithms and producing effective machine learn- Scheimpflug corneal tomography parameters
ing models, significant improvement in the to establish a model that can diagnose sub-
diagnostic rate of keratoconus and different clinical keratoconus with high accuracy. We
classification tasks such as keratoconus, sub- used the support vector machine (SVM) and
clinical keratoconus, high astigmatism cor- gradient boosted decision tree (GBDT), an
nea, and cornea after refractive surgery have iterative machine learning algorithm com-
been achieved [3]. posed of multiple decision trees that screens
2 . AI diagnostic algorithm for combined mul- attribute features with larger weights, to con-
timode data struct a subclinical keratoconus diagnosis
In order to improve the diagnostic accuracy model, and performed a 10-fold cross-­
of keratoconus, keratectasia, and other related validation to verify the accuracy. The model
diseases, different screening equipment are achieved a 95.53% diagnostic accuracy. The
often combined during the clinical diagnosis accuracy of the model in distinguishing sub-
process to develop algorithms for multi-­source clinical keratoconus from the normal cornea
diagnostic methods, which includes anterior was 96.67%, and the accuracy in distinguish-
segment OCT devices, optical aberration mea- ing keratoconus from the normal cornea was
suring instruments, confocal microscopy, and 98.91%. In particular, suspected patients have
in vivo measurement of corneal biomechanics. a high diagnostic accuracy [8].
However, because the results of different func- 4. Finding new medical rules or connections
tional instruments have different meanings, and provides more clinical clues
and there are large differences in the results of In addition to being able to predict and diag-
different instruments with the same functional nose different diseases and achieve good dis-
category, analyzing these inspection parame- ease classification performance in the clinical
ters is difficult and complicated. To make the setting, AI can also discover new medical laws
model more compatible and robust, and also or connections that have never been noticed or
convenient to use for research and application discovered before. Our research found that the
between different devices and clinics, a retro- diagnostic features of keratoconus include not
spective analysis was performed to develop only common clinical indexes such as central
diagnostic algorithms from a single corneal corneal astigmatism, index of surface variance,
topographic device to the cross-platform data asymmetry index, corneal thickness, and poste-
of three various topographic device sources rior corneal surface height (posterior corneal
[4]. Realize the analysis and evaluation of surface elevation), but also clarified the signifi-
maps obtained from a variety of topographic cance of the aspheric parameters for the suspi-
devices. Meanwhile, a diagnostic model com- cious keratoconus diagnosis. Moreover,
bined with corneal biomechanical measure- collecting large sample data from different cen-
ment has also proven to have high diagnostic ters and different populations to establish and
performance [5, 6]. train machine-learning algorithms can further
3. Improve diagnostic efficiency for suspected enhance the universality and generalizability of
patients in different populations with AI diagnostic models [9].
algorithms 5 . Analysis of images and big data
In order to identify the diagnostic differ- The emergence of numerous machine learn-
ences among different ethnic groups, different ing models has provided more possibilities for
population-based studies have been per- the research and application concerning the
formed, showing that ethnic origin influences assisted diagnosis of keratoconus detection,
the keratoconus incidence and corneal physi- through testing and comparison of various
17  Artificial Intelligence in Refractive Surgery 209

algorithms to find the optimal model, thereby spherical equivalent and astigmatism before PRK
improving the ability of disease diagnosis. The [15] and LASIK [16] have shown that the use of
research mainly focuses on the analysis of multiple regression analysis to establish a nomo-
images and data (Fig. 17.1). Some of the most gram model that can consider numerous factors
commonly used methods are support vector including age, diopter, and corneal curvature,
machine (SVM) [10], decision tree (DT) [9], improves the accuracy of PRK and LASIK sur-
multilayer perceptron (MLP), radial basis func- gery [17]. However, with the development of new
tion (RBFNN) [11], and convolutional neural surgical techniques and multi-source clinical
networks (CNN) [12]. Generally, the dataset examination equipment, more factors influencing
includes a training set and one or more valida- the outcome of surgery have been observed, such
tion or test sets. The training set is mainly used as temperature, humidity [18], wind speed, and
to build and train the model and the test or vali- air pressure [19]. Unlike PRK and LASIK, the
dation set is used to evaluate the model. K-fold, nomogram adjustment of SMILE surgery needs
leave-one-out, and other cross-­validation meth- to consider more factors and depends more on the
ods are used to perform proper internal valida- experience of the surgeon. The former analysis of
tion of the training dataset, and it is better to use data using only a single factor or a small sample
another dataset derived from clinical data to is insufficient for newly arising needs.
verify it again. However, the validation of most With the help of association analysis, informa-
studies is still based on the data set itself and tion gain, classifier, and other algorithms, it is
lacks clinical data validation. possible to realize the correlation between these
factors and analyze its impact on refractive sur-
gery. Based on artificial intelligence technology,
 I Application to Improve Surgical
A research and development of new intelligent
Accuracy and Personalized Design refractive surgery platforms assist doctors in
completing the entire process from preoperative
To ensure the accuracy and predictability of cor- screening and parameter design to result
neal refractive surgery, the risk of overcorrection prediction.
or undercorrection is reduced [13, 14]. Previous Our team selected 1146 cases that underwent
nomogram reports of adjusting magnitudes of SMILE surgery with ideal postoperative results.

Fig. 17.1  AI assists in the diagnosis of keratoconus and model is processed to achieve classification and grading
other related ectatic corneal disorders. Input corneal topo- for different cases
graphic map or corneal morphological parameters, the
210 Y. Wang et al.

Fig. 17.2 Various
factors affect the
accuracy and
predictability of SMILE
surgery outcomes. Using
AI can achieve
comprehensive analysis
and control of
influencing factors

From these samples, the nominal features were reached the level of experienced surgeons or even
transformed into binary ones, and the numeric better [20] (Fig. 17.2). This study proves the AI
features were normalized into range [0, 1]. The feasibility in the design of refractive surgery
critical features affecting the nomogram values treatment strategies. However, it is worth noting
were resolved according to information gain that when building a nomogram model, it is not
analysis. The multilayer perceptron algorithm that the more data attributes, the better the model
was used to train the artificial neural network will be. Including too much information will lead
model to predict the SMILE nomogram and con- to over fitting and model accuracy reduction.
duct clinical control experiments for validation. In addition, it is necessary to create a world-
Moreover, we compared the outcomes of the sur- wide refractive surgery database. The clinical
geon group with the machine learning group in data of refractive surgery is growing at an unprec-
terms of safety, efficacy, and predictability. edented rate all over the world. The establish-
Significant results showed that the efficacy index ment of a standardized public database is
in the machine learning group (1.48 ± 1.08) was fundamental for the in-depth AI development in
significantly higher than that in the surgeon group this field, and it is also the key to the development
(1.3 ± 0.27) (t = −2.17, P < 0.05), and 83% of the and evaluation of algorithm models. In addition
eyes in the surgeon group and 93% of the eyes in to the multi-center clinical research development,
the machine learning group were within ±0.50 which will greatly enhance the safety and effi-
D. The error of SE correction was −0.09 ± 0.024 ciency of the model, reduce misdiagnosis and
and −0.23 ± 0.021 for the machine learning and missed diagnosis as well as overtreatment during
surgeon groups, respectively. The outcomes of all the refractive surgery treatment process, and
aspects of the machine learning group have ensure precision medicine is highly beneficial.
17  Artificial Intelligence in Refractive Surgery 211

 rends and Challenges of AI


T technologies such as deep learning [23], and the
in Refractive Surgery aid of algorithms such as convolutional neural
networks (CNN) and recurrent neural networks
In addition to corneal refractive surgery, intraoc- (RNN), effective analysis of complex high-­
ular refractive procedures (such as Implantable dimensional data, including heterogeneous data
Collamer Lens, ICL) and Intrastromal Corneal from multiple sources like images, audio, and
Ring Segment Implantation (such as Intrastromal video, and the management of all data such as
CorneRing, ICR) and other vision correction pro- corneal topographic maps and OCT can be
cedures have played an increasingly important achieved. This gradually leads to achieving accu-
role in correcting refractive errors such as high rate diagnosis and treatment in the field of refrac-
myopia and astigmatism [21, 22]. AI can also be tive surgery.
applied in these areas. The use of artificial intelligence in refractive
The application of AI solves many current surgery will shift from assisting in diagnosis to
real-world research problems that are difficult to assisting in treatment and health management
solve even with ideal experimental models, and (Fig.  17.3). Although AI has been shown to be
refractive surgery is a good example, which goes effective in clinical diagnosis and decision-­
beyond traditional statistical methods and pro- making, artificial intelligence, particularly deep
vides a good solution for this type of surgery. learning, is still a black box for us until now,
The research and application of AI-assisted which makes it difficult to explain what is going
refractive surgery has not only shown great on inside. In the process of using AI to assist deci-
potential in diagnosis and treatment, but also has sion-making, must be combined with clinical
great advantages in the prediction of surgical practice and disease characteristics, doctors need
results and patient management. With the popu- a deeper understanding of the relationship
larization of developing artificial intelligence between the two, otherwise it will cause a mis-

Fig. 17.3  AI runs through the entire process of refractive surgery, assisting doctors in accomplishing all tasks from
preoperative screening, surgery design, intraoperative control, and postoperative management
212 Y. Wang et al.

take. Meanwhile, large-scale clinical applications tomographic assessment to detect corneal ectasia
based on artificial intelligence. Am J Ophthalmol.
also need to consider ethical and legal issues. The 2018;195:223–32. https://doi.org/10.1016/j.
interpretation of algorithms and the establishment ajo.2018.08.005.
of relevant specifications will also be a focal issue 10. Ruiz Hidalgo I, Rozema JJ, Saad A, Gatinel D,

that needs to be addressed in the future. For the Rodriguez P, Zakaria N, et al. Validation of an objec-
tive keratoconus detection system implemented in a
discipline of vision correction, which is insepara- Scheimpflug Tomographer and comparison with other
ble from images and data, refractive surgeons methods. Cornea. 2017;36(6):689–95. https://doi.
should actively embrace the convenience brought org/10.1097/ICO.0000000000001194.
by artificial intelligence to help this discipline 11. Souza MB, Medeiros FW, Souza DB, Garcia R, Alves
MR.  Evaluation of machine learning classifiers in
develop faster and more accurately. keratoconus detection from orbscan II examinations.
Clinics (Sao Paulo). 2010;65(12):1223–8. https://doi.
org/10.1590/s1807-­59322010001200002.
12. Lavric A, Valentin P.  KeratoDetect: keratoconus

References detection algorithm using convolutional neural net-
works. Comput Intell Neurosci. 2019;2019:8162567.
1. Kim TI, Alio Del Barrio JL, Wilkins M, https://doi.org/10.1155/2019/8162567.
Cochener B, Ang M.  Refractive surgery. Lancet. 13. Jin HY, Wan T, Wu F, Yao K. Comparison of visual
2019;393(10185):2085–98. https://doi.org/10.1016/ results and higher-order aberrations after small incision
S0140-­6736(18)33209-­4. lenticule extraction (SMILE): high myopia vs. mild to
2. Lin SR, Ladas JG, Bahadur GG, Al-Hashimi S, Pineda moderate myopia. BMC Ophthalmol. 2017;17(1):118.
R. A review of machine learning techniques for kera- https://doi.org/10.1186/s12886-­017-­0507-­2.
toconus detection and refractive surgery screening. 14. Zhang J, Wang Y, Wu W, Xu L, Li X, Dou R. Vector
Semin Ophthalmol. 2019;34(4):317–26. https://doi. analysis of low to moderate astigmatism with small
org/10.1080/08820538.2019.1620812. incision lenticule extraction (SMILE): results of a
3. Ruiz Hidalgo I, Rodriguez P, Rozema JJ, Ni 1-year follow-up. BMC Ophthalmol. 2015;15:8.
Dhubhghaill S, Zakaria N, Tassignon MJ, et  al. https://doi.org/10.1186/1471-­2415-­15-­8.
Evaluation of a machine-learning classifier for kera- 15. Shapira Y, Vainer I, Mimouni M, Sela T, Munzer G,
toconus detection based on Scheimpflug tomography. Kaiserman I.  Myopia and myopic astigmatism pho-
Cornea. 2016;35(6):827–32. https://doi.org/10.1097/ torefractive keratectomy: applying an advanced mul-
ICO.0000000000000834. tiple regression-derived nomogram. Graefes Arch
4. Mahmoud AM, Roberts C, Lembach R, Herderick Clin Exp Ophthalmol. 2019;257(1):225–32. https://
EE, McMahon TT, Clek SG. Simulation of machine-­ doi.org/10.1007/s00417-­018-­4101-­y.
specific topographic indices for use across platforms. 16. Moniz N, Fernandes ST.  Nomogram for treatment
Optom Vis Sci. 2006;83(9):682–93. https://doi. of astigmatism with laser in situ keratomileusis. J
org/10.1097/01.opx.0000232944.91587.02. Refract Surg. 2002;18(3 Suppl):S323–6.
5. Machado AP, Lyra JM, Ambrósio R, Ribeiro G, LPN 17. Liyanage SE, Allan BD.  Multiple regression analy-
A, Xavier C, et  al., editors. Comparing machine-­ sis in myopic wavefront laser in situ keratomi-
learning classifiers in keratoconus diagnosis from leusis nomogram development. J Cataract Refract
ORA examinations. Berlin: Springer; 2011. Surg. 2012;38(7):1232–9. https://doi.org/10.1016/j.
6. Ambrosio R Jr, Lopes BT, Faria-Correia F, Salomao jcrs.2012.02.043.
MQ, Buhren J, Roberts CJ, et  al. Integration of 18. Seider MI, McLeod SD, Porco TC, Schallhorn
Scheimpflug-based corneal tomography and biome- SC.  The effect of procedure room temperature and
chanical assessments for enhancing ectasia detec- humidity on LASIK outcomes. Ophthalmology.
tion. J Refract Surg. 2017;33(7):434–43. https://doi. 2013;120(11):2204–8. https://doi.org/10.1016/j.
org/10.3928/1081597X-­20170426-­02. ophtha.2013.04.015.
7. Ma R, Liu Y, Zhang L, Lei Y, Hou J, Shen Z, et  al. 19. Neuhaus-Richard I, Frings A, Ament F, Görsch IC,
Distribution and trends in corneal thickness param- Druchkiv V, Katz T, et al. Do air pressure and wind
eters in a large population-based multicenter study speed influence the outcome of myopic laser refrac-
of young Chinese adults. Invest Ophthalmol Vis tive surgery? Results from the Hamburg weather
Sci. 2018;59(8):3366–74. https://doi.org/10.1167/ study. Int Ophthalmol. 2014;34(6):1249–58. https://
iovs.18-­24332. doi.org/10.1007/s10792-­014-­9923-­y.
8. Zou HH, Xu JH, Zhang L, Ji SF, Wang Y. Assistant 20. Cui T, Wang Y, Ji S, Li Y, Hao W, Zou H, et  al.
diagnose for subclinical keratoconus by arti- Applying machine learning techniques in nomo-
ficial intelligence. Zhonghua Yan Ke Za Zhi. gram prediction and analysis for SMILE treatment.
2019;55(12):911–5. https://doi.org/10.3760/ Am J Ophthalmol. 2020;210:71–7. https://doi.
cma.j.issn.0412-­4081.2019.12.008. org/10.1016/j.ajo.2019.10.015.
9. Lopes BT, Ramos IC, Salomao MQ, Guerra FP, 21. Sanders DR, Doney K, Poco M.  United States
Schallhorn SC, Schallhorn JM, et  al. Enhanced Food and Drug Administration clinical trial of the
17  Artificial Intelligence in Refractive Surgery 213

Implantable Collamer Lens (ICL) for moderate to Cochrane Database Syst Rev. 2019;5 https://doi.
high myopia: three-year follow-up. Ophthalmology org/10.1002/14651858.CD011150.pub2.
J. 2004;111(9):1683–92. https://doi.org/10.1016/j. 23. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F,
ophtha.2004.03.026. Ghafoorian M, et al. A survey on deep learning in med-
22. Zadnik K, Money S, Lindsley K.  Intrastromal
ical image analysis. Med Image Anal. 2017;42:60–88.
corneal ring segments for treating keratoconus. https://doi.org/10.1016/j.media.2017.07.005.
Artificial Intelligence in Cataract
Surgery Training
18
Nouf Alnafisee, Sidra Zafar, Kristen Park,
Satyanarayana Swaroop Vedula,
and Shameema Sikder

Cataract surgery is one of the most common sur- However, minimal guidance has been provided
gical procedures performed across the world. It is by the regulating bodies on the best methods to
estimated that by 2050, almost 50 million indi- assess for surgical skill [3]. In the case of cata-
viduals will need cataract surgery in the United ract surgery, graduating ophthalmology resi-
States (U.S.) alone [1]. Given the growing inci- dents in the United States are required to
dence of cataracts, cataract surgery training has complete a minimum of 86 cataract surgeries as
become an increasingly important component of primary surgeon [2]. A simple case log, how-
ophthalmology residency, and failing to learn the ever, is both insufficient and crude for assessing
procedure to mastery can be consequential. competence with minimal, if any, value in terms
Moreover, surgeons must often sustain high vol- of providing feedback. Other conventional
umes of procedures even after Board certification methods for skill assessment including the tra-
to retain their skill in cataract surgery, minimize ditional apprenticeship model of teaching sur-
the risk of complications, and optimize patient gery often lack standardization and objectivity.
care. Although there has been a reliance on structured
In addition to the six core competencies man- rating scales for skill assessment in recent years
dated by the Accreditation Council for Graduate [4], such feedback is often not timely or com-
Medical Education, The American Board of plete. In one large academic orthopedic surgery
Ophthalmologists (ABO) has also identified program, 58% of residents reported that evalua-
surgical proficiency as a competence that should tions were rarely or never completed in a timely
be met by ophthalmology training programs [2]. manner, and more than 30% of the assessments
were completed more than 1 month after a rota-
N. Alnafisee tion’s end. In summary, new approaches to reli-
Faculty of Biology, Medicine and Health, The able, valid, and universally accessible methods
University of Manchester, Manchester, UK to assess skill are necessary to improve care for
S. Zafar · S. Sikder (*) cataract surgery.
The Wilmer Eye Institute, Johns Hopkins University Recent advances in surgical data science
School of Medicine, Baltimore, MD, USA
e-mail: ssikder1@jhmi.edu have enabled novel methods for surgical skill
assessment. These include techniques to directly
K. Park · S. S. Vedula
Malone Center for Engineering in Healthcare, analyse instrument motion, video, or other data
Department of Computer Science, The Johns Hopkins from the surgical field as well as manual
University Whiting School of Engineering, approaches such as crowdsourcing. However,
Baltimore, MD, USA
the validity of these approaches for skill assess-
e-mail: kpark38@jhu.edu; swaroop@jhu.edu

© Springer Nature Switzerland AG 2021 215


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_18
216 N. Alnafisee et al.

ment in the operating room (OR) remains to be  ources of Data for Technical Skill
S
seen [5, 6]. This brings us to the following ques- Assessment in Cataract Surgery
tions: (1) Where does ophthalmology stand in
relation to other surgical specialties regarding I. Surgeon performance data
the use of new techniques for the assessment of (a) Direct observation
surgical technical skill? (2) What do we have to The most prevalent and the most sub-
do to bridge this gap and begin implementing jective method of assessment involves
automated methods for technical skill assess- the direct observation of the surgeon’s
ment in ophthalmology training, specifically in live performance. Direct observation
cataract surgery? currently plays a critical role in training
Broadly, measures of technical skill in cata- surgeons because it allows for feedback
ract surgery may be derived from three data both during and after the procedure.
sources: (I) surgeon performance data, (II) the However, it requires the presence of an
usage of a phacoemulsification machine or other experienced surgeon and/or educator
devices [7], and (III) clinical outcomes [8–10] and is not readily reproducible, owing to
(Fig.  18.1). Figure  18.2 illustrates different large interobserver variations in both
sources of data, the ease with which they can be assessment and feedback.
obtained, and access to current methods to use (b) Instrument motion data (Sensor-
them to assess technical skill. Each source of based)
data and the methods that have been utilized to Motion analysis techniques can
analyze them for technical skill are discussed assess a surgeon’s dexterity based on the
below. movements of their hands and fingers.

Input
(Surgical Data
Performance) Processing Output

Direct
Observation
Expert Ratings

Video
Machine Feedback
Learning
Benchtop
Wetlab

Instrument
OR

motion
Skill
Deep Learning
Hand motion Clinical
Outcomes

Crowdsourcing
Eye tracking

Other e.g. Performance


phacoemulsification Metrics
data

Fig. 18.1  An overview of the scope for sources of data for surgical skill assessment and the training context in which
they may be obtained
18  Artificial Intelligence in Cataract Surgery Training 217

X
Eye gaze
X
X Automated
ratings from DL
Crowdsourced
algorithms
ratings using
video X X X
Increasing ease of capture

Expert ratings Weltab Feedback in OR


using video
Ease of capture

X
Video review
feedback

X
Simulators

X
Phacoemulsification
data

Ease of use/how useful the data is

Increasing usefulness

a expensive, limited data on whether more or less than wetlabs

Fig. 18.2  Overview of different sources of data, ease with which they can be obtained, and access to current methods
to use them to assess technical skill

To date, sensor-based motion analysis in measure instrument motion during


ophthalmology has only been used for cataract surgery to provide objective
assessing skill in corneal suturing [11] metrics of performance. A study by
and oculoplastics [12]. Motion analysis Smith et al. analyzed 20 intraoperative
has been shown to stratify performance surgical videos and determined metrics
based on surgical experience with more provided by this methodology (the total
experienced surgeons using less time, time taken, the total path length, and
movement, and distance to complete the the number of movements made by the
given task [11, 12]. However, to date, surgeon) to successfully discriminate
this has only been applied in a wet lab between expert and novice surgeons.
environment, and the first-year costs of Similar results were also reported by
the required equipment added up to Din et al. and Balal et al. [15, 16]. Balal
$19,060 [13]. et  al. also reported higher levels of
(c) Video-based variability among junior surgeons and
Video recordings are common dur- updated the previous tracking method
ing cataract surgery and can be a rich by adding stable feature points to the
source of data for analyzing surgeon frames and tracking their movement to
performance. Phacotracking [14], a correctly identify surgical instrument
computer vision-based method, can movements.
218 N. Alnafisee et al.

II. Surgical device usage ract surgery. These include the Objective
Ogawa et al. developed the Surgical Media Assessment of Skills in Intraocular Surgery
Center (SMC) (Abbott Medical Optics, Inc.), (OASIS) [18], Global Rating Assessment of
a cataract surgery recording device capable of Skills in Intraocular Surgery (GRASIS) [19],
measuring changes in multiple objective Subjective Phacoemulsification Skills
parameters, including phacoemulsification Assessment (SPESA) [20], Objective
power, vacuum level, aspiration rate over Structured Assessment of Cataract Surgical
time, and foot pedal position [7]. By doing so, Skill (OSACSS) [21], International Council
the SMC can detect inappropriate phacoemul- of Ophthalmology–approved Ophthalmology
sification techniques by analyzing graphs and Surgical Competency Assessment Rubrics
elucidate the cause of intraoperative compli- (ICO-OSCARS) [22, 23], and the Objective
cations. Ogawa et al. found significant differ- Structured Assessments of Technical Skills
ences in performance metrics measured by (OSATS) [24]. Although rubrics remain the
the SMC between expert (n = 3) and novice most affordable assessment method with an
surgeons (n = 3). The authors stated that the annual cost of approximately $4000 [13],
time to reach maximum vacuum and the speed their widespread use and implementation has
of increase in vacuum may be regarded as been limited by their time and resource-­
indicators for the skillfulness of handling the intensive nature. In more recent years,
phacoemulsification device [7]. To date, the rubrics, such as the Iowa Department of
SMC is the only assessment tool utilizing Ophthalmology Objective Wet Laboratory
intraoperative findings for resident education, Structured Assessment of Skill and
and it can be applied to performance in both Technique (OWLSAT) [25] and the Eye
OR and wet lab settings. Surgical Skills Assessment Test (ESSAT)
III. Clinical outcomes [26, 27], have been developed for assessing
The technical skill of the operating sur- trainee performance in the wet lab. However,
geon has been proposed to be an important costs that may reach up to approximately
determinant of postoperative outcomes [17]. $370,000 may hinder widespread implemen-
While tracking resident complication rates tation [13].
may allow for monitoring progress and Assessments using structured rating
ensuring patient safety, the rates of clinically scales may be obtained from individual
significant complications, such as posterior experts or from a crowd, i.e., through crowd-
capsule rupture and vitreous loss, remain sourcing. In the crowdsourcing methodol-
low. Furthermore, outcomes may not be ogy, a random sample of unrelated
readily mapped to specific feedback infor- individuals, who have an incentive to per-
mation that can drive learning and improve- form repetitive tasks but are not necessarily
ments in performance. experts in the domain, have been found to
yield skill ratings that are accurate on aver-
age. Recent studies show that crowdsourcing
Approaches to Assessment can rapidly provide skill assessments that are
comparable to those obtained from expert
(a) Structured rating scales ratings. Evidence has accumulated in robotic
Rubrics are one of the earliest attempts at surgery, urology, laparoscopic surgery, and
increasing objectivity in surgical assessment, gynaecology [28–30]. To our knowledge,
allowing for a more structured approach and there is limited research on crowdsourcing in
higher quality feedback. However, despite cataract surgical skill assessment.
continued efforts to make them more objec- ( b) Objective data (video based)
tive, the issue of subjectivity remains. In recent years, there has been a push
Multiple rubrics are presently available towards the use of machine learning (ML)
for the assessment of technical skill in cata- and deep learning (DL) models for the objec-
18  Artificial Intelligence in Cataract Surgery Training 219

tive assessment of surgical skill by utilizing mation from the operating room (OR) for
video-based data. The goal with ML is for a automatic recognition of high-level surgical
computer to “learn” certain patterns from tasks. Consequently, such frameworks can
labeled datasets in order to analyze novel automatically detect procedures, evaluate
data and make informed predictions based on surgeon performance, and increase surgical
specific algorithms [31]. DL is a subset of efficiency as well as the quality of care in the
machine learning that can process vast OR.  However, before a surgical procedure
amounts of data to solve more complex que- can be “chaptered,” certain steps often need
ries. This is done using hierarchal algorithms to be made, including extracting information
structured in an “artificial neural network”, a on the various instruments in the surgical
process inspired by how neurons in the field of view and anatomical segmentation.
human brain work [31]. Several limitations exist for extricating this
Use of these algorithms for technical skill information. First, instrument recognition is
assessment in ophthalmology is still in its often made difficult by the similarities in
early stages with the large majority of work appearance between instruments or by the
focusing on methods for anatomical segmen- differences in instrument scale, color gradi-
tation and tool detection/classification, both ent, or orientation. Secondly, for anatomical
for phase and step recognition in cataract sur- segmentation in cataract surgery, the pupil is
gery [32]. Broadly, two approaches are avail- often the region of interest, and automatic
able to obtain videos of the phases of cataract segmentation may be limited by interfer-
surgical procedures: (1) content-based video ences in the microscope field of vision that
retrieval, which involves matching a video to obscure the pupil. Bouget et al. used a bag-­
other, similar videos in a data set, and (2) of-­word, ML-based approach that detected
decomposing a procedure video into its con- surgical tools with 84% accuracy and an
stituent phases (segmentation) and assigning image-based analysis that detected the pupil
each segment a phase label (classification). as the region of interest with 95% accuracy.
In the first approach, videos are transformed Together, the addition of these two modules
into fixed-­ dimensional feature representa- within the framework resulted in the auto-
tions using computer vision techniques and matic detection of eight phases of cataract
then evaluated with distance metrics within surgery (preparation, betadine injection, cor-
the feature space. On the other hand, meth- neal incision, capsulorhexis, phacoemulsifi-
ods for the second approach include com- cation, cortical aspiration, IOL implantation,
puter vision techniques and deep learning. IOL adjustment and wound sealing) with
94% accuracy [33]. Lalys similarly proposed
a high-level task recognition system based on
Anatomical Segmentation, Tool application-dependent visual cues and time
Recognition, and Task Recognition series analysis, using either a Dynamic Time
Warping (DTW) or Hidden Markov Model
(a) Machine Learning (HMM) algorithm. Five subsystems based on
Surgical procedures can typically be visual cues were implemented: color-­
“chaptered” into different levels of granular- oriented visual cues (simple histogram inter-
ity: the surgical procedure, the phases, the section), texture-oriented visual cues
steps, the activities, and the physical ges- (bag-of-words-approach), shape-oriented
tures. Achieving different levels of granular- visual cues (Haar classifier trained for instru-
ity can currently be achieved through either a ment categorization and bag-of-words
top-down or bottom-up approach. In the con- approach used for other instrument detec-
text of a bottom-up approach, computer-­ tion), and all other visual cues. Using this
assisted-­
surgical (CAS) systems play an framework, the authors proceeded to achieve
essential role by retrieving low-level infor- a global recognition rate of almost 94% for
220 N. Alnafisee et al.

12 phases of cataract surgery (preparation, (b) Deep learning


betadine injection, paracentesis, main inci- Recently, deep learning models that
sion, viscoelastic injection, capsulorhexis, employ convoluted neural networks (CNNs)
phacoemulsification, cortical aspiration of to automatically process and learn from input
big lens pieces, cortical aspiration of the data have been employed for surgical skill
reminiscent lens, expansion of main incision, assessment, achieving significantly more
IOL insertion, IOL adjustment and wound success than previous state-of-the-art meth-
sealing) and suggested that their framework ods [39, 40]. Training CNNs, however,
could be adjusted to recognize the major sur- requires large amounts of data. To mitigate
gical phases of any new procedure [34]. this need, Zisimopoulos et al. utilized surgi-
Building upon their initial work, Lalys et al. cal simulation to train deep learning models
later attempted to identify a lower level of for cataract surgery instrument detection and
granularity: surgical activities, which they segmentation. In doing so, theirs was the first
defined as “the use of one surgical tool for attempt to train deep learning models for sur-
one surgical action performed on one ana- gical instrument detection on simulated data
tomical structure” [35]. Using a dataset of 20 while demonstrating promising results to
cataract surgeries, 18 activities were identi- generalize on real data [41]. The same group
fied, containing up to 25 pairs of activities. A later attempted to use recurrent neural net-
frame-by-frame recognition rate of 64.5% works (RNNs) for surgical tool and phase
was achieved using image analysis tech- recognition in cataract videos by first identi-
niques [35]. However, the algorithms devel- fying the instrument with a CNN, and then
oped by Bouget et  al. and Lalys et  al. are applying CNN features to segment phases
unable to identify surgical tasks in real-time. and label them [42]. Using the CATARACTS
To address this, Quellec et  al. proposed the data set, Zisimopoulos et  al., reported a
use of Content-Based Video Retrieval 78.3% frame-level accuracy [42]. Similar
(CBVR), which aims at finding videos or AUROCs were also reported by Yu et al. and
video segments that are like the query video Primus et  al. for automated detection of
inside a video collection. Quellec et al. found phases in cataract surgery procedures [32,
that the application of the CBVR system 43]. Al Hajj et al. later proposed a model that
compares favorably with a state-of-the-art aimed to take advantage of motion informa-
human action recognition system for real-­ tion instead of analyzing images indepen-
time recognition of most high-level surgical dently inside the video stream. The CNNs
tasks in epiretinal membrane surgery and were evaluated using 30 cataract surgery vid-
cataract surgery. Training the algorithm, eos (6 h of videos) and successfully detected
however, required a substantial amount of 10 surgical tools with a 95% accuracy [44].
time, lasting 16 h per surgical task or action
on average [36]. Expanding upon their work,
Quellec et  al. successfully used CBVR for  apsulorhexis as an Example of ML
C
the automatic segmentation and categoriza- and DL Assessment
tion of cataract surgery tasks, achieving an
average area under the ROC (AUROC) curve Computer vision technology has undergone a sig-
of 0.83 while analyzing 186 surgeries per- nificant amount of transformation over the past
formed by ten surgeons of various experi- few decades. In surgical fields, computer vision
ence levels [37]. Most recently, Charrière applications have been used to track surgeons’
et al. successfully identified cataract surgical hand motion from captured videos and assess
phases (AUROC 0.83) and steps (AUROC technical skill and dexterity. In ophthalmology,
0.69) in real time, via the recognition of sur- Zhu et al. were the first to use this t­echnology for
gical instruments in a frame (labelled by sur- 23 cataract surgery (Kitaro) videos and analyze
geons), and the use of CBVR [38]. capsulorhexis performance in terms of 3 metrics:
18  Artificial Intelligence in Cataract Surgery Training 221

spatiality, duration and motion. Compared to Current State and Future Direction


expert gradings, their algorithm achieved 85.2%
“soft” accuracy and 58.3% “hard” accuracy on the The future of cataract surgery skills assessment is
motion score. The authors defined “soft” accuracy moving towards autonomous techniques for more
as results being correct when grades were the effective and efficient surgical skill acquisition.
same or adjacent and “hard” accuracy when grad- This push stems from the need to ensure both
ings were the same. Additionally, the authors quality and safety in surgery [47].
found their algorithm to be more consistent and Most studies to date have focused on differen-
more effective at discriminating between extremes tiating novices from expert surgeons rather than
in skill level than human evaluators [45]. exploring the nuanced differences between them.
Deep neural networks were most recently used We believe that being able to identify such differ-
by Kim et al. for the objective assessment of tech- ences and assessing an “intermediate” level sur-
nical skill in capsulorhexis. In their study, one geon would be more relevant for resident
expert surgeon manually annotated each video for assessment. In addition, a marked variation exists
technical skill (expert/novice) using the two items in the definition of “experts” and “novices”
for capsulorhexis (a, commencement of flap and across the different studies, pointing out the
follow-through and b, formation and completion) necessity to unify definitions. Many studies also
on the ICO-OSCAR:phaco rubric. An expert was did not take into consideration many of the con-
someone who received a score of five on at least founding factors that could have affected surgical
one item for capsulorhexis and at least four on the performance when measuring parameters, such
other item. The CNN models distinguished skill as the level of anatomical knowledge. Some
class with 84.8% accuracy when instrument tip papers compared medical students rather than
velocities were used and 63.4% accuracy when junior trainees to experienced surgeons, which
optical flow fields were used [46]. could falsely increase the reliability of their
Another recent study sought to predict techni- results, as the difference between experienced
cal expertise by means of ML approaches through surgeons and students may be more marked than
context-specific metrics of capsulorhexis, as con- that between surgeons and residents [48].
ventional objective metrics often ignore task-­ In ophthalmology, machine learning has been
specific contexts and have little relevance for used for tool recognition and pupil segmentation
feedback. The tips and insertion sites of utrata to identify the stages of cataract surgery using
forceps were manually annotated and used with only visual cues from surgical videos [33, 34].
time point notations of sub-, post-, supra-, and However, to establish the ground truth, the data-
pre-incisional quadrants to compute different set must by labelled by an “expert” in the field.
types of metrics that addressed specific compo- This process can be very time-consuming and
nents of the task, i.e., grasp/tear motions, tool highly subjective, leading to variations in the
position/angle, and quadrant specificity. A ran- defined ground truth. One solution to these disad-
dom forests algorithm was used to map the vari- vantages is using crowd evaluators for rating per-
ous metrics to the expertise label described formances. Crowdsourcing presents a unique
above, and it was found that different subsets of opportunity of obtaining large amounts of data in
the metric set yielded varying measures of algo- a cost-effective and a time-efficient manner.
rithm performance. Preliminary results show that More importantly, studies have found crowd-
a random forests algorithm composed of 10 trees sourced evaluations to be reliable and accurate
modeling metrics related to tool position and dis- [49]. Kim et al. showed that with the proper train-
tance had a sensitivity of 0.72, specificity of 0.70, ing and resources, crowdsourced-workers (CWs)
and AUROC of 0.75, demonstrating that context-­ could identify surgical tools with 88% accuracy
specific metrics can encode information about when compared to the ground truth (an expert’s
technical expertise that can be translated into annotation) [49]. Crowdsourcing has also been
instructional feedback. shown to be effective for recognizing differing
222 N. Alnafisee et al.

levels of surgical skill in robotic, laparoscopic where they stand in comparison. This can be an
and urologic procedures [50, 51]. issue if experts use slightly different techniques
Recent technical advances, particularly in or different tools to do so. Another issue is the
deep learning, have transformed algorithms, lack of uniform criteria for novice, intermediate,
enabling an accurate identification of surgical and expert surgeons in the studies mentioned. In
instruments, automated segmentation, as well as addition, data on surgical skill level are continu-
the classification and analysis of cataract surgery ous rather than categorical, ranging between the
videos for data-driven, objective, and valid different levels of expertise with no discrete
assessment and feedback. In clinical practice, the boundaries between them. Furthermore, the rela-
power of CNNs can be leveraged for cataloguing tionship between objectively measured perfor-
datasets that can be used in different applications mance metrics and expertise classification is
for surgical training. Despite these advances, the non-linear. Hajshirmohammadi and Payandeh
pace of progress has been limited by the scale of proposed a solution for this problem, by using
data that can be captured in the OR and by the fuzzy set theory to classify novice, intermediate,
extensive manual annotation of anatomy and and expert performance on a laparoscopic VR
human activities that is required before such surgical simulator [66], although they did not
datasets can be processed. For instance, if one achieve optimal results.
were to analyze new aspects of a video that had The ultimate goal of these technological
previously not been annotated, relabelling the advances would be to improve the surgeon learn-
whole dataset would be needed, which would be ing curve and to ensure that patients are able to
an extremely tedious and time-consuming task. navigate their surgical care. In the future, deep
As previously mentioned, motion analysis learning algorithms may advance to a point where
techniques have yet to be implemented for cata- they can provide autonomous feedback to sur-
ract surgery assessment and have only been used geons, thus, eliminating the traditional apprentice
in corneal suturing and oculoplastics [11, 12]. model. However, before that can be achieved, it is
Besides the costs involved [13], the main disad- important to lay the groundwork. In this regard,
vantage of using this form of motion analysis is DL algorithms need to be refined even further for
that it cannot be used in the OR, as it may com- accurate tool recognition and phase segmenta-
promise sterility and add clutter or extra steps to tion. Only when algorithms can understand the
a surgical routine. Machine learning techniques granular concepts can they learn higher-level
have been applied to motion analysis metrics in concepts. Implementation of such algorithms can
non-ophthalmic surgical fields for surgical phase subsequently provide a scalable, rapid, and objec-
recognition [52, 53]. Different subcategories of tive method for the assessment of technical skill
motion analysis currently exist: (1) hand motion across all surgical disciplines [45, 46, 67–72].
analysis, (2) tool motion analysis, (3) eye motion
tracking, and (4) muscle contraction analysis [6].
Data from hand-worn motion sensors have been  kills Assessment in Other Surgical
S
reasonably successful at differentiating between Specialties
surgeons of different levels [53–55]. In 2018,
multiple papers were published on sensor-based Automated surgical skills assessment methods in
tool motion tracking for surgical skills assess- non-ophthalmic specialties have progressed
ment [56–59] and prediction of surgical out- faster than that in ophthalmology [6, 60, 73]. This
comes [60, 61]. Some achieved nearly perfect is particularly true for robotic surgery [56, 59–61,
results at skills-level classification [62–64] and 63, 65, 67, 68, 74, 75] the data of which has been
outperformed state-of-the-art methods [65]. deemed to be the most transparent, scalable, and
Modelling the optimally performed surgery comprehensive [5]. There has also been notable
(performed by an expert) is a challenge but nec- progress in laparoscopic surgical assessment [53,
essary to assess novice performance and measure 55, 58, 69, 70].
18  Artificial Intelligence in Cataract Surgery Training 223

The majority of phase recognition via ML 4. Puri S, Sikder S.  Cataract surgical skill assess-
ment tools. J Cataract Refract Surg [Internet].
algorithms has been applied to laparoscopic sur- 2014;40(4):657–65. https://doi.org/10.1016/j.
gery, but there has also been progress in phase jcrs.2014.01.027.
recognition for minimally invasive surgery 5. Vedula SS, Ishii M, Hager GD.  Objective assess-
(MIS) [76, 77] and robotic surgery [78], with ment of surgical technical skill and competency in the
operating room. Annu Rev Biomed Eng [Internet].
some studies combining computer vision with 2017;19(1):301–25. https://doi.org/10.1146/
kinematic data [79]. ML used for surgical assess- annurev-­bioeng-­071516-­044435.
ment has mostly been applied to robotic surgery 6. Levin M, McKechnie T, Khalid S, Grantcharov
[80], with the majority of focus being on TP, Goldenberg M.  Automated methods of tech-
nical skill assessment in surgery: a system-
Kinematic data [63, 74] and some research on atic review. J Surg Educ [Internet]. 2019;1–11.
computer vision [60, 67, 68, 74, 75]. Fard et al. https://www.sciencedirect.com/science/article/pii/
used motion data to assess surgical skill in S1931720419301643?dgcid=raven_sd_aip_email
robotic surgery and suggested that the classifica- 7. Ogawa T, Shiba T, Tsuneoka H.  Usefulness of sur-
gical media center as a cataract surgery educational
tion methods they proposed (k-nearest neigh- tool. J Ophthalmol. 2016;2016
bours, logistic regression and support vector 8. Gauba V, Tsangaris P, Tossounis C, Mitra A, McLean
machines) could be used to provide tailored C, Saleh GM. Human reliability analysis of cataract
feedback for trainees [59]. Zia et  al. also used surgery. Arch Ophthalmol. 2008;126(2):173–7.
9. Cox A, Dolan L, MacEwen CJ.  Human reliability
kinematic data for skills classification and gener- analysis: a new method to quantify errors in cataract
ated “task highlights” that showed which parts of surgery. Eye. 2008;22(3):394–7.
the procedure contributed the most to the final 10. Finn AP, Borboli-Gerogiannis S, Brauner S, Peggy
scores, allowing for individual feedback [63]. Chang HY, Chen S, Gardiner M, et  al. Assessing
resident cataract surgery outcomes using medi-
An important goal expressed by these studies is care physician quality reporting system measures. J
to provide tailored feedback for trainees in real Surg Educ [Internet]. 2016;73(5):774–9. https://doi.
time [59, 63] and to predict surgical outcomes org/10.1016/j.jsurg.2016.04.007.
using ML methods. 11. Saleh GM, Voyazis Y, Hance J, Ratnasothy J, Darzi
A. Evaluating surgical dexterity during corneal sutur-
The future of assessing cataract surgical skills ing. Arch Ophthalmol. 2006;124(9):1263–6.
is moving towards automated methods to increase 12. Saleh GM, Sim D, Lindfield D, Borhani M,

efficiency and objectivity. Such methods involve Ghoussayni S, Gauba V. Motion analysis as a tool for
extracting data from video recordings, virtual the evaluation of oculoplastic surgical skill: evalua-
tion of oculoplastic surgical skill. Arch Ophthalmol.
reality surgical simulators, and potentially 2008;126(2):213–6.
sensor-­based motion analysis sensors to be pro- 13. Nandigam K, Soh J, Gensheimer WG, Ghazi A,

cessed via machine learning or deep leaning Khalifa YM.  Cost analysis of objective resident
algorithms and produce meaningful assessments. cataract surgery assessments. J Cataract Refract
Surg [Internet]. 2015;41(5):997–1003. https://doi.
Further research is needed to refine these meth- org/10.1016/j.jcrs.2014.08.041.
ods in order to incorporate these measures into 14. Smith P, Tang L, Balntas V, Young K, Athanasiadis
the training of ophthalmic residents. Y, Sullivan P, et  al. “PhacoTracking” an evolving
paradigm in ophthalmic surgical training. JAMA
Ophthalmol. 2013;131(5):659–61.
15. Balal S, Smith P, Bader T, Tang HL, Sullivan P,

References Thomsen ASS, et al. Computer analysis of individual
cataract surgery segments in the operating room. Eye
1. National Eye Institute. Cataract Data and [Internet]. 2019;33(2):313–9. http://www.nature.com/
Statistics [Internet]. 2019. https://www.nei. articles/s41433-­018-­0185-­1
nih.gov/learn-­a bout-­e ye-­h ealth/resources-­f or-­ 16. Din N, Smith P, Emeriewen K, Sharma A, Jones S,
health-­e ducators/eye-­h ealth-­d ata-­a nd-­s tatistics/ Wawrzynski J, et  al. Man versus machine: software
cataract-­data-­and-­statistics training for surgeons  – an objective evaluation of
2. Accreditation Council for Graduate Medical human and computer-based training tools for cataract
Education. surgical performance. J Ophthalmol. 2016;2016
3. Lee AG, Volpe N.  The impact of the new compe- 17. Low SAW, Braga-Mele R, Yan DB, El-Defrawy

tencies on resident education in ophthalmology. S. Intraoperative complication rates in cataract surgery
Ophthalmology. 2004;111(7):1269–70. performed by ophthalmology resident trainees com-
224 N. Alnafisee et al.

pared to staff surgeons in a Canadian academic center. 2016;195(6):1859–65. https://doi.org/10.1016/j.


J Cataract Refract Surg [Internet]. 2018;44(11):1344– juro.2016.01.005.
9. https://doi.org/10.1016/j.jcrs.2018.07.028 31. Sheikh AY, Fann JI.  Artificial intelligence. Thorac
18. Cremers SL, Ciolino JB, Ferrufino-Ponce ZK,
Surg Clin [Internet]. 2019;29(3):339–50. https://doi.
Henderson BA.  Objective assessment of skills org/10.1016/j.thorsurg.2019.03.011.
in intraocular surgery (OASIS). Ophthalmology. 32. Yu F, Silva Croso G, Kim TS, Song Z, Parker F, Hager
2005;112(7):1236–41. GD, et al. Assessment of automated identification of
19. Cremers SL, Lora AN, Ferrufino-Ponce ZK.  Global phases in videos of cataract surgery using machine
rating assessment of skills in intraocular surgery learning and deep learning techniques. JAMA Netw
(GRASIS). Ophthalmology. 2005;112(10):1655–60. Open. 2019;2(4):e191860.
20. Feldman BH, Geist CE.  Assessing residents in
33. Bouget D, Lalys F, Jannin P, Bouget D, Lalys F,
phacoemulsification. Ophthalmology. 2007;114(8): Jannin P, et  al. Surgical tools recognition and pupil
1586–e2. segmentation for cataract surgical process modeling.
21. Saleh GM, Gauba V, Mitra A, Litwin AS, Chung In: Medicine meets virtual reality – NextMed. 2012.
AKK, Benjamin L.  Objective structured assess- p. 78–84.
ment of cataract surgical skill. Arch Ophthalmol. 34. Lalys F, Riffaud L, Bouget D, Jannin P.  A frame-
2007;125(3):363–6. work for the recognition of high-level surgical tasks
22. Swaminathan M, Ramasubramanian S, Pilling R, Li from video images for cataract surgeries. IEEE Trans
J, Golnik K. ICO-OSCAR for pediatric cataract surgi- Biomed Eng. 2012;59(4):966–76.
cal skill assessment. J AAPOS [Internet]. 2016;20(4): 35. Lalys F, Bouget D, Riffaud L, Jannin P.  Automatic
364–5. https://doi.org/10.1016/j.jaapos.2016.02.015. knowledge-based recognition of low-level tasks in
23. Golnik KC, Beaver H, Gauba V, Lee AG, Mayorga ophthalmological procedures. Int J Comput Assist
E, Palis G, et  al. Cataract surgical skill assess- Radiol Surg. 2013;8(1):39–49.
ment. Ophthalmology [Internet]. 2011;118(2): 36. Quellec G, Charrière K, Lamard M, Droueche Z,

427–427.e5. https://linkinghub.elsevier.com/retrieve/ Roux C, Cochener B, et al. Real-time recognition of
pii/S0161642010010341 surgical tasks in eye surgery videos. Med Image Anal.
24.
RCOphth. Objective Assessment of Surgical 2014;18(3):579–90.
and Technical Skills (OSATS) [Internet]. https:// 37. Quellec G, Lamard M, Cochener B, Cazuguel

www.rcophth.ac.uk/curriculum/ost/assessments/ G. Real-time task recognition in cataract surgery vid-
workplace-­based-­assessments/objective-­assessment-­ eos using adaptive spatiotemporal polynomials. IEEE
of-­surgical-­and-­technical-­skills-­osats/ Trans Med Imaging. 2015;34(4):877–87.
25. Lee AG, Greenlee E, Oetting TA, Beaver HA,
38. Charrière K, Quellec G, Lamard M, Martiano D,

Johnson AT, Boldt HC, et al. The Iowa ophthalmology Cazuguel G, Coatrieux G, et  al. Real-time analysis
wet laboratory curriculum for teaching and assess- of cataract surgery videos using statistical models.
ing cataract surgical competency. Ophthalmology. Multimed Tools Appl. 2017;76(21):22473–91.
2007;114(7):21–6. 39. Krizhevsky A, Sutskever I, Hinton GE.  ImageNet

26. Taylor JB, Binenbaum G, Tapino P, Volpe NJ.
classification with deep convolutional neural net-
Microsurgical lab testing is a reliable method for works. Adv Neural Inf Process Syst. 2012:1097–105.
assessing ophthalmology residents’ surgical skills. Br 40. Zhang Y, Qiu Z, Yao T, Liu D, Mei T. Fully convolu-
J Ophthalmol. 2007;91(12):1691–4. tional adaptation networks for semantic segmentation.
27. Fisher JB, Binenbaum G, Tapino P, Volpe NJ.
Proc IEEE Comput Soc Conf Comput Vis Pattern
Development and face and content validity of an Recognit. 2014:6810–8.
eye surgical skills assessment test for ophthal- 41. Zisimopoulos O, Flouty E, Stacey M, Muscroft S,
mology residents. Ophthalmology. 2006;113(12): Giataganas P, Nehme J, et al. Can surgical simulation
2364–70. be used to train detection and classification of neural
28. Dai JC, Lendvay TS, Sorensen MD.  Crowdsourcing networks? Healthc Technol Lett. 2017;4(5):216–22.
in surgical skills acquisition: a developing technology 42. Zisimopoulos O, Flouty E, Luengo I, Giataganas P,
in surgical education. J Grad Med Educ [Internet]. Nehme J, Chow A, et al. DeepPhase: surgical phase
2017;9(6):697–705. http://www.ncbi.nlm.nih.gov/ recognition in CATARACTS videos. In: Lecture notes
pubmed/29270257. on computer science (including Subser Lect Notes
29. Polin MR, Siddiqui NY, Comstock BA, Hesham
Artif Intell Lect Notes Bioinformatics). 2018;11073
H, Brown C, Lendvay TS, et  al. Crowdsourcing: LNCS. p. 265–72.
a valid alternative to expert evaluation of robotic 43. Primus MJ, Putzgruber-Adamitsch D, Taschwer M,
surgery skills. Am J Obstet Gynecol [Internet]. Münzer B, El-Shabrawi Y, Böszörmenyi L, et  al.
2016;215(5):644.e1–7. http://www.ncbi.nlm.nih.gov/ Frame-based classification of operation phases in
pubmed/27365004 cataract surgery videos. 2018. p. 241–53. https://doi.
30. Kowalewski TM, Comstock B, Sweet R, Schaffhausen org/10.1007/978-­3-­319-­73603-­7_20
C, Menhadji A, Averch T, et  al. Crowd-sourced 44. Al Hajj H, Lamard M, Charriere K, Cochener B,
assessment of technical skills for validation of basic Quellec G. Surgical tool detection in cataract surgery
laparoscopic urologic Skills tasks. J Urol [Internet]. videos through multi-image fusion inside a convolu-
18  Artificial Intelligence in Cataract Surgery Training 225

tional neural network. Proc Annu Int Conf IEEE Eng 57. Forestier G, Fawaz HI, Weber J, Idoumghar L, Muller
Med Biol Soc EMBS. 2017:2002–5. P-A, Petitjean F, et al. Surgical motion analysis using
45. Zhu J, Luo J, Soh JM, Khalifa YM.  A computer
discriminative interpretable patterns. Artif Intell Med.
vision-based approach to grade simulated cataract 2018;91:3–11.
surgeries. Mach Vis Appl. 2014;26(1):115–25. 58. Oquendo YA, Riddle EW, Hiller D, Blinman TA,

46. Kim TS, O’Brien M, Zafar S, Hager GD, Sikder
Kuchenbecker KJ.  Automatically rating trainee
S, Vedula SS.  Objective assessment of intraop- skill at a pediatric laparoscopic suturing task. Surg
erative technical skill in capsulorhexis using vid- Endosc [Internet]. 2018;32(4):1840–57. https://doi.
eos of cataract surgery. Int J Comput Assist Radiol org/10.1007/s00464-­017-­5873-­6.
Surg [Internet]. 2019;14(6):1097–105. https://doi. 59. Fard MJ, Ameri S, Darin Ellis R, Chinnam RB, Pandya
org/10.1007/s11548-­019-­01956-­8 AK, Klein MD.  Automated robot-assisted surgical
47. Spiteri A, Aggarwal R, Kersey T, Benjamin L, Darzi skill evaluation: predictive analytics approach. Int J
A, Bloom P.  Phacoemulsification skills training and Med Robot Comput Assist Surg. 2018;14(1):1–10.
assessment. Br J Ophthalmol. 2010;94(5):536–41. 60. Hung AJ, Chen J, Gill IS.  Automated Performance
48. Selvander M, Åsman P.  Cataract surgeons outper-
metrics and machine learning algorithms to mea-
form medical students in Eyesi virtual reality cata- sure surgeon performance and anticipate clini-
ract surgery: evidence for construct validity. Acta cal outcomes in robotic surgery. JAMA Surg
Ophthalmol. 2013;91(5):469–74. [Internet]. 2018;153(8):770. https://doi.org/10.1001/
49. Kim TS, Malpani A, Reiter A, Hager GD, Sikder jamasurg.2018.1512
S, Swaroop Vedula S.  Crowdsourcing annotation of 61. Hung AJ, Chen J, Che Z, Nilanon T, Jarc A, Titus M,
surgical instruments in videos of cataract surgery. et al. Utilizing machine learning and automated per-
In: Stoyanov D, Taylor Z, Balocco S, Sznitman R, formance metrics to evaluate robot-assisted radical
Martel A, Maier-Hein L, et al., editors. Intravascular prostatectomy performance and predict outcomes. J
imaging and computer assisted stenting and large- Endourol [Internet]. 2018;32(5):438–44. https://doi.
scale annotation of biomedical data and expert label org/10.1089/end.2018.0035.
synthesis. Cham: Springer International; 2018. 62. Ismail Fawaz H, Forestier G, Weber J, Idoumghar
p. 121–30. L, Muller PA.  Evaluating surgical skills from kine-
50. Chen C, White L, Kowalewski T, Aggarwal R, Lintott matic data using convolutional neural networks. Lect
C, Comstock B, et al. Crowd-sourced assessment of Notes Comput Sci (including Subser Lect Notes
technical skills: a novel method to evaluate surgical Artif Intell Lect Notes Bioinformatics). 2018;11073
performance. J Surg Res [Internet]. 2014;187(1):65– LNCS:214–21.
71. https://doi.org/10.1016/j.jss.2013.09.024. 63. Zia A, Essa I.  Automated surgical skill assessment
51. Prebay ZJ, Peabody JO, Miller DC, Ghani KR. Video in RMIS training. Int J Comput Assist Radiol Surg.
review for measuring and improving skill in urologi- 2018;13(5):731–9.
cal surgery. Nat Rev Urol [Internet]. 2019;16(4):261– 64. Zia A, Sharma Y, Bettadapura V, Sarin EL, Essa

7. https://doi.org/10.1038/s41585-­018-­0138-­2. I. Video and accelerometer-based motion analysis for
52. Bardram JE, Doryab A, Jensen RM, Lange PM,
automated surgical skills assessment. Int J Comput
Nielsen KLG, Petersen ST. Phase recognition during Assist Radiol Surg. 2018;13(3):443–55.
surgical procedures using embedded and body-worn 65. Wang Z, Fey AM. SATR-DL: improving surgical skill
sensors. In: 2011 IEEE International conference on assessment and task recognition in robot-assisted sur-
pervasive computer communications PerCom 2011. gery with deep neural networks. Proc Annu Int Conf
2011. p. 45–53. IEEE Eng Med Biol Soc EMBS. 2018;(1):1793–6.
53. Kowalewski K-F, Garrow CR, Schmidt MW, Benner 66.
Hajshirmohammadi I, Payandeh S.  Fuzzy set
L, Müller-Stich BP, Nickel F. Sensor-based machine theory for performance evaluation in a surgical
learning for workflow detection and as key to detect simulator. Presence Teleoperators Virtual Environ.
expert level in laparoscopic suturing and knot-tying. 2007;16(6):603–22.
Surg Endosc [Internet]. 2019;21;0(0):0. https://doi. 67. Zhang Y, Law H, Kim T-K, Miller D, Montie J, Deng
org/10.1007/s00464-­019-­06667-­4 J, et  al. PD58-12 surgeon technical skill assess-
54. Watson RA.  Use of a machine learning algorithm ment using computer vision-based analysis. J Urol
to classify expertise: analysis of hand motion pat- [Internet]. 2018;199(4S). https://doi.org/10.1016/j.
terns during a simulated surgical task. Acad Med. juro.2018.02.2800
2014;89(8):1163–7. 68. Law H, Ghani K, Deng J.  Surgeon technical skill
55. Miao T, Tomikawa M, Akahoshi T, Hashizume M, assessment using computer vision based analysis.
Lefor AK, Souzaki R, et al. Feasibility of an AI-based Proc Mach Learn Healthc. 2017;68
measure of the hand motions of expert and novice sur- 69. Handelman A, Schnaider S, Schwartz-Ossad A,

geons. Comput Math Methods Med. 2018;2018:1–6. Barkan R, Tepper R.  Computerized model for
56. Wang Z, Majewicz FA. Deep learning with convolu- objectively evaluating cutting performance using
tional neural network for objective skill evaluation a laparoscopic box trainer simulator. Surg Endosc
in robot-assisted surgery. Int J Comput Assist Radiol [Internet]. 2018;0(0):0. https://doi.org/10.1007/
Surg. 2018;13(12):1959–70. s00464-­018-­6598-­x
226 N. Alnafisee et al.

70. Alonso-Silverio GA, Pérez-Escamirosa F, Bruno-­


75. Loukas C.  Video content analysis of surgical proce-
Sanchez R, Ortiz-Simon JL, Muñoz-Guerrero R, dures. Surg Endosc [Internet]. 2018;32(2):553–68.
Minor-Martinez A, et  al. Development of a laparo- https://doi.org/10.1007/s00464-­017-­5878-­1.
scopic box trainer based on open source hardware and 76. Klank U, Padoy N, Feussner H, Navab N. Automatic
artificial intelligence for objective assessment of surgi- feature generation in endoscopic images. Int J Comput
cal psychomotor skills. Surg Innov. 2018;25(4):380–8. Assist Radiol Surg. 2008;3(3–4):331–9.
71. Jin A, Yeung S, Jopling J, Krause J, Azagury
77. Blum T, Feußner H, Navab N.  Modeling and seg-
D, Milstein A, et  al. Tool detection and opera- mentation of surgical workflow from laparoscopic
tive skill assessment in surgical videos using video. In: Lecture Notes on Computer Science
region-based convolutional neural networks. In: (including Subser Lect Notes Artif Intell Lect
Proceedings of 2018 IEEE Winter Conference on Notes Bioinformatics). 2010;6363 LNCS(PART 3).
Applications and Computer Vision, WACV 2018. p. 400–7.
2018;2018-Janua(Nips). p. 691–9. 78. Reiley CE, Hager GD.  Decomposition of robotic
72. Miller B, Azari D, Yu YH, Radwin R, Le B, Wi
surgical tasks: an analysis of subtasks and their cor-
M.  Use of machine learning algorithms to classify relation to skill. Model Monit Comput Assist Interv.
surgical maneuvers. 2019;201(4):2019. 2009.
73. Fard MJ, Ameri S, Chinnam RB, Pandya AK, Klein 79. Voros S, Hager GD.  Towards “real-time” tool-tissue
MD, Ellis RD.  Machine learning approach for skill interaction detection in robotically assisted lapa-
evaluation in robotic-assisted surgery. 2016;I. http:// roscopy. In: Proceedings of 2nd Bienn IEEE/RAS-­
arxiv.org/abs/1611.05136 EMBS Int Conf Biomed Robot Biomechatronics,
74. Chen J, Cheng N, Cacciamani G, Oh P, Lin-Brande BioRob 2008. 2008. p. 562–7.
M, Remulla D, et al. Objective assessment of robotic 80. Zia A.  Automated benchmarking of surgical skills
Surgical technical skill: a systematic review. J Urol. using machine learning. 2018
2019;201(3):461–9.
Artificial Intelligence
in Ophthalmology Triaging
19
Yiran Tan, Stephen Bacchi, and Weng Onn Chan

Triaging in Ophthalmology Fig.  19.1). Emergency referrals include time-­


dependent sight-threatening pathologies such as
Triaging in ophthalmology describes the process acute angle closure crisis, retinal detachment, eye
of classifying patients according to the severity trauma and orbital cellulitis.
and urgency of their medical condition. Effective While the triaging process may appear rela-
triage methods are essential for any ophthalmic tively straightforward, accurate triaging is com-
practice to enable efficient workflow, good clini- plicated by the diverse nature of ophthalmic
cal management, and appropriate resource allo- presentations and varying clinical guidelines.
cation. A tertiary ophthalmology center can Conventionally, triaging of referrals in oph-
expect a large number of referrals from multiple thalmology are conducted by both medical and
sources, including primary care physicians, med- nursing professionals. Referrals may be sub-
ical specialists, optometrists, emergency depart- categorized based on patient’s age, geographic
ments, other hospitals and patient self-referrals. location, nature of presenting complaint and
To ensure timely intervention for patients with acuity of symptoms. The Royal Australasian
urgent ophthalmic problems, external referrals and New Zealand College of Ophthalmologists
are manually sorted into one of several catego- (RANZCO) has dedicated referral pathways for
ries, from most urgent to least urgent. Triaging the management of specific ophthalmic condi-
classification practices vary around the world and tions such as age-related macular degenera-
may include dichotomous categorization of tion, glaucoma and diabetic retinopathy [2–4].
urgent vs. non-urgent, or numerical categoriza- Optometry Australia classifies adult referrals into
tion. Each hospital or local health network usu- five categories, including conditions requiring
ally has its own guideline for managing referrals. immediate presentation to the emergency depart-
The Central Adelaide Local Health Network ment, immediate referral to the optometrist, refer-
(CAHLN) in Adelaide, Australia, has a tertiary ral within 24  hours, next available appointment
adult ophthalmology referral center that provides within 7 days, referral to ophthalmologist if
assessment and treatment for all sub-specialty within 28 days of surgery, and co-management
areas of adult ophthalmology [1]. Adult triage with general practitioners [5]. In the United
criteria are divided into four categories (see Kingdom, referral guidelines exist for each indi-
vidual National Health Service (NHS) Trusts.
The Oxford Health NHS Foundation Trust for
Y. Tan (*) · S. Bacchi · W. O. Chan example, has a referral guideline based on spe-
South Australian Institute of Ophthalmology, Royal cific ophthalmic conditions, with recommended
Adelaide Hospital, Adelaide, Australia

© Springer Nature Switzerland AG 2021 227


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_19
228 Y. Tan et al.

General Walk-in self Emergency Other Inpatient


Specialists
Practioners referrals Department Hospitals Consults

Ophthalmology
Referral

Triage by ophthalmology
nurse
or medical officer

Emergency
Category 1 Category 2 Category 3
To be discussed
Target within 4 Target within 3 Target within 12
with
weeks months months
on-call registrar

Fig. 19.1  Adult triage criteria for referrals at a tertiary department and then referred to ophthalmology service by
ophthalmology center in Adelaide, Australia. Acute sight-­ a phone call notification in addition to the written referral
threatening pathologies are usually sent to the emergency

referral pathway to either eye casualty or a speci- ophthalmology triaging is to improve both the
fied subspecialty clinic [6]. accuracy and efficiency of the triage process.
Despite the availability of dedicated referral Machine learning (ML) assisted triaging has
guidelines, triaging in ophthalmology remains a been successfully applied to other fields of medi-
dynamic process. Effective triaging requires the cine. For example, ML methods appear to be use-
responsible health professional to have a sound ful in triaging undifferentiated patients in the
clinical knowledge, compassion, and ability to emergency department and shown to be accurate
identify and correct errors. in the triage of chronic obstructive pulmonary
disease (COPD) exacerbations. The majority of
ML studies in ophthalmology have focused on
Application of Artificial Intelligence image interpretation, including the application of
ML to aid diagnosis based upon fundus photo-
Current processes involved in the categorization graphs, visual field analysis and optical coher-
of referrals are both time consuming and prone to ence tomography [7–9]. The application of deep
error. Human errors may stem from a lack of learning (DL) and natural language processing
clinical experience, failure to follow protocols, (NLP) to the issue of triaging is a novel approach
written miscommunication or transcription error. established at the South Australian Institute of
The goal of applying artificial intelligence (AI) to Ophthalmology (SAIO), Australia. Results from
19  Artificial Intelligence in Ophthalmology Triaging 229

preliminary studies suggest that ML, in particular ries (Categories 1–3) [10]. A referral database was
DL, can accurately assist with the triaging of established from consecutive referrals received
ophthalmology referrals. by the Royal Adelaide Hospital Ophthalmology
Department between January 2018 and March
2019. Referrals for these patients had all been
Deep Learning Analysis made within the previous 24 months. Pertinent
information, including clinical synopsis, triage
A pilot study was first conducted in 2018 involv- categorisation, and referral source, were manually
ing the use of retrospectively collected outpa- extracted from scanned electronic referrals using
tient ophthalmology referrals to determine how commercially available optical character recogni-
effectively NLP can identify referrals requiring tion (OCR) software (Adobe, San Jose, CA).
a “category one” (Urgent) prioritization. A sec- The DL analysis of ophthalmology triage
ondary aim was to emulate human triaging and notes was conducted in three phases (see
determine accuracy across three referral catego- Fig. 19.2):

Referral Collection

Sequence-independent Text Extraction Sequence-dependent


DATA COLLECTION

Negation Punctuation
Detection Removal

Punctuation Word stemming


Removal and tokenization

Word stemming
and tokenization

Count- Train rest split Padding


Vectorisation (75%/25%)

PRE-PROCESSING

Develop and
Increase Model
Complexity

Apply model to
set within 5-fold
cross validation
MODEL DEVELOPMENT

Final Model
applied to
test set
PERFORMANCE ANALYSIS

Fig. 19.2  Flowchart demonstrating the development of an artificial intelligence guided triage model
230 Y. Tan et al.

• Phase 1: Pre-processing of data cross-­validation. Initially, simple models were


• Phase 2: Model Development trialled, that employed relatively small numbers
• Phase 3: Performance Analysis of nodes and hidden layers. Further layers and
model complexity were added until accuracy on
the training dataset ceased to improve.
Data Pre-processing Hyperparameter tuning was also conducted on
the training data. The final CNN architecture
During pre-processing, individuals for whom included an embedding layer, a dropout layer, a
there was incomplete referral data or outcome convolutional layer, a maximum pooling layer,
data were excluded from analysis. Referral text and five dense hidden layers (nodes varying
punctuations, such as commas and full colons, from 512 to 128). 99% of the most frequently
were also removed. appearing word tokens in the corpus were incor-
The following pre-processing methods were porated into the model.
applied to the referral database corpus:

• Negation detection describes the process of Model Assessment


flagging words that follow a negative cue. For
example, in the sentence “there was no change The developed models then had their accuracy in
in vision”, both “change” and “vision” would the prediction of triage categorisation assessed
be flagged as negated. on the unseen test dataset. The Youden’s index
• Word stemming and tokenisation describes was used for binary classification tasks to select
the processes of removing word endings and the cut-off score for each model. Initially, all
replacing unique words with unique num- models were used to predict the binary outcome
bers, which are then referred to as “tokens”. of urgent vs. non-urgent referrals. The primary
For example, “hypertension” and “hyper- outcome was area under the receiver operator
tensive” may be both shortened to curve (AUC). Other outcomes assessed included
“hyperten”. accuracy, F1 score, positive predictive value
• Count vectorisation describes the process of (PPV), negative predictive value (NPV), sensitiv-
transforming a collection of text documents to ity and specificity. Examples of results using dif-
a vector of term or token counts. ferent cut-off scores, demonstrating high
Count vectorisation and negation detection sensitivity or high specificity, were generated for
were applied to the text prior to analysis with the best performing model.
word-sequence-independent algorithms such as The best performing model on the binary out-
an Artificial Neural Network (ANN) or Random come was then employed to predict the actual
Forest models. The proportion of the total num- numerical triage category (category one vs. two
ber of unique words included in the corpus was vs. three) assigned to each referral. Under the
considered a hyperparameter, which is a param- secondary aim, the primary outcome was classifi-
eter set before the learning process commences. cation accuracy.
Prior to analysis by a CNN, token sequences Due to the pilot nature of the study, no statisti-
were padded, through the addition of blank cal tests were conducted to demonstrate superior-
tokens, to provide a consistent sequence length. ity of one model as compared to another.
Training and testing datasets were created by ran-
domly splitting the dataset (75% training dataset,
25% testing dataset). Accuracy of AI Models

The accuracy of DL guided triaging in ophthal-


Classifier Development mology is based on a single pilot study of 208
participants. The dataset included 118 category
Various neural network architecture models one referrals, 61 category two referrals and 29
were trialled on the training set using fivefold category three referrals. Referrals were triaged
19  Artificial Intelligence in Ophthalmology Triaging 231

by senior nurse practitioners with more than 15 Clinical Implications


years of clinical triaging experience at a tertiary
ophthalmology centre. The mean length of refer- Despite a limited sample size, the pilot study
ral synopsis was 68.1 words (IQR 25–93, range demonstrated promising results to support the
2–293 words). Referral sources included general use of artificial intelligence in triaging of oph-
practitioners (51, 24.5%), optometrists (57, thalmology referrals. With sufficient training, it
27.4%), specialists (98, 47.1%), and the emer- may be feasible to develop a program that would
gency department (2, 1.0%). The referrals flag urgent referrals requiring presentation to the
included both internal referrals, from within the eye casualty clinic. Accuracy from the existing
tertiary hospital (64, 30.8%), and external refer- CNN model is achieved based on text entry alone.
rals (144, 69.2%). The accuracy of the CNN model could be
improved by using multi-modal inputs such as
patient demographics, source of referral and clin-
Identification of Urgent Referrals ical images.
It should also be noted that the most predic-
The CNN model achieved the highest AUC of tive words from the logistic regression model
0.83 and accuracy of 0.81 in categorising urgent are not necessarily representative of the most
vs. non-urgent ophthalmology referrals. This is predictive words, or combinations of words, in
followed by the ANN (AUC 0.81 and accuracy the CNN or ANN models. Models such as CNN
0.77) and logistic regression models (AUC 0.79 and ANN have higher levels of association than
and accuracy 0.77). The Random Forest (AUC individual words, and it is accordingly more
0.77 and accuracy 0.73) and Decision Tree (AUC challenging to present the weightings attributed
0.58 and accuracy 0.6) models achieved lower to parts of text in these models in an interpreta-
accuracies. When different cut-off scores were ble fashion. Regardless of the type of model
employed for the CNN model, high specificities employed, the entire body of a referral text
or high sensitivities were able to be achieved, at would be analysed before a suggested categori-
the expense of overall accuracy. sation was made.
Coefficients of the most strongly predictive
words were extracted from the logistic regres-
sion model to gauge the words on which the Challenges in AI Assisted Triaging
models may be placing the most emphasis. The
word stems that were most predictive of category Small dataset size: Like all ML derived pro-
one triage included “urgent”, “vision”, “IOP”, grams, the accuracy of a pre-trained model is
“disc” and “left”. The word stems that were most greatly dependent on the nature of the training
predictive of non-category one classification dataset. A small dataset will prevent the ML algo-
were “cataract”, “diabet”, “le”, “mr”, and rithm from gaining sufficient exposure to exam-
“diseas”. ples of each triage category. As a result, the ML
model will fail to understand and recognize
meaningful differences between triage content.
 llocation of Specific Referral
A In the pilot study for example, there were only 29
Category referrals to which a category three prioritisation
was allocated. Small dataset size also creates the
DL models struggled to discriminate between problem of overfitting, where words such as
multiple triage categories. When CNN was “left” is included in those with high predictive
applied to the classification task of identifying value. Furthermore, the training dataset must also
the specific referral category (Category 1–3), a contain balanced and well-represented examples
significantly lower accuracy was achieved from every possible type of referral. It is unreal-
(0.65). istic to expect a ML model to recognize ocular
232 Y. Tan et al.

emergencies outside of the examples that it is A base NLP model is typically trained on a very
trained with. large general-purpose body of text to provide a
Building a sufficiently large triage database is sound fundamental understanding of English
a significant and difficult undertaking. Foremost, vocabulary and grammar. Medical jargon, in par-
referral documents are not standardized and differ ticular terms used in ophthalmology, unfortu-
significantly depending on the source. The docu- nately falls outside the scope of common English
ment format may be hard-copy notes, emails, language. For example, a base model which can
scanned documents, or electronic medical records. differentiate left and right would have great dif-
The current process of extracting text from refer- ficulty recognizing the Latin terms “OU” or
ral notes is a labor-intensive process. Despite the “OD” that commonly found on ophthalmology
availability of OCR, the majority of referral notes reports. In AI assisted triaging, there is the need
still require further human processing to ensure to train a specialized language model that under-
that the correct information is transcribed. The stands languages in medicine, and not just com-
efficiency will likely improve with a more stream- mon English. Building a specialized language
lined and consistent means of receiving referrals model is a very time consuming and resource
in the age of digital medicine. For example, with heavy process that would require collection and
the availability of electronic medical records subsequent training on millions of un-labelled
(EMR), a number of internal referral pathways documents.
have shifted away from faxed paper documents.
In many tertiary medical centers, referrals can be
sent directly via the EMR operating system and The Future
extracted as a monthly report. This will enable a
large quantity of data to be captured. In the digital age, attempts to integrate AI into
Distant labels: DL models contain document clinical practice will continue to stay at the fore-
classifiers that are built to detect the general con- front of ophthalmic research. AI assisted triaging
cepts conveyed in the text. In the real world, the in ophthalmology has demonstrated early poten-
clinical urgency of referrals is often influenced by tial in discriminating between urgent vs. non-­
factors other than the presenting complaint. urgent referrals. The South Australian Institute of
Human guided triaging will take into account a Ophthalmology (SAIO), in collaboration with
patient’s age, geographic location, past history and The Australian Institute for Machine Learning
social circumstances. Based on the final triage cat- (AIML), is involved in conducting further deriva-
egory alone, ML classifiers will have difficulty tion tests on expanded dataset. The ongoing
emulating the complex thought process behind the development involves training an off-the-shelf
triage process. The distant-label problem can be document classification model, leveraging a large
addressed by splitting the triaging task into two pre-trained DistilBERT model for language
parts, firstly to detect the concepts conveyed in understanding. Interim analysis based on an
each document, and then secondly to decide on the expanded sample size of 1000 referrals showed
triage category. Managing the problem of distant promising results with improved ability to dis-
labels will require the triaging personnel to manu- criminate between multiple triage categories.
ally “tag” the referral with all of the factors that led Validation accuracies of up to 80% were obtained
to their decision. For example, in a referral for when triaging referrals to either emergency
undifferentiated vision loss to hand movement (within 24 hours), category 1 (within 4 weeks) or
(HM) acuity, the tags for emergency triage could category 2 (within 1 year).
be “visual acuity HM” and “only eye”. The use of a small database remains the great-
Specialized vocabulary: Ophthalmology est limitation of the preliminary pilot study.
referrals often contain specialized medical jargon Future research relating to AI assisted triaging
that is difficult for a base DL model to interpret. should endeavour to use larger sample sizes,
19  Artificial Intelligence in Ophthalmology Triaging 233

consultant-­level triage allocation, and data from 3. The Royal Australian and New Zealand College of
Ophthalmologists. Referral pathway glaucoma man-
multiple centres. SAIO is currently building an agement. In: RANZCO. 2019.
expanded referral database to allow further ML 4. The Royal Australian and New Zealand College of
testing to be conducted. A particular aim is to Ophthalmologists. Patient screening and referral
improve the efficiency of the data capture pro- pathway guidelines for diabetic retinopathy (includ-
ing diabetic maculopathy). In: RANZCO. 2019.
cess by incorporating a text extraction function 5. Optometrists Association Australia. Eye health refer-
within the AI algorithm. This will allow digital ral guidelines. In: Optometry Australia. 2020.
referrals in various format to be inputted directly 6. Patel C, Rosen P, Hornby S, Mahalingham N, Hayles
into the model without the need for human pro- S, Stocker T. Referral guideline ophthalmology over-
view. In: NHS Oxfordshire Clinical Commisioning
cessing. If the preliminary results are validated Group; 2018.
in a subsequent derivation study, SAIO hopes to 7. Raman R, Srinivasan S, Virmani S, Sivaprasad S, Rao
eventually conduct randomized controlled trials C, Rajalakshmi R.  Fundus photograph-based deep
to test the accuracy of AI assisting triaging learning algorithms in detecting diabetic retinopathy.
Eye (Lond). 2019;33(1):97–109.
against the gold standard of triage by a consul- 8. Li F, Wang Z, Qu G, Song D, Yuan Y, Xu Y, Gao K,
tant ophthalmologist. Luo G, Xiao Z, Lam DSC, Zhong H, Qiao Y, Zhang
X.  Automatic differentiation of Glaucoma visual
field from non-glaucoma visual filed using deep
convolutional neural network. BMC Med Imaging.
References 2018;18(1):35.
9. Yoon J, Han J, Park JI, Hwang JS, Han JM, Sohn J,
1. Central Adelaide Local Health Network. Park KH, Hwang DD. Optical coherence tomography-­
Ophthalmology outpatient service information, triage based deep-learning model for detecting central
and referral guideline. In: Ophthalmology. Vol 0.1. serous chorioretinopathy. Sci Rep. 2020;10(1):18852.
SA Health; 2018. 10. Tan Y, Bacchi S, Casson RJ, Selva D, Chan

2. The Royal Australian and New Zealand College of W. Triaging ophthalmology outpatient referrals with
Ophthalmologists. Referral pathway for AMD man- machine learning: a pilot study. Clin Exp Ophthalmol.
agement. In: RANZCO. 2020. 2020;48(2):169–73.
Deep Learning Applications
in Ocular Oncology
20
T. Y. Alvin Liu and Zelia M. Correa

Ocular oncology is a subspecialty within oph-


thalmology that primarily focuses on the diagno-
sis and treatment of intraocular malignancies and
ocular surface malignancies. Examples of intra-
ocular malignancy include retinoblastoma, uveal
melanoma and metastatic tumor from systemic
dissemination. Examples of ocular surface malig-
nancy include conjunctival squamous cell carci-
noma, conjunctival melanoma and conjunctival
lymphoma. Currently, deep learning (DL) is the
cutting-edge machine learning technique for
medical image analysis. However, given the rela-
tive rarity of intraocular and ocular surface
malignancies and the need for a large amount of Fig. 20.1  A choroidal melanoma located at the infero-
temporal edge of the macula
data in order to train a deep learning system
(DLS), only two studies have been published to
date discussing the application of DL in the field survival rate of 80.9% that has not improved in
of ocular oncology. In this chapter, we will high- the past four decades [2], despite the
light these two publications that focus on uveal advancement in local treatment, which typically
melanoma (UM), and discuss the future direc- involves plaque brachytherapy and proton beam
tions of DL applications in ocular oncology. irradiation.
UM (Fig. 20.1) is the most common primary In the first published study on DL, Sun et al.
intraocular malignancy in adults [1]. This neo- [3] used DL techniques to detect the expression
plasm arises from the uveal tract, which includes of BRCA1-associated protein 1 (BAP1) in histo-
the iris, ciliary body and choroid. Although UM pathology slides. The BAP1 protein is involved
is relatively rare in the United States with an in tumor suppression and is produced by the
­incidence of 5.2 cases per million [2], it is a BAP1 gene, which is located on chromosome
potentially lethal disease, with an overall 5-year 3p21.1 [4]. Inactivation mutation of the BAP1
gene has been shown to increase the metastatic
T. Y. A. Liu (*) · Z. M. Correa potential of UM [5] and to be present in 81–84%
Retina Division, Wilmer Eye Institute, Johns Hopkins of metastatic UM tumors [6–8]. In this study, 47
University, Baltimore, MD, USA enucleated eyes from 47 patients were included.
e-mail: tliu25@jhmi.edu; zcorrea@med.miami.edu

© Springer Nature Switzerland AG 2021 235


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_20
236 T. Y. A. Liu and Z. M. Correa

The paraffin blocks were cut into 4-micron-thick among malignancies in that GEP, independent of
sections that were incubated with mouse mono- other clinicopathological parameters, has been
clonal antibodies against BAP1 and counter- shown to be the most robust method currently
stained with hematoxylin eosin (H&E). The available to predict long-term metastasis risk and
resultant glass slides were digitally scanned, and survival. UM patients can be divided into two
the region of interest containing the UM tumor classes by GEP: class 1 and class 2, and there is a
was further cropped into numerous 256  ×  256 stark contrast in long-term survival between the
pixel image patches. In total, 8176 histopathol- two classes—the 92-month survival probability
ogy image patches were generated in this fash- in class 1 patients is 95% vs. 31% in class 2
ion. Each image patch was annotated twice by an patients [10, 11].
ophthalmic pathologist, who established the In this study, 20 de-identified FNAB cytology
ground truth, and each image patch was classified slides from 20 patients with UM underwent H&E
as one of the four categories: positive (positive staining. Whole-slide scanning was performed
for BAP1 expression, 2576 patches), negative for each cytology slide at a magnification of 40×,
(negative for BAP1 expression, 4720 patches), and native resolution crops containing melanoma
blurred (too blurred to be classified, 560 patches) cells were saved. Each snapshot image measured
and excluded (lack of UM cells, 320 patches). 1716 pixels (width) ×  926 pixels (height), and
The 8176 image patches were randomly split into was further split into eight tiles of equal size. The
a training (6800 image patches) and testing (1376 tiles were then screened and selected for further
image patches) subset. The authors applied trans- processing only if at least one melanoma cell was
fer learning to a pre-trained DenseNet-121 net- present. Typically, each slide generated hundreds
work, and achieved a sensitivity of 97.09%, of 40× snapshot images, and out of the 20 slides,
specificity of 98.12%, and overall accuracy of a total of 26,351 unique image tiles were gener-
97.10% in predicting nuclear BAP1 expression. ated. Schematic representation for data process-
While this study represents the first DL study in ing is shown in Fig. 20.2. The GEP ground truth
the field of ocular oncology, the resultant DLS at the slide level was established by the
ultimately only emulates what a human patholo- commercially-­ available DecisionDx-UM® test
gist is capable of performing—identifying [Friendswood, Texas], and the GEP label desig-
images with positive BAP1 staining. In addition, nated to a particular slide was propagated to all
the reported methodology has limited clinical the image tiles generated from that slide. I.e. if
practicalities, as it requires histopathology slides “slide 1” was determined to be GEP class 1 by
obtained from enucleated eyes and the current the DecisionDx-UM® test, then all the image
standard of care for most eyes with UM is globe-­ tiles generated from “slide 1” were labeled as
preserving local therapy with either plaque “class 1.”
brachytherapy or proton beam irradiation. The authors applied transfer learning to a pre-­
In the second study, our group (Liu et al. [9]) trained ResNet-152 network for the binary clas-
extended the application of DL techniques in sification problem of distinguishing between
digital pathology slide analysis in UM, and aimed class 1 and class 2 image tiles. Due to the low
to train a DLS to perform a task that is impossible amount of data (patient) variation, the validation
for a human pathologist—predicting gene expres- slides would have a strong effect on the model
sion profile (GEP) in smeared slides stained with performance, so “leave-one-out” cross-valida-
H&E alone obtained from fine needle aspiration tions were performed to evaluate the DLS’s per-
biopsy (FNAB) of uveal melanomas. The under- formance. To test each of the 20 slides/patients,
lying hypothesis is that cancer cell morphology 10 models using different training/validation
reflects the underlying genetics and careful anal- split were trained. Specifically, for each of the
ysis of cytopathology images will provide helpful leave-one-out cross-validations, 10 random sam-
prediction of the biological behavior of the tumor plings for the validation subset selection were
and clinical course of the patient. UM is unique performed. If “slide 1” was used as the t­esting
20  Deep Learning Applications in Ocular Oncology 237

Fig. 20.2 Schematic
representation of data
processing. (Top Panel)
Whole slide scanning;
one slide per patient.
(Middle Panel) Snapshot
image manually
captured at 40×;
multiple 40× images
were captured from each
slide. (Bottom Panel)
Each 40× image was
further divided into eight
tiles of equal sizes

slide, then the other 19 slides were used for model training and validation slides. For example,
development: 17 slides for training and 2 slides model #1 would use “slide 2” and “slide 11” for
for validation (one from class 1 and one from validation. Model #2 would use “slide 3” and
class 2). “Slide 1” was then tested 10 different “slide 12” for validation. Model #3 would use
times by 10 different models that were generated “slide 4” and “slide 13” for validation etc.
by 10 random and different combinations of Eventually, 10 models were generated, and the
238 T. Y. A. Liu and Z. M. Correa

mean accuracy of these 10 models was obtained. References


If the lower 95 confidence interval (CI) value
exceeded 50%, then it was concluded that the 1. Singh AD, Turell ME, Topham AK.  Uveal mela-
noma: trends in incidence, treatment, and survival.
GEP of “slide 1” was correctly predicted. This Ophthalmology. 2011;118(9):1881–5.
process was repeated for all 20 slides/patients, 2. Aronow ME, Topham AK, Singh AD.  Uveal mela-
such that each slide/patient was evaluated 10 noma: 5-year update on incidence, treatment, and
times by 10 different models. Out of this cohort survival (SEER 1973-2013). Ocul Oncol Pathol.
2018;4(3):145–51.
of 20 UM patients, the DLS achieved a point esti- 3. Sun M, Zhou W, Qi X, et  al. Prediction of BAP1
mate of 75% (15/20 patients) accuracy in predict- expression in uveal melanoma using densely-­
ing the GEP at the patient level. connected deep classification networks. Cancers
Given that GEP is the most robust prognostic (Basel). 2019;11(10).
4. Murali R, Wiesner T, Scolyer RA. Tumours associated
test available with a high correlation with sur- with BAP1 mutations. Pathology. 2013;45(2):116–26.
vival, the study by Liu et al. [9] investigated and 5. Stalhammar G, See TRO, Phillips SS, Grossniklaus
showed that survival prognostication could be HE.  Density of PAS positive patterns in uveal mel-
predicted from H&E cytopathology slides alone anoma: correlation with vasculogenic mimicry,
gene expression class, BAP-1 expression, macro-
in UM using DL.  However, further work is phage infiltration, and risk for metastasis. Mol Vis.
required, which includes prospective validation 2019;25:502–16.
of the DLS with a larger patient sample size, 6. Griewank KG, van de Nes J, Schilling B, et  al.
using actual survival data as reference standard in Genetic and clinico-pathologic analysis of metastatic
uveal melanoma. Mod Pathol. 2014;27(2):175–83.
DLS development, and combining DL-based 7. Harbour JW, Onken MD, Roberson ED, et al. Frequent
pathology image analysis with other clinical mutation of BAP1 in metastasizing uveal melanomas.
parameters in an ensemble algorithm that incor- Science. 2010;330(6009):1410–3.
porates multiple machine learning techniques. In 8. Koopmans AE, Verdijk RM, Brouwer RW, et  al.
Clinical significance of immunohistochemistry for
addition, both studies by Sun and Liu required detection of BAP1 mutations in uveal melanoma.
extensive manual data processing to identify Mod Pathol. 2014;27(10):1321–30.
regions of interest and to differentiate high-­ 9. Liu TYA, Zhu H, Chen H, et al. Gene expression pro-
quality images tiles with usable information from file prediction in uveal melanoma using deep learn-
ing: a pilot study for development of an alternative
low-quality image tiles with artifacts. This survival prediction tool. Ophthalmol Retina. 2020;
approach is time-consuming and labor-intensive, 10. Onken MD, Worley LA, Ehlers JP, Harbour JW. Gene
and thus limited in scalability and adoptability. expression profiling in uveal melanoma reveals two
Further research is required to develop novel molecular classes and predicts metastatic death.
Cancer Res. 2004;64(20):7205–9.
approaches, such as unsupervised clustering, to 11. Onken MD, Worley LA, Char DH, et al. Collaborative
enable efficient, large-scale processing of digital Ocular Oncology Group report number 1: prospective
ophthalmic pathology data for machine learning validation of a multi-gene prognostic assay in uveal
purposes. melanoma. Ophthalmology. 2012;119(8):1596–603.
Artificial Intelligence
in Neuro-ophthalmology
21
Dan Milea and Raymond Najjar

Introduction large datasets, using either a supervised (labelled


data) or un-supervised (unlabelled data)
Despite the current hype in almost every area of approach. As in other areas of DL, the input data
the society and medicine, artificial intelligence is used for training purposes and needs to be
(AI) is not very new. Machine learning (ML), one selected according to the highest clinical diag-
of the first AI methods, was described in the 50s, nostic standards, because it represents the «
followed recently by an extraordinary technical ground truth » or the « reference standard ». All
development of computers and processing power the results provided by a DL algorithm will be
leading to the current achievements of deep compared to this ground truth. After the training
learning (DL). DL is the current state-of-art tech- phase, the performance of the algorithm is tested,
nique in ML, particularly well adapted for image first via internal validation (cross-validation) and,
analysis. DL techniques have allowed new algo- more importantly, via external testing (in totally
rithms to make predictions on various outcomes novel datasets). The datasets used for training,
and diagnosis, based on « learning » (training) on internal validation and external testing need to be
distinct and should not intersect at any time. The
Dr Milea is the Principal Investigator of research sup- performance of an algorithm is expressed as
ported by the Singapore National Medical Research diagnostic accuracy, sensitivity, specificity and
Council, CS-IRG grant (CIRG18Nov-0013), and a Duke-­
NUS Ophthalmology and Visual Sciences Academic
area under the receiving operating curve (AUC),
Clinical Programme grant (05/FY2019/P2/06-A60). compared to the reference standard. The current,
extraordinary boom of AI requires a high need of
D. Milea (*) rigorous, controlled, prospective evaluations to
Duke-NUS Medical School, Singapore, Singapore demonstrate the impact of AI systems in health
Copenhagen University Hospital, København, Denmark outcomes. In response to this need, AI consensus
Visual Neuroscience Group, Singapore Eye Research groups have recently elaborated international
Institute, Singapore, Singapore guidelines regarding AI interventions, including
Neuro-Ophthalmology Department, Singapore instructions and skills required for use of AI sys-
National Eye Centre, Singapore, Singapore tems, including considerations for the handling
e-mail: dan.milea@singhealth.com.sg of input and output data, the human–AI interac-
R. Najjar tion and analysis of error cases [1].
Duke-NUS Medical School, Singapore, Singapore Numerous DL models have already been suc-
Visual Neuroscience Group, Singapore Eye Research cessfully trained in various areas of medicine
Institute, Singapore, Singapore (dermatology, radiology, pathology, etc.), to
e-mail: raymond.najjar@seri.com.sg

© Springer Nature Switzerland AG 2021 239


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_21
240 D. Milea and R. Najjar

accurately identify and classify abnormal images disc, based only on analysis of retinal images.
for disease prediction. This approach has been Several other recent studies have suggested that
particularly successful in Ophthalmology, for modern deep learning systems can achieve high
automated identification on retinal fundus images sensitivity, specificity, and generalizability for
in diabetic retinopathy (DR), retinopathy of pre- detecting glaucoma on retinal images alone, in a
maturity, other retinopathies, glaucoma, etc [2]. cost-effective and time-­ efficient manner. Only
To date, two algorithms have been authorized by very few studies, mostly based on computer-based
the FDA for detection of DR on retinal fundus analysis, have aimed to automatically detect
images (IDx-DR, Coralville and EyeArt, “non-glaucomatous”, or “neuro-ophthalmic”
EyeNuk), based on pivotal prospective studies optic disc abnormalities, given the relative scar-
[3]. Deep learning methods have been success- city of these conditions. Indeed, deep learning
fully applied to other retina imaging modalities, algorithms are notoriously dependent on large
i.e. optical coherence tomography (OCT) for training datasets, making this approach particu-
identification of diabetic macular edema, glau- larly difficult in Neuro-Ophthalmology.
coma, etc. More interestingly, recent « machine
to machine » learning techniques have allowed
deep learning to directly predict OCT parameters Achievements of AI
(i.e. retinal nerve fiber layers thickness, or dia- in Neuro-ophthalmology
betic macular edema grade), using monoscopic
retinal fundus photographs, with higher specific- Early computer-aided diagnostic systems have
ity than the prediction provided by doctors. aimed to automatically detect Neuro-Ophthalmic
Deep learning algorithms have been deployed optic disc abnormalities on retinal fundus images
on fundus images for the detection of the most [5], including papilledema, achieving good
common optic neuropathy, glaucoma, based on results, with high accuracy and substantial agree-
the large numbers of available optic disc images ment with the Frisen severity classification pro-
in this condition. The optic disc features in glau- vided by expert neuro-ophthalmologists [6].
coma are relatively specific (increased cup/disc Neuro-ophthalmic abnormalities affecting the
ratio, notching, etc), and are different from the optic discs (i.e. optic disc swelling in inflamma-
abnormal features in patients affected by other tory/ischemic/compressive optic neuropathies,
optic neuropathies. The initial studies for optic papilledema associated with intracranial hyper-
disc classification in glaucoma provided good tension, optic nerve head drusen, optic atrophy in
results, despite several methodological limita- chronic optic neuropathies, etc.) are however rare
tions, inherent to the nature of the “reference stan- conditions, explaining the scarce published liter-
dard”. Indeed, the reference standard was often ature in this field.
established on subjective, post-hoc assessments In 2020, an international Neuro-­
of the optic discs, performed by randomly selected Ophthalmology consortium (BONSAI—Brain
ophthalmologists/graders, and not on clinical and Optic Nerve Study with Artificial
information obtained in the native datasets. In Intelligence), has published the results obtained
order to circumvent this limitation, a recent, very with a dedicated deep-learning system aiming to
elegant study, has used a deep learning algorithm, discriminate optic discs with papilledema, nor-
which was trained on images compared to objec- mal discs, and discs with nonpapilledema abnor-
tive reference standards (i.e. average retinal nerve malities [7]. Globally, this large retrospective
fiber layer (RNFL) thickness values), and subse- collaborative study has included multi-ethnic
quently applied on stereoscopic optic disc images, populations from 24 neuro-ophthalmology sites
a technique called “machine to machine” [4]. In in 15 countries on three continents, using fundus
other words, this algorithm was able to predict images obtained with a large set of fundus cam-
and quantify the retinal neuronal loss at the optic era brands. The training and validation data sets
21  Artificial Intelligence in Neuro-ophthalmology 241

from 6779 patients included 14,341 photographs


of abnormal and normal optic discs; the external
validation testing was performed in 1505 images
from five separate international centers. In this
retrospective in silico dataset, papilledema detec-
tion (Fig. 21.1) was achieved with high AUC
(0.96), high sensitivity (96%) and good specific-
ity (85%). The first results of this ML-based,
international study suggest that future computer
programs examining digital funduscopic images
may classify disorders of the optic discs, in clini-
cal situations which are notoriously difficult to
manage by non-ophthalmologists.
The next, natural question was whether such
DL algorithms could perform better than humans
in predicting optic nerve conditions on retinal
photographs. For this purpose, a recent study has Fig. 21.1  Example of class activation probability (heat-
evaluated the diagnostic performance of two map), displayed on an optic disc image with confirmed
expert neuro-ophthalmologists with that of the papilledema
BONSAI algorithm, for discriminating various
optic disc abnormalities [8]. Unsurprisingly, the tions, involving methodologists and computer
algorithm needed a significantly shorter time engineers should lead in the future to new AI
(25  s), to classify a sample of 800 optic disc developments in these areas.
images, than the included neuro-ophthalmology
experts. On the same large sample, the perfor-
mance of the deep learning system at classifying Summary
optic disc abnormalities was at least as good as
the two expert neuro-ophthalmologists (the sys- In summary, AI has still limited applications in
tem correctly classified 85% of photographs, Neuro-ophthalmology, mainly focusing on detec-
compared with 80–84% by the two experts). It is tion of optic nerve head abnormalities in various
however important to notice that these evalua- neuro-ophthalmic conditions. The ability of AI to
tions were performed purely on fundoscopic accurately detect abnormal optic discs may be
images, without taking into account other clinical implemented, if further validated in real-life situ-
signs (visual loss, headache, obscurations, tinni- ations, for patients’ triaging in non-ophthalmic
tus, etc), which are of paramount importance in set-ups. Future prospective studies, including
everyday clinics. Therefore, prospective, real life large clinical datasets are needed to establish if
studies are needed to validate DL algorithms as such systems may become a safe, effective,
potential diagnostic aids in relevant clinical set- attractive and affordable solution for non-­
tings, at best applied to non-mydriatic images, ophthalmologists. Other clinical neuro-­
using convenient (possibly handheld) cameras. ophthalmic features and neuro-ophthalmic
Other areas of Neuro-Ophthalmology (i.e. eye disorders (in particular involving eye movements/
movement disorders, visual fields, pupil record- pupils) are currently underexplored with artificial
ings, multimodal imaging modalities, genetic intelligence, despite being well adapted for such
mutations, etc.) have been only rarely explored investigations. However, AI could be a very help-
with AI. It is probable that the low data availabil- ful tool in the future, especially if associated to
ity in relatively rare neuro-ophthalmic conditions Tele-Neuro-Ophthalmology, in the current
may explain this gap. Multicentric collabora- COVID-19 era.
242 D. Milea and R. Najjar

References 6. Echegaray S, et al. Automated analysis of optic nerve


images for detection and staging of papilledema.
Invest Ophthalmol Vis Sci. 2011;52:7470–8.
1. Cruz Rivera S, Liu X, Chan AW, Denniston AK,
7. Milea D, Najjar RP, Zhubo J, Ting D, Vasseneix
Calvert MJ, SPIRIT-AI and CONSORT-AI Working
C, Xu X, Aghsaei Fard M, Fonseca P, Vanikieti K,
Group; SPIRIT-AI and CONSORT-AI Steering Group;
Lagrèze WA, La Morgia C, Cheung CY, Hamann S,
SPIRIT-AI and CONSORT-AI Consensus Group.
Chiquet C, Sanda N, Yang H, Mejico LJ, Rougier
Guidelines for clinical trial protocols for intervention
M-B, Kho R, Thi Ha Chau T, Singhal S, Gohier P,
involving artificial intelligence: the SPIRIT-AI exten-
Clermont-Vignal C, Cheng C-Y, Jonas JB, Yu-Wai-
sion. Nat Med. 2020;26(9):1351–63.
Man P, Fraser CL, Chen JJ, Ambika S, Miller NR,
2. Ting DSW, et al. Deep learning in ophthalmology: the
Liu Y, Newman NJ, Wong TY, Biousse V, BONSAI
technical and clinical considerations. Prog Retin Eye
Group. Artificial intelligence to detect papilledema
Res. 2019;72:100759.
from ocular fundus photographs. N Engl J Med.
3. Abràmoff MD, Lavin PT, Birch M, Shah N, Folk
2020b;382(18):1687–95.
JC. Pivotal trial of an autonomous AI-based diagnostic
8. Biousse V, Newman NJ, Najjar RP, Vasseneix C,
system for detection of diabetic retinopathy in primary
Xu X, Ting DS, Milea LB, Hwang JM, Kim DH,
care offices. NPJ Digit Med. 2018;1:39.
Yang HK, Hamann S, Chen JJ, Liu Y, Wong TY,
4. Medeiros FA, Jammal AA, Thompson AC.  From
Milea D, BONSAI (Brain and Optic Nerve Study
machine to machine: an OCT-trained deep learning
with Artificial Intelligence) Study Group. Optic disc
algorithm for objective quantification of glaucoma-
classification by deep learning versus expert neuro-
tous damage in fundus photographs. Ophthalmology.
ophthalmologists. Ann Neurol. 2020. https://doi.
2019;126(4):513–21.
org/10.1002/ana.25839.
5. Milea D, Singhal S, Najjar RP. Artificial intelligence
for detection of optic disc abnormalities. Curr Opin
Neurol. 2020a;33(1):106–10.
Artificial Intelligence Using the Eye
as a Biomarker of Systemic Risk
22
Rachel Marjorie Wei Wen Tseng,
Tyler Hyungtaek Rim, Carol Y. Cheung,
and Tien Yin Wong

Introduction To date, there are a number of large population-­


based studies that assessed the association
Artificial intelligence (AI) technology is trans- between such retinal microvascular abnormali-
forming healthcare. One major feature in oph- ties and systemic diseases. This includes retinal
thalmology is attributable to the ability to directly vascular changes that are commonly found in
visualize retinal blood vessels, measuring 100– hypertensive and/or diabetic retinopathy cases
300 μm in size, via digital colour fundus photo- and their association with systemic diseases such
graphs (CFP). The use of CFP offers a unique as cardiovascular disease (CVD) and neurologi-
and easily accessible opportunity to study the cal diseases [2–8]. One example is that patients
human microcirculation [1] and characterize sys- with Alzheimer’s Disease (AD) have some reti-
temic diseases that manifest themselves in the nal features which may be early biomarkers of
retina. Identifying systemic disease risk factors AD (i.e. narrower retinal vein calibre, [9] more
through such means (i.e., via the eye) is an area tortuous retinal vessels [10], and reduced blood
of emerging research with major epidemiological flow, [11] etc.) which may facilitate the early
studies that show changes in the retinal vascula- detection of AD [12]. The calibre of the retinal
ture mirror systemic microcirculation changes. vessels (widths of retinal arterioles and venules)
has been established as a promising biomarker
for CVD risk estimation. For example, narrower
R. M. W. W. Tseng retinal arterioles are associated with CVD and
Singapore Eye Research Institute, Singapore National hypertension, a finding more common in females
Eye Centre, Singapore, Singapore [13], while wider retinal venules was associated
e-mail: rachel.marjorie.tseng.w.w@seri.com.sg with an increased risk of stroke [14–16] and dia-
T. H. Rim · T. Y. Wong (*) betes [17].
Singapore Eye Research Institute, Singapore National Building on such large-scale epidemiological
Eye Centre, Singapore, Singapore
studies, the application of AI technology, specifi-
Ophthalmology and Visual Sciences Academic cally in deep learning (DL), on CFPs is advancing
Clinical Program (Eye ACP), Duke-NUS Medical
School, Singapore, Singapore new research directions in retina-systemic disease
e-mail: tyler.rim@snec.com.sg; wong.tien.yin@ relationships. This review provides a comprehen-
singhealth.com.sg sive summary of the various systemic-­related out-
C. Y. Cheung comes that can be predicted via CFP using AI-DL
Department of Ophthalmology and Visual Sciences, technology. The current studies fall into two basic
The Chinese University of Hong Kong, groups: 1) cross-sectional studies that use AI-DL
Hong Kong, Hong Kong
e-mail: carolcheung@cuhk.edu.hk technology on CFP to detect or estimate systemic

© Springer Nature Switzerland AG 2021 243


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_22
244 R. M. W. W. Tseng et al.

Cross-sectional associations Longitudinal associations

Detection/estimat Replacement of Detection of Prediction of future


ion of systemic existing specific systemic outcomes
and lifestyle risk biomarkers diseases e.g. all-cause mortality,
factors e.g. coronary artery e.g. chronic kidney cardiovascular mortality,
e.g. age, sex, calcium, retinal disease, anaemia, cardiovascular event
hemoglobin, vessel calibre diabetic peripheral
smoking status neuropathy

Fig. 22.1  Framework for artificial intelligence to evaluate systemic disease via the eye

risk factors (e.g., age, blood pressure, smoking) or which part of the retina contributes to prediction
other biomarkers (e.g., coronary artery calcium) of sex.
(Fig. 22.1); 2) longitudinal studies that use AI-DL In terms of lifestyle factors, smoking status is
technology on CFP to predict the incidence or risk commonly assessed because of the direct link
of systemic disease (e.g., cardiovascular diseases between CVD and smoking habits. The effect of
event or mortality). smoking on retinal vasculature was previously
reported with studies showing that cigarette
smoking was linked with a wider retinal venular
Cross-Sectional Studies caliber [25, 26]. Other studies have also demon-
strated that one’s smoking status is associated
Prediction of Demographic with CVD due to the dual effects on both the reti-
and Lifestyle Factors nal and systemic circulation, as suggested by
visual changes in the retinal vascular structure
Figure 22.1 shows the framework for AI-DL to [26–28]. Along with the fair results obtained
evaluate systemic diseases via CFPs using from internal test sets of three unique studies
AI-DL. (AUC  =  0.71–0.86) [18, 21, 23], this allowed
Table 22.1 illustrates various studies in which ophthalmic researchers to figure out with good
AI-DL on CFP has been used for the prediction confidence that smoking status prediction was
of demographic and lifestyle factors. Among the predominantly predicted using the retinal vessels
identified studies, most of them investigated age via AI-DL.
as a predictable variable from CFP via
AI-DL. Chronological age is the most reliable at
portraying growth milestones accurately [19],  rediction of Body Composition
P
and the retina is considered the “window” to the Factors
whole body. Therefore, predicting age from CFP
via AI-DL could provide valuable information Table 22.2 summarizes the three main studies
about the status of a target organ and/or the body that used AI-DL to predict body composition fac-
[24]. In addition to age as a predictor, the ability tors based on CFP.  The association between an
to identify sex with high confidence from CFP increased body-mass index (BMI) and mortality,
via AI-DL has been demonstrated in similar stud- in the form of stroke [29], cancer [30, 31], etc.,
ies. For example, Rim et  al. [20], showed fair has been established because BMI is a common
results in their external multi-ethnic test sets for measure of adiposity [32]. However, just like
both age (coefficient of determination, R2 = 0.36– many other systemic factors, BMI prediction in
0.63) and sex (area under a curve [AUC] = 0.80– CFP via AI-DL is not suitable for clinical appli-
0.91) predictions, demonstrating reasonable cation yet. This is due to the great variability in
generalizability on predicting sex and age from mean absolute error (MAE) with a low generalis-
CFPs. Nonetheless, it still remains unclear as to ability across the ethnic groups in cohort studies
[18, 20].
22  Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 245

Table 22.1  Applications of AI in determining demographic and lifestyle factors


Results
Variables Author, Year Internal test set External test set
Demographic factors Age Poplin et al., 2018 MAE = 3.26 years NA
[18] CI = 3.22–3.31
R2 = 0.74
Kim et al., 2020 [19] Overall: NA
MAE = 3.06 years
R2 = 0.92
Subgroup with hypertension
(HTN):
MAE = 3.46 years
R2 = 0.74
Subgroup with diabetes (DM):
MAE = 3.55 years
R2 = 0.75
Subgroup with smokers:
MAE = 2.65 years
R2 = 0.86
Rim et al., 2020 [20] MAE = 2.43 years MAE = 3.38–4.50
R2 = 0.83 years
R2 = 0.36–0.63
Gerrits et al., 2020 MAE = 2.78 years NA
[21] R2 = 0.89
Zhu et al., 2020 [22] NA MAE = 3.50 years
R2 = 0.83
Sex Poplin et al., 2018 AUC = 0.97 NA
[18] CI = 0.966–0.971
Kim et al., 2020 [19] Overall: AUC = 0.97 NA
HTN: AUC = 0.96
DM: AUC = 0.96
Smoker: AUC = 0.98
Rim et al., 2020 [20] AUC = 0.96 AUC = 0.80–0.91
Accuracy = 0.91 Accuracy =
0.70–0.85
Gerrits et al., 2020 AUC = 0.97 NA
[21] CI = 0.96–0.98
Accuracy = 0.93
Environmental Smoking Poplin et al., 2018 AUC = 0.71 NA
factors status [18] CI = 0.70–0.73
Vaghefi et al., 2019 AUC = 0.86 NA
[23] Accuracy = 88.9%
Gerrits et al., 2020 AUC = 0.78 NA
[21] Accuracy = 0.81
AUC area under the receiver operating characteristic curve, CI confidence interval, MAE mean absolute error, NA results
not available

Recently, researchers discovered that body quantification of muscle mass from CFP.  The
muscle mass is a more reliable measure of car- MAE (6.09 kg) was high, and the coefficient of
diometabolic risk than BMI. Using body muscle determination (R2 = 0.33) was low in the external
mass as a variable allowed for the detection of an testing set, reiterating the need for further valida-
age-related condition which reflects skeletal tion studies before assessing whether CFP could
muscle loss called sarcopenia [20]. Rim et  al., be used as an alternative screening tool for
developed an AI-DL model that enabled the sarcopenia.
246 R. M. W. W. Tseng et al.

Table 22.2  Applications of AI in determining body composition factors


Results
Variables Author, Year Internal test set External test set
Body composition Body mass index (BMI) Poplin et al., 2018 MAE = 3.29 units NA
factors [18] R2 = 0.13
CI = 3.24–3.34
Rim et al., 2020 MAE = 2.15 kg/ MAE = 2.37–3.52
[20] m2 kg/m2
CI = 2.12–2.19 R2 = 0.01–0.14
R2 = 0.17
Body muscle mass Rim et al., 2020 MAE = 5.11 kg MAE = 6.09 kg
[20] CI = 5.04–5.19 CI = 5.96–6.23
R2 = 0.52 R2 = 0.33
Height Rim et al., 2020 MAE = 5.20 cm MAE =
[20] CI = 5.13–5.28 5.48–7.09 cm
R2 = 0.42 R2 = 0.08–0.28
Body weight Rim et al., 2020 MAE = 7.69 kg MAE =
[20] CI = 7.57–7.81 8.28–11.81 kg
R2 = 0.36 R2 = 0.04–0.19
Relative fat mass/percentage Gerrits et al., MAE = 5.68 units NA
body fat 2020 [21] R2 = 0.43
Rim et al., 2020 MAE = 4.71 kg MAE = 4.50 kg
[20] CI = 4.64–4.78 CI = 4.39–4.60
R2 = 0.23 R2 = 0.08
CI confidence interval, MAE mean absolute error, NA results not available

Prediction of Neurological Diseases izability when the model was tested in unseen
images [38]. Considering the emerging potential
The retina and the brain share a special relation- of retinal imaging as a non-invasive strategy,
ship since both structures develop from the neu- employing AI-DL in CFP could act as opportu-
ral tube and are part of the central nervous system nistic screening for neurological diseases, and
[33]. Embryonic origin aside, visual changes ultimately increase screening adherence rates in
have also been reported as the first few symptoms the community.
in many patients diagnosed with Alzheimer’s dis-
ease [33, 34], with some studies showing the
aggregation of amyloid beta monomers in the Prediction of Cardiovascular
retina in these patients [35]. In the realm of and Circulatory Disorders
AI-DL, compared to other major body systems,
the limited number of published studies suggests Table 22.3 details the AI-DL studies focused on
that there is much room to explore the relation- predicting systemic risk factors and specific dis-
ship between the brain and the retina. Existing eases (e.g., anaemia and hypertension) of the cir-
studies vary in the type of ocular imaging tech- culatory system. Blood pressure (BP) is an
nologies used to explore the association between important indicator for CVD, but it is also a
neurological diseases and the retina [35–37]. In means to maintain homeostasis and therefore
particular, Lim et al., evaluated the potential of an fluctuates based on body and emotional status.
AI-DL model as an ischemic stroke risk assess- Using the retina as a biomarker instead is pre-
ment from CFP, and this resulted in a varying ferred because the retina shows accumulated
AUC of 0.685–0.994 for six different datasets damage due to high blood pressure and experi-
[38]. Additionally, the team also found that reti- ences comparably less fluctuation compared to
nal vessel calibre could be predictive of ischemic conventional BP stable marker, rendering it a
stroke in patients although there was low general- more stable marker. The results of applying
22  Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 247

Table 22.3  Applications of AI in predicting systemic diseases involved with the circulatory system
Systemic disease/ Results
variable Author, Year Internal test set External test set
Hypertension Blood Pressure Poplin et al., SBP: NA
(HTN) or (Systolic = SBP, 2018 [18] MAE = 11.23 mmHg
related Dystolic = DBP)/ CI = 11.18–11.51
biomarkers mmHg R2 = 0.36
DBP:
MAE = 6.42 mmHg
CI = 6.33–6.52
R2 = 0.32
Rim et al., SBP: SBP:
2020 [20] MAE = 9.29mmHg MAE =
CI = 9.16–9.43 10.55–
R2 = 0.31 13.95 mmHg
DBP: R2 = 0.17–0.21
MAE = 7.20 mmHg D:
CI = 7.09–7.30 MAE =
R2 = 0.35 7.14–8.09 mmHg
R2 = 0.16–0.27
Gerrits et al.,SBP: NA
2020 [21] MAE = 8.96 mmHg
R2 = 0.40
DBP:
MAE = 6.84 mmHg
R2 = 0.24
Hypertension Dai et al., 2020 AUC = 0.651 NA
[39]
Zhang et al., AUC = 0.766 NA
2020 [40]
Anaemia or Anaemia Mitani et al., Metadata/fundus/combined: NA
related 2019 [41] AUC = 0.73/0.87/0.88
biomarkers AUC = 0.89 (combined in diabetes
subgroup)
Haemoglobin levels Mitani et al., Metadata/fundus/combined: NA
(Hb) 2019 [41] MAE = 0.73/0.67/0.64 g/dL
CI = 0.72–0.74/0.66–0.68/0.62–0.64
AUC = 0.74/0.87/0.88
Rim et al., MAE = 0.79 g/dL MAE = 0.93–
2020 [20] CI = 0.78–0.80 0.98 g/dL
R2 = 0.56 R2 = 0.06–0.33
Gerrits et al., MAE = 0.61% NA
2020 [21] R2 = 0.34
Haematocrit levels Mitani et al., Metadata/fundus/combined: NA
2019 [41] MAE = 2.10/1.94/1.83%
CI = 2.07–2.13/1.91–1.97/1.80–1.86
Rim et al., MAE = 2.03% MAE =
2020 [20] CI = 2.00–2.06 2.62–2.81%
R2 = 0.57 R2 = 0.09–0.26
Red blood cell count Mitani et al., Metadata/fundus/combined: NA
2019 [41] MAE = 0.26/0.26/0.25·1012/L
CI = 0.26–0.27/0.25–0.26/0.25–0.25
Rim et al., MAE = 0.26·1012/L MAE =
2020 [20] CI = 0.25–0.26 0.33–0.37·1012/L
R2 = 0.45 R2 = −0.02–0.14
AUC area under the receiver operating characteristic curve, CI confidence interval, MAE mean absolute error, NA results
not available
248 R. M. W. W. Tseng et al.

AI-DL to date show that unlike other body fac- Prediction of Metabolic
tors (e.g., height and weight), there is generalis- and Endocrinological Diseases
ability across the different ethnic groups for
blood pressure prediction [6]. However, the R2 Table 22.4 details the studies on the endocrine
value is somewhat low with ranges of 0.24–0.40. system. Of the different systemic diseases and/or
Apart from using blood pressure as a biomarker variables that were tested using the AI-DL model,
for hypertension, disease prediction of hyperten- the prediction of biomarkers related to diabetes,
sion was also reported by Dai et al., and Zhang including glucose and HbA1c, showed somewhat
et  al., but modest predictability was observed low predictive performance in both the internal
with a combined AUC of 0.651–0.766. and external test sets.
For the prediction of anaemia and/or related Testosterone (MAE = 3.76 nmol/L, R2 = 0.54)
biomarkers, biomarkers including hemoglobin, was predictable from CFP but Gerrits et al., dis-
hematocrit, and red blood cell were predicted covered that the AI-DL model that was trained to
from CFP in three separate studies [20, 21, 41]. predict testosterone levels, was indirectly pre-
Anaemia was also predictable using an AI-DL dicting sex as well. The team additionally found
model developed by Mitani et al., with a modest that the performance of the model was affected
AUC of 0.88 using a combined model of sys- when it was trained on solely males or females,
temic risk factors and CFPs [41]. Rim et  al., indicating that sex had an indirect effect on the
tested these hematologic factors in external data- performance of the model when predicting for
sets with varying ethnicities but generalizability testosterone [21]. Apart from systemic risk fac-
was limited among other ethnic groups [20]. tors and related biomarkers, endocrine system-­

Table 22.4  Applications of AI in predicting systemic diseases involved with the endocrine system
Results
Systemic disease/variable Author, Year Internal test set External test set
Diabetes or related Diabetes/blood glucose Rim et al., 2020 Fasting blood MAE = 10.10 mg/dL
biomarkers control [20] glucose: CI = 9.83–10.36
MAE = 8.55 mg/dL R2 = 0.05
CI = 8.40–8.71
R2 = 0.11
Babenko et al., AUC = 0.702 NA
2020 [42]
Diabetic peripheral Benson et al., Accuracy = 89% NA
neuropathy 2020 [43]
Hyperglycemia Zhang et al., 2020 AUC = 0.880 NA
[40]
HbA1c Poplin et al., MAE = 1.39% NA
2018 [18] CI = 1.29–1.50
R2 = 0.09
Rim et al., 2020 MAE = 0.33% MAE = 0.35
[20] CI = 0.32–0.33 CI = 0.34–0.36
R2 = 0.13 R2 = 0.07
Gerrits et al., MAE = 0.61% NA
2020 [21] R2 = 0.34
Lipid related Dyslipidaemia Zhang et al., 2020 AUC = 0.703 NA
biomarkers [40]
HDL cholesterol Rim et al., 2020 MAE = 9.45 mg/dL MAE = 9.46 mg/dL
[20] R2 = 0.13 R2 = 0.08
Other biomarkers Testosterone Gerrits et al., MAE = 3.76 nmol/L NA
2020 [21] R2 = 0.54
AUC area under the receiver operating characteristic curve, CI confidence interval, HbA1c hemoglobin A1C, HDL high-­
density lipoprotein, MAE mean absolute error, NA results not available
22  Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 249

related diseases were also reported, including a oped by Sabanayagam et al., was generally stable
machine learning system created by Benson (AUC  >  0.9 for all models) across the different
et  al., to predict diabetic peripheral neuropathy, models (i.e. CFP, risk factors, combined) that
which showed relatively good performance were trained, suggesting that risk factor informa-
(AUC = 0.89) [43]. The prediction of other con- tion was not required for CKD risk assessment in
ditions such as dyslipidaemia and diabetes patients. In the study, CKD was strictly defined
showed moderate predictive performance. as an estimated glomerular filtration rate of less
than 60 units. This estimation was based on fac-
tors such as creatinine level, age, sex, and body-
Prediction of Kidney Disease weight. Given that predicting age and sex from
AI-DL in CFP is well-established and highly
Among systemic diseases, chronic kidney dis- accurate, this suggests the high potential of using
ease (CKD) is infamously described by the AI-DL in CFP to predict CKD in the future as an
American Society of Nephrology to be a silent opportunistic screening method that can be inte-
killer. The traditional way of screening for CKD grated into CFP and an eventual replacement of
is invasive and includes collecting serum creati- traditional invasive methods.
nine levels [44]. As such, screening for CKD is
limited and a tough challenge for most communi-
ties. Applying AI to CFP for the detection of Other Retinal Biomarkers
CKD would not only act as an adjunct screening
tool for CKD but could also be a gamechanger in Table 22.6 details the studies focused on other
increasing the detection rate and lowering the retinal imaging biomarkers. There is strong evi-
mortality rate of patients with CKD. dence from epidemiological studies that changes
Despite the huge potential impact, few studies in the retinal vasculature mirror systemic micro-
have explored the prediction of CKD from circulation changes. However, the process for
CFP.  Of note, Sabanayagam et  al., predicted assessing retinal vascular changes is time-­
CKD (AUC = 0.73) with modest generalisability consuming and requires professional training,
and a separate AI-DL system developed by Kang which has limited the expansion and wider appli-
et  al., achieved an AUC of 0.81 although no cation of these traditional methods in other pri-
external validation was conducted (Table  22.5). mary care settings outside of ophthalmology
In particular, the performance of the model devel- [50]. To address these challenges, semi-­automated

Table 22.5  Applications of AI in predicting systemic diseases involved with the renal system
Systemic disease/ Results
variable Author, Year Internal test set External test set
Chronic kidney CKD Sabanayagam AUC (CFP/RF/Combined) AUC (CFP/RF/Combined)
disease (CKD) et al., 2020 [45] = 0.911/0.916/0.938 =
or related Subgroup of DM patients: 0.733–0.835/0.829–
biomarkers AUC = 0.889/0.899/0.925 0.887/0.810–0.858
Subgroup of HTN
patients:
AUC = 0.889/0.889/0.918
Kang et al., 2020 Overall AUC = 0.81 NA
[46] AUC = 0.81–0.87 as
HbA1c levels increased
from <6.5% to >10%
Creatinine Rim et al., 2020 MAE = 0.11 MAE = 0.11–0.17
[20] CI = 0.11–0.11 R2 = 0.01–0.26
R2 = 0.38
AUC area under the receiver operating characteristic curve, CI confidence interval, MAE mean absolute error, RF risk
factor, NA results not available
250 R. M. W. W. Tseng et al.

Table 22.6  Applications of AI in predicting other established imaging biomarker


Results
Systemic disease/variable Author, Year Internal test set External test set
Other imaging Retinal vessel calibre Cheung et al., ICC = 0.88–0.95 ICC = 0.69–0.92
Biomarkers 2020 [47]
Coronary artery calcification Son et al., 2020 AUC = NA
[48] 0.823–0.832
Sonographically confirmed Chang et al., 2020 AUC = 0.713 NA
carotid artery atherosclerosis [49]
AUC area under the receiver operating characteristic curve, ICC intraclass correlation coefficient, NA results not
available

Input into the SIVA-DLS Heatmap of CRAE score Heatmap of CRVE score

Fig. 22.2  Use of SIVA-DLS to assess width of retinal vessels in a more efficient, objective, and quantifiable way

software, such as the Singapore I Vessel et  al. developed an AI-DL model to predict
Assessment deep-learning system (SIVA-DLS), carotid artery atherosclerosis, and the model was
was created and applied on CFP, allowing for a able to predict the sonographically confirmed
more efficient, objective, and quantifiable way of carotid artery atherosclerosis with an AUC of
assessing width of retinal vessels from CFP 0.713 [49]. Both CAC and CIMT models were
through the use of heat maps (Fig.  22.2). The not tested externally and further validation is
SIVA-DLS study reported high intra-class corre- required prior to assessing the clinical applicabil-
lation coefficients (0.82–0.95) between the ity of these models.
SIVA-DLS and validated human measurements.
Coronary artery calcium (CAC) is a pre-­
clinical marker of atherosclerosis, a cardiovascu- Longitudinal Studies
lar condition that could implicate the circulatory
system, and is strongly associated with risk of Leveraging the predictive values and cross-­
clinical CVD [51]. Measurement of CAC scores sectional outcomes that are estimated using
has increasingly been used for stratification of AI-DL and CFP, research in the retinal biomarker
CVD risk. Recently, Son et al. created an AI-DL field is currently expanding towards predicting
model to predict abnormal CAC from CFP both future events (Table 22.7). Since applying AI-DL
unilaterally and bilaterally and the performance in CFP is sufficient for risk factor prediction of
(AUC  =  0.823–0.832) was promising [48]. In systemic diseases, there is a high possibility that
addition, CIMT was measured using ultrasonog- CFP could also be directly associated with and
raphy by averaging three measurements made therefore a good predictor of the incidence of
10 mm proximal to the bifurcation and was used CVD events. Recent work includes the survival
as the proxy marker for atherosclerosis. Chang analysis of risk stratification for incident CVD
22  Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 251

Table 22.7  Applications of AI in predicting longitudinal outcomes of systemic diseases


Outcome Author, Year Ground truth (Data) Datasets Results
CVD events Poplin et al., Age, sex, SBP, BMI, UK Biobank AUC = 0.70
2018 [18] smoking status CI = 0.648–0.740
Cheung et al., Retinal vessel calibre (the SEED, SP2, After adjusting for factors
2020 [47] SEED study) Dunedin, HKCES, like demographic factors,
AHES, RICP, IRED, BMI, total cholesterol
CUHK-STDR, level, etc., the team found
GUSTO, SiDRP, that narrower CRAEB (HR
CVD screening = 1.12/OR = 1.88) and
study, BES, UK narrower CRAEC (HR
Biobank, KSH, = 1.13/OR = 1.67) were
Austin Health study independently associated
with incident CVD events
and 10-year all-cause
mortality in SEED study
and BES study
respectively.
HR value for SEED study,
OR value for BES study
CVD mortality Chang et al., Carotid artery HPC-SNUH AUC = 0.713 for
2020 [49] atherosclerosis (Health Promotion predicting carotid artery
Center of Seoul atherosclerosis.
National University Patients with a deep-­
Hospital, Korea) learning fundoscopic
atherosclerosis score
(DL-FAS) greater than
0.66 has a significantly
higher risk of CVD
mortality compared to
those with score below
0.33 (HR = 8.33).
Risk association between
DL-FAS and CVD
mortality was significant
in patients with
intermediate to high
Framingham risk scores.
All-cause Zhu et al., 2020 Age UK Biobank The correlation between
mortality [22] retinal age and
chronological age (retinal
age gap) was strong (0.83,
P < 0.001, MAE = 3.50
years)
Cox regression models
showed that as the
difference between
predicted age from CFP
and chronological age
increased, the mortality
risk increased by 2%
(HR = 1.02).
SEED Singapore Epidemiology of Eye Diseases, SP2 Singapore Prospective study program, HKCES Hong Kong chil-
dren eye study, AHES Australian heart eye study, RICP retinal imaging in chest pain study, IRED retinal imaging in
renal disease study, CUHK-STDR CUHK sight-threatening diabetic retinopathy study, GUSTO growing up in Singapore
towards healthy outcomes birth cohort, SiDRP Singapore integrated diabetic retinopathy program, CVD cardiovascular
disease screening using retinal vascular imaging study, BES Beijing Eye Study, KSH Kangbuk Samsung health study,
HR hazard ratio, BMI body mass index, OR odds ratio, MAE mean absolute error
252 R. M. W. W. Tseng et al.

events and mortality using the predicted proba- age from CFP (retinal age gap) independently
bility of CVD occurrence from CFP at baseline. predicted one’s mortality risk and that the mortal-
The conventional Cox proportional hazard model ity risk was positively associated with the retinal
is an extension of current efforts at creating an age gap. These studies show the potential of CFP
optimal predictive model using deep features of as a screening tool for risk stratification and
CFP as an input. In the past, variables such as, delivery of tailored interventions.
age, sex, socioeconomic status, and other CVD
risk factors were manually or statistically selected
for survival analysis, but presently, the Cox Areas of Future Research
model is generated by using the deep features of
CFP that are observed based on the association When assessing the performance of AI-DL mod-
between CFP and various risk factors via els, the predictability and generalisability of the
DL. Apart from this hybrid model, other papers AI-DL model must be evaluated appropriately.
have used different methods ranging from neural Predictability refers to how accurate the AI-DL
networks to machine learning techniques. This model is at predicting the desired result. The
includes Cox-nnet [52], Deepsurv [53], and AI-DL model that predicts age from CFP is an
Nnet-survival [54]. Currently, no studies that example given the relatively high coefficient of
have applied this recent network with CFP to pre- determination in comparison to the somewhat
dict systemic-related event outcome using time-­ low coefficient of determination that was pro-
series data, suggesting that there is room to duced by the AI-DL model that predicts BP. There
explore the relationship between the incidence of are no specific guidelines or cut-off values for
systemic disease outcomes and the retina. coefficient of determination, but the predictabil-
Of the various studies that investigate the ity of AI-DL systems on specific biomarkers can
applications of AI-DL in predicting longitudinal be inferred from the performance of these sys-
outcomes of systemic diseases, a few should be tems on internal test sets. Predictability, however,
mentioned. Poplin et al. [18] predicted CVD risk does not guarantee generalisability. In terms of
factors from CFP via AI-DL and thereafter used generalisability, it is crucial to determine how
the results to predict Major Adverse Cardiac well the AI-DL model will perform in different
Events (MACE) over 5 years in the UK Biobank. clinical settings and different populations with
The performance of the AI-DL model was of varying ethnicities. Rim et al., has demonstrated
similar to the performance of the European this by testing their AI-DL models on external
Systematic COronary Risk Evaluation (SCORE) multi-ethnic testing datasets. In this case, the
risk calculator [18]. Another study by Chang AI-DL model that predicts BP from CFP demon-
et al. used proxy markers such as CIMT and exis- strates good generalisability even though the
tence of carotid artery plaque to train an AI-DL coefficient of determination is not as high as it is
model to predict atherosclerosis [49]. The study for age prediction. Taking into account these two
demonstrated that the retinal biomarker was sig- performance indices when developing AI-DL
nificantly associated with an increased risk, rep- models would substantiate the use of retinal
resented by hazard ratio, for CVD mortality after markers as a surrogate marker and proxy for con-
adjusting for the Framingham risk score (FRS). ventional markers in the future.
In addition, Cheung et al. looked at CVD and its Additional challenges that could implicate the
association to retinal vessel calibre [47]. The performance of AI-DL models on CFP include
team found that a narrow central retinal arteriolar the limited number of datasets available for sys-
equivalent measured by SIVA-DLS was associ- temic diseases, the adoption of retinal examina-
ated with incident CVD and all-cause mortality tions into CVD guidelines, as well as gaining
in two prospective cohorts. Lastly, Zhu et al. used acceptance from physicians, patients, and the
data from the UK biobank to demonstrate that the public. In particular, the collation of CFP datasets
difference between one’s age and the predicted according to systemic diseases is tough given the
22  Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 253

division of specialties, with ophthalmology being review and individual-participant meta-analysis. Am


commonly separated from other departments. J Epidemiol. 2009;170(11):1323–32.
4. Wong TY, McIntosh R.  Systemic associations of
The division of specialties also makes the incor- retinal microvascular signs: a review of recent
poration of retinal examinations into other clini- population-­ based studies. Ophthalmic Physiol Opt.
cal pathways and therefore CVD guidelines more 2005;25(3):195–204.
challenging. Additionally, when a new AI-DL 5. Lim M, Sasongko MB, Ikram MK, Lamoureux E,
Wang JJ, Wong TY, et  al. Systemic associations of
system is created and has the potential of being dynamic retinal vessel analysis: a review of current
enforced in clinical practice, it is imperative for literature. Microcirculation. 2013;20(3):257–68.
researchers to prove and simultaneously con- 6. Sabanayagam C, Lye WK, Klein R, Klein BE, Cotch
vince different stakeholders, including physi- MF, Wang JJ, et  al. Retinal microvascular calibre
and risk of diabetes mellitus: a systematic review
cians, patients, and the public, that the specific and participant-level meta-analysis. Diabetologia.
AI-DL model benefits healthcare systems, 2015;58(11):2476–85.
addresses current unmet needs, and would add 7. Kim DH, Chaves PHM, Newman AB, Klein R, Sarnak
substantial value to existing clinical settings and MJ, Newton E, et al. Retinal microvascular signs and
disability in the Cardiovascular Health Study. Archiv
technologies prior to its employment in real-­ Ophthalmol (Chicago, Ill: 1960). 2012;130(3):350–6.
world settings. 8. Wong TY, McIntosh R.  Hypertensive retinopathy
signs as risk indicators of cardiovascular morbidity
and mortality. Br Med Bull. 2005;73–74:57–70.
9. Kesler A, Vakhapova V, Korczyn AD, Naftaliev E,
Conclusion Neudorfer M. Retinal thickness in patients with mild
cognitive impairment and Alzheimer’s disease. Clin
This review presents systemic disease factors that Neurol Neurosurg. 2011;113(7):523–6.
have been predicted from the retina using CFP 10. Cheung CY, Ong YT, Ikram MK, Ong SY, Li X, Hilal
S, et al. Microvascular network alterations in the ret-
(Fig.  22.1). Various studies have expressed the ina of patients with Alzheimer’s disease. Alzheimers
versatility of AI-DL in assessing systemic dis- Dement. 2014;10(2):135–42.
eases and risk factors through the use of 11. Feke GT, Hyman BT, Stern RA, Pasquale LR. Retinal
CFP.  Further efforts to discover other systemic blood flow in mild cognitive impairment and
Alzheimer’s disease. Alzheimers Dement (Amst).
risk factors and biomarkers that could be pre- 2015;1(2):144–51.
dicted from the retina are underway. To date, 12. Frost S, Kanagasingam Y, Sohrabi H, Vignarajan J,
there remains insufficient prospective studies and Bourgeat P, Salvado O, et al. Retinal vascular biomark-
lack of evidence in real-world settings and there- ers for early detection and monitoring of Alzheimer’s
disease. Transl Psychiatry. 2013;3(2):e233.
fore the clinical application of AI-DL models 13. McGeechan K, Liew G, Macaskill P, Irwig L, Klein R,
using CFP is limited. Moving forward, prospec- Klein BEK, et al. Meta-analysis: retinal vessel caliber
tive studies conducted in real-world settings will and risk for coronary heart disease. Ann Intern Med.
be needed as evidence that the AI-DL models are 2009;151(6):404–13.
14. Cheung CY, Tay WT, Ikram MK, Ong YT, De Silva
beneficial and to ensure that AI-DL models are DA, Chow KY, et al. Retinal microvascular changes
safe and cost-effective. and risk of stroke: the Singapore Malay Eye Study.
Stroke. 2013;44(9):2402–8.
15. Kawasaki R, Xie J, Cheung N, Lamoureux E, Klein R,
Klein BE, et al. Retinal microvascular signs and risk
References of stroke: the Multi-Ethnic Study of Atherosclerosis
(MESA). Stroke. 2012;43(12):3245–51.
1. Wagner SK, Fu DJ, Faes L, Liu X, Huemer J, Khalid 16. Wong TY, Klein R, Couper DJ, Cooper LS,

H, et  al. Insights into systemic disease through reti- Shahar E, Hubbard LD, et  al. Retinal micro-
nal imaging-based oculomics. Transl Vis Sci Technol. vascular abnormalities and incident stroke: the
2020;9(2):6. Atherosclerosis Risk in Communities Study. Lancet.
2. Rim TH, Teo AWJ, Yang HHS, Cheung CY, Wong 2001;358(9288):1134–40.
TY.  Retinal vascular signs and cerebrovascular dis- 17. Nguyen TT, Wang JJ, Sharrett AR, Islam FMA, Klein
eases. J Neuroophthalmol. 2020;40(1):44–59. R, Klein BEK, et al. Relationship of retinal vascular
3. McGeechan K, Liew G, Macaskill P, Irwig L, Klein caliber with diabetes and retinopathy. The Multi-­
R, Klein BE, et  al. Prediction of incident stroke Ethnic Study of Atherosclerosis (MESA). Diabetes
events based on retinal vessel caliber: a systematic Care. 2008;31(3):544–9.
254 R. M. W. W. Tseng et al.

18. Poplin R, Varadarajan AV, Blumer K, Liu Y, McConnell 33.


Chiquita S, Rodrigues-Neves AC, Baptista FI,
MV, Corrado GS, et al. Prediction of cardiovascular Carecho R, Moreira PI, Castelo-Branco M, et al. The
risk factors from retinal fundus photographs via deep retina as a window or mirror of the brain changes
learning. Nat Biomed Eng. 2018;2(3):158–64. detected in Alzheimer’s disease: critical aspects to
19. Kim YD, Noh KJ, Byun SJ, Lee S, Kim T, Sunwoo L, unravel. Mol Neurobiol. 2019;56(8):5416–35.
et al. Effects of hypertension, diabetes, and smoking 34. Sadun AA, Borchert M, DeVita E, Hinton DR, Bassi
on age and sex prediction from retinal fundus images. CJ.  Assessment of visual impairment in patients
Scientific Rep. 2020;10(1):4623. with Alzheimer’s disease. Am J Ophthalmol.
20. Rim TH, Lee G, Kim Y, Tham YC, Lee CJ, Baik 1987;104(2):113–20.
SJ, et  al. Prediction of systemic biomarkers from 35. Hart NJ, Koronyo Y, Black KL, Koronyo-Hamaoui
retinal photographs: development and validation M. Ocular indicators of Alzheimer’s: exploring disease
of deep-learning algorithms. Lancet Digit Health. in the retina. Acta Neuropathol. 2016;132(6):767–87.
2020;2(10):e526–e36. 36. Jiang H, Wei Y, Shi Y, Wright CB, Sun X, Gregori
21. Gerrits N, Elen B, Craenendonck TV, Triantafyllidou G, et  al. Altered macular microvasculature in mild
D, Petropoulos IN, Malik RA, et  al. Age and sex cognitive impairment and Alzheimer disease. J
affect deep learning prediction of cardiometabolic Neuroophthalmol. 2018;38(3):292–8.
risk factors from retinal images. Scientific Rep. 37. Harju M, Tuominen S, Summanen P, Viitanen M,
2020;10(1):9432. Pöyhönen M, Nikoskelainen E, et  al. Scanning
22. Zhu Z, Shi D, Peng G, Tan Z, Shang X, Hu W, et al. laser Doppler flowmetry shows reduced reti-
Retinal age as a predictive biomarker for mortality nal capillary blood flow in CADASIL.  Stroke.
risk. medRxiv. 2020. 2004;35(11):2449–52.
23. Vaghefi E, Yang S, Hill S, Humphrey G, Walker N, 38. Lim G, Lim ZW, Xu D, Ting DSW, Wong TY, Lee ML,
Squirrell D.  Detection of smoking status from reti- et al. Feature isolation for hypothesis testing in retinal
nal images; a Convolutional Neural Network study. imaging: an ischemic stroke prediction case study.
Scientific Rep. 2019;9(1):7180. Proc AAAI Conf Artif Intell. 2019;33(01):9510–5.
24. Zhuoting Zhu DS, Peng G, Tan Z, Shang X, Hu W, 39. Dai G, He W, Xu L, Pazo EE, Lin T, Liu S, et  al.
Liao H, Zhang X, Huang Y, Yu H, Meng W, Wang W, Exploring the effect of hypertension on retinal micro-
Yang X, He M. Retinal age as a predictive biomarker vasculature using deep learning on East Asian popula-
for mortality risk. medRxiv. 2020. tion. PLoS One. 2020;15(3):e0230111.
25. Kifley A, Liew G, Wang JJ, Kaushik S, Smith W, 40. Zhang L, Yuan M, An Z, Zhao X, Wu H, Li H, et al.
Wong TY, et  al. Long-term effects of smoking on Prediction of hypertension, hyperglycemia and dys-
retinal microvascular caliber. Am J Epidemiol. lipidemia from retinal fundus photographs via deep
2007;166(11):1288–97. learning: a cross-sectional study of chronic diseases
26. Ikram MK, de Jong FJ, Vingerling JR, Witteman JC, in central China. PLoS One. 2020;15(5):e0233166.
Hofman A, Breteler MM, et al. Are retinal arteriolar 41. Mitani A, Huang A, Venugopalan S, Corrado GS,
or venular diameters associated with markers for car- Peng L, Webster DR, et al. Detection of anaemia from
diovascular disorders? The Rotterdam Study. Invest retinal fundus images via deep learning. Nat Biomed
Ophthalmol Vis Sci. 2004;45(7):2129–34. Eng. 2020;4(1):18–27.
27. Sun C, Wang JJ, Mackey DA, Wong TY. Retinal vas- 42. Boris Babenko AM, Traynis I, Kitade N, Singh P,
cular caliber: systemic, environmental, and genetic Maa A, Cuadros J, Corrado GS, Peng L, Webster DR,
associations. Surv Ophthalmol. 2009;54(1):74–95. Varadarajan A, Hammel N, Liu Y.  Detecting hidden
28. Kifley A, Wang JJ, Cugati S, Wong TY, Mitchell
signs of diabetes in external eye photographs. arXiv.
P. Retinal vascular caliber, diabetes, and retinopathy. 2020.
Am J Ophthalmol. 2007;143(6):1024–6. 43. Benson J, Estrada T, Burge M, Soliz P, editors.

29. Song YM, Sung J, Davey Smith G, Ebrahim
Diabetic peripheral neuropathy risk assessment using
S.  Body mass index and ischemic and hemorrhagic digital fundus photographs and machine learning.
stroke: a prospective study in Korean men. Stroke. 2020 42nd Annual International Conference of the
2004;35(4):831–6. IEEE Engineering in Medicine & Biology Society
30. Reeves GK, Pirie K, Beral V, Green J, Spencer E, Bull (EMBC), 20–24 July; 2020.
D. Cancer incidence and mortality in relation to body 44. Wong TY, Xu D, Ting D, Nusinovici S, Cheung

mass index in the Million Women Study: cohort study. C, Shyong TE, Cheng C-Y, Lee ML, Hsu W,
BMJ. 2007;335(7630):1134. Sabanayagam C. Artificial intelligence deep learning
31. Calle EE, Rodriguez C, Walker-Thurmond K, Thun system for predicting chronic kidney disease from
MJ. Overweight, obesity, and mortality from cancer in retinal images. IOVS. 2019;60:1468.
a prospectively studied cohort of U.S. adults. N Engl 45. Sabanayagam C, Xu D, Ting DSW, Nusinovici S,
J Med. 2003;348(17):1625–38. Banu R, Hamzah H, et al. A deep learning algorithm
32. Shah NR, Braverman ER.  Measuring adiposity in
to detect chronic kidney disease from retinal photo-
patients: the utility of body mass index (BMI), percent graphs in community-based populations. Lancet Digit
body fat, and leptin. PLoS One. 2012;7(4):e33308. Health. 2020;2(6):e295–302.
22  Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 255

46. Kang EY HY, Li C, Huang Y, Kuo C, Kang J, Chen 50. Walsh JB.  Hypertensive retinopathy. Description,

K, Lai C, Wu W, Hwang Y. A deep learning model for classification, and prognosis. Ophthalmology.
detecting early renal function impairment using reti- 1982;89(10):1127–31.
nal fundus images: model development and validation 51. Detrano R, Guerci AD, Carr JJ, Bild DE, Burke G,
study. JMIR Med Inf. 2020. Folsom AR, et al. Coronary calcium as a predictor of
47. Cheung CY, Xu D, Cheng CY, Sabanayagam C,
coronary events in four racial or ethnic groups. N Engl
Tham YC, Yu M, et al. A deep-learning system for the J Med. 2008;358(13):1336–45.
assessment of cardiovascular disease risk via the mea- 52. Ching T, Zhu X, Garmire LX.  Cox-nnet: An artifi-
surement of retinal-vessel calibre. Nat Biomed Eng. cial neural network method for prognosis prediction
2020. of high-throughput omics data. PLoS Comput Biol.
48. Son J, Shin JY, Chun EJ, Jung K-H, Park KH, Park 2018;14(4):e1006076.
SJ.  Predicting high coronary artery calcium score 53. Katzman JL, Shaham U, Cloninger A, Bates J, Jiang
from retinal fundus images with deep learning algo- T, Kluger Y.  DeepSurv: personalized treatment rec-
rithms. Transl Vis Sci Technol. 2020;9(2):28. ommender system using a Cox proportional hazards
49. Chang J, Ko A, Park SM, Choi S, Kim K, Kim SM, deep neural network. BMC Med Res Methodol.
et al. Association of cardiovascular mortality and deep 2018;18(1):24.
learning-funduscopic atherosclerosis score derived 54. Gensheimer MF, Narasimhan B. A scalable discrete-­
from retinal fundus images. Am J Ophthalmol. time survival model for neural networks. PeerJ.
2020;217:121–30. 2019;7:e6257.
Artificial Intelligence
in Calculating the IOL Power
23
John G. Ladas and Shawn R. Lin

Introduction likely change over time to help direct an interven-


tion to a desired outcome. This may be a more
The use of artificial intelligence and machine interesting application and indeed, AI is also a
learning will transform medicine as it has many dynamic solution which continues to grow and
other fields. Within ophthalmology, AI has improve as the training data set becomes larger.
already been introduced in glaucoma, retina, and Its strength lies in its scalability, flexibility and
various anterior segment diseases as a potential malleability to include additional variables and
means to efficiently and accurately screen for and complexities.
grade disease. The majority of these advances According to the World Health Organization,
involve identification of an image, a diagnosis or it is estimated that approximately 32 million cat-
outcome through “pattern recognition” and large aract surgeries will be performed on a yearly
datasets [1, 2]. basis as of the year 2020 making it the most com-
In general, AI works by learning the relation- monly performed surgery in the field of ophthal-
ship of multiple variables from a large data set mology [3]. While the goal of cataract surgery is
and weighing them accordingly as they relate to a to improve a patient’s overall visual function,
specific outcome or image. This is useful in iden- appropriate consideration of IOL selection allows
tifying patterns in diagnosis, making manage- it to become the ultimate refractive surgery. This
ment decisions, scoring a prognosis, or is achieved by meeting the desired refractive
performing automation for laborious tasks. outcome.
“Pattern recognition” identification is one of the Further, cataract surgery and IOL calculations
first tasks one can ask of artificial intelligence in are ideally suited for the application of AI and
medicine. An additional task that can be pro- deep learning. The procedure is precise, the out-
posed is to navigate a set of factors that may come is mathematical and perhaps most impor-
tantly, a highly accurate outcome or “ground
truth” is known within several weeks. The algo-
rithms or development process honed with use in
J. G. Ladas (*) cataract surgery could potentially be transferred
Wilmer Eye Institute, Baltimore, MD, USA
to other fields in medicine. Thus, this seems to be
Maryland Eye Consultants and Surgeons, the perfect test bed to perfect these learning algo-
Silver Spring, MD, USA
rithms within medicine.
S. R. Lin
Stein Eye Institute, Los Angeles, CA, USA
e-mail: slin@jsei.ucla.edu

© Springer Nature Switzerland AG 2021 257


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_23
258 J. G. Ladas and S. R. Lin

Types of Machine Learning 5378 mydriatic slit lamp photos. The paper
showed good agreement with the ground truth as
There are two categories of machine learning, established by expert grader [5]. A recent paper
and both could be applicable to cataract surgery published in BJO described a system combining
and IOL calculations. These include supervised image recognition of slit lamp images and home-­
and unsupervised learning. Unsupervised learn- based cell phone photos with questionnaire to
ing uses input data to discover similarities among identify referable cataracts. With this system, the
data sets. For instance, with enough data, one authors estimate that an ophthalmologist can
could attempt to find the specific characteristics improve productivity 10× by leveraging AI to
of eyes at risk for a suboptimal refractive out- monitor and evaluate 40,000 patients a year
come. The learning, however, would not be able instead of 4000 [6].
to help correct the error. To achieve that, one
would need outcome data.
Supervised learning is the other branch of History of IOL Calculations
machine learning that utilizes outcome data, in
addition to the input variables, to develop a pre- Formulae, throughout the years, have been classi-
dictive model. Supervised learning can further be fied multiple ways. The most common has been
subdivided into classification and regression. An by “generations”. The first generation of formu-
example of classification involves the use of lae include the SRK formula, based completely
images to determine whether a patient has dia- on linear regression. The formula is IOL power =
betic retinopathy. Regression based supervised A constant −2.5 AL − K.  The A constant was
learning uses specific algorithms to establish the calculated mathematically and can be adjusted to
relationship between input variables and the out- fine tune the formula [7, 8]. The second genera-
come. As previously stated, this is why cataract tion of formulas added “adjustments” that were
surgery, and specifically, IOL calculations is per- dependent on axial length. The third generation
fectly suited for this task. formulas were theoretical formulas developed to
further improve the prediction accuracy of the
prediction effective lens position. These formulas
 rtificial Intelligence for Diagnosis
A include the Hoffer Q, Holladay 1, and
and Grading SRK/T.  Although different, they each used AL
and K power to predict the ELP. Additional for-
The use of artificial intelligence has been mulas included measured anterior chamber depth
explored for the diagnosis and grading of cata- (ACD) and lens thickness (LT) to enhance pre-
racts. This type of functionality has the potential diction of the ELP [9–11].
to have a large impact in diagnosing community A step towards the integration of AI in IOL
cases and referring appropriate surgical candi- calculations took place in 2015 with the intro-
dates to a health care system. duction of a concept of an IOL ‘super formula’
A study in 2010 published in the Journal of [12]. Although previous generations of IOL for-
Medical Systems demonstrated early work on the mulas were thought of as two-dimensional alge-
training of artificial intelligence to identify cata- braic equations, this methodology depicted
ract versus non-cataractous and pseudophakic formulas in three dimensions. Ladas et al. dem-
patients by mydriatic slit lamp photos. However, onstrated that these formulas could be repre-
the example cataracts shown in this paper are sented graphically and combined for potential
dense white lenses, which may limit the utility of adjustment of specific areas of these calcula-
this approach [4]. A paper published in the IEEE tions. The formula as published served as a
Transactions on Biomedical Engineering dependable backbone or framework to integra-
described a method for grading cataracts using a tion of AI. Furthermore, it provides a malleable
convolutional neural network built on a dataset of framework which allows for targeted improve-
23  Artificial Intelligence in Calculating the IOL Power 259

ment within the formula. We will discuss more absent comorbidities in order to ensure the most
on this later in the chapter. accurate input data possible. This data is then fed
into a training algorithm. The most popular deep
learning toolset available at the time of publica-
I nput Variables to Consider in Any tion is Google’s TensorFlow. The algorithm is
Algorithm allowed to run iteratively to attain the closest set
of weights to match inputs to known outcomes
To date, there are at least 20 number of potential and these set of weights constitute the new lens
variables that have been used to help refine these formula algorithm.
formulas. In addition to AL and net K power,
which are the most important, other variables
include the aforementioned LT, and ACD. Review of Current Formulas with AI
Holladay included seven variables including pre-
operative refraction, white to white, and age [13]. One of the first discussions of the use of a neural
This was done without AI. Later he determined network came from Clarke in 1992 [19]. The
that further axial length adjustment was needed studied was limited by a small number of eyes in
[14]. Along these lines, Haigis also adjusted his the training set (200) and test set (95). There was
formula with three constants that would vary the no description of the algorithm that he created.
shape of the power prediction curve based on cor- As he stated, the main disadvantage at that point
neal power, axial length, ACD and lens geometry. was that running the algorithm required “sub-
Again, this was not done with AI but highlights stantial computing power and memory”. Further,
the importance of factoring multiple variables the input data used in this study measured LT and
that are interrelated. AL by ultrasound biometry, now known to be
Other potential variables that have been pro- less accurate than optical biometry. Nonetheless,
posed or demonstrated to have an impact on IOL the use of a computer to help adjust a formula
power calculation include posterior corneal was introduced. Beyond this early study, there is
power, true power of the cornea, ratio of anterior a paucity of information in the peer reviewed
and posterior segment, IOL power and design, literature.
measured equatorial lens position, aphakic With the advent of computer power and more
refraction, race, gender, and age [7, 9, 10, 13– data came more interest in using AI in calcula-
18]. Unfortunately, these do not occur in a vac- tions. From a theoretical standpoint, there are two
uum but are intimately related to each other. ways to approach this problem. Using a fixed set
Perhaps, perfectly suited for the use of AI. of data and then building an algorithm directly
There are many factors that must be consid- from this data, or utilizing data to adjust an exist-
ered to make AI successful in IOL calculations. ing algorithm.
The general steps through which an AI formula Hill reported the use of a radial basis function
can be built, can be divided into data gathering, to help determine from a group of 3445 eyes
cleaning, and training. In the data gathering step, (Version 1). Although the formula itself was
input variables are collected ideally directly from never published, it has been examined in other
the source (biometry devices) with minimal studies. A white paper supplied by Haag Streit
translation. In addition, outcome data (post-­ describes it as a pure data driven solution that
operative manifest refractions) must be gathered works best with a specific biometer and the spe-
from the medical record. In the case of lens power cific lens from which it was derived (Len Star
calculations, minimal data cleaning is required, 900 and SN60WF) [20]. Fundamentally, this is a
as the data exists already in numerical format and “regression” formula that uses deep learning to
contains little noise. However, some filtering is “back calculate” a predicted outcome from a
often performed, such as selecting only eyes known data set. There is also a function that
achieving a certain refractive target, or eyes describes a calculation as “in bounds” or out of
260 J. G. Ladas and S. R. Lin

bounds if the calculation is deemed unreliable. There are multiple detailed approaches to
The dataset has since been expanded to include these issues which are beyond the scope of this
12,419 eyes (Version 2). chapter. A recent publication from our group out-
Another approach can be to adjust an existing lines the exact methodology for each of these
formula. This was introduced by Ladas and col- steps as they apply to IOL calculations.
leagues and is inherent in the most recent version
of their formula. This approach has been called
the Ladas PLUS method, and was presented at Challenges of AI Integration
ASCRS, ESCRS and AAO [21–23]. This algo-
rithm works with any formula and adjusts it There are a few principles which will govern the
based on the machine supervised learning algo- speed of adoption of new lens formulas: the accu-
rithms that were developed to predict the error racy and amount of data and trust. First, larger
between any formula’s predicted outcome and datasets may allow an algorithm to account for
the actual outcome. Further, we have recently outliers that are poorly represented in smaller
demonstrated that we can improve multiple gen- datasets. Existing machine learning algorithms
erations of formulae with our methodology [24]. are trained on tens of thousands of eyes.
Because of the lack of publication on this sub- Potentially creating a public dataset of a hundred
ject, it is difficult to report on the specific meth- thousand, a million, or even more eyes could help
odologies that others may or may not use. Our us design better formulas. One of the authors of
most recent publication utilized gathering, filter- this chapter, SRL, is in the process of creating a
ing and cleansing mechanisms to obtain eyes public dataset of high-quality lens data. Using
with one type of IOL, a best corrected visual acu- this dataset, new formulas can be developed more
ity threshold as well as exclude eyes with co-­ rapidly and tested against a known benchmark. In
morbidities. At this point, we used software addition, a public dataset would allow individuals
(Python 3.7 with scikit-learn package) to refine a and organizations outside of ophthalmology to
baseline formula. The supervised learning algo- work on this problem.
rithms we tested were Support vector regression Another way to achieve the goal obtaining
(SVR), Extreme gradient boosting (XGBoost), large amounts of accurate data is to make the pro-
and an Artificial Neural Network (ANN). cess automated. Indeed, the outcome data which
Important steps in developing an algorithm is AI relies upon, the manifest refraction, is often
making sure that the data obtained demonstrates suboptimal due to technique variability, room
a normal distribution. This is done with a Shapiro-­ length, patient’s subjective participation, and
Wilk test. Further, steps to prevent overfitting of a time taken to perform measurements. The use of
model are done by performing a fivefold cross post-operative autorefraction or wavefront data
validation within the training set. This was all can potentially help eliminate most issues that
done after randomly separating the data into ten occur with MRx acquisition. However, the cor-
equal parts and subsequently using nine of the ten relation between ARx and MRx for the purposes
to train the algorithm and test on the remaining of IOL formula optimization is still unclear and
tranche. This was sequentially done ten times. being currently investigated in ongoing studies.
There have also been other publications Furthermore, with ‘big data’ stored within an
describing techniques of adjusting an existing automated refractor, one would be able to charac-
formula [25]. Indeed, Sramka et al. used super- terize an eye as one with ‘standard’ parameters or
vised learning techniques and demonstrated one with ‘unusual’ parameters. Thus, AI could
equal or better performance than standard formu- pre-operatively highlight eyes that are ‘at-risk’
las. Kane has also reported including elements of for a post-operative refractive surprise so that the
artificial intelligence in his own formula but has surgeon may pay extra attention to pre-operative
never described the methodology [26]. IOL calculation.
23  Artificial Intelligence in Calculating the IOL Power 261

However the input variables and outcome data outcomes in our modern world will forever
are accumulated will lead to advances in this change the way that formulas are developed and
field. For instance, website called Kaggle.com compared.
coordinates competitions in which organizations
provide data sets and machine learning research-
ers compete to derive the best algorithms. This is References
where Netflix famously ran its $1,000,000 com-
petition to create a better movie recommendation 1. Heath Jeffery RC, Smith M.  Artificial intelligence
in ophthalmology: current applications and emerg-
algorithm. ing issues [published online ahead of print, 2020 Jan
Finally, clinicians often do not adopt new 23]. Clin Exp Ophthalmol. *Stevenson CH, Hong
algorithms until they have seen enough clinical SC, Ogbuehi KC. Development of an artificial intel-
evidence so they can trust the outcomes. As a ligence system to classify pathology and clinical fea-
tures on retinal fundus images. Clin Exp Ophthalmol.
field, we can accelerate the development of trust 2019;47(4):484–9.
by providing a larger dataset by which research- 2. Hogarty DT, Mackey DA, Hewitt AW.  Current
ers could compare new algorithms to existing state and future prospects of artificial intelligence
ones. For example, widespread adoption of a new in ophthalmology: a review. Clin Exp Ophthalmol.
2019;47(1):128–39.
algorithm could occur more quickly if a clinician 3. World Health Organization. Blindness: vision 2020—
knew that the new algorithm had been compared control of major blinding diseases and disorders.
to previously used algorithms in a public dataset http://www.who.int/mediacentre/factsheets/fs214/en/.
of 1 million eyes. A global public dataset could Accessed Jan 2020.
4. Acharya RU, Yu W, Zhu K, et  al. Identification of
allow not only for advancements in algorithm cataract and post-cataract surgery optical images
accuracy, but also for regional fine tuning by giv- using artificial intelligence techniques. J Med Syst.
ing clinicians the ability to extend the base algo- 2010;34(4):619–28.
rithm for their patient population by adding 5. Gao X, Lin S, Wong TY. Automatic feature learning
to grade nuclear cataracts based on deep learning.
custom data. IEEE Trans Biomed Eng. 2015.
A key factor in the success of any artificial 6. Wu X, Huang Y, Liu Z.  Universal artificial intelli-
intelligence or machine learning algorithm is the gence platform for collaborative management of cata-
quality of the input and outcome data as well as the racts. Br J Ophthalmol. 2019;103(11):1553–60.
7. Olsen T.  Calculation of intraocular lens power:
number of samples. The input data currently used a review. Acta Ophtalmologica Scandinavica.
in lens formulas is quite accurate with modern 2007;85(5):472–85.
biometry devices, but benefits from the continued 8. Olsen T, Thom K, Corydon L. Theoretical versus SRK
development of more accurate measurements, as I and SRK II calculation of intraocular lens power. J
Cataract Refract Surg. 1990;16(2):217–25.
well as the incorporation of new types of measure- 9. Barrett GD.  An improved universal theoretical for-
ments will continue to futher the field. This big mula for intraocular lens power prediction. J Cataract
data and ‘crowd-sourced’ approach could eventu- Refract Surg. 1993;19(6):713–20.
ally use millions of data points to achieve very 10.
Haigis W.  Kongreß d. Deutschen Ges. f.
Intraokularlinsen Implantation. In: Schott K, Jacobi
high levels of accuracy. This approach could KW, Freyler H, editors. Strahldurchrechnung in
evolve over time and become a ‘living formula’ Gauß’scher Optik zur Beschreibung des Sustems
that continuously improves as new data is added. Brille-Kontaktlinse-Hornhaut-Augenlinse (IOL).
Berlin: Springer; 1991. p. 233–46.
11. Olsen T.  Prediction of the effective postoperative

(intraocular lens) anterior chamber depth. J Cataract
Conclusion Refract Surg. 2006;32(3):419–24.
12. Ladas JG, Siddiqui AA, Devgan U, Jun AS.  A 3-D
This integration of AI into IOL calculations will “super surface” combining modern intraocular for-
mulas to generate a “super formula” and maximize
continue to grow and with the ability to accumu- accuracy. JAMA. 2015;133(12):1431–6.
late reliable outcome data will exponentially 13. Mahdavi S, Holladay J. IOLMaster 500 and integra-
increase within the next few years. Further, the tion of the Holladay 2 formula for intraocular lens
network effect of comparing formulas as well as calculations. Eur Ophthal Rev. 2011;5(2):134–5.
262 J. G. Ladas and S. R. Lin

14. Wang L, Holladay JT, Koch DD.  Wang-Koch axial 21. Siddiqui AA, Ladas JG, Nutkiewicz M. Evaluation of
length adjustment for the Holladay 2 formula in long new IOL formula that integrates artificial intelligence.
eyes. J Cataract Refract Surg. 2018;44(10):1291–2. Paper presentation at: American Society of Cataract
15.
Cooke DL, Cook TL.  Approximating sum-of-­ and Refractive Surgery (ASCRS) annual meeting,
segments axial length from a traditional optical low-­ Washington, DC, April 2018.
coherence reflectometry measurement. J Cataract 22. Ladas JG.  Artificial intelligence and big data in

Refract Surg. 2019;45(3):351–4. IOL calculations. European Society of Cataract and
16. Olsen T, Corydon L, Gimbel H.  Intraocular lens
Refractive Surgeons (ESCRS) Annual Meeting,
power calculation with an improved anterior chamber September 14, 2019.
depth prediction algorithm. J Cataract Refract Surg. 23. Ladas JG.  Artificial intelligence in ophthalmology.
1995;21(3):313–9. American Academy of Ophthalmology (AAO) Annual
17. Yoo YS, Whang WJ, Hwang KY.  Use of the crys- Meeting, Spotlight Session, October 13, 2019.
talline lens equatorial plane (LEP) as a new param- 24. Ladas J, Ladas D, Lin SR, Devgan U, Siddiqui AA,
eter for predicting postoperative IOL position. Am J Jun AS.  Improvement of multiple generations of
Ophthalmol. 2019;198:17–24. intraocular lens calculation formulae with a novel
18. Olsen T. The Olsen formula. In: Shammas HJ, editor. approach using artificial intelligence. Trans Vis Sci
Intraocular lens power calculations. Thorofare, NJ: Tech. 2021. (In Press).
Slack; 2004. p. 27–38. 25.
Sramka M, Slovak M, Tuckova J, Stodulka
19. Clarke GP, Burmeister JB.  Comparison of intraocu- P.  Improving clinical refractive results of cataract
lar lens computations using a neural network ver- surgery by machine learning. July 2, 2019. PubMed
sus the Holladay formula. J Cataract Refract Surg. 31304064. www.Peerj.com/articles/7202/. Accessed
1997;23(10):1585–9. April 2020.
20. Hill-RBF Method. Released: October 2017/V2.0.
26. Connell BJ, Kane JX.  Comparison of the Kane for-
Haag-Streit AG Koeniz, Switzerland. https://www. mula with existing formulas for intraocular lens power
haag-­streit.com/fileadmin/Haag-­Streit_Diagnostics/ selection. BMJ Open Ophthalmol. 2019;4:e000251.
biometry/EyeSuite_IOL/Brochures_Flyers/White_ https://doi.org/10.1136/bmjophth-­2018-­000251.
Paper_Hill-­R BF_Method_20160819_2_0.pdf.
Accessed April 2020.
Practical Considerations for AI
Implementation in IOL Calculation
24
Formulas

Guillaume Debellemanière, Alain Saad,
and Damien Gatinel

History steps necessary for formula development and


algorithm training.
The Postoperative spherical Equivalent predic-
tion using ARtificial intelligence and Linear
algorithms (PEARL) project was initiated in the General Points About AI
Anterior Segment and Refractive Surgery
Department at Rothschild Foundation, by the AI: A Word of Caution
authors of this chapter. This ever-evolving work
aims to assess the potential of Artificial AI can be defined as “the part of computer sci-
Intelligence (AI) techniques in the IOL calcula- ence concerned with designing intelligent com-
tion field, determining the optimal architecture puter systems, that is, systems that exhibit
of AI-based formulas and the remaining role of characteristics we associate with intelligence in
optics within them, as well as addressing the human behaviour” [1]. This ambitious research
common problems encountered in IOL calcula- field has gained a renewed interest since 2012,
tion (such as IOL constant adjustment, post-­ with advances in image recognition allowed by
refractive surgery calculations, use of new IOL the advent of deep learning. While deep learning
models) in the specific context of AI-based for- techniques have indeed revolutionized some spe-
mulas. It resulted in a succession of IOL calcula- cific fields (computer vision, speech recognition,
tion formulas known under the name image and video generation, automatic transla-
“PEARL-DGS”, DGS representing the initials tion, natural language processing...), it must be
of the last names of the authors. The final objec- recognized that AI has also often been used as a
tive of the PEARL project is the open-source marketing tool. It is therefore important for
release of the code allowing to reproduce the every doctor to know the current limits of this
technology.

The authors sincerely thank Dr Radhika Rampat, Corneo-


Plastic Unit, Queen Victoria Hospital, East Grinstead,  ome Common Misconceptions
S
UK, for careful proofreading of this chapter.
About Artificial Intelligence

G. Debellemanière · A. Saad · D. Gatinel (*) –– AI algorithms don’t usually “learn” on an


Adolphe de Rothschild Foundation Hospital, ongoing basis. The algorithms must be trained
Paris, France
with clean and verified data, and the
e-mail: gdebellemaniere@for.paris

© Springer Nature Switzerland AG 2021 263


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_24
264 G. Debellemanière et al.

p­ erformances of the resulting model must be on optical assumptions. The calculation is a sta-
evaluated before it can be used. AI models can tistical inference which uses preoperative bio-
of course be regularly refined, but they don’t metric parameters to predict the postoperative
typically learn with every new example that is spherical equivalent (SE). In particular, the effec-
provided to them. tive lens position (ELP, distance from the anterior
–– AI tools can help to accurately predict future corneal surface to the IOL principal plane, used
outcomes using previous experience. This in thin lens formulas) and/or the ALP (distance
requires clean data composed of predictors from the anterior corneal surface to the anterior
(features) and resulting outcomes. However, IOL surface, used in thick lens formulas) is not
the predictions can only be as good as the considered in the calculation process of those
input data: modern algorithms can use the formulas. The Hill-RBF formula is not published,
information contained within the training but is described on the calculator website [2] as
datasets to their full potential, but no “extra “entirely data driven” and the underlying algo-
information” can be extracted, or guessed, by rithm as a radial basis function (RBF) network,
any digital super-intelligence. i.e. a neural network using non-linear RBF acti-
–– Even the most sophisticated machine learning vation function in the neurons composing its hid-
regression algorithms are not intrinsically dif- den layer.
ferent from a standard linear regression. The
mathematics behind the training and the pre-
dictions processes differ between algorithms, Machine learning regression algorithms are
as well as their predictions performances on sophisticated regression techniques that do
complex datasets, especially when dealing not differ in essence from linear regression:
with a very high number of predictors or train- the latter is in fact the simplest and most
ing examples; however, every regression algo- popular algorithm of its kind. The SRK for-
rithm aims at predicting a continuous target mula [3, 4] was one of the most successful
value using data from previous experience, and regression formulas. However, the coeffi-
can only be as accurate as the input data. No cients associated to the keratometry and
algorithm is intrinsically better than another: axial length were fixed (to 0.9 and 2.5,
their performances vary depending on the kind respectively) and were not intended to be
of problem at hand and in fact using complex adapted, “retrained”, to fit different IOL
algorithms to solve simple problems usually models: the prediction was adjusted by
lead to overfitting. This means that very good varying the offset of the regression only.
performances may be seen on the training set,
with disappointing performances on external
datasets and in real life. In IOL calculation, for a given IOL model and
a given eye, the IOL power implanted and the
postoperative refractive result are almost perfectly
 achine Learning-Based IOL
M linearly interdependent. Hence, pure AI-based
Formulas Architecture IOL formulas could theoretically be designed to
either predict the postoperative spherical equiva-
Pure AI-Based Formulas lent for a given IOL power (Fig. 24.1a), or the rec-
ommended IOL power to choose to reach a specific
Machine learning algorithms are used to predict a refractive target (Fig. 24.1b). Either the IOL power
numerical target using data from previous experi- or the postoperative outcome has to be among the
ence. Within the context of IOL calculation, vari- inputs, because of the necessary correspondence
ous formulas architecture can be determined. In between a given IOL power and a given refractive
pure AI-based IOL formulas, the prediction of result. The biometric parameters by themselves
the refractive result of the surgery does not rely only give information about the post-operative
24  Practical Considerations for AI Implementation in IOL Calculation Formulas 265

AL

ARC

ACD

LT
ML regression Postop. SE obtained with
algorithm implanted IOL power
...
(Prediction phase: predicted
SE for a given IOL Power)
Age

Gender

Implanted
IOL Power

(Prediction phase:
given IOL Power)

AL

ARC

ACD

LT
ML regression IOL power implanted
algorithm
...
(Prediction phase: IOL
Power to achieve
Age the target SE)

Gender

Postop.SE

(Prediction
phase: Target
SE)

Fig. 24.1  Pure AI-based formulas architectures. Pure AI parameters (b). Because the intended SE is almost never
formulas can be designed to either predict the postopera- hyperopic, the postoperative SE carries “hidden” informa-
tive spherical equivalent for a given IOL power and set of tion about the eye: as a consequence, architecture (b)
biometric parameters (a) or to predict the IOL power should never be used
implanted given the postoperative SE and the biometric
266 G. Debellemanière et al.

physical position of a given lens: if optical formu-


las are not used at all, either the IOL power or the The reverse phenomenon is not true: know-
post-operative refraction have to be among the ing the IOL power implanted does not give
predictors. This information can also be implicit any hidden information about the postopera-
(by using the predictions from other formulas for a tive SE.  It can be tempting to design IOL
given IOL power, like in the Ladas [5] formula, for formulas using the post-­operative outcome
example). as an input in their predictive algorithm:
However, the fact that the desired real-life those formulas will very accurately (and
refractive outcome is usually emmetropic or blindly!) predict the IOL power implanted
slightly myopic leads to the imperative necessity on retrospective datasets, by incorrectly
of avoiding the use of the postoperative refractive assimilating the real postoperative spherical
outcome among the predictors in any IOL calcu- equivalent to the refractive target. However,
lation algorithm (see box below). This is because when used in real-life, the performances of
this value carries in itself indirect information those formulas would strongly deteriorate,
about the degree of atypicality of a given eye, because the target and the real refractive out-
especially if the eye is hyperopic postoperatively. come will differ. Hence, IOL formulas
Hence, pure-AI formulas should only be designed should always be evaluated on their perfor-
to predict the spherical postoperative equivalent, mances in blindly predicting the post-opera-
using the IOL power implanted as an input. tive outcome for a given IOL power, and
IOL formulas that require the “target” (in
fact, the real refractive outcome) to predict
The features chosen to train an algorithm
the IOL power (even in a blind manner)
must carry exactly the same kind of infor-
should be considered with caution.
mation in the training process and in real
life. A particularly relevant example of that
imperative in IOL calculation is the use of Because they are not based on optical assump-
the postoperative spherical equivalent as a tions, pure-AI formulas could have the advantage
feature to predict the IOL power (or the of being less susceptible to non-measured (e.g.
physical position of the IOL). Consider the posterior radius of curvature of the cornea, PRC)
following situation: a surgeon is asked to or mismeasured (e.g. extreme ALs [6, 7]) param-
blindly guess the IOL power that was eters used in formulas based on optics. One of the
implanted for a given set of patients. He drawbacks is that the pure-AI approach does not
has access to their biometry and to the allow for straightforward adjustment of the IOL
refractive outcome. The surgeon notices constant, making the adaptation to other lens mod-
that some patients are hyperopic. Knowing els potentially more difficult for those formulas.
that the intended target is almost never
hyperopic, he will probably conclude that,
in those patients, the chosen IOL power AI-Enhanced Optical Formulas
was not the one that targeted hyperopia, but
the one that targeted emmetropia: he will AI and optics can also be used in combination.
usually be right. Even without knowing The eye is an optical system and the refraction
anything about cataract surgery, if the set is obeys predictable physical laws. Physical laws
large enough, a careful observer will notice are, by definition, the most efficient way to pre-
this phenomenon and increase its accuracy dict physical phenomena: it is therefore the
in predicting the IOL power that was authors’ opinion that the prediction processes in
implanted. Powerful algorithms that have any formula should be limited to what cannot be
access to the postoperative outcome will measured or calculated, and that not using optics
behave the same way. in IOL formulas is the equivalent of letting an
algorithm “reinvent” the laws of physics.
24  Practical Considerations for AI Implementation in IOL Calculation Formulas 267

ML algorithms can be used to predict the [13], thus theoretically allowing a prediction from
parameters in a given optical formula if: preoperatively known parameters (condition 2).
However, they are difficult to measure, preventing
1) They are not usually measured or known pre- the determination of the target that could allow the
operatively (thus making a prediction training of an algorithm: no gold standard refrac-
irrelevant) tive index measurements can be performed.
2) Their value is related to other parameters that Both the posterior corneal radius and the ALP
are known or measured preoperatively, to be can be accurately measured (with a corneal topog-
used as features in the model rapher and post-operatively with a biometer,
3) Their value can be accurately measured or cal- respectively). They can also be back-­ calculated
culated, even post-operatively, to determine from the other optical parameters of the eye [14]: in
the model target. this case, the resulting value accounts for the
approximations made in the measurement or the
The postoperative lens position and the poste- estimation of all the other parameters. The PEARL-
rior corneal radius are the only parameters that ful- DGS formula is based on the prediction of the
fill all of those requirements. The refractive indices Theoretical Internal Lens Position (TILP), which is
of the cornea, aqueous and vitreous are directly defined as the theoretical distance between the pos-
used in the optical formulas that calculates the terior corneal surface and the anterior IOL surface,
refraction of the pseudophakic eye and as such back-calculated from postoperative data.
could theoretically be interesting to predict. The optical equations used in IOL formulas
Furthermore, the refractive index of the crystalline allow to calculate the spherical equivalent of the
lens indirectly influences the AL measurement eye at the spectacle plane from the axial length of
[6–9]. Condition number 1 is fulfilled because the eye, the geometrical characteristics of two
they are not measured by current biometers. The lenses (the cornea and the IOL), the distance
refractive indices of the eye segments could poten- between those two lenses, and the refractive indi-
tially statistically vary with their thickness/length ces of the eye segments and IOL. The inner work-
(lens and vitreous), with the patient’s age [10–12], ing of an AI-enhanced thin lens optical formula is
and/or with a history of corneal or retinal surgery shown in Fig. 24.2.

Implanted IOL
Power
R K
Considered
IOL Power

ACD

ML
ELP Thin lens
regression
equation
algorithm
...

Predicted
AL postop. SE

Real postop. SE

Fig. 24.2  AI-enhanced formulas use ML algorithms to tion process is used to calculate the target value for the eyes
predict the lens position (and/or the posterior corneal radius) of the training set and represented by dotted lines. The triple-
and use the predicted value(s) in optical equations. In this optimized Haigis formula is a specific case of this kind of
example, thin lens equations are used. The calculation pro- formula, where the ELP predictors are limited to ACD and
cess is represented using solid lines. The ELP back-calcula- AL and where the ML algorithm is a linear regression
268 G. Debellemanière et al.

Thin lens optical equations simplify the calcu- eters) and on engineered data (the back-­calculated
lations by ignoring the thicknesses of the lenses ALP).
considered in the post-operative eye (cornea and
IOL), thus removing the notion of principal plane
positions; lenses are defined by their refractive  ata Cleaning Based on Biometric
D
power without other considerations. This approx- Parameters
imation is close to the reality for lenses with sym-
metrical anterior and posterior radii, but is Eyes with illogical or impossible biometric val-
responsible for errors when dealing with asym- ues should be discarded. This situation can hap-
metrical lenses [8]. Thick lens equations do not pen even in high quality datasets (for example,
use this simplification. false low lens thickness readings can be encoun-
tered in hard brunescent cataracts). Some biom-
eters indicate quality index measurements for
The Haigis formula [14, 15] is unique in
each measured parameter: eyes with measure-
replacing the traditional IOL constant,
ment errors should be discarded. An efficient
which acts as a simple offset in the other
way to spot and discard eyes with outliers in
published classical thin lens formulas
biometric parameters is to create a distribution
(Holladay 1, Hoffer Q, SRK/T) by an offset
curve for every parameter, and visually deter-
(a0) and two coefficients (a1 for ACD and
mine the limits to apply (Fig. 24.4). Outliers can
a2 for AL) that determine a bivariate linear
also be eliminated by identifying eyes having
regression. In its single-optimized version,
biometric values beyond a certain number of
a1 and a2 are fixed and a0 acts as a standard
standard deviations away from the mean (usu-
IOL constant. Haigis also allows calcula-
ally 3). If an aggressive outliers elimination
tion of the “perfect thin lens ELP”, i.e. the
strategy is chosen, care must be taken to evalu-
ELP value which, when entered in the thin
ate the resulting formula on eyes with extreme
lens optical formula along with the preop-
AL and corneal radii. It can be easier and
erative keratometry, axial length and the
quicker to manually adapt a given formula to
implanted IOL power, leads to the real post-
extreme eyes, rather than trying to train an algo-
operative refractive outcome. The Haigis
rithm to correctly predict the right values
formula in its triple-optimized version
(whether the postoperative IOL position or the
allows training of a linear regression algo-
postoperative SE) in very atypical eyes.
rithm to predict this value, leading to the
determination of a new offset and two new
coefficients. The triple-optimized version
of the Haigis formula can then be consid-  ata Cleaning Based on TILP
D
ered as the first optical IOL formula capable Back-Calculation
of being completely re-trained using data,
because it has no hard-coded algorithm to A useful strategy to detect outliers on big datasets
predict the ELP, unlike other open-source is to back-calculate the TILP from the corneal
thin lens formulas (see Fig. 24.3). radius, theoretical posterior corneal radius, IOL
parameters (known or estimated) and axial
length. This can be done using thick lens equa-
tions [22, 23]. Eyes with very high or very low
 ata Quality Control
D TILP can then be considered as outliers. A distri-
and Preparation bution plot of the TILP (Fig. 24.5) is helpful to
choose the limits to apply. A scatter plot display-
Datasets have to be cleaned before being used in ing the TILP as a function of the AL can also be
the formula building process. This data cleaning created (Fig. 24.6), allowing the detection of evi-
can be based on native data (the biometric param- dent outliers.
24  Practical Considerations for AI Implementation in IOL Calculation Formulas 269

n = 4/3
R K
IOL n = 1.3375
R K IOL
Power Power
f Rag

f AL lim
ACD Thin lens + Thin lens
f + SF f ACD + 0.05
equation pACD equation
AL f G
f AG
Predicted
postop. SE f M Predicted
AL + RT ALm postop. SE

Holladay 1 Hoffer Q

n = 1.333 R K
R K
n = 1.3315
IOL
A ACD IOL Power
const. f const Power
f Cw
ACD x a1

ACD d Thin lens


f LCOR ACD Thin lens + a0
f H const + H equation
est. equation
-3.336 AL x a2

AL+ Predicted
AL f RETHICK RETHICK LOPT Predicted
postop. SE
postop. SE
SRK/T Haigis

Fig. 24.3  Inner workings of the four classical thin lens thickness. In the Haigis formula, a1 and a2 are used as the
formulas [15–21]. The acronyms of the original publica- coefficients weighting the ACD and AL respectively in a
tions are used in those diagrams. Any transformation dif- multiple regression, while a0 is the intercept value. In
ferent from a multiplication or an addition is represented triple-­optimized mode, the multiple regression is re-fitted
by a “f” (“function”) round cell on the diagram. ACD: to new-data. There is no hard-coded ELP prediction rule
Anatomic anterior chamber measured from corneal epi- in this formula. Reprinted with permission from
thelium to lens; AG: Anterior chamber diameter from Debellemanière et al.: The PEARL-DGS formula: devel-
angle to angle; ALm: modified AL; Cw: Computed cor- opment of an open-source machine learning-based thick
neal width; H: Corneal height; K: Corneal power; LCOR: IOL calculation formula, 2021, American Journal of
Corrected AL; RETHICK: Retinal thickness; RT: Retinal Ophthalmology, in press

Dataset Size [24] on their performances on those extreme


eyes. The relationship between the AL and the
The minimum number of eyes necessary to postoperative IOL position is non-linear and
develop a formula is difficult to assess. We deter- characterized by thresholds effects for extreme
mined that, for a thick lens formula using a basic AL values (Fig. 24.8). It has also been hypothe-
multiple regression algorithm, the number of sized that the refractive index of the vitreous
eyes that allows the SD of the mean prediction could be different in very long eyes [25]. Extreme
error (PE) to start to reach a plateau was around eyes are also statistically associated with extreme
2000 eyes (Fig. 24.7). IOL powers, which can be characterized by
As a general rule, it is always preferable to unusual lens shape factors (meniscus lenses for
obtain the biggest possible dataset; however, data low/negative IOL powers, biconvex asymmetri-
quality should never be compromised. It is an cal shapes for very high IOL powers). It is impor-
error to give priority to quantity over quality in tant to have enough eyes with extreme AL to
datasets used in IOL formula design, because any adapt a formula to those situations.
undetected outlier is susceptible to increase the
standard deviation of the mean prediction error of
the final formula. It is interesting to note that, Training Set/Test Set Constitution
more than the total number of eyes, it is the num-
ber of extreme eyes in a given dataset that matters Ideally, the test set should be totally independent
most, especially regarding axial length. Indeed, from the training set: the ideal situation is to have
eyes shorter than 22 mm and longer than 26 mm two (or more) datasets from different centers.
together represent less than 20% of a typical This is not always possible in practice. If the
dataset, but IOL formulas are routinely evaluated test set and the training set come from the same
270 G. Debellemanière et al.

a b

c d

e f

Fig. 24.4 Distribution of the six classical biometric operative biometric measurement is mistakenly included
parameters in a non-curated dataset. ARC anterior radius in a dataset), or in brunescent cataracts. Thin corneas can
of curvature of the cornea, AL axial length, ACD anterior indicate eyes that underwent corneal refractive surgery,
chamber depth, LT lens thickness, CCT central corneal and very thick corneas can be secondary to corneal
thickness, WTW white-to-white. Lower and upper limits decompensation. Very small WTW are usually not repre-
can be set for each parameter. Very short AL values can sentative of the anatomical reality and should be
sometimes indicate retinal detachment; very short LT val- discarded
ues can be found in eyes after cataract surgery (if a post-
24  Practical Considerations for AI Implementation in IOL Calculation Formulas 271

Fig. 24.5  Distribution of the back-calculated theoretical Fig. 24.6  Representation of the back-calculated theoreti-
internal lens position (TILP) in a non-curated dataset. cal internal lens position (TILP) as a function of axial
“Impossible” values should be discarded length in a non-curated dataset. Outliers can easily be
spotted and eliminated
center, it is important to avoid the situation where
a patient has one eye in both sets, in order to models in the same IOL formula development
avoid data contamination. It is also preferable to dataset, whatever the underlying IOL formula
have only one eye per patient in the test set [24]. architecture choice.
However, we found no drawbacks in having both
eyes of a given patient in the training set. Hence,
we recommend the following three steps to con- Machine Learning Models Inputs
stitute the sets:
Standard Biometric Parameters
1. Randomly split the main dataset into a train-
ing set and a test set The values measured by recent biometers usually
2. Identify patients that have one eye in both sets include the anterior radius of curvature of the
and move both of their eyes into the training cornea (ARC), axial length (AL), anterior
set ­chamber depth (ACD), lens thickness (LT), cen-
3. Identify patients that have both eyes in the test tral corneal thickness (CCT) and corneal diame-
set and randomly delete one of those eyes ter (white-to-white, WTW). An example of their
without including it in the training set relative importance in the TILP prediction is
shown in Fig.  24.9. AQD stands for aqueous
chamber depth (ACD − CCT) and VCD stands
Management of IOL Models Diversity for vitreous chamber depth (AL − [CCT + AQD
+ LT]). AQD and VCD were preferred to AL and
IOL anterior and posterior radius of curvature, ACD to reduce the collinearity between variables
thicknesses, refractive indices and haptic styles and facilitate feature importance study.
differ between IOL models. Those properties are It is important to remember not to include
different between IOL models for a given IOL the corneal radius and corneal thickness among
power, and the way those parameters vary along the postoperative IOL position predictors if the
the IOL power range is also specific to a given formula is developed for eyes with a history of
model. It is therefore, in our opinion, not recom- corneal refractive surgery or corneal graft,
mended to mix eyes implanted with various IOL because those surgically modified values are no
272 G. Debellemanière et al.

a b

Fig. 24.7  Representation of the SD of the mean PE of with no hyperparameter optimization. The Thick lens +
two generic formulas, obtained on a test set of 700 eyes, XGBoost formula is an XGBoost algorithm trained to pre-
as a function of the number of eyes in the training set. The dict the TILP value using the six biometric parameters as
Pure AI formula is an XGBoost model trained to predict an input: this value is then used in thick lens equations.
the postoperative spherical equivalent using the six bio- Note that the SD of formula (b) decreases faster and gets
metric parameters + the implanted IOL power as inputs, lower than the SD of formula (a)

longer helpful to predict the IOL position. The power cannot be used “out-of-the-box” to predict
ACD must be considered with caution in eyes the lens position in a formula formerly based on
that underwent radial keratotomy for the same the keratometric index.
reasons. LT is statistically strongly correlated The potential usefulness of the PRC to predict
to the ACD, but increases the post-operative the postoperative lens position has not yet been
IOL position prediction accuracy nonetheless. evaluated, to the best of our knowledge. It could
The role of corneal thickness and corneal diam- be hypothesized that, because of the strong ARC/
eter is more debated: the Kane [27] and EVO PRC correlation, this value could enhance the
[28] formulas use only the former, and the postoperative lens position prediction in post-­
Barrett Universal II (BU II) formula only the corneal refractive surgery eyes, where the ARC
latter [29]. has been modified but the PRC is still representa-
tive of the native corneal shape.

Posterior Corneal Radius


Parameters Related to the Patient
The posterior corneal radius was not measured by
biometers until recently. Its optical importance is Patient’s age, preoperative refraction, gender and
obvious, and the possibility of estimating the cor- ethnicity have been used in IOL formulas, and
neal power using real measurements rather than can be considered as inputs in a ML model.
the keratometric index is promising. IOL formu-
las lens position predictive algorithms have to be
retrained with new back-calculated postoperative New Biometric Parameters
IOL position values if the corneal power is calcu-
lated rather than estimated, because both meth- The advent of OCT allows the definition of new
ods can yield different results (for a given patient, biometric parameters, such as equatorial lens
and in average for a given population) depending position [30], lens meridian parameter, thickness
on the keratometric index value used in the con- of anterior and posterior parts of the crystalline
sidered formula: the measured total corneal lens, and anterior segment length [31]. Those
24  Practical Considerations for AI Implementation in IOL Calculation Formulas 273

a b

c d

e f

Fig. 24.8  Median TILP value for the six classical bio- this group was determined. A threshold is clearly visible
metric parameters along their value range. Eyes were around 27 mm for the AL. AQD stands for aqueous cham-
sorted according to the value of each parameter. Every ber depth (distance from the posterior corneal surface to
100 eyes, the mean biometric parameter value was calcu- the anterior crystalline lens surface)
lated for the next 500 eyes and the median TILP value for
274 G. Debellemanière et al.

a b

Fig. 24.9  Feature importance study for two algorithms ied using SHapley Additive exPlanations (SHAP) values
trained to predict the TILP in a generic thick lens formula. [26]. Despite the difference between those algorithms and
On the left, a multiple regression model was trained using the way their importance is studied, there is a good con-
normalized data, thus allowing the direct comparison of cordance between them in relation to feature importance
the multiple regression coefficients. On the right, feature ranking, magnitude and sign
importance of the gradient boosted tree algorithm is stud-

parameters are not directly usable in the optical knobs” of an algorithm. The basic linear or mul-
part of a formula but could be useful to more tiple regression is the simplest regression algo-
accurately predict the IOL position. rithm that can be designed and, as such, is the
only one that doesn’t have any hyperparameters.
In the case of gradient boosted trees, hyperpa-
Predictive Models Building rameters include the maximum depth of each
tree, the fraction of observation that are sampled
Algorithm Choice into each tree, and criterias that control when new
leaves and/or new trees are created. In neural
No algorithm or algorithm family can be consid- nets, hyperparameters will control the number of
ered as universally superior to the others. The layers, the number of neurons into each layer, the
performances of a given algorithm vary depend- maximum number of iterations, and so on.
ing on the type of problem to solve, the dataset Hyperparameters cannot be chosen a priori.
size, the number of predictive features, among Their best combination for a given algorithm
other parameters: this phenomenon is known as must be searched for, usually using cross-­
the “no free lunch theorem” [32]. Empirically, we validation. Cross-validation is a process by which
obtained good outcomes with gradient boosted the training set is divided in n groups (usually
trees, neural nets, multiple regression, and sup- around 5) and each group is used as a temporary
port vector regression. test while the others are trained to predict the tar-
get with the selected set of hyperparameters. The
average of the prediction performance on the
Hyperparameters Optimization subgroups is then computed. This process is
repeated for every set of hyperparameters, as far
Hyperparameters allow to tune an algorithm, as the limits that are defined by the researcher.
define its architecture, and thus control the learn- Hyperparameters should never be optimized
ing process. They can be seen as the “control using the test set: doing so would immediately
24  Practical Considerations for AI Implementation in IOL Calculation Formulas 275

lead to overfitting and compromise the final out- dataset, allowing formula constant adjustment.
comes in real life. Similarly, cross-validation is Information about patient age, gender and ethnic-
not a valid method to assess the final perfor- ity could also be included. Individual postopera-
mances of a given model. tive refraction should be kept secret and formulas
evaluation would be performed independently by
the reference authority holding the dataset. This
What is overfitting? This phenomenon can evaluation should not be too frequent (e.g. twice
be compared to what would happen to a a year) to avoid deliberate overfitting. The eyes
student who would obtain in advance and comprising this dataset should never be used by
learn an exam by rote, instead of under- any formula inventor for the same reason.
standing the lesson: his results would prob-
ably be very good for this one exam but
would be deceptive for any other evalua-  escription of the Current PEARL-­
D
tion. Machine learning algorithms, if overly DGS Formula
complex and over-trained on a specific
dataset, can easily be led to “learn by heart” General Principles
the dataset. They will obtain very good out-
comes on this specific dataset but will per- The PEARL-DGS formula [22, 23, 33] is a thick
form poorly in real life. Overfitting can be lens formula based on the prediction of the theo-
prevented by avoiding adding unnecessary retical internal lens position (TILP). This param-
complexity to the models and by evaluating eter is the theoretical distance between the
the performances of an algorithm (and for- posterior corneal surface and the anterior IOL
mula) on new data, from other centers, ide- surface, back-calculated from postoperative data.
ally in a blind manner. It is a theoretical anatomical distance, indepen-
dent of both the lens principal plane positions and
the corneal thickness. The sum-of-segments [6]
AL replaces the AL value, and is approximated
Model Training and Evaluation by the Cooke-­modified AL [9] (CMAL). It cor-
responds to the value leading to the real postop-
Once the hyperparameters are chosen, the algo- erative SE when entered in thick lens equations
rithm can be trained using the whole training set, along with the other optical parameters of the eye
and its performances finally assessed on the test and IOL.  It is predicted using various machine
set. While it can be tempting to restart the whole learning algorithms comprising regular multiple
process, if the performances are judged unsatis- regression, support vector regression, gradient
factory, this would lead to overfitting and should boosted trees and neural networks. The refractive
be avoided. The test set should ideally be used indices values of the Atchinson model eye [34]
only once. If the dataset is big enough, it can be are used, except for the corneal index which was
useful to create a test set (used only once) and a determined empirically during the formula devel-
validation set (that will be used more often, to opment process. Ideally, the real geometric
test different iterations of a given formula for parameters of the IOL are used during the
example). ­development process; otherwise, the formula can
The constitution of a reference dataset, held be developed using theoretical IOL parameters
by a recognized scientific authority, could help (for example, biconvex symmetric geometry) and
standardize IOL formulas evaluation. This data- a study of the mean TILP prediction error along
set would include information regarding preop- the IOL power range is proposed.
erative biometric parameters and biometric In an article describing the PEARL-DGS for-
device type, IOL model and power implanted, mula and evaluating its performances on two test
and mean postoperative refraction for the entire sets of 677 and 262 eyes, the PEARL-DGS for-
276 G. Debellemanière et al.

ARC CCT ACD WTW LT AL

f2
f1
AQD
CMAL
PRC

corrected
f4 f3
CMAL
Eye refractive
indices TILP
IOL Power

R iol ant. R iol post. IOL thickness n IOL

Thick lens Emmetropizing


Predicted postop. SE
equations corneal power

Fig. 24.10  General outline of the PEARL-DGS formula dict the TILP. The TILP is then predicted using six bio-
prediction process. The PRC is deduced from the ARC metric parameters as inputs in various ML models
(f1). AL and LT are used to calculate the CMAL (f2). The combined in ensemble methods (f4). Reprinted with per-
CMAL is corrected before being used as an input to pre- mission from Debellemanière et  al.: The PEARL-DGS
dict the TILP (f3). The raw CMAL value is used in the formula: development of an open-source machine
optical part of the formula. The ARC and CCT are used in learning-­
based thick IOL calculation formula, 2021,
the optical part of the formula, and also used as an input to American Journal of Ophthalmology, in press
predict the TILP. WTW, AQD and LT are only used to pre-

mula yielded the lowest SD on the first set cohort with short axial eye length, Wendelstein
(±0.382  D), followed by K6 and Olsen et  al. [35] showed that PEARL-DGS, Okulix,
(±0.394  D), EVO 2.0 (±0.398  D), RBF 3.0 and Kane or Castrop formulae had the lowest MAE
BUII (±0.402 D) as well as the lowest SD on the (0.260, 0.300, 0.300 and 0.270 respectively).
second set (±0.269  D), followed by Olsen Evaluating the refractive result of 171 eyes,
(±0.272 D), K6 (±0.276 D), EVO 2.0 (±0.277 D) Rocha de Lossada [38] found that Barrett and
and BUII (±0.301 D) [23]. Pearl DGS performed best for medium eyes
Different independent peer reviewed studies (MAE  =  0.237 and 0.263 respectively; % eyes
evaluated and compared the PEARL-DGS for- <0.5 D = 89.34% and 86.89% respectively).
mula along, with other fourth generations IOL
calculation formulas. In three [35, 36, 37] out of
seven studies PEARL-DGS ranked first with a References
median absolute error (MedAE) varying between
1. Barr A, Feigenbaum EA. Chapter I – Introduction. In:
0.190 and 0.310 and a percentage of eyes with a Barr A, Feigenbaum EA, editors. The handbook of
postoperative refractive error of <0.5 diopter, artificial intelligence. Butterworth-Heinemann; 1981.
varying between 74% and 87.1%. In their patient p. 1–17.
24  Practical Considerations for AI Implementation in IOL Calculation Formulas 277

2. Hill W.  Hill-RBF Formula 3.0 [Internet]. Hill-RBF lation formula: Erratum. J Cataract Refract Surg.
Calculator Version 3.0. https://rbfcalculator.com/. 1990;16(4):528.
Accessed 3 Feb 2021. 18. Hoffer KJ.  The Hoffer Q formula: a comparison of
3. Sanders D, Retzlaff J, Kraff M, Kratz R, Gills J, theoretic and regression formulas. J Cataract Refract
Levine R, et  al. Comparison of the accuracy of the Surg. 1993;19(6):700–12.
Binkhorst, Colenbrander, and SRK implant power 19. Zuberbuhler B, Morrell AJ.  Errata in printed Hoffer
prediction formulas. J Am Intraocul Implant Soc. Q formula. J Cataract Refract Surg. 2007;33(1):2;
1981;7(4):337–40. author reply 2–3.
4. Sanders DR, Retzlaff J, Kraff MC.  Comparison of 20. Hoffer KJ. Errors in self-programming the Hoffer Q
empirically derived and theoretical aphakic refraction formula. Eye. 2007;21(3):429; author reply 430.
formulas. Arch Ophthalmol. 1983;101(6):965–7. 21. Holladay JT, Prager TC, Chandler TY, Musgrove KH,
5. Ladas JG, Siddiqui AA, Devgan U, Jun AS.  A 3-D Lewis JW, Ruiz RS. A three-part system for refining
“Super Surface” combining modern intraocular lens intraocular lens power calculations. J Cataract Refract
formulas to generate a “Super Formula” and maximize Surg. 1988;14(1):17–24.
accuracy. JAMA Ophthalmol. 2015;133(12):1431–6. 22. Gatinel D, Debellemanière G, Saad A, Dubois M,
6. Wang L, Cao D, Weikert MP, Koch DD. Calculation Rampat R. Determining the theoretical effective lens
of axial length using a single group refractive index position of thick intraocular lenses for machine learn-
versus using different refractive indices for each ing based IOL power calculation and simulation.
ocular segment: theoretical study and refractive out- Transl Vis Sci Technol. 2021.
comes. Ophthalmology. 2019;126(5):663–70. 23. Debellemanière G, Dubois M, Gauvin M, Wallerstein
7. Cooke DL, Cooke TL, Suheimat M, Atchison A, Brenner LF, Rampat R, et  al. The PEARL-DGS
DA.  Standardizing sum-of-segments axial length formula: development of an open-source machine
using refractive index models. Biomed Opt Express. learning-based thick IOL calculation formula. (under
2020;11(10):5860–70. review).
8. Haigis W.  Intraocular lens calculation in extreme 24. Hoffer KJ, Savini G. Update on intraocular lens power
myopia. J Cataract Refract Surg. 2009;35(5):906–11. calculation study protocols: the better way to design
9. Cooke DL, Cooke TL.  Approximating sum-of-­ and report clinical trials. Ophthalmology [Internet].
segments axial length from a traditional optical low-­ 2020. https://doi.org/10.1016/j.ophtha.2020.07.005.
coherence reflectometry measurement. J Cataract 25. Wang L, Shirayama M, Ma XJ, Kohnen T, Koch

Refract Surg. 2019;45(3):351–4. DD.  Optimizing intraocular lens power calculations
10. Bahrami M, Hoshino M, Pierscionek B, Yagi N,
in eyes with axial lengths above 25.0 mm. J Cataract
Regini J, Uesugi K. Refractive index degeneration in Refract Surg. 2011;37(11):2018–27.
older lenses: a potential functional correlate to struc- 26. Lundberg S, Lee S-I. A unified approach to interpret-
tural changes that underlie cataract formation. Exp ing model predictions [Internet]. arXiv [cs.AI]. 2017.
Eye Res. 2015;140:19–27. Available from http://arxiv.org/abs/1705.07874.
11. Kasthurirangan S, Markwell EL, Atchison DA, Pope 27. Kane JX.  Kane formula calculator [Internet]. Kane
JM.  In vivo study of changes in refractive index Formula. https://www.iolformula.com/. Accessed 15
distribution in the human crystalline lens with age Mar 2021.
and accommodation. Invest Ophthalmol Vis Sci. 28. Yeo TK.  EVO formula [Internet]. The Emmetropia
2008;49(6):2531–40. Verifying Optical (EVO) formula. https://www.
12. Dubbelman M, Van der Heijde GL.  The shape
evoiolcalculator.com/. Accessed 1 Feb 2021.
of the aging human lens: curvature, equivalent 29. Barrett G.  Barrett Universal II Calculator [Internet].
refractive index and the lens paradox. Vision Res. Barrett Universal II. https://calc.apacrs.org/barrett_
2001;41(14):1867–77. universal2105/. Accessed 15 Mar 2021.
13. Patel S, Tutchenko L.  The refractive index of the 30. Martinez-Enriquez E, Pérez-Merino P, Durán-Poveda
human cornea: a review. Cont Lens Anterior Eye. S, Jiménez-Alfaro I, Marcos S. Estimation of intraoc-
2019;42(5):575–80. ular lens position from full crystalline lens geometry:
14. Haigis W.  Intraocular lens power calculations. In:
towards a new generation of intraocular lens power
Shammas HJ, editor. SLACK Incorporated; 2004. calculation formulas. Sci Rep. 2018;8(1):9829.
15. Haigis W, Lege B, Miller N, Schneider B. Comparison 31. Yoo Y-S, Whang W-J, Kim H-S, Joo C-K, Yoon

of immersion ultrasound biometry and partial coher- G.  New IOL formula using anterior segment three-­
ence interferometry for intraocular lens calcula- dimensional optical coherence tomography. PLoS
tion according to Haigis. Graefes Arch Clin Exp One. 2020;15(7):e0236137.
Ophthalmol. 2000;238(9):765–73. 32. Wolpert DH.  The lack of A Priori distinctions

16. Retzlaff JA, Sanders DR, Kraff MC. Development of between learning algorithms. Neural Comput.
the SRK/T intraocular lens implant power calculation 1996;8(7):1341–90.
formula. J Cataract Refract Surg. 1990;16(3):333–40. 33. Debellemanière G, Saad A, Gatinel D. PEARL DGS
17. Retzlaff JA, Sanders DR, Kraff MC.  Development calculator [Internet]. IOL Solver. www.iolsolver.com.
of the SRK/T intraocular lens implant power calcu- Accessed 14 Mar 2021.
278 G. Debellemanière et al.

34. Atchison DA. Optical models for human myopic eyes. 37. Leonardo T, Kenneth JH, Piero B, Domenico S-L,
Vision Res. 2006;46(14):2236–50. Giacomo S. Outcomes of IOL power calculation
35. Wendelstein J, Hoffmann P, Hirnschall N, Fischinger using measurements by a rotating Scheimpflug cam-
IR, Mariacher S, Wingert T, et al. Project hyperopic era combined with partial coherence interferometry. J
power prediction: accuracy of 13 different concepts Cataract Refract Surg 2020; 46:1618–23.
for intraocular lens calculation in short eyes. Br 38. Rocha-de-Lossada C, Colmenero-Reina E, Flikier

J Ophthalmol [Internet]. https://doi.org/10.1136/ D, Castro-Alonso F-J, Rodriguez-Raton A, García-­
bjophthalmol-­2020-­318272. Accessed 27 Jan 2021. Madrona J-L, et al. Intraocular lens power calculation
36. Diogo H-F, Maria EL, Rita S-P, Pedro G, Vitor M, formula accuracy: comparison of 12 formulas for a tri-
João F, Nuno A. Anterior chamber depth, lens thick- focal hydrophilic intraocular lens. Eur J Ophthalmol.
ness and intraocular lens calculation formula accu- 2020;1120672120980690.
racy: nine formulas comparison. British Journal of
Ophthalmology:bjophthalmol-2020-317822.
Index

A developments, 1, 2
Accreditation Council for Graduate Medical DL models, 3
Education, 215 doctor-patient relationship, 3
Accuracy of DL guided triaging in ophthalmology, 230 GANs, 8–10
multi-task classification task, referral category, 231 guidelines, 13
urgent referrals, 231 human cognitive capacity, 3
Adult triage criteria for referrals at tertiary Industrial Revolution, 1, 2
ophthalmology center, 228 life-science papers, 3, 4
Age-Related Eye Disease Study (AREDS), 42, 103, medical specialties, 3
187–190 optical coherence tomography images, 7
Age-related macular degeneration (AMD), 187–190 programs creation, 3
AREDS, 103 resources, 6, 7
automated image analysis, 102 risk factors, 3, 5
classification, 42, 43 safety, 6
clinical features, 101, 102 transfer learning, 7–9
deep learning (see Deep learning) Turing test, 1
imaging modalities, 103 AI-based systems, 173
limitation, 108 AI-DL models on external multi-ethnic testing
prediction, 48 datasets, 252
Simplified Severity Scale, 103–105 AI-enhanced formulas use ML algorithms, 267
AlexNet, 37 AI-enhanced optical formulas, 266–268
Algorithm family, 274 AI-enhanced thin lens optical formula, 267
American Board of Ophthalmologists (ABO), 215 algorithms, 263
Anaemia, 248 assisting in diagnosis and surgery procedures, 207
Anatomical segmentation, 219–220 for cataract detection and grading, 204
Anemia, 171, 172 in corneal ectasias, advantages of, 198
Applied deep learning research work in eye diseases, for corneal pathologies, 195
162–168 diagnostic algorithm, combined multimode data, 208
Area Under the Curve (AUC), 25 enabled DR screening, 161
AREDS 9-step severity scale, 189 guided triage model, 229
Artificial intelligence (AI), 177 models, 264
adoption in healthcare, 177 and optics, 266
AI-assisted refractive surgery, research and for preoperative assessment, 204
application, 211 for scientific discovery, 168–173
AI-based assistance, 172 disease diagnosis, 168
AI-based automated segmentation, 46–48 disease progression, 169
AI-based cataract management, 206 systemic conditions, 170–172
AI-based devices tools, 264
aims, 1, 3 Artificial neural networks (ANN), 22, 23, 34, 205,
benefit for, 4, 6 231, 260
challenges, 10–13 Artificial neurons, 22
continuous learning, 10 Assessments using structured rating scales, 218
cost-effectiveness, 13, 14 Atchinson model eye, 275
design factors, 13, 14 Australian Institute for Machine Learning (AIML), 232

© Springer Nature Switzerland AG 2021 279


A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4
280 Index

Automated decision tree classification method, 196 board certification, 215


Automated segmentation, 222 clinically significant complications, 218
Automated surgical skills assessment methods in direct observation, 216
non-ophthalmic specialties, 222 feedback information, 218
Automatic retinal imaging and analysis, 187–190 and IOL calculations, 257
Automatic segmentation, 219 minimal guidance, 215
Autonomous AI system motion analysis, 217
assistive system, 58 ophthalmology residency, 215
clinical data, 59 preoperative assessment of, 204–205
design requirements, 60, 61 preoperative risk stratification, 206
diabetic retinopathy, 56, 57 recording device, 218
diagnostic process, 58, 59 skill assessment, 215
gene therapy, 60 sources of data for technical skill assessment,
implementation requirements, 63, 64 216–218
incidental findings, 59 structured rating scales, 218
medical decision, 60 surgical skill acquisition, 221
patients’ overall health, 57 surgical skill assessment, 216
population diagnosability, 58 technical skill of operating surgeon, 218
principles, 60 video recordings, 217
replicability, 57 Centered-involved DME (CI-DME), 41
safety, 57, 58 Central Adelaide Local Health Network
validation requirements, 61–63 (CAHLN), 227
Choroidal melanoma, 235
Classifier development, 230
B Color fundus photography (CFP), 41, 42, 103,
Back-calculated theoretical internal lens position 162–164, 172
(TILP), 271 Color-oriented visual cues (simple histogram
Bayes Network, 196 intersection), 219
Bayes’ theorem, 21 Computer-aided diagnostic systems, 240
Belin Ambrosio Enhanced Ectasia Display (BAD-D), Computer-assisted-surgical (CAS) systems, 219
194, 200 Computer-based analysis, 240
Best corrected visual acuity (BCVA), 48 Computer power, 259
Bevacizumab Eliminates the Angiogenic Threat of Computer vision technology, 220
Retinopathy of Prematurity (BEAT-ROP), 130 Conditional generative adversarial networks, 190
Big data, 194, 208, 261 Conformité Européenne (CE), 6, 161
Big generative adversarial networks, 189–190 Confusion matrix, 24
Biometric parameters, 271–274 Congenital Cataract-Cruiser (CC-Cruiser), 46
Biometry devices, 261 Content-Based Video Retrieval (CBVR), 220
Blood pressure (BP), 246 Context-specific metrics of capsulorhexis, 221
Boundary equilibrium GANs, 189 Continuous learning AI systems, 64
Brain and Optic Nerve Study with Artificial Intelligence Conventional Cox proportional hazard model, 252
(BONSAI), 240 Convolutional neural network (CNN), 23, 71–73, 179,
209, 211, 258
AlexNet, 37
C architecture, 35, 36, 230
Cancer cell morphology, 236 diagnostic support and prediction, 84
Capsulorhexis, 220–221 EfficientNet, 37
Cardiovascular Risk Factors, 170–171 examples, 36, 37
Cataract features, 35
classification, 44, 45 gradient filters, 36
data set, 220 image quality assessment, 84
detection and grading, 203–204 inception, 37
diagnosis and grading, 258 model, 231
management, 203, 205, 206 ResNet, 37
vs. non-cataractous, 258 stages, 36
screening, 45, 46 VGG, 37
surgery, 257 Cooke-modified AL (CMAL), 275
AI for postoperative management, 205–206 Corneal biomechanics, 198
Index 281

Corneal diagnosis RNNs, 38, 73, 74


clinical decisions, 89 RPD, 107, 108
corneal refractive surgery, 90, 91 semi-supervised learning, 70
grading of cataracts, 91 social aspects, 28, 29
keratoconus screening and classification, 89, 90 supervised learning, 70
Corneal refractive surgery, 90, 91 technical aspects, 28
Coronary artery calcium (CAC), 250 techniques, 236
Corvis Biomechanical Index (CBI), 200 testing, 39
Count vectorisation and negation detection, 230 training, 18, 38, 39
COVID-19, 4, 96 unsupervised learning, 70
Crowd-sourced approach, 261 Deep-learning system (DLS)
Crowdsourced-workers (CWs), 221 classification accuracy, 189
development, 238
Deep medicine, 3
D Deep neural networks (DNN), 23, 221
Data cleaning based on biometric DeepMind, 162, 173
parameters, 268, 270 Development lifecycle, 162
Data cleaning based on TILP back- Diabetic macular edema (DME), 167–168
calculation, 268–269, 271 Diabetic retinopathy (DR), 240
Data mining, 195 detection performance, 163
Data pre-processing, 230 epidemiology, 139, 140
Data processing, 237 general practitioners, 156, 157
Data quality control and preparation, 268–271 Google along development lifecycle, 162
Data set, 194, 195, 197, 259, 268, 269 image analysis, 153, 154
Dataset size, 269 incidence, 169–170
Decision tree (DT), 22, 34, 209, 231 large scale screening programs, 161
Deep convolutional neural network (DCNN), 94, 190 mobile devices, 154–156
Deep feedforward networks (DFNs), 70, 71 model performance for detecting referable, 164
Deep learning (DL), 177, 239 OCT angiography, 156
algorithms, 222, 239–241 optical coherence tomography, 156
AMD, 105–107 prediction, 48
analysis of ophthalmology triage, 229 RetinaRisk, 157
automatic feature extraction, 33 screening
for cataract surgery instrument detection and CFP, 41, 42
segmentation, 220 CI-DME, 41
clinical diagnosis, 69 false negatives, 41
CNNs, 71–73 iDx-DR system, 40
AlexNet, 37 RetmarkerDR and EyeArt, 40
architecture, 35, 36 SELENA, 41
EfficientNet, 37 sensitivity and specificity, 41
examples, 36, 37 screening initiatives
features, 35 automated screening, 141, 142
gradient filters, 36 deep-learning algorithms, 142
inception, 37 in England, 140
ResNet, 37 Eyeart, 147–150
stages, 36 Google algorithm, 150–152
VGG, 37 IDx-DR, 142–146
DFN, 70, 71 in Ireland, 141
for diabetic retinopathy, 180 local screening initiatives, 141, 143
GA, 107 in Northern Ireland, 140
GANs, 74, 75 RetCAD, 153
glaucoma, 113, 115 Retinalyze, 145
history, 17, 18 Retmarker, 145–147
models, 3, 220 in Scotland, 140
neural networks, 69 in Singapore, 141, 152, 153
overview, 18 in United Kingdom, 140
PUN, 38 Verisee, 153
retinal disease, 27 in Wales, 140
282 Index

Diabetic Retinopathy Severity Scale (DRSS), 48 Generative network, 189


Digital colour fundus photographs (CFP), 243 Geographic atrophy (GA), 107
AI-DL on, 244 Glaucoma
body composition factors, 244, 246 applications, 121
chronic kidney disease, 249 classification, 43, 44
circulatory system, 246 classification problem, 114–116
demographic and lifestyle factors, 245 deep learning, 113, 115
endocrine system, 248 functional defects, 115–117
framework for artificial intelligence, 244 functional measurements, 113, 115
hematocrit, 248 fundus photography, 113, 114
hematologic factors in external datasets, 248 longitudinal analysis, 119, 120
hemoglobin, 248 optical coherence tomography, 113, 114
imaging biomarker, 250 structural and functional data, 119
lifestyle factors, 244 structural damage
longitudinal outcomes of systemic diseases, 251 fundus photography, 117, 118
prediction of kidney disease, 249 optical coherence tomography, 118, 119
prediction of metabolic and endocrinological structural-functional correlation, 120, 121
diseases, 248–249 Glaucomatous optic nerve head (GONH), 43
prediction of neurological diseases, 246 Global Rating Assessment of Skills in Intraocular
predictive values and cross-sectional outcomes, 250 Surgery (GRASIS), 218
red blood cell, 248 Good Clinical Practice (GCP), 61
renal system, 249 Google, 162, 173
risk stratification and delivery of tailored Google’s TensorFlow, 259
interventions, 252 Gradient boosted trees, 274
smoking status, 244 Grading, model training and performance
systemic diseases, circulatory system, 247 evaluation, 165
Discriminative network, 189 Graphics processing unit (GPU), 182
DistilBERT model for language understanding, 232 Ground truth, AI technique, 198
Dynamic time warping (DTW), 219 Guided Progression Analysis (GPA), 119

E H
Ectasia scoring systems, 194 Haigis formula, 267–269
Effective lens position (ELP, 264 Hand motion analysis, 222
Electronic medical record (EMR), 43, 232 Hidden Markov Model (HMM) algorithm, 219
Emergency referrals, 227 Hill-RBF formula, 264
Excimer laser photorefractive keratectomy (PRK), 207 Holdout validation, 23
Expert decision tree, 195 Human action recognition system for real-time
Expert system classifier, 197 recognition, 220
Extreme gradient boosting (XGBoost), 260 Human guided triaging, 232
Eye motion tracking, 222 Hyperparameters, 274
Eye Surgical Skills Assessment Test (ESSAT), 218 optimization, 274–275
tuning, 230

F
Feedforward neural network (FFN), 23 I
Fine needle aspiration biopsy (FNAB) Image analysis, 220
cytology, 236 CNN
of uveal melanomas, 236 diagnostic support and prediction, 84
Forme fruste keratoconus (FFKC) patients, 193 image quality assessment, 84
Framingham risk score (FRS), 252 keypoints, 83
Frisen severity classification, 240 digital images, 78, 79
Fundus autofluorescence (FAF), 103 image preprocessing
Fuzzy K-means clustering algorithm, 203 contrast enhancement, 80, 81
Fuzzy set theory, 222 intensity normalization, 80
image registration process, 81–83
preprocessing and registration, 77
G process, 77
Gene expression profile (GEP), 236, 238 ImageNet Large Scale Visual Recognition Challenge
General practitioners (GPs), 156, 157 (ILSVRC), 1
Generative adversarial networks (GANs), 8–10, 74, 75, Implantable Collamer Lens (ICL), 211
189, 190 Insulin growth factor (IGF-1), 128
Index 283

Intensity normalization methods, 80 Laser-assisted in situ keratomileusis


Intergrader variability, 165 (LASIK), 207
International Clinical Diabetic Retinopathy Severity Laser in situ keratomileusis (LASIK) surgery, 193
Scale (ICDRSS) classification, 182 Laser refractive surgery, 207
International Council of Ophthalmology–approved Laser vision correction (LVC) procedures, 193
Ophthalmology Surgical Competency Learning vector quantizer (LVQ) neural network, 90
Assessment Rubrics (ICO-OSCARS), 218 Leave-P-Out Cross Validation, 25
Intraocular lens (IOL) Linear discriminant analysis (LDA), 197
calculations Linear regression, 20
adjust an existing formula, 260 Logistic regression models, 231
algorithm, 266 Low-shot learning (LSL), 7
anterior chamber depth (ACD) and lens
thickness, 258
aphakic refraction, 259 M
axial length adjustment, 259 Machine learning (ML), 164, 177, 195, 239
cleansing mechanisms, 260 algorithms, 195, 267, 275
data cleaning, 259 ANN, 22, 23
data gathering, 259 assisted triaging, 228
datasets, 260 boosting algorithms, 35
filtering, 259, 260 classifier, 195
5-fold cross validation, 260 classification, 19
global public dataset, 261 clustering, 19
history of, 258 decision tree, 22
measured equatorial lens position, 259 DNN, 23
optimal architecture of AI-based formulas, 263 feature extraction methods, 32
posterior corneal power, 259 glaucoma, 27
post-operative autorefraction or wavefront history, 17, 18
data, 260 inferencing, 32
power and design, 259 inputs, 271–274
predictions performances, 264 keratoconus, 27
predictors (features), 264 K-means clustering, 21, 22
ratio of anterior and posterior segment, 259 KNN, 22
SRK formula, 258 learning models, 19, 20
training, 259 logistic regression, 21
training algorithm, 259 ML-based approach, 219
true power of the cornea, 259 model development, 25
implantation, 204 Naïve Bayes algorithm, 21
models diversity, management of, 271 non-linear regression, 20, 21
power calculation, 204 overview, 18, 32
Intrastromal corneal ring segment implantation, 211 parameters, 25
Inverse reinforcement learning (IRL), 95, 96 performance evaluation
Iowa Department of Ophthalmology Objective Wet confusion matrix, 24
Laboratory Structured Assessment of Skill and K folds, 25
Technique (OWLSAT), 218 metrics, 23, 24
ROC curve, 24
problems, 7–8
K prediction, 19
Kaggle.com, 261 refractive surgery, 27
Keras, 26 regression algorithms, 264
Keratoconus (KCN) detection regression, 19
and refractive screening, 195 retinal disease, 27
biomicroscopic and retinoscopic manifestations, 193 social aspects, 28, 29
non-inflammatory corneal ectasia, 193 software packages, 25, 26
prevalence of, 193 supervised learning, 33, 34
Keratoconus suspects (KCS), 193 SVM, 21
Keratometric index value, 272 techniques, 222
K-Nearest Neighbor (KNN), 22 applications, 177
technical aspects, 28
technical perspectives, 32
L training process, 18
Label transfer, 42, 44 types, 258
Laplacian pyramid generative adversarial networks, 189 unsupervised learning, 34, 35
284 Index

Machine learning-based IOL formulas architecture, Objective Structured Assessments of Technical Skills
264–268 (OSATS), 218
Machine to machine technique, 240 Ocular oncology, 235, 236, 238
Major Adverse Cardiac Events (MACE), 252 Optic disc abnormalities, 94, 95
Management decision of referral or follow-up, 203 Optic disc classification in glaucoma, 240
Manual data processing, 238 Optic disc features in glaucoma, 240
MATLAB, 26 Optic neuropathies, 240
Mean absolute error (MAE), 244 Optical coherence tomography (OCT), 103,
Median absolute error (MedAE), 276 165–167, 169
Median TILP value, 273 OverFeat deep convolutional neural network
Medical laws or connections, AI, 208 (DCNN), 187
Metrics, 23, 24 Overfitting, 24, 275
Model assessment, 230 Oxford Health NHS Foundation Trust, 227
Model training and evaluation, 275
Model training process, 165
Modern deep learning systems, 240 P
Modified network architecture of SELENA, 181 Papilledema, 95
Motion analysis techniques, 216, 222 Pattern recognition identification, 257
Multi-Context Deep Network (MCDN), 118 PEARL-DGS formula, 267, 269, 275, 276
Multilayer perceptron (MLP), 23, 70, 71, 196, 209 Phacoemulsification techniques, 218
Multivariate logistic regression analysis, 197, 198 Phacotracking, 217
Muscle contraction analysis, 222 Phase recognition via ML algorithms, 223
Population achieved sensitivity (PAS), 59
Population-based studies, 208, 243
N Population diagnosability, 58
Naïve Bayes, 33 Posterior capsule opacification (PCO), 205
algorithm, 21 Posterior corneal radius, 272
Random Forest, 196 Post-hoc assessments of optic discs, 240
National Health Service (NHS) Trusts, 227 Postoperative spherical Equivalent Prediction using
Natural language processing (NLP), triaging in ARtificial Intelligence and Linear algorithms
ophthalmology, 228 (PEARL) project, 263
Negation detection, 230 Precision medicine, 210
Netflix, 261 Predictability and generalisability of the AI-DL
Neural network (NN), 17, 195, 259 model, 252
architecture models, 230 Prediction error (PE), 269
components, 88 Predictive models building, 274–275
for corneal topography classification, 195 Pre-operative refraction, 259
data types, 88 Pre-trained Unsupervised Network (PUN), 38
and linear discriminant analysis, 196 Proliferative DR (pDR), 139
unilateral and bilateral indices, 196 Pseudophakic anterior chamber depth (pACD), 205
Neuro-ophthalmic abnormalities affecting the optic Pure AI-based formulas, 264–266
discs, 240 Python, 25
Neuro-ophthalmic disease
fundus photographs, 94
optic disc abnormalities, 94, 95 Q
Neuro-ophthalmic optic disc abnormalities on retinal Quality management system (QMS), 64
fundus images, 240
NHS Diabetic Eye Screening Programme (NDESP), 140
No free lunch theorem, 274 R
Non-proliferative DR (NPDR), 182 Radial basis function (RBF) network, 209, 264
Northern Ireland Diabetic Retinopathy Screening Random forests algorithm, 22, 34, 221, 231
Programme (NIDRSP), 140 Receiver operating characteristic (ROC) Curve, 24
Numerical triage category, 230 Recurrent neural network (RNN), 23, 38, 73, 74, 120,
211, 220
Referable cataracts, 258
O Reference dataset, 275
Objective Assessment of Skills in Intraocular Surgery Reference standard, 58, 63
(OASIS), 218 Refractive indices of eye, 267
Objective Structured Assessment of Cataract Surgical Refractive surgery
Skill (OSACSS), 218 analysis of images and data, 209
Index 285

clinical data of, 210 clinical feasibility of AI applications, 181


diagnostic algorithm, 208 development, validation and testing, 178
diagnostic efficiency, 208 large-scale population-based epidemiological
medical rules or connections, 208 studies, 183
overcorrection/undercorrection, 209 manual grading by human assessors, 183
surgical safety, 207 patient demographics, 182
treatment and health management, 211 on primary validation dataset, 180
treatment strategies, 210 prospective trial designs, 183
types, 207 screening process, 183
visual effects and long-term safety, 207 Singapore AI team, 183
Regression, 70, 258 statistical measures of performance between deep
Regression algorithm, 264 learning systems, 179
Reinforcement learning (RL), 20, 23, 95, 96 systemic risk factors, 182
Residual Network (ResNet), 117 Singapore I Vessel Assessment deep-learning system
ResNet architecture, 180 (SIVA-DLS), 250
ResNet 50 derived classifications, 189 Singapore Integrated Diabetic Retinopathy Programme
ResNet-152 network, 236 (SIDRP), 141, 180
Reticular pseudodrusen (RPD), 107, 108 Singapore National Diabetic Retinopathy Screening
Retinal imaging biomarkers, 249 Program (SIDRP), 152, 178
Retinal microvascular abnormalities and systemic Singapore National Diabetic Retinopathy Screening
diseases, 243 Programme, 178
Retinal vascular changes, 243 Single detection device, diagnostic algorithm, 207
Retinalyze, 145 Small incision lenticule extraction (SMILE) surgery,
Retinopathy of prematurity (ROP), 240 207, 209
automated classification, 134 accuracy and predictability of, 210
classification, 129, 130 Software as a medical device (SaMD), 12, 13
detection, 93, 94 South Australian Institute of Ophthalmology (SAIO),
diagnosis 228, 232
challenges, 134, 135 Sparse connectivity, 72
early AI systems, 132, 133 Spectral-domain OCT (SD-OCT) images, 42
limitations, 131, 132 Standard biometric parameters, 271–272
training, 135 Statistical inference, 264
impact, 128 Stereo Pairs of CFPs based AMD Progression, 169
pathophysiology, 128 Stereoscopic optic disc images, 240
plus disease Strabismus, detection, 92, 93
automated detection, 133 Structured data, 18
continuous scoring, 133, 134 Subjective Phacoemulsification Skills Assessment
prevalence, 127 (SPESA), 218
risk factors, 127 Subretinal drusenoid deposits, 107, 108
screening, 45, 46, 128, 129 Substantial computing power and memory, 259
telemedicine, 129 Supervised learning, 19, 20, 70, 258
Robotic-assisted surgery (RAS), 95 ANN, 34
Royal Australasian and New Zealand College of Decision Tree, 34
Ophthalmologists (RANZCO), 227 Naïve Bayes, 33
Rubrics, 218 Random Forest, 34
SVM, 33, 34
Support vector machine (SVM), 21, 33, 34, 196, 209
S Support vector regression (SVR), 260
Segmentation, advantages, 166, 167 Surgical data science, 215
Self-attention GANs, 189 Surgical Media Center (SMC), 218
Self-organization maps (SOM), 23 Systematic COronary Risk Evaluation (SCORE) risk
Semi-structured data, 18 calculator, 252
Semi-supervised learning, 20, 70 Systemic vascular risk factor associations, 182–183
Sensor-based motion analysis in ophthalmology, 217
Shallow medicine, 3
Shapiro-Wilk test, 260 T
Shared weights, 72 Task recognition, 219–220
Singapore Eye Lesion Analyzer (SELENA), 178 Technical skill assessment in ophthalmology, 216, 219
on African population, 180–182 Technical skill in cataract surgery, 216
architecture of, 180 Telemedicine, 129
286 Index

Tele-Neuro-Ophthalmology, in current COVID-19 resource allocation, 227


era, 241 Triaging of referrals in ophthalmology, 227
Temporal sequence network, 205 Turing test, 1
TensorFlow, 26
Tertiary adult ophthalmology, 227
Tertiary ophthalmology center, 227 U
Texture-oriented visual cues (bag-of-words- Universal artificial intelligence platform for collaborative
approach), 219 management of cataracts, 203
Theoretical internal lens position (TILP), 267, 275 Unstructured data, 18
Thin lens formulas, 269 Unsupervised learning, 20, 34, 35, 70
Tomographic Biomechanical Index (TBI), 200 Unsupervised machine learning techniques, 195, 197
Tool motion analysis, 222 US Food and Drug Administration (FDA) approval, 4, 6
Tool recognition, 219–220
Training set/test set constitution, 269–271
Transfer learning, 38 V
Triage categorisation, 230 Vergence formula category, 204–205
Triaging classification practices, 227 Video recordings are common during cataract
Triaging in ophthalmology, 227 surgery, 217
applying artificial intelligence, 228 Visual field (VF) progression, 44
challenges in
base NLP model, 232
clinical practice, 232 W
distant labels, 232 Waikato Environment for Knowledge Analysis (WEKA)
distant-label problem, 232 workbench, 196
small dataset, 231 Weka, 26
small dataset size, 231 Word stemming and tokenisation, 230
specialized vocabulary, 232
validation accuracies, 232
clinical management, 227 X
data capture process, 233 XGBoost model, 272
deep learning (DL), 228
external referrals, 227
ML studies in, 228 Z
randomized controlled trials, 233 Zambia, 182

You might also like