Professional Documents
Culture Documents
Intelligence in
Ophthalmology
Andrzej Grzybowski
Editor
123
Artificial Intelligence in Ophthalmology
Andrzej Grzybowski
Editor
Artificial Intelligence
in Ophthalmology
Editor
Andrzej Grzybowski
Department of Ophthalmology
University of Warmia and Mazury
Olsztyn, Poland
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Contents
v
vi Contents
Andrzej Grzybowski
“If you do not get feedback, your confidence grows much faster than your accuracy”
Tetlock P., Gardner D. Superforcasting: The Art and Science of Prediction, Crown
Publishing, 2016.
he Promise of Artificial
T After decades of slow progress since the Turing
Intelligence test was proposed, AI has finally blossomed.
Many new technologies and applications are
The term “artificial intelligence” (AI) was coined available, and there is great enthusiasm about the
on August 31, 1955, when John McCarthy, promise of AI in health care. It holds the potential
Marvin L. Minsky, Nathaniel Rochester, and to improve patient and practitioner outcomes,
Claude E. Shannon submitted “A Proposal for the reduce costs by preventing errors and unneces-
Dartmouth Summer Research Project on Artificial sary procedures, and provide population- wide
Intelligence.” [1, 2]. It was, however, Alan Turing health improvements. We have entered the fourth
who during a public lecture in London in 1947 stage of the Industrial Revolution that began in
mentioned computer intelligence, and in 1948 he the eighteenth century, and its defining feature
introduced many of the central concepts of AI in may well be the use of AI technologies (Fig. 1.1).
a report entitled “Intelligent Machinery.” [3]. The results of an annual competition known as
Moreover, Turing proposed in 1950 the test, orig- the ImageNet Large Scale Visual Recognition
inally called the imitation game, and later known Challenge (ILSVRC) provide interesting insights
as the Turing test, as a way to confirm that the into recent developments in AI technology
intelligent behavior of a machine was equivalent (Fig. 1.2). Over the years 2010–2016 there was a
to that of a human. A human evaluator is asked to steady decrease in the error rates of the algo-
determine the nature of a partner (human or rithms presented, and in 2017, 29 of the 38 com-
machine) based on a text-only conversation [1–3]. peting teams had error rates lower than 5%
(considered to be the human threshold). Thus in
10 years AI algorithms exceeded human perfor-
A. Grzybowski (*) mance in image recognition.
Department of Ophthalmology, University of Warmia There are many promising applications for AI
and Mazury, Olsztyn, Poland
in health care, addressing a variety of aims and
Institute for Research in Ophthalmology, Foundation taking many different approaches (Table 1.1).
for Ophthalmology Development, Poznan, Poland
For example, misdiagnoses constitute a huge,
© Springer Nature Switzerland AG 2021 1
A. Grzybowski (ed.), Artificial Intelligence in Ophthalmology,
https://doi.org/10.1007/978-3-030-78601-4_1
2 A. Grzybowski
Fig. 1.1 The four main stages of the Industrial Revolution that began in the eighteenth century
0.5
0.4
0.3
Error rate
0.2
0.1
0.0
2011 2012 2013 2014 2015 2016
Year
1 Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 3
Table 1.1 Some ambitious expectations for AI in health describe how they perceive their physician found
care. Adapted from Topol E. Deep Medicine: How
that the most common negative responses were
Artificial Intelligence Can Make Healthcare Human
Again. Basic Books, New York 2019 “rushed,” “busy,” and “hurried.” [7]. These reac-
• outperform doctors,
tions are manifestations of “shallow medicine.”
• help to diagnose what is presently undiagnosable, One of the arguments supporting the use of AI
• help to treat what is presently untreatable, in medicine is that human cognitive capacity to
• to recognize on images what is presently effectively manage information is often exceeded
unrecognizable,
• predict the unpredictable,
by the quantity of data generated. Each year the
• classify the unclassifiable, world produces zettabytes of data (roughly,
• decrease the workflow inefficiencies, enough to fill a trillion smartphones) [2].
• decrease hospital admissions and readmissions, Moreover, unlike humans, who have bad days
• increase medication adherence
• decrease patient harm
and emotions, and who get tired, with subsequent
• decrease or eliminate misdiagnosis decreases in performance and accuracy, AI works
24/7 without vacations or complaints [2].
AI-based technologies employing deep-
although poorly recognized, medical problem. A learning (DL) approaches have proven effective
study published in 2014 estimated that diagnostic in supporting decisions in many medical special-
errors affect at least 5% of US adults (12 million ties, including radiology, cardiology, oncology,
people) per year [4]. More recently a systematic dermatology, ophthalmology, and others. For
review and meta-analysis reported that the rate of example, AI/DL algorithms (also referred to as
diagnostic errors causing adverse events among AI/DL models in the following text) have been
hospitalized patients was 0.7% [5]. Furthermore, shown to reduce waiting times, improve medica-
diagnostic error is the most important reason for tion adherence, customize insulin dosages, and
malpractice litigation in the United States, help interpret magnetic resonance images. The
accounting for 31% of malpractice lawsuits in number of AI life-science papers listed in
2017 [2]. The creation of AI programs to identify PubMed increased from 596 in 2010 to 12,422 in
and analyze diagnostic errors could be an impor- 2019 [8]. The number of papers on the use of AI
tant step in addressing this problem [6]. in the field of ophthalmology also has increased
Eric Topol has proposed that AI could help dramatically (Figs. 1.3 and 1.4).
shift into “deep medicine,” by allowing the physi- AI/DL algorithms have been used to detect
cians to devote more time to crucial relationships diseases based on image analysis, with fundus
with their patients—an aspect of medicine that photos and optical coherence tomography (OCT)
cannot be replaced by any AI technology [2]. It is scans analyzed for retinal diseases, chest radio-
also interesting to consider whether AI might graphs assessed for lung diseases, and skin pho-
enrich the doctor-patient relationship, enabling a tos analyzed for skin disorders. Retinal photos
shift from the present “shallow medicine” into have also been used to identify risk factors related
“deep medicine,” based on deep empathy and to cardiovascular disorders, including blood pres-
connection [2]. Success in building such relation- sure, smoking, and body mass index [9]. Using
ships is very much related to the amount of time DL models trained on data from over 280,000
doctors can spare for patients and the extent of patients and validated on two independent data
the personal contact they have with their patients. sets, Poplin et al. predicted cardiovascular risk
The average time of a clinic visit in the United factors not previously thought to be present or
States for an established patient is 7 min and for quantifiable in retinal images, such as age (mean
a new patient 12 min. In many Asian countries, absolute error within 3.26 years), gender (area
clinic visits last as little as 2 min per patient [2]. under the receiver operating characteristic curve
Making this situation even worse, part of this = 0.97), smoking status (AUC = 0.71), systolic
time must be devoted to completing electronic blood pressure (mean absolute error within
health records, further limiting personal contact. 11.23 mmHg) and major adverse cardiac events
A study published in 2017 that asked patients to (AUC = 0.70) (Fig. 1.5) [9].
4 A. Grzybowski
The COVID-19 pandemic has raised expecta- ies are conducted in experimental conditions and
tions for the use of AI in data analysis. So far it based on preselected data. They might provide
has been used in epidemic modeling, detection of inadequate insight into the use of AI applications
misinformation, diagnostics, vaccine and drug in heterogeneous, real-world care settings.
development, triage and patient outcomes, and Lee et al. tested seven algorithms being used
identification of regions of greatest need [10]. clinically around the world, including one with
US Food and Drug Administration (FDA)
approval and four whose developers have submit-
egulating AI-Based Medical
R ted applications for FDA approval. They found
Devices: Demonstrating Benefit that most of these algorithms performed worse in
and Safety real-world, compared with experimental, situa-
tions, with only three of seven and one of seven
One of many challenges in the field of AI is deter- having comparable sensitivity and specificity to
mining what constitutes evidence of impact and the human graders, respectively. Only one algo-
benefit for AI medical devices and who should rithm performed as well as human graders [11].
assess the evidence [2]. The majority of AI stud- Another of the algorithms tested performed
1 Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 5
SBP DBP
Fig. 1.5 Attention maps for a single retinal fundus image. model is using to make the prediction for the image.
The top left image is a sample retinal image in color from Source: Poplin R, Varadarajan AV, Blumer K, Liu Y,
the UK Biobank data set. The remaining images show the McConnell MV, Corrado GS, Peng L, Webster
same retinal image in black-and-white. The soft attention DR. Prediction of cardiovascular risk factors from retinal
heat map for each prediction is overlaid in green, indicat- fundus photographs via deep learning. Nat Biomed Eng.
ing the areas of the heat map that the neural-network 2018 Mar;2(3):158–164
6 A. Grzybowski
significantly worse than human graders at all lev- lowest-risk medical devices (CE class I), the
els of DR severity—it missed 25.58% of cases of manufacturer ensures that the product complies
advanced retinopathy, which could have serious with regulations and an approval procedure is not
consequences. One of the potential hazards of the required. The registration procedure for higher-
clinical use of algorithms identified in this study risk medical devices (CE class IIa, IIb, and III) is
was the risk of applying an algorithm trained on a handled by private entities, called notified bodies,
particular demographic group to a population that have been accredited to assess the devices
that differs in factors such as ethnicity, age, and and issue a CE mark.
sex. Moreover, many studies of algorithms devel- Thirteen CE-marked AI-based medical
oped with AI have excluded low-quality images, devices were approved in 2015, 27 in 2016, 26 in
treating them as ungradable images, and patients 2017, 55 in 2018, and 100 in 2019. The majority
with comorbid eye diseases, making them less were designed for use in radiology, general hos-
reflective of real-world conditions. pital care, cardiology, neurology, ophthalmology
The study by Lee et al. shows the importance (12 devices), and pathology, and most were class
and limitations of the registration process of IIa (40%), class I (35%), or class IIb (12%)
AI-based medical devices. FDA registration is devices [12]. Of the AI-based devices that were
based on a centralized system, which does not CE-marked between 2015 and 2019, 124 (52%)
have a specific, easily accessible regulatory path- were also FDA approved, making up 56% of the
way for AI-based medical devices. FDA clears the AI-based tools that the FDA approved in those.
medical devices through three pathways: the pre- Bigger companies were more likely to get both
market approval pathway, the de-novo premarket approvals, whereas smaller companies were more
review, and the 510(k) pathway [12, 13]. The likely to obtain only a CE mark. The authors of
leading AI disciplines in medicine are radiology, this study suggested that the European approval
cardiology, internal medicine/endocrinology, system was less rigorous than the US one. This
neurology, ophthalmology, emergency medicine, conclusion is supported by an FDA report on 12
and oncology. FDA approvals of AI-based medi- devices that received CE approval only and later
cal devices have increased steadily in recent years; were found to be unsafe or ineffective [13, 14]. A
there were 9 in 2015, 13 in 2016, 32 in 2017, 67 in major problem in studying CE-marked devices in
2018, and 77 in 2019, with the majority of devices the European Economic Area is the lack of a pub-
designed for use in radiology, cardiology, and licly available register of approved devices com-
neurology [12]. Interestingly, 85% of FDA- parable to the FDA register. Moreover, the
approved medical devices in the years 2015–2019 information submitted to the notified bodies is
were intended for use by health-care profession- confidential. In 2022, a new European database
als, and only 15% for use by patients. The best- on medical devices (Eudamed), providing a live
known, FDA-approved, AI-based medical devices picture of the lifecycle of medical devices, will
in the field of ophthalmology are IDx-DR (2018), become operational. It will be composed of six
the first software to provide screening decisions modules, including actor registration, unique
that do not have to be interpreted by a clinician, device identification (UDI), device registration,
and Eyenuk (2020), which, like IDx-DR, screens notified bodies and certificates, clinical investiga-
for diabetic retinopathy. tions and performance studies, and vigilance and
In European Economic Area, which includes market surveillance [15].
the European Union (EU) countries and the
European Free Trade Association (EFTA) mem-
bers (Iceland, Lichtenstein, Norway, and Access to Reliable Data
Switzerland), medical devices are approved in a
decentralized manner. DL algorithm training requires large data sets
Conformité Européenne (CE) marking indi- with thousands or even hundreds of thousands of
cates conformity with EU health, safety, and diverse, well-balanced, and accurately labeled
environmental-protection standards. For the images [16]. The resources required for an AI
1 Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 7
Test Data
Validation Data
Gold Standard/ Benchmark
Performance evidence
study are presented in Fig. 1.6. The enormous and interpretable diagnosis by highlighting the
numbers of required images rarely can be regions recognized by the neural network.
obtained from individual centers; thus they are Further, they showed that a transfer-learning
secured from data repositories or centers that approach produced only modestly worse results
agree to share data. There is a growing need for (twofold increase of error, compared with the full
consensus on standardized definitions of medical data set) while using approximately 20 times
entities; conventions for data formatting; identifi- fewer images. They also demonstrated the wider
cation of units of measure; protocols for data utility of this approach by applying it to the iden-
cleaning, harmonization, and validation; stan- tification of pediatric pneumonia using chest
dards for sharing and reusing data and sharing of X-ray images. They provided their data and code
code implementing AI models; and the adoption in a publicly available database to facilitate their
of open application program interfaces to AI use by other biomedical researchers in order to
models [17]. This is required for data sharing and improve the performance of future models [18].
open communication in AI, which is critical for Transfer learning (Figs. 1.7 and 1.8) has been
conducting the reproducible research that is nec- used in recent years to build classification models
essary before AI technology can be adopted in for medical images because the number of images
health care. that can be used for training is relatively small
Kermany et al. used a DL analysis of a data set compared to the number of images available to
of optical coherence tomography images for tri- train general models [19] (Fig. 1.9). Another
age and diagnosis of choroidal neovasculariza- approach to meet the need for large, annotated
tion, diabetic macular edema, and drusen. They training data sets might be the use of low-shot DL
demonstrated performance comparable to that of algorithms. Low-shot learning (LSL), also known
human experts and provided a more transparent as few-shot learning, is a type of machine learn-
8 A. Grzybowski
Instead of training a new model from scratch for a new (target) task: Source labels Target labels
Model training consists in transferring the entire data Learning process can be faster, faster because knowledge
set with assigned labels from other tasks is used
Knowledge
ing (ML) problems where the training dataset methods degraded substantially when used with
contains limited information. It is well known limited data sets, but LSL methods performed
that many real-life situations, including rare dis- better and might be applied in retinal diagnostics
eases (e.g., serpiginous choroidopathy or angioid when a limited number of retina images are avail-
streaks in pseudoxanthoma elasticum) and non- able for training [20].
typical presentations or subtypes of common dis- Another approach that has been suggested by
orders, are prone to AI bias due to the paucity or several authors to address the problem of limited
imbalance of data. These deficiencies may also data sets is the use of generative adversarial net-
result in less accurate future models. When works (GANs) to synthesize new images from a
addressing this sort of bias, dividing data accord- training data set of real images. GANs are ML
ing to some patient features (e.g., age, sex, and models that can generate new data with the same
race/ethnicity) may result in smaller data sets that statistics as the training set (Fig. 1.10). For exam-
may be insufficient for training models for these ple, a GAN trained on photographs can generate
particular groups. The study by Burlina et al. photographs of non-existing persons that look as
showed that the performance of widely used DL authentic as real humans (Fig. 1.11). Artificial
1 Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 9
1000
Categories
TRANSFER
LEARNING
Retinopathy
grade0
Retinopathy
grade1
Retinopathy
grade2
Retinopathy
grade3
Fig. 1.9 The schematic diagram of transfer learning. 6596/1544/1/012133. Content from this work may be
Source: Lingling Li et al. Diabetic retinopathy identifica- used under the terms of the Creative Commons Attribution
tion system based on transfer learning. 2020. J. Phys.: 3.0 licence
Conf. Ser. 1544 012133. https://doi.org/10.1088/1742-
photos can be found at https://thispersondoesno- trained on real images [21]. Liu et al. have shown
texist.com. Many applications of GAN have been that 92% of synthetic OCT images had sufficient
proposed, including, art, fashion, advertising, sci- quality for further clinical interpretation. Only
ence, video games, however, concerns about mali- about 26–30% of synthetic post-therapeutic
cious uses were also raised, e.g., to produce fake, images could be accurately identified as synthetic
possibly incriminating, photographs and videos. images (Fig. 1.8) [22]. The accuracy of models
Burlina et al. used the Age-Related Eye trained on synthetic images to predict wet or dry
Disease Study data set of over 130,000 fundus macular status was 0.85 (95% CI 0.74–0.95)
images to generate a similar number of synthetic [22]. In a study by Zheng et al., the image quality
images to train DL models. The performance of of real versus synthetic OCT images was similar
DL models trained with the synthetic images was as assessed by two retinal specialists. The accu-
nearly as good as the performance of models racy of discrimination of real versus synthetic
10 A. Grzybowski
Real images
Discriminator Real
Deep Convolutional Network
Generated images
Fine tune training
will set a balance between individual protection Table 1.2 Practical challenges to the advancement and
application of AI tools in clinical settings
and the common good. One approach to protect-
ing privacy and increasing sample size is to share Workflow Understand the technical, cognitive,
integration social, and political factors in play
DL algorithms with local institutions for retrain- and incentives impacting integration
ing purposes, but without sharing the private data of AI into health care workflows.
used to build the algorithms. This model-to-data Enhanced To promote integration of AI into
approach, also known as federated learning, was explainability health care workflows, consider
and what needs to be explained and
tested in ophthalmology and was shown to work
interpretability approaches for ensuring
effectively [26]. understanding by all members of
According to the US National Institute of the health care team.
Standards and Technology, biometric data, Workforce Promote educational programs to
including retina images, are personally identifi- education inform clinicians about AI/machine
learning approaches and to develop
able information and should be protected from an adequate workforce.
inappropriate access. Although AI models have Oversight and Consider the appropriate regulatory
been shown to diagnose and stage some ocular regulation mechanism for AI/machine learning
diseases from fundus photographs, OCT, and and approaches for evaluating
algorithms and their impact.
visual-field images, most AI algorithms were
Problem Catalog the different areas of health
tested on data sets that did not correspond well to identification and care and public health where AI/
real-world conditions. Patient populations were prioritization machine learning could make a
usually homogenous, and poor-quality images difference, focusing on intervention-
and patients with multiple pathologies were driven AI.
Clinician and Understand the appropriate
excluded. Future studies are needed to validate patient approaches for involving consumers
algorithms on ocular images from heterogeneous engagement and clinicians in AI/machine
populations, including both good- and poor- learning prioritization, development,
quality images. Otherwise, we may face the situ- and integration, and the potential
impact of AI/machine learning
ation of “good AI gone bad.” The tendency to algorithms on the patient-provider
cherry-pick the best results might make the situa- relationship.
tion even worse. AI algorithms can behave unpre- Data quality and Promoting data quality, access, and
dictably when applied in real life. Algorithm access sharing, as well as the use of both
structured and unstructured data and
performance can degrade after deployment due to
the integration of non-clinical data
the changes between the training and testing con- is critical to developing effective AI
ditions (dataset shift), caused, for example, by tools.
using to images generated by a different device Source: Matheny ME, Thadaney Israni S, Ahmed M,
than this in the training set or collected in a dif- Whicher D. AI in Health Care: The Hope, the Hype,
ferent clinical environment [27–30]. Moreover, the Promise, the Peril. Washington, DC: National
Academy of Medicine; 2019. https://nam.edu/artificial-
algorithms may return different outputs at differ- intelligence-special-publication
ent times when presented with similar inputs [31,
32]—they can be affected by minor changes in
image quality or extraneous data on an image advocate the use of openly accessible, standard-
[32–35]. All these problems might lead to misdi- ized, population-representative data; addressing
agnosis and erroneous treatment suggestions, explicit and implicit biases related to AI; devel-
breaching trust in AI technologies. An error in an oping and deploying appropriate training and
AI system could harm hundreds or even thou- educational programs for health workers to sup-
sands of patients. port health-care AI; and balancing innovation and
A recent report from the National Academy of safety through the use of regulation and legisla-
Medicine [36] highlights some important chal- tion to promote trust.
lenges in the further development of AI applica- To understand the limitations of AI-based
tions in health care (Table 1.2). Its authors models in health care and the responsibilities of
12 A. Grzybowski
manufacturers and users of AI software as a med- fairness and bias, and to allow rapid replication
ical device (SaMD), an MI-CLAIM checklist of the technical design by any legitimate clinical
was proposed for use in AI software development AI study. The MI-CLAIM checklist has six parts
[37]. Its purpose is to enable a direct assessment (Table 1.3), including (1) Study design; (2)
of clinical impact, including considerations of Separation of data into partitions for model train-
Table 1.3 The MI-CLAIM checklist [Source: Norgeot B, Quer G, Beaulieu-Jones BK, Torkamani A, Dias R,
Gianfrancesco M, Arnaout R, Kohane IS, Saria S, Topol E, Obermeyer Z, Yu B, Butte AJ. Minimum information about
clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020 Sep;26(9):1320–1324]
Before paper submission
Study design (Part 1) Completed: Notes if not
page number completed
The clinical problem in which the model will be employed is clearly detailed in the □
paper.
The research question is clearly stated. □
The characteristics of the cohorts (training and test sets) are detailed in the text. □
The cohorts (training and test sets) are shown to be representative of real-world clinical □
settings.
The state-of-the-art solution used as a baseline for comparison has been identified and □
detailed.
Data and optimization (Parts 2, 3) Completed: Notes if not
page number completed
The origin of the data is described and the original format is detailed in the paper. □
Transformations of the data before it is applied to the proposed model are described. □
The independence between training and test sets has been proven in the paper. □
Details on the models that were evaluated and the code developed to select the best □
model are provided.
Is the input data type structured or unstructured? □
Model performance (Part 4) Completed: Notes if not
page number completed
The primary metric selected to evaluate algorithm performance (e.g., AUC, F-score, □
etc.), including the justification for selection, has been clearly stated.
The primary metric selected to evaluate the clinical utility of the model (e.g., PPV, □
NNT, etc.), including the justification for selection, has been clearly stated.
The performance comparison between baseline and proposed model is presented with □
the appropriate statistical significance.
Model examination (Part 5) Completed: Notes if not
page number completed
Examination technique 1a □
Examination technique 2a □
A discussion of the relevance of the examination results with respect to model/ □
algorithm performance is presented.
A discussion of the feasibility and significance of model interpretability at the case □
level if examination methods are uninterpretable is presented.
A discussion of the reliability and robustness of the model as the underlying data □
distribution shifts is included.
Reproducibility (Part 6): choose appropriate tier of transparency Notes
Tier 1: complete sharing of the code □
Tier 2: allow a third party to evaluate the code for accuracy/fairness; share the results of □
this evaluation
Tier 3: release of a virtual machine (binary) for running the code on new data without □
sharing its details
Tier 4: no sharing □
PPV positive predictive value, NNT numbers needed to treat
a
Common examination approaches based on study type: for studies involving exclusively structured data, coefficients
and sensitivity analysis are often appropriate; for studies involving unstructured data in the domains of image analysis
or natural language processing, saliency maps (or equivalents) and sensitivity analyses are often appropriate
1 Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 13
Table 1.4 Major topics of CONSORT-AI extension It should be also remembered that AI algo-
1. State the inclusion and exclusion criteria at the level rithms can be designed to perform in unethical
of participants ways. For example, Uber’s software Greyball
2. State the inclusion and exclusion criteria at the level
of the input data
allowed the company to identify and circumvent
3. Describe how the AI intervention was integrated local regulations, and Volkswagen’s algorithm
into the trial setting, including any onsite or offsite allowed vehicles to pass emission tests by reduc-
requirements ing their emissions of nitrogen oxide during test-
4. State which version of the AI algorithm was used
5. Describe how the input data were acquired and
ing. AI algorithms could be tuned to generate
selected for the AI intervention increased profits for their owners by recommend-
6. Describe how poor-quality or unavailable input data ing particular drugs, tests, or the like without clini-
were assessed and handled cal users’ awareness. AI systems are vulnerable to
7. Specify whether there was human—AI interaction
in the handling of the input data, and what level of
cybersecurity attacks that could cause their algo-
expertise was required for users rithms to misclassify medical information [31].
8. Specify the output of the AI intervention Seven essential factors to design AI for social
9. Explain how the AI intervention’s outputs good were proposed by Floridi et al. (Table 1.5)
contributed to decision-making or other elements of
clinical practice
[38]. The authors propose falsifiability as an
10. Describe results of any analysis of performance essential factor to improve the trustworthiness of
errors and how errors were identified, where technological application, i.e., for an SaMD to be
available. If no such analysis was planned or done, trustworthy, its safety should be falsifiable.
explain why not.
Critical requirements for a device to be fully
Source: Adapted from Liu X, Cruz Rivera S, Moher D,
functional must be specified and must be testable.
Calvert MJ, Denniston AK; SPIRIT-AI and CONSORT-AI
Working Group. Reporting guidelines for clinical trial If falsifiability is not possible, then the critical
reports for interventions involving artificial intelligence: requirements cannot be checked, and the system
the CONSORT-AI extension. Lancet Digit Health. 2020 should not be deemed trustworthy [38].
Oct;2(10):e537–e548
Cost-Effectiveness of AI-Based
ing and model testing; (3) Optimization and final Devices
model selection; (4) Performance evaluation; (5)
Model examination; and (6) Reproducible pipe- One of the arguments for AI-based medical
line. The CONSORT-AI and SPIRIT-AI working devices is that they can reduce medical costs and
groups have proposed reporting guidelines for eliminate unnecessary procedures. A study from
clinical trials of interventions involving AI. A Singapore found that a semiautomated model
summary of these guidelines is presented in that combined a DL system with human assess-
Table 1.4. ment achieved the best economic returns, leading
Inherent conflicts of interest should be to savings of 19.5% in screening for diabetic reti-
acknowledged. Manufacturers who develop and nopathy. An earlier study from the UK reported
market SaMD have a strong financial interest in cost-savings of 12.8–21.0%; however, a simple
presenting their products positively. Thus, con- comparison between them is not possible due to
flicts of interest exist if they fund, conduct, and the different models of DR screening in the two
publish results of studies, including those that countries (two-stage screening in Singapore and
might report deficiencies in their products. Many three-stage screening in the UK), and their differ-
of the published papers in the field of AI-based ent DR classification systems. The authors of
diabetic retinopathy screening, particularly those both studies argued that a semiautomated system
using CE-marked and FDA-approved algorithms, produces more savings than a fully automated
were conducted by manufacturers or patent system due to the lower rate of false positives and
owners. unnecessary specialist visits [39, 40].
14 A. Grzybowski
This book aims to provide ophthalmologists contexts. The very important chapter on AI safety
and other visual professionals and researchers and efficacy outlines the challenges ophthalmol-
with an overview of current research into the use ogy will face with the introduction and wide-
of AI in ophthalmology. Together with a team of spread dissemination of this technology. Although
international experts from Europe, North we have covered all of the major areas of AI/ML
America, and Asia, we present an overview of the technology in ophthalmology, research in this
most important documentary research in ophthal- field is progressing so quickly that some new
mology on ML and AI technologies and their concepts that emerged at the end of 2020 and in
benefits. We discussed the use of AI in the early 2021 do not appear on these pages.
diagnosis of some retinal and corneal disor- However, evidence-based medicine often
ders, the diagnosis of congenital cataract, neuro- demands that we await for more evidence to ver-
ophthalmology, glaucoma, intraocular lens cal- ify early reports and assess the real value of new
culation methods, ocular oncology, medical technologies or applications. I would
ophthalmology triaging, cataract-surgery train- like to thank all the contributors for sharing their
ing, refractive surgery, and the assessment and knowledge in this new and fascinating discipline,
prediction of systemic diseases through the use which has great potential to change
of the eye. Chapters on digital-image analysis, AI ophthalmology.
basics, and technical aspects of AI provide the
reader with knowledge not commonly possessed Acknowledgements I would like to thank Aleksandra
by ophthalmologists, but required to understand Lemanik, Foundation for Ophthalmology Development,
the topic in both its field-specific and broader Poznan, Poland and Tomasz Krzywicki, Faculty of
1 Artificial Intelligence in Ophthalmology: Promises, Hazards and Challenges 15
Mathematics and Computer Science, University of learning-based software devices in medicine. JAMA.
Warmia and Mazury, Olsztyn, Poland for their help in pre- 2019;322(23):2285–6.
paring illustrations, and Szymon Wilk, Faculty of 14. Hwang TJ, Sokolov E, Franklin JM, Kesselheim
Computing and Telecommunications, Poznan University AS. Comparison of rates of safety issues and report-
of Technology, Poznan, Poland, for his valuable discus- ing of trial outcomes for medical devices approved in
sion on this chapter. the European Union and United States: cohort study.
BMJ. 2016;353:i3323.
15.
European Commission. Medical devices—
EUDAMED. 17 June 2020. https://ec.europa.eu/
References growth/sectors/medical-d evices/new-r egulations/
eudamed_en. Accessed 15 Jan 2021.
1. Mitchell M. Artificial intelligence: a guide for think- 16. Ting DSW, Liu Y, Burlina P, Xu X, Bressler NM,
ing humans. Penguin UK; 2019. Wong TY. AI for medical imaging goes deep. Nat
2. Topol E. Deep medicine: how artificial intelligence Med. 2018;24(5):539–40.
can make healthcare human again. New York: Basic 17. Wang SY, Pershing S, Lee AY, AAO Taskforce
Books; 2019. on AI and AAO Medical Information Technology
3. Copeland BJ. Artificial intelligence. Encyclopedia Committee. Big data requirements for artificial intel-
Britannica, 11 August 2020. https://www.britannica. ligence. Curr Opin Ophthalmol. 2020;31(5):318–23.
com/technology/artificial-intelligence. Accessed 18 18. Kermany DS, Goldbaum M, Cai W, Valentim CCS,
Mar 2021. Liang H, Baxter SL, McKeown A, Yang G, Wu X,
4. Singh H, Meyer AN, Thomas EJ. The frequency of Yan F, Dong J, Prasadha MK, Pei J, Ting MYL, Zhu
diagnostic errors in outpatient care: estimations from J, Li C, Hewett S, Dong J, Ziyar I, Shi A, Zhang R,
three large observational studies involving US adult Zheng L, Hou R, Shi W, Fu X, Duan Y, Huu VAN,
populations. BMJ Qual Saf. 2014;23(9):727–31. Wen C, Zhang ED, Zhang CL, Li O, Wang X, Singer
5. Gunderson CG, Bilan VP, Holleck JL, et al. Prevalence MA, Sun X, Xu J, Tafreshi A, Lewis MA, Xia H,
of harmful diagnostic errors in hospitalised adults: a Zhang K. Identifying medical diagnoses and treat-
systematic review and meta-analysis. BMJ Qual Saf. able diseases by image-based deep learning. Cell.
2020;29:1008–18. 2018;172(5):1122–1131.e9.
6. Zwaan L, Singh H. Diagnostic error in hospitals: 19. Rampasek L, Goldenberg A. Learning from everyday
finding forests not just the big trees. BMJ Qual Saf. images enables expert-like diagnosis of retinal dis-
2020;29(12):961–4. eases. Cell. 2018;172(5):893–5.
7. Singletary B, Patel N, Heslin M. Patient per- 20. Burlina P, Paul W, Mathew P, Joshi N, Pacheco KD,
ceptions about their physician in 2 words: Bressler NM. Low-shot deep learning of diabetic
the good, the bad, and the ugly. JAMA Surg. retinopathy with potential applications to address
2017;152(12):1169–70. artificial intelligence bias in retinal diagnostics
8. Benjamens S, Dhunnoo P, Meskó B. The state of and rare ophthalmic diseases. JAMA Ophthalmol.
artificial intelligence-based FDA-approved medical 2020;138(10):1070–7.
devices and algorithms: an online database. NPJ Digit 21. Burlina PM, Joshi N, Pacheco KD, Liu TYA, Bressler
Med. 2020;3:118. NM. Assessment of deep generative models for high-
9. Poplin R, Varadarajan AV, Blumer K, Liu Y, resolution synthetic retinal image generation of age-
McConnell MV, Corrado GS, Peng L, Webster related macular degeneration. JAMA Ophthalmol.
DR. Prediction of cardiovascular risk factors from 2019;137:258–64.
retinal fundus photographs via deep learning. Nat 22. Liu Y, Yang J, Zhou Y, Wang W, Zhao J, Yu W, Zhang
Biomed Eng. 2018;2(3):158–64. D, Ding D, Li X, Chen Y. Prediction of OCT images
10. Chen J, See KC. Artificial intelligence for COVID-19: of short-term response to anti-VEGF treatment for
rapid review. J Med Internet Res. 2020;22(10):e21476. neovascular age-related macular degeneration using
11. Lee AY, Yanagihara RT, Lee CS, Blazes M, Jung
generative adversarial network. Br J Ophthalmol.
HC, Chee YE, Gencarella MD, Gee H, Maa AY, 2020;104(12):1735–40.
Cockerham GC, Lynch M, Boyko EJ. Multicenter, 23. Zheng C, Xie X, Zhou K, Chen B, Chen J, Ye H, Li W,
head-to-head, real-world validation study of seven Qiao T, Gao S, Yang J, Liu J. Assessment of genera-
automated artificial intelligence diabetic retinopathy tive adversarial networks model for synthetic optical
screening systems. Diabetes Care. 2021;dc201877. coherence tomography images of retinal disorders.
https://doi.org/10.2337/dc20-1877. Transl Vis Sci Technol. 2020;9(2):29.
12. Muehlematter UJ, Daniore P, Vokinger KN. Approval 24. Liu TYA, Farsiu S, Ting DS. Generative adversarial
of artificial intelligence and machine learning-based networks to predict treatment response for neovascu-
medical devices in the USA and Europe (2015- lar age-related macular degeneration: interesting, but
20): a comparative analysis. Lancet Digit Health. is it useful? Br J Ophthalmol. 2020;104(12):1629–30.
2021;3(3):e195–203. 25. Lee CS, Lee AY. Clinical applications of continual
13. Hwang TJ, Kesselheim AS, Vokinger KN. Lifecycle learning machine learning. Lancet Digit Health.
regulation of artificial intelligence- and machine 2020;2(6):e279–81.
16 A. Grzybowski
26. Mehta N, Lee CS, Mendonça LSM, Raza K, Braun radiographs: a cross-sectional study. PLoS Med.
PX, Duker JS, Waheed NK, Lee AY. Model-to-data 2018;15:e1002683.
approach for deep learning in optical coherence 35. Antun V, Renna F, Poon C, Adcock B, Hansen AC. On
tomography intraretinal fluid segmentation. JAMA instabilities of deep learning in image reconstruction
Ophthalmol. 2020;138(10):1017–24. and the potential costs of AI. Proc Natl Acad Sci U
27. Larson DB, Harvey H, Rubin DL, Irani N, Tse JR, S A. 2020; pii: 201907377. https://doi.org/10.1073/
Langlotz CP. Regulatory frameworks for development pnas.1907377117.
and evaluation of artificial intelligence-based diag- 36. Matheny ME, Thadaney Israni S, Ahmed M,
nostic imaging algorithms: summary and recommen- Whicher D. AI in health care: the hope, the hype,
dations. J Am Coll Radiol. 2021;18(3 Pt A):413–24. the promise, the peril. Washington, DC: National
28. Wang X, Liang G, Zhang Y, Blanton H, Bessinger Z, Academy of Medicine; 2019. https://nam.edu/
Jacobs N. Inconsistent performance of deep learning artificial-intelligence-special-publication
models on mammogram classification. J Am Coll 37. Norgeot B, Quer G, Beaulieu-Jones BK, Torkamani
Radiol. 2020;17:796–803. A, Dias R, Gianfrancesco M, Arnaout R, Kohane
29. Subbaswamy A, Schulam P, Saria S. Preventing
IS, Saria S, Topol E, Obermeyer Z, Yu B, Butte
failures due to dataset shift: Learning predic- AJ. Minimum information about clinical artificial
tive models that transport. Proc Mach Learn Res. intelligence modeling: the MI-CLAIM checklist. Nat
2019;89:3118–27. Med. 2020;26(9):1320–4.
30. Subbaswamy A, Saria S. From development to
38. Floridi L, Cowls J, King TC, Taddeo M. How to
deployment: dataset shift, causality, and shift-stable design AI for social good: seven essential factors. Sci
models in health AI. Biostatistics. 2020;21:345–52. Eng Ethics. 2020;26(3):1771–96.
31. Kelly CJ, Karthikesalingam A, Suleyman M, Corrado 39. Xie Y, Nguyen QD, Hamzah H, Lim G, Bellemo V,
G, King D. Key challenges for delivering clini- Gunasekeran DV, Yip MYT, Qi Lee X, Hsu W, Li Lee
cal impact with artificial intelligence. BMC Med. M, Tan CS, Tym Wong H, Lamoureux EL, Tan GSW,
2019;17:195. Wong TY, Finkelstein EA, Ting DSW. Artificial intel-
32. Winkler JK, Fink C, Toberer F. Association between ligence for teleophthalmology-based diabetic retinop-
surgical skin markings in dermoscopic images and athy screening in a national programme: an economic
diagnostic performance of a deep learning convo- analysis modelling study. Lancet Digit Health.
lutional neural network for melanoma recognition. 2020;2(5):e240–9.
JAMA Dermatol. 2019;155:1135–41. 40. Tufail A, Rudisill C, Egan C, Kapetanakis VV, Salas-
33.
Finlayson SG, Bowers JD, Ito J. Adversarial Vega S, Owen CG, Lee A, Louw V, Anderson J, Liew
attacks on medical machine learning. Science. G, Bolter L, Srinivas S, Nittala M, Sadda S, Taylor
2019;363:1287–9. P, Rudnicka AR. Automated diabetic retinopathy
34. Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, image assessment software: diagnostic accuracy and
Oermann EK. Variable generalization performance of cost-effectiveness compared with human graders.
a deep learning model to detect pneumonia in chest Ophthalmology. 2017;124(3):343–51.
Basics of Artificial Intelligence
for Ophthalmologists
2
Ikram Issarti and Jos J. Rozema
The past decade has seen a steep rise in the num- After being considered science fiction for a long
ber of applications of Artificial Intelligence (AI), time, the first scientific step towards intelligent
especially for repetitive or complex tasks where machines was taken by Alan Turing, who in 1950
humans may quickly suffer from either a drifting developed the famous Turing test [1]. This
attention span or subtle inconsistencies. Such sys- involves an interview with open-ended questions
tems are often more cost efficient, thus accelerat- to determine whether the intelligence of the inter-
ing their adoption and acceptance, consequently viewee is human or artificial. If this distinction
increasing people’s reliance on AI. But an under- can no longer be made, within certain predefined
standing of its inner workings is often lacking, margins, true machine intelligence has been
many tend to approach it as a ‘black box’ at the accomplished. The concept suggests that a
risk of uncritically accepting whatever output it machine could, in principle, think and stimulate
produces. Although by its very nature AI is opaque human intelligence through behaviour such as
about how it reaches a result, there are statistical learning, interpreting and communicating. This
methods to objectively assess the quality of its concept is referred to as artificial intelligence.
output. As AI becomes a popular subject within Between 1956 and 1974, began the period
the scientific community and health care practice, known as the Golden Age of AI. This time saw a
this chapter explains the basic principles of AI in massive growth in computing power, allowing to
a comprehensive step-by-step manner, along with test the ideas of MacCulloch and Pitts [2] that the
examples of ophthalmological applications. brain’s neurons may be described by simple logi-
Special attention will be paid to the differences cal operators (AND, OR and NOT), and leading
between AI, machine learning (ML), and deep to the first AI algorithms called neural networks.
learning (DL), highly interconnected techniques This illustrates how from the very beginning
that are often confused for one another. onwards AI has been inspired by biological phe-
nomena to mimic human abilities and behaviour,
such as the ability to learn and adapt to real-life
I. Issarti · J. J. Rozema (*) scenarios. These ideas were expanded upon in
Visual Optics Lab Antwerp (VOLANTIS), the decades that followed with the introduction of
Department of Ophthalmology, Antwerp University new techniques until in the 1990s the first
Hospital, Edegem, Belgium
ophthalmological applications started to
Department of Medicine and Health Sciences, emerge for the screening of glaucoma [3], dia-
Antwerp University, Wilrijk, Belgium
betic retinopathy [4] and keratoconus [5]. A more simultaneously analyse multiple layers of data.
detailed overview of ophthalmological applica- These layers consist of data processing units,
tions can be found in a recent review paper [6] or called neurons, that allow them to analyse large
the other chapters. amounts of data at once while preserving the
data’s spatial distribution. DL systems have seen
significant successes in applications such as pat-
Overview tern recognition, image processing, and speech
recognition.
Artificial Intelligence is a very broad field of The training process of ML and DL is very
study encompassing a wide range of techniques similar to that found in schools, with a professor
that allow machines to display ever more intelli- teaching his students. From a large amount of
gent behaviour (Fig. 2.1). Machine learning is given data, the algorithm learns how to describe a
one of the most important subfields of specific topic into a model (knowledge acquisi-
AI. Although ML and AI are often confused, AI tion), which will subsequently be validated using
also includes other approaches not included in unseen data to evaluate its generalizability.
machine learning, such as Expert systems, Finally, the performance of the algorithm is eval-
knowledge- or rule-based systems that emulate uated based on several guideline given in section
human cognitive and the reasoning abilities by “Performance Evaluation”.
following certain guidelines to perform a
decision-making process [7]. Meanwhile, ML
refers to a group of mathematical algorithms that Data Basis
learn from experience (data) by mimicking
human learning behaviour to perform new tasks. Data is the fuel of AI, which can come from dif-
ML is able to fit complex data sets, to extract new ferent sources such as, webs, videos, audios, text,
knowledge, imitate complex behaviour, predict etc. It is comprised of massive amounts of bits,
and classify based on prior data. Another well- binary values of zeros and ones, that can be reor-
known group of algorithms is Deep learning ganized to form structured data that are usually
(DL), which is a subset of machine learning easier to process by AI algorithms, as a relational
based on artificial neural networks. DL is able to data base or a spreadsheet. It is also possible to
work with unstructured data without predefined
formatting (e.g. audio, video, text, etc.), or a
hybrid form of a structured and unstructured data
called semi-structured data. Finally, one can con-
sider time series data, consisting of structured or
Artificial Intelligence
unstructured data in sequential time steps [8]. A
good understanding of data structures allows a
proper AI implementation. Some highlights are
Machine Learning
given in section “Conducting a Machine Learning
Analysis”, but more details are available in the
data mining literature and data pre-processing
text books [9].
Deep
Learning
Common Tasks
structure and patterns within large datasets. The come of a surgical procedure or treatment
most typical tasks for ML are classification, clus- (Fig. 2.2c).
tering and prediction. • Regression while classification problems
classify data into different set of classes or cat-
• Classification involves sorting new cases into egories, regression problem predicts the val-
two or more groups (Fig. 2.2a). In healthcare, ues of a continuous variable rather than
classification could be used for diagnosis categorical real variables. This problem is also
(healthy or abnormal) or the identification of referred as prediction task.
biological markers.
• Clustering in clustering the algorithm divides a
dataset into several, previously unknown clus- Learning Models
ters (groups) with certain properties in common
(Fig. 2.2b). Clustering can be used to e.g. distin- AI algorithms can be trained in any one of four
guish the different stages of a disease. methods:
• Prediction consists of building a model based
on historical data to forecast unknown param- • Supervised Learning (‘with professor’)
eter values in the future to e.g. predict the out- teaches a ML algorithm the desired output
a 5 b 5
Cluster 2
Class 2
4 4
3 3
Parameter Y
Parameter Y
2 2
Class 3
1 1
Class 1 Class 3
Cluster 1
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Parameter X Parameter X
c 60
Patient response
ML simulation
50 ML prediction
Drug infusion (mg/m2)
40
30
20
10
0
0 10 20 30 40 50 60 70 80
Time (days)
Fig. 2.2 Examples of (a) classification, (b) clustering and (c) prediction
20 I. Issarti and J. J. Rozema
(answer) given an input with labelled response based on trial and error, much as in
categories. Based on this the algorithm learns human learning. This can be applied when
the characteristics of each category, so when it there is a continuous change in the situation to
is presented with an unseen input, it will be which the machine needs to adapt and respond
able to assign it to the right output class (cat- to. Although quite advanced, its use remains
egory). Supervised algorithms are mostly used limited within the field of medicine to e.g. sys-
for classification problems (Fig. 2.2a) where tems that learn from the successes and failures
points can be assigned to three pre-defined of clinical trials in the literature to suggest
classes (e.g. healthy, pathological and suspect) new approaches for testing.
or for prediction problems (Fig. 2.2c), such as
predicting the future evolution of a tumour.
• Unsupervised Learning (‘without profes- Machine Learning Algorithms
sor’) algorithms assign data to multiple sub-
groups (clusters) with similar properties There are dozens of machine learning algorithms
within the input data without being given described in the literature. For reasons of con-
desired answers or outputs. Unsupervised ciseness, only the most common ones will be
learning can be applied for classification prob- listed below.
lems with unknown outputs, as is illustrated in
Fig. 2.2b, where the algorithm identified three
clusters based on the available data. (Non)-linear Regression
• Semi-supervised learning combines super-
vised and unsupervised learning by giving the Regression analysis is a well-known statistical
desired output for only a small number of inputs. method that builds a mathematical model from
After training based on the labelled data, the prior observations to make a prediction, which
algorithm uses unsupervised learning for the constitutes the basis of machine learning. If the
unlabelled data to create new clusters. Ultimately relation between the input and output is linear,
these clusters are themselves labelled and added the model is called linear regression (Fig. 2.3).
to the previous outputs. This method is used For example, one can score the progress of a
when not all outputs are available. disease based on several observed variables
• Reinforcement learning is a training method (x1, x2,…, xn) by assigning a weight (w1, w2,…,
in which an algorithm must define its own wn) to each variable indicating their relative
3.5 2
a b
1.8
3
1.6
2.5
1.4
1.2
2
1
1.5
0.8
1 0.6
0.4
0.5
0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
P ( KC )
P ( KC | Thin ) = P ( Thin | KC ) .
P ( KC ) P ( Thin KC ) + P ( NL ) .P ( Thin | NL )
. |
where all the terms on the right-hand side can a binary classifier that searches for a dividing
easily be estimated on forehand from a large data plane to separate distinct data. The orientation
set after weighting P(KCǀThin, W) with a weight and the position of this dividing plane is deter-
W. In practice the classification is based on many mined by the closest points, called support vec-
variables, however, increasing the chance of tors, in an attempt to maximise the margins
interdependence between parameters. Naïve between distinct groups [8, 10].
Bayes chooses to ignore this interdependence,
which, despite being a severe oversimplification,
demonstrated good results in classification, espe- K-Means
cially in text recognition, spam detection, and
medical diagnosis. As a form of unsupervised clustering, this method
groups unlabelled input data into a predefined
number of K clusters. As such, it can operate on a
Support Vector Machine (SVM) large dataset. For example, if a dataset of normal
and diseased subjects, but without clear classifi-
SVM is a recently developed form of supervised cation of the disease’s features, then K-means
machine learning used for both regression and can be used to define K distinct clusters with sta-
classification. These algorithms are preferred by tistical characteristics that may be used clinically.
experts for their accurate and reproducible results It does this by first randomly choosing centroid
with less computation power, while being robust points (Fig. 2.4) and seeing whether the neigh-
enough to handle small samples. In short, SVM is bouring points can efficiently be divided into the
22 I. Issarti and J. J. Rozema
8 8
a b
7 7
6 6
5 5
4 4
3 3
2 2
1 1
2 3 4 5 6 7 2 3 4 5 6 7
Fig. 2.4 (a) Input data; (b) K-means clustering, starting randomly placed centroids (open squares) that are gradually
adjusted until an equilibrium has been reached (open circles)
requested number of clusters by assessing their trees can also be combined into a structure, called
distance to the centroids. If this is not the case, a Random Forest, as a way to average multiple
the centroids points are iteratively shifted until a Decision Trees built from different parts of the
certain minimum distance is achieved. training set to reduce the risk of overfitting often
experienced by a single Decision Trees. The
increased performance comes at a cost of some
K-Nearest Neighbor (KNN) loss of interpretable and accurate fit to training
data, however.
KNN is a supervised machine learning method to
address regression and classification problems. It
is considered a ‘lazy’ algorithm as it does not Artificial Neural Networks (ANN)
require training, relying instead on the assump-
tion that similar inputs remain close to one Artificial neural networks are a family of algo-
another. The algorithm computes a distance rithms that, inspired by the human brain, form
between a test data set and its K nearest neigh- interconnected structures of artificial neurons.
bours to form a large cluster. Whenever a new These structures interact with each other to
data point is presented for classification, KNN mimic the complex behaviour found in real neu-
will look in the database at the K points nearest to rons, such as self-adapting, self-organizing, and
the new point to determine to what group this real-time learning from examples. The human
point should belong. brain consists of many neurons that are intercon-
nected through a large number of axons to
exchange signals. When a neuron receives a spe-
Decision Trees cific input signal through its connections, its cel-
lular body will generate a new signal through the
A decision tree is a directed data structure that axons and transmit to other dendritic cells
uses a series of yes/no questions for classifica- (Fig. 2.5) [11]. This biological architecture is
tion. The start of the decision tree is the root node emulated in artificial neurons, where dendritic
with a yes/no input question. From this starts a signals represent the neural inputs X = (xj)j that
series of decision paths (branch) where the algo- are assigned synaptic weights θij. The cellular
rithm makes a decision through a computed prob- body is represented by a nonlinear activation
ability that ultimately leads to the leaf of the tree, function that operates on the input signal to cre-
corresponding with the outcome [10]. Decision ate an output signal y that is passed on to the next
2 Basics of Artificial Intelligence for Ophthalmologists 23
and a validation set (25%), each with a different rics (Fig. 2.6). Underfitting, on the other hand, is
purpose. The training set is fed into the machine the opposite where the model cannot account for
learning algorithm, the test set is used to perform various relations due to a lack of well-discriminat-
an internal validation at each iteration of the train- ing parameters. Consequently, it is a good practice
ing, while the validation set helps to assess the to avoid overfits and underfits cases before com-
algorithm’s performance after it stabilized. puting the performance metrics on the validation
ML models use metrics that compare the real, set. Meanwhile, in a good fit the performance of
measured values with the models’ prediction to the training and test sets should be very similar.
assess performance in each iteration using the
training and test. This procedure is done using the
training set to evaluate the learning performance, Confusion Matrix
but also the validation set to evaluate the models’
generalisability. The most common metrics One popular way to represent performance is
includes accuracy, sensitivity, specificity, and using a confusion matrix that compares the
precision, defined in Table 2.1. algorithm’s classifications to the actual classifi-
During the training process it is especially cation using the metrics in Table 2.2. Ideally, the
important to be mindful of overfitting and under- non-diagonal values should remain 0. Usually
fitting of the model. Overfitting occurs when the confusion matrices are binary, but they may be
model has become overly detailed to the point expanded to include more than two classes.
where it begins to fit random statistical variations
(noise). This can be noticed by a continued
improvement of the metrics of the training set, but eceiver Operating Characteristic
R
a stabilization or worsening of the test set’s met- Curve (ROC)
Table 2.1 Performance metrics The ROC curve is a plot of the true positive rate
True TP Abnormal cases identified as a function of the false positive rate (i.e. 100–
positive correctly Specificity) for different cut offs. This aims to
True TN Normal cases identified
negative correctly
False FP Normal cases classified as Underfitting Overfitting
positive abnormal (Type I error)
False FN Abnormal cases classified as
negative normal (Type II error)
Accuracy (TP + TN) / Percentage of times the
(TP + FP + algorithm is correct
Error
FN + TN)
Sensitivity TP / (TP + Percentage of true positives
FN) correctly identified
Specificity TN / (TN + Percentage of true negatives
FP) correctly identified
Precision TP / (TP + Ratio of correctly classified Training score
Test score
FP) positives; high precision
relates to low false positive.
Model complexity
Cut-off Value or point designed as a
limit of a group. Fig. 2.6 Under- and overfitting
Table 2.2 Confusion
Actually positive Actually negative
matrix
Predicted positive True positive False positive
Predicted negative False negative True negative
2 Basics of Artificial Intelligence for Ophthalmologists 25
1
Perfect classifier look at the available parameters and what kind of
outcome can be expected. Based on this assess-
0.8 ment, you can decide what parameters may be
the most appropriate for the analysis, or whether
parameters must be combined or transformed
True positive rate
0.6 r
s ifie before proceeding. Next, perform the pre-
clas processing to remove missing values and outli-
m
0.4 n do ers, and restructure the data to an [input, output]
Ra format for supervised learning, or an [input] for-
0.2
mat for unsupervised learning. Once the data is
ready for training, select an AI algorithm based
on the task to perform (e.g. classification, cluster-
0
0 0.2 0.4 0.6 0.8 1
ing, time series prediction, etc.). If multiple AI
False positive rate algorithms are available, select the one that is
easiest to implement and adjust to avoid a loss of
Fig. 2.7 Examples of ROC curves
time and computational cost. The actual training
of the AI involves a non-linear optimisation based
find the optimal cut-off that maximizes the bal- on your inputs and outputs. This step is mostly
ance between specificity and sensitivity. Typically done in a black box format that is incorporated in
curves close to the top-left corner of the plot rep- your chosen software environment (e.g. Python,
resent “ideal” models with a false positive rate of Weka, Matlab, etc.), yielding a trained model,
zero, and a true positive rate of one, while curves along with its performance metrics. During this
close to the diagonal approximate random noise phase you should keep an eye out for overfitting,
(Fig. 2.7). ROC curves can also be represented by underfitting or large errors. If any of these occur,
the Area Under the Curve (AUC), which ranges despite retraining several times, the parameters
between 0.5 for random noise and 1 for perfect selected in data pre-processing step are not repre-
models. sentative enough and other parameters should be
chosen before retraining the model. Once the
model performs well based on the metrics, ROC
K-Fold Cross Validation Testing curves and confusion matrices based on the test
sets, it is time to assess the model performance
This validation technique splits the training data using the unseen validation data. Ideally, this
into K folds, selects one for validation and builds should be completely independent data from a
the model based on the remaining (K−1) folds. It different centre, if possible. If this validation is
can be considered as a K-repetitive hold out vali- also satisfactory, the model development is com-
dation, where the test set and the validation set plete. A full overview of these steps is given in
are independent for each iteration or run. There Fig. 2.8.
are several variations of the technique, such as
Stratified K-Fold Cross Validation, Leave-P-Out
Cross Validation, etc. oftware for Machine Learning
S
Implementation
to help the physician reach a conclusion, rather 8. Taulli T. Artificial intelligence basics: a non-
technical introduction. Apress; 2019. https://doi.
than a Diagnostic System to replace the physi- org/10.1007/978-1-4842-5028-0.
cian’s expertise altogether. Finally, AI may help 9. Aggarwal CC. Data mining: the textbook. Springer;
with screening, patient follow-up and scheduling, 2015.
filling out patient files, letters and administration, 10. Rebala G, Ravi A, Churiwala S. An introduction to
Machine Learning. Springer; 2019.
provided it is supervised and corrected after- 11. Lo JT-H. Functional model of biological neural net-
wards by a physician. Current AI systems works. Cogn Neurodyn. 2010;4:295–313.
would therefore have to embedded in a human 12.
Gupta N, Trindade BL, Hooshmand J, Chan
context. Provided such AI systems are developed E. Variation in the best fit sphere radius of curva-
ture as a test to detect keratoconus progression on a
with respect for human interaction, empathy and Scheimpflug-based corneal tomographer. J Refract
privacy, this could optimize time use and reduce Surg Thorofare NJ. 2018;1995(34):260–3.
the waiting times in hospitals. 13. Kohonen T. Self-organization of very large docu-
ment collections: state of the art. In: Niklasson
L, Bodén M, Ziemke T, editors. ICANN
98. Springer; 1998. p. 65–74. https://doi.
Conclusion org/10.1007/978-1-4471-1599-1_6.
14. Sutton RS, Barto AG. Reinforcement learning: an
For very well-delineated tasks in ophthalmology, introduction. p. 352.
15. Hope IL, Yehezkel Resheff T. Learning TensorFlow.
AI can reach exceptional levels of performance that 16. Witten I, Cunningham SJ, Frank E. Weka: practi-
supersede human ophthalmologists. The technol- cal machine learning tools and techniques with Java
ogy suffers from a number of limitations, however, implementations.
that make it unwise to rely solely on their output. 17. Lantz B. Machine learning with R: learn how to use R
to apply powerful machine learning methods and gain
Instead, AI systems are ideal to work in partnership an insight into real-world applications. Packt Publ;
with ophthalmologists, for example for disease 2013.
detection or as a decision support system. 18. Goldbaum MH, et al. Interpretation of automated
perimetry for glaucoma by neural network. Invest
Ophthalmol Vis Sci. 1994;35:3362–73.
19. Goldbaum MH, et al. Comparing machine learning
classifiers for diagnosing glaucoma from standard
References automated perimetry. Invest Ophthalmol Vis Sci.
2002;43:162–9.
1. Turing AM. I.—COMPUTING MACHINERY AND 20. Huang M-L, Chen H-Y, Lin J-C. Rule extrac-
INTELLIGENCE. Mind. 1950;LIX:433–60. tion for glaucoma detection with summary data
2. McCulloch WS, Pitts W. A logical calculus of the from StratusOCT. Invest Ophthalmol Vis Sci.
ideas immanent in nervous activity. Bull Math 2007;48:244–50.
Biophys. 1943;5:115–33. 21. Kim SJ, Cho KJ, Oh S. Development of machine
3. Goldbaum, M. H. et al. Interpretation of automated learning models for diagnosis of glaucoma. PLoS
perimetry for glaucoma by neural network. Invest. One. 2017;12:e0177726.
Ophthalmol. Vis. Sci. 1994;35:3362–73. 22. Coyner AS, et al. Automated fundus image quality
4. Gardner GG, Keating D, Williamson TH, Elliott assessment in retinopathy of prematurity using deep
AT. Automatic detection of diabetic retinopathy using convolutional neural networks. Ophthalmol Retina.
an artificial neural network: a screening tool. Br J 2019;3:444–50.
Ophthalmol. 1996;80:940–4. 23. Brown JM, et al. Automated diagnosis of plus dis-
5. Maeda N, Klyce SD, Smolek MK. Neural network ease in retinopathy of prematurity using deep con-
classification of corneal topography. Preliminary volutional neural networks. JAMA Ophthalmol.
demonstration. Invest Ophthalmol Vis Sci. 2018;136:803–10.
1995;36:1327–35. 24. Gulshan V, et al. Development and validation of a
6. Consejo A, Melcer T, Rozema JJ. Introduction to deep learning algorithm for detection of diabetic
Machine Learning for ophthalmologists. Semin retinopathy in retinal fundus photographs. JAMA.
Ophthalmol. 2019;34:19–41. 2016;316:2402–10.
7. Choi RY, Coyner AS, Kalpathy-Cramer J, Chiang 25. Bogunovic H, et al. Machine learning of the progres-
MF, Campbell JP. Introduction to Machine Learning, sion of intermediate age-related macular degeneration
Neural Networks, and Deep Learning. Transl Vis Sci based on OCT imaging. Invest Ophthalmol Vis Sci.
Technol. 2020;9:14. 2017;58:BIO141–50.
30 I. Issarti and J. J. Rozema
26. Arcadu F, et al. Deep learning algorithm predicts dia- 32. Yoo TK, et al. Adopting machine learning to automat-
betic retinopathy progression in individual patients. ically identify candidate patients for corneal refractive
Npj Digit Med. 2019;2:1–9. surgery. Npj Digit Med. 2019;2:1–9.
27. Souza MB, Medeiros FW, Souza DB, Garcia R, Alves 33. Valdés-Mas MA, et al. A new approach based on
MR. Evaluation of machine learning classifiers in Machine Learning for predicting corneal curvature
keratoconus detection from orbscan II examinations. (K1) and astigmatism in patients with keratoconus
Clinics. 2010;65:1223–8. after intracorneal ring implantation. Comput Methods
28. Arbelaez MC, Versaci F, Vestri G, Barboni P, Savini Prog Biomed. 2014;116:39–47.
G. Use of a support vector machine for keratoco- 34. Poplin R, et al. Prediction of cardiovascular risk fac-
nus and subclinical keratoconus detection by topo- tors from retinal fundus photographs via deep learn-
graphic and tomographic data. Ophthalmology. ing. Nat Biomed Eng. 2018;2:158–64.
2012;119:2231–8. 35. Schrijvers EMC, et al. Retinopathy and risk of demen-
29. Lopes BT, et al. Enhanced tomographic assessment to tia: the Rotterdam Study. Neurology. 2012;79:365–70.
detect corneal ectasia based on artificial intelligence. 36. Korot E, et al. Will AI replace ophthalmologists?
Am J Ophthalmol. 2018;195:223–32. Transl Vis Sci Technol. 2020;9:2.
30. Issarti I, et al. Computer aided diagnosis for suspect 37. Blasi ZD, Harkness E, Ernst E, Georgiou A, Kleijnen
keratoconus detection. Comput Biol Med. 2019; J. Influence of context effects on health outcomes: a
https://doi.org/10.1016/j.compbiomed.2019.04.024. systematic review. Lancet. 2001;357:757–62.
31. Achiron A, et al. Predicting refractive surgery out- 38. Smadja D, et al. Detection of subclinical keratoconus
come: machine learning approach with big data. J using an automated decision tree classification. Am J
Refract Surg. 2017;33:592–7. Ophthalmol. 2013;156:237–246.e1.
Overview of Artificial Intelligence
Systems in Ophthalmology
3
Paisan Ruamviboonsuk, Natsuda Kaothanthong,
Thanaruk Theeramunkong,
and Varis Ruamviboonsuk
One of the first successful systems of artificial It took another period of about two decades
intelligence (AI) in health care can be traced back later when another system of AI in ophthalmol-
to a study in the late 1970s. In this study by Yu ogy, the iDx-DR [3], became the first system of
et al. [1], a computer was able to recommend AI in health care approved by the United States
choices of antibiotics for treatment of meningitis Food and Drug Administration (U.S. FDA) for
with acceptability rate of 65%. This rate may not automated detection of diabetic retinopathy (DR)
be very high but the corresponding acceptability for referrals to ophthalmologists in primary care
rates of faculty specialists who performed the settings.
same task were only from 42.5% to 62.5%. It is The approval of the iDx-DR has placed AI in
obvious that this early AI system had a better per- ophthalmology at the forefront of AI in health
formance than the specialists. care even though the studies of AI in other fields,
About two decades later, in the late 1990s, a such as pathology and radiology, may outnumber
system of AI in ophthalmology by Sinthanayothin the studies of AI in ophthalmology. While major-
et al. [2] was able to recognize the optic disc from ity of studies of AI in health care focus on the
retinal images with both sensitivity and specific- objective of screening or early detection of dis-
ity as high as 99.1%; recognize the fovea from eases, such as screening for DR, AI is also useful
the same images with the sensitivity and specific- for other tasks in ophthalmology [4]. AI has been
ity of 80.4% and 99.1% respectively. These studied for automated segmentation of retinal
results were far more promising than the results layers in macular edema due to DR, age-related
of the earlier AI for choosing choices of antibiot- macular degeneration (AMD), and retinal vein
ics stated previously. occlusion (RVO); for automated segmentation of
optic nerve head (ONH) in glaucoma; for auto-
mated extraction of features, such as nucleus and
capsule in cataract. In addition, AI has been stud-
P. Ruamviboonsuk (*)
Department of Ophthalmology, Rajavithi Hospital, ied for therapeutic and prognostic predictions,
Bangkok, Thailand such as prediction of requirement for anti-
N. Kaothanthong · T. Theeramunkong vascular endothelial growth factors (anti-VEGF)
Sirindhorn International Institute of Technology, injections and prediction of visual outcome after
Thammasat University, Pathumtani, Thailand treatment of AMD.
e-mail: natsuda@siit.tu.ac.th; thanaruk@siit.tu.ac.th
V. Ruamviboonsuk
Department of Biochemistry, Faculty of Medicine,
Chulalongkorn University, Bangkok, Thailand
Training
Training:
Feature
Image Dataset
Extraction
(raw image) Training: Machine
Labelled Feature Learning
Vector Algorithm
Training:
Label of each image
Prediction or Inferencing
Prediction or Inferencing
Fig. 3.2 Automatic feature extraction using the connectionist approach: Deep Learning
from the domain expert. This is why it is known Naïve Bayes is a probabilistic method based
as “black box”. The architecture of DL is a mul- on Bayes’ theorem. It finds the probability that
tilayered stack of simple modules, see Fig. 3.2. It the desired output, denoted by B, will occur when
is capable of discovering the feature representa- the input feature, denoted by A, is presented, see
tions from a set of raw input data for classifica- the equation below:
tion [9]. Each module transforms the input from
P ( B|A ) P ( A )
the previous levels into a representation at a P ( A|B ) =
higher, slightly more abstract level. With multiple P ( B)
levels of transformation, very complex features,
which are much too high dimensional to be For example, in an image segmentation appli-
accessible for human interpretation, are extracted cation, A is defined as a class, such as disease or
and inferences can be performed [10]. no disease, and B is a feature extracted from a
pixel in an image. The pixel B is classified as
either disease or no disease using a ratio of the
verview of Conventional ML
O likelihood of the feature B occurring in the area
Algorithms of class A.
Naïve Bayes can also be applied for an image
Supervised Learning Approach classification. For example, the probability that
the features of an input image are DME or Non-
Given input-output pairs assigned by humans, DME. There are many Naïve Bayes theorem-
supervised learning finds a pattern of input fea- based methods such as multinomial Naïve Bayes
tures that discriminates the different desired out- for predicting classes and Gaussian Naïve Bayes
put. This pattern is considered as knowledge that for predicting continuous value [11].
is used to predict the output of a new input fea- Support Vector Machine (SVM) finds a line
ture. There are many methods for supervised segment or a hyperplane that optimally discrimi-
learning. nates the features of two classes. The data on one
34 P. Ruamviboonsuk et al.
side of the hyperplane should contain the data of the same or different value of feature. Figure 3.4
the same class as much as possible. Figure 3.3 shows an example of a Decision Tree for classify-
shows an example of an image classification ing the disease. The features in the training set is
application using SVM, where each point repre- separated into two subgroups using “Feature1”.
sents an image feature. New data, when given to Each subgroup is divided further using
SVM, are classified according to their positions “Feature2”, where the training data of “Feature2
on the plane [12]. SVM can be applied as both < YY” are mostly in Class 1. The subdivision is
image classification and image segmentation. The continued until the optimal class separation is
latter is by assessing features of each pixel and found, as shown in the gray-shaded labels.
classify them as either background (no disease) or Instead of relying on only one tree, Random
foreground (disease) for segmentation (Fig. 3.11). Forest applies multiple trees to learn the input
Decision Tree is a binary tree-based method features [12]. Criteria for separating the input
that recursively divides the features of input in features into homogeneous groups are defined
the training dataset into two parts until the opti- differently for each tree. To predict the output of
mal spilt between each class of the output has a new image, features are extracted and used by
been reached. For each separation, the value of a decision Treesb in the forest. The classification
feature is used to optimally divide the features outputs that are decided by each tree are voted.
into the two subgroups, such as disease and no The output that achieved the highest number of
disease. Each subgroup is divided further using votes becomes the prediction result.
Artificial Neural Network (ANN) extracts
relevant features from the input data by learning
from examples without explicitly stating the rules
to perform classification tasks [13]. It applies the
DME
concept of connected neural network where each
neuron adjusts the weights (the optimal parame-
ters) from the precedent neurons for the learning
process. ANN has been applied in many tasks
and is also a fundamental of deep learning.
Normal
Fig. 3.3 An illustration of a classification using a support Whereas the supervised learning approach relies
vector machine (SVM) on the input-output pairs assigned by humans,
Feature1 >=XX
Feature1 <XX
Class 1
Feature1 < YY Feature1 >= YY
Unsupervised learning approach requires only algorithms that are used in ophthalmology, par-
the input features to separate them homoge- ticularly for the prediction task.
neously. The aim of the unsupervised learning
approach is to discover a structure or distribution
in the input data in order to learn more about Overview of DL Algorithms
each separated group.
The unsupervised learning is used when the Methods of DL that are commonly used in oph-
input-output pair is not provided. It has widely thalmology may be classified into Convolutional
been used in an image segmentation task to Neural Networks (CNNs), Pre-trained
separate the set of pixels into a group of back- Unsupervised Networks (PUNs) [17], and
ground and foreground or a region of an interest Recurrent/Recursive Neural Networks (RNNs)
object (Fig. 3.11). In addition, it is also applied [18]. Among these three different categories of
for studying objects in each homogeneous DL networks, CNN has been used more exten-
group. sively in medical image recognition [19, 20]
K-Nearest Neighbor (KNN) finds sets of including in ophthalmology.
objects whose features are similar to the input
features. Distance among the input features is
used as a similarity measure [14]. Given a feature Convolutional Neural Network (CNN)
of new data, the classification result is achieved
by voting the number of the k closest objects to The CNN is designed to automatically extract
the new data. features in two-dimensional data, such as images,
while merging semantically similar features into
one in order to reduce sparseness. The features
Boosting Algorithms extracted using CNN can preserve important
information for obtaining a good prediction. In
Boosting is a generic algorithm that aims to addition, one or multiple images can be used as
improve the accuracy of the prediction result. input, and a single diagnostic feature is designed
Instead of relying on the prediction outcome of a as the output, such as disease presence or absence.
single model, boosting algorithms apply multiple The architecture of CNN comprises three lay-
weak classifiers trained with new data to achieve ers: Input Layer, Convolution Layer, and Pooling
a good classifier [15]. Outcome of the precedent Layer (see Fig. 3.5). An input image is placed in
weak model is connected to a new model together the first Image layer, as shown in Fig. 3.5a. The
with the new data to train and improve the predic- image is a two-dimensional array, where each
tion outcome. There are many boosting algo- cell in the array is in a three-color-channel, red,
rithms, where each applies different measures to green, and blue. Each channel is considered as a
improve the prediction accuracy. Adaboost [16] matrix and applied for a feature extraction. To
and Gredient Boosting are examples of these obtain a rich representation, the input image is
a b c
Fig. 3.5 Architecture of Convolutional Neural Network. (a) Input layer. (b) Convoluted layer. (c) Max pooling layer
36 P. Ruamviboonsuk et al.
divided into smaller subimages. Each subimage tional and positional invariant. In other words, it
is used in the subsequent layers. can distinguish between non-disease and disease
The Convolution Layer in Fig. 3.5b then location in an image.
applies a filter to extract a feature from the input
matrix. The objective of this layer is for feature CNN in DL
extraction using different filters. The output of Applying the CNN in DL is typically structured
the Convolution Layer is varied according to the as a series of stages where each stage consists of
filters, such as edge or texture [21]. Figure 3.6 a CNN unit as shown in Fig. 3.7. The number of
illustrates two examples of the convolution result the units depends on the number of filter size,
using two different gradient filters [22]. Blood also called a network width. The stages are con-
vessels can be clearly represented using Filter A nected as shown in Fig. 3.8a. The number of
while the optic disc can be visualized using Filter stages refers to a network depth. The first few
B. Since the performance of the prediction model stages focus on mapping features from input
is also depended on the weights (optimal param- images. The later stages apply the features from
eters) [23], many filters, therefore, are applied to the previous stages as input to merge semanti-
extract features from an image. The problem of cally similar features into one [9, 19] (Lu, 2018).
using too many filters to obtain a rich representa- The last stage is fully-connected layer for pre-
tion of an image is the sparseness of the feature dicting the classification result. This fully-
which results in a low accuracy. To cope with this connected layer can be viewed as a conventional
limitation, CNN utilizes the Pooling Layer for a ML that applies an input feature from the Pooling
dimensionality reduction. Layer for a classification.
The Pooling Layer in Fig. 3.5c applies a filter The architectures of CNN can be varied in the
to preserve important information of the extracted arrangement of the number and size of filters for
feature in the previous layer and down-sampling feature extraction, in the connections between
it into a smaller size. The filter can be in any size, these features, and in the network depth as
such as 3×3 filter as shown in Fig. 3.5c. The value depicted in Fig. 3.8. The number of filters defines
of the extracted features is summarized using one the width of the network while the depth refines
of the three mappings: Max Pooling, Average the learning capability of the network. Many
Pooling, and Sum Pooling. In addition to the CNN architectures have been created, such as
dimension reduction, the Pooling Layer is useful AlexNet [17], VGG [24], Inception [25], ResNet
for extracting dominant features to achieve rota- [26], and EfficientNet [27]. They employ the
1 0 -1 1 1 1
1 0 -1 0 0 0
1 0 -1 -1 -1 -1
Fig. 3.7 An overview A CNN unit (input layer, convoluted layer, and pooling layer)
of the deep learning
neural network using width
CNN A level in a deep learning network.
Each level may have different width.
3 Overview of Artificial Intelligence Systems in Ophthalmology 37
Connected
Pooling
Output
Fully
Image
Model
depth
Connected
Pooling
Fully
Image Output
Model
Fig. 3.8 Examples of the deep neural network using CNN. (a) A plain network. (b) A network with a short-cut connec-
tion to skip some layers
same structure of CNN for feature extraction, set. Also, a uniformly increased network size
however, the number of layers and feature map- means the dramatically increased use of compu-
pings (filters) are varied. Their efficiency, such as tational resources. The architecture of Inception
accuracy, and time for training are also varied. is to increase the depth and the width of the net-
work to achieve a higher accuracy while keeping
AlexNet the computational budget constant [25].
The difference of the original CNN and AlexNet is
the number of convolution layers. AlexNet com- ResNet
prises eight layers where five layers are convolu- Although increasing the depth of a network
tion and the rest are fully-connected layer [17]. It improves its performance, it can lead to a degra-
is the first architecture that applies the multiple- dation. The problem of degradation causes satu-
layered CNN with graphic processing unit (GPU) ration of the network’s performance. Instead of
to accelerate computational time of the DL. passing every layer in the network, ResNet
applied a “shortcut” connection to skip some
GG
V CNN layers, see Fig. 3.8b. The shortcut is used
Since increasing the depth of the CNN showed when the output feature of the layer is the same as
more accurate results of its classification [24], the the one before, then this particular layer can be
architecture of VGG applies a small-sized filter skipped, and the degradation problem can be
of 3×3 and a deeper weight layers to the resolved [26]. There are many configurations of
CNN. The architecture of VGG16 and VGG19 ResNet architecture: ResNet35, ResNet50, etc.
are similar, except only in the depth of the weight The number refers to the depth of the network.
layers. With a smaller filter, compared to the orig-
inal CNN, VGG shows a significant improvement EfficientNet
of the result. Other than scaling up the depth of deep neural
networks, such as in VGG [24], Inception [25],
Inception ResNet [26], scaling up the width [28] and
The drawbacks of increasing depth and width of increasing image resolution [29] are other means
deep neural networks are an overfitting of the to improve network performance. EfficientNet
prediction model when the training size was lim- [27] presents a compound scaling method which
ited. Overfitting means the model performs much achieves the maximal accuracy by uniformly
poorer in a new dataset than in the training data- scaling network width, depth, and resolution of
38 P. Ruamviboonsuk et al.
a b
Cost J(w) Cost J(w)
Small value of the learning rate Large value of the learning rate
Fig. 3.10 A comparison of a small value (a) and a large value (b) of the learning rate
reduce the number of training iterations without Table 3.1 A confusion matrix for predicting two
the limitation due to Epochs. With transfer learn- outcomes
ing, these initial weight values can be assigned Actual labels
from the already available model, also called Predicted results Disease No disease
Disease True Positive False Positive
“pre-trained” model. There is a number of pre-
No disease False Negative True Negative
trained models for CNN-based DL [34].
Research by Gulshan et al. [45]. The algorithm in independent multiethnic datasets, this algorithm
this study was developed from more than 100,000 achieved the performance on par with other sys-
retinal images and was validated in the other two tems for detecting STDR. The authors high-
independent datasets of more than 10,000 images. lighted 77% of false negatives as undetected
This study was the first that showed the achieve- intraretinal microvascular abnormalities.
ment of both sensitivity and specificity at 95% All of the aforementioned systems of DL
and AUC at 99% for detecting referable DR when for DR screening designed for being applied to
validated in independent datasets. color fundus photography (CFP). Apart from
This system was further validated in another detecting referable DR, their performances on
retrospective dataset of more than 20,000 images detecting diabetic macular edema (DME) were
of 7000 patients in a nationwide screening pro- similar to the detection of STDR [46] (DME is
gram of DR to detect moderate NPDR or worse almost always counted as part of STDR). The
[46]. This AI model achieved 97% sensitivity identification of DME from CFP, however,
compared to 74% of human graders in this could be problematic since in the real clinical
screening program whereas slightly lower speci- practice DME was identified using images from
ficity of 96%, compared to 98%, of humans was optical coherence tomography (OCT) which is
found. When validated prospectively in two pri- three-dimensional. To overcome the limitation
vate hospitals in India with more than 3000 of two-dimensional images of CFP, the pres-
patients to detect moderate NPDR or worse, this ence of hard exudates in the macular area was
model still achieved about 90% sensitivity with used as proxy for detecting DME when grad-
slightly higher than 90% of specificity which ing CFP. There are always cases for which
were better than manual grading [47]. the identification of DME from CFP and from
Another large-scale study of DL system for OCT is not in concordance [52]. In addition,
DR screening is from Singapore in which more the prevalence of DME based on each modal-
than 70,000 images were used for development ity is significantly different [53]. An interesting
of the algorithm. The highlight of this study by study by Varadarajan et al. used paired data of
Ting et al. [48] was the largest population with CFP and OCT to train a DL algorithm to learn
more than 100,000 images of independent datas- to detect OCT-derived DME from grading on
ets of various races for validation to date. This CFP only [54].
AI-based software, now called SELENA (VGG- Developed from more than 6000 of the paired
19), was able to detect STDR with 100% sensi- images, this algorithm (CNN: Inception V3) was
tivity, 91% specificity, and 0.96 AUC. able to detect centered-involved DME (CI-DME)
SELENA was also explored for DR screening from CFP in the testing set of 1000 CFP images
in Zambia, a country in Africa where resources with 85% sensitivity and 80% specificity whereas
are scant (for example, the number of ophthal- three retinal specialists who graded the same
mologists is less than three per million Zambian CFP images using hard exudates as proxy for
population), in a study by Bellemo et al. [49]. CI-DME had, in average, similar sensitivity but
The performances of SELENA in this study were about half of the specificity at 45%. In validation
found to be on par with those in Ting et al. [42] on another independent dataset of 990 CFP
mentioned previously. SELENA was also found images, sensitivity and specificity of this algo-
to be able to estimate prevalence and systemic rithm was lowered to be at 57% and 91% respec-
risk factors of DR similar to human assessors; tively, whereas sensitivity and specificity of
hence, this showed the potential of the roles of graders who graded the same images were even
DL in epidemiology study [50]. lowered than the algorithm at 55% and 79%
There was another system of DL for DR respectively. It was noted that data in the devel-
screening developed in China by Li et al. [51]. opment dataset in this study were from tertiary
Developed from more than 70,000 images and care settings while those in the independent set
validated in more than 30,000 images of three were from primary care settings.
42 P. Ruamviboonsuk et al.
However, this study showed a potential of AI [61] (CNN: AlexNet) used a training and testing
to make prediction across two imaging modali- dataset of almost 54,000 and 14,000 images from
ties or across two kinds of labelled data (other AREDS while a study by Grassman et al. [62]
examples: predict gender from CFP, blood pres- (using various CNNs) used approximately 87,000
sure from CFP, etc. [55]), when trained with pairs and 34,000 images. The former study used exist-
of both imaging modalities or trained with pairs ing grades from AREDS for training while the
of both data labelling (label gender data with latter required a trained ophthalmologist to label
CFP, label blood pressure data with CFP, etc.). data for the training, both studies provided out-
This concept is sometimes called “label puts as grades according to AREDS Classification.
transfer”. Burlina et al. classified the outputs into two
classes, referable and non-referable, whereas
Grassman et al. classified the outputs into nine
Classification of AMD steps of AREDS and three late AMD stages. Both
studies achieved sensitivity and specificity of
The aim for AI to screen AMD has been widely around 90% for the testing dataset. The study by
assessed recently. Many attempts had been made Grassman et al. conducted validation in the exter-
before for using other means, such as Amsler’s nal dataset of more than 5000 images and
Grid [56] or hyperacuity device [57], for screen- achieved a sensitivity and specificity of 82.2%
ing AMD with fair success. A recent study in and 97.1% for detecting intermediate AMD, and
South Korea found systematic, population-wide, a sensitivity and specificity of 100% and 96.5%
retinal photography of people more than 40 years for late AMD. The system by Burlina et al. could
old by non-specialists for screening AMD was later classify the 9-step AREDS and predict
cost-effective [58]. Another study found screen- 5-year risk of progression to advanced AMD with
ing AMD in the concurrent programs for screen- acceptable error [63].
ing of DR was also cost-effective [59]. It is still There are other AI systems for classification
not known whether screening for AMD replacing of AMD from spectral-domain OCT (SD-OCT)
the non-specialists with AI is also cost-effective. images. Some systems use DL for classifying
Most of the AI systems for screening and clas- AMD directly from OCT images whereas some
sification of AMD are developed using CNN and systems apply conventional ML for automated
use CFP as inputs. Fewer studies used OCT segmentation of fluid or biomarkers on OCT
images as inputs. SELENA, one of the first sys- images as the first step then use DL classifiers for
tems for screening AMD, was initially applied in classification later. Studies by Kermany et al.
patients with diabetes. Although use for screen- [64] and Treder et al. [65] are examples of the
ing AMD, the algorithm in SELENA was devel- former. Both used “transfer learning” from exist-
oped from a training dataset of more than 72,000 ing, open-sourced, pre-trained, ImageNet deep
images of patients with diabetes in Singapore and neural network (DNN) with 1000 output catego-
Malaysia and a testing dataset of almost 36,000 ries to train on OCT images for AMD.
images of patients in the same population. The Kermany et al. trained the ImagNet on four
output in this study was defined as referable categories: choroidal neovascularization (CNV),
AMD [48]. DME, drusen, and normal. With the training
There are other studies on AI for screening dataset of more than 100,000 images (37,000
AMD that was developed from CFP of CNV, 11,000 DME, 8600 drusen, and 51,000
Age-Related Eye Disease Study (AREDS) [60] normal) and 1000 images for validation with
which is a large randomized controlled trial com- equal distribution of the four categories, the sys-
pared between vitamin supplements and placebo tem achieved AUC of 98% with accuracy, sensi-
for AMD development and progression. Since tivity and specificity around 95%. The authors
the AREDS collected CFP as films, they are digi- also performed the occlusion test to uncover the
tized for applying AI. A study by Burlina et al. potential “black box” created by the model.
3 Overview of Artificial Intelligence Systems in Ophthalmology 43
Treder et al, on the other hand, trained and tested This system then makes classification in dif-
their system using over 1000 images (90% for ferent retinal diagnoses, for example, normal,
training which contained 70% AMD and 30% CNV, macular hole, central serous chorioretinop-
controls, and 10% for testing which contained athy, vitreomacular traction, etc., and also refer-
50% for both AMD and controls). ral suggestions: urgent, semi-urgent, routine, and
Lee et al. [66] linked data from electronic observation. The authors found that on test per-
medical record (EMR) with OCT images to formance of the model on an independent test set
develop CNN system (VGG16) to classify of 997 patients (252 urgent, 230 semi-urgent, 266
AMD. With approximately 100,000 OCT images routine, 249 observation), an AUC of 99.9 was
with linked data points of EMR were used, half achieved for urgent referral; whereas the error
normal and another half AMD, the system rate of 3.4% was on par with those of retinal spe-
achieved AUC and accuracy around 90%. cialists and was better than optometrists.
Studies by Prahs et al. [67] and Hwang et al.
[68] are examples of applying AI for classifica-
tion of OCT images for decision-making. Prahs Classification in Glaucoma
et al. trained CNN (GoogLeNet or Inception)
with the inputs of AMD, DME, RVO, CSC, and The diagnosis of glaucoma may require identifi-
the outputs of “requiring” anti-VEGF treatment cation of many co-existing parameters, such as
or “not requiring” anti-VEGF treatment, labeled increased optic nerve head cupping, characteris-
by treating clinicians. This study conducted vali- tic loss of retinal nerve fiber layer or characteris-
dation on external dataset of more than 5500 tic defects of visual field. These may make
images and achieved sensitivity of 90%, specific- diagnosis of glaucoma by AI more complex,
ity of 96%. Hwang et al. not only used three dif- compared with retinal disease.
ferent DL systems, VGG16, InceptionV3, and The SELENA system by Ting et al. [48] can
ResNet50, to train on OCT images of normal, dry also detect glaucomatous optic nerve head
AMD, active wet AMD, and inactive wet AMD, (GONH), this part of the algorithms was devel-
they also studied the DL as a cloud-based plat- oped from CFP of more than 120,000 patients
form. The authors found that the three CNN sys- with diabetes. Li et al. [70] also developed
tems performed similarly for classification of the another DL system (VGG) for detecting GONH
four categories of AMD with slightly lower per- from more than 50,000 CFP graded by more than
formance on dry AMD. They also found a poten- 20 ophthalmologists. The identification of refer-
tial on prediction of longitudinal changes after able glaucoma in these studies by Ting et al. and
treatment of wet AMD with 90% accuracy. Li et al. may have a limitation from relying on
The major AI system that was designed to per- only GONH since even ophthalmologists might
form both automated segmentation of OCT not have high agreement among them on grading
images and then performed classification tasks GONH [71]. There are other retinal imaging
for AMD and other retinal diseases is by De technologies deployed for detecting glaucoma,
Fauw et al. [69]. For the segmentation, the such as OCT, confocal scanning laser ophthalmo-
authors applied a three-dimensional U-net archi- scope (CSLO), and scanning laser polarimetry
tecture for deep segmentation network to delin- (SLP), on which AI can be applied.
eate OCT scanned images using more than 1000 Before this era of DL, many models of con-
manually segmented training images to form ventional ML were applied for detecting glau-
tissue segmentation maps of the OCT scans.
coma from both time- and spectral-domain OCT
Another classification network, a customized 29 images of optic nerve head (ONH) with accept-
CNN layers with 5 pooling layers, developed able performance [72]. Muhammad et al. showed
from 14,884 training tissue maps with confirmed that a hybrid DL model (AlexNet and Random
diagnosis and referral decision, was applied to Forest Classifier) was able to analyze single-scan
the segmented OCT maps. of SS-OCT images to classify between normal
44 P. Ruamviboonsuk et al.
and glaucoma suspects with 93% accuracy [73]. and found their performances improved with the
Christopher et al. applied Principal Component combined data for classification between glau-
Analysis (PCA) approach of unsupervised ML to coma and non-glaucoma patients.
analyze retinal nerve fiber layer (RNFL) thick- Medeiros et al. [79], on the other hand, applied
ness maps from SS-OCT and showed that this the concept of “label transfer”, stated previously
approach could achieve the highest AUC of 0.95, in glaucoma. They trained a CNN (ResNet34)
compared to SD-OCT-based circumferential with more than 30,000 paired data of both CFP
RNFL thickness measurements and visual filed images and RNFL thickness to predict the RNFL
global indices for detection of glaucoma. Using thickness from analyzing only the CFP. In the test
stereoscopic CFP as standard for defining glau- set of around 6200 CFP images, the model could
coma, this model could also detect glaucoma pro- predict the RNFL thickness with a strong correla-
gression with the highest AUC [74] compared to tion between predicted and observed RNFL
the other means. thickness values (Pearson r = 0.832; R2 = 69.3%;
Visual field (VF) progression, another impor- P < 0.001), with mean absolute error of the pre-
tant indicator of worsening glaucoma, had been dictions of 7.39 μm. The AUC of the classifica-
detected using back-propagation neural network tion of glaucoma and normal was both 0.94 for
since the early 2000s in the Advanced Glaucoma the prediction made by the CNN on grading only
Intervention Study (AGIS) with an AUC of 0.92 CFP images and for the actual RNFL
[75]. In another study, Yousefi et al. introduced a measurement.
new glaucoma VF index calculated by an unsu- In the next study by Jammal et al, the same
pervised ML, Gaussian Mixture Model and team of authors validated their AI model with
Expectation (GEM), to detect VF progression. another set of 490 CFP images of 490 eyes of 370
This model was trained on more than 2000 VFs subjects graded by two glaucoma specialists for
and tested on a longitudinal cohort of 270 eyes the probability of glaucomatous optical neuropa-
followed every 6 months. This new AI-based thy (GON) and estimates of cup-to-disc ratios
index outperformed existing indices, such as (C/D). The AUC for classifying GON from CFP
Global or Region-wise, by finding the time to was higher for their AI model compared with the
progression of 25% of the eyes in the longitudi- glaucoma specialists, 0.529 vs 0.411 [80]. This
nal cohort at 3.5 years, compared with 4.5 years concept of “label transfer” may be applied more
for the Region-wise and 5.2 years for the Global in AI in ophthalmology in the future.
index [76].
In a recent study, Li et al. [77] compared (1)
DL-CNN (VGG architecture) (2) conventional Classification of Cataract
ML models (SVM, RF, k-NN) (3) rule-based
algorithms (AGIS and enhanced glaucoma stag- One of the first AI systems for grading cataract
ing system [GSS2] and (4) human experts for was by Gao et al. [81]. This system (Recursive
grading 300 VFs to differentiate between glau- Neural Networks combined with Support Vector
coma and non-glaucoma patients. The CNN, Regression) was trained to grade severity of cata-
developed from the same data set of 4000 VF ract from slit lamp photography in a decimal
images, achieved the accuracy of 0.876, while score, from 0.1 to 5.0, based on Wisconsin
the specificity and sensitivity were 0.826 and Grading System. The testing was performed in
0.932 respectively, whereas the accuracy of the 5378 images, this system achieved a 70.7% exact
three ML models was around 0.65, the human agreement ratio (R0), a 88.4% decimal grading
experts was around 0.6, and AGIS and GSS2 was error of less than 0.5 (Re0.5), a 99.0% decimal
around 0.5. grading error of less than 1.0 (Re1.0) when com-
A study by Bowd et al. [78] combined struc- pared against clinical integral grading.
tural data (OCT) and functional data (VF) to train A much larger scale of an AI system for clas-
conventional ML models (Bayesian Classifiers) sification of cataract was by Wu et al. [82] in
3 Overview of Artificial Intelligence Systems in Ophthalmology 45
China. The authors trained a DL-CNN (ResNet) Koprowski et al. [84] deployed ANN to evaluate
to classify photographs of anterior segment in corneal power after refractive surgery.
three steps. The first was to classify according Interestingly, other than patient care, AI has
to modes of illumination: slit or diffuse, and also been used for ophthalmic training in cataract
methods of capture: mydriatic or non-mydri- surgery. The aim is to use DL to identify manu-
atic; the second was according to diagnosis: ally pre-segmented phases of cataract surgery
normal or cataract or postoperative cataract; the procedures in assisting to develop efficient and
third was according to severity: mild or severe effective skill training tools. In a study by Yu
nuclear sclerosis, and visual axis involvement et al. [85], from a dataset of videos of 100 cata-
or not. The development dataset was from ract surgery procedures, the authors applied dif-
37,638 slit lamp photographs of 16,611 patients ferent AI algorithms, SVM, CNN, CNN-RNN, to
with 80% for training and 20% for testing. identify phases in the cataract surgery videos.
Validation of the models in the testing set found Modelling time series of labels of instruments in
relatively good performances with confusion use provided the highest accuracy for the identifi-
metrics more than 90% across the board. The cation. In another study by Morita et al. [86], a
performances on images from slit or diffuse DL-CNN model, Inception V3, was also trained
illumination, mydriatic or non-mydriatic were to extract important phases, capsulorrhexis and
similar. nuclear extraction, of cataract surgery videos
However, when the models were validated in with high correct response rate and low errors.
four separate community hospitals in real-world
situations, some of the confusion metrics of the
AI dropped. Here are the metrics found to be Screening for Retinopathy
lower than 90% in the real-world testing: sensi- of Prematurity and Pediatric Cataract
tivity for classifying normal at 71.3%; specificity
for classifying cataract at 83.9%; sensitivity/ Retinopathy of prematurity (ROP) shares some
specificity for classifying severe nuclear sclerosis similarities with DR: early detection by retinal
at 73%/86% and classifying mild nuclear sclero- examination is essential to reduce the risk of
sis at 86%/73% respectively. The authors claimed visual loss. CFP of both diseases can be cap-
that using this AI-based assistance of the referral tured using commercially available cameras.
system, however, an ophthalmologist may serve Digital CFP allows less burden to both examin-
up to more than 40,000 persons a year, compared ers and examinees for examination of ROP [86]
to 4000 persons without AI. This kind of AI mod- and also allows possibility of screening by neo-
els using anterior segment photography for deter- natologists [87]. The main difference between
mination of referral cataract should be subjected the two diseases is an urgency for treatment in
for further assessment since many patients in the ROP which is within 72 h for Plus ROP [88]. The
real world may live with relatively dense nuclear term “referable DR” in screening settings may
sclerosis without visual impairment or interfer- has different definitions concerning different
ence of daily lives. The indication for cataract resources available in screening programs [89]
surgery, in addition, may still be varied in differ- while “referable ROP” is universally agreeable
ent eye care services. as the definition of Plus disease [90]. Similar to
AI, in theory, would also be ideal as a tool for DR screening, telemedicine has been deployed
intraocular lens (IOL) power calculation. Sramka successfully for screening of ROP [91]. This
et al. [83] demonstrated that a conventional ML technology allows addressing two important bar-
model, Support Vector Machine Regression, and riers for ROP screening: interexaminer variabil-
multilayered neural network ensemble achieved ity due to subjectivity and inadequate number of
significantly better performance for IOL power qualified trained examiners [92]. Further allevia-
calculation compared to conventional methods of tion of these barriers is expected with applica-
calculation without AI. Another study by tion of AI.
46 P. Ruamviboonsuk et al.
There was a number of studies on conven- AI applied to slit-lamp images and comprised
tional ML for determination of tortuosity and three CNNs (AlexNet) for screening, assessment
width of retinal vessels in Plus ROP in the era of severity, and treatment recommendation. It
prior to DL [93–97]. All of these models required was compared with experienced pediatric oph-
manual annotations in implementation. The fully thalmologists in five clinics in China in a pro-
automated software, without additional manual spective randomized controlled trial evaluating
annotations, for detection of ROP came with 350 patients younger than 14 years old. Applying
application of CNN. DeepROP by Wang J et al. grading of images by a panel of three experienced
[98] is one of the first systems of CNNs for ROP pediatric ophthalmologists as the gold standard,
detection. The system (modified inception-BN the accuracy of this AI system was significantly
nets pre-trained on ImageNet) was developed in lower than the experts for detection of cataract
China with the largest dataset of ROP to date (87% vs 99%) and recommendation of treatment
(more than 20,000 images). (71% vs 97%). The duration for reaching results,
Another system, i-ROP-DL (the earlier ver- however, was significantly shorter for the AI sys-
sion of i-ROP had not yet adopted DL [97]) tem (2.8 vs 8.5 min).
developed in the U.S. with more than 5000 Other pediatric ophthalmology conditions that
images, was demonstrated in a study by Redd AI is studied for include detection of strabismus,
et al. [99] to successfully classify severity of ROP refractive error, prediction of future high myopia
into type 1 and type 2. Another study of i-ROP- and reading disability [106].
DL by Brown et al. [100] compared the DL
model (Inception V1 and U-Net) with eight inter-
national ROP experts on the output of Plus dis- verview of Systems for Automated
O
ease. Using the consensus of image-based Segmentation
diagnosis combined with ophthalmoscopy as the
gold standard, the authors found the DL system The difference between AI-based classification
agreed with this standard more than six of the and AI-based automated segmentation may be
eight experts. This i-ROP-DL system, in addi- clarified from the outputs. The classification sys-
tion, was able to compute the probability of its tems generally provide outputs as stages of a dis-
prediction via a linear formula to be ROP severity ease classification, or binary outputs as referable
scores. or non-referable. Most of the automated segmen-
In the subsequent study of i-ROP-DL, Taylor tation systems, on the other hand, provide outputs
et al. [101] showed that these ROP severity scores as biomarkers or pathological characteristics,
could potentially be applied to monitor ROP dis- such as subretinal fluid, retinal pigment epithelial
ease progression. In monitoring more than 870 detachment, or optic nerve head. The objectives
infants with ROP over time, the median scores of of automated segmentation are assisting special-
those who progressed to require treatment were ists in busy clinics or assisting researchers in
significantly higher than those who did not prog- research for time and cost saving. The objectives
ress at each of the postmenstrual age time points of the classification systems are generally for the
in the study. benefits of patients, such as early detection or
There was another CNN system for ROP screening disease.
developed from a smaller dataset of 1500 images On bioengineering perspective, the objective
from Canada and England. This system could of image segmentation task is to find the interest-
also detect Plus ROP with the accuracy which is ing area, such as an area of characteristics of dis-
comparable with other systems [102]. ease in images, then separate image pixels in that
Apart from ROP, another major AI system was area into background and foreground. Figure 3.11
developed for detection of congenital cataract in shows segmentation results of the macular fluid
China [103–105]. This system called Congenital in an OCT image. In this example, the pixels of
Cataract-Cruiser, or CC-Cruiser, is a cloud-based the fluid area are separated as foreground from
3 Overview of Artificial Intelligence Systems in Ophthalmology 47
Example 1 Example 2
OCT Slice
Ground Truth
Segmentation
Result
Fig. 3.11 Example of the input images, their ground truths, and the segmentation results
the rest which are background. Each of the image AMD. While macular fluid was generally the
pixels is then classified into either a class of most common outputs for segmentation of retinal
background or foreground. To train the model, a OCT images, photoreceptor cells loss was output
set of training images with labelling of the fluid for segmentation of retinal OCT in patients with
area by experts, also called as ground truth, is retinitis pigmentosa and choroideremia [111].
provided (Fig. 3.11). The ML model then learns Segmentation of choroidal thickness in patients
to find the patterns of the pixels that separate with AMD from enhanced-depth OCT (EDI-
them into foreground and background classes. To OCT) images was also performed successfully in
validate the model, the ML model then uses these another study [112]. Adaptive Optics Scanning
patterns to categorize image pixels in the new Light Ophthalmoscopy images from patients
images as either background or foreground. with Stargadt’s disease was also segmented by a
The conventional ML method, such as Multi-Dimensional RNN for localization of cone
Principle Component Analysis (PCA), is gener- cells [32].
ally applied to reduce the sparseness of the While the automated segmentation of OCT
extracted features. Another method, such as images of macular region is important in retinal
SVM, is applied to classify each pixel into the disease, the region of interest (ROI) for auto-
two classes [107]. DL can also be applied for the mated segmentation in glaucoma is at the
classification [108]. A common statistic that is ONH. Many studies used different segmentation
used for assessment of accuracy of automated algorithms for ONH segmentation from
segmentation is Dice coefficient [109]. The hand- CFP. Almost all of the studies reported higher
drawn of the segmented area by human experts is than 90% of accuracy [72]. Hagiwara et al. [113]
used as ground truth. proposed a common workflow for a conventional
Retinal OCT images are excellent inputs for computer-aided diagnosis of glaucoma using
automated segmentation. Not only in DR, Fang CFP images as the followings: (1) Image Input,
et al. [110] was able to segment the nine retinal (2) Pre-Processing, (3) Segmentation, (4) Feature
layers in OCT images of patients with dry extraction, (5) Feature Selection and Ranking,
48 P. Ruamviboonsuk et al.
and (6) Classification. Some studies performed care and facilitate research from database of their
segmentation of peripapillary area which own institute. Based on electronic health record
included RNFL, neural retina, retinal pigment collected in a form of Data Warehouse [119], the
epithelium, choroid, peripapillary sclera, and authors developed and compared different con-
lamina cribrosa from OCT images [114, 115]. ventional ML models (AdaBoost.R2, Gradient
Boosting, Random Forests, Extremely
Randomized Trees, and Lasso) that predicted VA
verview of Systems for Prediction
O at 3 months and 12 months after 3 monthly injec-
of Natural History or Treatment tions of anti-VEGF agents for neovascular
Outcomes AMD. The prediction by the models at 3 months
had mean absolute error between 5.5 and 9 let-
Prediction aims to foresee the outcome of the dis- ters, and root mean square between 7 and 10 let-
ease, either with or without treatment, when a set ters. The prediction at 12 months was slightly
of features, such as best corrected visual acuity worse.
(BCVA), or recurrence of macular edema at cer- Other studies on AI-based prediction of treat-
tain time period, which are related to but not nec- ment outcomes were also centered on data from
essarily within the images themselves, is input. major clinical trials. For example, data from the
Longitudinal data or images are usually required Protocol T [120] of drcr.net trials were used for
for development of AI systems for prediction. prediction of BCVA at 12 months after treatment
There has been an attempt to predict progres- with anti-VEGF injections for patients with DME
sion of DR at 6 months, 12 months, and 2 years [121]. Data from HARBOUR Study [122] were
for individual patients who did not have any ocu- used for prediction of (1) BCVA at 12 months
lar intervention during that period [116]. Using after treatment [123] (2) requirement of low or
CFP of patients with DME who were randomized high injections of anti-VEGF [124] and (3)
to be in the sham arm in the clinical trials of RISE advanced AMD conversion for patients with
and RIDE, a system of deep CNN (Inception V3) AMD [125]. For retinal vein occlusion, data from
was developed to predict progression of DR CRYSTRAL Study [126] were used for predic-
[117]. The CFPs in this AI model was based on tion of BCVA and recurrence macular edema at
the standard 7-field ETDRS photography, each 12 months [127].
field had its own CNN for prediction of 2-step or
more worsening of DR according to Diabetic
Retinopathy Severity Scale (DRSS), the results The Future of AI in Ophthalmology
of each CNN was then combined using Random
Forest. The prediction at 12 months of the system We are in a transition phase of AI in ophthalmol-
of combined models achieved the best perfor- ogy. There will be abundance of AI systems com-
mance with AUC of 0.75 whereas the prediction ing for more specific tasks in ophthalmology at
of this system based on some peripheral fields of the present time and in the future. No more doubt
ETDRS surprisingly outperformed the prediction will be cast on AI’s performance in research, but
based on the posterior fields of ETDRS. However, many questions remain for deployment. The
the generalizability of this system might still be robust performance of many systems may not be
doubt due to (1) the limited availability of the carried over into the real world. Since compari-
ETDRS 7-field photography (2) the lack of vali- son between different AI systems is difficult, the
dation in independent dataset and (3) the applica- choosing of an AI system to be used in eye care
bility of the models in patients without DME should neither be easy. While researchers will be
since this study included only CFP of patients unravelling “black box”, patient acceptability,
with DME. data privacy, data protection, regulations, includ-
A study by Rohm et al. [118] reflected an ing medico-legal aspects [128] will be issues that
attempt to develop AI systems to improve patient every AI system will face in the future.
3 Overview of Artificial Intelligence Systems in Ophthalmology 49
Acknowledgement This work is partially supported 17. Krizhevsky A, Sutskever I, Hinton GE. ImageNet
under the Thailand Research Fund grant number classification with deep convolutional neural net-
RTA6280015 and the Ratchadapisek Sompoch works. 2012.
Endowment Fund under Telehealth Cluster, 18. Hochreiter S, Schmidhuber J. Long short-term mem-
Chulalongkorn University. ory. Neural Comput. 1997;9:1735–80.
19. Lu W, Tong Y, Yu Y, Xing Y, Chen C, Shen
Y. Applications of artificial intelligence in ophthal-
mology: general overview. J Ophthalmol. 2018;
References https://doi.org/10.1155/2018/5278196.
20. Shen D, Wu G, Suk H-I. Deep Learning in medi-
1. Yu VL, Fagan LM, Wraith SM, Clancey WJ, Scott cal image analysis. Annu Rev Biomed Eng.
AC, Hannigan J, Blum RL, Buchanan BG, Cohen 2017;19:221–48.
SN. Antimicrobial selection by a computer: a 21. Andrearczyk V, Whelan PF. Using filter banks in
blinded evaluation by infectious diseases experts. Convolutional Neural Networks for texture classifi-
JAMA J Am Med Assoc. 1979;242:1279–82. cation. Pattern Recognit Lett. 2016;84:63–9.
2. Sinthanayothin C, Boyce JF, Cook HL, Williamson 22. Robinson R. Convolutional Neural Networks –
TH. Automated localisation of the optic disc, fovea, basics. 2017. https://mlnotebook.github.io/post/
and retinal blood vessels from digital colour fundus CNN1/. Accessed 20 Mar 2020.
images. Br J Ophthalmol. 1999;83:902–10. 23. Saxe AM, Saxe AM, Koh PW, Chen Z, Bhand M,
3. He J, Baxter SL, Xu J, Xu J, Zhou X, Zhang K. The Suresh B, Ng AY. On random weights and unsuper-
practical implementation of artificial intelligence vised feature learning. In: 28th Int Conf Mach Learn
technologies in medicine. Nat Med. 2019;25:30–6. ICML 2011. 2011. p. 1089–96.
4. Schmidt-Erfurth U, Sadeghipour A, Gerendas BS, 24. Simonyan K, Zisserman A. Very deep convolutional
Waldstein SM, Bogunović H. Artificial intelligence networks for large-scale image recognition. In: 3rd
in retina. Prog Retin Eye Res. 2018;67:1–29. Int. Conf. Learn. Represent. ICLR 2015 – Conf.
5. Han J, Kamber M, Pei J. Data mining: concepts and Track Proc. 2015.
techniques. Data Min Concepts Tech. 2012; https:// 25. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S,
doi.org/10.1016/C2009-0-61819-5. Anguelov D, Erhan D, Vanhoucke V, Rabinovich
6. Alpaydin E. Introduction to machine learning. 4th A. Going deeper with convolutions. In: Proc. IEEE
ed. MIT Press; 2020. Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
7. Indyk P, Motwani R. Approximate nearest neigh- IEEE Computer Society; 2015. p. 1–9.
bors: towards removing the curse of dimensionality. 26. He K, Zhang X, Ren S, Sun J. Deep residual learn-
In: STOC ’98 Proc. 30th Annu. ACM Symp. Theory ing for image recognition. In: Proc. IEEE Comput.
Comput. 1998. p. 604–613. Soc. Conf. Comput. Vis. Pattern Recognit. IEEE
8. Bengio Y, LeCun Y. Scaling learning algorithms Computer Society; 2016. p. 770–778.
towards AI. 2007. 27. Tan M, Le QV. EfficientNet: rethinking model scal-
9. LeCun Y, Bengio Y, Hinton G. Deep learning. ing for Convolutional Neural Networks. In: 36th
Nature. 2015;521:436–44. Int Conf Mach Learn ICML 2019-June; 2019.
10. Hosseini M-P, Lu S, Kamaraj K, Slowikowski A, p. 10691–700.
Venkatesh HC. Deep Learning Architecture. In: 28. Zagoruyko S, Komodakis N. Wide residual networks.
Pedrycz W, Chen S-M, editors. Deep Learning: In: Br. Mach. Vis. Conf. 2016, BMVC 2016. British
concepts and architectures. Cham: Springer Machine Vision Association; 2016. p. 87.1–87.12.
International; 2020. p. 1–24. 29. Huang Y, Cheng Y, Bapna A, et al. GPipe: efficient
11. Lewis DD. Naive(Bayes) at forty: the independence training of giant neural networks using pipeline
assumption in information retrieval. In: Lect. Notes parallelism. In: Adv. Neural Inf. Process. Syst. 32
Comput. Sci. (including Subser. Lect. Notes Artif. (NIPS 2019). Vancouver; 2019. p. 103–12.
Intell. Lect. Notes Bioinformatics). Springer; 1998. 30. Bengio Y, Courville A, Vincent P. Representation
p. 4–15. learning: a review and new perspectives. IEEE Trans
12. Alpaydin E. Introduction to Machine Learning. 3rd Pattern Anal Mach Intell. 2013;35:1798–828.
ed. https://doi.org/10.1007/978-1-62703-748-8_7. 31. Christopher M, Belghith A, Bowd C, Proudfoot JA,
13. Graupe D. Principles of Artificial Neural Networks. Goldbaum MH, Weinreb RN, Girkin CA, Liebmann
2013. https://doi.org/10.1142/8868. JM, Zangwill LM. Performance of Deep Learning
14. Weinberger KQ, Saul LK. Distance Metric Learning architectures and transfer learning for detecting
for large margin nearest neighbor classification. J glaucomatous optic neuropathy in fundus photo-
Mach Learn Res. 2009;10:207–44. graphs. Sci Rep. 2018;8:1–13.
15. Hastie T, Rosset S, Zhu J, Zou H. Multi-class 32. Davidson B, Kalitzeos A, Carroll J, Dubra A,
AdaBoost. Stat Interface. 2009;2:349–60. Ourselin S, Michaelides M, Bergeles C. Automatic
16. Schapire RE. Explaining adaboost. In: Empir. cone photoreceptor localisation in healthy and star-
Inference Festschrift Honor Vladimir N. Vapnik. gardt afflicted retinas using deep learning. Sci Rep.
Berlin: Springer; 2013. p. 37–52. 2018;8:1–13.
50 P. Ruamviboonsuk et al.
33. Diaz GI, Fokoue-Nkoutche A, Nannicini G, for detecting diabetic retinopathy in India. JAMA
Samulowitz H. An effective algorithm for hyper- Ophthalmol. 2019;137:987–93.
parameter optimization of neural networks. 48. Ting DSW, Cheung CYL, Lim G, et al. Development
IBM J Res Dev. 2017; https://doi.org/10.1147/ and validation of a deep learning system for diabetic
JRD.2017.2709578. retinopathy and related eye diseases using retinal
34. Models for image classification with weights trained images from multiethnic populations with diabetes.
on ImageNet. https://keras.io/applications/#models- JAMA – J Am Med Assoc. 2017;318:2211–23.
for-image-classification-with-weights-trained-on- 49. Bellemo V, Lim G, Rim TH, et al. Artificial intel-
imagenet. Accessed 27 Mar 2020. ligence screening for diabetic retinopathy: the real-
35. Ting DSW, Lee AY, Wong TY. An ophthalmologist’s world emerging application. Curr Diab Rep. 2019;
guide to deciphering studies in artificial intelligence. https://doi.org/10.1007/s11892-019-1189-3.
Ophthalmology. 2019;126:1475–9. 50. Ting DSW, Cheung CY, Nguyen Q, et al. Deep
36. Raman R, Srinivasan S, Virmani S, Sivaprasad S, learning in estimating prevalence and systemic risk
Rao C, Rajalakshmi R. Fundus photograph-based factors for diabetic retinopathy: a multi-ethnic study.
deep learning algorithms in detecting diabetic reti- npj Digit Med. 2019;2:1–8.
nopathy. Eye. 2019;33:97–109. 51. Li Z, Keel S, Liu C, et al. An automated grading
37. Dy JG, Brodley CE. Feature selection for unsuper- system for detection of vision-threatening referable
vised learning. 2004. diabetic retinopathy on the basis of color fundus
38. Wilson JMG, Jungner G, Organization photographs. Diabetes Care. 2018;41:2509–16.
WH. Principles and practice of screening for dis- 52. Mackenzie S, Schmermer C, Charnley A, Sim D, Tah
ease. Russian version of nos. 31-46 bound together V, Dumskyj M, Nussey S, Egan C. SDOCT imaging
(barc). 1968. to identify macular pathology in patients diagnosed
39. Dobrow MJ, Hagens V, Chafe R, Sullivan T, with diabetic maculopathy by a digital photographic
Rabeneck L. Consolidated principles for screening retinal screening programme. PLoS One. 2011;
based on a systematic review and consensus process. https://doi.org/10.1371/journal.pone.0014811.
CMAJ. 2018;190:E422–9. 53. Wang YT, Tadarati M, Wolfson Y, Bressler SB,
40. Abràmoff MD, Lou Y, Erginay A, Clarida W, Bressler NM. Comparison of prevalence of diabetic
Amelon R, Folk JC, Niemeijer M. Improved auto- macular edema based on monocular fundus pho-
mated detection of diabetic retinopathy on a publicly tography vs optical coherence tomography. JAMA
available dataset through integration of deep learn- Ophthalmol. 2016;134:222–8.
ing. Investig Ophthalmol Vis Sci. 2016;57:5200–6. 54. Varadarajan AV, Bavishi P, Ruamviboonsuk P, et al.
41. Abràmoff MD, Lavin PT, Birch M, Shah N, Folk Predicting optical coherence tomography-derived
JC. Pivotal trial of an autonomous AI-based diag- diabetic macular edema grades from fundus pho-
nostic system for detection of diabetic retinopathy in tographs using deep learning. Nat Commun. 2020;
primary care offices. npj Digit Med. 2018;1:1–8. https://doi.org/10.1038/s41467-019-13922-8.
42. van der Heijden AA, Abramoff MD, Verbraak F, van 55. Poplin R, Varadarajan AV, Blumer K, Liu Y,
Hecke MV, Liem A, Nijpels G. Validation of auto- McConnell MV, Corrado GS, Peng L, Webster
mated screening for referable diabetic retinopathy DR. Prediction of cardiovascular risk factors from
with the IDx-DR device in the Hoorn Diabetes Care retinal fundus photographs via deep learning. Nat
System. Acta Ophthalmol. 2018;96:63–8. Biomed Eng. 2018;2:158–64.
43. Tufail A, Rudisill C, Egan C, et al. Automated 56. Faes L, Bodmer NS, Bachmann LM, Thiel MA,
diabetic retinopathy image assessment soft- Schmid MK. Diagnostic accuracy of the Amsler
ware: diagnostic accuracy and cost-effectiveness grid and the preferential hyperacuity perimetry in
compared with human graders. Ophthalmology. the screening of patients with age-related macular
2017;124:343–51. degeneration: systematic review and meta-analysis.
44. Rajalakshmi R, Subashini R, Anjana RM, Mohan Eye. 2014;28:788–96.
V. Automated diabetic retinopathy detection in 57. AREDS2-HOME Study Research Group, Chew
smartphone-based fundus photography using artifi- EY, Clemons TE, Bressler SB, Elman MJ, Danis
cial intelligence. Eye. 2018;32:1138–44. RP, Domalpally A, Heier JS, Kim JE, Garfinkel
45. Gulshan V, Peng L, Coram M, et al. Development and R. Randomized trial of a home monitoring system
validation of a deep learning algorithm for detection for early detection of choroidal neovasculariza-
of diabetic retinopathy in retinal fundus photographs. tion home monitoring of the eye (HOME) study.
JAMA – J Am Med Assoc. 2016;316:2402–10. Ophthalmology. 2014;121:535–44.
46. Ruamviboonsuk P, Krause J, Chotcomwongse P, 58. Ho R, Song LD, Choi JA, Jee D. The cost-
et al. Deep learning versus human graders for clas- effectiveness of systematic screening for age-related
sifying diabetic retinopathy severity in a nationwide macular degeneration in South Korea. PLoS One.
screening program. npj Digit Med. 2019;2:1–9. 2018; https://doi.org/10.1371/journal.pone.0206690.
47. Gulshan V, Rajan RP, Widner K, et al. Performance 59. Chew EY, Schachat AP. Should we add screen-
of a Deep-Learning algorithm vs manual grading ing of age-related macular degeneration to cur-
3 Overview of Artificial Intelligence Systems in Ophthalmology 51
rent screening programs for diabetic retinopathy? 72. Zheng C, Johnson TV, Garg A, Boland MV. Artificial
Ophthalmology. 2015;122:2155–6. intelligence in glaucoma. Curr Opin Ophthalmol.
60. Chew EY, Clemons TE, SanGiovanni JP, et al. 2019;30:97–103.
Lutein + zeaxanthin and omega-3 fatty acids 73. Muhammad H, Fuchs TJ, De Cuir N, De Moraes
for age-related macular degeneration: the Age- CG, Blumberg DM, Liebmann JM, Ritch R, Hood
Related Eye Disease Study 2 (AREDS2) ran- DC. Hybrid Deep Learning on single wide-field opti-
domized clinical trial. JAMA – J Am Med Assoc. cal coherence tomography scans accurately classifies
2013;309:2005–15. glaucoma suspects. J Glaucoma. 2017;26:1086–94.
61. Burlina PM, Joshi N, Pekala M, Pacheco KD, 74. Christopher M, Belghith A, Weinreb RN, Bowd C,
Freund DE, Bressler NM. Automated grading of Goldbaum MH, Saunders LJ, Medeiros FA, Zangwill
age-related macular degeneration from color fundus LM. Retinal nerve fiber layer features identified by
images using deep convolutional neural networks. unsupervised machine learning on optical coherence
JAMA Ophthalmol. 2017;135:1170–6. tomography scans predict glaucoma progression.
62. Grassmann F, Mengelkamp J, Brandl C, Harsch S, Investig Ophthalmol Vis Sci. 2018;59:2748–56.
Zimmermann ME, Linkohr B, Peters A, Heid IM, 75. Lin A, Hoffman D, Gaasterland DE, Caprioli
Palm C, Weber BHF. A Deep Learning algorithm J. Neural networks to identify glaucomatous
for prediction of age-related eye disease study visual field progression. Am J Ophthalmol.
severity scale for age-related macular degeneration 2003;135:49–54.
from color fundus photography. Ophthalmology. 76. Yousefi S, Kiwaki T, Zheng Y, Sugiura H, Asaoka
2018;125:1410–20. R, Murata H, Lemij H, Yamanishi K. Detection
63. Burlina PM, Joshi N, Pacheco KD, Freund DE, Kong of longitudinal visual field progression in glau-
J, Bressler NM. Use of Deep Learning for detailed coma using machine learning. Am J Ophthalmol.
severity characterization and estimation of 5-year 2018;193:71–9.
risk among patients with age-related macular degen- 77. Li F, Wang Z, Qu G, et al. Automatic differentiation
eration. JAMA Ophthalmol. 2018;136:1359–66. of Glaucoma visual field from non-glaucoma visual
64. Kermany DS, Goldbaum M, Cai W, et al. Identifying filed using deep convolutional neural network. BMC
medical diagnoses and treatable diseases by image- Med Imaging. 2018;18:35.
based Deep Learning. Cell. 2018;172:1122–1131. 78. Bowd C, Hao J, Tavares IM, Medeiros FA, Zangwill
e9. LM, Lee TW, Sample PA, Weinreb RN, Goldbaum
65. Treder M, Lauermann JL, Eter N. Automated detec- MH. Bayesian machine learning classifiers for com-
tion of exudative age-related macular degeneration in bining structural and functional measurements to
spectral domain optical coherence tomography using classify healthy and glaucomatous eyes. Investig
deep learning. Graefe’s Arch Clin Exp Ophthalmol. Ophthalmol Vis Sci. 2008;49:945–53.
2018;256:259–65. 79. Medeiros FA, Jammal AA, Thompson AC. From
66. Lee CS, Baughman DM, Lee AY. Deep Learning is machine to machine: an OCT-trained deep learning
effective for classifying normal versus age-related algorithm for objective quantification of glaucoma-
macular degeneration OCT images. Kidney Int Rep. tous damage in fundus photographs. Ophthalmology.
2017;1:322–7. 2019;126:513–21.
67. Prahs P, Radeck V, Mayer C, Cvetkov Y, Cvetkova 80. Jammal AA, Thompson AC, Mariottoni EB,
N, Helbig H, Märker D. OCT-based deep learning Berchuck SI, Urata CN, Estrela T, Wakil SM, Costa
algorithm for the evaluation of treatment indica- VP, Medeiros FA. Human versus machine: compar-
tion with anti-vascular endothelial growth factor ing a Deep Learning algorithm to human gradings
medications. Graefe’s Arch Clin Exp Ophthalmol. for detecting glaucoma on fundus photographs. Am
2018;256:91–8. J Ophthalmol. 2020;211:123–31.
68. Hwang DK, Hsu CC, Chang KJ, et al. Artificial 81. Gao X, Lin S, Wong TY. Automatic feature learning
intelligence-based decision-making for age-related to grade nuclear cataracts based on deep learning.
macular degeneration. Theranostics. 2019;9:232–45. IEEE Trans Biomed Eng. 2015;62:2693–701.
69. De Fauw J, Ledsam JR, Romera-Paredes B, et al. 82. Wu X, Huang Y, Liu Z, et al. Universal artificial
Clinically applicable deep learning for diag- intelligence platform for collaborative management
nosis and referral in retinal disease. Nat Med. of cataracts. Br J Ophthalmol. 2019;103:1553–60.
2018;24:1342–50. 83. Sramka M, Slovak M, Tuckova J, Stodulka
70. Li Z, He Y, Keel S, Meng W, Chang RT, He P. Improving clinical refractive results of cataract
M. Efficacy of a Deep Learning system for surgery by machine learning. PeerJ. 2019; https://
detecting glaucomatous optic neuropathy based doi.org/10.7717/peerj.7202.
on color fundus photographs. Ophthalmology. 84. Koprowski R, Lanza M, Irregolare C. Corneal power
2018;125:1199–206. evaluation after myopic corneal refractive surgery
71. Abrams LS, Scott IU, Spaeth GL, Quigley HA, using artificial neural networks. Biomed Eng Online.
Varma R. Agreement among optometrists, ophthal- 2016;15:121.
mologists, and residents in evaluating the optic disc 85. Yu F, Silva Croso G, Kim TS, Song Z, Parker F,
for glaucoma. Ophthalmology. 1994;101:1662–7. Hager GD, Reiter A, Vedula SS, Ali H, Sikder
52 P. Ruamviboonsuk et al.
S. Assessment of automated identification of phases 99. Redd TK, Campbell JP, Brown JM, et al. Evaluation
in videos of cataract surgery using machine learning of a deep learning image assessment system for
and deep learning techniques. JAMA Netw Open. detecting severe retinopathy of prematurity. Br J
2019;2:e191860. Ophthalmol. 2019;103:580–4.
86. Morita S, Tabuchi H, Masumoto H, Yamauchi T, 100. Brown JM, Campbell JP, Beers A, et al. Automated
Kamiura N. Real-time extraction of important surgi- diagnosis of plus disease in retinopathy of prema-
cal phases in cataract surgery videos. Sci Rep. 2019; turity using deep convolutional neural networks.
https://doi.org/10.1038/s41598-019-53091-8. JAMA Ophthalmol Am Med Assoc. 2018:803–10.
87. Gilbert C, Wormald R, Fielder A, Deorari A, Zepeda- 101. Taylor S, Brown JM, Gupta K, et al. Monitoring dis-
Romero LC, Quinn G, Vinekar A, Zin A, Darlow ease progression with a quantitative severity scale
B. Potential for a paradigm change in the detection for retinopathy of prematurity using Deep Learning.
of retinopathy of prematurity requiring treatment. JAMA Ophthalmol. 2019;137:1022–8.
Arch Dis Child Fetal Neonatal Ed. 2016;101:F6–7. 102. Worrall DE, Wilson CM, Brostow GJ. Automated
88. Salvin JH, Lehman SS, Jin J, Hendricks DH. Update retinopathy of prematurity case detection with
on retinopathy of prematurity: treatment options and convolutional neural networks. https://doi.
outcomes. Curr Opin Ophthalmol. 2010;21:329–34. org/10.1007/978-3-319-46976-8.
89. Wong TY, Sun J, Kawasaki R, et al. Guidelines 103. Long E, Lin H, Liu Z, et al. An artificial intelligence
on diabetic eye care: the International Council of platform for the multihospital collaborative man-
Ophthalmology recommendations for screening, agement of congenital cataracts. Nat Biomed Eng.
follow-up, referral, and treatment based on resource 2017;1:1–8.
settings. Ophthalmology. 2018;125:1608–22. 104. Lin H, Li R, Liu Z, et al. Diagnostic efficacy and
90. Davitt BV, Wallace DK. Plus disease. Surv therapeutic decision-making capacity of an artificial
Ophthalmol. 2009;54:663–70. intelligence platform for childhood cataracts in eye
91. Daniel E, Quinn GE, Hildebrand PL, et al. Validated clinics: a multicentre randomized controlled trial.
system for centralized grading of retinopathy of EClinicalMedicine. 2019;9:52–9.
prematurity: telemedicine approaches to evaluating 105. Liu X, Jiang J, Zhang K, et al. Localization and diag-
acute-phase Retinopathy of Prematurity (e-ROP) nosis framework for pediatric cataracts based on slit-
Study. JAMA Ophthalmol. 2015;133:675–82. lamp images using deep features of a convolutional
92. Ting DSW, Pasquale LR, Peng L, Campbell JP, Lee neural network. PLoS One. 2017;12:e0168606.
AY, Raman R, Tan GSW, Schmetterer L, Keane PA, 106. Reid JE, Eaton E. Artificial intelligence for pedi-
Wong TY. Artificial intelligence and deep learning in atric ophthalmology. Curr Opin Ophthalmol.
ophthalmology. Br J Ophthalmol. 2019;103:167–75. 2019;30:337–46.
93. Capowski JJ, Kylstra JA, Freedman SF. A numeric 107. Alsaih K, Lemaitre G, Rastgoo M, Massich J, Sidibé
index based on spatial frequency for the tortuos- D, Meriaudeau F. Machine learning techniques for
ity of retinal vessels and its application to plus diabetic macular edema (DME) classification on
disease in retinopathy of prematurity. Retina. SD-OCT images. Biomed Eng Online. 2017;16:68.
1995;15:490–500. 108. Schlegl T, Waldstein SM, Bogunovic H, Endstraßer
94. Heneghan C, Flynn J, O’Keefe M, Cahill F, Sadeghipour A, Philip AM, Podkowinski D,
M. Characterization of changes in blood vessel width Gerendas BS, Langs G, Schmidt-Erfurth U. Fully
and tortuosity in retinopathy of prematurity using automated detection and quantification of macular
image analysis. Med Image Anal. 2002;6:407–29. fluid in OCT using Deep Learning. Ophthalmology.
95. Swanson C, Cocker KD, Parker KH, Moseley MJ, 2018;125:549–58.
Fielder AR. Semiautomated computer analysis of 109. Zou KH, Warfield SK, Bharatha A, Tempany CMC,
vessel growth in preterm infants without and with Kaus MR, Haker SJ, Wells WM, Jolesz FA, Kikinis
ROP. Br J Ophthalmol. 2003;87:1474–7. R. Statistical validation of image segmentation qual-
96. Gelman R, Martinez-Perez ME, Vanderveen DK, ity based on a Spatial Overlap Index. Acad Radiol.
Moskowitz A, Fulton AB. Diagnosis of plus disease 2004;11:178–89.
in retinopathy of prematurity using retinal image 110. Fang L, Cunefare D, Wang C, Guymer RH, Li S,
multiScale analysis. Investig Ophthalmol Vis Sci. Farsiu S. Automatic segmentation of nine retinal
2005;46:4734–8. layer boundaries in OCT images of non-exudative
97. Ataer-Cansizoglu E, Bolon-Canedo V, Campbell AMD patients using deep learning and graph search.
JP, et al. Computer-based image analysis for plus Biomed Opt Express. 2017;8:2732.
disease diagnosis in retinopathy of prematurity: 111. Camino A, Wang Z, Wang J, Pennesi ME, Yang P,
performance of the “i-ROP” system and image fea- Huang D, Li D, Jia Y. Deep learning for the segmen-
tures associated with expert diagnosis. Transl Vis Sci tation of preserved photoreceptors on en face optical
Technol. 2015;4:5. coherence tomography in two inherited retinal dis-
98. Wang J, Ju R, Chen Y, Zhang L, Hu J, Wu Y, Dong eases. Biomed Opt Express. 2018;9:3092.
W, Zhong J, Yi Z. Automated retinopathy of pre- 112. Chen M, Wang J, Oguz I, VanderBeek BL, Gee
maturity screening using deep neural networks. JC. Automated segmentation of the choroid in
EBioMedicine. 2018;35:361–8. EDI- OCT images with retinal pathology using
3 Overview of Artificial Intelligence Systems in Ophthalmology 53
convolution neural networks. In: Lect. Notes 121. Gerendas BS, Bogunovic H, Sadeghipour A,
Comput. Sci. (including Subser. Lect. Notes Artif. Schlegl T, Langs G, Waldstein SM, Schmidt-Erfurth
Intell. Lect. Notes Bioinformatics). Springer; U. Computational image analysis for prognosis deter-
2017. p. 177–184. mination in DME. Vision Res. 2017;139:204–10.
113. Hagiwara Y, Koh JEW, Tan JH, Bhandary SV, Laude 122. Suner IJ, Yau L, Lai P. HARBOR Study: one-year
A, Ciaccio EJ, Tong L, Acharya UR. Computer- results of efficacy and safety of 2.0 mg versus 0.5 mg
aided diagnosis of glaucoma using fundus images: ranibizumab in patients with subfoveal choroidal
a review. Comput Methods Programs Biomed. neovascularization secondary to age-related macu-
2018;165:1–12. lar degeneration | IOVS | ARVO Journals. Invest
114. Devalla SK, Chin KS, Mari JM, Tun TA, Strouthidis Ophthalmol Vis Sci. 2012;53.
N, Aung T, Thiéry AH, Girard MJA. A deep learn- 123. Schmidt-Erfurth U, Bogunovic H, Sadeghipour
ing approach to digitally stain optical coherence A, Schlegl T, Langs G, Gerendas BS, Osborne A,
tomography images of the optic nerve head. Investig Waldstein SM. Machine learning to analyze the
Ophthalmol Vis Sci. 2018;59:63–74. prognostic value of current imaging biomarkers
115. Devalla SK, Renukanand PK, Sreedhar BK, et al. in neovascular age-related macular degeneration.
DRUNET: a dilated-residual U-Net deep learn- Ophthalmol Retin. 2018;2:24–30.
ing network to segment optic nerve head tissues in 124. Bogunovic H, Waldstein SM, Schlegl T, Langs G,
optical coherence tomography images. Biomed Opt Sadeghipour A, Liu X, Gerendas BS, Osborne A,
Express. 2018;9:3244. Schmidt-Erfurth U. Prediction of anti-VEGF treat-
116. Arcadu F, Benmansour F, Maunz A, Willis J, ment requirements in neovascular AMD using a
Haskova Z, Prunotto M. Deep learning algorithm machine learning approach. Invest Ophthalmol Vis
predicts diabetic retinopathy progression in indi- Sci. 2017;58:3240–8.
vidual patients. npj Digit Med. 2019. https://doi. 125. Schmidt-Erfurth U, Waldstein SM, Klimscha S,
org/10.1038/s41746-019-0172-3. Sadeghipour A, Hu X, Gerendas BS, Osborne A,
117. Nguyen QD, Brown DM, Marcus DM, et al. Bogunović H. Prediction of individual disease con-
Ranibizumab for diabetic macular edema: results version in early AMD using artificial intelligence.
from 2 phase iii randomized trials: RISE and Investig Ophthalmol Vis Sci. 2018;59:3199–208.
RIDE. Ophthalmology. 2012;119:789–801. 126. Larsen M, Waldstein SM, Boscia F, et al.
118. Rohm M, Tresp V, Müller M, Kern C, Manakov I, Individsualized ranibizumab regimen driven by sta-
Weiss M, Sim DA, Priglinger S, Keane PA, Kortuem bilization criteria for central retinal vein occlusion:
K. Predicting visual acuity by using machine twelve-month results of the CRYSTAL Study. In:
learning in patients treated for neovascular age- Ophthalmology. Elsevier; 2016. p. 1101–1111.
related macular degeneration. Ophthalmology. 127. Vogl WD, Waldstein SM, Gerendas BS, Schlegl T,
2018;125:1028–36. Langs G, Schmidt-Erfurth U. Analyzing and predict-
119. Kortüm KU, Müller M, Kern C, Babenko A, Mayer ing visual acuity outcomes of anti-VEGF therapy
WJ, Kampik A, Kreutzer TC, Priglinger S, Hirneiss by a longitudinal mixed effects model of imag-
C. Using electronic health records to build an oph- ing and clinical data. Investig Ophthalmol Vis Sci.
thalmologic data warehouse and visualize patients’ 2017;58:4173–81.
data. Am J Ophthalmol. 2017;178:84–93. 128. Grzybowski A, Brona P, Lim G, Ruamviboonsuk P,
120. Wells JA, Glassman AR, Ayala AR, et al. Aflibercept, Tan GSW, Abramoff M, Ting DSW. Artificial intel-
bevacizumab, or ranibizumab for diabetic macular ligence for diabetic retinopathy screening: a review.
edema. N Engl J Med. 2015;372:1193–203. Eye. 2019;34:451–60.
Autonomous Artificial Intelligence
Safety and Trust
4
Michael D. Abramoff
system, thus offering a pathway for other autono- many cases, low image quality requires calling
mous AI to be reimbursed. the patient back for another examination. Thus,
No prior processes existed for determining the safer, more trustworthy solutions were desirable.
safety, efficacy and equity of such autonomous Patient adherence to any form of regular dia-
AI systems that might serve as guidance. This betic eye examination is generally low around the
chapter goes through the why and how of ethical world, a state of affairs primarily caused by lack
and accountability principles for autonomous AI, of access and convenience. Recent research
explains how they were practically addressed and shows that in the US, adherence to a regular doc-
how they form the foundation for a series of umented diabetic eye exam remains as low as
requirements for autonomous AI, that were used 15.3%, even though all practice guidelines and
in design, in the de novo FDA clearance process, standards recommend regular diabetic eye exams
and in ongoing implementation. It leans heavily [16, 19, 20]. Compared to testing at a remote
on the concepts addressed in “Lessons Learned laboratory, point of care diagnosis increases
About Autonomous AI: Finding a Safe, access, which is has been shown for point of care
Efficacious, and Ethical Path Through the A1C testing [21, 22], and has even been shown to
Development Process” [10]. improve clinical outcome [23].
Ultimately, successful introduction of autono- Historically, clinicians are trained during
mous AI into the healthcare system is dependent medical school, residency and possible fellow-
on trust in this technology, and it is thus vital that
ship, and then their medical competency contin-
everyone involved with autonomous AI helps ues to be overseen by Medical Boards. However,
build trust [11]. in practice, clinicians are rarely validated on their
safety, efficacy and equity for a specific diagnos-
tic process, against a valid standard. Furthermore,
Autonomous AI for the Diabetic consistency of clinicians performing the diabetic
Retinopathy Exam eye exam is limited, and the only scientifically
valid studies comparing clinicians to the most
The patient and societal benefits of early detec- rigorous patient outcome-based prognostic refer-
tion of diabetic retinopathy are well established ence standard, show that clinician accuracy,
[12–15]. This diabetic eye exam is typically per- expressed as sensitivity, does not exceed 50%
formed as an indirect dilated retinal examination, [24, 25]. Additionally, documentation of a dia-
using slitlamp biomicroscopy and indirect bin- betic eye exam in the chart, once performed, is a
ocular ophthalmoscopy, as well as macular opti- human driven process with many points of poten-
cal coherence tomography, by an ophthalmologist, tial fail-ure, so that too often, the fact that a dia-
retina specialist, or optometrist. Thus, many betic eye exam was performed cannot be verified
practice guidelines referred to this method as the from the patient’s chart. Operational coding sys-
standard of care [16]. In the 1990s, evidence tems for physician activities are typically not fine
showed that telemedicine for diabetic retinopa- grained enough to establish that a diabetic eye
thy, where retinal images are taken and then eval- exam was performed [26–28]. Instead, typically
uated remotely, has at least a similar patient only the fact that the physician spent time inter-
safety profile as the dilated retinal examination acting with the patient is documented, if it is
[17, 18]. While improving access, traditional documented at all.
telemedicine does not allow instantaneous, point Together, these are major challenges with pro-
of care diagnosis, nor has the safety, efficacy and cess integrity for the diabetic eye exam (both tra-
equity of its process been established in a scien- ditional, in the chair, as well as through
tifically valid, hypothesis testing, manner. Thus, telemedicine). In other words, the traceability of
trust has generally been limited. There is a delay what exactly happens to the patient when the
of typically days between when the patient is need for a diabetic eye exam is established, is
imaged, and the availability of an ophthalmolo- limited, because the accuracy of the diagnostic
gist to read images and diagnose the patient. In process is unknown, and because the documen-
4 Autonomous Artificial Intelligence Safety and Trust 57
tary evidence that a completely determined by ones: undesired racial, sex or ethnic bias, valida-
fallible clinician documentation [29]. To increase tion, what type of trial and what to compare the
trust, program integrity for the diabetic eye exam AI against, what safety threshold to use, data
needs to be as high as possible. usage, and liability in this section [10].
The use of artificial intelligence for medical A recent study of an AI showed that using
diagnostic purposes, started in the 1960s with medical cost as a proxy for patients’ overall
Mycin, an AI that helped physicians prescribe health needs led to inappropriate racial bias in
antibiotics [30]. This continued through the allocating healthcare resources, as black patients
1980s with algorithms such as the perceptron were incorrectly deemed to have lower risk com-
[31] and multilayer neural network learning pared to white patients because their incurred
using backpropagation [32]. Safe performance of costs were lower for a given health risk status
these AI systems was limited, mainly because of [35]. Another study showed a consistent decrease
a lack of high quality maximally objective input in performance for underrepresented sex catego-
data. Input consisted of physicians interpreting ries in the AI’s machine learning training data
the patient’s symptoms and signs and then typing when a minimum balance was not fulfilled [36].
them in, a process with an inherently low signal To demonstrate AI systems’ safety, scientifi-
to noise ratio. Instead much of the foundational cally valid studies that are replicable are essen-
methodology for modern autonomous AI has tial. The design of these studies is a concern.
been developed over the past decades. The recent Some, like healthcare pundit Eric Topol, have
introduction of affordable digital retinal cameras proposed that Randomized Clinical Trials (RCT)
with high-quality complementary metal-oxide are the gold standard for diagnostic AI [37],
semiconductor (CMOS) image sensors was a piv- though there is clear evidence that other study
otal moment. These CMOS image sensors make designs can work as well or better [38]. For
it possible to acquire images of high fidelity and example, RCTs for diagnostic AI require an arm
consistency, such as retinal images of people where patient outcome, includ-ing need for inter-
with diabetes, and thus provide highly objective vention, is only decided based on the AI output—
input data for AI algorithms. The resulting higher without possible override by a clinician. If the AI
performance and potentially higher safety, is nec- is not perfect, this requires potentially withhold-
essary for increasing trust in autonomous AI. ing effective treatment for a treatable condition
In summary, diagnostic autonomous AI sys- where we know how to improve outcome. Most
tems, such as IDx-DR, provide a direct diagnos- Institutional Review Boards see this as unethical,
tic recommendation for the point of care diagnosis as the benefit to other patients and society may
of diabetic retinopathy and diabetic macular not outweigh the harm to the patient who has
edema. Autonomous AI allows the diagnosis for been diagnosed by AI [39, 40]. As diagnostic AIs
diabetic retinopathy and diabetic macular edema are typically designed for conditions where effec-
to be performed in real time, with the goals of tive interven-tions are available, RCTs are ethi-
higher adherence, improved access, higher cost- cally unsound. In other words, while a null
effectiveness [33], increased accuracy and diag- hypothesis of “no effect” works well in most
nosability [34], and increased program integrity. interventional trials, a null hypothesis of “not
informative” is not appro-priate for validation of
diagnostic AI [41].
an We Trust It? Concerns About
C Replicability is another big problem affecting
Autonomous AI trust in safety, and without preregistration, AI
performance tends to be overestimated, and suc-
The idea of ‘a computer making a diagnosis’, cessful study replication becomes less likely. In
generates concerns in both physicians and fact when comparing trials with and without pre-
patients, as is to be anticipated with any new registration, trial effect sizes are larger when they
technology. We will address the most common lack preregistration [34, 42].
58 M. D. Abramoff
Often, to demonstrate AI systems’ safety, their aspect. One metric is so-called population diag-
output is compared against the diagnosis by clini- nosability. Population diagnosability is defined
cians or groups of clinicians, called a “reference as
standard” [43]. This approach assumes that the
nn n p
AI system safety is highest when it most closely PS
matches clinicians’ diagnosis. There are several nn n p nx ni
problems with this approach: (a) showing that an
AI system is safer than clinicians is impossible, Where
as a discrepancy between clinician and AI system
will by definition be attributed to an error by the
n p = number of subjects that received
AI rather than the clinician; (b) clinicians differ
greatly on typical diagnostic tasks, in many cases a positive diagnostic result
in 30% of cases or more [44], and it is thus impos-
nn = number of subjects that received
sible to determine which clinician is right and
which is wrong; c) if compared to clinical out- a negative diagnostic result
come, such as with a prognostic standard, the
nx = number of subjects that were excluded
ultimate determinant of clinical relevance, clini-
cians perform poorly, for example achieving only for any reason from study completion
33% and 34% in the only two studies comparing
ni = number of subject that received an
clinicians diagnosing diabetic retinopathy as
determined by the Wisconsin Reading Center insuficient input quality result
ETDRS standard [24, 25]. Thus, esti-mates of For example, if in a completed validation
safety are greatly affected by the choice of refer- study, the total number of subjects recruited for
ence standard. the study, n, is 1000, nx, 200 subjects were
In 2007, Fenton and colleagues first demon- excluded from analysis for various reasons, and
strated the importance of rigorous validation of the autonomous AI gave an invalid input quality,
AI in the actual workflow setting, rather than in a ni, for 100 subjects, PS = 0.7. Obviously, a higher
modeled laboratory setting [45]. In this pivotal population diagnosability improves efficiency
study, the outcomes of women undergoing breast and especially workflow, as there is less risk of an
cancer screening by a radiologist assisted by a individual patient not being diagnosed by the
previously FDA approved assistive AI system, autonomous AI, and needing to fall back on the in
were compared to women who underwent breast person exam.
cancer screening by a radiologist without such an Safety, for a diagnostic process, including for
assistive AI. The assistive AI had, in 2000, been an autonomous AI system, is typically measured
approved by FDA, on the basis of a study that as sensitivity, as appropriate for binary outcomes,
showed that, when used in isolation, the assistive more so than metrics such as Receiver Operator
AI had high diagnostic accuracy compared to Characteristics (ROC) analysis [46]. In many
radiologists. When this assistive AI system was cases, the purpose of the diagnostic process is to
tested in a study design that reflected actual find (true) cases in a population. While a high
usage, assisting a radiologist who makes the final sensitivity maximizes the efficiency of such a
clinical decision, the study showed worse out- process, a population level analysis will show
comes for the women who underwent breast can- that adherence with the diagnostic process plays
cer screening with AI assistance. as important a role. For example, if a diagnostic
For continued trust and acceptance, the auton- process has high sensitivity of 90%, but only
omous AI’s impact on clinic workflow needs to 10% of a population undergoes the diagnostic
be optimized, and optimal clinic workflow of the process, the cases in the remaining 90% of the
patient around the autonomous AI is an important population will not be found, lowering the
4 Autonomous Artificial Intelligence Safety and Trust 59
“population achieved sensitivity”. In order to patient, the physician, the hospital system, or
account for this diagnostic access bias [47], the even who paid for acquisition, has not been fully
population achieved sensitivity (PAS) or “com- litigated, and as such can easily lead to concerns
pliance corrected sensitiv-ity” can be calculated and controversy. For example, in one case patient
as follows: data for training AI through was obtained through
an agreement with a health system [49]. While
sc cpc
PAS agreements were in place, patients and physi-
cpc 1 c
pnc cians were not aware of this data usage, leading
to confusion, so that the Department of Health
Where and Human services became involved. In another
• sc = sensitivity (as determined in compliant example, a class action lawsuit alleging failure to
population) adequately deidentify patient data for AI training
• c = compliance was initiated against an academic health system.
• pc = measured prevalence in the compliant Autonomous AI is typically designed and vali-
population dated for a narrow diagnostic task, and so-called
• pnc = estimated prevalence, in the incidental findings, that potentially could have
noncompliant population been discovered by a general exam, are not
flagged. For our example, which diagnoses dia-
If we assume that pc ≅ pnc i.e. the prevalence betic retinopathy and diabetic macular edema,
of the disease is the same in the non-compliant as other diagnoses such as glaucoma, or macular
in the compliant part of the population, we can degeneration will not be made, as the autono-
use the simplified estimate scc. mous AI is neither designed nor validated for
anything but DR. While there is widespread evi-
PAS ≅ sc c
dence for the effectiveness and cost-effectiveness
This estimated PAS would then form an upper of early detection of diabetic retinopathy [15],
bound, as in most cases, prevalence in the non- this is currently not yet the case for glaucoma
compliant subpopulation is higher than in the [50], macular degeneration [51] and many other
compliant part. eye diseases. Therefore, clinical trials for a spe-
For example, if compliance c, with the dia- cific AI will typically not be designed or powered
betic eye exam, is 15% [19], and the minimal to analyze diagnostic accuracy on other retinal
acceptable sensitivity is 85% [34], the popula- abnormalities or other eye abnormalities in peo-
tion achieved sensitivity (PAS) = 0.13. In other ple with diabetes. However, it is important to note
words, only 13% of all cases in the population that little evidence exists on how accurately clini-
will be identified correctly with this diagnostic cians can diagnose these incidental findings: their
system. performance has not been evaluated in formal
These metrics show the importance of includ- studies. In fact, such studies may be logistically
ing workflow and patient-experience in general impossible and challenging to power and there-
in the study design, as focus on sensitivity in the fore are unlikely to be performed because of the
compliant population may not give the optimal enormous number of subjects required. For
population benefit, and increasing compliance example, at a prevalence of 5 per million, evalu-
with the autonomous AI may have more effect on ating the diagnostic performance of either oph-
its safety than increasing sensitivity. thalmologists or autonomous AI for choroidal
The development of any AI requires vast melanoma diagnosis would require a clinical trial
amounts of clinical data. There are many statutes with approximately 40 million subjects [52].
and regulations covering patient derived data, In the past, liability for medical errors con-
such as HIPAA and HITECH [48]. Ultimately, tributed to by AI has generally been attributed
whether patient derived data belongs to the to the physician using it [53]. While this is
60 M. D. Abramoff
acceptable practice for assistive AI, as ulti- Table 4.1 Deriving requirements from bioethical
principles
mately the medical decision is made the phy-
sician using the AI, this may not be the case Relevant
bioethical
for autonomous AI, where after all, the medi- Autonomous AI requirement principles [57]
cal decisions is made by the AI, without input Improve patient outcome as Nonmaleficence,
by that physician. Nevertheless many AI cre- shown either by direct evidence or Beneficence, and
ators have publicly refused to take liability linked clinical literature, and JEquity
for their AI products, as shown by the ongoing aligned with evidence based
clinical standards of care/practice
liability debate concerning autonomous driv- patterns from quality of care
ing [54]. organizations, professional medical
Trust in Autonomous AI is essential, and defi- societies and patient organizations,
nitely, lack of trust has negatively affected other while accounting for safety,
efficacy and equity
medical innovations. In the early 2000s, gene Design so its operations are Beneficence
therapy effectively went through a moratorium maximally reducible to Non-maleficence
on research funding, including closure of research characteristics aligned with
institutions, after several young people died in scientific knowledge of human
clinician cognition.
poorly planned and executed gene therapy stud-
Maximize traceability of patient Accountability
ies [55]. Only in 2017, almost two decades later, derived data, and commensurate and Respect for
did FDA approve the first ever gene therapy, for data stewardship, accountability, Autonomy
the RPE65 variant of Leber’s Congenital and authorization; including by
adherence to accepted standards.
Amaurosis [56]. In sort, trust in autonomous AI
Validate rigorously for safety, Nonmaleficence,
needs to be earned. efficacy and equity, using and Equity
preregistered clinical studies, by
comparing against clinical
Building Trust: An Ethical outcome, or outcome surrogates in
the case of chronic diseases, in the
Foundation for Autonomous AI intended clinical workflow and
Requirements usage, as shown by either direct or
linked evidence
Successful introduction of autonomous AI into Assume liability commensurate Accountability,
with indications for use and and Equity
the healthcare system is dependent on trust, and it autonomy
is essential that everyone involved with autono-
mous AI helps build such trust [6, 11]. This is best
done through an ethical foundation. Previously, utonomous AI System Design
A
Char, Abramoff et al., derived a foundation for Requirements
autonomous AI in healthcare from bioethical
principles as well as accountability principles, Considerations for Autonomous AI design can have
implemented in Abramoff et al. [10]. We ensured unexpected and profound ethical implications.
alignment with classical bioethics principles, such When autonomous AI is designed so its opera-
as Beauchamp and Childress [57]. The following tions are maximally reducible to characteristics
table is modified from our paper [10]. Copyright aligned with scientific knowledge of human cli-
retained by the Authors (Table 4.1). nician cognition, it ethically aligns with non-
The following sections will illustrate these maleficence. The use of patho-physiologically
various requirements for autonomous AI as they sound priors, such as biomarkers, and exploiting
are relevant. Addressing such requirements dur- higher order coherence in the input data, not only
ing the design, validation and implementation is helps gain trust from regulators, physicians and
expected to increase trust from all healthcare patients, but also improves safety and equity, thus
stakeholders, including patients, physicians, reg- aligning with non-maleficence and justice.
ulators, and payers. Machine learning algorithms that mimic closely
4 Autonomous Artificial Intelligence Safety and Trust 61
how clinicians diagnose have been shown to be Evidence of the relationship to outcome is
more robust to small perturbations in the input, to essential to confirm that the autonomous AI
show less catastrophic failure, and to be less improves outcome. The level of disease diag-
likely to exhibit inappropriate racial and other nosed by an autonomous AI can align directly
bias [58, 59]. Black box or gray box algorithm with the risk of poor outcome: an IDx-DR nega-
designs makes such bias harder to mitigate and tive output (less than level ETDRS 35 and no
detect, while the speed and scalability can multi- macular edema) implies a risk of 1.7% or less of
ply the effect of inappropriate bias faster than tra- proliferative retinopathy in 3 years, and a risk of
ditional enforcement efforts can react. Take as an 2.4% or less of DME in 1 year, while an IDx-DR
example, the design of autonomous AI for dia- positive output (ETDRS level 35 or higher or
betic retinopathy: for over 150 years, clinicians center-involved or clinically significant macular
have evaluated a patient’s retina for the different edema) confers a risk of at least 18% of prolifera-
indicators of diabetic retinopathy such as hemor- tive retinopathy in 3 years, as well as a risk of
rhages, microaneurysms, and neovascularization 17.7% of developing DME in 1 year—if left
[60, 61]. These are all indicators or biomarkers untreated [34, 65]. Tying the autonomous AI out-
that are invariant with regards to race, ethnicity, put to patient relevant clinical outcome.
sex, and age. Using multiple, statistically depen-
dent detectors for such biomarkers [62, 63], each
optimized using machine learning algorithms utonomous AI System Validation
A
mitigates the problems sketched. Requirements
Designing the autonomous AI so that it
improves patient outcome as shown either by Just as is the case for its design, the validation of
direct evidence or linked clinical literature, an autonomous AI can also have unexpected and
ensures patient benefit, and ethically aligns with profound ethical implications. To maximize trust,
non-maleficence and justice. Similarly, align- autonomous AI should be validated rigorously
ment with evidence based clinical standards of for safety, efficacy and equity, ideally using pre-
care/practice patterns from quality of care organi- registered clinical studies, by comparing the AI
zations, professional medical societies and against clinical outcome, or prognostic standards,
patient organizations, while accounting for in the case of chronic diseases, within the
safety, efficacy and equity, again aligns ethically intended clinical workflow and usage, as shown
with non-maleficence and justice. ‘Glamour AI’ by either direct or linked evidence, non-
or AI which is technologically of high interest, maleficence and equity are maximized. The
but does not improve outcome, is thus avoided by meaning of each these terms is explained here.
this requirement. In accordance with the bioethical principle of
For our example, when the goal is to create non-maleficence [6, 10], AI validation studies
a real time point of care autonomous retinal should test hypotheses of safety, efficacy and
exam for diabetic retinopathy, non-maleficence equity. Scientific validity of such studies is higher
was maximized, by ensuring that diabetic mac- the more replicable they are, and common report-
ular was diagnosed accurately by the autono- ing standards [66], CONSORT-AI [67], preregis-
mous AI system—as is standard of care [8, 16]. tration of study and analysis protocols [42, 68],
Somewhat surprisingly, this design require- and validated relationship to patient outcome
ment is different from that of most AI designs, [10] are important factors to enhance replicabil-
which focus only on the classic ischemic vari- ity, and consistent with US Federal regulations.
ant of diabetic retinopathy, and ignore macular While standards have been set for preregistration,
edema. At most, they only look for exudates, and especially Good Clinical Practice (GCP) [7],
which in many cases are not proxies for the depending on the risk of harm to the patient,
presence or absence of diabetic macular edema these can be burdensome and can require sub-
[34, 64]. stantial resources from AI creators, and thus may
62 M. D. Abramoff
not always be achievable for AI validation stud- mograms, in breast cancer, or evidence of mitosis
ies, in accordance with risk of patient harm. in skin cancer.
Broadly, preregistration includes public registra- Within diabetic retinopathy the Early scale
tion of the in- and exclusion criteria, the entire and the DRCR.net macular edema scale are prog-
protocol, and statistical on a site such as clinical nostic standards [65, 71].
trials.gov, a hypothesis-testing design with pre- Obviously, while requiring less time and fewer
defined endpoints, a predefined method for resources than true outcome in the case of chronic
statistical analysis, predefined inclusion and disease, prognostic standard based reference
exclusion criteria, a predefined sampling proto- standards still requires considerable effort. This
col, a plan for handling of the trial data by an is an important reason why clinician derived ref-
independent Contract Research Organization or erence standards, rather than outcome or surro-
third party, and prohibition of access by the gate based reference standards, are so widely
researchers to the subject level results before used in AI. This practice is so widespread that it
finalizing the statistical analysis. is almost a standard. A widely cited meta-analysis
Also in accordance with non-maleficence, is of the quality of evidence of AI accuracy, while
to validate the autonomous AI against what mat- mentioning the potential of AI to improve out-
ters to the patient. This is most likely clinical out- come, takes as a given the compari-son to clini-
come: the clinical event relevant to the patient, or, cian derived ground truth. Validation against
an event of which the patient is aware and wants (surrogate) clinical outcome is not even consid-
to avoid, including death, loss of vision, visual ered [72].
field loss, the need for ventilatory support, or In addition to unknown validity against out-
other events causing a reduction in quality of the come, reproducibility—different clinicians eval-
patient’s life [69]. uating the same patient differently in 30–50% of
While for acute diseases or interventions, clin- cases—; repeatability—the same clinician evalu-
ical outcome may be immediate and easy to mea- ating the same patient differently in 20–30% of
sure—such as visual acuity in the case of myopia cases—; and temporal drift—clinicians system-
or central retinal artery occlusion. However, for atically evaluating the same hypothetical patient
the many chronic diseases that autonomous AI differently over generations of clinicians—; are
has particular potential for, such as diabetic reti- other major issues to be addressed [24, 25, 73].
nopathy, glaucoma or macular degeneration, As the evidence for a given treatment based on a
clinical outcome may take years to manifest. As a given evaluation may have been derived decades
consequence, surrogate endpoints have been ago, the latter, temporal drift, is a form of bias
developed [70], to reduce the cost and shorten the that is especially pernicious, and difficult to cor-
duration of trials, especially in the drug approval rect for. When prognostic standards, or outcome
process. One type of surrogate endpoint is a phe- is unfeasible, optimal correction for reproducibil-
notype: a laboratory measurement, or a physical ity and repeatability requires strict evaluation
sign, used as a substitute for a clinical outcome protocols and independent verification where
that measures directly how a patient feels, func- possible.
tions or survives. Another type is a prognostic Summarizing the above, clinical outcome or
standard, which are (combinations of) biomark- prognostic standards should be used for primary
ers that are associated with prognosis. Changes endpoints, are preferred, provided their validity
induced by a therapy on a prognostic standard, or against true clinical outcome has already been
other surrogate endpoints should reflect changes rigorously established. Only if these are not
in clinical outcome. Examples of surrogate end- available, can non outcome associated endpoints
points are suppression of ventricular arrhythmias, be used, but that should then be clear in the
or reduction in cholesterol level, in cardiovascu- autonomous AI labelling [69] (Table 4.2):
lar trials, Prognostic standards include positive Validation in the envisioned context, envi-
pathology in a biopsy, or progression on mam- ronment and workflow, in “locked” form, so that
4 Autonomous Artificial Intelligence Safety and Trust 63
Table 4.2 Reference standards © 2020 Abramoff require new validation at the same level as the
• Level I Reference Standard: A reference standard original, in chronic disease.
that either is a clinical outcome, a prognostic standard Ensuring that the validation is applicable for
or other surrogate outcome. If the surrogate outcome is
derived from an independent reading center, validation
the clinical use case, requires workflow analysis,
against outcome is required, as is published evidence of and where possible, mimicking workflow during
temporal drift, reproducibility, and repeatability metrics the trial. In our example, this required the trial to
• Level II Reference Standard: A reference be performed in primary care clinics, in the stan-
standard established by an independent reading center
dard diabetes management workflow, without
with published temporal drift, reproducibility, and
repeatability metrics. A B level reference standard has modifications to the clinic environment, and with
not been validated to correlate to a clinical outcome operators recruited from existing staff without
• Level III Reference Standard: A reference prior experience or training.
standard created from the same modality as used by the
AI, by adjudicating or voting of multiple independent
expert readers, documented to be masked, with
published reproducibility and repeatability metrics. A C utonomous AI System
A
level reference standard has not been established an Implementation Requirements
independent reading center, and has not been validated
to correlate with a clinical outcome
For enhancing trust, ethical alignment with
• Level IV Reference Standard: All other
reference standards, created by single readers or patient autonomy from a patient-derived perspec-
non-expert readers, without an established protocol. tive is important, as we have derived previously
A D level reference standard has not been derived [10]. Focusing on maximizing traceability of
from an independent reading center, has not been
patient derived data, including commensurate
validated to correlate with a clinical outcome, and
there are no published reproducibility and data stewardship, accountability, and authoriza-
repeatability metrics tion, as well as adherence to accepted standards.
Obviously, this applies to data usage during the
design phase, but is of primary importance dur-
its performance is known and persists in real ing deployment. Alignment with patient auton-
world clinical practice, is desirable for aligning omy may make so-called “data-plays”, where the
with non-maleficence. For example, the nega- purpose of the autonomous AI is maximizing the
tive effect of AI on outcome, that was shown in value of resellable patient derived data, rather
the Fenton study, could thus have been pre- than providing a patient diagnosis.
vented [45]. Operationally, autonomous AI creators have
Updates to the autonomous AI should be eval- an obligation to lawfully collect data, in the US
uated on their potential effect on patient risk of requiring compliance with HIPAA/HITECH, as
harm. Changing the font size of the autonomous well as other applicable statutory and regulatory
AI user interface will has substantially lower risk rules, in a manner that is transparent about the
than adding training data for potential perfor- purpose and scope for which the data will be used
mance improvement. A standardized process that [48]. Data used by the autonomous AI creator
links the level of evidence to potential risk of should be traceable to an authorization to use
patient harm for each type of system update, such data. Transparency on the part of autono-
allows analysis of such “continuous learning”. At mous AI creators, through written agreements, is
the highest risk of patient harm, a fully locked essential to assess whether patients have ade-
autonomous AI, once validated, cannot have its quately authorized use of data. Physicians and AI
training data automatically updated based on new creators together are accountable directly to
inputs, as then the safety, efficacy and equity are patients and each must take full responsibility for
not known. In a narrow sense, “Continuous learn- protecting patient rights as stewards of patient
ing” AI systems, where learning is used to derived data. Additionally, the rules require audit-
describe incorporating new training data as it able processes and security controls to ensure
processes new inputs during deployment, will that data is being used in accordance with the
64 M. D. Abramoff
scope for which it was authorized and to protect However, medical decisions by autonomous
the data from unauthorized use or access. AI on individual patients typically cannot be
It is critical that autonomous AI system valida- unequivocally labeled as correct or incorrect,
tion requirements also include ongoing monitor- especially in chronic diseases where outcomes
ing of real-world performance after deployment. may emerge years later. On populations of
Typically, this is achieved by instituting a compre- patients however, the medical decisions can be
hensive Quality Management System (QMS), compared statistically to the desired decisions,
such as that under 21 CFR 820, that accommo- for example to claimed correctness, and it is thus
dates user feedback, complaints, reportable where the liability will be focused. Another issue
events, and ongoing product monitoring. is that, while autonomous AI is preferably com-
Performance data monitored under the QMS pared to patient outcome, or surrogate outcome,
should include a predefined protocol for deter- this requires enormous resources that will be not
mining whether the autonomous AI system results available for the individual patient where liability
remain within the specified performance range is at stake. Then, the autonomous AI decision
that aligns with safe, effective, and equitable use will be compared to an individual physician or
of the AI system. In addition, ongoing monitoring group of physicians, lacking validation and thus,
of real-world performance includes all other qual- with unknown correspondence to outcome or
ity responsibilities that remain within developers’ surrogate outcome. This is obviously an issue for
control such as usability, user experience, product so-called continuous learning AI systems.
performance (which include uptime, bugs, and These distinctions will need to be resolved as
issues), and necessary safety controls (which various AI applications move forward.
include a comprehensive framework for cyberse-
curity, data protection, and data privacy).
Program integrity is maximized through the Summary
inherent validation of the autonomous AI safety
and equity, as well through design controls for Successful introduction of autonomous AI into
the integration of the AI with the wider clinical the healthcare system can be achieved. For exam-
EHR and charting system, ensuring automated ple, at the University of Iowa, introduction of
population of EHR with the diagnostic outputs autonomous AI for the diabetic eye exam has led
and management. to greatly increased compliance of people with
Creators of healthcare autonomous AI should diabetes with their annual eye exams. This was
assume liability for harm related to the inade- especially important during the recent COVID-19
quate performance of the device when used prop- pandemic, where patients, already at higher risk
erly and on-label. This is essential for adoption: it of morbidity and mortality from the virus, were
is inappropriate for a clinician, using an autono- reluctant to make the extra visit to an ophthal-
mous AI to make a diagnosis they are not com- mologist that would otherwise have been
fortable making themselves, to have full medical required. Even after all clinics where shut down
liability for harm caused by that AI. This position for a few weeks, the care gaps resulting from
has been confirmed by the American Medical patients not getting their diabetic eye exam, were
Association in its 2019 AI Policy [1]. Just like a closed within weeks, so that there is almost com-
physician grading an examination would be held plete compliance in this diabetes population.
responsible for their diagnosis, creators of auton- Using the autonomous AI system, patients were
omous AI products have obtained medical mal- able to get in and get the exam within minutes
practice insurance. This paradigm for who would have otherwise not been able to get in
responsibility shifts medical liability for a medi- for an exam for at least 6 months. Because of the
cal diagnostic from the provider managing the program integrity enabled by autonomous AI,
patient’s diabetes, who orders the autonomous patients and payers were ensured that the diabetic
point of care retinal examination, to the autono- eye exam was fully documented, billed and
mous AI creator. coded.
4 Autonomous Artificial Intelligence Safety and Trust 65
24. Pugh JA, Jacobson JM, Van Heuven WA, Watters JA, 36. Larrazabal AJ, Nieto N, Peterson V, Milone DH,
Tuley MR, Lairson DR, et al. Screening for diabetic Ferrante E. Gender imbalance in medical imaging
retinopathy. The wide-angle retinal camera. Diabetes datasets produces biased classifiers for computer-
Care. 1993;16(6):889–95. http://www.ncbi.nlm.nih. aided diagnosis. Proc Natl Acad Sci U S A.
gov/pubmed/8100761. 2020;117(23):12592–4. https://www.ncbi.nlm.nih.
25. Lin DY, Blumenkranz MS, Brothers RJ, Grosvenor gov/pubmed/32457147.
DM. The sensitivity and specificity of single-field 37. Angus DC. Randomized clinical trials of artificial
nonmydriatic monochromatic digital fundus photog- intelligence. JAMA. 2020; https://www.ncbi.nlm.nih.
raphy with remote image interpretation for diabetic gov/pubmed/32065828.
retinopathy screening: a comparison with ophthal- 38. Pearl J, Mackenzie D. The book of why: the new science
moscopy and standardized mydriatic color photogra- of cause and effect. New York: Basic Books; 2018.
phy. Am J Ophthalmol. 2002;134(2):204–13. 3 9. Bossuyt PM, Lijmer JG, Mol BW. Randomised
26. Thorwarth WT Jr. From concept to CPT code to com- comparisons of medical tests: some-
pensation: how the payment system works. J Am Coll times invalid, not always efficient. Lancet.
Radiol. 2004;1(1):48–53. https://www.ncbi.nlm.nih. 2000;356(9244):1844–7. https://www.ncbi.nlm.
gov/pubmed/17411519. nih.gov/pubmed/11117930.
27.
Chiang MF, Casper DS, Cimino JJ, Starren 40. Korevaar DA, Gopalakrishna G, Cohen JF, Bossuyt
J. Representation of ophthalmology concepts by PM. Targeted test evaluation: a framework for design-
electronic systems: adequacy of controlled medical ing diagnostic accuracy studies with clear study
terminologies. Ophthalmology. 2005;112(2):175–83. hypotheses. Diagn Progn Res. 2019;3:22. https://
https://www.ncbi.nlm.nih.gov/pubmed/15691548. www.ncbi.nlm.nih.gov/pubmed/31890896.
28. Steindel SJ. A comparison between a SNOMED CT 41. Lu B, Gatsonis C. Efficiency of study designs in
problem list and the ICD-10-CM/PCS HIPAA code diagnostic randomized clinical trials. Stat Med.
sets. Perspect Health Inf Manag. 2012;9:1b. https:// 2013;32(9):1451–66. https://www.ncbi.nlm.nih.gov/
www.ncbi.nlm.nih.gov/pubmed/22548020. pubmed/23071073.
29. Linder JA, Kaleba EO, Kmetik KS. Using electronic 42. Kaplan RM, Irvin VL. Likelihood of null effects of
health records to measure physician performance for large NHLBI clinical trials has increased over time.
acute conditions in primary care: empirical evalua- PLoS One. 2015;10(8):e0132382. https://www.ncbi.
tion of the community-acquired pneumonia clinical nlm.nih.gov/pubmed/26244868.
quality measure set. Med Care. 2009;47(2):208–16. 43. Ting DSW, Peng L, Varadarajan AV, Keane PA,
https://www.ncbi.nlm.nih.gov/pubmed/19169122. Burlina PM, Chiang MF, et al. Deep learning in oph-
30. Shortliffe EH, Davis R, Axline SG, Buchanan BG, thalmology: the technical and clinical considerations.
Green CC, Cohen SN. Computer-based consultations Prog Retin Eye Res. 2019;72:100759. https://www.
in clinical therapeutics: explanation and rule acqui- ncbi.nlm.nih.gov/pubmed/31048019.
sition capabilities of the MYCIN system. Comput 44. Van Dijk HW, Verbraak FD, Kok PHB, Oberstein
Biomed Res. 1975;8(4):303–20. http://www.ncbi. SYL, Schlingemann RO, Russell SR, et al. Variability
nlm.nih.gov/pubmed/1157471. in photocoagulation treatment of diabetic macu-
31. Fukushima K. Neocognitron: a self organizing neural lar oedema. Acta Ophthalmol. 2013;91(8):722–7.
network model for a mechanism of pattern recogni- https://www.scopus.com/inward/record.uri?eid=2-
tion unaffected by shift in position. Biol Cybern. s2.0-8 4888203653&doi=10.1111%2fj.1755--
1980;36(4):193–202. http://www.ncbi.nlm.nih.gov/ 3768.2012.02524.x&partnerID=40&md5=48a44cbc
pubmed/7370364. 77f3b8682f5c428b10c88683.
32.
Rumelhart DE, McClelland JL, University of 45. Fenton JJ, Taplin SH, Carney PA, Abraham L, Sickles
California San Diego. PDP Research Group. Parallel EA, D’Orsi C, et al. Influence of computer-aided
distributed processing: explorations in the microstruc- detection on performance of screening mammogra-
ture of cognition. Cambridge, MA: MIT Press; 1986. phy. N Engl J Med. 2007;356(14):1399–409.
33. Wolf RM, Channa R, Abramoff MD, Lehmann
46. Sonka M, Fitzpatrick JM. Handbook of medical imag-
HP. Cost-effectiveness of autonomous point-of-care ing – volume 2, medical image processing and analy-
diabetic retinopathy screening for pediatric patients sis. Wellingham, WA: The International Society for
with diabetes. JAMA Ophthalmol. 2020. https://www. Optical Engineering Press; 2000.
ncbi.nlm.nih.gov/pubmed/32880616. 47. Sackett DL. Bias in analytic research. J Chronic Dis.
34. Abràmoff MD, Lavin PT, Birch M, Shah N, Folk 1979;32(1–2):51–63. https://www.ncbi.nlm.nih.gov/
JC. Pivotal trial of an autonomous AI-based diag- pubmed/447779.
nostic system for detection of diabetic retinopathy in 48. Blumenthal D. Launching HITECH. N Engl J Med.
primary care offices. Nat Digit Med. 2018;1(1):39. 2010;362(5):382–5. http://www.ncbi.nlm.nih.gov/
https://doi.org/10.1038/s41746-018-0040-6. pubmed/20042745.
35. Obermeyer Z, Powers B, Vogeli C, Mullainathan
49.
Copeland R, Needleman S. Google’s ‘Project
S. Dissecting racial bias in an algorithm used Nightingale’ triggers federal inquiry. WSJ. 2019.
to manage the health of populations. Science. https://www.wsj.com/articles/behind-g oogles-
2019;366(6464):447–53. https://www.ncbi.nlm.nih. project-nightingale-a-health-data-gold-mine-of-50-
gov/pubmed/31649194. million-patients-11573571867.
4 Autonomous Artificial Intelligence Safety and Trust 67
50. Moyer VA, Force USPST. Screening for glaucoma: revealed by high resolution optical imaging. Science.
U.S. preventive services task force recommenda- 1990;249(4967):417–20.
tion statement. Ann Intern Med. 2013;159(7):484–9. 64. Wang YT, Tadarati M, Wolfson Y, Bressler SB,
https://www.ncbi.nlm.nih.gov/pubmed/24325017. Bressler NM. Comparison of prevalence of diabetic
51. Chou R, Dana T, Bougatsos C, Grusing S, Blazina macular edema based on monocular fundus pho-
I. Screening for impaired visual acuity in older tography vs optical coherence tomography. JAMA
adults: updated evidence report and systematic review Ophthalmol. 2016;134(2):222–8. http://www.ncbi.
for the US preventive services task force. JAMA. nlm.nih.gov/pubmed/26719967.
2016;315(9):915–33. https://www.ncbi.nlm.nih.gov/ 65. Fundus photographic risk factors for progression of
pubmed/26934261. diabetic retinopathy. ETDRS report number 12. Early
52. McLaughlin CC, Wu XC, Jemal A, Martin HJ, Roche treatment diabetic retinopathy study research group.
LM, Chen VW. Incidence of noncutaneous melano- Ophthalmology. 1991;98(5 Suppl):823–33.
mas in the U.S. Cancer. 2005;103(5):1000–7. https:// 66. Cohen JF, Korevaar DA, Altman DG, Bruns DE,
www.ncbi.nlm.nih.gov/pubmed/15651058. Gatsonis CA, Hooft L, et al. STARD 2015 guidelines
53. Sullivan HR, Schweikart SJ. Are current tort liability for reporting diagnostic accuracy studies: explanation
doctrines adequate for addressing injury caused by and elaboration. BMJ Open. 2016;6(11):e012799.
AI? AMA J Ethics. 2019;21(2):E160–6. https://www. https://www.ncbi.nlm.nih.gov/pubmed/28137831.
ncbi.nlm.nih.gov/pubmed/30794126. 67. Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston
54. Maier S. Elon take the wheel. Minnesota Law Rev. AK, Chan A-W, et al. Reporting guidelines for clini-
2017. https://minnesotalawreview.org/2017/01/24/ cal trial reports for interventions involving artifi-
elon-take-the-wheel/. cial intelligence: the CONSORT-AI extension. Nat
55. Chandler RJ, Venditti CP. Gene therapy for meta- Med. 2020;26(9):1364–74. https://doi.org/10.1038/
bolic diseases. Transl Sci Rare Dis. 2016;1(1):73–89. s41591-020-1034-x.
https://www.ncbi.nlm.nih.gov/pubmed/27853673. 68. US Food and Drug Agency (FDA). FDA permits mar-
56. Russell S, Bennett J, Wellman JA, Chung DC, Yu keting of artificial intelligence-based device to detect
ZF, Tillman A, et al. Efficacy and safety of voreti- certain diabetes-related eye problems. Washington,
gene neparvovec (AAV2-hRPE65v2) in patients with DC; 2018. https://www.fda.gov/newsevents/news-
RPE65-mediated inherited retinal dystrophy: a ran- room/pressannouncements/ucm604357.htm.
domised, controlled, open-label, phase 3 trial. Lancet. 69. Fleming TR, DeMets DL. Surrogate end points in
2017;390(10097):849–60. https://www.ncbi.nlm.nih. clinical trials: are we being misled? Ann Intern Med.
gov/pubmed/28712537. 1996;125(7):605–13. https://www.ncbi.nlm.nih.gov/
57. Beauchamp TL, Childress JF. Principles of biomedi- pubmed/8815760.
cal ethics. 8th ed. New York: Oxford University Press; 70. Temple R. A regulatory authority’s opinion about sur-
2019. rogate endpoints. In: Nimmo W, Tucker G, editors.
58. Shah A, Lynch S, Niemeijer M, Amelon R, Clarida W, Clinical measurement in drug evaluation. New York:
Folk J, et al., editors. Susceptibility to misdiagnosis Wiley; 1995.
of adversarial images by deep learning based retinal 71. Browning DJ, Glassman AR, Aiello LP, Bressler NM,
image analysis algorithms. Proceedings – International Bressler SB, Danis RP, et al. Optical coherence tomog-
Symposium on Biomedical Imaging; 2018. raphy measurements and analysis methods in optical
59. Finlayson SG, Bowers JD, Ito J, Zittrain JL, Beam coherence tomography studies of diabetic macular
AL, Kohane IS. Adversarial attacks on medical edema. Ophthalmology. 2008;115(8):1366–71, 71 e1.
machine learning. Science. 2019;363(6433):1287–9. http://www.ncbi.nlm.nih.gov/pubmed/18675696.
https://www.ncbi.nlm.nih.gov/pubmed/30898923. 72. Nagendran M, Chen Y, Lovejoy CA, Gordon AC,
60. Friedenwald J, Day R. The vascular lesions of
Komorowski M, Harvey H, et al. Artificial intelli-
diabetic retinopathy. Bull Johns Hopkins Hosp. gence versus clinicians: systematic review of design,
1950;86(4):253–4. http://www.ncbi.nlm.nih.gov/ reporting standards, and claims of deep learning stud-
pubmed/15411556. ies. BMJ. 2020;368:m689. https://www.ncbi.nlm.nih.
61. MacKenzie S. A case of glycosuric retinitis, with gov/pubmed/32213531.
comments. (Microscopical Examination of the Eyes 73. Lin AP, Katz LJ, Spaeth GL, Moster MR, Henderer
by Mr. Nettleship). Roy London Ophthal Hosp Rep. JD, Schmidt CM Jr, et al. Agreement of visual field
1879;9(134). interpretation among glaucoma specialists and com-
62. Hubel DH, Wiesel TN. Receptive fields of sin-
prehensive ophthalmologists: comparison of time
gle neurones in the cat’s striate cortex. J Physiol. and methods. Br J Ophthalmol. 2011;95(6):828–31.
1959;148:574–91. http://www.ncbi.nlm.nih.gov/pubmed/20956271.
63.
Ts’o DY, Frostig RD, Lieke EE, Grinvald
A. Functional organization of primate visual cortex
Technical Aspects of Deep
Learning in Ophthalmology
5
Zhiqi Chen and Hiroshi Ishikawa
Deep learning (DL) is a specific category within collected over 14 million labeled images from
machine learning. The prototype of DL dates 1000 categories, were launched [4]. Soon after in
back to 1940s when Walter Pitts and Warren 2012, DL models won the recognition challenge
McCulloch designed a computation model to of ImageNet drastically for the first time, where it
mimic the neural networks of human brain [1]. decreased the top-5 error rate from 26.1% to
Initially, neural networks were clumsy and inef- 15.3% [5]. DL models started to take over these
ficient, and would not become useful until 1985 challenges since then. In some applications such
when the concept of back propagation was as traffic sign recognition, diabetic retinopathy
applied to neural networks [2]. In 1989, LeCun classification and Go (traditional Chinese strat-
first demonstrated the feasibility of convolutional egy game like chess), DL even exceeded human
neural networks with back propagation on recog- performance [6–8]. DL provides a powerful
nizing handwritten zip code [3]. While computa- framework for learning. The deep structure
tional speeds increased exponentially with the enables the algorithm to represent complex func-
development of graphics processing units tions. Consequently, given models and dataset
(GPUs), neural networks have more layers and that are big enough, deep learning can be used to
began to compete with support vector machines. learn the mapping from the input data to the out-
Moreover, neural networks are scalable and able put data to do complex tasks in real life.
to continue improving as more parameters and Clinical diagnosis of many eye diseases relies on
training data are added with increased demand of characteristic patterns of the visualization of the eye
computational power. In 2009, ImageNet, which and its surrounding structures. The high dependence
on imaging makes DL perfectly fit into the field of
ophthalmology. In recent years, research involving
Z. Chen
Department of Electrical and Computer Engineering, DL in ophthalmology have risen exponentially [7,
New York University, Brooklyn, NY, USA 9–13]. DL based AI technology is likely to aid clini-
Department of Ophthalmology, NYU Langone Health, cal decision-making process in the near future
New York, NY, USA improving overall medical service quality.
e-mail: zc1337@nyu.edu In this chapter, we provide a formal introduc-
H. Ishikawa (*) tion and definition of the deep learning concepts,
Department of Ophthalmology, NYU Langone Health, techniques and architectures. We begin this chap-
New York, NY, USA ter with main forms of learning. Next, we review
Department of Biomedical Engineering, New York the basics of the most basic and simple DL archi-
University, Brooklyn, NY, USA tecture, deep feedforward neural networks
e-mail: Hiroshi.Ishikawa@nyulangone.org
Supervised Learning
and Unsupervised Learning
Fig. 5.2 Sigmoid
1.0
activation function. The
sigmoid activation
function is a nonlinear
0.8
function which maps
input signal in between
0 and 1. Such nonlinear
activation functions 0.6
increase the ability of
f(x)
DFN to represent
complex functions 0.4
0.2
0.0
2 2
Fig. 5.3 Example of convolution operation with 3×3 kernel size. Each element in the output array is the sum of point-
wise multiplication of a 3×3 convolution kernel and a 3×3 corresponding patch in the input array
4 4 7 10 8 10
6 8 1 3
ht h0 h1 ht
Unfold
A A A A
A: a chunk of RNN
xt: in put at time t
ht: hidden state at time t
xt x0 x1 ... xt
Fig. 5.5 Structure of RNN. A is a chunk of neural net- information from one time step of the network to the next
work which takes input x at time step t and output a hid- step. The output at the next time step is calculated based
den state value h at time step t. The loop passes the on the previous hidden state as well as the current input
Recurrent Neural Networks (RNNs) same weights at every time step of the sequence
as shown in Fig. 5.5.
Many tasks in Ophthalmology involves sequence Although hidden units are designed to learn to
data. Recurrent neural networks are a class of store past information, the stored information is
neural networks that are designed for tasks lossy thus does not have long-term dependencies.
involving sequential data. A RNN is recurrent as To overcome this drawback, a class of more com-
it performs the same operation for the input at plicated neural networks, long short-term mem-
every time step and produce current output based ory (LSTM) networks is proposed [16]. As shown
on previous output. A hidden unit is maintained in Fig. 5.6, LSTMs augment RNNs with explicit
to store past history. In this way, RNNs map an memory which consists of three gates, an input
input sequence to an output sequence. Similar to gate, a forget gate and an output gate, which
CNNs which share weights at every local patches make it easier to remember the input for a long
of an array, RNNs, once unfold in time, share the time. In standard RNNs, the shared module only
74 Z. Chen and H. Ishikawa
Fig. 5.6 Structure of LSTM. LSTM is made of three LSTM has longer-term dependency compared to RNN
gates, the input, forget and output gate. Each gate acts as a which has only one gate to filter the input information
filter to control the information flow explicitly. Thus,
Generator
Noise network False data
Discriminator
Real data Real or not
network
Fig. 5.7 Structure of GAN. GAN is composed of a gen- discriminator while the discriminator is trained to dis-
erator and a discriminator. The generator is trained to gen- criminate between the fake and true samples
erate plausible samples from noise in order to fool the
has one single neural layer while that of LSTMs enerative Adversarial Networks
G
has four neural layers for which the three gates (GANs)
interact in a special way. The first layer, called
“forget gate layer”, looks at the previous one hid- GANs, which have the capacity to generate data
den state and the current input and output a num- without explicitly modelling the underlying dis-
ber between 0 and 1 for each neuron to decide tribution, are one of the most interesting recent
what information we are going to throw away in innovations in DL [17]. GANs are a special form
memory. The second layer, called “input gate of neural networks where two networks, a gen-
layer”, generates the scalar for each neuron to erator and a discriminator, are trained alterna-
decide what information we are going to remem- tively. The generator is trained to produce realistic
ber in memory. Then in the third layer, the old data sample from some random noise while the
memory after forgotten is combined with the new discriminator is trained to discriminate the fake
information that the input gate decides to remem- samples generated by the generator and the real
ber. Finally, the output gate decides what is going samples. Figure 5.7 shows the structure of GANs.
to output based on the updated memory. CNNs and RNNs can be easily incorporated into
Theoretical and empirical evidence shows that GANs to deal with array data and sequence data.
LSTMs have longer-term dependency compared There are two potential applications of GANs
to standard RNNs [16]. Therefore, this is good in Ophthalmology. The first one focuses on the
for modeling longitudinal disease progression generator which is able to extract the underlying
and changes. structure of the data and learn to generate new
5 Technical Aspects of Deep Learning in Ophthalmology 75
data samples. Plausible OCT scans generated in a nationwide screening program. NPJ Digital Med.
2019;2(1):1–9.
from random noise [18] as well as denoised scans 8. Silver D, et al. Mastering the game of Go with
generated from real scans [19] can be achieved deep neural networks and tree search. Nature.
by the generator of GANs. The second one 2016;529(7587):484–9.
focuses on the discriminator which uses the dis- 9. Ting DSW, Cheung CYL, Lim G, Tan GSW, Quang
ND, Gan A, et al. Development and validation
criminator as a learned prior to detect abnormal of a deep learning system for diabetic retinopa-
samples. For example, Zhou et al. trained a GAN thy and related eye diseases using retinal images
with only healthy data and use the trained dis- from multiethnic populations with diabetes. JAMA.
criminator to detect anomalies from OCT [20]. 2017;318(22):2211–23.
10. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D,
Narayanaswamy A, et al. Development and validation
of a deep learning algorithm for detection of diabetic
Conclusions retinopathy in retinal fundus photographs. JAMA.
2016;316(22):2402–10.
11. Abràmoff MD, Lou Y, Erginay A, Clarida W, Amelon
DL is an emerging set of tools that provides vari- R, Folk JC, Niemeijer M. Improved automated detec-
ous potential solutions for applications in oph- tion of diabetic retinopathy on a publicly available
thalmology. The range of applications is very dataset through integration of deep learning. Invest
wide, from diagnosis and strategizing the treat- Ophthalmol Vis Sci. 2016;57(13):5200–6.
12. Gargeya R, Leng T. Automated identification of dia-
ment, to understanding pathogenesis and fore- betic retinopathy using deep learning. Ophthalmology.
casting disease prognosis. Understanding various 2017;124(7):962–9.
DL techniques will help clinicians/researchers in 13. Maetschke S, Antony B, Ishikawa H, Wollstein G,
leveraging the potential of DL applications in Schuman J, Garnavi R. A feature agnostic approach
for glaucoma detection in OCT volumes. PLoS One.
ophthalmology leading to improve the quality 2019;14(7):e0219126.
and delivery of ophthalmic care. 14. Hubel DH, Wiesel TN. Receptive fields, binocular
interaction, and functional architecture in the cat’s
visual cortex. J Physiol. 1962;160:106–54.
15. Goodfellow I, Bengio Y, Courville A. Deep learning.
References MIT Press; 2016.
16. Sepp Hochreiter, Jürgen Schmidhuber; Long Short-
1. McCulloch WS, Pitts W. A logical calculus of the Term Memory. Neural Comput 1997; 9 (8): 1735–
ideas immanent in nervous activity. Bull Mathematical 1780. https://doi.org/10.1162/neco.1997.9.8.1735.
Biophys. 1943;5(4):115–33. 17. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi
2. Lecun Y. Une procedure d’apprentissage pour Mirza, Bing Xu, David Warde-Farley, Sherjil
reseau a seuil asymmetrique (A learning scheme for Ozair, Aaron Courville, and Yoshua Bengio. 2014.
asymmetric threshold networks). In: Proceedings Generative adversarial nets. In Proceedings of the
of Cognitiva 85, Paris, France. 1985. p. 599–604. 27th International Conference on Neural Information
3. LeCun Y, Boser B, Denker JS, Henderson D, Howard Processing Systems - Volume 2 (NIPS’14). MIT
RE, Hubbard W, Jackel LD. Backpropagation applied Press, Cambridge, MA, USA, 2672–2680.
to handwritten zip code recognition. Neural Comput. 18. Zheng C, Xie X, Zhou K, Chen B, Chen J, Ye H,
1989;1(4):541–51. et al. Assessment of generative adversarial networks
4. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei model for synthetic optical coherence tomography
L. Imagenet: a large-scale hierarchical image data- images of retinal disorders. Transl Vis Sci Technol.
base. In: 2009 IEEE conference on computer vision 2020;9(2):29.
and pattern recognition. IEEE; 2009. p. 248–55. 19. Halupka KJ, Antony BJ, Lee MH, Lucy KA, Rai RS,
5. Krizhevsky A, Sutskever I, Hinton GE. Imagenet clas- Ishikawa H, et al. Retinal optical coherence tomogra-
sification with deep convolutional neural networks. phy image enhancement via deep learning. Biomed
In: Advances in neural information processing sys- Optics Express. 2018;9(12):6205–21.
tems. 2012. p. 1097–105. 20. Zhou K, Gao S, Cheng J, Gu Z, Fu H, Tu Z, ... Liu
6. CireşAn D, Meier U, Masci J, Schmidhuber J. Multi- J. Sparse-GAN: sparsity-constrained generative
column deep neural network for traffic sign classifica- adversarial network for anomaly detection in reti-
tion. Neural Netw. 2012;32:333–8. nal OCT image. In: 2020 IEEE 17th International
7. Ruamviboonsuk P, et al. Deep learning versus human Symposium on Biomedical Imaging (ISBI). IEEE;
graders for classifying diabetic retinopathy severity 2020. p. 1227–31.
Selected Image Analysis Methods
for Ophthalmology
6
Tomasz Krzywicki
Fig. 6.1 General
RETINAL FUNDUS IMAGE
diagram of the image DIGITAL DEVICE
analysis process
INFERENCE STAGE
CLASSIFICATION, REGRESSION, CLUSTERING
PREDICTION STAGE
DECISION CLASS, FORECAST, GROUP INDEX
The structure of this chapter is as follows. two-dimensional (2D) images. In formal terms,
First, we introduce the basics and structure of the image is a function of f (r, c) represented as
digital images. Then, the preprocessing stage of a matrix, whose elements are signal values at
images is described. Next, we discuss the image specific pairs of coordinates r, c (correspond-
registration stage, preparing the retinal images ing to pixels). Typically the signal is intensity,
for analysis in inference stage. The last part of the but it may also be, for example, temperature.
chapter presents artificial neural networks with Digital images, unlike analog images, for a
the convolutional layer as a tool for the classifica- given pair of coordinates take finite and dis-
tion of retinal images, including the detection of crete values. Each digital image is composed
diseases. of a finite number of pixels, each of which has
a specific location in space and an intensity
value. The function f (r, c) can be defined as
Digital Images follows:
Hue
Value
d
Re
Blu
e
Fig. 6.2 RGB color space Fig. 6.3 HSV color space
80 T. Krzywicki
In general, intensity is the individual components Contrast enhancement is one of typical steps in
of a color space. Intensity normalization is a pro- image preprocessing stage and it is employed in
cess used in the compensation of artifacts, due to multiple domains of image analysis, including
uneven illumination of retinal tissue by the imag- ophthalmology. Its purpose is to improve the
ing modality. The global linear intensity normal- details and fidelity of the image by emphasizing
ization is the simplest method of image its structure. Among the frequently used
normalization operation, it is applied to each approaches are generic methods, however the
layer (component in the color model), and can be most commonly used are approaches targeted at
expressed as follows: the subsequent analysis stages, that facilitate the
subsequent registration of images and the infer-
g ( r ,c ) = ( f ( r ,c ) − min )
ence on their basis.
new max − new min The sharpening filter is one of the simplest
+ newMin (6.2)
max − min generic methods to improve the contrast of an
image. It creates a new image that is less blurred
where: than the original, so more details are visible.
However, although resulting image is cleaner
• f is the original image than the original, it may also contain some extra
• g is the new image created by the intensity noise resulting in additional distortion.
normalization operation Contrast enhancement is applied iteratively to
• min and max are the minimum and maximum each layer (component in the color model). The
pixel values in the considered layer of the operation can be done by applying a convolution
original image filter to the original image, which can be
• new Min and new Max are the new minimum expressed by the following formula:
and maximum pixel values in the a b
• considered layer of the new image g ( r ,c ) = w × f ( r ,c ) = ∑ ∑ w ( dr,dc )
dr =− a dc =− b
(6.3)
In addition to global intensity normalization f ( r + dr ,c + dc )
methods, there are also local methods that take
into account the characteristics of the neighbour- where:
hood. Salem et al. [10] present several methods
for preprocessing retinal fundus images. Generic • f is the original image
local approaches are one of the popular image • w is the convolution filter (kernel) of size 2a +
intensity normalization methods for these images. 1, 2b + 1
Dedicated methods have also been proposed, one • g is a new image created in the processing
of them is employing a color-constancy [11], operation
which detects vessels in retinal images by seg-
menting retinal vasculatures. Other dedicated A sample 3 × 3 sharpening filter (where a = 1
methods estimate the illumination model and and b = 1) to be used in the convolution operation
correct the intensity according to the expected can be represented as:
luminance [12–15]. This model is a mask image
in which the pixel values are estimates of the 0 −1 0
reflectance of the tissue. Since the illumination −1 5 −1 (6.4)
source is usually unknown, it is obtained assum-
0 −1 0
ing that the local illumination variation is lower
than the global variation for the entire image.
6 Selected Image Analysis Methods for Ophthalmology 81
a b
Fig. 6.4 Image sharpening filter operation. (a) Original image; (b) Sharpened image
The effect of the filter is shown in Fig. 6.4. The applications of the retinal image registra-
The more advanced methods that may be tion process depend on when test and image
applied to retinal images include generic images are taken. Fundus retinal images obtained
approaches operating on the local contrast [16, during the same examination can be combined to
17]. However, these methods can introduce noise obtain a higher resolution image [21, 22] as they
in images. Single scale [18] and multi-scale [19] are devoid of anatomical changes. This allows for
linear filters have been also considered, but more accurate measurements of the eye parame-
unfortunately they filter out relevant image ters, however overlap of the fundus retinal images
details. Other clinical image processing methods must be significant. Fundus retinal images with
are described in [20] in the context of developing little overlap are used to create mosaics that
a system to identify retinal disease in retinal fun- expand the sensor’s field of view [23, 24]—an
dus images. example is given in Fig. 6.5.
Longitudinal studies of the retina [25, 26] can
be done by the registration of images acquired at
Retinal Image Registration different examinations. Accurate registration of
fundus retinal images of the same region can be
The image registration process requires test and useful in detecting small but significant changes
reference images to obtain the estimation of the (i.e., small in size in the image, but relevant to the
aligning transformation. This transformation patient’s condition) such as local hemorrhages or
warps the test image, so that characteristic retinal differences in vasculature width.
points in the test image occur at the same loca- The first retinal image registration methods
tions as in the reference image. The fundus reti- were global and consisted in matching global
nal image registration process may be difficult to similarities of the entire test and reference image
complete due to optical differences across modal- as transformed into appropriate representation:
ities or devices, optical distortions, anatomical spatial [27] or frequency [28] assuming that the
changes due to lesions or disease, and viewpoint intensities in both images are consistent.
differences due to projective distortion of the However, when using global methods, problems
curved surface of the eye. may arise due to uneven illumination, and ana-
82 T. Krzywicki
a b
Fig. 6.5 Registration of retinal images into mosaic. (a) First of original images; (b) Second of original images;
(c) Registration result
tomical changes of the eye as captured by the test and crossovers—detected keypoints may be
and reference images. included as features of the retina [24]. A good
Popular retinal image registration methods alternative to the patented SURF and SIFT gen-
include local methods that rely on well-localized eral methods can be the Oriented FAST and
features or keypoints [22, 24]. The keypoint in Rotated BRIEF (ORB) [32] algorithms.
the image can be understood as a characteristic Local methods in image registration are more
place, for example a corner—examples can be frequently used than global methods. They are
seen in Fig. 6.6. For more efficient processing, especially useful for images with limited overlap,
keypoints are presented as numerical vectors due to the increased specificity that point matches
called descriptors. The SIFT [29] and SURF provide. Local methods are also much better
algorithms are popular generic methods of suited to the registration of images with anatomi-
searching for keypoints in the image [30, 31]. cal changes due to their resistance to differences
They can be used to detect vessels, bifurcations, in images.
6 Selected Image Analysis Methods for Ophthalmology 83
Among the methods of image registration, it and color fundus photographs of the ocular pos-
is worth mentioning the multimodal ophthalmic terior pole.
image registration, which is a process integrat-
ing information stored in two or more images,
captured using multiple imaging modalities. It Inference and Prediction
is challenging because geometric deformations Using CNNs
are an inseparable part of multimodal ophthal-
mic imaging. They contain integral deforma- This section presents the last two stages of image
tions resulting from heterogeneity in the optical processing—inference and prediction, focused on
specifications of imaging devices and patient classification. Classification is often used in
dependent factors. In [33] authors proposed a assessing a patient’s condition, when a patient is
method using Laplacian features, Hessian affine assigned to a diagnostic category or a risk class.
feature space and phase correlation, to register Classification is also used in the retinal image
blue autofluorescence, near-infrared reflectance quality assessments to indicate whether a given
image can be reliably used to establish diagnosis.
A CNN is one of the Deep Learning (DL)
models. More specifically, it is kind of a feedfor-
ward Artificial Neural Network (ANN) based on
a stacked structure of specialized layers of neu-
rons, with each layer specialized in recognizing
specific patterns in images. Figure 6.7 shows a
general diagram of a CNN.
A CNN usually consists of more than one
convolution layer. Then, they are augmented by
the densely connected classical ANN layers, as
in the case of a multi-layer neural network. The
CNN schema (construction of features and their
use) has been designed to be the best fit to use
the structure of the two-dimensional layers of
the input images. Convolutional layers of CNN
also have fewer parameters than dense layers,
which may reduce the risk of overtraining the
Fig. 6.6 Keypoints of the image marked in green model.
dense layers
max-pooling layer flattening
convolutional layer
output
(prediction)
input image
feature maps
Retinal Image Quality Assessment can be seen that with the multi-class classifica-
tion of the patient’s condition, the accuracy
The quality of an image is an important issue decreased compared to the binary classification.
affecting image analysis, registration and seg- Arcadu et al. [37] propose an approach to pre-
mentation. The reliability of these methods dict progression of DR. More specifically, the
depends largely on the quality of the images on study predicts future DR regarding the two-step
which they operate. Ophthalmic images may lose worsening of the Early Treatment DR Severity
quality due to the behavior of the device operator Scale. Authors estimate the severity of DR
at the time of image capture resulting in differ- assessed at 6, 12, and 24 months. The decision
ences in camera exposure and focal plane errors. model for considered time periods obtained an
Zago et al. [34] present an approach to retinal area under the curve (AUC) of 0.68 (sensitivity,
image quality assessment at the moment of the 66%; specificity, 77%), 0.79 (sensitivity, 91%;
acquisition, aiming at assisting health care pro- specificity, 65%), and 0.77 (sensitivity, 79%;
fessionals during fundus examination. They sug- specificity, 72%), respectively.
gest using a CNN pretrained on non-medical Lam et al. [38] propose an approach to DR
images for extracting general image features. The grading on fundus retinal images cropped using
efficiency of the proposed method was tested on Otsu’s method [39] to isolate the circular image
two publicly available databases (i.e., DRIMDB of the retina. The study used popular CNN archi-
and ELSA-Brasil) and the best decision model tectures such as GoogLeNet and AlexNet. The
obtained accuracy of 98.6%, sensitivity of 97.1% most accurate results in binary classification task
and specificity of 100%. for both sensitivity, and the specificity were
Mahapatra et al. [35] propose a method for achieved with the GoogLeNet architecture with
retinal fundus image assessment that has been accuracy of 97%, sensitivity of 95% and specific-
inspired by the human visual system. Saliency ity of 96%.
map is used to identify differences of particular Gulshan et al. [40] present a method to auto-
regions from their neighbors with respect to matically detect DR and diabetic macular oedema
image features. The method obtained sensitivity in retinal fundus images. The method was evalu-
of 98.2% and specificity of 97.8%. ated in identifying gradable (high quality) and
ungradable (poor quality) images. The method
obtained 90.3% of sensitivity and specificity of
Diagnostic Support and Prediction 98.1% in detecting of DR and sensitivity of
87.0% and specificity of 98.5% in detecting of
DR is a serious and prevalent disease, and there- diabetic macular oedema.
fore a lot of research has been done on support-
ing its automatic diagnosis based on fundus
retinal images. Many studies used a CNN as a Conclusions
classifier, and the images were preprocessed
using image preprocessing techniques described This chapter presents the process of retinal fun-
in this chapter—selected relevant works are dus image analysis including pre-processing,
summarized below. registration, inference and prediction. Pre-
Sahlsten et al. [36] propose a system for diag- processing focuses on intensity normalization
nosing DR, where binary (non-referable/refer- and contrast enhancement methods. The methods
able) and multi-class classification (five DR of inference are presented using the example of
stages) are considered. The decision model in the classification, where decision classes correspond
binary classification task obtained accuracy of to image quality or diagnostic categories.
94%, sensitivity of 89.6% and specificity of The preprocessing and image registration
97.4%. In the multiclass classification task the steps are required in a broad variety of use cases.
decision model obtained accuracy of 86.9%. It The simplest one is image inspection by a medi-
6 Selected Image Analysis Methods for Ophthalmology 85
cal professional, where preprocessing and regis- dimensional HSV color space (Version 10005546).
2016. https://doi.org/10.5281/zenodo.1126874.
tration are required to improve the fidelity of the 9. Semary N. A proposed HSV-based pseudo coloring
acquired image while clarifying or accenting its scheme for enhancing medical image. 2018:81–92.
structure and anatomical features. Retinal image https://doi.org/10.5121/csit.2018.80407.
registration is also able to help monitor and eval- 10. Salem NM, Nandi AK. Novel and adaptive contribu-
tion of the red channel in pre-processing of colour
uate the effectiveness of treatment by combining fundus images. J Franklin Inst. 2007;344(3–4):243–
multiple retinal images into improved or larger 56. ISSN 0016-0032. https://doi.org/10.1016/j.
images or by comparing these images, which is jfranklin.2006.09.001.
essential for monitoring a disease and the assess- 11. Zhao Y, Liu Y, Wu X, Harding S, Zheng Y. Retinal
vessel segmentation: an efficient graph cut
ment of its treatment efficacy. approach with Retinex and local phase. PLoS One.
Preprocessing and registration steps are also 2015;10(4):1–22.
utilized as initial steps in automatic diagnosis. In 12. Kolar R, Odstrcilik J, Jan J, Harabis V. Illumination
practice, the most popular inference method correction and contrast equalization in colour fundus
images. European Signal Processing Conference,
based on images employs CNNs, due to their 2011. p. 298–302.
very good results. 13. Foracchia M, Grisan E, Ruggeri A. Luminosity and
contrast normalization in retinal images. Med Image
Anal. 2005;9(3):179–90.
14. Narasimha-Iyer H, Can A, Roysam B, Stewart V,
References Tanenbaum HL, Majerovics A, Singh H. Robust detec-
tion and classification of longitudinal changes in color
1. Grosso A. Hypertensive retinopathy revisited: retinal fundus images for monitoring diabetic retinop-
some answers, more questions. Br J Ophthalmol. athy. IEEE Trans Biomed Eng. 2006;53(6):1084–98.
2005;89:1646–54. https://doi.org/10.1136/ 15. Grisan E, Giani A, Ceseracciu E, Ruggeri A. Model-
bjo.2005.072546. based illumination correction in retinal images. IEEE
2. Danis RP, Davis MD. Proliferative diabetic retinopa- International Symposium on Biomedical Imaging:
thy, diabetic retinopathy. Totowa, NJ: Humana Press; Nano to Macro. 2006. p. 984–7.
2008. p. 29–65. 16. Walter T, Massin P, Erginay A, et al. Automatic detec-
3. Matsui M, Tashiro T, Matsumoto K, Yamamoto S. A tion of microaneurysms in color fundus images.
study on automatic and quantitative diagnosis of fun- Med Image Anal. 2007;11:555–66. https://doi.
dus photographs. I. Detection of contour line of reti- org/10.1016/j.media.2007.05.001.
nal blood vessel images on color fundus photographs. 17. Fleming A, Philip S, Goatman K, Olson J, Sharp
Nippon Ganka Gakkai Zasshi. 1973;77(8):907–18. P. Automated microaneurysm detection using local
4. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi contrast normalization and local vessel detection.
F, Ghafoorian M, van der Laak JAWM, van Ginneken IEEE Trans Med Imaging. 2006;25(9):1223–32.
B, Sánchez CI. A survey on deep learning in medi- 18. Qidwai U, Qidwai U. Blind deconvolution for reti-
cal image analysis. Med Image Anal. 2017;42:60– nal image enhancement. IEEE EMBS Conference
88. ISSN 1361-8415. https://doi.org/10.1016/j. on Biomedical Engineering and Sciences. 2010.
media.2017.07.005. p. 20–25.
5. Lathuilière S, Mesejo P, Alameda-Pineda X, 19. Sivaswamy J, Agarwal A, Chawla M, Rani A, Das
Horaud R. A comprehensive analysis of deep T. Extraction of capillary non-perfusion from fundus
regression. IEEE Trans Pattern Anal Mach Intell. fluorescein angiogram. In: Fred A, Filipe J, Gamboa
2020;42(9):2065–81. https://doi.org/10.1109/ H, editors. Biomedical engineering systems and tech-
TPAMI.2019.2910523. nologies. Berlin: Springer; 2009. p. 176–88.
6. Min E, Guo X, Liu Q, Zhang G, Cui J, Long J. A survey 20. Rajan K, Sreejith C. Retinal image process-
of clustering with deep learning: from the perspective ing and classification using convolutional neu-
of network architecture. IEEE Access. 2018;6:39501– ral networks. In: Pandian D, Fernando X, Baig
14. https://doi.org/10.1109/ACCESS.2018.2855437. Z, Shi F, editors. Proceedings of the International
7. Stanescu L, Burdescu DD, Stoica C. Color image seg- Conference on ISMAC in Computational Vision
mentation applied to medical domain. In: Yin H, Tino and Bio-Engineering 2018 (ISMAC-CVB). ISMAC
P, Corchado E, Byrne W, Yao X, editors. Intelligent 2018. Lecture Notes in Computational Vision and
Data Engineering and Automated Learning – IDEAL Biomechanics, vol. 30. Cham: Springer; 2019. https://
2007. IDEAL 2007. Lecture Notes in Computer doi.org/10.1007/978-3-030-00665-5_120.
Science, vol. 4881. Berlin: Springer; 2007. https://doi. 21. Meitav N, Ribak EN. Improving retinal image resolu-
org/10.1007/978-3-540-77226-2_47. tion with iterative weighted shift-and-add. J Opt Soc
8. Suzuki N. Distinction between manifestations of Am A. 2011;28(7):1395–402. https://doi.org/10.1364/
diabetic retinopathy and dust artifacts using three- JOSAA.28.001395.
86 T. Krzywicki
22.
Hernandez-Matas C, Zabulis X. Super resolu- 31. Hernandez-Matas C, Zabulis X, Argyros AA. Retinal
tion for fundoscopy based on 3D image registra- image registration through simultaneous camera pose
tion. 36th Annual International Conference of and eye shape estimation. 38th Annual International
the IEEE Engineering in Medicine and Biology Conference of the IEEE Engineering in Medicine and
Society. 2014. p. 6332–8. https://doi.org/10.1109/ Biology Society (EMBC). 2016. p. 3247–51. https://
EMBC.2014.6945077. doi.org/10.1109/EMBC.2016.7591421.
23. Can A, Stewart CV, Roysam B, Tanenbaum HL. A 32. Rublee E, Rabaud V, Konolige K, Bradski G. ORB:
feature-based technique for joint, linear estimation of an efficient alternative to SIFT or SURF. Proceedings
high-order image-to-mosaic transformations: mosa- of the IEEE International Conference on Computer
icing the curved human retina. IEEE Trans Pattern Vision. 2011. p. 2564–71. https://doi.org/10.1109/
Anal Mach Intell. 2002;24(3):412–9. https://doi. ICCV.2011.6126544.
org/10.1109/34.990145. 33. Suthaharan S, Rossi EA, Snyder V, Chhablani J,
24. Ryan N, Heneghan C, de Chazal P. Registration of Lejoyeux R, Sahel J-A, Dansingani K. Laplacian fea-
digital retinal images using landmark correspondence ture detection and feature alignment for multi-modal
by expectation maximization. Image Vis Comput. ophthalmic image registration using phase correlation
2004;22(11):883–98. https://doi.org/10.1016/j. and Hessian affine feature space. Sig Process. 2020;
imavis.2004.04.004. https://doi.org/10.1016/j.sigpro.2020.107733.
25. Narasimha-Iyer H, Can A, Roysam B, Tanenbaum 34. Zago GT, Andreão RV, Dorizzi B, Salles EOT. Retinal
HL, Majerovics A. Integrated analysis of vascular image quality assessment using deep learning. Comp
and nonvascular changes from color retinal fun- Biol Med. 2018;103:64–70. ISSN 0010-4825. https://
dus image sequences. IEEE Trans Biomed Eng. doi.org/10.1016/j.compbiomed.2018.10.004.
2007;54(8):1436–45. https://doi.org/10.1109/ 35. Mahapatra D, Roy PK, Sedai S, Garnavi R. Retinal
TBME.2007.900807. image quality classification using saliency maps
26. Troglio G, Benediktsson JA, Moser G, Serpico SB, and CNNs. In: International Workshop on Machine
Stefansson E. Unsupervised change detection in Learning in Medical Imaging. Springer; 2016.
multitemporal images of the human retina. In: Multi p. 172–9.
modality state-of-the-art medical image segmentation 36. Sahlsten J, Jaskari J, Kivinen J, Turunen L, Jaanio
and registration methodologies, vol. 1. Boston, MA: E, Hietala K, Kaski K. Deep learning fundus image
Springer US; 2011. p. 309–37. analysis for diabetic retinopathy and macular edema
27. Reel PS, Dooley LS, Wong KCP, Börner A. Robust grading. 2019. arXiv preprint arXiv:1904.08764.
retinal image registration using expectation maximi- 37. Arcadu F, Benmansour F, Maunz A, Willis J, Haskova
sation with mutual information. IEEE International Z, Prunotto M. Deep learning algorithm predicts dia-
Conference on Acoustics, Speech and Signal betic retinopathy progression in individual patients.
Processing. 2013. p. 1118–1122. https://doi. NPJ Digit Med. 2019;2(1):1–9.
org/10.1109/ICASSP.2013.6637824. 38. Lam C, Yi D, Guo M, Lindsey T. Automated detection
28. Cideciyan AV, Jacobson SG, Kemp CM, Knighton of diabetic retinopathy using deep learning. AMIA
RW, Nagel JH. Registration of high resolution images Summits Transl Sci Proc. 2018;2018:147.
of the retina. SPIE Med Imaging. 1652;1992:310–22. 39.
Otsu N. A threshold selection method from
https://doi.org/10.1117/12.59439. gray-level histograms. IEEE Trans Syst Man
29.
Lowe DG. Distinctive image features from Cybernetics. 1979;9(1):62–6. https://doi.org/10.1109/
scale-invariant keypoints. Int J Comput Vis. TSMC.1979.4310076.
2004;60(2):91–110. https://doi.org/10.1023/B:V 40. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D,
ISI.0000029664.99615.94. Narayanaswamy A, et al. Development and validation
30. Lin Y, Medioni G. Retinal image registration from of a deep learning algorithm for detection of diabetic
2D to 3D. IEEE Conference on Computer Vision retinopathy in retinal fundus photographs. JAMA.
and Pattern Recognition. 2008. p. 1–8. https://doi. 2016;316(22):2402–10.
org/10.1109/CVPR.2008.4587705.
Experimental Artificial Intelligence
Systems in Ophthalmology:
7
An Overview
Joelle A. Hallak, Kathleen Emily Romond,
and Dimitri T. Azar
Back Propagation
Bias node
Weight
1 1
bw2
bw1 bw3
w1
a1 b1 Total Error
w2 w5
Input c1 Output
w3 w6
w4
a2 b2
Fig. 7.1 Simple neural network and its components. Reconstructed from Taylor M. Neural Networks Math: Visual
Fig. 7.2 Example of a large-scale network with a variety data from each tower is then merged and flows through
of ophthalmic data types as input (imaging, clinical and higher levels, allowing the deep neural network to per-
genetic) to determine a certain output. Each data type form inference across data types [3]
learns a useful featurization in its lower-level towers. The
7 Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 89
keratoconus, refractive surgery, cataracts, stra- screening to detect the presence of keratoconus
bismus, retinopathy of prematurity (ROP), and or keratoconus suspects [13]. They reported that
neuro-ophthalmology. Reinforcement learning the neural networks completely distinguished
(RL) and inverse reinforcement learning (IRL), keratoconus from keratoconus suspects and from
and their potential applications for surgical sim- topographies that resembled keratoconus.
ulation applications are also discussed. Additionally, their network approach equaled the
Additionally, in light of the COVID-19 pan- sensitivity of currently used tests for keratoconus
demic, we conclude by highlighting the impact detection and outperformed them in terms of
and potential role of AI systems in ophthalmol- accuracy and specificity [13].
ogy in the post COVID-19 era. Viera and Barbosa [14] showed that Zernike
polynomials are reliable parameters as inputs in a
feed-forward artificial neural network and a dis-
I Applications in Corneal
A criminant analysis technique, with reported pre-
Diagnosis cisions of 94% and 84.8%. Whereas Kovacs et al.
[15] used Tomographic data, topographic data,
AI applications have a decades-long history in and keratoconus indices from the Scheimpflug
corneal topography interpretation, and their value camera. They reported that classifiers trained on
has been demonstrated for enhancing clinical bilateral data were better than unilateral single
decisions for patients with corneal diseases. With parameter in discriminating fellow eyes of
advancements in technology, we are able to eval- patients with keratoconus from controls (ROC
uate corneal curvatures and elevation maps, tis- 0.96 vs. 0.88). However, this result may be due to
sue anatomy, and histology and biomechanical including one eye in the training set and the sec-
properties. Machine learning (ML) and AI tech- ond from the same patient in the test set.
niques are useful tools to help in corneal disease Several studies combined parameters from
classification (normal, early suspect keratoconus more than one device. Hwang et al. [16] extracted
and keratoconus), allowing early interventions, variables from slit-scan tomography and spectral-
like collagen cross-linking, to prevent progres- domain OCT to differentiate between normal
sion and vision loss. Additionally, ML and AI controls and the clinically normal fellow eyes of
techniques provide useful tools for refractive sur- highly asymmetric keratoconus subjects. The
gery screening, in vivo corneal morphology best discrimination was found when using a com-
exams, and corneal surgeries. bination of variables from both instruments, with
spectral-domain OCT corneal thickness mea-
sures followed by anterior corneal measures from
Keratoconus Screening tomography as being most important. Ambrosio
and Classification et al. [17] used a tomographic and biomechanical
index (TBI), which combined Scheimpflug based
ML models for classifying keratoconus have corneal tomography and biomechanics. Their
been implemented on parameters from one device multinational retrospective study employed sev-
and a combination of devices. Deep learning eral AI techniques, including logistic regression,
applications have also been used for the detection a support vector machine (SVM), and random
of keratoconus. forest (RF), all verified by leave one out cross
Maeda et al. were one of the earliest groups to validation (LOOCV) to distinguish normal ver-
apply learning techniques on imaging data [12]. sus ectatic cases. The RF/LOOCV performed the
They used a classification tree combined with a best of all trained and tested methods, and was
linear discriminant function to distinguish successful in detecting subclinical (fruste)
between a keratoconus and a non-keratoconic ectasia, with an area under receiver operating
pattern. Smolek and Klyce were the first to use a characteristic curve (AUROC) for all ectasia
classification neural network for keratoconus cases of 0.996. Furthermore, the TBI cutoff value
90 J. A. Hallak et al.
of 0.79 showed 100% sensitivity and specificity gery (such as age, room temperature and humid-
for detecting clinical ectasia [17]. ity, astigmatism axis, stromal thickness, laser
More recently, some studies have been using ablation method, keratometry values, and laser
deep learning methods for the detection of kera- characteristics etc). After data preprocessing, a
toconus. Kamiya et al. [18], used a deep learning learning vector quantizer (LVQ) neural network
model on colour-coded maps obtained by the was employed for classification. LVQ has a non-
anterior segment optical coherence tomography linear classification property, which is preferred
(AS-OCT) to discriminate keratoconus from nor- for this task. They reported a sensitivity of 0.88
mal corneas, with classifying the grade of the and specificity of 0.93 [19]. Two studies have
disease. developed learning algorithms for refractive sur-
gery screening. Using pentacam images as input,
Xie et al., developed a deep learning model for
Corneal Refractive Surgery screening of candidates for refractive surgery
[21]. A total of 6465 corneal tomographic images
Ocular imaging technology has evolved in recent of 1385 patients from Zhongshan Ophthalmic
years to address candidacy issues in corneal Center, Guangzhou, China were used to develop
refractive surgery. A preoperative refractive sur- the AI model. Their model, Pentacam
gery exam results in imaging, text, and numeric InceptionResNetV2 Screening System (PIRSS),
data. These data are a perfect example for multi- achieved an overall detection accuracy of 0.947
view and multi-task learning algorithms, with on the validation data set, and on the independent
data integration, feature selection and modeling test data set it achieved an overall detection accu-
applications (Fig. 7.3). racy of 0.95, which was comparable with that of
Neural networks in refractive surgery applica- senior ophthalmologists who are refractive sur-
tions have been designed to predict outcomes. geons 0.928 [21]. In another recent study, Yoo
and for candidate screenings [19, 20]. Balidis et al. developed a machine learning architecture
et al. developed an algorithm to predict the need system integrating multi-data from patient demo-
for retreatment in patients undergoing refractive graphics, pentacam imaging, and ophthalmic
surgery for myopia [19]. They used a computer- examinations to identify candidates for refractive
ized query to select patients who underwent surgery [20]. Five algorithms were used for pre-
PRK, LASEK, Epi-LASIK, or LASIK, and dictions, SVM, artificial neural networks, ran-
investigated 13 factors related to refractive sur- dom forests, LASSO, and AdaBoost to classify
Fig. 7.3 Example of a data integration technique used for multi-view data in refractive surgery for modeling
predictions
7 Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 91
normal controls versus ectasia-risk patients. and validation of more than 5000 images against
Followed by an ensemble classifier to improve a clinical ground truth achieved a structure detec-
performance. Training and internal validation tion success rate of 95% with an average grading
were conducted using subjects who had visited difference of 0.36 on a 5.0 scale [24]. Further
between 2016 and 2017, and external validation improvements have since been reported, includ-
was performed using subjects who had visited in ing Srivastava et al.’s proposal of extracting fea-
2018. With the ensemble classifier, the reported tures representing the visibility, or lack thereof,
AUC were 0.983 and 0.972 in the internal and of lens parts [25]. While previous works had
external validation set, respectively [20]. focused on brightness and color metrics of the
Recent research efforts involve modifying eye, these authors used features which repre-
neural networks to increase AI interpretability. sented the fading of edges of distinct anatomical
Yoo et al., developed a multiclass machine learn- regions as the severity of nuclear cataract
ing model that provides output selecting the type increased. The integration of these new features
of refractive surgery procedure a patient may be with previously used features provided better
eligible for [22]. They constructed a multiclass accuracy than either feature type used alone [25].
XGBoost model to classify patients into four Limitations to DL techniques, including
categories including laser epithelial keratomile- incomplete, redundant or noisy representations,
usis, laser in situ keratomileusis, small incision have been addressed in more recent work. One
lenticule extraction, and contraindication groups. study addressed these challenges with their sys-
The model was trained based on clinical deci- tem which first automatically learned features for
sions of experts and ophthalmic measurements. grading cataract through local filters and then pre-
The SHapley Additive explanations technique dicted cataract grade using support vector regres-
was adopted to explain the output of the XGBoost sion [26]. The filters were acquired by clustering
model. They reported an accuracy of 0.81 and patches of images showing lenses from the same
0.789 when testing on the internal and external grading class and then fed through a convolu-
validation datasets, respectively. The SHapley tional neural network as well as a set of recursive
Additive explanations for the results were con- neural networks to produce higher order features.
sistent with prior knowledge from ophthalmolo- Validation using these selected features on a pop-
gists [22]. ulation-based dataset of over 5000 images showed
a mean absolute error of 0.304 [26].
Studies which examine the use of DL algo-
Cataract Diagnosis and Grading rithms as a means of accurate diagnosis and mon-
itoring of cataract among vulnerable populations
The use of DL algorithms for automatic grading have shown similarly promising results, includ-
of cataracts has been explored using photographs ing Liu et al.’s presentation of a convolutional
from slit-lamp microscopes. An early study digi- neural network to grade slit-lamp images among
tized images before extracting gray-level statis- pediatric patients [27]. Their algorithm offered
tics from within the relevant circular regions of excellent mean accuracy, sensitivity and specific-
the nucleus [23]. These extracted features were ity for classification (97.07%, 97.28%, and
fed into a neural network to produce classifica- 96.83%). Zhang et al. [28] discussed the need of
tions. Other approaches include Li et al.’s use of an automatic diagnosis aid for rural populations
support vector machine regression to predict in China that lack access to expert care in their
grade of cataract [24]. The authors introduced the study which used fundus images, a more acces-
first automatic system for nucleus region detec- sible imaging modality than slit-lamp, as input to
tion in slit lamp images. Local features were their six-level cataract grading system and
extracted on the basis of anatomical landmarks achieved an average accuracy of 92.66%.
92 J. A. Hallak et al.
I Applications in Strabismus
A thus far have included image-based or video-
and ROP based methods and eye-tracking methods.
Image and video-based approaches to auto-
The most significant advances in AI applications matic diagnostic aids have included Yang et al.’s
for pediatric ophthalmology involve the auto- [33] prospective observational pilot study on 30
mated detection of retinopathy of prematurity. intermittent exotropia patients, 30 esotropia
Reid et al. summarize AI applications in pediatric patients and 30 orthotropic patients who were
ophthalmology [29]. In addition to ROP, machine able to cooperate with a PCT. Two ophthalmolo-
learning has also been applied to the classifica- gists independently performed the PCT for each
tion of pediatric cataracts, prediction of postop- subject to examine the angle of deviation. Using
erative complications following cataract surgery, an infrared camera, full-face photos were taken
detection of strabismus and refractive error, pre- with a selective wavelength filter placed over
diction of future high myopia, and diagnosis of either eye. The resulting images were then fed
reading disability. In addition, techniques have through a 3D strabismus photo analyzer and cor-
been used for the study of visual development, relations between the angles of deviation were
vessel segmentation in pediatric fundus images, compared between the automatic and manual
and ophthalmic image synthesis. methods which showed an excellent positive cor-
relation (R = 0.900, P < 0.001) [33]. Another
study selected features and classified images
Detection of Strabismus using a supervised support vector machine learn-
ing algorithm to mimic the diagnosis process of
Strabismus is a common affliction, affecting the Hirschberg test, which calculates the magni-
approximately 4% of the population [30]. Often tude of yaw in the non-fixating eye by measuring
diagnosed in infancy or childhood, the disease the luminous reflection displacement of the cor-
presents as misalignment of the eyes so that each nea [34]. Their fully automated system was able
eye looks in different directions. Causes include to find the region of the eyes in images, locate the
problems with the eye muscles, the nerves of the limbus and area of brightness and classify the
eye in charge of transmitting information, the images among previously diagnosed cases and
area of the brain which controls eye movements, controls, achieving 94.14% sensitivity, 95.38%
or other eye injuries or disease. Untreated, it may specificity, 98.78% for PPV (Positive predictive
lead to amblyopia, impairment of depth percep- value) and 83.07% for NPV (Negative predictive
tion and 3D vision, or permanent vision loss, as value). Chandna et al. [35] used measures
the brain suppresses the image contributed by the obtained from prisms and cover tests as input to
weaker eye. Accurate and timely diagnosis is a their backpropagation neural networks to pro-
topic of interest among researchers of late, as ear- duce a differential diagnosis. Despite an average
lier diagnosis will allow for earlier treatment reported accuracy of 100%, their system was lim-
improving quality of life for patients not only in ited to vertical strabismus.
terms of visual outcomes, but through improved Several other studies have presented algo-
self-image and self-confidence. Traditional diag- rithms with their own limitations [36–38]. These
nosis may require several tests performed by and studies, along with the studies described above,
interpreted by a clinician, which may include the used equipment that would be difficult to obtain
Maddox Rod test, corneal light reflex test, and or prohibitively expensive at some clinics, only
the commonly used, gold standard prism and diagnosed one direction of strabismus, or was
alternate cover test (PCT). Level of expertise of not sensitive enough to detect unapparent stra-
the examiners has been shown to affect diagnos- bismus. Valente et al. [39] attempted to address
tic decisions [31, 32], bringing to light the need these limitations in their 2017 study, which used
for a more objective and accurate approach digital video to detect strabismus through the
through automatic methods. Techniques explored cover test. The only materials needed were a
7 Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 93
digital camera and regular computer for image Early identification of the disease and treatment
and video processing and their methods achieved has shown better visual outcomes, and therefore
87% accuracy while acknowledging measures accurate and timely diagnosis is critical [45].
lower than 1Δ. There is variability in the diagnostic process for
Several eye tracking techniques have also ROP and inconsistent classification agreement
been proposed. Corneal light reflection was used (plus vs. pre-plus vs. normal) among clinicians
in one early animal study to characterize binocu- has been observed [46, 47]. Challenges which
lar misalignment in macaque monkeys with stra- may precipitate variation in clinical diagnoses
bismus through measuring eye alignment errors include geographic differences in training, vague-
to fixation targets which were presented through- ness of the definition of ROP, differing cut points
out the subject’s field of gaze at any distance on the continuous spectrum of vascular abnor-
[40]. The method allowed for horizontal and ver- mality, as judged by clinicians, and a currently
tical error measurements and showed similar used standard published photograph for plus dis-
accuracy when compared to standard prism and ease from the 1980s showing a smaller and more
cover assessments. Another eye tracking system magnified view compared to currently available
automated a Hirschberg test for infant subjects images [48]. As discussed elsewhere in this book,
using a two-camera gaze-estimation system [41]. AI and ML techniques have been explored as
The Hirschberg ratio and angle Kappa (the angle assistive tools to aid with these challenges and
between the optical and visual axis) were deter- improve diagnostic accuracy.
mined through measurements of optical axis First attempts at automatic and semi-
direction, corneal reflexes and coordinates of the automatic imaging analysis used wide-angle
center of the entrance pupil when the infants RetCam images and focused on quantification
looked at images presented on a computer moni- of dilation and vascular toruosity for plus dis-
tor. This method was only tested on five infants ease diagnosis. The diagnostic decisions of these
and needed further verification [41]. Chen et al. systems were compared to expert diagnosis, and
[42] used an eye-tracking system, Tobii X2-60, to clinical applications were not realized because
collect gaze data. Subjects were asked to look at of either limitations in usability, or lack of agree-
nine points on a screen while the tracker detected ment with the experts [49–51]. In particular, an
the gaze points and eye movements. Gaze devia- ML system proposed by Ataer-Cansizoglu et al.
tion maps were created and combined to form a [52] (Imaging & Informatics in ROP, i-ROP)
gaze deviation image which was fed into a convo- showed promising classification capabilities by
lutional neural network for categorization. The comparing different cropped shapes and sizes
best performing model of the six attempted, of images, as well as extracted toruosity and
achieved 96% sensitivity and 94.1 specificity dilation features. Their algorithm showed high
[42]. Lu et al.’s deep neural network for telemedi- diagnostic accuracy (95%) using a large circu-
cine applications was proposed as an automated lar six disc-diameter crop of wide-angle RetCam
solution to provide diagnosis for patients living images and a metric which combined arterial and
in remote communities and to reduce the burden venous torousity, which was shown to be compa-
of on-site specialists [43]. Their algorithm rable to 3 experts (96%, 94%, 92%) and higher
showed excellent results, with a reported detec- than the mean accuracy of 31 non-experts (81%).
tion performance of 0.933 sensitivity, 0.96 speci- Despite the algorithm performing well, manual
ficity and 0.9865 AUC. segmentation of the images limited usability in a
real-world setting [52].
Brown et al. [5] presented their fully auto-
Retinopathy of Prematurity mated i-ROP DL system in 2018, which was
developed to provide a three-level plus disease
Retinopathy of Prematurity (ROP) accounts for diagnosis in ROP patients and performed well on
6–18% of childhood blindness, worldwide [44]. both internal and external validation sets. 5511
94 J. A. Hallak et al.
RetCam photographs were collected over a 95.1% and 97.8% for diagnosis of plus disease
5-year period and U-Net vessel segmentation and and 92.4% and 97.4% for pre-plus or worse [54].
a pretrained Inception-V1 technical network
were used on the images. The algorithm was
compared to gold standard diagnosis performed Neuro-ophthalmology Applications
by an ophthalmic examination by one expert and
image analysis by three experts. A fivefold cross- Identifying abnormalities of the optic nerve can
validation showed AUCs of 0.94 and 0.98 for help to reveal vision-related neurological condi-
normal vs. pre-plus and plus, and plus vs. normal tions. Detection of papilledema, for example, can
and pre-plus classifications, respectively. alert clinicians to possible elevated intracranial
External validation on an independent dataset of pressure from life-threatening brain tumors or
100 wide-angle RetCam images continued to blood clots in the brain. Fundus photographs can
show excellent specificity and sensitivity with capture changes to the optic nerve head, allowing
plus classification exhibiting 93% sensitivity and for examination without the use of an ophthalmo-
94% sensitivity and preplus or worse exhibiting scope, a tool which non-ophthalmologists may
100% sensitivity and 94% specificity [5]. find difficult to use. AI systems have been used
Building on Brown et al.’s work, Redd et al. on fundus photo data to detect abnormalities in
[53] used the same DL system to classify images the optic nerve head and have been proposed as a
and produce a probability based nine-level dis- diagnostic aid for patients with neuro-ophthalmic
ease severity scale. The reference standard diag- disease who may be presenting at non-neuro or
nosis integrated image-based and ophthalmic non-ophthalmic health care clinics.
diagnoses from expert clinicians, 4861 examina-
tions from 870 infants were included in the anal-
ysis and showed an AUC of 0.960 and 94% Optic Disc Abnormalities
sensitivity for detecting type-1 ROP and an AUC
of 0.910 for clinically significant ROP, showing Changes to the optic nerve head due to neuro-
the system’s ability to recognize broad as well as ophthalmic causes are rare, compared to optic
specific categories of the disease. Furthermore, disc abnormalities caused by glaucoma, and the
the i-ROP DL severity score correlated with dis- ability of deep learning to aid in the detection of
ease severity decided by expert graders, lending non-glaucomatous vs. glaucomatous optic neu-
evidence that i) ROP phenotypes present on a ropathy has been explored. Yang et al.’s [55] con-
continuum of severity from mild to severe and ii) volutional neural network to perform such a task
this severity can be measured automatically and showed a diagnostic accuracy of 93.4% sensitiv-
accurately with their proposed system [53]. ity, 81.8% specificity and 0.874 AUC, with false
More recently, Mao et al. [54] presented their positive cases exhibiting optic discs that were
deep convolutional neural network (DCNN) tilted or extensive peripapillary atrophy. The rar-
which provided a diagnostic decision and ana- ity of neuro-ophthalmic caused changes in optic
lyzed pathological features of ROP to generate disc appearance, vs. glaucoma caused changes
quantitative metrics for tortuosity, width, fractal may contribute to neuro-ophthalmic misdiagno-
dimension and vessel density. Three DCNNs sis. Non-ophthalmic physicians may not be con-
were used in the network architecture. A modi- fident in identifying markers of neuro-ophthalmic
fied U-net segmented the blood vessels while disease including those presenting on images of
another segmented the optic disc. Three-level optic disks and there is a resulting unmet need for
classification was performed by DenseNet. an objective approach to abnormal optic disc
Whole images were used during training of the diagnosis. Indeed, a 2019 review found high rates
system and data augmentation was performed by of misdiagnosis in neuro-ophthalmology in the
flipping the images vertically and horizontally as literature with patients undergoing unnecessary,
well as rotating in three-degree increments. The costly treatments and test before being seen by a
authors reported a sensitivity and specificity of neuro-ophthalmologist [56].
7 Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 95
A few studies have attempted automatic detec- et al.’s proposed ML techniques for distinguish-
tion of non-glaucomatous optic disc abnormali- ing between optic neuropathies and pseudopap-
ties, using ML techniques to diagnose and stage illedema, and Yang et al.’s identification of optic
neuro-ophthalmic conditions by detecting optic disc pallor [61, 62]. A deep learning system with
disc irregularities, such as papilledema and optic the ability to determine laterality (ocular domi-
disc atrophy, in fundus photographs. One early nance) from fundus photo features was proposed
study graded papilledema severity using ML and recently. This algorithm could help with the labo-
decision tree forest classification to extract optic rious and time-consuming task of sorting and
disc margins features, vascular features and peri- labeling images and was developed so that deter-
papillary texture features from fundus photo- mination could be achieved for photos displaying
graphs [57]. The system achieved substantial normal or abnormal optic discs [63].
agreement with a severity grading ground truth
provided by an expert neuro-ophthalmologist. In
a subsequent study, a support vector machine Reinforcement and Inverse
(SVM) was used, alongside a hybrid feature Reinforcement Learning
extraction method, to achieve excellent accuracy for Surgical Applications
(92.86%) of papilledema detection on a dataset
of 160 fundus images, and even higher accuracy Reinforcement learning (RL) and inverse rein-
(97.85%) for grading papilledema images as forcement learning (IRL), or apprenticeship, are
‘mild’ or ‘severe’ [58]. Fatima et al. [59] also AI algorithms that are being explored for health-
presented a computer aided system to detect pap- care applications. Both methods may have some
illedema. After preprocessing of fundus photos to potential applications to surgery, surgical robot-
detect the optic disk and perform vessel segmen- ics, surgical training and assessment. In RL the
tation, 26 features were extracted which repre- goal is finding the optimal policy. Once an opti-
sented optic disk change. The best features were mal value function is learned, it is possible to
selected within four categories, color, texture, generate the optimal policy (surgical skill) for a
and vascular and disc margin obstruction. Again, given task from the value function [64]. RL algo-
an SVM system was tested on 160 fundus images rithms learn by trial and error, taking as input
from two publicly available data sets, STARE sequences of interactions (histories) between the
and AFIO, achieving a 95.6% accuracy for the decision maker, in this case the surgeon, and their
STARE dataset, 87.4% for the AFIO dataset, and environment [65]. At every decision point, the
85.9% for the combined data [59]. RL algorithm chooses an action according to its
A deep learning system has been introduced to policy and receives new observations and imme-
detect papilledema vs. other abnormalities vs. diate outcomes (rewards). In IRL, we recover the
normal status, using fundus photographs from a reward function, where the optimal policy is
multi-site, multiethnic dataset of over 15,000 given by an expert or another agent (mentor) and
photos [60]. 14,341 images from 19 sites in 11 we find out what the reward function is. RL and
countries were used for training, and internal IRL may have potential in healthcare when learn-
testing, and 1505 images from five other sites ing requires physician demonstration, for exam-
were used for external validation. The internal ple in learning to suture wounds for
testing showed a 0.99 AUC for both papilledema robotic-assisted surgery [66]. As such, there is
detection vs. normal and other abnormalities, and potential for applications in robotic-assisted sur-
normal status detection vs. papilledema and other gery (RAS) in ophthalmology and training.
abnormalities. External validation resulted in an RL can learn from a surgeon’s motions
AUC of 0.96 for the detection of papilledema, enhancing RAS. Additionally, segmentation
with sensitivity and specificity of 96.4% and techniques can reconstruct open wounds from
84.7%, respectively [60]. imaging, and a suturing method can be generated
Other studies attempting multi-level classifi- by finding an optimal trajectory taking into con-
cation of optic disk images have included Ahn sideration external factors, such joint motions
96 J. A. Hallak et al.
and obstacles. Image-trained RNNs can also be embedded applications, and their connectivity to
developed to tie knots autonomously by learning ophthalmic care providers will allow for immedi-
sequences of events, such as the surgeons’ hand ate intervention. The need to overcome the hurdle
movement. of affordability and data privacy will be a priority
IRL applications may have potential applica- for many of these applications. Additionally,
tions for surgical training. These applications challenges in translating deep learning applica-
assume that the mentor has the same reward func- tions due to limited explainability and interpret-
tion as the observer/trainee and chooses from the ability need to be addressed to ensure targeted
same set of actions. The idea is then to infer the representation and to identify potential biases. AI
reward function of the mentor so as to produce the researchers and clinicians should work together
observed behavior. IRL assumes that the reward to increase interpretability and integrate AI sys-
function is expressed as a linear function of known tems in the clinical decision-making process in a
features. A prime example is teaching a person responsible manner.
how to drive, rather than tell a young unseasoned
driver what the reward function is, it is more natu-
ral to demonstrate to them how to drive [67].
Supervised methods have been developed to
References
mimic the mentor. However, there are instances 1. Ting DSW, Pasquale LR, Peng L, et al. Artificial
where blindly following the mentor’s trajectory intelligence and deep learning in ophthalmology.
may not work because the environment encoun- Br J Ophthalmol. 2019;103(2):167–75. https://doi.
tered is different (consider the example of high- org/10.1136/bjophthalmol-2018-313173.
2. Hogarty DT, Mackey DA, Hewitt AW. Current
way driving and varying traffic patterns). Abeel state and future prospects of artificial intelligence
and Ng were the first to apply IRL for apprentice- in ophthalmology: a review. Clin Exp Ophthalmol.
ship learning simulation studies and reported that 2019;47(1):128–39. https://doi.org/10.1111/
the policy found will have performance compara- ceo.13381.
3. Esteva A, Robicquet A, Ramsundar B, et al. A guide
ble to or better than that of an expert, on the to deep learning in healthcare. Nat Med. 2019;25:24–
expert’s unknown reward function [67]. 9. https://doi.org/10.1038/s41591-018-0316-z.
Challenges in applying learning algorithms 4. Treder M, Lauermann JL, Eter N. Deep learning-
for surgical applications will be more difficult to based detection and classification of geographic
atrophy using a deep convolutional neural network
overcome and include correctly localizing an classifier. Graefes Arch Clin Exp Ophthalmol.
instrument’s position and orientation in surgical 2018;256(11):2053–60. https://doi.org/10.1007/
scenes, data collection, adapting to completely s00417-018-4098-2.
unknown and unique situations, which may cause 5. Brown JM, Campbell JP, Beers A, et al. Automated
diagnosis of plus disease in retinopathy of prema-
serious surgical errors [64]. Limiting these appli- turity using deep convolutional neural networks.
cations to simulation studies and training may be JAMA Ophthalmol. 2018;136(7):803–10. https://doi.
the safest approach. org/10.1001/jamaophthalmol.2018.1934.
6. Gulshan V, Peng L, Coram M, et al. Development
and validation of a deep learning algorithm for
detection of diabetic retinopathy in retinal fundus
AI in the Post COVID-19 Era photographs. JAMA. 2016;316(22):2402–10. https://
doi.org/10.1001/jama.2016.17216.
COVID-19 has transformed all aspects of medi- 7. Wang M, Pasquale LR, Shen LQ, et al. Reversal of
glaucoma hemifield test results and visual field fea-
cine and ophthalmology, from clinical care and tures in glaucoma. Ophthalmology. 2018;125(3):352–
research to educational and public health activi- 60. https://doi.org/10.1016/j.ophtha.2017.09.021.
ties. We predict that in the post-covid19 era, AI 8. Lee CS, Baughman DM, Lee AY. Deep learning is
applications will further expand and play a cen- effective for classifying normal versus age-related
macular degeneration OCT images. Ophthalmol
tral role in ophthalmology. Telemedicine will be Retina. 2017;1(4):322–7.
embraced more as a tool for healthcare delivery. 9. Gargeya R, Leng T. Automated identification of dia-
Home monitors and smart devices, with AI betic retinopathy using deep learning. Ophthalmology.
7 Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 97
2017;124(7):962–9. https://doi.org/10.1016/j. 21. Xie Y, Zhao L, Yang X, et al. Screening candidates
ophtha.2017.02.008. for refractive surgery with corneal tomographic-
10. Schmidt-Erfurth U, Waldstein SM, Klimscha S,
based deep learning. JAMA Ophthalmol.
et al. Prediction of individual disease conversion 2020;2020:e200507. https://doi.org/10.1001/
in early AMD using artificial intelligence. Invest jamaophthalmol.2020.0507. [published online ahead
Ophthalmol Vis Sci. 2018;59(8):3199–208. https:// of print, 2020 Mar 26].
doi.org/10.1167/iovs.18-24106. 22. Yoo TK, Ryu IH, Choi H, et al. Explainable machine
11. Rich WL III, Chiang MF, Lum F, Hancock R, Parke learning approach as a tool to understand factors used
DW II. Performance rates measured in the American to select the refractive surgery technique on the expert
Academy of Ophthalmology IRIS© Registry level. Transl Vis Sci Technol. 2020;9(2):8. https://doi.
(Intelligent Research in Sight). Ophthalmology. org/10.1167/tvst.9.2.8.
2018;125(5):782–4. https://doi.org/10.1016/j. 23. Duncan DD, Shukla OB, West SK, et al. New objec-
ophtha.2017.11.033. tive classification system for nuclear opacification. J
12. Maeda N, Klyce SD, Smolek MK, Thompson
Opt Soc Am A Opt Image Sci Vis. 1997;14:1197–204.
HW. Automated keratoconus screening with corneal 24. Li H, Lim JH, Liu J, et al. An automatic diagnosis sys-
topography analysis. Invest Ophthalmol Vis Sci. tem of nuclear cataract using slit-lamp images. IEEE
1994;35(6):2749–57. Trans Biomed Eng. 2010;57:1690–8.
13. Smolek MK, Klyce SD. Current keratoconus detection 25. Srivastava R, Gao X, Yin F, et al. Automatic nuclear
methods compared with a neural network approach. cataract grading using image gradients. J Med
Invest Ophthalmol Vis Sci. 1997;38(11):2290–9. Imaging (Bellingham). 2014;1:014502.
14. de Carvalho LAV, Barbosa MS. Neural networks and 26. Gao X, Lin S, Wong TY. Automatic feature learning
statistical analysis for classification of corneal vid- to grade nuclear cataracts based on deep learning.
eokeratography maps based on Zernike coefficients: IEEE Trans Biomed Eng. 2015;62:2693–701.
a quantitative comparison. Arquivos Brasileiros 27. Liu X, Jiang J, Zhang K, Long E, Cui J, et al.
de Oftalmologia. 2008;71:337–41. https://doi. Localization and diagnosis framework for pediatric
org/10.1590/S0004-27492008000300006. cataracts based on slit-lamp images using deep fea-
15. Kovács I, Miháltz K, Kránitz K, et al. Accuracy of tures of a convolutional neural network. PLoS One.
machine learning classifiers using bilateral data from 2017;12(3):e0168606.
a Scheimpflug camera for identifying eyes with pre- 28. Zhang H, Niu K, Xiong Y, et al. Automatic cataract
clinical signs of keratoconus. J Cataract Refract grading methods based on deep learning. Comput
Surg. 2016;42(2):275–83. https://doi.org/10.1016/j. Methods Programs Biomed. 2019;182:104978.
jcrs.2015.09.020. 29. Reid JE, Eaton E. Artificial intelligence for pedi-
16. Hwang ES, Perez-Straziota CE, Kim SW, Santhiago atric ophthalmology. Curr Opin Ophthalmol.
MR, Randleman JB. Distinguishing highly asymmet- 2019;30(5):337–46. https://doi.org/10.1097/
ric keratoconus eyes using combined Scheimpflug ICU.0000000000000593.
and spectral-domain OCT analysis. Ophthalmology. 30. Hertle RW. Clinical characteristics of surgically
2018;125(12):1862–71. https://doi.org/10.1016/j. treated adult strabismus. J Pediatr Ophthalmol
ophtha.2018.06.020. Strabismus. 1998;35(3):138–68.
17. Ambrósio R Jr, Lopes BT, Faria-Correia F, Salomão 31. Anderson HA, Manny RE, Cotter SA, et al. Effect of
MQ, Bühren J, Roberts CJ, Elsheikh A, Vinciguerra examiner experience and technique on the alternate
R, Vinciguerra P. Integration of Scheimpflug- cover test. Optom Vis Sci. 2010;87:168–75.
based corneal tomography and biomechani- 32. Hrynchak PK, Herriot C, Irving EL. Comparison
cal assessments for enhancing ectasia detection. of alternate cover test reliability at near in non-
J Refract Surg. 2017;33(7):434–43. https://doi. strabismus between experienced and novice examin-
org/10.3928/1081597X-20170426-02. ers. Ophthalmic Physiol Opt. 2010;30:304–9.
18. Kamiya K, Ayatsuka Y, Kato Y, et al. Keratoconus 33. Yang HK, Seo JM, Hwang JM, Kim KG. Automated
detection using deep learning of colour-coded analysis of binocular alignment using an infrared
maps with anterior segment optical coherence camera and selective wavelength filter. Investig
tomography: a diagnostic accuracy study. BMJ Ophthalmol Vis Sci. 2013;54:2733–7.
Open. 2019;9(9):e031313. https://doi.org/10.1136/ 34. De Almeid JDS, Silva AC, De Paiva AC, et al.
bmjopen-2019-031313. Computational methodology for automatic detection
19. Balidis M, Papadopoulou I, Malandris D, Zachariadis of strabismus in digital images through Hirschberg
Z, Sakellaris D, Vakalis T, et al. Using neural net- test. Comput Biol Med. 2012;42:135–46.
works to predict the outcome of refractive surgery for 35. Chandna A, Fisher A, Cunninghan I, et al. Pattern rec-
myopia. 4Open. 2019;2:29. ognition of vertical strabismus using an artificial neu-
20. Yoo TK, Ryu IH, Lee G, Kim Y, Kim JK, Lee IS, et al. ral network (strabnet). Strabismus. 2009;17(4):131–8.
Adopting machine learning to automatically identify 36. Kim TY, Seo SS, Kim YJ, et al. A new soft-
candidate patients for corneal refractive surgery. NPJ ware for quantitative measurement of strabismus
Digit Med. 2019;2(1):59. https://doi.org/10.1038/ based on digital image. J Korea Multimedia Soc.
s41746-019-0135-8. 2012;15(5):595–605.
98 J. A. Hallak et al.
37. Seo MW, Yang HK, Hwang JM, Seo JM. The
of the “i-ROP” system and image features associ-
automated diagnosis of strabismus using an infra- ated with expert diagnosis. Transl Vis Sci Technol.
red camera. 6th Eur Conf Int Fed Med Biol Eng. 2015;4(6):5.
2015;45:142–5. 53. Redd TK, Campbell JP, Brown JM, et al. Evaluation
38.
Khumdat N, Phukpattaranont P, Tengtrisorn of a deep learning image assessment system for
S. Development of a computer system for strabis- detecting severe retinopathy of prematurity. Br J
mus screening. In: 6th Biomedical Engineering Ophthalmol. 2019;103(5):580–4.
International Conference. IEEE; 2013. p. 1–5. 54. Mao J, Luo Y, Liu L, et al. Automated diagnosis and
39. Valente TL, de Almeida JD, Silva AC, et al.
quantitative analysis of plus disease in retinopathy of
Automatic diagnosis of strabismus in digital videos prematurity based on deep convolutional neural net-
through cover test. Comput Methods Prog Biomed. works. Acta Ophthalmol. 2020;98(3):e339–45.
2017;140:295–305. 55. Yang HK, Kim YJ, Sung JY, et al. Efficacy for dif-
40. Quick MW, Boothe RG. A photographic technique ferentiating nonglaucomatous versus glaucomatous
for measuring horizontal and vertical eye alignment optic neuropathy using deep learning systems. Am J
throughout the field of gaze. Investig Ophthalmol Vis Ophthalmol. 2020;2.
Sci. 1992;33:234–46. 56. Stunkel L, Newman NJ, Biousse V. Diagnostic
41. Model D, Eizenman M. An automated Hirschberg test error and neuro-ophthalmology. Curr Opin Neurol.
for infants. IEEE Trans Biomed Eng. 2011;58:103–9. 2019;32(1):62–7.
42. Chen Z, Fu H, Lo WL, Chi Z. Strabismus recognition 57. Echegaray S, Zamora G, Yu H, et al. Automated
using eye-tracking data and convolutional neural net- analysis of optic nerve images for detection and
works. J Healthc Eng. 2018;2018:1–9. staging of papilledema. Invest Ophthalmol Vis Sci.
43. Lu J, Feng J, Fan Z, et al. Automated strabis-
2011;52:7470–8.
mus detection based on deep neural networks for 58. Akbar S, Akram MU, Sharif M, et al. Decision sup-
telemedicine applications. 2018. https://deepai. port system for detection of papilledema through fun-
org/publication/automated-s trabismus-d etection- dus retinal images. J Med Syst. 2017;41:66.
based-o n-d eep-n eural-n etworks-f or-t elemedicine- 59. Fatima KN, Hassan T, Akram MU, et al. Fully auto-
applications. Accessed 31 Jul 2020. mated diagnosis of papilledema through robust
44. Fleck BW, Dangata Y. Causes of visual handicap in extraction of vascular patterns and ocular pathol-
the Royal Blind School, Edinburgh, 1991–2. Br J ogy from fundus photographs. Biomed Opt Express.
Ophthalmol. 1994 May;78(5):421. 2017;8:1005–24.
45. Early Treatment for Retinopathy of Prematurity
60. Milea D, Najjar RP, Zhubo J, Ting D, Vasseneix C, Xu
Cooperative Group, Good WV, Hardy RJ, Dobson X, et al. Artificial intelligence to detect papilledema
V, et al. Final visual acuity results in the early treat- from ocular fundus photographs. N Engl J Med.
ment for retinopathy of prematurity study. Archiv 2020;382(18):1687–95.
Ophthalmol. 2010;128(6):663. 61. Ahn JM, Kim S, Ahn KS, et al. Accuracy of machine
46. Chan RP, Williams SL, Yonekawa Y, et al. Accuracy learning for differentiation between optic neuropa-
of retinopathy of prematurity diagnosis by retinal fel- thies and pseudopapilledema. BMC Ophthalmol.
lows. Retina (Philadelphia, PA). 2010;30(6):958. 2019;19:178.
47. Myung JS, Chan RV, Espiritu MJ, et al. Accuracy 62. Yang HK, Oh JE, Han SB, et al. Automatic
of retinopathy of prematurity image-based diagno- computer-aided analysis of optic & disc pallor in
sis by pediatric ophthalmology fellows: implica- fundus photographs. Acta Ophthalmol (Copenh).
tions for training. J Am Assoc Pediatric Ophthalmol 2019;97:e519–25.
Strabismus. 2011;15(6):573–8. 63. Liu TYA, Ting DSW, Yi PH, Wei J, Zhu H,
48. Ting DS, Peng L, Varadarajan AV, et al. Deep learning Subramanian PS, et al. Deep learning and transfer
in ophthalmology: the technical and clinical consider- learning for optic disc laterality detection: implica-
ations. Progr Retinal Eye Res. 2019;72:100759. tions for machine learning in neuro-ophthalmology. J
49. Koreen S, Gelman R, Martinez-Perez ME, et al.
Neuroophthalmol. 2020;40(2):178–84.
Evaluation of a computer-based system for plus 64.
Kassahun Y, Yu B, Tibebu AT, Stoyanov D,
disease diagnosis in retinopathy of prematurity. Giannarou S, Metzen JH, et al. Surgical robotics
Ophthalmology. 2007;114(12):e59–67. beyond enhanced dexterity instrumentation: a sur-
50. Wilson CM, Wong K, Ng J, Cocker KD, et al. Digital vey of machine learning techniques and their role
image analysis in retinopathy of prematurity: a com- in intelligent and autonomous surgical actions. Int J
parison of vessel selection methods. J Am Assoc CARS. 2016;11(4):553–68. https://doi.org/10.1007/
Pediatric Ophthalmol Strabismus. 2012;16(3):223–8. s11548-015-1305-z.
51.
Abbey AM, Besirli CG, Musch DC, et al. 65. Gottesman O, Johansson F, Komorowski M, Faisal
Evaluation of screening for retinopathy of prema- A, Sontag D, Doshi-Velez F, et al. Guidelines
turity by ROP tool or a lay reader. Ophthalmology. for reinforcement learning in healthcare. Nat
2016;123(2):385–90. Med. 2019;25(1):16–8. https://doi.org/10.1038/
52. Ataer-Cansizoglu E, Bolon-Canedo V, Campbell
s41591-018-0310-5.
JP. Computer-based image analysis for plus disease 66. Esteva A, Robicquet A, Ramsundar B, Kuleshov V,
diagnosis in retinopathy of prematurity: performance DePristo M, Chou K, et al. A guide to deep learning
7 Experimental Artificial Intelligence Systems in Ophthalmology: An Overview 99
in healthcare. Nat Med. 2019;25(1):24–9. https://doi. the twenty-first international conference on Machine
org/10.1038/s41591-018-0316-z. learning (ICML ‘04). Banff, AB, Canada: Association
67. Abbeel P, Ng AY. Apprenticeship learning via
for Computing Machinery; 2004. p. 1. https://doi.
inverse reinforcement learning. In: Proceedings of org/10.1145/1015330.1015430.
Artificial Intelligence
in Age-Related Macular
8
Degeneration (AMD)
Abbreviations Introduction
Fig. 8.1 Demonstration of the AMD pathologies, according to the Age-Related Eye Disease Study (AREDS)
and/or pigmentary abnormalities at the macula suming, deep learning enables more accurate
(Fig. 8.1). There are two forms of late AMD: and efficient approaches.
neovascular AMD and atrophic AMD, defined Due to the importance of deep learning in the
by geographic atrophy (GA). imaging-based analysis of AMD, this chapter
Automated image analysis tools have dem- aims to discuss the role of medical imaging,
onstrated promising results in biology and med- empowered by AI, in detecting AMD, which will
icine [8–13]. Early retinal image classification encourage future practical applications and meth-
systems of color fundus photographs (CFP) had odological research.
adopted traditional machine learning with In the following, we first introduce common
human- engineered features [14]. Recently, data modalities and publicly available datasets
deep learning, a subfield of machine learning, and then summarize popular machine learning
has generated substantial interest in the field of methods in AMD detection, classification, and
ophthalmology [8, 15–20]. Compared to the prognosis. Finally, we discuss several open prob-
traditional imaging workflow that relies heavily lems and challenges. We hope to provide guid-
on human grading, which is highly time-con- ance for researchers and ophthalmologists.
8 Artificial Intelligence in Age-Related Macular Degeneration (AMD) 103
In this section, we briefly introduce common data The Age-Related Eye Disease Study (AREDS),
modalities and publicly available data sets that sponsored by the National Eye Institute (National
have been widely used in AI-powered AMD Institutes of Health), was a 12-year multi-center
research. prospective cohort study of the clinical course,
prognosis, and risk factors of AMD, as well as a
phase III randomized clinical trial to assess the
Common Data Modalities effects of nutritional supplements on AMD pro-
gression [26]. In short, 4757 participants aged
Three imaging modalities are commonly used in 55–80 years were recruited between 1992 and
AI-powered AMD research. 1998 at 11 retinal specialty clinics in the United
Color Fundus Photograph (CFP) uses a fun- States. The inclusion criteria were wide, from no
dus camera to record color images of the condi- AMD in either eye to late AMD in one eye. The
tion of the interior surface of the eye [21]. A participants were randomly assigned to placebo,
fundus camera or retinal camera is a specialized antioxidants, zinc, or the combination of antioxi-
low-power microscope with an attached camera dants and zinc. The AREDS dataset is publicly
designed to photograph the interior surface of the accessible to researchers by request at dbGAP
eye, including the retina, retinal vasculature, (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-
optic disc, macula, and posterior pole (i.e., the bin/study.cgi?study_id=phs000001.v3.p1).
fundus). The retina is imaged to document condi- Similarly, the AREDS2 was a multi-center
tions such as diabetic retinopathy, age-related phase III randomized clinical trial that analyzed
macular degeneration, macular edema, and reti- the effects of different nutritional supplements on
nal detachment. the course of AMD [27]. 4203 participants aged
Fundus autofluorescence (FAF) is a non- 50–85 years were recruited between 2006 and
invasive retinal imaging modality used in clinical 2008 at 82 retinal specialty clinics in the United
practice to provide a density map of lipofuscin, States. The inclusion criteria were the presence of
the predominant ocular fluorophore, in the retinal either bilateral large drusen or late AMD in one
pigment epithelium [22]. Classically, FAF uti- eye and large drusen in the fellow eye. The par-
lizes blue-light excitation, then collects emis- ticipants were randomly assigned to placebo,
sions within preset spectra to form a brightness lutein/zeaxanthin, docosahexaenoic acid (DHA)
map reflecting the distribution of lipofuscin, a plus eicosapentaenoic acid (EPA), or the combi-
dominant fluorophore located in the retinal pig- nation of lutein/zeaxanthin and DHA plus
ment epithelium (RPE). Thus, it provides detailed EPA. AREDS supplements were also adminis-
insight into the health of the RPE. tered to all AREDS2 participants because they
Optical coherence tomography (OCT) is were by then considered the standard of care [28].
another non-invasive imaging test, which uses
light waves to take cross-section pictures of the
retina [23]. More recently, advances in OCT AMD Scales
technology have made it possible to investigate
the feasibility of using OCT for the diagnosis AREDS Simplified Severity Scale and 4-step
and monitoring of retinal diseases such as glau- Scale Longitudinal analysis of the AREDS
coma and AMD. Existing studies have used cohort led to the development of the patient-based
single or multiple modalities for AMD-related AREDS Simplified Severity Scale for AMD,
disease diagnosis and prognosis, such as using based on CFP [29] (Fig. 8.2a). This simplified
CFP and FAF to detect the presence of RPD scale provides convenient risk factors for assess-
[24] and using FAF and OCT to detect the pres- ing the risk of progression to late AMD that can
ence of GA [25]. be determined by clinical examination or by less
104 Y. Peng et al.
a Right eye
0 0 0 1 1 1 - pigmentary abnormal
0 1 2 0 1 2 - drusen size
0 0 0 0 0 0 1 late AMD
0 0 0 0 0 1 1 1 2 5
0 1 0 0 1 1 1 2 2 5
0 2 0 1 1 2 2 2 3 5
Left eye
1 0 0 1 1 2 2 2 3 5
1 1 0 1 2 2 2 3 3 5
1 2 0 2 2 3 3 3 4 5
- - 1 5 5 5 5 5 5 5
late AMD
pigmentary abnormalities
drusen size
b Pigment abnormalities
Geographic atrophy 0 0 0 0 0 1
Increased pigment 0 1 – – – –
Depigmentation
Drusen area 0 0 1 2 3 –
0 1 2 2 4 8 9
1 2 4 4 4 8 9
2 3 4 4 5 8 9
3 4 5 5 6 8 9
4 5 6 6 7 8 9
5 6 7 7 8 8 9
Fig. 8.2 Different AMD scales. (a) AREDS Simplified atrophy (0/1, i.e., absent/present), increased pigment (0/1,
Severity Scale Scoring schematic for participants with i.e., absent/present), depigmentation (graded 0–3), and
and without late age-related macular degeneration drusen area (graded 0–5). The final AREDS Severity
(AMD). Pigmentary abnormalities: 0 = no, 1 = yes; dru- Scale score (steps 1–9, shown shaded in different colors)
sen size: 0 = small or none, 1 = medium, 2 = large; late is defined by the combination of findings from these four
AMD: 0 = no, 1 = yes. (b) AREDS Severity Scale scores categories
1–9, defined by graders from four categories: geographic
8 Artificial Intelligence in Age-Related Macular Degeneration (AMD) 105
demanding photographic procedures than used in machine learning, has recently generated
the AREDS. The scale combines risk factors from substantial interest in the field of ophthalmology
both eyes to generate an overall score for the indi- [8, 15–20]. In general, deep learning is the pro-
vidual, based on the presence of one or more large cess of training algorithmic models with labeled
drusen (diameter >125 μm) and/or AMD pigmen- data (e.g., CFP categorized manually as contain-
tary abnormalities at the macula of each eye [29]. ing pigmentary abnormalities or not), where
The Simplified Severity Scale is clinically useful these models can then be used to assign labels
in that it allows ophthalmologists to predict an automatically to new data. Deep learning differs
individual’s 5-year risk of developing late from traditional machine learning methods in
AMD. This 5-step scale (from score 0 to 4) esti- that specific image features do not need to be pre-
mates the 5-year risk of the development of late specified by experts in that field. Instead, the
AMD in at least one eye as 0.4%, 3.1%, 11.8%, image features are learned directly from the
25.9%, and 47.3%, respectively [29]. images themselves. Past studies have utilized
deep learning systems for the identification of
9-step AREDS Severity Scale The AREDS various retinal diseases, including diabetic reti-
9-step Severity Scale was designed to allow read- nopathy [31–36], glaucoma [36–39], and retinop-
ing center graders to perform comprehensive athy of prematurity [40]. In recent years, we have
classification of AMD-related features on fundus seen the successful use of deep learning on AMD
photographs, as a research tool [30] (Fig. 8.2b). (Fig. 8.3).
The detailed nature of the system means that oph-
thalmologists in clinical practice very rarely use
it. In short, the 9-step severity scale combines a Deep Learning in AMD
6-step drusen area scale with a 5-step pigmentary
abnormality scale. The 5-year risk of advanced Classification of 2-class and 2-class AMD
AMD increases progressively from less than 1% severity Recently, several deep learning sys-
in step 1 to about 50% in step 9. tems have been developed for the classification of
CFP into AMD severity scales, at the level of the
individual eye. They applied deep neural net-
Methods works to a 2-class classification problem in which
the task was to distinguish referable AMD (inter-
Automated image analysis tools have demon- mediate or advanced disease) from non-referable
strated promising results in biology and medicine AMD (no or early disease) [16, 20, 36, 41, 42].
[8–13]. In particular, deep learning, a subfield of Burlina et al. used transfer learning and universal
• Classification of referable • 9-step AREDS severity scale • Classification of 9-step • AMD prognosis in a wide
and non-referable AMD using CFP and an ensemble AREDS severity scale using time interval using CFP of
using CFP (Burlina et al., model (Grassmann et al., CFP and multi-task strategy both eyes and demographic
2017a; Lee et al., 2017 Ting 2018) (Chen et al., 2019) information of patients
et al., 2017) • Wet-AMD detection using • Classification of AMD (Peng et al., 2020)
• Classification of 4-class AMD ultra–wide-field fundus simplified severity score • AMD prognosis in exceeding
severity using CFP and images (Matsuba et al., using CFP of both eyes the inquired year using CFP
universal features (Burlina 2018) (Peng et al., 2019) and genotypes (Yan et al.,
et al., 2017b) • GA in fundus • Multiple clinical referral 2020)
• Dry-AMD detection using autofluorescence (Treder et suggestions on OCT images • RPD detection with
OCT images (Karri et al., al, 2018b) (De Fauw et al., 2019) intermediated to late AMD
2017) • Late AMD prognosis 5 year • GA detection in color funds using FAF and CFP (Keenan
• AMD detection using OCT using CFP (Burlina et al., photographs (Keenan et al., et al., 2020)
images (Lee et al., 2017) 2018) 2019
features to address a 4-class AMD severity clas- [29]. To this end, Peng et al. proposed a deep
sification problem (no, early, intermediate, and learning framework to automatically identify
advanced disease) and obtained 79.4% vs. 75.8% AMD severity from CFP of both eyes [46]. It
in accuracy for machine and physician grading, mimicked the human grading process by first
respectively [16]. detecting individual risk factors (drusen and pig-
mentary abnormalities) in each eye and then
combining values from both eyes to assign an
Classification of 9-step AREDS severity AMD score to each patient. Thus, the model
scale In a more complicated scenario in which closely matches the clinical decision-making
the aim was to classify AMD according to the process, which allows an ophthalmologist to
9-step AREDS Severity Scale, Grassmann et al. inspect and visualize an interpretable result.
trained an ensemble of six different neural net
architectures with a random forest approach to Prediction of risk of progression to late
directly predict the 9-step AREDS severity scale AMD Besides AMD classification, making
in an AREDS test set with an overall accuracy of accurate time-based predictions of progression to
63.3%, outperforming human graders [17] late AMD is also clinically critical. This would
(Fig. 8.2b). However, 1-step classification of enable improved decision-making regarding: (i)
images according to the 9-step AREDS Severity medical treatments, especially oral supplements
Scale does not reflect regular human grading known to decrease progression risk, (ii) lifestyle
practice. In reading centers, graders instead first interventions, mainly smoking cessation and
calculate individual scores for four different dietary changes, and (iii) intensity of patient
AMD characteristics (drusen area, geographic monitoring, e.g., frequent reimaging in the clinic
atrophy, increased pigmentation, and depigmen- or tailored home monitoring programs [47–51]. It
tation), then combine the scores for these four would also aid the design of future clinical trials,
characteristics into the 9-step non-advanced which could be enriched for participants with a
AREDS severity scale [30]. Hence, a deep learn- high risk of progression events [52].
ing approach that predicts the overall AREDS Currently, three existing methods are available
severity score directly, without these intermedi- clinically for using CFP to predict the risk of pro-
ate steps, may have lower transparency and gression. Of the three methods, the most com-
decreased information content for research and monly used is the AREDS Simplified Severity
clinical purposes [43, 44]. To address these Scale, as described above [29]. The second method
potential criticisms, Chen et al. designed four is an online risk calculator [53]. Like the Simplified
deep learning models, each of which is responsi- Severity Scale, its inputs include the presence of
ble for classifying an individual characteristic, macular drusen and pigmentary abnormalities;
and trained them separately using a multi-task however, it can also receive the individual’s age,
strategy [45]. Evaluation on both the AREDS and smoking status, and basic genotype information
AREDS2 datasets showed that the accuracy of consisting of two SNPs (when available). The
the model exceeded the current state-of-the-art third method is a deep learning-based architecture
model by >10%. to predict progression with improved accuracy and
transparency in two steps: image classification fol-
Classification of AREDS Simplified Severity lowed by survival analysis [54]. The model was
Scale Besides AMD classification at the level of developed and clinically validated on two datasets
the individual eye, it is also helpful to obtain one from AREDS and AREDS2.
overall score for the individual from both eyes
(Fig. 8.2a). This is particularly relevant because Classification of AMD on Optical Coherence
estimates of rates of progression to late AMD are Tomography Besides CFP, OCT also plays a
highly influenced by the status of fellow eyes, as major role in the detection of AMD [55]. Several
the behavior of the two eyes is highly correlated recent studies have reported robust performance in
8 Artificial Intelligence in Age-Related Macular Degeneration (AMD) 107
the automated classification of AMD from OCT lesions with less extensive RPE atrophy. In addi-
scans. Karri et al. fine-tuned a CNN model to clas- tion, the increasing prevalence of GA (through
sify OCT images of dry AMD [56]. Lee et al. aging populations in many countries) will trans-
developed an algorithm to categorize OCT images late to greater demand for retinal services. As
as either normal or AMD [20]. The images were such, deep learning approaches involving retinal
linked to clinical data points from the electronic images, obtained perhaps using telemedicine-
health record, and gold-standard labels were based devices, might support GA detection and
extracted using the ICD-9 diagnosis codes. At a diagnosis. However, these approaches would
patient level, the model achieved an area under the require establishing evidence-based and ‘explain-
ROC curve of 97.45% with an accuracy of 93.45%. able’ systems that have undergone extensive vali-
De Fauw et al. further demonstrated expert perfor- dation and demonstrated performance metrics
mance on multiple clinical referral suggestions for that are at least non-inferior to those of clinical
two independent test datasets [57]. ophthalmologists in routine practice.
In contrast to studies of AMD classification,
few studies have explicitly focused on GA. Treder
Deep Learning in GA et al. detected and classified GA in FAF images
using a deep learning algorithm [63]. Two classi-
Geographic atrophy (GA) is the defining lesion fiers were built, one to classify healthy patients
of the atrophic form of late AMD. GA in AMD and patients with GA and the other to classify
has been estimated to affect over 5 million people patients with GA and other retinal diseases. Both
worldwide [3, 4]. Unlike neovascular AMD, no achieved high accuracy. Keenan et al. conducted
drug therapies are available to prevent GA, slow an empirical study to investigate the performance
its enlargement, or restore lost vision; this makes of deep learning models on CFP [64]. The first
it an important research priority [58, 59]. Rapid model predicted GA presence from a population
and accurate identification of eyes with GA could of eyes with no AMD to advanced AMD using
improve the recruitment of eligible patients for CFP; the second model predicted CGA presence
future clinical trials and eventually to early iden- from the same population, and the third model
tification of appropriate patients for proven predicted central GA (CGA) presence from the
treatments. subset of eyes with GA. Experiments demon-
Since the original description of GA by Gass strated that deep learning could achieve high
[60], clinical definitions have varied between accuracy for the detection of GA, and compared
research groups [61]. In the AREDS, GA was favorably for the detection of CGA to human reti-
defined as a sharply demarcated, usually circu- nal specialists.
lar zone of partial or complete depigmentation
of RPE, typically with exposure of underlying
large choroidal blood vessels, at least as large as Deep Learning in RPD
grading circle I-1 (1/8 disc diameter in diame-
ter) [62]. The sensitivity of the retina to light Reticular pseudodrusen (RPD), also known as
stimuli is markedly decreased (i.e., dense scoto- subretinal drusenoid deposits, have been identi-
mata) in areas affected by GA. The natural his- fied as another disease feature independently
tory of the disease involves progressive associated with an increased risk of progression
enlargement of GA lesions over time, with to late AMD [65]. Unlike soft drusen, which are
visual acuity decreasing markedly as the central located in the sub-retinal pigment epithelial
macula becomes involved [58]. (RPE) space, RPD are thought to represent aggre-
The identification of GA by ophthalmologists gations of material in the subretinal space
conducting dilated fundus examinations is some- between the RPE and photoreceptors [65, 66].
times challenging. This may be particularly true Compositional differences have also been found
for cases with early GA, comprising smaller between soft drusen and RPD [66].
108 Y. Peng et al.
24. Chen Q, Keenan TDL, Allot A, Peng Y, Agrón E, 35. Takahashi H, Tampo H, Arai Y, Inoue Y, Kawashima
Domalpally A, Klaver CCW, Luttikhuizen DT, H. Applying artificial intelligence to disease staging:
Colyer MH, Cukras CA, Wiley HE, Teresa Magone deep learning for improved staging of diabetic reti-
M, Cousineau-Krieger C, Wong WT, Zhu Y, Chew nopathy. PLoS One. 2017;12(6):e0179790. https://
EY, Lu Z; AREDS2 Deep Learning Research Group. doi.org/10.1371/journal.pone.0179790.
Multimodal, multitask, multiattention (M3) deep 36. Ting DSW, Cheung CY-L, Lim G, et al. Development
learning detection of reticular pseudodrusen: Toward and validation of a deep learning system for dia-
automated and accessible classification of age-related betic retinopathy and related eye diseases using
macular degeneration. J Am Med Inform Assoc. 2021 retinal images from multiethnic populations with
Jun 12;28(6):1135–48. https://doi.org/10.1093/jamia/ diabetes. JAMA. 2017;318(22):2211–23. https://doi.
ocaa302. org/10.1001/jama.2017.18152.
25. Arslan J, Samarasinghe G, Benke KK, et al. Artificial 37. Asaoka R, Murata H, Iwase A, Araie M. Detecting pre-
intelligence algorithms for analysis of geographic perimetric glaucoma with standard automated perim-
atrophy: a review and evaluation. Transl Vis Sci etry using a deep learning classifier. Ophthalmology.
Technol. 2020;9(2):57. https://doi.org/10.1167/ 2016;123(9):1974–80. https://doi.org/10.1016/j.
tvst.9.2.57. ophtha.2016.05.029.
26. Age-Related Eye Disease Study Research Group.
38. Cerentini A, Welfer D, Cordeiro d’Ornellas M, Pereira
The age-related eye disease study (AREDS): design Haygert CJ, Dotto GN. Automatic identification of
implications. AREDS report no. 1. Control Clin glaucoma using deep learning methods. Stud Health
Trials. 1999;20(6):573–600. https://doi.org/10.1016/ Technol Inform. 2017;245:318–21.
s0197-2456(99)00031-8. 39. Muhammad H, Fuchs TJ, De Cuir N, et al. Hybrid
27. AREDS2 Research Group, Chew EY, Clemons
deep learning on single wide-field optical coherence
T, et al. The Age-Related Eye Disease Study 2 tomography scans accurately classifies glaucoma sus-
(AREDS2): study design and baseline characteris- pects. J Glaucoma. 2017;26(12):1086–94. https://doi.
tics (AREDS2 report number 1). Ophthalmology. org/10.1097/IJG.0000000000000765.
2012;119(11):2282–9. https://doi.org/10.1016/j. 40. Brown JM, Campbell JP, Beers A, et al. Automated
ophtha.2012.05.027. diagnosis of plus disease in retinopathy of prema-
28.
American Academy of Ophthalmology turity using deep convolutional neural networks.
Retina/Vitreous Panel. Preferred Practice JAMA Ophthalmol. 2018;136(7):803–10. https://doi.
Pattern®Guidelines. Age-related macular degenera- org/10.1001/jamaophthalmol.2018.1934.
tion. Am Acad Ophthalmol. 2015. 41. Matsuba S, Tabuchi H, Ohsugi H, et al. Accuracy
29. Ferris FL, Davis MD, Clemons TE, et al. A simplified of ultra-wide-field fundus ophthalmoscopy-assisted
severity scale for age-related macular degeneration: deep learning, a machine-learning technology, for
AREDS Report No. 18. Arch Ophthalmol Chic Ill detecting age-related macular degeneration. Int
1960. 2005;123(11):1570–4. https://doi.org/10.1001/ Ophthalmol. Published online May 2018. https://doi.
archopht.123.11.1570. org/10.1007/s10792-018-0940-0.
30. Davis MD, Gangnon RE, Lee L-Y, et al. The age- 42. Treder M, Lauermann JL, Eter N. Automated detec-
related eye disease study severity scale for age-related tion of exudative age-related macular degeneration in
macular degeneration: AREDS report no. 17. Arch spectral domain optical coherence tomography using
Ophthalmol Chic Ill 1960. 2005;123(11):1484–98. deep learning. Graefes Arch Clin Exp Ophthalmol
https://doi.org/10.1001/archopht.123.11.1484. Albrecht Von Graefes Arch Klin Exp Ophthalmol.
31. Choi JY, Yoo TK, Seo JG, Kwak J, Um TT, Rim 2018;256(2):259–65. https://doi.org/10.1007/
TH. Multi-categorical deep learning neural network s00417-017-3850-3.
to classify retinal images: a pilot study employing 43. Doshi-Velez F, Kim B. Towards a rigorous sci-
small database. PLoS One. 2017;12(11):e0187336. ence of interpretable machine learning. ArXiv
https://doi.org/10.1371/journal.pone.0187336. Prepr. Published online 2017. https://arxiv.org/
32. Gargeya R, Leng T. Automated identification of dia- abs/1702.08608.
betic retinopathy using deep learning. Ophthalmology. 44. Madumal P, Miller T, Vetere F, Sonenberg L. Towards
2017;124(7):962–9. https://doi.org/10.1016/j. a grounded dialog model for explainable artificial
ophtha.2017.02.008. intelligence. ArXiv Prepr. Published online 2018.
33. Gulshan V, Peng L, Coram M, et al. Development https://arxiv.org/abs/1806.08055.
and validation of a deep learning algorithm for detec- 45. Chen Q, Peng Y, Keenan T, et al. A multi-task deep
tion of diabetic retinopathy in retinal fundus photo- learning model for the classification of age-related
graphs. JAMA. 2016;316(22):2402–10. https://doi. macular degeneration. Proc AMIA Jt Summits Transl
org/10.1001/jama.2016.17216. Sci. 2019;2019:505–14. https://pubmed.ncbi.nlm.nih.
34. Raju M, Pagidimarri V, Barreto R, Kadam A,
gov/31259005.
Kasivajjala V, Aswath A. Development of a deep 46. Peng Y, Dharssi S, Chen Q, et al. DeepSeeNet: a
learning algorithm for automatic diagnosis of dia- deep learning model for automated classification of
betic retinopathy. Stud Health Technol Inform. patient-based age-related macular degeneration sever-
2017;245:559–63. ity from color fundus photographs. Ophthalmology.
8 Artificial Intelligence in Age-Related Macular Degeneration (AMD) 111
2018;125(3):369–90. https://doi.org/10.1016/j. 74. Keenan TDL, Chen Q, Peng Y, et al. Deep learning
ophtha.2017.08.038. automated detection of reticular pseudodrusen from
70. Domalpally A, Agrón E, Pak JW, et al. Prevalence, fundus autofluorescence images or color fundus pho-
risk, and genetic association of reticular pseudodru- tographs in AREDS2. Ophthalmology. Published
sen in age-related macular degeneration: Age-Related online May 21, 2020. https://doi.org/10.1016/j.
Eye Disease Study 2 Report 21. Ophthalmology. ophtha.2020.05.036.
2019;126(12):1659–66. https://doi.org/10.1016/j. 75.
Garrity ST, Sarraf D, Freund KB, Sadda
ophtha.2019.07.022. SR. Multimodal imaging of nonneovascular age-
71.
Alten F, Clemens CR, Heiduschka P, Eter related macular degeneration. Invest Ophthalmol Vis
N. Characterisation of reticular pseudodrusen and Sci. 2018;59(4):AMD48–64. https://doi.org/10.1167/
their central target aspect in multi-spectral, confocal iovs.18-24158.
scanning laser ophthalmoscopy. Graefes Arch Clin 76. Holz FG, Sadda SR, Staurenghi G, et al. Imaging
Exp Ophthalmol Albrecht Von Graefes Arch Klin protocols in clinical studies in advanced age-
Exp Ophthalmol. 2014;252(5):715–21. https://doi. related macular degeneration: recommendations
org/10.1007/s00417-013-2525-y. from classification of atrophy consensus meetings.
72. Ueda-Arakawa N, Ooto S, Tsujikawa A, Yamashiro Ophthalmology. 2017;124(4):464–78. https://doi.
K, Oishi A, Yoshimura N. Sensitivity and specific- org/10.1016/j.ophtha.2016.12.002.
ity of detecting reticular pseudodrusen in multi- 77. Beede E, Baylor E, Hersch F, Iurchenko A, Wilcox
modal imaging in Japanese patients. Retina Phila L, Ruamviboonsuk P, Vardoulakis LM. A human-cen-
Pa. 2013;33(3):490–7. https://doi.org/10.1097/ tered evaluation of a deep learning system deployed
IAE.0b013e318276e0ae. in clinics for the detection of diabetic retinopathy. In
73. van Grinsven MJJP, Buitendijk GHS, Brussee C,
Proceedings of the 2020 CHI Conference on Human
et al. Automatic identification of reticular pseudodru- Factors in Computing Systems (CHI ‘20). 2020. p.
sen using multimodal retinal image analysis. Invest 1–12. https://doi.org/10.1145/3313831.3376718.
Ophthalmol Vis Sci. 2015;56(1):633–9. https://doi.
org/10.1167/iovs.14-15019.
AI and Glaucoma
9
Zhiqi Chen, Gadi Wollstein, Joel S. Schuman,
and Hiroshi Ishikawa
Glaucoma is characterized by progressive loss of visual field (VF, Fig. 9.3)) measurements are
retinal ganglion cell (RGC) and their axons commonly assessed in addition to the conven-
which may result in optic nerve head (ONH) and tional observations (e.g. optic disc assessment,
retinal nerve fiber layer (RNFL) changes and intraocular pressure (IOP)). Various longitudinal
eventually lead to vision loss and irreversible studies on glaucoma progression reported contra-
blindness [1–3]. Since glaucoma is a slowly pro- dicting non-linear relationships between struc-
gressing disease with irreversible neural damage, tural and functional measurements [4–10]. There
early diagnosis and sensitive progression moni- are complex, non-linear, asynchronous interac-
toring are essential to glaucoma management. tions between them, which have not been fully
For clinical assessment, structural (e.g. fundus understood yet.
photography (Fig. 9.1)), optical coherence Recently, artificial intelligence (AI) has started
tomography (OCT, Fig. 9.2)) and functional (e.g. to impact in ophthalmology [11–15]. Deep learn-
ing (DL) is a class of state-of-the-art machine
Z. Chen learning (ML) algorithms that are especially tai-
Department of Electrical and Computer Engineering, lored to extract meaningful features from complex
New York University, Brooklyn, NY, USA and high-dimensional data. Consequently, AI
Department of Ophthalmology, NYU Langone Health, algorithms, especially DL, have the potential to
New York, NY, USA revolutionize the diagnosis and management of
e-mail: zc1337@nyu.edu
glaucoma based on the interpretation of functional
G. Wollstein · H. Ishikawa (*) and/or structural information and even to improve
Department of Ophthalmology, NYU Langone Health,
New York, NY, USA the understanding of glaucoma by defining the
structural features responsible for certain func-
Department of Biomedical Engineering,
New York University, Brooklyn, NY, USA tional damages and to identify phenotypes that fol-
e-mail: Gadi.Wollstein@nyulangone.org; low similar progression patterns. Table 9.1
Hiroshi.Ishikawa@nyulangone.org summarizes current DL applications in glaucoma.
J. S. Schuman In this chapter, we provide an overview of cur-
Department of Electrical and Computer Engineering, rent AI applications and challenges in glaucoma.
New York University, Brooklyn, NY, USA Section “Glaucoma Diagnosis” introduces AI utili-
Department of Ophthalmology, NYU Langone Health, zation in detecting glaucoma; section “Longitudinal
New York, NY, USA Analysis” focuses on role of AI in longitudinal pro-
Department of Biomedical Engineering, jection; section “Structural-Functional Correlation”
New York University, Brooklyn, NY, USA summarizes developments of AI in finding the
e-mail: Joel.Schuman@nyulangone.org
a b
Fig. 9.2 Examples of Cirrus OCT report from a healthy case (a) and a glaucomatous case (b). The image is color-
coded (red, orange, and yellow represent thicker areas while green and blue represent thinner areas)
9 AI and Glaucoma 115
a b
Fig. 9.3 Examples of Humphrey 24-2 VF report from a deficit, large inferior nasal step and superior temporal
healthy eye (a) and a glaucomatous eye (b). Figure (b) step, and paracentral scotoma
shows advanced visual field damages with superior nasal
resentations of data that optimally solves the prob- comatous damage assessment. Several classical
lem [44–47]. DL models utilize multiple ML classifiers (i.e., multilayer perceptron (MLP),
processing layers to obtain scalability and learn SVM, and linear (LDA) and quadratic discrimi-
hierarchical feature representations of data with nant analysis (QDA), Parzen window and mix-
multiple levels of abstraction which are suitable ture of Gaussian (MOG)) have been proposed to
for classification. Therefore, DL models have been automatically discriminate between normal eyes
studied to improve the accuracy of automated and eyes with pre-perimetric glaucoma based on
glaucoma diagnosis, summarized in Table 9.2. visual fields and have shown promising perfor-
mance [48–50]. With the development of compu-
tational capacity, deeper models have become
Functional Defects as Input possible to be implemented. Asaoka et al. [26]
proposed a multi-layer feed-forward neural net-
In clinical practice, VF testing is widely used as work (FNN) with stacked denoising autoencoder
the gold standard for disease diagnose and glau- to classify pre-perimetric glaucoma VFs and
116 Z. Chen et al.
healthy VFs and achieved better performance CNN, which explicitly took spatial information
over shallower ML models. into account through spatial convolutions, was
Previous work showed promising perfor- used to discriminate between healthy and early
mance in classification of VFs. Yet, these meth- glaucomatous VFs with those converted images
ods treated each VF point as individual features as input. Results demonstrated supremacy of
and failed to leverage spatial information within CNN over NN that did not consider spatial infor-
VFs. Spatial Information is useful to discover VF mation (average precision score: 0.874 ± 0.095
defect pattern and therefore helps glaucoma diag- vs. 0.843 ± 0.089). By computing the gradient,
nosis [27]. Thus, incorporating spatial informa- saliency maps can be obtained to visualize impor-
tion into ML classifiers may boost the tant pixels that contribute most to the outputs of
discrimination ability. CNN is an evolution of CNN. The saliency maps suggested that CNNs
FNN which replace matrix multiplication with were capable of detecting patterns of localized
convolution to process spatial information. Thus, VF defects.
researchers started to implement CNN models to Li et al. [29] took probability map of pattern
discriminate VFs. deviation (PD) from the VF reports as the input to
Kucur et al. [28] converted VFs to images a CNN. Results showed that CNN achieved
using a Voronoi parcelation [51]. A seven-layer higher accuracy compared to ophthalmologists,
9 AI and Glaucoma 117
rule-based method (Advanced Glaucoma In 2015, Chen et al. [15] proposed a six-layer
Intervention Study (AGIS) criteria and Glaucoma CNN to classify glaucoma and non-glaucoma
Staging System (GSS) criteria), and traditional eyes directly from fundus images from public
machine learning algorithms (SVM, RF, k-nearest available ORIGA dataset [59] and SCES dataset
neighbor). [60]. Experimental results showed AUC of 0.831
and 0.887 on ORIGA and SCES respectively. In
a later work, Chen et al. [61] designed a novel
Structural Damages as Input CNN which embed multilayer perceptron to dis-
criminate glaucoma and non-glaucoma patterns
Assessment of structural damage has become a of fundus images.
practical standard for glaucoma diagnosis. Early In 2018, Fu et al. [22] proposed a Disc-aware
studies focused on structural measurements Ensemble Network (DENet) which consisted of
obtained from imaging techniques such as confo- four streams to integrate hierarchical context of
cal scanning laser ophthalmoscopy (CSLO) and global fundus images and local OD regions. The
scanning laser polarimetry (SLP) [47, 52–58]. first stream used a Residual Network (ResNet)
Promising performance of ML classifiers on [62] to learn the global representation on the
structural parameters such as optic disc parame- whole fundus image directly and produce a glau-
ters measured by CSLO and RNFL measure- coma classification probability. The second
ments from SLP were reported. However, due to stream adapted the U-shape CNN (U-Net) [63],
the popularity of the technologies, recent which is an efficient DL model for medical image
AI-based studies on structural glaucomatous segmentation, to produce the disc probability
damages have focused on fundus photography map and a glaucoma classification probability.
and OCT. The third stream cropped OD region image as
input and output a classification probability
Fundus Photography through a ResNet. The fourth stream applied the
Fundus photography is a well-established and pixel-wise polar transformation to transfer the
cost-effective imaging technique to identify fea- copped original image to the polar coordinate
tures of the fundus region including fovea, mac- system in order to enlarge the cup region and
ula, optic disc (OD), and optic cup (OC). augment data. Then, a ResNet was trained to out-
Glaucoma can be identified by the optic nerve put a classification probability. The model was
cupping. Thus cup-to-disc ratio (CDR), which trained on ORIGA dataset and yielded testing
measures the vertical diameter of the optic cup to results of an accuracy of 0.832 on SCES and
that of the disc, is one of the most important bio- 0.666 on SINDI datasets.
markers for glaucoma diagnosis. Higher CDR Later, Li et al. [23] applied Inception-v3 [64]
value indicates a higher probability of glaucoma. on a private dataset to detect referable glaucoma-
Therefore, many AI-based studies focused on tous optic neuropathy (GON) and achieved an
automatic segmentation of OD and OC using AUC OD 0.986 with sensitivity of 0.956 and
deep learning [16–21]. Segmentation-based specificity of 0.92. Results also showed that other
methods, however, lack sufficiently discrimina- eye conditions would greatly affect the detection
tive representations and are easily affected by accuracy. Hight or pathological myopia
noise and low image quality. Moreover, pre- contributed most to false-negative results while
defined clinical parameters lack complex mor- physiologic cupping and pathological myopia are
phological information that might be useful in the most common reason for false-positive
the diagnosis. Therefore, more recent methods results.
simultaneously learn discriminative representa- Though previous methods demonstrated the
tion that optimize classification results directly efficiency of DL in glaucoma diagnosis, DL
from fundus images. methods suffer from overfitting problem due to
118 Z. Chen et al.
relatively small dataset available and a large uses handcrafted features and is prone to errors.
number of parameters needed training. In 2018, Therefore, deeper and segmentation-free meth-
Chakravarty et al. [24] presented a multi-task ods were desired to avoid the problem.
CNN that segmented OD and OC on fundus In 2017, Muhammad et al. [30] used a pre-
images and jointly classified the image to glau- trained CNN model for feature extraction and a
coma and non-glaucoma. The proposed method random forest model for classification. Though
was evaluated on the REFUGE dataset to achieve the proposed model is deep, the images fed into
an average dice score (which measures the over- the model were still generated by conventional
lap of segmentations and ground truths) of 0.92 segmentation methods: (1) retinal ganglion cell
for OD segmentation, 0.84 for OC segmentation, plus inner plexiform layer (GCIPL) thickness
and AUC of 0.95 for classification. The cross- map; (2) RNFL thickness map; (3) GCIPL prob-
task design reduced the number of parameters ability map; (4) RNFL probability map; (5) en
and ensured good generalization of the model on face projection. The results showed that the pro-
small dataset. In another work, Chai et al. [25] posed method with the RNFL probability map as
designed a multi-branch neural network (MB- input outperformed OCT and VF clinic but fell
NN) model to leverage domain knowledge short of an experienced human expert.
includes important measures (e.g., CDR) for In 2018, Fu et al. [31] present a Multi-Context
glaucoma diagnosis. The first branch extracted Deep Network (MCDN) to classify angle-closure
hidden features directly from fundus image and open-angle glaucoma based on Anterior
through a CNN. The second branch used Faster- Segment Optical Coherence Tomography
RCNN [65] which is a deep learning framework (AS-OCT). The anterior chamber angle (ACA)
for object detection to obtain optic disc region. region was first localized by a data-driven
Then another CNN is used to extract local hidden AS-OCT structure segmentation method [22] to
features. The third branch used a fully convolu- compute the clinical parameters (e.g., Anterior
tional network (FCN) [66] to segment OD, OC, Chamber Width, Lens-Vault, Chamber Height,
and peripapillary atrophy (PPA), and then calcu- Iris Curvature, and Anterior Chamber Area). A
lated measures related to disc, cup, and linear SVM was employed to predict an angle-
PPA. RNFL defects, a roughly wedge shape closure probability based on these clinical param-
region starting from OD, detected from another eters. Then localized ACA region and the original
CNN and non-image features (e.g. age, IOP, eye- scan were fed into two parallel CNNs to jointly
sight and symptoms) from case reports were also gain local and global discriminative representa-
inputs to the third branch. The proposed frame- tions respectively and output an angle-closure
work was verified on a private dataset and probability. Finally, the probabilities from clini-
achieved an accuracy of 0.915, sensitivity of cal parameters and CNNs are averaged to pro-
0.9233, and specificity of 0.909. duce the final results. Experimental results
showed that the proposed method is effective for
OCT angle-closure glaucoma screening. Detailed anal-
OCT, which is a non-invasive imaging technique ysis of three input streams showed that DL-based
to provide micrometer resolution cross-sectional global discriminative features did not work as
and volumetric images of retina, has emerged as well as handcrafted visual features (AUC 0.894
the de facto standard in objective quantification vs. 0.924) while DL-based local discriminative
of structural damage in glaucoma. Similarly, features achieved on par performance with hand-
early studies focused on comparison of various crafted features (AUC 0.920 vs. 0.924).
classical ML classifiers using parameters mea- In 2019, Maetschke et al. [32] proposed a 3D
sured by OCT [67–69]. Though classical ML CNN to be trained directly on raw spectral
classifiers classified glaucoma with satisfying domain optical coherence tomography (SD-OCT)
accuracy, the limitation of ML classifiers is the volumes of ONH to classify healthy and glauco-
reliance of the segmentation of retinal layers that matous eyes. The class activation map (CAM)
9 AI and Glaucoma 119
analysis found neuroretinal rim, optic disc cup- Generally speaking, DL models are capable of
ping, and lamina cribrosa and its surrounding learning discriminative representations and iden-
area were significant associated with the classifi- tifying glaucoma patients. However, comparing
cation results, which aligned with commonly those methods remains challenging because of
used clinical markers for glaucoma diagnosis the variety of training and testing datasets and
such as neuroretinal thinning at the superior and validation methods. The possibility to extract
inferior segments and increased cup volume. knowledge that might not be discovered before,
In the same year, RNFL probability maps, such as unknown glaucoma related structures/
which are generated based on swept-source optic features that are highly associated with glauco-
coherence tomography (SS-OCT) to superim- matous damages and glaucoma phenotyping, is
pose structural changes with VF locations, were the most exciting part of DL. Therefore, increas-
also trained with CNNs to discriminate between ing interpretability of DL to visualize learned
glaucomatous and healthy eyes [33]. CAM anal- knowledge will be critical to future development
ysis suggested that anatomical variation in blood of the use of DL in glaucoma diagnosis.
vessel or RNFL location caused ambiguity in
false positive and false negative. This discover
might be useful for future improvement of DL Longitudinal Analysis
systems by supplying information about blood
vessels. The accelerated retinal ganglion cell loss is a
characteristic feature of glaucoma progression
together with functional damages. Therefore,
Combining Structure and Function identifying progression and estimating the rate of
loss either structurally or functionally are crucial
Many studies also developed effective ML classi- to glaucoma management.
fiers combining structural and functional data. The current clinical gold standard for progres-
Global VF indices (mean defect, corrected loss sion analysis is the Guided Progression Analysis
variance, and short-term fluctuation) in combina- (GPA) provided by the commercial software
tion with structural data (CDR, rim area, cup vol- developed by Carl Zeiss [72, 73]. The software
ume, and nerve fiber layer height) analyzed by an allows clinicians to evaluate the patient’s func-
ANN was capable to correctly identify glauco- tional or structural loss over time compared to his
matous eyes with an accuracy of 88% in an early or her own baseline, which is a composite of two
study [70]. This figure was higher than that of the initial examinations. Event-based and trend-
same ANN trained with only structural or func- based analysis are two approaches to tell whether
tional data. The development of computational the progression exists. Event-based analysis eval-
ability accommodated larger models and larger uates changes from baseline compared to
inputs. Bowd et al. [71] took complete VF maps expected variability. The expected variability is
and OCT RNFL thickness measurements of 32 determined by the 95% confidence intervals of
sectors to train multiple ML learning classifiers. the magnitude of fluctuation of stable glaucoma
In a later study, Silva et al. [24] tested several patients from empirical datasets. Progression is
classifiers, including bagging (BAG), naïve bayes defined as the change exceeds the expected vari-
(NB), MLP, radial basis function (RBF), RF, ability. Trend-based analysis estimated the rate of
ensemble selection (ENS), classification tree change over time using linear regression. While
(CTREE), ada boost M1 (ADA), SVM, using 17 GPA is useful to define and quantify glaucoma
RNFL thickness parameters (average thickness, 4 progression, GPA does not forecast future pro-
quadrants and 12 clock hour measurements) and gressions, which could augment clinical decision
mean deviation (MD), pattern standard deviation making.
(PSD), and glaucoma hemifield test (GHT). RF For VF forecasting, Caprioli et al. [74] pro-
achieved the best AUC result of 0.946. jected individual VFs through an exponential
120 Z. Chen et al.
model which characterized fast or slow progres- RNFL thickness and VFI. Sedai et al. [35] devel-
sion rate in VF losses better than linear models. oped a ML regressor to forecast circumpapillary
However, both linear model and exponential RNFL thickness at the next visit from multimodal
model assume constant loss rates of VF loss, data including clinical (age and IOP), structural
which usually decay over time [75]. To better (circumpapillary RNFL thickness derived from
depict glaucomatous damage, Chen et al. [76] OCT scans and DL-extracted OCT features), and
compared pointwise linear, exponential, logistic functional (VF parameters) data of three prior
functions, and combinations of functions and visits and the inter-visit intervals. Chen et al. [36]
showed that a combination of exponential and also investigated the predictive DL in predicting
logistic functions predicted future progressions structural loss. A time-aware long short-term
better. Previous methods treated test points as memory network was designed to predict fifth
individual points and did not incorporate spatial visit of GCIPL thickness map based on four prior
correlations between VF test points at the time maps and took uneven intervals between every
point. Several statistical methods have been pro- two visits into account.
posed to incorporate spatio-temporal correlations
in VFs [77–80].
Application of DL in this field of predictive Structural-Functional Correlation
medicine is particularly interesting to manage-
ment of glaucoma since many factors contribut- Relationships between structural loss and func-
ing to the rate or severity of glaucoma progression tional loss has been a controversial topic which
still remain unknown. But unlike the more defin- we still do not have a general consensus yet.
itive diagnosis of glaucoma, there have been lim- Early work has investigated classical ML models
ited investigation into the potential of DL in such as LR [83], a Bayesian framework with a
predicting future findings. Park et al. [34] devel- radial basis function [84], and Bayesian LR [85],
oped a recurrent neural network (RNN) to pre- and logarithmic regression [86] that map func-
dict the sixth visual field test. The performance tion from structure. However, model performance
of RNN was compared with that of a point-wise has been limited and highly depending on
linear regression. Results showed that VFs pre- assumptions of linear relationship or the gaussian
dicted by RNN were more accurate than that by distribution of variability in VF measurements,
linear regression (root mean square error which is not optimal given that it is usually heav-
(RMSE): 4.31 ± 2.54 dB vs. 4.96 ± 2.76 dB, p < ily tailed. Given the success of DL in identifying
0.001) and RNN was more robust (smaller and and forecasting glaucoma, DL may help to
more slowly increasing of RMSE as the false improve the understanding of the structural-
negative rate increases). However, the proposed functional relationship in glaucoma. In addition,
method required a large number of VF tests over VF tests are subjective, time-consuming, and
a long period of time. And many years of VF very noisy. Thus, estimating VF from OCT accu-
testing would be needed to accurately predict the rately may help to reduce unnecessary VF testing
future VFs. To overcome the problem, Wen et al. in eyes that are estimated to be stable.
[81] trained a deep learning model on the tempo- In 2017, Uesaka et al. [37] proposed two
ral history for a large group of patients to accu- methods to estimate full-resolution 10-2 mode
rately predict future VFs up to 5.5 years given VF maps from retinal thickness (RT) data includ-
only a single VF test, with a correlation of 0.92 ing GCIPL thickness maps, RNFL thickness
between MD on predicted VFs and MD on actual maps and RCL thickness maps. The proposed
future VFs. two methods were affine structured non-negative
For structural progression forecasting, Song matrix factorization (ASNMF) and a
et al. [82] proposed a 2D continuous-time hidden CNN. Results showed that ASNMF worked bet-
markov model to predict average circumpapillary ter for small data size while CNN was powerful
9 AI and Glaucoma 121
for large data size. 7.27 dB of average root mean Other AI Applications in Glaucoma
squared errors (RMSE) was achieved by ASNMF
and 6.79 dB by CNNs. One application of AI is to discover new knowl-
Later in 2018, Sugiura et al. [38] reduced the edge in glaucoma. Mendoza et al. [42] developed
overfitting effect of CNNs by pattern-based regu- a DL method to predict age, sex, and race based
larization (PBR) which utilized characteristic on Spectralis OCT RNFL circle scans from
pattern obtained from a large amount of non- healthy individuals, glaucoma suspect, and glau-
paired VF-RT data. Characteristic VF patterns coma patients. A MAE (95% CI) of 4.5 years
were extracted with an unsupervised learning (3.9, 5.2) and a strong (R2 (95% CI)) association
method. Then, the model was regularized by add- of 0.73 between predicted and actual age were
ing a regularization term to the loss function. The achieved for predicting age. AUC (95% CI) of
regularization term penalizes the model if the predicting race and sex were 0.96 (0.86, 0.99)
estimation is far from the manifold formed by the and 0.70 (0.57, 0.80), respectively. These results
extracted VF patterns. Moreover, the location- suggest that DL can learn demographic features
wise estimation at the last layer of CNNs was including age, race, and sex that are not apparent
replaced by group-wise estimation to reduce net- to human observers. The research implied that
work parameters. VF locations were first catego- there are still uncovered knowledges to be dis-
rized into several groups depending on functional covered in retinal OCT scans.
similarity. Then, an estimation model was shared Another application of AI is to enhance OCT
within each group. 6.16 dB of RMSE was scans. Halupka et al. [43] presented a CNN using
achieved by the model. either mean squared error or a generative adver-
In 2019, Christopher et al. [39] applied sarial network (GAN) with Wasserstein distance
ResNet50 to detect eyes with glaucomatous and perceptual similarity to reduce speckle noise
visual field damage (GVFD) and predict VF MD, of OCT images from both healthy and glaucoma-
PSD, and mean VF sectoral PD from RNFL tous eyes. The results demonstrated the effective-
thickness map, RNFL en face image, and CSLO ness of CNNs to denoising OCT B-scams while
images. Model parameters were initialized by a preserving structural features of retinal layers.
transfer learning that trained the model on Such denoising methods could be extremely use-
ImageNet which is a large image recognition ful in the analysis pipeline and ensure the reli-
dataset and finetuned on a private training dataset ability of the following disease assessment.
in order to reduce overfitting effect.
Previous work relied on segmentation-based
features which are prone to errors especially with Conclusion
advanced glaucoma and other co-existing ocular
pathologies. Segmentation-free DL methods In this chapter, we discussed the role of AI in
have also been explored. In 2019, Maetschke glaucoma. Accurate automated diagnosis and
et al. [40] inferred VFI and MD directly from prognosis of glaucoma may assist clinicians to
OCT volumes of the ONH or the macula to elimi- increase efficiency, minimize diagnosis errors,
nate the need for layer segmentation. The pro- and improve overall quality of glaucoma treat-
posed 3D CNN was compared with several ments. With its abilities to extract meaningful
classical ML methods with segmentation-based information from high dimensional and complex
OCT features and proved to outperform those multi-modal data, AI may help to discover new
ML methods. In 2020, Christopher et al. [41] biomarkers, patterns, or knowledge to improve
used U-Net to predict full-resolution 24-2 and the current understanding of glaucoma, which
10-2 mode VF maps from unsegmented SD-OCT could be useful for promoting research and devel-
circle scans. The R2 of the predicted results opment into new treatments.
ranged from 0.07 dB to 0.71 for 24-2 mode and There are still several challenges for clinical
from 0.01 to 0.85 for 10-2 mode. applications of AI in glaucoma. First, datasets
122 Z. Chen et al.
used in many studies are small and collected ing and current concepts. Clin Exp Ophthalmol.
2012;40(4):369–80.
from homogeneous populations while modern AI 9. Harwerth RS, Wheat JL, Fredette MJ, Anderson
systems require very large training dataset and DR. Linking structure and function in glaucoma. Prog
are often subject to numerous variabilities. Retin Eye Res. 2010;29(4):249–71.
Tremendous efforts would be required to collect 10. Garg A, Hood DC, Pensec N, Liebmann JM,
Blumberg DM. Macular damage, as determined by
a large and general dataset for glaucoma research. structure-function staging, is associated with worse
Second, the definition of glaucoma is not clear. vision-related quality of life in early glaucoma. Am J
Disagreements in the definition of the disease Ophthalmol. 2018;194:88–94.
phenotypes often happen between experienced 11. Taylor P, Kalpathy-Cramer J. Machine learning has
arrived! Aaron Lee, MD, MSCI-Seattle, Washington.
ophthalmologists. Therefore, it is hard to obtain 12. Rahimy E. Deep learning applications in ophthalmol-
high-quality ground-truth labels. Third, despite ogy. Curr Opin Ophthalmol. 2018;29(3):254–60.
many efforts in increasing the interpretability of 13. Ting DSW, Pasquale LR, Peng L, Campbell JP,
AI models, AI models are still being considered Lee AY, Raman R, et al. Artificial intelligence and
deep learning in ophthalmology. Br J Ophthalmol.
as “black boxes”, which limits its clinical adop- 2019;103(2):167–75.
tion. Thus, it is crucial to develop more visualiza- 14. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D,
tion tools for AI algorithms. Despite these Narayanaswamy A, et al. Development and validation
challenges ahead, AI will likely have positive of a deep learning algorithm for detection of diabetic
retinopathy in retinal fundus photographs. JAMA.
impact on research and clinical practice in 2016;316(22):2402–10.
glaucoma. 15. Chen X, Xu Y, Wong DWK, Wong TY, Liu
J. Glaucoma detection based on deep convolutional
neural network. In: 2015 37th annual international
conference of the IEEE engineering in medicine and
References biology society (EMBC). IEEE; 2015. p. 715–8.
16. Thakur N, Juneja M. Survey on segmentation and
1. Tan O, Chopra V, Lu AT, et al. Detection of macu- classification approaches of optic cup and optic disc
lar ganglion cell loss in glaucoma by Fourier-domain for diagnosis of glaucoma. Biomed Signal Process
optical coherence tomography. Ophthalmology. Control. 2018;42:162–89.
2009;116(12):2305–2314.e1–e2. 17.
Shankaranarayana M, Ram SM, Mitra K,
2. Quigley HA, Broman AT. The number of people Sivaprakasam K. Joint optic disc and cup segmenta-
with glaucoma worldwide in 2010 and 2020. Br J tion using fully convolutional and adversarial net-
Ophthalmol. 2006;90(3):262–7. works. In: Fetal, infant and ophthalmic medical
3. Ramulu P. Glaucoma and disability: which tasks are image analysis, vol. 10554. Cham: Springer; 2017.
affected, and at what stage of disease? Curr Opin p. 168–76.
Ophthalmol. 2009;20:92. 18. Zilly J, Buhmann JM, Mahapatra D. Glaucoma detec-
4. Hood DC, Tsamis E, Bommakanti NK, Joiner DB, tion using entropy sampling and ensemble learn-
Al-Aswad LA, Blumberg DM, et al. Structure- ing for automatic optic cup and disc segmentation.
function agreement is better than commonly thought Comput Med Imaging Graphics. 2017;55:28–41.
in eyes with early glaucoma. Invest Ophthalmol Vis 19. Sevastopolsky A. Optic disc and cup segmentation
Sci. 2019;60(13):4241–8. methods for glaucoma detection with modification of
5. Rao HL, Zangwill LM, Weinreb RN, Leite MT, Sample U-Net convolutional neural network. Pattern Recognit
PA, Medeiros FA. Structure-function relationship in Image Anal. 2017;27(3):618–24.
glaucoma using spectral-domain optical coherence 20. Fu H, Cheng J, Xu Y, Wong DWK, Liu J, Cao X. Joint
tomography. Arch Ophthalmol. 2011;129(7):864–71. optic disc and cup segmentation based on multi-label
6. Leite MT, Zangwill LM, Weinreb RN, Rao HL, deep network and polar transformation. IEEE Trans
Alencar LM, Medeiros FA. Structure-function rela- Med Imaging. 2018:1–9.
tionships using the Cirrus spectral domain optical 21. Al-Bander B, Zheng Y. Dense fully convolutional seg-
coherence tomograph and standard automated perim- mentation of the optic disc and cup in colour fundus
etry. J Glaucoma. 2012;21(1):49. for glaucoma diagnosis. Symmetry. 2018;10(4):87.
7. Wollstein G, Kagemann L, Bilonick RA, Ishikawa 22. Fu H, Cheng J, Xu Y, Zhang C, Wong DWK, Liu J,
H, Folio LS, Gabriele ML, et al. Retinal nerve fibre Cao X. Disc-aware ensemble network for glaucoma
layer and visual function loss in glaucoma: the tipping screening from fundus image. IEEE Trans Med
point. Br J Ophthalmol. 2012;96(1):47–52. Imaging. 2018;37(11):2493–501.
8. Malik R, Swanson WH, Garway-Heath DF. Structure– 23. Zhixi L, He Y, Keel S, Meng W, Chang R, He
function relationship in glaucoma: past think- M. Efficacy of a deep learning system for detecting
9 AI and Glaucoma 123
glaucomatous optic neuropathy based on color fundus Symposium on Biomedical Imaging (ISBI). IEEE;
photographs. Ophthalmology. 2018;125(8):1199–206. 2020. p. 1–5.
24. Chakravarty A, Sivswamy J. A deep learning based 37. Uesaka T, Morino K, Sugiura H, Kiwaki T, Murata
joint segmentation and classification framework for H, Asaoka R, Yamanishi K. Multi-view learning over
glaucoma assessment in retinal color fundus images. retinal thickness and visual sensitivity on glaucoma-
arXiv preprint arXiv:1808.01355. tous eyes. In: Proceedings of the 23rd ACM SIGKDD
25. Chai Y, Liu H, Xu J. Glaucoma diagnosis based
International Conference on Knowledge Discovery
on both hidden features and domain knowledge and Data Mining. 2017. p. 2041–50.
through deep learning models. Knowl-Based Syst. 38. Sugiura H, Kiwaki T, Yousefi S, Murata H, Asaoka
2018;161:147–56. R, Yamanishi K. Estimating glaucomatous visual
26. Asaoka R, Murata H, Iwase A, Araie M. Detecting pre- sensitivity from retinal thickness with pattern-based
perimetric glaucoma with standard automated perim- regularization and visualization. In: Proceedings of
etry using a deep learning classifier. Ophthalmology. the 24th ACM SIGKDD International Conference
2016;123(9):1974–80. on Knowledge Discovery & Data Mining. 2018.
27. Sample PA, Chan K, Boden C, Lee TW, Blumenthal p. 783–92.
EZ, Weinreb RN, et al. Using unsupervised learning 39. Christopher M, Bowd C, Belghith A, Goldbaum
with variational bayesian mixture of factor analysis to MH, Weinreb RN, Fazio MA, et al. Deep learning
identify patterns of glaucomatous visual field defects. approaches predict glaucomatous visual field damage
Invest Ophthalmol Vis Sci. 2004;45(8):2596–605. from OCT optic nerve head En face images and reti-
28. Kucur ŞS, Holló G, Sznitman R. A deep learning nal nerve fiber layer thickness maps. Ophthalmology.
approach to automatic detection of early glaucoma 2020;127(3):346–56.
from visual fields. PLoS One. 2018;13(11):e0206081. 40. Maetschke S, Antony B, Ishikawa H, Wollstein G,
29. Li F, Wang Z, Qu G, Song D, Yuan Y, Xu Y, et al. Schuman J, Garnavi R. Inference of visual field test
Automatic differentiation of Glaucoma visual field performance from OCT volumes using deep learning.
from non-glaucoma visual filed using deep con- arXiv preprint arXiv:1908.01428. 2019.
volutional neural network. BMC Med Imaging. 41. Christopher M, Proudfoot JA, Bowd C, Belghith A,
2018;18(1):35. Goldbaum MH, Rezapour J, et al. Deep learning mod-
30. Muhammad H, Fuchs T, De Cuir N, De Moraes C, els based on unsegmented OCT RNFL circle scans
Blumberg D, Liebmann J, Ritch R, Hood D. Hybrid provide accurate detection of glaucoma and high
deep learning on single wide-field optical coherence resolution prediction of visual field damage. Invest
tomography scans accurately classifies glaucoma sus- Ophthalmol Vis Sci. 2020;61(7):1439.
pects. J Glaucoma. 2017;26(12):1086–94. 42. Mendoza L, Christopher M, Belghith A, Bowd C,
31. Fu H, Xu Y, Lin S, Wong D, Mani B, Mahesh M, Rezapour J, Fazio MA, et al. Deep learning mod-
Aung T, Liu J. Multi-context deep network for angle- els predict age, sex and race from OCT optic nerve
closure glaucoma screening in anterior segment oct. head circle scans. Invest Ophthalmol Vis Sci.
In: International Conference on Medical Image 2020;61(7):2012.
Computing and Computer-Assisted Intervention. 43. Halupka KJ, Antony BJ, Lee MH, Lucy KA, Rai RS,
Springer; 2018. p. 356–63. Ishikawa H, et al. Retinal optical coherence tomogra-
32. Maetschke S, Antony B, Ishikawa H, Wollstein G, phy image enhancement via deep learning. Biomed
Schuman J, Garnavi R. A feature agnostic approach Optics Express. 2018;9(12):6205–21.
for glaucoma detection in OCT volumes. PLoS One. 44. Zeiler MD, Fergus R. Visualizing and understanding
2019;14(7):e0219126. convolutional networks. In: European conference on
33. Thakoor KA, Li X, Tsamis E, Sajda P, Hood
computer vision. Cham: Springer; 2014. p. 818–33.
DC. Enhancing the accuracy of glaucoma detection 45. Simonyan K, Vedaldi A, Zisserman A. Deep inside
from OCT probability maps using convolutional convolutional networks: visualising image classi-
neural networks. In: 2019 41st Annual International fication models and saliency maps. arXiv preprint
Conference of the IEEE Engineering in Medicine and arXiv:1312.6034. 2013.
Biology Society (EMBC). IEEE; 2019. p. 2036–40. 46. Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba
34. Park K, Kim J, Lee J. Visual field prediction using A. Learning deep features for discriminative localiza-
recurrent neural network. Sci Rep. 2019;9(1):1–12. tion. In: Proceedings of the IEEE conference on com-
35. Sedai S, Antony B, Ishikawa H, Wollstein G,
puter vision and pattern recognition; 2016. p. 2921–9.
Schuman JS, Garnavi R. Forecasting retinal nerve 47. Bowd C, Chan K, Zangwill LM, et al. Comparing
fiber layer thickness from multimodal temporal data neural networks and linear discriminant functions
incorporating OCT volumes. Ophthalmol Glaucoma. for glaucoma detection using confocal scanning laser
2020;3(1):14–24. ophthalmoscopy of the optic disc. Invest Ophthalmol
36. Chen Z, Wang Y, Wollstein G, de los Angeles Ramos- Vis Sci. 2002;43:3444–54.
Cadena M, Schuman J, Ishikawa H. Macular GCIPL 48.
Goldbaum MH, Sample PA, White H, Colt
thickness map prediction via time-aware convo- B, Raphaelian P, Fechtner RD, Weinreb
lutional LSTM. In: 2020 IEEE 17th International RN. Interpretation of automated perimetry for glau-
124 Z. Chen et al.
coma by neural network. Invest Ophthalmol Vis Sci. based on deep learning. In: International Conference
1994;35(9):3362–73. on Medical Image Computing and Computer-Assisted
49. Chan K, Lee TW, Sample PA, Goldbaum MH,
Intervention. Cham: Springer; 2015. p. 669–77.
Weinreb RN, Sejnowski TJ. Comparison of machine 62. He K, Zhang X, Ren S, Sun J. Deep residual learning
learning and traditional classifiers in glaucoma diag- for image recognition. In: Proceedings of the IEEE
nosis. IEEE Trans Biomed Eng. 2002;49(9):963–74. conference on computer vision and pattern recogni-
50. Goldbaum MH, Sample PA, Chan K, Williams J, Lee tion. 2016. p. 770–8.
TW, Blumenthal E, et al. Comparing machine learn- 63.
Ronneberger O, Fischer P, Brox T. U-net:
ing classifiers for diagnosing glaucoma from standard Convolutional networks for biomedical image seg-
automated perimetry. Invest Ophthalmol Vis Sci. mentation. In: International Conference on Medical
2002;43(1):162–9. image computing and computer-assisted intervention.
51.
Aurenhammer F. Voronoi diagrams—a survey Cham: Springer; 2015. p. 234–41.
of a fundamental geometric data structure. ACM 64. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna
Computing Surveys (CSUR). 1991;23(3):345–405. Z. Rethinking the inception architecture for com-
52. Townsend KA, Wollstein G, Danks D, et al. Heidelberg puter vision. In: Proceedings of the IEEE conference
retina tomograph 3 machine learning classifiers for on computer vision and pattern recognition; 2016.
glaucoma detection. Br J Ophthalmol. 2008;92:814– p. 2818–26.
8. https://doi.org/10.1136/bjo.2007.133074. 65. Ren S, He K, Girshick R, Sun J. Faster r-cnn: towards
53. Zangwill LM, Chan K, Bowd C, et al. Heidelberg real-time object detection with region proposal net-
retina tomograph measurements of the optic disc and works. In: Advances in neural information processing
parapapillary retina for detecting glaucoma analyzed systems. 2015. p. 91–9.
by machine learning classifiers. Invest Ophthalmol 66. Long J, Shelhamer E, Darrell T. Fully convolutional
Vis Sci. 2004;45:3144–51. https://doi.org/10.1167/ networks for semantic segmentation. In: Proceedings
iovs.04-0202. of the IEEE conference on computer vision and pat-
54. Uchida H, Brigatti L, Caprioli J. Detection of
tern recognition. 2015. p. 3431–40.
structural damage from glaucoma with confocal 67.
Bizios D, Heijl A, Hougaard JL, Bengtsson
laser image analysis. Invest Ophthalmol Vis Sci. B. Machine learning classifiers for glaucoma diagno-
1996;37:2393–401. sis based on classification of retinal nerve fibre layer
55. Adler W, Peters A, Lausen B. Comparison of classi- thickness parameters measured by Stratus OCT. Acta
fiers applied to confocal scanning laser ophthalmos- Ophthalmol. 2010;88(1):44–52.
copy data. Methods Inf Med. 2008;47:38–46. https:// 68. Barella KA, Costa VP, Gonçalves Vidotti V, Silva FR,
doi.org/10.3414/ME0348. Dias M, Gomi ES. Glaucoma diagnostic accuracy
56. Bowd C, Zangwill LM, Medeiros FA, et al. Confocal of machine learning classifiers using retinal nerve
scanning laser ophthalmoscopy classifiers and ste- fiber layer and optic nerve data from SD-OCT. J
reophotograph evaluation for prediction of visual Ophthalmol. 2013.
field abnormalities in glaucoma-suspect eyes. Invest 69. Kotsiantis SB, Zaharakis I, Pintelas P. Supervised
Ophthalmol Vis Sci. 2004;45:2255–62. machine learning: a review of classification tech-
57. Weinreb RN, Zangwill L, Berry CC, et al. Detection niques. In: Maglogiannis I, et al., editors. Emerging
of glaucoma with scanning laser polarimetry. Artificial Intelligence Applications in Computer
Arch Ophthalmol. 1998;116:1583–9. https://doi. Engineering. IOS Press; 2007. p. 3–24.
org/10.1001/archopht.116.12.1583. 70. Brigatti L, Hoffman D, Caprioli J. Neural networks to
58. Bowd C, Medeiros FA, Zhang Z, et al. Relevance identify glaucoma with structural and functional mea-
vector machine and support vector machine clas- surements. Am J Ophthalmol. 1996;121:511–21.
sifier analysis of scanning laser polarimetry retinal 71. Bowd C, Hao J, Tavares IM, et al. Bayesian machine
nerve fiber layer measurements. Invest Ophthalmol learning classifiers for combining structural and
Vis Sci. 2005;46:1322–9. https://doi.org/10.1167/ functional measurements to classify healthy and
iovs.04-1122. glaucomatous eyes. Invest Ophthalmol Vis Sci.
59. Zhang Z, Yin FS, Liu J, Wong WK, Tan NM, Lee 2008;49:945–53.
BH, et al. Origa-light: an online retinal fundus image 72. Leung CKS, Cheung CYL, Weinreb RN, Qiu K, Liu
database for glaucoma analysis and research. In: S, Li H, et al. Evaluation of retinal nerve fiber layer
2010 Annual International Conference of the IEEE progression in glaucoma: a study on optical coherence
Engineering in Medicine and Biology. IEEE; 2010. tomography guided progression analysis. Invest
p. 3065–8. Ophthalmol Vis Sci. 2010;51(1):217–22.
60. Sng CC, Foo LL, Cheng CY, Allen JC, He M,
73. Na JH, Sung KR, Baek S, Lee JY, Kim S. Progression
Krishnaswamy G, Nongpiur ME, Friedman DS, Wong of retinal nerve fiber layer thinning in glaucoma
TY, Aung T. Determinants of anterior chamber depth: assessed by cirrus optical coherence tomography-
the Singapore Chinese Eye Study. Opthalmology. guided progression analysis. Curr Eye Res.
2012;119(6):1143–50. 2013;38(3):386–95.
61. Chen X, Xu Y, Yan S, Wong DWK, Wong TY, Liu 74. Caprioli J, Mock D, Bitrian E, Afifi AA, Yu F, Nouri-
J. Automatic feature learning for glaucoma detection Mahdavi K, Coleman AL. A method to measure and
9 AI and Glaucoma 125
predict rates of regional visual field decay in glaucoma. 82. Song Y, Ishikawa H, Wu M, Liu YY, Lucy KA,
Invest Ophthalmol Vis Sci. 2011;52(7):4765–73. Lavinsky F, et al. Clinical prediction performance
75. Otarola F, Chen A, Morales E, Yu F, Afifi A, Caprioli of glaucoma progression using a 2-dimensional
J. Course of glaucomatous visual field loss across continuous-time hidden markov model with struc-
the entire perimetric range. JAMA ophthalmology. tural and functional measurements. Ophthalmology.
2016;134(5):496–502. 2018;125(9):1354–61.
76. Chen A, Nouri-Mahdavi K, Otarola FJ, Yu F, Afifi 83. Hood DC, Kardon RH. A framework for com-
AA, Caprioli J. Models of glaucomatous visual field paring structural and functional measures of
loss. Invest Ophthalmol Vis Sci. 2014;55(12):7881–7. glaucomatous damage. Prog Retin Eye Res.
77. Warren JL, Mwanza JC, Tanna AP, Budenz DL. A sta- 2007;26(6):688–710.
tistical model to analyze clinician expert consensus on 84. Zhu H, Crabb DP, Schlottmann PG, Lemij HG, Reus
glaucoma progression using spatially correlated visual NJ, Healey PR, et al. Predicting visual function from
field data. Transl Vis Sci Technol. 2016;5(4):14. the measurements of retinal nerve fiber layer structure.
78. Betz-Stablein BD, Morgan WH, House PH, Hazelton Invest Ophthalmol Vis Sci. 2010;51(11):5657–66.
ML. Spatial modeling of visual field data for assess- 85. Russell RA, Malik R, Chauhan BC, Crabb DP,
ing glaucoma progression. Invest Ophthalmol Vis Sci. Garway-Heath DF. Improved estimates of visual
2013;54(2):1544–53. field progression using Bayesian linear regression
79. Anderson AJ. Comparison of three parametric models to integrate structural information in patients with
for glaucomatous visual field progression rate distri- ocular hypertension. Invest Ophthalmol Vis Sci.
butions. Transl Vis Sci Technol. 2015;4(4):2–2. 2012;53(6):2760–9.
80.
VanBuren J, Oleson JJ, Zamba GK, Wall 86. Pollet-Villard F, Chiquet C, Romanet JP, Noel
M. Integrating independent spatio-temporal replica- C, Aptel F. Structure-function relationships with
tions to assess population trends in disease spread. spectral-domain optical coherence tomography retinal
Stat Med. 2016;35(28):5210–21. nerve fiber layer and optic nerve head measurements.
81. Wen JC, Lee CS, Keane PA, Xiao S, Rokem AS, Chen Invest Ophthalmol Vis Sci. 2014;55(5):2953–62.
PP, et al. Forecasting future Humphrey visual fields
using deep learning. PLoS One. 2019;14(4):e0214875.
Artificial Intelligence
in Retinopathy of Prematurity
10
Brittni A. Scruggs, J. Peter Campbell,
and Michael F. Chiang
1
*
I
Zone
2
II
II
3
Vessels
Fig. 10.1 ROP classification by zone, stage, and vessel and dilation. The far-right column depicts representative
changes. The top row shows mild ROP disease in zone I stages 1 through 4A. Black arrowheads highlight the faint
(upper panel) and in zone II (bottom panel). The montage demarcation line in an eye with stage 1 ROP. Note the
photo shows the location of zones I and II. The asterisk temporal ridge in stage 2, the neovascularization present
indicates the location of the fovea. The bottom row depicts in stage 3, and the localized temporal retinal detachment
images with a label of normal vessels, pre-plus, or plus in stage 4A. Stage 4B (sub-total retinal detachment with
disease based on multiple (>3) expert consensus from macula involvement) and stage 5 (total retinal detach-
ophthalmoscopy and image grading of vessel tortuosity ment) are not shown
marked plus disease out of proportion to the ease. Despite these well-defined guidelines, some
peripheral retinal pathology; these eyes often experts are overly aggressive in their treatment
manifest flat neovascularization that can be diffi- plans whereas others are more conservative.
cult to appreciate. Peripheral laser ablation permanently destroys
the peripheral retina that is driving VEGF pro-
duction. Although a mainstay treatment, laser is
Treatment associated with strabismus and high myopia [29],
and incomplete treatment such as skip lesions
‘Threshold ROP’ is a term for ROP requiring can lead to ROP progression and retinal detach-
treatment as defined by the CRYO-ROP study as ment despite treatment. The Bevacizumab
five or more contiguous or eight total clock-hours Eliminates the Angiogenic Threat of Retinopathy
of stage 3 ROP in zone I or II in the presence of of Prematurity (BEAT-ROP) and Ranibizumab
plus disease [12]. The Early Treatment for ROP versus laser therapy for the treatment of very low
(ET-ROP) trial further classified ROP into type 1 birthweight infants with retinopathy of prematu-
and type 2 pre-threshold treatment to guide the rity (RAINBOW) trials demonstrated the utility
treatment of infants with early laser before the of using VEGF inhibitors (intravitreal bevaci-
development of threshold ROP [4]. Type 1 ROP zumab or ranibizumab, respectively) instead of
remains the currently accepted treatment cutoff laser in certain cases, such as zone I stage 3 ROP
for ROP and is defined as 1) zone I stage 3 with- with plus disease [30, 31]. Despite encouraging
out plus disease; 2) any stage in zone I with plus results showing fewer unfavorable ocular out-
disease; or 3) zone II stage 2 or 3 with plus dis- comes than laser therapy, intravitreal therapy for
10 Artificial Intelligence in Retinopathy of Prematurity 131
ROP introduces new challenges, including the three components despite good relative agree-
possibility of end-organ effects from systemic ment on ROP disease severity [32, 33]. Real
exposure to anti-VEGF medications and the need world images with high inter-observer variability
for increased monitoring, potentially for years, to among ROP experts (stage 1 vs. 2; pre-plus vs.
assess for reactivation. It is the authors hope that plus disease) are provided in Fig. 10.2. Identifying
automated computer systems may soon help cli- plus disease remains the most critical finding for
nicians decide which treatment is warranted for diagnosing threshold disease. However, agree-
an individual infant and the risk of post-treatment ment on plus disease between experts is imper-
progression. fect due to systematic biases and differences in
diagnostic thresholds along a continuum [32].
Gelman et al. found that when diagnosing plus
imitations in ROP Diagnosis
L disease, 22 experts had sensitivities and specifici-
and Management ties ranging from 0.31 to 1.00 and 0.57 to 1.00,
respectively, with agreement in only 21% of the
Assessment of ROP diagnosis and severity images.
depends on the subjective evaluation of zone, Ghergherehchi et al. proposed that the vari-
stage, and plus disease, and it is well established ability in plus disease diagnosis is partly due to
that there is wide inter-observer variability for all attention to undefined vascular features [28]. For
Vessels
Stage
Fig. 10.2 Real world images with high inter-observer pre-plus or plus disease with no consensus by ROP
variability among ROP experts. The top row shows two experts. Similarly, the bottom row shows two telemedi-
fundus photos depicting arteriolar tortuosity and venous cine photos that some ROP experts documented stage 1
dilation in the posterior pole. These are photos from a tele- ROP, whereas others documented stage 2 ROP
medicine screening program that were graded as either
132 B. A. Scruggs et al.
a b
Fig. 10.3 Effect of field of view and peripheral vessel Despite expert consensus that all three eyes had pre-plus,
appearance on diagnostic interpretation. (a)–(c) Three not plus, disease, the peripheral vessel tortuosity and dila-
montage images documenting vascular changes that were tion correlated with significant peripheral pathology that
more evident in the periphery than in the posterior pole. required ROP treatment with laser in all cases
example, vessel tortuosity can be quite striking the development of AI technology in ROP and
peripherally despite normal appearing vessels have contributed to the high medicolegal risk of
posteriorly (Fig. 10.3). This is important because ROP screening. Such limitations provided moti-
the standard photographs for ROP diagnosis vation for the development of numerous
appear narrower than the field of view (FOV) computer-based systems for ROP, including the
obtained with bedside examination and/or tele- i-ROP deep learning (DL) system, which may be
medicine photos. Wide FOV allow different a way of standardizing disease severity and will
examiners to focus on different parts of the retina be discussed later in this chapter.
than originally described in ICROP, and inter-
expert agreement is higher in plus disease diag-
nosis using wide-angle images [34]. Kim et al. Early AI Systems for ROP Diagnosis
found lower accuracy when clinicians diagnosed
plus disease one quadrant at a time, suggesting The first computer-based systems for ROP
that clinicians subconsciously evaluate the whole diagnosis utilized manual tracings of dilation
eye even when they intend to carefully evaluate and tortuosity to produce an objective metric
plus by quadrant [35]. Figure 10.3 demonstrates of severity [38]. Such semi-automated ROP
the effect of FOV on severity appearance. diagnostic systems include ROPToolTM [39].
Most examiners do not routinely perform pho- Retinal Image multiScale Analysis (RISA)
tography at the time of examination, and limited [40], Computer Assisted Image Analysis of
objective data may contribute to the significant the Retina (CAIAR) [41], among others; these
inter-expert variability across different regions systems were reviewed by Wittenberg et al. in
[28, 36, 37]. Increased use of photography offers 2012 [38]. As feature-extraction-based systems,
serial comparisons for monitoring ROP disease they all utilized manual or semi-automated sys-
and for improving ROP training. The lack of tems to quantify dilation and/or tortuosity for
objective diagnosis of ROP and the high rates of correlation with clinical diagnosis of ROP. In
inter-observer variability have been hindrances to contrast to newer machine learning (ML) and
10 Artificial Intelligence in Retinopathy of Prematurity 133
DL systems, there was no automated image Convolutional neural networks (CNN) incor-
analysis performed by the computer; instead, porate image classification algorithms that differ
feature combinations and diagnostic cut-points from traditional feature extraction and ML sys-
were determined manually with clinicians label- tems. Using a large database of input images, the
ing or selecting findings within the images. CNN uses learnable weights and biases and gives
Comparisons of expert performance to the RISA importance to image features (e.g., tortuosity of
system demonstrated high diagnostic accuracy arterioles, dilation of venules) that best correlate
for plus disease using the computer-based anal- the input image with the diagnosis. The CNN
ysis [40, 42, 43]. However, these systems can- learns these features with or without pre-
not process large numbers of images and do not processing but without explicit human input [46,
correlate well enough with ROP diagnosis to be 48, 49]. The CNN’s fully connected ‘output’
widely utilized [44]. layer classifies the image (e.g., absence or pres-
ence of plus disease) with improved performance
than feature-extraction-based ML approaches.
utomated Detection of Plus
A Worrall et al. reported the first fully automated
Disease plus disease diagnosis using a CNN; this study
used a real-world dataset that included input
Machine learning utilizes a classifier, such as a image discrepancies across experts [49]. This
support vector machine (SVM), that learns the system’s image recognition classifier performed
best relationship between image features and the as well as some of the human experts (92% accu-
diagnosis [45]. One approach to have more racy) [49].
explainable AI is to combine DL and ML meth- Brown et al. reported the results of a fully
ods with traditional feature extraction, and sev- automated DL-based system for automated three-
eral groups have attempted this for plus disease level diagnosis of plus disease [48]. This deep
[46, 47]. Mao et al. trained a DL network to seg- CNN, called the i-ROP DL system, was trained
ment retinal vessels and the optic disc and to and validated on more than 5000 images with a
diagnosis plus disease based on automated quan- single reference standard diagnosis (RSD) based
titative characterization of pathological features, on the consensus diagnosis of three independent
such as vessel tortuosity, width, fractal dimen- image graders and the clinical diagnosis. The
sion, and density [46]. area under the curve (AUC) for plus disease diag-
In 2015, a ML model with a trained SVM was nosis was excellent (0.98). On an independent
developed to determine the combination of fea- dataset of 100 images (i.e., not included in the
tures and FOV that best correlated with expert training set), the i-ROP DL system had higher
plus disease diagnosis [45]. This automated sys- diagnostic agreement with the RSD than seven
tem diagnosed plus disease as well as experts out of eight of the experts. For diagnosis of plus
when incorporating vascular tortuosity from both disease, the sensitivity and specificity of the algo-
arteries and veins with the widest FOV [45]. The rithm were 93% and 94%, respectively. These
accuracy was significantly lower using a FOV values increased to 100% and 94%, respectively,
comparable to that of the standard ICROP photo- when including pre-plus disease or worse [48].
graph; this suggested that experts consider the
vascular information from a large area of retina
when diagnosing plus disease. The montage Continuous Scoring for Plus Disease
images in Fig. 10.3 show examples of peripheral
vascular pathology that may influence diagnostic Vascular disease in ROP presents on a contin-
interpretation. Despite expert-level performance, uum, which likely explains why there is poor
this system was limited in clinical utility as it absolute agreement on plus disease classification
required manual tracing and segmentation of the between experts [32]. This finding motivated the
vessels as an input [45]. development of a quantitative severity scale using
134 B. A. Scruggs et al.
the i-ROP DL system. Redd et al. reported that a to classify zone or stage [55, 56]. For example,
scale from 1 to 9 could accurately detect type 1 DeepROP is a different automated ROP detection
ROP with an AUC of 0.95 [50]. Taylor et al. system that was developed using deep neural net-
implemented the i-ROP DL algorithm to assign a works (DNNs) [57]. An identification DNN model
continuous ROP vascular severity score 1–9 and (Id-Net) and a grading DNN model (Gr-Net)
to classify images based on severity: no ROP, directly learned ROP features from big datasets,
mild ROP, type 2 ROP and pre-plus disease, or which were comprised of retinal photographs
type 1 ROP [51]. The continuous ROP vascular labeled by ROP experts. Both the identification
score was associated with the ICROP category of and the grading DNNs performed better than some
disease at a single point in time and the clinical of the human experts; impressively, the Id-Net
progression of ROP over time [51]. achieved a sensitivity of 96.62% (95%CI, 92.29–
Using the i-ROP dataset, Gupta et al. showed 98.89%) and a specificity of 99.32% (95%CI,
that these continuous scores reflected post- 99.98%) for ROP identification [57].
treatment regression in eyes with treatment Similarly, Hu et al. developed a deep CNN
requiring-ROP [52]. Additionally, eyes requiring with a novel architecture to determine the pres-
multiple treatment sessions (laser or intravitreal ence and severity of ROP disease; a sub-network
injection of bevacizumab) had higher pre- designed to extract high-level features from
treatment ROP vascular severity scores compared images was connected to a second sub-network
with eyes requiring only a single treatment, sug- that predicted ROP severity (mild vs. severe)
gesting that treatment failure may be related to [58]. Using a feature aggregate operator, this sys-
more aggressive disease or disease treated at a tem was found to have a high classification accu-
later stage [52]. A recent study by Yildiz et al. and racy [58].
the iROP Consortium described iROP ASSIST, a Zhao et al. reported the development of a DL
fully automated system with CNN-like perfor- system that can automatically draw the border of
mance to diagnosis plus vs. not plus disease (0.94 zone 1 on a fundus image as a diagnostic aid [56].
AUC) [53]. Inspired by the algorithms of Ataer- Mulay et al. first reported the identification of a
Cansizoglu et al. [45, 54], this system uses hand- peripheral ROP ridge directly in a fundus image
crafted features with a combined neural network [55]. A CNN was trained by Coyner et al. in 2018
for automatic vessel segmentation, tracing, fea- to automatically assess the quality of retinal fun-
ture extraction, and classification; it is publicly dus images [59, 60]; this would serve well as a
available for generation of a vessel severity score prescreening method for telemedicine and
(0–100) from an input image [53]. computer-based image analysis in ROP. Thus, DL
Improvement in the feature extraction process seems to hold promise for automated and objec-
will allow clinicians to achieve better perfor- tive diagnosis of ROP in digital fundus images.
mance levels without sacrificing explainability However, none of these systems are yet available
[53]. Ultimately, using similar automated quanti- for clinical use and further research is needed. A
tative severity scale for ROP diagnosis may help recent review by Scruggs et al. offers recommen-
optimize treatment regimens by better predicting dations for future AI research applied to ROP
the preterm infants at risk for treatment failure [61], including using optical coherence tomogra-
and disease recurrence [52]. Future clinical trials phy (OCT) and OCT-angiography (OCT-A) to
may use a quantitative scale to help evaluate identify the structural signs (e.g., vitreoretinal
treatment thresholds. traction) preceding disease progression [62, 63].
Table 10.1 Main challenges of AI implementation for ROP diagnosis in clinical practice
Main challenges Potential Solutions
Generalizability • CNNs often do not generalize well to • Validation of AI system performance on the
unseen data target population prior to clinical use using
• Qualitatively different populations and images of varying quality and fields of view
phenotypes being studied, such as in • Datasets tested in different populations
LMIC • Open-access datasets and software
• Differences in the ways the images • Automated DL-enhanced algorithms integrated
were acquired into commonly used cameras (e.g., RetCam) or
• Technical differences between camera into cloud-based systems.
systems
• Resolution and quality of input images
or labels
Explainability • Inability to explain how the algorithm • Combination of deep learning methods with
arrived at a conclusion traditional feature extraction [46, 47, 53]
• “Black box” nature of clinical • Correlation of disease specific features with
diagnosis, in general [65] the CNN diagnostic outcome [47]
• Difficult to develop methodology for • Rigorous clinical validation demonstrating
understanding the high-level features improvement in outcomes despite lack of
that CNNs use for discrimination complete transparency
• Use of activation maps to highlight feature
areas on that image that contributed to
classification
Regulatory and • ROP care is the highest medicolegal • Precise indication for use and evidence of
medicolegal issues risk within ophthalmology effectiveness in a real world population
• Need to adjudicate liability from care • Innovation of evaluation methods by the Food
decisions informed by AI [66] and Drug Agency (FDA) to ensure safe
• Regulatory requirements will continue implementation
to evolve
and clinically useful implementation of technol- direct supervision during ophthalmology training
ogy remains wide. The main potential challenges [70]. Chan et al. demonstrated that there was sig-
hindering the deployment of DL systems include nificant variability in diagnostic accuracy among
ensuring generalizability, explainability, and retinal fellows when analyzing ROP images com-
overcoming regulatory and medicolegal issues pared to RSDs [71]. Both Chan et al. and Myung
[64]. Table 10.1 outlines these challenges as they et al. demonstrated the inconsistent accuracy of
apply to AI for ROP diagnosis. detecting type 2 ROP and treatment-requiring
ROP by fellows [71, 72].
These studies raise serious concerns for ROP
AI for ROP Training screening performed by inexperienced examin-
ers, and there are no accepted criteria for mini-
If ROP experts often do not agree on how to diag- mum necessary supervision, exams, treatments,
nose ROP or on the diagnosis of individual etc. for clinical competency for ROP cares.
babies, it is not surprising that ROP trainees find Improved global education for ROP training is
the task of ROP diagnosis challenging as well. It necessary to ensure treatments are performed
is well established that ophthalmology graduates adequately. The development of AI systems for
complete residency, as well as ophthalmology automated diagnosis in ROP may facilitate the
fellowship programs, without confidence in their incorporation of these algorithms within medical
ability to diagnose ROP [67–69]. Fewer than a training to standardize ROP education and certi-
third of learners perform ROP screenings under fication [69].
136 B. A. Scruggs et al.
report by the American Academy of Ophthalmology. 32. Kalpathy-Cramer J, Campbell JP, Erdogmus D,
Ophthalmology. 2012;119(6):1272–80. et al. Plus disease in retinopathy of prematurity:
19. Ells AL, Holmes JM, Astle WF, et al. Telemedicine improving diagnosis by ranking disease severity and
approach to screening for severe retinopathy using quantitative image analysis. Ophthalmology.
of prematurity: a pilot study. Ophthalmology. 2016;123(11):2345–51.
2003;110(11):2113–7. 33. Campbell JP, Ataer-Cansizoglu E, Bolon-Canedo V,
20. Fijalkowski N, Zheng LL, Henderson MT, et al.
et al. Expert diagnosis of plus disease in retinopathy
Stanford University Network for Diagnosis of of prematurity from computer-based image analysis.
Retinopathy of Prematurity (SUNDROP): five years JAMA Ophthalmol. 2016;134(6):651–7.
of screening with telemedicine. Ophthalmic Surg 34. Rao R, Jonsson NJ, Ventura C, et al. Plus disease in
Lasers Imaging Retina. 2014;45(2):106–13. retinopathy of prematurity: diagnostic impact of field
21. Quinn GE, Ells A, Capone A, et al. Analysis of dis- of view. Retina. 2012;32(6):1148–55.
crepancy between diagnostic clinical examination 35.
Kim SJ, Campbell JP, Kalpathy-Cramer J,
findings and corresponding evaluation of digital et al. Accuracy and reliability of eye-based vs
images in the telemedicine approaches to evaluating quadrant-based diagnosis of plus disease in reti-
acute-phase retinopathy of prematurity study. JAMA nopathy of prematurity. JAMA Ophthalmol.
Ophthalmol. 2016;134(11):1263–70. 2018;136(6):648–55.
22. Ying GS, Pan W, Quinn GE, Daniel E, Repka MX, 36. Reynolds JD, Dobson V, Quinn GE, et al. Evidence-
Baumritter A. Intereye agreement of retinopathy of based screening criteria for retinopathy of prema-
prematurity from image evaluation in the telemedicine turity: natural history data from the CRYO-ROP
approaches to evaluating of acute-phase ROP (e-ROP) and LIGHT-ROP studies. Arch Ophthalmol.
Study. Ophthalmol Retina. 2017;1(4):347–54. 2002;120(11):1470–6.
23. Schwartz SD, Harrison SA, Ferrone PJ, Trese
37. Fleck BW, Williams C, Juszczak E, et al. An inter-
MT. Telemedical evaluation and management national comparison of retinopathy of prematu-
of retinopathy of prematurity using a fiber- rity grading performance within the Benefits of
optic digital fundus camera. Ophthalmology. Oxygen Saturation Targeting II trials. Eye (Lond).
2000;107(1):25–8. 2018;32(1):74–80.
24. Chee RI, Darwish D, Fernandez-Vega A, et al. Retinal 38. Wittenberg LA, Jonsson NJ, Chan RV, Chiang
telemedicine. Curr Ophthalmol Rep. 2018;6(1):36–45. MF. Computer-based image analysis for plus disease
25.
International Committee for the Classification diagnosis in retinopathy of prematurity. J Pediatr
of Retinopathy of Prematurity. The International Ophthalmol Strabismus. 2012;49(1):11–9; quiz 10,
Classification of Retinopathy of Prematurity revisited. 20.
Arch Ophthalmol. 2005;123(7):991–9. 39. Wallace DK, Zhao Z, Freedman SF. A pilot study
26. The International Committee for the Classification using “ROPtool” to quantify plus disease in retinopa-
of the Late Stages of Retinopathy of Prematurity. An thy of prematurity. J AAPOS. 2007;11(4):381–7.
international classification of retinopathy of prematu- 40. Gelman R, Jiang L, Du YE, Martinez-Perez ME,
rity. II. The classification of retinal detachment. Arch Flynn JT, Chiang MF. Plus disease in retinopathy of
Ophthalmol. 1987;105(7):906–12. prematurity: pilot study of computer-based and expert
27. The Committee for the Classification of Retinopathy diagnosis. J AAPOS. 2007;11(6):532–40.
of Prematurity. An international classification of 41. Shah DN, Wilson CM, Ying GS, et al. Comparison
retinopathy of prematurity. Arch Ophthalmol. of expert graders to computer-assisted image analy-
1984;102(8):1130–4. sis of the retina in retinopathy of prematurity. Br J
28. Ghergherehchi L, Kim SJ, Campbell JP, Ostmo S, Ophthalmol. 2011;95(10):1442–5.
Chan RVP, Chiang MF. Plus disease in retinopathy of 42. Chiang MF, Gelman R, Jiang L, Martinez-Perez
prematurity: more than meets the ICROP? Asia Pac J ME, Du YE, Flynn JT. Plus disease in retinopathy of
Ophthalmol (Phila). 2018;7(3):152–5. prematurity: an analysis of diagnostic performance.
29. Geloneck MM, Chuang AZ, Clark WL, et al.
Trans Am Ophthalmol Soc. 2007;105:73–84. discus-
Refractive outcomes following bevacizumab mono- sion 84-75.
therapy compared with conventional laser treatment: 43. Koreen S, Gelman R, Martinez-Perez ME, et al.
a randomized clinical trial. JAMA Ophthalmol. Evaluation of a computer-based system for plus
2014;132(11):1327–33. disease diagnosis in retinopathy of prematurity.
30. Mintz-Hittner HA, Kennedy KA, Chuang AZ, Group Ophthalmology. 2007;114(12):e59–67.
B-RC. Efficacy of intravitreal bevacizumab for 44. Wilson CM, Wong K, Ng J, Cocker KD, Ells AL,
stage 3+ retinopathy of prematurity. N Engl J Med. Fielder AR. Digital image analysis in retinopathy of
2011;364(7):603–15. prematurity: a comparison of vessel selection meth-
31. Stahl A, Lepore D, Fielder A, et al. Ranibizumab ods. J AAPOS. 2012;16(3):223–8.
versus laser therapy for the treatment of very low 45. Ataer-Cansizoglu E, Bolon-Canedo V, Campbell JP,
birthweight infants with retinopathy of prematurity et al. Computer-based image analysis for plus disease
(RAINBOW): an open-label randomised controlled diagnosis in retinopathy of prematurity: performance
trial. Lancet. 2019;394(10208):1551–9. of the “i-ROP” system and image features associ-
138 B. A. Scruggs et al.
ated with expert diagnosis. Transl Vis Sci Technol. retinopathy of prematurity. AMIA Annu Symp Proc.
2015;4(6):5. 2018;2018:1224–32.
46. Mao J, Luo Y, Liu L, et al. Automated diagnosis and 60. Coyner AS, Swan R, Campbell JP, et al. Automated
quantitative analysis of plus disease in retinopathy of fundus image quality assessment in retinopathy of
prematurity based on deep convolutional neural net- prematurity using deep convolutional neural net-
works. Acta Ophthalmol. 2019. works. Ophthalmol Retina. 2019;3(5):444–50.
47. Graziani M, Brown JM, Andrearczyk V, et al.
61.
Scruggs BA, Chan RVP, Kalpathy-Cramer J,
Improved interpretability for computer-aided sever- Chiang MF, Campbell JP. Artificial Intelligence in
ity assessment of retinopathy of prematurity. SPIE Retinopathy of Prematurity Diagnosis. Transl Vis Sci
Medical Imaging. San Diego, CA; 2019. Technol. 2020;9(2).
48. Brown JM, Campbell JP, Beers A, et al. Automated 62. Campbell JP. Why do we still rely on ophthalmos-
diagnosis of plus disease in retinopathy of prematu- copy to diagnose retinopathy of prematurity? JAMA
rity using deep convolutional neural networks. JAMA Ophthalmol. 2018;136(7):759–60.
Ophthalmol. 2018;136(7):803–10. 63. De Fauw J, Ledsam JR, Romera-Paredes B, et al.
49. Worrall DE, Wilson CM, Brostow GJ. Automated
Clinically applicable deep learning for diag-
retinopathy of prematurity case detection with con- nosis and referral in retinal disease. Nat Med.
volutional neural networks. Deep learning and data 2018;24(9):1342–50.
labeling for medical applications. Athens; 2016. 64. Ting DSW, Peng L, Varadarajan AV, et al. Deep learn-
50. Redd TK, Campbell JP, Brown JM, et al. Evaluation ing in ophthalmology: the technical and clinical con-
of a deep learning image assessment system for siderations. Prog Retin Eye Res. 2019.
detecting severe retinopathy of prematurity. Br J 65. Reid JE, Eaton E. Artificial intelligence for pedi-
Ophthalmol. 2018. atric ophthalmology. Curr Opin Ophthalmol.
51. Taylor S, Brown JM, Gupta K, et al. Monitoring dis- 2019;30(5):337–46.
ease progression with a quantitative severity scale 66. Shah NH, Milstein A, Bagley SC. Making machine
for retinopathy of prematurity using deep learning. learning models clinically useful. JAMA. 2019.
JAMA Ophthalmol. 2019. 67.
Patel SN, Martinez-Castellanos MA, Berrones-
52. Gupta K, Campbell JP, Taylor S, et al. A quantitative Medina D, et al. Assessment of a tele-education sys-
severity scale for retinopathy of prematurity using tem to enhance retinopathy of prematurity training by
deep learning to monitor disease regression after international ophthalmologists-in-training in Mexico.
treatment. JAMA Ophthalmol. 2019. Ophthalmology. 2017;124(7):953–61.
53. Yildiz VM, Tian P, Yildiz I, et al. Plus disease in reti- 68. Campbell JP, Swan R, Jonas K, et al. Implementation
nopathy of prematurity: convolutional neural network and evaluation of a tele-education system for the diag-
performance using a combined neural network and nosis of ophthalmic disease by international trainees.
feature extraction approach. 2020;9(2). AMIA Annu Symp Proc. 2015;2015:366–75.
54. Ataer-Cansizoglu E, You S, Kalpathy-Cramer J,
69. Chan RV, Patel SN, Ryan MC, et al. The Global
Keck K, Chiang MF, Erdogmus D. OBSERVER Education Network for Retinopathy of Prematurity
AND FEATURE ANALYSIS ON DIAGNOSIS OF (Gen-Rop): development, implementation, and evalu-
RETINOPATHY OF PREMATURITY. IEEE Int ation of a novel tele-education system (An American
Workshop Mach Learn Signal Process. 2012:1–6. Ophthalmological Society Thesis). Trans Am
55. Mulay S, Ram K, Sivaprakasam M, Vinekar A. Early Ophthalmol Soc. 2015;113:T2.
detection of retinopathy of prematurity stage using 70. Al-Khaled T, Mikhail M, Jonas KE, et al. Training of
deep learning approach. Paper presented at: SPIE residents and fellows in retinopathy of prematurity
Medical Imaging, 2019, San Diego, CA. around the world: an international web-based survey.
56. Zhao J, Lei B, Wu Z, et al. A deep learning framework J Pediatr Ophthalmol Strabismus. 2019;56(5):282–7.
for identifying zone I in RetCam images. Vol 7. IEEE 71. Paul Chan RV, Williams SL, Yonekawa Y, Weissgold
Access; 2019. p. 103530–7. DJ, Lee TC, Chiang MF. Accuracy of retinopathy
57. Wang J, Ju R, Chen Y, et al. Automated retinopathy of prematurity diagnosis by retinal fellows. Retina.
of prematurity screening using deep neural networks. 2010;30(6):958–65.
EBioMedicine. 2018;35:361–8. 72. Myung JS, Paul Chan RV, Espiritu MJ, et al. Accuracy
58. Hu J, Chen Y, Zhong J, Ju R, Yi Z. Automated analy- of retinopathy of prematurity image-based diagnosis
sis for retinopathy of prematurity by deep neural net- by pediatric ophthalmology fellows: implications for
works. IEEE Trans Med Imaging. 2019;38(1):269–79. training. J AAPOS. 2011;15(6):573–8.
59. Coyner AS, Swan R, Brown JM, et al. Deep learn-
ing for image quality assessment of fundus images in
Artificial Intelligence in Diabetic
Retinopathy
11
Andrzej Grzybowski and Piotr Brona
d iabetes complications like nephropathy, periph- referrals with suspected proliferative disease and
eral neuropathy and cardiovascular events [3]. over 52,000 referrals for suspected maculopathy
or pre-proliferative diabetic retinopathy, with an
overall rate of DR of 2.8%.
onventional Screening Initiatives
C The aforementioned screening programme
of DR: Telemedicine was expected to reduce the number of people
considered legally blind in England from 4200 to
There have been many DR screening initiatives less than 1000. It appears this goal has been
throughout the world with varying degrees of accomplished with a 2014 report showing DR is
coverage and longevity. Nevertheless, only a few no longer the leading cause of certifiable blind-
countries were able to successfully establish and ness in England and Wales for the first time in
continue DR screening on national level, most 50 years [5].
prominently—UK and Singapore. It appears
such programme is also functioning within
Denmark, however very little information regard- Wales
ing it is available in English.
The Diabetic Retinopathy Screening Service for
Wales (DRSSW), established in 2002, is a mobile
United Kingdom screening service. Similarly to the English pro-
gramme, two fundus images are taken per eye.
Each country within the UK established their Patients with sight-threatening DR are referred to
own national screening programme. The specific a hospital-based retinal service. 30 screening
protocols and grading methods vary, however all teams serve 220 locations within Wales, achiev-
are based on digital, colour fundus photography. ing patient over of about 80%.
The programmes cover all diabetics over the age
of 12 years old with vision of at least light per-
ception in one eye. Scotland
as Retinalyze and discussed in further sections, Iowa Detection Program. Based on the publicly
were published showing relatively good sensitivi- available set of fundus images with/without DR—
ties of 71–93% and specificities of 72–86%, these the Messidor-2 dataset, the sensitivity improved
were based on small sample sizes reaching 137 from 94.4% to 96.8% and specificity from a con-
patients in the largest study [10, 11]. fidence interval of 55.7%–63% to 87% [12]. For
All of those studies were done in the pre- the Iowa Detection Program, deep learning fea-
digitisation era, meaning images, in the form of tures were added on top of already existing algo-
slides, taken from a fundus camera had to be rithms, many other initiatives attempted to
scanned by hand. This was done using a slide establish entirely new deep-based learning DR
reader or scanner to achieve a workable digital ver- detection software. Establishing automated or
sion of the image. The process was time consuming semi-automated screening, with the use of AI,
and required specialized equipment, and additional will require striking a careful balance between
processing steps introduced potential image arte- sensitivity and specificity, imaging modality,
facts and loss of quality. The lack of centralised gradeability of the images, all of which will need
databases and digital storage of fundus images to be weighed against the potential cost. The cost-
meant training and verification images were hard to benefit balance is not universal and will vary
acquire. As a consequence, most studies suffered depending on the relationship of those parameters
from low number of images used, as compared to with the relevant population characteristics, such
modern models using tens of thousands of fundus as the prevalence of DR and sight- threatening
images to establish and validate a system. DR, availability of treatment, cost and availability
Even though at that time automated screening of trained staff etc. A recent paper explores the
was severely limited from a technical standpoint, potential approaches to making a health economic
a number of people already attempted devising assessment and safety analysis of implementing
suitable screening methods, recognising the novel AI DR solutions into widespread screening
potential of new technology to enhance or substi- [13]. Deep learning DR detection has been found
tute human-based grading. to be cost-effective in developed countries, like
Singapore and United Kingdom. However, there
are no published studies looking into the feasibil-
Deep Learning Algorithms ity of implementing AI DR screening in countries
without a robust teleophthalmology screening
In subsequent years, with increasing digitisation, programme setup beforehand and other resource-
new ways of approaching the subject of auto- limited settings Table 11.1 [13].
mated image analysis were made possible. Up Described further are several significant initia-
until 2010s experts designed algorithms for detec- tives for AI-based diabetic retinopathy detection.
tion of specific features of DR like micro aneu-
rysms or haemorrhages. In deep learning the
software is presented with a fundus image as a IDx-DR
whole and a pre-specified result for that image.
Over the course of analysing many such images, IDx-DR is combined DR screening solution that
often thousands, it starts being able to distinguish incorporates the aforementioned DR screening
between images with different results. What sepa- algorithm with image quality assessment and
rates one result from another does not have to be feedback system. Submission of images is done
explicitly specified by its designers. The advent of using the IDx-DR client, which is a stand-alone
deep learning-based DR detection revealed a sig- piece of software. The IDx-DR client features a
nificant improvement in the accuracy of newly system for resubmission of images deemed to be
developed or improved systems. Abramoff and of too low quality. The threshold for a positive
colleagues reported how the introduction of deep result has been set as ‘more than mild’ diabetic
learning techniques, allowed a significant retinopathy according to the ICDR grading scale
improvement to the already established, classi- or signs of diabetic macular edema. IDx-DR
cally designed, automated DR software—the offers one additional result level of vision threat-
11 Artificial Intelligence in Diabetic Retinopathy 143
Table 11.1 The list of deep learning - based DR screening algorithms available at the end of 2020
Name of the Country of
software origin Classification level Comments
IDx-DR USA Per patient First AI autonomous diagnostic device to be FDA
rDR/no rDR approved.
Class IIa medical device in EU
Eyeart USA Per patient Second AI software to receive FDA approval. Approved by
rDR/no rDR Canadian FDA
Class IIa medical device in EU
RetmarkerDR Portugal DR/no DR Previously used in various screening initiatives in Portugal
Microaneurysm Class IIa medical device in EU
turnover rate
SELENA + Singapore Per patient Scheduled to be implemented into national DR screening
rDR/no rDR in Singapore
Google USA Per picture Studies surrounding real-world implementation based in
algorithm rDR/no rDR India, Thailand. Currently no official software package
available outside of research studies
MediosAI India Per patient Integrated into an offline smartphone app to be paired with
rDR/no rDR the Remidio fundus-on-phone device
Verisee Taiwan rDR/no rDR Relatively new algorithm, recently approved by the
Taiwanese FDA-equivalent government body
Pegasus United rDR/no rDR Operated by the Orbis non-profit organisation
Kingdom
RetCAD Netherlands rDR/no rDR Detects referable AMD as well
Retinalyze Denmark Per image, retinal Detects AMD related changes as well, also offers an
changes/no changes automated glaucoma screening module
OphtAI France Per patient rDR/no Also detects glaucoma and AMD
rDR and DR grade
Fig. 11.1 IDx-DR image submission screen. Printed with Permission © IDx Technologies
ening DR, indicative of a suspicion of more and all images need to be submitted for a result to
advanced, possibly proliferative DR. Screening is be produced. The algorithm is able to cope with
based on four fundus images per patient, two some quality loss utilizing the overlap of the two
from each eye, one macula- and one disc-centred image fields (Fig. 11.1).
144 A. Grzybowski and P. Brona
Although on the front-end, the user is pre- the algorithm had no access to. With odds stacked
sented with a screening result in one of the four against it, the AI was still able to exceed all end-
categories—no DR, mtmDR, vision threatening points set before the trial began, endpoints at sen-
DR and insufficient quality, on the back-end sitivity of 87.2% (>85%), specificity of 90.7%
IDx-DR produces a numerical value representing (>82.5%), and imageability rate of 96.1% (among
its assessment of likelihood of mtmDR. Currently patients deemed imageable by the reading cen-
it uses defined cut-offs to sort the patient into an ter). The landmark FDA decision to allow
appropriate category. Theoretically, this means IDx-DR to operate within the United States was
that the IDx-DR output could be adjusted to max- largely based on the results of this study [14]. In
imise either sensitivity or specificity depending US, according to the FDA approved use, IDx-DR
on the needs of a given screening initiative. needs to be coupled with the Topcon NW-400
IDx-DR is the first autonomous diagnostic non-mydriatic fundus camera.
software and one of the very first AI-based soft- Previously to this study, were a number of
ware’s in medicine to receive Federal Drug and studies published on IDx-DR, though none as
Administration (FDA) approval. In a self-titled significant. Notably its performance against the
pivotal trial, IDx-DR software was studied in a Messidor-2 dataset was significantly higher than
real-world application. A little under 900 patients in the above described trial, with 96.8% sensitiv-
were screened using IDx-DR coupled with ity and specificity of 87%. In another real-life
Topcon NW-400 automatic fundus camera in a study, performed in Netherlands, 1410 patients
primary care setting. The staff operating the were screened within the Dutch diabetic care sys-
IDx-DR client and taking the fundus images were tem. Three experts graded the resultant images
not IDx-DR or clinical trial staff, but pre-existing according to ICDR and EURODIAB grading
employees of those clinics who underwent stan- scales, resulting in significantly different algo-
dardised training. This is important as in a sce- rithm performance depending on the scale used.
nario of large-scale DR screening deployment For EURODIAB IDx-DR sensitivity and speci-
specialised staff, say in ophthalmology imaging ficity was 91% and 84%, whereas for ICDR they
may be harder to produce and acquire that the were 68% and 86% respectively. The signifi-
necessary technical equipment. In previous trials cantly lower performance when compared to
of IDx-DR and other AI algorithms the perfor- ICDR criteria could all be attributed to a single
mance of the AI was compared to human grading aspect of ICDR—judging a single haemorrhage
with the same information available, which was as at least moderate DR, the authors note that
mostly the fundus images. Sometimes to should this be changed the sensitivity changes
strengthen the human grading standard against from 68% to 96.1% [15].
which the AI was compared, several persons This is a great illustration of how important
graded each image with a consensus grading that grading criteria are. A number of differing crite-
followed. This trial took an even more stringent, ria have been used in different studies so far,
extreme approach—giving the human graders a Eurodiab, ICDR, ETDRS, some studies use local
lot more available information, while keeping the grading guidelines, with each being one of the
AI limited to the four fundus images taken by most significant parts affecting the outcome and
relatively inexperienced staff, albeit with an auto- final performance indicators published. The first
matic fundus camera and selective mydriasis. question and most important question in estab-
This was compared to grading done on four lishing DR screening is ‘what is the screening
stereoscopic, widefield fundus images taken by trying to accomplish?’. In the simplest form the
professional technicians and graded by an estab- aim of a DR screening initiative should be finding
lished, independent reading center—the those patients, who will require a specialty oph-
Wisconsin Fundus Photograph Reading Center. thalmology visit before the next screening epi-
Presence of clinically significant diabetic macu- sode. This seems to hold true for established
lar edema (CSME) was additionally established traditional screening programmes in developed
based on macula OCT imaging, which of course countries. However, depending on the region and
11 Artificial Intelligence in Diabetic Retinopathy 145
RetmarkerDR
the process will likely be resource effective, as sons of AI DR systems ever published [19]. This
even a specificity of 50% means almost halving study, done for the purpose of assessing a poten-
the human grader work. tial introduction of autonomous DR detection
A noteworthy feature that distinguishes software into the existing English DR screening
Retmarker from other algorithms is its ability to programme, invited AI DR software makers to
take previous screenings into account. By com- submit their algorithm for the testing. Three sys-
paring the fundus images taken on a previous tems participated in the testing, RetmarkerDR,
screening visit, the system is able to track retinal Eyeart and iGradingM. Because of technical
changes and determine if progression occurred. issues iGradingM, a DR detection software born
This leads to another interesting avenue—track- in Scotland, was disqualified from the study and
ing microaneurysms. Microaneurysms disappear its parent company has since dissolved. The study
over time and new ones form. Tracking those involved images taken from consecutive, routine
changes using traditional, human-grader based, screening visits of over 20,000 patients to an
methods is very labour intensive, but is virtually English DR screening centre, which were previ-
instantaneous for an AI. The rate of microaneu- ously graded as per the national screening proto-
rysms appearing and disappearing was named col were processed by the systems, and any
microaneurysm turn-over rate. A number of stud- discrepancies in grading between the AI and
ies have been published showing this parameter human-graders were sent to an external reading
is a promising predictive factor for future DR centre. Both the efficiency in detecting DR, refer-
progression [9, 16–18]. Although these studies able DR and cost-effectiveness were studied [19].
consistently linked increased MA turn-over to The study concluded with the following sensitiv-
increased chance of DR, to establish a clinically ity levels:
significant and actionable link between lesion
turn-over and diabetic retinopathy progression • EyeArt 94.7% for any retinopathy, 93.8% for
would require further work (Fig. 11.4). referable retinopathy (human graded as either
In addition to being introduced as a part of ungradable, maculopathy, preproliferative, or
screening in Portugal, RetmarkerDR was also proliferative), 99.6% for proliferative
studied in one of the only head-to-head compari- retinopathy;
11 Artificial Intelligence in Diabetic Retinopathy 147
• Retmarker 73.0% for any retinopathy, 85.0% described above is being developed by Eyenuk
for referable retinopathy, 97.9% for prolifera- Inc., based in Los Angeles, USA. It additionally
tive retinopathy. offers another product—Eyemark for tracking
DR progression which, similarly to Retmarker,
Specificity: offers MA turnover measurements. Eyeart is able
to take in a variable number of pictures per
• 20% for Eyeart for any DR patient, making it suitable for various screening
• 52.3% for Retmarker for any DR scenarios without further adjustments needed, in
contrast to some of its competitors. This solves a
Although the sensitivity levels are much number of issues, as was illustrated by IDx-DR,
higher for Eyeart, this is equalised by the reverse which had to be specially modified to accept the
situation happening in specificity. Of note are single image per eye Messidor-2 dataset, instead
the remarkably low specificity levels for both of its typical input of two images.
systems as compared to more recent reports and Eyeart had been verified retrospectively on a
estimates of those and other software. It is database of 78,685 patient encounters (total of
important to realise that although the study was 627,490 images) with a refer/no refer result and
originally published in 2016, it started some a final screening sensitivity of 91.7% and speci-
years prior, during that period of time machine- ficity of 91.5%, as compared to the Eye Picture
learning and image analysis methods were Archive Communication System (EyePACS)
improved dramatically and one can assume the graders, however only the abstract for the study
algorithms established for this period of time was available on-line. It appears Eyeart has
improved as well. decided to pursue this line of enquiry further
with publishing of a full study, done on more
than 100,000 consecutive patient visits from the
Eyeart EyePACS database. A total of 850, 908 images
were analysed, collected from 404 primary care
Eyeart, the second software compared for the facilities between 2014 and 2015. Patients gen-
purpose of the British screening programme, as erally had eight images taken, four per eye; one
148 A. Grzybowski and P. Brona
image of external eye, and a single macula- screening episodes over one third had severe or
disc-centred image and an image temporal to proliferative DR, the authors note that the sys-
the disc, though no patient was disqualified tem treats non-screenable patients as positive,
because of number of images taken or their res- for the purpose of patient safety [20].
olution. The images almost evenly split between Eyeart analysed the whole cohort of over
non-mydriatic at 54% and mydriatic at 46%. 100,000 screening encounters, almost a million
The final results in terms of detecting referable images in less than 2 full days [20]. Assuming an
DR were 91.3% sensitivity and 91.1% specific- average 30 seconds of grading time per image,
ity, in line with the previous partial results. the same task would take about 7000 work-hours
Sensitivity for detecting higher DR levels that or about 4 full time graders working for a whole
are treatable—either severe or proliferative DR year, showing just how much faster computer
was 98.5% and 97.1% for detecting CSME (as analysis can be. Of course in the actual screening
compared to human graders assessing the same scenario no one is grading thousands of images at
fundus pictures). The systems accuracy did not a time, with a quick result available within min-
seem to change depending on mydriasis, with utes of the screening being much more satisfac-
98.0% and 98.8% sensitivities for detecting tory, but AI can do that too, 24 h a day, every day
treatable DR, in non-mydriatic and mydriatic of the year (Figs. 11.5 and 11.6).
encounters respectively. Only 910 patient Eyeart achieved similar results in terms of
encounters, less than 1%, were deemed non- sensitivity, to the aforementioned UK study look-
screenable by Eyeart, of those 198 encounters ing into AI DR screening viability, though there
were assigned as insufficient for full human is a very considerable discrepancy in specificity
grading previously. Nevertheless, of those 910 between the two studies [19, 21]. As mentioned
before, these studies were not done in the same and comprised of a similar number of patients—893
time-period, and further improvements to the patients screened in total. The screening was per-
system probably account for the increase in its formed in primary-care clinics with two-field non-
accuracy. Indeed, the authors themselves describe mydriatic fundus photography first and 4-field
the improvement that the 1.2 version of Eyeart mydriatic imaging second. The study compared
(still based on traditional image analysis tech- the ability of Eyeart to detect clinically significant
niques) has undergone with the inclusion of mul- DME, moderate non-proliferative DR or higher
tiple convolutional neural networks. based on the two-field imaging with external read-
Eyeart was also measured against the ing centre (the Wisconsin Fundus Photograph
Messidor-2 dataset. Referable DR screening sen- Reading Center, as was used in the IDx-DR trial)
sitivity was 93.8%, specificity of 72.2%. grading decision using the four wide-field stereo-
Importantly this dataset does not have a pre- scopic images per eye. For non-mydriatic screen-
defined result or grading attached to it, therefore ing EyeArt’s was shown to have high sensitivity at
necessitating a separate set of graders to judge it 95.5%, good specificity at 86%, and gradeability
for the standard that the AI is compared against, of 87.5%. When dilating patients from the initially
this grading is separate for each study, further ungradable group, the systems overall gradeability
hampering the ability to directly compare any rose to 97.4%, while retaining the same sensitivity
systems involved. and a rise in specificity of 0.5% to 86.5%. Although
Eyeart has recently published the results of its this trial did not involve OCT imaging for the
most robust clinical trial to date. The study was detection of DME, in all other respects this trial
pre-registered, as with the IDx-DR pivotal trial, appears similar to the IDx-DR clinical trial, with
150 A. Grzybowski and P. Brona
similar results in terms of the both systems’ smaller, independent teams and companies but
accuracy. also industry giants—Google. This is not
Another result, perhaps even more surprising Google’s only foray into medical AI, with teams
than the stellar performance of the AI, was a com- at Google collaborating to find solutions for
parison based on a subset of the patients in this automated analysis of histopathology images
trial, that have undergone dilated ophthalmoscopy and other non-image analysis related publica-
after the fundus imaging. A total of 497 patients tions. A Google inc. sponsored study introducing
were tested across 10 U.S. clinical centres, with their automated DR screening algorithm was
some specialty retinal centres and others general published in 2016 by Gulshan and colleagues. To
ophthalmology clinics. This was compared against develop the algorithm the authors gathered over
the adjudicated decision of the Wisconsin reading 128,000 macula-centered images from patients
center based on the 4 wide-field stereoscopic fun- presenting for their diabetic screening in India
dus photography. Although the ophthalmoscope- and US. To validate the resultant algorithm a
based examinations had high specificity of 99.5%, random set of images from the same data source
this was coupled with an abysmal sensitivity of was chosen, those images were not used in cre-
28.1% overall. Even among the retina specialty ation of the algorithm. The image set for both
centres the sensitivity rate was only 59.1% [22]. development and validation consisted of mixed
This shows that human-based grading, using oph- mydriatic and non-mydriatic photos from sev-
thalmoscopy, as one of the tools commonly avail- eral different fundus camera models.
able in primary-care clinics, is very unlikely to be Additionally, the authors tested the algorithm
a sensible screening solution, if even ophthalmolo- against the aforementioned French dataset—
gists struggle with its accuracy. Messidor-2. The algorithm achieved impressive
The most recent study regarding Eyeart was results at a sensitivity of 96.1% and specificity of
done on 30,000 images taken from the English 93.9% (tuned for high sensitivity) and sensitivity
DR screening programme and followed a very of 87.0%, specificity of 98.5% (tuned for speci-
similar protocol and analysis pattern to the only ficity). The respective numbers for Messidor-2
comparative study on AI in DR screening [19, data-set were 97.5% and 93.4% (high sensitiv-
23]. Images from three different centers were ity) and 90.3% and 98.1% (high specificity) [24].
graded according to the established national Although these accuracy results are among the
screening protocol. Among 30,405 screening epi- highest published, and the sample size is consid-
sodes, Eyeart flagged all 462 cases of moderate erable, this study stood out in that it put a lot of
and severe DR. Overall sensitivity for rDR was emphasis on selection of human graders and
95.7% for rDR and 54% specificity. Although the their validation. Initially, for the development of
specificity is once again lower than in other stud- the dataset, the study invited 54 US-licensed
ies, it is still a very significant increase from the ophthalmologists or ophthalmology trainees at
20% specificity in the previous study [19, 23]. last year of residency, with each grading between
The authors concluded that with the introduction 20 and 62,508 images. As a result, each image
of such an AI system into the currently estab- was graded between 3 and 7 times. Final DR sta-
lished national screening protocol replacing the tus and gradeability of the image were set based
primary grader, the overall human grading work- on majority-decision. Graders were sometimes
load could be halved. shown the images they have previously marked,
to measure intra-grader reliability, or how often
given the same image, the grader decides on the
‘Google’ Algorithm same result. Sixteen graders went through
enough volume of images for this to be feasibly
The potential application of new artificial intel- calculated, and the top 7 or 8 ophthalmologists,
ligence solutions for analysis of fundus images, based on this measure, were chosen to grade all
DR particularly, caught the attention of not only the images from the validation datasets. Inter-
11 Artificial Intelligence in Diabetic Retinopathy 151
grader reliability was also measured for 26 of the which relied only on majority decision. The new
ophthalmologists. The mean intra-grader reli- algorithm was based on well over 1.5 million
ability for the 16 graders for referable DR was retinal images, with 3737 images with adjudi-
94%, and inter-grader reliability for the 26 grad- cated grading used to fine tune the system and
ers was 95.5%. 1958 images used for validation. The validation
Even when choosing the most self-consistent set was graded by three retinal specialists on
graders out of several board-certified ophthal- their own, and was repeated later with face-to-
mologists, the mean agreement rate for referable face adjudication of all images between all three
DR images was only 77.7% for the EyePacs-1 specialists. Additionally, three separate ophthal-
dataset, with complete agreement among all eight mologists graded the images on their own. The
graders achieved in less than 20% of referable adjudicated grade was set as the gold standard
DR images. Grader agreement was much better for further comparisons.
for non-referable DR images, with complete All of the graders had high specificity—97.5%,
agreement on 85.1% of the nonreferable cases 97.9% and 99.1% for ophthalmologists and
[24]. This highlights just how many caveats; the 99.1%, 99.3%, 99.3% for retinal specialists.
current universally acceptable grading method Sensitivities however were much lower with oph-
and gold standard of certified human grading can thalmologists ranging from 75.2% to 76.4% indi-
have. Out of 16 graders, on average, 4 out of 100 vidually and 83.8% as majority decision as
images were marked differently each time they compared to the adjudicated grading [25]. Even
were assessed by the same person. Out of 8 most the majority decision grading of retinal special-
self-consistent graders, only 20% of referable DR ists showed room for improvement at 88.1% and
cases were judged as such by all graders. individual sensitivities of 74.6%, 74.6% and
Issues surrounding human grading were fur- 82.1%. Most cases of discrepancy between the
ther explored in a subsequent 2018 study [25]. In majority grading of ophthalmologists and the
it, authors build up on the previously described adjudicated result stemmed from missed MAs—
work by Gulshan in terms of developing an 36%; misinterpreted image artefacts that can be
improved algorithm, expanding the training data- construed as MAs or small haemorrhages—20%;
set and exploring different presently used grading and misclassified haemorrhages—16%.
protocols. The authors implemented a solution After implementing the adjudication proce-
where the software outputs several numbers dure and fine-tuning the autonomous system it
ranging from 0 to 1, each indicating its confi- achieved accuracy levels comparable to any of
dence that the image represents a given severity the retinal specialists or ophthalmologists
level of DR. This appears to be very similar to the involved [25].
back-end solutions implemented by IDx-DR, A prospective trial was done to assess the real-
which also output’s its confidence level in the world viability of the algorithm, utilising many
result being more than moderate DR, although of the lessons learned from the two above-
this is not presented to the end-user. This allows described studies [26]. The trial was done in two
relatively easy adjustments to the systems hospitals in India on a total of 3049 diabetics
sensitivity-specificity balance, focusing on either attending their appointments in the local general
of those measures. ophthalmology and vitreoretinal clinics, as well
This study ended up with three different as, telescreening initiatives. During their appoint-
‘grading pools’—EyePacs graders, Certified ments macula-centered 40–45 degree fundus
Ophthalmologists and Retinal specialists. images were taken mainly with a Forus 3nethra
Additionally, an adjudication protocol was camera, a compact, low-cost fundus camera [26].
introduced in cases of disagreement by the reti- All images were non-mydriatic and were not
nal specialists with both asynchronous and live included in further therapeutic decisions for the
adjudication sessions until an agreement was patients, as they carried on with their appoint-
reached [25]. This is in contrast to the first work, ments. All images were later graded by a non-
152 A. Grzybowski and P. Brona
physician trained grader, a retinal specialist. All were additionally graded by two senior non-phy-
images from taken from one of the two centre, sician graders and adjudicated by a senior retinal
997 patients total, also underwent grading by specialist in case of conflicting grading. Overall
three retinal specialists with an adjudication pro- 72,610 images were included in the training data-
cess as in the previous study. Additionally, any set, taken from the years 2010–2013, and further
images from the second centre with any discrep- 71,896 from years 2014–2015 were used for the
ancies between any of the graders or algorithm primary validation dataset. The system was addi-
output (5-point DR grading and DME status) tionally validated using images from multi-ethnic
were also adjudicated. The results, in terms of populations from Singapore, and using images
human grading accuracy in detecting rDR, were taken in screening studies from around the
largely similar to those in the previous study— world—China, African-American Eye disease
the four human graders had sensitivities between study (US based), Royal Victoria Eye Hospital
73.4% and 88.8%, with specificities between (Australia), Mexico and University of Hong
83.5 to 98.7%. The algorithm had comparable Kong. These studies included between 1052 and
performance, at a sensitivity of 88.9% at the first 15,798 images for a total validation dataset of
centre and 92.1% at the second centre respective 112,618 images, more than 56 thousand patients.
specificities of 92.2% and 95.2% [26]. The Reference standards varied between the different
‘Google’ DR algorithm was trained on images studies, but all included at least two graders, with
taken from many different cameras of which only the largest study by image volume (n = 15,798)
0.3% were taken by this specific fundus camera, also including retinal specialist arbitration.
yet it has showed very good performance on For the primary validation, that is the data from
images taken using it, suggesting the algorithm is SIDRP years 2014–2015, the system demon-
able to deal with different equipment being used strated a sensitivity of 90.5% for detecting refer-
to take the images [26]. Although the algorithm able DR, comparable to professional graders on
and its results appears very promising, with good the same dataset at 91.5%, as compared to the final
accuracy, it does require further work in order to retinal specialist arbitration decision. Specificity
be used in a clinical setting, which the authors of this solution was 91.6%, lower than that of pro-
point out themselves. Firstly, as it currently has fessional graders at 99.3%. Interestingly the sys-
no image quality assessment capabilities, only tem proved better at detecting sight-threatening
images deemed gradable by the adjudication DR at 100%, with trained graders rated at only
panel were included in this latest study. 88.6%, again, at a cost of the lower specificity. As
Additionally, as with all other algorithms, their the study included multiple ethnic populations, yet
place within and the precise protocols of wide- was devised only on the basis of SIDRP images,
spread screening and integration into the existing the authors analysed if it showed racial or other
clinical workflow or outside of it remains to be biases. This was made possible by the large racial
devised and assessed. This latest study was diversity among the validation datasets—Malay,
designed specifically for the algorithm not to Indian, Chinese, White, African-American and
interfere with established clinical set-up. Hispanic. The algorithm achieved comparable per-
formance in different subgroups of patients by
race, additionally age, sex, and glycaemic control
SELENA+, Singapore Algorithm did not affect the accuracy of the algorithm.
previously at the National Taiwan University, However, for all of the above datasets, including
with a single fundus camera [27]. The images the development and validation dataset, only
were graded by two ophthalmologists undergo- images of good quality were chosen.
ing fellowship training, with an experienced reti-
nal specialist employed for adjudication. The
algorithm was trained on about 37,000 images, OphtAI
with 1875 images used for validation. The valida-
tion dataset was not used for training, but was OphtAI is a relatively new entry to the commercial
taken with the same camera at the same location. AI DR detection market. It originates from a joint
The algorithm achieved 92.2% specificity and venture of two French medical IT companies
89.5% sensitivity for any DR, and 89.2% and Evolucare and ADCIS, it was developed in France
90.1% for rDR. The algorithm exceeded the sen- and possesses a class IIa CE certification. The DR
sitivity for detecting rDR achieved by ophthal- algorithm was developed based on a dataset of over
mologists in this study, which was calculated at 275,000 eyes from a French medical imaging data-
71.1%, and did much better than internal physi- base [30]. It is mostly a cloud-based service acces-
cians at detecting any DR (64.3% sensitivity, sible through a web interface, my.ophtai.com,
71.9% specificity, based on diagnosis available in which allows between 1 and 6 images per patient
chart records) [27]. Although these results are to be sent for analysis and offers a DR grading
promising, due to the low volume and homogene- result in a few seconds along with a confidence
ity of validation dataset, the performance of the rating and heatmap of the suspect retinal changes.
algorithm in other scenarios remains uncertain. OphtAI is also available as a locally hosted plat-
Nevertheless, the algorithm has been approved form, dependent on local regulations. While soft-
by the Taiwanese FDA-equivalent body and is ware additionally detects referable DR, diabetic
scheduled to be implemented into real-world macular edema, glaucoma and AMD from fundus
screening in Taiwan in the near future. images, there are plans for the next version to
detect general eye health in addition to detection
of over 10 specific pathologies and 27 disease
RetCAD signs to expand the number of detected patholo-
gies to over 30. The DR detection algorithm was
A recently published system, developed in the compared against the Messidor-2 dataset with
Netherlands, allowing for joint detection of DR very promising results [30, 31]. We would expect
and AMD from fundus images [28]. It is the only further publications related to the verification and
study to show algorithm’s effectiveness at screen- efficacy of this algorithm in the coming years.
ing for both AMD and rDR at the same time. The
validation dataset was rather small, relative to
other studies described here, and comprised of Other AI DR Solutions
600 images. Nevertheless the software achieved
good accuracy and was able to distinguish The initiatives described so far focused mostly on
between rDR and referable AMD rather well the aspect of image analysis. One of the hurdles
with sensitivity of 90.1% and specificity of 90.6% to go through with their development regarded
[28]. Unlike the SELENA software, which can equipment and technique used to take the fundus
also detect both AMD and DR, both diseases images, and how that might affect the system’s
were tested at the same time, instead of testing diagnostic ability or its image quality detection
the accuracy against AMD and DR on separate protocols. Use of different fundus cameras by
data sets [29]. RetCAD was tested against the many different technicians can introduce a lot of
publicly available datasets of Messidor-2, for DR variability in picture quality, its resolution or
detection and Age-Related Eye Disease Study sharpness. IDx-DR, for example, is only approved
dataset for AMD, achieving favourable results. for use in US when coupled, not only with a sin-
154 A. Grzybowski and P. Brona
Fig. 11.8 MediosAI Image selection. Printed with per- Fig. 11.9 MediosAI report. Printed with permission
mission from MediosAI from MediosAI
USD for a refurbished model. Medios AI achieved Although the access to wireless internet sources is
good results with sensitivity and specificity pair for spreading all over the world, this can be a hugely
any DR of 83.3% and 95.5% and for rDR 93% and important factor in screening remote and under-
92.5% [35]. For Medios AI, all studies so far com- privileged communities, where internet access is
pared AI and grader performance on the same sometimes not possible and very often unreliable.
source material - pictures taken with the mobile This approach is picking up steam with more
camera. A study similar to that done by IDx-DR mobile, smartphone based or smartphone aided
and Eyeart, where the chosen system is compared fundus imaging solutions being studied and con-
to diagnosis based on professional, multi-field fun- sidered for adoption in DR screening. Smartphones,
dus imaging might provide additional insight and coupled with a compatible mobile fundus camera
comparability of those systems with the mobile attachment or device provides a low cost, highly
approach (Fig. 11.8). mobile and highly scalable DR screening solution,
The big difference in Remidio’s DR screening especially if the analysis is integrated into the
system, other than implementing it directly into the smartphone itself. A recent study conducted in
fundus imaging device, is performing the analysis India compared effectiveness of four such devices
entirely offline, without need for internet access. in human based DR grading [36] (Fig. 11.9).
156 A. Grzybowski and P. Brona
It appears the company Bosch has also taken a However, testing on curated, high quality data
similar approach in improving its ‘Bosch Mobile sets will overestimate the real-world testing accu-
Eye Care’ fundus camera and developing an in- racy. Testing the software should be done in a
house DR diagnostic algorithm to be imple- scenario as close to the desired implementation
mented within the fundus camera itself. as possible, to achieve accuracy metrics that will
Single-field images taken with their camera, be true to real-life screening.
without pharmacological mydriasis, were anal-
ysed by a convolutional neural network-based AI
software to deliver a disease/no-disease or insuf- New Technologies in Retina
ficient quality output. The system is cloud based Imaging and DR Screening
and would require internet access. Out of 1128
eyes studied, 44 (3.9%) were deemed inconclu- Although most DR screening efforts are directed
sive by the algorithm, with just 4 out of 568 towards analysis of fundus images, there have
patients having images from both eyes of insuf- been significant advancements in employing AI
ficient quality. The study compared AI’s perfor- for analysis of optical coherence tomography
mance with grading based on 7-field stereoscopic, (OCT). OCT is commonly used in assessing and
mydriatic, ETDRS imaging done on the same monitoring DR and DME on an individual
eye. Bosch DR Algorithm achieved good results patient basis. Several metrics like central macu-
with sensitivity, specificity, PPV, and NPV rates lar thickness help us establish some objective
of 91%, 96%, 94%, and 95% respectively [37]. parameters, nevertheless the evaluation of OCT
However little is known about the grading criteria scans is still subjective, user-dependent, simi-
employed in this study, in contrast to other simi- larly to evaluating fundus pictures. A further
lar works, it employs purely a disease positive/ development of OCT—OCT angiography
negative criteria, rather than the more useful rDR, (OCTA), allows for non-invasive tracing of reti-
non rDR distinction [37]. Unfortunately no fur- nal and choroid vasculature, the role of OCTA
ther reports of this algorithms effectiveness are in common ophthalmic practice is not firmly
available at this time. defined, and there are few objective quantifica-
Even though mobile screening does appear tions possible. First attempts at using OCTA
very appealing, and as exemplified above the data for machine-learning and automated analy-
results are very promising, it is conceivable that sis of DR patients have already been made.
the lower image quality obtained when using OCTA data from 106 patients with type II dia-
mobile fundus cameras might affect the accuracy betes and either no DR (n = 23) or mild non pro-
of the AI system used to grade it. A recent study liferative DR (n = 83) was used to train the
compared the performance of a deep learning algorithm to detect DR features from superficial
based DR detection system against a benchmark, and deep retinal maps [39]. Using a combined
curated image set, taken with a desktop camera approach of using both layers, the system dem-
against its accuracy with images taken with a onstrated overall accuracy of 94.3%, sensitivity
handheld fundus camera [38]. Although the soft- of 97.9%, specificity of 87.0%, and an area
ware, dubbed Pegasus, did exceptionally well on under curve (AUC) of 92.4% [39]. Although the
the curated, desktop dataset with 93.4% and relatively high reliability measures are promis-
94.2% sensitivity and specificity, this did not ing, it is important to note that the validation
translate to equal detection rate in the handheld was done on the training subset. Nevertheless,
camera images with a statistically significant the study has shown that OCTA can be subjected
decrease in accuracy. The parameters for the to deep learning and automated analysis and we
handheld camera dataset were 81.6% sensitivity may very well see more such initiatives in the
and 81.7% specificity—a drop of more than 10% future. The specific computational techniques
for each of the parameters [38]. Mobile screening for detecting DR from OCTA have been further
setups and portable cameras are very attractive explored in a recent study comparing different
means for introducing widespread screening. neural network approaches to analysing OCTA
11 Artificial Intelligence in Diabetic Retinopathy 157
and their results. The best performing algorithm interval, but may also be shorter, for a subset of
achieved accuracy of 0.90–0.92 [40]. patients with high risk for developing DR com-
Teaching general practitioners (GPs) to take plications. In a recent study based in one
photos with a mobile fundus camera and Norwegian ophthalmic practice between 2014
subsequently grade them, might be an alternative and 2019 average screening interval was extended
method of widening access to DR screening, to 23 months as compared to 14 month average
without the use of AI or automated systems. A for the control group with fixed screening inter-
recent study looked into training GPs in Sri- vals [42].
Lanka to take and grade fundus photos taken with
a mobile camera (Zeiss-Visuscout100®). The
GPs underwent a training programme delivered Conclusions
by two retinologists, however of the nine doctors
that undertook the training only two with the best Deep learning DR diagnostic software is cur-
test grading results were chosen for the study. rently a rapidly developing topic. During the last
The GPs took and graded non-dilated and subse- decade we have seen the concepts surrounding
quently mydriatic fundus images, their perfor- automatic DR screening evolve from few expert-
mance was graded against a decision of a retinal designed algorithms with varying measures of
specialist after performing a dilated fundus accuracy to a multitude of different approaches
examination using slit lamp biomicroscopy and employing the newest developments in deep
indirect ophthalmoscopy. Assuming ungradable learning and other fields. We have seen progres-
subjects as referable, the two GPs achieved sensi- sively more robust studies emerge, proving the
tivities for detecting rDR of 85%, 87% with spec- diagnostic or decision-support algorithms to be
ificities of 72%, 77% for non-mydriatic screening, accurate and reliable, some basing on millions of
rising to 89%, 93% specificity and 95%, 96% images, others with particularly rigorous setting
sensitivity after mydriasis. Although, this shows of their gold-standard. During the last 2 years, a
that training GPs to screen for rDR is theoreti- number of software packages have been approved
cally feasible and can achieve good diagnostic by regulatory bodies around the world and are
accuracy, both the availability of GPs and their well on their way to be implemented into wide-
ability to take on additional workload is limited. spread screening in the respective countries.
In the aforementioned study only the two best Following the general worldwide trend, increas-
performing GPs (measured as agreement with the ing emphasis is being placed on mobile solutions,
retinal specialist on a test image set) were which may prove to be a better fit for resource
included, unlike an automated system the accu- starved regions. Although the body of evidence
racy would likely vary between different GP speaking for the various algorithms is quite large
graders [41]. and constantly increasing, there are significant
Approaching the issues surrounding DR shortcomings in our current study of AI in
screening from a different direction is RetinaRisk, DR. Virtually all of the current studies looking
a software developed in Iceland. RetinaRisk aims into and measuring DR algorithms are sponsored
to decrease the overall burden of yearly DR or dependent on the respective algorithm’s’ com-
screening by safely extending the time between pany. Independent studies are very few and far
screening for part of DR population. Although between. For a long time the only independent
not explicitly derived from machine learning, it is and the only robust comparison available, pub-
based on analysis of extensive datasets. The algo- lished by Tufail and colleagues in 2016, compares
rithm takes in patients’ parameters, such as gen- algorithms tested in 2013. Since that time deep
der, age, HbA1c level, DR status, diabetes type learning and related concepts progressed almost
and duration, and blood pressure level. As a beyond recognition, and many of the algorithms
result, the algorithm presents a recommended described here are being constantly updated. This
time till next screening interval, which may be situation changed only recently with the publish-
longer than the traditionally accepted yearly ing of a study comparing multiple AI DR detec-
158 A. Grzybowski and P. Brona
tion algorithms in an anonymised fashion, which 10. Hansen AB, Hartvig NV, Jensen MS, Borch-Johnsen
K, Lund-Andersen H, Larsen M. Diabetic retinopa-
made it clear the algorithms’ accuracy can vary thy screening using digital non-mydriatic fundus
significantly, but unfortunately not giving readers photography and automated image analysis. Acta
any insight into the performance of any given Ophthalmol Scand. 2004;82(6):666–72.
algorithm [43]. We recently published a much 11. Larsen M, Godt J, Larsen N, Lund-Andersen H, Sjølie
AK, Agardh E, et al. Automated detection of fundus
smaller study comparing two algorithms on a photographic red lesions in diabetic retinopathy.
local dataset [44]. Nevertheless independent stud- Invest Ophthalmol Vis Sci. 2003;44(2):761–6.
ies, particularly comparisons or studies establish- 12. Abràmoff MD, Lou Y, Erginay A, Clarida W, Amelon
ing objective criteria through which the respective R, Folk JC, et al. Improved automated detection of
diabetic retinopathy on a publicly available data-
algorithms could be compared are missing, with set through integration of deep learning. Invest
organisations, end-users or consumers left with a Ophthalmol Vis Sci. 2016;57(13):5200–6.
considerable dilemma when trying to choose and 13. Xie Y, Gunasekeran DV, Balaskas K, Keane PA,
algorithm for screening their local population. Sim DA, Bachmann LM, et al. Health economic and
safety considerations for artificial intelligence appli-
cations in diabetic retinopathy screening. Transl Vis
Sci Technol. 2020;9(2):22.
14. Abràmoff MD, Lavin PT, Birch M, Shah N, Folk
References JC. Pivotal trial of an autonomous AI-based
diagnostic system for detection of diabetic reti-
1. Klein BEK. Overview of epidemiologic studies nopathy in primary care offices. NPJ Digit Med.
of diabetic retinopathy. Ophthalmic Epidemiol. 2018;1(1):1–8.
2007;14(4):179–83. 15. Van Der Heijden AA, Abramoff MD, Verbraak F, van
2. Guariguata L, Whiting DR, Hambleton I, Beagley J, Hecke MV, Liem A, Nijpels G. Validation of automated
Linnenkamp U, Shaw JE. Global estimates of dia- screening for referable diabetic retinopathy with the
betes prevalence for 2013 and projections for 2035. IDx-DR device in the Hoorn Diabetes Care System.
Diabetes Res Clin Pract. 2014;103(2):137–49. Acta Ophthalmol (Copenh). 2018;96(1):63–8.
3. Lee R, Wong TY, Sabanayagam C. Epidemiology 16. Haritoglou C, Kernt M, Neubauer A, Gerss J, Oliveira
of diabetic retinopathy, diabetic macular edema and CM, Kampik A, et al. Microaneurysm formation rate
related vision loss. Eye Vis [Internet]. 2015 Sep 30 as a predictive marker for progression to clinically
[cited 2020 Feb 7];2. Available from: https://www. significant macular edema in nonproliferative diabetic
ncbi.nlm.nih.gov/pmc/articles/PMC4657234/ retinopathy. Retina. 2014;34(1):157–64.
4. Romero-Aroca P, de la Riva-Fernandez S, Valls- 17. Nunes S, Pires I, Rosa A, Duarte L, Bernardes R,
Mateu A, Sagarra-Alamo R, Moreno-Ribas A, Cunha-Vaz J. Microaneurysm turnover is a biomarker
Soler N. Changes observed in diabetic retinopathy: for diabetic retinopathy progression to clinically sig-
eight-year follow-up of a Spanish population. Br J nificant macular edema: findings for type 2 diabetics
Ophthalmol. 2016;100(10):1366–71. with nonproliferative retinopathy. Ophthalmologica.
5. Scanlon PH. The English National Screening 2009;223(5):292–7.
Programme for diabetic retinopathy 2003–2016. Acta 18. Pappuru RK, Ribeiro L, Lobo C, Alves D, Cunha-
Diabetol. 2017;54(6):515–25. Vaz J. Microaneurysm turnover is a predictor of
6. Pandey R, Morgan MM, Murphy C, Kavanagh H, diabetic retinopathy progression. Br J Ophthalmol.
Acheson R, Cahill M, et al. Irish National Diabetic 2019;103(2):222–6.
RetinaScreen Programme: report on five rounds of 19. Tufail A, Kapetanakis VV, Salas-Vega S, Egan
retinopathy screening and screen-positive referrals. C, Rudisill C, Owen CG, et al. An observational
(INDEAR study report no. 1). Br J Ophthalmol. study to assess if automated diabetic retinopathy
2020;Published Online First: 17 December 2020. image assessment software can replace one or more
7. Nguyen HV, GSW T, Tapp RJ, Mital S, DSW T, steps of manual imaging grading and to determine
Wong HT, et al. Cost-effectiveness of a national tele- their cost-effectiveness. Health Technol Assess.
medicine diabetic retinopathy screening program in 2016;20(92):1–72.
Singapore. Ophthalmology. 2016;123(12):2571–80. 20. Bhaskaranand M, Ramachandra C, Bhat S, Cuadros
8. Gardner GG, Keating D, Williamson TH, Elliott J, Nittala MG, Sadda SR, et al. The value of auto-
AT. Automatic detection of diabetic retinopathy using mated diabetic retinopathy screening with the EyeArt
an artificial neural network: a screening tool. Br J system: a study of more than 100,000 consecutive
Ophthalmol. 1996;80(11):940–4. encounters from people with diabetes. Diabetes
9. Hipwell JH, Strachan F, Olson JA, KC MH, Sharp PF, Technol Ther. 2019;21(11):635–43.
Forrester JV. Automated detection of microaneurysms 21. Solanki K, Bhaskaranand M, Bhat S, Ramachandra
in digital red-free photographs: a diabetic retinopathy C, Cuadros J, Nittala MG, et al. Automated diabetic
screening tool. Diabet Med. 2000;17(8):588–94. retinopathy screening: large-scale study on con-
11 Artificial Intelligence in Diabetic Retinopathy 159
secutive patient visits in a primary care setting. In: retinopathy screening with an offline artificial intel-
Diabetologia. Springer 233 SPRING ST, New York; ligence system on a smartphone. JAMA Ophthalmol.
2016. p. S64. 2019;137(10):1182–8.
22. Ipp E, Shah VN, Bode BW, Sadda SR. 599-P: dia- 34. Sosale B, Sosale AR, Murthy H, Sengupta S,
betic retinopathy (DR) screening performance of Naveenam M. Medios–An offline, smartphone-
general ophthalmologists, retina specialists, and based artificial intelligence algorithm for the diag-
artificial intelligence (AI): analysis from a piv- nosis of diabetic retinopathy. Indian J Ophthalmol.
otal multicenter prospective clinical trial. Diabetes 2020;68(2):391–5.
[Internet]. 2019 [cited 2020 Feb 26];68(Supplement 35. Sosale B, Aravind SR, Murthy H, Narayana S,
1). Available from: https://diabetes.diabetesjournals. Sharma U, SGV G, et al. Simple, mobile-based artifi-
org/content/68/Supplement_1/599-P cial intelligence algorithm in the detection of diabetic
23. Heydon P, Egan C, Bolter L, Chambers R, Anderson retinopathy (SMART) study. BMJ Open Diabetes Res
J, Aldington S, et al. Prospective evaluation of an Amp Care. 2020;8(1):e000892.
artificial intelligence-enabled algorithm for automated 36. MWM W, Mishra DK, Hartmann L, Shah P, Konana
diabetic retinopathy screening of 30 000 patients. Br J VK, Sagar P, et al. Diabetic retinopathy screening
Ophthalmol. 2020;bjophthalmol-2020-316594. using smartphone-based fundus imaging in India.
24. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Ophthalmology. 2020;127(11):1529–38.
Narayanaswamy A, et al. Development and validation 37. Bawankar P, Shanbhag N, SS K, Dhawan B, Palsule
of a deep learning algorithm for detection of diabetic A, Kumar D, et al. Sensitivity and specificity of auto-
retinopathy in retinal fundus photographs. JAMA. mated analysis of single-field non-mydriatic fundus
2016;316(22):2402–10. photographs by Bosch DR Algorithm—comparison
25. Krause J, Gulshan V, Rahimy E, Karth P, Widner with mydriatic fundus photography (ETDRS) for
K, Corrado GS, et al. Grader variability and the screening in undiagnosed diabetic retinopathy. PLoS
importance of reference standards for evaluating One. 2017;12(12):e0189854.
machine learning models for diabetic retinopathy. 38. Rogers TW, Gonzalez-Bueno J, Franco RG, Star EL,
Ophthalmology. 2018;125(8):1264–72. Marín DM, Vassallo J, et al. Evaluation of an AI sys-
26. Gulshan V, Rajan RP, Widner K, Wu D, Wubbels tem for the detection of diabetic retinopathy from
P, Rhodes T, et al. Performance of a deep-learning images captured with a handheld portable fundus
algorithm vs manual grading for detecting dia- camera: the MAILOR AI study. Eye. 2020:1–7.
betic retinopathy in India. JAMA Ophthalmol. 39. Sandhu HS, Eladawi N, Elmogy M, Keynton R,
2019;137(9):987–93. Helmy O, Schaal S, et al. Automated diabetic reti-
27. Hsieh Y-T, Chuang L-M, Jiang Y-D, Chang T-J, Yang nopathy detection using optical coherence tomog-
C-M, Yang C-H, et al. Application of deep learning raphy angiography: a pilot study. Br J Ophthalmol.
image assessment software VeriSeeTM for diabetic reti- 2018;102(11):1564–9.
nopathy screening. J Formos Med Assoc. 2021;120(1, 40. Heisler M, Karst S, Lo J, Mammo Z, Yu T, Warner S,
Part 1):165–71. et al. Ensemble deep learning for diabetic retinopathy
28.
González-Gonzalo C, Sánchez-Gutiérrez V, detection using optical coherence tomography angi-
Hernández-Martínez P, Contreras I, Lechanteur YT, ography. Transl Vis Sci Technol. 2020;9(2):20.
Domanian A, et al. Evaluation of a deep learning sys- 41. Piyasena MMPN, Yip JL, MacLeod D, Kim M,
tem for the joint automated detection of diabetic reti- Gudlavalleti VSM. Diagnostic test accuracy of dia-
nopathy and age-related macular degeneration. Acta betic retinopathy screening by physician graders
Ophthalmol (Copenh). 2020;98(4):368–77. using a hand-held non-mydriatic retinal camera at
29. DSW T, Cheung CY-L, Lim G, GSW T, Quang ND, a tertiary level medical clinic. BMC Ophthalmol.
Gan A, et al. Development and validation of a deep 2019;19(1):89.
learning system for diabetic retinopathy and related eye 42. Estil S, Steinarsson ÆÞ, Einarsson S, Aspelund T,
diseases using retinal images from multiethnic popu- Stefánsson E. Diabetic eye screening with variable
lations with diabetes. JAMA. 2017;318(22):2211–23. screening intervals based on individual risk factors
30. Quellec G, et al. Instant automatic diagnosis of dia- is safe and effective in ophthalmic practice. Acta
betic retinopathy. arXiv e-prints: arXiv-1906. 2019. Ophthalmol (Copenh). 2020;98(4):343–6.
https://arxiv.org/abs/1906.11875. 43. Lee, A. Y., Yanagihara, R. T., Lee, C. S., Blazes, M.,
31. Quellec G, et al. Automatic detection of rare patholo- Jung, H. C., Chee, Y. E., ... & Boyko, E. J. (2021).
gies in fundus photographs using few-shot learn- Multicenter, head-to-head, real-world validation
ing. Med Image Anal. 2020;61:101660. https://doi. study of seven automated artificial intelligence dia-
org/10.1016/j.media.2020.101660. https://arxiv.org/ betic retinopathy screening systems. Diabetes care.
abs/1907.09449. 2021;44(5), 1168–1175.
32. Rajalakshmi R, Subashini R, Anjana RM, Mohan
44. Grzybowski, A., & Brona, P. (2021). Analysis and
V. Automated diabetic retinopathy detection in Comparison of Two Artificial Intelligence Diabetic
smartphone-based fundus photography using artificial Retinopathy Screening Algorithms in a Pilot Study:
intelligence. Eye. 2018;32(6):1138–44. IDx-DR and Retinalyze. J Clin Med. 2021;10(11),
33. Natarajan S, Jain A, Krishnan R, Rogye A, Sivaprasad 2352.
S. Diagnostic accuracy of community-based diabetic
Google and DeepMind: Deep
Learning Systems
12
in Ophthalmology
Development Lifecycle
Engaging Preparing
Grading Adjudication
graders guidelines
Fig. 12.1 The development lifecycle shows the stages of time consuming and directly influences the final model
taking applied AI research through to deployment and quality. Grading requires clear guidelines that are often
beyond. Tasks within applied research and within label- the result of multiple iterations by medical experts. Data
ling are expanded upon in greater detail. Applied research can be then graded either independently by several grad-
covers typical AI model development tasks, with data and ers or potentially even adjudicated until consensus is
label acquisition, and modeling (training, evaluating and reached between the graders
testing). The medical grading process shown is potentially
Gulshan et al. 2016 Smith-Morris et al. 2018 “Diabetic Gulshan et al. 2018 Beede et al. 2010
“Development and validation of a retinopathy and the cascade into “Performnace of a deep-learning “A human-centered evaluation of a
deep learning algorithm for vision loss” algorithm vs manual grading for deep learning system deployed in
detection of diabetic retinopathy in detecting diabetic retinopathy in clinics for the detection of diavetic
retined fundus photographs” india” retinopathy”
Krause et al. 2018 Bouskil et al. 2018 Ruamviboonusuk et al. 2019 Verity Blog 2010
“Grader variability and the “Blind spots in telemedicine: a “Deep learning versus human “Launching a powerful new
importance of reference standards qualitative study of staff graders for classifying diabetic screening tool for diabetic eye
for evaluating machine learning workarounds to resolve gaps in retinopathy severity in a disease in india”
models for diabetic retinopathy” diabetes management” nationwide screening program”
Huston et al. 2019 Sayres et al. 2018 Google The Keyword 2018
“Quality control challenges in “Using a deep learning algorithm “Al for social good in Asia Pacific”
crowdsourcing medical labeling” and integrated gradients
explanantion to assist grading for
diabetic retinopathy”
Fig. 12.2 Representative publications or announcements from Google along the development lifecycle for diabetic
retinopathy screening
The first application of deep learning (DL) to network architecture called Inception-v3 shown
the retina in Google was for DR screening [3]. to be effective for image classification of non-
DR has been the leading cause of preventable medical images (e.g. cats vs. dogs) [3]. In this
vision loss among the working-age population study, Inception-v3 was used to detect DR using
[12], remaining as a global health burden [13]. the 5-point International Clinical Diabetic
Early detection and proper follow-up for timely Retinopathy scale [16]: none, mild, moderate,
treatment is the key to prevent irreversible vision severe and proliferative (Fig. 12.3). Working
loss from DR [14]. This necessitates scalable together with our collaborators, we determined
screening programs to cover increasing global that the most clinically relevant model would
population with diabetes [15], and automated detect referable DR at the level of moderate or
grading that could potentially improve the effi- above, the threshold at which follow-up visits to
cacy and availability of such screening programs. ophthalmologists are normally requested. This
To achieve this, we applied DL by using a neural first work shows DR detection performance on-
par with general ophthalmologists, achieving an
Area Under the Receiver-Operating Characteristic
Curve (AUC) of 0.99, when evaluated against a
Box 12.1 Terminology reference standard determined by the majority
• Artificial Intelligence (AI) opinion of US board-certified ophthalmologist
–– AI is a general term for the broad graders [3].
research field of developing intelli- Next, we observed that intergrader variability
gent systems. still exists (refer to the Grading subsection
• Machine Learning (ML) below), and the majority opinions were some-
–– Within the field of AI, ML describes times different from opinions arrived at after dis-
algorithms that perform tasks requir- cussion within a panel of graders. This is because
ing intelligence by learning from taking the majority can ignore “minority opin-
examples. ions” that reflect actual pathology. For example,
• Deep Learning (DL) if a single grader pointed out the existence of
–– DL is a particular form of ML loosely subtle abnormality, other graders might change
inspired by biological neural net- the grade after the discussion even though they
works where algorithms process had initially missed independently. By tuning the
information through a complex net- model using more reliable labels based on such
work of adaptive artificial compute adjudicated grades, the algorithm further
units (“neurons”). achieved a performance comparable to retina
specialists (Fig. 12.4) [17].
Convolution
AvgPool
MaxPool
Concat
Dropout
Fully connected
Softmax
Fig. 12.3 A deep learning system for detecting DR from the relative likelihood that the input image is each one of
CFPs. A CFP is used as input to Inception v3. The model five grades of DR, and whether the image itself is gradable
is a deep neural network made up of building blocks that for DR. The 5-class output here is then separated via the
include convolutions, average pooling, max pooling, con- dotted line to determine a referability result
cats, dropouts, and fully connected layers. The output is
164 X. Liu et al.
100
80
100
90
60
Sensitivity (%)
80
70
40 60
Moderate or Worse Diabetic
50 Retinopathy Model (AUC = 0.986)
40 Generalists
20 Specialists
30
0 2 4 6 8 10
0
0 20 40 60 80 100
1 - Specificity (%)
Fig. 12.4 Model performance for detecting referable dia- demonstrating performance on par with retinal specialists.
betic retinopathy. Receiver operating characteristic (ROC) Also shown is the performance of generalists, assessed in
curve for our DR model published in Krause et al. [17], our previous 2016 publication [3]
To evaluate how the model would generalize to diagnose due to ambiguities and subjectivity,
in the actual clinical settings, the model perfor- and the diagnosis generally requires a number
mance needs to be validated to ensure at least of other clinical data such as visual field.
decent generalization to new data and population. Fortunately, many signs of glaucoma- related
We conducted two validation studies to date, a neuropathy (such as high cup-to-disc ratio, neu-
prospective study in India and a retrospective roretinal rim notching, retinal nerve fiber layer
study in Thailand. The model performed on par defect) are visible from a fundus photograph. In
with manual grading in both of these studies [18, training a model for glaucoma, we also col-
19] and another prospective study is under way in lected feature-level (e.g. vertical elongation of
Thailand [20]. the optic cup, parapapillary atrophy, disc hem-
In addition to DR, diabetic screening pro- orrhage) grades, in addition to referable glauco-
grams must catch a wide range of common matous optic neuropathy grades. We showed
referrable ophthalmic diseases that may coexist that a DL model’s predictions of glaucoma sus-
in the diabetic population, including Age- pect correlates well with glaucomatous neurop-
related Macular Degeneration (AMD), glau- athy and actual glaucoma diagnoses [23].
coma, and Retinal Vein Occlusion (RVO).
Similar to how DR can manifest as hard exu-
dates in DR, AMD can present with lesions Grading
called drusen and RVO with obstructions.
Glaucoma, the second leading cause of blind- Machine learning models require labeled data for
ness [21] and the leading cause of irreversible both development and validation. In ophthalmol-
blindness [22] worldwide, is more challenging ogy, these labels are generally obtained via grad-
12 Google and DeepMind: Deep Learning Systems in Ophthalmology 165
ing by ophthalmologists. Both model training cases with disagreements can also be adjudicated
and performance evaluation is dependent on the by graders via asynchronous discussion to reach
quality of the grades provided. However, there is consensus [24].
significant intergrader variability for DR grading
(Fig. 12.5).
One central tenet of our approach to reducing Optical Coherence Tomography (OCT)
grading variability is to create guidelines that
result in consistent and reproducible grades for Despite being more expensive than CFP, OCT
given diseases. This involves getting experts to usage is growing in community eye care settings
grade a small set of cases on prototype guide- [25] because it enables the diagnosis of macular
lines, quantitatively evaluating their intergrader conditions with greater accuracy and the identifi-
agreement, having the experts come together to cation of early pathological changes. As such, the
discuss and resolve disagreements, and finally community use of OCT could lead to better man-
revising the guidelines for clarity and better agement of patients. Regular remote follow up
alignment. This process is repeated until the via virtual clinics is also rapidly becoming a stan-
agreement metrics (such as Krippendorf’s alpha dard of care [26, 27]. However, such a shift to
or Cohen’s kappa) plateaus. remote assessment may come at a cost. The lack
Our experience further suggests that while the of sufficient local clinical expertise has led to a
model training process is generally resilient to high referral or false positive rate, and the
variability or “noise” in the training set, highly increased workload and number of referrals can
reliable grades are more important for the valida- burden tertiary care sites. This problem is exacer-
tion set to ensure the ability to measure model bated by the increase in prevalence of sight-
performance precisely. If multiple graders reach threatening diseases for which OCT is the gold
a consensus via discussion, the grade is generally standard of initial assessment [28].
more reliable than simply taking the most fre- AI offers a potential solution to this problem,
quent initial grade among them. However, having by identifying abnormalities and by triaging
face-to-face discussion or even online meeting is scans to appropriate virtual clinics. To evaluate
often difficult at remote labelling settings. With the potential of AI in this setting, we applied DL
our customized platform for grading, if desired, to triage macular OCT [4]. In this study the
Fig. 12.5 Intergrader
variability. The
variability between
individual graders
(columns) is shown
for 19 cases (rows).
All graders were
board certified
ophthalmologists. Each
cell shows the DR grade
from an individual
grader. By looking along
a whole row, such as
those highlighted by the
two black rectangles,
one can see cases where
there was significant
variability between
individuals
166 X. Liu et al.
training images
MRO
normal
routine 0.4
CNV 99.0
MRO 5.4
VMT
15.0
43.4
ER14 0.000
e: diagnosis probabilities
c: tissue map hypotheses and referral suggestion
Fig. 12.6 AI framework from OCT paper [4]. (a) Raw trained with tissue maps with confirmed diagnoses and
retinal OCT (pictured here with 6 × 6 × 2.3 mm3 centered optimal referral decisions. (e) Predicted diagnosis proba-
at the macula). (b) Deep segmentation network, trained bilities for each pathology and referral suggestions
with manually segmented OCT scans. (c) Resulting tissue (Reproduced from [4])
segmentation map. (d) Deep classification network,
substantially higher
than retina
specialists grading
CFPs, irrespective 0.4
of the location of
any HEs found and
their distance from
the fovea in disc
diameters (DD) [34] Model AUC: 0.89 (95% CI 0.87-0.91)
0.2
Retina Specialists: Overall Judgement
Retina Specialists: HE<500 microns
Retina Specialists: HE<1DD
Retina Specialists: HE<2DD
0.0
0.0 0.2 0.4 0.6 0.8 1.0
1-Specificity
Further research is warranted to investigate if the [36]. The ability to predict patients’ risk of adverse
need for such specialized measurements can be events enables better matching of patients to treat-
reduced by applying DL to common modalities. ment, a process known as risk stratification.
...
Fig. 12.8 Progression from early to exudative AMD. The for further treatment. The timing of each follow-up visit
images depict the progression of AMD over time in the varies depending on the treatment regimen of the first eye
fellow (i.e., other) eye of a patient who is already receiv- as well as patient and clinic factors. At each visit, both
ing injections for exudative AMD in their first eye. In this CFP and OCT scans are captured for both eyes. The fel-
case, on diagnosis of exudative AMD (ex-AMD) in one low eye converts to ex-AMD at 11 months, highlighted by
eye at 0 months, the patient commences intravitreal ther- the red box
apy in the affected first eye and is followed-up regularly
be regularly screened (yearly or every 2 years) microvasculature is visible for visual examina-
for DR, by International Council of tion or imaging non-invasively. We will next
Ophthalmology (ICO) [44], American Academy cover work on specific systemic diseases.
of Ophthalmology (AAO) [45], American
Diabetes Association (ADA) [46], etc. These ardiovascular Risk Factors
C
screening programs provided longitudinal fol- Identifying patients at high risk of future cardiovas-
lowup data that enabled research for predicting cular events is crucial in preventing cardiovascular
DR incidence and progression. The ability to risk diseases [49], the leading cause of global death
stratify diabetic patients for developing DR could [50]. Cardiovascular risk factors may manifest in
potentially lead to personalized medication and the eyes; for example severe hypertension can lead
lifestyle coaching, as well as early diagnosis and to hypertensive retinopathy [51]. Poplin et al.
treatment to avoid vision loss [13, 47]. Preliminary showed that models developed on CFPs can accu-
results in identifying patients at high risk of rately predict blood pressure, age, self-reported sex
developing DR have been described at the time of and other cardiovascular disease risk factors
this writing. ([48], accepted). (Fig. 12.9) [53], and these findings were confirmed
by other researchers on an external validation set
[54]. It demonstrates the potential for further posi-
redicting the Presence of Systemic
P tive interactions between clinical practices for oph-
Conditions thalmic and systemic disease management.
Two of the findings were particularly unex-
In addition to clinical applications in ophthalmol- pected: first, though age was known to affect the
ogy, recent work has shown the potential for appearance of the vessels, quantifying age within
detecting signs of systemic diseases. Many sys- the error range of a few years was not known to be
temic conditions can manifest in the eye, with the possible; second, sex-associated differences in the
most prominent examples being diabetes. More retina was not previously known to appear in
interestingly, the retina is the only organ where CFPs. To better understand how the models com-
200
Predicated [mmHg]
Predicated [year]
60
150
40
100
20
Fig. 12.9 Predictions of age and systolic blood pressure. the UK Biobank validation dataset [52]. The diagonal
(a) Predicted and actual age in the two validation datasets, lines represent perfect correlation between predicted and
UK Biobank [52] and EyePACS (http://www.eyepacs. actual values. (Plot was recreated using data from [53])
com). (b) Predicted and actual systolic blood pressure on
12 Google and DeepMind: Deep Learning Systems in Ophthalmology 171
pute the predictions, we applied soft-attention [55] tus related features were along the blood vessels,
to produce heatmaps of where the model focused and self-reported sex features were near macula.
on for each image (Fig. 12.10). The findings sug-
gested that the models used different part of the Anemia
images for each task, e.g. age-related features Anemia is another example of a systemic disease
were distributed over the whole eye, smoking sta- that is posing global public health problems [56].
Fig. 12.10 Attention maps. (a): The top left image is a tion heatmap overlaid in green, indicating the areas that
sample color fundus photograph from the UK Biobank the DL model is using to make the prediction for the
dataset. (b), (c) and (d): The remaining images show the image. (Plot was recreated using data from [53])
same retinal image in black and white with the soft atten-
172 X. Liu et al.
Extending the previous work of predicting car- els are meeting real clinical needs. Clinicians
diovascular risk factors, Mitani et al. showed that will need to be able to respond to model deci-
ML models can predict blood hemoglobin levels sions appropriately, and to engage with models
and detect anemia from CFPs [57] (Fig. 12.11). in a proactive way. While this is still an active
In this study, multiple model explanation meth- area of research, a few specific examples would
ods were applied to show that most of the attribu- help demonstrate the required work. We inves-
tion of the model goes to the disc and the blood tigated the impact of DL models for DR on
vessels nearby. In addition, they used occlusion graders when presented as AI-based assistance,
analysis to examine how performance is impacted finding improvements in accuracy and confi-
by removing certain parts of the image. This dence at the expense of grading time [29]. We
analysis confirmed that the region around the further conducted an observational human cen-
optic disc is most crucial for predicting hemoglo- tered study [62] to identify socio-environmen-
bin and anemia. tal factors that impact model performance in a
number of real world clinical deployments [63].
hallenges in Translating AI to Clinical
C The insights from this research will ultimately
Practice inform clinical trials. Prospective evidence of
We have described many applications that may, efficacy [18] and additional studies on clinical
if deployed, improve patient care. To achieve this impact such as economic cost and patient out-
potential it is essential to couple developmental comes is also important in understanding clini-
research with user research to ensure the mod- cal applicability.
a b c d
0.8 0.8
AUC
AUC
0.7 f 0.7 h
Fig. 12.11 Saliency maps and effects of occluding parts ing center core, (h) same as (f), CFP example for (g).
of the image on the prediction of anemia from CFPs. (a) Proportions of the blocked area for f and h were chosen to
Example CFP from UK Biobank. (b) Saliency map for have AUC for predicting anemia close to 0.80, to illustrate
predicting anemia using GradCAM [58], (c) Smooth inte- the proportions that need to be occluded to result in simi-
grated gradients [59, 60], (d) Guided-backprop [61], (e) lar performance decrease, highlighting the importance of
Effects of masking top and bottom of the image on predic- optic disc in detecting anemia from CFPs. (Plot was recre-
tion of anemia (in orange) and moderate anemia (in blue), ated using data from [57])
(f) masked CFP example for (e), (g) same as (e) for mask-
12 Google and DeepMind: Deep Learning Systems in Ophthalmology 173
As illustrated by these studies, it is crucial to come, we are confident that AI will transform
not only show that AI is accurate on retrospec- patient care in ophthalmology.
tive datasets, but also to demonstrate how to
implement such techniques in a way that can
Box 12.3 Open Source Software for AI
benefit everyone in practice [64]. Regulatory
bodies have to balance safety and innovation, • Google and DeepMind contribute open-
and standards have not yet been well estab- source tools and models to the
lished for this new technology. For each appli- community:
cation, the evaluation has to be clinically –– TensorFlow framework (www.ten-
meaningful. Comparison methods between sorflow.org) for model development
algorithms should be established, including and serving [65]
how to choose a representative test set, so that –– Google Colaboratory (colab.
users can optimize the selection for the target research.google.com) for data analy-
use-cases, and monitor any potential perfor- sis and preparation [66]
mance shifts even after the initial decision. In –– Model architecture used for predict-
addition, further research is needed to under- ing conversion to wet age related
stand how and when AI-based systems can macular degeneration using deep
introduce bias or even fail so that we can moni- learning in [43, 67]
tor for and mitigate potential harm. For exam- • We also use common open-source soft-
ple, the system can learn to use confounders ware and models:
that further reinforce existing bias, or may learn –– Inception-v3 network architecture
features only helpful for specific populations. ([68], available in Tensorflow at
Alternatively, if it learned the spurious correla- InceptionV3 [69]) for Fundus
tion that only exists in the original sets, the sys- models
tem may underperform when applied elsewhere. –– UNet [70] architecture for OCT
Improving the interpretability of the models segmentation
and careful subgroup analysis would help iden- –– Numpy [71]
tify such issues and contribute to building more –– Pandas
transparent and trustworthy systems that serve –– Matplotlib
the global population. –– Seaborn
–– Scipy
–– Sklearn
Summary • Proprietary custom software is used for
other tasks specific to our computing
In this chapter we have provided a brief overview infrastructure, such as medical data pre-
of the contributions from Google and DeepMind processing and management, scalable
to the field of AI in Ophthalmology. We have dis- image grading, distributed training of
cussed several different phases, from early DL models, runtime inference and
research and development (see also Box 12.3) to model serving—most of which is avail-
implementations in clinical trials, with examples able in similar third-party offerings.
of each. Finally we have demonstrated the poten-
tial of AI for Scientific Discovery. A considerable
amount of work remains ahead to answer funda-
mental questions around topics such as bias,
uncertainty, safety, interpretability and generaliz- Acknowledgements We would like to thank Y. Liu,
D. Webster, O.Ronneberger and P. Kohli for their guid-
ability. As these hurdles and challenges are over-
ance and feedback.
174 X. Liu et al.
27. Whited JD, et al. A modeled economic analysis of possible patient care. Eye. 2020;26(S1). https://www.
a digital teleophthalmology system as used by three nature.com/articles/eye2011342
federal healthcare agencies for detecting proliferative 40. Heier JS. IAI versus Sham as prophylaxis against
diabetic retinopathy. Telemed e-Health. 2005;11:641– conversion to neovascular AMD (PRO-CON). clini-
51. https://doi.org/10.1089/tmj.2005.11.641 caltrials.gov. https://clinicaltrials.gov/ct2/show/
28. Bourne RRA, et al. Magnitude, temporal trends, and NCT02462889
projections of the global prevalence of blindness and 41.
Southern California Desert Retina Consultants,
distance and near vision impairment: a systematic MC. Prophylactic Ranibizumab for Exudative Age-
review and meta-analysis. Lancet Global Health. related Macular Degeneration (PREVENT). clini-
2017;5:e888–97. https://www.thelancet.com/journals/ caltrials.gov. https://clinicaltrials.gov/ct2/show/
langlo/article/PIIS2214-109X(17)30293-0/fulltext NCT02140151
29. Sayres R, et al. Using a deep learning algorithm
42. Babenko B, et al. Predicting progression of age-
and integrated gradients explanation to assist grad- related macular degeneration from fundus images
ing for diabetic retinopathy. Ophthalmology. using deep learning. arXiv, Apr 2019. https://arxiv.
2019;126(4):552–64. https://www.aaojournal.org/ org/pdf/1904.05478.pdf
article/S0161-6420(18)31575-6/fulltext 43. Yim J, et al. Predicting conversion to wet age
30. Mitchell P, et al. Cost-effectiveness of ranibizumab in related macular degeneration using deep learning.
treatment of diabetic macular oedema (DME) caus- Nat Med. 2020. https://www.nature.com/articles/
ing visual impairment: evidence from the RESTORE s41591-020-0867-7
trial. Br J Ophthalmol. 2012;96:688–93. https://bjo. 44.
International Council of Ophthalmology.
bmj.com/content/96/5/688 ICO Guidelines for Diabetic Eye Care.
31. Romero-Aroca P. Managing diabetic macular edema: 2017. http://www.icoph.org/downloads/
the leading cause of diabetes blindness. World J ICOGuidelinesforDiabeticEyeCare.pdf
Diabetes. 2011;2(6):98–104. https://www.ncbi.nlm. 45. AAO PPP Retina/Vitreous Committee, Hoskins Center
nih.gov/pubmed/21860693 for Quality Eye Care. Diabetic Retinopathy PPP 2019.
32. Mackenzie S, et al. SDOCT imaging to identify
2019. https://www.aao.org/preferred-practice-pattern/
macular pathology in patients diagnosed with dia- diabetic-retinopathy-ppp
betic maculopathy by a digital photographic retinal 46. Solomon SD. Diabetic retinopathy: a position state-
screening programme. PLoS One. 2011;6(5):e14811. ment by the American Diabetes Association. Diabetes
https://doi.org/10.1371/journal.pone.0014811 Care. 2017;40(3):412–8. https://care.diabetesjour-
33. Wong RL, et al. Are we making good use of our public nals.org/content/40/3/412
resources? The false-positive rate of screening by fun- 47. Dornhorst A, Merrin PK. Primary, secondary and ter-
dus photography for diabetic macular oedema. Hong tiary prevention of non-insulin-dependent diabetes.
Kong Med J. 2017;23(4):356–64. https://pubmed. Postgrad Med J. 1994;70(826):529–35. https://www.
ncbi.nlm.nih.gov/28684650/ ncbi.nlm.nih.gov/pmc/articles/PMC2397691
34. Varadarajan AV, et al. Predicting optical coherence 48. Bora A, et al. Deep learning for predicting the pro-
tomography-derived diabetic macular edema grades gression of diabetic retinopathy using fundus images.
from fundus photographs using deep learning. Nat ARVO Abstract, 2020.
Commun. 2020;11(130). https://www.nature.com/ 49. Goff DC, et al. ACC/AHA guideline on the assess-
articles/s41467-019-13922-8 ment of cardiovascular risk: a report of the American
35. D’Agostino RB, et al. General cardiovascular risk College of Cardiology/American Heart Association
profile for use in primary care: the Framingham heart Task Force on Practice Guidelines. Circulation.
study. Circulation. 2008;117(6):743–53. https://www. 2014;129:S49–73.
ncbi.nlm.nih.gov/pubmed/18212285 50. WHO The Top 10 Causes of Death. 2017. 2018.
36. Tomašev N, et al. A clinically applicable approach to https://www.who.int/news-room/fact-sheets/detail/
continuous prediction of future acute kidney injury. the-top-10-causes-of-death
Nature. 2019;572:116–9. https://www.nature.com/ 51. Wong TY, Mitchell P. Hypertensive retinopathy. N
articles/s41586-019-1390-1 Engl J Med. 2004;22(351):2310–7.
3 7. Wong WL, et al. Global prevalence of age-
52. Sudlow C, et al. UK Biobank: an open access resource
related macular degeneration and disease bur- for identifying the causes of a wide range of com-
den projection for 2020 and 2040: a systematic plex diseases of middle and old age. PLoS Med.
review and meta- analysis. Lancet Glob Health. 2015;12(3):e1001779.
2014;2(2):e106–16. 53. Poplin R, et al. Prediction of cardiovascular risk fac-
38. Lim JH, et al. Delay to treatment and visual outcomes tors from retinal fundus photographs via deep learn-
in patients treated with anti-vascular endothelial ing. Nat Biomed Eng. 2018;2:158–64. https://www.
growth factor for age-related macular degeneration. nature.com/articles/s41551-018-0195-0
Am J Ophthalmol. 2012;153(4):678–86. https://www. 54. Ting DSW, Wong TY. Eyeing cardiovascular risk fac-
ajo.com/article/S0002-9394(11)00721-5 tors. Nat Biomed Eng. 2018;2:140–1. https://www.
39. Action on AMD. Optimising patient management: act nature.com/articles/s41551-018-0210-5
now to ensure current and continual delivery of best
176 X. Liu et al.
55. Xu K, et al. Show, attend and tell: neural image cap- Recognit. 2016. https://www.researchgate.net/pub-
tion generation with visual attention. 2015. https:// lication/306281834_Rethinking_the_Inception_
arxiv.org/abs/1502.03044 Architecture_for_Computer_Vision
56. McLean E, et al. Worldwide prevalence of anaemia, 69. InceptionV3. https://www.tensorflow.org/api_docs/
WHO vitamin and mineral nutrition information sys- python/tf/keras/applications/InceptionV3
tem, 1993–2005. Public Health Nutr. 2009;12(4):444– 70. Ronneberger O, et al. U-Net: convolutional networks
54. https://pubmed.ncbi.nlm.nih.gov/18498676/ for biomedical image segmentation. In: International
57. Mitani A, et al. Detection of anaemia from retinal conference on medical image computing and
fundus images via deep learning. Nat Biomed Eng. computer- assisted intervention, 2015, p. 234–41.
2020;4:18–27. https://www.nature.com/articles/ https://doi.org/10.1007/978-3-319-24574-4_28
s41551-019-0487-z. 71. van der Walt S, et al. The NumPy array: a structure
58. Selvaraju RR, et al. Grad-CAM: visual explanations for efficient numerical computation. Comput Sci
from deep networks via gradient-based localization. Eng. 2011;13(2):22–30. https://www.researchgate.
Int J Comput Vis. 2019;128:336–59. net/publication/224223550_The_NumPy_Array_A_
59. Smilkov D, et al. SmoothGrad: removing noise by Structure_for_Efficient_Numerical_Computation
adding noise. 2017. https://arxiv.org/abs/1706.03825 72. Bouskill KE, et al. Blind spots in telemedicine: a qual-
60. Sundararajan M., et al. Axiomatic attribution for deep itative study of staff workarounds to resolving gaps
networks. In: Proceedings of the 34th international in chronic disease care. BMC Health Services Res.
conference on machine learning, 2017, p. 3319–28. 2018;18:617. https://research.google/pubs/pub47345/
61. Springenberg TJ, et al. Striving for simplicity:
73.
Google. AI for social good in Asia Pacific.
the all convolutional net. 2014. https://arxiv.org/ The Keyword, Dec 2018. https://www.
abs/1412.6806. b l o g . g o o g l e / a r o u n d -t h e -g l o b e / g o o g l e -a s i a /
62. Jaimes A, et al. Human-centered computing: toward a ai-social-good-asia-pacific
human revolution. Computer. 2007;40(5):30–4. 74. Google Research. TensorFlow: large-scale machine
63. Beede E, et al. A human-centered evaluation of a learning on heterogeneous distributed systems. 2015.
deep learning system deployed in clinics for the https://www.tensorflow.org/about/bib
detection of diabetic retinopathy. In: Proceedings 75. Hutson M, et al. Quality control challenges in crowd-
of the 2020 CHI conference on human factors in sourcing medical labeling. 2019. https://research.
computing systems, 2020, p. 1–12. https://doi. google/pubs/pub48327/
org/10.1145/3313831.3376718 76. Schaekermann M, et al. Expert discussions improve
64.
Kelly CJ, et al. Key challenges for deliver- comprehension of difficult cases in medical image
ing clinical impact with artificial intelligence. assessment. CHI Conference on Human Factors in
BMC Med. 2019;17:195. https://doi.org/10.1186/ Computing Systems (CHI ‘20), April 25–30, 2020,
s12916-019-1426-2 Honolulu, HI. ACM, New York, 2020. https://doi.
65. Abadi M, et al. TensorFlow: a system for large-scale org/10.1145/3313831.3376290.
machine learning. In: OSDI’16: Proceedings of the 77. Shlens J. Train your own image classifier with
12th USENIX conference on operating systems Inception in TensorFlow. https://ai.googleblog.com/,
design and implementation, 2016. p. 265–83. https:// Google, 9 3 2016, https://ai.googleblog.com/2016/03/
doi.org/10.5555/3026877.3026899 train-your-own-image-classifier-with.html. Accessed
66. Carneiro T, et al. Performance analysis of Google 6.5.2020
Colaboratory as a tool for accelerating deep learning 78. Smith-Morris C, et al. Diabetic retinopathy and
applications. IEEE Access. 2018;6:61677–85. https:// the cascade into vision loss. Med Anthropol.
ieeexplore.ieee.org/abstract/document/8485684 2020;39(2):109–22. https://pubmed.ncbi.nlm.nih.
67. Google Health. Model architecture for predicting con- gov/29338335/
version to wet age related macular degeneration using 79. Verily. Launching a powerful new screening tool
deep learning. https://github.com/google-health/ for diabetic eye disease in India. Verily Blog; 2019.
imaging-research/wet-amd-prediction https://blog.verily.com/2019/02/launching-powerful-
68. Szegedy C, et al. Rethinking the inception archi-
new-screening-tool.html. Accessed Apr 2020.
tecture for computer vision. Comput Vis Pattern
Singapore Eye Lesions Analyzer
(SELENA): The Deep Learning
13
System for Retinal Diseases
David Chuen Soong Wong, Grace Kiew,
Sohee Jeon, and Daniel Ting
In order to implement these DL systems, mul- SELENA is also able to detect other leading
tiple stages of testing must be carried out to con- causes of blindness globally: possible glaucoma
firm the efficacy of the methods and to ensure and age-related macular degeneration (AMD).
patient safety. The Singapore Eye Lesion The first objective of the project was to train and
Analyzer (SELENA) was developed and tested at validate this system to detect these diseases by
the Singapore Eye Research Institute (SERI) in analysing retinal fundus images obtained from
2017 [15] to aid tertiary care referral decisions community-based DR screening of patients in
based on fundoscopy photographs taken during Singapore. Further external validation of the per-
routine screening for diabetic retinopathy (DR). formance of the system with referable DR was
The conception of this system showcases some of carried out using datasets from 10 multi-ethnic
the remarkable capabilities of deep learning in populations collected from different countries
Ophthalmology and healthcare more widely. In with diverse community and hospital-based dia-
this chapter we discuss the initial development of betic populations. Another objective of the proj-
SELENA in Singapore, its testing on African ect was to determine how SELENA might fit in
populations, and the detection of cardiovascular two different models of DR screening: fully auto-
risk factors. mated (in countries without national screening
programmes) or assistive (semi-automated as
referable cases detected by SELENA were
SELENA: Development, Validation assessed by humans).
and Testing
(c) convolutional
(b) input map (d) max-pooling
map
weight map
kernel
(e)
fully-
retinal fundus network module connected
direction of data processing flow layer
photograph
(f)
output DR score1
layer
(a) template
image
ensembled DR
image score
module module
input output
ensembled AMD
image score
AMD score2
referable
diagnosis
Glaucoma score1
ensembled Glaucoma
image score
Glaucoma score2
Ungradable
score
Nonretinal
score
Fig. 13.1 Convolutional neural network architecture of SELENA. The algorithm consists of a 8 modified variants
of the VGG-19 CNN. Figure reproduced with permission from Ting et al. 2017 (JAMA) [15]. Please see this original
paper for full details. Briefly, steps a) to f) consist of template image (a) processing by a deep CNN consisting of a suc-
cession of network modules, each continuing a series of convolutional maps. This results in a final output node (f) for
each class trained for. For the classification of severity, a second deep CNN was provided locally contrast-normalized
images (g) as input; the final disease severity score is then the mean of the outputs. This was repeated for DR, AMD and
glaucoma. Additional CNNs were trained to reject images for insufficient image quality, as well as for being invalid
input (i.e. not being a retinal image)
referable glaucoma, two networks identified refer- formance were used to show that for referable DR,
able AMD, one network assessed image quality, SELENA achieved an AUC of 0.936 (95% CI,
and one network rejected invalid non-retinal images. 0.925–0.943), sensitivity of 90.5% (95% CI,
Each CNN was trained by progressive exposure to 87.3%–93.0%), and specificity of 91.6% (95% CI,
randomly selected images from the training set 91.0%–92.2%). For vision- threatening DR, the
together with the ground truth as determined by the statistics showed greater AUC and sensitivity and
human labellers, thus gradually learning the appro- slightly lower specificity: AUC of 0.958 (95% CI,
priate features (i.e. modifying the weight values) for 0.956–0.961), sensitivity of 100% (95% CI,
classification via gradient descent. 94.1%–100.0%), and specificity of 91.1% (95%
CI, 90.7%–91.4%). The key statistics are shown in
Table 13.1. This performance was generalizable,
Results as the AUCs from the 10 external validation sets
ranged between 0.889 and 0.983.
In the primary validation dataset there were 71,896 It is difficult to compare the typical statistical
images from 14,880 patients, with a mean age of measures of performance between deep learning
60.2 (SD 2.2), and 54.6% were men. Within this systems due to the different datasets used for
cohort, the prevalence of referable DR was 3%, training and validation, and the different reference
vision-threatening DR 0.6%, possible glaucoma standards applied [9]. It therefore follows that sta-
0.1% and AMD 2.5%. Standard measures of per- tistical measures of performance must be supple-
180 D. C. S. Wong et al.
Table 13.1 Performance of SELENA on the primary validation dataset, showing area under the receiver-operating
curve (AUC), sensitivity and specificity
Disease category AUC (95% CI) Sensitivity (95% CI) Specificity (95% CI)
Referable DR 0.936 (0.925–0.943) 90.5 (87.3–93.0) 91.6 (91.0–92.2)
VTDR 0.958 (0.956–0.961) 100 (94.1–100.0) 91.1 (90.7–91.4)
Possible glaucoma 0.942 (0.929–0.954) 96.4 (81.7–99.9) 87.2 (86.8–87.5)
AMD 0.931 (0.928–0.935) 93.2 (91.1–99.8) 88.7 (88.3–89.0)
DR diabetic retinopathy, VTDR vision-threatening diabetic retinopathy, AMD age-related macular degeneration,
CI confidence interval
Network module
Template retinal
fundus image Enembled
image score
Template retinal
fundus image
Fig. 13.2 Modified network architecture of SELENA. tures. Figure reproduced from Bellemo et al. 2019 (Lancet
The algorithm was modified to include a ResNet model to Digital Health) [27]
support more layers and thus analyse more image fea-
variate analysis was performed to look at sys- tion with a completely different ethnic composi-
temic risk factors for referable DR for the AI tion. This is especially relevant as the greatest
model and human graders; both identified the need for AI applications is in third world coun-
same risk factors of longer duration of diabetes, tries with poor healthcare resources and a short-
higher level of glycated haemoglobin and age of health professionals, where patients
increased systolic blood pressure as associated struggle to access healthcare expertise. Low-
with referable DR. AUC for systemic risk factors resource countries stand to gain the most from
for the AI model (0.723 (95% CI 0.691–0.754)) the application of AI systems to healthcare
and human graders (0.741 (95% CI 0.710–0.771)) screening programmes, and this study represents
were comparable (p = 0.432). a potential way forward where the AI model
This study provides further evidence for the replaces human graders in identifying patients
clinical feasibility of AI applications in popula- requiring referral for further assessment and
tion screening programmes, with the AI system treatment. The implications of this are far-
performing to an acceptable clinical standard reaching, as AI systems may be able to perform
comparable to human graders in a prospectively screening comparable to that of human graders in
recruited population in a resource-poor country a fraction of the time previously needed and with
despite having been trained on a different popula- minimal human resource consumption, allowing
182 D. C. S. Wong et al.
development of screening programmes previ- and the SiMES, SINDI, and SCES images at the
ously limited by lack of manpower and resources Blue Mountain Eye Study reading centre in
in countries such as Zambia. Ophthalmologists in Australia by non-ophthalmologists. Images from
these areas can then focus their resources and Beijing and Hong Kong were graded by general
time towards treating cases with sight-threatening ophthalmologists and retinal specialists, respec-
diabetic retinopathy. tively. The average time per image for human
However, lack of resources in countries such assessors was 2–5 min.
as Zambia may also pose other challenges to the The input for the DL system was 76,370 optic
development of an AI-based screening pro- disc– and macula-centred retinal images. The
gramme due to the poor telecommunication infra- output nodes were the individual DR severity lev-
structure in place. A feasible strategy would have els according to International Clinical Diabetic
to take into account the need for computing power Retinopathy Severity Scale (ICDRSS) classifica-
and telecommunication network requirements, tion [34]. The input images were composed of
either by providing the necessary elements for the 88.3% normal retina, 6.4% mild non-proliferative
AI system to be implemented, or alternatively by DR (NPDR), 3.8% moderate NPDR, and 1.5%
using the AI system as a standalone system or vision-threatening DR (VTDR; severe NPDR
integrating it with the retinal cameras to be used. and proliferative DR). To train the DL system
Moreover, screening programmes by themselves with ICDRSS classification, the weights of the
do not improve a population’s health; a robust DL system were adjusted with stochastic gradi-
treatment strategy for those identified by the ent descent. For validation, the DL model pre-
screening programme would also need to be in dicted a raw confidence score for each severity
place to make a significant difference to the health level output node. The scores were linearly
of the general population. weighted to produce a single image-level DR
score and used two separate models: one trained
with the original image and one with its contest-
ystemic Vascular Risk Factor
S equalized version. These were averaged into an
Associations eye-level DR score. The DR score was translated
into DR grading according to previously speci-
DL systems may also be very useful in the fied score thresholds. Patients were classified as
research setting. Epidemiologic research in par- ungradable and excluded from analysis when
ticular would benefit from DL systems because both eyes were ungradable. Additional manual
vast amounts of data may be efficiently analysed. grading was done on DL system-ungradable
To explore potential roles in this space, the per- images. The time taken to pre-process and
formance of SELENA was compared to that of analyse retinal images using a graphics process-
human assessors in reviewing retinal images for ing unit (GPU) for eight datasets was recorded.
DR prevalence and risk factors [28]. The AUC and level of agreement in detection
A total of 93,293 retinal images from 18,912 of the three outcomes were calculated with the
patients of multiple races were collected from the human assessor grading as a reference. The AUC
Singapore Integrated Diabetic Retinopathy was 0.863 (95% CI, 0.854–0.871) for any DR,
Screening Program (SiDRP), Singapore Malay 0.963 (95% CI, 0.956–0.969) for referable DR,
Eye Study (SiMES), Singapore Indian Eye Study and 0.950 (95% CI, 0.940–0.959) for
(SINDI), Singapore Chinese Eye Study (SCES) VTDR. Human assessors analysed the preva-
[29], Beijing Eye Study (BES) [30], African lence of any DR, referable DR, and VTDR as
American Eye Study (AFEDS) [31], Chinese 15.9%, 6.5%, and 4.1%, while DLS analysed as
University of Hong Kong [32], and Diabetes 16.1%, 6.4%, and 3.7% (P = 0.59, 0.46, and
Management Project Melbourne (DMP Melb) 0.07), respectively.
[33]. The SiDRP, AFEDS, and DMP images were Patient demographics and systemic risk fac-
graded at the Singapore Eye Research Institute tors (e.g. age, sex, ethnicity, diabetes duration,
13 Singapore Eye Lesions Analyzer (SELENA): The Deep Learning System for Retinal Diseases 183
myopia. The group is also pursuing the integra- 14. Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund
DE, Bressler NM. Automated grading of age-related
tion of genetic, epigenetic and proteomic infor- macular degeneration from color fundus images
mation in the SELENA algorithm, with the hope using deep convolutional neural networks. JAMA
of ushering in a new era of personalised care for Ophthalmol. 2017;135(11):1170–6.
patients, ultimately improving health outcomes 15. Ting DSW, Cheung CY, Lim G, Tan GSW, Quang
ND, Gan A, et al. Development and validation
for people all around the world. of a deep learning system for diabetic retinopa-
thy and related eye diseases using retinal images
from multiethnic populations with diabetes. JAMA.
2017;318(22):2211–23.
References 16. Li Z, He Y, Keel S, Meng W, Chang RT, He M. Efficacy
of a deep learning system for detecting glaucomatous
1. Moor J. The Dartmouth College artificial intelligence optic neuropathy based on color fundus photographs.
conference: the next fifty years. AI Magazine. 2006. Ophthalmology. 2018;125(8):1199–206.
87–91. 17. Brown JM, Campbell JP, Beers A, Chang K, Ostmo
2. Lecun Y, Bengio Y, Hinton G. Deep learning. Nature. S, Chan RP, et al. Automated diagnosis of plus dis-
2015;521:436–44. ease in retinopathy of prematurity using deep con-
3. Krizhevsky A, Sutskever I, Hinton GE. ImageNet volutional neural networks. JAMA Ophthalmol.
classification with deep convolutional neural net- 2018;136(7):803–10.
works; 2012. 18. Poplin R, Varadarajan AV, Blumer K, Liu Y, McConnell
4. Le QV, Ranzato MA, Monga R, Devin M, Chen K, MV, Corrado GS, et al. Prediction of cardiovascular
Corrado GS, et al. Building high-level features using risk factors from retinal fundus photographs via deep
large scale unsupervised learning; 2012. learning. Nat Biomed Eng. 2018;2(3):158.
5. Raina R, Madhavan A, Ng AY. Large-scale deep unsu- 19. Lee CS, Tyring AJ, Deruyter NP, Wu Y, Rokem A, Lee
pervised learning using graphics processors; 2009. AY. Deep-learning based, automated segmentation
6. Topol EJ. High-performance medicine: the conver- of macular edema in optical coherence tomography.
gence of human and artificial intelligence. Nat Med. Biomed Opt Express. 2017;8(7):3440–8.
2019;25:44–56. 20. De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov
7. Ting DSW, Lin H, Ruamviboonsuk P, Wong TY, Sim S, Tomasev N, Blackwell S, et al. Clinically applica-
DA. Artificial intelligence, the internet of things, and ble deep learning for diagnosis and referral in retinal
virtual clinics: ophthalmology at the digital transla- disease. Nat Med. 2018;24(9):1342.
tion forefront. Lancet Digital Health. 2020;2(1):e8–9. 21. Kermany DS, Goldbaum M, Cai W, Valentim CC,
8. Ting DSW, Pasquale LR, Peng L, Campbell JP, Liang H, Baxter SL, et al. Identifying medical diag-
Lee AY, Raman R, et al. Artificial intelligence and noses and treatable diseases by image-based deep
deep learning in ophthalmology. Br J Ophthalmol. learning. Cell. 2018;172(5):1122–31. e9.
2019;103(2):167–75. 22. Yau JWY, Rogers SL, Kawasaki R, Lamoureux EL,
9. Ting DSW, Peng L, Varadarajan AV, Keane PA, Kowalski JW, Bek T, et al. Global prevalence and
Burlina PM, Chiang MF, et al. Deep learning in oph- major risk factors of diabetic retinopathy. Diabetes
thalmology: the technical and clinical considerations. Care. 2012;35(3):556.
Prog Retin Eye Res. 2019;72:100759. 23. Taylor HR. Global blindness: the progress we are
10. Gargeya R, Leng T. Automated identification of dia- making and still need to make. Asia-Pac J Ophthalmol.
betic retinopathy using deep learning. Ophthalmology. 2019;8(6).
2017;124(7):962–9. 24. Flaxman SR, Bourne RR, Resnikoff S, Ackland P,
11. Abràmoff MD, Lou Y, Erginay A, Clarida W, Amelon Braithwaite T, Cicinelli MV, et al. Global causes of
R, Folk JC, et al. Improved automated detection of blindness and distance vision impairment 1990–2020:
diabetic retinopathy on a publicly available data- a systematic review and meta-analysis. Lancet Glob
set through integration of deep learning. Invest Health. 2017;5(12):e1221–e34.
Ophthalmol Vis Sci. 2016;57(13):5200–6. 25.
Chua J, Lim CXY, Wong TY, Sabanayagam
12. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, C. Diabetic retinopathy in the Asia-Pacific. Asia-Pac J
Narayanaswamy A, et al. Development and validation Ophthalmol. 2018;7(1):3–16.
of a deep learning algorithm for detection of diabetic 26. Ting DSW, Cheung GCM, Wong TY. Diabetic reti-
retinopathy in retinal fundus photographs. JAMA. nopathy: global prevalence, major risk factors, screen-
2016;316(22):2402–10. ing practices and public health challenges: a review.
13. Grassmann F, Mengelkamp J, Brandl C, Harsch S, Clin Exp Ophthalmol. 2016;44(4):260–77.
Zimmermann ME, Linkohr B, et al. A deep learning 27. Bellemo V, Lim ZW, Lim G, Nguyen QD, Xie Y, Yip
algorithm for prediction of age-related eye disease MY, et al. Artificial intelligence using deep learning
study severity scale for age-related macular degenera- to screen for referable and vision-threatening dia-
tion from color fundus photography. Ophthalmology. betic retinopathy in Africa: a clinical validation study.
2018;125(9):1410–20. Lancet Digital Health. 2019;1(1):e35–44.
13 Singapore Eye Lesions Analyzer (SELENA): The Deep Learning System for Retinal Diseases 185
28. Ting DSW, Cheung CY, Nguyen Q, Sabanayagam C, 33. Lamoureux EL, Fenwick E, Xie J, McAuley A,
Lim G, Lim ZW, et al. Deep learning in estimating Nicolaou T, Larizza M, et al. Methodology and early
prevalence and systemic risk factors for diabetic reti- findings of the diabetes management project: a cohort
nopathy: a multi-ethnic study. npj Digital Medicine: study investigating the barriers to optimal diabetes
Springer US; 2019. p. 1–8. care in diabetic patients with and without diabetic
29. Tan GS, Gan A, Sabanayagam C, Tham YC, Neelam retinopathy. Clin Exp Ophthalmol. 2012;40(1):73–82.
K, Mitchell P, et al. Ethnic differences in the preva- 34. Wilkinson CP, Ferris FL 3rd, Klein RE, Lee PP,
lence and risk factors of diabetic retinopathy: The Agardh CD, Davis M, et al. Proposed international
Singapore epidemiology of eye diseases study. clinical diabetic retinopathy and diabetic macu-
Ophthalmology. 2018;125(4):529–36. lar edema disease severity scales. Ophthalmology.
30. Xu J, Xu L, Wang YX, You QS, Jonas JB, Wei
2003;110(9):1677–82.
WB. Ten-year cumulative incidence of diabetic reti- 35. Jones CD, Greenwood RH, Misra A, Bachmann
nopathy. The Beijing Eye Study 2001/2011. PLoS MO. Incidence and progression of diabetic retinopathy
One. 2014;9(10):e111320. during 17 years of a population-based screening pro-
31. McKean-Cowdin R, Fairbrother-Crisp A, Torres M, gram in England. Diabetes Care. 2012;35(3):592–6.
Lastra C, Choudhury F, Jiang X, et al. The African 36. Thomas RL, Dunstan F, Luzio SD, Roy Chowdury S,
American eye disease study: design and methods. Hale SL, North RV, et al. Incidence of diabetic retinop-
Ophthalmic Epidemiol. 2018;25(4):306–14. athy in people with type 2 diabetes mellitus attending
32. Tang FY, Ng DS, Lam A, Luk F, Wong R, Chan C, the diabetic retinopathy screening service for wales:
et al. Determinants of quantitative optical coherence retrospective analysis. BMJ. 2012;344:e874.
tomography angiography metrics in patients with dia-
betes. Sci Rep. 2017;7(1):2575.
Automatic Retinal Imaging
and Analysis: Age-Related Macular
14
Degeneration (AMD) within
Age-Related Eye Disease Studies
(AREDS)
T. Y. Alvin Liu and Neil M. Bressler
In this chapter, we focus on a series of collabora- the University of Wisconsin Fundus Photograph
tions between clinicians from the School of Reading Center, which is the designated reading
Medicine and computer scientists from the center for the AREDS.
Applied Physics Lab at the Johns Hopkins In the first study [3], a DLS was trained with a
University. The described studies utilized deep combination of transfer learning, using the uni-
learning (DL) and focused on analysis of color versal features taken from the fully connected
fundus photographs (CFP) of patients with age- layer of the pre-trained OverFeat [4] deep convo-
related macular degeneration (AMD), the leading lutional neural network (DCNN), and a linear
cause of central vision loss in persons over age support vector machine (LSVM). Only a subset of
50 in the United States [1] and around the world. the AREDS dataset, 5664 images, was used.
The dataset used for training and testing was Three sets of classification experiments were per-
derived from the Age-Related Eye Disease formed: 4-class classification (no AMD vs. early
Studies (AREDS) [2], a longitudinal cohort study AMD vs. intermediate AMD vs. advanced AMD);
funded by the National Eye Institute with over 3-class classification (no or early AMD vs. inter-
4500 participants and roughly 130,000 CFPs mediate AMD vs. advanced AMD); 2-class clas-
taken with a 30 degree camera. The ground truth sification (no or early AMD vs. intermediate or
of the deep learning systems (DLS) was based on advanced AMD). The DLS’s performance was
the annotations (gradings) by trained graders at compared to that of a human ophthalmologist.
DLS Ophthalmologist
Classification Problem Accuracy Kappa Accuracy Kappa
4-class 79.4% 0.6962 75.8% 0.6583
3-class 81.5% 0.7226 85.0% 0.7748
2-class 93.4% 0.8482 95.2% 0.8897
To achieve the same level of performance, a Given the size of the training data was fixed in
DLS will need more data for training to perform this study, a drop in accuracy and correlation with
progressively more fine-grained classifications. the gold standard (kappa score) was seen with
progressively more fine-grained classifications as
T. Y. Alvin Liu (*) · N. M. Bressler expected. The trend was observed for both the
Retina Division, Department of Ophthalmology DLS and human ophthalmologist. Overall, the
(Wilmer Eye Institute), Johns Hopkins University DLS’s performance was comparable to that of a
School of Medicine, Baltimore, MD, USA human ophthalmologist, and the DLS showed
e-mail: tliu25@jhmi.edu
cation [9] (step-1: no or only small drusen tion” first calculated the 9-step class probabilities
(<63 μm) and no pigmentary abnormalities; step- for each of the nine classes, using the ResNet 50
2: multiple small drusen or medium-sized drusen derived classifications, and the risk associated
(63–125 μm) and/or pigmentary abnormalities; with the class with maximum probability was
step-3: large drusen (≥125 μm) or numerous designated as the 5-year risk estimate. “Regressed
medium-sized drusen and pigmentary abnormali- prediction” skipped the 9-step severity scale pre-
ties; step-4: choroidal neovascularization or geo- diction step and directly mapped an input CFP
graphic atrophy), and then fused the four classes image to a 5-year risk estimate by using the
into two (non-referable and referable). The resul- RestNet 50 in regression mode.
tant DLS, when compared to the AlexNet DLS For the 9-step severity scale classification
used in the second study [5], produced a task, the DLS achieved a linearly weighted kappa
statistically-significant superior performance in score of 0.738, suggesting a high degree of cor-
terms of accuracy (91.6% vs. 88.4%), sensitivity relation with the gold standard established at the
(89.0% vs. 84.5%) and specificity (93.6% vs. reading center. However, variation in the DLS’s
91.5%). classification accuracy was seen across the nine
Based on longitudinal outcomes data, the classes, which was likely driven by the imbal-
AREDS 9-step severity scale [10, 11] incorpo- ances in sample size available for each class. For
rates detailed quantification of drusen area and example, of the 58,370 images involved in this
pigmentary abnormalities, and provides 5-year study, there were 24,411 step-1 images, but only
risk estimates for progression to advanced AMD, 1160 step-9 images. For the 5-year progression
which includes choroidal neovascularization, risk estimation task, the “hard prediction” method
central geography, or both. While this severity performed best in most classes and the overall
scale provides fairly granular relative risk estima- mean estimation error between the three methods
tions for developing advanced AMD, the grading ranged from 3.47% to 5.29%, indicating a rela-
is so complex that it is likely no human ophthal- tively small estimation error by the DLS.
mologist uses the scale. Nevertheless, the scale The fifth study [14] produced by this series of
could be useful. For example, an eye with a step-1 collaboration pivoted from utilizing DL for clas-
score carries 0.3% chance of progression, while sification tasks to utilizing DL for generative
an eye with a step-9 score carries a 53% chance tasks. Specifically, generative adversarial net-
[10]. The fourth study [12] had two major contri- works (GANs) [15] were used to create high-
butions: using DL for fine-grained, 9-step sever- resolution CFPs with various stages of AMD
ity classification and for 5-year progression risk (Fig. 14.2). GANs have two main components:
calculation inferred directly from a CFP as an “generative” and “discriminative”. The “genera-
input. Only one image from each stereo pair in tive” network uses training data to generate syn-
the AREDS dataset (58,370 images in total) was thetic images, which are then presented to the
used. The dataset was split at the patient level, “discriminative” network that is responsible for
and the training, validation and testing subsets discriminating between the synthetic and real
compromised of 88%, 2% and 10% of the images, images. The two networks are “adversarial” in
respectively. The DLS was trained with the that the “generative” network aims to generate
ResNet 50 [13] DCNN. The 5-year progression synthetic images that can “fool” the “discrimina-
risk inference was performed using three meth- tive” network. These two networks are then
ods: soft prediction, hard prediction and regressed trained reiteratively against each other to ulti-
prediction. “Soft prediction” first calculated the mately maximize the “authenticity” of the syn-
9-step class probabilities for each of the nine thetic images. Since the description of GANs by
classes, using the ResNet 50 derived classifica- Goodfellow et al. in 2014 [15], many variants of
tions and SoftMax output values, and computed GANs have been developed, e.g. fully-connected
the 5-year risk estimate as the expected value of GANs, Laplacian Pyramid GANs, boundary
class risk under this probability. “Hard predic- equilibrium GANs, self-attention GANs, Big
190 T. Y. Alvin Liu and N. M. Bressler
Fig. 14.2 Sample synthetic images generated by GANs: early (left), intermediate (middle) and advanced (right) AMD
GANs, conditional GANs and progressive GANs from the AREDS, such as CFPs and optical
(ProGANs) [16]. This study adopted ProGANs, coherence tomography imaging, to develop more
the experiments involved two retinal specialists, powerful DLSs, using GANs to supplement the
and here are the major findings. First, the syn- under-represented classes of images in the 9-step
thetic images were of high quality. There was no severity scale to improve progression risk estima-
difference in gradeability detected between the tions, and testing DLSs trained with AREDS
synthetic and real images as determined by the images against images collected in clinical
two retinal specialists. Second, the synthetic practice.
images were judged realistic in most cases. The
two retinal specialists only achieved 59.5% and
53.7% accuracy, respectively, in discriminating References
real AMD images from synthetic images. The
accuracies of close to 50% suggested that the 1. Bressler NM. Age-related macular degenera-
retinal specialists’ ability in identifying the syn- tion is the leading cause of blindness. JAMA.
thetic images was similar to random chance. 2004;291(15):1900–1.
2. Age-Related Eye Disease Study Research G. The
Third, the synthetic images generated by the Age-Related Eye Disease Study (AREDS): design
GANs were useful for training a DL algorithm. implications. AREDS report no. 1. Control Clin
The authors showed that a DLS, trained entirely Trials. 1999;20(6):573–600.
with synthetic images, still achieved a respect- 3. Burlina P, Pacheco KD, Joshi N, Freund DE, Bressler
NM. Comparing humans and deep learning perfor-
able AUC-ROC value of 0.9235 and accuracy of mance for grading AMD: a study in using universal
82.92% in differentiating between non-referable deep features and transfer learning for automated
vs. referable AMD when tested against real AMD analysis. Comput Biol Med. 2017;82:80–6.
images. 4. Razavian AS, Azizpour H, Sullivan J, Carlsson
S. CNN features off-the-shelf: an astounding base-
In summary, a series of collaboration between line for recognition. Paper presented at: the Institute
retinal specialists at the School of Medicine and of Electrical and Electronics Engineers Conference
computer scientists at the Applied Physics Lab at of computer vision and pattern recognition; May
the Johns Hopkins University produced five pub- 12, 2014; Stockholm, Sweden https://arxiv.org/
pdf/1403.6382.pdf
lications that explored various aspects of DL 5. Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund
applications in AMD using data from CFPs of DE, Bressler NM. Automated grading of age-related
AREDS: from referable AMD determination to macular degeneration from color fundus images
fine-grained AMD severity classification to using deep convolutional neural networks. JAMA
Ophthalmol. 2017;135(11):1170–6.
DCNN technical refinements to synthetic images 6. Krizhevsky A, Sutskever I, Hinton GE. Imagenet
generation with GANs. Future directions may classification with deep convolutional neural net-
include the following: using multimodal imaging works. https://papers.nips.cc/paper/4824-imagenet-
14 Automatic Retinal Imaging and Analysis: Age-Related Macular Degeneration (AMD)… 191
classification-w ith-d eep-c onvolutional-n eural- 9-step severity scale applied to participants in the com-
networks.pdf plications of age-related macular degeneration preven-
7. Burlina P, Joshi N, Pacheco KD, Freund DE, tion trial. Arch Ophthalmol. 2009;127(9):1147–51.
Kong J, Bressler NM. Utility of deep learning 12. Burlina PM, Joshi N, Pacheco KD, Freund DE, Kong
methods for referability classification of age- J, Bressler NM. Use of deep learning for detailed
related macular degeneration. JAMA Ophthalmol. severity characterization and estimation of 5-year risk
2018;136(11):1305–7. among patients with age-related macular degenera-
8. He K, Zhang X, Ren S, Sun J. Deep residual learning tion. JAMA Ophthalmol. 2018;136(12):1359–66.
for image recognition. In: Proceedings of IEEE con- 13. He K, Zhang X, Ren S, Sun J. Deep residual learn-
ference on computer vision and pattern recognition; ing for image recognition. In: Proceedings of the 2016
June 27–30, 2016; Las Vegas, NV. p. 771–8. IEEE conference on computer vision and pattern rec-
9. Age-Related Eye Disease Study Research G. The ognition (CVPR). Piscataway, NJ: Institute of Electric
age-related eye disease study system for classify- and Electronics Engineers; 2016. p. 771–8.
ing age-related macular degeneration from stereo- 14. Burlina PM, Joshi N, Pacheco KD, Liu TYA, Bressler
scopic color fundus photographs: the age-related eye NM. Assessment of deep generative models for high-
disease study report number 6. Am J Ophthalmol. resolution synthetic retinal image generation of age-
2001;132(5):668–81. related macular degeneration. JAMA Ophthalmol.
10. Davis MD, Gangnon RE, Lee LY, et al. The age- 2019.
related eye disease study severity scale for age-related 15. Goodfellow I, Pouget-Abadie J, Mirza M, et al.
macular degeneration: AREDS report no. 17. Arch Generative adversarial nets. Adv Neural Inf Process
Ophthalmol. 2005;123(11):1484–98. Syst. 2014:2672–80.
11. Ying GS, Maguire MG, Alexander J, Martin RW,
16. Karras, T., Aila, T., Laine, S., and Lehtinen,
Antoszyk AN, Complications of Age-related J. Progressive growing of gans for improved
Macular Degeneration Prevention Trial Research quality, stability, and variation. arXiv preprint
G. Description of the age-related eye disease study arXiv:1710.10196 (2017).
Artificial Intelligence
for Keratoconus Detection
15
and Refractive Surgery Screening
ity and specificity risk factor detection methods as refractive surgery screening), as it constitutes a
[9, 11]. valuable instrument that has been proven to iden-
Recently, ectasia scoring systems refined the tify early signs of ectasias with high degree of
ability of some devices to detect early forms of reliability [2].
keratoconus. For example, the Belin/Ambrosio
enhanced ectasia display (BAD-D) by Pentacam
(Oculus Wetzlar, Germany) combines elevation- Artificial Intelligence
based maps and pachymetry to screen corneal
ectasias with high sensitivity and specificity One common term used in AI is big data. This
(Fig. 15.1). This is particularly useful for evalu- concept refers to the aggregate of data that has
ating patients at risk for post-operative ectasia been generated over years in scientific research
and identification of early or subclinical kerato- projects. This information is extremely extensive
conus. A final BAD-D score, also referred to as and complex; therefore, its storage and interpre-
“final D”, greater than 1.88 has shown a 2.5% tation with traditional tools is not possible [13].
false positive rate with less than 1% false nega- AI provides the investigator with useful tools that
tives [12]. allow a better understanding and analysis of that
Moreover, the use of Artificial Intelligence data [8]. To achieve that objective, big data needs
(AI) techniques has been recommended for diffi- initially to be organized in a dataset. Then, in
cult KCN diagnosis (such as differentiation of order to determine the presence of patterns in an
normal population from KCS and FFKC, as well extensive group of information, big data must
Fig. 15.1 The Belin-Ambrosio Enhanced Ectasia Display (BAD-D) in a case with early keratoconus. Abnormal
reported indices are represented in yellow and red
15 Artificial Intelligence for Keratoconus Detection and Refractive Surgery Screening 195
undergo a process called data mining. This pat- Although most of the AI studies for KCN
tern recognition is very important and cannot be detection and refractive screening are based on
achieved manually (due to the complicatedness supervised machine learning techniques; other
of the information). Data mining can be per- types of AI, including unsupervised machine
formed using two approaches: (1) Supervised learning technique, expert decision tree [15], and
classification, which is assisted by prototypes or multivariate logistic regression analysis [16],
examples; or (2) Unsupervised classification or have also proven to be useful.
clustering, which examines relationships between
the properties of the objects [13].
The first study involving the use of AI for Studies
corneal pathologies was performed by Maeda in
1994 [11]. Since then several AI techniques upervised Machine Learning
S
have been employed for corneal topography Classifiers
interpretation and early keratoconus detection
[11, 14]. Neural Networks
“Neural network” (NN) refers to the AI method
that simulates human neurological processing
Artificial Intelligence Techniques abilities with the objective of interpolating or
estimating complicated data [17]. This method
“Machine learning” is a popular AI technique has the capacity of detecting characteristics, that
which has the objective of mimicking human are concealed within the input data, without
learning. This method has the advantage of giv- requiring additional information about logic
ing computers the aptitude to ‘learn’ without structures between input and output data.
being programmed to do so [5]. This is accom- Therefore, for developing a successful interpreta-
plished through machine programs, that allow the tion system and obtaining an accurate response,
boosting of output using a previously known the establishment of input data is fundamental in
information dataset, and through the use of mul- this technique.
tiple types of algorithms, e.g. the classifiers algo- To date, several studies have used NN for cor-
rithm (See Box 15.1). Machine learning neal topography classification. The first NN study
algorithms are “trained” on a guidance data set, was reported in 1995 by Maeda and collabora-
and afterward are “validated” on a different set of tors. In their publication, topographic maps were
test or verification data, in order to ensure the classified (using 11 indices of the computer-
external validity of this process [5]. assisted videokeratoskope TMS-1 [Computed
Anatomy, New York, NY]) into normal, astigma-
tism with the rule, KCN mild to advanced, post-
Box 15.1 Classifier photorefractive keratectomy and post penetrating
In machine learning, a classifier is an algo- keratoplasty. The NN backpropagation model
rithm that implements classification to find (which is a supervised, multilayer, feed-forward
patterns within a dataset. As previously network) was used, obtaining a correct classifica-
exposed and according to the way new tion in 80% of cases (with >90% specificity, and
information is obtained, these algorithms 44–100% sensitivity) [17].
can be divided into two: Supervised or In 1997, Smolek and Klyce published their
unsupervised. Classifiers can be used, for first NN study. Using 10 indices of the computer-
example, in the classification of topo- assisted videokeratoskope TMS-1, they were
graphic maps as normal, astigmatic, or able to detect KCS and KCN severity with 100%
with KCN [14]. accuracy, 100% sensitivity, and 100% specificity
[18]. Four years later, these two authors also
196 J. L. Reyes Luis and R. Pineda
reported their NN findings using wavelet data with a known class assignment. This adaptable
from the computer-assisted videokeratoskope investigation tool is suitable for nonparametric
TMS-1. The NN was able to detect normal from data analysis and has been used for data mining
prior refractive surgery corneas with 99.3% accu- [22]. In this technique, cutoff values of discrimi-
racy, 99.1% sensitivity and 100% specificity. nant variables are designated, giving place to suc-
These outcomes were considerably higher than cessive ‘nodes’, ‘branches’ (which divide into
those obtained by clinicians (who only achieved two mutual subgroups), and ‘leaves’ (the final
65% sensitivity and 95% specificity) [19]. decision of class assignment).
Accardo and Pensiero, on the other hand, used Smajda and collaborators have used the auto-
NN to evaluate unilateral and bilateral indices, as mated decision tree classification method to dis-
well as early KCN maps obtained with the video- criminate inputs from the Galilei corneal
keratoscope EyeSys (EyeSys Vision, Houston, topographer (Ziemer Ophthalmic Systems, Port,
Texas), This method allowed the authors to dif- Switzerland). Posterior surface asymmetry index
ferentiate early KCN from non-KCN corneas and corneal volume were the two most important
with 94.1% sensitivity and 97.6% specificity [20]. discriminant variables in their study. With this
In 2008, Vieira de Carvalho and Barbosa method, normal and KCN corneas were classified
employed NN and discriminant analysis to clas- with 100% sensitivity and 93.6% specificity; while
sify corneal shapes using Zernicke coefficients. normal and FFKC corneas were categorized with
Eyesys videokeratoscope data was used as inputs, 93.6% sensitivity and 97.2% specificity [22].
achieving a 94% accuracy with NN and 84.8%
accuracy with discriminant analysis [21]. EKA Software Classifiers (SVM,
W
Six years later, Silverman and collaborators Random Forest, Bayes Network)
compared the ability of NN and linear discrimi- More recently, an open-source machine learning
nant analysis to differentiate between normal and software was created at the University of Waikato,
KCN corneas. This was done by using maps of New Zealand. This software, called Waikato
epithelial and stromal thickness obtained from Environment for Knowledge Analysis (WEKA)
Artemis 1 (ArcScan, Morrison, CO), digital workbench, contains a compilation of machine
ultrasound scanner. Both classifiers obtained an learning algorithms and data processing tools for
Area Under the Receiver Operating data mining [23]. An example of the learning
Characteristics (AUROC) of 1.0 (indicative of algorithms offered by WEKA are the supervised
complete separation of groups); however, NN classifiers: Support Vector Machine (SVM),
sensitivity and specificity were slightly better Bayes Network, Multi-Layer Perceptron (MLP),
when compared to linear discriminant analysis Naive Bayes, Random Forest and Radial Basis
(98.9% and 99.5% versus 94.6% and 99.2%, Function Neural Network (RBNFNN) [13].
respectively) [11]. In 2010, Souza et al. evaluated the perfor-
Kovacs et al., in turn, analyzed unilateral and mance of SVM, MLP and RBNFNN on KCN
bilateral indices obtained with Sheimpflug imag- detection using indices from the Orbscan II scan-
ing (Pentacam corneal topographer). NN differ- ning slit topographer (Bausch & Lomb, Quebec,
entiated normal and KCN corneas with an Canada). The results showed high KCN detection
AUROC of 0.99, 100% sensitivity, and 95% indices for all three machine learning algorithms;
specificity; however, differentiation between nor- however, no-difference between SVM and MLP
mal and FFKC corneas was slightly less accurate, efficiency (with 0.99 AUROC, and 100% sensi-
as it only achieved an AUROC of 0.97, 90% sen- tivity), and RBFNN high values were found (0.98
sitivity, and 90% specificity [2]. AUROC and 98% sensitivity) [24].
Later, Ruiz Hidalgo and collaborators evalu-
utomated Decision Tree Classification
A ated the KCN detection effectiveness of the SVM
The “automated decision tree” classification algorithm using 22 parameters obtained from the
method is a machine learning algorithm that was Pentacam corneal topographer. Their results
generated employing a sample of training data demonstrated a KCN vs non-KCN classification
15 Artificial Intelligence for Keratoconus Detection and Refractive Surgery Screening 197
accuracy of 98.9%, sensitivity of 99.1%, and cation applications and machine learning. Its goal
specificity of 98.5%; while, a FFKC vs non-KCN is to display the original data in a smaller dimen-
classification achieved an accuracy of 93.1%, sional space while maximizing the separation
sensitivity of 79.1%, and specificity 97.9%. between categories [27].
Finally, the authors were able to classify 5 groups Saad and Gantinel used LDA to differentiate
[KCN, FFKC, Astigmatic, PR (post-refractive KCN, FFKC, and Non-KCN using 51 Orbscan
surgery) and Normal] with an 88.8% accuracy, II topographic indices. Percentage of thickness
sensitivity 89%, and 95.2% specificity [25]. increase and maximum posterior corneal eleva-
More recently, a study published by Lopes tion were the most important contributors and,
et al. intended to detect corneal ectasia suscepti- with this technique, the differentiation capacity
bility by analyzing Pentacam tomographic data between normal and FFKC groups and between
of patients with unilateral asymmetric ectasia, normal and KCN groups reached an AUROC of
clinical KCN, and stable LASIK. The authors 0.98 and 0.99, respectively. These results sug-
compared five classifiers (Random Forest, Naive gest that it is plausible to accurately differenti-
Bayes, NN, SVM, and regularized Discriminant ate normal from FFKC eyes using topography
Analysis) of which the Random Forest yielded indices [28].
the highest accuracy (80–85.2% sensitivity and Two years later, the same investigators, used
96.6% specificity). To date, this is the largest the OPD-Scan corneal analyzer to obtain
machine learning study for ectasia susceptibility Zernicke coefficients and differentiate between
and early detection of clinical ectasia [8]. normal, FFKC and KCN corneas using LDA. An
AUROC of 0.98 for the distinction between nor-
mal and FFKC corneas was reported, while an
Unsupervised Machine Learning AUROC 0.96 between normal and KCN corneas
was reached [29].
Contrary to supervised machine learning tech-
niques, the unsupervised algorithms do not need to xpert System Classifier
E
pre-label data for training [26]. This allows inves- The “Expert System Classifier” is a non-machine
tigators to use a non-biased approach that employs learning AI technique that through an extensive
multiple comprehensive parameters for analysis. set of decision rules, deductive decisions, and
Although to date most of the machine learning step-by step logical operations resolutions are
techniques for automatic detection of KCN are reached [17]. In 1994, Maeda and collaborators
supervised, Yousefi et al. conducted an unsuper- reported the first Expert System Classifier results
vised machine learning study using a big dataset and combined them with LDA in order to differ-
of corneal images obtained with the corneal entiate KCN and non-KCN. Using indices from
swept-source 1000 CASIA OCT. The classifica- TMS-1 maps, this method resulted in 96% accu-
tion was able to detect KCN cases with 97.7% racy, 89% sensitivity and 99% specificity [15].
sensitivity and 94.1% specificity. To the best of
our knowledge, this is the first and only study that ultivariate Logistic Regression
M
employs unsupervised machine learning in cor- Analysis
neal ectasia detection [26]. Although multivariate logistic regression analy-
sis is not a machine learning technique, the clas-
sification results that have been obtained with
Non-machine Learning Classifiers this classic statistical method are worth includ-
ing in this chapter. This method is convenient
inear Discriminant Analysis
L for models with dichotomous dependent vari-
“Linear Discriminant Analysis” (LDA) is a tech- ables and uses logistic regression coefficients to
nique commonly used for data reduction that can estimate odds ratios for each independent vari-
help as a pre-processing step for pattern classifi- able [30].
198 J. L. Reyes Luis and R. Pineda
In 2018, Hwang and collaborators used a multi- neal ectasias). However, in cases which there is
variate logistic regression analysis to differentiate not an expert consensus on the definition of a
normal from FFKC corneas. They combined five specific condition, such as early ectasias follow-
Sheimpflug Pentacam indices with 11 anterior seg- ing refractive surgery, a reliable classification is
ment OCT RT-Vue-100 (Optovue, Fremont, CA), difficult to achieve [5].
indices, obtaining a classification system with In addition, the lack of large sample sizes for
100% sensitivity and 100% specificity. The mayor low-incidence conditions (such as FFKC and
contributors in this analysis were the epithelial early post-refractive surgery ectasia) and the
thickness variability, the total focal corneal thick- scarcity of exposure to a sufficient variety of pre-
ness variability from OCT, the anterior curvature sentations of the same pathology and its differen-
and the topometric indices from Sheimpflug tial diagnoses (e.g. ectasias following refractive
tomography [16]. In Fig. 15.2, anterior segment surgery, KCN, and pellucid marginal degenera-
OCT and Sheimpflug images, similar to the ones tion), undermine the results of the algorithms and
used in Hwang study, are displayed. reduce their external validity [14].
Recently, a spotlight has fallen on the evaluation The advantages of using AI in corneal ectasias
of corneal Biomechanics not only due to its capa- include:
bility of diagnosing masked corneal ectasias, but
also for guiding therapeutic interventions such as • Automated interpretation of topographic
crosslinking. Nowadays, there are few devices maps. This may help unskilled observers
for in vivo evaluation of corneal biomechanics enhance their decision-making performance
which include the ocular response analyzer to a more skilled level [17].
(ORA) and the Corvis ST. [31–33] Biomechanics • Early Keratoconus detection. KCS and
have contributed to the expansion of corneal FFKC detection by AI models has achieved
imaging and more studies are expected to good results, allowing ophthalmologists to
improve datasets for AI in keratoconus and diminish progression by the administration
refractive screening. In Fig. 15.3 the results of a of early treatments. One of these treatments
patient with a very asymmetric ectasia (VAE-E) is Riboflavin- induced UVA cross-linking,
evaluated with the Ambrosio, Roberts and which seems to be an effective and enduring
Vinciguerra (ARV) biomechanics and tomo- treatment when the detection is done at early
graphic assessment, performed with the Corvis stages [34].
ST device. • Customized LASIK ablations. Recently AI
capacity to combine data has been focused on
optimizing refractive results. This led to the
Limitations design of LASIK ablations based on axial
length measurements, total eye wavefront
The term “ground truth” is used to depict the input, and detailed corneal and anterior seg-
algorithm result that the AI technique is pro- ment tomography data, this first module was
grammed to obtain; however, reaching that called Innoveyes by Alcon. With this informa-
ground truth is not a simple task. Large datasets tion, mathematical models of ray-tracing, pre-
can be more reliable for designating the ground diction of biomechanical changes of the
truth, as cutoffs can be set to include only the cornea and the anticipated epithelial remodel-
advanced forms of the disease (e.g. manifest cor- ing can also be obtained [35].
15 Artificial Intelligence for Keratoconus Detection and Refractive Surgery Screening 199
Fig. 15.2 Combined image of anterior segment OCT and ness and the posterior elevation maps (middle). The total
Sheimpflug tomography from a patient with KCN in the corneal thickness and epithelial thickness map can be
left eye. The Sheimpflug Tomography shows the anterior found in the anterior segment OCT report (bottom)
elevation and curvature maps (top), and the corneal thick-
200 J. L. Reyes Luis and R. Pineda
Fig. 15.3 The ARV Biomechanical and Tomographic Ambrosio Enhanced Ectasia Display (BAD-D). Abnormal
Display in a patient left eye with keratoconus. We can see reported indices are represented in yellow and red. Image
below the Corvis Biomechanical Index (CBI), Courtesy of Dr. Renato Ambrosio MD PhD
Tomographic Biomechanical Index (TBI) and Belin
ELP
K IOL
Ax
management of other ophthalmic diseases, with one of the most commonly performed surgical
updated user-accessible mobile devices and auto- procedures today, with more than 11 million
matic examination instruments. eyes undergoing intraocular lens (IOL) implan-
tation worldwide every year [9–11]. The post-
operative vision of patients depends largely on
Artificial Intelligence an accurate calculation of the IOL optical
for Preoperative Assessment power, which represents a challenge for oph-
of Cataract Surgery thalmologists [12]. Researchers have integrated
artificial intelligence in this process to achieve
The current standardized management of a better visual outcomes for patients [13, 14]
visually impairing cataract is surgical removal (Fig. 16.2).
of the cataractous lens and implantation with an Currently, the most commonly used formulas
intraocular lens instead [8]. Cataract surgery is for IOL power calculation are from the Vergence
16 Artificial Intelligence for Cataract Management 205
dict the existence of PCO in 24th month postop- 7. Xu X, Zhang L, Li J, Guan Y, Zhang L. A hybrid
global-local representation CNN model for automatic
eratively. This proposed model offered cataract grading. IEEE J Biomed Health Inform.
exceptional performance with a sensitivity of 2019;
88.55%, a specificity of 94.31% and an area 8. Liu YC, Wilkins M, Kim T, Malyugin B, Mehta
under curve of 0.9718. Prediction models of this JS. Cataracts. Lancet. 2017;390(10094):600–12.
9. Frampton G, Harris P, Cooper K, Lotery A, Shepherd
kind help to plan treatment strategies and to pro- J. The clinical effectiveness and cost-effectiveness
vide early warning for the patients. of second-eye cataract surgery: a systematic review
and economic evaluation. Health Technol Assess.
2014;18(68):1–205. v-vi
10. Wang W, Yan W, Muller A, He M. A global view on
uture of AI-Based Cataract
F output and outcomes of cataract surgery with national
Management indices of socioeconomic development. Invest
Ophthalmol Vis Sci. 2017;58(9):3669–76.
As is discussed above, AI technologies have been 11. Abell RG, Vote BJ. Cost-effectiveness of femtosec-
ond laser-assisted cataract surgery versus phaco-
applied to multiple aspects of cataract manage- emulsification cataract surgery. Ophthalmology.
ment. From a visionary perspective, AI applica- 2014;121(1):10–6.
tions may be extended to other cataract-related 12. Olsen T. Sources of error in intraocular lens power cal-
areas such as preoperative risk stratification for culation. J Cataract Refract Surg. 1992;18(2):125–9.
13. Findl O, Struhal W, Dorffner G, Drexler W. Analysis
cataract surgery, prediction of postoperative of nonlinear systems to estimate intraocular lens posi-
visual and refractive outcomes and patient assess- tion after cataract surgery. J Cataract Refract Surg.
ment for implants such as multifocal or accom- 2004;30(4):863–6.
modative IOLs. It is becoming apparent that AI 14.
Sramka M, Slovak M, Tuckova J, Stodulka
P. Improving clinical refractive results of cataract sur-
technologies will play a crucial role in the revolu- gery by machine learning. PeerJ. 2019;7:e7202.
tion of healthcare service delivery within the field 15. Melles RB, Holladay JT, Chang WJ. Accuracy of
of ophthalmology and serve as a powerful addi- intraocular lens calculation formulas. Ophthalmology.
tion to the current diagnostic and therapeutic 2018;125(2):169–78.
16. Ewe SY, Abell RG, Oakley CL, Lim CH, Allen PL,
armamentarium of cataract specialists. McPherson ZE, et al. A comparative cohort study of
visual outcomes in femtosecond laser-assisted versus
phacoemulsification cataract surgery. Ophthalmology.
2016;123(1):178–82.
References 17. Ruit S, Tabin G, Chang D, Bajracharya L, Kline
DC, Richheimer W, et al. A prospective randomized
1. Flaxman SR, Bourne RRA, Resnikoff S, Ackland P, clinical trial of phacoemulsification vs manual suture-
Braithwaite T, Cicinelli MV, et al. Global causes of less small-incision extracapsular cataract surgery in
blindness and distance vision impairment 1990-2020: Nepal. Am J Ophthalmol. 2007;143(1):32–8.
a systematic review and meta-analysis. Lancet Glob 18. Apple DJ, Solomon KD, Tetz MR, Assia EI, Holland
Health. 2017;5(12):e1221–e34. EY, Legler UF, et al. Posterior capsule opacification.
2. Song P, Wang H, Theodoratou E, Chan KY, Rudan Surv Ophthalmol. 1992;37(2):73–116.
I. The national and subnational prevalence of cataract 19. Ando H, Ando N, Oshika T. Cumulative probability
and cataract blindness in China: a systematic review of neodymium: YAG laser posterior capsulotomy
and meta-analysis. J Glob Health. 2018;8(1):010804. after phacoemulsification. J Cataract Refract Surg.
3. Ramke J, Zwi AB, Lee AC, Blignault I, Gilbert 2003;29(11):2148–54.
CE. Inequality in cataract blindness and services: 20. Ebihara Y, Kato S, Oshika T, Yoshizaki M, Sugita
moving beyond unidimensional analyses of social G. Posterior capsule opacification after cataract sur-
position. Br J Ophthalmol. 2017;101(4):395–400. gery in patients with diabetes mellitus. J Cataract
4. Acharya RU, Yu W, Zhu K, Nayak J, Lim TC, Chan Refract Surg. 2006;32(7):1184–7.
JY. Identification of cataract and post-cataract surgery 21. Mohammadi SF, Sabbaghi M, Hashemi H, Alizadeh
optical images using artificial intelligence techniques. S, Majdi M, et al. Using artificial intelligence to
J Med Syst. 2010;34(4):619–28. predict the risk for posterior capsule opacification
5. Gao X, Lin S, Wong TY. Automatic feature learning after phacoemulsification. J Cataract Refract Surg.
to grade nuclear cataracts based on deep learning. 2012;38(3):403–8.
IEEE Trans Biomed Eng. 2015;62(11):2693–701. 22. Jiang J, Liu X, Liu L, Wang S, Long E, Yang H, et al.
6. Wu X, Huang Y, Liu Z, Lai W, Long E, Zhang K, et al. Predicting the progression of ophthalmic disease
Universal artificial intelligence platform for collabora- based on slit-lamp images using a deep temporal
tive management of cataracts. Br J Ophthalmol. 2019; sequence network. PLoS One. 2018;13(7):e0201142.
Artificial Intelligence in Refractive
Surgery
17
Yan Wang, Mohammad Alzogool,
and Haohan Zou
ing on the anterior corneal surface parameter ological parameters of people in different
analysis then changed to focus on the compre- regions also show regional distribution char-
hensive analysis of full corneal parameters. acteristics [7]. Recently, our team completed a
By incorporating artificial intelligence algo- study among 2000 participants based on the
rithms and producing effective machine learn- Scheimpflug corneal tomography parameters
ing models, significant improvement in the to establish a model that can diagnose sub-
diagnostic rate of keratoconus and different clinical keratoconus with high accuracy. We
classification tasks such as keratoconus, sub- used the support vector machine (SVM) and
clinical keratoconus, high astigmatism cor- gradient boosted decision tree (GBDT), an
nea, and cornea after refractive surgery have iterative machine learning algorithm com-
been achieved [3]. posed of multiple decision trees that screens
2 . AI diagnostic algorithm for combined mul- attribute features with larger weights, to con-
timode data struct a subclinical keratoconus diagnosis
In order to improve the diagnostic accuracy model, and performed a 10-fold cross-
of keratoconus, keratectasia, and other related validation to verify the accuracy. The model
diseases, different screening equipment are achieved a 95.53% diagnostic accuracy. The
often combined during the clinical diagnosis accuracy of the model in distinguishing sub-
process to develop algorithms for multi-source clinical keratoconus from the normal cornea
diagnostic methods, which includes anterior was 96.67%, and the accuracy in distinguish-
segment OCT devices, optical aberration mea- ing keratoconus from the normal cornea was
suring instruments, confocal microscopy, and 98.91%. In particular, suspected patients have
in vivo measurement of corneal biomechanics. a high diagnostic accuracy [8].
However, because the results of different func- 4. Finding new medical rules or connections
tional instruments have different meanings, and provides more clinical clues
and there are large differences in the results of In addition to being able to predict and diag-
different instruments with the same functional nose different diseases and achieve good dis-
category, analyzing these inspection parame- ease classification performance in the clinical
ters is difficult and complicated. To make the setting, AI can also discover new medical laws
model more compatible and robust, and also or connections that have never been noticed or
convenient to use for research and application discovered before. Our research found that the
between different devices and clinics, a retro- diagnostic features of keratoconus include not
spective analysis was performed to develop only common clinical indexes such as central
diagnostic algorithms from a single corneal corneal astigmatism, index of surface variance,
topographic device to the cross-platform data asymmetry index, corneal thickness, and poste-
of three various topographic device sources rior corneal surface height (posterior corneal
[4]. Realize the analysis and evaluation of surface elevation), but also clarified the signifi-
maps obtained from a variety of topographic cance of the aspheric parameters for the suspi-
devices. Meanwhile, a diagnostic model com- cious keratoconus diagnosis. Moreover,
bined with corneal biomechanical measure- collecting large sample data from different cen-
ment has also proven to have high diagnostic ters and different populations to establish and
performance [5, 6]. train machine-learning algorithms can further
3. Improve diagnostic efficiency for suspected enhance the universality and generalizability of
patients in different populations with AI diagnostic models [9].
algorithms 5 . Analysis of images and big data
In order to identify the diagnostic differ- The emergence of numerous machine learn-
ences among different ethnic groups, different ing models has provided more possibilities for
population-based studies have been per- the research and application concerning the
formed, showing that ethnic origin influences assisted diagnosis of keratoconus detection,
the keratoconus incidence and corneal physi- through testing and comparison of various
17 Artificial Intelligence in Refractive Surgery 209
algorithms to find the optimal model, thereby spherical equivalent and astigmatism before PRK
improving the ability of disease diagnosis. The [15] and LASIK [16] have shown that the use of
research mainly focuses on the analysis of multiple regression analysis to establish a nomo-
images and data (Fig. 17.1). Some of the most gram model that can consider numerous factors
commonly used methods are support vector including age, diopter, and corneal curvature,
machine (SVM) [10], decision tree (DT) [9], improves the accuracy of PRK and LASIK sur-
multilayer perceptron (MLP), radial basis func- gery [17]. However, with the development of new
tion (RBFNN) [11], and convolutional neural surgical techniques and multi-source clinical
networks (CNN) [12]. Generally, the dataset examination equipment, more factors influencing
includes a training set and one or more valida- the outcome of surgery have been observed, such
tion or test sets. The training set is mainly used as temperature, humidity [18], wind speed, and
to build and train the model and the test or vali- air pressure [19]. Unlike PRK and LASIK, the
dation set is used to evaluate the model. K-fold, nomogram adjustment of SMILE surgery needs
leave-one-out, and other cross-validation meth- to consider more factors and depends more on the
ods are used to perform proper internal valida- experience of the surgeon. The former analysis of
tion of the training dataset, and it is better to use data using only a single factor or a small sample
another dataset derived from clinical data to is insufficient for newly arising needs.
verify it again. However, the validation of most With the help of association analysis, informa-
studies is still based on the data set itself and tion gain, classifier, and other algorithms, it is
lacks clinical data validation. possible to realize the correlation between these
factors and analyze its impact on refractive sur-
gery. Based on artificial intelligence technology,
I Application to Improve Surgical
A research and development of new intelligent
Accuracy and Personalized Design refractive surgery platforms assist doctors in
completing the entire process from preoperative
To ensure the accuracy and predictability of cor- screening and parameter design to result
neal refractive surgery, the risk of overcorrection prediction.
or undercorrection is reduced [13, 14]. Previous Our team selected 1146 cases that underwent
nomogram reports of adjusting magnitudes of SMILE surgery with ideal postoperative results.
Fig. 17.1 AI assists in the diagnosis of keratoconus and model is processed to achieve classification and grading
other related ectatic corneal disorders. Input corneal topo- for different cases
graphic map or corneal morphological parameters, the
210 Y. Wang et al.
Fig. 17.2 Various
factors affect the
accuracy and
predictability of SMILE
surgery outcomes. Using
AI can achieve
comprehensive analysis
and control of
influencing factors
From these samples, the nominal features were reached the level of experienced surgeons or even
transformed into binary ones, and the numeric better [20] (Fig. 17.2). This study proves the AI
features were normalized into range [0, 1]. The feasibility in the design of refractive surgery
critical features affecting the nomogram values treatment strategies. However, it is worth noting
were resolved according to information gain that when building a nomogram model, it is not
analysis. The multilayer perceptron algorithm that the more data attributes, the better the model
was used to train the artificial neural network will be. Including too much information will lead
model to predict the SMILE nomogram and con- to over fitting and model accuracy reduction.
duct clinical control experiments for validation. In addition, it is necessary to create a world-
Moreover, we compared the outcomes of the sur- wide refractive surgery database. The clinical
geon group with the machine learning group in data of refractive surgery is growing at an unprec-
terms of safety, efficacy, and predictability. edented rate all over the world. The establish-
Significant results showed that the efficacy index ment of a standardized public database is
in the machine learning group (1.48 ± 1.08) was fundamental for the in-depth AI development in
significantly higher than that in the surgeon group this field, and it is also the key to the development
(1.3 ± 0.27) (t = −2.17, P < 0.05), and 83% of the and evaluation of algorithm models. In addition
eyes in the surgeon group and 93% of the eyes in to the multi-center clinical research development,
the machine learning group were within ±0.50 which will greatly enhance the safety and effi-
D. The error of SE correction was −0.09 ± 0.024 ciency of the model, reduce misdiagnosis and
and −0.23 ± 0.021 for the machine learning and missed diagnosis as well as overtreatment during
surgeon groups, respectively. The outcomes of all the refractive surgery treatment process, and
aspects of the machine learning group have ensure precision medicine is highly beneficial.
17 Artificial Intelligence in Refractive Surgery 211
Fig. 17.3 AI runs through the entire process of refractive surgery, assisting doctors in accomplishing all tasks from
preoperative screening, surgery design, intraoperative control, and postoperative management
212 Y. Wang et al.
take. Meanwhile, large-scale clinical applications tomographic assessment to detect corneal ectasia
based on artificial intelligence. Am J Ophthalmol.
also need to consider ethical and legal issues. The 2018;195:223–32. https://doi.org/10.1016/j.
interpretation of algorithms and the establishment ajo.2018.08.005.
of relevant specifications will also be a focal issue 10. Ruiz Hidalgo I, Rozema JJ, Saad A, Gatinel D,
that needs to be addressed in the future. For the Rodriguez P, Zakaria N, et al. Validation of an objec-
tive keratoconus detection system implemented in a
discipline of vision correction, which is insepara- Scheimpflug Tomographer and comparison with other
ble from images and data, refractive surgeons methods. Cornea. 2017;36(6):689–95. https://doi.
should actively embrace the convenience brought org/10.1097/ICO.0000000000001194.
by artificial intelligence to help this discipline 11. Souza MB, Medeiros FW, Souza DB, Garcia R, Alves
MR. Evaluation of machine learning classifiers in
develop faster and more accurately. keratoconus detection from orbscan II examinations.
Clinics (Sao Paulo). 2010;65(12):1223–8. https://doi.
org/10.1590/s1807-59322010001200002.
12. Lavric A, Valentin P. KeratoDetect: keratoconus
References detection algorithm using convolutional neural net-
works. Comput Intell Neurosci. 2019;2019:8162567.
1. Kim TI, Alio Del Barrio JL, Wilkins M, https://doi.org/10.1155/2019/8162567.
Cochener B, Ang M. Refractive surgery. Lancet. 13. Jin HY, Wan T, Wu F, Yao K. Comparison of visual
2019;393(10185):2085–98. https://doi.org/10.1016/ results and higher-order aberrations after small incision
S0140-6736(18)33209-4. lenticule extraction (SMILE): high myopia vs. mild to
2. Lin SR, Ladas JG, Bahadur GG, Al-Hashimi S, Pineda moderate myopia. BMC Ophthalmol. 2017;17(1):118.
R. A review of machine learning techniques for kera- https://doi.org/10.1186/s12886-017-0507-2.
toconus detection and refractive surgery screening. 14. Zhang J, Wang Y, Wu W, Xu L, Li X, Dou R. Vector
Semin Ophthalmol. 2019;34(4):317–26. https://doi. analysis of low to moderate astigmatism with small
org/10.1080/08820538.2019.1620812. incision lenticule extraction (SMILE): results of a
3. Ruiz Hidalgo I, Rodriguez P, Rozema JJ, Ni 1-year follow-up. BMC Ophthalmol. 2015;15:8.
Dhubhghaill S, Zakaria N, Tassignon MJ, et al. https://doi.org/10.1186/1471-2415-15-8.
Evaluation of a machine-learning classifier for kera- 15. Shapira Y, Vainer I, Mimouni M, Sela T, Munzer G,
toconus detection based on Scheimpflug tomography. Kaiserman I. Myopia and myopic astigmatism pho-
Cornea. 2016;35(6):827–32. https://doi.org/10.1097/ torefractive keratectomy: applying an advanced mul-
ICO.0000000000000834. tiple regression-derived nomogram. Graefes Arch
4. Mahmoud AM, Roberts C, Lembach R, Herderick Clin Exp Ophthalmol. 2019;257(1):225–32. https://
EE, McMahon TT, Clek SG. Simulation of machine- doi.org/10.1007/s00417-018-4101-y.
specific topographic indices for use across platforms. 16. Moniz N, Fernandes ST. Nomogram for treatment
Optom Vis Sci. 2006;83(9):682–93. https://doi. of astigmatism with laser in situ keratomileusis. J
org/10.1097/01.opx.0000232944.91587.02. Refract Surg. 2002;18(3 Suppl):S323–6.
5. Machado AP, Lyra JM, Ambrósio R, Ribeiro G, LPN 17. Liyanage SE, Allan BD. Multiple regression analy-
A, Xavier C, et al., editors. Comparing machine- sis in myopic wavefront laser in situ keratomi-
learning classifiers in keratoconus diagnosis from leusis nomogram development. J Cataract Refract
ORA examinations. Berlin: Springer; 2011. Surg. 2012;38(7):1232–9. https://doi.org/10.1016/j.
6. Ambrosio R Jr, Lopes BT, Faria-Correia F, Salomao jcrs.2012.02.043.
MQ, Buhren J, Roberts CJ, et al. Integration of 18. Seider MI, McLeod SD, Porco TC, Schallhorn
Scheimpflug-based corneal tomography and biome- SC. The effect of procedure room temperature and
chanical assessments for enhancing ectasia detec- humidity on LASIK outcomes. Ophthalmology.
tion. J Refract Surg. 2017;33(7):434–43. https://doi. 2013;120(11):2204–8. https://doi.org/10.1016/j.
org/10.3928/1081597X-20170426-02. ophtha.2013.04.015.
7. Ma R, Liu Y, Zhang L, Lei Y, Hou J, Shen Z, et al. 19. Neuhaus-Richard I, Frings A, Ament F, Görsch IC,
Distribution and trends in corneal thickness param- Druchkiv V, Katz T, et al. Do air pressure and wind
eters in a large population-based multicenter study speed influence the outcome of myopic laser refrac-
of young Chinese adults. Invest Ophthalmol Vis tive surgery? Results from the Hamburg weather
Sci. 2018;59(8):3366–74. https://doi.org/10.1167/ study. Int Ophthalmol. 2014;34(6):1249–58. https://
iovs.18-24332. doi.org/10.1007/s10792-014-9923-y.
8. Zou HH, Xu JH, Zhang L, Ji SF, Wang Y. Assistant 20. Cui T, Wang Y, Ji S, Li Y, Hao W, Zou H, et al.
diagnose for subclinical keratoconus by arti- Applying machine learning techniques in nomo-
ficial intelligence. Zhonghua Yan Ke Za Zhi. gram prediction and analysis for SMILE treatment.
2019;55(12):911–5. https://doi.org/10.3760/ Am J Ophthalmol. 2020;210:71–7. https://doi.
cma.j.issn.0412-4081.2019.12.008. org/10.1016/j.ajo.2019.10.015.
9. Lopes BT, Ramos IC, Salomao MQ, Guerra FP, 21. Sanders DR, Doney K, Poco M. United States
Schallhorn SC, Schallhorn JM, et al. Enhanced Food and Drug Administration clinical trial of the
17 Artificial Intelligence in Refractive Surgery 213
Implantable Collamer Lens (ICL) for moderate to Cochrane Database Syst Rev. 2019;5 https://doi.
high myopia: three-year follow-up. Ophthalmology org/10.1002/14651858.CD011150.pub2.
J. 2004;111(9):1683–92. https://doi.org/10.1016/j. 23. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F,
ophtha.2004.03.026. Ghafoorian M, et al. A survey on deep learning in med-
22. Zadnik K, Money S, Lindsley K. Intrastromal
ical image analysis. Med Image Anal. 2017;42:60–88.
corneal ring segments for treating keratoconus. https://doi.org/10.1016/j.media.2017.07.005.
Artificial Intelligence in Cataract
Surgery Training
18
Nouf Alnafisee, Sidra Zafar, Kristen Park,
Satyanarayana Swaroop Vedula,
and Shameema Sikder
Cataract surgery is one of the most common sur- However, minimal guidance has been provided
gical procedures performed across the world. It is by the regulating bodies on the best methods to
estimated that by 2050, almost 50 million indi- assess for surgical skill [3]. In the case of cata-
viduals will need cataract surgery in the United ract surgery, graduating ophthalmology resi-
States (U.S.) alone [1]. Given the growing inci- dents in the United States are required to
dence of cataracts, cataract surgery training has complete a minimum of 86 cataract surgeries as
become an increasingly important component of primary surgeon [2]. A simple case log, how-
ophthalmology residency, and failing to learn the ever, is both insufficient and crude for assessing
procedure to mastery can be consequential. competence with minimal, if any, value in terms
Moreover, surgeons must often sustain high vol- of providing feedback. Other conventional
umes of procedures even after Board certification methods for skill assessment including the tra-
to retain their skill in cataract surgery, minimize ditional apprenticeship model of teaching sur-
the risk of complications, and optimize patient gery often lack standardization and objectivity.
care. Although there has been a reliance on structured
In addition to the six core competencies man- rating scales for skill assessment in recent years
dated by the Accreditation Council for Graduate [4], such feedback is often not timely or com-
Medical Education, The American Board of plete. In one large academic orthopedic surgery
Ophthalmologists (ABO) has also identified program, 58% of residents reported that evalua-
surgical proficiency as a competence that should tions were rarely or never completed in a timely
be met by ophthalmology training programs [2]. manner, and more than 30% of the assessments
were completed more than 1 month after a rota-
N. Alnafisee tion’s end. In summary, new approaches to reli-
Faculty of Biology, Medicine and Health, The able, valid, and universally accessible methods
University of Manchester, Manchester, UK to assess skill are necessary to improve care for
S. Zafar · S. Sikder (*) cataract surgery.
The Wilmer Eye Institute, Johns Hopkins University Recent advances in surgical data science
School of Medicine, Baltimore, MD, USA
e-mail: ssikder1@jhmi.edu have enabled novel methods for surgical skill
assessment. These include techniques to directly
K. Park · S. S. Vedula
Malone Center for Engineering in Healthcare, analyse instrument motion, video, or other data
Department of Computer Science, The Johns Hopkins from the surgical field as well as manual
University Whiting School of Engineering, approaches such as crowdsourcing. However,
Baltimore, MD, USA
the validity of these approaches for skill assess-
e-mail: kpark38@jhu.edu; swaroop@jhu.edu
ment in the operating room (OR) remains to be ources of Data for Technical Skill
S
seen [5, 6]. This brings us to the following ques- Assessment in Cataract Surgery
tions: (1) Where does ophthalmology stand in
relation to other surgical specialties regarding I. Surgeon performance data
the use of new techniques for the assessment of (a) Direct observation
surgical technical skill? (2) What do we have to The most prevalent and the most sub-
do to bridge this gap and begin implementing jective method of assessment involves
automated methods for technical skill assess- the direct observation of the surgeon’s
ment in ophthalmology training, specifically in live performance. Direct observation
cataract surgery? currently plays a critical role in training
Broadly, measures of technical skill in cata- surgeons because it allows for feedback
ract surgery may be derived from three data both during and after the procedure.
sources: (I) surgeon performance data, (II) the However, it requires the presence of an
usage of a phacoemulsification machine or other experienced surgeon and/or educator
devices [7], and (III) clinical outcomes [8–10] and is not readily reproducible, owing to
(Fig. 18.1). Figure 18.2 illustrates different large interobserver variations in both
sources of data, the ease with which they can be assessment and feedback.
obtained, and access to current methods to use (b) Instrument motion data (Sensor-
them to assess technical skill. Each source of based)
data and the methods that have been utilized to Motion analysis techniques can
analyze them for technical skill are discussed assess a surgeon’s dexterity based on the
below. movements of their hands and fingers.
Input
(Surgical Data
Performance) Processing Output
Direct
Observation
Expert Ratings
Video
Machine Feedback
Learning
Benchtop
Wetlab
Instrument
OR
motion
Skill
Deep Learning
Hand motion Clinical
Outcomes
Crowdsourcing
Eye tracking
Fig. 18.1 An overview of the scope for sources of data for surgical skill assessment and the training context in which
they may be obtained
18 Artificial Intelligence in Cataract Surgery Training 217
X
Eye gaze
X
X Automated
ratings from DL
Crowdsourced
algorithms
ratings using
video X X X
Increasing ease of capture
X
Video review
feedback
X
Simulators
X
Phacoemulsification
data
Increasing usefulness
Fig. 18.2 Overview of different sources of data, ease with which they can be obtained, and access to current methods
to use them to assess technical skill
II. Surgical device usage ract surgery. These include the Objective
Ogawa et al. developed the Surgical Media Assessment of Skills in Intraocular Surgery
Center (SMC) (Abbott Medical Optics, Inc.), (OASIS) [18], Global Rating Assessment of
a cataract surgery recording device capable of Skills in Intraocular Surgery (GRASIS) [19],
measuring changes in multiple objective Subjective Phacoemulsification Skills
parameters, including phacoemulsification Assessment (SPESA) [20], Objective
power, vacuum level, aspiration rate over Structured Assessment of Cataract Surgical
time, and foot pedal position [7]. By doing so, Skill (OSACSS) [21], International Council
the SMC can detect inappropriate phacoemul- of Ophthalmology–approved Ophthalmology
sification techniques by analyzing graphs and Surgical Competency Assessment Rubrics
elucidate the cause of intraoperative compli- (ICO-OSCARS) [22, 23], and the Objective
cations. Ogawa et al. found significant differ- Structured Assessments of Technical Skills
ences in performance metrics measured by (OSATS) [24]. Although rubrics remain the
the SMC between expert (n = 3) and novice most affordable assessment method with an
surgeons (n = 3). The authors stated that the annual cost of approximately $4000 [13],
time to reach maximum vacuum and the speed their widespread use and implementation has
of increase in vacuum may be regarded as been limited by their time and resource-
indicators for the skillfulness of handling the intensive nature. In more recent years,
phacoemulsification device [7]. To date, the rubrics, such as the Iowa Department of
SMC is the only assessment tool utilizing Ophthalmology Objective Wet Laboratory
intraoperative findings for resident education, Structured Assessment of Skill and
and it can be applied to performance in both Technique (OWLSAT) [25] and the Eye
OR and wet lab settings. Surgical Skills Assessment Test (ESSAT)
III. Clinical outcomes [26, 27], have been developed for assessing
The technical skill of the operating sur- trainee performance in the wet lab. However,
geon has been proposed to be an important costs that may reach up to approximately
determinant of postoperative outcomes [17]. $370,000 may hinder widespread implemen-
While tracking resident complication rates tation [13].
may allow for monitoring progress and Assessments using structured rating
ensuring patient safety, the rates of clinically scales may be obtained from individual
significant complications, such as posterior experts or from a crowd, i.e., through crowd-
capsule rupture and vitreous loss, remain sourcing. In the crowdsourcing methodol-
low. Furthermore, outcomes may not be ogy, a random sample of unrelated
readily mapped to specific feedback infor- individuals, who have an incentive to per-
mation that can drive learning and improve- form repetitive tasks but are not necessarily
ments in performance. experts in the domain, have been found to
yield skill ratings that are accurate on aver-
age. Recent studies show that crowdsourcing
Approaches to Assessment can rapidly provide skill assessments that are
comparable to those obtained from expert
(a) Structured rating scales ratings. Evidence has accumulated in robotic
Rubrics are one of the earliest attempts at surgery, urology, laparoscopic surgery, and
increasing objectivity in surgical assessment, gynaecology [28–30]. To our knowledge,
allowing for a more structured approach and there is limited research on crowdsourcing in
higher quality feedback. However, despite cataract surgical skill assessment.
continued efforts to make them more objec- ( b) Objective data (video based)
tive, the issue of subjectivity remains. In recent years, there has been a push
Multiple rubrics are presently available towards the use of machine learning (ML)
for the assessment of technical skill in cata- and deep learning (DL) models for the objec-
18 Artificial Intelligence in Cataract Surgery Training 219
tive assessment of surgical skill by utilizing mation from the operating room (OR) for
video-based data. The goal with ML is for a automatic recognition of high-level surgical
computer to “learn” certain patterns from tasks. Consequently, such frameworks can
labeled datasets in order to analyze novel automatically detect procedures, evaluate
data and make informed predictions based on surgeon performance, and increase surgical
specific algorithms [31]. DL is a subset of efficiency as well as the quality of care in the
machine learning that can process vast OR. However, before a surgical procedure
amounts of data to solve more complex que- can be “chaptered,” certain steps often need
ries. This is done using hierarchal algorithms to be made, including extracting information
structured in an “artificial neural network”, a on the various instruments in the surgical
process inspired by how neurons in the field of view and anatomical segmentation.
human brain work [31]. Several limitations exist for extricating this
Use of these algorithms for technical skill information. First, instrument recognition is
assessment in ophthalmology is still in its often made difficult by the similarities in
early stages with the large majority of work appearance between instruments or by the
focusing on methods for anatomical segmen- differences in instrument scale, color gradi-
tation and tool detection/classification, both ent, or orientation. Secondly, for anatomical
for phase and step recognition in cataract sur- segmentation in cataract surgery, the pupil is
gery [32]. Broadly, two approaches are avail- often the region of interest, and automatic
able to obtain videos of the phases of cataract segmentation may be limited by interfer-
surgical procedures: (1) content-based video ences in the microscope field of vision that
retrieval, which involves matching a video to obscure the pupil. Bouget et al. used a bag-
other, similar videos in a data set, and (2) of-word, ML-based approach that detected
decomposing a procedure video into its con- surgical tools with 84% accuracy and an
stituent phases (segmentation) and assigning image-based analysis that detected the pupil
each segment a phase label (classification). as the region of interest with 95% accuracy.
In the first approach, videos are transformed Together, the addition of these two modules
into fixed- dimensional feature representa- within the framework resulted in the auto-
tions using computer vision techniques and matic detection of eight phases of cataract
then evaluated with distance metrics within surgery (preparation, betadine injection, cor-
the feature space. On the other hand, meth- neal incision, capsulorhexis, phacoemulsifi-
ods for the second approach include com- cation, cortical aspiration, IOL implantation,
puter vision techniques and deep learning. IOL adjustment and wound sealing) with
94% accuracy [33]. Lalys similarly proposed
a high-level task recognition system based on
Anatomical Segmentation, Tool application-dependent visual cues and time
Recognition, and Task Recognition series analysis, using either a Dynamic Time
Warping (DTW) or Hidden Markov Model
(a) Machine Learning (HMM) algorithm. Five subsystems based on
Surgical procedures can typically be visual cues were implemented: color-
“chaptered” into different levels of granular- oriented visual cues (simple histogram inter-
ity: the surgical procedure, the phases, the section), texture-oriented visual cues
steps, the activities, and the physical ges- (bag-of-words-approach), shape-oriented
tures. Achieving different levels of granular- visual cues (Haar classifier trained for instru-
ity can currently be achieved through either a ment categorization and bag-of-words
top-down or bottom-up approach. In the con- approach used for other instrument detec-
text of a bottom-up approach, computer- tion), and all other visual cues. Using this
assisted-
surgical (CAS) systems play an framework, the authors proceeded to achieve
essential role by retrieving low-level infor- a global recognition rate of almost 94% for
220 N. Alnafisee et al.
levels of surgical skill in robotic, laparoscopic where they stand in comparison. This can be an
and urologic procedures [50, 51]. issue if experts use slightly different techniques
Recent technical advances, particularly in or different tools to do so. Another issue is the
deep learning, have transformed algorithms, lack of uniform criteria for novice, intermediate,
enabling an accurate identification of surgical and expert surgeons in the studies mentioned. In
instruments, automated segmentation, as well as addition, data on surgical skill level are continu-
the classification and analysis of cataract surgery ous rather than categorical, ranging between the
videos for data-driven, objective, and valid different levels of expertise with no discrete
assessment and feedback. In clinical practice, the boundaries between them. Furthermore, the rela-
power of CNNs can be leveraged for cataloguing tionship between objectively measured perfor-
datasets that can be used in different applications mance metrics and expertise classification is
for surgical training. Despite these advances, the non-linear. Hajshirmohammadi and Payandeh
pace of progress has been limited by the scale of proposed a solution for this problem, by using
data that can be captured in the OR and by the fuzzy set theory to classify novice, intermediate,
extensive manual annotation of anatomy and and expert performance on a laparoscopic VR
human activities that is required before such surgical simulator [66], although they did not
datasets can be processed. For instance, if one achieve optimal results.
were to analyze new aspects of a video that had The ultimate goal of these technological
previously not been annotated, relabelling the advances would be to improve the surgeon learn-
whole dataset would be needed, which would be ing curve and to ensure that patients are able to
an extremely tedious and time-consuming task. navigate their surgical care. In the future, deep
As previously mentioned, motion analysis learning algorithms may advance to a point where
techniques have yet to be implemented for cata- they can provide autonomous feedback to sur-
ract surgery assessment and have only been used geons, thus, eliminating the traditional apprentice
in corneal suturing and oculoplastics [11, 12]. model. However, before that can be achieved, it is
Besides the costs involved [13], the main disad- important to lay the groundwork. In this regard,
vantage of using this form of motion analysis is DL algorithms need to be refined even further for
that it cannot be used in the OR, as it may com- accurate tool recognition and phase segmenta-
promise sterility and add clutter or extra steps to tion. Only when algorithms can understand the
a surgical routine. Machine learning techniques granular concepts can they learn higher-level
have been applied to motion analysis metrics in concepts. Implementation of such algorithms can
non-ophthalmic surgical fields for surgical phase subsequently provide a scalable, rapid, and objec-
recognition [52, 53]. Different subcategories of tive method for the assessment of technical skill
motion analysis currently exist: (1) hand motion across all surgical disciplines [45, 46, 67–72].
analysis, (2) tool motion analysis, (3) eye motion
tracking, and (4) muscle contraction analysis [6].
Data from hand-worn motion sensors have been kills Assessment in Other Surgical
S
reasonably successful at differentiating between Specialties
surgeons of different levels [53–55]. In 2018,
multiple papers were published on sensor-based Automated surgical skills assessment methods in
tool motion tracking for surgical skills assess- non-ophthalmic specialties have progressed
ment [56–59] and prediction of surgical out- faster than that in ophthalmology [6, 60, 73]. This
comes [60, 61]. Some achieved nearly perfect is particularly true for robotic surgery [56, 59–61,
results at skills-level classification [62–64] and 63, 65, 67, 68, 74, 75] the data of which has been
outperformed state-of-the-art methods [65]. deemed to be the most transparent, scalable, and
Modelling the optimally performed surgery comprehensive [5]. There has also been notable
(performed by an expert) is a challenge but nec- progress in laparoscopic surgical assessment [53,
essary to assess novice performance and measure 55, 58, 69, 70].
18 Artificial Intelligence in Cataract Surgery Training 223
The majority of phase recognition via ML 4. Puri S, Sikder S. Cataract surgical skill assess-
ment tools. J Cataract Refract Surg [Internet].
algorithms has been applied to laparoscopic sur- 2014;40(4):657–65. https://doi.org/10.1016/j.
gery, but there has also been progress in phase jcrs.2014.01.027.
recognition for minimally invasive surgery 5. Vedula SS, Ishii M, Hager GD. Objective assess-
(MIS) [76, 77] and robotic surgery [78], with ment of surgical technical skill and competency in the
operating room. Annu Rev Biomed Eng [Internet].
some studies combining computer vision with 2017;19(1):301–25. https://doi.org/10.1146/
kinematic data [79]. ML used for surgical assess- annurev-bioeng-071516-044435.
ment has mostly been applied to robotic surgery 6. Levin M, McKechnie T, Khalid S, Grantcharov
[80], with the majority of focus being on TP, Goldenberg M. Automated methods of tech-
nical skill assessment in surgery: a system-
Kinematic data [63, 74] and some research on atic review. J Surg Educ [Internet]. 2019;1–11.
computer vision [60, 67, 68, 74, 75]. Fard et al. https://www.sciencedirect.com/science/article/pii/
used motion data to assess surgical skill in S1931720419301643?dgcid=raven_sd_aip_email
robotic surgery and suggested that the classifica- 7. Ogawa T, Shiba T, Tsuneoka H. Usefulness of sur-
gical media center as a cataract surgery educational
tion methods they proposed (k-nearest neigh- tool. J Ophthalmol. 2016;2016
bours, logistic regression and support vector 8. Gauba V, Tsangaris P, Tossounis C, Mitra A, McLean
machines) could be used to provide tailored C, Saleh GM. Human reliability analysis of cataract
feedback for trainees [59]. Zia et al. also used surgery. Arch Ophthalmol. 2008;126(2):173–7.
9. Cox A, Dolan L, MacEwen CJ. Human reliability
kinematic data for skills classification and gener- analysis: a new method to quantify errors in cataract
ated “task highlights” that showed which parts of surgery. Eye. 2008;22(3):394–7.
the procedure contributed the most to the final 10. Finn AP, Borboli-Gerogiannis S, Brauner S, Peggy
scores, allowing for individual feedback [63]. Chang HY, Chen S, Gardiner M, et al. Assessing
resident cataract surgery outcomes using medi-
An important goal expressed by these studies is care physician quality reporting system measures. J
to provide tailored feedback for trainees in real Surg Educ [Internet]. 2016;73(5):774–9. https://doi.
time [59, 63] and to predict surgical outcomes org/10.1016/j.jsurg.2016.04.007.
using ML methods. 11. Saleh GM, Voyazis Y, Hance J, Ratnasothy J, Darzi
A. Evaluating surgical dexterity during corneal sutur-
The future of assessing cataract surgical skills ing. Arch Ophthalmol. 2006;124(9):1263–6.
is moving towards automated methods to increase 12. Saleh GM, Sim D, Lindfield D, Borhani M,
efficiency and objectivity. Such methods involve Ghoussayni S, Gauba V. Motion analysis as a tool for
extracting data from video recordings, virtual the evaluation of oculoplastic surgical skill: evalua-
tion of oculoplastic surgical skill. Arch Ophthalmol.
reality surgical simulators, and potentially 2008;126(2):213–6.
sensor-based motion analysis sensors to be pro- 13. Nandigam K, Soh J, Gensheimer WG, Ghazi A,
cessed via machine learning or deep leaning Khalifa YM. Cost analysis of objective resident
algorithms and produce meaningful assessments. cataract surgery assessments. J Cataract Refract
Surg [Internet]. 2015;41(5):997–1003. https://doi.
Further research is needed to refine these meth- org/10.1016/j.jcrs.2014.08.041.
ods in order to incorporate these measures into 14. Smith P, Tang L, Balntas V, Young K, Athanasiadis
the training of ophthalmic residents. Y, Sullivan P, et al. “PhacoTracking” an evolving
paradigm in ophthalmic surgical training. JAMA
Ophthalmol. 2013;131(5):659–61.
15. Balal S, Smith P, Bader T, Tang HL, Sullivan P,
References Thomsen ASS, et al. Computer analysis of individual
cataract surgery segments in the operating room. Eye
1. National Eye Institute. Cataract Data and [Internet]. 2019;33(2):313–9. http://www.nature.com/
Statistics [Internet]. 2019. https://www.nei. articles/s41433-018-0185-1
nih.gov/learn-a bout-e ye-h ealth/resources-f or- 16. Din N, Smith P, Emeriewen K, Sharma A, Jones S,
health-e ducators/eye-h ealth-d ata-a nd-s tatistics/ Wawrzynski J, et al. Man versus machine: software
cataract-data-and-statistics training for surgeons – an objective evaluation of
2. Accreditation Council for Graduate Medical human and computer-based training tools for cataract
Education. surgical performance. J Ophthalmol. 2016;2016
3. Lee AG, Volpe N. The impact of the new compe- 17. Low SAW, Braga-Mele R, Yan DB, El-Defrawy
tencies on resident education in ophthalmology. S. Intraoperative complication rates in cataract surgery
Ophthalmology. 2004;111(7):1269–70. performed by ophthalmology resident trainees com-
224 N. Alnafisee et al.
tional neural network. Proc Annu Int Conf IEEE Eng 57. Forestier G, Fawaz HI, Weber J, Idoumghar L, Muller
Med Biol Soc EMBS. 2017:2002–5. P-A, Petitjean F, et al. Surgical motion analysis using
45. Zhu J, Luo J, Soh JM, Khalifa YM. A computer
discriminative interpretable patterns. Artif Intell Med.
vision-based approach to grade simulated cataract 2018;91:3–11.
surgeries. Mach Vis Appl. 2014;26(1):115–25. 58. Oquendo YA, Riddle EW, Hiller D, Blinman TA,
46. Kim TS, O’Brien M, Zafar S, Hager GD, Sikder
Kuchenbecker KJ. Automatically rating trainee
S, Vedula SS. Objective assessment of intraop- skill at a pediatric laparoscopic suturing task. Surg
erative technical skill in capsulorhexis using vid- Endosc [Internet]. 2018;32(4):1840–57. https://doi.
eos of cataract surgery. Int J Comput Assist Radiol org/10.1007/s00464-017-5873-6.
Surg [Internet]. 2019;14(6):1097–105. https://doi. 59. Fard MJ, Ameri S, Darin Ellis R, Chinnam RB, Pandya
org/10.1007/s11548-019-01956-8 AK, Klein MD. Automated robot-assisted surgical
47. Spiteri A, Aggarwal R, Kersey T, Benjamin L, Darzi skill evaluation: predictive analytics approach. Int J
A, Bloom P. Phacoemulsification skills training and Med Robot Comput Assist Surg. 2018;14(1):1–10.
assessment. Br J Ophthalmol. 2010;94(5):536–41. 60. Hung AJ, Chen J, Gill IS. Automated Performance
48. Selvander M, Åsman P. Cataract surgeons outper-
metrics and machine learning algorithms to mea-
form medical students in Eyesi virtual reality cata- sure surgeon performance and anticipate clini-
ract surgery: evidence for construct validity. Acta cal outcomes in robotic surgery. JAMA Surg
Ophthalmol. 2013;91(5):469–74. [Internet]. 2018;153(8):770. https://doi.org/10.1001/
49. Kim TS, Malpani A, Reiter A, Hager GD, Sikder jamasurg.2018.1512
S, Swaroop Vedula S. Crowdsourcing annotation of 61. Hung AJ, Chen J, Che Z, Nilanon T, Jarc A, Titus M,
surgical instruments in videos of cataract surgery. et al. Utilizing machine learning and automated per-
In: Stoyanov D, Taylor Z, Balocco S, Sznitman R, formance metrics to evaluate robot-assisted radical
Martel A, Maier-Hein L, et al., editors. Intravascular prostatectomy performance and predict outcomes. J
imaging and computer assisted stenting and large- Endourol [Internet]. 2018;32(5):438–44. https://doi.
scale annotation of biomedical data and expert label org/10.1089/end.2018.0035.
synthesis. Cham: Springer International; 2018. 62. Ismail Fawaz H, Forestier G, Weber J, Idoumghar
p. 121–30. L, Muller PA. Evaluating surgical skills from kine-
50. Chen C, White L, Kowalewski T, Aggarwal R, Lintott matic data using convolutional neural networks. Lect
C, Comstock B, et al. Crowd-sourced assessment of Notes Comput Sci (including Subser Lect Notes
technical skills: a novel method to evaluate surgical Artif Intell Lect Notes Bioinformatics). 2018;11073
performance. J Surg Res [Internet]. 2014;187(1):65– LNCS:214–21.
71. https://doi.org/10.1016/j.jss.2013.09.024. 63. Zia A, Essa I. Automated surgical skill assessment
51. Prebay ZJ, Peabody JO, Miller DC, Ghani KR. Video in RMIS training. Int J Comput Assist Radiol Surg.
review for measuring and improving skill in urologi- 2018;13(5):731–9.
cal surgery. Nat Rev Urol [Internet]. 2019;16(4):261– 64. Zia A, Sharma Y, Bettadapura V, Sarin EL, Essa
7. https://doi.org/10.1038/s41585-018-0138-2. I. Video and accelerometer-based motion analysis for
52. Bardram JE, Doryab A, Jensen RM, Lange PM,
automated surgical skills assessment. Int J Comput
Nielsen KLG, Petersen ST. Phase recognition during Assist Radiol Surg. 2018;13(3):443–55.
surgical procedures using embedded and body-worn 65. Wang Z, Fey AM. SATR-DL: improving surgical skill
sensors. In: 2011 IEEE International conference on assessment and task recognition in robot-assisted sur-
pervasive computer communications PerCom 2011. gery with deep neural networks. Proc Annu Int Conf
2011. p. 45–53. IEEE Eng Med Biol Soc EMBS. 2018;(1):1793–6.
53. Kowalewski K-F, Garrow CR, Schmidt MW, Benner 66.
Hajshirmohammadi I, Payandeh S. Fuzzy set
L, Müller-Stich BP, Nickel F. Sensor-based machine theory for performance evaluation in a surgical
learning for workflow detection and as key to detect simulator. Presence Teleoperators Virtual Environ.
expert level in laparoscopic suturing and knot-tying. 2007;16(6):603–22.
Surg Endosc [Internet]. 2019;21;0(0):0. https://doi. 67. Zhang Y, Law H, Kim T-K, Miller D, Montie J, Deng
org/10.1007/s00464-019-06667-4 J, et al. PD58-12 surgeon technical skill assess-
54. Watson RA. Use of a machine learning algorithm ment using computer vision-based analysis. J Urol
to classify expertise: analysis of hand motion pat- [Internet]. 2018;199(4S). https://doi.org/10.1016/j.
terns during a simulated surgical task. Acad Med. juro.2018.02.2800
2014;89(8):1163–7. 68. Law H, Ghani K, Deng J. Surgeon technical skill
55. Miao T, Tomikawa M, Akahoshi T, Hashizume M, assessment using computer vision based analysis.
Lefor AK, Souzaki R, et al. Feasibility of an AI-based Proc Mach Learn Healthc. 2017;68
measure of the hand motions of expert and novice sur- 69. Handelman A, Schnaider S, Schwartz-Ossad A,
geons. Comput Math Methods Med. 2018;2018:1–6. Barkan R, Tepper R. Computerized model for
56. Wang Z, Majewicz FA. Deep learning with convolu- objectively evaluating cutting performance using
tional neural network for objective skill evaluation a laparoscopic box trainer simulator. Surg Endosc
in robot-assisted surgery. Int J Comput Assist Radiol [Internet]. 2018;0(0):0. https://doi.org/10.1007/
Surg. 2018;13(12):1959–70. s00464-018-6598-x
226 N. Alnafisee et al.
Ophthalmology
Referral
Triage by ophthalmology
nurse
or medical officer
Emergency
Category 1 Category 2 Category 3
To be discussed
Target within 4 Target within 3 Target within 12
with
weeks months months
on-call registrar
Fig. 19.1 Adult triage criteria for referrals at a tertiary department and then referred to ophthalmology service by
ophthalmology center in Adelaide, Australia. Acute sight- a phone call notification in addition to the written referral
threatening pathologies are usually sent to the emergency
referral pathway to either eye casualty or a speci- ophthalmology triaging is to improve both the
fied subspecialty clinic [6]. accuracy and efficiency of the triage process.
Despite the availability of dedicated referral Machine learning (ML) assisted triaging has
guidelines, triaging in ophthalmology remains a been successfully applied to other fields of medi-
dynamic process. Effective triaging requires the cine. For example, ML methods appear to be use-
responsible health professional to have a sound ful in triaging undifferentiated patients in the
clinical knowledge, compassion, and ability to emergency department and shown to be accurate
identify and correct errors. in the triage of chronic obstructive pulmonary
disease (COPD) exacerbations. The majority of
ML studies in ophthalmology have focused on
Application of Artificial Intelligence image interpretation, including the application of
ML to aid diagnosis based upon fundus photo-
Current processes involved in the categorization graphs, visual field analysis and optical coher-
of referrals are both time consuming and prone to ence tomography [7–9]. The application of deep
error. Human errors may stem from a lack of learning (DL) and natural language processing
clinical experience, failure to follow protocols, (NLP) to the issue of triaging is a novel approach
written miscommunication or transcription error. established at the South Australian Institute of
The goal of applying artificial intelligence (AI) to Ophthalmology (SAIO), Australia. Results from
19 Artificial Intelligence in Ophthalmology Triaging 229
preliminary studies suggest that ML, in particular ries (Categories 1–3) [10]. A referral database was
DL, can accurately assist with the triaging of established from consecutive referrals received
ophthalmology referrals. by the Royal Adelaide Hospital Ophthalmology
Department between January 2018 and March
2019. Referrals for these patients had all been
Deep Learning Analysis made within the previous 24 months. Pertinent
information, including clinical synopsis, triage
A pilot study was first conducted in 2018 involv- categorisation, and referral source, were manually
ing the use of retrospectively collected outpa- extracted from scanned electronic referrals using
tient ophthalmology referrals to determine how commercially available optical character recogni-
effectively NLP can identify referrals requiring tion (OCR) software (Adobe, San Jose, CA).
a “category one” (Urgent) prioritization. A sec- The DL analysis of ophthalmology triage
ondary aim was to emulate human triaging and notes was conducted in three phases (see
determine accuracy across three referral catego- Fig. 19.2):
Referral Collection
Negation Punctuation
Detection Removal
Word stemming
and tokenization
PRE-PROCESSING
Develop and
Increase Model
Complexity
Apply model to
set within 5-fold
cross validation
MODEL DEVELOPMENT
Final Model
applied to
test set
PERFORMANCE ANALYSIS
Fig. 19.2 Flowchart demonstrating the development of an artificial intelligence guided triage model
230 Y. Tan et al.
emergencies outside of the examples that it is A base NLP model is typically trained on a very
trained with. large general-purpose body of text to provide a
Building a sufficiently large triage database is sound fundamental understanding of English
a significant and difficult undertaking. Foremost, vocabulary and grammar. Medical jargon, in par-
referral documents are not standardized and differ ticular terms used in ophthalmology, unfortu-
significantly depending on the source. The docu- nately falls outside the scope of common English
ment format may be hard-copy notes, emails, language. For example, a base model which can
scanned documents, or electronic medical records. differentiate left and right would have great dif-
The current process of extracting text from refer- ficulty recognizing the Latin terms “OU” or
ral notes is a labor-intensive process. Despite the “OD” that commonly found on ophthalmology
availability of OCR, the majority of referral notes reports. In AI assisted triaging, there is the need
still require further human processing to ensure to train a specialized language model that under-
that the correct information is transcribed. The stands languages in medicine, and not just com-
efficiency will likely improve with a more stream- mon English. Building a specialized language
lined and consistent means of receiving referrals model is a very time consuming and resource
in the age of digital medicine. For example, with heavy process that would require collection and
the availability of electronic medical records subsequent training on millions of un-labelled
(EMR), a number of internal referral pathways documents.
have shifted away from faxed paper documents.
In many tertiary medical centers, referrals can be
sent directly via the EMR operating system and The Future
extracted as a monthly report. This will enable a
large quantity of data to be captured. In the digital age, attempts to integrate AI into
Distant labels: DL models contain document clinical practice will continue to stay at the fore-
classifiers that are built to detect the general con- front of ophthalmic research. AI assisted triaging
cepts conveyed in the text. In the real world, the in ophthalmology has demonstrated early poten-
clinical urgency of referrals is often influenced by tial in discriminating between urgent vs. non-
factors other than the presenting complaint. urgent referrals. The South Australian Institute of
Human guided triaging will take into account a Ophthalmology (SAIO), in collaboration with
patient’s age, geographic location, past history and The Australian Institute for Machine Learning
social circumstances. Based on the final triage cat- (AIML), is involved in conducting further deriva-
egory alone, ML classifiers will have difficulty tion tests on expanded dataset. The ongoing
emulating the complex thought process behind the development involves training an off-the-shelf
triage process. The distant-label problem can be document classification model, leveraging a large
addressed by splitting the triaging task into two pre-trained DistilBERT model for language
parts, firstly to detect the concepts conveyed in understanding. Interim analysis based on an
each document, and then secondly to decide on the expanded sample size of 1000 referrals showed
triage category. Managing the problem of distant promising results with improved ability to dis-
labels will require the triaging personnel to manu- criminate between multiple triage categories.
ally “tag” the referral with all of the factors that led Validation accuracies of up to 80% were obtained
to their decision. For example, in a referral for when triaging referrals to either emergency
undifferentiated vision loss to hand movement (within 24 hours), category 1 (within 4 weeks) or
(HM) acuity, the tags for emergency triage could category 2 (within 1 year).
be “visual acuity HM” and “only eye”. The use of a small database remains the great-
Specialized vocabulary: Ophthalmology est limitation of the preliminary pilot study.
referrals often contain specialized medical jargon Future research relating to AI assisted triaging
that is difficult for a base DL model to interpret. should endeavour to use larger sample sizes,
19 Artificial Intelligence in Ophthalmology Triaging 233
consultant-level triage allocation, and data from 3. The Royal Australian and New Zealand College of
Ophthalmologists. Referral pathway glaucoma man-
multiple centres. SAIO is currently building an agement. In: RANZCO. 2019.
expanded referral database to allow further ML 4. The Royal Australian and New Zealand College of
testing to be conducted. A particular aim is to Ophthalmologists. Patient screening and referral
improve the efficiency of the data capture pro- pathway guidelines for diabetic retinopathy (includ-
ing diabetic maculopathy). In: RANZCO. 2019.
cess by incorporating a text extraction function 5. Optometrists Association Australia. Eye health refer-
within the AI algorithm. This will allow digital ral guidelines. In: Optometry Australia. 2020.
referrals in various format to be inputted directly 6. Patel C, Rosen P, Hornby S, Mahalingham N, Hayles
into the model without the need for human pro- S, Stocker T. Referral guideline ophthalmology over-
view. In: NHS Oxfordshire Clinical Commisioning
cessing. If the preliminary results are validated Group; 2018.
in a subsequent derivation study, SAIO hopes to 7. Raman R, Srinivasan S, Virmani S, Sivaprasad S, Rao
eventually conduct randomized controlled trials C, Rajalakshmi R. Fundus photograph-based deep
to test the accuracy of AI assisting triaging learning algorithms in detecting diabetic retinopathy.
Eye (Lond). 2019;33(1):97–109.
against the gold standard of triage by a consul- 8. Li F, Wang Z, Qu G, Song D, Yuan Y, Xu Y, Gao K,
tant ophthalmologist. Luo G, Xiao Z, Lam DSC, Zhong H, Qiao Y, Zhang
X. Automatic differentiation of Glaucoma visual
field from non-glaucoma visual filed using deep
convolutional neural network. BMC Med Imaging.
References 2018;18(1):35.
9. Yoon J, Han J, Park JI, Hwang JS, Han JM, Sohn J,
1. Central Adelaide Local Health Network. Park KH, Hwang DD. Optical coherence tomography-
Ophthalmology outpatient service information, triage based deep-learning model for detecting central
and referral guideline. In: Ophthalmology. Vol 0.1. serous chorioretinopathy. Sci Rep. 2020;10(1):18852.
SA Health; 2018. 10. Tan Y, Bacchi S, Casson RJ, Selva D, Chan
2. The Royal Australian and New Zealand College of W. Triaging ophthalmology outpatient referrals with
Ophthalmologists. Referral pathway for AMD man- machine learning: a pilot study. Clin Exp Ophthalmol.
agement. In: RANZCO. 2020. 2020;48(2):169–73.
Deep Learning Applications
in Ocular Oncology
20
T. Y. Alvin Liu and Zelia M. Correa
The paraffin blocks were cut into 4-micron-thick among malignancies in that GEP, independent of
sections that were incubated with mouse mono- other clinicopathological parameters, has been
clonal antibodies against BAP1 and counter- shown to be the most robust method currently
stained with hematoxylin eosin (H&E). The available to predict long-term metastasis risk and
resultant glass slides were digitally scanned, and survival. UM patients can be divided into two
the region of interest containing the UM tumor classes by GEP: class 1 and class 2, and there is a
was further cropped into numerous 256 × 256 stark contrast in long-term survival between the
pixel image patches. In total, 8176 histopathol- two classes—the 92-month survival probability
ogy image patches were generated in this fash- in class 1 patients is 95% vs. 31% in class 2
ion. Each image patch was annotated twice by an patients [10, 11].
ophthalmic pathologist, who established the In this study, 20 de-identified FNAB cytology
ground truth, and each image patch was classified slides from 20 patients with UM underwent H&E
as one of the four categories: positive (positive staining. Whole-slide scanning was performed
for BAP1 expression, 2576 patches), negative for each cytology slide at a magnification of 40×,
(negative for BAP1 expression, 4720 patches), and native resolution crops containing melanoma
blurred (too blurred to be classified, 560 patches) cells were saved. Each snapshot image measured
and excluded (lack of UM cells, 320 patches). 1716 pixels (width) × 926 pixels (height), and
The 8176 image patches were randomly split into was further split into eight tiles of equal size. The
a training (6800 image patches) and testing (1376 tiles were then screened and selected for further
image patches) subset. The authors applied trans- processing only if at least one melanoma cell was
fer learning to a pre-trained DenseNet-121 net- present. Typically, each slide generated hundreds
work, and achieved a sensitivity of 97.09%, of 40× snapshot images, and out of the 20 slides,
specificity of 98.12%, and overall accuracy of a total of 26,351 unique image tiles were gener-
97.10% in predicting nuclear BAP1 expression. ated. Schematic representation for data process-
While this study represents the first DL study in ing is shown in Fig. 20.2. The GEP ground truth
the field of ocular oncology, the resultant DLS at the slide level was established by the
ultimately only emulates what a human patholo- commercially- available DecisionDx-UM® test
gist is capable of performing—identifying [Friendswood, Texas], and the GEP label desig-
images with positive BAP1 staining. In addition, nated to a particular slide was propagated to all
the reported methodology has limited clinical the image tiles generated from that slide. I.e. if
practicalities, as it requires histopathology slides “slide 1” was determined to be GEP class 1 by
obtained from enucleated eyes and the current the DecisionDx-UM® test, then all the image
standard of care for most eyes with UM is globe- tiles generated from “slide 1” were labeled as
preserving local therapy with either plaque “class 1.”
brachytherapy or proton beam irradiation. The authors applied transfer learning to a pre-
In the second study, our group (Liu et al. [9]) trained ResNet-152 network for the binary clas-
extended the application of DL techniques in sification problem of distinguishing between
digital pathology slide analysis in UM, and aimed class 1 and class 2 image tiles. Due to the low
to train a DLS to perform a task that is impossible amount of data (patient) variation, the validation
for a human pathologist—predicting gene expres- slides would have a strong effect on the model
sion profile (GEP) in smeared slides stained with performance, so “leave-one-out” cross-valida-
H&E alone obtained from fine needle aspiration tions were performed to evaluate the DLS’s per-
biopsy (FNAB) of uveal melanomas. The under- formance. To test each of the 20 slides/patients,
lying hypothesis is that cancer cell morphology 10 models using different training/validation
reflects the underlying genetics and careful anal- split were trained. Specifically, for each of the
ysis of cytopathology images will provide helpful leave-one-out cross-validations, 10 random sam-
prediction of the biological behavior of the tumor plings for the validation subset selection were
and clinical course of the patient. UM is unique performed. If “slide 1” was used as the testing
20 Deep Learning Applications in Ocular Oncology 237
Fig. 20.2 Schematic
representation of data
processing. (Top Panel)
Whole slide scanning;
one slide per patient.
(Middle Panel) Snapshot
image manually
captured at 40×;
multiple 40× images
were captured from each
slide. (Bottom Panel)
Each 40× image was
further divided into eight
tiles of equal sizes
slide, then the other 19 slides were used for model training and validation slides. For example,
development: 17 slides for training and 2 slides model #1 would use “slide 2” and “slide 11” for
for validation (one from class 1 and one from validation. Model #2 would use “slide 3” and
class 2). “Slide 1” was then tested 10 different “slide 12” for validation. Model #3 would use
times by 10 different models that were generated “slide 4” and “slide 13” for validation etc.
by 10 random and different combinations of Eventually, 10 models were generated, and the
238 T. Y. A. Liu and Z. M. Correa
accurately identify and classify abnormal images disc, based only on analysis of retinal images.
for disease prediction. This approach has been Several other recent studies have suggested that
particularly successful in Ophthalmology, for modern deep learning systems can achieve high
automated identification on retinal fundus images sensitivity, specificity, and generalizability for
in diabetic retinopathy (DR), retinopathy of pre- detecting glaucoma on retinal images alone, in a
maturity, other retinopathies, glaucoma, etc [2]. cost-effective and time- efficient manner. Only
To date, two algorithms have been authorized by very few studies, mostly based on computer-based
the FDA for detection of DR on retinal fundus analysis, have aimed to automatically detect
images (IDx-DR, Coralville and EyeArt, “non-glaucomatous”, or “neuro-ophthalmic”
EyeNuk), based on pivotal prospective studies optic disc abnormalities, given the relative scar-
[3]. Deep learning methods have been success- city of these conditions. Indeed, deep learning
fully applied to other retina imaging modalities, algorithms are notoriously dependent on large
i.e. optical coherence tomography (OCT) for training datasets, making this approach particu-
identification of diabetic macular edema, glau- larly difficult in Neuro-Ophthalmology.
coma, etc. More interestingly, recent « machine
to machine » learning techniques have allowed
deep learning to directly predict OCT parameters Achievements of AI
(i.e. retinal nerve fiber layers thickness, or dia- in Neuro-ophthalmology
betic macular edema grade), using monoscopic
retinal fundus photographs, with higher specific- Early computer-aided diagnostic systems have
ity than the prediction provided by doctors. aimed to automatically detect Neuro-Ophthalmic
Deep learning algorithms have been deployed optic disc abnormalities on retinal fundus images
on fundus images for the detection of the most [5], including papilledema, achieving good
common optic neuropathy, glaucoma, based on results, with high accuracy and substantial agree-
the large numbers of available optic disc images ment with the Frisen severity classification pro-
in this condition. The optic disc features in glau- vided by expert neuro-ophthalmologists [6].
coma are relatively specific (increased cup/disc Neuro-ophthalmic abnormalities affecting the
ratio, notching, etc), and are different from the optic discs (i.e. optic disc swelling in inflamma-
abnormal features in patients affected by other tory/ischemic/compressive optic neuropathies,
optic neuropathies. The initial studies for optic papilledema associated with intracranial hyper-
disc classification in glaucoma provided good tension, optic nerve head drusen, optic atrophy in
results, despite several methodological limita- chronic optic neuropathies, etc.) are however rare
tions, inherent to the nature of the “reference stan- conditions, explaining the scarce published liter-
dard”. Indeed, the reference standard was often ature in this field.
established on subjective, post-hoc assessments In 2020, an international Neuro-
of the optic discs, performed by randomly selected Ophthalmology consortium (BONSAI—Brain
ophthalmologists/graders, and not on clinical and Optic Nerve Study with Artificial
information obtained in the native datasets. In Intelligence), has published the results obtained
order to circumvent this limitation, a recent, very with a dedicated deep-learning system aiming to
elegant study, has used a deep learning algorithm, discriminate optic discs with papilledema, nor-
which was trained on images compared to objec- mal discs, and discs with nonpapilledema abnor-
tive reference standards (i.e. average retinal nerve malities [7]. Globally, this large retrospective
fiber layer (RNFL) thickness values), and subse- collaborative study has included multi-ethnic
quently applied on stereoscopic optic disc images, populations from 24 neuro-ophthalmology sites
a technique called “machine to machine” [4]. In in 15 countries on three continents, using fundus
other words, this algorithm was able to predict images obtained with a large set of fundus cam-
and quantify the retinal neuronal loss at the optic era brands. The training and validation data sets
21 Artificial Intelligence in Neuro-ophthalmology 241
Fig. 22.1 Framework for artificial intelligence to evaluate systemic disease via the eye
risk factors (e.g., age, blood pressure, smoking) or which part of the retina contributes to prediction
other biomarkers (e.g., coronary artery calcium) of sex.
(Fig. 22.1); 2) longitudinal studies that use AI-DL In terms of lifestyle factors, smoking status is
technology on CFP to predict the incidence or risk commonly assessed because of the direct link
of systemic disease (e.g., cardiovascular diseases between CVD and smoking habits. The effect of
event or mortality). smoking on retinal vasculature was previously
reported with studies showing that cigarette
smoking was linked with a wider retinal venular
Cross-Sectional Studies caliber [25, 26]. Other studies have also demon-
strated that one’s smoking status is associated
Prediction of Demographic with CVD due to the dual effects on both the reti-
and Lifestyle Factors nal and systemic circulation, as suggested by
visual changes in the retinal vascular structure
Figure 22.1 shows the framework for AI-DL to [26–28]. Along with the fair results obtained
evaluate systemic diseases via CFPs using from internal test sets of three unique studies
AI-DL. (AUC = 0.71–0.86) [18, 21, 23], this allowed
Table 22.1 illustrates various studies in which ophthalmic researchers to figure out with good
AI-DL on CFP has been used for the prediction confidence that smoking status prediction was
of demographic and lifestyle factors. Among the predominantly predicted using the retinal vessels
identified studies, most of them investigated age via AI-DL.
as a predictable variable from CFP via
AI-DL. Chronological age is the most reliable at
portraying growth milestones accurately [19], rediction of Body Composition
P
and the retina is considered the “window” to the Factors
whole body. Therefore, predicting age from CFP
via AI-DL could provide valuable information Table 22.2 summarizes the three main studies
about the status of a target organ and/or the body that used AI-DL to predict body composition fac-
[24]. In addition to age as a predictor, the ability tors based on CFP. The association between an
to identify sex with high confidence from CFP increased body-mass index (BMI) and mortality,
via AI-DL has been demonstrated in similar stud- in the form of stroke [29], cancer [30, 31], etc.,
ies. For example, Rim et al. [20], showed fair has been established because BMI is a common
results in their external multi-ethnic test sets for measure of adiposity [32]. However, just like
both age (coefficient of determination, R2 = 0.36– many other systemic factors, BMI prediction in
0.63) and sex (area under a curve [AUC] = 0.80– CFP via AI-DL is not suitable for clinical appli-
0.91) predictions, demonstrating reasonable cation yet. This is due to the great variability in
generalizability on predicting sex and age from mean absolute error (MAE) with a low generalis-
CFPs. Nonetheless, it still remains unclear as to ability across the ethnic groups in cohort studies
[18, 20].
22 Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 245
Recently, researchers discovered that body quantification of muscle mass from CFP. The
muscle mass is a more reliable measure of car- MAE (6.09 kg) was high, and the coefficient of
diometabolic risk than BMI. Using body muscle determination (R2 = 0.33) was low in the external
mass as a variable allowed for the detection of an testing set, reiterating the need for further valida-
age-related condition which reflects skeletal tion studies before assessing whether CFP could
muscle loss called sarcopenia [20]. Rim et al., be used as an alternative screening tool for
developed an AI-DL model that enabled the sarcopenia.
246 R. M. W. W. Tseng et al.
Prediction of Neurological Diseases izability when the model was tested in unseen
images [38]. Considering the emerging potential
The retina and the brain share a special relation- of retinal imaging as a non-invasive strategy,
ship since both structures develop from the neu- employing AI-DL in CFP could act as opportu-
ral tube and are part of the central nervous system nistic screening for neurological diseases, and
[33]. Embryonic origin aside, visual changes ultimately increase screening adherence rates in
have also been reported as the first few symptoms the community.
in many patients diagnosed with Alzheimer’s dis-
ease [33, 34], with some studies showing the
aggregation of amyloid beta monomers in the Prediction of Cardiovascular
retina in these patients [35]. In the realm of and Circulatory Disorders
AI-DL, compared to other major body systems,
the limited number of published studies suggests Table 22.3 details the AI-DL studies focused on
that there is much room to explore the relation- predicting systemic risk factors and specific dis-
ship between the brain and the retina. Existing eases (e.g., anaemia and hypertension) of the cir-
studies vary in the type of ocular imaging tech- culatory system. Blood pressure (BP) is an
nologies used to explore the association between important indicator for CVD, but it is also a
neurological diseases and the retina [35–37]. In means to maintain homeostasis and therefore
particular, Lim et al., evaluated the potential of an fluctuates based on body and emotional status.
AI-DL model as an ischemic stroke risk assess- Using the retina as a biomarker instead is pre-
ment from CFP, and this resulted in a varying ferred because the retina shows accumulated
AUC of 0.685–0.994 for six different datasets damage due to high blood pressure and experi-
[38]. Additionally, the team also found that reti- ences comparably less fluctuation compared to
nal vessel calibre could be predictive of ischemic conventional BP stable marker, rendering it a
stroke in patients although there was low general- more stable marker. The results of applying
22 Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 247
Table 22.3 Applications of AI in predicting systemic diseases involved with the circulatory system
Systemic disease/ Results
variable Author, Year Internal test set External test set
Hypertension Blood Pressure Poplin et al., SBP: NA
(HTN) or (Systolic = SBP, 2018 [18] MAE = 11.23 mmHg
related Dystolic = DBP)/ CI = 11.18–11.51
biomarkers mmHg R2 = 0.36
DBP:
MAE = 6.42 mmHg
CI = 6.33–6.52
R2 = 0.32
Rim et al., SBP: SBP:
2020 [20] MAE = 9.29mmHg MAE =
CI = 9.16–9.43 10.55–
R2 = 0.31 13.95 mmHg
DBP: R2 = 0.17–0.21
MAE = 7.20 mmHg D:
CI = 7.09–7.30 MAE =
R2 = 0.35 7.14–8.09 mmHg
R2 = 0.16–0.27
Gerrits et al.,SBP: NA
2020 [21] MAE = 8.96 mmHg
R2 = 0.40
DBP:
MAE = 6.84 mmHg
R2 = 0.24
Hypertension Dai et al., 2020 AUC = 0.651 NA
[39]
Zhang et al., AUC = 0.766 NA
2020 [40]
Anaemia or Anaemia Mitani et al., Metadata/fundus/combined: NA
related 2019 [41] AUC = 0.73/0.87/0.88
biomarkers AUC = 0.89 (combined in diabetes
subgroup)
Haemoglobin levels Mitani et al., Metadata/fundus/combined: NA
(Hb) 2019 [41] MAE = 0.73/0.67/0.64 g/dL
CI = 0.72–0.74/0.66–0.68/0.62–0.64
AUC = 0.74/0.87/0.88
Rim et al., MAE = 0.79 g/dL MAE = 0.93–
2020 [20] CI = 0.78–0.80 0.98 g/dL
R2 = 0.56 R2 = 0.06–0.33
Gerrits et al., MAE = 0.61% NA
2020 [21] R2 = 0.34
Haematocrit levels Mitani et al., Metadata/fundus/combined: NA
2019 [41] MAE = 2.10/1.94/1.83%
CI = 2.07–2.13/1.91–1.97/1.80–1.86
Rim et al., MAE = 2.03% MAE =
2020 [20] CI = 2.00–2.06 2.62–2.81%
R2 = 0.57 R2 = 0.09–0.26
Red blood cell count Mitani et al., Metadata/fundus/combined: NA
2019 [41] MAE = 0.26/0.26/0.25·1012/L
CI = 0.26–0.27/0.25–0.26/0.25–0.25
Rim et al., MAE = 0.26·1012/L MAE =
2020 [20] CI = 0.25–0.26 0.33–0.37·1012/L
R2 = 0.45 R2 = −0.02–0.14
AUC area under the receiver operating characteristic curve, CI confidence interval, MAE mean absolute error, NA results
not available
248 R. M. W. W. Tseng et al.
AI-DL to date show that unlike other body fac- Prediction of Metabolic
tors (e.g., height and weight), there is generalis- and Endocrinological Diseases
ability across the different ethnic groups for
blood pressure prediction [6]. However, the R2 Table 22.4 details the studies on the endocrine
value is somewhat low with ranges of 0.24–0.40. system. Of the different systemic diseases and/or
Apart from using blood pressure as a biomarker variables that were tested using the AI-DL model,
for hypertension, disease prediction of hyperten- the prediction of biomarkers related to diabetes,
sion was also reported by Dai et al., and Zhang including glucose and HbA1c, showed somewhat
et al., but modest predictability was observed low predictive performance in both the internal
with a combined AUC of 0.651–0.766. and external test sets.
For the prediction of anaemia and/or related Testosterone (MAE = 3.76 nmol/L, R2 = 0.54)
biomarkers, biomarkers including hemoglobin, was predictable from CFP but Gerrits et al., dis-
hematocrit, and red blood cell were predicted covered that the AI-DL model that was trained to
from CFP in three separate studies [20, 21, 41]. predict testosterone levels, was indirectly pre-
Anaemia was also predictable using an AI-DL dicting sex as well. The team additionally found
model developed by Mitani et al., with a modest that the performance of the model was affected
AUC of 0.88 using a combined model of sys- when it was trained on solely males or females,
temic risk factors and CFPs [41]. Rim et al., indicating that sex had an indirect effect on the
tested these hematologic factors in external data- performance of the model when predicting for
sets with varying ethnicities but generalizability testosterone [21]. Apart from systemic risk fac-
was limited among other ethnic groups [20]. tors and related biomarkers, endocrine system-
Table 22.4 Applications of AI in predicting systemic diseases involved with the endocrine system
Results
Systemic disease/variable Author, Year Internal test set External test set
Diabetes or related Diabetes/blood glucose Rim et al., 2020 Fasting blood MAE = 10.10 mg/dL
biomarkers control [20] glucose: CI = 9.83–10.36
MAE = 8.55 mg/dL R2 = 0.05
CI = 8.40–8.71
R2 = 0.11
Babenko et al., AUC = 0.702 NA
2020 [42]
Diabetic peripheral Benson et al., Accuracy = 89% NA
neuropathy 2020 [43]
Hyperglycemia Zhang et al., 2020 AUC = 0.880 NA
[40]
HbA1c Poplin et al., MAE = 1.39% NA
2018 [18] CI = 1.29–1.50
R2 = 0.09
Rim et al., 2020 MAE = 0.33% MAE = 0.35
[20] CI = 0.32–0.33 CI = 0.34–0.36
R2 = 0.13 R2 = 0.07
Gerrits et al., MAE = 0.61% NA
2020 [21] R2 = 0.34
Lipid related Dyslipidaemia Zhang et al., 2020 AUC = 0.703 NA
biomarkers [40]
HDL cholesterol Rim et al., 2020 MAE = 9.45 mg/dL MAE = 9.46 mg/dL
[20] R2 = 0.13 R2 = 0.08
Other biomarkers Testosterone Gerrits et al., MAE = 3.76 nmol/L NA
2020 [21] R2 = 0.54
AUC area under the receiver operating characteristic curve, CI confidence interval, HbA1c hemoglobin A1C, HDL high-
density lipoprotein, MAE mean absolute error, NA results not available
22 Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 249
related diseases were also reported, including a oped by Sabanayagam et al., was generally stable
machine learning system created by Benson (AUC > 0.9 for all models) across the different
et al., to predict diabetic peripheral neuropathy, models (i.e. CFP, risk factors, combined) that
which showed relatively good performance were trained, suggesting that risk factor informa-
(AUC = 0.89) [43]. The prediction of other con- tion was not required for CKD risk assessment in
ditions such as dyslipidaemia and diabetes patients. In the study, CKD was strictly defined
showed moderate predictive performance. as an estimated glomerular filtration rate of less
than 60 units. This estimation was based on fac-
tors such as creatinine level, age, sex, and body-
Prediction of Kidney Disease weight. Given that predicting age and sex from
AI-DL in CFP is well-established and highly
Among systemic diseases, chronic kidney dis- accurate, this suggests the high potential of using
ease (CKD) is infamously described by the AI-DL in CFP to predict CKD in the future as an
American Society of Nephrology to be a silent opportunistic screening method that can be inte-
killer. The traditional way of screening for CKD grated into CFP and an eventual replacement of
is invasive and includes collecting serum creati- traditional invasive methods.
nine levels [44]. As such, screening for CKD is
limited and a tough challenge for most communi-
ties. Applying AI to CFP for the detection of Other Retinal Biomarkers
CKD would not only act as an adjunct screening
tool for CKD but could also be a gamechanger in Table 22.6 details the studies focused on other
increasing the detection rate and lowering the retinal imaging biomarkers. There is strong evi-
mortality rate of patients with CKD. dence from epidemiological studies that changes
Despite the huge potential impact, few studies in the retinal vasculature mirror systemic micro-
have explored the prediction of CKD from circulation changes. However, the process for
CFP. Of note, Sabanayagam et al., predicted assessing retinal vascular changes is time-
CKD (AUC = 0.73) with modest generalisability consuming and requires professional training,
and a separate AI-DL system developed by Kang which has limited the expansion and wider appli-
et al., achieved an AUC of 0.81 although no cation of these traditional methods in other pri-
external validation was conducted (Table 22.5). mary care settings outside of ophthalmology
In particular, the performance of the model devel- [50]. To address these challenges, semi-automated
Table 22.5 Applications of AI in predicting systemic diseases involved with the renal system
Systemic disease/ Results
variable Author, Year Internal test set External test set
Chronic kidney CKD Sabanayagam AUC (CFP/RF/Combined) AUC (CFP/RF/Combined)
disease (CKD) et al., 2020 [45] = 0.911/0.916/0.938 =
or related Subgroup of DM patients: 0.733–0.835/0.829–
biomarkers AUC = 0.889/0.899/0.925 0.887/0.810–0.858
Subgroup of HTN
patients:
AUC = 0.889/0.889/0.918
Kang et al., 2020 Overall AUC = 0.81 NA
[46] AUC = 0.81–0.87 as
HbA1c levels increased
from <6.5% to >10%
Creatinine Rim et al., 2020 MAE = 0.11 MAE = 0.11–0.17
[20] CI = 0.11–0.11 R2 = 0.01–0.26
R2 = 0.38
AUC area under the receiver operating characteristic curve, CI confidence interval, MAE mean absolute error, RF risk
factor, NA results not available
250 R. M. W. W. Tseng et al.
Input into the SIVA-DLS Heatmap of CRAE score Heatmap of CRVE score
Fig. 22.2 Use of SIVA-DLS to assess width of retinal vessels in a more efficient, objective, and quantifiable way
software, such as the Singapore I Vessel et al. developed an AI-DL model to predict
Assessment deep-learning system (SIVA-DLS), carotid artery atherosclerosis, and the model was
was created and applied on CFP, allowing for a able to predict the sonographically confirmed
more efficient, objective, and quantifiable way of carotid artery atherosclerosis with an AUC of
assessing width of retinal vessels from CFP 0.713 [49]. Both CAC and CIMT models were
through the use of heat maps (Fig. 22.2). The not tested externally and further validation is
SIVA-DLS study reported high intra-class corre- required prior to assessing the clinical applicabil-
lation coefficients (0.82–0.95) between the ity of these models.
SIVA-DLS and validated human measurements.
Coronary artery calcium (CAC) is a pre-
clinical marker of atherosclerosis, a cardiovascu- Longitudinal Studies
lar condition that could implicate the circulatory
system, and is strongly associated with risk of Leveraging the predictive values and cross-
clinical CVD [51]. Measurement of CAC scores sectional outcomes that are estimated using
has increasingly been used for stratification of AI-DL and CFP, research in the retinal biomarker
CVD risk. Recently, Son et al. created an AI-DL field is currently expanding towards predicting
model to predict abnormal CAC from CFP both future events (Table 22.7). Since applying AI-DL
unilaterally and bilaterally and the performance in CFP is sufficient for risk factor prediction of
(AUC = 0.823–0.832) was promising [48]. In systemic diseases, there is a high possibility that
addition, CIMT was measured using ultrasonog- CFP could also be directly associated with and
raphy by averaging three measurements made therefore a good predictor of the incidence of
10 mm proximal to the bifurcation and was used CVD events. Recent work includes the survival
as the proxy marker for atherosclerosis. Chang analysis of risk stratification for incident CVD
22 Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 251
events and mortality using the predicted proba- age from CFP (retinal age gap) independently
bility of CVD occurrence from CFP at baseline. predicted one’s mortality risk and that the mortal-
The conventional Cox proportional hazard model ity risk was positively associated with the retinal
is an extension of current efforts at creating an age gap. These studies show the potential of CFP
optimal predictive model using deep features of as a screening tool for risk stratification and
CFP as an input. In the past, variables such as, delivery of tailored interventions.
age, sex, socioeconomic status, and other CVD
risk factors were manually or statistically selected
for survival analysis, but presently, the Cox Areas of Future Research
model is generated by using the deep features of
CFP that are observed based on the association When assessing the performance of AI-DL mod-
between CFP and various risk factors via els, the predictability and generalisability of the
DL. Apart from this hybrid model, other papers AI-DL model must be evaluated appropriately.
have used different methods ranging from neural Predictability refers to how accurate the AI-DL
networks to machine learning techniques. This model is at predicting the desired result. The
includes Cox-nnet [52], Deepsurv [53], and AI-DL model that predicts age from CFP is an
Nnet-survival [54]. Currently, no studies that example given the relatively high coefficient of
have applied this recent network with CFP to pre- determination in comparison to the somewhat
dict systemic-related event outcome using time- low coefficient of determination that was pro-
series data, suggesting that there is room to duced by the AI-DL model that predicts BP. There
explore the relationship between the incidence of are no specific guidelines or cut-off values for
systemic disease outcomes and the retina. coefficient of determination, but the predictabil-
Of the various studies that investigate the ity of AI-DL systems on specific biomarkers can
applications of AI-DL in predicting longitudinal be inferred from the performance of these sys-
outcomes of systemic diseases, a few should be tems on internal test sets. Predictability, however,
mentioned. Poplin et al. [18] predicted CVD risk does not guarantee generalisability. In terms of
factors from CFP via AI-DL and thereafter used generalisability, it is crucial to determine how
the results to predict Major Adverse Cardiac well the AI-DL model will perform in different
Events (MACE) over 5 years in the UK Biobank. clinical settings and different populations with
The performance of the AI-DL model was of varying ethnicities. Rim et al., has demonstrated
similar to the performance of the European this by testing their AI-DL models on external
Systematic COronary Risk Evaluation (SCORE) multi-ethnic testing datasets. In this case, the
risk calculator [18]. Another study by Chang AI-DL model that predicts BP from CFP demon-
et al. used proxy markers such as CIMT and exis- strates good generalisability even though the
tence of carotid artery plaque to train an AI-DL coefficient of determination is not as high as it is
model to predict atherosclerosis [49]. The study for age prediction. Taking into account these two
demonstrated that the retinal biomarker was sig- performance indices when developing AI-DL
nificantly associated with an increased risk, rep- models would substantiate the use of retinal
resented by hazard ratio, for CVD mortality after markers as a surrogate marker and proxy for con-
adjusting for the Framingham risk score (FRS). ventional markers in the future.
In addition, Cheung et al. looked at CVD and its Additional challenges that could implicate the
association to retinal vessel calibre [47]. The performance of AI-DL models on CFP include
team found that a narrow central retinal arteriolar the limited number of datasets available for sys-
equivalent measured by SIVA-DLS was associ- temic diseases, the adoption of retinal examina-
ated with incident CVD and all-cause mortality tions into CVD guidelines, as well as gaining
in two prospective cohorts. Lastly, Zhu et al. used acceptance from physicians, patients, and the
data from the UK biobank to demonstrate that the public. In particular, the collation of CFP datasets
difference between one’s age and the predicted according to systemic diseases is tough given the
22 Artificial Intelligence Using the Eye as a Biomarker of Systemic Risk 253
46. Kang EY HY, Li C, Huang Y, Kuo C, Kang J, Chen 50. Walsh JB. Hypertensive retinopathy. Description,
K, Lai C, Wu W, Hwang Y. A deep learning model for classification, and prognosis. Ophthalmology.
detecting early renal function impairment using reti- 1982;89(10):1127–31.
nal fundus images: model development and validation 51. Detrano R, Guerci AD, Carr JJ, Bild DE, Burke G,
study. JMIR Med Inf. 2020. Folsom AR, et al. Coronary calcium as a predictor of
47. Cheung CY, Xu D, Cheng CY, Sabanayagam C,
coronary events in four racial or ethnic groups. N Engl
Tham YC, Yu M, et al. A deep-learning system for the J Med. 2008;358(13):1336–45.
assessment of cardiovascular disease risk via the mea- 52. Ching T, Zhu X, Garmire LX. Cox-nnet: An artifi-
surement of retinal-vessel calibre. Nat Biomed Eng. cial neural network method for prognosis prediction
2020. of high-throughput omics data. PLoS Comput Biol.
48. Son J, Shin JY, Chun EJ, Jung K-H, Park KH, Park 2018;14(4):e1006076.
SJ. Predicting high coronary artery calcium score 53. Katzman JL, Shaham U, Cloninger A, Bates J, Jiang
from retinal fundus images with deep learning algo- T, Kluger Y. DeepSurv: personalized treatment rec-
rithms. Transl Vis Sci Technol. 2020;9(2):28. ommender system using a Cox proportional hazards
49. Chang J, Ko A, Park SM, Choi S, Kim K, Kim SM, deep neural network. BMC Med Res Methodol.
et al. Association of cardiovascular mortality and deep 2018;18(1):24.
learning-funduscopic atherosclerosis score derived 54. Gensheimer MF, Narasimhan B. A scalable discrete-
from retinal fundus images. Am J Ophthalmol. time survival model for neural networks. PeerJ.
2020;217:121–30. 2019;7:e6257.
Artificial Intelligence
in Calculating the IOL Power
23
John G. Ladas and Shawn R. Lin
Types of Machine Learning 5378 mydriatic slit lamp photos. The paper
showed good agreement with the ground truth as
There are two categories of machine learning, established by expert grader [5]. A recent paper
and both could be applicable to cataract surgery published in BJO described a system combining
and IOL calculations. These include supervised image recognition of slit lamp images and home-
and unsupervised learning. Unsupervised learn- based cell phone photos with questionnaire to
ing uses input data to discover similarities among identify referable cataracts. With this system, the
data sets. For instance, with enough data, one authors estimate that an ophthalmologist can
could attempt to find the specific characteristics improve productivity 10× by leveraging AI to
of eyes at risk for a suboptimal refractive out- monitor and evaluate 40,000 patients a year
come. The learning, however, would not be able instead of 4000 [6].
to help correct the error. To achieve that, one
would need outcome data.
Supervised learning is the other branch of History of IOL Calculations
machine learning that utilizes outcome data, in
addition to the input variables, to develop a pre- Formulae, throughout the years, have been classi-
dictive model. Supervised learning can further be fied multiple ways. The most common has been
subdivided into classification and regression. An by “generations”. The first generation of formu-
example of classification involves the use of lae include the SRK formula, based completely
images to determine whether a patient has dia- on linear regression. The formula is IOL power =
betic retinopathy. Regression based supervised A constant −2.5 AL − K. The A constant was
learning uses specific algorithms to establish the calculated mathematically and can be adjusted to
relationship between input variables and the out- fine tune the formula [7, 8]. The second genera-
come. As previously stated, this is why cataract tion of formulas added “adjustments” that were
surgery, and specifically, IOL calculations is per- dependent on axial length. The third generation
fectly suited for this task. formulas were theoretical formulas developed to
further improve the prediction accuracy of the
prediction effective lens position. These formulas
rtificial Intelligence for Diagnosis
A include the Hoffer Q, Holladay 1, and
and Grading SRK/T. Although different, they each used AL
and K power to predict the ELP. Additional for-
The use of artificial intelligence has been mulas included measured anterior chamber depth
explored for the diagnosis and grading of cata- (ACD) and lens thickness (LT) to enhance pre-
racts. This type of functionality has the potential diction of the ELP [9–11].
to have a large impact in diagnosing community A step towards the integration of AI in IOL
cases and referring appropriate surgical candi- calculations took place in 2015 with the intro-
dates to a health care system. duction of a concept of an IOL ‘super formula’
A study in 2010 published in the Journal of [12]. Although previous generations of IOL for-
Medical Systems demonstrated early work on the mulas were thought of as two-dimensional alge-
training of artificial intelligence to identify cata- braic equations, this methodology depicted
ract versus non-cataractous and pseudophakic formulas in three dimensions. Ladas et al. dem-
patients by mydriatic slit lamp photos. However, onstrated that these formulas could be repre-
the example cataracts shown in this paper are sented graphically and combined for potential
dense white lenses, which may limit the utility of adjustment of specific areas of these calcula-
this approach [4]. A paper published in the IEEE tions. The formula as published served as a
Transactions on Biomedical Engineering dependable backbone or framework to integra-
described a method for grading cataracts using a tion of AI. Furthermore, it provides a malleable
convolutional neural network built on a dataset of framework which allows for targeted improve-
23 Artificial Intelligence in Calculating the IOL Power 259
ment within the formula. We will discuss more absent comorbidities in order to ensure the most
on this later in the chapter. accurate input data possible. This data is then fed
into a training algorithm. The most popular deep
learning toolset available at the time of publica-
I nput Variables to Consider in Any tion is Google’s TensorFlow. The algorithm is
Algorithm allowed to run iteratively to attain the closest set
of weights to match inputs to known outcomes
To date, there are at least 20 number of potential and these set of weights constitute the new lens
variables that have been used to help refine these formula algorithm.
formulas. In addition to AL and net K power,
which are the most important, other variables
include the aforementioned LT, and ACD. Review of Current Formulas with AI
Holladay included seven variables including pre-
operative refraction, white to white, and age [13]. One of the first discussions of the use of a neural
This was done without AI. Later he determined network came from Clarke in 1992 [19]. The
that further axial length adjustment was needed studied was limited by a small number of eyes in
[14]. Along these lines, Haigis also adjusted his the training set (200) and test set (95). There was
formula with three constants that would vary the no description of the algorithm that he created.
shape of the power prediction curve based on cor- As he stated, the main disadvantage at that point
neal power, axial length, ACD and lens geometry. was that running the algorithm required “sub-
Again, this was not done with AI but highlights stantial computing power and memory”. Further,
the importance of factoring multiple variables the input data used in this study measured LT and
that are interrelated. AL by ultrasound biometry, now known to be
Other potential variables that have been pro- less accurate than optical biometry. Nonetheless,
posed or demonstrated to have an impact on IOL the use of a computer to help adjust a formula
power calculation include posterior corneal was introduced. Beyond this early study, there is
power, true power of the cornea, ratio of anterior a paucity of information in the peer reviewed
and posterior segment, IOL power and design, literature.
measured equatorial lens position, aphakic With the advent of computer power and more
refraction, race, gender, and age [7, 9, 10, 13– data came more interest in using AI in calcula-
18]. Unfortunately, these do not occur in a vac- tions. From a theoretical standpoint, there are two
uum but are intimately related to each other. ways to approach this problem. Using a fixed set
Perhaps, perfectly suited for the use of AI. of data and then building an algorithm directly
There are many factors that must be consid- from this data, or utilizing data to adjust an exist-
ered to make AI successful in IOL calculations. ing algorithm.
The general steps through which an AI formula Hill reported the use of a radial basis function
can be built, can be divided into data gathering, to help determine from a group of 3445 eyes
cleaning, and training. In the data gathering step, (Version 1). Although the formula itself was
input variables are collected ideally directly from never published, it has been examined in other
the source (biometry devices) with minimal studies. A white paper supplied by Haag Streit
translation. In addition, outcome data (post- describes it as a pure data driven solution that
operative manifest refractions) must be gathered works best with a specific biometer and the spe-
from the medical record. In the case of lens power cific lens from which it was derived (Len Star
calculations, minimal data cleaning is required, 900 and SN60WF) [20]. Fundamentally, this is a
as the data exists already in numerical format and “regression” formula that uses deep learning to
contains little noise. However, some filtering is “back calculate” a predicted outcome from a
often performed, such as selecting only eyes known data set. There is also a function that
achieving a certain refractive target, or eyes describes a calculation as “in bounds” or out of
260 J. G. Ladas and S. R. Lin
bounds if the calculation is deemed unreliable. There are multiple detailed approaches to
The dataset has since been expanded to include these issues which are beyond the scope of this
12,419 eyes (Version 2). chapter. A recent publication from our group out-
Another approach can be to adjust an existing lines the exact methodology for each of these
formula. This was introduced by Ladas and col- steps as they apply to IOL calculations.
leagues and is inherent in the most recent version
of their formula. This approach has been called
the Ladas PLUS method, and was presented at Challenges of AI Integration
ASCRS, ESCRS and AAO [21–23]. This algo-
rithm works with any formula and adjusts it There are a few principles which will govern the
based on the machine supervised learning algo- speed of adoption of new lens formulas: the accu-
rithms that were developed to predict the error racy and amount of data and trust. First, larger
between any formula’s predicted outcome and datasets may allow an algorithm to account for
the actual outcome. Further, we have recently outliers that are poorly represented in smaller
demonstrated that we can improve multiple gen- datasets. Existing machine learning algorithms
erations of formulae with our methodology [24]. are trained on tens of thousands of eyes.
Because of the lack of publication on this sub- Potentially creating a public dataset of a hundred
ject, it is difficult to report on the specific meth- thousand, a million, or even more eyes could help
odologies that others may or may not use. Our us design better formulas. One of the authors of
most recent publication utilized gathering, filter- this chapter, SRL, is in the process of creating a
ing and cleansing mechanisms to obtain eyes public dataset of high-quality lens data. Using
with one type of IOL, a best corrected visual acu- this dataset, new formulas can be developed more
ity threshold as well as exclude eyes with co- rapidly and tested against a known benchmark. In
morbidities. At this point, we used software addition, a public dataset would allow individuals
(Python 3.7 with scikit-learn package) to refine a and organizations outside of ophthalmology to
baseline formula. The supervised learning algo- work on this problem.
rithms we tested were Support vector regression Another way to achieve the goal obtaining
(SVR), Extreme gradient boosting (XGBoost), large amounts of accurate data is to make the pro-
and an Artificial Neural Network (ANN). cess automated. Indeed, the outcome data which
Important steps in developing an algorithm is AI relies upon, the manifest refraction, is often
making sure that the data obtained demonstrates suboptimal due to technique variability, room
a normal distribution. This is done with a Shapiro- length, patient’s subjective participation, and
Wilk test. Further, steps to prevent overfitting of a time taken to perform measurements. The use of
model are done by performing a fivefold cross post-operative autorefraction or wavefront data
validation within the training set. This was all can potentially help eliminate most issues that
done after randomly separating the data into ten occur with MRx acquisition. However, the cor-
equal parts and subsequently using nine of the ten relation between ARx and MRx for the purposes
to train the algorithm and test on the remaining of IOL formula optimization is still unclear and
tranche. This was sequentially done ten times. being currently investigated in ongoing studies.
There have also been other publications Furthermore, with ‘big data’ stored within an
describing techniques of adjusting an existing automated refractor, one would be able to charac-
formula [25]. Indeed, Sramka et al. used super- terize an eye as one with ‘standard’ parameters or
vised learning techniques and demonstrated one with ‘unusual’ parameters. Thus, AI could
equal or better performance than standard formu- pre-operatively highlight eyes that are ‘at-risk’
las. Kane has also reported including elements of for a post-operative refractive surprise so that the
artificial intelligence in his own formula but has surgeon may pay extra attention to pre-operative
never described the methodology [26]. IOL calculation.
23 Artificial Intelligence in Calculating the IOL Power 261
However the input variables and outcome data outcomes in our modern world will forever
are accumulated will lead to advances in this change the way that formulas are developed and
field. For instance, website called Kaggle.com compared.
coordinates competitions in which organizations
provide data sets and machine learning research-
ers compete to derive the best algorithms. This is References
where Netflix famously ran its $1,000,000 com-
petition to create a better movie recommendation 1. Heath Jeffery RC, Smith M. Artificial intelligence
in ophthalmology: current applications and emerg-
algorithm. ing issues [published online ahead of print, 2020 Jan
Finally, clinicians often do not adopt new 23]. Clin Exp Ophthalmol. *Stevenson CH, Hong
algorithms until they have seen enough clinical SC, Ogbuehi KC. Development of an artificial intel-
evidence so they can trust the outcomes. As a ligence system to classify pathology and clinical fea-
tures on retinal fundus images. Clin Exp Ophthalmol.
field, we can accelerate the development of trust 2019;47(4):484–9.
by providing a larger dataset by which research- 2. Hogarty DT, Mackey DA, Hewitt AW. Current
ers could compare new algorithms to existing state and future prospects of artificial intelligence
ones. For example, widespread adoption of a new in ophthalmology: a review. Clin Exp Ophthalmol.
2019;47(1):128–39.
algorithm could occur more quickly if a clinician 3. World Health Organization. Blindness: vision 2020—
knew that the new algorithm had been compared control of major blinding diseases and disorders.
to previously used algorithms in a public dataset http://www.who.int/mediacentre/factsheets/fs214/en/.
of 1 million eyes. A global public dataset could Accessed Jan 2020.
4. Acharya RU, Yu W, Zhu K, et al. Identification of
allow not only for advancements in algorithm cataract and post-cataract surgery optical images
accuracy, but also for regional fine tuning by giv- using artificial intelligence techniques. J Med Syst.
ing clinicians the ability to extend the base algo- 2010;34(4):619–28.
rithm for their patient population by adding 5. Gao X, Lin S, Wong TY. Automatic feature learning
to grade nuclear cataracts based on deep learning.
custom data. IEEE Trans Biomed Eng. 2015.
A key factor in the success of any artificial 6. Wu X, Huang Y, Liu Z. Universal artificial intelli-
intelligence or machine learning algorithm is the gence platform for collaborative management of cata-
quality of the input and outcome data as well as the racts. Br J Ophthalmol. 2019;103(11):1553–60.
7. Olsen T. Calculation of intraocular lens power:
number of samples. The input data currently used a review. Acta Ophtalmologica Scandinavica.
in lens formulas is quite accurate with modern 2007;85(5):472–85.
biometry devices, but benefits from the continued 8. Olsen T, Thom K, Corydon L. Theoretical versus SRK
development of more accurate measurements, as I and SRK II calculation of intraocular lens power. J
Cataract Refract Surg. 1990;16(2):217–25.
well as the incorporation of new types of measure- 9. Barrett GD. An improved universal theoretical for-
ments will continue to futher the field. This big mula for intraocular lens power prediction. J Cataract
data and ‘crowd-sourced’ approach could eventu- Refract Surg. 1993;19(6):713–20.
ally use millions of data points to achieve very 10.
Haigis W. Kongreß d. Deutschen Ges. f.
Intraokularlinsen Implantation. In: Schott K, Jacobi
high levels of accuracy. This approach could KW, Freyler H, editors. Strahldurchrechnung in
evolve over time and become a ‘living formula’ Gauß’scher Optik zur Beschreibung des Sustems
that continuously improves as new data is added. Brille-Kontaktlinse-Hornhaut-Augenlinse (IOL).
Berlin: Springer; 1991. p. 233–46.
11. Olsen T. Prediction of the effective postoperative
(intraocular lens) anterior chamber depth. J Cataract
Conclusion Refract Surg. 2006;32(3):419–24.
12. Ladas JG, Siddiqui AA, Devgan U, Jun AS. A 3-D
This integration of AI into IOL calculations will “super surface” combining modern intraocular for-
mulas to generate a “super formula” and maximize
continue to grow and with the ability to accumu- accuracy. JAMA. 2015;133(12):1431–6.
late reliable outcome data will exponentially 13. Mahdavi S, Holladay J. IOLMaster 500 and integra-
increase within the next few years. Further, the tion of the Holladay 2 formula for intraocular lens
network effect of comparing formulas as well as calculations. Eur Ophthal Rev. 2011;5(2):134–5.
262 J. G. Ladas and S. R. Lin
14. Wang L, Holladay JT, Koch DD. Wang-Koch axial 21. Siddiqui AA, Ladas JG, Nutkiewicz M. Evaluation of
length adjustment for the Holladay 2 formula in long new IOL formula that integrates artificial intelligence.
eyes. J Cataract Refract Surg. 2018;44(10):1291–2. Paper presentation at: American Society of Cataract
15.
Cooke DL, Cook TL. Approximating sum-of- and Refractive Surgery (ASCRS) annual meeting,
segments axial length from a traditional optical low- Washington, DC, April 2018.
coherence reflectometry measurement. J Cataract 22. Ladas JG. Artificial intelligence and big data in
Refract Surg. 2019;45(3):351–4. IOL calculations. European Society of Cataract and
16. Olsen T, Corydon L, Gimbel H. Intraocular lens
Refractive Surgeons (ESCRS) Annual Meeting,
power calculation with an improved anterior chamber September 14, 2019.
depth prediction algorithm. J Cataract Refract Surg. 23. Ladas JG. Artificial intelligence in ophthalmology.
1995;21(3):313–9. American Academy of Ophthalmology (AAO) Annual
17. Yoo YS, Whang WJ, Hwang KY. Use of the crys- Meeting, Spotlight Session, October 13, 2019.
talline lens equatorial plane (LEP) as a new param- 24. Ladas J, Ladas D, Lin SR, Devgan U, Siddiqui AA,
eter for predicting postoperative IOL position. Am J Jun AS. Improvement of multiple generations of
Ophthalmol. 2019;198:17–24. intraocular lens calculation formulae with a novel
18. Olsen T. The Olsen formula. In: Shammas HJ, editor. approach using artificial intelligence. Trans Vis Sci
Intraocular lens power calculations. Thorofare, NJ: Tech. 2021. (In Press).
Slack; 2004. p. 27–38. 25.
Sramka M, Slovak M, Tuckova J, Stodulka
19. Clarke GP, Burmeister JB. Comparison of intraocu- P. Improving clinical refractive results of cataract
lar lens computations using a neural network ver- surgery by machine learning. July 2, 2019. PubMed
sus the Holladay formula. J Cataract Refract Surg. 31304064. www.Peerj.com/articles/7202/. Accessed
1997;23(10):1585–9. April 2020.
20. Hill-RBF Method. Released: October 2017/V2.0.
26. Connell BJ, Kane JX. Comparison of the Kane for-
Haag-Streit AG Koeniz, Switzerland. https://www. mula with existing formulas for intraocular lens power
haag-streit.com/fileadmin/Haag-Streit_Diagnostics/ selection. BMJ Open Ophthalmol. 2019;4:e000251.
biometry/EyeSuite_IOL/Brochures_Flyers/White_ https://doi.org/10.1136/bmjophth-2018-000251.
Paper_Hill-R BF_Method_20160819_2_0.pdf.
Accessed April 2020.
Practical Considerations for AI
Implementation in IOL Calculation
24
Formulas
Guillaume Debellemanière, Alain Saad,
and Damien Gatinel
p erformances of the resulting model must be on optical assumptions. The calculation is a sta-
evaluated before it can be used. AI models can tistical inference which uses preoperative bio-
of course be regularly refined, but they don’t metric parameters to predict the postoperative
typically learn with every new example that is spherical equivalent (SE). In particular, the effec-
provided to them. tive lens position (ELP, distance from the anterior
–– AI tools can help to accurately predict future corneal surface to the IOL principal plane, used
outcomes using previous experience. This in thin lens formulas) and/or the ALP (distance
requires clean data composed of predictors from the anterior corneal surface to the anterior
(features) and resulting outcomes. However, IOL surface, used in thick lens formulas) is not
the predictions can only be as good as the considered in the calculation process of those
input data: modern algorithms can use the formulas. The Hill-RBF formula is not published,
information contained within the training but is described on the calculator website [2] as
datasets to their full potential, but no “extra “entirely data driven” and the underlying algo-
information” can be extracted, or guessed, by rithm as a radial basis function (RBF) network,
any digital super-intelligence. i.e. a neural network using non-linear RBF acti-
–– Even the most sophisticated machine learning vation function in the neurons composing its hid-
regression algorithms are not intrinsically dif- den layer.
ferent from a standard linear regression. The
mathematics behind the training and the pre-
dictions processes differ between algorithms, Machine learning regression algorithms are
as well as their predictions performances on sophisticated regression techniques that do
complex datasets, especially when dealing not differ in essence from linear regression:
with a very high number of predictors or train- the latter is in fact the simplest and most
ing examples; however, every regression algo- popular algorithm of its kind. The SRK for-
rithm aims at predicting a continuous target mula [3, 4] was one of the most successful
value using data from previous experience, and regression formulas. However, the coeffi-
can only be as accurate as the input data. No cients associated to the keratometry and
algorithm is intrinsically better than another: axial length were fixed (to 0.9 and 2.5,
their performances vary depending on the kind respectively) and were not intended to be
of problem at hand and in fact using complex adapted, “retrained”, to fit different IOL
algorithms to solve simple problems usually models: the prediction was adjusted by
lead to overfitting. This means that very good varying the offset of the regression only.
performances may be seen on the training set,
with disappointing performances on external
datasets and in real life. In IOL calculation, for a given IOL model and
a given eye, the IOL power implanted and the
postoperative refractive result are almost perfectly
achine Learning-Based IOL
M linearly interdependent. Hence, pure AI-based
Formulas Architecture IOL formulas could theoretically be designed to
either predict the postoperative spherical equiva-
Pure AI-Based Formulas lent for a given IOL power (Fig. 24.1a), or the rec-
ommended IOL power to choose to reach a specific
Machine learning algorithms are used to predict a refractive target (Fig. 24.1b). Either the IOL power
numerical target using data from previous experi- or the postoperative outcome has to be among the
ence. Within the context of IOL calculation, vari- inputs, because of the necessary correspondence
ous formulas architecture can be determined. In between a given IOL power and a given refractive
pure AI-based IOL formulas, the prediction of result. The biometric parameters by themselves
the refractive result of the surgery does not rely only give information about the post-operative
24 Practical Considerations for AI Implementation in IOL Calculation Formulas 265
AL
ARC
ACD
LT
ML regression Postop. SE obtained with
algorithm implanted IOL power
...
(Prediction phase: predicted
SE for a given IOL Power)
Age
Gender
Implanted
IOL Power
(Prediction phase:
given IOL Power)
AL
ARC
ACD
LT
ML regression IOL power implanted
algorithm
...
(Prediction phase: IOL
Power to achieve
Age the target SE)
Gender
Postop.SE
(Prediction
phase: Target
SE)
Fig. 24.1 Pure AI-based formulas architectures. Pure AI parameters (b). Because the intended SE is almost never
formulas can be designed to either predict the postopera- hyperopic, the postoperative SE carries “hidden” informa-
tive spherical equivalent for a given IOL power and set of tion about the eye: as a consequence, architecture (b)
biometric parameters (a) or to predict the IOL power should never be used
implanted given the postoperative SE and the biometric
266 G. Debellemanière et al.
ML algorithms can be used to predict the [13], thus theoretically allowing a prediction from
parameters in a given optical formula if: preoperatively known parameters (condition 2).
However, they are difficult to measure, preventing
1) They are not usually measured or known pre- the determination of the target that could allow the
operatively (thus making a prediction training of an algorithm: no gold standard refrac-
irrelevant) tive index measurements can be performed.
2) Their value is related to other parameters that Both the posterior corneal radius and the ALP
are known or measured preoperatively, to be can be accurately measured (with a corneal topog-
used as features in the model rapher and post-operatively with a biometer,
3) Their value can be accurately measured or cal- respectively). They can also be back- calculated
culated, even post-operatively, to determine from the other optical parameters of the eye [14]: in
the model target. this case, the resulting value accounts for the
approximations made in the measurement or the
The postoperative lens position and the poste- estimation of all the other parameters. The PEARL-
rior corneal radius are the only parameters that ful- DGS formula is based on the prediction of the
fill all of those requirements. The refractive indices Theoretical Internal Lens Position (TILP), which is
of the cornea, aqueous and vitreous are directly defined as the theoretical distance between the pos-
used in the optical formulas that calculates the terior corneal surface and the anterior IOL surface,
refraction of the pseudophakic eye and as such back-calculated from postoperative data.
could theoretically be interesting to predict. The optical equations used in IOL formulas
Furthermore, the refractive index of the crystalline allow to calculate the spherical equivalent of the
lens indirectly influences the AL measurement eye at the spectacle plane from the axial length of
[6–9]. Condition number 1 is fulfilled because the eye, the geometrical characteristics of two
they are not measured by current biometers. The lenses (the cornea and the IOL), the distance
refractive indices of the eye segments could poten- between those two lenses, and the refractive indi-
tially statistically vary with their thickness/length ces of the eye segments and IOL. The inner work-
(lens and vitreous), with the patient’s age [10–12], ing of an AI-enhanced thin lens optical formula is
and/or with a history of corneal or retinal surgery shown in Fig. 24.2.
Implanted IOL
Power
R K
Considered
IOL Power
ACD
ML
ELP Thin lens
regression
equation
algorithm
...
Predicted
AL postop. SE
Real postop. SE
Fig. 24.2 AI-enhanced formulas use ML algorithms to tion process is used to calculate the target value for the eyes
predict the lens position (and/or the posterior corneal radius) of the training set and represented by dotted lines. The triple-
and use the predicted value(s) in optical equations. In this optimized Haigis formula is a specific case of this kind of
example, thin lens equations are used. The calculation pro- formula, where the ELP predictors are limited to ACD and
cess is represented using solid lines. The ELP back-calcula- AL and where the ML algorithm is a linear regression
268 G. Debellemanière et al.
Thin lens optical equations simplify the calcu- eters) and on engineered data (the back-calculated
lations by ignoring the thicknesses of the lenses ALP).
considered in the post-operative eye (cornea and
IOL), thus removing the notion of principal plane
positions; lenses are defined by their refractive ata Cleaning Based on Biometric
D
power without other considerations. This approx- Parameters
imation is close to the reality for lenses with sym-
metrical anterior and posterior radii, but is Eyes with illogical or impossible biometric val-
responsible for errors when dealing with asym- ues should be discarded. This situation can hap-
metrical lenses [8]. Thick lens equations do not pen even in high quality datasets (for example,
use this simplification. false low lens thickness readings can be encoun-
tered in hard brunescent cataracts). Some biom-
eters indicate quality index measurements for
The Haigis formula [14, 15] is unique in
each measured parameter: eyes with measure-
replacing the traditional IOL constant,
ment errors should be discarded. An efficient
which acts as a simple offset in the other
way to spot and discard eyes with outliers in
published classical thin lens formulas
biometric parameters is to create a distribution
(Holladay 1, Hoffer Q, SRK/T) by an offset
curve for every parameter, and visually deter-
(a0) and two coefficients (a1 for ACD and
mine the limits to apply (Fig. 24.4). Outliers can
a2 for AL) that determine a bivariate linear
also be eliminated by identifying eyes having
regression. In its single-optimized version,
biometric values beyond a certain number of
a1 and a2 are fixed and a0 acts as a standard
standard deviations away from the mean (usu-
IOL constant. Haigis also allows calcula-
ally 3). If an aggressive outliers elimination
tion of the “perfect thin lens ELP”, i.e. the
strategy is chosen, care must be taken to evalu-
ELP value which, when entered in the thin
ate the resulting formula on eyes with extreme
lens optical formula along with the preop-
AL and corneal radii. It can be easier and
erative keratometry, axial length and the
quicker to manually adapt a given formula to
implanted IOL power, leads to the real post-
extreme eyes, rather than trying to train an algo-
operative refractive outcome. The Haigis
rithm to correctly predict the right values
formula in its triple-optimized version
(whether the postoperative IOL position or the
allows training of a linear regression algo-
postoperative SE) in very atypical eyes.
rithm to predict this value, leading to the
determination of a new offset and two new
coefficients. The triple-optimized version
of the Haigis formula can then be consid- ata Cleaning Based on TILP
D
ered as the first optical IOL formula capable Back-Calculation
of being completely re-trained using data,
because it has no hard-coded algorithm to A useful strategy to detect outliers on big datasets
predict the ELP, unlike other open-source is to back-calculate the TILP from the corneal
thin lens formulas (see Fig. 24.3). radius, theoretical posterior corneal radius, IOL
parameters (known or estimated) and axial
length. This can be done using thick lens equa-
tions [22, 23]. Eyes with very high or very low
ata Quality Control
D TILP can then be considered as outliers. A distri-
and Preparation bution plot of the TILP (Fig. 24.5) is helpful to
choose the limits to apply. A scatter plot display-
Datasets have to be cleaned before being used in ing the TILP as a function of the AL can also be
the formula building process. This data cleaning created (Fig. 24.6), allowing the detection of evi-
can be based on native data (the biometric param- dent outliers.
24 Practical Considerations for AI Implementation in IOL Calculation Formulas 269
n = 4/3
R K
IOL n = 1.3375
R K IOL
Power Power
f Rag
f AL lim
ACD Thin lens + Thin lens
f + SF f ACD + 0.05
equation pACD equation
AL f G
f AG
Predicted
postop. SE f M Predicted
AL + RT ALm postop. SE
Holladay 1 Hoffer Q
n = 1.333 R K
R K
n = 1.3315
IOL
A ACD IOL Power
const. f const Power
f Cw
ACD x a1
AL+ Predicted
AL f RETHICK RETHICK LOPT Predicted
postop. SE
postop. SE
SRK/T Haigis
Fig. 24.3 Inner workings of the four classical thin lens thickness. In the Haigis formula, a1 and a2 are used as the
formulas [15–21]. The acronyms of the original publica- coefficients weighting the ACD and AL respectively in a
tions are used in those diagrams. Any transformation dif- multiple regression, while a0 is the intercept value. In
ferent from a multiplication or an addition is represented triple-optimized mode, the multiple regression is re-fitted
by a “f” (“function”) round cell on the diagram. ACD: to new-data. There is no hard-coded ELP prediction rule
Anatomic anterior chamber measured from corneal epi- in this formula. Reprinted with permission from
thelium to lens; AG: Anterior chamber diameter from Debellemanière et al.: The PEARL-DGS formula: devel-
angle to angle; ALm: modified AL; Cw: Computed cor- opment of an open-source machine learning-based thick
neal width; H: Corneal height; K: Corneal power; LCOR: IOL calculation formula, 2021, American Journal of
Corrected AL; RETHICK: Retinal thickness; RT: Retinal Ophthalmology, in press
a b
c d
e f
Fig. 24.4 Distribution of the six classical biometric operative biometric measurement is mistakenly included
parameters in a non-curated dataset. ARC anterior radius in a dataset), or in brunescent cataracts. Thin corneas can
of curvature of the cornea, AL axial length, ACD anterior indicate eyes that underwent corneal refractive surgery,
chamber depth, LT lens thickness, CCT central corneal and very thick corneas can be secondary to corneal
thickness, WTW white-to-white. Lower and upper limits decompensation. Very small WTW are usually not repre-
can be set for each parameter. Very short AL values can sentative of the anatomical reality and should be
sometimes indicate retinal detachment; very short LT val- discarded
ues can be found in eyes after cataract surgery (if a post-
24 Practical Considerations for AI Implementation in IOL Calculation Formulas 271
Fig. 24.5 Distribution of the back-calculated theoretical Fig. 24.6 Representation of the back-calculated theoreti-
internal lens position (TILP) in a non-curated dataset. cal internal lens position (TILP) as a function of axial
“Impossible” values should be discarded length in a non-curated dataset. Outliers can easily be
spotted and eliminated
center, it is important to avoid the situation where
a patient has one eye in both sets, in order to models in the same IOL formula development
avoid data contamination. It is also preferable to dataset, whatever the underlying IOL formula
have only one eye per patient in the test set [24]. architecture choice.
However, we found no drawbacks in having both
eyes of a given patient in the training set. Hence,
we recommend the following three steps to con- Machine Learning Models Inputs
stitute the sets:
Standard Biometric Parameters
1. Randomly split the main dataset into a train-
ing set and a test set The values measured by recent biometers usually
2. Identify patients that have one eye in both sets include the anterior radius of curvature of the
and move both of their eyes into the training cornea (ARC), axial length (AL), anterior
set chamber depth (ACD), lens thickness (LT), cen-
3. Identify patients that have both eyes in the test tral corneal thickness (CCT) and corneal diame-
set and randomly delete one of those eyes ter (white-to-white, WTW). An example of their
without including it in the training set relative importance in the TILP prediction is
shown in Fig. 24.9. AQD stands for aqueous
chamber depth (ACD − CCT) and VCD stands
Management of IOL Models Diversity for vitreous chamber depth (AL − [CCT + AQD
+ LT]). AQD and VCD were preferred to AL and
IOL anterior and posterior radius of curvature, ACD to reduce the collinearity between variables
thicknesses, refractive indices and haptic styles and facilitate feature importance study.
differ between IOL models. Those properties are It is important to remember not to include
different between IOL models for a given IOL the corneal radius and corneal thickness among
power, and the way those parameters vary along the postoperative IOL position predictors if the
the IOL power range is also specific to a given formula is developed for eyes with a history of
model. It is therefore, in our opinion, not recom- corneal refractive surgery or corneal graft,
mended to mix eyes implanted with various IOL because those surgically modified values are no
272 G. Debellemanière et al.
a b
Fig. 24.7 Representation of the SD of the mean PE of with no hyperparameter optimization. The Thick lens +
two generic formulas, obtained on a test set of 700 eyes, XGBoost formula is an XGBoost algorithm trained to pre-
as a function of the number of eyes in the training set. The dict the TILP value using the six biometric parameters as
Pure AI formula is an XGBoost model trained to predict an input: this value is then used in thick lens equations.
the postoperative spherical equivalent using the six bio- Note that the SD of formula (b) decreases faster and gets
metric parameters + the implanted IOL power as inputs, lower than the SD of formula (a)
longer helpful to predict the IOL position. The power cannot be used “out-of-the-box” to predict
ACD must be considered with caution in eyes the lens position in a formula formerly based on
that underwent radial keratotomy for the same the keratometric index.
reasons. LT is statistically strongly correlated The potential usefulness of the PRC to predict
to the ACD, but increases the post-operative the postoperative lens position has not yet been
IOL position prediction accuracy nonetheless. evaluated, to the best of our knowledge. It could
The role of corneal thickness and corneal diam- be hypothesized that, because of the strong ARC/
eter is more debated: the Kane [27] and EVO PRC correlation, this value could enhance the
[28] formulas use only the former, and the postoperative lens position prediction in post-
Barrett Universal II (BU II) formula only the corneal refractive surgery eyes, where the ARC
latter [29]. has been modified but the PRC is still representa-
tive of the native corneal shape.
a b
c d
e f
Fig. 24.8 Median TILP value for the six classical bio- this group was determined. A threshold is clearly visible
metric parameters along their value range. Eyes were around 27 mm for the AL. AQD stands for aqueous cham-
sorted according to the value of each parameter. Every ber depth (distance from the posterior corneal surface to
100 eyes, the mean biometric parameter value was calcu- the anterior crystalline lens surface)
lated for the next 500 eyes and the median TILP value for
274 G. Debellemanière et al.
a b
Fig. 24.9 Feature importance study for two algorithms ied using SHapley Additive exPlanations (SHAP) values
trained to predict the TILP in a generic thick lens formula. [26]. Despite the difference between those algorithms and
On the left, a multiple regression model was trained using the way their importance is studied, there is a good con-
normalized data, thus allowing the direct comparison of cordance between them in relation to feature importance
the multiple regression coefficients. On the right, feature ranking, magnitude and sign
importance of the gradient boosted tree algorithm is stud-
parameters are not directly usable in the optical knobs” of an algorithm. The basic linear or mul-
part of a formula but could be useful to more tiple regression is the simplest regression algo-
accurately predict the IOL position. rithm that can be designed and, as such, is the
only one that doesn’t have any hyperparameters.
In the case of gradient boosted trees, hyperpa-
Predictive Models Building rameters include the maximum depth of each
tree, the fraction of observation that are sampled
Algorithm Choice into each tree, and criterias that control when new
leaves and/or new trees are created. In neural
No algorithm or algorithm family can be consid- nets, hyperparameters will control the number of
ered as universally superior to the others. The layers, the number of neurons into each layer, the
performances of a given algorithm vary depend- maximum number of iterations, and so on.
ing on the type of problem to solve, the dataset Hyperparameters cannot be chosen a priori.
size, the number of predictive features, among Their best combination for a given algorithm
other parameters: this phenomenon is known as must be searched for, usually using cross-
the “no free lunch theorem” [32]. Empirically, we validation. Cross-validation is a process by which
obtained good outcomes with gradient boosted the training set is divided in n groups (usually
trees, neural nets, multiple regression, and sup- around 5) and each group is used as a temporary
port vector regression. test while the others are trained to predict the tar-
get with the selected set of hyperparameters. The
average of the prediction performance on the
Hyperparameters Optimization subgroups is then computed. This process is
repeated for every set of hyperparameters, as far
Hyperparameters allow to tune an algorithm, as the limits that are defined by the researcher.
define its architecture, and thus control the learn- Hyperparameters should never be optimized
ing process. They can be seen as the “control using the test set: doing so would immediately
24 Practical Considerations for AI Implementation in IOL Calculation Formulas 275
lead to overfitting and compromise the final out- dataset, allowing formula constant adjustment.
comes in real life. Similarly, cross-validation is Information about patient age, gender and ethnic-
not a valid method to assess the final perfor- ity could also be included. Individual postopera-
mances of a given model. tive refraction should be kept secret and formulas
evaluation would be performed independently by
the reference authority holding the dataset. This
What is overfitting? This phenomenon can evaluation should not be too frequent (e.g. twice
be compared to what would happen to a a year) to avoid deliberate overfitting. The eyes
student who would obtain in advance and comprising this dataset should never be used by
learn an exam by rote, instead of under- any formula inventor for the same reason.
standing the lesson: his results would prob-
ably be very good for this one exam but
would be deceptive for any other evalua- escription of the Current PEARL-
D
tion. Machine learning algorithms, if overly DGS Formula
complex and over-trained on a specific
dataset, can easily be led to “learn by heart” General Principles
the dataset. They will obtain very good out-
comes on this specific dataset but will per- The PEARL-DGS formula [22, 23, 33] is a thick
form poorly in real life. Overfitting can be lens formula based on the prediction of the theo-
prevented by avoiding adding unnecessary retical internal lens position (TILP). This param-
complexity to the models and by evaluating eter is the theoretical distance between the
the performances of an algorithm (and for- posterior corneal surface and the anterior IOL
mula) on new data, from other centers, ide- surface, back-calculated from postoperative data.
ally in a blind manner. It is a theoretical anatomical distance, indepen-
dent of both the lens principal plane positions and
the corneal thickness. The sum-of-segments [6]
AL replaces the AL value, and is approximated
Model Training and Evaluation by the Cooke-modified AL [9] (CMAL). It cor-
responds to the value leading to the real postop-
Once the hyperparameters are chosen, the algo- erative SE when entered in thick lens equations
rithm can be trained using the whole training set, along with the other optical parameters of the eye
and its performances finally assessed on the test and IOL. It is predicted using various machine
set. While it can be tempting to restart the whole learning algorithms comprising regular multiple
process, if the performances are judged unsatis- regression, support vector regression, gradient
factory, this would lead to overfitting and should boosted trees and neural networks. The refractive
be avoided. The test set should ideally be used indices values of the Atchinson model eye [34]
only once. If the dataset is big enough, it can be are used, except for the corneal index which was
useful to create a test set (used only once) and a determined empirically during the formula devel-
validation set (that will be used more often, to opment process. Ideally, the real geometric
test different iterations of a given formula for parameters of the IOL are used during the
example). development process; otherwise, the formula can
The constitution of a reference dataset, held be developed using theoretical IOL parameters
by a recognized scientific authority, could help (for example, biconvex symmetric geometry) and
standardize IOL formulas evaluation. This data- a study of the mean TILP prediction error along
set would include information regarding preop- the IOL power range is proposed.
erative biometric parameters and biometric In an article describing the PEARL-DGS for-
device type, IOL model and power implanted, mula and evaluating its performances on two test
and mean postoperative refraction for the entire sets of 677 and 262 eyes, the PEARL-DGS for-
276 G. Debellemanière et al.
f2
f1
AQD
CMAL
PRC
corrected
f4 f3
CMAL
Eye refractive
indices TILP
IOL Power
Fig. 24.10 General outline of the PEARL-DGS formula dict the TILP. The TILP is then predicted using six bio-
prediction process. The PRC is deduced from the ARC metric parameters as inputs in various ML models
(f1). AL and LT are used to calculate the CMAL (f2). The combined in ensemble methods (f4). Reprinted with per-
CMAL is corrected before being used as an input to pre- mission from Debellemanière et al.: The PEARL-DGS
dict the TILP (f3). The raw CMAL value is used in the formula: development of an open-source machine
optical part of the formula. The ARC and CCT are used in learning-
based thick IOL calculation formula, 2021,
the optical part of the formula, and also used as an input to American Journal of Ophthalmology, in press
predict the TILP. WTW, AQD and LT are only used to pre-
mula yielded the lowest SD on the first set cohort with short axial eye length, Wendelstein
(±0.382 D), followed by K6 and Olsen et al. [35] showed that PEARL-DGS, Okulix,
(±0.394 D), EVO 2.0 (±0.398 D), RBF 3.0 and Kane or Castrop formulae had the lowest MAE
BUII (±0.402 D) as well as the lowest SD on the (0.260, 0.300, 0.300 and 0.270 respectively).
second set (±0.269 D), followed by Olsen Evaluating the refractive result of 171 eyes,
(±0.272 D), K6 (±0.276 D), EVO 2.0 (±0.277 D) Rocha de Lossada [38] found that Barrett and
and BUII (±0.301 D) [23]. Pearl DGS performed best for medium eyes
Different independent peer reviewed studies (MAE = 0.237 and 0.263 respectively; % eyes
evaluated and compared the PEARL-DGS for- <0.5 D = 89.34% and 86.89% respectively).
mula along, with other fourth generations IOL
calculation formulas. In three [35, 36, 37] out of
seven studies PEARL-DGS ranked first with a References
median absolute error (MedAE) varying between
1. Barr A, Feigenbaum EA. Chapter I – Introduction. In:
0.190 and 0.310 and a percentage of eyes with a Barr A, Feigenbaum EA, editors. The handbook of
postoperative refractive error of <0.5 diopter, artificial intelligence. Butterworth-Heinemann; 1981.
varying between 74% and 87.1%. In their patient p. 1–17.
24 Practical Considerations for AI Implementation in IOL Calculation Formulas 277
2. Hill W. Hill-RBF Formula 3.0 [Internet]. Hill-RBF lation formula: Erratum. J Cataract Refract Surg.
Calculator Version 3.0. https://rbfcalculator.com/. 1990;16(4):528.
Accessed 3 Feb 2021. 18. Hoffer KJ. The Hoffer Q formula: a comparison of
3. Sanders D, Retzlaff J, Kraff M, Kratz R, Gills J, theoretic and regression formulas. J Cataract Refract
Levine R, et al. Comparison of the accuracy of the Surg. 1993;19(6):700–12.
Binkhorst, Colenbrander, and SRK implant power 19. Zuberbuhler B, Morrell AJ. Errata in printed Hoffer
prediction formulas. J Am Intraocul Implant Soc. Q formula. J Cataract Refract Surg. 2007;33(1):2;
1981;7(4):337–40. author reply 2–3.
4. Sanders DR, Retzlaff J, Kraff MC. Comparison of 20. Hoffer KJ. Errors in self-programming the Hoffer Q
empirically derived and theoretical aphakic refraction formula. Eye. 2007;21(3):429; author reply 430.
formulas. Arch Ophthalmol. 1983;101(6):965–7. 21. Holladay JT, Prager TC, Chandler TY, Musgrove KH,
5. Ladas JG, Siddiqui AA, Devgan U, Jun AS. A 3-D Lewis JW, Ruiz RS. A three-part system for refining
“Super Surface” combining modern intraocular lens intraocular lens power calculations. J Cataract Refract
formulas to generate a “Super Formula” and maximize Surg. 1988;14(1):17–24.
accuracy. JAMA Ophthalmol. 2015;133(12):1431–6. 22. Gatinel D, Debellemanière G, Saad A, Dubois M,
6. Wang L, Cao D, Weikert MP, Koch DD. Calculation Rampat R. Determining the theoretical effective lens
of axial length using a single group refractive index position of thick intraocular lenses for machine learn-
versus using different refractive indices for each ing based IOL power calculation and simulation.
ocular segment: theoretical study and refractive out- Transl Vis Sci Technol. 2021.
comes. Ophthalmology. 2019;126(5):663–70. 23. Debellemanière G, Dubois M, Gauvin M, Wallerstein
7. Cooke DL, Cooke TL, Suheimat M, Atchison A, Brenner LF, Rampat R, et al. The PEARL-DGS
DA. Standardizing sum-of-segments axial length formula: development of an open-source machine
using refractive index models. Biomed Opt Express. learning-based thick IOL calculation formula. (under
2020;11(10):5860–70. review).
8. Haigis W. Intraocular lens calculation in extreme 24. Hoffer KJ, Savini G. Update on intraocular lens power
myopia. J Cataract Refract Surg. 2009;35(5):906–11. calculation study protocols: the better way to design
9. Cooke DL, Cooke TL. Approximating sum-of- and report clinical trials. Ophthalmology [Internet].
segments axial length from a traditional optical low- 2020. https://doi.org/10.1016/j.ophtha.2020.07.005.
coherence reflectometry measurement. J Cataract 25. Wang L, Shirayama M, Ma XJ, Kohnen T, Koch
Refract Surg. 2019;45(3):351–4. DD. Optimizing intraocular lens power calculations
10. Bahrami M, Hoshino M, Pierscionek B, Yagi N,
in eyes with axial lengths above 25.0 mm. J Cataract
Regini J, Uesugi K. Refractive index degeneration in Refract Surg. 2011;37(11):2018–27.
older lenses: a potential functional correlate to struc- 26. Lundberg S, Lee S-I. A unified approach to interpret-
tural changes that underlie cataract formation. Exp ing model predictions [Internet]. arXiv [cs.AI]. 2017.
Eye Res. 2015;140:19–27. Available from http://arxiv.org/abs/1705.07874.
11. Kasthurirangan S, Markwell EL, Atchison DA, Pope 27. Kane JX. Kane formula calculator [Internet]. Kane
JM. In vivo study of changes in refractive index Formula. https://www.iolformula.com/. Accessed 15
distribution in the human crystalline lens with age Mar 2021.
and accommodation. Invest Ophthalmol Vis Sci. 28. Yeo TK. EVO formula [Internet]. The Emmetropia
2008;49(6):2531–40. Verifying Optical (EVO) formula. https://www.
12. Dubbelman M, Van der Heijde GL. The shape
evoiolcalculator.com/. Accessed 1 Feb 2021.
of the aging human lens: curvature, equivalent 29. Barrett G. Barrett Universal II Calculator [Internet].
refractive index and the lens paradox. Vision Res. Barrett Universal II. https://calc.apacrs.org/barrett_
2001;41(14):1867–77. universal2105/. Accessed 15 Mar 2021.
13. Patel S, Tutchenko L. The refractive index of the 30. Martinez-Enriquez E, Pérez-Merino P, Durán-Poveda
human cornea: a review. Cont Lens Anterior Eye. S, Jiménez-Alfaro I, Marcos S. Estimation of intraoc-
2019;42(5):575–80. ular lens position from full crystalline lens geometry:
14. Haigis W. Intraocular lens power calculations. In:
towards a new generation of intraocular lens power
Shammas HJ, editor. SLACK Incorporated; 2004. calculation formulas. Sci Rep. 2018;8(1):9829.
15. Haigis W, Lege B, Miller N, Schneider B. Comparison 31. Yoo Y-S, Whang W-J, Kim H-S, Joo C-K, Yoon
of immersion ultrasound biometry and partial coher- G. New IOL formula using anterior segment three-
ence interferometry for intraocular lens calcula- dimensional optical coherence tomography. PLoS
tion according to Haigis. Graefes Arch Clin Exp One. 2020;15(7):e0236137.
Ophthalmol. 2000;238(9):765–73. 32. Wolpert DH. The lack of A Priori distinctions
16. Retzlaff JA, Sanders DR, Kraff MC. Development of between learning algorithms. Neural Comput.
the SRK/T intraocular lens implant power calculation 1996;8(7):1341–90.
formula. J Cataract Refract Surg. 1990;16(3):333–40. 33. Debellemanière G, Saad A, Gatinel D. PEARL DGS
17. Retzlaff JA, Sanders DR, Kraff MC. Development calculator [Internet]. IOL Solver. www.iolsolver.com.
of the SRK/T intraocular lens implant power calcu- Accessed 14 Mar 2021.
278 G. Debellemanière et al.
34. Atchison DA. Optical models for human myopic eyes. 37. Leonardo T, Kenneth JH, Piero B, Domenico S-L,
Vision Res. 2006;46(14):2236–50. Giacomo S. Outcomes of IOL power calculation
35. Wendelstein J, Hoffmann P, Hirnschall N, Fischinger using measurements by a rotating Scheimpflug cam-
IR, Mariacher S, Wingert T, et al. Project hyperopic era combined with partial coherence interferometry. J
power prediction: accuracy of 13 different concepts Cataract Refract Surg 2020; 46:1618–23.
for intraocular lens calculation in short eyes. Br 38. Rocha-de-Lossada C, Colmenero-Reina E, Flikier
J Ophthalmol [Internet]. https://doi.org/10.1136/ D, Castro-Alonso F-J, Rodriguez-Raton A, García-
bjophthalmol-2020-318272. Accessed 27 Jan 2021. Madrona J-L, et al. Intraocular lens power calculation
36. Diogo H-F, Maria EL, Rita S-P, Pedro G, Vitor M, formula accuracy: comparison of 12 formulas for a tri-
João F, Nuno A. Anterior chamber depth, lens thick- focal hydrophilic intraocular lens. Eur J Ophthalmol.
ness and intraocular lens calculation formula accu- 2020;1120672120980690.
racy: nine formulas comparison. British Journal of
Ophthalmology:bjophthalmol-2020-317822.
Index
A developments, 1, 2
Accreditation Council for Graduate Medical DL models, 3
Education, 215 doctor-patient relationship, 3
Accuracy of DL guided triaging in ophthalmology, 230 GANs, 8–10
multi-task classification task, referral category, 231 guidelines, 13
urgent referrals, 231 human cognitive capacity, 3
Adult triage criteria for referrals at tertiary Industrial Revolution, 1, 2
ophthalmology center, 228 life-science papers, 3, 4
Age-Related Eye Disease Study (AREDS), 42, 103, medical specialties, 3
187–190 optical coherence tomography images, 7
Age-related macular degeneration (AMD), 187–190 programs creation, 3
AREDS, 103 resources, 6, 7
automated image analysis, 102 risk factors, 3, 5
classification, 42, 43 safety, 6
clinical features, 101, 102 transfer learning, 7–9
deep learning (see Deep learning) Turing test, 1
imaging modalities, 103 AI-based systems, 173
limitation, 108 AI-DL models on external multi-ethnic testing
prediction, 48 datasets, 252
Simplified Severity Scale, 103–105 AI-enhanced formulas use ML algorithms, 267
AlexNet, 37 AI-enhanced optical formulas, 266–268
Algorithm family, 274 AI-enhanced thin lens optical formula, 267
American Board of Ophthalmologists (ABO), 215 algorithms, 263
Anaemia, 248 assisting in diagnosis and surgery procedures, 207
Anatomical segmentation, 219–220 for cataract detection and grading, 204
Anemia, 171, 172 in corneal ectasias, advantages of, 198
Applied deep learning research work in eye diseases, for corneal pathologies, 195
162–168 diagnostic algorithm, combined multimode data, 208
Area Under the Curve (AUC), 25 enabled DR screening, 161
AREDS 9-step severity scale, 189 guided triage model, 229
Artificial intelligence (AI), 177 models, 264
adoption in healthcare, 177 and optics, 266
AI-assisted refractive surgery, research and for preoperative assessment, 204
application, 211 for scientific discovery, 168–173
AI-based assistance, 172 disease diagnosis, 168
AI-based automated segmentation, 46–48 disease progression, 169
AI-based cataract management, 206 systemic conditions, 170–172
AI-based devices tools, 264
aims, 1, 3 Artificial neural networks (ANN), 22, 23, 34, 205,
benefit for, 4, 6 231, 260
challenges, 10–13 Artificial neurons, 22
continuous learning, 10 Assessments using structured rating scales, 218
cost-effectiveness, 13, 14 Atchinson model eye, 275
design factors, 13, 14 Australian Institute for Machine Learning (AIML), 232
E H
Ectasia scoring systems, 194 Haigis formula, 267–269
Effective lens position (ELP, 264 Hand motion analysis, 222
Electronic medical record (EMR), 43, 232 Hidden Markov Model (HMM) algorithm, 219
Emergency referrals, 227 Hill-RBF formula, 264
Excimer laser photorefractive keratectomy (PRK), 207 Holdout validation, 23
Expert decision tree, 195 Human action recognition system for real-time
Expert system classifier, 197 recognition, 220
Extreme gradient boosting (XGBoost), 260 Human guided triaging, 232
Eye motion tracking, 222 Hyperparameters, 274
Eye Surgical Skills Assessment Test (ESSAT), 218 optimization, 274–275
tuning, 230
F
Feedforward neural network (FFN), 23 I
Fine needle aspiration biopsy (FNAB) Image analysis, 220
cytology, 236 CNN
of uveal melanomas, 236 diagnostic support and prediction, 84
Forme fruste keratoconus (FFKC) patients, 193 image quality assessment, 84
Framingham risk score (FRS), 252 keypoints, 83
Frisen severity classification, 240 digital images, 78, 79
Fundus autofluorescence (FAF), 103 image preprocessing
Fuzzy K-means clustering algorithm, 203 contrast enhancement, 80, 81
Fuzzy set theory, 222 intensity normalization, 80
image registration process, 81–83
preprocessing and registration, 77
G process, 77
Gene expression profile (GEP), 236, 238 ImageNet Large Scale Visual Recognition Challenge
General practitioners (GPs), 156, 157 (ILSVRC), 1
Generative adversarial networks (GANs), 8–10, 74, 75, Implantable Collamer Lens (ICL), 211
189, 190 Insulin growth factor (IGF-1), 128
Index 283
Machine learning-based IOL formulas architecture, Objective Structured Assessments of Technical Skills
264–268 (OSATS), 218
Machine to machine technique, 240 Ocular oncology, 235, 236, 238
Major Adverse Cardiac Events (MACE), 252 Optic disc abnormalities, 94, 95
Management decision of referral or follow-up, 203 Optic disc classification in glaucoma, 240
Manual data processing, 238 Optic disc features in glaucoma, 240
MATLAB, 26 Optic neuropathies, 240
Mean absolute error (MAE), 244 Optical coherence tomography (OCT), 103,
Median absolute error (MedAE), 276 165–167, 169
Median TILP value, 273 OverFeat deep convolutional neural network
Medical laws or connections, AI, 208 (DCNN), 187
Metrics, 23, 24 Overfitting, 24, 275
Model assessment, 230 Oxford Health NHS Foundation Trust, 227
Model training and evaluation, 275
Model training process, 165
Modern deep learning systems, 240 P
Modified network architecture of SELENA, 181 Papilledema, 95
Motion analysis techniques, 216, 222 Pattern recognition identification, 257
Multi-Context Deep Network (MCDN), 118 PEARL-DGS formula, 267, 269, 275, 276
Multilayer perceptron (MLP), 23, 70, 71, 196, 209 Phacoemulsification techniques, 218
Multivariate logistic regression analysis, 197, 198 Phacotracking, 217
Muscle contraction analysis, 222 Phase recognition via ML algorithms, 223
Population achieved sensitivity (PAS), 59
Population-based studies, 208, 243
N Population diagnosability, 58
Naïve Bayes, 33 Posterior capsule opacification (PCO), 205
algorithm, 21 Posterior corneal radius, 272
Random Forest, 196 Post-hoc assessments of optic discs, 240
National Health Service (NHS) Trusts, 227 Postoperative spherical Equivalent Prediction using
Natural language processing (NLP), triaging in ARtificial Intelligence and Linear algorithms
ophthalmology, 228 (PEARL) project, 263
Negation detection, 230 Precision medicine, 210
Netflix, 261 Predictability and generalisability of the AI-DL
Neural network (NN), 17, 195, 259 model, 252
architecture models, 230 Prediction error (PE), 269
components, 88 Predictive models building, 274–275
for corneal topography classification, 195 Pre-operative refraction, 259
data types, 88 Pre-trained Unsupervised Network (PUN), 38
and linear discriminant analysis, 196 Proliferative DR (pDR), 139
unilateral and bilateral indices, 196 Pseudophakic anterior chamber depth (pACD), 205
Neuro-ophthalmic abnormalities affecting the optic Pure AI-based formulas, 264–266
discs, 240 Python, 25
Neuro-ophthalmic disease
fundus photographs, 94
optic disc abnormalities, 94, 95 Q
Neuro-ophthalmic optic disc abnormalities on retinal Quality management system (QMS), 64
fundus images, 240
NHS Diabetic Eye Screening Programme (NDESP), 140
No free lunch theorem, 274 R
Non-proliferative DR (NPDR), 182 Radial basis function (RBF) network, 209, 264
Northern Ireland Diabetic Retinopathy Screening Random forests algorithm, 22, 34, 221, 231
Programme (NIDRSP), 140 Receiver operating characteristic (ROC) Curve, 24
Numerical triage category, 230 Recurrent neural network (RNN), 23, 38, 73, 74, 120,
211, 220
Referable cataracts, 258
O Reference dataset, 275
Objective Assessment of Skills in Intraocular Surgery Reference standard, 58, 63
(OASIS), 218 Refractive indices of eye, 267
Objective Structured Assessment of Cataract Surgical Refractive surgery
Skill (OSACSS), 218 analysis of images and data, 209
Index 285