You are on page 1of 9

The n e w e ng l a n d j o u r na l of m e dic i n e

Review Article

AI in Medicine
Jeffrey M. Drazen, M.D., Editor, Isaac S. Kohane, M.D., Ph.D., Guest Editor,
and Tze‑Yun Leong, Ph.D., Guest Editor

Where Medical Statistics Meets Artificial


Intelligence
David J. Hunter, M.B., B.S., and Christopher Holmes, Ph.D.​​

S
tatistics emerged as a distinct discipline around the beginning From the Nuffield Department of Popu-
of the 20th century. During this time, fundamental concepts were developed, lation Health (D.J.H.) and the Depart-
ment of Statistics and Nuffield Depart-
including the use of randomization in clinical trials, hypothesis testing, ment of Medicine (C.H.), University of
likelihood-based inference, P values, and Bayesian analysis and decision theory.1,2 Oxford, Oxford, and the Alan Turing In-
Statistics rapidly became an essential element of the applied sciences, so much so stitute, London (C.H.) — both in the
United Kingdom. Dr. Holmes can be con-
that in 2000, the editors of the Journal cited “Application of Statistics to Medicine” tacted at ­chris​.­holmes@​­stats​.­ox​.­ac​.­uk or
as one of the 11 most important developments in medical science over the previous at the Department of Statistics, Univer-
1000 years.3 Statistics concerns reasoning with incomplete information and the sity of Oxford, 24-29 St. Giles’, Oxford
OX1 3LB, United Kingdom.
rigorous interpretation and communication of scientific findings from data. Sta-
tistics includes determination of the optimal design of experiments and accurate N Engl J Med 2023;389:1211-9.
DOI: 10.1056/NEJMra2212850
quantification of uncertainty regarding conclusions and inferential statements from Copyright © 2023 Massachusetts Medical Society.
data analysis, expressed through the language of probability.
In the 21st century, artificial intelligence (AI) has emerged as a valuable ap-
proach in data science and a growing influence in medical research,4-6 with an
accelerating pace of innovation. This development is driven, in part, by the enormous
expansion in computer power and data availability. However, the very features that
make AI such a valuable additional tool for data analysis are the same ones that
make it vulnerable from a statistical perspective. This paradox is particularly per-
tinent for medical science. Techniques that are adequate for targeted advertising
to voters and consumers or that enhance weather prediction may not meet the rigor-
ous demands of risk prediction or diagnosis in medicine.7,8 In this review article, we
discuss the statistical challenges in applying AI to biomedical data analysis and the
delicate balance that researchers face in wishing to learn as much as possible from
data while ensuring that data-driven conclusions are accurate, robust, and repro-
ducible.
We begin by highlighting a distinguishing feature of AI that makes it such a
powerful approach while at the same time making it statistically vulnerable. We
then explore three particular challenges at the interface of statistics and AI that are
of particular relevance to medical studies: population inference versus prediction,
generalizability and interpretation of evidence, and stability and statistical guar-
antees. We focus on issues of data analysis and interpretation of findings. Space
constraints preclude a discussion of the important area of AI and experimental de-
sign or a deep dive into the emerging area of generative AI and medical chatbots;
however, we comment on this emerging area briefly.

Fe at ur e R epr e sen tat ion L e a r ning


Traditional statistical modeling uses careful hands-on selection of measurements
and data features to include in an analysis — for example, which covariates to in-

n engl j med 389;13 nejm.org September 28, 2023 1211


The New England Journal of Medicine
Downloaded from nejm.org by MOHEMED SANOWFER on September 27, 2023. For personal use only. No other uses without permission.
Copyright © 2023 Massachusetts Medical Society. All rights reserved.
The n e w e ng l a n d j o u r na l of m e dic i n e

clude in a regression model — as well as any bring to bear in deciding on a feature set to use
transformation or standardization of measure- in a model. AI models are often unable to trace
ments. Semiautomated data-reduction techniques the evidence line from data to features, making
such as random forests and forward- or backward- auditability and verification challenging. Thus,
selection stepwise regression have assisted stat- greater checks and balances are needed to ensure
isticians in this hands-on selection for decades. the validity and generalizability of AI-enabled
Modeling assumptions and features are typically scientific findings (Fig. 1B).14,15
explicit, and the dimensionality of the model, as The checking of AI-supported findings is
quantified by the number of parameters, is usu- particularly important in the emerging field of
ally known. Although this approach uses expert generative AI through self-supervised learning,
judgment to provide high-quality manual analy- such as large language models and medical sci-
sis, it has two potential deficiencies. First, it can- ence chatbots that may be used, among many
not be scaled to very large data sets — for in- applications, for medical note taking in electronic
stance, millions of images. Second, the assumption health records.16 Self-supervised learning by these
is that the statistician either knows or is able to foundation models involves vast quantities of un-
search for the most appropriate set of features documented training data and the use of broad
or measurements to include in the analysis objective functions to train the models with tril-
(Fig. 1A). lions of parameters (at the time of this writing).
Arguably the most impressive and distin- This is in contrast to the “supervised” learning
guishing aspect of AI is its automated ability to with AI prediction models, such as deep learning
search and extract arbitrary, complex, task-orient- classifiers, in which the training data are known
ed features from data — so-called feature repre- and labeled according to the clinical outcome,
sentation learning.9-11 Features are algorithmi- and the training objective is clear and targeted
cally engineered from data during a training to the particular prediction task at hand. Given
phase in order to uncover data transformations the opaqueness of generative AI foundation mod-
that are correct for the learning task. Optimality els, additional caution is needed for their use in
is measured by means of an “objective function” health applications.
quantifying how well the AI model is perform-
ing the task at hand. AI algorithms largely remove Pr edic t ion v er sus P opul at ion
the need for analysts to prespecify features for Infer ence
prediction or manually curate transformations
of variables. These attributes are particularly ben- AI is especially well suited to, and largely designed
eficial in large, complex data domains such as for, large-scale prediction tasks.17 This is true, in
image analysis, genomics, or modeling of elec- part, because with such tasks, the training ob-
tronic health records. AI models can search jective for the model is clear, and the evaluation
through potentially billions of nonlinear covari- metric in terms of predictive accuracy is usually
ate transformations to reduce a large number of well characterized. Adaptive models and algo-
variables to a smaller set of task-adapted features. rithms can capitalize on large quantities of an-
Moreover, somewhat paradoxically, increasing the notated data to discover patterns in covariates
complexity of the AI model through additional that associate with outcomes of interest. A good
parameters, which occurs in deep learning, only example is predicting the risk of disease.18 How-
helps the AI model in its search for richer inter- ever, the ultimate goal of most medical studies
nal feature sets, provided training methods are is not explicitly to predict risk but rather to gain
suitably tailored.12,13 an understanding of some biologic mechanism
The result is that the trained AI models can or cause of disease in the wider population or to
engineer data-adaptive features that are beyond assist in the development of new therapies.19,20
the scope of features that humans can engineer, There is an evidence gap between a good pre-
leading to impressive task performance. The prob- dictive model that operates at the individual level
lem is that such features can be hard to interpret, and the ability to make inferential statements
are brittle in the face of changing data, and lack about the population.21 Statistics is mainly con-
common sense in the use of background knowl- cerned with population inference tasks and the
edge and qualitative checks that statisticians generalizability of evidence obtained from one

1212 n engl j med 389;13 nejm.org September 28, 2023

The New England Journal of Medicine


Downloaded from nejm.org by MOHEMED SANOWFER on September 27, 2023. For personal use only. No other uses without permission.
Copyright © 2023 Massachusetts Medical Society. All rights reserved.
Where Medical Statistics Meets AI

study to an understanding of a scientific hy- posure and disease (a “collider”), or highlighting


pothesis in the wider population. Prediction is a spurious association — for example, a batch
an important yet simpler task, whereas scientific effect in a biomarker study.33 Causal inference
inference often has a greater influence on mech- methods may also be applied to AI for the inter-
anistic understanding. As Hippocrates observed, pretation of radiologic or pathological images34
“It is more important to know what sort of per- and for clinical decision making and diagnosis,35
son has a disease than to know what sort of and they may facilitate the handling of high-
disease a person has.” dimensional confounders.36 Although AI meth-
An example comes from the recent corona- ods may automate and assist in applying causal
virus disease 2019 (Covid-19) pandemic. Various inference methods to biomedical data, human
prediction tools for determining whether a per- judgment is likely to be necessary for the foresee-
son has severe acute respiratory syndrome corona- able future, if only because different AI algorithms
virus 2 infection have been reported,22 but moving may present us with different conclusions. More-
from individual prediction to inference regarding over, in order to avoid potential bias arising from
the population prevalence and an understanding ascertainment, mediation, and confounding, caus-
of at-risk subgroups in the population is much al analysis from observational data requires as-
more challenging.23 sumptions that lie outside that which is learn-
An additional challenge in the use of predic- able from the data.
tive tools is that there are many ways to measure
and report predictive accuracy — for instance, Gener a l i z a bil i t y a nd
with the use of measures such as the area under In ter pr e tat ion
the receiver-operating-characteristic curve, pre-
cision and recall, mean squared error, positive One challenge in interpreting AI results is that
predictive value, misclassification rate, net re- algorithms for internal feature representation are
classification index, and log probability score. designed to automatically adapt their complexity
Choosing a measure that is appropriate for the to the task at hand, with nearly infinite flexibil-
context is vitally important, since accuracy in one ity in some approaches. This flexibility is a great
of these measures may not translate to accuracy strength but also requires care to avoid overfitting
in another and may not relate to a clinically to data. The use of regularization and controlled
meaningful measure of performance or safety.24,25 stochastic optimization of model parameters dur-
In contrast, inferential targets and estimands for ing training can help prevent overfitting but also
population statistics tend to be less ambiguous, means that AI algorithms have poorly defined
and the uncertainty is more clearly characterized notions of statistical degrees of freedom and the
through the use of P values, confidence intervals, number of free parameters. Thus, traditional sta-
and credible intervals. That said, robust, accurate, tistical guarantees against overoptimism cannot
AI prediction models indicate the existence of be used, and techniques such as cross-validation
repeatable signals and stable associations in the and held-out samples to mimic true out-of-sample
data that warrant further investigation.26 Bayesian performance must be substituted, with the trade-
procedures have an inherent link between pre- off that the amount of data available for discov-
diction and inference through the use of joint ery is reduced. With these factors taken together,
probabilistic modeling.27-29 the risk is overinterpretation of the generalizabil-
An interesting area where AI prediction meth- ity and reproducibility of results.
ods and statistical inference meet is causal ma- Practices that medical scientists should pay
chine learning that pays particular attention to careful attention to in planning AI-enabled
inferential quantities.30-32 Adoption of structural studies include releasing all code and providing
causal modeling or potential outcomes frame- clear statements on model fitting and held-out
works, with tools such as directed acyclic graphs, data used for reporting of accuracy so as to facili-
uses domain knowledge to reduce the probabil- tate external assessment of the reproducibility of
ity that an AI model will make data-driven findings.15 A recent report by McKinney et al. on
mistakes such as misspecifying the temporal the use of AI for predicting breast cancer on the
relationship between exposure and outcome, con- basis of mammograms37 prompted a call by Haibe-
ditioning on a variable that is caused by both ex- Kains et al. for greater transparency: “In their

n engl j med 389;13 nejm.org September 28, 2023 1213


The New England Journal of Medicine
Downloaded from nejm.org by MOHEMED SANOWFER on September 27, 2023. For personal use only. No other uses without permission.
Copyright © 2023 Massachusetts Medical Society. All rights reserved.
The n e w e ng l a n d j o u r na l of m e dic i n e

A Statistical Model

Statisticians, clinicians,
epidemiologists

Risk prediction
(e.g., risk of diabetes
and other coexisting
conditions)

Hands-on selection of measurements or features


for prediction
Curation of transformation or standardization methods
Prespecified analysis strategy

Expert Conclusions are


judgment reproducible, auditable,
and verifiable
Harder to scale to very
large, multimodal, or
high-dimensional data sets
Stable process

B AI Model

Automated search and extraction of arbitrary, complex,


task-oriented features to develop prediction algorithm
Risk prediction
(e.g., risk of diabetes
and other coexisting
conditions)

Conclusions are harder


to reproduce or audit
Automated
Can handle very large, Opaque
multimodal, or high- process
dimensional data sets

study, McKinney et al. showed the high potential of traditional statistical prediction methods
of AI for breast cancer screening. However, the alongside interpretable AI methods can contrib-
lack of details of the methods and algorithm ute to an understanding of the prediction signal
code undermines its scientific value.”38 The use and can mitigate nonsensical associations. The

1214 n engl j med 389;13 nejm.org September 28, 2023

The New England Journal of Medicine


Downloaded from nejm.org by MOHEMED SANOWFER on September 27, 2023. For personal use only. No other uses without permission.
Copyright © 2023 Massachusetts Medical Society. All rights reserved.
Where Medical Statistics Meets AI

Figure 1 (facing page). Characteristics of Statistical


number of AI-selected features may do so. Fea-
and Artificial Intelligence (AI) Models. ture reduction also helps the human analyst ex-
As shown in Panel A, statisticians, in conjunction with amine the data and apply additional constraints
clinicians, can use expert judgment to design studies on an analysis that are based on previous subject
and analyze the resulting data. To avoid “data dredg- knowledge. For example, feature A is often con-
ing,” the analysis is often prespecified in a statistical founded by feature X, or a latency period of sev-
analysis plan, which may include such details as a list-
ing of primary and secondary hypotheses and specifi-
eral years between exposure to feature A and the
cation of variables that will be controlled for, how the disease outcome means that no relationship is
variables will be categorized, which statistical meth- expected in early follow-up. Some similarities and
ods will be used, how they will provide protection differences between AI and conventional statistics
against type 1 error, and even how the tables will be are summarized in Table 1.
presented. Additional or post hoc analyses are consid-
ered to be exploratory. A second statistician or statis-
AI approaches challenge some recent trends
ticians starting with the same data and statistical in conventional statistical analysis of clinical and
analysis plan should produce almost identical results. epidemiologic studies. Randomized trials of in-
These principles are challenged by high-dimensional vestigational drugs have been held to a high stan-
data (i.e., data with many variables from multiple dard of rigor, and concerns about overinterpre-
sources), for which there may be numerous alternative
approaches to reducing the data to a smaller number
tation of the results of secondary end-point and
of variables, with many options for analyses, and the subgroup analyses have led to an even stronger
statistician may “drown” in the data. As shown in Pan- focus on prespecified description of primary
el B, an AI algorithm can sift through vast amounts of hypotheses and control of the familywise error
data, but the way in which findings are derived from rate in order to limit false positive results. Pro-
the data may be opaque, and it may be impossible for
an analyst starting with the same data to succinctly
tocols now often specify the precise estimands
describe and reproduce the analysis and results. Of and methods of analysis that will be used to ob-
most concern is the possibility of overfitting and of tain P values for inference and may include the
false positive results leading to findings that are not covariates to be controlled for and the dummy
reproducible. Biases in the data that a human may un- tables that will be filled in once data are complete.
derstand may not be known to an AI algorithm. Inter-
nal reproducibility should be assessed by using meth-
Analyses in observational studies are usually less
ods such as partitioning the data into discovery and rigorously prespecified, although a statistical
test sets. The generalizability to other data sets may analysis plan established before the start of data
be limited by idiosyncrasies in the first data set that analysis is increasingly expected as supplemen-
are not shared with apparently similar data sets. tary material in published reports.43
AI approaches, in contrast, often seek pat-
clear reporting of results and availability of code terns in the data that are not prespecified, which
add to the potential for external replication and is one of the strengths of such approaches (as
refinement by other groups but may be limited by discussed above), and thus the potential for false
a tendency to seek intellectual property rights for positive results is increased unless rigorous pro-
commercial AI products. cedures to assess the reproducibility of findings
AI approaches may be useful in winnowing are incorporated. New reporting guidelines and
down a data set with a very large number of fea- recommendations for AI in medical science have
tures, such as “-omic” data sets (e.g., metabolo- been established to ensure greater trust and gen-
mic, proteomic, or genomic data), into a smaller eralizability of conclusions.44-48 Moreover, highly
number of features that can then be tested with adaptive AI algorithms inherit all the biases and
the use of conventional statistical methods. Popu- unrepresentativeness that might be present in
lar AI methods such as random forests, XGBoost, the training data, and in using black-box AI pre-
and Bayesian additive regression trees39-41 all pro- diction tools, it can be difficult to judge whether
vide “feature relevance” ranking of covariates, predictive signals arise as a result of confound-
and statistical methods such as the least absolute ing from hidden biases in the data.49-52 Methods
shrinkage and selection operator42 use explicit from the field of explainable AI (XAI) can help
variable selection as part of the model fitting. counter opaque feature representation learning,53
Although many AI procedures may not effectively but for applications in which safety is a critical
distinguish between highly correlated variables, issue, the black-box nature of AI models war-
standard regression techniques with a smaller rants careful consideration and justification.54

n engl j med 389;13 nejm.org September 28, 2023 1215


The New England Journal of Medicine
Downloaded from nejm.org by MOHEMED SANOWFER on September 27, 2023. For personal use only. No other uses without permission.
Copyright © 2023 Massachusetts Medical Society. All rights reserved.
The n e w e ng l a n d j o u r na l of m e dic i n e

Table 1. Similarities and Differences between Artificial Intelligence and Conventional Statistics.

Feature Artificial Intelligence Methods Conventional Statistical Methods


Prior hypotheses Agnostic or very general Specific; often categorized as primary, secondary, and
exploratory
Techniques (examples) Random forests, neural networks, XGBoost Parametric and nonparametric comparisons between
groups; regression and survival models with linear
predictors
Stability (end-to-end) Analyses are more prone to instability and variability Stable analyses that follow prespecification of a sta-
as a result of application domains (e.g., multimod- tistical analysis plan with minimal available user-
al data integration) and user choices in algorithm defined choices in model specification
specification (e.g., architecture in deep learning)
Applications Analysis of images, outputs from monitors, massive Data with a smaller number of predictors, tabular
data sets (e.g., electronic health records, natural data, randomized trials
language processing)
Purpose Pattern discovery; automatic feature representation; Statistical inference and testing of specific factors for
feature reduction to a smaller, more manageable departure from a null hypothesis, control of con-
set; prediction models founding and ascertainment bias, quantification of
uncertainty
Reproducibility Often internal (i.e., performed with original data set); Ideally external (i.e., performed with “new” data); for-
cross-validation or split samples mal tests of significance against null hypotheses
Barriers Increasingly, use of proprietary algorithms not avail- Slow progress in sharing of primary data to allow oth-
able to other researchers; lack of clarity in reporting ers to check or extend results
Interpretability Often black-box; automatic algorithmic feature engi- Explicit features, clear number of free parameters and
neering introduces opaqueness degrees of freedom
Equity Data-driven feature learning susceptible to biases pres- Less flexible, more explicit (interpretable) models,
ent in data, compounding health inequities which are more easily checked for equity if relevant
data are available

Obermeyer and colleagues55 describe an AI- S ta bil i t y a nd S tat is t ic a l


informed algorithm that was applied to a popu- Gua r a n tee s
lation of 200 million persons in the United
States each year in order to identify patients who Medical science is an iterative process of obser-
were at highest risk for incurring substantial vation and hypothesis refinement with cycles of
health care costs and to refer them to “high-risk experimentation, analysis, and conjecture, lead-
care management programs.” Their analysis sug- ing to further experiments and ultimately toward
gested that the algorithm unintentionally discrim- a level of evidence that refutes existing theories
inated against Black patients. The reason appears and supports new therapies, lifestyle recommen-
to be that at every level of health care expenditure dations, or both. Analytic methods, including tra-
and age, Black patients have more coexisting con- ditional statistical and AI algorithms, are used to
ditions than White patients do but may access enhance the efficiency of this scientific cycle.
health care less frequently. Thus, the algorithm The context and consequences of decisions made
with an objective function that set out to predict on the basis of evidence reported in medical stud-
health care utilization on the basis of previous ies carry with them important implications for
costs did not recognize race-related disparities the health of patients.
in health care needs. In the future, AI algorithms To a large extent, the concern about prevent-
may be sufficiently sophisticated to avoid this ing false positive results in conventional medical
sort of discrimination, but this example illus- statistics centers on the potential clinical conse-
trates both the need for human experts in clini- quences of such results. For example, patients may
cal practice and health care policy to explore the be harmed by the licensing of a drug that has no
consequences of AI applications in these domains benefit and may have adverse effects. In genetic
and the need to carefully specify objective func- analyses, falsely concluding that a chromosomal
tions for training and evaluation. segment or a genetic variant is associated with a

1216 n engl j med 389;13 nejm.org September 28, 2023

The New England Journal of Medicine


Downloaded from nejm.org by MOHEMED SANOWFER on September 27, 2023. For personal use only. No other uses without permission.
Copyright © 2023 Massachusetts Medical Society. All rights reserved.
Where Medical Statistics Meets AI

disease can lead to much wasted effort attempt- genetic variants among the very large number
ing to understand the causal association. For this being tested are associated with the disease of
reason, the field has insisted on high LOD (loga- interest). This obviously leads to a substantial
rithm of the odds) scores for linkage and very multiplicity problem. Multiplicity can be controlled
small P values for an association in genomewide by using standard approaches such as the Bon-
studies as evidence that the association is a priori ferroni correction or explicitly using a Bayesian
likely to represent a true positive result. In con- prior specification on hypotheses, but new AI ap-
trast, if data are being analyzed to decide wheth- proaches to graphical procedures for controlling
er one should be shown a particular advertise- multiplicity are being developed.62 Another stan-
ment on a browser, even a small improvement in dard approach is to validate findings in an inde-
random assignment is an improvement, and a pendent data set on the basis of whether the AI
mistake imposes a financial penalty only on the predictions are reproduced. Where such indepen-
advertiser. dent validation is not possible, we must resort
This difference between statistical analysis in to mimicking this approach by using in-sample
medicine and AI analysis has consequences for partitioning. Dividing the data into two sets, one
the potential of AI to affect medical science, for discovery and one for validation, can provide
since most AI methods are designed outside of statistical guarantees on discovery findings.63
medicine and have evolved to improve perfor- More generally, multiple splits with the use of
mance in nonmedical domains (e.g., image clas- cross-validation can estimate future predictive
sification of house numbers for mapping soft- risk,64 although statistical uncertainty in the pre-
ware56). In medical science, the stakes are dictive risk estimate is harder to assess. Emerg-
higher, either because the conclusions may be ing techniques in conformal inference look prom-
used in the clinic or, at a minimum, false posi- ising for quantifying uncertainty in prediction
tive results will drain scientific resources and settings.65
distract scientists. Trust in the robustness and
stability of analyses and reporting is vital in or- S tat is t ic a l Sense a nd the A r t
der for the medical science community to pro- of S tat is t ic s
ceed efficiently and safely. Stability refers to the
end-to-end variability in the analysis, by persons Much of the art of applied statistics and the
skilled in the art of analysis, from project con- skills of a trained statistician or epidemiologist
ception to end-user reporting or deployment. AI- involve factors that lie outside the data and,
enabled studies are increasing in complexity with hence, cannot be captured by data-driven AI algo-
the integration of multiple data techniques and rithms alone. These factors include careful design
data fusion. Thus, assessment of the end-to-end of experiments, an understanding of the research
stability of the analysis that includes data engineer- question and study objectives, and tailoring of
ing, as well as model choice, becomes vital.57,58 models to the research question in the context
Methods that provide statistical guarantees of an existing knowledge base, with ascertainment
for AI findings, such as in subgroup analysis in and selection bias accounted for and a healthy
randomized trials59 or observational studies,60 can suspicion of results that look too good to be true,
help. In the emerging area of machine learning followed by careful model checking. Bringing
operations, which combines machine learning, these skills to bear on AI-enabled studies through
software development, and information technol- “human-in-the-loop” development (in which AI
ogy operations, particular attention is paid to the supports and assists expert human judgment) will
importance of data engineering in the AI devel- enhance the effect and uptake of AI methods and
opment cycle61 and the problem of “garbage in, highlight methodologic and theoretical gaps to
garbage out,” which can affect automated ma- be addressed for the benefit of medical science.
chine learning in the absence of careful human AI has much to bring to medical science. Statis-
intervention. ticians should embrace AI, and in response, the
There are many examples of data analysis in field of AI will benefit from increased statistical
medical science in which we undertake an “ag- thinking.
nostic” analysis because a specific hypothesis Disclosure forms provided by the authors are available with
does not exist, or if it does, it is global (e.g., some the full text of this article at NEJM.org.

n engl j med 389;13 nejm.org September 28, 2023 1217


The New England Journal of Medicine
Downloaded from nejm.org by MOHEMED SANOWFER on September 27, 2023. For personal use only. No other uses without permission.
Copyright © 2023 Massachusetts Medical Society. All rights reserved.
The n e w e ng l a n d j o u r na l of m e dic i n e

References
1. Gorroochurn P. Classic topics on the 19. Bzdok D, Engemann D, Thirion B. In- tistics, virtual, March 28–30, 2022. poster
history of modern mathematical statis- ference and prediction diverge in biomed- (https://virtual​.­aistats​.­org/​­virtual/​­2022/​
tics: from Laplace to more recent times. icine. Patterns (N Y) 2020;​1:​100119. ­poster/​­3447).
Medford, MA:​Wiley, 2016. 20. Cox DR. Statistical modeling: the two 37. McKinney SM, Sieniek M, Godbole V,
2. Wasserman L. All of statistics. New cultures. Stat Sci 2001;​16:​216-8. et al. International evaluation of an AI sys-
York:​Springer, 2013. 21. Bzdok D, Ioannidis JPA. Exploration, tem for breast cancer screening. Nature
3. Looking back on the millennium in inference, and prediction in neuroscience 2020;​577:​89-94.
medicine. N Engl J Med 2000;​342:​42-9. and biomedicine. Trends Neurosci 2019;​ 38. Haibe-Kains B, Adam GA, Hosny A,
4. Rajkomar A, Dean J, Kohane I. Ma- 42:​251-62. et al. Transparency and reproducibility
chine learning in medicine. N Engl J Med 22. Wynants L, Van Calster B, Collins GS, in artificial intelligence. Nature 2020;​
2019;​380:​1347-58. et al. Prediction models for diagnosis and 586(7829):​E14-E16.
5. Topol EJ. High-performance medi- prognosis of covid-19: systematic review 39. Breiman L. Random forests. Mach
cine: the convergence of human and arti- and critical appraisal. BMJ 2020;​369:​m1328. Learn 2021;​45:​5-32.
ficial intelligence. Nat Med 2019;​25:​44-56. 23. Rossman H, Segal E. Nowcasting the 40. Chipman HA, George EI, McCulloch
6. Rajpurkar P, Chen E, Banerjee O, spread of SARS-CoV-2. Nat Microbiol 2022;​ RE. BART: Bayesian additive regression
Topol EJ. AI in health and medicine. Nat 7:​16-7. trees. Ann Appl Stat 2010;​4:​266-98.
Med 2022;​28:​31-8. 24. Oren O, Gersh BJ, Bhatt DL. Artificial 41. Chen T, Guestrin C. Xgboost: a scal-
7. Parikh RB, Obermeyer Z, Navathe AS. intelligence in medical imaging: switch- able tree boosting system. In:​Proceed-
Regulation of predictive analytics in med- ing from radiographic pathological data ings and abstracts of the 22nd ACM
icine. Science 2019;​363:​810-2. to clinically meaningful endpoints. Lancet SIGKDD International Conference on
8. He J, Baxter SL, Xu J, Xu J, Zhou X, Digit Health 2020;​2(9):​e486-e488. Knowledge Discovery and Data Mining,
Zhang K. The practical implementation of 25. Collins GS, Moons KGM. Reporting August 13–17, 2016. San Francisco:​Spe-
artificial intelligence technologies in med- of artificial intelligence prediction mod- cial Interest Group on Knowledge Discov-
icine. Nat Med 2019;​25:​30-6. els. Lancet 2019;​393:​1577-9. ery and Data Mining, 2016.
9. Bengio Y, Courville A, Vincent P. 26. Sammut S-J, Crispin-Ortuzar M, Chin 42. Tibshirani R. Regression shrinkage
Representation learning: a review and S-F, et al. Multi-omic machine learning and selection via the lasso. J R Stat Soc
new perspectives. IEEE Trans Pattern Anal predictor of breast cancer therapy response. Series B Stat Methodol 1996;​58:​267-88.
Mach Intell 2013;​35:​1798-828. Nature 2022;​601:​623-9. 43. Harrington D, D’Agostino RB Sr, Gat-
10. Schölkopf B, Locatello F, Bauer S, et 27. Rubin DB. Bayesian inference for sonis C, et al. New guidelines for statisti-
al. Toward causal representation learn- causal effects: the role of randomization. cal reporting in the Journal. N Engl J Med
ing. Proc IEEE 2021;​109:​612-34. Ann Stat 1978;​6:​34-58. 2019;​381:​285-6.
11. Choi E, Bahadori MT, Searles E, et al. 28. Gelman A, Carlin JB, Stern HS, Rubin 44. Sounderajah V, Ashrafian H, Aggar-
Multi-layer representation learning for DB. Bayesian data analysis. New York:​ wal R, et al. Developing specific report-
medical concepts. In:​Proceedings and Taylor and Francis, 2014. ing guidelines for diagnostic accuracy
abstracts of the 22nd ACM SIGKDD Inter- 29. Fong E, Holmes C, Walker SG. Mar- studies assessing AI interventions: the
national Conference on Knowledge Dis- tingale posterior distributions. November STARD-AI Steering Group. Nat Med 2020;​
covery and Data Mining, August 13–17, 22, 2021 (https://doi​.­org/​­10​.­48550/​­arXiv 26:​807-8.
2016. San Francisco:​Special Interest Group ​.­2103​.­15671). preprint. 45. CONSORT-AI and SPIRIT-AI Steering
on Knowledge Discovery and Data Mining, 30. Peters J, Janzing D, Schölkopf B. Ele- Group. Reporting guidelines for clinical
2016. ments of causal inference: foundations trials evaluating artificial intelligence in-
12. LeCun Y, Bengio Y, Hinton G. Deep and learning algorithms. Cambridge, MA:​ terventions are needed. Nat Med 2019;​25:​
learning. Nature 2015;​521:​436-44. MIT Press, 2017. 1467-8.
13. Goodfellow I, Bengio Y, Courville A. 31. Van der Laan MJ, Rose S. Targeted 46. Norgeot B, Quer G, Beaulieu-Jones
Deep learning. Cambridge, MA:​MIT press, learning: causal inference for observational BK, et al. Minimum information about
2016. and experimental data. New York:​Springer, clinical artificial intelligence modeling:
14. Kelly CJ, Karthikesalingam A, Suley- 2011. the MI-CLAIM checklist. Nat Med 2020;​
man M, Corrado G, King D. Key chal- 32. Blakely T, Lynch J, Simons K, Bentley 26:​1320-4.
lenges for delivering clinical impact with R, Rose S. Reflection on modern meth- 47. DECIDE-AI Steering Group. DECIDE-AI:
artificial intelligence. BMC Med 2019;​17:​ ods: when worlds collide-prediction, ma- new reporting guidelines to bridge the
195. chine learning and causal inference. Int J development-to-implementation gap in
15. Vollmer S, Mateen BA, Bohner G, et Epidemiol 2021;​49:​2058-64. clinical artificial intelligence. Nat Med
al. Machine learning and artificial intel- 33. Sanchez P, Voisey JP, Xia T, Watson 2021;​27:​186-7.
ligence research for patient benefit: 20 HI, O’Neil AQ, Tsaftaris SA. Causal ma- 48. Nagendran M, Chen Y, Lovejoy CA, et
critical questions on transparency, repli- chine learning for healthcare and preci- al. Artificial intelligence versus clinicians:
cability, ethics, and effectiveness. BMJ 2020;​ sion medicine. R Soc Open Sci 2022;​ 9:​ systematic review of design, reporting
368:​l6927. 220638. standards, and claims of deep learning
16. Lee P, Bubeck S, Petro J. Benefits, lim- 34. Castro DC, Walker I, Glocker B. Cau- studies. BMJ 2020;​368:​m689.
its, and risks of GPT-4 as an AI chatbot for sality matters in medical imaging. Nat 49. Seyyed-Kalantari L, Zhang H, McDer-
medicine. N Engl J Med 2023;​388:​1233-9. Commun 2020;​11:​3673. mott MBA, Chen IY, Ghassemi M. Under-
17. Agrawal A, Gans J, Goldfarb A. Pre- 35. Richens JG, Lee CM, Johri S. Improv- diagnosis bias of artificial intelligence
diction machines: the simple economics ing the accuracy of medical diagnosis algorithms applied to chest radiographs
of artificial intelligence. Cambridge, MA:​ with causal machine learning. Nat Com- in under-served patient populations. Nat
Harvard Business Press, 2018. mun 2020;​11:​3923. Med 2021;​27:​2176-82.
18. Goldstein BA, Navar AM, Carter RE. 36. Clivio O, Falck F, Lehmann B, Deligi- 50. Parikh RB, Teeple S, Navathe AS. Ad-
Moving beyond regression techniques in annidis G, Holmes C. Neural score match- dressing bias in artificial intelligence in
cardiovascular risk prediction: applying ing for high-dimensional causal inference. health care. JAMA 2019;​322:​2377-8.
machine learning to address analytic Presented at the 25th International Con- 51. Mehrabi N, Morstatter F, Saxena N,
challenges. Eur Heart J 2017;​38:​1805-14. ference on Artificial Intelligence and Sta- Lerman K, Galstyan A. A survey on bias

1218 n engl j med 389;13 nejm.org September 28, 2023

The New England Journal of Medicine


Downloaded from nejm.org by MOHEMED SANOWFER on September 27, 2023. For personal use only. No other uses without permission.
Copyright © 2023 Massachusetts Medical Society. All rights reserved.
Where Medical Statistics Meets AI

and fairness in machine learning. ACM research. IEEE Signal Process Mag 2012;​ 61. Treveil M, Omont N, Stenac C, et al. In-
Comput Surv 2021;​54:​1-35. 29:​141-2. troducing MLOps. Newton, MA:​O’Reilly,
52. Chen IY, Pierson E, Rose S, Joshi S, 57. Yu B, Kumbier K. Veridical data sci- 2020.
Ferryman K, Ghassemi M. Ethical ma- ence. Proc Natl Acad Sci U S A 2020;​117:​ 62. Zhan T, Hartford A, Kang J, Offen W.
chine learning in healthcare. Annu Rev 3920-9. Optimizing graphical procedures for mul-
Biomed Data Sci 2021;​4:​123-44. 58. Gelman A, Loken E. The statistical tiplicity control in a confirmatory clinical
53. Tjoa E, Guan C. A survey on explain- crisis in science: data-dependent analysis trial via deep learning. Stat Biopharm Res
able artificial intelligence (xai): Toward — a “garden of forking paths” — ex- 2022;​14:​92-102.
medical xai. IEEE Trans Neural Netw plains why many statistically significant 63. Cox DR. Some problems connected
Learn Syst 2021;​32:​4793-813. comparisons don’t hold up. Am Sci 2014;​ with statistical inference. Ann Math Stat
54. Rudin C. Stop explaining black box 102:​460 (https://www​.­americanscientist​ 1958;​29:​357-72.
machine learning models for high stakes .­org/​­article/​­t he​-­statistical​-­crisis​-­in​ 64. Stone M. Cross-validatory choice and
decisions and use interpretable models -­science). assessment of statistical predictions. J R
instead. Nat Mach Intell 2019;​1:​206-15. 59. Watson JA, Holmes CC. Machine Stat Soc Series B Stat Methodol 1974;​36:​
55. Obermeyer Z, Powers B, Vogeli C, learning analysis plans for randomised 111-47.
Mullainathan S. Dissecting racial bias controlled trials: detecting treatment ef- 65. Lei J, G’Sell M, Rinaldo A, Tibshirani
in an algorithm used to manage the fect heterogeneity with strict control of RJ, Wasserman L. Distribution-free pre-
health of populations. Science 2019;​366:​ type I error. Trials 2020;​21:​156. dictive inference for regression. J Am Stat
447-53. 60. Nie X, Wager S. Quasi-oracle estima- Assoc 2018;​113:​1094-111.
56. Deng L. The MNIST database of hand- tion of heterogeneous treatment effects. Copyright © 2023 Massachusetts Medical Society.
written digit images for machine learning Biometrika 2021;​108:​299-319.

n engl j med 389;13 nejm.org September 28, 2023 1219


The New England Journal of Medicine
Downloaded from nejm.org by MOHEMED SANOWFER on September 27, 2023. For personal use only. No other uses without permission.
Copyright © 2023 Massachusetts Medical Society. All rights reserved.

You might also like