Professional Documents
Culture Documents
Nejmra 2212850
Nejmra 2212850
Review Article
AI in Medicine
Jeffrey M. Drazen, M.D., Editor, Isaac S. Kohane, M.D., Ph.D., Guest Editor,
and Tze‑Yun Leong, Ph.D., Guest Editor
S
tatistics emerged as a distinct discipline around the beginning From the Nuffield Department of Popu-
of the 20th century. During this time, fundamental concepts were developed, lation Health (D.J.H.) and the Depart-
ment of Statistics and Nuffield Depart-
including the use of randomization in clinical trials, hypothesis testing, ment of Medicine (C.H.), University of
likelihood-based inference, P values, and Bayesian analysis and decision theory.1,2 Oxford, Oxford, and the Alan Turing In-
Statistics rapidly became an essential element of the applied sciences, so much so stitute, London (C.H.) — both in the
United Kingdom. Dr. Holmes can be con-
that in 2000, the editors of the Journal cited “Application of Statistics to Medicine” tacted at chris.holmes@stats.ox.ac.uk or
as one of the 11 most important developments in medical science over the previous at the Department of Statistics, Univer-
1000 years.3 Statistics concerns reasoning with incomplete information and the sity of Oxford, 24-29 St. Giles’, Oxford
OX1 3LB, United Kingdom.
rigorous interpretation and communication of scientific findings from data. Sta-
tistics includes determination of the optimal design of experiments and accurate N Engl J Med 2023;389:1211-9.
DOI: 10.1056/NEJMra2212850
quantification of uncertainty regarding conclusions and inferential statements from Copyright © 2023 Massachusetts Medical Society.
data analysis, expressed through the language of probability.
In the 21st century, artificial intelligence (AI) has emerged as a valuable ap-
proach in data science and a growing influence in medical research,4-6 with an
accelerating pace of innovation. This development is driven, in part, by the enormous
expansion in computer power and data availability. However, the very features that
make AI such a valuable additional tool for data analysis are the same ones that
make it vulnerable from a statistical perspective. This paradox is particularly per-
tinent for medical science. Techniques that are adequate for targeted advertising
to voters and consumers or that enhance weather prediction may not meet the rigor-
ous demands of risk prediction or diagnosis in medicine.7,8 In this review article, we
discuss the statistical challenges in applying AI to biomedical data analysis and the
delicate balance that researchers face in wishing to learn as much as possible from
data while ensuring that data-driven conclusions are accurate, robust, and repro-
ducible.
We begin by highlighting a distinguishing feature of AI that makes it such a
powerful approach while at the same time making it statistically vulnerable. We
then explore three particular challenges at the interface of statistics and AI that are
of particular relevance to medical studies: population inference versus prediction,
generalizability and interpretation of evidence, and stability and statistical guar-
antees. We focus on issues of data analysis and interpretation of findings. Space
constraints preclude a discussion of the important area of AI and experimental de-
sign or a deep dive into the emerging area of generative AI and medical chatbots;
however, we comment on this emerging area briefly.
clude in a regression model — as well as any bring to bear in deciding on a feature set to use
transformation or standardization of measure- in a model. AI models are often unable to trace
ments. Semiautomated data-reduction techniques the evidence line from data to features, making
such as random forests and forward- or backward- auditability and verification challenging. Thus,
selection stepwise regression have assisted stat- greater checks and balances are needed to ensure
isticians in this hands-on selection for decades. the validity and generalizability of AI-enabled
Modeling assumptions and features are typically scientific findings (Fig. 1B).14,15
explicit, and the dimensionality of the model, as The checking of AI-supported findings is
quantified by the number of parameters, is usu- particularly important in the emerging field of
ally known. Although this approach uses expert generative AI through self-supervised learning,
judgment to provide high-quality manual analy- such as large language models and medical sci-
sis, it has two potential deficiencies. First, it can- ence chatbots that may be used, among many
not be scaled to very large data sets — for in- applications, for medical note taking in electronic
stance, millions of images. Second, the assumption health records.16 Self-supervised learning by these
is that the statistician either knows or is able to foundation models involves vast quantities of un-
search for the most appropriate set of features documented training data and the use of broad
or measurements to include in the analysis objective functions to train the models with tril-
(Fig. 1A). lions of parameters (at the time of this writing).
Arguably the most impressive and distin- This is in contrast to the “supervised” learning
guishing aspect of AI is its automated ability to with AI prediction models, such as deep learning
search and extract arbitrary, complex, task-orient- classifiers, in which the training data are known
ed features from data — so-called feature repre- and labeled according to the clinical outcome,
sentation learning.9-11 Features are algorithmi- and the training objective is clear and targeted
cally engineered from data during a training to the particular prediction task at hand. Given
phase in order to uncover data transformations the opaqueness of generative AI foundation mod-
that are correct for the learning task. Optimality els, additional caution is needed for their use in
is measured by means of an “objective function” health applications.
quantifying how well the AI model is perform-
ing the task at hand. AI algorithms largely remove Pr edic t ion v er sus P opul at ion
the need for analysts to prespecify features for Infer ence
prediction or manually curate transformations
of variables. These attributes are particularly ben- AI is especially well suited to, and largely designed
eficial in large, complex data domains such as for, large-scale prediction tasks.17 This is true, in
image analysis, genomics, or modeling of elec- part, because with such tasks, the training ob-
tronic health records. AI models can search jective for the model is clear, and the evaluation
through potentially billions of nonlinear covari- metric in terms of predictive accuracy is usually
ate transformations to reduce a large number of well characterized. Adaptive models and algo-
variables to a smaller set of task-adapted features. rithms can capitalize on large quantities of an-
Moreover, somewhat paradoxically, increasing the notated data to discover patterns in covariates
complexity of the AI model through additional that associate with outcomes of interest. A good
parameters, which occurs in deep learning, only example is predicting the risk of disease.18 How-
helps the AI model in its search for richer inter- ever, the ultimate goal of most medical studies
nal feature sets, provided training methods are is not explicitly to predict risk but rather to gain
suitably tailored.12,13 an understanding of some biologic mechanism
The result is that the trained AI models can or cause of disease in the wider population or to
engineer data-adaptive features that are beyond assist in the development of new therapies.19,20
the scope of features that humans can engineer, There is an evidence gap between a good pre-
leading to impressive task performance. The prob- dictive model that operates at the individual level
lem is that such features can be hard to interpret, and the ability to make inferential statements
are brittle in the face of changing data, and lack about the population.21 Statistics is mainly con-
common sense in the use of background knowl- cerned with population inference tasks and the
edge and qualitative checks that statisticians generalizability of evidence obtained from one
A Statistical Model
Statisticians, clinicians,
epidemiologists
Risk prediction
(e.g., risk of diabetes
and other coexisting
conditions)
B AI Model
study, McKinney et al. showed the high potential of traditional statistical prediction methods
of AI for breast cancer screening. However, the alongside interpretable AI methods can contrib-
lack of details of the methods and algorithm ute to an understanding of the prediction signal
code undermines its scientific value.”38 The use and can mitigate nonsensical associations. The
Table 1. Similarities and Differences between Artificial Intelligence and Conventional Statistics.
disease can lead to much wasted effort attempt- genetic variants among the very large number
ing to understand the causal association. For this being tested are associated with the disease of
reason, the field has insisted on high LOD (loga- interest). This obviously leads to a substantial
rithm of the odds) scores for linkage and very multiplicity problem. Multiplicity can be controlled
small P values for an association in genomewide by using standard approaches such as the Bon-
studies as evidence that the association is a priori ferroni correction or explicitly using a Bayesian
likely to represent a true positive result. In con- prior specification on hypotheses, but new AI ap-
trast, if data are being analyzed to decide wheth- proaches to graphical procedures for controlling
er one should be shown a particular advertise- multiplicity are being developed.62 Another stan-
ment on a browser, even a small improvement in dard approach is to validate findings in an inde-
random assignment is an improvement, and a pendent data set on the basis of whether the AI
mistake imposes a financial penalty only on the predictions are reproduced. Where such indepen-
advertiser. dent validation is not possible, we must resort
This difference between statistical analysis in to mimicking this approach by using in-sample
medicine and AI analysis has consequences for partitioning. Dividing the data into two sets, one
the potential of AI to affect medical science, for discovery and one for validation, can provide
since most AI methods are designed outside of statistical guarantees on discovery findings.63
medicine and have evolved to improve perfor- More generally, multiple splits with the use of
mance in nonmedical domains (e.g., image clas- cross-validation can estimate future predictive
sification of house numbers for mapping soft- risk,64 although statistical uncertainty in the pre-
ware56). In medical science, the stakes are dictive risk estimate is harder to assess. Emerg-
higher, either because the conclusions may be ing techniques in conformal inference look prom-
used in the clinic or, at a minimum, false posi- ising for quantifying uncertainty in prediction
tive results will drain scientific resources and settings.65
distract scientists. Trust in the robustness and
stability of analyses and reporting is vital in or- S tat is t ic a l Sense a nd the A r t
der for the medical science community to pro- of S tat is t ic s
ceed efficiently and safely. Stability refers to the
end-to-end variability in the analysis, by persons Much of the art of applied statistics and the
skilled in the art of analysis, from project con- skills of a trained statistician or epidemiologist
ception to end-user reporting or deployment. AI- involve factors that lie outside the data and,
enabled studies are increasing in complexity with hence, cannot be captured by data-driven AI algo-
the integration of multiple data techniques and rithms alone. These factors include careful design
data fusion. Thus, assessment of the end-to-end of experiments, an understanding of the research
stability of the analysis that includes data engineer- question and study objectives, and tailoring of
ing, as well as model choice, becomes vital.57,58 models to the research question in the context
Methods that provide statistical guarantees of an existing knowledge base, with ascertainment
for AI findings, such as in subgroup analysis in and selection bias accounted for and a healthy
randomized trials59 or observational studies,60 can suspicion of results that look too good to be true,
help. In the emerging area of machine learning followed by careful model checking. Bringing
operations, which combines machine learning, these skills to bear on AI-enabled studies through
software development, and information technol- “human-in-the-loop” development (in which AI
ogy operations, particular attention is paid to the supports and assists expert human judgment) will
importance of data engineering in the AI devel- enhance the effect and uptake of AI methods and
opment cycle61 and the problem of “garbage in, highlight methodologic and theoretical gaps to
garbage out,” which can affect automated ma- be addressed for the benefit of medical science.
chine learning in the absence of careful human AI has much to bring to medical science. Statis-
intervention. ticians should embrace AI, and in response, the
There are many examples of data analysis in field of AI will benefit from increased statistical
medical science in which we undertake an “ag- thinking.
nostic” analysis because a specific hypothesis Disclosure forms provided by the authors are available with
does not exist, or if it does, it is global (e.g., some the full text of this article at NEJM.org.
References
1. Gorroochurn P. Classic topics on the 19. Bzdok D, Engemann D, Thirion B. In- tistics, virtual, March 28–30, 2022. poster
history of modern mathematical statis- ference and prediction diverge in biomed- (https://virtual.aistats.org/virtual/2022/
tics: from Laplace to more recent times. icine. Patterns (N Y) 2020;1:100119. poster/3447).
Medford, MA:Wiley, 2016. 20. Cox DR. Statistical modeling: the two 37. McKinney SM, Sieniek M, Godbole V,
2. Wasserman L. All of statistics. New cultures. Stat Sci 2001;16:216-8. et al. International evaluation of an AI sys-
York:Springer, 2013. 21. Bzdok D, Ioannidis JPA. Exploration, tem for breast cancer screening. Nature
3. Looking back on the millennium in inference, and prediction in neuroscience 2020;577:89-94.
medicine. N Engl J Med 2000;342:42-9. and biomedicine. Trends Neurosci 2019; 38. Haibe-Kains B, Adam GA, Hosny A,
4. Rajkomar A, Dean J, Kohane I. Ma- 42:251-62. et al. Transparency and reproducibility
chine learning in medicine. N Engl J Med 22. Wynants L, Van Calster B, Collins GS, in artificial intelligence. Nature 2020;
2019;380:1347-58. et al. Prediction models for diagnosis and 586(7829):E14-E16.
5. Topol EJ. High-performance medi- prognosis of covid-19: systematic review 39. Breiman L. Random forests. Mach
cine: the convergence of human and arti- and critical appraisal. BMJ 2020;369:m1328. Learn 2021;45:5-32.
ficial intelligence. Nat Med 2019;25:44-56. 23. Rossman H, Segal E. Nowcasting the 40. Chipman HA, George EI, McCulloch
6. Rajpurkar P, Chen E, Banerjee O, spread of SARS-CoV-2. Nat Microbiol 2022; RE. BART: Bayesian additive regression
Topol EJ. AI in health and medicine. Nat 7:16-7. trees. Ann Appl Stat 2010;4:266-98.
Med 2022;28:31-8. 24. Oren O, Gersh BJ, Bhatt DL. Artificial 41. Chen T, Guestrin C. Xgboost: a scal-
7. Parikh RB, Obermeyer Z, Navathe AS. intelligence in medical imaging: switch- able tree boosting system. In:Proceed-
Regulation of predictive analytics in med- ing from radiographic pathological data ings and abstracts of the 22nd ACM
icine. Science 2019;363:810-2. to clinically meaningful endpoints. Lancet SIGKDD International Conference on
8. He J, Baxter SL, Xu J, Xu J, Zhou X, Digit Health 2020;2(9):e486-e488. Knowledge Discovery and Data Mining,
Zhang K. The practical implementation of 25. Collins GS, Moons KGM. Reporting August 13–17, 2016. San Francisco:Spe-
artificial intelligence technologies in med- of artificial intelligence prediction mod- cial Interest Group on Knowledge Discov-
icine. Nat Med 2019;25:30-6. els. Lancet 2019;393:1577-9. ery and Data Mining, 2016.
9. Bengio Y, Courville A, Vincent P. 26. Sammut S-J, Crispin-Ortuzar M, Chin 42. Tibshirani R. Regression shrinkage
Representation learning: a review and S-F, et al. Multi-omic machine learning and selection via the lasso. J R Stat Soc
new perspectives. IEEE Trans Pattern Anal predictor of breast cancer therapy response. Series B Stat Methodol 1996;58:267-88.
Mach Intell 2013;35:1798-828. Nature 2022;601:623-9. 43. Harrington D, D’Agostino RB Sr, Gat-
10. Schölkopf B, Locatello F, Bauer S, et 27. Rubin DB. Bayesian inference for sonis C, et al. New guidelines for statisti-
al. Toward causal representation learn- causal effects: the role of randomization. cal reporting in the Journal. N Engl J Med
ing. Proc IEEE 2021;109:612-34. Ann Stat 1978;6:34-58. 2019;381:285-6.
11. Choi E, Bahadori MT, Searles E, et al. 28. Gelman A, Carlin JB, Stern HS, Rubin 44. Sounderajah V, Ashrafian H, Aggar-
Multi-layer representation learning for DB. Bayesian data analysis. New York: wal R, et al. Developing specific report-
medical concepts. In:Proceedings and Taylor and Francis, 2014. ing guidelines for diagnostic accuracy
abstracts of the 22nd ACM SIGKDD Inter- 29. Fong E, Holmes C, Walker SG. Mar- studies assessing AI interventions: the
national Conference on Knowledge Dis- tingale posterior distributions. November STARD-AI Steering Group. Nat Med 2020;
covery and Data Mining, August 13–17, 22, 2021 (https://doi.org/10.48550/arXiv 26:807-8.
2016. San Francisco:Special Interest Group .2103.15671). preprint. 45. CONSORT-AI and SPIRIT-AI Steering
on Knowledge Discovery and Data Mining, 30. Peters J, Janzing D, Schölkopf B. Ele- Group. Reporting guidelines for clinical
2016. ments of causal inference: foundations trials evaluating artificial intelligence in-
12. LeCun Y, Bengio Y, Hinton G. Deep and learning algorithms. Cambridge, MA: terventions are needed. Nat Med 2019;25:
learning. Nature 2015;521:436-44. MIT Press, 2017. 1467-8.
13. Goodfellow I, Bengio Y, Courville A. 31. Van der Laan MJ, Rose S. Targeted 46. Norgeot B, Quer G, Beaulieu-Jones
Deep learning. Cambridge, MA:MIT press, learning: causal inference for observational BK, et al. Minimum information about
2016. and experimental data. New York:Springer, clinical artificial intelligence modeling:
14. Kelly CJ, Karthikesalingam A, Suley- 2011. the MI-CLAIM checklist. Nat Med 2020;
man M, Corrado G, King D. Key chal- 32. Blakely T, Lynch J, Simons K, Bentley 26:1320-4.
lenges for delivering clinical impact with R, Rose S. Reflection on modern meth- 47. DECIDE-AI Steering Group. DECIDE-AI:
artificial intelligence. BMC Med 2019;17: ods: when worlds collide-prediction, ma- new reporting guidelines to bridge the
195. chine learning and causal inference. Int J development-to-implementation gap in
15. Vollmer S, Mateen BA, Bohner G, et Epidemiol 2021;49:2058-64. clinical artificial intelligence. Nat Med
al. Machine learning and artificial intel- 33. Sanchez P, Voisey JP, Xia T, Watson 2021;27:186-7.
ligence research for patient benefit: 20 HI, O’Neil AQ, Tsaftaris SA. Causal ma- 48. Nagendran M, Chen Y, Lovejoy CA, et
critical questions on transparency, repli- chine learning for healthcare and preci- al. Artificial intelligence versus clinicians:
cability, ethics, and effectiveness. BMJ 2020; sion medicine. R Soc Open Sci 2022; 9: systematic review of design, reporting
368:l6927. 220638. standards, and claims of deep learning
16. Lee P, Bubeck S, Petro J. Benefits, lim- 34. Castro DC, Walker I, Glocker B. Cau- studies. BMJ 2020;368:m689.
its, and risks of GPT-4 as an AI chatbot for sality matters in medical imaging. Nat 49. Seyyed-Kalantari L, Zhang H, McDer-
medicine. N Engl J Med 2023;388:1233-9. Commun 2020;11:3673. mott MBA, Chen IY, Ghassemi M. Under-
17. Agrawal A, Gans J, Goldfarb A. Pre- 35. Richens JG, Lee CM, Johri S. Improv- diagnosis bias of artificial intelligence
diction machines: the simple economics ing the accuracy of medical diagnosis algorithms applied to chest radiographs
of artificial intelligence. Cambridge, MA: with causal machine learning. Nat Com- in under-served patient populations. Nat
Harvard Business Press, 2018. mun 2020;11:3923. Med 2021;27:2176-82.
18. Goldstein BA, Navar AM, Carter RE. 36. Clivio O, Falck F, Lehmann B, Deligi- 50. Parikh RB, Teeple S, Navathe AS. Ad-
Moving beyond regression techniques in annidis G, Holmes C. Neural score match- dressing bias in artificial intelligence in
cardiovascular risk prediction: applying ing for high-dimensional causal inference. health care. JAMA 2019;322:2377-8.
machine learning to address analytic Presented at the 25th International Con- 51. Mehrabi N, Morstatter F, Saxena N,
challenges. Eur Heart J 2017;38:1805-14. ference on Artificial Intelligence and Sta- Lerman K, Galstyan A. A survey on bias
and fairness in machine learning. ACM research. IEEE Signal Process Mag 2012; 61. Treveil M, Omont N, Stenac C, et al. In-
Comput Surv 2021;54:1-35. 29:141-2. troducing MLOps. Newton, MA:O’Reilly,
52. Chen IY, Pierson E, Rose S, Joshi S, 57. Yu B, Kumbier K. Veridical data sci- 2020.
Ferryman K, Ghassemi M. Ethical ma- ence. Proc Natl Acad Sci U S A 2020;117: 62. Zhan T, Hartford A, Kang J, Offen W.
chine learning in healthcare. Annu Rev 3920-9. Optimizing graphical procedures for mul-
Biomed Data Sci 2021;4:123-44. 58. Gelman A, Loken E. The statistical tiplicity control in a confirmatory clinical
53. Tjoa E, Guan C. A survey on explain- crisis in science: data-dependent analysis trial via deep learning. Stat Biopharm Res
able artificial intelligence (xai): Toward — a “garden of forking paths” — ex- 2022;14:92-102.
medical xai. IEEE Trans Neural Netw plains why many statistically significant 63. Cox DR. Some problems connected
Learn Syst 2021;32:4793-813. comparisons don’t hold up. Am Sci 2014; with statistical inference. Ann Math Stat
54. Rudin C. Stop explaining black box 102:460 (https://www.americanscientist 1958;29:357-72.
machine learning models for high stakes .org/article/t he-statistical-crisis-in 64. Stone M. Cross-validatory choice and
decisions and use interpretable models -science). assessment of statistical predictions. J R
instead. Nat Mach Intell 2019;1:206-15. 59. Watson JA, Holmes CC. Machine Stat Soc Series B Stat Methodol 1974;36:
55. Obermeyer Z, Powers B, Vogeli C, learning analysis plans for randomised 111-47.
Mullainathan S. Dissecting racial bias controlled trials: detecting treatment ef- 65. Lei J, G’Sell M, Rinaldo A, Tibshirani
in an algorithm used to manage the fect heterogeneity with strict control of RJ, Wasserman L. Distribution-free pre-
health of populations. Science 2019;366: type I error. Trials 2020;21:156. dictive inference for regression. J Am Stat
447-53. 60. Nie X, Wager S. Quasi-oracle estima- Assoc 2018;113:1094-111.
56. Deng L. The MNIST database of hand- tion of heterogeneous treatment effects. Copyright © 2023 Massachusetts Medical Society.
written digit images for machine learning Biometrika 2021;108:299-319.