Professional Documents
Culture Documents
Dembrower 2020
Dembrower 2020
Summary
Background We examined the potential change in cancer detection when using an artificial intelligence (AI) Lancet Digital Health 2020;
cancer-detection software to triage certain screening examinations into a no radiologist work stream, and then after 2: e468–74
regular radiologist assessment of the remainder, triage certain screening examinations into an enhanced assessment Department of Physiology
and Pharmacology
work stream. The purpose of enhanced assessment was to simulate selection of women for more sensitive screening (K Dembrower MD,
promoting early detection of cancers that would otherwise be diagnosed as interval cancers or as next-round P Lindholm PhD), Department
screen-detected cancers. The aim of the study was to examine how AI could reduce radiologist workload and increase of Pathology and Oncology
cancer detection. (M Salim MD, F Strand PhD),
and Department of Medical
Epidemiology and Biostatistics
Methods In this retrospective simulation study, all women diagnosed with breast cancer who attended two consecutive (M Eklund PhD), Karolinska
screening rounds were included. Healthy women were randomly sampled from the same cohort; their observations Institute, Stockholm, Sweden;
were given elevated weight to mimic a frequency of 0·7% incident cancer per screening interval. Based on the Department of Radiology,
Capio Sankt Görans Hospital,
prediction score from a commercially available AI cancer detector, various cutoff points for the decision to channel Stockholm, Sweden
women to the two new work streams were examined in terms of missed and additionally detected cancer. (K Dembrower); Department
of Medical Radiation Physics
and Nuclear Medicine
Findings 7364 women were included in the study sample: 547 were diagnosed with breast cancer and 6817 were healthy
(E Wåhlin MSc), Department
controls. When including 60%, 70%, or 80% of women with the lowest AI scores in the no radiologist stream, the of Radiology (M Salim), and
proportion of screen-detected cancers that would have been missed were 0, 0·3% (95% CI 0·0–4·3), or 2·6% (1·1–5·4), Breast Radiology (F Strand),
respectively. When including 1% or 5% of women with the highest AI scores in the enhanced assessment stream, the Karolinska University Hospital,
Stockholm, Sweden;
potential additional cancer detection was 24 (12%) or 53 (27%) of 200 subsequent interval cancers, respectively, and
and Department of
48 (14%) or 121 (35%) of 347 next-round screen-detected cancers, respectively. Computational Science and
Technology, KTH Royal
Interpretation Using a commercial AI cancer detector to triage mammograms into no radiologist assessment and Institute of Technology and
Science for Life Laboratory,
enhanced assessment could potentially reduce radiologist workload by more than half, and pre-emptively detect
Stockholm, Sweden (Y Liu MSc,
a substantial proportion of cancers otherwise diagnosed later. K Smith PhD)
Correspondence to:
Funding Stockholm City Council. Dr Karin Dembrower, Capio
Sankt Görans Hospital,
112 81 Stockholm, Sweden
Copyright © 2020 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0
karin.dembrower@ki.se
license.
Research in context
Evidence before this study AI algorithm scoring of mammograms, the lowest 60% could
Several artificial intelligence (AI) cancer-detection software be triaged to a no radiologist work stream without missing
algorithms have been developed for mammography. There any cancer that would otherwise have been screen detected.
is evidence that some software algorithms are now at a Then, after negative radiologist assessment of the remaining
performance level comparable to radiologists in assessing mammograms, the highest AI scores could be used to identify
mammograms, even though validation in a true screening mammograms for an enhanced assessment work stream with
cohort is still absent. In addition to assisting the radiologist a substantial enrichment of false-negative assessments.
in tumour detection, there are several potential roles for
Implications of all the available evidence
AI in the screening process, which have not yet been
Commercial AI algorithms as independent readers of
investigated fully.
screening mammography assessment are now performing
Added value of this study on a clinically relevant level. AI-based scoring can be used
In this retrospective simulation study we show that a to reallocate radiologist time from clearly negative
commercial AI cancer-detector algorithm could be used mammograms towards cases where cancer might go
in triaging mammograms to decrease radiologist time spent undetected. AI has the potential to promote early detection
on clearly negative mammograms, and use these resources and thereby increase overall survival for breast cancer
for women at risk of having a false-negative screening. After patients.
study sample was derived from the case-control subset Healthy Breast cancer detected
of CSAW and consisted of women from the Karolinska
University Hospital uptake area examined between Random sampling of 10 000 women 747 women diagnosed with breast cancer*
Feb 10, 2009, and Dec 10, 2015. Exclusions are described
in figure 2. The main exclusion criteria were that healthy 3183 excluded 200 excluded
women must have had at least 2 years follow-up, and 1249 without two consecutive 177 without two consecutive screening
must have participated in two consecutive screening examinations rounds
995 with <2 year follow-up 22 with >2·5 years between last image
rounds within 2·5 years to enable image analysis based 909 examined after Dec 31, 2015 and diagnosis
on the previous screening mammogram. The final 4 with unknown radiologists 1 excluded because mammogram
26 with implants was acquired after diagnosis
study sample consisted of 7364 women: 547 diagnosed
women and 6817 healthy controls. The Regional Cancer
Center Stockholm-Gotland reported that during the 6817 healthy women included in the final 547 women with breast cancer included in
study sample the final study sample
study time, the participation rate varied between 71%
and 75%. Two radiologists assessed each mammography
examination, and determined whether it was normal or Figure 2: Study population flow chart
*All women who were diagnosed with breast cancer at Karolinska University Hospital between 2010 and 2015,
whether there was a suspicious finding. If there was a within the screening age range of 40–74 years, with complete screening examination, without previous breast
suspicious finding by any of the radiologists, the cancer and without implants.
examination went on to a consensus discussion, in
which it was decided whether the woman would be Statistical analysis
recalled or not. During the study time, the recall rate Since our study sample was enriched with positive cases,
varied between 2·0% and 2·6%. For this retrospective we applied an 11-times up-sampling of healthy women
study, the research was approved by the Swedish ethical to mimic the ratio in the source screening cohort (approxi
review board, which waived the need for individual mately 0·7% incident cancer per screening interval).3 To
informed consent. examine the AI cancer detector in a rule-out role (no
radiologist work stream), we determined the number and
Images percentage of women diagnosed with screen-detected
All included women had a complete four-view, full-field breast cancer for each decile of AI score to understand how
digital mammography examination. All mammograms many would be missed in relation to the population
were acquired on Hologic equipment (Hologic, proportion included. For diagnosed women, we also
Marlborough, MA, USA). Percent mammographic dens defined the AI score separately for the ipsilateral and
ity was estimated by the publicly available LIBRA soft contralateral breast, to examine association with the same
ware, version 1.0.4, from the University of Pennsylvania side as the cancer was verified. We tested for statistical
(Philadelphia, PA, USA).15 differences with the Student’s t test. The analysis then
focused entirely on women with negative screening exam
Deep neural network inations. The analysis was based on complete-case analysis.
The AI cancer detector algorithm used for detection To examine the AI cancer detector in a rule-in role
of tumour signs was sourced from a commercial vendor (enhanced assessment work stream), the AI score of
(Lunit, Seoul, South Korea), version 5.5.0.16 The vendor women with a negative examination after radiological
has stated that the algorithm was originally trained on assessment was divided into percentiles starting with the
170 230 mammograms, from 36 468 women diagnosed highest score. We examined alternative definitions of what
with breast cancer and 133 762 healthy controls. The proportion of the highest scores would be included in the
mammograms used in the original training came from enhanced assessment work stream: top 1%, 2%, 5%, 10%,
five institutions: three from South Korea, one from 15%, or 20%. For each alternative, we determined the
the USA, and one from the UK.16 The mammograms number of subsequent interval cancers and next-round,
were acq uired on equipment from GE Healthcare, screen-detected cancers according to the recorded radio
Hologic, and Siemens, and consisted of both screening logist assessments. The potential additional cancer detec
and diagnostic mammograms. In the standard version, tion rate was calculated as the number of these subsequent
the AI cancer-detector software visually highlights areas cancers divided by the total number of women included in
in the mammograms where the suspicion of tumour the work stream. The potential additional cancer detection
is above a certain threshold. However, in this study we rate for women whose AI score was not in the selected top
did not use the images, but instead used the underlying proportion would be 0 because they would have no
prediction score of the algorithm. The generated additional examination. To explore alternative prediction
prediction score for tumour presence was a decimal models for the enhanced assessment work stream, we
number between 0 and 1, where 1 represented the fitted logistic regression models for the traditional pre
highest level of suspicion. The AI score on examination dictor mammographic density as well as for AI score. We
level was defined by the maximum of the image-level estimated odds ratios (ORs) with 95% CI for both breasts,
prediction scores. and calculated the area under the receiver-operating curve
(AUC). We examined the net effect on cancer detection by detected at screening and 200 were detected clinically as
scenarios of combining various population proportions for interval cancers. The median age at mammography was
the no radiologist work stream (less resources but 53·6 years (IQR 15·4). The simulated screening population
potentially missing some cancers) and for the enhanced contained 75 534 women, resulting in 0·74% cancer
assessment work stream (more resources and potentially incidence over one screening interval. The median and
increasing cancer detection). dispersion of AI scores are reported in the appendix (p 3).
We account for a scenario where the AI cancer detector
Role of the funding source assessed all screening examinations as a single reader
The funding source, Stockholm County Council, pro without radiologists (ie, the no radiologist work stream)
vided funds for the entire project but had no influence in table 1. We determined that the AI score did not miss
over how any aspect of the work was carried out. The any cancer that would otherwise have been screen
funder of the study had no role in study design, data detected for mammograms with the 60% lowest AI
collection, data analysis, data interpretation, or writing of scores. For the 70%, 80%, and 90% lowest AI scores,
the report. The corresponding author had full access to there were one, nine, and 14 missed cancers, respectively,
all the data in the study and had final responsibility for which corresponded to 0·3%, 2·6%, and 4·0%, respect
the decision to submit for publication. ively, of all screen-detected cancers in the population.
In the enhanced assessment work stream, we showed
Results that among the top 1% of AI scores for women with
See Online for appendix The study population is described in the appendix (p 2). negative mammograms after double reading, there were
7364 women were included: 547 diagnosed with breast 24 (12%) interval cancers and 48 (14%) next-round
cancer, and 6817 healthy controls. 347 cancers were screen-detected cancers (table 2). The results for the top
5% were 53 (27%) interval cancers and 121 (35%) next-
n Proportion (95% CI) round screen-detected cancers. Expressing the total
potential detection, both interval cancers and screen-
Lowest 10% 0 0 (NA)
detected cancers, as a detection rate corresponded to
Lowest 20% 0 0 (NA)
114 per 1000 women within the top 1% AI scores and
Lowest 30% 0 0 (NA)
34 per 1000 women within the top 5% AI scores. The raw
Lowest 40% 0 0 (NA)
AI scores for the cutoff points for the two novel work
Lowest 50% 0 0 (NA)
streams are presented in the appendix (p 4).
Lowest 60% 0 0 (NA)
Alternative prediction models for the enhanced assess
Lowest 70% 1 0·3% (0·0–4·3) ment work stream are shown in table 3. The OR for
Lowest 80% 9 2·6% (1·1–5·4) predicting interval cancer was 2·01 (95% CI 1·98–2·18;
Lowest 90% 14 4·0% (2·1–6·9) AUC 0·74) and 1·59 (1·50–1·68; 0·67) for maximum
All 347 100·0% (NA) AI score and mammographic density, respectively.
AI computer-aided detection score shows the upper cut-point for the no The corresponding numbers for predicting next-round
radiologist work stream. n=74 987 healthy women and 547 cancer diagnoses. screen-detected cancer were 2·29 (2·22–2·38; 0·76) and
NA=not applicable.
1·12 (1·06–1·18; 0·65) for maximum AI score and
Table 1: Number of screen-detected cancers that would be missed in the mammographic density, respectively. The OR was
no radiologist work stream depending on the proportion of the markedly higher for the breast containing the cancer than
population lowest scores included
for the other breast when using the AI score, while the OR
was similar between breasts for mammographic density.
The potential net change in cancer detection when
Interval cancer Next-round screen- Cancer of both Additional cancer
(n=200) detected cancer categories detection rate* using AI score to save resources (no radiologist work
(n=347) (n=547) stream) and to increase potential cancer detection
Highest 1% (n=633) 24 (12%) 48 (14%) 72 (13%) 114/1000 (enhanced assessment work stream) is shown in table 4,
Highest 2% (n=1445) 32 (16%) 71 (21%) 103 (19%) 71/1000 based on results from tables 1 and 2. If no radiologist
Highest 5% (n=5073) 53 (27%) 121 (35%) 174 (32%) 34/1000 resources were used for 90% of women with the lowest
Highest 10% (n=8746) 73 (37%) 155 (45%) 228 (42%) 26/1000 AI scores and were invested into doing MRI for the top
Highest 15% (n=12 571) 86 (43%) 183 (53%) 269 (49%) 21/1000 2% AI scores (that were negative after radiologist double
Highest 20% (n=16 181) 100 (50%) 204 (59%) 304 (56%) 19/1000
reading of the mammograms), a net of 89 of 547 cancers
All (n=75 534) 200 (100%) 347 (100%) 547 (100%) 7/1000
would potentially have been detected up to 2 years earlier,
corresponding to a detection rate of 59 cancers per
Data are n (%) or n/n. *The ratio was calculated with the total number of women in the population selected as the
denominator.
1000 supplemental screening examinations.
Table 2: Potential detection of interval and next-round screen-detected cancer in the enhanced Discussion
assessment work stream depending on the proportion of the population highest scores (after negative
In this study we show that a commercial AI cancer-
double-reading) included
detector algorithm could be used as both a single reader to
screening (28% of all cancers).3 Whether choosing a Table 3: AI score and mammographic percent density as alternative
60% or 90% population threshold, a massive reduction predictors for triaging into the enhanced assessment work stream
in radiologist workload would result, and AI as an detecting subsequent interval cancers and next-round screen-detected
independent rule-out reader has great potential. cancers
mammogram. For each threshold of assigning enhanced Bottom 60% AI score 72 (95/1000) 103 (68/1000) 174 (46/1000) 228 (30/1000)
assessment, we found the relative reduction of subsequent Bottom 70% AI score 71 (94/1000) 102 (68/1000) 173 (46/1000) 227 (30/1000)
interval cancers and of next-round screen-detected Bottom 80% AI score 63 (83/1000) 94 (62/1000) 165 (44/1000) 219 (29/1000)
cancers was of similar magnitude. If the examined Bottom 90% AI score 58 (77/1000) 89 (59/1000) 160 (42/1000) 214 (28/1000)
method is imple mented clinically, the reduction in Data are n (detection rate). n=75 534 women, of which 547 were diagnosed with breast cancer. Net number of
interval cancer would most likely be apparent by a additional cancers (detection rate per 1000 examinations) calculated by proportion of women in the no radiologist
work stream and proportion of women in the enhanced assessment work stream. AI=artificial intelligence.
continuously lower number of interval cancers. However,
earlier detection of next-round screen-detected cancers Table 4: Number of cancers detected earlier by the enhanced assessment work stream subtracted
would mainly be apparent by an increased number of by screen-detected cancers missed in the no radiologist work stream
screen-detected cancers at the first screening. In the
continuation, the number of screen-detected cancers
would not be reduced, but a stage shift towards smaller provider.20 Kerlikowske and colleagues21 discuss that it
cancer could be expected. We found that the accuracy in would not be acceptable to leave women with an interval
predicting future interval cancer and next-round screen- cancer rate above 1 per 1000 without further examination.
detected cancer was markedly higher for AI computer- They found that when combining density with a
aided detection score than for mammographic density. traditional breast cancer risk model, there were 35 interval
Mammographic density has previously been established cancer cases among the 24 294 women identified by their
as a strong risk factor for interval cancer.18,19 In many US model in a simulated screening cohort of 100 000 women.
states, legislation requires that women with high However, their additional cancer detection rate of 1·4 per
mammographic density should be informed that they are 1000 examinations is much below the 6·2 per 1000 exam
at risk of reduced mammographic sensitivity, and should inations that would result from preventing all interval
discuss supple mental screening with their health-care cancers for the 20% of women with the highest AI scores
in our study. The US study and our study are different in design to improve computing efficiency. A limitation of
at least two aspects: in the USA, screening there is mostly our study was our requirement that all women must have
annual screening, not biennial as in our study, and the had a previous mammogram not more than 30 months
interval cancer rate is around 13% according to the US before diagnosis, which consequently affected the
Breast Cancer Surveillance Consortium.22 proportion of interval cancer (28% before and 37% after
The AI algorithm and mammographic density might these exclusions). A second limitation was that we were
capture complementary explanatory factors for interval not informed of the location of the radiological findings
cancer. We speculate that the AI algorithm finds subtle and could therefore not examine whether the AI
tumour markers that were previously unidentified, while algorithm finding was at the same location as where
mammographic density is most likely associated with the cancer was later found. Another limitation was that all the
risk of masking, or obstruction, of tumour signs. A useful women were from Sweden, and the results could differ in
feature of the AI cancer detector is that the software a population with a different ethnic and geographical
produces an image with a marker for the localisation of composition. Additionally, our programme was based on
the suspicious finding in the mammography image. biennial screening, and results from an annual screening
Therefore, one could consider routing women with a high programme could be different. Results in a clinical setting
AI score to targeted ultrasound examination, guided by could differ from our study, for example if radiologist
the localisation shown on the AI cancer-detector software. performance is affected by knowing that there is an AI
Based on previous studies using MRI-guided localisation algorithm potentially detecting missed cancer signs. A
for second-look ultrasound, there is reason to believe that final limitation was that the specific cutoff points for the
many cancers could be detected.23 In a clinical setting, AI algorithm were derived in our setting, using our
there are many considerations (eg, local availability, radio equipment and acquisition settings for the mammograms.
logical expertise, and economic considerations) to take In conclusion, our retrospective study shows that using
into account when deciding whether the supplemental a commercial AI cancer detector to triage mammograms
method should include one or all of MRI, contrast- into no radiologist assessment and enhanced assessment,
enhanced mammog raphy, or AI cancer-detector-guided could potentially reduce radiologist workload by more
ultrasound. than half, and pre-emptively detect a substantial propor
The resources saved by the no radiologist triage tion of cancers otherwise diagnosed later. Retrospective
consist of the radiologist assessments and discussions trials in other settings and a prospective trial would be
of screening mammograms, while the resources needed to validate our findings.
expended for the enhanced assessment triaging include Contributors
doing the ultrasound or MRI examination and following All authors have contributed to different parts of the Article. KD searched
assessments, discussions, and further work-up. The the literature, collected data, did analysis, interpreted data, and wrote the
Article. EW, YL, and MS collected data. KS interpreted data. PL interpreted
number of averted radiologist assessments required to data and collected data. ME designed the study, interpreted data, and wrote
finance one MRI examination varies by country and the Article. FS searched the literature, collected data, did analysis, prepared
setting. However, even if it were necessary to save 90% figures, interpreted data, and designed the study. In addition, all authors
of the mammography assessments to finance MRI were involved in drafting the work or revising it critically for important
intellectual content, or in the final approval of the version submitted for
examinations for 1% of the population, the net change publication.
in cancer detection is still positive by a wide margin.
Declaration of interests
On a population level, the suggested balanced strategy FS declares receiving consulting fees from Collective Minds Radiology,
would most probably result in a marked increase in unrelated to this Article. All other authors declare no competing
cancers detected early. In a previous study24 we showed interests.
that the women generally seemed to have positive Data sharing
attitudes towards using a computer program to assess All data collected for the study cannot be made publicly available due
to Swedish and European regulations, and permission from the original
mam mograms and to triage for MRI screening.
information owner for the use of data in research. However, contact the
However, a few individual women might end up with last author (FS; firstname.lastname@ki.se) for academic inquiries into
clinically detected cancer that, in hindsight, could have the possibility of applying for access to de-identified data through a Data
been detected by a radiologist on the previous screening Transfer Agreement procedure, which will then require permission from
the head of the department at Karolinska Institute. The examined AI
mammogram. This could be a starting point for an
algorithm is a commercially available third-party product that we have
important conversation between policy makers and no rights to share.
screening participants.
Acknowledgments
An important strength of this study is that the AI The funding source (Stockholm City Council, grant number 20170802)
algorithm is commercially available, and has never provided funds for the entire project, but had neither an influence over
previously been exposed to images from our department how any aspect of the work was carried out, nor any impact on the
transparency of the Article. We were allowed to use the AI algorithm
or our equipment. Additionally, the cohort of women
free of charge by Lunit, South Korea; the company had no influence
come from a population-based screening population. A over the research question, nor any other aspect of the work carried
weakness of our study is that we did not study the full out, nor any impact on the transparency of the Article.
screening cohort, but instead used a case-controlled