You are on page 1of 9

ORIGINAL RESEARCH • BREAST IMAGING

AI-based Strategies to Reduce Workload in Breast Cancer


Screening with Mammography and Tomosynthesis:
A Retrospective Evaluation
José Luis Raya-Povedano, MD  •  Sara Romero-Martín, PhD, MD  •  Esperanza Elías-Cabot, MD  • 
Albert Gubern-Mérida, PhD  •  Alejandro Rodríguez-Ruiz, PhD  •  Marina Álvarez-Benito, PhD, MD
From the Breast Cancer Unit, Department of Radiology, Hospital Universitario Reina Sofía, Av Menéndez Pidal s/n, Córdoba 14004, Spain (J.L.R.P., S.R.M., E.E.C.,
M.Á.B.); Maimonides Institute for Biomedical Research of Córdoba, Córdoba, Spain (J.L.R.P., S.R.M., E.E.C., M.Á.B.); and Department of Clinical Science, ScreenPoint
Medical, Nijmegen, the Netherlands (A.G.M., A.R.R.). Received August 31, 2020; revision requested October 23; revision received January 5, 2021; accepted January 14.
Address correspondence to J.L.R.P. (e-mail: joseluisrayapovedano@gmail.com).
The study was funded by the Hospital Universitario Reina Sofía in Córdoba, Spain.

Conflicts of interest are listed at the end of this article.

Radiology 2021; 300:57–65  • https://doi.org/10.1148/radiol.2021203555 •  Content codes:

Background:  The workflow of breast cancer screening programs could be improved given the high workload and the high number of
false-positive and false-negative assessments.

Purpose:  To evaluate if using an artificial intelligence (AI) system could reduce workload without reducing cancer detection in
breast cancer screening with digital mammography (DM) or digital breast tomosynthesis (DBT).

Materials and Methods:  Consecutive screening-paired and independently read DM and DBT images acquired from January 2015 to
December 2016 were retrospectively collected from the Córdoba Tomosynthesis Screening Trial. The original reading settings were
single or double reading of DM or DBT images. An AI system computed a cancer risk score for DM and DBT examinations inde-
pendently. Each original setting was compared with a simulated autonomous AI triaging strategy (the least suspicious examinations
for AI are not human-read; the rest are read in the same setting as the original, and examinations not recalled by radiologists but
graded as very suspicious by AI are recalled) in terms of workload, sensitivity, and recall rate. The McNemar test with Bonferroni
correction was used for statistical analysis.

Results:  A total of 15 987 DM and DBT examinations (which included 98 screening-detected and 15 interval cancers) from 15 986
women (mean age 6 standard deviation, 58 years 6 6) were evaluated. In comparison with double reading of DBT images (568
hours needed, 92 of 113 cancers detected, 706 recalls in 15 987 examinations), AI with DBT would result in 72.5% less workload
(P , .001, 156 hours needed), noninferior sensitivity (95 of 113 cancers detected, P = .38), and 16.7% lower recall rate (P , .001,
588 recalls in 15 987 examinations). Similar results were obtained for AI with DM. In comparison with the original double reading
of DM images (222 hours needed, 76 of 113 cancers detected, 807 recalls in 15 987 examinations), AI with DBT would result in
29.7% less workload (P , .001), 25.0% higher sensitivity (P , .001), and 27.1% lower recall rate (P , .001).

Conclusion:  Digital mammography and digital breast tomosynthesis screening strategies based on artificial intelligence systems could
reduce workload up to 70%.
Published under a CC BY 4.0 license.

E arlier detection by means of mammography-based


screening results in a 20%–35% reduction in breast can-
cer mortality (1,2). Consequently, breast cancer screening
20% more breast cancers, but it also results in more
false-positive recalls and adds extra reading workload
to screening (4).
programs have been established in many countries to diag- More recently, digital breast tomosynthesis (DBT) has
nose the disease as early as possible. been shown to improve breast cancer screening detection
However, mammography-based screening programs are rates by 30%–90% compared with digital mammography
subject to some limitations. First, mammography sensitiv- (DM), with a diverse impact on recall rate (7–10). Never-
ity is lower with higher breast density (3). This leads to up theless, reading a DBT examination approximately doubles
to 20%–30% of breast cancers not being detected during reading time for radiologists, which could be a barrier for
screening and later manifesting symptomatically as interval implementing DBT as a screening modality in some set-
cancers (4). Second, it is estimated that at least one out of tings (7,10–13).
three women participating in screening will have a false- In recent years, deep learning–based artificial intelli-
positive recall during her lifetime (5), which not only adds gence (AI) systems have been quickly evolving in the field
harm to women but also increases the cost and workload of breast imaging, surpassing the performance and clinical
of health care systems. value of traditional computer-aided detection systems for
In general, the screening workload is high. The vast mammography (14). Some of these systems can automati-
majority of mammograms in asymptomatic women cally detect breast cancer in two-dimensional mammo-
will have a normal outcome, and no further action will grams and DBT images, with a performance level compa-
be taken (6). Double reading detects between 9% and rable with that of radiologists (15–17).
This copy is for personal use only. To order printed copies, contact reprints@rsna.org
AI-based Strategies to Reduce Workload in Breast Cancer Screening with Mammography and Tomosynthesis

institutional review board, and the requirement for informed


Abbreviations consent was waived. The study was not financially supported by
AI = artificial intelligence, DBT = digital breast tomosynthesis, DM = any grant or company. ScreenPoint Medical provided the soft-
digital mammography
ware for the study. The authors who were not employees of or
Summary consultants for ScreenPoint Medical had control of the data and
Digital mammography and digital breast tomosynthesis screening information submitted for publication at all times.
strategies based on artificial intelligence systems could reduce workload
up to 70% without reducing sensitivity by 5% or more. Study Population
Key Results The data for this study were retrospectively collected from the
N In a retrospective study of 15 987 mammograms, artificial intel- Córdoba Tomosynthesis Screening Trial (21), a prospective
ligence (AI) reduced screening workload up to 70% for both screening trial that collected consecutive examinations in 16 067
digital mammography (DM)– or digital breast tomosynthesis women (one woman had a bilateral breast cancer and had two
(DBT)–based screening programs without reducing sensitivity by different examinations included in the original trial) who were
5% or more.
screened with both two-view DM and two-view DBT between
N Using AI to transition from DM screening to DBT screening
would yield a reduction of 30% in workload, a 25% improvement January 2015 and December 2016. The images were acquired
in sensitivity, and a reduction of 27% in recall rate. with a Selenia Dimensions device (Hologic). This paired trial
compared the screening performance of DM alone versus that of
DBT (added to DM or with synthetic mammography) in terms
Some studies have investigated whether AI systems can be of recall rate and cancer detection rate. These data only overlap
used in screening programs to reduce radiologists’ workload with those reported in the original publication of the trial (21).
without negatively affecting the quality of outcomes (18–20). Age, breast density, histopathologic results of biopsy pro-
However, these are limited and have only investigated the use of cedures, and interval cancer diagnosis were retrieved from the
AI to reduce workload in DM-based screening programs. medical records. Race was not individually recorded, but the ma-
In this study, we retrospectively evaluate how AI could be jority of the population was White. In addition to the original
used to reduce workload without reducing cancer detection in trial exclusion criteria (21), examinations were excluded if there
different screening settings, whether the screening is based on were problems retrieving the mammograms from the picture ar-
single or double reading of DM or DBT images. chiving and communication system prior to the AI processing.

Materials and Methods Original Screening Reading Settings


This retrospective study was compliant with the Health Insur- The DM and DBT images were independently read by four
ance Portability and Accountability Act. The study included an- out of five dedicated breast radiologists (including J.L.R.P. and
onymized and retrospectively collected screening examinations. S.R.M., 15 and 3 years of experience, respectively) in four read-
Women were included from a single institution. The retrospec- ing arms as described in Figure 1. The readers were blinded to
tive use of these anonymized data was approved by our hospital’s the outcomes of the other arms.

Figure 1:  Diagram illustrates how digital mammography (DM) and digital breast tomosynthesis (DBT) images (with synthetic mammography [SM]) were read during the
trial, in four independent readings by four radiologists. During the trial, a woman was recalled when any of the four arms recalled the examination (no arbitration or consen-
sus). The assessments of each reading arm were recorded.

58 radiology.rsna.org  n  Radiology: Volume 300: Number 1—July 2021


Raya-Povedano et al

Figure 2:  Flowchart of the original screening strategies and how they were compared with the artificial intelligence (AI)–based screening strategy. If the original setting
used digital mammography (DM), AI scores computed on DM images were used. Similarly, if the original setting used digital breast tomosynthesis (DBT), the AI scores com-
puted on only DBT images were used. Cases were considered very likely normal if the AI score was 7 or lower (approximately 70% of screening volume). Additionally, the
examinations not recalled by radiologists but with an AI score among the 2% most suspicious examinations in the cohort were considered automatically recalled.

Because of this interpretation setting, it was possible to com- In this AI strategy, the least suspicious examinations for AI
pute the performance of the following settings, hereinafter re- (those assumed very likely normal with an AI score of 7 or
ferred to as original screening settings: double reading of DM im- lower, approximately 70% according to the device specifica-
ages (if either reader recalls, the case is recalled), double reading tions; the cutoff was chosen based on previous research [20]
of DBT images (if either reader recalls, the case is recalled), and indicating that replacing double reading with single reading
single reading of DBT images (with synthetic mammograms). for these very likely normal cases would not reduce screening
sensitivity by more than 5%) would not be human-read, and
AI System the rest of examinations would be read as in the original setting
The AI system used in this study (Transpara, version 1.6.0; Screen- (single or double reading of DM or DBT images). Addition-
Point Medical) was previously investigated in other publications ally, the examinations not recalled by radiologists but within
(17,19,20,22–24). This system uses deep learning to detect lesions the 2% most suspicious examinations as graded by AI would
suspicious for breast cancer on DM and DBT images. The most be automatically recalled in order to potentially improve sen-
suspicious findings detected by the system are marked on every sitivity (the cutoff was chosen taking into account radiologists’
image and assigned a score between 1 and 100. Based on the maxi- recall rate at this site).
mum suspicious finding present in the examination, a proprietary The output of the AI triaging was analyzed by a panel of ra-
conversion table generates an examination score from 1 to 10, in- diologists (J.L.R.P., S.R.M., E.E.C., and M.Á.B., with 20, 8, 3,
dicating the increasing likelihood that a visible cancer is present and 20 years of experience, respectively), and findings were con-
on the mammogram. The DBT images and the DM images of sidered true-positive only if the system correctly localized them
each examination were independently processed by the AI system, and assigned them the highest suspicion score at the examina-
resulting in two AI scores per examination: an AI-DM score and tion (on the region suspicion scale of 1–100).
an AI-DBT score.
Statistical Analysis
AI-based Screening Strategy First, the distribution of AI examination scores in DM and
For each original setting, an autonomous AI triaging screening DBT was computed for different groups of examinations based
strategy was retrospectively simulated, aiming to reduce work- on ground truth (95% CIs were computed using the Wilson
load while maintaining sensitivity (detailed in Fig 2). binomial method).

Radiology: Volume 300: Number 1—July 2021  n  radiology.rsna.org 59


AI-based Strategies to Reduce Workload in Breast Cancer Screening with Mammography and Tomosynthesis

Table 1: Summary of Demographic Characteristics

No. of Women
Characteristic (n = 15 986)*
No. of examinations 15 987
Age at screening
  50–54 years 6173 (38.6)
  55–59 years 3800 (23.8)
  60–64 years 3335 (20.9)
  64–69 years 2678 (16.8)
Mean age at screening (y)† 58 6 6
Breast density‡
 A 3648 (22.8)
 B 8153 (51.0)
 C 3749 (23.5)
 D 436 (2.7)
Original outcomes
  No. of normal readings (with 2-year follow-up) 14 795 (92.5)
  No. of false-positive recalls 1078 (6.7)
  (at either DBT or DM)
Figure 3:  Flowchart of data selection. In total, 15 987 examinations from 15 986   No. of screening-detected cancers 98 (0.6)
women were included. PACS = picture archiving and communications system.
  (at either DBT or DM)
  No. of interval cancers 15 (0.1)
The screening reading workload, sensitivity (including Note.—Unless otherwise specified, data are numbers of women
s­creening-detected and interval cancers), and recall rate (ie, the (n = 15 986), with percentages in parentheses. DBT = digital
number of examinations recalled by either the AI system or ra- breast tomosynthesis, DM = digital mammography.
diologists divided by total examinations) were compared between * One woman had a bilateral breast cancer and had two different
each original screening setting and the AI-based screening strategy examinations included in the original trial, for a total of 15 987
by using the McNemar test for paired data, with an a of .05 in- examinations.
dicating statistical significance. Additionally, the AI-based strategy

Data are means 6 standard deviations.
in DBT was compared with the original double reading of DM.

Breast density was graded according to the American College of
Radiology Breast Imaging Reporting and Data System lexicon.
Screening workload was defined as the number of readings,
and an estimate in hours was computed using the average read-
ing time per examination originally reported in this cohort (21):
25 seconds for a DM examination and 64 seconds for a DBT Table 2: Summary of Cancer Characteristics
plus DM or synthetic mammography examination.
To control for multiple comparisons (four in total; see Fig Interval
2), Bonferroni correction was applied. P = .013 (ie, .05/4) was Screening-detected Cancers
Characteristic Cancers (n = 98) (n = 15)
considered to indicate a significant difference after Bonferroni
correction. To control for multiple comparisons of the end point Morphologic type
metrics (workload, sensitivity, and recall rate), these were tested  Mass 54 (55) 13 (87)
sequentially for each comparison.   Architectural distortion 20 (20) 1 (6.7)
 Asymmetry 3 (3.1) 1 (6.7)
The hypothesis was that in the AI-based strategy, workload
 Calcifications 21 (21) 0 (0)
could be significantly reduced, with noninferior sensitivity and
Histologic type
recall rate (prespecified noninferiority margin difference of 5%,
  Invasive ductal carcinoma 68 (69) 12 (80)
in relative terms). Noninferiority was concluded if the sensitivity
  Invasive lobular carcinoma 4 (4.1) 1 (6.7)
or the recall rate was superior (higher sensitivity, lower recall rate)   Other invasive 0 (0) 1 (6.7)
in the AI-based setting, and the lower limit of the 95% CI of the   Ductal carcinoma in situ 26 (27) 1 (6.7)
difference was greater than the negative value of the prespecified Grade
noninferiority margin. If noninferiority was concluded, superior  I 45 (46) 4 (27)
sensitivity and recall rate in the AI-based strategies were sequen-  II 34 (34) 6 (40)
tially tested using the McNemar test.  III 19 (20) 5 (33)
Size (mm)* 20.7 6 14.4 26.6 6 17.1
Results Note.—Unless otherwise specified, data are numbers of cancers,
Participant and Examination Characteristics with percentages in parentheses.
From the 16 067 women in the cohort, 15 987 examinations in * Data are means 6 standard deviations.
15 986 women (mean age 6 standard deviation, 58 years 6 6)

60 radiology.rsna.org  n  Radiology: Volume 300: Number 1—July 2021


Raya-Povedano et al

Figure 4:  Bar graphs show distribution of artificial intelligence (AI) examination scores across different groups of examinations in the paired digital mammography (DM)–
digital breast tomosynthesis (DBT) cohort (all examinations, noncancer recalled examinations, screening-detected cancers, and interval cancers). AI scores were computed
for DM and DBT examinations independently. The ground truth was computed for DM-based screening outcomes and for DBT-based screening outcomes (which includes
DBT plus DM and DBT plus synthetic mammography workflows). For interval cancers, the only difference is whether the AI scores were computed for DM or DBT images.

Table 3: Comparison of the Original Settings and the AI-based Strategy in Terms of Workload, Sensitivity (Cancers Detected),
and Recall Rate

Metric Original Setting without AI With Simulated Autonomous AI Triaging Relative Difference* P Value
DM: double human reading
 Workload† 31 974 (222) 9100 (63) 271.5 (272.4, 270.6) ,.001‡
 Sensitivity§ 67.3 (76/113) [58.2, 75.2] 69.0 (78/113) [60.0, 76.8] 2.63 (24.9, 11.4)|| .68
  Recall rate§ 5.1 (807/15 987) [4.7, 5.4] 4.2 (671/15 987) [3.9, 4.5] 216.9 (224.0, 211.0) ,.001‡
DBT: double human reading
 Workload† 31 974 (568) 8830 (156) 272.4 (272.9, 271.9) ,.001‡
 Sensitivity§ 81.4 (92/113) [73.3, 87.5] 84.1 (95/113) [76.2, 89.7] 3.26 (22.2, 9.4)|| .38
  Recall rate§ 4.4 (706/15 987) [4.1, 4.8] 3.7 (588/15 987) [3.40, 4.0] 216.7 (223.4, 28.6) ,.001‡
DBT: single human reading
 Workload† 15 987 (284) 4415 (78) 272.4 (273.3, 271.5) ,.001‡
 Sensitivity§ 77.0 (87/113) [68.4, 83.8] 79.6 (90/113) [71.3, 86.0] 3.45 (21.2, 9.8)|| .38
  Recall rate§ 3.0 (481/15 987) [2.8, 3.3] 3.1 (499/15 987) [2.9, 3.4] 3.74 (23.7, 12.8) .41
Note.—AI = artificial intelligence, DBT = digital breast tomosynthesis, DM = digital mammography.
* Data are percentages, with 95% CIs in parentheses.

Unless otherwise specified, data are number of reads, with number of hours in parentheses.

Significant difference.
§
Unless otherwise specified, data are percentages, with raw data in parentheses and 95% CIs in brackets.
||
Noninferior.

Radiology: Volume 300: Number 1—July 2021  n  radiology.rsna.org 61


AI-based Strategies to Reduce Workload in Breast Cancer Screening with Mammography and Tomosynthesis

Figure 5:  A, Digital mammography (DM) and, B, digital breast tomosynthesis (DBT) images in a 66-year-old woman not recalled during
any of the original readings. Artificial intelligence (AI) identified a spiculated mass (outlined) on images obtained with both techniques during
screening and assigned a region score of 82 at DM and 95 at DBT. This woman would have automatically been recalled only at DBT. C,
DM image obtained 4 months later, after she discovered a palpable lump (not related to the actual cancer). Biopsy was performed, and
an interval cancer, a grade II invasive ductal carcinoma of 6 mm, was diagnosed in the lesion that would have been recalled by AI. The AI
examination score of this case was 10.

were included (99.5%) (Fig 3). Eighty-one examinations (five of 31 974 reads; 95% CI: 71.9, 72.9) was observed with the AI-
noncancer recalled examinations and 76 normal examinations) based screening strategy.
from 81 women were excluded because of problems retrieving The AI-based strategy resulted in noninferior sensitivity
the data from the picture archiving and communication system. across different screening settings: 76 of 113 cancers detected in
In total, 113 examinations were labeled as showing cancers (98 double reading of DM images versus 78 of 113 when using AI
screening-detected and 15 interval). The characteristics of the (relative difference, 2.63%; 95% CI: 24.9, 11.4; P = .68); 92 of
selected cohort are included in Tables 1 and 2. 113 cancers detected in double reading of DBT images versus
95 of 113 when using AI (relative difference, 3.26%; 95% CI:
Distribution of AI Scores 22.2, 9.4; P = .38); and 87 of 113 cancers detected in single
The distribution of AI scores across the different groups of exami- reading of DBT images versus 90 of 113 when using AI (relative
nations in the cohort is shown in Figure 4, computed for both difference, 3.45%; 95% CI: 21.2, 9.8; P = .38).
DM and DBT examinations independently. The distribution of When compared with double readings, the AI-based strat-
AI scores is homogeneous for all screening examinations (approxi- egy was associated with an overall reduction in recall rate
mately 10% in each score category), whereas only a minority of of 16.9% (671 of 15 987 women recalled with AI vs 807
screening-detected cancers were scored 1–7: two of 76 DM-based of 15 987 without AI; 95% CI: 11.0, 24.0; P , .001) and
screening-detected cancers (2.6%; 95% CI: 0.72, 9.10) and one of 16.7% (588 of 15 987 women recalled with AI vs 706 of
92 DBT-based screening-detected cancers (1.1%; 95% CI: 0.19, 15 987 without AI; 95% CI: 8.6, 23.4; P , .001) in DM and
5.90). At the same time, AI examinations scored 1–7 comprise DBT double readings settings, respectively. When compared
11 437 of 15 987 of the DM-based screening volume (71.5%; with single reading of DBT images, recall rate showed a non-
95% CI: 70.8, 72.2) and 11 572 of 15 987 of the DBT-based significant increment (499 of 15 987 women recalled with AI
screening volume (72.4%; 95% CI: 71.7, 73.1). vs 481 of 15 987 without AI; relative difference, 3.74%; 95%
Given that this group of cases with scores 1–7 includes less CI: 23.77, 12.83; P = .41).
than 5% of screening-detected cancers, it was estimated that this
is an optimal cutoff point to differentiate likely normal exami- Examinations Recalled Only by AI
nations in the proposed AI-based strategies (negative predictive In double reading of DM images, AI additionally recalled a
value, 99.98% [95% CI: 99.94, 99.99] in DM and 99.99% [95% total of 210 examinations (four of which were true-positive
CI: 99.95, 99.99] in DBT), similar to previous studies (20). results). In double reading of DBT images, AI additionally
recalled a total of 206 examinations (four of which were true-
Simulated AI-based Strategy positive results). In single reading of DBT images, AI addition-
The comparison of the original screening strategy with the AI- ally recalled a total of 218 examinations (four of which were
based strategy is presented in Table 3. Consistently across DM- true-positive results). Therefore, in the group of additional
based and DBT-based screening, a workload reduction of 71.5% cases recalled by AI only, the positive predictive value ranged
(9100 of 31 974 reads; 95% CI: 70.6, 72.4) and 72.4% (8830 from 1.8% to 1.9%.

62 radiology.rsna.org  n  Radiology: Volume 300: Number 1—July 2021


Raya-Povedano et al

Figure 6:  Images obtained in mediolateral oblique (top) and craniocaudal (bottom) views. A, Digital mammography (DM) images and C, digital breast tomosynthesis (DBT)
images in a 65-year-old woman recalled only because of the original DBT readings. B, AI-processed DM images show a focal asymmetry (outlined in red in the mediolateral
oblique view by AI, with a region score of 94; nonrecalled at DM readings). The yellow diamond outlines a cluster of calcifications, with a region score of 42, not related to the
actual cancer. D, AI-processed DBT images show spiculated mass (also outlined by AI in both views, with a region score of 95). AI would have correctly recalled that cancer
lesion in both techniques (B and D). Grade I invasive ductal carcinoma of 18 mm was diagnosed at percutaneous biopsy. The AI examination score of this case was 10.

The four cancers added by AI at DM examinations were Comparison of Unaided Double Reading of DM Images with
all originally screening-detected with DBT only (two of the AI-based Double Reading of DBT Images
four were ductal carcinoma in situ, one was a low-grade When comparing the AI-based strategy of DBT to the origi-
invasive ductal cancer, and one was a high-grade invasive nal double reading of DM (Table 4), it was observed that AI-
ductal cancer). based DBT screening would have been carried out with a smaller
Among the four cancers added by AI at DBT examinations, workload (156 hours vs 222 hours, a relative workload reduc-
two were originally screening-detected at DM and two were tion of 29.7% [95% CI: 23.8, 36.2], P , .001). The sensitivity
interval cancers (in total, three of the four were ductal carci- would have been 25.0% higher in relative terms (95% CI: 15.8,
noma in situ and one was a high-grade invasive ductal cancer). 36.3; P , .001), with 95 of 113 cancers detected with AI-DBT
Thirteen of 15 interval cancers were not detected with any AI- screening (84.1%; 95% CI: 76.2, 89.7) and 76 of 113 with un-
based strategy (not present in the top 2% of suspicion among AI aided DM screening (67.3%; 95% CI: 58.2, 75.2). Moreover,
scores), although nine of these interval cancers are included in the recall rate would have been 27.1% lower in relative terms
the group of examinations with AI scores of 8–10 (the top 30% (95% CI: 24.1, 30.3; P , .001), with 588 of 15 987 women
of suspicion among AI scores). recalled with AI-DBT screening (3.7%; 95% CI: 3.4, 4.0) and
Illustrative examples of examinations in the study where AI 807 of 15 987 women recalled with unaided DM screening
showed additional value are shown in Figures 5 and 6. (5.1%; 95% CI: 4.7, 5.4).

Radiology: Volume 300: Number 1—July 2021  n  radiology.rsna.org 63


AI-based Strategies to Reduce Workload in Breast Cancer Screening with Mammography and Tomosynthesis

Table 4: Comparison of the AI Strategy for DBT with the Original Double Reading of DM Images

Double Reading of DM Double Reading of DBT Images


Metric Images without AI with Autonomous AI Triaging Relative Difference* P Value
Workload† 31 974 (222) 8830 (156) 229.7 (236.2, 223.8) ,.001‡
Sensitivity (%)§ 67.3 (76/113) [58.2, 75.2] 84.1 (95/113) [76.2, 89.7] +25.0 (15.8, 36.3) ,.001‡
Recall rate (%)§ 5.1 (807/15 987) [4.7, 5.4] 3.7 (588/15 987) [3.4, 4.0] 227.1 (230.3, 224.1) ,.001‡
Note.—AI = artificial intelligence, DBT = digital breast tomosynthesis, DM = digital mammography.
* Data are percentages, with 95% CIs in parentheses.

Unless otherwise specified, data are number of reads, with number of hours in parentheses.

Significant difference.
§
Unless otherwise specified, data in parentheses are raw data, with 95% CIs in brackets.

Discussion Our study has limitations. It was only performed with data
Current breast cancer screening programs have a high workload from a single site and single mammography and AI vendor.
for radiologists and an objectionable number of false-positive Moreover, because it was a retrospective study and the AI sce-
and false-negative assessments. Our findings highlight how an narios were simulated, it is not possible to know the impact on
artificial intelligence (AI) system could reduce up to 70% of the radiologists’ performance in the setting where they would, for
workload in digital mammography (DM)– and digital breast example, read only the 30% most suspicious screening examina-
tomosynthesis (DBT)–based breast cancer screening without re- tions. In addition, in the analysis of the AI system, readers were
ducing the sensitivity by 5% or more, indicating that workload blinded to prior examinations, as opposed to radiologist screen-
can be reduced while maintaining the overall program sensitiv- ing assessments, which requires further analyses to understand
ity. This was achieved when this proportion of least suspicious the clinical impact of using AI in screening when AI does not
examinations for AI would not be read by radiologists, while, at include prior information. Finally, although our results suggest
the same time, AI could be used as an additional complementary that no human reading of low-suspicion examinations would be
reader to recall cases not recalled by radiologists. Letting radiolo- the most optimal for the screening program cost-efficiency, fur-
gists read this group of the 70% least suspicious examinations ther legal discussions would be needed to establish a framework
led to more recalls. Moreover, AI could be used to transition where this strategy is safe for all the parties involved in screening.
from DM screening to DBT screening with a 30% reduction in In conclusion, our study shows a strategy with an artificial in-
workload (P , .001), a 25% improvement in sensitivity (P , telligence (AI) system where screening workload could be safely
.001), and a 27% reduction in recall rate (P , .001). reduced up to 70% for both digital mammography (DM)– and
Although several studies investigated how AI could reduce digital breast tomosynthesis (DBT)–based programs, as well as al-
workload in screening programs with DM (18–20,25), to our low the transition from DM- to DBT-based screening without an
knowledge, this is the first study to investigate AI-based strate- increase in workload. Given the increasing lack of expert breast
gies to reduce workload in DBT using real screening cohorts. radiologists as well as the increased workload associated with the
Furthermore, because our study uses paired DM and DBT ex- introduction of DBT, new strategies potentially using AI could be
aminations, it was possible to determine AI-based strategies for necessary to maintain the cost-efficiency of screening programs.
DBT that could replace standard DM screening without increas- Further prospective studies are needed to validate our findings.
ing workload, one of the biggest limitations of introducing DBT Acknowledgments: The authors thank the Department of Informatics at Hospital
into clinical practice. To our knowledge, our results have not Universitario Reina Sofía for their help in retrieving images from picture archiving
been reported in any other comparison between DM- and DBT- and communication system and their support in processing them.
based screening where transitioning to DBT is always associated Author contributions: Guarantors of integrity of entire study, J.L.R.P., M.Á.B.;
with an increase in workload (11,26). study concepts/study design or data acquisition or data analysis/interpretation, all
Previous studies in DM have suggested that it could be safe authors; manuscript drafting or manuscript revision for important intellectual con-
tent, all authors; approval of final version of submitted manuscript, all authors;
(ie, no sensitivity reduction) to use AI to reduce screening work- agrees to ensure any questions related to the work are appropriately resolved, all au-
load between 20% and 50% (18–20). In our study, we found thors; literature research, J.L.R.P., E.E.C., A.R.R., M.Á.B.; clinical studies, J.L.R.P.,
this to be 70%. This threshold of 70% to define the optimal E.E.C., A.R.R., M.Á.B.; experimental studies, A.G.M.; statistical analysis, J.L.R.P.;
and manuscript editing, J.L.R.P., S.R.M., A.G.M., A.R.R., M.Á.B.
group of least suspicious examinations was proposed by Balta
et al (20) using the same AI system in a DM screening cohort Disclosures of Conflicts of Interest: J.L.R.P. disclosed no relevant relationships.
and could also be reproduced in our study (including DBT). In J.L.R.P. disclosed no relevant relationships. S.R.M. disclosed no relevant relation-
ships. E.E.C. disclosed no relevant relationships. A.G.M. Activities related to the
comparison, earlier studies using previous versions of the same present article: disclosed no relevant relationships. Activities not related to the pres-
AI system found that the group of the 20% least suspicious ex- ent article: is an employee of ScreenPoint Medical. Other relationships: disclosed
aminations would be the most optimal threshold (19), also sug- no relevant relationships. A.R.R. Activities related to the present article: disclosed
no relevant relationships. Activities not related to the present article: is an employee
gesting how the continuous development of AI systems could of ScreenPoint Medical. Other relationships: disclosed no relevant relationships.
keep bringing this threshold further up in the future. M.Á.B. disclosed no relevant relationships.

64 radiology.rsna.org  n  Radiology: Volume 300: Number 1—July 2021


Raya-Povedano et al

References 15. Kim HE, Kim HH, Han BK, et al. Changes in cancer detection and false-
positive recall in mammography using artificial intelligence: a retrospec-
1. Hakama M, Coleman MP, Alexe DM, Auvinen A. Cancer screening: evi- tive, multireader study. Lancet Digit Health 2020;2(3):e138–e148.
dence and practice in Europe 2008. Eur J Cancer 2008;44(10):1404–1413. 16. McKinney SM, Sieniek M, Godbole V, et al. International evaluation of
2. Smith RA, Cokkinides V, Brooks D, Saslow D, Brawley OW. Cancer an AI system for breast cancer screening. Nature 2020;577(7788):89–94.
screening in the United States, 2010: a review of current American Can- 17. Rodríguez-Ruiz A, Lång K, Gubern-Mérida A, et  al. Stand-Alone Arti-
cer Society guidelines and issues in cancer screening. CA Cancer J Clin ficial Intelligence for Breast Cancer Detection in Mammography: Com-
2010;60(2):99–119. parison With 101 Radiologists. J Natl Cancer Inst 2019;111(9):916–922.
3. Boyd NF, Guo H, Martin LJ, et al. Mammographic density and the risk 18. Yala A, Schuster T, Miles R, Barzilay R, Lehman C. A Deep Learning
and detection of breast cancer. N Engl J Med 2007;356(3):227–236. Model to Triage Screening Mammograms: A Simulation Study. Radiology
4. Mellado Rodríguez M, Osa Labrador AM. Breast cancer screening: cur- 2019;293(1):38–46.
rent status [in Spanish]. Radiología 2013;55(4):305–314. 19. Rodríguez-Ruiz A, Lång K, Gubern-Mérida A, et al. Can we reduce the
5. Castells X, Molins E, Macià F. Cumulative false positive recall rate workload of mammographic screening by automatic identification of
and association with participant related factors in a population based normal exams with artificial intelligence? A feasibility study. Eur Radiol
breast cancer screening p ­rogramme. J Epidemiol Community Health 2019;29(9):4825–4832.
2006;60(4):316–321. 20. Balta C, Rodríguez-Ruiz A, Mieskes C, Karssemeijer N, Heywang-Köb-
6. GLOBOCAN. Cancer Today. International Agency for Research on Can- runner SH. Going from double to single reading for screening exams la-
cer. World Health Organization, 2018. http://gco.iarc.fr/today. Accessed beled as likely normal by AI: what is the impact? In: Bosmans H, Marshall
August 21, 2020. N, Van Ongeval C, eds. Proceedings of SPIE: 15th International Work-
7. Skaane P, Bandos AI, Gullien R, et al. Prospective trial comparing full-field shop on Breast Imaging (IWBI2020). Vol 11513. Bellingham, Wash: In-
digital mammography (FFDM) versus combined FFDM and tomosynthe- ternational Society for Optics and Photonics, 2020; 115130D.
sis in a ­population-based screening programme using independent double 21. Romero Martín S, Raya Povedano JL, Cara García M, Santos Romero
reading with arbitration. Eur Radiol 2013;23(8):2061–2071. AL, Pedrosa Garriguet M, Álvarez Benito M. Prospective study aiming
8. Ciatto S, Houssami N, Bernardi D, et al. Integration of 3D digital mam- to compare 2D mammography and tomosynthesis + synthesized mam-
mography with tomosynthesis for population breast-cancer screening mography in terms of cancer detection and recall. From double reading
(STORM): a prospective comparison study. Lancet Oncol 2013;14(7): of 2D mammography to single reading of tomosynthesis. Eur Radiol
583–589. 2018;28(6):2484–2491.
9. Lång K, Andersson I, Rosso A, Tingberg A, Timberg P, Zackrisson S. Per- 22. Rodríguez-Ruiz A, Krupinski E, Mordang JJ, et  al. Detection of Breast
formance of one-view breast tomosynthesis as a stand-alone breast cancer Cancer with Mammography: Effect of an Artificial Intelligence Support
screening modality: results from the Malmö Breast Tomosynthesis Screen- System. Radiology 2019;290(2):305–314.
ing Trial, a population-based study. Eur Radiol 2016;26(1):184–190. 23. Sasaki M, Tozaki M, Rodríguez-Ruiz A, et  al. Artificial intelligence for
10. Pattacini P, Nitrosi A, Giorgi Rossi P, et al. Digital Mammography versus breast cancer detection in mammography: experience of use of the Screen-
Digital Mammography Plus Tomosynthesis for Breast Cancer Screen- Point Medical Transpara system in 310 Japanese women. Breast Cancer
ing: The Reggio Emilia Tomosynthesis Randomized Trial. Radiology 2020;27(4):642–651.
2018;288(2):375–385. 24. Dustler M, Dahlblom V, Tingberg A, Zackrisson S. The effect of breast
11. Caumo F, Zorzi M, Brunelli S, et al. Digital Breast Tomosynthesis with density on the performance of deep learning-based breast cancer detection
Synthesized Two-Dimensional Images versus Full-Field Digital Mammog- methods for mammography. In: Bosmans H, Marshall N, Van Ongeval C,
raphy for Population Screening: Outcomes from the Verona Screening eds. Proceedings of SPIE: 15th International Workshop on Breast Imag-
Program. Radiology 2018;287(1):37–46. ing (IWBI2020). Vol 11513. Bellingham, Wash: International Society for
12. Bernardi D, Ciatto S, Pellegrini M, et al. Application of breast tomosyn- Optics and Photonics, 2020; 1151324.
thesis in screening: incremental effect on mammography acquisition and 25. Kyono T, Gilbert FJ, van der Schaar M. Improving Workflow Efficiency
reading time. Br J Radiol 2012;85(1020):e1174–e1178. for Mammography Using Machine Learning. J Am Coll Radiol 2020;17(1
13. Aase HS, Holen ÅS, Pedersen K, et al. A randomized controlled trial of Pt A):56–63.
digital breast tomosynthesis versus digital mammography in population- 26. Dang PA, Freer PE, Humphrey KL, Halpern EF, Rafferty EA. Addi-
based screening in B ­ ergen: interim analysis of performance indicators tion of tomosynthesis to conventional digital mammography: effect
from the To-Be trial. Eur Radiol 2019;29(3):1175–1186. on image interpretation time of screening examinations. Radiology
14. Lehman CD, Wellman RD, Buist DS, et al. Diagnostic Accuracy of Digi- 2014;270(1):49–56.
tal Screening Mammography With and Without Computer-Aided Detec-
tion. JAMA Intern Med 2015;175(11):1828–1837.

Radiology: Volume 300: Number 1—July 2021  n  radiology.rsna.org 65

You might also like