Professional Documents
Culture Documents
Original Investigation
Rationale and Objectives: Federal legislation requires patient notification of dense mammographic breast tissue because increased den-
sity is a marker of breast cancer risk and can limit the sensitivity of mammography. As previously described, we clinically implemented our
deep learning model at the academic breast imaging practice where the model was developed with high clinical acceptance. Our objective
was to externally validate our deep learning model on radiologist breast density assessments in a community breast imaging practice.
Materials and Methods: Our deep learning model was implemented at a dedicated breast imaging practice staffed by both academic and
community breast imaging radiologists in October 2018. Deep learning model assessment of mammographic breast density was pre-
sented to the radiologist during routine clinical practice at the time of mammogram interpretation. We identified 2174 consecutive screen-
ing mammograms after implementation of the deep learning model. Radiologist agreement with the model’s assessment was measured
and compared across radiologist groups.
Results: Both academic and community radiologists had high clinical acceptance of the deep learning model’s density prediction, with
94.9% (academic) and 90.7% (community) acceptance for dense versus nondense categories (p < 0.001). The proportion of mammo-
grams assessed as dense by all radiologists decreased from 47.0% before deep learning model implementation to 41.0% after deep
learning model implementation (p < 0.001).
Conclusion: Our deep learning model had a high clinical acceptance rate among both academic and community radiologists and reduced
the proportion of mammograms assessed as dense. This is an important step to validating our deep learning model prior to potential wide-
spread implementation.
Key Words: Breast Density; Mammography; Deep Learning.
© 2020 The Association of University Radiologists. Published by Elsevier Inc. All rights reserved.
Abbreviations: DL deep learning, BI-RADS Breast Imaging Reporting and Data System, PACS Picture Archiving and Communication
System
INTRODUCTION vary, but most require direct notification of patients that their
increased mammographic breast density may mask breast can-
I
ncreased mammographic breast density can mask cancers
cer and that they may benefit from supplemental screening
on mammography and is an independent risk factor for
tests such as ultrasound or MRI. In February 2019, federal
breast cancer (1 3). As a result, more than 30 states
legislation was passed with similar requirements, expanding
passed legislation requiring direct patient notification of
the potential impact of dense mammogram assessments.
increased mammographic breast density. Individual state laws
Mammographic breast density assessment is subjective and
varies widely between and within radiologists (4 10). In one
Acad Radiol 2020; &:1–6 study of 83 radiologists, there was considerable variation of
From the Massachusetts General Hospital, 55 Fruit Street, WAC-240, Boston, qualitative Breast Imaging Reporting and Data System (BI-
MA 02114 (B.N.D., C.D.L.); Massachusetts Institute of Technology, Cambridge,
Massachusetts (A.Y., R.B., J.X.). Received November 11, 2019; revised Decem- RADS) density assessments with a range of 6% 85% of
ber 11, 2019; accepted December 12, 2019. Funding: There is no funding to dis- mammograms being assessed as heterogeneously or extremely
close. Address correspondence to: B.N.D. e-mail: bdontchos@mgh.harvard.edu
dense (4). In another study of 34 radiologists, the intraradiol-
© 2020 The Association of University Radiologists. Published by Elsevier Inc.
All rights reserved.
ogist density agreement among women with two exams
https://doi.org/10.1016/j.acra.2019.12.012 ranged from 62% to 87% (6). In addition, one study across
1
ARTICLE IN PRESS
DONTCHOS ET AL Academic Radiology, Vol &, No &&, && 2020
multiple states demonstrated a change in radiologists’ density following the American College of Radiology BI-RADS
assessments after notification laws were enacted (5). These lexicon of category a (almost entirely fatty), b (scattered areas
studies not only show that mammographic density assess- of fibroglandular density), c (heterogeneously dense), and d
ments are subjective and variable but can also be influenced. (extremely dense).
Commercially available methods of automated breast den-
sity assessment have produced mixed results in agreement with
expert density assessments, with both over and under reporting Clinical Implementation and Influence of the Deep
of density compared to radiologists’ qualitative assessments Learning Model
(11,12). A recent study found significant differences in density
After 6 months of evaluation at our primary academic breast
assessments of the same 4170 women by two software pro-
imaging practice (19), the DL model was implemented in an
grams (Volpara—Volpara Solutions, Wellington, New Zea-
identical fashion at our partner community breast imaging
land and Quantra—Hologic, Bedford, MA), with one practice on June 1, 2018. Our partner community breast imag-
program assessing 37% of the women as having dense breast
ing site is a dedicated breast imaging center located in a subur-
tissue and the other program assessing 51% of the same women
ban ambulatory care center. The radiologist staffing is shared
as having dense breast tissue. Radiologists assessed 43% of the
equally between our academic practice radiologists and our
same set of mammograms as having dense breast tissue (12).
partner community practice radiologists. Screening mammo-
Deep learning (DL) has been gaining traction in radiology
grams performed at the community practice site are divided
(13 16). Specifically, some early studies have evaluated algo-
randomly and equally between the two groups of breast imag-
rithms that automatically assess mammographic breast density
ing radiologists, then assessed and reported independently.
(17,18). In the first clinical implementation of a DL algo- Identical to our academic breast imaging practice, mam-
rithm, we developed and implemented a DL model to assess
mograms from the community breast imaging practice were
mammographic breast density into routine clinical practice at
automatically retrieved after image acquisition and processed
an academic breast imaging center. After clinical implementa-
by the DL algorithm; the DL density assessment (BI-RADS
tion, the interpreting radiologist accepted the DL model
category a, b, c, or d) was sent to a commercially available
assessment in 94% of exams (19). While this study was an
mammography reporting software program (Magview, 2018,
important initial step, the model was not externally validated
Version 8.0.143. Burtonsville, MD). We allowed a several
on patients or radiologists outside the institution that pro-
month training period for our community practice radiolog-
vided the reference images and density assessments to train ists from June 1, 2018 to September 30, 2018 to familiarize
the model.
themselves with the DL model workflow integration.
The primary aim of this study is to measure the clinical accep-
Eight academic radiologists (range of 2 24 years of experi-
tance of the DL model’s density assessment in routine clinical
ence; mean of 8.3 years; median of 4 years) and five commu-
practice among both academic and community breast imaging
nity radiologists (range of 4 20 years of experience; mean of
radiologists. The secondary aim is to measure the influence the
11.6 years; median of 10 years) interpreted screening mam-
DL model had on radiologist’s density assessments.
mograms during the clinical implementation phase. All eight
of the academic radiologists and four of the five community
radiologists completed a dedicated fellowship training pro-
METHODS
gram in breast imaging (the one community radiologist who
Our retrospective study was approved by our institutional did not complete a dedicated breast imaging fellowship has
review board (with a waiver for the need to obtain informed practiced clinical breast imaging for 17 years). None of these
consent) and was compliant with the Health Insurance Porta- 13 radiologists contributed prospective density assessments to
bility and Accountability Act. the training, development, or test sets for the DL algorithm.
During the screening mammogram review, all radiologists
were provided with the DL model’s density assessment in the
Development of the Deep Learning Model
electronic imaging report provided by the reporting software.
We developed and tested our DL model using 58,894 ran- The final density assessment of the mammogram was at the dis-
domly selected digital mammograms (Hologic, Bedford, cretion of the radiologist, that is, to agree or disagree with the
MA) without exclusion criteria from 39,272 women screened DL model. All mammograms were analyzed on our review
between January 2009 and May 2011 by using a deep convo- workstations (Hologic, Bedford, MA) following our routine
lutional neural network, ResNet-18 (21), with PyTorch clinical workflow for mammogram assessment and reporting.
(2018, version 0.31; pytorch.org) as previously described (19). We identified consecutive digital screening mammograms
In brief, women were randomly assigned to training, devel- performed at the community breast imaging site during two
opment, and test sets, resulting in 41,479; 8738; and 8677 time periods as follows: 5696 mammograms before DL model
mammograms for each set, respectively. Breast density was implementation from June 1, 2017 to May 31, 2018 (preim-
recorded by one of 12 academic radiologists subspecialized in plementation); and 2174 mammograms from October 1,
breast imaging, with between 5 and 33 years of experience, 2018 to February 28, 2019 (postimplementation). No
2
ARTICLE IN PRESS
Academic Radiology, Vol &, No &&, && 2020 EXTERNAL VALIDATION OF A DEEP LEARNING MODEL
mammograms were excluded (e.g., due to prior surgery, across 5000 bootstrap samples to assess significance. To com-
implants, etc.). Note that density legislation in our state went pare density distributions, we compared Academic radiolog-
into effect on January 1, 2015, well before the study. Both ists, Community radiologists and all radiologists across time
academic and community radiologists interpreted screening Periods 1 and 2 using Pearson’s chi-squared test to calculate
mammograms at the community breast imaging site during significance. We computed all statistics using scikit-learn
both time periods. The group of academic and community (scikit-learn.org, v0.19.1).
radiologists assessing the mammograms was nearly identical in
each time period, save for one academic radiologist that left
the practice before the DL model was implemented. We RESULTS
measured the proportion of mammograms categorized by the
radiologist as dense or nondense and across the four After clinical implementation of the DL model, 2174 screening
BI-RADS categories for the final read during the pre- and mammograms were interpreted by prospectively assessed by
postimplementation time periods, compared across academic academic (1079 exams) and community radiologists (1095
and community radiologists. Finally, we recorded the pro- exams). Mean age, age distribution, and race were similar
portion of mammograms categorized by the DL model as between academic and community radiologist assessed exams.
dense or nondense and across the four BI-RADS categories Academic radiologists assessed 39.3% exams as dense and com-
during the postimplementation time period. munity radiologists assessed 42.8% exams as dense (p = 0.09).
The DL model assessed a similar proportion of exams as dense
in the academic (34.9%) and community radiologist assessed
exams (34.1%) (p = 0.67). Availability of a comparison exam
Statistical Analysis
was similar across academic (94.1%) and community (92.8%)
We quantified the types of disagreements and computed radiologist assessed exams (p = 0.23) (Table 1).
agreement between final assessment and DL assessment across The academic radiologists had high clinical acceptance of
the four BI-RADS categories, estimating with weighted the DL model’s density prediction, with 94.9% acceptance
Kappa using linear weighting. Kappa statistics were compared for dense versus nondense categories, and 92.1% acceptance
TABLE 1. Patient demographics and mammographic breast density assessment distribution, after implementation of the DL
model
Significance comparisons made between academic and community radiologists: Tests used: 1two-tailed t test; 2Pearson’s chi-square test.
3
ARTICLE IN PRESS
DONTCHOS ET AL Academic Radiology, Vol &, No &&, && 2020
TABLE 2. Percent acceptance of community and academic radiologists with the DL model density assessment after clinical
implementation with 95% confidence interval in parenthesis
TABLE 3. Distribution of prospectively reported density assessments during before implementation of the DL model (6/1/2017 to
5/31/2018), and after implementation of the DL model (10/1/2018 to 2/28/2019)
4
ARTICLE IN PRESS
Academic Radiology, Vol &, No &&, && 2020 EXTERNAL VALIDATION OF A DEEP LEARNING MODEL
logical next step to support our model’s expanded use in a vari- There were limitations to our study. Our DL model was
ety of practice types and patient populations. developed and trained with the reference standard being the
The American College of Radiology has highlighted the original interpreting radiologist’s assessment, which is known
variability of breast density assessments in a statement discus- to be prone to inter- and intrareader variation. We acknowl-
sing potential outcomes and harms of mandatory density edge that the acceptance of the DL density assessment was
notification (22). We showed a 6% absolute reduction in measured in an unblinded manner, and that “acceptance” is
dense mammogram assessments after DL model implementa- not equivalent to “truth” or “accuracy”. Note however, that
tion, and more expanded use of our DL model could have the early studies showing association of incremental breast
direct beneficial implications on limited imaging resources, density and increased breast cancer risk were based on subjec-
patients and payers, with substantial cost reduction by reduc- tive radiologist assessments, implying that the radiologist’s
ing the number of women potentially referred for supple- subjective assessment carries validity as a marker for risk pre-
mental screening tests and/or high-risk clinic evaluations. diction (1,3). It is important to note that a reduction in the
Interestingly, when the radiologists did not accept the DL proportion of mammograms assessed as dense—as influenced
model assessment, most cases were categorized into a higher by the DL model—is only meaningful if the assessments are
density category (i.e., when the DL model’s assessment was accurate. The model would not be beneficial if women with
scattered areas of fibroglandular density, the disagreeing radi- dense breasts were inaccurately categorized as nondense (or
ologist much more often assessed density as heterogeneously conversely, if women with nondense breasts were inaccu-
dense as opposed to almost entirely fatty), demonstrating the rately categorized as dense). Nevertheless, the high rate of
radiologist’s tendency to increase the density category. clinical acceptance among both academic and community
There is also potential for improved consistency of density practice fellowship trained breast imagers is evidence that the
assessments given the known wide variation in radiologist DL model provides reasonable guidance. While a secondary
density assessments, as Sprague et al. have shown with aim of this study, measuring the model’s influence on radiol-
between 6% and 85% exams categorized as dense in their ogists warrants further study on a much larger volume of
large study (4). After clinical implementation of our model, screening mammograms. Prior studies have used CT or MRI
the academic radiologists categorized 39.3% exams as dense, breast density as the reference standard to compare to auto-
and the community radiologists categorized 42.8% as dense, mated breast density assessments, yet this approach is limited
while our model categorized 34.9% and 34.1% of exams as given the small sample sizes (100 200 patients) and sample
dense in both groups of patients, highlighting potential for bias toward high-risk patients (25,26). CT and MRI have the
the model to improve radiologist consistency. This is increas- potential benefit of a more objective volumetric assessment of
ingly important because risk models such as the Tyrer-Cuzick fibroglandular tissue volume compared to a subjective radiol-
v8 and Breast Cancer Surveillance Consortium 5-year risk ogist assessment of mammographic breast density, however
models incorporate mammographic breast density into their development of our model utilized mammograms in over
prediction criteria (23,24). 30,000 women at average risk and using CT or MRI for this
Figure 1. Comparison of the original interpreting radiologist assessment with the deep learning (DL) model assessment for four-way mam-
mographic breast density classification for all radiologists (a). Corresponding examples of mammograms with concordant and discordant
assessments by the radiologist with the DL model (b).
5
ARTICLE IN PRESS
DONTCHOS ET AL Academic Radiology, Vol &, No &&, && 2020
purpose would not be feasible. It is important to point out providers. J Am Coll Radiol 2015; 12:1011–1015. doi:10.1016/j.
that density assessment guidance has changed over time with jacr.2015.04.015. Epub 2015/07/09. PubMed PMID:26163978.
10. Kerlikowske K, Grady D, Barclay J, et al. Variability and accuracy in
the publication of the fifth edition of the BI-RADS manual mammographic interpretation using the American College of Radiology
in 2013, however this was several years before our study took Breast Imaging Reporting and Data System. J Natl Cancer Inst 1998;
place and likely had little impact on our results (27). Finally, 90:1801–1809. PubMed PMID:9839520.
11. Youk JH, Gweon HM, Son EJ, et al. Automated volumetric breast density
we did not directly measure influence of the DL model on measurements in the era of the BI-RADS fifth edition: a comparison with
intra reader agreement as clinical cases are not double read. visual assessment. AJR Am J Roentgenol 2016; 206:1056–1062.
In conclusion, this external validation demonstrated high doi:10.2214/AJR.15.15472. Epub 2016/03/02. PubMed PMID:26934689.
12. Brandt KR, Scott CG, Ma L, et al. Comparison of clinical and automated
clinical acceptance of our DL model in routine clinical practice breast density measurements: implications for risk prediction and sup-
among both academic and community radiologists, suggesting plemental screening. Radiology 2016; 279:710–719. doi:10.1148/
our model may have widespread applicability. The DL model radiol.2015151261. Epub 2015/12/22 PubMed PMID:26694052.
PubMed Central PMCID: PMCPMC4886704.
also influenced radiologists by significantly reducing the pro- 13. Bahl M, Barzilay R, Yedidia AB, et al. High-risk breast lesions: a machine
portion of screening mammograms categorized as dense after learning model to predict pathologic upgrade and reduce unnecessary sur-
DL model implementation. In the era of federally mandated gical excision. Radiology 2017:170549. doi:10.1148/radiol.2017170549.
Epub 2017/10/17. PubMed PMID:29039725.
breast density notification, consideration of a more widespread 14. Kohli M, Prevedello LM, Filice RW, et al. Implementing machine learning
DL model to predict mammographic breast density should be in radiology practice and research. AJR Am J Roentgenol 2017;
given as this tool could supply more accurate information to 208:754–760. doi:10.2214/AJR.16.17224. Epub 2017/01/26. PubMed
PMID:28125274.
patients and help healthcare systems more appropriately utilize 15. Lakhani P, Sundaram B. Deep learning at chest radiography: automated
limited supplemental screening resources. classification of pulmonary tuberculosis by using convolutional neural net-
works. Radiology 2017; 284:574–582. doi:10.1148/radiol.2017162326.
Epub 2017/04/24. PubMed PMID:28436741.
REFERENCES 16. Geras K, Wolfson S, Shen Y, et al. High-resolution breast cancer screen-
ing with multi-view deep convolutional neural networks. arXivorg. 2017.
1. Boyd NF, Byng JW, Jong RA, et al. Quantitative classification of mam- Epub 6 November 2017.
mographic densities and breast cancer risk: results from the Canadian 17. Kallenberg M, et al. Unsupervised Deep Learning Applied to Breast Den-
National Breast Screening Study. J Natl Cancer Inst 1995; 87:670–675. sity Segmentation and Mammographic Risk Scoring. IEEE Trans Med
PubMed PMID:7752271. Imag 2016; 35:1322–1331.
2. Carney PA, Miglioretti DL, Yankaskas BC, et al. Individual and combined 18. Wu Nan, Geras Krzystof J, Shen Y, et al. In: Library CU, ed. . editor. arXiv.org.
effects of age, breast density, and hormone replacement therapy use on 19. Lehman CD, Yala A, Schuster T, et al. Mammographic breast density
the accuracy of screening mammography. Ann Intern Med 2003; assessment using deep learning: clinical implementation. Radiology
138:168–175. PubMed PMID:12558355. 2018:180694doi:10.1148/radiol.2018180694. Epub 2018/10/16. PubMed
3. Whitehead J, Carlile T, Kopecky KJ, et al. Wolfe mammographic paren- PMID:30325282.
chymal patterns. A study of the masking hypothesis of Egan and Mostel- 20. Kim DW, Jang HY, Kim KW, et al. Design characteristics of studies
ler. Cancer 1985; 56:1280–1286. PubMed PMID:4027868. reporting the performance of artificial intelligence algorithms for diagnos-
4. Sprague BL, Conant EF, Onega T, et al. Variation in mammographic tic analysis of medical images: results from recently published papers.
breast density assessments among radiologists in clinical practice: a Korean J Radiol 2019; 20:405–410. doi:10.3348/kjr.2019.0025. PubMed
multicenter observational study. Ann Intern Med 2016; 165:457–464. PMID:30799571 PubMed Central PMCID: PMCPMC6389801.
doi:10.7326/M15-2934. Epub 2016/07/19. PubMed PMID:27428568- 21. He K, Zhang X, Ren S, et al. The IEEE Conference on Computer Vision
PubMed Central PMCID: PMCPMC5050130. and Pattern Recognition (CVPR). Microsoft Research 2016, pp. 770–778.
5. Bahl M, Baker JA, Bhargavan-Chatfield M, et al. Impact of breast density 22. ACR Statement on Reporting Breast Density in Mammography Reports
notification legislation on radiologists' practices of reporting breast den- and Patient Summaries. American College of Radiology, 26 November
sity: a multi-state study. Radiology 2016; 280:701–706. doi:10.1148/ 2017.
radiol.2016152457. Epub 2016/03/28. PubMed PMID:27018643. 23. Brentnall AR, Cuzick J, Buist DSM, et al. Long-term Accuracy of Breast
6. Spayne MC, Gard CC, Skelly J, et al. Reproducibility of BI-RADS breast Cancer Risk Assessment Combining Classic Risk Factors and Breast
density measures among community radiologists: a prospective cohort Density. JAMA Oncol 2018; 4:e180174. doi:10.1001/jamaon-
study. Breast J 2012; 18:326–333. doi:10.1111/j.1524-4741.2012.01250.x. col.2018.0174.
Epub 2012/05/21. PubMed PMID:22607064. PubMed Central PMCID: 24. Vachon CM, Pankratz VS, Scott CG, et al. The contributions of breast
PMCPMC3660069. density and common genetic variation to breast cancer risk. J Natl Can-
7. Berg WA, Campassi C, Langenberg P, et al. Breast imaging reporting cer Inst 2015; 107. doi:10.1093/jnci/dju397. Epub 2015/03/04PubMed
and data system: inter- and intraobserver variability in feature analysis PMID:25745020. PubMed Central PMCID: PMCPMC4598340.
and final assessment. AJR Am J Roentgenol 2000; 174:1769–1777. 25. Gubern-Me rida A, Kallenberg M, Platel B, et al. Volumetric breast density
doi:10.2214/ajr.174.6.1741769. PubMed PMID:10845521. estimation from full-field digital mammograms: a validation study. PLoS
8. Ray KM, Price ER, Joe BN. Breast density legislation: mandatory disclosure to One 2014; 9:e85952.
patients, alternative screening, billing, reimbursement. AJR Am J Roentgenol 26. Wang J, Azziz A, Fan B, et al. Agreement of mammographic measures of
2015; 204:257–260. doi:10.2214/AJR.14.13558. PubMed PMID:25615746. volumetric breast density to MRI. PLoS One 2013; 8:e81653.
9. Sobotka J, Hinrichs C. Breast density legislation: discussion of patient 27. American College of Radiology.ACR BI-RADS Atlas-Mammography.
utilization and subsequent direct financial ramifications for insurance 5th ed. Reston, VA. 2013