You are on page 1of 4

Clinical Imaging 65 (2020) 96–99

Contents lists available at ScienceDirect

Clinical Imaging
journal homepage: www.elsevier.com/locate/clinimag

Understanding artificial intelligence based radiology studies: What is T


overfitting?

Simukayi Mutasa, Shawn Sun, Richard Ha
Columbia University Medical Center, New York Presbyterian Hospital, 622 West 168th Street, PB-1-301, New York, NY 10032, United States of America

A R T I C LE I N FO A B S T R A C T

Keywords: Artificial intelligence (AI) is a broad umbrella term used to encompass a wide variety of subfields dedicated to
Overfitting creating algorithms to perform tasks that mimic human intelligence. As AI development grows closer to clinical
Artificial intelligence integration, radiologists will need to become familiar with the principles of artificial intelligence to properly
Machine learning evaluate and use this powerful tool. This series aims to explain certain basic concepts of artificial intelligence,
and their applications in medical imaging starting with a concept of overfitting.

“Radiologists will not be replaced by AI. Radiologists who use and


1. Introduction
understand AI will replace radiologists who don't.”
Curt Langlotz [17]
Artificial intelligence (AI) is a broad umbrella term used to en-
compass a wide variety of subfields dedicated to, simply put, creating
algorithms to perform tasks that mimic human intelligence. Machine 2. Overfitting
learning is a subfield of artificial intelligence which involves the crea-
tion of algorithms that can parse data and modify themselves to pro- Overfitting is a major obstacle for AI technology, but what exactly,
duce a desired output. This was applied heavily to the field of computer is overfitting? Burnham describes “the essence of overfitting is to have
vision; it frequently used hand-crafted features such as edge detection unknowingly extracted some of the residual variation as if that varia-
algorithms or shape detectors. Deep learning is a type of machine tion represented underlying model structure” [18]. In layman's terms,
learning that uses multiple layers to extract progressively higher level overfitting means that an AI model has learned in a manner that is only
features as the algorithm is trained on structured data and creates its applicable to the training sample and is no longer generalizable to the
own composition of features which it determines to be important. There overall population (Fig. 1).
is much excitement around deep learning's ability to discover pre- For example, if an algorithm designed to distinguish between dogs
viously unknown relationships in data and perform almost any complex and cats is trained only with the German shepherd dogs and Siamese
mapping with correct training. cats in (Fig. 2), it will perform well if subsequently tested only on
Since the success of AlexNet on the Image Net challenge in 2012, German shepherd dogs and Siamese cats. However, if the algorithm is
deep learning algorithms have seen remarkable advancements in the then asked to distinguish other types of dogs and cats (Fig. 3), which it
field of medical imaging [1–3]. This technology has been applied to has not seen before, its performance will decrease substantially.
oncological detection, characterization, and monitoring in recent stu- We can recognize that certain characteristics, such as the white fur
dies and has achieved impressive results [4–15]. Recently, we are be- pattern and blue eye color of Siamese cats, pertain to that individual
ginning to see research results which suggest performance similar to, or species, but are not representative of all cats. However, if the model
better than, radiologists for various tasks [16]. As development grows only has access to the pictures in Fig. 2, it will have no way of learning
closer to clinical integration, radiologists will need to become familiar this piece of knowledge. Medical studies will similarly require diverse
with the principles of artificial intelligence to properly evaluate and use training samples in order to draw generalizable conclusions that will
this powerful tool. This series aims to explain certain basic concepts of apply to the population at large.
artificial intelligence, and their applications in medical imaging starting
with a concept of overfitting.


Corresponding author at: Columbia University Medical Center, New York Presbyterian Hospital, 622 West 168th Street, PB-1-301, New York, NY 10032, United
States of America.
E-mail addresses: stm9116@nyp.org (S. Mutasa), shs2179@cumc.columbia.edu (S. Sun), rh2616@columbia.edu (R. Ha).

https://doi.org/10.1016/j.clinimag.2020.04.025
Received 9 March 2020; Received in revised form 10 April 2020; Accepted 17 April 2020
0899-7071/ © 2020 Elsevier Inc. All rights reserved.
S. Mutasa, et al. Clinical Imaging 65 (2020) 96–99

Fig. 1. A graphical representation of overfitting. (Left) The trend line hasn't learned enough patterns from the data and has failed to capture the dominant trend.
(Middle) The trend line is a good fit for this set of data. (Right) The trend line has learned too many patterns and has lost the dominant trend. This algorithm would
not be generalizable to new data.

Fig. 2. Example of German shepherd dogs and Siamese cats.

Fig. 3. Example of different types of dogs and cats.

2.1. Digging deeper examples of many species of dogs and cats would be necessary in the
training set (Fig. 3). In the original ImageNet competition, where deep
A typical deep learning algorithm consists of a neural network ar- learning neural networks first publicly demonstrated their power, re-
chitecture that begins with an input, connects to layers of nodes, and searchers had the luxury of having 1.4 million images to work with
ultimately ends in an output. During the learning process, the para- [21]. These datasets are orders of magnitude larger than the medical
meters or “weights” within the nodes are continuously adjusted to image datasets most AI radiology studies are using, and represent one
minimize the difference between the output and the correct answer major reason why current medical studies may be prone to overfitting
(ground truth). This extracts the features in the input that were most [22,23]. Though there is no set number of “sufficient” training data, as
important to solving the question. However, if a model is run through this number will vary depending on the study and the question asked, a
the same training data too many times without enough regularization, general basic starting point around 1000 images per class will have a
the model will inevitably capture residual variation, or noise, as fea- good chance of training a classifier accurately. This is based largely on
tures and interpret these as parameters useful for prediction, thus de- the ImageNet database which contained around 1000 images per class.
creasing the overall generalizability [18,19]. There are many factors that can reduce this requirement including
using different architecture types, how representative of the population
your training data is, how distinguishable the different classes are, and
3. Overcoming overfitting
the methods you employ for regularization.
Medical data has historically been difficult to amass due to concerns
The most effective way to mitigate overfitting is to collect more
about patient confidentiality and the cost of obtaining high quality
training data. Ideally your training data would be truly representative
ground-truth annotated data [22,23]. However, the efforts of large data
of the overall population. In the case of distinguishing cats and dogs,

97
S. Mutasa, et al. Clinical Imaging 65 (2020) 96–99

same mass to be inputted to the network.


Another strategy is transfer learning, which involves the use of a
network that is already trained on a different dataset, such as the
ImageNet collection, and then fine-tuning it with the addition of a small
amount of training data to direct the network to a specific task [11].
This utilizes the concept that visual tasks require similar processing, for
example, the recognition of edges and simple shapes. Similarly, deep
features, or off-the-shelf features, can be extracted from pre-trained
Fig. 4. Data augmentation: Artificially increase the size of a training dataset by CNNs, and then applied to a new task [29]. Several groups have all
creating image variants from the original dataset such as flipping, rotation, demonstrated increased performance using transfer learning [30–32].
cropping, skewing without straying too far from the original label. Typically transfer learning works best when the data that the original
network was trained on is similar to the new data. As such, there is no
archives, such as the Cancer Imaging Archive (TCIA), are making de- current consensus that networks trained to recognize cats and dogs
identified, public medical imaging data more accessible to researchers perform consistently better when fine-tuned on medical images than
around the world [24]. A recent Google AI project utilized the National networks trained from random initializations.
Lung Cancer Screening Trial (NSLT) data, including 42,000 CT cases On the other hand, another approach to limit overfitting is modify
[16]. The 2019 RSNA AI Challenge for Intracranial Hemorrhage De- the learning algorithm in a way that the model will generalize better,
tection and Classification assembled a dataset of over 25,000 annotated known as regularization [33,34]. One such technique, dropout, can
brain CTs [25]. The success of these recent large dataset studies will limit the capacity of the network.
undoubtedly foster future efforts to provide large data to AI researchers.
3.2. Digging deeper

3.1. Technical solutions Dropout is based on the insight that an ensemble of neural net-
works, each trained on the same data but with slightly different con-
Researchers have also developed several creative workarounds to siderations, generalizes better. Dropout allows us to have a large
efficiently use limited training data and prevent overfitting. One number of different network architectures while training by randomly
strategy, data augmentation, can artificially increase the size of a removing nodes during training. The theory behind this is that the ar-
training dataset by creating image variants from the original dataset. chitecture of a neural network can compensate for individual nodal
This can involve random rigid affine transformations, such as flipping, deficiencies. However, this is not reproducible to unseen data, and
rotation, cropping, skewing, or even the introduction of artifacts to disrupting the architecture slightly by dropping random nodes will re-
diversify the images without straying too far from the original label, as move this compensation effect [35].
shown in Fig. 4 [22,26]. Another technique for regularization is called L1/L2 regularization.
Kim et al., in their effort to detect fractures, used this technique to As a neural network learns each feature, each feature is given a weight
create 11,112 training images from an original dataset of only 1389 to determine how significant that feature is. We can limit the magnitude
images [27]. Warped images, while providing variation, do not prove of our weights through L1/L2 regularization so that no single feature
the same degree of data enrichment that additional, separate examples overwhelms others [36]. This has the effect of encouraging the agree-
do. ment of many input features when coming to a certain conclusion as
In a study using CNN to predict breast cancer molecular subtype opposed to preferring the input of a few overwhelming features.
published in 2019 [15], Ha et al. used this augmentation technique to Batch normalization (BN), initially developed to mitigate a game of
increase sample size (Fig. 5). Images of a single input example of an “broken telephone” which can occur when layers in a neural network
enhancing mass with multiple random affine warps applied for data are unable to learn simultaneously, has a small regularizing effect. The
augmentation. This technique alters the mass slightly utilizing a rigid theoretical motivation for batch normalization is outside the scope of
transformation effectively making additional unique samples of the this paper; however it functions by forcing normalization of the acti-
vation maps to a learned mean and standard deviation. This has the
effect of introducing noise to the input of each layer in the neural
network, which introduces a regularization effect [37].

4. Importance of external validation

The exciting results of recent AI radiology studies certainly generate


much anticipation towards a future where radiologists utilize AI to
better save lives [38]. However, the pitfall of overfitting really high-
lights the need for external validation of AI before clinical im-
plementation. There have been cases of neural network performance
being affected by data from a different institution [39,40]. To prove to
clinicians the validity of results, deep neural networks need to de-
monstrate performance on external data different from its training data.
Some researchers have even emphasized the need for prospective,
multi-center, cohort studies [41] and to hold AI technology to the same
level of scrutiny as new clinical drugs. Undoubtedly, the field of AI in
medical imaging is still in its infancy, as studies achieving that level of
validation are extremely rare [42].

5. Conclusion
Fig. 5. Images of a single input example of an enhancing mass with multiple
random affine warps applied for data augmentation. Overfitting is a common pitfall in which AI models capture noise or

98
S. Mutasa, et al. Clinical Imaging 65 (2020) 96–99

superficial information rather than truly distinguishing disease. Models three-dimensional deep learning on low-dose chest computed tomography. Nat Med
that are overfitted will have a high training performance but will have 2019;25:954–61. https://doi.org/10.1038/s41591-019-0447-x.
[17] Langlotz C. RSNA annual meeting. November 27, 2017.
severely decreased accuracy upon encountering new data. This can be [18] Burnham KP, Anderson DR. Model selection and multimodel inference. 2nd ed.
overcome by increasing the amount of training data, data augmenta- Springer-Verlag; 2002.
tion, or several other techniques such as regularization and dropout. [19] England JR, Cheng PM. Artificial intelligence for medical image analysis: a guide
for authors and reviewers. Am J Roentgenol 2018;212(3):513–9. https://doi.org/
Before AI algorithms can be incorporated into clinical use, external 10.2214/AJR.18.20490.
validation will be necessary to ensure generalizability. [21] Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition chal-
lenge. arXiv:1409.0575v3 [cs.CV] 302015.
[22] Chartrand G, Cheng PM, Vorontsov E, et al. Deep learning: a primer for radiologists.
Declaration of competing interest RadioGraphics 2017;37(7):2113–31. https://doi.org/10.1148/rg.2017170077.
[23] Parmar C, Barry JD, Hosny A, et al. Data analysis strategies in medical imaging. Clin
Cancer Res 2018;24(15):3492–9. https://doi.org/10.1158/1078-0432.CCR-18-
No disclosures. No conflict of interest.
0385.
[24] Clark K, Vendt B, Smith K, et al. The cancer imaging archive (TCIA): maintaining
References and operating a public information repository. J Digit Imaging
2013;26(6):1045–57. https://doi.org/10.1007/s10278-013-9622-7.
[25] AI challenge. RSNA. https://www.rsna.org/en/education/ai-resources-and-
[1] Langlotz CP, Allen B, Erickson BJ, et al. A roadmap for foundational research on training/ai-image-challenge, Accessed date: 16 January 2020.
artificial intelligence in medical imaging: from the 2018 NIH/RSNA/ACR/The [26] Yamashita R, Nishio M, Do RKG, et al. Convolutional neural networks: an overview
Academy Workshop. Radiology 2019;291(3):781–91. https://doi.org/10.1148/ and application in radiology. Insights Imaging 2018;9(4):611–29.
radiol.2019190613. [27] Kim DH, Mackinnon T. Artificial intelligence in fracture detection: transfer learning
[2] Thrall JH, Li X, Li Q, et al. Artificial intelligence and machine learning in radiology: from deep convolutional neural networks. Clin Radiol 2018;73:439–45.
opportunities, challenges, pitfalls, and criteria for success. J Am Coll Radiol 2018 [29] Paul R, Schabath M, Balagurunathan Y, et al. Explaining deep features using radi-
Mar;15(3 Pt B):504–8. https://doi.org/10.1016/j.jacr.2017.12.026. [Epub 2018 ologist-defined semantic features and traditional quantitative features. Tomography
Feb 4]. 2019;5(1):192–200. https://doi.org/10.18383/j.tom.2018.00034.
[3] Ahmed H, Chintan P, John Q, et al. Artificial intelligence in radiology. Nat Rev [30] Nishio M, Sugiyama O, Yakami M, et al. Computer-aided diagnosis of lung nodule
Cancer 2018;18(8):500–10. Aug. classification between benign nodule, primary lung cancer, and metastatic lung
[4] Huang X, Shan J, Vaidya V. Lung nodule detection in CT using 3D convolutional cancer at different image size using deep convolutional neural network with
neural networks. IEEE 14th International Symposium on Biomedical Imaging (ISBI transfer learning. PLoS One 2018;13(7):e0200721. Published 2018 Jul 27 https://
2017). 2017. p. 379–83. doi.org/10.1371/journal.pone.0200721.
[5] Tsehay YK, Lay NS, Roth HR, et al. Convolutional neural network based deep- [31] Samala RK, Chan HP, Hadjiiski LM, et al. Multi-task transfer learning deep con-
learning architecture for prostate cancer detection on multiparametric magnetic volutional neural network: application to computer-aided diagnosis of breast cancer
resonance images. Proceedings of SPIE 2017. https://doi.org/10.1117/12.2254423. on mammograms. Phys Med Biol 2017;62(23):8894–908. Published 2017 Nov 10
[6] Kooi T, Litjens G, van Ginneken B, et al. Large scale deep learning for computer https://doi.org/10.1088/1361-6560/aa93d4.
aided detection of mammographic lesions. Med Image Anal 2017;35:303–12. [32] Maqsood M, Nazir F, Khan U, et al. Transfer learning assisted classification and
[7] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic seg- detection of Alzheimer’s disease stages using 3D MRI scans. Sensors (Basel)
mentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition 2019;19(11):2645. Published 2019 Jun 11 https://doi.org/10.3390/s19112645.
(CVPR). 2015. p. 3431–40. Boston, MA, USA. [33] Sra S, Nowozin S, Wright SJ. Optimization for machine learning. Mit Press; 2012.
[8] Moeskops P, Wolterink JM, van der Velden BHM, et al. Deep learning for multi-task [34] Bell RM, Koren Y. SIGKDD Explor Newsl. Lessons from the Netflix prize challenge 9.
medical image segmentation in multiple modalities. Medical image computing and New York, NY, USA: ACM; 2007. p. 75–9.
computer-assisted intervention—MICCAI 2016. 478–486. 2016. Athens, Greece. [35] Srivastava N, Hinton GR, Krizhevsky A, et al. Dropout: a simple way to prevent
[9] Cheng JZ, Ni D, Chou YH, et al. Computer-aided diagnosis with deep learning ar- neural networks from overfitting. J Mach Learn Res 2014;15:1929–58.
chitecture: applications to breast lesions in US images and pulmonary nodules in CT [36] Nowlan SJ, Hinton GE. Simplifying neural networks by soft weight-sharing. Neural
scans. Sci Rep 2016;6:24454. Published 2016 Apr 15 https://doi.org/10.1038/ Comput 1992;4(4).
srep24454. [37] Luo P, Wang X, Shao W, et al. Towards understanding regularization in batch
[10] Ding Y, Sohn JH, Kawczynski MG, et al. A deep learning model to predict a diag- normalization. 7th International Conference on Learning Representations. ICLR
nosis of Alzheimer disease by using 18F-FDG PET of the brain. Radiology 2019; 2019.
2019;290(2):456–64. https://doi.org/10.1148/radiol.2018180958. [38] Jha S, Topol EJ. Adapting to artificial intelligence: radiologists and pathologists as
[11] Mazurowski MA, Buda M, Saha A, et al. Deep learning in radiology: an overview of information specialists. JAMA 2016;316(22):2353–4. https://doi.org/10.1001/
the concepts and a survey of the state of the art with focus on MRI. J Magn Reson jama.2016.17438.
Imaging 2019;49:939–54. https://doi.org/10.1002/jmri.26534. [39] Zech JR, Badgeley MA, Liu M, et al. Variable generalization performance of a deep
[12] Patriarche JW, Erickson BJ. Part 1. Automated change detection and character- learning model to detect pneumonia in chest radiographs: a cross-sectional study.
ization in serial MR studies of brain-tumor patients. J Digit Imag 2007;20:203–22. PLoS Med 2018;15(11):e1002683https://doi.org/10.1371/journal.pmed.1002683.
[13] Ha R, Chang P, Karcich J, et al. Axillary lymph node evaluation utilizing con- [40] Yasaka K, Abe O. Deep learning and artificial intelligence in radiology: current
volutional neural networks using MRI dataset. J Digit Imaging 2018;31(6):851–6. applications and future directions. PLoS Med 2018;15(11):e1002707. Published
https://doi.org/10.1007/s10278-018-0086-7. 2018 Nov 30 https://doi.org/10.1371/journal.pmed.1002707.
[14] Ha R, Mutasa S, Sant EP, et al. Accuracy of distinguishing atypical ductal hyper- [41] Park SH, Do KH, Choi JI, et al. Principles for evaluating the clinical implementation
plasia from ductal carcinoma in situ with convolutional neural network–based of novel digital healthcare devices. J Korean Med Assoc 2018;61:765–75.
machine learning approach using mammographic image data. Am J Roentgenol [42] Kim DW, Jang HY, Kim KW, et al. Design characteristics of studies reporting the
2019;212(5):1166–71. https://doi.org/10.2214/AJR.18.20250. performance of artificial intelligence algorithms for diagnostic analysis of medical
[15] Ha R, Mutasa S, Karcich J, et al. Predicting breast cancer molecular subtype with images: results from recently published papers. Korean J Radiol
MRI dataset utilizing convolutional neural network algorithm. J Digit Imaging 2019;20(3):405–10. https://doi.org/10.3348/kjr.2019.0025.
2019;32(2):276–82. https://doi.org/10.1007/s10278-019-00179-2.
[16] Ardila D, Kiraly AP, Bharadwaj S, et al. End-to-end lung cancer screening with

99

You might also like