You are on page 1of 9

original reports

Multi-Institutional Validation of Deep Learning for


Pretreatment Identification of Extranodal
Extension in Head and Neck Squamous
Cell Carcinoma
Benjamin H. Kann, MD1; Daniel F. Hicks, MD2; Sam Payabvash, MD3; Amit Mahajan, MD3; Justin Du, BS4; Vishal Gupta, MD2;
Henry S. Park, MD, MPH4; James B. Yu, MD4; Wendell G. Yarbrough, MD, MMHC5; Barbara A. Burtness, MD6; Zain A. Husain, MD7; and
Sanjay Aneja, MD4
abstract

PURPOSE Extranodal extension (ENE) is a well-established poor prognosticator and an indication for adjuvant
treatment escalation in patients with head and neck squamous cell carcinoma (HNSCC). Identification of ENE
on pretreatment imaging represents a diagnostic challenge that limits its clinical utility. We previously developed
a deep learning algorithm that identifies ENE on pretreatment computed tomography (CT) imaging in patients
with HNSCC. We sought to validate our algorithm performance for patients from a diverse set of institutions and
compare its diagnostic ability to that of expert diagnosticians.
METHODS We obtained preoperative, contrast-enhanced CT scans and corresponding pathology results from
two external data sets of patients with HNSCC: an external institution and The Cancer Genome Atlas (TCGA)
HNSCC imaging data. Lymph nodes were segmented and annotated as ENE-positive or ENE-negative on the
basis of pathologic confirmation. Deep learning algorithm performance was evaluated and compared directly to
two board-certified neuroradiologists.
RESULTS A total of 200 lymph nodes were examined in the external validation data sets. For lymph nodes from
the external institution, the algorithm achieved an area under the receiver operating characteristic curve (AUC)
of 0.84 (83.1% accuracy), outperforming radiologists’ AUCs of 0.70 and 0.71 (P = .02 and P = .01). Similarly, for
lymph nodes from the TCGA, the algorithm achieved an AUC of 0.90 (88.6% accuracy), outperforming ra-
diologist AUCs of 0.60 and 0.82 (P , .0001 and P = .16). Radiologist diagnostic accuracy improved when
receiving deep learning assistance.
CONCLUSION Deep learning successfully identified ENE on pretreatment imaging across multiple institutions,
exceeding the diagnostic ability of radiologists with specialized head and neck experience. Our findings suggest
that deep learning has utility in the identification of ENE in patients with HNSCC and has the potential to be
integrated into clinical decision making.
J Clin Oncol 38. © 2019 by American Society of Clinical Oncology

INTRODUCTION Trimodality therapy, which is associated with in-


Head and neck squamous cell carcinoma (HNSCC) is creased treatment-related morbidity and health care
ASSOCIATED
CONTENT diagnosed in . 550,000 patients annually worldwide costs, has not been shown to improve disease control
Data Supplement and leads to . 300,000 deaths.1 Treatment options for or survival compared with chemoradiation alone.5-9
Author affiliations locally advanced HNSCC include definitive chemo- Therefore, patients with ENE, including those with
and support radiation or up-front surgery followed by adjuvant treatment-sensitive, human papilloma virus (HPV)-
information (if
management dictated by pathologic risk factors.2,3 related HNSCC, may be better served with a non-
applicable) appear surgical approach.10
at the end of this Tumor extranodal extension (ENE), which occurs
article. when tumor infiltrates through the lymph node cap- Currently, a barrier to incorporating ENE in clinical
Accepted on sule, is a well-established poor prognostic factor that decision making is the inability to diagnosis it on di-
November 19, 2019 was recently incorporated into the American Joint agnostic imaging. As a result, trimodality therapy is
and published at Committee on Cancer 8th edition staging system for commonly indicated because of unexpected discovery
jco.org on
HNSCC.4 Furthermore, pathologic ENE is an indication of pathologic ENE.11-14 There is a need to develop
December 9, 2019:
DOI https://doi.org/10. for adjuvant treatment escalation, with the addition better methods to detect ENE in the pretreatment
1200/JCO.19.02031 of chemotherapy to dose-intensified radiotherapy. setting to guide patient staging, risk stratification, and

1
Downloaded from ascopubs.org by Western General Hospital on December 9, 2019 from 129.215.017.190
Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Kann et al

appropriate management. However, thus far, attempts to External Validation Data Sets
reliably identify ENE by clinicians using conventional im- For external validation of the algorithm, we used data sets
aging methods have been unsuccessful.15-20 from Mount Sinai Hospital (New York, NY) and The Cancer
Deep learning, a machine learning technique in the field of Genome Atlas (TCGA) HNSCC imaging data accessed
artificial intelligence that uses layered neural networks to through The Cancer Imaging Archive (TCIA). The Mount
analyze data, has emerged as a promising tool for medical Sinai Hospital data set was compiled from a retrospective
image analysis and has shown diagnostic performance database of patients diagnosed with HNSCC who un-
similar to trained diagnosticians.21-24 Previously, we de- derwent neck lymph node dissection (LND) from 2006 to
veloped a deep learning (DL) algorithm that successfully 2017. Patients were included if they had a preoperative CT
identified pathologic ENE from pretreatment CT imaging in scan of the neck performed with intravenous contrast
an institutional patient cohort.25 In the current study, we within 2 months of LND. Patients with prior history of neck
sought to validate the algorithm performance on patients surgery or radiation therapy were excluded. Nodal ENE
with HNSCC from different institutions, directly compare status was determined on the basis of pathology reports
model performance to board-certified radiologists, and (Data Supplement). De-identified clinical information was
evaluate the benefit of using the algorithm to assist captured at the patient and lymph node levels.
diagnosticians. To assess the generalizability of the deep learning algo-
rithm, a second external data set was aggregated from
METHODS TCGA Head-Neck Squamous Cell Carcinoma data col-
The study was approved by the institutional review boards lection with linked radiologic data housed by TCIA.29,30 The
of the participating institutions, and each granted an ex- TCGA data set represents a geographically diverse patient
emption and waiver of informed consent. The report is in population across a variety of CT scanners and imaging
accordance with the Transparent Reporting of a multivari- protocols. Patients in this cohort were diagnosed with
able prediction model for Individual Prognosis or Diagnosis HNSCC and underwent LND from 1993 to 2013 at seven
Statement (type 4).26 institutions. The results derived from this data set are in
whole or part on the basis of data generated by the TCGA
Deep Learning Algorithm Development and
Research Network.31 Details regarding CT scan charac-
Internal Validation
teristics are found in the Data Supplement.
Deep learning for medical image analysis uses raw pixel
data as input that is then passed through a neural network Deep Learning Performance and Evaluation
of progressively more complex representations of that data We calculated the necessary sample size of labeled lymph
to generate a label prediction.24 During the neural network nodes for each data set to be at least 70 lymph nodes to
training process, a mathematical algorithm with millions of evaluate the study’s primary end point, AUC of the ROC
hierarchically layered parameters is iteratively trained and curve (Data Supplement).
tested on labeled data sets, with the goal of minimizing the
For model testing, the segmented lymph node region-of-
error of prediction versus the “true” label. After training on
interest data were preprocessed and input into the DL
labeled data, neural networks can be used to predict labels
algorithm in single-unit batches. The output probabilities
on new, unseen data. A specific type of neural network, the
for ENE on a node-by-node basis were compared against
convolutional neural network, has achieved excellent
the respective ground truth label. ROC graphs were gen-
performance in image classification and object identifica-
erated. Differences in AUC values were compared using the
tion problems, surpassing performance of other machine
DeLong method.32 The probability value threshold selected
learning algorithms,23,27,28 and this was chosen as the
to calculate secondary end points of sensitivity, specificity,
framework for development of the DL algorithm, DualNet,
positive predictive value (PPV), negative predictive value
described in detail previously.25
(NPV), and raw accuracy was the one that maximized the
In prior work, the DL algorithm was trained on a data set of Youden index (YI = sensitivity + specificity – 1) on previous
2,875 segmented and labeled lymph nodes on CT scan for internal validation. P values were two-sided, and a value
patients with HNSCC who underwent lymph node dis- , 0.05 was considered statistically significant. Statistical
section at Yale–New Haven Hospital from May 2013 to analyses were performed in Python v3.6.
March 2017. Details regarding training data and lymph
node segmentation and labeling are found in the Data Observer Evaluation and Comparison With Deep Learning
Supplement. The DL algorithm achieved an area under the ENE identification by two board-certified radiologists (R1
receiver operating characteristic (ROC) curve (AUC) of 0.91 and R2), with fellowship and Certificate of Added Qualifi-
for ENE identification when evaluated on a blinded test set cation in Neuroradiology and 20 years of combined head
from the same institution.25 In the current study, we tested and neck cancer experience, was evaluated and compared
the DL algorithm on two external validation data sets for directly to the DL algorithm across both validation data sets.
ENE identification (Fig 1). The observer reviews were conducted using OsirixMD

2 © 2019 by American Society of Clinical Oncology

Downloaded from ascopubs.org by Western General Hospital on December 9, 2019 from 129.215.017.190
Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Deep Learning Identifies ENE in Head-Neck Cancer

Input Model Output


Preop Segmented

DualNet

Size preserving
3D CNN

Extracted

Size invariant
3D CNN

• Probability of nodal metastasis


• Probability of ENE

Size preserving Size invariant

FIG 1. Deep learning algorithm framework. For each patient computed tomography scan, the algorithm uses 3-dimensional (3D) segmented lymph node
region-of-interest inputs with two representations: a size-preserving input, and a size-invariant input. These inputs are fed into the DualNet 3D con-
volutional neural network (CNN), which outputs a probability of nodal metastasis and extranodal extension (ENE) for each input lymph node.

(Pixmeo, Switzerland) software during single sessions. The without ENE, and 1.2 cm (range, 0.6-1.7 cm) for benign
lymph node segmentations were overlaid on the entire CT nodes. Of all lymph nodes, 65 (92.9%) had SAD $ 1 cm.
scan and made available for review. Each review was Deep Learning Performance on External Data Sets
conducted in isolation, and radiologists were blinded to the
segmentation labels. Performance was evaluated with The DL algorithm yielded an AUC of 0.84 (95% CI, 0.75 to
AUC, sensitivity, specificity, PPV, NPV, and raw accuracy. 0.93) for ENE identification on the Mount Sinai data set
Cohen k score was used to evaluate inter-observer (Fig 2) and 0.90 (95% CI, 0.81 to 0.99) on the TCIA-TCGA
agreement. data set (Fig 3). Using the maximized YI threshold proba-
bility, sensitivity was 0.71 and 0.82, specificity was 0.85 and
To evaluate the incremental benefit of the DL algorithm to 0.91, and accuracy was 83.1% and 88.6% on the Mount
diagnostic radiologists, diagnostician review was repeated Sinai and TCIA-TCGA data sets, respectively (Table 2). The
with access to the DL algorithm ENE prediction probability algorithm was executed at a rate of 3.4 seconds per case on
alongside each lymph node. Each observer was then given the desktop central processing unit and , 1 second per case
the opportunity to revise his or her initial ENE prediction and on the graphics processing unit (Data Supplement).
record a new one. Deep learning–assisted performance
was compared with that of the observers alone with the Observer Performance and Comparison With Deep
same metrics as above. Learning on External Data Sets
The radiologist observers yielded AUCs of 0.70 (95% CI,
RESULTS 0.59 to 0.82) and 0.71 (95% CI, 0.60 to 0.82) on the Mount
Sinai data set, and 0.60 (95% CI, 0.49 to 0.71) and 0.82
Patient and Lymph Node Characteristics (95% CI, 0.71 to 0.94) on the TCIA-TCGA data set. Sen-
Patient and lymph node–level characteristics are found in sitivities for the radiologists were 0.62 and 0.67, specific-
Table 1. From 82 patients included in the Mount Sinai data ities were 0.79 and 0.75, and accuracies were 76.2% and
set, 130 lymph nodes were segmented and labeled. Of 73.8% on the Mount Sinai data set. Sensitivities were 0.24
these, 21 (16.2%) were malignant with ENE. Median short- and 0.71, specificities 0.96 and 0.94, and accuracies
axis diameter (SAD) was 2.3 cm (range, 1.4-3.6 cm) for 78.6% and 88.6% on the TCIA-TCGA data set. In-
ENE, 2.0 cm (range, 1.2-4.5 cm) for malignant without terobserver agreement was moderate for the Mount Sinai
ENE, and 1.2 cm (range, 0.6-1.9 cm) for benign nodes. Of data set (k, 0.43) and low for the TCIA-TCGA data set
all lymph nodes, 115 (88.5%) had SAD $ 1 cm. (k, 0.29). The observers made predictions at a mean rate
For the TCIA-TCGA data, 70 lymph nodes were labeled of 37.7 seconds per case (range, 34.3-41.1 seconds).
from 62 patients. Of segmented nodes, 17 (24.3%) were AUC of the DL algorithm was superior to that of both ob-
malignant with ENE. Median SAD was 2.3 cm (range, 1.1- servers (P = .01 [DL v R1]; P = .02 [DL v R2]) for the
3.5 cm) for ENE, 1.5 cm (range, 1.1-2.5 cm) for malignant external institution data set. For the TCIA-TCGA data set,

Journal of Clinical Oncology 3

Downloaded from ascopubs.org by Western General Hospital on December 9, 2019 from 129.215.017.190
Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Kann et al

TABLE 1. Study Patient and Lymph Node Characteristics


Mount Sinai Validation
Model Development Cohort Cohort TCIA-TCGA Validation Cohort

Patients Lymph Nodes Patients Lymph Nodes Patients Lymph Nodes


Patient Cohort (N = 270) (n = 653) (N = 82) (n = 130) (N = 62) (n = 70)
Primary cancer site
Oropharynx 72 (26.7) 178 (27.3) 41 (50.0) 71 (54.6) 1 (1.6) 1 (1.4)
Oral cavity 106 (39.3) 251 (38.4) 32 (39.0) 44 (33.8) 51 (82.3) 59 (84.3)
Larynx/hypopharynx/ nasopharynx 48 (17.8) 126 (19.3) 9 (11.0) 15 (11.6) 10 (16.1) 10 (14.3)
Salivary gland 18 (6.7) 36 (5.5) 0 (0) 0 (0) 0 (0) 0 (0)
Unknown/other 26 (9.6) 62 (9.5) 0 (0) 0 (0) 0 (0) 0 (0)
Pathologic T stage
T0 5 (1.9) 17 (2.6) 0 (0) 0 (0) 0 (0) 0 (0)
T1 36 (13.3) 91 (13.9) 25 (30.5) 38 (29.2) 5 (8.1) 5 (7.1)
T2 72 (26.7) 172 (26.3) 32 (39.0) 53 (40.8) 16 (25.8) 18 (25.7)
T3 37 (13.7) 94 (14.4) 9 (11.0) 15 (11.5) 19 (30.6) 22 (31.4)
T4 44 (16.3) 107 (16.4) 16 (19.5) 24 (18.5) 21 (33.9) 24 (34.4)
Unknown 76 (28.2) 172 (26.3) 0 (0) 0 (0) 1(1.6) 1 (1.4)
Pathologic N stage
N0 83 (30.7) 185 (28.3) 17 (20.7) 20 (15.4) 24 (38.7) 28 (40.0)
N1 38 (14.1) 82 (12.6) 11 (13.4) 17 (13.1) 12 (19.4) 14 (20.0)
N2 76 (28.2) 209 (32.0) 53 (64.6) 92 (72.7) 24 (38.7) 26 (37.1)
N3 9 (3.3) 33 (5.1) 1 (1.2) 1 (0.8) 0 (0) 0 (0)
Unknown 64 (23.7) 144 (22.0) 0 (0) 0 (0) 2 (3.2) 2 (2.9)
HPV/p16 status
Negative 188 (69.6) 454 (69.5) 44 (53.7) 67 (51.5) 6 (9.7) 7 (10.0)
Positive 76 (28.2) 185 (28.3) 38 (46.3) 63 (48.5) 0 (0) 0 (0)
Unknown 6 (2.2) 14 (2.2) 0 (0) 0 (0) 56 (90.3) 53 (90.0)
Lymph node pathology
Negative — 380 (58.2) — 55 (42.3) — 29 (41.4)
Nodal metastasis, ENE-negative — 153 (23.4) — 54 (41.5) — 46 (34.3)
Node metastasis, ENE-positive — 120 (18.4) — 21 (16.2) — 17 (24.3)

Abbreviations: ENE, extranodal extension; HPV, human papilloma virus; TCGA, The Cancer Genome Atlas; TCIA, The Cancer Imaging Archive.

AUC of DL algorithm was superior to R1 (P , .0001), and significant changes in performance metrics. In addition,
numerically increased over R2, although this did not reach interobserver agreement increased with deep learning
statistical significance (P = .16). assistance.

Observer Performance With Deep Learning Assistance HPV-Related HNSCC Subgroup Analysis

Observer ENE identification performance when offered DL HPV DNA and/or p16-positive status was available for all
algorithm assistance is found in Table 3. For R1, deep oropharyngeal carcinomas in the Mount Sinai data set.
learning assistance resulted in 18 changed decisions Twenty of the labeled lymph nodes (28.2%) were from
(13.9%) for the Mount Sinai data set and 16 (12.9%) for the HPV-related malignancies. Within this subgroup, AUC of
TCIA-TCGA data set. This led to an AUC increase from 0.70 DL algorithm, R1, and R2 were 0.81, 0.75, and 0.56,
to 0.78 for the Mount Sinai data set (P = .22) and from 0.60 respectively.
to 0.82 for the TCIA-TCGA data set (P = .0003), and
sensitivity increases from 0.62 to 0.71 and 0.24 to 0.71, DISCUSSION
respectively. For R2, deep learning assistance resulted in 5 In this study we used two diverse data sets to confirm that
changed decisions (4.0%) for the Mount Sinai data set a DL algorithm could successfully identify ENE on CT
and none (0%) for the TCIA-TCGA data set, yielding no imaging for patients with HNSCC. In addition, we

4 © 2019 by American Society of Clinical Oncology

Downloaded from ascopubs.org by Western General Hospital on December 9, 2019 from 129.215.017.190
Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Deep Learning Identifies ENE in Head-Neck Cancer

1.0 1.0

0.8 0.8
True-Positive Rate

True-Positive Rate
0.6 0.6

0.4 0.4

DL (AUC = 0.84) DL (AUC = 0.90)


R1 (AUC = 0.70) R1 (AUC = 0.60)
0.2 0.2 R1 with DL (AUC = 0.82)
R1 with DL (AUC = 0.78)
R2 (AUC = 0.71) R2 (AUC = 0.82)
R2 with DL (AUC = 0.71) R2 with DL (AUC = 0.82)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False-Positive Rate False-Positive Rate

FIG 2. Receiver operating characteristic plots for extranodal exten- FIG 3. Receiver operating characteristic plots for extranodal ex-
sion identification for deep learning (DL [DualNet]) algorithm and tension identification for deep learning (DL [DualNet]) algorithm and
radiologists (R1, R2): Mount Sinai Patients. AUC, area under radiologists (R1, R2): The Cancer Imaging Archive–The Cancer
the curve. Genome Atlas patients.

demonstrated that the algorithm’s diagnostic performance To our knowledge, the DL algorithm represents the first
surpassed that of board-certified radiologists with spe- externally validated deep learning algorithm used to identify
cialized head and neck cancer experience. The algorithm ENE. The AUCs of 0.90 and 0.84 achieved on the external
demonstrated generalizability across heterogeneous clini- data sets are superior compared with direct radiologist
cal settings, comprising a variety of CT scanners, practice comparison and historical results.16,17,19 Meanwhile, ob-
locations, and patient populations. Last, our findings server performance in this study (AUC, 0.70-0.71) was
suggest that algorithm assistance may improve radiologist comparable to that of prior reports, which have yielded
performance, although additional investigation is needed to AUCs of 0.62-0.69.15,18 Interobserver agreement in our
confirm this. The study highlights the difficulty human study was lower than that observed in a prior study of 2
diagnosticians experience in identifying ENE, with modest observers (0.59) and may indicate a higher level of am-
discriminatory performance and poor to moderate in- biguity in the radiologic appearance of ENE in our data
terobserver agreement. This deep learning–based ap- sets.16 At a balanced probability threshold, the deep
proach to ENE identification offers several additional learning algorithm yielded a sensitivity of 0.71 and speci-
advantages over the current standard, including ficity of 0.85, which are higher than those of prior studies of
reproducibility, objectivity, near-instantaneous reporting diagnostician performance.15,16,33 Radiologist ENE de-
(. 10 times faster than clinicians), and the ability to adjust tection performance varies widely in the literature, and, in
probability thresholds to achieve varying balance of sen- even the most successful studies, high sensitivity comes at
sitivity and specificity to suit the clinical scenario. The study the expense of low specificity, or vice versa.19,33,34 Historical
validates the DL algorithm as a clinical decision-making tool comparison is fraught in evaluating ENE identification,
for patients with HNSCC. because performance is highly contingent on the difficulty
of the data set. Importantly, our study reports a direct
The DL algorithm has utility in a number of complex clinical
performance comparison with what is considered the gold
situations. First, the algorithm could decrease patient
standard in human ENE identification.
morbidity and health care costs by reducing the need for
trimodality therapy and by selecting patients appropriate for The reasons for inferior human observer performance are
surgery with minimal adjuvant therapy. Second, it could be likely multifactorial. ENE is often a microscopic phenom-
used to help select appropriate patients for clinical trials for enon, making it difficult for humans to identify with the
which ENE is an exclusion criterion. For instance, the on- naked eye. The 3-dimensional pixel-by-pixel analysis in-
going Eastern Cooperative Oncology Group 3311 trial seeks herent in the deep learning strategy may provide an in-
to de-escalate adjuvant therapy for HPV-related HNSCC, but trinsic advantage in this regard.23,35 In addition, ENE
patients who are found to have ENE postoperatively are not reporting is not often part of radiologic documentation, and
eligible for the experimental arms in the trial, because of there is no feedback mechanism that exists to train clini-
their high risk (ClinicalTrials.gov identifier: NCT01898494). cians to identify ENE in real time. Highlighting this issue, the
Third, the algorithm could help identify patients with ENE American College of Radiology–Data Science Institute has
who may benefit from treatment intensification. Finally, it designated ENE identification as a key use case for artificial
could be used to guide patient staging and risk stratification. intelligence.36 An advantage of deep learning is that it is

Journal of Clinical Oncology 5

Downloaded from ascopubs.org by Western General Hospital on December 9, 2019 from 129.215.017.190
Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Kann et al

TABLE 2. DL Algorithm Performance for ENE Identification and Radiologist Comparisons on External Test Sets
Internal Test Set
(n = 98) Mount Sinai Validation (n = 130) TCIA-TCGA Validation (n = 70)
Performance
Metric DL DL R1 R2 DL R1 R2
AUC (95% CI) 0.91 (0.86 to 0.96) 0.84 (0.75 to 0.93) 0.70 (0.59 to 0.82) 0.71 (0.60 to 0.82) 0.90 (0.81 to 0.99) 0.60 (0.49 to 0.71) 0.82 (0.71 to 0.94)
Accuracy, % 85.7 83.1 76.2 73.8 88.6 78.6 88.6
Sensitivity 0.88 0.71 0.62 0.67 0.82 0.24 0.71
Specificity 0.85 0.85 0.79 0.75 0.91 0.96 0.94
PPV 0.66 0.48 0.36 0.34 0.74 0.67 0.80
NPV 0.95 0.94 0.91 0.92 0.94 0.80 0.91
Youden index 0.73 0.56 0.41 0.42 0.73 0.20 0.65
Cohen k score 0.43 0.29

Abbreviations: AUC, area under the curve; DL, deep learning (DualNet algorithm); ENE, extranodal extension; NPV, negative predictive value PPV, positive
predictive value; R, radiologist.

iteratively trained and tuned to the specific task of interest, heuristics, and although experimenting with additional
in this case, ENE identification. Radiologist training and novel network strategies may yield improved results, sub-
experience also affect performance. Given the specialized stantial investigation undertaken during the neural net-
radiologists in this study, observer accuracy may be higher work’s development phase yielded the current architecture
than expected in less-specialized practice settings. We that performed better than others tested.
hypothesize that the net benefit of deep learning could be
The study has several other limitations. Time from CT scan
even greater extrapolated to other settings, such as low-
to surgery in this study is reflective of real-world clinical
volume centers or those in middle- and low-resource
practice, in that it was not strictly standardized. Although
countries.37
this may influence the accuracy of ENE prediction, scans
Despite high performance on the external data sets, the DL were limited to those that were performed within 2 months
algorithm did exhibit a slight decrease in discriminatory before surgery, and ENE predictions within this time frame
ability on the external data compared with the internally have been found to be stable.34 Because the training and
validated set from previous work.25 This is not surprising, as validation data sets primarily consist of lymph nodes with
the algorithm was trained only on the internal set, and there short-axis diameter $ 1 cm, the DL algorithm should be
are variations between the data sets, in terms of CT scanner applied with caution to smaller lymph nodes until it can be
parameters, lymph node sizes, and proportion of nodes validated in this subset. In addition, in the evaluation of
with ENE, that made the algorithm’s task of ENE identifi- deep learning as a diagnostic assist, only one of the ob-
cation more difficult. Therefore, additional studies exploring servers meaningfully used the algorithm’s prediction. The
algorithm generalizability are warranted. The authors rec- reasons for this may be related to experience and as-
ognize that there are nearly limitless options when de- suredness or technical aspects of displaying the deep
signing neural network–based algorithms and training learning results to the observer. Although our findings

TABLE 3. Radiologist Performance With DL Assistance


Mount Sinai Validation TCIA-TCGA Validation

Performance Metric R1 With DL R2 With DL R1 With DL R2 With DL


AUC (95% CI) 0.78 (0.67 to 0.88) 0.71 (0.59 to 0.82) 0.82 (0.71 to 0.94) 0.82 (0.71 to 0.94)
Accuracy 82.3 73.1 88.6 88.6
Sensitivity 0.71 0.67 0.71 0.71
Specificity 0.84 0.74 0.94 0.94
PPV 0.47 0.33 0.80 0.80
NPV 0.94 0.92 0.91 0.91
Youden index 0.55 0.41 0.65 0.65
Cohen k score 0.74 0.75

Abbreviations: AUC, area under the curve; DL, deep learning (DualNet algorithm); NPV, negative predictive value; PPV, positive predictive
value; R, radiologist.

6 © 2019 by American Society of Clinical Oncology

Downloaded from ascopubs.org by Western General Hospital on December 9, 2019 from 129.215.017.190
Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Deep Learning Identifies ENE in Head-Neck Cancer

suggest DL assistance could improve radiologist perfor- study was predetermined and may be different from that in
mance in ENE detection, our study was not powered to real-world application of the model. ENE identification may
predict a statistically significant improvement with DL as- perhaps be most useful for patients with HPV-related
sistance. Additional investigation is needed to determine malignancies, and although the model performed well on
the optimal way to integrate the algorithm to augment the subgroup of HPV-related lymph nodes from the external
observer predictions. Although the radiologists in this study institution, HPV status was not available in the TCIA-TCGA
had substantial head and neck cancer experience, there data set. Dedicated testing in the setting of HPV-related
may exist diagnosticians nationally who exclusively practice HNSCC is underway, and prospective testing is being
in head and neck cancer, in which case the performance planned to apply the model to a real-world clinical workflow
gap between the DL algorithm and human interpretation in the management of HNSCC.
may be smaller. This likely occurs at only a minority of
highly specialized cancer centers, whereas the majority of Deep learning can identify lymph node ENE on pre-
head and neck cancer imaging in clinical practice is read treatment imaging for patients with HNSCC from external
by neuroradiologists and general radiologists. Our study institutions with high performance, exceeding that of board-
analyzes ENE at the lymph node level and does not account certified radiologists with specialized head and neck ex-
for within-patient correlation. This could underestimate perience. Radiologist performance can be augmented with
variation within our data set. Given the number of lymph deep learning assistance. Deep learning shows promise
nodes used for each patient is relatively small, we do not as a tool to risk stratify patients with HNSCC and select
believe this contributes significant bias in our model. Fi- appropriate management, with the goal of reducing
nally, the distribution of lymph node classes used in the treatment-related morbidity and health care costs.

AFFILIATIONS SUPPORT
1
Department of Radiation Oncology, Dana-Farber Cancer Institute/ Supported by an Eastern Cooperative Oncology Group - American College
Brigham and Women's Hospital, Harvard Medical School, Boston, MA of Radiology Imaging Network Paul Carbone Research Fellowship Grant.
2
Department of Radiation Oncology, Icahn School of Medicine at Mount
Sinai, New York, NY
3
AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF
Department of Radiology, Yale School of Medicine, New Haven, CT
4
Department of Therapeutic Radiology, Yale School of Medicine, New
INTEREST AND DATA AVAILABILITY STATEMENT
Haven, CT Disclosures provided by the authors and data availability statement (if
5
Department of Otolaryngology/Head and Neck Surgery, University of applicable) are available with this article at DOI https://doi.org/10.1200/
North Carolina School of Medicine, Chapel Hill, NC JCO.19.02031.
6
Department of Medicine, Yale School of Medicine, New Haven, CT
7
Department of Radiation Oncology, Odette Cancer Centre, Sunnybrook AUTHOR CONTRIBUTIONS
Health Sciences Centre, Toronto, Ontario, Canada Conception and design: Benjamin H. Kann, Sam Payabvash, Vishal Gupta,
Zain A. Husain, Sanjay Aneja
CORRESPONDING AUTHOR Financial support: Sanjay Aneja
Benjamin H. Kann, MD, Department of Radiation Oncology, Dana-Farber Administrative support: Sanjay Aneja
Cancer Institute/Brigham and Women’s Hospital, 75 Francis St, Boston, Provision of study material or patients: Amit Mahajan, Vishal Gupta, Sanjay
MA 02115; Twitter: @BenjaminKannMD; e-mail: benjamin_kann@ Aneja
dfci.harvard.edu Collection and assembly of data: Benjamin H. Kann, Daniel F. Hicks, Sam
Payabvash, Amit Mahajan, Wendell G. Yarbrough, Sanjay Aneja
Data analysis and interpretation: Benjamin H. Kann, Sam Payabvash,
PRIOR PRESENTATION Justin Du, Vishal Gupta, Henry S. Park, James B. Yu, Barbara A.
Presented at the Annual American Society for Radiation Oncology Burtness, Zain A. Husain, Sanjay Aneja
Conference, Chicago, IL, September 17, 2019. Manuscript writing: All authors
Final approval of manuscript: All authors
Accountable for all aspects of the work: All authors

REFERENCES
1. Jemal A, Simard EP, Dorell C, et al: Annual report to the nation on the status of cancer, 1975–2009, featuring the burden and trends in human papillomavirus
(HPV)–associated cancers and HPV vaccination coverage levels. J Natl Cancer Inst 105:175-201, 2013
2. Adelstein DJ, Li Y, Adams GL, et al: An intergroup phase III comparison of standard radiation therapy and two schedules of concurrent chemoradiotherapy in
patients with unresectable squamous cell head and neck cancer. J Clin Oncol 21:92-98, 2003
3. Bernier J, Cooper JS, Pajak TF, et al: Defining risk levels in locally advanced head and neck cancers: A comparative analysis of concurrent postoperative
radiation plus chemotherapy trials of the EORTC (#22931) and RTOG (# 9501). Head Neck 27:843-850, 2005
4. Huang SH, O’Sullivan B: Overview of the 8th Edition TNM Classification for Head and Neck Cancer. Curr Treat Options Oncol 18:40, 2017
5. Sher DJ, Fidler MJ, Tishler RB, et al: Cost-effectiveness analysis of chemoradiation therapy versus transoral robotic surgery for human papillomavirus-
associated, clinical N2 oropharyngeal cancer. Int J Radiat Oncol Biol Phys 94:512-522, 2016

Journal of Clinical Oncology 7

Downloaded from ascopubs.org by Western General Hospital on December 9, 2019 from 129.215.017.190
Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Kann et al

6. Sethia R, Yumusakhuylu AC, Ozbay I, et al: Quality of life outcomes of transoral robotic surgery with or without adjuvant therapy for oropharyngeal cancer.
Laryngoscope 128:403-411, 2018
7. Cooper JS, Pajak TF, Forastiere AA, et al: Postoperative concurrent radiotherapy and chemotherapy for high-risk squamous-cell carcinoma of the head and
neck. N Engl J Med 350:1937-1944, 2004
8. Ling DC, Chapman BV, Kim J, et al: Oncologic outcomes and patient-reported quality of life in patients with oropharyngeal squamous cell carcinoma treated with
definitive transoral robotic surgery versus definitive chemoradiation. Oral Oncol 61:41-46, 2016
9. Nichols AC, Theurer J, Prisman E, et al: A phase II randomized trial for early-stage squamous cell carcinoma of the oropharynx: Radiotherapy versus trans-oral
robotic surgery (ORATOR). J Clin Oncol 37, 2019 (suppl; abstr 6006)
10. Ang KK, Harris J, Wheeler R, et al: Human papillomavirus and survival of patients with oropharyngeal cancer. N Engl J Med 363:24-35, 2010
11. Weinstein GS, Quon H, O’Malley BW Jr, et al: Selective neck dissection and deintensified postoperative radiation and chemotherapy for oropharyngeal cancer: A
subset analysis of the University of Pennsylvania transoral robotic surgery trial. Laryngoscope 120:1749-1755, 2010
12. Subramanian HE, Park HS, Barbieri A, et al: Pretreatment predictors of adjuvant chemoradiation in patients receiving transoral robotic surgery for squamous
cell carcinoma of the oropharynx: A case control study. Cancers Head Neck 1:7, 2016
13. White-Gilbertson S, Nelson S, Zhan K, et al: Analysis of the National Cancer Data Base to describe treatment trends in stage IV oral cavity and pharyngeal
cancers in the United States, 1998-2012. J Registry Manag 42:146-151, quiz 156-157, 2015
14. McMullen CP, Garneau J, Weimar E, et al: Occult nodal disease and occult extranodal extension in patients with oropharyngeal squamous cell carcinoma
undergoing primary transoral robotic surgery with neck dissection. JAMA Otolaryngol Head Neck Surg 145:701, 2019
15. Maxwell JH, Rath TJ, Byrd JK, et al: Accuracy of computed tomography to predict extracapsular spread in p16-positive squamous cell carcinoma. La-
ryngoscope 125:1613-1618, 2015
16. Carlton JA, Maxwell AW, Bauer LB, et al: Computed tomography detection of extracapsular spread of squamous cell carcinoma of the head and neck in
metastatic cervical lymph nodes. Neuroradiol J 30:222-229, 2017
17. Url C, Schartinger VH, Riechelmann H, et al: Radiological detection of extracapsular spread in head and neck squamous cell carcinoma (HNSCC) cervical
metastases. Eur J Radiol 82:1783-1787, 2013
18. Chai RL, Rath TJ, Johnson JT, et al: Accuracy of computed tomography in the prediction of extracapsular spread of lymph node metastases in squamous cell
carcinoma of the head and neck. JAMA Otolaryngol Head Neck Surg 139:1187-1194, 2013
19. Patel MR, Hudgins PA, Beitler JJ, et al: Radiographic imaging does not reliably predict macroscopic extranodal extension in human papilloma virus-associated
oropharyngeal cancer. ORL J Otorhinolaryngol Relat Spec 80:85-95, 2018
20. Kann BH, Buckstein M, Carpenter TJ, et al: Radiographic extracapsular extension and treatment outcomes in locally advanced oropharyngeal carcinoma. Head
Neck 36:1689-1694, 2014
21. Esteva A, Kuprel B, Novoa RA, et al: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542:115-118, 2017 [Erratum: Nature
546:686, 2017]
22. Hwang EJ, Park S, Jin K-N, et al: Development and validation of a deep learning-based automated detection algorithm for major thoracic diseases on chest
radiographs. JAMA Netw Open 2:e191095, 2019
23. Ardila D, Kiraly AP, Bharadwaj S, et al: End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat
Med 25:954-961, 2019
24. LeCun Y, Bengio Y, Hinton G: Deep learning. Nature 521:436-444, 2015
25. Kann BH, Aneja S, Loganadane GV, et al: Pretreatment identification of head and neck cancer nodal metastasis and extranodal extension using deep learning
neural networks. Sci Rep 8:14036, 2018
26. Collins GS, Reitsma JB, Altman DG, et al: Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): The
TRIPOD statement. Ann Intern Med 162:55-63, 2015
27. Ji S, Yang M, Yu K: 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35:221-231, 2013
28. Baumgartner CF, Oktay O, Rueckert D: Fully convolutional networks in medical imaging: Applications to image enhancement and recognition, in Lu L, Zheng Y,
Carneiro G, et al (eds): Deep Learning and Convolutional Neural Networks for Medical Image Computing. Cham, Switzerland, Springer, 2017, pp 159-179
29. Clark K, Vendt B, Smith K, et al: The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository. J Digit Imaging 26:1045-1057,
2013
30. Zuley ML, Jarosz R, Kirk S, et al: Radiology data from The Cancer Genome Atlas Head-Neck Squamous Cell Carcinoma [TCGA-HNSC] collection. https://
wiki.cancerimagingarchive.net/x/VYG0
31. National Cancer Institute: The Cancer Genome Atlas program. http://cancergenome.nih.gov/
32. DeLong ER, DeLong DM, Clarke-Pearson DL: Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric
approach. Biometrics 44:837-845, 1988
33. Prabhu RS, Magliocca KR, Hanasoge S, et al: Accuracy of computed tomography for predicting pathologic nodal extracapsular extension in patients with head-
and-neck cancer undergoing initial surgical resection. Int J Radiat Oncol Biol Phys 88:122-129, 2014
34. Almulla A, Noel CW, Lu L, et al: Radiologic-pathologic correlation of extranodal extension in patients with squamous cell carcinoma of the oral cavity:
Implications for future editions of the TNM classification. Int J Radiat Oncol Biol Phys 102:698-708, 2018
35. He K, Zhang X, Ren S, et al: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. http://arxiv.org/abs/1502.01852
36. ACR Data Science Institute: TOUCH-AI directory. https://www.acrdsi.org/DSI-Services/TOUCH-AI
37. Guo J, Li B: The application of medical artificial intelligence technology in rural areas of developing countries. Health Equity 2:174-181, 2018

n n n

8 © 2019 by American Society of Clinical Oncology

Downloaded from ascopubs.org by Western General Hospital on December 9, 2019 from 129.215.017.190
Copyright © 2019 American Society of Clinical Oncology. All rights reserved.
Deep Learning Identifies ENE in Head-Neck Cancer

AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST


Multi-Institutional Validation of Deep Learning for Pretreatment Identification of Extranodal Extension in Head and Neck Squamous Cell Carcinoma
The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted.
Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript.
For more information about ASCO’s conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/jco/site/ifc.
Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open Payments).

Benjamin H. Kann Zain A. Husain


Patents, Royalties, Other Intellectual Property: Computed tomography-based Research Funding: Merck Sharp & Dohme (Inst)
method for analysis of malignancy (patent pending) Travel, Accommodations, Expenses: Elekta
Amit Mahajan Sanjay Aneja
Stock and Other Ownership Interests: Gilead Sciences, Celgene, Atara Consulting or Advisory Role: Prophet Consulting (I)
Biotherapeutics, Bausch Health, Aurinia Pharmaceuticals, CRISPR Research Funding: The MedNet
Therapeutics Patents, Royalties, Other Intellectual Property: Provisional patent of deep
learning optimization algorithm
Henry S. Park
Travel, Accommodations, Expenses: Prophet Consulting (I)
Honoraria: RadOncQuestions
James B. Yu No other potential conflicts of interest were reported.
Consulting or Advisory Role: Augmenix
Research Funding: 21st Century Oncology (Inst)
Wendell G. Yarbrough
Consulting or Advisory Role: Olympus Medical Systems
Barbara A. Burtness
Honoraria: AstraZeneca
Consulting or Advisory Role: Merck, Debiopharm Group, AstraZeneca, Bristol-
Myers Squibb, Alligator Bioscience, Aduro BioTech, GlaxoSmithKline, Celgene,
Cue Biopharma, Maverick Therapeutics, Rakuten, Nanobiotix, MacroGenics,
ALX Oncology
Research Funding: Merck (Inst), Aduro BioTech (Inst), Formation Biologics
(Inst), Bristol Myers (Inst), Exelixis (Inst)
Travel, Accommodations, Expenses: Merck, Debiopharm Group, Boehringer
Ingelheim

Journal of Clinical Oncology

Downloaded from ascopubs.org by Western General Hospital on December 9, 2019 from 129.215.017.190
Copyright © 2019 American Society of Clinical Oncology. All rights reserved.

You might also like