You are on page 1of 10

European Journal of Radiology 142 (2021) 109894

Contents lists available at ScienceDirect

European Journal of Radiology


journal homepage: www.elsevier.com/locate/ejrad

Detection and PI-RADS classification of focal lesions in prostate MRI:


Performance comparison between a deep learning-based algorithm (DLA)
and radiologists with various levels of experience
Seo Yeon Youn a, Moon Hyung Choi a, b, *, Dong Hwan Kim a, Young Joon Lee a, b,
Henkjan Huisman c, Evan Johnson d, Tobias Penzkofer e, Ivan Shabunin f, David Jean Winkel g,
Pengyi Xing h, Dieter Szolar i, Robert Grimm j, Heinrich von Busch j, Yohan Son k, Bin Lou l,
Ali Kamen l
a
Department of Radiology, Seoul St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
b
Department of Radiology, Eunpyeong St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, 1021 Seoul, Republic of Korea
c
Department of Radiology, Radboud University Medical Center, Nijmegen, The Netherlands
d
Department of Radiology, New York University, NY, USA
e
Department of Radiology, Charité, Universitätsmedizin Berlin, Berlin, Germany
f
Patero Clinic, Moscow, Russia
g
Department of Radiology, University Hospital of Basel, Basel, Switzerland
h
Department of Radiology, Changhai Hospital, Shanghai, China
i
Diagnostikum Graz Süd-West, Graz, Austria
j
Diagnostic Imaging, Siemens Healthcare, Erlangen, Germany
k
Siemens Healthineers Ltd., Seoul, Republic of Korea
l
Digital Technology and Innovation, Siemens Healthineers, Princeton, NJ, USA

A R T I C L E I N F O A B S T R A C T

Keywords: Purpose: To compare the performance of lesion detection and Prostate Imaging-Reporting and Data System (PI-
Deep learning RADS) classification between a deep learning-based algorithm (DLA), clinical reports and radiologists with
Prostate different levels of experience in prostate MRI.
Prostate neoplasms
Methods: This retrospective study included 121 patients who underwent prebiopsy MRI and prostate biopsy.
Prostate imaging reporting and data system
Magnetic resonance imaging
More than five radiologists (Reader groups 1, 2: residents; Readers 3, 4: less-experienced radiologists; Reader 5:
expert) independently reviewed biparametric MRI (bpMRI). The DLA results were obtained using bpMRI. The
reference standard was based on pathologic reports. The diagnostic performance of the PI-RADS classification of
DLA, clinical reports, and radiologists was analyzed using AUROC. Dichotomous analysis (PI-RADS cutoff value
≥ 3 or 4) was performed, and the sensitivities and specificities were compared using McNemar’s test.
Results: Clinically significant cancer [CSC, Gleason score ≥ 7] was confirmed in 43 patients (35.5%). The AUROC
of the DLA (0.828) for diagnosing CSC was significantly higher than that of Reader 1 (AUROC, 0.706; p = 0.011),
significantly lower than that of Reader 5 (AUROC, 0.914; p = 0.013), and similar to clinical reports and other
readers (p = 0.060–0.661). The sensitivity of DLA (76.7%) was comparable to those of all readers and the clinical
reports at a PI-RADS cutoff value ≥ 4. The specificity of the DLA (85.9%) was significantly higher than those of
clinical reports and Readers 2–3 and comparable to all others at a PI-RADS cutoff value ≥ 4.

Abbreviations: AI, artificial intelligence; bpMRI, biparametric MRI; CSC, clinically significant prostate cancer; DLA, deep learning-based algorithm; mpMRI,
multiparametric MRI; PI-RADS, Prostate Imaging-Reporting and Data System; PSA, prostate-specific antigen; TRUS, transrectal ultrasonography.
* Corresponding author at: Department of Radiology, Eunpyeong St. Mary’s Hospital, College of Medicine, The Catholic University of Korea, 1021, Tongil-ro,
Eunpyeong-gu, Seoul 03312, Republic of Korea.
E-mail addresses: cmh@catholic.ac.kr (M.H. Choi), rhdwngkwk333@naver.com (D.H. Kim), yjleerad@catholic.ac.kr (Y.J. Lee), henkjan.huisman@radboudumc.nl
(H. Huisman), tobias.penzkofer@charite.de (T. Penzkofer), shabunin@pateroclinic.ru (I. Shabunin), davidjean.winkel@usb.ch (D.J. Winkel), 746992685@qq.com
(P. Xing), dieter.szolar@diagnostikum-graz.at (D. Szolar), robertgrimm@siemens-healthineers.com (R. Grimm), heinrich.von_busch@siemens-healthineers.com
(H. von Busch), yohan.son@siemens-healthineers.com (Y. Son), bin.lou@siemens-healthineers.com (B. Lou), ali.kamen@siemens-healthineers.com (A. Kamen).

https://doi.org/10.1016/j.ejrad.2021.109894
Received 2 May 2021; Received in revised form 30 June 2021; Accepted 1 August 2021
Available online 5 August 2021
0720-048X/© 2021 Elsevier B.V. All rights reserved.
S.Y. Youn et al. European Journal of Radiology 142 (2021) 109894

Conclusions: The DLA showed moderate diagnostic performance at a level between those of residents and an
expert in detecting and classifying according to PI-RADS. The performance of DLA was similar to that of clinical
reports from various radiologists in clinical practice.

1. Introduction RADS classification [7,13]. To the best of our knowledge, no studies


have compared the performance of AI and radiologists with various
Prostate MRI is beneficial for the diagnosis of prostate cancer in bi­ levels of experience in prostate MRI.
opsy-naïve patients and patients with prior negative biopsy results. The purpose of this study was to compare the diagnostic performance
Multicentric, randomized studies have shown that prostate MRI before of lesion detection and PI-RADS classification between a deep learning-
biopsy and MRI-targeted biopsy is superior to standard transrectal ul­ based algorithm (DLA), clinical reports and radiologists with different
trasonography (TRUS)-guided biopsy in patients at clinical risk [1,2]. A levels of experience in prostate MRI
meta-analysis revealed that prebiopsy MRI is a more favorable diag­
nostic work-up than systematic biopsy for the diagnosis of clinically 2. Materials and methods
significant prostate cancer (CSC) [3].
Because of increasing evidence and guideline recommendations, the This retrospective study was approved by the Institutional Review
demand for prostate MRI is growing, and radiologists face a substantial Board (KC19DISI0933), and informed consent was waived.
increase in the number of referrals. The Prostate Imaging-Reporting and
Data System (PI-RADS) has suggested fundamental guidelines for the
interpretation of prostate MRI [4]; however, the PI-RADS score can be 2.1. Study population
interpreted differently by different radiologists [5]. In particular, radi­
ologists with less experience have greater inter-reader variability in PI- Patients with clinical suspicion of prostate cancer who underwent
RADS scoring [5]. Machine learning-based automatic detection and prebiopsy prostate MRI between November 2015 and February 2019
classification of focal lesions in the prostate gland may be helpful for followed by TRUS-guided biopsy were eligible (n = 923). One of two
reducing radiologists’ reading time and reducing inter-reader variability experienced genitourinary radiologists performed TRUS-guided biopsy
[6]. after review of MRI images and reports. Systematic biopsy with or
Recently, deep learning-based artificial intelligence (AI) algorithms without additional targeted biopsy was performed. Targeted biopsy was
have shown valuable performance in differentiating prostate cancer performed using cognitive fusion, and the location of the targeted lesion
from normal tissues and in estimating the probability of prostate cancer was recorded in the biopsy report. Only patients who underwent pros­
[7–11]. As PI-RADS is the standard of reporting on prostate MRI, a deep- tate MRI using the same 3-T MRI machine (MAGNETOM Verio, Siemens
learning based algorithm for both lesion detection and PI-RADS classi­ Healthcare, Erlangen, Germany) were included in this study. The
fication is necessary in clinical practice. Lately, a deep learning-based exclusion criteria were 1) known prostate cancer before prostate MRI (n
algorithm for PI-RADS classification has been demonstrated [12]; = 38), 2) MRI examination performed using a 1.5-T (n = 612) or other
however, this approach did not address the detection of prostate lesions vendor 3-T MRI machine (n = 44), 3) MRI examination used in the
and required manual lesion segmentation by a radiologist. Few studies development of the DLA (n = 100), 4) cases in which the DLA did not run
have compared the performance of the machine to detect focal lesions (n = 1) due to technical error during MRI scanning, not technical error of
and calculate cancer probability to the performance of radiologists’ PI- DLA, and 5) patients with very high PSA (>40 ng/mL, n = 7). Finally, a
total of 121 patients were enrolled in this study (Fig. 1). Clinical

Fig. 1. Flow diagram of subject selection.

2
S.Y. Youn et al. European Journal of Radiology 142 (2021) 109894

information of age, prostate-specific antigen (PSA) level, the use of 5α- was determined by the highest PI-RADS score or by the largest diameter
reductase inhibitor, clinical report of prostate MRI, biopsy result, if multiple lesions with the same PI-RADS score existed.
number of previous TRUS-guided biopsies before MRI, interval between
MRI and biopsy, prostatectomy result, and interval between MRI and 2.5. Clinical report as a routine interpretation
prostatectomy of patients who underwent prostatectomy were collected
from electronic medical records. In the routine clinical process at our institution, six board-certified
abdominal/genitourinary radiologists with at least six years of experi­
2.2. MR imaging techniques ence in prostate MRI reviewed prostate MRI and made radiological re­
ports. More than 1,000 prostate MRIs are performed each year at our
The MRI examinations in this study consisted of 58 biparametric institution. The number, size, location, PI-RADS score, and snapshot of
MRIs (bpMRIs) and 63 multiparametric MRIs (mpMRIs) using a pelvic each of up to five lesions in each patient report were collected. In some
phased-array coil. The following sequences were acquired with the reports made before release of PI-RADS v2, the conclusion did not follow
following parameters: axial, sagittal, and coronal T2-weighted images the guidelines; these reports were interpreted according to PI-RADS v2
(T2WIs), repetition time [TR] > 3,200 ms; echo time [TE], 80–100 ms; by a radiologist who was not aware of the biopsy results; an indeter­
matrix, 320 × 320; slice thickness, 3 mm; field of view [FOV], 200–220 minate conclusion was interpreted as PI-RADS 3 and a definite conclu­
mm; axial diffusion-weighted images (DWIs) with b-values of 0, 50, 500, sion as prostate cancer was interpreted as PI-RADS 4 and 5 according to
and 1,000 sec/mm2, slice thickness 3 mm; matrix, 100 × 100; FOV, tumor size and extracapsular extension.
200–220 mm. DWI with a b-value of 1,500 sec/mm2 was obtained in
some patients, but apparent diffusion-coefficient (ADC) maps were 2.6. Reference standard for prostate cancer
calculated from DWI with b-values of 50 and 1,000 sec/mm2 to maintain
consistency in the imaging protocol. All prostate cancers, regardless of Gleason grade group, and CSC ≥
Gleason grade group 2 (Gleason score 7 [3 + 4]) were used to define
2.3. DLA for prostate MRI prostate cancer. Reader 5 reviewed the pathologic results and the MRI
results from DLA and all other readers at least 4 months after Reader 5′ s
A non-commercially available, deep learning-based prototype soft­ image review and determined whether the lesion were cancers based on
ware (Prostate AI version 1.2.1, build date 2019-11-27, Siemens the pathology. For patients who underwent systematic and targeted
Healthcare) was used. The algorithm was trained and validated with biopsy, the reader assessed whether prostate cancer was confirmed in
2,170 bpMRIs from seven institutions, including ours [14]. Our insti­ the targeted lesion. If prostate cancer was confirmed in systematic cores
tution provided 100 cases for the development of DLA and these cases other than the targeted cores, the reader matched the biopsy results to
were excluded from the current study. This DLA was designed to eval­ the focal lesion on prostate MRI. In patients who underwent prostatec­
uate bpMRI using axial T2WI and DWI with high b-values to detect tomy, the schematic diagram of the histopathology map that depicted
suspicious prostatic lesions and categorize the lesions according to PI- the cancer area, instead of whole mount pathology, was the primary
RADS v2. Only T2WI and DWI were loaded to DLA even in patients reference for prostate cancer location. When the Gleason score differed
who underwent mpMRI. It displays abnormal areas using a suspicion between biopsy and prostatectomy pathology in a patient, the prosta­
map on T2-weighted images and shows the segmented abnormal lesion. tectomy result was used for evaluation.
It also automatically produces a draft radiologic report containing text
information about PI-RADS classification, size and location of the lesions 2.7. Statistical analysis
in order of higher PI-RADS score and larger size, for user to review and
edit. The DLA results provide PI-RADS scores and a snapshot of each of The PI-RADS assessments from the DLA, clinical reports and each
up to five lesions in each patient. Detailed information on the DLA is radiologist were compared with those of Reader 5 (the most experienced
provided in the Supplementary material. radiologist) using weighted Kappa statistics to analyze inter-reader
agreement.
2.4. Prostate MRI review by radiologists The diagnostic performance of per-patient PI-RADS scores for all
readings was analyzed using the area under the receiver operating
Two reader groups of radiology residents (Reader group 1, composed characteristics (ROC) curve (AUROC) in diagnosing all prostate cancers
of four 2-year residents, and Reader group 2, composed of four 3-year or CSCs. The ROC curves were compared using Delong’s method. We
residents) and three board-certified radiologists (Reader 3, 4.5 years also performed dichotomous analysis of the diagnostic performance
of experience; Reader 4, 5 years of experience; and Reader 5, 10 years of using either PI-RADS ≥ 3 or ≥ 4 as the cutoff value. Given that physi­
experience in prostate imaging) who were blinded to the biopsy results cians dichotomically determine whether to conduct prostate biopsy, we
independently reviewed bpMRI (three planes of T2WIs, axial DWIs [b = thought that dichotomous analysis would be more helpful to understand
0, 1,000 sec/mm2], and an ADC map). Given that the radiology residents the findings intuitively. To calculate the sensitivity, specificity, positive
were not familiar with prostate MRI, four with similar levels of experi­ predictive value (PPV), negative predictive value (NPV), and accuracy in
ence in prostate MRI split the cases. Integration of the four residents’ detecting prostate cancer, we defined true or false positives/negatives
reviews was considered as a single reader review in Reader groups 1 and for an index lesion at the per-patient level. For example, a true positive
2. Reader 5 had reviewed MRIs for more than four months before means that a reader (DLA, clinical report, or radiologist) correctly
matching the other readers’ MRI reviews to pathology results. All detected prostate cancer at the same location and categorized the lesion
readers recorded the number, location, and size of suspicious prostate with at least the PI-RADS score cutoff value. Sensitivities and specific­
lesions and PI-RADS version 2 score. For later comparison with the ities of the clinical report and radiologists were compared to those of the
reference standard, all detected prostate lesions (PI-RADS scores from 2 DLA using McNemar’s test.
to 5) were captured on axial images with indicators on the picture Statistical analysis was performed using SPSS software, version 23.0
archiving and communication system by all readers. The index lesion (IBM, Armonk, NY, USA) and MedCalc version 19.2.0 (MedCalc

3
S.Y. Youn et al. European Journal of Radiology 142 (2021) 109894

Table 1 3.2. PI-RADS assessment and inter-reader agreement for DLA, clinical
Clinical characteristics of the 121 patients. reports, and radiologists
Characteristics Total (n = 121)
The proportions of PI-RADS categories determined by the DLA,
Mean age (range) (years) 68.2 ± 8.5
(47–85) clinical reports, and radiologists varied. The detailed distribution of the
Median PSA (interquartile range) (ng/mL) 6.5 (4.5–10.4) PI-RADS scores by DLA, clinical reports, and Readers 1–5 is shown in
Use of 5α-reductase inhibitor 4 (3.3%) Fig. 2. DLA assigned 60.3%, 0.8%, 9.9% and 27.3% of cases as PI-RADS
Number of previous TRUS-guided biopsies before MRI 1, 3, 4, and 5, respectively. Inter-reader agreement for PI-RADS score
None 87 (71.9%)
One time 29 (24.0%)
was moderate (κ, 0.461) between DLA and Reader 5 and varied from
Two times 3 (2.5%) poor to good agreement between the other readers and Reader 5 (κ,
Three times 1 (0.8%) 0.340, 0.457, 0.467, 0.609, for Readers 1–4, respectively). Inter-reader
Four times 1 (0.8%) agreement for the PI-RADS score between clinical reports and Reader 5
Median time interval between MRI and biopsy (interquartile 17 (9–26)
was moderate (κ, 0.422).
range) (days)
Median time interval between MRI and prostatectomy 44 (31–67)
(interquartile range) (days)
3.3. Comparison of diagnostic performance using AUROC
Pathologically proven prostate cancers 52 (43.0%)
Pathologically proven CSCs 43 (35.3%)
Maximum Gleason score For diagnosing all prostate cancers, the AUROC of the DLA was
6 (3 + 3) 9 (17.3%) 0.808, which was significantly higher than those of Reader 1 (AUROC,
7 (3 + 4) 24 (42.3%) 0.698; p = 0.031) and clinical reports (AUROC, 0.687; p = 0.015) and
7 (4 + 3) 13 (23.1%)
8 (4 + 4) 6 (11.5%)
was similar to those of Readers 2–5 (AUROC, 0.786, 0.729, 0.862 and
9 (4 + 5) 3 (5.8%) 0.874; p = 0.623, 0.101, 0.174 and 0.098, respectively) (Fig. 3a). In the
diagnosis of CSCs, the AUROC of DLA was 0.828, with no significant
CSC, clinically significant prostate cancer; PSA, prostate-specific antigen; TRUS,
difference from that of Readers 2–4 and clinical reports (AUROC, 0.811,
transrectal ultrasonography.
0.754, 0.882, and 0.730; p = 0.661, 0.110, 0.122, and 0.060, respec­
tively). The performance of the DLA was superior to that of Reader
Software, Mariakerke, Belgium). A p-value of<0.05 was considered
group 1 (AUROC, 0.706; p = 0.011) but inferior to that of Reader 5
statistically significant. To compare diagnostic performance between
(AUROC, 0.914; p = 0.013) (Fig. 3b).
clinical reports, the five radiologists, and DLA, p-values were multiplied
by 6 according to the Bonferroni correction.
3.4. Dichotomous analysis
3. Results
The sensitivities and specificities of the DLA, clinical reports, and
3.1. Patient characteristics radiologists in the diagnosis of prostate cancer (all prostate cancers or
CSCs) varied widely (Tables 2 and 3). For both all prostate cancers and
Among the 121 patients (mean age 68.2 ±8.5 years, range 47–85 CSCs, no significant difference in sensitivity was noted between the DLA
years), 52 (43.0%) were diagnosed with prostate cancer, and 43 (35.5%) results and any of the readers or clinical reports for PI-RADS cutoff value
were confirmed to have CSC. The median prostate-specific antigen (PSA) of either ≥ 3 or 4 except for Reader 5 at a PI-RADS cutoff value ≥ 3. The
level was 6.5 ng/mL (interquartile range 4.5–10.4 ng/mL). Twenty- DLA showed significantly higher specificity in diagnosing all prostate
three patients underwent radical prostatectomy. The demographic, cancers and CSCs relative to any of the radiologists and clinical reports
clinical, and pathologic information of the patients is summarized in for a PI-RADS cutoff value ≥ 3. The sensitivities and specificities of the
Table 1. DLA and Reader 5 did not significantly differ when using a PI-RADS
cutoff value ≥ 4 for diagnosing all prostate cancers and CSCs. The
PPV also varied among the radiologists and clinical reports. The DLA
showed a better PPV than all radiologists and clinical reports. The

Fig. 2. Proportion (%) of PI-RADS score in DLA, clinical reports and Readers 1–5 The distribution of PI-RADS scores determined by DLA, clinical reports and Readers
1–5 were variable.

4
S.Y. Youn et al. European Journal of Radiology 142 (2021) 109894

cancer varied depending on radiologists’ experience in this study. The


DLA showed moderate diagnostic performance on a level between that
of residents and an expert for detecting and classifying PI-RADS. The
diagnostic performance of the residents and less-experienced radiolo­
gists was not significantly better than that of the DLA. Moreover, the
performance of DLA was also similar to that of clinical reports for
diagnosing CSCs. Only the expert had significantly superior diagnostic
performance to the DLA based on ROC curve analysis.
The most important strength of the DLA in this study was its higher
specificity than those of the radiologists and clinical reports while
maintaining sensitivity. For both all prostate cancers and CSCs, the DLA
showed significantly higher specificity than all readers for a PI-RADS
cutoff value ≥ 3; the same was observed for a PI-RADS cutoff value ≥
4 but without statistical significance. High specificity was a character­
istic of the DLA in this study, not a general characteristic of other al­
gorithms; the specificities of U-Net ranged from 24% to 55% in a
previous study [7,13]. Using a novel false positive reduction network in
the pipeline, DLA succeeded in increasing specificity. Moreover, the DLA
showed less reduction in specificity than the radiologists when the cutoff
value was reduced from 4 to 3. This result may be caused by the dif­
ference in proportion of PI-RADS 3 scores between the DLA and radi­
ologists; the proportion of PI-RADS 3 scores from the DLA was only
0.8%, but that for all readers and clinical reports ranged from 9.9% to
24.0%. Given the ambiguous meaning of PI-RADS 3, reducing the
number of PI-RADS 3 scores may help reduce the number of unnecessary
biopsies in clinical practice [15]. The sensitivity and specificity of the
DLA (PI-RADS cutoff value ≥ 4) were 76.7% and 85.9%, respectively,
which were comparable to those of the expert and those obtained in a
previous meta-analysis, which showed 79% pooled sensitivity and 88%
pooled specificity for bpMRI [16]. No significant difference in sensitivity
was noted between DLA and clinical reports for a PI-RADS cutoff value
of either ≥ 3 or 4, and the AUROC of DLA (0.828) was similar to that of
clinical reports (AUROC, 0.730; p = 0.060).
When using DLA, PI-RADS 1 was assigned in 60.3% of all cases, and
PI-RADS 2 was not assigned. According to DLA’s pipeline, DLA detects
abnormal lesions using a localization net that computes PI-RADS 1 and 2
vs. PI-RADS ≥ 3 and then assigns PI-RADS scores from 3 to 5. Therefore,
the distribution of PI-RADS scores 1 and 2 using DLA was different from
that of radiologists. Despite this distribution, the inter-reader agreement
between the DLA and expert was moderate, not inferior to that of other
radiologists except for Reader 4. In addition, discrimination between PI-
RADS scores 1 and 2 is not important to diagnose prostate cancer in
clinical practice.
Deep learning-based AI algorithms have shown valuable perfor­
mance in differentiating prostate cancer from normal tissues [7–11].
Few studies have shown good performance of the machine probability
score to detect and classify a lesion compared to radiologist PI-RADS
classification [7,13]. In previous studies, algorithm performance was
compared with clinical routine interpretations from 8 or 9 radiologists,
but the radiologist reviews were not analyzed individually. In our study,
more than five radiologists with various levels of experience indepen­
Fig. 3. Receiver operating characteristic curves (ROCs) of the DLA, clin­ dently reviewed prostate MRIs. As such, we could compare the indi­
ical reports and Readers 1–5 In diagnosis of all cancers, the performance of vidual diagnostic performance between radiologists with various levels
DLA is better than that of clinical reports for all prostate cancers (p = 0.015). of experience and the DLA. We also analyzed clinical reports from
There are no significant differences in the performances between the DLA and
routine interpretation and compared performance between DLA and a
Readers 2–5 (a). In diagnosis of clinically significant cancers, only Reader 5
mixture of board-certified radiologists.
shows better performance than DLA (b).
The inter-reader agreements of PI-RADS scores between the radiol­
ogists and the expert varied according to experience. The DLA and
accuracy of the DLA was higher than those of Readers 1–3 and clinical
clinical report showed moderate agreement with the expert, and the
reports, and comparable to those of Readers 4 and 5. Figs. 4 and 5, and
DLA resulted in slightly higher inter-reader agreement than the clinical
Supplementary Figs. 2 and 3 show representative examples of cancer
reports. Previous studies indicated that radiologist inter-reader agree­
detection by the DLA and radiologists.
ment for PI-RADS category assignment varied from poor to good
[17–21], with more experienced radiologists showing greater inter-
4. Discussion
reader agreement. Therefore, the moderate agreement between DLA
and the expert radiologist seems promising and DLA-based PI-RADS
The PI-RADS assignments and performance in diagnosing prostate
categorization may help reduce inter-reader variability in clinical

5
S.Y. Youn et al. European Journal of Radiology 142 (2021) 109894

Table 2
Dichotomous analysis of DLA, clinical reports and Readers 1–5, reference standard based on presence of pathologically proven all prostate cancer.
Sensitivity, % Corrected P value* Specificity, % Corrected P value* Accuracy, % PPV, % NPV, %

DLA
PI-RADS ≥ 3 73.1 (38/52) Reference 87.0 (60/69) Reference 81.0 (98/121) 80.9 (38/47) 81.1 (60/74)
PI-RADS ≥ 4 69.2 (36/52) Reference 88.4 (61/69) Reference 80.2 (97/121) 81.8 (36/44) 79.2 (61/77)
Reader group 1
PI-RADS ≥ 3 69.2 (36/52) >0.99 40.6 (28/69) <0.001 52.9 (64/121) 46.8 (36/77) 63.6 (28/44)
PI-RADS ≥ 4 57.7 (30/52) 0.654 68.1 (47/69) 0.042 63.6 (77/121) 57.7 (30/52) 68.1 (47/69)
Reader group 2
PI-RADS ≥ 3 76.9 (40/52) >0.99 49.3 (34/69) <0.001 61.2 (74/121) 53.3 (40/75) 73.9 (34/46)
PI-RADS ≥ 4 71.2 (37/50) >0.99 65.2 (45/69) 0.030 67.8 (82/121) 60.7 (37/61) 78.3 (45/60)
Reader 3
PI-RADS ≥ 3 82.7 (43/52) >0.99 29.0 (20/69) <0.001 52.1 (63/121) 46.7 (43/92) 69.0 (20/29)
PI-RADS ≥ 4 80.8 (42/52) 0.876 50.7 (35/69) <0.001 63.6 (77/121) 55.3 (42/76) 77.8 (35/45)
Reader 4
PI-RADS ≥ 3 90.4 (47/52) 0.072 60.9 (42/69) <0.001 73.6 (89/121) 63.5 (47/74) 89.4 (42/47)
PI-RADS ≥ 4 86.5 (45/52) 0.072 79.7 (55/69) >0.99 82.6 (100/121) 76.3 (45/59) 88.7 (55/62)
Reader 5
PI-RADS ≥ 3 92.3 (48/52) 0.036 58.0 (40/69) <0.001 72.7 (88/121) 62.3 (48/77) 90.9 (40/44))
PI-RADS ≥ 4 84.6 (44/52) 0.234 81.2 (56/69) >0.99 82.6 (100/121) 77.2 (44/59) 87.5 (56/64)
Clinical reports
PI-RADS ≥ 3 84.6 (44/52) >0.99 23.2 (16/69) <0.001 49.6 (60/121) 44.4 (44/99) 72.7 (16/22)
PI-RADS ≥ 4 78.8 (41/52) >0.99 36.2 (25/69) <0.001 54.5 (66/121) 47.1 (41/87) 73.5 (25/34)

DLA, deep learning-based algorithm; PI-RADS, prostate imaging-reporting and data system; PPV, positive predictive value; NPV, negative predictive value.
*
Bonferroni corrected p-value; p-values were multiplied by 6.

Table 3
Dichotomous analysis of DLA, clinical reports and Readers 1–5, reference standard based on presence of pathologically proven clinically significant prostate cancer.
Sensitivity, % Corrected P value* Specificity, % Corrected P value* Accuracy, % PPV, % NPV, %

DLA
PI-RADS ≥ 3 81.4 (35/43) Reference 84.6 (66/78) Reference 83.5 (101/121) 74.5 (35/47) 89.2 (66/74)
PI-RADS ≥ 4 76.7 (33/43) Reference 85.9 (67/78) Reference 82.6 (100/121) 75.0 (33/44) 87.0 (67/77)
Reader group 1
PI-RADS ≥ 3 76.7 (33/43) >0.99 43.6 (34/78) <0.001 55.4 (67/121) 42.9 (33/77) 77.3 (34/44)
PI-RADS ≥ 4 62.8 (27/43) 0.420 67.9 (53/78) 0.066 66.1 (80/121) 51.9 (27/52) 76.8 (53/69)
Reader group 2
PI-RADS ≥ 3 83.7 (36/43) >0.99 50.0 (39/78) <0.001 62.0 (75/121) 48.0 (36/75) 84.8 (39/46)
PI-RADS ≥ 4 79.1 (34/43) >0.99 65.4 (50/77) 0.036 71.2 (84/118) 55.7 (34/61) 87.7 (50/57)
Reader 3
PI-RADS ≥ 3 88.4 (38/43) >0.99 30.8 (24/78) <0.001 51.2 (62/121) 41.3 (38/92) 82.8 (24/29)
PI-RADS ≥ 4 86.0 (37/43) >0.99 50.0 (39/78) <0.001 62.8 (76/121) 48.7 (37/76) 86.7 (39/45)
Reader 4
PI-RADS ≥ 3 95.3 (41/43) 0.420 57.7 (45/78) <0.001 71.1 (86/121) 55.4 (41/74) 95.7 (45/47)
PI-RADS ≥ 4 90.7 (39/43) 0.420 74.4 (58/78) 0.468 80.2 (97/121) 66.1 (39/59) 92.5 (58/62)
Reader 5
PI-RADS ≥ 3 100.0 (43/43) 0.048 56.4 (44/78) <0.001 71.9 (87/121) 55.8 (43/77) 100.0 (44/44)
PI-RADS ≥ 4 93.0 (40/43) 0.234 78.2 (61/78) >0.99 83.5 (101/121) 70.2 (40/57) 95.3 (61/64)
Clinical reports
PI-RADS ≥ 3 88.4 (38/43) >0.99 24.4 (19/78) <0.001 47.1 (57/121) 38.4 (38/99) 86.4 (19/22)
PI-RADS ≥ 4 83.7 (36/43) >0.99 37.2 (29/78) <0.001 53.7 (65/121) 41.4 (36/87) 85.3 (29/34)

DLA, deep learning-based algorithm; PI-RADS, prostate imaging-reporting and data system; PPV, positive predictive value; NPV, negative predictive value.
*
Bonferroni corrected p-value; p-values were multiplied by 6.

practice. second reader to reduce variability in PI-RADS assessment. In previous


The performance of PI-RADS in 26 centers with members in the study, achieving consensus of deep convolutional neural network
Society of Abdominal Radiology Prostate Cancer Disease-focused Panel (DCNN) and PI-RADS score by radiologist showed better diagnostic
varied widely across the center in a previous study [22]. Even radiolo­ performance than those of DCNN and PI-RADS score alone [9]. Further
gists with a high level of experience in prostate MRI showed variable research on the value of DLA as a decision support tool is needed.
results. Therefore, greater dedication to training and developing a
quality assurance program is necessary in prostate MRI [22]. In our 5. Limitations
institution, abdominal/genitourinary radiologists interpreted prostate
MRI as a routine clinical process. From this point of view, even though Our study has several limitations. First, the size of the study group
the performance of DLA was not superior to that of the expert’s retro­ was small and the number of readers was small. DLA is prototype soft­
spective review, it makes sense that the performance of DLA was supe­ ware that can analyze images from a single-vendor (Siemens Health­
rior to that of clinical reports made by multiple board-certified care) 3 T MRI during the study period, so only a small number of 121
radiologists, not all of whom were experts in prostate imaging. In MRIs were enrolled. Given that the number of readers was relatively
addition, DLA showed similar sensitivity and higher specificity than small, comparing DLA performance might have generalization issues.
radiologists with various levels of experience did in retrospective re­ Second, selection bias can exist because patients who underwent biopsy
view. From these results, we believe that DLA can assist radiologists as a were not representative of patients with clinical suspicion of prostate

6
S.Y. Youn et al. European Journal of Radiology 142 (2021) 109894

Fig. 4. True positive lesion detected by the readers and the DLA. MRI of the prostate gland was performed in a 71-year old male with elevated PSA (19.9 ng/mL).
Axial T2-weighted image shows ill-defined low signal intensity mass (area with yellow dotted line) in both peripheral zone at the prostate base (a). The mass shows
high signal intensity on diffusion-weighted image (b = 1000 sec/mm2) (b) and low value on ADC map (c). DLA detected the same lesion and assigned PI-RADS
category 5. DLA shows the abnormal area by using a suspicion map and presents the lesion as pink area on T2-weighted image (d). Readers 2 and 3 missed this
lesion and Readers 1, 4 and 5 detected the lesion. In clinical report, PI-RADS score was 1. Clinically significant cancer (Gleason score 7 [4 + 3]) was confirmed by
biopsy. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

7
S.Y. Youn et al. European Journal of Radiology 142 (2021) 109894

Fig. 5. False negative result of DLA MRI of the prostate gland was performed in an 80-year old male with elevated PSA (15.0 ng/mL). Axial T2-weighted image
shows focal homogeneous low signal intensity lesion (area with white dotted line) in anterior aspect of transition zone (a). The mass shows high signal intensity on
diffusion-weighted image (b = 1000 sec/mm2) (b) and low value on ADC map (c). DLA could not detect the lesion, but all readers detected the same lesion as PI-
RADS ≥ 4. In clinical report, the lesion was also classified to PI-RADS 4. Clinically significant cancer (Gleason score 7 [3 + 4]) was confirmed by biopsy.

8
S.Y. Youn et al. European Journal of Radiology 142 (2021) 109894

cancer in the real world. However, the percentage of assigned PI-RADS 1 References
or 2 scores by Reader 5 in the study population was 36.0%, which was
similar to the percentage of 33% in a previous prospective study [23]. [1] V. Kasivisvanathan, M. Emberton, C.M. Moore, MRI-Targeted Biopsy for Prostate-
Cancer Diagnosis, N. Engl. J. Med. 379 (6) (2018) 589–590.
This study included patients in the period at which the usefulness of [2] H.U. Ahmed, A. El-Shater Bosaily, L.C. Brown, R. Gabe, R. Kaplan, M.K. Parmar,
prebiopsy MRI was being determined. Therefore, many patients under­ Y. Collaco-Moraes, K. Ward, R.G. Hindley, A. Freeman, A.P. Kirkham, R. Oldroyd,
went prostate biopsy even with negative prebiopsy MRI results. Third, C. Parker, M. Emberton, Diagnostic accuracy of multi-parametric MRI and TRUS
biopsy in prostate cancer (PROMIS): a paired validating confirmatory study, The
the reference standard was mainly based on biopsy results. We tried our Lancet 389 (10071) (2017) 815–822.
best to minimize this limitation by undergoing targeted biopsy and [3] F.J.H. Drost, D. Osses, D. Nieboer, C.H. Bangma, E.W. Steyerberg, M.J. Roobol, I.
considering all systematic biopsy results. In addition, including only G. Schoots, Prostate Magnetic Resonance Imaging, with or Without Magnetic
Resonance Imaging-targeted Biopsy, and Systematic Biopsy for Detecting Prostate
radical prostatectomy patients induces bias by excluding many patients Cancer: A Cochrane Systematic Review and Meta-analysis, Eur. Urol. 77 (1) (2020)
who underwent active surveillance, systemic therapy, or focal therapy. 78–94.
Fourth, Reader groups 1 and 2 each consisted of four residents with [4] J.C. Weinreb, J.O. Barentsz, P.L. Choyke, F. Cornud, M.A. Haider, K.J. Macura,
D. Margolis, M.D. Schnall, F. Shtern, C.M. Tempany, H.C. Thoeny, S. Verma, PI-
similar experience in prostate imaging. Although this analysis cannot
RADS Prostate Imaging - Reporting and Data System: 2015, Version 2, Eur. Urol. 69
assess each resident’s diagnostic performance, we believe that the re­ (1) (2016) 16–40.
sults reflect the overall diagnostic performance of residents with similar [5] G.A. Sonn, R.E. Fan, P. Ghanouni, N.N. Wang, J.D. Brooks, A.M. Loening, B.
levels of experience. Fifth, our results did not strictly follow PI-RADS v2, L. Daniel, K.J. To’o, A.E. Thong, J.T. Leppert, Prostate Magnetic Resonance
Imaging Interpretation Varies Substantially Across Radiologists, European urology
because we used bpMRI but not mpMRI. Many previous studies have focus 5 (4) (2019) 592–599.
shown comparable diagnostic performance between bpMRI and mpMRI [6] A.R. Padhani, B. Turkbey, Detecting Prostate Cancer with Deep Learning for MRI: A
[16,19,24–27]. Sixth, the evaluation of MRIs taken in one of the in­ Small Step Forward, Radiology 293 (3) (2019) 618–619.
[7] P. Schelb, S. Kohl, J.P. Radtke, M. Wiesenfarth, P. Kickingereder, S. Bickelhaupt, T.
stitutions that had provided cases (100/2,170 cases) for developing DLA A. Kuder, A. Stenzinger, M. Hohenfellner, H.P. Schlemmer, K.H. Maier-Hein,
could overestimate the performance of DLA in this study. Seventh, lastly, D. Bonekamp, Classification of Cancer at Prostate MRI: Deep Learning versus
PI-RADS v2 recommends high b-values (≥ 1400 sec/mm2). In our study, Clinical PI-RADS Assessment, Radiology 293 (3) (2019) 607–617.
[8] S. Yoo, I. Gujrathi, M.A. Haider, F. Khalvati, Prostate Cancer Detection using Deep
DWI with highest b-value of 1000 sec/mm2 were used for review by Convolutional Neural Networks, Sci. Rep. 9 (1) (2019) 19518.
radiologists and analysis by DLA. This could potentially reduce overall [9] Y. Song, Y.D. Zhang, X. Yan, H. Liu, M. Zhou, B. Hu, G. Yang, Computer-aided
diagnostic performance for both readers and DLA. However, this was diagnosis of prostate cancer using a deep convolutional neural network from
multiparametric MRI, J. Magn. Reson. Imaging 48 (6) (2018) 1570–1577.
because this study included patients who underwent prostate MRI [10] Y. Sumathipala, N. Lay, B. Turkbey, C. Smith, P.L. Choyke, R.M. Summers, Prostate
before releasing PIRADS v2, therefore high b-values ≥ 1400 sec/mm2 cancer detection from multi-institution multiparametric MRIs using deep
were not routinely used as part of scan protocol. convolutional neural networks, J Med Imaging (Bellingham) 5 (4) (2018), 044507.
[11] J. Ishioka, Y. Matsuoka, S. Uehara, Y. Yasuda, T. Kijima, S. Yoshida, M. Yokoyama,
K. Saito, K. Kihara, N. Numao, T. Kimura, K. Kudo, I. Kumazawa, Y. Fujii,
6. Conclusion Computer-aided diagnosis of prostate cancer on magnetic resonance imaging using
a convolutional neural network algorithm, BJU Int 122 (3) (2018) 411–417.
This study provides the first comparison between DLA and radiolo­ [12] T. Sanford, S.A. Harmon, E.B. Turkbey, D. Kesani, S. Tuncer, M. Madariaga,
C. Yang, J. Sackett, S. Mehralivand, P. Yan, S. Xu, B.J. Wood, M.J. Merino, P.
gists with various levels of experience in PI-RADS classification. The A. Pinto, P.L. Choyke, B. Turkbey, Deep-Learning-Based Artificial Intelligence for
DLA showed moderate diagnostic performance on a level between those PI-RADS Classification to Assist Multiparametric Prostate MRI Interpretation: A
of residents and an expert for detecting and classifying according to PI- Development Study, J. Magn. Reson. Imaging (2020).
[13] P. Schelb, X. Wang, J.P. Radtke, M. Wiesenfarth, P. Kickingereder, A. Stenzinger,
RADS. The performance of the DLA was similar to that of clinical reports M. Hohenfellner, H.P. Schlemmer, K.H. Maier-Hein, D. Bonekamp, Simulated
from various radiologists in clinical practice. clinical deployment of fully automatic deep learning for clinical prostate MRI
Funding assessment, Eur. Radiol. (2020).
[14] X. Yu, B. Lou, B. Shi, D. Winkel, N. Arrahmane, M. Diallo, T. Meng, H.v. Busch, R.
This work was supported by the National Research Foundation of Grimm, B. Kiefer, D. Comaniciu, A. Kamen, H. Huisman, A. Rosenkrantz, T.
Korea (NRF) under Grant (2018R1D1A1B07050160). Penzkofer, I. Shabunin, M.H. Choi, Q. Yang, D. Szolar, False Positive Reduction
Using Multiscale Contextual Features for Prostate Cancer Detection in Multi-
Parametric MRI Scans, 2020 IEEE 17th International Symposium on Biomedical
CRediT authorship contribution statement Imaging (ISBI), 2020, pp. 1355-1359.
[15] M. de Rooij, B. Israel, M. Tummers, H.U. Ahmed, T. Barrett, F. Giganti, B. Hamm,
Seo Yeon Youn: Methodology, Formal analysis, Writing - review & V. Logager, A. Padhani, V. Panebianco, P. Puech, J. Richenberg, O. Rouviere,
G. Salomon, I. Schoots, J. Veltman, G. Villeirs, J. Walz, J.O. Barentsz, ESUR/ESUI
editing. Moon Hyung Choi: Conceptualization, Methodology, Writing -
consensus statements on multi-parametric MRI for the detection of clinically
review & editing, Supervision. Dong Hwan Kim: Investigation, Formal significant prostate cancer: quality requirements for image acquisition,
analysis. Young Joon Lee: Investigation, Methodology. Henkjan interpretation and radiologists’ training, Eur. Radiol. (2020).
[16] Z. Kang, X. Min, J. Weinreb, Q. Li, Z. Feng, L. Wang, Abbreviated Biparametric
Huisman: Investigation. Evan Johnson: Investigation. Tobias
Versus Standard Multiparametric MRI for Diagnosis of Prostate Cancer: A
Penzkofer: Investigation. Ivan Shabunin: Investigation. David Jean Systematic Review and Meta-Analysis, AJR Am. J. Roentgenol. 212 (2) (2019)
Winkel: Investigation. Pengyi Xing: Investigation. Dieter Szolar: 357–365.
Investigation. Robert Grimm: Data curation, Software. Heinrich von [17] M.D. Greer, J.H. Shih, N. Lay, T. Barrett, L. Bittencourt, S. Borofsky, I. Kabakus, Y.
M. Law, J. Marko, H. Shebel, M.J. Merino, B.J. Wood, P.A. Pinto, R.M. Summers, P.
Busch: Software. Yohan Son: Software, Resources. Bin Lou: Software. L. Choyke, B. Turkbey, Interreader Variability of Prostate Imaging Reporting and
Ali Kamen: Software. Data System Version 2 in Detecting and Assessing Prostate Cancer Lesions at
Prostate MRI, AJR, Am. J. Roentgenol. (2019) 1–8.
[18] B.G. Muller, J.H. Shih, S. Sankineni, J. Marko, S. Rais-Bahrami, A.K. George, J.J. de
Declaration of Competing Interest la Rosette, M.J. Merino, B.J. Wood, P. Pinto, P.L. Choyke, B. Turkbey, Prostate
Cancer: Interobserver Agreement and Accuracy with the Revised Prostate Imaging
Robert Grimm, Heinrich von Busch, Yohan Son, Bin Lou, and Ali Reporting and Data System at Multiparametric MR Imaging, Radiology 277 (3)
(2015) 741–750.
Kamen are employees of Siemens Healthineers or Siemens Healthcare. [19] M.H. Choi, C.K. Kim, Y.J. Lee, S.E. Jung, Prebiopsy Biparametric MRI for Clinically
The other authors declare that they have no known competing financial Significant Prostate Cancer Detection With PI-RADS Version 2: A Multicenter
interests or personal relationships that could have appeared to influence Study, AJR Am. J. Roentgenol. 212 (4) (2019) 839–846.
[20] C.P. Smith, S.A. Harmon, T. Barrett, L.K. Bittencourt, Y.M. Law, H. Shebel, J.Y. An,
the work reported in this paper. M. Czarniecki, S. Mehralivand, M. Coskun, B.J. Wood, P.A. Pinto, J.H. Shih, P.
L. Choyke, B. Turkbey, Intra- and interreader reproducibility of PI-RADSv2: A
Appendix A. Supplementary data multireader study, J. Magn. Reson. Imaging 49 (6) (2019) 1694–1703.
[21] A.B. Rosenkrantz, L.A. Ginocchio, D. Cornfeld, A.T. Froemming, R.T. Gupta,
B. Turkbey, A.C. Westphalen, J.S. Babb, D.J. Margolis, Interobserver
Supplementary data to this article can be found online at https://doi. Reproducibility of the PI-RADS Version 2 Lexicon: A Multicenter Study of Six
org/10.1016/j.ejrad.2021.109894. Experienced Prostate Radiologists, Radiology 280 (3) (2016) 793–804.

9
S.Y. Youn et al. European Journal of Radiology 142 (2021) 109894

[22] A.C. Westphalen, C.E. McCulloch, J.M. Anaokar, S. Arora, N.S. Barashi, J. [24] S. Woo, C.H. Suh, S.Y. Kim, J.Y. Cho, S.H. Kim, M.H. Moon, Head-to-Head
O. Barentsz, T.K. Bathala, L.K. Bittencourt, M.T. Booker, V.G. Braxton, P.R. Carroll, Comparison Between Biparametric and Multiparametric MRI for the Diagnosis of
D.D. Casalino, S.D. Chang, F.V. Coakley, R. Dhatt, S.C. Eberhardt, B.R. Foster, A. Prostate Cancer: A Systematic Review and Meta-Analysis, AJR Am. J. Roentgenol.
T. Froemming, J.J. Fütterer, D.M. Ganeshan, M.R. Gertner, L.M. Gettle, S. Ghai, R. 211 (5) (2018) W226–W241.
T. Gupta, M.E. Hahn, R. Houshyar, C. Kim, C.K. Kim, C. Lall, D.J.A. Margolis, S. [25] D. Junker, F. Steinkohl, V. Fritz, J. Bektic, T. Tokas, F. Aigner, T.R.W. Herrmann,
E. McRae, A. Oto, R.B. Parsons, N.U. Patel, P.A. Pinto, T.J. Polascik, B. Spilseth, J. M. Rieger, U. Nagele, Comparison of multiparametric and biparametric MRI of the
B. Starcevich, V.S. Tammisetti, S.S. Taneja, B. Turkbey, S. Verma, J.F. Ward, C. prostate: are gadolinium-based contrast agents needed for routine examinations?
A. Warlick, A.R. Weinberger, J. Yu, R.J. Zagoria, A.B. Rosenkrantz, Variability of World J. Urol. 37 (4) (2019) 691–699.
the Positive Predictive Value of PI-RADS for Prostate MRI across 26 Centers: [26] X.K. Niu, X.H. Chen, Z.F. Chen, L. Chen, J. Li, T. Peng, Diagnostic Performance of
Experience of the Society of Abdominal Radiology Prostate Cancer Disease-focused Biparametric MRI for Detection of Prostate Cancer: A Systematic Review and Meta-
Panel 296 (1) (2020) 76–84. Analysis, AJR Am. J. Roentgenol. 211 (2) (2018) 369–378.
[23] A.R. Padhani, J. Barentsz, G. Villeirs, A.B. Rosenkrantz, D.J. Margolis, B. Turkbey, [27] M. Alabousi, J.P. Salameh, K. Gusenbauer, L. Samoilov, A. Jafri, H. Yu, A. Alabousi,
H.C. Thoeny, F. Cornud, M.A. Haider, K.J. Macura, C.M. Tempany, S. Verma, J. Biparametric vs multiparametric prostate magnetic resonance imaging for the
C. Weinreb, PI-RADS Steering Committee: The PI-RADS Multiparametric MRI and detection of prostate cancer in treatment-naive patients: a diagnostic test accuracy
MRI-directed Biopsy Pathway, Radiology 292 (2) (2019) 464–474. systematic review and meta-analysis, BJU Int 124 (2) (2019) 209–220.

10

You might also like