You are on page 1of 5

2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI)

April 3-7, 2020, Iowa City, Iowa, USA

FALSE POSITIVE REDUCTION USING MULTISCALE CONTEXTUAL FEATURES FOR


PROSTATE CANCER DETECTION IN MULTI-PARAMETRIC MRI SCANS

Xin Yu1 , Bin Lou1 , Bibo Shi1 , David Winkel1,2 , Nacim Arrahmane1 , Mamadou Diallo1 , Tongbai Meng1 ,
Heinrich von Busch3 , Robert Grimm3 , Berthold Kiefer3 , Dorin Comaniciu1 , Ali Kamen1 ,
ProstateAI Clinical Collaborators∗
1
Digital Technology and Innovation Division, Siemens Healthineers, Princeton, NJ, USA
2
Universitätsspital Basel, Basel, Switzerland
3
Diagnostic Imaging, Siemens Healthineers, Erlangen, Germany

ABSTRACT the mp-MRI approach [3]. Many attempts have been made
to help radiologists in detecting and classifying PCa lesion
Prostate cancer (PCa) is the most prevalent and one of the
using mp-MRI. Litjens et al. [4] and Cao et al. [5] indicated
leading causes of cancer death among men. Multi-parametric
that computer-aided detection (CAD) system using either
MRI (mp-MRI) is a prominent diagnostic scan, which could
mp-MRI or bp-MRI scans can achieve comparable detection
help in avoiding unnecessary biopsies for men screened for
sensitivity to a radiologist. However, as compared to radiolo-
PCa. Artificial intelligence (AI) systems could help radiolo-
gists’ performance, the CAD systems usually have relatively
gists to be more accurate and consistent in diagnosing clini-
lower specificity. This can in turn lead to overdiagnosis or
cally significant cancer from mp-MRI scans. Lack of speci-
overtreatment, which should be avoided as much as possible.
ficity has been identified recently as one of weak points of
Therefore, false positive reduction (FPR) is considered as an
such assistance systems. In this paper, we propose a novel
essential part of any assistance system. Furthermore, there
false positive reduction network to be added to the overall de-
are significant intensity pattern similarities between cancer-
tection system to further analyze lesion candidates. The new
ous and benign tissues within prostate gland as expressed
network utilizes multiscale 2D image stacks of these candi-
in various contrasts of MRI scans. This together with size
dates to discriminate between true and false positive detec-
variations among cancerous lesions makes the FPR for PCa
tions. We trained and validated our network on a dataset with
lesions particularly challenging.
2170 cases from seven different institutions and tested it on
a separate independent dataset with 243 cases. With the pro- Inspired by recent articles published aiming at lung nod-
posed model, we achieved area under curve (AUC) of 0.876 ule FPR [6, 7, 8], we propose a novel approach for utilizing
on discriminating between true and false positive detected le- multiscale images to learn better image features to effectively
sions and improved the AUC from 0.825 to 0.867 on overall remove false positives in PCa detection. The FPR network
identification of clinically significant cases. utilizes the results of an up-stream detection network. These
Index Terms— Deep learning, prostate cancer, false pos- detected lesions could be seen as candidates that need to be
itive reduction, mp-MRI further analyzed and separated into true and false lesions. The
detection network uses Prostate Imaging Reporting and Data
System (PI-RADS) scores, which is the standard for report-
1. INTRODUCTION ing prostate MRI findings [9], as ground truth. In this system,
a lesion is regarded as clinically significant if its PI-RADS
Prostate Cancer (PCa) is one of the most prevalent cancers
score is equal or greater than 3 [10]. The strategies used in
in 2019 among males in the United States. It is estimated
our proposed FPR network are summarized as follows: (1)
that over 3.6 million men have a history of PCa and 174,650
utilizing 2.5D inputs to incorporate more out-of-plane contex-
cases will be newly diagnosed in 2019 [1]. In recent years,
tual information; (2) extracting image features from different
multi-parametric MRI (mp-MRI) has shown its utility as a
fields of view (i.e. multiscale images); (3) devising a fusion
non-invasive imaging tool for detection, localization, and
module to assign optimal weights for different scales based on
classification of PCa [2], and there is growing evidence that
their contributions to the final classification; and (4) applying
biparametric MRI (bp-MRI) protocols, consisting of only
a multi-task loss function to extract the most discriminative
T2-weighted (T2w) imaging and diffusion weighted imaging
set of features. The proposed model was validated on an in-
(DWI), offer similar diagnostic performance compared to
dependent dataset with 243 patients and showed a significant
∗A list of members and affiliations appears at the end of the paper. improvement in terms of removing false positive lesions.

978-1-5386-9330-8/20/$31.00 ©2020 IEEE 1355

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on May 27,2020 at 11:23:30 UTC from IEEE Xplore. Restrictions apply.
2. PROPOSED FRAMEWORK resolution in the through-plane direction, we only imple-
mented this multiscale strategy in-plane. To fit various lesion
Our overall system for detecting clinically significant PCa le- sizes, for three different scales we used dimensions of 32×32,
sions has two stages: (1) PCa detection using a detection net- 48×48 and 64×64 respectively and resized them all to 48×48
work, and (2) FPR using the detection results from the up- as FPR network input.
stream detection network. In the first stage, we aim to achieve
a higher sensitivity whereas in the second stage aim to min-
2.3. FPR Network Architecture
imize the false detection rate with a minimal impact to the
sensitivity. The whole pipeline is depicted in Fig.1. The FPR network started with a feature encoding module
where three different cropped fields of view were fed into
2.1. Prostate Cancer Detection 3 groups of residual blocks independently before the fusion
step. Each group consisted of 3 consecutive residual blocks
The detection network had a UNet architecture with 2D resid- of basic architecture [12]. The filter number for the first block
ual blocks [11, 12]. The Res-UNet was designed to have 5 was 16, with the size doubling for the following blocks. After
down-sampling and 5 up-sampling residual blocks. For each feature concatenation, we added a Squeeze-and-Excitation
residual block, the “bottleneck” architecture [12] with a stack (SE) block [13] to allocate different weights to each channel.
of 3 layers was adopted. There were 16 filters in the first Another residual block was adopted after the SE block for
residual block with doubling filter sizes for every block. further feature fusion. The output was flattened using global
Our detection network was evaluated using 3D volumes. average pooling and ultimately fed into 2 fully connected
Each patient imaging study was passed through the detec- layers with a size of 128 and 1 respectively to achieve the
tion network separately and the output was a heatmap volume final classification. The network structure is shown in Fig.2.
where the lesion candidates or regions were designated with Images coming from different fields of view ended up having
non-zero values. The heatmap was then thresholded to gen- different weights and hence importance. For example, the
erate a set of 3D connected components. True positive (TP) smaller scale images best demonstrated local lesion features
detection was identified if the detection was within annotated whereas the larger scale images emphasized the difference
lesion boundary or less than 5mm away from the lesion cen- between the lesion and its surroundings. This parallel design
ter. Otherwise, the detection was considered as a false pos- enables the network to learn more detailed set of features
itive (FP). Lesions without matched detection were regarded from different scales. Adding SE block helped to adjust the
as false negatives (FNs). The detected components were used weight for each channel resulting in different contributions of
as inputs for the FPR network. TP and FP detections were re- scales to the final classification.
spectively used as positive and negative samples for training
down-stream FPR network. After computing the connected
component peaks, we cropped small patches from the origi- 2.4. Loss Function for FPR Network
nal set of scans, and used them as inputs for the training of We used binary cross-entropy loss (BCEL) as our main loss
FPR network. function to train the network. Our goal was to minimize the
FP while having a minimum impact to the overall detection
2.2. 2.5D Multiscale Data sensitivity. To achieve this, we used a weight of α to have
larger penalty on misclassified positive samples:
Our FPR network was a 2.5D network by design. For each
2D input slice of Is , we also used its neighboring slices of LBCE = αyi log(pi ) + (1 − yi ) log(1 − pi ) (1)
Is−1 , Is+1 as additional channels. For this reason, we referred
to it as 2.5D input. The advantage of the 2.5D input is that where pi ∈ [0, 1] is the predicted lesion probability and y
a) in the case of true positive lesions, we expect the pattern ∈ {0, 1} is the ground truth label. In addition to BCEL, cen-
extends to the neighboring slices to some extent and based ter loss (CL) [14] was employed as a secondary term to en-
on the consistency across images the network could identify hance the discriminative power between classes, while min-
them as true positives; and b) since the detection network is imizing the distances between samples from the same class.
operating only on 2D, there is a chance for the FPR network We assumed that the intra-class variation should be smaller
to focus on inconsistencies and lack of coherent signatures for TP samples as compared to FP ones, therefore we set
across the slices and, by taking the context into account, de- higher weights for true positive samples:
tect and eliminate false positives. !
m m
In additional to the contextual information, we used a 1 X X
LC = β yi kxi − c1 k22 + (1 − yi ) kxi − c0 k22 (2)
multiscale strategy to better capture the discriminative pat- 2 i=1 i=1
terns. This strategy was employed by providing several
images of different fields of view for the same region to where xi ∈ Rd is the deep feature and c0 , c1 ∈ Rd de-
the network. Since the bp-MRI images have relatively low note two class centers of deep features. The total loss is a

1356

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on May 27,2020 at 11:23:30 UTC from IEEE Xplore. Restrictions apply.
Detection 32x32 48x48 64x64 TP detection
FP detection
Crop multiscale patches

Detection FPR
Mask Network Network
B-2000
ADC T2w
Heatmap Ground Truth

Fig. 1. Depiction of the overall processing pipeline consisting of Detection Network and FPR Network.

32x32x9
SE Block 3.2. Model Training
Res Blocks
FC layers
TP lesion?
We trained and fixed the detection network before conducting
48x48x9
FP lesion?
CE Loss
experiments on the FPR network. Four types of image con-
Res Blocks Res Block
deep trasts were used as inputs: T2w, DWI ADC, DWI B-2000 MR
feature
64x64x9 images and a binary prostate region mask providing anatom-
Res Blocks Center Loss ical information. The prostate mask was generated based on
T2w volume using a learning-based method as presented in
[16, 17, 18]. The overall model was trained for 200 epochs
Fig. 2. Architecture for the false positive reduction network. with lesion masks as training labels. Detection model selec-
tion was done based on the highest dice coefficient on the
validation set.
combination of BCEL and CL during the training, Ltotal = Experiments for the FPR network was conducted using
LBCE + λLC , where λ is a hyper-parameter that was tuned to the patches generated from the outputs of the detection net-
control the balance between two loss functions. work. 80% of the detections were used for training and the
other 20% for validation. The network was trained using
ADAM as the optimizer for 100 epochs, with L2 regulariza-
3. EXPERIMENTAL RESULTS tion of 10−4 . Rotation range [-45°, 45°], shift range [-5, 5],
and vertical flips were adopted for data augmentation. Each
3.1. MRI Data and Pre-processing input sample had 9 channels including all the sequences from
the previous and next slices while the ground truth label was
Datasets from seven institutions with 2170 cases in total only from the middle slices. The number of samples from
were used for the analysis. 1736 cases were used for train- positive and negative classes were balanced within each batch
ing (80%) and 434 cases for validation (20%). We used during the training. In our experiments, we set the dimension
another 243 cases from ProstateX challenge public dataset reduction ratio to 8 when SE block was included in the model.
[4] as the test dataset to evaluate the performance of various The weight for CL was set to λ = 0.2, and weights of the pos-
models. All images and corresponding clinical reports were itive samples in BCEL and CL computation were assigned to
carefully reviewed by a radiologist with four years of experi- α = 3 and β = 2 respectively.
ence in radiology and subspecialty training in prostate MRI
examinations. Lesion contours were manually re-annotated
3.3. FPR Results
in 3D using an internally developed annotation tool based
on the original clinical reports. All images were registered Our baseline was a network with 3 residual blocks followed
to the T2w images and resampled to a voxel spacing of by 2 fully connected layers with size 256 and 1 respectively.
0.5mm×0.5mm×3mm, with an image size of 240×240×30 Inputs for the baseline were 2D data without 2.5D contextual
[15]. T2w images were linearly normalized to [0, 1] using or multiscale images.
0.05 and 99.5 percentiles of the image’s intensity histogram We conducted an ablation study to verify the impact of
as the lower and upper thresholds. Since the actual ADC 2.5D data, multiscale, SE block and center loss. Area under
value is highly relevant to clinical significance of lesions [9], curve (AUC) was evaluated on 2D sample level, 3D compo-
we normalized ADC intensity [0, 3000] to [0, 1] by a constant nent level and patient case level. The results are shown in
value. The DWI B-2000 images were first normalized to the TABLE 1. Component level suspicion scores were calculated
the median intensity in the prostate region of the correspond- by averaging the suspicion score of each 2D slice within the
ing DWI B-50 images, and then normalized by a constant stack. Component level AUC reflects the overall FPR net-
value to map the range of intensities into [0, 1]. work performance on discriminating true and false positive

1357

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on May 27,2020 at 11:23:30 UTC from IEEE Xplore. Restrictions apply.
(a) Lesion level FROC (b) Case level ROC
1.0 1.0
Table 1. Evaluation of the effectiveness of 2.5D multiscale
data, SE block, center loss and a comparison with other mul- 0.9
0.8

tiscale method on sample, component and case level AUC 0.8

Sensi!vity

Sensi!vity
Settings Samples Components Cases 0.6

w/o FPR / / 0.825 0.7


baseline 0.866 0.829 0.842 0.4
baseline+2.5D 0.879 0.837 0.844 0.6

Proposed w/o CL & SE 0.887 0.844 0.863 0.2


0.5
Proposed w/o CL 0.895 0.867 0.868 Original Detec!on Original (AUC=0.825)
A#er FPR A#er FPR (AUC=0.867)
Proposed w/o SE 0.891 0.867 0.868
0.4 0.0
Proposed 0.897 0.876 0.867 0.1 1.0
False posi!ve per pa!ent
4.0 0.0 0.2 0.4 0.6
1 - Specificity
0.8 1.0

Kim et al 0.876 0.847 0.857

Fig. 3. Quantitative comparison between the detection per-


lesions. As our results demonstrate, the false positive detec- formance before and after adding the FPR network.
tion performance improved from 0.829 to 0.837 by adding
2.5D contextual data, and further improved to 0.876 by us-
ing multiscale information based on our proposed approach.
SE block and center loss showed their efficacy by increasing
the sample level AUC from 0.887 to 0.895 and 0.891 respec-
tively. The component level AUC also improved from 0.844
to 0.867. The most recent FPR methods proposed by Kim et
al. achieving state-of-the-art for lung nodule detection was
re-implemented for comparison, and the result demonstrates
that our method’s performance is superior. Finally, we com-
bined the FPR net results with the up-stream detection net to
perform overall case level analysis. In this analysis, the case
level suspicion score determines if a patient potentially has a
clinically significant lesion. We defined the maximum value Fig. 4. Two examples for the detection result before and after
of the detection network heatmap as the suspicion score. The FPR. (a) T2w image with original detection (b) ADC images
case level AUC is shown in the third column of TABLE 1. (c) B-2000 images (d) T2w image with detection after FPR.
The first row shows the case level performance without using Red: FP, Green: TP, Magenta: ground truth.
any FPR network. The result indicates that we have substan-
tially increased the overall case level detection performance
by using FPR network. tions. We demonstrate an overall improvement in terms of
The overall detection performance, after adding the FPR both FROC and AUC on a case level with regards to the de-
network, was also evaluated using a free-response receiver tection of clinically significant lesions.
operator characteristics (FROC) analysis. The FROC reflects Disclaimer: The concepts and information presented in this
the relationship between sensitivity of detection and the false paper are based on research results that are not commercially
positive rate per patient. FROC performance and case level available.
ROC are shown in Fig.3. At sensitivity greater than 90%, we
reduced the FP rate per patient by 39.3% (1.17 versus 0.71).At ProstateAI Clinical Collaborators
sensitivity greater than 87%, we reduced the FP rate per pa-
Henkjan Huisman4 , Andrew Rosenkrantz5 , Tobias Penzkofer6 , Ivan
tient by 52.3% (1.09 versus 0.52). Fig.4 also demonstrates Shabunin7 , Moon Hyung Choi8 , Qingsong Yang9 , Dieter Szolar10
two examples of the detection result before and after adding
the FPR and their corresponding MRI images. 4
Radboud University Medical Center, Nijmegen, NL. 5 New York
University, New York City, NY, USA. 6 Charité, Universitätsmedizin
4. CONCLUSION Berlin, Berlin, Germany. 7 Patero Clinic, Moscow, Russia. 8 Eunpye-
ong St. Marys Hospital, Catholic University of Korea, Seoul,
This paper presents a novel network for multiscale feature Republic of Korea. 9 Radiology Department, Changhai Hospital of
Shanghai, China. 10 Diagnostikum Graz Süd-West, Graz, Austria.
fusion that incorporates more contextual information to re-
duce false positive detection of prostate cancer lesions within
5. REFERENCES
MRI scans. We also use auxiliary task of minimizing intra-
class variation to improve the feature extraction for classifica- [1] Kimberly D. Miller, Leticia Nogueira, Angela B. Mariotto,
tion. Our experiments on an independent dataset demonstrate Julia H. Rowland, K. Robin Yabroff, Catherine M. Alfano,
the efficacy of our model in removing false positive detec- Ahmedin Jemal, Joan L. Kramer, and Rebecca L. Siegel, “Can-

1358

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on May 27,2020 at 11:23:30 UTC from IEEE Xplore. Restrictions apply.
cer treatment and survivorship statistics, 2019,” CA: A Cancer [14] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao, “A
Journal for Clinicians, vol. 69, no. 5, pp. 363–385, 2019. discriminative feature learning approach for deep face recog-
[2] Romaric Loffroy, Olivier Chevallier, Morgan Moulin, Sylvain nition,” in European conference on computer vision. Springer,
Favelier, Pierre-Yves Genson, Pierre Pottecher, Gilles Cre- 2016, pp. 499–515.
hange, Alexandre Cochet, and Luc Cormier, “Current role of [15] Atilla P Kiraly, Clement Abi Nader, Ahmet Tuysuzoglu,
multiparametric magnetic resonance imaging for prostate can- Robert Grimm, Berthold Kiefer, Noha El-Zehiry, and Ali Ka-
cer,” Quantitative imaging in medicine and surgery, vol. 5, no. men, “Deep convolutional encoder-decoders for prostate can-
5, pp. 754, 2015. cer detection and classification,” in International Conference
[3] Christiane K Kuhl, Robin Bruhn, Nils Krämer, Sven Nebelung, on Medical Image Computing and Computer-Assisted Inter-
Axel Heidenreich, and Simone Schrading, “Abbreviated bi- vention. Springer, 2017, pp. 489–497.
parametric prostate mr imaging in men with elevated prostate- [16] Dong Yang, Daguang Xu, S Kevin Zhou, Bogdan Georgescu,
specific antigen,” Radiology, vol. 285, no. 2, pp. 493–505, Mingqing Chen, Sasa Grbic, Dimitris Metaxas, and Dorin Co-
2017. maniciu, “Automatic liver segmentation using an adversar-
[4] Geert Litjens, Oscar Debats, Jelle Barentsz, Nico Karssemei- ial image-to-image network,” in International Conference on
jer, and Henkjan Huisman, “Computer-aided detection of Medical Image Computing and Computer-Assisted Interven-
prostate cancer in mri,” IEEE transactions on medical imaging, tion. Springer, 2017, pp. 507–515.
vol. 33, no. 5, pp. 1083–1092, 2014. [17] Haozhe Jia, Yong Xia, Yang Song, Donghao Zhang, Heng
[5] Ruiming Cao, Amirhossein Mohammadian Bajgiran, Huang, Yanning Zhang, and Weidong Cai, “3d apa-net:
Sohrab Afshari Mirak, Sepideh Shakeri, Xinran Zhong, 3d adversarial pyramid anisotropic convolutional network for
Dieter Enzmann, Steven Raman, and Kyunghyun Sung, “Joint prostate segmentation in mr images,” IEEE Transactions on
prostate cancer detection and gleason score prediction in Medical Imaging, pp. 1–1, 2019.
mp-mri via focalnet,” IEEE transactions on medical imaging, [18] Donghao Zhang, Yang Song, Dongnan Liu, Haozhe Jia, Siqi
2019. Liu, Yong Xia, Heng Huang, and Weidong Cai, “Panoptic
[6] Bum-Chae Kim, Jee Seok Yoon, Jun-Sik Choi, and Heung-Il segmentation with an end-to-end cell r-cnn for pathology im-
Suk, “Multi-scale gradual integration cnn for false positive age analysis,” in International Conference on Medical Im-
reduction in pulmonary nodule detection,” Neural Networks, age Computing and Computer-Assisted Intervention. Springer,
vol. 115, pp. 1–10, 2019. 2018, pp. 237–244.
[7] Qi Dou, Hao Chen, Lequan Yu, Jing Qin, and Pheng-Ann
Heng, “Multilevel contextual 3-d cnns for false positive reduc-
tion in pulmonary nodule detection,” IEEE Transactions on
Biomedical Engineering, vol. 64, no. 7, pp. 1558–1567, 2016.
[8] Zhancheng Zhang, Xinyi Li, Qingjun You, and Xiaoqing Luo,
“Multicontext 3d residual cnn for false positive reduction of
pulmonary nodule detection,” International Journal of Imag-
ing Systems and Technology, vol. 29, no. 1, pp. 42–49, 2019.
[9] Andrei S Purysko, Andrew B Rosenkrantz, Jelle O Barentsz,
Jeffrey C Weinreb, and Katarzyna J Macura, “Pi-rads version
2: a pictorial update,” Radiographics, vol. 36, no. 5, pp. 1354–
1372, 2016.
[10] Jeffrey C Weinreb, Jelle O Barentsz, Peter L Choyke, Fran-
cois Cornud, Masoom A Haider, Katarzyna J Macura, Daniel
Margolis, Mitchell D Schnall, Faina Shtern, Clare M Tempany,
et al., “Pi-rads prostate imaging–reporting and data system:
2015, version 2,” European urology, vol. 69, no. 1, pp. 16–40,
2016.
[11] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net:
Convolutional networks for biomedical image segmentation,”
in International Conference on Medical image computing and
computer-assisted intervention. Springer, 2015, pp. 234–241.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
“Deep residual learning for image recognition,” IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), pp.
770–778, 2015.
[13] Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation net-
works,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2018, pp. 7132–7141.

1359

Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on May 27,2020 at 11:23:30 UTC from IEEE Xplore. Restrictions apply.

You might also like