You are on page 1of 6

Uncertainty Quantification for Medical Image

Segmentation Using Dynamic Label Factor


Allocation Among Multiple Raters

1,2 ?
Wei Ji , Wenting Chen 1,3 ? , Shuang Yu 1 , Kai Ma 1 , Li Cheng 2 , Linlin
Shen 3 , and Yefeng Zheng 1
1
Tencent HealthCare
shirlyyu, kylekma, yefengzheng@tencent.com
2
University of Alberta
wji3,lcheng5@ualberta.ca
3
School of Computer Science, Shenzhen University, Shenzhen, China
chenwenting2017@email.szu.edu.cn, llshen@szu.edu.cn

Abstract. The quantification of uncertainty resulted by inter-observer


variability is an important problem for automatic medical image seg-
mentation, but it is rarely investigated in the deep learning field. The
existing methods generated ground-truth label by averagely merging the
labels given by different experts, which is based on the assumption that
each expert has the same level of expertise. However, different experts
have different expertise levels and the disease diagnosis may vary due
to different clinical background, subjective error or quality of images. In
this paper, we propose a novel dynamic training strategy for the uncer-
tainty quantification of medical image segmentation, which dynamically
assign different weight for different raters’ annotation to obtain dynamic
ground-truth and further use it train the model. We quantitatively eval-
uate the proposed method on the QUBIQ Challenge datasets, which
proves the effectiveness of the proposed method.

Keywords: Uncertainty estimation · dynamic label factor allocation ·


image segmentation · medical image.

1 Introduction

Along with the development of deep learning, it is reported that the performance
of image segmentation [1, 13, 12] and object detection [4, 10, 11] has reached the
human level performance for some specific tasks, for example, longitudinal brain
tumor volumetry segmentation [6]. Many medical datasets are labeled by differ-
ent experts, so as to avoid the subjective bias or potential problems caused by
different levels of clinical domain knowledge, negligence of subtle symptoms and
quality of images [8]. Then the final ground-truth label is generally obtained via
majority vote or weighted average of the raw gradings, or other fusion techniques.
?
Wei Ji and Wenting Chen have equal contribution.
2 Wei Ji et al.

In the procedure of multi-experts annotation, inter-rater variability problem is


raised. However, how to better quantify the consistency or inter-rater variability
among different experts has rarely been studied by deep learning experts.
Recently, there have been several deep learning works started to pay attention
to the inter-rater variability problem. Guan et al. [2] introduced weighted doctor
net to predict the labels of each rater individually and applied the average weight
to make the final prediction. Alain et al. [5] observed that the models trained
on the fused ground-truth label tend to under-estimate the uncertainty, while
the models are trained with the individual labels can reflect the disagreements
among experts. Yu et al. utilizes the raw multi-rater annotations to develop the
difficulty aware glaucoma classification model [9].
In this paper, in order to take advantage of both the fused final labels and
individual raters’ annotation, we propose a novel training strategy, named as the
dynamic label factor allocation, to exploit the importance of different raters and
further to improve the performance of uncertainty estimation for medical image
segmentation on the four QUBIQ datasets.

Sum Summation Concatenation


Or Pick randomly Pixel-wise multiplication Label Label
factor factor
Rater 1 prediction Rater 1 label
𝟏 𝟏
Input Embedding 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐 BCE Loss 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐
𝟔 𝟔
features
Rater 2 prediction Rater 2 label
Output 𝟏 𝟏
Encoder Decoder 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐 BCE Loss 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐
𝟔 𝟔
Rater 3 prediction Rater 3 label
𝟏 𝟏
𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐 BCE Loss 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐
𝟔 𝟔
Rater 4 prediction Rater 4 label
Label factor Factor maps 𝟏 𝟏
𝟏
Factor maps 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐 BCE Loss 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐
(a) 𝟏
𝟏 𝟔
𝟏 𝟔
Expand
𝟏 𝟏 𝟔 𝟔
𝟏
Conv

Average weight 𝟏𝟔 𝟏𝟔 𝟔 𝟔 𝟔 𝟔
𝟔
𝟏
𝟏 Rater 5 prediction Rater 5 label
assignment 𝟔 × 𝟏 × 𝟏 𝟔
𝟔 𝟏 𝟏
𝟔 × 𝟏𝟔 × 𝟏𝟔 𝟓𝟏𝟐 × 𝟏𝟔 × 𝟏𝟔 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐 BCE Loss 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐
𝟔 𝟔
Label factor Factor maps Rater 6 prediction Rater 6 label
(b) 𝟎
𝟎
Expand 𝟏 𝟏
𝟎
𝟏 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐 BCE Loss 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐
𝟎
𝟏 Or
Conv

Label sampling 𝟏
𝟎
𝟏 𝟔 𝟔
𝟔×𝟏×𝟏 𝟔 × 𝟏𝟔 × 𝟏𝟔 𝟓𝟏𝟐 × 𝟏𝟔 × 𝟏𝟔
Sum Sum
Label factor Factor maps
𝟎. 𝟏
(c) 𝟎. 𝟏
𝟎. 𝟐 Expand
𝟎. 𝟏
𝟎.4 𝟎.4
Conv

Random weight 𝟎. 𝟒𝟎. 𝟏 𝟎.4 Final prediction Overall weighted label


assignment 𝟔 × 𝟏 × 𝟏 Label factor
𝟔 × 𝟏𝟔 × 𝟏𝟔 𝟓𝟏𝟐 × 𝟏𝟔 × 𝟏𝟔
𝟏
𝟏 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐 BCE Loss 𝟐 × 𝟓𝟏𝟐 × 𝟓𝟏𝟐
𝟏 𝟔
𝟏 𝟔
𝟏 𝟔
𝟏 𝟔
Dynamic label factor allocation 𝟔
𝟔

Fig. 1. Architecture of our proposed framework, including a segmentation network


based on U-Net, the dynamic label factor allocation mechanism and the training loss
for each rater.

2 Method
Fig. 1 demonstrates the overview of the proposed uncertainty estimation method
for medical image segmentation using dynamic label factor allocation among
multiple raters. The overall architecture is composed of three parts, including
(1) the segmentation network, (2) the dynamic label factor allocation and (3)
the individual training loss for the prediction of each rater. In this section, we
use the prostate segmentation as an example task to describe our method.
Uncertainty Quantification Using Dynamic Label Factor Allocation 3

As for the segmentation network, we adopt the widely used U-Net [7] with a
pretrained ResNet-34 as the encoder, and the decoder contains six outputs cor-
responding to the six raters and each output has two-channels for the two tasks
of prostate segmentation. To exploit the importance of the raw annotations la-
beled by different raters, we apply a dynamic label factor allocation mechanism
to randomly generate weight for the importance of individual raters, and based
on which the ground-truth is dynamically obtained. We concatenate the dynamic
label factor and the embedding features encoded by the encoder of segmentation
network, and feed the decoder with the concatenation of the two to generate
the prediction for different raters. Then, we pixel-wisely multiply the prediction
probability map with the corresponding label factor for each rater, as the output
prediction of the branch. Similarly, the pixel-wise multiplication of the raw an-
notation by individual raters with the dynamic label factor is computed as the
ground-truth for the branch. Afterwards, the model is supervised via the combi-
nation of raw annotations and the fused ground-truth via binary cross entropy
loss, including 1) the prediction of six raters with that of raw six annotations,
and 2) the fused prediction with the fused ground-truth label. Finally, the final
model prediction is obtained by summing all the weighted prediction for each
individual rater to obtain the final prediction.

2.1 Dynamic label factor allocation

Considering that each rater has different confidence for the task due to different
clinical expertise, we introduce a novel training strategy, the dynamic label factor
allocation, to exploit the importance of different raters and further to improve
the model generalization by using different label factors as an indirect data
augmentation strategy. The dynamic allocation strategy consists of three types
of label factor allocation mechanisms, i.e. (1) the average weight assignment, (2)
label sampling, and (3) random weight assignment.
As for the average weight assignment, it first provides the label factor
z ∈ RN ×1×1 , where N denotes the number of rater. Each element of z is equal
to N1 . It means that each rater has the same confidence. Then, we expand the
label factor z to the dimension of N × H × W , where H and W denote the height
and width of the embedding features f ∈ RC×H×W encoded by the encoder of
the segmentation network. C represents the number of channel for embedding
features f . Afterwards, a 1×1 convolutional layer is applied to expand the factor
maps to have the same channel numbers as that of the feature map f , and thus we
obtain M ∈ RC×H×W . Finally, the factor maps M and the embedding features
f are concatenated and fed into the decoder to further generate prediction for
each rater, respectively.
Regarding label sampling and random weight assignment mechanism,
the processes of generating factor maps are the same as the that of the average
weight allocation mechanism. In terms of label factor, label sampling allocation
mechanism randomly set an element of the label factor as 1 and the rest of
elements as 0, while the random weight allocation mechanism allocates each
4 Wei Ji et al.

element of label factor with a random probability and then normalize the label
factor to have the summation of 1.
To obtain the final prediction of each rater, we multiply the prediction of
0
each rater yi (i ∈ {1, 2, ..., N }) by the corresponding label factor ri . To construct
the weighted label of each rater yi (i ∈ {1, 2, ..., N }) , we also perform pixel-wise
multiplication of the real label given by different raters and the corresponding
label factor. Afterwards, we compute the binary cross entropy loss between the
0
final prediction of each rater (yi · ri ) and weighted label of each rater (yi · ri )
to supervise the training of the segmentation network. In addition,Pwe sum over
N 0
the final prediction of each rater and obtain the final prediction ( i=1 (yi · ri ))
for our framework. The objective functions are defined as:

N N N
X X 0 X 0
Loss = λ1 BCE( (yi · ri ), (yi · ri )) + λ2 BCE(yi · ri , yi · ri ) (1)
i=1 i=1 i=1

2
X
BCE(target, pred) = − targetc log(predc ) (2)
c=1
As shown in Eq. 1, the first term represents the binary cross entropy loss between
the final prediction (i.e., the weighted summation of all raters’ predictions) with
the overall fused ground-truth; and the second term represents the loss between
the weighted prediction of each rater with that of the weighted raw annotation
of the corresponding rater. We set λ1 and λ2 as 0.5, and c denotes the cth class
of the output.

3 Experiments

3.1 Implement details and data processing

We evaluate the proposed method on the validation set of four different CT


and MRI datasets from the QUBIQ Challenge, which includes prostate, brain
tumor, brain growth and kidney datasets, and split the cases according to the
QUBIQ Challenge. In the training stage, we resize all the cases to dimension of
512×512 pixels and randomly flip for data augmentation. In addition, the Adam
is adopted as optimizer to optimize the model for a maximum of 5,000 iterations
with a batch size of 8. The initial learning rate is set as 0.0001. To assess our
models, we adopt the soft Dice score as evaluation metric with ten thresholds
range from [0, 0.9], following the metric applied in the QUBIQ Challenge.

3.2 Quantitative evaluation

To prove the effectiveness of our proposed method, we compare our proposed


method with the existing multi-rater methods on four different QUBIQ datasets.
To generate different ground-truth label for supervised learning, there are three
Uncertainty Quantification Using Dynamic Label Factor Allocation 5

Table 1. Comparison with existing methods on four different QUBIQ datasets (%).

Brain Brain tumor Prostate


Methods Kidney
growth task1 task2 task3 task1 task2
Majority voting 46.81 78.31 83.08 76.24 77.53 53.04 56.89
Label sampling 47.57 79.27 83.78 77.21 78.05 53.95 57.33
Weighted doctor net 48.71 80.11 84.45 79.43 80.27 55.41 59.03
Ours 49.58 81.54 85.02 80.49 81.74 56.37 60.14

main methods, i.e. the commonly used majority voting method [9], label sam-
pling method which randomly selects label from the label pool of multiple an-
notations [3], and weighted doctor net which predicts the individual rater’s pre-
diction using multi-branches [2]. As listed in Table 1, the weighted doctor net
solution surpasses the majority vote and label sampling methods, it indicates
that the labels given by different raters should be utilized instead of only apply-
ing one of labels given by one rater or only the fused ground-truth. Moreover,
our proposed method achieves the optimal performance and outperforms the
above mentioned three methods by a large margin. It indicates that the pro-
posed method can properly utilize the importance of individual raters to better
quantify the uncertainty.

References

1. Chen, W., Yu, S., Wu, J., Ma, K., Bian, C., Chu, C., Shen, L., Zheng, Y.: Tr-gan:
Topology ranking gan with triplet loss for retinal artery/vein classification. arXiv
preprint arXiv:2007.14852 (2020)
2. Guan, M.Y., Gulshan, V., Dai, A.M., Hinton, G.E.: Who said what: Modeling
individual labelers improves classification. arXiv preprint arXiv:1703.08774 (2017)
3. Jensen, M.H., Jørgensen, D.R., Jalaboi, R., Hansen, M.E., Olsen, M.A.: Improving
uncertainty estimation in convolutional neural networks using inter-rater agree-
ment. In: International Conference on Medical Image Computing and Computer-
Assisted Intervention. pp. 540–548. Springer (2019)
4. Ji, W., Li, J., Zhang, M., Piao, Y., Lu, H.: Accurate rgb-d salient object detection
via collaborative learning. ECCV (2020)
5. Jungo, A., Meier, R., Ermis, E., Blatti-Moreno, M., Herrmann, E., Wiest, R.,
Reyes, M.: On the effect of inter-observer variability for a reliable estimation of
uncertainty of medical image segmentation. In: International Conference on Medi-
cal Image Computing and Computer-Assisted Intervention. pp. 682–690. Springer
(2018)
6. Meier, R., Knecht, U., Loosli, T., Bauer, S., Slotboom, J., Wiest, R., Reyes, M.:
Clinical evaluation of a fully-automatic segmentation method for longitudinal brain
tumor volumetry. Scientific reports 6, 23376 (2016)
7. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedi-
cal image segmentation. In: International Conference on Medical Image Computing
and Computer Assisted Intervention. pp. 234–241. Springer (2015)
6 Wei Ji et al.

8. Schaekermann, M., Beaton, G., Habib, M., Lim, A., Larson, K., Law, E.: Under-
standing expert disagreement in medical data analysis through structured adjudi-
cation. Proceedings of the ACM on Human-Computer Interaction 3(CSCW), 1–23
(2019)
9. Yu, S., Zhou, H.Y., Ma, K., Bian, C., Chu, C., Liu, H., Zheng, Y.: Difficulty-
aware glaucoma classification with multi-rater consensus modeling. arXiv preprint
arXiv:2007.14848 (2020)
10. Zhang, M., Ji, W., Piao, Y., Li, J., Zhang, Y., Xu, S., Lu, H.: Lfnet: Light field
fusion network for salient object detection. IEEE TIP 29, 6276–6287 (2020)
11. Zhang, M., Li, J., Ji, W., Piao, Y., Lu, H.: Memory-oriented decoder for light field
salient object detection. In: Advances in Neural Information Processing Systems.
pp. 898–908 (2019)
12. Zhao, H., Li, H., Cheng, L.: Improving retinal vessel segmentation with joint local
loss by matting. Pattern Recognition 98, 107068 (2020)
13. Zhao, H., Li, H., Maurer-Stroh, S., Guo, Y., Deng, Q., Cheng, L.: Supervised
segmentation of un-annotated retinal fundus images by synthesis. IEEE TMI 38(1),
46–56 (2018)

You might also like