Professional Documents
Culture Documents
of the pixel/voxel indicates higher uncertainty. Note that, diction between different misclassification costs;
though there are other uncertainty estimation methods in deep • A training strategy to separately tackle the segmentation
learning models, e.g., Bayesian modeling, dropout and input on certain and uncertain regions in different ways.
augmentation, we notice that in semi-supervised segmentation Our model outperforms the baselines and several state-
tasks, most previous semi-supervised methods [50] [9] [51] [3] of-the-art methods with higher accuracy on various medical
[60] tend to adopt the simple uncertainty estimation method image segmentation tasks, including CT pancreas, MR endo-
via setting a predefined threshold on the output of softmax cardium, and MR ACDC multi-structures segmentation. We
layer, as it is easy to implement. For simplicity, we refer have released our implementation at https://github.com/koncle/
to this conventional way of estimation as the softmax-based CoraNet.
confidence in the following parts.
Despite the promising performance of conventional methods
II. R ELATED W ORK
of semi-supervised medical image segmentation on different
tasks, we observe that the following two issues should be A. Medical Image Segmentation
addressed. In these years, convolutional neural networks have achieved
Observation 1: “It is the first step that matters the most”. state-of-the-art (SOTA) performance in medical image seg-
Under the semi-supervised setting, the initial pseudo label mentation. In terms of the segmentation under supervised
prediction for unlabeled samples plays a crucial role in the setting, among the current SOTA methods, the most popular
process of segmentation. The more errors it introduces at framework is the fully convolutional network (FCN) [35]
the beginning of training, the more error propagations it based encoder-decoder architecture with the representative
might cause in the proceeding process. Therefore, how to model like U-Net [45]. So far, to make the conventional
ensure a safe and reliable model initialization becomes critical. encoder-decoder architecture more effective and robust, re-
Most previous methods estimate “uncertainty” with softmax- searchers have made great efforts in the following three
based confidence by seeking a trade-off between precision directions: 1) developing novel structures, including the 3D
and recall. Can we investigate a novel method of estimating structure, recurrent neural network (RNN) based model and
“uncertainty”? cascaded framework [39] [7] [4] [48] [12] [62], 2) designing
Observation 2: “Not all the regions will be equally treated”. novel network blocks, including attention mechanism, dense
In previous semi-supervised segmentation, for unlabeled sam- connection, inception or multi-scale fusion [8] [65] [11] [57]
ples, usually both the certain region (with high confidence) [41], and 3) utilizing sophisticated loss functions [58] [10]
and the uncertain region (with low confidence) are input to the [26] [61], significantly improving segmentation accuracy.
same segmentation network, which might not only render it
hard to make full use of certain region, but also underestimate
the complexity of the uncertain region. Can we separately treat B. Uncertainty Estimation
their segmentation in a unified framework? In deep neural network models, a reliable estimation of
To this end, in this paper we investigate the estimation uncertainty plays a crucial role in quantifying the confidence
of uncertainty from a new perspective. Considering the fact of predictions. However, as the “ground truth” uncertainty
that when assigned different misclassification costs on classes, is usually not available, uncertainty estimation remains a
the predicted label of a pixel during segmentation becomes challenging task.
inconsistent (to a certain degree), so its current prediction So far, several methods have been developed for uncertainty
seems not to be certain enough (see Figure 1). Hence, to 1) estimation, especially by the way of supervised learning. For
provide a robust estimation of uncertainty and 2) separately example, in Bayesian uncertainty modeling based methods,
segment certain and uncertain regions, we present a novel the posterior distribution over parameters is calculated on the
framework namely Conservative-Radical network (CoraNet) training samples with Bayesian neural networks to estimate the
for semi-supervised segmentation. uncertainty. Since it is intractable to compute exact Bayesian
Our CoraNet model consists of three major components: inference, there would be several approximate variants [36]
1) a conservative-radical module (CRM), which involves and [21] [6] of Bayesian neural networks for efficient training.
maintains a main segmentation model, to indicate the certain For example, Monte Carlo dropout was proposed to perform
and uncertain region masks by predicting the inconsistency dropout at test time to estimate the uncertainty [19]. An
between different misclassification costs; 2) a certain region ensemble-based estimation method was presented in [30],
segmentation network (C-SN) to update the model with pre- while data augmentation based methods were also utilized
dicted certain regions; and 3) an uncertain region segmentation to estimate the uncertainty by investigating the predictions
network (UC-SN) to segment the predicted uncertain region under different augmentation operators [2] [54]. Moreover,
and update the model. These three components could be uncertainty estimation also triggered some interests in med-
alternatively trained in an end-to-end manner. Overall, the ical image analysis [24] [14] [17]. However, our estimation
contributions of our work include: method substantially differs from previous uncertainty esti-
• A framework of semi-supervised segmentation with a mation methods that mainly adopt supervised learning with
new method of estimating pixel-wise uncertainty; sufficient training samples. Most previous works either 1)
• A novel conservative-radical module to automatically utilize predefined thresholds for determining certain/uncertain
identify uncertain regions according to inconsistency pre- regions or 2) measure the difference of output in different
3
initialization or augmentations. By contrast, our goal is to framework based on inequality constraints, which could tol-
estimate uncertainty by employing different costs without any erate uncertainties during segmentation in inferred knowledge
prerequisite, e.g., choosing specific augmentation operators. in semi-supervised manner. Yu et al. [60] developed a novel
Therefore, we believe that our way of uncertainty estimation, uncertainty-aware semi-supervised framework for left atrium
which has never been investigated before, is novel. segmentation on MR images, while Li et al. [34] employed
the mean teacher-based framework to optimize the network
by a weighted combination of common supervised loss and
C. Semi-supervised Learning
regularization loss. Sun et al. [50] developed a hierarchical
Recently, with the development of deep learning, many attention module with a teacher model to output the pseudo
deep learning-based semi-supervised learning (SSL) methods labels and a student model to combine manual labels and
have been developed. For example, Π-model [29] adopts a pseudo labels for joint training. In [64], a teacher-student
self-ensembling strategy by imposing a consistency prediction model was investigated by introducing perturbation-sensitive
constraint between two augmentations of the same sample. sample mining to jointly perform semantic segmentation and
Another self-ensembling method, i.e., temporal ensembling feature distillation. The consistency between different decoders
[29], extends Π-model by considering the network predictions was imposed during segmentation in [18] [42]. Huo et al.
over previous epochs. In the mean-teacher model [52], a [22] developed an iterative training scheme in semi-supervised
teacher model is maintained by averaging the model weights of segmentation manner. Luo et al. [37] presented a dual-task-
consecutive student models, which was a simple yet effective consistency semi-supervised framework with a dual-task deep
method in SSL. Miyato et al. [40] proposed a virtual adversar- network that jointly predicts segmentation maps and geometry-
ial training (VAT) model by introducing the adversarial train- aware level set representation. The mutual information based
ing with virtual labels in semi-supervised scenarios. Moreover, regularization and transformation consistency were integrated
given recent progress in data augmentation [15] [16], several in [44] for semi-supervised segmentation. A unified framework
methods have been proposed to apply data augmentation in with uncertainty-aware modeling was developed to address
semi-supervised learning, e.g., MixMatch [5] and FixMatch both the unsupervised domain adaptation and semi-supervised
[49]. In summary, most of recent state-of-the-art SSL models segmentation tasks in [59].
adopted 1) feature-based or image-based augmentation to To better summarize and compare these works, we roughly
benefit the learning from unlabeled data, and 2) specifically group them into EM-based or CR-based methods in Table
designed methods (e.g., temporal ensembling) to reduce the I. As is shown, despite their success, most of these works
possible noise in unlabeled data. tend to estimate uncertainty by evaluating the pixel/voxel-
wise entropy. Yet, since uncertainty could greatly influence
TABLE I the subsequent learning process, we aim to investigate the
T HE COMPARISON OF RECENT METHODS . F OR SIMPLICITY, WE USE EM
AND CR TO INDICATE THE CATEGORIES OF ENTROPY MINIMIZATION AND
possibility of developing other safe and reliable methods
CONSISTENT REGULARIZATION , RESPECTIVELY. of estimating uncertainty. So we propose a novel way of
modeling region-level uncertainty with CRM, which could
Method
Treat region Modeling unlabeled data help the model to separately treat different regions, a method
separately? High conf. reg. Low conf. reg.
proven effective in our evaluation.
[28] (MICCAI’19) - EM EM Remark. In summary, our proposed method is substantially
[60] (MICCAI’19) - CR CR
[34] (TNNLS’21) - CR CR different from previous semi-supervised segmentation meth-
[25] (ICCV’19) - EM EM ods. Drawing on the semi-supervised learning, these previous
[64] (MICCAI’20) - CR CR
[18] (MICCAI’20) - CR CR methods usually are either EM-based or CR-based methods,
[42] (CVPR’20) - CR CR so based on the EM or CR architecture, they usually introduce
[9] (MICCAI’19) - CR CR additional consistency-based constraints, e.g., output-based
[37] (AAAI’21) - CR CR
[44] (MIDL’20) - CR CR consistency [18] [42] [44] [56] or task-based consistency [37]
[59] (MedIA’20) - CR CR [59]. By contrast, our method is highly novel in 1) providing
√
Ours EM CR a region-level mask to indicate certain/uncertain regions with
our proposed CRM, and in 2) training the subsequent segmen-
tation module by incorporating the merits from both CR and
EM based architectures. According to our best knowledge,
D. Semi-supervised Medical Image Segmentation the aforementioned two aspects have never been investigated
To alleviate the challenge of manual delineation, several in previous works.
methods have been recently proposed to improve segmenta-
tion performance in the semi-supervised environment. These III. T HE P ROPOSED M ETHOD
previous methods usually fall into either entropy minimization
(EM) based methods or consistency regularization (CR) based A. Problem Analysis
methods. For example, Bai et al. [3] proposed a method For clear clarification, we show a visualized example in
namely semiFCN that performs self-training for cardiac MR Figure 1 to illustrate the case of segmentation with just two
image segmentation by integrating labeled and unlabeled data classes—each pixel is classified into background (i.e., class 0)
during training. Kervadec et al. [28] presented a segmentation or object (i.e., class 1) for simplicity. For the loss function, we
4
Update
Segmentation
Radical Add mask
result
segmentation
employ the popularly used cross-entropy loss for illustration. aware loss, specific convolution or pooling. Although it shows
Before providing our definition of uncertainty, we introduce potential to locate the uncertain region, we still aim to
two cost-sensitive settings as follows: 1) leverage the high-confidence certain region to continu-
• Object conservative setting (e.g., 5:1): The cost of mis- ously complement the current labeled samples, and
segmenting one background pixel to object class equals 2) reduce incorrect segmentation caused by low-confidence
to that of mis-segmenting five object pixels to background uncertain region.
class. Obviously, it performs segmentation in an object- Accordingly, we employ a separate self-training strategy to
caution manner, since in most case only the most reliable treat the subsequent segmentation of certain and uncertain
central part of the object is segmented. In this setting, the regions in the unlabeled data separately. For the certain region,
predicted object class under object conservative setting is being aware of its high confident prediction, we employ its
highly confident. prediction as pseudo label for self-training. For the uncertain
• Object radical setting (e.g., 1:5): We reverse the mis- region, we aim to obtain a more robust estimation to prevent
classification cost in object conservative setting. The cost the possible error propagation.
of mis-segmenting one object pixel to background class Our overall framework is summarized in Figure 2 with
equals to that of mis-segmenting five background pixels three components: 1) a conservative-radical module (CRM),
to object class. It is a background-caution setting that 2) a certain region segmentation network (C-SN), and 3) an
indicates the predicted background is highly confident. uncertain region segmentation network (UC-SN). Note that,
On the same training set, by employing these two cost- in Figure 2, though we split them for better visualization, the
sensitive settings to train two separate segmentation models, main segmentation model and student model are actually the
as shown in Figure 1, we can obtain the corresponding object- same model.
conservation and object-radical models.1 With the obtained Notation. For a semi-supervised segmentation task, given
models, we then input the same image to these two models m labeled samples with manual delineation as ground-truth
to obtain corresponding conservative and radical segmentation mask (label) and n unlabeled samples with no ground-truth
masks, respectively. We calculate the difference between these information, we denote the labeled samples as Xl1 , Xl2 , ...,
two segmentation masks to define: Xlm ∈ RH0 ×W0 and their corresponding label maps as Y1l ,
Y2l , ..., Ym
l
∈ BH0 ×W0 . The unlabeled samples are denoted as
• Certain region: the pixels with the same prediction under
X1 , X2 , ..., Xun ∈ RH0 ×W0 . Note that, H0 and W0 indicate
u u
both conservative and radical settings, since they are
the height and width of original images, respectively. It is
being robust to different costs.
worthwhile to mention that, in the following description and
• Uncertain region: the pixels with different predictions
evaluation, we first train a model on both labeled and unlabeled
under conservative and radical settings, indicating their
samples and then deploy the trained model to segment the
current predictions are uncertain.
unseen samples without any intersection to previous observed
We now extensively analyze the ways in which our esti- labeled/unlabeled samples, so it falls into an inductive rather
mation method largely differs from traditional softmax-based than transductive semi-supervised segmentation setting.
confidence estimation methods. From Figure 1, we can observe
that most part of the image falls into the certain region,
however uncertainty is also possible in non-boundary location. B. Certainty-aware Prediction: CRM
Our definition of uncertainty requires neither task-specific Recall that the conventional cross-entropy loss Lce defined
assumption nor empirical predefined condition, e.g., boundary- in the segmentation task is evaluated between the prediction
pz,k of the z-th pixel on the k-th class and its corresponding
1 We will use conservative and radical for simplicity in the following. label qz,k , so we denote two separate losses as Lce-con and
5
Lce-rad to indicate the aforementioned conservative and radical C. Separate Self-training: C-SN and UC-SN
settings, respectively. Specifically, for the i-th labeled image We treat the certain region and the uncertain region dif-
(1 ≤ i ≤ m), by employing the conventional cross-entropy ferently. For the certain region indicated by Mu,cj , regarding
loss Lce with different misclassification costs, we formulate its high confident prediction, we implement C-SN by using
Lce-con and Lce-rad for a binary segmentation task as follows: its prediction as the pseudo label for self-training. For the
×W0
2 H0X uncertain region indicated by Mu,uc j , given its prediction
might be unreliable in current stage, we utilize UC-SN by
X
Lce-con (Xli , Yil ; E, Dcon ) = − wkcon qz,k logpz,k ,
k=1 z=1 employing an advanced model, i.e., mean teacher, for label
assignment.
×W0
2 H0X 1) C-SN for certain region segmentation: For the certain
region indicated by Mu,c
X
Lce-rad (Xli , Yil ; E, Drad ) = − wkrad qz,k logpz,k . j , we directly use the prediction
k=1 z=1 as their pseudo label and self-train the main segmentation
model parameterized by encoder E and decoder D in the
As we employ the encoder-decoder segmentation structure, next iteration. Regarding its consistent prediction with both
our normal loss Lce , conservative loss Lce-con and radical loss conservative and radical losses, we believe that it has high
Lce-rad are parameterized by a common encoder E with shared confidence on its prediction, which could aid the learning
parameters, and three separate decoders D, Dcon and Drad , in the next round with two benefits: 1) it provides efficient
respectively. Then, as the corresponding weights for different training in that the model could directly assign the label
classes, wkcon and wkrad are defined as follows: to most parts of image, as most regions usually fall into
( ( the certain region, and 2) the certain region could be used
α k = 0 1 k=0
wkcon = and wkrad = to complement the labeled samples to aid robust training
1 k=1 α k = 1, particularly when the labeled samples are limited.
Specifically, the objective loss of C-SN is formulated as:
where α > 1 indicates the high misclassification cost in
n
current class compared with the other class. In this paper, we 1X
Lc-sn (E, D) = Lce-m (Xuj , Yju , Mu,c
j ; E, D), (7)
set α as 5 in the implementation. n j=1
Therefore, the total loss Lcrm of CRM is calculated on all
the labeled samples, which can be defined as where Lce-m is defined as:
" ×W0
2 H0X
m X
X Lce-m (Xuj , Yju , Mu,c
j ; E, D) = − Mu,c
j,z qz,k logpz,k ,
Lcrm (E, D, Dcon , Drad ) = Lce (Xli , Yil ; E, D)
k=1 z=1
i=1
where Mu,c denotes the z-th pixel of Mu,c
j .
#
j,z
+Lce-con (Xli , Yil ; E, Dcon ) + Lce-rad (Xli , Yil ; E, Drad ) . 2) UC-SN for uncertain region segmentation: Due to its
low confident prediction, the uncertain region could not be
(1) directly used as pseudo label, so we incorporate the deep semi-
supervised learning framework proposed earlier to achieve
Next, we can obtain three separate segmentation sub-models
reliable pseudo label assignment instead. In particular, we
F, Fcon and Frad with the same encoder but separate de-
introduce the mean teacher model into our framework, where
coders. Based on these sub-models, for the j-th unlabeled
the main segmentation model (parameterized by encoder E
image, we have conservative, radical, and normal predictions
and decoder D) is regarded as the student model. In addaition
as Yju,con , Yju,rad , and Yju , respectively:
to the student model, another teacher model parameterized by
encoder E0 and decoder D0 with the same encoder-decoder
Yju = F(Xuj ; E, D) ∈ BH0 ×W0 , (2)
structure is also employed in the training process.
Specifically, the student and teacher models are imposed
Yju,con = Fcon (Xuj ; E, Dcon ) ∈ BH0 ×W0 , (3) to have a consistent prediction on unlabeled samples. The
consistent loss can be formulated as follows:
Yju,rad = Frad (Xuj ; E, Drad ) ∈ BH0 ×W0 . (4) n 2
1X
Lcnt = Mu,uc
j ⊗ (F(Xuj ; E, D) − F 0 (Xuj ; E0 , D0 )) ,
With obtained conservative and radical predictions, the n j=1
certain and uncertain masks of the j-th unlabeled image, Mu,c
j (8)
and Mu,uc
j can be defined as 0
where F and F denote the student segmentation model
and teacher segmentation model, respectively, and ⊗ means
Mu,uc = Yju,rad ⊕ Yju,con ∈ BH0 ×W0 , (5)
j element-wise multiplication. Also, given Et , Dt , E0t and D0t
at the t-th iteration, the parameter update of E0t+1 and D0t+1
Mu,c
j = 1 − Mu,uc
j ∈ BH0 ×W0 , (6) at the (t+1)-th iteration is similar to that in the conventional
mean teacher model as
where ⊕ is pixel-wise XOR operator to indicate the inconsis-
tency between Yju,con and Yju,rad , and 1 is an all-one matrix. E0t+1 = βE0t + (1 − β)Et , D0t+1 = βD0t + (1 − β)Dt , (9)
6
Algorithm 1 Proposed CoraNet Training phase: Our method contains merely two additional
Training input: labeled data Xl1 , Xl2 , ..., Xlm and their label decoders (Dcon and Drad ), each consisting of just three
Y1l , Y2l , ..., Yml
, unlabeled data Xu1 , Xu2 , ..., Xun convolutional layers (i.e., two 3×3 and one 1×1 convolutions).
Training output: segmentation model F parameterized by In this sense, our model could avoid introducing too many
encoder E and decoder D additional parameters. Also, although our separate self-training
1: //model initialization involves a teacher network, its weights are indeed the moving
2: E, D, Dcon , Drad ← initialized by Eqn. (1) average of the weights of the student network, therefore not
3: E0 , D0 ← initialized with E, D introducing any additional parameters here. In addition, in
u,uc
4: ∀j, Mj and Mu,cj ← initialized by Eqns. (5) and (6) separate self-training module, the student model is actually
5: while not converged do the same model as the main segmentation model in Figure 2.
6: //certainty-based segmentation Test phase: It is worth noting that although our method
7: Update E and D using Eqn. (7) consists of three decoders, i.e., D, Dcon and Drad , during the
8: //uncertainty-based segmentation test phase, two of them (i.e., Dcon , Drad ) are to be discarded
9: Update E and D using Eqn. (8) later in our final model, as indicated in Algorithm 1. Therefore,
10: //update CoraNet model it is as efficient as U-Net.
11: Update E, D, Dcon , Drad using Eqn. (1) Based on the above analysis, in our test phase, the inference
12: //update mask parameters only involve those from the main segmentation
13: Update Mu,uc j and Mu,c
j using Eqns. (5) and (6) model parameterized by E and D. For instance, in CT
14: //update teacher model pancreas segmentation, 1) in the training phase, the parameter
15: Update E0 and D0 using Eqn. (9) size of U-Net and our model is 29.6M and 29.8M, respectively,
16: end while 2) in the test phase, they are both 29.6M after discarding Dcon
and Drad . Since our method would not introduce too many
Test input: test image Xt1 , ..., Xtk additional parameters, it is fast to perform training and testing.
In particular, when implementing our model on an NVIDIA
Test output: results Ŷ1t , ..., Ŷkt
GeForce RTX 2080TI GPU, the training time of the popularly
1: Calculate ∀i, Ŷit ← F(Xti ; E, D) used mean teacher model [52] and our model is 2.685h and
2.863h, respectively, and the test time for segmenting 1,414
slices is 18.75s and 18.85s, respectively. In this sense, although
where β is the smoothing coefficient parameter. our method involves two additional decoders, they have less
Algorithm 1 summarizes the overall procedure of the negative effect on efficiency in both training and test phases.
method we propose. 3) Weight Setting: For how to balance the weight of Lc-sn
and Lcnt , we find that setting the weighting parameter as 1:1
(i.e., setting Lc-sn and Lcnt with equal weight) could achieve
D. Discussion
satisfactory performance. This is probably because the loss
1) Network Architecture: Although our method also con- normally depends on the production of 1) the number of pixels
sists of one encoder and several decoders, it is fundamentally in this region and 2) the ratio of possible misclassified pixels
different from previous works [18] [42] for the following in this region.
reasons: • For certain region, the number of total pixels is relatively
Goals are different: the two decoders (i.e., Dcon and Drad ) large while the ratio of misclassification could be small.
in our method aim to estimate their difference in later separate • For uncertain region, the number of total pixels is rela-
self-training. In this sense, as a novel way to estimate the tively small while the ratio of misclassification could be
uncertainty, the utilization of our two decoders (i.e., Dcon and large.
Drad ) could be seen as an intermediate step to generate the We notice these two losses are often roughly the same in most
region-level uncertain mask before being discarded later in the cases. Moreover, setting an extremely large or small value
test phase. Previous works [18] [42] attempt to seek a con- to this weighting parameter certainly deteriorates the perfor-
sistent prediction between different decoders, often following mance. To eliminate possible negative impacts that might be
the conventional way of consistency regularization in semi- brought in by introducing extra hyper parameter, the weight
supervised learning. to balance Lc-sn and Lcnt is set to 1:1 in our paper.
Implementations are different: our two decoders are op-
timized with different objective functions (i.e., radical and
IV. E XPERIMENTS
conservative settings in Eqn. (1)) to generate diverse results for
uncertainty estimation, while previous works [18] [42] usually A. Setting
utilize the same objective function on different decoders by For the implementation, we employ a four-level U-Net
either introducing different perturbations or employing data as our backbone network that outputs the feature map with
augmentations to aid consistency training. 64 channels in the last level. In each level, there are two
2) Efficiency: Compared with the simple yet popularly- convolutional layers, both followed by batch normalization,
used baseline U-Net, our method is more efficient in network ReLU and a max pooling layer. Next, we utilize the trans-
parameters: posed convolutional layer for up-sampling. In the last level,
7
segmentation goal. After that, each slice is cropped to the size 0.6
0.2
data augmentation by utilizing several simple and effective 0
operators, i.e., 1) randomly rotating the image within -30 to 0.8
PPV TPR CSI
pixel size of 0.9-1.6 mm and the inter-slice thickness of 6-13 Fig. 3. The PPV, TPR, CSI of initial prediction on the endocardium dataset
mm. In this dataset, our goal is to accurately segment the (endo. in short) as shown on the top row and the pancreas dataset (panc. in
boundary of the LV cavity (i.e., endocardium). short) as shown on the bottom row.
TPR
PPV
CSI
0.8
0.45 0.4
0.75 0.4
0.35
0.35
0.7
0.3
0.3
Fig. 4. The change of PPV, TPR, and CSI during the training on the pancreas dataset. This validates that a good initial prediction could significantly reduce
error propagation in following segmentation.
1st -epoch 5th -epoch 10th -epoch 15th -epoch 20th -epoch 1st -epoch 5th -epoch 10th -epoch 15th -epoch 20th -epoch
Fig. 5. Uncertain region mask predicted every five epochs on the endocardium dataset (left) and the pancreas dataset (right). The yellow curve indicates the
ground truth, while the red region indicates the uncertainty mask.
Method DSC (%) Prec. (%) Rec. (%) HD (voxel) and separate self-training losses.
U-Net [45] 57.53 67.78 52.92 27.65 We report the results of DSC, precision, recall and HD in Table
Π-model [29] 57.69 68.98 55.20 26.37 IV. We could observe that using all the components proposed
TE [29] 61.18 71.79 60.22 21.94
MT [52] 62.50 68.97 62.16 23.34 in our work benefits the overall segmentation performance.
UA-MT [60] 63.82 63.47 73.12 20.64 Also, by comparing the setting of using and not using certain
Ours 67.01 68.41 74.19 15.90 and uncertain masks, we validate the efficacy of certain mask
predicted by our CRM.
We also investigate the performance of our method with
different proportions of the labeled samples in the training TABLE IV
A BLATION STUDY ON CT PANCREAS SEGMENTATION .
set. In particular, we introduce three different ratios of labeled
samples to unlabeled samples, i.e., 1:4, 1:2, and 1:1. As is Method DSC (%) Prec. (%) Rec. (%) HD (voxel)
shown by their DSC values in Table III, our method could
seg 60.10 69.67 54.68 25.07
consistently achieve better and more stable segmentation per- seg+st 62.33 68.72 63.07 22.59
formance with different ratios of labeled to unlabeled samples seg+st (mask) 64.23 67.90 71.99 19.53
than the comparison methods. seg+mt 61.23 68.04 60.78 23.43
seg+mt (mask) 63.69 69.60 70.23 20.40
ours 67.01 68.41 74.19 15.90
3 https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/.
4 https://github.com/smlaine2/tempens.
5 https://github.com/smlaine2/tempens. To further investigate the performance, we implement our
6 https://github.com/CuriousAI/mean-teacher. method in its 3D version by utilizing V-Net and ResNet-
7 https://github.com/yulequan/UA-MT. 18 as the backbone to compare with recent methods in
10
Table V, including DAN [63], ADVNET [53], UA-MT [60], and apply random horizontal flip and random scale with ratio
SASSNet [33], DTC [37] and UMCT [59]. In addition, the between [0.8, 1.1] in the 2D setting, respectively. We perform
backbone of DAN, ADVNET, UA-MT, SASSNet and DTC forward pass 20 times to obtain the average result. For MC
is V-Net, whereas the backbone of UMCT is ResNet-18. dropout, we add the dropout to the last downsampled feature
Specifically, the backbone of V-Net consists of five layers with and the final feature with a dropout rate of 0.5. Then, we
four downsampling and upsampling layers. The downsampling conduct forward pass 50 times with dropout enabled in the
layer is implemented with 3D convolution with stride of 2 and test time and average their results. For both test augmentation
the upsampling layer with 3D transposed convolution. Similar and MC dropout, we adopt the confidence threshold of 0.8
to the 2D setting discussed earlier, we first pretrain V-Net 30 for the certain and uncertain mask generation. The aleatoric
epochs and then conduct self-training in the following 200 uncertainty captures the inherent uncertainty in the observation
epochs. For the backbone of ResNet-18, we modify ResNet- and assumes the normal distribution in logits. We implement
18 into 3D version by extending the first 7×7 convolution this method with two branches of the same structure as our
to 7×7×3 and changing all other 3×3 convolutional layers separate self-training branches and output the f and σ. The
into 3×3×1 that can be trained as a 3D convolutional layer. noise is sampled 20 times in the 2D setting and 10 times in the
During the training stage, we only apply random crop as 3D setting, respectively. To generate the certain and uncertain
the augmentation method. For fair comparison with these masks, we manually set the standard deviation less than 0.001
baselines, we utilize 12 labeled and 50 unlabeled volumes as the certain region and the rest as the uncertain region. The
to train our model, as in all these baselines. The V-Net is result in Table VI shows the advantage of our method in both
only trained with 12 labeled data in the supervised setting, 2D and 3D.
while other baseline methods are trained with both labeled and
unlabeled data in the semi-supervised setting. In the inference TABLE VI
phase, we adopt a sliding window strategy to obtain the final T HE COMPARISON ON DSC OF DIFFERENT UNCERTAINTY ESTIMATION
METHODS ON BOTH 2D AND 3D CT PANCREAS SEGMENTATION . I N THE
results with a stride of 16×16×4. We do not perform any 2D SETTING , WE EMPLOY 50% LABELED DATA , WHILE IN THE 3D
postprocessing or model ensembling for fair comparison. Since SETTING , WE UTILIZE 20% LABELED DATA , WHICH IS CONSISTENT WITH
all these methods follow the same setting, we directly report OUR ORIGINAL SETTING .
TABLE IX
T HE COMPARISON OF DIFFERENT METHODS ON THE ACDC DATASET.
R EPORTED VALUES ARE AVERAGES ( STANDARD DEVIATION IN
PARENTHESES ) FOR 3 RUNS WITH DIFFERENT RANDOM SEEDS .
DSC (%)
Method
RV Myo LV mean
Fully supervised 88.98 (0.09) 84.95 (0.15) 92.44 (0.33) 88.79 (0.13)
contour more specifically, or Myo in short). Then, the average Original image Mean teacher Ours Ground truth
segmentation accuracy of these three classes is reported. To
evaluate the performance, we introduce the Dice score (DSC) Fig. 9. The visual segmentation examples of mean teacher [52] and our
method on ACDC segmentation.
as the evaluation metric. For fair comparison with baselines,
we use 20% samples in the dataset as the labeled samples,
and the rest as the unlabeled samples, as consistent with the demonstrated that our method outperforms other related base-
setting introduced in [43] and [44]. lines in terms of segmentation accuracy.
We report the values of DSC of our method and the
comparison methods in Table IX. Since all these methods ACKNOWLEDGEMENT
are evaluated under the same setting, we directly report their The authors would like to thank Dr. Wenjuan Xie (NNU),
performance presented in their literature. Note that, since the Lihe Yang (NJU), and Zhen Zhao (USYD & NJU) for discus-
number of slices along z-axis is sometimes limited (i.e., less sion and comments. The work of Yinghuan Shi, Jian Zhang,
than 10), thus making 3D convolutions hard to perform, above and Tong Ling was supported by National Key Research and
methods usually adopted the 2D setting [53] [44] [23]. For fair Development Program of China (2019YFC0118300), and Lei
comparison, we follow their 2D setting. As shown in this table, Qi was supported by China Postdoctoral Science Foundation
not only has our method largely outperformed the comparison Project (2021M690609) and Jiangsu Natural Science Founda-
methods on the mean segmentation accuracy, our results are tion Project (BK20210224).
also close to those of the fully-supervised trained model.
We also present several visualized segmentation examples R EFERENCES
of mean teacher and our method in Figure 9 for comparison. [1] A. Andreopoulos and J. K. Tsotsos, “Efficient and generalizable sta-
Different colors indicate different classes for segmentation. tistical models of shape and appearance for analysis of cardiac MRI,”
Medical Image Analysis, vol. 12, no. 3, pp. 335–357, 2008.
Compared with mean teacher, our method also obtains more [2] M. Ayhan and P. Berens, “Test-time data augmentation for estimation
precise segmentation on some difficult boundary regions. of heteroscedastic aleatoric uncertainty in deep neural networks,” in
Proceedings of Medical Imaging with Deep Learning, 2018.
[3] W. Bai, O. Oktay, M. Sinclair, H. Suzuki, M. Rajchl, G. Tarroni,
B. Glocker, A. King, P. M. Matthews, and D. Rueckert, “Semi-
V. C ONCLUSION supervised learning for network-based cardiac MR image segmentation,”
in International Conference on Medical Image Computing and Computer
How to estimate and handle uncertainty remains a cru- Assisted Intervention. Cham, Switzerland: Springer, 2017, pp. 253–260.
cial problem in semi-supervised medical image segmentation. [4] W. Bai, H. Suzuki, C. Qin, G. Tarroni, O. Oktay, P. M. Matthews,
In this paper, we proposed a novel method of estimating and D. Rueckert, “Recurrent neural networks for aortic image sequence
segmentation with sparse annotations,” in International Conference on
uncertainty by capturing the inconsistent prediction between Medical Image Computing and Computer Assisted Intervention. Cham,
multiple cost-sensitive settings. Our definition of uncertainty Switzerland: Springer, 2018, pp. 586–594.
directly relies on the classification output without requiring [5] D. Berthelot, N. Carlini, I. J. Goodfellow, N. Papernot, A. Oliver, and
C. Raffel, “MixMatch: A holistic approach to semi-supervised learning,”
any predefined boundary-aware assumption. Based on this arXiv preprint arXiv:1905.02249, 2019.
definition, we also presented a separate self-training strategy to [6] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight
treat certain and uncertain regions differently. To achieve end- uncertainty in neural network,” in Proceedings of the 32nd International
Conference on Machine Learning, 2015, pp. 1613–1622.
to-end training, three components, i.e., CRM, C-SN and UC- [7] H. Chen, Q. Dou, L. Yu, J. Qin, and P.-A. Heng, “VoxResNet: Deep
SN were developed to train our semi-supervised segmentation voxelwise residual networks for brain segmentation from 3D MR
model. By extensively evaluating the proposed method in images,” NeuroImage, vol. 170, pp. 446–455, 2018.
[8] L. Chen, P. Bentley, K. Mori, K. Misawa, M. Fujiwara, and D. Rueck-
various medical image segmentation tasks, e.g., CT pancreas, ert, “DRINet for medical image segmentation,” IEEE Transactions on
MR endocardium, and multi-class ACDC segmentation, we Medical Imaging, vol. 37, no. 11, pp. 2453–2462, 2018.
13
[9] S. Chen, G. Bortsova, A. G.-U. Juárez, G. van Tulder, and M. de Bruijne, age Computing and Computer Assisted Intervention. Cham, Switzer-
“Multi-task attention-based semi-supervised learning for medical image land: Springer, 2019, pp. 568–576.
segmentation,” in International Conference on Medical Image Comput- [29] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learn-
ing and Computer Assisted Intervention. Cham, Switzerland: Springer, ing,” in International Conference on Learning Representations, 2017.
2019, pp. 457–465. [30] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable
[10] X. Chen, B. M. Williams, S. R. Vallabhaneni, G. Czanner, R. Williams, predictive uncertainty estimation using deep ensembles,” in Advances in
and Y. Zheng, “Learning active contour models for medical image seg- Neural Information Processing Systems, 2017, p. 6405–6416.
mentation,” in IEEE/CVF Conference on Computer Vision and Pattern [31] D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervised
Recognition, 2019, pp. 11 624–11 632. learning method for deep neural networks,” in Workshop on Challenges
[11] X. Chen, R. Zhang, and P. Yan, “Feature fusion encoder decoder in Representation Learning, ICML, 2013, pp. 1–6.
network for automatic liver lesion segmentation,” arXiv preprint [32] J. Li, X. Lin, H. Che, H. Li, and X. Qian, “Probability map guided bi-
arXiv:1903.11834, 2019. directional recurrent U-Net for pancreas segmentation,” arXiv preprint
[12] P. F. Christ, F. Ettlinger, F. Grün, M. E. A. Elshaera, J. Lipkova, arXiv:1903.00923, 2019.
S. Schlecht, F. Ahmaddy, S. Tatavarty, M. Bickel, P. Bilic et al., [33] S. Li, C. Zhang, and X. He, “Shape-aware semi-supervised 3D semantic
“Automatic liver and tumor segmentation of CT and MRI volumes segmentation for medical images,” in International Conference on
using cascaded fully convolutional neural networks,” arXiv preprint Medical Image Computing and Computer Assisted Intervention. Cham,
arXiv:1702.05970, 2017. Switzerland: Springer, 2020, pp. 552–561.
[13] K. W. Clark, B. A. Vendt, K. E. Smith, J. Freymann, J. Kirby, P. Koppel, [34] X. Li, L. Yu, H. Chen, C.-W. Fu, L. Xing, and P.-A. Heng,
S. M. Moore, S. R. Phillips, D. R. Maffitt, M. Pringle et al., “The “Transformation-consistent self-ensembling model for semisupervised
Cancer Imaging Archive (TCIA): Maintaining and operating a public medical image segmentation,” IEEE Transactions on Neural Networks
information repository,” Journal of Digital Imaging, vol. 26, no. 6, pp. and Learning Systems, vol. 32, no. 2, pp. 523–534, 2020.
1045–1057, 2013. [35] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
[14] M. Combalia, F. Hueto, S. Puig, J. Malvehy, and V. Vilaplana, “Uncer- for semantic segmentation,” in Proceedings of the IEEE Conference on
tainty estimation in deep neural networks for dermoscopic image clas- Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
sification,” in IEEE/CVF Conference on Computer Vision and Pattern [36] C. Louizos and M. Welling, “Structured and efficient variational deep
Recognition Workshops, 2020, pp. 3211–3220. learning with matrix Gaussian posteriors,” in Proceedings of the 33rd
[15] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “AutoAug- International Conference on Machine Learning, 2016, pp. 1708–1716.
ment: Learning augmentation strategies from data,” in Proceedings of the [37] X. Luo, J. Chen, T. Song, Y. Chen, G. Wang, and S. Zhang, “Semi-
IEEE Conference on Computer Vision and Pattern Recognition, 2019, supervised medical image segmentation through dual-task consistency,”
pp. 113–123. arXiv preprint arXiv:2009.04448, 2020.
[16] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “RandAugment:
[38] Y. Man, Y. Huang, J. Feng, X. Li, and F. Wu, “Deep Q learning
Practical automated data augmentation with a reduced search space,” in
driven CT pancreas segmentation with geometry-aware U-Net,” IEEE
Proceedings of the IEEE Conference on Computer Vision and Pattern
Transactions on Medical Imaging, vol. 38, no. 8, pp. 1971–1980, 2019.
Recognition Workshops, 2020, pp. 702–703.
[39] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutional
[17] Z. Eaton-Rosen, F. Bragman, S. Bisdas, S. Ourselin, and M. J. Car-
neural networks for volumetric medical image segmentation,” in Pro-
doso, “Towards safe deep learning: Accurately quantifying biomarker
ceedings of the Fourth International Conference on 3D Vision. IEEE,
uncertainty in neural network predictions,” in International Conference
2016, pp. 565–571.
on Medical Image Computing and Computer Assisted Intervention.
Springer, 2018, pp. 691–699. [40] T. Miyato, S. Maeda, M. Koyama, and S. Ishii, “Virtual adversarial
[18] K. Fang and W.-J. Li, “DMNet: Difference minimization network training: a regularization method for supervised and semi-supervised
for semi-supervised segmentation in medical images,” in International learning,” IEEE Transactions on Pattern Analysis and Machine Intelli-
Conference on Medical Image Computing and Computer Assisted Inter- gence, vol. 41, no. 8, pp. 1979–1993, 2018.
vention, 2020, pp. 532–541. [41] A. Nazir, M. N. Cheema, B. Sheng, H. Li, P. Li, P. Yang, Y. Jung,
[19] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: J. Qin, J. Kim, and D. D. Feng, “OFF-eNET: An optimally fused fully
Representing model uncertainty in deep learning,” in Proceedings of the end-to-end network for automatic dense volumetric 3D intracranial blood
33rd International Conference on Machine Learning, 2016, pp. 1050– vessels segmentation,” IEEE Transactions on Image Processing, vol. 29,
1059. pp. 7192–7202, 2020.
[20] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, [42] Y. Ouali, C. Hudelot, and M. Tami, “Semi-supervised semantic segmen-
and J. Liu, “CE-Net: Context encoder network for 2D medical image tation with cross-consistency training,” in Proceedings of the IEEE/CVF
segmentation,” IEEE Transactions on Medical Imaging, vol. 38, no. 10, Conference on Computer Vision and Pattern Recognition, 2020, pp.
pp. 2281–2292, 2019. 12 674–12 684.
[21] J. M. Hernández-Lobato and R. P. Adams, “Probabilistic backpropaga- [43] J. Peng, G. Estrada, M. Pedersoli, and C. Desrosiers, “Deep co-
tion for scalable learning of Bayesian neural networks,” in Proceedings training for semi-supervised image segmentation,” Pattern Recognition,
of the 32nd International Conference on Machine Learning - Volume p. 107269, 2020.
37, 2015, p. 1861–1869. [44] J. Peng, M. Pedersoli, and C. Desrosiers, “Mutual information deep
[22] X. Huo, L. Xie, J. He, Z. Yang, W. Zhou, H. Li, and Q. Tian, regularization for semi-supervised segmentation,” in Medical Imaging
“ATSO: Asynchronous teacher-student optimization for semi-supervised with Deep Learning, 2020, pp. 601–613.
image segmentation,” in Proceedings of the IEEE/CVF Conference on [45] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
Computer Vision and Pattern Recognition, 2021, pp. 1235–1244. works for biomedical image segmentation,” in International Conference
[23] J. Peng, M. Pedersoli, and C. Desrosiers, “Boosting semi-supervised on Medical Image Computing and Computer Assisted Intervention.
image segmentation with global and local mutual information regular- Cham, Switzerland: Springer, 2015, pp. 234–241.
ization,” arXiv preprint arXiv:2103.04813, 2021. [46] H. R. Roth, A. Farag, E. B. Turkbey, L. Lu, J. Liu, and
[24] A. Jungo and M. Reyes, “Assessing reliability and challenges of un- R. M. Summers. (2016) Data from Pancreas-CT. [Online]. Available:
certainty estimations for medical image segmentation,” in International https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT
Conference on Medical Image Computing and Computer Assisted Inter- [47] H. R. Roth, L. Lu, A. Farag, H. Shin, J. Liu, E. B. Turkbey, and
vention, 2019, pp. 48–56. R. M. Summers, “DeepOrgan: Multi-level deep convolutional networks
[25] T. Kalluri, G. Varma, M. Chandraker, and C. V. Jawahar, for automated pancreas segmentation,” in International Conference on
“Universal semi-supervised semantic segmentation,” arXiv preprint Medical Image Computing and Computer Assisted Intervention, 2015,
arXiv:1811.10323, 2018. pp. 556–564.
[26] D. Karimi and S. E. Salcudean, “Reducing the Hausdorff distance in [48] Y. Shi, W. Yang, Y. Gao, and D. Shen, “Does manual delineation
medical image segmentation with convolutional neural networks,” IEEE only provide the side information in ct prostate segmentation?” in
Transactions on Medical Imaging, vol. 39, no. 2, pp. 499–513, 2020. International Conference on Medical Image Computing and Computer
[27] A. Kendall and Y. Gal, “What uncertainties do we need in Bayesian Assisted Intervention. Cham, Switzerland: Springer, 2017, pp. 692–700.
deep learning for computer vision?” arXiv preprint arXiv:1703.04977, [49] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk,
2017. A. Kurakin, H. Zhang, and C. Raffel, “FixMatch: Simplifying semi-
[28] H. Kervadec, J. Dolz, É. Granger, and I. B. Ayed, “Curriculum semi- supervised learning with consistency and confidence,” arXiv preprint
supervised segmentation,” in International Conference on Medical Im- arXiv:2001.07685, 2020.
14