Professional Documents
Culture Documents
A New Framework of Swarm Learning Consolidating Knowledge From Multi-Center Non-IID Data For Medical Image Segmentation
A New Framework of Swarm Learning Consolidating Knowledge From Multi-Center Non-IID Data For Medical Image Segmentation
7, JULY 2023
Abstract — Large training datasets are important for deep of labeled training data. The quantity of data at a single
learning-based methods. For medical image segmentation, medical center is usually limited, especially for rare dis-
it could be however difficult to obtain large number of eases [1]. Centralized learning methods, which assemble data
labeled training images solely from one center. Distributed
learning, such as swarm learning, has the potential to use from multiple centers, may not be applicable due to the data
multi-center data without breaching data privacy. However, privacy issue [2]. To solve this problem, federated learning has
data distributions across centers can vary a lot due to the been proposed, which trains a model on distributed datasets
diverse imaging protocols and vendors (known as feature without the exchange of privacy-sensitive data between
skew). Also, the regions of interest to be segmented could centers [3].
be different, leading to inhomogeneous label distributions
(referred to as label skew). With such non-independently Recently, research on federated learning has gone beyond
and identically distributed (Non-IID) data, the distributed privacy issues, to further investigate novel methods well
learning could result in degraded models. In this work, handling issues of security, transparency and fairness [4].
we propose a novel swarm learning approach, which assem- For healthcare applications [2], there are two widely adopted
bles local knowledge from each center while at the same approaches. One is to rely on an aggregation server on secure
time overcomes forgetting of global knowledge during local
training. Specifically, the approach first leverages a label hardware [5], [6], and the other is to adopt the peer-to-peer
skew-awared loss to preserve the global label knowledge, communication [4], [7]. Among them, swarm learning, a new
and then aligns local feature distributions to consolidate decentralized paradigm, has been recently proposed [4]. This
global knowledge against local feature skew. We vali- paradigm can keep both the manipulation of data and parame-
dated our method in three Non-IID scenarios using four ters locally. The decentralized paradigm provides a promising
public datasets, including the Multi-Centre, Multi-Vendor
and Multi-Disease Cardiac Segmentation (M&Ms) dataset, solution for privacy protection and fairness. However, the
the Federated Tumor Segmentation (FeTS) dataset, the performance could degrade when it comes to the problem
Multi-Modality Whole Heart Segmentation (MMWHS) dataset of non-independent and identically distributed (Non-IID) data
and the Multi-Site Prostate T2-weighted MRI segmenta- coming from different centers [8], [9]. This is mainly due to
tion (MSProsMRI) dataset. Results show that our method the fact that the local training on Non-IID data could update
could achieve superior performance over existing methods.
Code will be released via https://zmiclab.github.io/ the local models in different directions, and averaging these
projects.html once the paper gets accepted. models with large discrepancy may deviate the optimization
and deteriorate performance of the final model.
Index Terms — Medical image, non-IID, segmentation,
swarm learning. The main challenges in Non-IID data for decentralized
learning have three folds, i.e., feature skew, label skew and
I. I NTRODUCTION quantity skew [9]. The former two are dominant and will be
fully investigated in this work, while the third one is less
T HE gains of deep learning for medical image segmenta-
tion could highly depend on the amount and diversity studied here, as it becomes an issue of sample sizes for image
segmentation tasks. Fig. 1 illustrates the two skew issues with
Manuscript received 5 August 2022; revised 13 October 2022; segmentation of short-axis cardiac magnetic resonance (CMR)
accepted 5 November 2022. Date of publication 9 November 2022;
date of current version 29 June 2023. This work was supported by the images. Feature skew could originate from the difference in
National Natural Science Foundation of China under Grant 61971142, imaging protocols, the strength of magnetic field in magnetic
Grant 62111530195, Grant 62011540404, and Grant 71991471. resonance imaging (MRI), or different demographics, which
(Corresponding author: Xiahai Zhuang.)
Zheyao Gao and Xiahai Zhuang are with the School of Data lead to the covariate shift [11].
Science, Fudan University, Shanghai 200433, China (e-mail: Label skew commonly exists in multi-center data with
zygao20@fudan.edu.cn; zxh@fudan.edu.cn). diverse forms, among which one prevailing scenario is the
Fuping Wu is with the School of Data Science, and Depart-
ment of Statistics, Fudan University, Shanghai 200433, China (e-mail: partial annotation of training images. Fig. 1 shows three CMR
17110690006@fudan.edu.cn). images with three types of annotations for three different stud-
Weiguo Gao is with the School of Data Science, and School of ies, i.e., different structures of interest to be segmented. This
Mathematical Sciences, Fudan University, Shanghai 200433, China
(e-mail: wggao@fudan.edu.cn). is because cardiac MRI could be used in studies of various
Digital Object Identifier 10.1109/TMI.2022.3220750 cardiac disorders, and the images from different centers could
1558-254X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: NEW FRAMEWORK OF SWARM LEARNING CONSOLIDATING KNOWLEDGE FROM MULTI-CENTER NON-IID DATA 2119
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
2120 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 7, JULY 2023
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: NEW FRAMEWORK OF SWARM LEARNING CONSOLIDATING KNOWLEDGE FROM MULTI-CENTER NON-IID DATA 2121
Fig. 2. The overall framework of the proposed method: the paradigm of swarm learning (left) and the proposed method for local training (right).
In local training, the proposed method adopts LaSA loss and FeSA regularization to tackle the problems from label and feature skew, respectively.
As shown in the procedure E, LaSA is computed with unbiased probability and unbiased partial label. The unbiased probability is calculated by
summing up the probabilities for unlabeled classes and the unbiased partial label is derived similarly by summing up values for unlabeled classes
in the weak label form of partial label. RFeSA is computed by matching the sample mean and variance from local model with the running mean and
variance parameters in batch normalization layers of the global model.
structures of interests to segment, i.e., partial labeling. Without data from other centers. Hence, the local model could forget
confusion, we denote the partial label from center k as yk . the knowledge learned from the segmentation tasks in other
Both label skew and feature skew can diverge a standard centers, resulting in different updating directions.
local training from the global updates, leading to degraded per- Therefore, we first develop the LaSA loss to preserve the
formance of the aggregated model [8]. To tackle this, we pro- global knowledge of full label (global distribution) when train-
pose a novel decentralized learning framework. As shown in ing the model locally. Then, we design the FeSA regularization
Fig. 2, the left subfigure presents the paradigm of standard to distill the knowledge of feature distributions from the
swarm learning, as described in Section III-A. Each center global model. In the following, we elaborate on the details
sends the local model to a decentralized communication net- of the two contributions in Section III-C and Section III-D,
work and receives models from other centers to derive the respectively.
global model. Our contribution is focused on the local training
process which is illustrated in the right part of Fig. 2. Our
method uses two new functions, i.e., a label skew-awared C. Label Skew-Awared Loss LaSA
(LaSA) loss and a feature skew-awared (FeSA) regularization For pixel i , let q i , yki and y i respectively denote the
term. LaSA loss adopts the unbiased probability derived from segmentation prediction vector, the gold standard label vector
the output of networks to formulate the segmentation loss for of partial annotations from center k and the ground truth label
label skew scenario. FeSA regularization handles the feature which is a one-hot vector. Elements of these three vectors can
skew problem by matching the statistic of intermediate features be accessed using index of c, which is also the index of the
in local clients with the global model. Formally, the objective class set. Let C be the set of all classes in the segmentation
for the local training with partially labeled data (x k , yk ) is task. In label skew scenario, Ck represents the set of annotated
given by, classes in center k, and C¯k = C\Ck denotes the set of
unlabeled classes.
( f θk (x k ), yk ) = La S A ( f θk (x k ), yk ) + λR F eS A (θk ; θ [t ] , x k ), For partial label scenario, we use the weak label form [10]
(5) as Fig. 1 illustrates. Formally, for a pixel i of an image x
which is partially annotated with label class set Ck in center
where λ is the balancing parameter. The pseudo code for the k, if pixel i is annotated, we have yki = y i ; otherwise, the
whole pipeline is shown in Algorithm 1. elements of vector yki are given by,
The proposed strategies deal with the label skew and feature
skew challenges in distributed learning by avoiding the diver- 0, for c ∈ Ck
yk,c =
i
.
gence of local updates during local training. This divergence 1
|C̄ |
, for c ∈ C̄k
k
resembles the consequence of catastrophic forgetting in con-
tinual learning [23]. In local training, each center fine-tunes For the label skew issue, we propose a new loss to preserve
the aggregated model using local data without an access to the the label knowledge for unlabeled classes, i.e., LaSA loss. The
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
2122 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 7, JULY 2023
end for
end for i i
2i∈ ŷk,C̄ q̂k,C̄
− , (11)
i∈ yk,C̄ +
i i
i∈ q̂k,C̄
LaSA loss is formulated with unbiased probability, as follows, where αc and γ are hyperparameters in focal loss, which are
introduced to control the weights of losses over each class
La S A( f θk (x k ), yk ) = seg (q̂ki , ŷki ), (6)
and pixel. One can see that the formulation in Eq. (9) is
i∈
similar to the marginal loss in [32]. We will show that the
where is the set of pixels in the image, q̂ki and ŷki are the label skew-awared loss in our framework could consolidate
unbiased prediction probability and unbiased partial label for label knowledge.
pixel i , respectively, defined as, In the following, we provide the theoretical explanation
qci if c ∈ Ck of the proposed LaSA loss by analyzing the problem of
q̂k,c =
i
, (7) fully-supervised segmentation losses in label skew scenario.
¯ q i
if c ∈ C¯k
j ∈C k j Without loss of generality, here we take CE loss as an example.
and For segmentation task in local client k, the overall CE loss
is the aggregation of CE losses over each pixel, which is given
i
yk,c if c ∈ Ck
i
ŷk,c = . (8) by,
¯
j if c ∈ Ck
i
j ∈C¯k yk,
Here, we further introduce two symbols for convenience of C E ( f θik (x k ), yki ) = − i
yk,c log(qci ) + i
yk,c log(qci ) .
i
notation, i.e., q̂k, C̄
= q̂k,c
i
, ∀c ∈ C¯k and ŷk, i
C̄
= ŷk,c
i
, ∀c ∈ C¯k , c∈C k c∈C̄ k
for the derivation of losses over unlabeled classes. (12)
In Eq. (6), seg (·) could be adapted from the loss func- One can see that for an unlabeled pixel i , the CE loss
tions for fully-supervised segmentation. Here, we provide the maximizes the output probabilities qci for all unlabeled classes
implementation with cross entropy (CE) loss [29], focal (FC) c ∈ C̄k . However, since log(·) is a concave function, the
loss [30] and Dice (DC) loss [31], as follows, gradient computed for the most probable class is the least.
As shown in Fig. 3, C E could result in more evenly dis-
La S A[C E] = − i
ŷk,c i
log(q̂k,c ) + ŷk,
i
C̄
log( q̂ i
k,C̄
) ,
tributed probabilities for unlabeled classes. Optimization of
i∈ c∈C k
this loss leads to increase of the uncertainty over unlabeled
(9)
pixels, which expedites the forgetting of label knowledge for
i γ
La S A[F C] = − αc ŷk,c
i
(1 − q̂k,c ) i
log(q̂k,c ) unlabeled structures.
i∈ c∈C k An alternative solution is to apply partial CE that only
computes losses over labeled classes [33],
γ
+ ŷk,
i
C̄
(1 − q̂ i
k,C̄
) log( q̂ i
k,C̄
) ,
iPC E ( f θk (x k ), yk ) = − i
yk,c log(qci ). (13)
(10)
c∈C k
i q̂ i
2 i∈ ŷk,c k,c
La S A[DC] = 1 − In this case, no loss is computed over unlabeled pixels. How-
i∈ yc +
i i
c∈C k i∈ q̂k,c ever the prediction for unlabeled pixels is affected by the losses
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: NEW FRAMEWORK OF SWARM LEARNING CONSOLIDATING KNOWLEDGE FROM MULTI-CENTER NON-IID DATA 2123
derived for labeled pixels. Although it does not explicitly P(X ) [35]. The feature statistics learned for local distribu-
degrade the performance of a model over the unlabeled pixels, tion Pk (X ) could diverge from those of the global model.
it cannot alleviate the catastrophic forgetting during local To preserve the knowledge of global statistics of layer l (i.e.
training. E(Xl ) and D(Xl )), we consider to force the sample statistics
Compared with the above losses, our proposed LaSA loss, calculated during local training to be close to the running
adopting the unbiased probability, could derive a proper esti- statistics in global model using L2 distance. Here, we estimate
mation for losses over unlabeled pixels and consolidate the the true global feature statistics by,
label knowledge from the global model. In Eq. (9), the loss
for an unlabeled pixel i is derived by the cross entropy between 1
K
nk 2
K
nk
i E(Xl ) ≈ E k (Xl ) ≈ μk,l = μl , (18)
the unbiased prediction probability q̂k, and unbiased partial N N
C̄ k=1 k=1
i
label ŷk,C̄ . In this formulation, the largest gradient is obtained K K
1 nk 2 nk 2
for the most probable unlabeled class among classes in C̄k . The D(Xl ) ≈ Dk (Xl ) ≈ σ = σl2 , (19)
N N k,l
derivative of Eq. (9) with respect to the model output could k=1 k=1
be derived by, where E k (Xl ) and Dk (Xl ) are the mean and variance of
∂iLa S A[C E]( f θk (x k ), yk ) 1 features from layer l of the local model in center k. The
= −qci ( − 1) ∝ −qci . (14) second approximations in both formulas are based on the
∂ f ci i
q̂k, C̄ assumption that the running statistics are accurate estimations
From this formula, one can see that LaSA tends to maximize of the true statistics. In Eq. (18), the weighted sum of local
the prediction for the most probable class predicted by the means is an unbiased estimation for the mean of global
global model during local training. In other words, the loss is feature distribution which supports the first approximation.
able to memorize the prediction of global model on unlabeled For variance estimation, the first approximation in Eq. (19)
pixels during local training, thus consolidates the global is based on the assumption that the local mean μk,l is similar
knowledge of full labeling. Fig. 3 illustrates and compares the for each center. This assumption is reasonable because we
gradients computed by LaSA and other loss functions. attempt to align the local means with the global mean during
training.
D. Feature Skew-Awared Regularization RFeSA
IV. E XPERIMENTS
For feature skew, we propose to introduce an extra reg-
ularization term using the feature statistics before the batch We validated the applicability of the proposed method
normalization (BN) module [34], i.e., the FeSA regularization using four datasets, i.e., the M&Ms dataset [36], the
term, FeTS dataset [37], the MSProsMRI [11] dataset and the
MMWHS [38] dataset. We first demonstrated the effectiveness
R F eS A (θk ; θ [t ] , x k ) = μl − μ̃k,l 2 + σl2 − σ̃k,l
2
2 , of the LaSA loss for label skew and FeSA regularization
l for feature skew in Section IV-C. We then compared the
(15) framework with other state-of-the-art (SOTA) methods in three
where, μl and σl are the running mean and variance of situations in Section IV-D, including the scenario having
parameters in BN module for layer l in the global model; both label skew and feature skew issues, the scenario with
l 2 ∈ RC l are the sample mean and variance
μ̃k,l ∈ RC and σ̃k,l solely label skew and the scenario with solely feature skew.
in the local model. The subscript k indicates the variables are In the experiments, we performed Wilcoxon test to report the
associated with local client k and the C l denotes the number significance of the difference between two approaches.
of channels in layer l. They are calculated along the batch and
spatial dimensions of the features from layer l, given a batch A. Datasets
of input images sampled from local distribution Pk (X ).
1) M&Ms Dataset: is composed of 320 short-axis Cardiac
In Eq. (15), the running mean and variance (μl and σl )
Magnetic Resonance (CMR) cases from 4 different scan-
are calculated through aggregation of the corresponding para-
ner vendors, including 95 cases from Siemens (Vendor A),
meters in the local model of last communication epoch. The
125 cases from Philips (Vendor B), 50 cases from General
computation of the running mean and variance parameters in
Electric (Vendor C) and 50 cases from Cannon (Vendor D).
local models is the same as in [34], namely they are updated
It provides annotations for 3 cardiac structures, including left
by,
ventricle (LV), right ventricle (RV) and the left ventricular
μ[m] [m−1]
k,l = Mμk,l + (1 − M)μ̃k,l (16) myocardium (Myo). We resampled all data with the in-plane
resolution of 1.25×1.25 mm and cropped them into 192×192
(σ 2 )[m]
k,l = M(σ 2 )[m−1]
k,l + (1 − M)σ̃k,l
2
(17)
region of interest (ROI). We took each vendor as a local client
where m represents the index of iteration and M is the such that the image distribution in each client is different.
momentum between (0, 1). It is set empirically according to To simulate the label skew, during training, only the anno-
the number of batches in one local training epoch. tations of LV and RV are available for the data in Vendor
The intuition behind the method is that the feature statistics A, LV and Myo for Vendor B, all cardiac structures as one
computed in BN layers contain the traits of image distribution foreground (illustrated in Fig. 1) for Vendor C and RV for
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
2124 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 7, JULY 2023
TABLE II
A BLATION S TUDY FOR THE P ROPOSED M ETHOD. R ESULTS A RE E VALUATED IN D ICE S CORE
Fig. 4. Visualization of segmentation results on M&Ms dataset. Arrows highlight the areas which are referred to in the text.
TABLE III structures in the training data of each center. Since quantity
S UMMARY OF M&M S D ATASET. O NLY PART OF L ABELS W ERE U SED skew is not the main focus of this work, we merged the
FOR T RAINING IN E ACH C ENTER . LV: L EFT V ENTRICLE , RV: R IGHT
V ENTRICLE ; M YO : M YOCARDIUM ; BG: B ACKGROUND
data from the centers listed in the partitioning-2 file from
the FeTS challenge. The information of the final partitioned
dataset is summarized in Table IV. In the experiment, the
dataset was split by the ratio of 60%:15%:25% for training,
validation and test in each center.
3) MMWHS Dataset: has 60 cardiac CT images from MIC-
CAI’17 Multi-Modality Whole Heart Segmentation challenge.
The annotations for the whole heart substructures include the
Vendor D. We randomly split the data of each client by the left ventricle (LV), right ventricle (RV), left atrium (LA),
ratio of 50%:15%:35% for training, validation and test. The right atrium (RA), myocardium (Myo), ascending aorta (AO),
dataset information is summarized in Table III. and the pulmonary artery (PA). We resampled the images
2) FeTS Dataset: includes multi-institutional multi- into 2×2×2 mm and cropped them with ROI of 96×96×64.
parametric Magnetic Resonance Imaging (mpMRI) scans of To simulate the scenario with label skew, we first randomly
brain tumor from Federated Tumor Segmentation (FeTS) and equally divided them into 4 parts as the datasets for each
challenge 2022 [37]. It was built upon the dataset from center. We performed 4-fold cross validation on this dataset
RSNA-ASNR-MICCAI BraTS 2021 challenge [39] with their in our experiment. In each center, we then only used the
real-world partitioning and the collaborative of independent annotations of 2 or 3 substructures in each center during
institutions in a real-world federation [40]. The training training. Specifically, we have Myo, LV and LA for Center
data consists of 1251 cases and each case has four 240 × A; RA and PA for Center B; RV, RA and AO for Center C;
240 × 155 structural MRI images including native (T1), RV, AO and PA for Center D. The dataset information is
post-contrast T1-weighted (T1Gd), T2-weighted (T2), and summarized in Table V.
T2 FLuid Attenuated Inversion Recovery (FLAIR) volumes. The MSProsMRI dataset includes 30 cases from
The pre-processing steps for all images include z-score RUNMC [41](Center A), 30 cases from BMC [41](Center
normalization, rigid registration and resolution resampling B), 19 cases from HCRUDB [42](Center C), 13 cases from
to 1mm 3 . The provided annotation includes the enhancing UCL [43](Center D), 12 cases from BIDMC [43](Center E),
tumor (ET), necrotic tumor core (NCR) and peritumoral and 12 from HK [43](Center F). We resampled all these
edematous and infiltrated tissue (ED). To simulate the label data with the axial plane resolution of 0.625×0.625 mm and
skew scenario, we removed the annotations for one or two cropped them into ROI of 384×384. We performed 4-fold
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: NEW FRAMEWORK OF SWARM LEARNING CONSOLIDATING KNOWLEDGE FROM MULTI-CENTER NON-IID DATA 2125
TABLE IV
S UMMARY OF F E TS D ATASET. T HE D ATASET WAS M ANUALLY S PLIT I NTO N INE C ENTERS . O RG C ENTER D ENOTES C ENTER I NDEX OF THE
O RIGINAL PARTITION P ROVIDED BY THE F E TS C HALLENGE . O NLY PART OF L ABELS W ERE U SED FOR T RAINING IN E ACH C ENTER . ET:
E NHANCING T UMOR ; NCR: N ECROTIC T UMOR C ORE ; ED: P ERITUMORAL E DEMATOUS AND I NFILTRATED T ISSUE ; BG: B ACKGROUND
TABLE V
S UMMARY OF MMWHS D ATASET. T HE D ATASET WAS M ANUALLY
S PLIT IN F OUR C ENTERS . O NLY PART OF L ABELS W ERE U SED FOR
T RAINING IN E ACH C ENTER . LV: L EFT V ENTRICLE ; RV: R IGHT
V ENTRICLE ; LA: L EFT ATRIUM ; RA: R IGHT ATRIUM ; M YO :
M YOCARDIUM OF LV; AO: A SCENDING AORTA ; PA: P ULMONARY
A RTERY
TABLE VI Fig. 5. Plot of the knowledge forgetting in Dice score during epoch 200 in
L AYER C HOICE FOR BN S TATISTIC M ATCHING IN F E SA Center D. The vertical axis represents the average Dice score for LV and
R EGULARIZATION . C RITICAL L AYERS R EFER TO THE D OWNSAMPLE Myo achieved by local model in center D on the validation data from all
AND ( OR ) U PSAMPLE C ONVOLUTION L AYERS other centers.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
2126 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 7, JULY 2023
TABLE VII
C OMPARISON S TUDY ON M&M S D ATASETS . R ESULTS A RE E VALUATED IN D ICE S CORE . B OLD T EXT D ENOTES THE B EST R ESULTS ACHIEVED BY
D ISTRIBUTED L EARNING M ETHODS W ITH PARTIAL L ABEL
TABLE VIII
C OMPARISONS OF D IFFERENT S TRATEGIES FOR L ABEL S KEW ON MMWHS D ATASETS . R ESULTS A RE E VALUATED BY 4-F OLD C ROSS
VALIDATION IN D ICE S CORE . B OLD T EXT D ENOTES THE B EST R ESULTS ACHIEVED BY D ISTRIBUTED L EARNING M ETHODS W ITH PARTIAL L ABEL
while the performance of the swarm learning method showed D. Comparison Study
dramatic drop. It means that the proposed regularization and In this section, we performed a comprehensive comparison
LaSA loss could effectively consolidate old knowledge, thus study with other SOTA methods. The methods chosen for
keeping the local training consistent with global updates. comparisons should have the following three features: (1)
2) Choice of Layers for FeSA Regularization: Because using distributed learning methods related to the idea of knowledge
all the layers for BN statistic matching could restrict the consolidation; (2) global-model-based methods, in which all
plasticity of the model, we conducted experiments to select centers share one model that is generalizable for data from
a better layer subset for the computation of the regularization unseen centers; (3) methods based on decentralized frame-
term, i.e., R F eS A (θk ; θ [t ] , x k ) in Eq. (15). Since the model we works without server-side manipulation.
adopted was UNet, we studied whether the features in encoder 1) Application to Multi-Center and Multi-Vendor Cardiac Seg-
layers or decoder layers play a more important role in the mentation With Label and Feature Skews: To further validate
regularization. We first had all the layers in the encoder or the proposed method in the scenario with both label and
decoder for regularization. According to Table VI, we found feature skew, we compared it with other SOTA methods for
that there was no significant difference across these choices cardiac segmentation on M&M dataset, including: (1) Fed-
(Wilcoxon test p-values > 0.6 for each pair of choice). Prox [8], which used L2 regularization during local training;
Hence, we further relaxed the constraint, and chose solely (2) FedCurv [14], which used importance matrix to penalize
features from several critical layers (i.e. features after the each the change of important parameters; (3) FML [19], which used
downsample or upsample convolution layer) for BN statistic personalization strategy; (4) Swarm, which was the standard
matching. The results in Table VI showed that using the swarm learning framework using partial cross-entropy loss;
features from upsample layers for regularization could obtain and (5) Sup&Cen, which trained the same network with all
better results than using features from downsample layers. the data and full labels together in a fully-supervised and
3) Effect of Communication Rate: In swarm learning, com- centralized manner.
munication rate refers to the number of epochs or iterations Table VII reported the comparison results. With the pro-
in local training between two consecutive model aggregation posed regularization term and LaSA loss, our method resulted
operations. Ideally, if the model aggregation is implemented in the best performance. Notably, compared with the Sup&Cen
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: NEW FRAMEWORK OF SWARM LEARNING CONSOLIDATING KNOWLEDGE FROM MULTI-CENTER NON-IID DATA 2127
Fig. 7. Visualization of the segmentation results for MMWHS datasets. Arrows highlight the areas which are referred to in the text.
TABLE IX
C OMPARISON R ESULTS (D ICE S CORES ) ON P ROSTATE D ATASETS . E ACH M ETHOD WAS I MPLEMENTED W ITH 4-F OLD C ROSS VALIDATION . B OLD
T EXT D ENOTES THE B EST R ESULTS ACHIEVED BY D ISTRIBUTED L EARNING M ETHODS . ‘C ENTRALIZED ’ D ENOTES THE T RIVIAL M ETHOD
T RAINING W ITH C ENTRALIZED D ATA
TABLE X FeTS datasets. The compared methods were similar with those
C OMPARISON S TUDY ON F E TS D ATASETS . R ESULTS A RE E VALUATED in the comparison study on M&Ms datasets.
IN D ICE S CORE . B OLD T EXT D ENOTES THE B EST R ESULTS A CHIEVED
BY D ISTRIBUTED L EARNING M ETHODS W ITH PARTIAL L ABEL . ET:
Following the evaluation metrics of the FeTS challenge,
E NHANCING T UMOR ; WT: W HOLE T UMOR ( COMPRISING ET, ED AND we reported the average Dice score on enhancing tumor (ET),
NCR); TC: T UMOR C ORE ( COMPRISING ED AND ET) whole tumor (WT), tumor core (TC) in Table X. Compared
with other distributed learning methods, our method achieved
the best performance in average Dice score of all evaluated
regions. Moreover, our method performed comparably to the
Sup&Cen method in the segmentation of WT and TC. This
demonstrates that the proposed LaSA and FeSA loss could
effectively alleviate the issues caused by label and feature
skew in FeTS datasets. In terms of the performance on ET, the
average Dice score of our method was slightly lower than that
of Sup&Cen. The reason could be that the segmentation of ET
is difficult and there could exist non-negligible inter-observer
method, the Dice score of our method was only 0.4% lower. variability of manual annotation across centers. Thus, consol-
It indicates that our method could effectively handle the label
idating the knowledge in other centers could be misleading.
skew and feature skew challenge in the distributed learning
framework. Other methods could only slightly improve the 3) Application to Whole Heart Segmentation With Label Skew
performance of the swarm learning method in this scenario. Setting: To demonstrate the superiority of the proposed method
This is reasonable as these methods attempt to either avoid the in the Non-IID scenario where only label skew exists, we com-
divergence of training between the local and global models pared with another strategy [27](i.e., Pseudo) for partial super-
by restricting the magnitude of parameter changes or forcing vision learning that could be applied to the swarm learning
their prediction to be consistent. They do not explicitly deal framework. The Swarm+Pseudo method generates pseudo
with the knowledge forgetting of feature distribution and label labels using the global model for the supervision of unlabeled
annotation, particularly the latter one. structures.
2) Application to Multi-Institutional Brain Tumor Segmentation As presented in Table VIII, the swarm learning with CE
With Label and Feature Skews: To demonstrate the generaliz- loss and PCE loss almost failed in this scenario. This is
ability to other applications, we also compared the proposed because these methods could not estimate an accurate loss for
method with SOTA methods for brain tumor segmentation on unlabeled structures, and thus failed to preserve the knowledge
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
2128 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 7, JULY 2023
TABLE XI
S TUDY OF THE E FFECT OF Q UANTITY S KEW FOR THE P ROPOSED N ON -IID SL M ETHOD ON M&M S D ATASET. T HE I NDEX IN THE C OLUMN OF
‘ LIMITED C ENTER ’ R EPRESENTS THE C ENTER W HICH H AS S CARCE T RAINING D ATA
of labels except 2 or 3 out of 7 structures which are provided method used image distribution related regularization during
with annotations in a local center. Both the Swarm+Pseudo local training, while other methods applied parameter-based
method and our proposed Non-IID swarm learning achieved regularization which could restrict the plasticity of local
satisfactory performance. From Table VII one can see that models.
the results of Swarm+Pseudo are very close to those of
Sup&Cen, which can be taken as an upper bound of federated V. C ONCLUSION
learning methods. Despite the limited space for improvement, In this paper, we have presented a new framework for
our method still obtained 1.3% higher average Dice score de-centerlized and privacy-protected distributed swarm learn-
than Swarm+Pseudo. The reason could be that although both ing. This framework employs the label skew-awared (LaSA)
methods alleviate the forgetting of true label in local update loss, which preserves the prediction for unlabeled structures by
by distilling knowledge from the global model, the pseudo the global model during local training, to avoid the label skew
label generated by the global model can be inaccurate. This issue. Moreover, we have proposed the feature skew-awared
inaccurate label could mislead the optimization of the local (FeSA) regularization to align feature distributions of different
model. centers through batch normalization statistic matching, which
The visual results are presented in Fig. 7. We can observe alleviates the effect of feature skew. The results showed that
that compared with the ground truth, the segmentation errors the proposed skew-awared functions were effective in tackling
of our method mainly occurred in Myo and PA. The reasons the different distributions and could consolidate the knowledge
are two folds. (1) Myo and PA are hard to segment, as the learned from local clients. The applications to the three tasks
results of Sup&Cen method are not accurate in them. (2) of medical image segmentation have demonstrated that the
The supervision for Myo is weak since only center A could proposed method could achieve comparable results to the
provide annotations for Myo. For difficult structures, more fully-supervised and centralized methods.
supervision is needed. It is also observed that the performance
of a structure is not determined by whether the annotation is A PPENDIX
available locally. For example, Center B and Center D could A. Study of Quantity Skew
not provide annotations for Myo during training. However, the
Here we discuss the effect of quantity skew on the proposed
results of Myo in these two centers are visually more accurate
method. Quantity skew, especially with highly unbalanced
than that in Center A. The reason could be that our proposed
numbers of subjects across centers, could degrade the perfor-
method can effectively preserve the label knowledge from
mance of federated learning methods, though this may not be
other centers during training and the performance difference
a challenge from the Non-IID problem in segmentation tasks
is probably due to the quality of images in each center.
but different sample sizes. However, as the proposed method
4) Application to Multi-Site Prostate MRI Segmentation With adopts knowledge consolidation strategy, the effect of quantity
Feature Skew: To show the effectiveness of our method in skew could be reduced significantly.
the feature skew scenario, we applied the method to seg- For demonstration, we implemented the proposed method
mentation of the multi-site prostate T2-weighted MRI dataset, on M&Ms dataset with unbalanced numbers across centers.
and compared with other SOTA methods that do not apply Specifically, we reduced the numbers of training and validation
the personalization strategy, including (1) the standard swarm data in one of the four centers to be only 5 and 3, respectively,
learning method, (2) FedProx [8] and (3) FedCurv [15]. and keep others unchanged. The implementation details are the
Table IX presents the results of compared global methods. same as described in Section IV-B.
Compared with the Swarm method, the proposed Non-IID SL As shown in Table XI, compared with the results trained
achieved significant improvements in Center C and E by 4.7% with relatively balanced dataset, quantity skew did not lead to
(p=0.005) and 5.8% (p=0.002), respectively. This indicates obvious performance decline. The reason could be two folds.
that by applying the feature statistic alignment in the local First, our method involves knowledge consolidation strate-
training, our method could learn robust feature representations gies. Centers with few training samples could also acquire
and thus improve the performance. Moreover, Non-IID SL knowledge from the global model. Second, we performed data
achieved the best performance in average Dice score among augmentations in each center and ensured equal local training
all distributed learning methods with all p-values less than iteration in one communication round such that the global
0.05 by the paired Wilcoxon tests. The reason could be that our model will not be biased toward centers with more samples.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: NEW FRAMEWORK OF SWARM LEARNING CONSOLIDATING KNOWLEDGE FROM MULTI-CENTER NON-IID DATA 2129
R EFERENCES [24] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic
intelligence,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 3987–3995.
[1] L. Qu, N. Balachandar, and D. L. Rubin, “An experimental study of [25] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars,
data heterogeneity in federated learning methods for medical imaging,” “Memory aware synapses: Learning what (not) to forget,” in Proc. Eur.
2021, arXiv:2107.08371. Conf. Comput. Vis. (ECCV), 2018, pp. 139–154.
[2] N. Rieke et al., “The future of digital health with federated learning,” [26] A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “PodNet:
NPJ Digit. Med., vol. 3, no. 1, pp. 1–7, 2020. Pooled outputs distillation for small-tasks incremental learning,” in Proc.
[3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas, Eur. Conf. Comput. Vis., 2020, pp. 86–102.
“Communication-efficient learning of deep networks from decentralized [27] A. Douillard, Y. Chen, A. Dapogny, and M. Cord, “PLOP: Learn-
data,” in Proc. Artif. Intell. Statist., 2017, pp. 1273–1282. ing without forgetting for continual semantic segmentation,” in Proc.
[4] S. Warnat-Herresthal et al., “Swarm learning for decentralized and IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
confidential clinical machine learning,” Nature, vol. 594, no. 7862, pp. 4040–4050.
pp. 265–270, 2021. [28] Y. Lu, X. Huang, Y. Dai, S. Maharjan, and Y. Zhang, “Blockchain
[5] S. Pati et al., “Federated learning enables big data for rare cancer and federated learning for privacy-preserved data sharing in industrial
boundary detection,” 2022, arXiv:2204.10836. IoT,” IEEE Trans. Ind. Informat., vol. 16, no. 6, pp. 4177–4186,
[6] M. J. Sheller, G. A. Reina, B. Edwards, J. Martin, and S. Bakas, “Multi- Jun. 2020.
institutional deep learning modeling without sharing patient data: A [29] I. Szita and A. Lörincz, “Learning tetris using the noisy cross-
feasibility study on brain tumor segmentation,” in Brainlesion: Glioma, entropy method,” Neural Comput., vol. 18, no. 12, pp. 2936–2941,
Multiple Sclerosis, Stroke and Traumatic Brain Injuries, A. Crimi, Dec. 2006.
S. Bakas, H. Kuijf, F. Keyvan, M. Reyes, and T. van Walsum, Eds. [30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for
Cham, Switzerland: Springer, 2019, pp. 92–104. dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
[7] A. Guha Roy, S. Siddiqui, S. Pölsterl, N. Navab, and C. Wachinger, Oct. 2017, pp. 2980–2988.
“BrainTorrent: A peer-to-peer environment for decentralized federated [31] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutional
learning,” 2019, arXiv:1905.06731. neural networks for volumetric medical image segmentation,” in Proc.
[8] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, 4th Int. Conf. 3D Vis. (DV), Oct. 2016, pp. 565–571.
“Federated optimization in heterogeneous networks,” in Proc. Mach. [32] G. Shi, L. Xiao, Y. Chen, and S. K. Zhou, “Marginal loss and exclusion
Learn. Syst., vol. 2, 2020, pp. 429–450. loss for partially supervised multi-organ segmentation,” Med. Image
[9] P. Kairouz et al., “Advances and open problems in federated learning,” Anal., vol. 70, May 2021, Art. no. 101979.
Found. Trends Mach. Learn., vol. 14, nos. 1–2, pp. 1–210, Jun. 2021. [33] F. Cermelli, M. Mancini, S. Rota Bulo, E. Ricci, and B. Caputo,
[10] B. Wu, S. Lyu, and B. Ghanem, “ML-MG: Multi-label learning with “Modeling the background for incremental learning in semantic seg-
missing labels using a mixed graph,” in Proc. IEEE Int. Conf. Comput. mentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Vis. (ICCV), Dec. 2015, pp. 4157–4165. (CVPR), Jun. 2020, pp. 9233–9242.
[11] Q. Liu, Q. Dou, L. Yu, and P. A. Heng, “MS-Net: Multi-site network for [34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
improving prostate segmentation with heterogeneous MRI data,” IEEE network training by reducing internal covariate shift,” in Proc. Int. Conf.
Trans. Med. Imag., vol. 39, no. 9, pp. 2713–2724, Sep. 2020. Mach. Learn., 2015, pp. 448–456.
[12] K. Gilbert et al., “Independent left ventricular morphometric atlases [35] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu, “Adaptive batch nor-
show consistent relationships with cardiovascular risk factors: A UK malization for practical domain adaptation,” Pattern Recognit., vol. 80,
biobank study,” Sci. Rep., vol. 9, no. 1, pp. 1–9, 2019. pp. 109–117, Aug. 2018.
[13] C. Petitjean et al., “Right ventricle segmentation from cardiac MRI: A [36] V. M. Campello et al., “Multi-centre, multi-vendor and multi-disease
collation study,” Med. Image Anal., vol. 19, no. 1, pp. 187–202, 2015. cardiac segmentation: The M&Ms challenge,” IEEE Trans. Med. Imag.,
[14] N. Shoham et al., “Overcoming forgetting in federated learning on non- vol. 40, no. 12, pp. 3543–3554, Dec. 2021.
IID data,” 2019, arXiv:1910.07796. [37] S. Pati et al., “The federated tumor segmentation (FeTS) challenge,”
[15] X. Li, M. Jiang, X. Zhang, M. Kamp, and Q. Dou, “FedBN: Federated 2021, arXiv:2105.05874.
learning on non-IID features via local batch normalization,” 2021, [38] X. Zhuang et al., “Evaluation of algorithms for multi-modality whole
arXiv:2102.07623. heart segmentation: An open-access grand challenge,” Med. Image Anal.,
[16] K. Bonawitz et al., “Practical secure aggregation for privacy-preserving vol. 58, Dec. 2019, Art. no. 101537.
machine learning,” in Proc. ACM SIGSAC Conf. Comput. Commun. [39] U. Baid et al., “The RSNA-ASNR-MICCAI BraTS 2021 benchmark
Secur., Oct. 2017, pp. 1175–1191. on brain tumor segmentation and radiogenomic classification,” 2021,
[17] X. Li, Y. Gu, N. Dvornek, L. H. Staib, P. Ventola, and J. S. Duncan, arXiv:2107.02314.
“Multi-site fMRI analysis using privacy-preserving federated learning [40] G. A. Reina et al., “OpenFL: An open-source framework for federated
and domain adaptation: ABIDE results,” Med. Image Anal., vol. 65, learning,” 2021, arXiv:2105.06413.
Oct. 2020, Art. no. 101765. [41] N. Bloch et al., “NCI-ISBI 2013 challenge: Automated segmentation of
[18] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated prostate structures,” Cancer Imag. Arch., vol. 370, p. 6, Oct. 2015.
learning with theoretical guarantees: A model-agnostic meta-learning [42] G. Lemaître, R. Martí, J. Freixenet, J. C. Vilanova, P. M. Walker, and
approach,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, F. Meriaudeau, “Computer-aided detection and diagnosis for prostate
pp. 3557–3568. cancer based on mono and multi-parametric MRI: A review,” Comput.
[19] T. Shen et al., “Federated mutual learning,” 2020, arXiv:2006.16765. Biol. Med., vol. 60, pp. 8–31, May 2015.
[20] T. Li, S. Hu, A. Beirami, and V. Smith, “Ditto: Fair and robust federated [43] G. Litjens et al., “Evaluation of prostate segmentation algorithms for
learning through personalization,” in Proc. Int. Conf. Mach. Learn., MRI: The PROMISE12 challenge,” Med. Image Anal., vol. 18, no. 2,
2021, pp. 6357–6368. pp. 359–373, 2014.
[21] L. Collins, H. Hassani, A. Mokhtari, and S. Shakkottai, “Exploit- [44] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolu-
ing shared representations for personalized federated learning,” 2021, tional networks for biomedical image segmentation,” in Proc.
arXiv:2102.07078. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent, 2015,
[22] K. James et al., “Overcoming catastrophic forgetting in neural networks,” pp. 234–241.
Proc. Nat. Acad. Sci. USA, vol. 114, no. 13, pp. 3521–3526, Mar. 2017. [45] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K. H. Maier-Hein,
[23] M. Delange et al., “A continual learning survey: Defying forgetting in “NnU-Net: A self-configuring method for deep learning-based biomed-
classification tasks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, ical image segmentation,” Nature Methods, vol. 18, no. 2, pp. 203–211,
no. 7, pp. 3366–3385, Jul. 2021. Dec. 2020.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.