A New Framework of Swarm Learning Consolidating Knowledge From Multi-Center Non-IID Data For Medical Image Segmentation

2118 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO.
7, JULY 2023
A New Framework of Swarm Learning

Consolidating Knowledge From Multi-Center
Non-IID Data for Medical Image Segmentation
Zheyao Gao , Fuping Wu , Weiguo Gao, and Xiahai Zhuang
Abstract — Large training datasets are important for deep of labeled training data. The quantity of data at a single
learning-based methods. For medical image segmentation, medical center is usually limited, especially for rare dis-
it could be however difficult to obtain large number of eases [1]. Centralized learning methods, which assemble data
labeled training images solely from one center. Distributed
learning, such as swarm learning, has the potential to use from multiple centers, may not be applicable due to the data
multi-center data without breaching data privacy. However, privacy issue [2]. To solve this problem, federated learning has
data distributions across centers can vary a lot due to the been proposed, which trains a model on distributed datasets
diverse imaging protocols and vendors (known as feature without the exchange of privacy-sensitive data between
skew). Also, the regions of interest to be segmented could centers [3].
be different, leading to inhomogeneous label distributions
(referred to as label skew). With such non-independently Recently, research on federated learning has gone beyond
and identically distributed (Non-IID) data, the distributed privacy issues, to further investigate novel methods well
learning could result in degraded models. In this work, handling issues of security, transparency and fairness [4].
we propose a novel swarm learning approach, which assem- For healthcare applications [2], there are two widely adopted
bles local knowledge from each center while at the same approaches. One is to rely on an aggregation server on secure
time overcomes forgetting of global knowledge during local
training. Specifically, the approach first leverages a label hardware [5], [6], and the other is to adopt the peer-to-peer
skew-awared loss to preserve the global label knowledge, communication [4], [7]. Among them, swarm learning, a new
and then aligns local feature distributions to consolidate decentralized paradigm, has been recently proposed [4]. This
global knowledge against local feature skew. We vali- paradigm can keep both the manipulation of data and parame-
dated our method in three Non-IID scenarios using four ters locally. The decentralized paradigm provides a promising
public datasets, including the Multi-Centre, Multi-Vendor
and Multi-Disease Cardiac Segmentation (M&Ms) dataset, solution for privacy protection and fairness. However, the
the Federated Tumor Segmentation (FeTS) dataset, the performance could degrade when it comes to the problem
Multi-Modality Whole Heart Segmentation (MMWHS) dataset of non-independent and identically distributed (Non-IID) data
and the Multi-Site Prostate T2-weighted MRI segmenta- coming from different centers [8], [9]. This is mainly due to
tion (MSProsMRI) dataset. Results show that our method the fact that the local training on Non-IID data could update
could achieve superior performance over existing methods.
Code will be released via https://zmiclab.github.io/ the local models in different directions, and averaging these
projects.html once the paper gets accepted. models with large discrepancy may deviate the optimization
and deteriorate performance of the final model.
Index Terms — Medical image, non-IID, segmentation,
swarm learning. The main challenges in Non-IID data for decentralized
learning have three folds, i.e., feature skew, label skew and
I. I NTRODUCTION quantity skew [9]. The former two are dominant and will be
fully investigated in this work, while the third one is less
T HE gains of deep learning for medical image segmenta-
tion could highly depend on the amount and diversity studied here, as it becomes an issue of sample sizes for image
segmentation tasks. Fig. 1 illustrates the two skew issues with
Manuscript received 5 August 2022; revised 13 October 2022; segmentation of short-axis cardiac magnetic resonance (CMR)
accepted 5 November 2022. Date of publication 9 November 2022;
date of current version 29 June 2023. This work was supported by the images. Feature skew could originate from the difference in
National Natural Science Foundation of China under Grant 61971142, imaging protocols, the strength of magnetic field in magnetic
Grant 62111530195, Grant 62011540404, and Grant 71991471. resonance imaging (MRI), or different demographics, which
(Corresponding author: Xiahai Zhuang.)
Zheyao Gao and Xiahai Zhuang are with the School of Data lead to the covariate shift [11].
Science, Fudan University, Shanghai 200433, China (e-mail: Label skew commonly exists in multi-center data with
zygao20@fudan.edu.cn; zxh@fudan.edu.cn). diverse forms, among which one prevailing scenario is the
Fuping Wu is with the School of Data Science, and Depart-
ment of Statistics, Fudan University, Shanghai 200433, China (e-mail: partial annotation of training images. Fig. 1 shows three CMR
17110690006@fudan.edu.cn). images with three types of annotations for three different stud-
Weiguo Gao is with the School of Data Science, and School of ies, i.e., different structures of interest to be segmented. This
Mathematical Sciences, Fudan University, Shanghai 200433, China
(e-mail: wggao@fudan.edu.cn). is because cardiac MRI could be used in studies of various
Digital Object Identifier 10.1109/TMI.2022.3220750 cardiac disorders, and the images from different centers could
1558-254X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on September 09,2023 at 06:07:55 UTC from IEEE Xplore. Restrictions apply.
GAO et al.: NEW FRAMEWORK OF SWARM LEARNING CONSOLIDATING KNOWLEDGE FROM MULTI-CENTER NON-IID DATA 2119
Our contributions are summarized as follows:

• We propose a novel decentralized distributed learning
framework for medical image segmentation, which could
train models using partially labeled images from multiple
centers for full label segmentation.
• We introduce an unbiased probability-based, label
skew-awared loss to handle label skew, by consolidating
global knowledge from local knowledge of partial labels.
• We propose a feature statistic-based regularization term
that could cope with feature skewed data by distilling
robust feature distribution knowledge during local train-
ing.
Fig. 1. Illustration of feature skew and label skew. The first row shows • We validate the proposed framework in three segmen-
the log histogram of typical images from three centers. The sample slices
of different distributions are presented in the second row. The third row tation tasks to demonstrate its applicability in different
presents three types of partial labeling and the weak label form [10], scenarios.
where the four elements in label vectors denote the probabilities of pixels
being non-cardiac background (BG), myocardium (Myo), left ventricle
This paper is organized as follows: Section II provides a
(LV) and right ventricle (RV), respectively. UN means unlabeled pixels. brief summary of related works, including federated learning,
swarm learning and distributed learning with Non-IID data.
In Section III, we first present the standard swarm learning as
preliminaries, and then describe the framework of our method
be solely partially annotated, due to the diverse interests of in details. The results of parameter studies and comparisons
clinical studies. For example, in myocardial viability studies are presented in Section IV. Finally, Section V concludes the
only the myocardium of left ventricle (LV) is annotated [12]; paper.
while for the diagnosis of right ventricle (RV) related patholo-
gies (e.g. pulmonary hypertension, coronary heart disease and
dysplasia), only the right ventricle requires segmentation [13]. II. R ELATED W ORKS
Therefore, given the same type of images, the segmentation A. Data Privacy and Distributed Learning
could be conceptually different. These inconsistent annotations The concept of distributed machine learning was first
from the same imaging area across centers result in the introduced by [3]. They proposed the first federated learn-
Non-IID distribution of the labels, which is referred to as the ing algorithm, i.e., FedAvg, that allows multiple clients to
label skew. collaboratively train a deep learning model, while keeping
To alleviate the skew problems, approaches have been the data stored locally. Although FedAvg meets the basic
proposed [1]. For example, Shoham et al. [14] proposed requirements for data privacy protection, it needs to send
to adopt the elastic weight consolidation strategy (EWC) to user-sensitive data (i.e., model parameters or gradients) to
prevent the local models from drifting apart by overcoming the a central server. Since private information could be recov-
catastrophic forgetting. Li et al. [15] proposed a personalized ered from the model gradients or parameters during training,
method that keeps the batch normalization locally during the previous works [16], [17] have proposed secured meth-
model aggregation to mitigate the feature skew problem. ods for parameter aggregation. In medical images analysis,
Although these methods have shown potential for classification Li et al. [17] proposed to add random noise to shared weights
tasks when training with Non-IID data, they do not take into to avoid the leaking of private information. Most recently,
consideration the label skew in segmentation tasks. This could swarm learning [4] was proposed to ensure the security and
lead to different local updating directions, and thus could fairness by sending the parameters peer-to-peer through an
degrade performance of the aggregated model. encrypted and decentralized manner of blockchain. Compared
In this work, we propose a new privacy protected, decen- with other privacy protection methods, it is effective and
tralized distributed learning framework for medical image efficient as it only requires minimal modification to the basic
segmentation, which tackles both of the feature and label skew FedAvg method. However, swarm learning also suffers from
problems. Specifically, we propose the label skew-awared the same problem when learning from Non-IID data as FedAvg
(LaSA) loss with unbiased probability for unlabeled structures, does, which we mainly aim to tackle in this work.
which maximizes the prediction for the most probable class
predicted by the global model during local training. This LaSA
loss is able to preserve the full label knowledge for local B. Distributed Learning With Non-IID Data
models. Furthermore, we propose the feature skew-awared Since the data distribution is different in each center, a com-
(FeSA) regularization and use batch normalization (BN) statis- mon idea is to build a personalized model from the shared
tic matching to align the feature distributions between centers, global model based on local distributions. A straightforward
which alleviates the effect of feature skew. We validate our method is to fine-tune the global model on local data [18].
proposed framework comprehensively in three scenarios using However, such naive fine-tuning could result in overfitting
four public datasets, and the results demonstrate the great on local training data [9]. To avoid overfitting, some meth-
potential in the tasks of medical image segmentation. ods [19], [20] proposed to build a personalized model, which
2120 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 7, JULY 2023
TABLE I III. M ETHOD

R EFERENCES FOR THE M ATHEMATICAL S YMBOLS
A. Preliminaries
Let Pk (X , Y) be the joint distribution of image variable
(X ) and label variable (Y) in center k, where k = 1 . . . K ;
let P(X , Y) be the joint distribution for all data. The goal
of distributed learning is to train a model, denoted as f θ :
X → Y, with parameter set θ , by minimizing the risk on
data from the true distribution without data exchange between
centers. In this work, we solely consider a degraded objective
that minimizes summation of empirical risks on each local
distribution, i.e.,
θ̂ = arg min E(x,y)∼P(X ,Y )[( f θ (x), y)] (1)
θ

K
≈ arg min E(x,y)∼Pk (X ,Y )[( f θ (x), y)], (2)
θ k=1
where, (·) is the loss function measuring the dissimilarity
between prediction and ground truth. Table I summarizes the
key symbols.
was trained with local data and applied distillation strategies.
Swarm learning involves local update and communication
Other methods [15], [21] proposed to keep distribution related
between centers [4]. At each round of communication (denoted
layers in local centers during model aggregation, instead of
with superscript t), every center receives the model parameters
building an entire personalized model. Although the methods
from the other K − 1 centers and then performs parameter
based on personalization have achieved excellent performance
aggregation,
in Non-IID scenarios, it is arduous to yield a robust global
model. Hence, the generalization ability of the resulting mod-
K
nk
els to unseen data from other centers could be still limited. θ [t ] = θk[t ] , (3)
N
Another strategy to handle this problem is to add regulariza- k=1
tion terms during local training to avoid divergence of local where n k isthe number of training samples from center k
K
models. For example, Li et al. [8] first proposed to use an and N = k=1 n k . Each local model is initialized by the
L2-norm regularization to constrain parameters from deviating aggregated parameters, and then updated via local training
too far from the global model. To design a more delicate until the next communication round. The objective of local
term, Shoham et al. [14] proposed to adapt a solution tackling training for center k is formulated as follows,
catastrophic forgetting by estimating the importance matrices
for all parameters. min E(x,y)∼Pk (X ,Y )[( f θk (x), y)], (4)
θk
which leads to local optimal θk[t ] . This process is repeated

C. Overcoming Forgetting until the aggregated model converges. This framework is
similar to the most popular federated averaging algorithm
The concept of overcoming forgetting was first introduced (FedAvg) [3], except that the aggregated model is decen-
in continual learning by [22], which is aimed to maintain the tralized, namely there is no central server, and the commu-
performance of a model for old tasks when the model is being nication is performed through blockchain-based peer-to-peer
trained for new tasks [23]. It was then applied in distributed networking [28] between all clients. However, the underly-
learning to handle the Non-IID issue by [14]. Here, we mainly ing assumption of vanilla swarm learning is that local data
focus on the regularization-based approaches, which could be from each center should be in the same distribution, i.e.,
easily adapted to the tasks of distributed learning. Inspired by Pk (X , Y) = P(X , Y). In reality, this assumption could be
the work of [22], several works have been proposed to restrict false, especially in the scenario of multi-center medical image
the alteration of parameters that are important to previous analysis, where data are Non-IID.
tasks [24], [25]. However, the estimation of importance matri-
ces could be inaccurate, thus undermining the effectiveness
of the techniques. Alternatively, there are methods alleviating B. Non-IID Swarm Learning
the forgetting through feature distillation. Douillard et al. [26] In this work, we consider a Non-IID scenario where both
proposed to use spatially pooled features for the preservation feature skew and label skew can exist. Specifically, we focus
of old knowledge. Similarly, Arthur et al. [27] applied this on the covariate shift for the problem of feature skew (dif-
idea in continual segmentation tasks. In feature distillation ferent P(X )) and the concept shift for label skew (different
methods, the strategy for feature extraction is essential to the P(Y|X )). The former can be caused by different protocols
model performance. Based on this idea, we propose to employ of image acquisition, and the latter is common in medical
distribution-related features to deal with the Non-IID issues. image segmentation, as different centers could have different
Fig. 2. The overall framework of the proposed method: the paradigm of swarm learning (left) and the proposed method for local training (right).
In local training, the proposed method adopts LaSA loss and FeSA regularization to tackle the problems from label and feature skew, respectively.
As shown in the procedure E, LaSA is computed with unbiased probability and unbiased partial label. The unbiased probability is calculated by
summing up the probabilities for unlabeled classes and the unbiased partial label is derived similarly by summing up values for unlabeled classes
in the weak label form of partial label. RFeSA is computed by matching the sample mean and variance from local model with the running mean and
variance parameters in batch normalization layers of the global model.
structures of interests to segment, i.e., partial labeling. Without data from other centers. Hence, the local model could forget
confusion, we denote the partial label from center k as yk . the knowledge learned from the segmentation tasks in other
Both label skew and feature skew can diverge a standard centers, resulting in different updating directions.
local training from the global updates, leading to degraded per- Therefore, we first develop the LaSA loss to preserve the
formance of the aggregated model [8]. To tackle this, we pro- global knowledge of full label (global distribution) when train-
pose a novel decentralized learning framework. As shown in ing the model locally. Then, we design the FeSA regularization
Fig. 2, the left subfigure presents the paradigm of standard to distill the knowledge of feature distributions from the
swarm learning, as described in Section III-A. Each center global model. In the following, we elaborate on the details
sends the local model to a decentralized communication net- of the two contributions in Section III-C and Section III-D,
work and receives models from other centers to derive the respectively.
global model. Our contribution is focused on the local training
process which is illustrated in the right part of Fig. 2. Our
method uses two new functions, i.e., a label skew-awared C. Label Skew-Awared Loss LaSA
(LaSA) loss and a feature skew-awared (FeSA) regularization For pixel i , let q i , yki and y i respectively denote the
term. LaSA loss adopts the unbiased probability derived from segmentation prediction vector, the gold standard label vector
the output of networks to formulate the segmentation loss for of partial annotations from center k and the ground truth label
label skew scenario. FeSA regularization handles the feature which is a one-hot vector. Elements of these three vectors can
skew problem by matching the statistic of intermediate features be accessed using index of c, which is also the index of the
in local clients with the global model. Formally, the objective class set. Let C be the set of all classes in the segmentation
for the local training with partially labeled data (x k , yk ) is task. In label skew scenario, Ck represents the set of annotated
given by, classes in center k, and C¯k = C\Ck denotes the set of
unlabeled classes.
( f θk (x k ), yk ) = La S A ( f θk (x k ), yk ) + λR F eS A (θk ; θ [t ] , x k ), For partial label scenario, we use the weak label form [10]
(5) as Fig. 1 illustrates. Formally, for a pixel i of an image x
which is partially annotated with label class set Ck in center
where λ is the balancing parameter. The pseudo code for the k, if pixel i is annotated, we have yki = y i ; otherwise, the
whole pipeline is shown in Algorithm 1. elements of vector yki are given by,
The proposed strategies deal with the label skew and feature
skew challenges in distributed learning by avoiding the diver- 0, for c ∈ Ck
yk,c =
i
.
gence of local updates during local training. This divergence 1
|C̄ |
, for c ∈ C̄k
k
resembles the consequence of catastrophic forgetting in con-
tinual learning [23]. In local training, each center fine-tunes For the label skew issue, we propose a new loss to preserve
the aggregated model using local data without an access to the the label knowledge for unlabeled classes, i.e., LaSA loss. The
Algorithm 1 Non-IID Swarm Learning

Notations: Initialize model parameters: θk[0] for each
center k; total optimization epochs: T ; running mean and
variance in the lth layer of global model: μl , σl2 ; sample
mean and variance from the lth layer of local model:
μ̃k,l , σ̃k,l
2 .
for each epoch t = 0, 1, 2, . . . , T − 1 do

for each center k = 1, 2 . . . , K do
qk , μ̃k , σ̃k2 = f θk (x k )
compute q̂k , ŷk using Eq. (7), (8) Fig. 3. Illustration for the gradients of (a) CE, (b) PCE and (c) LaSA loss
on an unlabeled pixel in one center with regard to the predicted logits.
La S A = i∈ seg (q̂ki , ŷki ) (Please refer to Eq. (6)) In this center, the labeled class is c2 . As shown by the blue boxes, the
R F eS A = l μl − μ̃k,l 2 + σl2 − σ̃k,l 2 (Please
2 class of this unlabeled pixel is predicted to be c3 by the global model.
refer to Eq. (15)) The dash line illustrates the resulting probability distributions by different
losses. Figure (a) indicates that by using CE , the gradient computed for
= La S A + R F eS A the most probable class is the least, resulting more evenly distributed
θk[t +1] ← Opti mi zer (, θk[t ] ) probabilities. Figure (b) shows that PCE does not compute gradients
end for for unlabeled classes, which cannot prevent the knowledge forgetting.
K nk [t +1] In figure (c), the proposed LaSA not only derives a negative gradient for
θ [t +1] ← k=1 N θk c2 , using the knowledge from local training data, but also generates the
for each client k do largest positive gradient for the most probable class (c3 ) predicted from
θk[t +1] ← θ [t +1] the global model. This consolidates the global knowledge.
end for
end for i i
2i∈ ŷk,C̄ q̂k,C̄
− , (11)
i∈ yk,C̄ +
i i
i∈ q̂k,C̄
LaSA loss is formulated with unbiased probability, as follows, where αc and γ are hyperparameters in focal loss, which are
introduced to control the weights of losses over each class
La S A( f θk (x k ), yk ) = seg (q̂ki , ŷki ), (6)
and pixel. One can see that the formulation in Eq. (9) is
i∈
similar to the marginal loss in [32]. We will show that the
where is the set of pixels in the image, q̂ki and ŷki are the label skew-awared loss in our framework could consolidate
unbiased prediction probability and unbiased partial label for label knowledge.
pixel i , respectively, defined as, In the following, we provide the theoretical explanation

qci if c ∈ Ck of the proposed LaSA loss by analyzing the problem of
q̂k,c =
i
, (7) fully-supervised segmentation losses in label skew scenario.
¯ q i
if c ∈ C¯k
j ∈C k j Without loss of generality, here we take CE loss as an example.
and For segmentation task in local client k, the overall CE loss
is the aggregation of CE losses over each pixel, which is given
i
yk,c if c ∈ Ck
i
ŷk,c = . (8) by,
¯
j if c ∈ Ck
i
j ∈C¯k yk,
Here, we further introduce two symbols for convenience of C E ( f θik (x k ), yki ) = − i
yk,c log(qci ) + i
yk,c log(qci ) .
i
notation, i.e., q̂k, C̄
= q̂k,c
i
, ∀c ∈ C¯k and ŷk, i
C̄
= ŷk,c
i
, ∀c ∈ C¯k , c∈C k c∈C̄ k
for the derivation of losses over unlabeled classes. (12)
In Eq. (6), seg (·) could be adapted from the loss func- One can see that for an unlabeled pixel i , the CE loss
tions for fully-supervised segmentation. Here, we provide the maximizes the output probabilities qci for all unlabeled classes
implementation with cross entropy (CE) loss [29], focal (FC) c ∈ C̄k . However, since log(·) is a concave function, the
loss [30] and Dice (DC) loss [31], as follows, gradient computed for the most probable class is the least.

As shown in Fig. 3, C E could result in more evenly dis-
La S A[C E] = − i
ŷk,c i
log(q̂k,c ) + ŷk,
i
C̄
log( q̂ i
k,C̄
) ,
tributed probabilities for unlabeled classes. Optimization of
i∈ c∈C k
this loss leads to increase of the uncertainty over unlabeled
(9)
pixels, which expedites the forgetting of label knowledge for
i γ
La S A[F C] = − αc ŷk,c
i
(1 − q̂k,c ) i
log(q̂k,c ) unlabeled structures.
i∈ c∈C k An alternative solution is to apply partial CE that only
computes losses over labeled classes [33],
γ
+ ŷk,
i
C̄
(1 − q̂ i
k,C̄
) log( q̂ i
k,C̄
) ,
iPC E ( f θk (x k ), yk ) = − i
yk,c log(qci ). (13)
(10)
c∈C k
i q̂ i
2 i∈ ŷk,c k,c
La S A[DC] = 1 − In this case, no loss is computed over unlabeled pixels. How-
i∈ yc +
i i
c∈C k i∈ q̂k,c ever the prediction for unlabeled pixels is affected by the losses
derived for labeled pixels. Although it does not explicitly P(X ) [35]. The feature statistics learned for local distribu-
degrade the performance of a model over the unlabeled pixels, tion Pk (X ) could diverge from those of the global model.
it cannot alleviate the catastrophic forgetting during local To preserve the knowledge of global statistics of layer l (i.e.
training. E(Xl ) and D(Xl )), we consider to force the sample statistics
Compared with the above losses, our proposed LaSA loss, calculated during local training to be close to the running
adopting the unbiased probability, could derive a proper esti- statistics in global model using L2 distance. Here, we estimate
mation for losses over unlabeled pixels and consolidate the the true global feature statistics by,
label knowledge from the global model. In Eq. (9), the loss
for an unlabeled pixel i is derived by the cross entropy between 1
K
nk 2
K
nk
i E(Xl ) ≈ E k (Xl ) ≈ μk,l = μl , (18)
the unbiased prediction probability q̂k, and unbiased partial N N
C̄ k=1 k=1
i
label ŷk,C̄ . In this formulation, the largest gradient is obtained K K
1 nk 2 nk 2
for the most probable unlabeled class among classes in C̄k . The D(Xl ) ≈ Dk (Xl ) ≈ σ = σl2 , (19)
N N k,l
derivative of Eq. (9) with respect to the model output could k=1 k=1
be derived by, where E k (Xl ) and Dk (Xl ) are the mean and variance of
∂iLa S A[C E]( f θk (x k ), yk ) 1 features from layer l of the local model in center k. The
= −qci ( − 1) ∝ −qci . (14) second approximations in both formulas are based on the
∂ f ci i
q̂k, C̄ assumption that the running statistics are accurate estimations
From this formula, one can see that LaSA tends to maximize of the true statistics. In Eq. (18), the weighted sum of local
the prediction for the most probable class predicted by the means is an unbiased estimation for the mean of global
global model during local training. In other words, the loss is feature distribution which supports the first approximation.
able to memorize the prediction of global model on unlabeled For variance estimation, the first approximation in Eq. (19)
pixels during local training, thus consolidates the global is based on the assumption that the local mean μk,l is similar
knowledge of full labeling. Fig. 3 illustrates and compares the for each center. This assumption is reasonable because we
gradients computed by LaSA and other loss functions. attempt to align the local means with the global mean during
training.
D. Feature Skew-Awared Regularization RFeSA
IV. E XPERIMENTS
For feature skew, we propose to introduce an extra reg-
ularization term using the feature statistics before the batch We validated the applicability of the proposed method
normalization (BN) module [34], i.e., the FeSA regularization using four datasets, i.e., the M&Ms dataset [36], the
term, FeTS dataset [37], the MSProsMRI [11] dataset and the
MMWHS [38] dataset. We first demonstrated the effectiveness
R F eS A (θk ; θ [t ] , x k ) = μl − μ̃k,l 2 + σl2 − σ̃k,l
2
2 , of the LaSA loss for label skew and FeSA regularization
l for feature skew in Section IV-C. We then compared the
(15) framework with other state-of-the-art (SOTA) methods in three
where, μl and σl are the running mean and variance of situations in Section IV-D, including the scenario having
parameters in BN module for layer l in the global model; both label skew and feature skew issues, the scenario with
l 2 ∈ RC l are the sample mean and variance
μ̃k,l ∈ RC and σ̃k,l solely label skew and the scenario with solely feature skew.
in the local model. The subscript k indicates the variables are In the experiments, we performed Wilcoxon test to report the
associated with local client k and the C l denotes the number significance of the difference between two approaches.
of channels in layer l. They are calculated along the batch and
spatial dimensions of the features from layer l, given a batch A. Datasets
of input images sampled from local distribution Pk (X ).
1) M&Ms Dataset: is composed of 320 short-axis Cardiac
In Eq. (15), the running mean and variance (μl and σl )
Magnetic Resonance (CMR) cases from 4 different scan-
are calculated through aggregation of the corresponding para-
ner vendors, including 95 cases from Siemens (Vendor A),
meters in the local model of last communication epoch. The
125 cases from Philips (Vendor B), 50 cases from General
computation of the running mean and variance parameters in
Electric (Vendor C) and 50 cases from Cannon (Vendor D).
local models is the same as in [34], namely they are updated
It provides annotations for 3 cardiac structures, including left
by,
ventricle (LV), right ventricle (RV) and the left ventricular
μ[m] [m−1]
k,l = Mμk,l + (1 − M)μ̃k,l (16) myocardium (Myo). We resampled all data with the in-plane
resolution of 1.25×1.25 mm and cropped them into 192×192
(σ 2 )[m]
k,l = M(σ 2 )[m−1]
k,l + (1 − M)σ̃k,l
2
(17)
region of interest (ROI). We took each vendor as a local client
where m represents the index of iteration and M is the such that the image distribution in each client is different.
momentum between (0, 1). It is set empirically according to To simulate the label skew, during training, only the anno-
the number of batches in one local training epoch. tations of LV and RV are available for the data in Vendor
The intuition behind the method is that the feature statistics A, LV and Myo for Vendor B, all cardiac structures as one
computed in BN layers contain the traits of image distribution foreground (illustrated in Fig. 1) for Vendor C and RV for
TABLE II
A BLATION S TUDY FOR THE P ROPOSED M ETHOD. R ESULTS A RE E VALUATED IN D ICE S CORE
Fig. 4. Visualization of segmentation results on M&Ms dataset. Arrows highlight the areas which are referred to in the text.
TABLE III structures in the training data of each center. Since quantity
S UMMARY OF M&M S D ATASET. O NLY PART OF L ABELS W ERE U SED skew is not the main focus of this work, we merged the
FOR T RAINING IN E ACH C ENTER . LV: L EFT V ENTRICLE , RV: R IGHT
V ENTRICLE ; M YO : M YOCARDIUM ; BG: B ACKGROUND
data from the centers listed in the partitioning-2 file from
the FeTS challenge. The information of the final partitioned
dataset is summarized in Table IV. In the experiment, the
dataset was split by the ratio of 60%:15%:25% for training,
validation and test in each center.
3) MMWHS Dataset: has 60 cardiac CT images from MIC-
CAI’17 Multi-Modality Whole Heart Segmentation challenge.
The annotations for the whole heart substructures include the
Vendor D. We randomly split the data of each client by the left ventricle (LV), right ventricle (RV), left atrium (LA),
ratio of 50%:15%:35% for training, validation and test. The right atrium (RA), myocardium (Myo), ascending aorta (AO),
dataset information is summarized in Table III. and the pulmonary artery (PA). We resampled the images
2) FeTS Dataset: includes multi-institutional multi- into 2×2×2 mm and cropped them with ROI of 96×96×64.
parametric Magnetic Resonance Imaging (mpMRI) scans of To simulate the scenario with label skew, we first randomly
brain tumor from Federated Tumor Segmentation (FeTS) and equally divided them into 4 parts as the datasets for each
challenge 2022 [37]. It was built upon the dataset from center. We performed 4-fold cross validation on this dataset
RSNA-ASNR-MICCAI BraTS 2021 challenge [39] with their in our experiment. In each center, we then only used the
real-world partitioning and the collaborative of independent annotations of 2 or 3 substructures in each center during
institutions in a real-world federation [40]. The training training. Specifically, we have Myo, LV and LA for Center
data consists of 1251 cases and each case has four 240 × A; RA and PA for Center B; RV, RA and AO for Center C;
240 × 155 structural MRI images including native (T1), RV, AO and PA for Center D. The dataset information is
post-contrast T1-weighted (T1Gd), T2-weighted (T2), and summarized in Table V.
T2 FLuid Attenuated Inversion Recovery (FLAIR) volumes. The MSProsMRI dataset includes 30 cases from
The pre-processing steps for all images include z-score RUNMC [41](Center A), 30 cases from BMC [41](Center
normalization, rigid registration and resolution resampling B), 19 cases from HCRUDB [42](Center C), 13 cases from
to 1mm 3 . The provided annotation includes the enhancing UCL [43](Center D), 12 cases from BIDMC [43](Center E),
tumor (ET), necrotic tumor core (NCR) and peritumoral and 12 from HK [43](Center F). We resampled all these
edematous and infiltrated tissue (ED). To simulate the label data with the axial plane resolution of 0.625×0.625 mm and
skew scenario, we removed the annotations for one or two cropped them into ROI of 384×384. We performed 4-fold
TABLE IV
S UMMARY OF F E TS D ATASET. T HE D ATASET WAS M ANUALLY S PLIT I NTO N INE C ENTERS . O RG C ENTER D ENOTES C ENTER I NDEX OF THE
O RIGINAL PARTITION P ROVIDED BY THE F E TS C HALLENGE . O NLY PART OF L ABELS W ERE U SED FOR T RAINING IN E ACH C ENTER . ET:
E NHANCING T UMOR ; NCR: N ECROTIC T UMOR C ORE ; ED: P ERITUMORAL E DEMATOUS AND I NFILTRATED T ISSUE ; BG: B ACKGROUND
TABLE V
S UMMARY OF MMWHS D ATASET. T HE D ATASET WAS M ANUALLY
S PLIT IN F OUR C ENTERS . O NLY PART OF L ABELS W ERE U SED FOR
T RAINING IN E ACH C ENTER . LV: L EFT V ENTRICLE ; RV: R IGHT
V ENTRICLE ; LA: L EFT ATRIUM ; RA: R IGHT ATRIUM ; M YO :
M YOCARDIUM OF LV; AO: A SCENDING AORTA ; PA: P ULMONARY
A RTERY
TABLE VI Fig. 5. Plot of the knowledge forgetting in Dice score during epoch 200 in
L AYER C HOICE FOR BN S TATISTIC M ATCHING IN F E SA Center D. The vertical axis represents the average Dice score for LV and
R EGULARIZATION . C RITICAL L AYERS R EFER TO THE D OWNSAMPLE Myo achieved by local model in center D on the validation data from all
AND ( OR ) U PSAMPLE C ONVOLUTION L AYERS other centers.
using M&Ms dataset. Swarm+FeSA method was implemented

by adding the FeSA regularization to the partial cross entropy
cross validation on this dataset in our experiments. Since the (PCE) loss (Eq. (13)) in the local training of swarm learning.
annotations only include one structure, we assume no label Swarm+LaSA method replaced the PCE loss with LaSA loss
skew in this setting. As described in the work of [11], the but without regularization. According to Table II, by applying
feature distribution of the dataset varies across these centers, regularization term, the improvement from the swarm learning
causing significant feature skew across centers. method was marginal. However, when replacing the naive
loss with LaSA loss, the average Dice score was improved
B. Implementation Details by 7.3%. This is probably due to the fact that label skew
could trigger the forgetting of label knowledge in local model,
To avoid the effect of quantity skew, we randomly sampled
which is more likely to diverge the local training from the
equal amount of images in each center with random aug-
global update than feature skew. One can also observe from
mentations including rotation, flipping and elastic deformation
Table II that our proposed method, employing both the regu-
during one local training epoch. All images were normal-
larization term and LaSA loss obtained statistically significant
ized using z-score normalization before data augmentation.
improvement (Wilcoxon test p-value < 0.05) of 2.3% Dice
For M&Ms and MSProsMRI datasets, we adopted the 2D
score, compared to the method with solely LaSA loss, i.e.,
UNet [44] as backbone, and trained the networks using Adam
Swarm+LaSA. This is reasonable because once the label skew
optimizer with the initial learning rate of 1e-3, batch size of
is tackled, the feature skew would become the main issue that
128 and 15 iterations per epoch. For both FeTS and MMWHS
affects the divergence between the local training and the global
datasets, we referred to nnUnet [45] and designed a 3D
update.
segmentation network. The initial learning rate was set to 2e-
The qualitative segmentation examples are shown in Fig. 4
4. Random crop was performed during training. The batch
for visual comparisons. One can observe that combining
size, number of cases sampled during one epoch and patch
the two terms could produce more accurate results, and the
size were 2, 120, 128 × 128×128 for FeTS datasets and 6,
improvement is particularly evident in the challenging struc-
30, 96 × 96×64 for MMWHS datasets. For all experiments,
ture, i.e., RV.
we empirically set the λ in Eq.(5) to be 0.1, the momentum
To further validate that the proposed method could alleviate
M in Eq. (17) to be 0.9 and trained the networks with
the catastrophic forgetting during local training, we visualized
500 epochs. The label skew-awared loss was implemented with
the global knowledge preserved by local model during train-
cross entropy. The framework was implemented using Pytorch
ing. Here, we illustrated the knowledge preservation in Center
and ran on one Nvidia RTX 3090 GPU.
D. The global knowledge was measured by the average Dice
score of unlabeled structures (i.e., LV and Myo) achieved by
C. Parameter Study the local model on validation data from Center A, B and C.
1) Effectiveness of LaSA Loss and FeSA Regularization: We According to Fig. 5, adopting the proposed training strategy,
studied the effectiveness of LaSA loss and FeSA regularization the performance hardly declined on data from other centers,
TABLE VII
C OMPARISON S TUDY ON M&M S D ATASETS . R ESULTS A RE E VALUATED IN D ICE S CORE . B OLD T EXT D ENOTES THE B EST R ESULTS ACHIEVED BY
D ISTRIBUTED L EARNING M ETHODS W ITH PARTIAL L ABEL
TABLE VIII
C OMPARISONS OF D IFFERENT S TRATEGIES FOR L ABEL S KEW ON MMWHS D ATASETS . R ESULTS A RE E VALUATED BY 4-F OLD C ROSS
VALIDATION IN D ICE S CORE . B OLD T EXT D ENOTES THE B EST R ESULTS ACHIEVED BY D ISTRIBUTED L EARNING M ETHODS W ITH PARTIAL L ABEL
every training iteration, then the distributed learning is equiv-

alent to the centralized training method with partial label.
Here we studied how communication rate could influence
the performance of the aggregated model. We conducted the
ablation study with model aggregation per 5, 15, 30, 50, 100,
200, 500 iterations. As shown in Fig. 6, the performance was
robust with solely 0.5% discrepancy in Dice if the commu-
nicate rate was set between 5 and 100 iterations. Obvious
decrease was only observed for the communicate rate larger
Fig. 6. Effect of the communication rate on the performance of M&Ms than 200 iterations.
datasets.
while the performance of the swarm learning method showed D. Comparison Study
dramatic drop. It means that the proposed regularization and In this section, we performed a comprehensive comparison
LaSA loss could effectively consolidate old knowledge, thus study with other SOTA methods. The methods chosen for
keeping the local training consistent with global updates. comparisons should have the following three features: (1)
2) Choice of Layers for FeSA Regularization: Because using distributed learning methods related to the idea of knowledge
all the layers for BN statistic matching could restrict the consolidation; (2) global-model-based methods, in which all
plasticity of the model, we conducted experiments to select centers share one model that is generalizable for data from
a better layer subset for the computation of the regularization unseen centers; (3) methods based on decentralized frame-
term, i.e., R F eS A (θk ; θ [t ] , x k ) in Eq. (15). Since the model we works without server-side manipulation.
adopted was UNet, we studied whether the features in encoder 1) Application to Multi-Center and Multi-Vendor Cardiac Seg-
layers or decoder layers play a more important role in the mentation With Label and Feature Skews: To further validate
regularization. We first had all the layers in the encoder or the proposed method in the scenario with both label and
decoder for regularization. According to Table VI, we found feature skew, we compared it with other SOTA methods for
that there was no significant difference across these choices cardiac segmentation on M&M dataset, including: (1) Fed-
(Wilcoxon test p-values > 0.6 for each pair of choice). Prox [8], which used L2 regularization during local training;
Hence, we further relaxed the constraint, and chose solely (2) FedCurv [14], which used importance matrix to penalize
features from several critical layers (i.e. features after the each the change of important parameters; (3) FML [19], which used
downsample or upsample convolution layer) for BN statistic personalization strategy; (4) Swarm, which was the standard
matching. The results in Table VI showed that using the swarm learning framework using partial cross-entropy loss;
features from upsample layers for regularization could obtain and (5) Sup&Cen, which trained the same network with all
better results than using features from downsample layers. the data and full labels together in a fully-supervised and
3) Effect of Communication Rate: In swarm learning, com- centralized manner.
munication rate refers to the number of epochs or iterations Table VII reported the comparison results. With the pro-
in local training between two consecutive model aggregation posed regularization term and LaSA loss, our method resulted
operations. Ideally, if the model aggregation is implemented in the best performance. Notably, compared with the Sup&Cen
Fig. 7. Visualization of the segmentation results for MMWHS datasets. Arrows highlight the areas which are referred to in the text.
TABLE IX
C OMPARISON R ESULTS (D ICE S CORES ) ON P ROSTATE D ATASETS . E ACH M ETHOD WAS I MPLEMENTED W ITH 4-F OLD C ROSS VALIDATION . B OLD
T EXT D ENOTES THE B EST R ESULTS ACHIEVED BY D ISTRIBUTED L EARNING M ETHODS . ‘C ENTRALIZED ’ D ENOTES THE T RIVIAL M ETHOD
T RAINING W ITH C ENTRALIZED D ATA
TABLE X FeTS datasets. The compared methods were similar with those
C OMPARISON S TUDY ON F E TS D ATASETS . R ESULTS A RE E VALUATED in the comparison study on M&Ms datasets.
IN D ICE S CORE . B OLD T EXT D ENOTES THE B EST R ESULTS A CHIEVED
BY D ISTRIBUTED L EARNING M ETHODS W ITH PARTIAL L ABEL . ET:
Following the evaluation metrics of the FeTS challenge,
E NHANCING T UMOR ; WT: W HOLE T UMOR ( COMPRISING ET, ED AND we reported the average Dice score on enhancing tumor (ET),
NCR); TC: T UMOR C ORE ( COMPRISING ED AND ET) whole tumor (WT), tumor core (TC) in Table X. Compared
with other distributed learning methods, our method achieved
the best performance in average Dice score of all evaluated
regions. Moreover, our method performed comparably to the
Sup&Cen method in the segmentation of WT and TC. This
demonstrates that the proposed LaSA and FeSA loss could
effectively alleviate the issues caused by label and feature
skew in FeTS datasets. In terms of the performance on ET, the
average Dice score of our method was slightly lower than that
of Sup&Cen. The reason could be that the segmentation of ET
is difficult and there could exist non-negligible inter-observer
method, the Dice score of our method was only 0.4% lower. variability of manual annotation across centers. Thus, consol-
It indicates that our method could effectively handle the label
idating the knowledge in other centers could be misleading.
skew and feature skew challenge in the distributed learning
framework. Other methods could only slightly improve the 3) Application to Whole Heart Segmentation With Label Skew
performance of the swarm learning method in this scenario. Setting: To demonstrate the superiority of the proposed method
This is reasonable as these methods attempt to either avoid the in the Non-IID scenario where only label skew exists, we com-
divergence of training between the local and global models pared with another strategy [27](i.e., Pseudo) for partial super-
by restricting the magnitude of parameter changes or forcing vision learning that could be applied to the swarm learning
their prediction to be consistent. They do not explicitly deal framework. The Swarm+Pseudo method generates pseudo
with the knowledge forgetting of feature distribution and label labels using the global model for the supervision of unlabeled
annotation, particularly the latter one. structures.
2) Application to Multi-Institutional Brain Tumor Segmentation As presented in Table VIII, the swarm learning with CE
With Label and Feature Skews: To demonstrate the generaliz- loss and PCE loss almost failed in this scenario. This is
ability to other applications, we also compared the proposed because these methods could not estimate an accurate loss for
method with SOTA methods for brain tumor segmentation on unlabeled structures, and thus failed to preserve the knowledge
TABLE XI
S TUDY OF THE E FFECT OF Q UANTITY S KEW FOR THE P ROPOSED N ON -IID SL M ETHOD ON M&M S D ATASET. T HE I NDEX IN THE C OLUMN OF
‘ LIMITED C ENTER ’ R EPRESENTS THE C ENTER W HICH H AS S CARCE T RAINING D ATA
of labels except 2 or 3 out of 7 structures which are provided method used image distribution related regularization during
with annotations in a local center. Both the Swarm+Pseudo local training, while other methods applied parameter-based
method and our proposed Non-IID swarm learning achieved regularization which could restrict the plasticity of local
satisfactory performance. From Table VII one can see that models.
the results of Swarm+Pseudo are very close to those of
Sup&Cen, which can be taken as an upper bound of federated V. C ONCLUSION
learning methods. Despite the limited space for improvement, In this paper, we have presented a new framework for
our method still obtained 1.3% higher average Dice score de-centerlized and privacy-protected distributed swarm learn-
than Swarm+Pseudo. The reason could be that although both ing. This framework employs the label skew-awared (LaSA)
methods alleviate the forgetting of true label in local update loss, which preserves the prediction for unlabeled structures by
by distilling knowledge from the global model, the pseudo the global model during local training, to avoid the label skew
label generated by the global model can be inaccurate. This issue. Moreover, we have proposed the feature skew-awared
inaccurate label could mislead the optimization of the local (FeSA) regularization to align feature distributions of different
model. centers through batch normalization statistic matching, which
The visual results are presented in Fig. 7. We can observe alleviates the effect of feature skew. The results showed that
that compared with the ground truth, the segmentation errors the proposed skew-awared functions were effective in tackling
of our method mainly occurred in Myo and PA. The reasons the different distributions and could consolidate the knowledge
are two folds. (1) Myo and PA are hard to segment, as the learned from local clients. The applications to the three tasks
results of Sup&Cen method are not accurate in them. (2) of medical image segmentation have demonstrated that the
The supervision for Myo is weak since only center A could proposed method could achieve comparable results to the
provide annotations for Myo. For difficult structures, more fully-supervised and centralized methods.
supervision is needed. It is also observed that the performance
of a structure is not determined by whether the annotation is A PPENDIX
available locally. For example, Center B and Center D could A. Study of Quantity Skew
not provide annotations for Myo during training. However, the
Here we discuss the effect of quantity skew on the proposed
results of Myo in these two centers are visually more accurate
method. Quantity skew, especially with highly unbalanced
than that in Center A. The reason could be that our proposed
numbers of subjects across centers, could degrade the perfor-
method can effectively preserve the label knowledge from
mance of federated learning methods, though this may not be
other centers during training and the performance difference
a challenge from the Non-IID problem in segmentation tasks
is probably due to the quality of images in each center.
but different sample sizes. However, as the proposed method
4) Application to Multi-Site Prostate MRI Segmentation With adopts knowledge consolidation strategy, the effect of quantity
Feature Skew: To show the effectiveness of our method in skew could be reduced significantly.
the feature skew scenario, we applied the method to seg- For demonstration, we implemented the proposed method
mentation of the multi-site prostate T2-weighted MRI dataset, on M&Ms dataset with unbalanced numbers across centers.
and compared with other SOTA methods that do not apply Specifically, we reduced the numbers of training and validation
the personalization strategy, including (1) the standard swarm data in one of the four centers to be only 5 and 3, respectively,
learning method, (2) FedProx [8] and (3) FedCurv [15]. and keep others unchanged. The implementation details are the
Table IX presents the results of compared global methods. same as described in Section IV-B.
Compared with the Swarm method, the proposed Non-IID SL As shown in Table XI, compared with the results trained
achieved significant improvements in Center C and E by 4.7% with relatively balanced dataset, quantity skew did not lead to
(p=0.005) and 5.8% (p=0.002), respectively. This indicates obvious performance decline. The reason could be two folds.
that by applying the feature statistic alignment in the local First, our method involves knowledge consolidation strate-
training, our method could learn robust feature representations gies. Centers with few training samples could also acquire
and thus improve the performance. Moreover, Non-IID SL knowledge from the global model. Second, we performed data
achieved the best performance in average Dice score among augmentations in each center and ensured equal local training
all distributed learning methods with all p-values less than iteration in one communication round such that the global
0.05 by the paired Wilcoxon tests. The reason could be that our model will not be biased toward centers with more samples.
R EFERENCES [24] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic
intelligence,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 3987–3995.
[1] L. Qu, N. Balachandar, and D. L. Rubin, “An experimental study of [25] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars,
data heterogeneity in federated learning methods for medical imaging,” “Memory aware synapses: Learning what (not) to forget,” in Proc. Eur.
2021, arXiv:2107.08371. Conf. Comput. Vis. (ECCV), 2018, pp. 139–154.
[2] N. Rieke et al., “The future of digital health with federated learning,” [26] A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle, “PodNet:
NPJ Digit. Med., vol. 3, no. 1, pp. 1–7, 2020. Pooled outputs distillation for small-tasks incremental learning,” in Proc.
[3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas, Eur. Conf. Comput. Vis., 2020, pp. 86–102.
“Communication-efficient learning of deep networks from decentralized [27] A. Douillard, Y. Chen, A. Dapogny, and M. Cord, “PLOP: Learn-
data,” in Proc. Artif. Intell. Statist., 2017, pp. 1273–1282. ing without forgetting for continual semantic segmentation,” in Proc.
[4] S. Warnat-Herresthal et al., “Swarm learning for decentralized and IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
confidential clinical machine learning,” Nature, vol. 594, no. 7862, pp. 4040–4050.
pp. 265–270, 2021. [28] Y. Lu, X. Huang, Y. Dai, S. Maharjan, and Y. Zhang, “Blockchain
[5] S. Pati et al., “Federated learning enables big data for rare cancer and federated learning for privacy-preserved data sharing in industrial
boundary detection,” 2022, arXiv:2204.10836. IoT,” IEEE Trans. Ind. Informat., vol. 16, no. 6, pp. 4177–4186,
[6] M. J. Sheller, G. A. Reina, B. Edwards, J. Martin, and S. Bakas, “Multi- Jun. 2020.
institutional deep learning modeling without sharing patient data: A [29] I. Szita and A. Lörincz, “Learning tetris using the noisy cross-
feasibility study on brain tumor segmentation,” in Brainlesion: Glioma, entropy method,” Neural Comput., vol. 18, no. 12, pp. 2936–2941,
Multiple Sclerosis, Stroke and Traumatic Brain Injuries, A. Crimi, Dec. 2006.
S. Bakas, H. Kuijf, F. Keyvan, M. Reyes, and T. van Walsum, Eds. [30] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for
Cham, Switzerland: Springer, 2019, pp. 92–104. dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
[7] A. Guha Roy, S. Siddiqui, S. Pölsterl, N. Navab, and C. Wachinger, Oct. 2017, pp. 2980–2988.
“BrainTorrent: A peer-to-peer environment for decentralized federated [31] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutional
learning,” 2019, arXiv:1905.06731. neural networks for volumetric medical image segmentation,” in Proc.
[8] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, 4th Int. Conf. 3D Vis. (DV), Oct. 2016, pp. 565–571.
“Federated optimization in heterogeneous networks,” in Proc. Mach. [32] G. Shi, L. Xiao, Y. Chen, and S. K. Zhou, “Marginal loss and exclusion
Learn. Syst., vol. 2, 2020, pp. 429–450. loss for partially supervised multi-organ segmentation,” Med. Image
[9] P. Kairouz et al., “Advances and open problems in federated learning,” Anal., vol. 70, May 2021, Art. no. 101979.
Found. Trends Mach. Learn., vol. 14, nos. 1–2, pp. 1–210, Jun. 2021. [33] F. Cermelli, M. Mancini, S. Rota Bulo, E. Ricci, and B. Caputo,
[10] B. Wu, S. Lyu, and B. Ghanem, “ML-MG: Multi-label learning with “Modeling the background for incremental learning in semantic seg-
missing labels using a mixed graph,” in Proc. IEEE Int. Conf. Comput. mentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Vis. (ICCV), Dec. 2015, pp. 4157–4165. (CVPR), Jun. 2020, pp. 9233–9242.
[11] Q. Liu, Q. Dou, L. Yu, and P. A. Heng, “MS-Net: Multi-site network for [34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
improving prostate segmentation with heterogeneous MRI data,” IEEE network training by reducing internal covariate shift,” in Proc. Int. Conf.
Trans. Med. Imag., vol. 39, no. 9, pp. 2713–2724, Sep. 2020. Mach. Learn., 2015, pp. 448–456.
[12] K. Gilbert et al., “Independent left ventricular morphometric atlases [35] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu, “Adaptive batch nor-
show consistent relationships with cardiovascular risk factors: A UK malization for practical domain adaptation,” Pattern Recognit., vol. 80,
biobank study,” Sci. Rep., vol. 9, no. 1, pp. 1–9, 2019. pp. 109–117, Aug. 2018.
[13] C. Petitjean et al., “Right ventricle segmentation from cardiac MRI: A [36] V. M. Campello et al., “Multi-centre, multi-vendor and multi-disease
collation study,” Med. Image Anal., vol. 19, no. 1, pp. 187–202, 2015. cardiac segmentation: The M&Ms challenge,” IEEE Trans. Med. Imag.,
[14] N. Shoham et al., “Overcoming forgetting in federated learning on non- vol. 40, no. 12, pp. 3543–3554, Dec. 2021.
IID data,” 2019, arXiv:1910.07796. [37] S. Pati et al., “The federated tumor segmentation (FeTS) challenge,”
[15] X. Li, M. Jiang, X. Zhang, M. Kamp, and Q. Dou, “FedBN: Federated 2021, arXiv:2105.05874.
learning on non-IID features via local batch normalization,” 2021, [38] X. Zhuang et al., “Evaluation of algorithms for multi-modality whole
arXiv:2102.07623. heart segmentation: An open-access grand challenge,” Med. Image Anal.,
[16] K. Bonawitz et al., “Practical secure aggregation for privacy-preserving vol. 58, Dec. 2019, Art. no. 101537.
machine learning,” in Proc. ACM SIGSAC Conf. Comput. Commun. [39] U. Baid et al., “The RSNA-ASNR-MICCAI BraTS 2021 benchmark
Secur., Oct. 2017, pp. 1175–1191. on brain tumor segmentation and radiogenomic classification,” 2021,
[17] X. Li, Y. Gu, N. Dvornek, L. H. Staib, P. Ventola, and J. S. Duncan, arXiv:2107.02314.
“Multi-site fMRI analysis using privacy-preserving federated learning [40] G. A. Reina et al., “OpenFL: An open-source framework for federated
and domain adaptation: ABIDE results,” Med. Image Anal., vol. 65, learning,” 2021, arXiv:2105.06413.
Oct. 2020, Art. no. 101765. [41] N. Bloch et al., “NCI-ISBI 2013 challenge: Automated segmentation of
[18] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated prostate structures,” Cancer Imag. Arch., vol. 370, p. 6, Oct. 2015.
learning with theoretical guarantees: A model-agnostic meta-learning [42] G. Lemaître, R. Martí, J. Freixenet, J. C. Vilanova, P. M. Walker, and
approach,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, F. Meriaudeau, “Computer-aided detection and diagnosis for prostate
pp. 3557–3568. cancer based on mono and multi-parametric MRI: A review,” Comput.
[19] T. Shen et al., “Federated mutual learning,” 2020, arXiv:2006.16765. Biol. Med., vol. 60, pp. 8–31, May 2015.
[20] T. Li, S. Hu, A. Beirami, and V. Smith, “Ditto: Fair and robust federated [43] G. Litjens et al., “Evaluation of prostate segmentation algorithms for
learning through personalization,” in Proc. Int. Conf. Mach. Learn., MRI: The PROMISE12 challenge,” Med. Image Anal., vol. 18, no. 2,
2021, pp. 6357–6368. pp. 359–373, 2014.
[21] L. Collins, H. Hassani, A. Mokhtari, and S. Shakkottai, “Exploit- [44] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolu-
ing shared representations for personalized federated learning,” 2021, tional networks for biomedical image segmentation,” in Proc.
arXiv:2102.07078. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent, 2015,
[22] K. James et al., “Overcoming catastrophic forgetting in neural networks,” pp. 234–241.
Proc. Nat. Acad. Sci. USA, vol. 114, no. 13, pp. 3521–3526, Mar. 2017. [45] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K. H. Maier-Hein,
[23] M. Delange et al., “A continual learning survey: Defying forgetting in “NnU-Net: A self-configuring method for deep learning-based biomed-
classification tasks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, ical image segmentation,” Nature Methods, vol. 18, no. 2, pp. 203–211,
no. 7, pp. 3366–3385, Jul. 2021. Dec. 2020.

A New Framework of Swarm Learning Consolidating Knowledge From Multi-Center Non-IID Data For Medical Image Segmentation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A New Framework of Swarm Learning Consolidating Knowledge From Multi-Center Non-IID Data For Medical Image Segmentation

Uploaded by

Copyright:

Available Formats

2118 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO.

A New Framework of Swarm Learning

Our contributions are summarized as follows:

TABLE I III. M ETHOD

which leads to local optimal θk[t ] . This process is repeated

Algorithm 1 Non-IID Swarm Learning

for each epoch t = 0, 1, 2, . . . , T − 1 do

using M&Ms dataset. Swarm+FeSA method was implemented

every training iteration, then the distributed learning is equiv-

You might also like