You are on page 1of 11

Biomedical Signal Processing and Control 53 (2019) 101562

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control


journal homepage: www.elsevier.com/locate/bspc

Adversarial learning for deformable registration of brain MR image


using a multi-scale fully convolutional network
Luwen Duan a,b , Gang Yuan b , Lun Gong c , Tianxiao Fu d , Xiaodong Yang b , Xinjian Chen e ,
Jian Zheng b,∗
a
School of Biomedical Engineering, University of Science and Technology of China, Hefei, 230000, China
b
Department of Medical Imaging, Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou, 215163, China
c
The Key Laboratory of Mechanism Theory and Equipment Design of Ministry of Education, Tianjin University, Tianjin, 300072, China
d
Department of Radiation Oncology, The First Affiliated Hospital of Soochow University, Suzhou, 215006, China
e
School of Electronics and Information Engineering, Soochow University, Suzhou, 215006, China

a r t i c l e i n f o a b s t r a c t

Article history: Background and objective: Deformable registration is very significant for various clinical image appli-
Received 21 November 2018 cations. Unfortunately, existing conventional medical image registration approaches, which involve
Received in revised form 20 March 2019 time-consuming iterative optimization, have not reached the level of routine clinical practice in terms of
Accepted 10 May 2019
registration time and robustness. The aim of this study is to propose a tuning-free 3D image registration
Available online 23 May 2019
model based on adversarial deep network, and to achieve rapid and high-accurate registration.
Methods: We propose a fully convolutional network (FCN) to regress the 3D dense deformation field in
Keywords:
one shot from the to-be-registered image pair. To precisely regress the complex deformation and produce
Deformable registration
Deep network
optimal registration, we design the FCN as a novel multi-scale frame to capture the complementary multi-
Unsupervised model scale image features and effectively characterize the spatial correspondence between the image pair.
Multi-scale Moreover, we learn a discriminator network simultaneously to discriminate the registered two images,
Adversarial training where the discrimination loss helps further update the FCN. Thus by the adversarial training strategy,
the registration network is urged to produce well-registered two images that are indistinguishable for
the discriminator.
Results: We perform registration experiments on four different brain MR datasets using the model trained
by ANDI database. Compared with some state-of-the-art registration algorithms including other newest
deeplearning-based methods, the proposed method provides a considerable increase of large than 4% in
terms of Dice similarity coefficient (DSC). Moreover, our model also obtains comparable distance errors.
More significantly, our model can achieve a high-accurate 3D registration result in average 0.74 s, with
roughly hundred speed-up over conventional registration methods.
Conclusions: The proposed model shows consistent high performance for various registration tasks under
a second without any additional parameter tuning, which proves its potential for real-time clinical
applications.
© 2019 Elsevier Ltd. All rights reserved.

1. Introduction Existing conventional methods [4,6–14] regard the deformable


registration as high-dimensional iterative optimization procedure
Deformable registration is the building block for various medical by maximizing the similarity between the registered image pair.
image analysis tasks, such as atlas-based image segmentation [1], These conventional algorithms roughly fall into two categories: (1)
radiation assessment [2], image-guided surgery [3] and so on. The Intensity-based registration [5–7], where the intensity information
aim of registration is to match all corresponding anatomical points is used to match voxels between the image pair. However, the single
in the two images to the same coordinate system through plausible intensity information is not sufficient to support fine anatomi-
spatial transformation or mapping. cal correspondences. Although some modified similarity metrics
incorporated with spatial information [8,9] have been proposed
recently, these increasingly complex metrics inevitably further add
∗ Corresponding author. the computational cost. (2) Feature-based registration [10,6–14],
E-mail address: zhengj@sibet.ac.cn (J. Zheng). where the geometrical features of the land- marks are adopted to

https://doi.org/10.1016/j.bspc.2019.101562
1746-8094/© 2019 Elsevier Ltd. All rights reserved.
2 L. Duan, G. Yuan, L. Gong, et al. / Biomedical Signal Processing and Control 53 (2019) 101562

guide the local matching. The performance of these methods overly Most recently, taking the difficulties of acquiring ground-truth
depends on the hand-crafted features. Features that are designed data into consideration, some studies turned to unsupervised
well for a specific image data, however, do not necessarily apply learning model for image registration. Bob et al. [25] learned an
well to other data. In either case, these methods have not been unsupervised image registration network by maximizing the sim-
widely integrated into commercial registration software support- ilarity between the registered image pair, thus they alleviated the
ing clinical decision due to large time-consumption and the lack requirement of supervised data. However, this method only pro-
of robustness. Therefore, a rapid and high-accurate registration duces the displacement of the center pixel in the selected patch
method, which is robust to various data, is in demand. each time, and later relies on Bsplines for spatial interpolation
Inspired by the outstanding performance of deep learning tech- which may fail to recover a dense deformation field. Moreover,
nology in the field of image segmentation [15], classification [16] this kind of patch-wise model could not support large displace-
and detection [17], some researchers have introduced deep learn- ment and the performance of the trained model strongly depends
ing networks into the registration model. Wu et al. [18] used a on the selection of training patch samples. In contrast, Shan et al.
convolutional stacked auto-encoder (CAE) and Kearney et al. [19] [26] proposed a fully convolutional network (FCN) to regress the
used a deep convolutional inverse graphics networks (DCIGN) to image-wise global 2D deformation field all at once. Furthermore,
extract features automatically instead of hand-crafted features. Balakrishnan et al. [27] and Li et al. [28] extended this FCN model to
However, these networks are independent of registration stage, 3D medical image registration. However, the registration accuracy
which has no verification whether the extracted features are most of these deeplearning-based methods, which is only comparative to
beneficial for registration. In a different way, Yang et al. [20] pro- that of conventional registration toolboxs, still need to be improved
posed a convolutional neural network (CNN) for predicting the for precise 3D medical image registration.
initial momenta of LDDMM registration algorithms. These meth- In this study, we adopt the similar unsupervised approach and
ods, to some extent, simplified the registration task and improved investigate further improvement for rapid and higher accurate reg-
registration efficiency, but all these networks still need to be istration on 3D medical images. Our unique contributions are as
coupled with conventional registration framework to eventually follows:
complete the registration task. (1) We propose an image-wise FCN network to rapidly generate
In order to avoid the conventional iterative optimization, some 3D global deformation field in one shot. Furthermore, in order to
researches utilized CNN to model the direct mapping function from effectively learn the highly non-linear mapping from the image pair
the raw images to the deformation in a supervised manner, in to the deformation field, the proposed FCN is designed as a novel
which the quality of supervised ground-truth deformations plays multi-scale framework. This FCN is capable of capturing the com-
a key role in the network training. To acquire ground-truth dis- plementary multi-scale features to precisely regress the complex
placement vector fields (DVFs), Sokooti et al. [21] builded synthetic deformation, and obtain high-accurate registration.
DVFs to generate image pairs with known artificial deformation (2) Without the need of ground-truth deformation, the pro-
as training data, but the synthetic deformations are too simple to posed deformation generator (multi-scale FCN) is trained based on
effectively account for complicated deformation within realistic the image similarity. Furthermore, considering the fact that max-
medical images. Cao et al. [22] and Fan et al. [23] used the DVFs esti- imizing the intensity-based image similarity might not guarantee
mated from conventional image registration algorithms as ground the optimal alignment, we add an adversarial training objective
truth, where the validity of the approximative ground truth DVFs is term. That is, we use an additional discriminator network to judge
limited by conventional methods. Rohé et al. [24] evaluated target whether the image pair are well aligned and give mismatch feed-
spatial transformations by aligning the regions of interest (ROIs) of back to further train the FCN. Thus it can urge the registered image
image pair, in which the ROIs need to be segmented beforehand. from the FCN to be more similar with the ground-truth target image
While plausible, as is known to us all, these ground-truth DVFs are perceptually.
inevitably time-consuming and laborious to build.

Fig. 1. Overview of unsupervised registration method. An FCN is trained to learn mapping function from the input to-be-registered 3D image pair (M, F) to the dense
deformation field, then employ the ϕ to deform M as M(ϕ). The training loss function estimates the similarity between generated M(ϕ) and F and imposes smoothness on.
L. Duan, G. Yuan, L. Gong, et al. / Biomedical Signal Processing and Control 53 (2019) 101562 3

(3) The proposed networks not only can produce well-registered M(ϕ). The discrimination scores are also used to further update the
images straightway but also can give a direct feedback of alignment parameters  of the FCN in backpropagation. The whole registration
quality, which is of practical significance especially for image- schematic diagram is detailed in Fig. 2.
fusion based clinical guidance. The discriminator D tries to correctly discriminate F and the
M(ϕ) from generator G. The loss function for training D is:
2. Method
LD (F, M(ϕ)) = LH (D(F), 1) + LH (D(M(ϕ)), 0) (4)

2.1. Adversarial deformable registration where D(·) is the output of network D which reflects the likelihood
of the input to be F (ideally D(·) → 1) or M(ϕ) (ideally D(·) → 0), and
2.1.1. Unsupervised FCN for medical image registration LH denotes the binary cross entropy loss function:
Fig. 1 depicts the overview of unsupervised registration method.  
Let F and M respectively denote the fixed image and moving image LH (Y  , Y ) = − Yi log(Y  i ) + (1 − Yi ) log(1 − Y  i ) (5)
to be registered over the image domain ˝ ⊂ R3 . We model a func- i
tion G(F, M; ) = ϕ by using a FCN framework, where ϕ is the dense where Y is the label of input image (1 for F and 0 for M(ϕ)), and
deformation field for registering M to F as warped moving image Y  ∈ [0, 1] is the discrimination score of D(·).
M(ϕ).  denote the learnable network parameters of FCN, which At the same time, the generator G aims to generate well-
are updated by maximizing the similarity between M(ϕ) and F, and registered image pair that is indistinguishable for D. Hence, the
thus the training of this registration model does not need known optimization with regard to G turns into the maximization of the LD ,
ground-truth deformations. so the training of the proposed registration model is an adversarial
For M(ϕ), Each voxel at integer coordinate of M(ϕ) is calculated minimax game between D network and G network. As suggested
by tri-linear interpolation in the corresponding warped location in [30], we instead minimize the binary cross entropy between the
(by ϕ), which is a fully-differentiable spatial transformation mod- discrimination probability of M(ϕ) and the label “1”, so as to speed
ule. For similarity metric, we adopt normalized cross correlation up the convergence of G. Therefore the adversarial loss as the third
(NCC 2 ) [29]. We evaluate the mean of local NCC 2 in the n × n × n term (“adversary”) for updating G network can be defined as:
(n = 9 in this study) patches centered at each voxel, so that we
can implement a more restrictive intensity mapping between every Ladversary (M(ϕ)) = LH (D(M(ϕ)), 1) (6)
corresponding image pair. NCC 2 is computed as: The adversarial term uses the discrimination feedback of D to
NCC 2 (F, M(ϕ)) = increase the probability of generated M(ϕ) to be F, and thus penal-
⎡  2 ⎤ izes the divergence between F and M(ϕ). Adding up (2), (3) and (6),
   the object function for training G is:
⎢ F(q) − F̄N(p) M(ϕ)(q) − M̄(ϕ)N(p) ⎥
⎢ ⎢
⎥ (1) LG (F, M(ϕ)) = 1 Lsimilarity (F, M(ϕ))
1
 ⎥
q ∈ N(p)
⎢ ⎥ (7)
V ⎢  2  2 ⎥ + 2 Lsmooth (ϕ) + 3 Ladversary (M(ϕ))
p∈˝ ⎣
F(q) − F̄N(p) M(ϕ)(q) − M̄(ϕ)N(p) +ε⎦
The two networks are trained in an alternating way. First, D net-
q ∈ N(p) q ∈ N(p)
work is updated by using a random mini-batch of fixed images and
Where ε is a small constant to relieve a zero denominator, N(p) is initial registered moving images (corresponding to the generation
n × n × n local patch centered at voxel p, F̄N(p) and M̄(ϕ)N(p) rep- of G). Then, G network is updated by taking another random mini-
resent the mean intensity over the local patch in the fixed image batch of to-be-registered image pairs. Repeating the cycle until the
and the warped moving image respectively, and V is the volume of two networks both cannot improve while the generated M(ϕ) is
the image. In this study, NCC 2 could be efficiently estimated only highly similar to F. Thus, the output of the G network ends up
through convolutional operations on M(ϕ) and F. Higher NCC 2 indi- being the expected displacement field since the image pair achieves
cates better registration, so the similarity term in the loss function maximum similarity [27,28].
to be minimized is denoted as:
2.1.2.1. Network architectures.
Lsimilarity (F, M(ϕ)) = −NCC 2 (F, M(ϕ)) (2)
2.1.2.1.1. Registration generator G. 2.1.2.1.1.1. Multi-scale fully
In addition, we also add a diffusion regularizer to impose smooth convolutional network
constraint on the spatial gradients of ϕ, to avoid an unpractical or The proposed registration network G is detailed in Fig. 3. It is a
discontinuous deformation field. fully convolutional network in the form of encoder-decoder fash-
 2 ion with skip connections similar to U-Net architecture [34]. In
Lsmooth (ϕ) = |∇ ϕ(p) | (3) contrast, the proposed FCN owns the unique property: it can cap-
p∈˝ ture complementary multi-scale image features by fusing different
operation strategies in the encoder sub-network.
2.1.2. Adversarial training Figure 3
Generative Adversarial Network (GAN) [30] has shown remark- The encoder sub-network involves two encoding streams with
able success in image generation [31], different volumetric operations. Each encoding stream contains
synthesis [32] and segmentation [33]. In order to produce reg- two decreasing resolution levels, and features of the same reso-
istered moving image that is more similar to the corresponding lution from the two streams will be combined and then fed to the
fixed image, we introduce the similar adversarial strategy to train next level. In one branch, we perform two convolutional layers with
the registration network. We add an additional discriminator net- the stride of
work right after the registration model in Section 2.1.1. That is to 1 plus one max-pooling layer, similar to encoding operations in
say the whole registration model consists of two deep networks: typical U-Net. In the other branch, we first employ the max pooling,
1) the generator network G (i.e. “Registration FCN” in Fig. 1) for and then use 2-dilated convolution, which allows large receptive
estimating the deformation field from the to-be-registered input field with fewer
images and generating warped moving image M(ϕ); 2) the discrim- trainable parameters. All above convolutional layers are with
inator network D for discriminating the registered image pair F and 3 × 3 × 3 kernel size followed by Leaky ReLU activation.
4 L. Duan, G. Yuan, L. Gong, et al. / Biomedical Signal Processing and Control 53 (2019) 101562

Fig. 2. Overall flowchart of the proposed adversarial registration model. Generator G generates deformation field ϕ and produces warped moving image M(ϕ). Then, both
M(ϕ) and F are fed to the discriminator D, where the discrimination score will be used to further update G in backpropagation.

Fig. 3. G network architecture, which consists of a multi-scale encoder sub-network and a decoder sub-network. The input is concatenated to-be-registered image pair F
and M. The output is a 3 dimensional vector map corresponding to the displacements in X, Y, Z respectively.

The subsequent decoder sub-network involves two increasing image deformations. The deconvolutional layer used in this study
image pyramids, symmetric to the encoder sub-network. There are is the transposed convolutional layer in [35].
two same convolutional operations and one deconvolutional layer 2.1.2.1.1.2. Multi-scale feature learning
with the stride of 2 at each resolution level. The proposed registration network G aggregates two-scale fea-
In the last layer, a simple convolution builds 3 channels for yield- tures in each encoding level. Fig. 4 illustrates the detailed procedure
ing displacements in the X, Y, Z dimension respectively at the same of 1st-level feature extraction in the view of 2D image. The last blue
grid as the initial input images. voxel (top right) in the 1st-scale feature map is extracted by max
There are two points need to be explained: pooling and one 2-dilated convolution on the 10 × 10 blue region
(1) We first apply one convolutional layer on the original images, of the input image. Whereas the corresponding red voxel (bottom
and then perform the above-described two different volumetric right) in the 2nd-scale feature map is extracted by two convolutions
operation strategies on the feature maps from this convolutional plus max pooling on the 6 × 6 red region of the input image. These
layer rather than the original images. By doing so, we can reserve two complementary feature voxels respectively characterize the
original neighborhood information and alleviate the information input image in a fine scale (6 × 6) and a coarse scale (10 × 10) with
loss (in first green down-sampled feature maps of Fig. 3) due to the identical center regions. Thus, we separately use these two differ-
max-pooling operation. ent operations with different receptive fields for multi-scale feature
(2) In the decoder sub-network, we employ deconvolutional learning [36]. Then, these multi-scale feature maps are fused as the
operation instead of up-sampling to learn the non-linear interpo- 2nd-level input.
lation, which may help precisely recover high-resolution complex Figure 4
L. Duan, G. Yuan, L. Gong, et al. / Biomedical Signal Processing and Control 53 (2019) 101562 5

Fig. 4. An interpretation of multi-scale feature extraction in the 1st-level encoder network with 2D view (10 × 10). The last blue pixel (top right) in the 1st-scalefeature map
and the last red pixel (bottom right) in the 2nd-scalefeature map correspond to the 10 × 10 blue region and the 6 × 6 red region in the input image, respectively. The fusion
symbol ⊕ denotes the concatenation operation.

Fig. 5. D network architecture in the adversarial registration model, estimating the probability of the input image being the fixed image F.

The 2nd-level feature extraction process is similar to the 1st- preprocessed involving brain extraction using FreeSurfer [37], bias
level stage illustrated above. As illustrated in Fig. 3, we implement correction [38] and intensity normalization. In addition, all images
1st-scale (coarse-scale) feature extraction operation on the forego- are pre-registered linearly to the Montreal Neurological Institute
ing coarse-scale features, and simultaneously implement 2nd-scale (MNI) 152 space by using FMRIB Software Library’s (FSL) FLIRT.
feature extraction on the fused features in the 1st-level, since it For testing, we choose four brain datasets, including LPBA40
shows overall better performance in the experiment compared to [39], IBSR18 [40], CUMC12 [40], and MGH10 [40]. The four datasets
the implementation of two operations both on the fused features. are already skull stripped as described in [40]. We perform pair-
We speculate that such strategy can extract coarser features and wise registration evaluation on all images, resulting in a total of
enable the G to cover a large range of displacements. The fusion 2168 independent registration pairs (1560 from LPBA40, 306 from
of these multi-scale feature representations can provide abundant IBSR18, 132 from CUMC12, 90 from MGH10). All to-be-registered
contextual information. pairs are also linearly registered in MNI152 space.
2.1.2.1.2. Discriminator D. Our proposed discriminator net- Since the proposed FCN-based model takes the whole 3D image
work D is a typical CNN architecture. It contains three phases of as input, taking the memory limitation of GPU into considera-
convolution, BN, ReLU and pooling layers. Subsequently, the fol- tion, we preprocess all images in the training and testing datasets
lowings are two consecutive convolutional layers with BN and ReLU to share the same size (96 × 96 × 96) and the same resolution
activation, which are then flattened and fed into three fully con- (2 mm × 2 mm × 2 mm) in this study.
nected layers. The last layer uses sigmoid output to estimate the
probability that the input image is fixed image F rather than the 3.2. Implementation
warped moving image M(ϕ) from G. As shown in Fig. 5, the kernel
sizes of convolutional layers are first 5 × 5 × 5 and then 3 × 3× 3, We implement our networks in Keras [41] with a Tensorflow
and the numbers of the filters are 24, 24, 32, 48 and 64 in sequence. backend [42]. The training and testing experiments are accom-
For the fully connected layers, the numbers of the output nodes are plished on a single NVIDIA Geforce GTX 1080 GPU with 8 GB RAM.
128, 32 and 1. We use the Adam optimizer to train the D network and G net-
work with a learning rate of 2 × 10−5 and 2 × 10−4 respectively.
3. Experiments and results The parameters for training generator G are 1 = 2 = 1, 3 = 0.5.

3.1. Dataset 3.3. Evaluation for label overlap

For training, we randomly select 5000 brain MR image pairs To verify the robustness of the proposed model, we directly
from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) apply the model trained with the ANDI dataset on all 2168 testing
database (http://adni.loni.ucla.edu/) as training data. All images are registration pairs of four datasets, without any additional parame-
6 L. Duan, G. Yuan, L. Gong, et al. / Biomedical Signal Processing and Control 53 (2019) 101562

Fig. 6. Overlap by different registration method for the four datasets. The boxplots denote the average of mean target overlaps across all labels in each dataset, where the
mean target overlap is mean of target overlap over all registration case.

ter tuning. We compare our results with conventional deformation simplify the generator into Single-Scale G network (denoted as
registration methods in [40]. SSG) similar to U-Net architecture. Further for fairness, we augment
Likewise, we follow the evaluation method in [40] and estimate the depth of SSG by adding third downsampling scale (additional
the target overlap (TO) of same labeled brain regions in two images pooling, convolutional layers, and deconvolutional layer) to guar-
after registration: antee its network complexity comparable with that of MSG.
We evaluate the Dice Similarity Coefficient (DSC) of Gray Matter
lm ∩ lf
TO = (8) (GM) and White Matter (WM) tissues based on the segmentation
lf label of GM and WM in the datasets. The average DSC of all registra-
tion pairs in each dataset (2168 pairs in total) is shown in Table 1.
where lm and lf denote the corresponding labeled regions in the
By comparing the results obtained by adversarial registration using
warped moving image and fixed image. There are stable 56, 96, 130,
the proposed multi-scale network (MSG + D) with those by using
106 manually labeled regions in LPBA40, IBSR18, CUMC12, MGH10
the single-scale G network (SSG + D), we can find that the multi-
datasets respectively. For each testing dataset, we calculate the TO
scale feature fusion effectively improves the DSC on all datasets.
averaged first across all registration cases then across all labeled
By further comparing the results of SSG and MSG, we notice that
regions in the dataset. The corresponding TO results of conventional
multi-scale feature learning is also beneficial for the single gener-
registration algorithms evaluated in [40] are compared in Fig. 6.
ator G, even when we do not add the discriminator D to train the
As shown in Fig. 6, the medians of TO values for four datasets
network G with adversarial strategy.
from [FNIRT, SyN [43], SPM D [44], proposed method] are: LPBA40:
Fig. 7 illustrates the visual registration results of SSG + D and
[0.701, 0.715, 0.682, 0.736]; IBSR18: [0.506, 0.529, 0.540, 0.531];
MSG + D. From the axial view in top row, we observe that the pro-
CUMC12: [0.435, 0.516, 0.520, 0.517]; MGH10: [0.505, 0.586, 0.543,
posed MSG + D can provide better registration for the ventricle, as
0.554]. We can observe the proposed method achieves the best TO
marked by the upper red ROI. Likewise, for the lower ROI with the
accuracy in LPBA40 dataset. For the other three datasets, the TO
big appearance discrepancy and complicated deformation, the pro-
values also can be par with top-performing SyN or SPM D meth-
posed MSG + D also shows more accurate registration. Furthermore,
ods, and the difference is not significant. In addition, an important
in the bottom sagittal view, the warped moving image from the
discovery is our method produces scarcely any low outliers in the
SSG + D fails to match with the fixed image and still reserves the
boxplots, which manifests our method can consistently provide
large structure in the red ROI of the moving image. These above
high aligning accuracy for all brain regions whether the region
results well demonstrate that the addition of coarse-scale infor-
is big or small. These results well demonstrate that the proposed
mation is effective to register image pair with high appearance
method can robustly work well on various registration cases of all
variation and large deformation.
four datasets.

3.4. Evaluation for the role of multi-scale FCN 3.5. Evaluation for the role of adversarial training

The proposed deformation regression Multi-Scale G network The adversarial training strategy is another important contri-
(denoted as MSG) is an extension of the typical U-Net by adding bution in this paper. The quantitative registration results obtained
a complementary coarse-scale feature encoding stream (i.e., the from the models with and without the discriminator network D
1st-scale feature maps also can be compared in Table 1.
in Fig. 4). To prove its effectiveness, we delete the 1st-scale The results of MSG + D are more accurate than those from MSG
encoding stream (green feature maps in Fig. 3) to on all datasets. Since the introduction of
L. Duan, G. Yuan, L. Gong, et al. / Biomedical Signal Processing and Control 53 (2019) 101562 7

Table 1
The Average (±Standard Deviation) DSC (%) of the WM and GM tissue. (SSG = Single-Scale G network without discriminator D network, MSG = Multi-Scale G network without
discriminator D network, SSG + D = Single-Scale G network with discriminator D network, MSG + D = Multi-Scale G network with discriminator D network).

Dataset Brain Tissue SSG MSG SSG + D MSG + D (Proposed)

GM 72.81 ± 3.61 74.62 ± 3.57 74.19 ± 3.42 75.77 ± 3.53


LPBA40
WM 73.70 ± 2.14 76.54 ± 2.07 76.07 ± 2.11 77.95 ± 2.05
GM 81.88 ± 2.10 83.37 ± 2.12 83.43 ± 2.16 84.83 ± 2.10
IBSR18
WM 76.69 ± 2.48 78.99 ± 2.32 78.92 ± 2.04 80.30 ± 2.13
GM 68.80 ± 2.25 71.16 ± 1.99 71.14 ± 1.97 73.40 ± 1.57
CUMC12
WM 75.92 ± 2.67 79.05 ± 1.73 79.21 ± 1.62 81.16 ± 0.82
GM 72.93 ± 0.87 75.68 ± 0.92 76.17 ± 0.96 78.63 ± 1.04
MGH10
WM 76.23 ± 1.67 80.14 ± 0.97 80.77 ± 0.87 83.05 ± 0.64

Fig. 7. Qualitative comparison of registration results by SSG + D (adversarial registration based on typical U-Net) and MSG + D (Proposed).

additional discriminator network D can provide precise align- the similarity metric [9], with 3 resolutions of 200 iterations each;
ment quality feedback for training the registration network G, thus (2) Deeplearning-based registration method:
can further refine the tissue matching and improve the average DSC. VoxelMorph CNN [27] and Li [28], are reproduced for compar-
The DSC values of SSG + D and SSG also can indicate the contribution ison. The above four methods all have been demonstrated to be
of the adversarial training. high-performing and robust for medical image registration.
Visual registration results by the MSG + D and MSG are illus-
trated in Fig. 8. The registered red ROIs generated 3.6.1. Quantitative evaluation for DSC and distance error
by MSG + D are more similar to the corresponding ROIs in the We estimate the DSC values of GM and WM registered by these
fixed image. It further follows that the adversarial training strat- four methods and the proposed model. The comparing DSCs on
egy indeed can help narrow the gap between registered moving the four testing datasets are shown in Table 2. For all four date-
image and fixed image. For the axial view in top row as an exam- sets, the proposed method always shows a considerable increase
ple, the maximum NCC 2 between the fixed image and the registered of large than 4% compared with the fine-tuned D.Demons and
image from MSG tends to judge the two image has reached the opti- the other newest deeplearning-based registration method [27,28].
mal alignment, while the discriminator D can offer more accurate When compared with the most advanced Bspline-based method
alignment discrimination and guide a better registration. [9], our method also yields overall better results except for WM of
the MGH10 dataset. Although
SRWCR [9] provides a slightly higher DSC on MGH10 dataset,
3.6. Comparison with the state-of-the-art registration methods the result of our method is also comparable. What is noteworthy is
that these reported results of the proposed method on four datasets
In this section, to further evaluate the registration accuracy of are obtained using the trained model by ANDI dataset without any
the proposed method, we compare the proposed method with parameter tuning.
the popular Diffeomorphic Demons (D.Demons [45]) and three The DSC measure could not explicitly assess boundary gaps
state-of-the-art registration methods: (1) SRWCR based B-splines between corresponding regions of the image pair. Therefore, apart
registration: using spatially region-weighted correlation ratio as from DSC, we also evaluate the distance error (DE) of region border
8 L. Duan, G. Yuan, L. Gong, et al. / Biomedical Signal Processing and Control 53 (2019) 101562

Fig. 8. Qualitative comparison of registration results by MSG (registration based on the proposed multi-scale FCN without using adversarial training) and MSG + D (Proposed).

Table 2
The Average (±Standard Deviation) DSC (%) of the WM and GM tissue by D.Demons, SRWCR [9], VoxelMorph [27], Li [28] and the proposed method.

Dataset Brain Tissue D.Demons SRWCR [9] VoxelMorph [27] Li [28] MSG + D (Proposed)

GM 68.57 ± 4.64 72.79 ± 2.70 70.91 ± 3.55 70.46 ± 3.30 75.77 ± 3.53
LPBA40
WM 72.55 ± 2.30 76.35 ± 1.77 72.89 ± 2.01 71.31 ± 1.74 77.95 ± 2.05
GM 73.56 ± 3.70 84.16 ± 3.29 79.95 ± 2.15 79.88 ± 2.03 84.83 ± 2.10
IBSR18
WM 69.92 ± 2.93 80.13 ± 2.19 75.25 ± 2.46 73.89 ± 1.94 80.30 ± 2.13
GM 64.77 ± 2.23 70.91 ± 2.85 66.55 ± 1.97 65.99 ± 1.88 73.40 ± 1.57
CUMC12
WM 70.72 ± 1.97 80.41 ± 1.70 74.71 ± 1.85 73.73 ± 1.14 81.16 ± 0.82
GM 71.72 ± 2.29 75.65 ± 2.80 72.69 ± 1.06 72.85 ± 1.07 78.63 ± 1.04
MGH10
WM 73.15 ± 1.18 84.09 ± 0.90 76.27 ± 1.10 76.87 ± 0.83 83.05 ± 0.64

[40] in this section. Concretely, DE of region r (DEr ) is calculated as In order to further compare our method with these methods
the mean of the minimum distance between each boundary visually, we illustrate one registration case for each testing dataset
point p of warped moving region r (Mr Bp ) and the entire bound- in Fig. 9. By comparing the detailed differences in the marked red
ary set of the fixed region r (Fr B): ROIs, we can find that the proposed method can consistently pro-
duce higher similarity to the fixed image in the three slices of
1
P
different directions.
DEr = min dist(Mr Bp , Fr B) (9)
P
p=1
3.6.2. Computation costs
where P is the number of border points of region r. DEr is similar
We compare our registration speed with the CPU or GPU imple-
to the maximum-likelihood Hausdorff distance (MHD) [46] metric.
mentations of these top-performing methods. Table 4 shows the
Then DE is denoted as the average of DEr overall regions.
average computation cost of registration cases. Due to avoiding any
Since the border information of regions is available only in the
iterative optimization and overlap of patches, our method regis-
LPBA40 dataset, we evaluate the DE over the
ters a pair of images under a second, hundreds of times faster than
56 labeled brain regions on LPBA40 dataset. From the results in
the GPU implementation of SRWCR. One also can find the time of
Table 3, the proposed method outperforms the D.Demons by 23%.
[27,28] is a little shorter, but the more significant fact is that our reg-
It is noteworthy that, D.Demons is a non-parametric diffeomor-
istration accuracy is much better according to the above evaluation
phic registration model, which determines the exact displacement
of DSC and DE.
for each voxel. This can prove that the displacement of each voxel
can be all precisely estimated although we provide rapid image-
wise dense DVF all at one. Likewise, our method can obtain higher 4. Discussion
accuracy than the state-of-the-art deeplearning-based registration
method [27,28], with reduced error by 12%. Even though the dis- 4.1. Network architecture and training
tance error of SRWCR based B-spline registration [9] is little smaller,
the proposed method also can achieve comparable results without In this study, the proposed deformation generator G is a
the need of any iterative optimization. multi-scale FCN. Traditional multi-scale models usually resize the
L. Duan, G. Yuan, L. Gong, et al. / Biomedical Signal Processing and Control 53 (2019) 101562 9

Table 3
The Average (±Standard Deviation) DE (mm) by D.Demons, SRWCR [9], VoxelMorph [27], Li [28] and the proposed method.

Initial D.Demons SRWCR [9] VoxelMorph [27] Li [28] Proposed

Distance Error(mm) 2.97 ± 3.20 2.70 ± 2.91 2.02 ± 2.57 2.47 ± 2.79 2.35 ± 2.75 2.20 ± 2.66

Fig. 9. Example registration cases on four image datasets of results by D.Demons, SRWCR, VoxelMorph, Li [28] and the proposed method (from left to right), in the view of
the axial, coronal and sagittal sections (from top to bottom).

Table 4
The Average registration time (in seconds) of different registration methods for registering a 96 × 96 × 96 brain image.

D.Demons-CPU SRWCR [9]-GPU VoxelMorph [27]-GPU Li [28]-GPU Proposed-GPU

Time(s) 120.8 64.3 0.68 0.61 0.74

image multiple times or implement on image patches of different The adversarial training strategy is also another significant con-
size. These methods, however, tend to be costly. In contrast, we tribution in this study. An additional discriminator network D is
implement two different operation streams simultaneously in the proposed following the registration network G for evaluating the
encoder sub-network to extract feature of two scales [36]. This all- registration result and then further refining the result. This dis-
in-once strategy enables the network G to comprehensively analyze criminator D can assign the warped moving image a score on
sufficient and complementary image information without increas- how likely it is the fixed image.Since the discrimination score is
ing the computational cost. Thus, the network G is more beneficial estimated based on the deep features of two images which are
for modeling complex non-linear mapping of image registration. the most descriptive, the discrimination score (i.e., the adversar-
On the other hand, in image registration cases, the neighbor- ial term) in the loss function of G network could serve as a more
hood information is responsible for detecting the displacement of restrictive and comprehensive similarity metric than NCC 2 . Thus,
the center voxel, so an important requirement is that neighbor- the additional discriminator is conducive to guiding the optimal
hood size should be at least as large as the maximum displacement. registration, which has been confirmed in Fig. 8.
For instance, the feature descriptors adopted in the feature-based From the DSC values for GM and WM in Table 2, our method
registration methods should cover larger neighborhood, since the achieves overall highest accuracy only except for WM in the
displacements to be found are unknown. In our FCN-based regis- MGH10 dataset. While comprehensively analyzing the results in
tration model, the features from network equate to the complex Tables 1 and 2, we find that using only the Multi-Scale G net-
descriptors, where the covering neighborhood of final features is work (MSG) leads to a slightly worse performance than the SRWCR
related to the receptive field of the last encoding feature. Taking this method [9] on IBSR18, CUMC12 and MGH10 datasets, but the
into consideration, we design the G network to extract one more accuracy is still comparable. Using the adversarial model with
coarse feature in the encoder sub-network than the typical U-Net. single-scale G network (SSG + D) also results in similar perfor-
Therefore, the fusion of the fine-scale and coarse-scale features not mance. Whereas using only the single-scale G network (SSG) shows
only helps the network G to precisely regress the complex deforma- noticeably degraded performance on all four datasets and deviates
tion but also enables it to accurately estimate the large deformation. from our desired registration accuracy that can match the top-
For the sagittal view at bottom row in Fig. 7, the single-scale net- performing SRWCR [9]. These can further strongly demonstrate the
work fails to well register the image pair and still reserves the big proposed multi-scale FCN and adversarial training both are effec-
structure in the red ROI of the moving image. It well proves that tive and significant contributions.
the addition of coarse-scale information contributes to register the Finally, one more point to be noted is that only the former
image pair with high appearance variations. generator G is required for new registration application, and it
10 L. Duan, G. Yuan, L. Gong, et al. / Biomedical Signal Processing and Control 53 (2019) 101562

is capable of directly producing the registered images in one References


shot.
[1] P. Aljabar, et al., Multi-atlas based segmentation of brain images: Atlas
selection and its effect on accuracy, Neuroimage 46 (3) (2009) 726–738.
4.2. Limitations and further improvement [2] C. Lee, et al., Assessment of parotid gland dose changes during head and neck
cancer radiotherapy using daily megavoltage computed tomography and
deformable image registration, Int. J. Radiat. Oncol. Biol. Phys. 71 (5) (2008)
While the proposed adversarial registration model can produce 1563–1571.
rapid image registration with high accuracy, some room still need [3] J.A. Collins, et al., Improving registration robustness for image-guided liver
further improvement: surgery in a novel human-to-phantom data framework, IEEE Trans.Med.
Imaging 36 (7) (2017) 1502–1510.
(1) The displacement predicted directly from the multi-scale [4] M. Holden, et al., A review of geometric transformations for nonrigid body
G could not be guaranteed to be diffeomorphic deformations. registration, IEEE Trans. Med. Imaging 27 (1) (2008) 111–128.
Inspired by concept of diffeomorphic demons [45], we will explore [5] D. Rueckert, et al., Nonrigid registration using free-form deformations:
application to breast MR images, IEEE Trans. Med. Imaging 18 (8) (1999)
to incorporate Lie group exponential map operation [47] into our
712–721.
registration network, so as to endow the predicted deformation [6] T. Vercauteren, X. Pennec, A. Perchant, N. Ayache, Diffeomorphic demons:
with diffeomorphic property of Lie group such as invertibility and efficient non-parametric image registration, NeuroImage 45 (1) (2009)
S61–S72.
topology-preservation.
[7] H. Rivaz, et al., Nonrigid registration of ultrasound and MRI using contextual
(2) Adversarial training strategy can provide effective guidance conditioned mutual information, IEEE Trans. Med. Imaging 33 (March (3))
on refining the registration similarity, however, it inevitably adds (2014) 708–725.
the training time and complexity. In addition, there is another [8] X. Zhuang, et al., A nonrigid registration framework using spatially encoded
mutual information and free-form deformations, IEEE Trans.Med. Imaging 30
issue for the adversarial network: The parameters of the genera- (10) (2011) 1819–1828.
tor G and discriminator D are updated independently from each [9] L. Gong, C. Zhang, L. Duan, X. Chen, J. Zheng, Non-rigid image registration
other, which has no theoretical proof that the gradient-based opti- using spatially region-weighted correlation ratio and GPU-acceleration, IEEE J.
Biomed. Health Inform. (2018).
mization could converge to the balance of the two networks. This [10] D. Shen, C. Davatzikos, HAMMER: hierarchical attribute matching mechanism
problem is beyond the scope of this study, but will be an interest- for elastic registration, IEEE Trans. Med. Imaging 21 (11) (2002) 1421–1439.
ing topic for our future research. We will investigate an improved [11] J. Zheng, J. Tian, K. Deng, X. Dai, X. Zhang, M. Xu, Salient feature region: a new
method for retinal image registration, IEEE Trans. Inf. Technol. Biomed. 15 (2)
adversarial strategy to train the registration network. (2011) 221–232.
[12] F. Zhu, M. Ding, X. Zhang, Self-similarity inspired local descriptor for non-rigid
multi-modal image registration, Inf. Sci. (Ny) 372 (2016) 16–31.
5. Conclusion [13] Y. Ou, A. Sotiras, N. Paragios, C. Davatzikos, DRAMMS: deformable registration
via attribute matching and mutual-saliency weighting, Med. Image Anal. 15
In this study, we have presented a novel multi-scale FCN (gen- (4) (2011) 622–639.
[14] El Rube, et al., Image registration based on multi-scale SIFT for remote sensing
erator G) for deformation regression of 3D image registration. To
images, Proc. ICSPCS (2009) 1–5.
generate warped moving image that are more similar to the fixed [15] D. Nie, L. Wang, E. Adeli, C. Lao, W. Lin, D. Shen, 3-D fully convolutional
image, we also introduced a discriminator network D and adopted networks for multimodal isointense infant brain image segmentation, IEEE
Trans. Cybern. (2018).
the adversarial training strategy to further update the FCN. The
[16] G. Mohan, M.M. Subashini, MRI based medical image analysis: survey on
experiment results in Section 3.4 and Section 3.5 have demon- brain tumor grade classification, Biomed. Signal Process. Control 39 (2018)
strated that these modifications both can enhance the awareness of 139–161.
deformation regression FCN, and are beneficial for registration task [17] H. Jin, Z. Li, R. Tong, L. Lin, A deep 3D residual CNN for false positive reduction
in pulmonary nodule detection, Med. Phys. (2018).
with large appearance variations and complex anatomical defor- [18] G. Wu, M. Kim, Q. Wang, B.C. Munsell, D. Shen, Scalable high performance
mation. image registration framework by unsupervised deep feature representations
Comparison experiments also denote that the registration accu- learning, IEEE Trans. Biomed. Eng. 63 (7) (2016) 1505–1516.
[19] V. Kearney, S. Haaf, A. Sudhyadhom, G. Valdes, T.D. Solberg, An unsupervised
racy of our proposed model is on par with various top-performing convolutional neural network-based algorithm for deformable image
conventional registration methods while with much less time cost. registration, Phys. Med. Biol. (2018).
Furthermore, the registration results of our method are verified to [20] X. Yang, R. Kwitt, M. Styner, M. Niethammer, Quicksilver: fast predictive
image registration–a deep learning approach, NeuroImage 158 (2017)
be better than several state-of-the-art deeplearning-based meth- 378–396.
ods. In summary, our registration can consistently perform well on [21] H. Sokooti, et al., Nonrigid image registration using multi-scale 3D
different datasets without any parameter tuning in less than a sec- convolutional neural networks, Proc. Int. Conf. Med. Image Comput.
Comput.-Assisted Intervention (2017).
ond, which proves it possesses the high-accuracy and robustness
[22] X. Cao, et al., Deformable image registration using cue-aware deep regression
for potential real-time clinical application in the future. network, IEEE Trans. Biomed. Eng. (2018).
Moreover, the proposed registration model not only can directly [23] J. Fan, X. Cao, Pew-Thian Yap, D. Shen, BIRNet: brain image registration using
dual-supervised fully convolutional networks, Proc. IEEE CVPR (2018).
generate registered images with the multi-scale G network, but also
[24] M.M. Rohé, M. Datar, T. Heimann, M. Sermesant, X. Pennec, SVF-net: learning
can estimate the quality of registration with the discriminator D, deformable image registration using shape matching, in: Proc. Int. Conf. Med.
which can help clinician detect potential problems when they use Image Comput. Comput.-Assisted Intervention, 2017, pp. 266–274.
image-fusion guidance. [25] Bob.D. de Vos, et al., End-to-End unsupervised deformable image registration
with a convolutional neural network, Proc. DLMIA (2017) 204, Springer.
[26] S. Shan, X. Guo, W. Yan, et al., Unsupervised End-to-end Learning for
Acknowledgments Deformable Medical Image Registration, arXiv preprint arXiv: 1711.08608,
2017.
[27] G. Balakrishnan, A. Zhao, et al., An unsupervised learning model for
This work was supported in part by the National Natural Sci- deformable medical image registration, Proc. IEEE CVPR (2018) 9252–9260.
ence Foundation of China [grant number 61701492], in part by [28] H. Li, Y. Fan, Non-rigid image registration using self-supervised fully
convolutional networks without training data, Proc. IEEE ISBI (2018)
the Jiangsu Science and Technology Department [grant number 1075–1078.
BK20170392], in part by Suzhou Municipal Science and Technology [29] H. Zhou, H. Rivaz, Registration of pre- and postresection ultrasound volumes
Bureau [grant number SYG201825], in part by the Suzhou Insti- with noncorresponding regions in neurosurgery, IEEE J. Biomed. Health
Inform. 20 (5) (2016) 1240–1249.
tute of Biomedical Engineering and Technology, Chinese Academy [30] I. Goodfellow, et al., Generative adversarial nets, Adv. Neural Inf. Process. Syst.
of Sciences [Grant numbers: Y753181305], in part by the Fudan (2014) 2672–2680.
University-SIBET Medical Engineering Joint Fund [grant number [31] A. Radford, L. Metz, S. Chintala, Unsupervised Representation Learning With
Deep Convolutional Generative Adversarial Networks, arXiv preprint
YG2017-011], and in part by the Youth Innovation Promotion Asso-
arXiv:1511.06434, 2015.
ciation CAS [grant number 2014281].
L. Duan, G. Yuan, L. Gong, et al. / Biomedical Signal Processing and Control 53 (2019) 101562 11

[32] D. Nie, et al., Medical image synthesis with deep convolutional adversarial [40] A. Klein, et al., Evaluation of 14 nonlinear deformation algorithms applied to
networks, IEEE Trans. Biomed. Eng. (2018). human brain MRI registration, NeuroImage 46 (3) (2009) 786–802.
[33] P. Moeskops, M. Veta, M.W. Lafarge, K.A. Eppenhof, J.P. Pluim, Adversarial [41] F. Chollet, et al., Keras, Available:, 2015 https://github.com/fchollet/keras.
training and dilated convolutions for brain MRI segmentation, Proc. DLMIA [42] M. Abadi, et al., Tensorflow: Large-scale Machine Learning on Heterogeneous
(2017) 56–64, Springer. Distributed Systems, arXiv preprint arXiv: 1603.04467, 2016.
[34] O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for [43] B. Avants, C. Epstein, M. Grossman, J. Gee, Symmetric diffeomorphic image
biomedical image segmentation, in: Proc. Int. Conf. Med. Image Comput. registration with cross-correlation: evaluating automated labeling of elderly
Comput.-Assisted Intervention, 2015, pp. 234–241. and neurodegenerative brain, MedIA 12 (1) (2008) 26–41, special Issue on The
[35] M.D. Zeiler, G.W. Taylor, R. Fergus, Adaptive deconvolutional networks for Third International Workshop on Biomedical Image Registration WBIR.
mid and high level feature learning, in: The IEEE International Conference on [44] J. Ashburner, K.J. Friston, Diffeomorphic registration using geodesic shooting
Computer Vision (ICCV), 2011. and Gauss–Newton optimisation, NeuroImage 55 (3) (2011) 954–967.
[36] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object [45] T. Vercauteren, X. Pennec, A. Perchant, N. Ayache, Diffeomorphic demons:
detection with region proposal networks, IEEE Trans. Pattern Anal. efficient non-parametric image registration, NeuroImage 45 (1) (2009) 61–72.
MachIntell. 39 (6) (2017) 1137–1149. [46] J.W. Suh, et al., CT-PET weighted image fusion for separately scanned whole
[37] B. Fischl, Freesurfer, Neuroimage 62 (2) (2012) 774–781, Available: http:// body rat, Med. Phys. 39 (1) (2012) 533–542.
surfer.nmr.mgh.harvard.edu/. [47] Vincent Arsigny, Olivier Commowick, Xavier Pennec, Nicholas Ayache, A
[38] N. Tustison, et al., N4ITK: improved N3 Bias correction, IEEE Trans. Med. logeuclidean framework for statistics on diffeomorphisms, in: Proc. Int. Conf.
Imaging (2010). Med. Image Comput. Comput.-Assisted Intervention, 2006, pp. 924–931.
[39] D.W.M.M. Shattuck, V. Adisetiyo, C. Hojatkashani, G. Salamon, et al.,
Construction of a 3D probabilistic atlas of human cortical structures,
NeuroImage 39 (2008) 1064–1080.

You might also like