You are on page 1of 9

J. Vis. Commun. Image R.

73 (2020) 102950

Contents lists available at ScienceDirect

Journal of Visual Communication and Image Representation


journal homepage: www.elsevier.com/locate/jvci

A view-free image stitching network based on global homography☆


Lang Nie, Chunyu Lin *, Kang Liao, Meiqin Liu, Yao Zhao
Institute of Information Science, Beijing Jiaotong University, Beijing Key Laboratory of Advanced Information Science and Network, Beijing 100044, China

A R T I C L E I N F O A B S T R A C T

Keywords: Image stitching is a traditional but challenging computer vision task, aiming to obtain a seamless panoramic
41A05 image. Recently, researchers begin to study the image stitching task using deep learning. However, the existing
41A10 learning methods assume a relatively fixed view during the image capturing, thus show a poor generalization
65D05
ability to flexible view cases. To address the above problem, we present a cascaded view-free image stitching
65D17
network based on a global homography. This novel image stitching network does not have any restriction on the
view of images and it can be implemented in three stages. In particular, we first estimate a global homography
between two input images from different views. And then we propose a structure stitching layer to obtain the
coarse stitching result using the global homography. In the last stage, we design a content revision network to
eliminate ghosting effects and refine the content of the stitching result. To enable efficient learning on various
views, we also present a method to generate synthetic datasets for network training. Experimental results
demonstrate that our method can achieve almost 100% elimination of artifacts in overlapping areas at the cost of
acceptable slight distortions in non-overlapping areas, compared with traditional methods. In addition, the
proposed method is view-free and more robust especially in a scene where feature points are difficult to detect.

1. Introduction warpings [4–11] to align the overlapping parts of images, and some
reduce the artifacts generated using projection transformation by
Image stitching is a technology that can create a seamless panorama finding the optimal seams [12–15] around objects. As for deep stitching
or high-resolution image by stitching images with overlapping parts. methods, some methods [16–20] are qualified for stitching images from
The images may be obtained from different moments, different per­ arbitrary views, but there is only some steps of its frameworks, such as
spectives or different sensors. In recent years, it has received increasing feature extraction or feature matching, is achieved by deep learning,
attention and has become a popular topic in photographic graphics, which cannot be called a complete deep image stitching model. Some
surveillance videos [1], and VR [2], etc. other methods [21–23] are all implemented using deep learning, but
The classical image stitching follows these steps. First, a 3 × 3 they are only specially designed for some specific conditions, such as
homography matrix including translation, rotation, scaling and van­ fixed views.
ishing point transformation is estimated after the feature extraction and Different from these deep stitching methods, we aim to establish a
feature matching between a pair of images. Then the homography is complete deep learning model that can handle images captured from
utilized to warp the original image into alignment with the other one. arbitrary views. In this paper, we present a cascaded view-free image
Finally the original image and the warped image are fused to get the stitching network based on the global homography, which can eliminate
stitching result. However, this basic algorithm needs to satisfy a basic the ghosting effects as much as possible.
assumption: the scene of the picture should be near planar [3]. In fact, The overview of our approach is illustrated in Fig. 1(e). Specifically,
the depth of image contents always differs, which does not satisfy the the first stage is the homography estimation. Different from the existing
prior hypothesis. Therefore, it is easy to cause ghosting effects or mis­ deep homography estimation [24–26], the proportion of overlapping
alignments for overlapping parts in the stitching image. In order to parts between two images in our image stitching is much lower, which
mitigate ghosting effects and improve stitching quality, some existing brings great challenges to the stitching performance. To address this
image stitching algorithms calculate multiple content-aware local problem, we introduce a global correlation layer [27,28] into this stage


This paper has been recommended for acceptance by Zicheng Liu.
* Corresponding author.
E-mail address: cylin@bjtu.edu.cn (C. Lin).

https://doi.org/10.1016/j.jvcir.2020.102950
Received 19 April 2020; Received in revised form 17 July 2020; Accepted 10 October 2020
Available online 4 November 2020
1047-3203/© 2020 Elsevier Inc. All rights reserved.
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950

Fig. 1. The pipeline of our proposed deep stitching network and the dataset generation.

by calculating the global correlation between every position in feature 2. Related work
maps. In the second stage, we use the homography obtained in the first
stage to warp the original image and perform average fusion on warped In this section, we briefly review the traditional image stitching
images to get the result of structure stitching. The last stage is named as methods and deep image stitching methods.
the content revision stage. Based on the known structure prior, we
design a content revision network to eliminate ghosting effects for
2.1. Traditional image stitching
overlapping areas and revise the content of stitching to get perceptually
natural results.
Among the traditional image stitching methods, some methods
In addition, training a network to generate stitching images requires
computed the multiple content-aware local warping to replace the single
a sufficiently large training set. Inspired by DeTone et al. [24], we
global warping to reach better local alignments [4–11], some proposed
improve the fashion of constructing a synthetic dataset for deep
to find the optimal seams around objects to reduce the artifacts gener­
homography estimation and propose to create a seemingly infinite
ated by image fusion [12–15].
dataset consisting of a quadruple (IA , IB , f, Label) for image stitching. The
In the first group of methods, Gao et al. [4] find that aligning pictures
pipeline of the dataset generation is depicted in Fig. 1 (a)-(d). Our
by a single homography has obvious limitations, which is easy to pro­
specially designed network and dataset enable our algorithm to deal
duce artifacts when the depth of a scene is inconsistent. Separating an
with image inputs of arbitrary views, which frees the deep image
image into the foreground and background, this problem is alleviated by
stitching from the limitation of the image views.
computing a dual-homography warping (DHW). Based on their inspiring
In the experiment, we show that the proposed approach outperforms
work, as-projective-as-possible (APAP) [6] is proposed to divide the
previous methods with a large margin, demonstrating its efficacy on
image into dense grids and align each grid with a unique local homog­
deep image stitching. The contributions of this paper can be summarized
raphy. Zaragoza et al. greatly reduce ghosting effects caused by incon­
as follows:
sistent depth with the cost of bringing geometric distortion in non-
overlapping areas. Meanwhile, Chang et al. [7] propose a shape-
1) We present a view-free image stitching network. To the best of our
preserving half-projective (SPHP) warping by which the projective
knowledge, this is the first work that can tackle images from arbi­
transformation of non-overlapping regions was gradually reduced to
trary views in the field of deep image stitching.
similar transformation only including translation, rotation and scaling to
2) To eliminate the ghosting effect as much as possible, we design a
reduce the geometric distortion caused by the projective transformation.
global correlation layer and a structure-to-content gradually stitch­
Combining APAP with SPHP, Lin et al. [10] present an adaptive as-
ing module.
natural-as-possible (ANAP) warping to implement more perceptually
3) To enable a parallax-tolerant stitching algorithm, we construct a
natural stitching results.
synthetic dataset for a training network, which displays more chal­
Seam-driven methods are also influential. In the work of Eden et al.
lenging overlapping parts than previous methods.
[12], in addition to data cost which is used to constrain the content
consistency, seam cost is also introduced to enhance smooth transitions
The remainder of this paper is organized as follows: The related work
in overlapping areas. Since we are looking forward to a seamless
is described in Section 2. The proposed view-free image stitching
stitching picture, seam-cutting can be a crucial step for obtaining a
network and the constructed synthetic dataset are presented in Section 3
perceptually seamless result. A seam-driven image stitching strategy
and Section 4, respectively. Section 5 describes the experiments. The
[15] was proposed by finding the best homography through seam-
conclusion and future direction are discussed in Section 6.
cutting. Then, Zhang et al. [13] propose an alignment algorithm to
find an optimal seam for stitching images, after which a multi-band
blending algorithm was used to get the final stitching results. And
based on the last-mentioned work, Lin et al. [14] improved seam-driven

2
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950

images, which can provide alignment information for subsequent


stitching work. More importantly, this stage is crucial to enable our
network to perform view-free image stitching.
The steps of traditional homography estimation include feature
extraction, feature matching, and homography solving by direct linear
transformation (DLT) algorithm [29]. DLT can serve as a basic approach
to estimation projective transformation from a set of matched points
between two images captured from different views. In contrast, most
deep homography estimation schemes [24–26] are simply accomplished
by convolutional neural networks which can serve as an excellent
feature extractor while lacking the ability of feature matching relatively.
Fig. 2. The process of homography estimation stage. Considering that both homography estimation and optical flow esti­
mation [28,30–32] can be cast as a problem of establishing dense cor­
method by adding constraints of keeping curve and line structure to respondences between a pair of images, we introduce the global
realize parallax-tolerant image stitching. correlation layer [27,28,32] usually used in optical flow estimation into
our work to enhance the feature matching part.
2.2. Deep image stitching Consistent with other homography estimations [24–26], we also use
the grayscale images (GA , GB ) as network inputs. First, a feature
Compared with traditional image stitching methods, deep stitching extractor with shared weights is used to learn feature representations
methods are still in developing. Some works [16–18] replace traditional and reduce the dimension of feature maps in the blue part of Fig. 2. Here
feature detection with CNNs to pursue better stitching performance, every small blue block includes two convolutional layers and a max-
while others use neural networks to learn to match features [19] or pooling layer. After performing L2 normalization on the feature maps
estimate transformation parameters from detected features [20]. Besides FlA , FlB ∈ Wl × Hl × Cl , a global correlation layer is adopted to learn the
that, there are some complete deep learning methods which are specially feature-wise global similarities CV l ∈ Wl × Hl × (Wl × Hl ) between two
designed for some specific conditions, such as fixed views [21,23], fixed feature maps. For every position in FlA , we need to calculate its corre­
views with fisheye cameras [22], etc. As far as we know, there is lation with each position in FlB as follows:
currently no complete deep learning method that can perform view-free ( ) ( )
( )
image stitching. < F l x1 , F l x2 >
CV l x1 , x2 = ⃒⃒ lA( )⃒⃒⃒⃒ Bl ( )⃒⃒ , x1 , x2 ∈ Z2 , (1)
F A x1 F B x2
3. Our method
where x1 and x2 define the locations in feature maps and FlA (x1 ) repre­
To stitch images from arbitrary views, we propose a view-free image sents a 1-dimention feature vector of the location x1 in feature map FlA .
stitching network based on the global homography. As shown in Fig. 1
As shown in the above formula, CV l (x1 , x2 ) varies between 0 and 1, and
(e), our approach consists of three stages, each of which plays a unique
the more similar FlA (x1 ) and FlB (x2 ) are, the larger CV l (x1 , x2 ) is, which
role in the image stitching, and the specific network structure is shown
also means the better the features match.
in Fig. 2 and Fig. 3. In this section, we first describe the proposed
Subsequently, a regression network composed of three convolutional
homography estimation stage. Then, we represent the structure stitching
layers and two fully connected layers is used to process CVl and predict
stage and content revision stage. Finally, the loss function of the view-
offsets f which can correspond one-to-one with the homography and will
free image stitching network is formed.
be explained in detail in Section 4. Finally, we transform predicted
offsets into corresponding homography H in tensor through DLT
3.1. Homography estimation stage
algorithm.
The estimated global homography provides global alignment infor­
Stitching images with large parallax directly using deep learning is
mation which can be easily utilized by the subsequent structure stitching
very difficult because it integrates the tasks of feature detection, feature
stage. With this homography, we can transform image pairs with large
matching, homography estimation, image registration, and image fusion
parallax into image pairs with small parallax. By converting the problem
into CNNs. Therefore, to free CNNs from complicated tasks, a pre-
of stitching images with large parallax to that of stitching images with
trained homography estimation network is very necessary. In this
small parallax, the difficulty of image stitching by CNNs is significantly
stage, we are aiming to obtain the projective transformation between

Fig. 3. Structure stitching stage (left): roughly stitching images in structure with global homography. Content revision stage (right): revise the content of image to
enhance fusion quality.

3
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950

reduced. decoder, its structure can be treated as a mirror of the encoder’s, with
three channels in the last layer. And this decoder is designed to reor­
3.2. Structure stitching stage ganize the feature basis into more complex feature representation and
recombine the basis into the desired stitching result. Furthermore, to
Training a stitching network from scratch without structure prior is prevent the gradient vanishing problem and information imbalance in
very difficult, as it needs to learn structure information and content each layer [34], we add skip connections between the low-level and
details of stitching images simultaneously. To solve this problem, we high-level features with the same resolution.
divide the remaining stitching work into two parts as structure stitching
and content revision. We determine the coarse contour of the stitched 3.4. Loss functions
image at this stage, which is able to provide important structural prior
information for the subsequent content revision stage. In essence, deep homography networks [24–26] can be seen as
To realize structure stitching, we improve Spatial Transformer Layer regression tasks to solve eight parameters of homography. Even in un­
(STL) introduced in [33] and propose a Structure Stitching Layer (SSL) supervised estimation methods [25,26], homography must be predicted
to obtain the structure information of stitched images. This part is first before calculating unsupervised loss function. Therefore, in our
accomplished in four steps as shown in Fig. 3 (left). work, we directly take a simpler but more effective supervised approach
First of all, a grid, which has the same size as stitching label, is to estimate the homography by minimizing the L2 distance between
generated for each image input. And every element in the grid represents predicted offsets ̂f and its ground truth f as follows:
its 2-D spatial location (u, v). Second, we calculate the original co­ ( )
( ) 1⃦⃦
⃦2

LH ̂f , f = ⃦̂f − f ⃦ , (3)
ordinates in IA and IB corresponding to these grid locations as xz, yz by N 2

Eq. 2:
where N defines the number of components in offsets ̂f .
⎡ ⎤ ⎡ ⎤
x u In order to constrain the structure of stitching images to be close to
⎣ y ⎦ = H − 1 ⎣ v ⎦, (2) that of labels as much as possible, L1 loss is used at corresponding po­
z 1 sitions between outputs and labels:
( ) ⃦ ⃦
where H represents the projective transformation from the perspective LS ̂I , I =
1 ⃦̂ ⃦
⃦I − I⃦ , (4)
of IB to that of IA . (x, y, z) is the homogeneous coordinates of original W ×H×C 1

images. We obtain the corresponding coordinates of IBW by Eq. 2 while


getting that of IAW by replacing H with a unit matrix E in the same where ̂I and I denote the stitching result and stitching label, respectively.
equation. And W, H and C define the width, height and channel number of the
In the third step, we obtain the smooth warped images directly by the stitching result.
bilinear interpolation. At last, a structure stitching result is generated by Inspired by [35,36], we also define a content loss based on VGG-19
performing average fusion on IAW and IBW . Concretely, the pixel value of [37] that encourages outputs and labels to have similar feature repre­
the overlapping area is equal to the sum of the pixel values of IAW and sentations. With the content loss, the artifacts and the discontinuity of
IBW , weighted by 0.5. image seams that can easily lead to dramatic changes in image features,
In our whole stitching model, SSL serves as the bridge to transform can be significantly reduced. Let ψ j (⋅) be the feature map obtained from
the estimated H into the structure stitching result which can be used by the j-th convolutional layer in VGG-19 and the content loss is defined as
content revision network. However, the process of solving the inverse of the following equation:
H and bilinear interpolation in SSL may affect the backpropagation of ( ) ⃦ ( ) ( )⃦2
1 ⃦ ̂ ⃦
gradients in our training stage. To avoid this problem, we train our LC ̂I , I = ⃦ψ j I − ψ j I ⃦ , (5)
Wj × Hj × Cj 2
homography estimation network and the rest parts separately without
joint optimization. In this way, as a module combining spatial trans­ where Wj ,Hj and Cj denote the width, height, and channel number of the
formation and image fusion without trainable parameters, no matter feature map, respectively.
which part of the network is trained, SSL will have no impact on the In general, LH is used to constrain the homography estimation, while
backpropagation during training. both LS and LC are used to constrain stitching results to be as close to the
ground truth as possible.
3.3. Content revision stage
4. Dataset generation
Based on a fact that mismatch of features may cause errors in
homography estimation, there can be ghosting effects inevitably for Training neural networks usually requires a large number of data.
overlapping areas in the structure stitching images. Therefore, a content However, the existing datasets for image stitching have only a few data,
revision stage is proposed to eliminate artifacts in overlapping regions which are not adequate for deep learning methods. To meet this
while keeping the content of the non-overlapping areas from significant requirement, we improve the method of generating a synthesized
distortion. dataset for deep homography estimation [25], and propose a method
To reach this goal, we design an encoder-decoder network to elimi­ which can generate seemingly infinite data for image stitching from any
nate artifacts and revise the content of the stitching result. As shown in existing real image dataset such as Microsoft COCO [38]. The pipeline of
Fig. 3 (right), this module first extracts features from the structure the dataset generation is illustrated in Fig. 1 (a)-(d) and the details of this
stitching result by 8 convolutional layers, where the number of filters process are described as follows.
per layer is set to 64, 64, 128, 128, 256, 256, 512 and 512, respectively. In our work, each real image will generate a quadruple (IA , IB , f,
With the number of convolution kernels increasing exponentially, the Label), of which IA and IB are a pair of images with overlapping areas
encoder decomposes the structure stitching result from the RGB image of from different perspective, Label is the ground truth of stitching result,
three channels to the representation of multi-channel features, and and f represents 8 offsets corresponding to a homography with 8 degrees
finally to a representation of feature basis. To reduce the computation of freedom. Compared with the pure homography estimation works
load, a 2 × 2 max-pooling layer is adopted to reduce the dimensions of [24–26], the overlapping area of our image pairs is significantly smaller.
feature maps after the 2nd, 4th and 6th convolutional layers. As for the In particular, before perturbing four vertices of blue square

4
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950

(pix , piy ∈ [ − p, p]) to get different perspectives, we first perform random


translation operations (tx , ty ∈ [ − t, t]) to ensure that only a small area of
two images is overlapped. And the f, the total offsets of four vertices
from green square (Fig. 1 (a)(b)(c)), can be calculated by following
equations:

f ix = pix + tx ,
(6)
f iy = piy + ty ,

where i ∈ {1, 2, 3, 4} defines different points of the square. f ix and f iy


represent the offsets of point i in the x and y direction, respectively.
For convolutional neural networks, when we fix the size of inputs,
the size of outputs is also determined. However, when it comes to image
stitching, the size of the stitching result varies according to different
perspectives of an image pair in spite of the identical image size. To solve
this contradiction, we set a maximum size of stitching labels to ensure Fig. 4. A histogram of average RMSE in our synthetic dataset. Compared with
that all the labels can be completely included in the canvas. The max size the VGG-style network existing homography estimations [24,25] adopt, our
can be calculated by expanding t +p pixels on the top, bottom, left and network structure predicts homography with smaller RMSE and is more robust
right of the image patch IA . Finally, we obtain the perceptually seamless to large displacements (low. overlap rates).
and natural stitching label by cropping the raw image into the preset size
and filling the area outside the green and red squares with value 0, as function is, they perform almost the same in the experiments of [25].
shown in Fig. 1 (e). Another thing to note is that the datasets they use [24,25] are quite
different from ours. The main difference is that the overlap rate of image
5. Experiments pairs in their datasets ranges from 60% to 100%, while in our dataset the
minimum overlap rate can be 20%. Thus, it is unfair to compare our
In this section, the presented homography estimation module is first work with their existing experimental results directly. Having observed
compared with other deep homography estimation methods, to verify its this fact, we only compare the performance of the homography esti­
ability of aligning images with large parallax. Furthermore, we conduct mation with different methods. To achieve this goal, we strictly control
a comparison experiment on our approach and the classic image all conditions except the network structure to be completely consistent.
stitching algorithms. In the last, an ablation study is performed to verify These conditions include the dataset, data pre-processing and
the efficiency of each component. enhancing, selection of the optimizer, learning rate, and the number of
iterations, etc.
In this experiment, we evaluate our model on 1,000 test data. In
5.1. Implementation details
terms of the different overlap rates, we divide the test data into 8 cat­
egories to explore the average RMSE of the offset predicted by different
Since there is no benchmark dataset in the field of image stitching,
models as shown in Fig. 4. As we can see, the VGG-style network which
we obtain a synthetic dataset by the method proposed in Section 4. In
is adopted by supervised [24] and unsupervised [25], performs nearly
detail, we select 50,000 real images larger than 480 × 360 from MS-
3X worse than our approach. With the decrease of the overlap rate,
COCO to generate the training set while we extract 1,000 to generate
RMSE growth speed of the VGG-style model is also significantly faster
the test set. The size of network inputs (IA ,IB ) is 128 × 128 while the size
than our model, which shows that our model is more robust to the large
of the label is set to 304 × 304 to cover all the stitching result from
parallax stitching case.
arbitrary views. Considering the overlap rate of image inputs can be very
In order to show the performance of different models more intui­
small in image stitching, we set the maximum translation parameter t =
tively, we also implement a set of alignment experiments on the eight
64 and the maximum perturbation parameter p = 25, which makes the
categories. Fig. 5 shows some results, of which (b) is warped to align
overlap rate ranges from 20% to 100% in our dataset.
with (a) using the estimated homography. Compared with other
Subsequently, we train our network following two steps. First, we
learning methods, our results are closer to ground truth in all cases.
only train our homography estimation network by minimizing LH for 50
Obviously, the lower the overlap rate, the smaller the size of warped
epochs. Adam Optimizer [39] is adopted with a learning rate varying
images should be. However, when the overlap rate is below 50%, the
between 0.0002 and 0.00002. Second, the rest of our network is trained
size of warped images obtained by the VGG-style methods is signifi­
by simultaneously minimizing LS and LC with almost the same training
cantly more than 50% of the image size, which implies this method has a
strategy as in the previous step. In fact, we intend to train the whole
poor alignment ability in the case of low overlap rate. In contrast, as a
network in the second step expecting further optimization of the
benefit of the presented global correlation layer, our model still main­
homography estimator under the constraint of structure loss and content
tains a considerable alignment performance at a low overlap rate.
loss in preliminary experiments. However, experimental results
demonstrate that the accuracy of the homography estimation decreases
when the gradient is back-propagated from the content revision module 5.3. Compared with image stitching algorithms
to the homography estimation module, thus we train these two parts
separately instead. Existing deep stitching method [23] stitches images from fixed views
while our approach works from arbitrary different views. Therefore, it is
5.2. Compared with deep homography estimations not appropriate to compare their work with ours. Instead, we selected
several representative typical image stitching algorithms to compare. To
Existing deep homography estimations [24–26] have achieved be specific, we divide our test set into 4 categories according to the
promising and robust results. We note that both the supervised method image content: repetitive patterns, background, far scene, and close
[24] and the unsupervised method [25] adopt almost the same VGG- scene. Then we compare our method with the traditional baseline, APAP
style architecture for the homography prediction, and the only differ­ [6], SPHP [7], and robust ELA [11]. The traditional baseline method is
ence is that their loss functions are different. But no matter what the loss completed by four steps as SIFT [40], RANSAC [41], DLT, and average

5
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950

Fig. 5. Alignment performance of different models in different overlap rate. (a)(b): Image inputs used to predict a homography. (c): The alignment results of VGG-
style deep homography methods [24,25]. (d): The alignment results of the proposed deep homography method. (e) The ground truth alignment results. And the
bottom number indicates the overlap rates betw.een (a) and (b).

Fig. 6. Comparision with existing image stitching algorithms. We stitch IA and IB to get a stitching result close to ground truth by the baseline, APAP, robust ELA,
SPHP, and our algorithm. Each line represents a different stitching scene.

6
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950

delicate texture by eliminating image blur due to bilinear interpolation.


Another advantage is that our method is more robust, especially in a
scene where feature corners are difficult to detect or the textures around
feature points are too similar. Fig. 7 shows a failure case of the baseline,
where ‘RANSAC’ represents the intermediate process of the baseline
after feature matching and RANSAC algorithm. Moreover, the red dots
indicate mismatched points removed by RANSAC, and the green dot
pairs represent correct matching points. As illustrated in Fig. 7, there is
still a mismatch in the results after RANSAC, such as the leftmost green
feature point, which directly leads to the baseline results far different
from the ground truth. As for other stitching methods, the results are
similar to the baseline. Different from extracting features by traditional
algorithms with fixed parameters, our method extracts features by
learning trainable convolutional filters, which has been proven amazing
performance in other similar fields such as optical estimation
[28,30–32]. Thus, even under the circumstance that traditional methods
cannot detect sufficient feature points, our view-free network still has
quite reliable robustness.

Fig. 7. A failure case of the baseline. (i): Image inputs to be stitched. (ii): Our 5.4. Ablation study
stitching result and ground truth. (iii): Stitching result of the baseline and the
process of RANSAC. Homography Estimation (HE): The first step of our deep image
stitching is the homography estimation which consists of feature
extraction, feature matching, offset prediction, and solving DLT. With
Table 1
the help of homography estimation, we can obtain prior information
Ablation studies on homography estimation, coarse stitching and refined
about the image alignment that can equip our network with the ability of
stitching. Data represents the average PSNR and SSIM on 1000 test sets.
view-free image stitching. To verify the importance of this module, we
Architecture PSNR SSIM
compare the performance in the case of with and without homography
w/o HE STL + CR 18.3225 0.8684 estimation as shown in Table 1 ‘w/o HE’ and Fig. 8. Concretely, without
w/o CR VGG-HE + SSL 17.8983 0.8083 sufficient structural prior information which can be provided by HE, it is
GH-HE + SSL 20.7950 0.8445
w/o SSL GH-HE + STL + CR 23.8677 0.9118
difficult for a CNN to learn the process of view-free interpolation from
Ours GH-HE + SSL + CR 24.8525 0.9241 the view of Ib to that of Ia for the non-overlapping areas, compared Fig. 8
‘w/o HE’ with ‘Ours’. Moreover, we also divide HE into VGG-style
homography estimation (VGG-HE) and global correlation homography
fusion, which will be simply called baseline in the following text for the estimation (GC-HE) according to different network structures. In Section
sake of brevity. The results are illustrated in Fig. 6. 4.2, we verified the performance of these two different structures on
As we can see, the results of previous methods exhibit varying de­ homography estimation and image alignment. In subsequent experi­
grees of ghosting effects on the overlapping areas. On the contrary, our ments, we also evaluated their effects on image stitching.
method can achieve almost 100% elimination of artifacts in overlapping Structure Stitching Layer (SSL): SSL is used to stitch the structure
areas at the cost of acceptable slight distortions in non-overlapping of images and it helps to obtain the basic contour of the stitched image
areas. This advantage benefits from our presented structure-to-content while ignoring possible ghosting effects. SSL is composed of Spatial
gradually stitch module, which makes full use of the structure prior by Transformation Layer (STL) and average fusion in tensor. When we train
the global homography and the fine content revision network. More­ the network with SSL, we can obtain more prior information about the
over, if we pay more attention to the content circled in red in Fig. 6, it is contour of the stitched image, thus get more perceptually precise
easy to find out, our results are more perceptually natural with more stitching results. As illustrated in Fig. 8, ‘w/o SSL’ and ‘Ours’ are similar

Fig. 8. Ablation experiments to vertify effectiveness of homography estimation (HE), structure stitching layer (SSL) and content revision network (CR) to
image stitching.

7
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950

in image content while the contour of ours is more smooth and closer to [8] C.-H. Chang, Y.-Y. Chuang, A line-structure-preserving approach to image resizing,
in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE,
the ground truth, especially in the part circled in red. Therefore, ours can
2012, pp. 1075–1082.
achieve higher PSNR and SSIM, which means that our results have lower [9] Y.-S. Chen, Y.-Y. Chuang, Natural image stitching with the global similarity prior,
noise error and higher structural similarity, as shown in Table 1. in: European conference on computer vision, Springer, 2016, pp. 186–201.
Content Revision Network (CR): The content revision network [10] C.-C. Lin, S.U. Pankanti, K. Natesan Ramamurthy, A.Y. Aravkin, Adaptive as-
natural-as-possible image stitching, in, in: Proceedings of the IEEE Conference on
refines the structure stitching result obtained in the previous stage. For Computer Vision and Pattern Recognition, 2015, pp. 1155–1163.
overlapping parts of images, this part of the network can remove almost [11] J. Li, Z. Wang, S. Lai, Y. Zhai, M. Zhang, Parallax-tolerant image stitching based on
all ghosting effects; for non-overlapping areas of images, it can reshape robust elastic warping, IEEE Trans. Multimedia 20 (2017) 1672–1687.
[12] A. Eden, M. Uyttendaele, R. Szeliski, Seamless image stitching of scenes with large
the contour and structure of the stitched image to make final results motions and exposure differences, in: 2006 IEEE Computer Society Conference on
more natural at seams. Table 1 and Fig. 8 also show the results in the Computer Vision and Pattern Recognition (CVPR’06), volume 2, IEEE, 2006,
case of with and without CR. Compared with our results, there can be pp. 2498–2505.
[13] F. Zhang, F. Liu, Parallax-tolerant image stitching, in: Proceedings of the IEEE
obvious ghosting effects in the stitching results without CR, where ‘VGG- Conference on Computer Vision and Pattern Recognition, 2014, pp. 3262–3269.
HE’ has more artifacts than ‘GC-HE’ since GC-HE can estimate homog­ [14] K. Lin, N. Jiang, L.-F. Cheong, M. Do, J. Lu, Seagull: Seam-guided local alignment
raphy with higher accuracy. This part is so important that the quality of for parallax-tolerant image stitching, in: European conference on computer vision,
Springer, 2016, pp. 370–385.
stitching images without CR is significantly worse than ours as Table 1 [15] J. Gao, Y. Li, T.-J. Chin, M.S. Brown, Seam-driven image stitching, in: Eurographics
shows, while a CR network alone has no effect. Therefore, only a com­ (Short Papers), 2013, pp. 45–48.
plete pipeline that fully includes the three stages of HE, SSL, and CR can [16] V.-D. Hoang, D.-P. Tran, N.G. Nhu, V.-H. Pham, et al., Deep feature extraction for
panoramic image stitching, in: Asian Conference on Intelligent Information and
achieve the natural view-free image stitching.
Database Systems, Springer, 2020, pp. 141–151.
[17] Z. Shi, H. Li, Q. Cao, H. Ren, B. Fan, An image mosaic method based on
6. Conclusion convolutional neural network semantic features extraction, J. Signal Process. Syst.
92 (2020) 435–444.
[18] L. Wang, W. Yu, B. Li, Multi-scenes image stitching based on autonomous driving,
In this paper, we present a view-free image stitching architecture in: 2020 IEEE 4th Information Technology, Networking, Electronic and
that can be divided into 3 stages: homography estimation, structure Automation Control Conference (ITNEC), vol. 1, IEEE, 2020, pp. 694–698.
[19] T.A. Alzohairy, E. El-Dein, et al., Image mosaicing based on neural networks, Int. J.
stitching, and content revision. In the homography estimation stage, we
Comput. Appl. 975 (2016) 8887.
extract features, match features, and predict a homography respectively [20] M. Yan, Q. Yin, P. Guo, Image stitching with single-hidden layer feedforward
by convolutional layers, global correlation layer, and regression neural networks, in: 2016 International Joint Conference on Neural Networks
network. In the second stage, the homography is used to perform (IJCNN), IEEE, 2016, pp. 4162–4169.
[21] C. Shen, X. Ji, C. Miao, Real-time image stitching with convolutional neural
structure stitching by SSL to determine the coarse contour of stitching networks, in: 2019 IEEE International Conference on Real-time Computing and
results. In the content revision stage, we refine the content of the image Robotics (RCAR), IEEE, 2019, pp. 192–197.
while keeping the contour of stitching result from distortion. Moreover, [22] J. Li, Y. Zhao, W. Ye, K. Yu, S. Ge, Attentive deep stitching and quality assessment
for 360 omnidirectional images, IEEE J. Select. Top. Signal Process. 14 (2019)
a scheme for generating a dataset of view-free image stitching is pro­ 209–221.
posed to enable efficient learning on various views. Compared with [23] W.-S. Lai, O. Gallo, J. Gu, D. Sun, M.-H. Yang, J. Kautz, Video stitching for linear
existing image stitching algorithms, our view-free image stitching camera arrays, arXiv preprint arXiv:1907.13622 (2019).
[24] D. DeTone, T. Malisiewicz, A. Rabinovich, Deep image homography estimation,
network can achieve almost 100% elimination of artifacts in over­ arXiv preprint arXiv:1606.03798 (2016).
lapping areas and is more robust with significantly reduced ghosting [25] T. Nguyen, S.W. Chen, S.S. Shivakumar, C.J. Taylor, V. Kumar, Unsupervised deep
effects, especially in a scene where feature corners are difficult to detect. homography: A fast and robust homography estimation model, IEEE Robot. Autom.
Lett. 3 (2018) 2346–2353.
[26] J. Zhang, C. Wang, S. Liu, L. Jia, J. Wang, J. Zhou, Content-aware unsupervised
Declaration of Competing Interest deep homography estimation, arXiv preprint arXiv:1909.05983 (2019).
[27] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, M. Gelautz, Fast cost-volume filtering
for visual correspondence and beyond, IEEE Trans. Pattern Anal. Mach. Intell. 35
The authors declare that they have no known competing financial (2012) 504–511.
interests or personal relationships that could have appeared to influence [28] D. Sun, X. Yang, M.-Y. Liu, J. Kautz, Pwc-net: Cnns for optical flow using pyramid,
warping, and cost volume, in, in: Proceedings of the IEEE Conference on Computer
the work reported in this paper. Vision and Pattern Recognition, 2018, pp. 8934–8943.
[29] R. Hartley, A. Zisserman, Multiple view geometry in computer vision, Cambridge
Acknowledgments University Press, 2003.
[30] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der
Smagt, D. Cremers, T. Brox, Flownet: Learning optical flow with convolutional
This work was supported by Fundamental Research Funds for the networks, in, in: Proceedings of the IEEE international conference on computer
Central Universities (2018JBM011) and National Natural Science vision, 2015, pp. 2758–2766.
[31] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox, Flownet 2.0:
Foundation of China (No.61772066, No.61972028).
Evolution of optical flow estimation with deep networks, in, in: Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017,
References pp. 2462–2470.
[32] P. Truong, M. Danelljan, R. Timofte, Glu-net: Global-local universal network for
dense flow and correspondences, arXiv preprint arXiv:1912.05524 (2019).
[1] V.R. Gaddam, M. Riegler, R. Eg, C. Griwodz, P. Halvorsen, Tiling in interactive
[33] M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer networks, in:
panoramic video: Approaches and evaluation, IEEE Trans. Multimedia 18 (2016)
Advances in neural information processing systems, 2015, pp. 2017–2025.
1819–1831.
[34] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical
[2] R. Anderson, D. Gallup, J.T. Barron, J. Kontkanen, N. Snavely, C. Hernández,
image segmentation, in: International Conference on Medical image computing
S. Agarwal, S.M. Seitz, Jump: virtual reality video, ACM Trans. Graph. (TOG) 35
and computer-assisted intervention, Springer, 2015, pp. 234–241.
(2016) 1–13.
[35] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and
[3] Z. Zhu, E.M. Riseman, A.R. Hanson, Parallel-perspective stereo mosaics, in:
super-resolution, in: European conference on computer vision, Springer, 2016,
Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001,
pp. 694–711.
vol. 1, IEEE, 2001, pp. 345–352.
[36] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken,
[4] J. Gao, S.J. Kim, M.S. Brown, Constructing image panoramas using dual-
A. Tejani, J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution
homography warping, in: CVPR 2011, IEEE, 2011, pp. 49–56.
using a generative adversarial network, in, in: Proceedings of the IEEE conference
[5] W.-Y. Lin, S. Liu, Y. Matsushita, T.-T. Ng, L.-F. Cheong, Smoothly varying affine
on computer vision and pattern recognition, 2017, pp. 4681–4690.
stitching, in: CVPR 2011, IEEE, 2011, pp. 345–352.
[37] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
[6] J. Zaragoza, T.-J. Chin, M.S. Brown, D. Suter, As-projective-as-possible image
image recognition, arXiv preprint arXiv:1409.1556 (2014).
stitching with moving dlt, in, in: Proceedings of the IEEE conference on computer
[38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.
vision and pattern recognition, 2013, pp. 2339–2346.
L. Zitnick, Microsoft coco: Common objects in context, in: European conference on
[7] C.-H. Chang, Y. Sato, Y.-Y. Chuang, Shape-preserving half-projective warps for
computer vision, Springer, 2014, pp. 740–755.
image stitching, in, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2014, pp. 3254–3261.

8
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950

[39] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint [41] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model fitting
arXiv:1412.6980 (2014). with applications to image analysis and automated cartography, Commun. ACM 24
[40] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. (1981) 381–395.
Comput. Vision 60 (2004) 91–110.

You might also like