Professional Documents
Culture Documents
73 (2020) 102950
A R T I C L E I N F O A B S T R A C T
Keywords: Image stitching is a traditional but challenging computer vision task, aiming to obtain a seamless panoramic
41A05 image. Recently, researchers begin to study the image stitching task using deep learning. However, the existing
41A10 learning methods assume a relatively fixed view during the image capturing, thus show a poor generalization
65D05
ability to flexible view cases. To address the above problem, we present a cascaded view-free image stitching
65D17
network based on a global homography. This novel image stitching network does not have any restriction on the
view of images and it can be implemented in three stages. In particular, we first estimate a global homography
between two input images from different views. And then we propose a structure stitching layer to obtain the
coarse stitching result using the global homography. In the last stage, we design a content revision network to
eliminate ghosting effects and refine the content of the stitching result. To enable efficient learning on various
views, we also present a method to generate synthetic datasets for network training. Experimental results
demonstrate that our method can achieve almost 100% elimination of artifacts in overlapping areas at the cost of
acceptable slight distortions in non-overlapping areas, compared with traditional methods. In addition, the
proposed method is view-free and more robust especially in a scene where feature points are difficult to detect.
1. Introduction warpings [4–11] to align the overlapping parts of images, and some
reduce the artifacts generated using projection transformation by
Image stitching is a technology that can create a seamless panorama finding the optimal seams [12–15] around objects. As for deep stitching
or high-resolution image by stitching images with overlapping parts. methods, some methods [16–20] are qualified for stitching images from
The images may be obtained from different moments, different per arbitrary views, but there is only some steps of its frameworks, such as
spectives or different sensors. In recent years, it has received increasing feature extraction or feature matching, is achieved by deep learning,
attention and has become a popular topic in photographic graphics, which cannot be called a complete deep image stitching model. Some
surveillance videos [1], and VR [2], etc. other methods [21–23] are all implemented using deep learning, but
The classical image stitching follows these steps. First, a 3 × 3 they are only specially designed for some specific conditions, such as
homography matrix including translation, rotation, scaling and van fixed views.
ishing point transformation is estimated after the feature extraction and Different from these deep stitching methods, we aim to establish a
feature matching between a pair of images. Then the homography is complete deep learning model that can handle images captured from
utilized to warp the original image into alignment with the other one. arbitrary views. In this paper, we present a cascaded view-free image
Finally the original image and the warped image are fused to get the stitching network based on the global homography, which can eliminate
stitching result. However, this basic algorithm needs to satisfy a basic the ghosting effects as much as possible.
assumption: the scene of the picture should be near planar [3]. In fact, The overview of our approach is illustrated in Fig. 1(e). Specifically,
the depth of image contents always differs, which does not satisfy the the first stage is the homography estimation. Different from the existing
prior hypothesis. Therefore, it is easy to cause ghosting effects or mis deep homography estimation [24–26], the proportion of overlapping
alignments for overlapping parts in the stitching image. In order to parts between two images in our image stitching is much lower, which
mitigate ghosting effects and improve stitching quality, some existing brings great challenges to the stitching performance. To address this
image stitching algorithms calculate multiple content-aware local problem, we introduce a global correlation layer [27,28] into this stage
☆
This paper has been recommended for acceptance by Zicheng Liu.
* Corresponding author.
E-mail address: cylin@bjtu.edu.cn (C. Lin).
https://doi.org/10.1016/j.jvcir.2020.102950
Received 19 April 2020; Received in revised form 17 July 2020; Accepted 10 October 2020
Available online 4 November 2020
1047-3203/© 2020 Elsevier Inc. All rights reserved.
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950
Fig. 1. The pipeline of our proposed deep stitching network and the dataset generation.
by calculating the global correlation between every position in feature 2. Related work
maps. In the second stage, we use the homography obtained in the first
stage to warp the original image and perform average fusion on warped In this section, we briefly review the traditional image stitching
images to get the result of structure stitching. The last stage is named as methods and deep image stitching methods.
the content revision stage. Based on the known structure prior, we
design a content revision network to eliminate ghosting effects for
2.1. Traditional image stitching
overlapping areas and revise the content of stitching to get perceptually
natural results.
Among the traditional image stitching methods, some methods
In addition, training a network to generate stitching images requires
computed the multiple content-aware local warping to replace the single
a sufficiently large training set. Inspired by DeTone et al. [24], we
global warping to reach better local alignments [4–11], some proposed
improve the fashion of constructing a synthetic dataset for deep
to find the optimal seams around objects to reduce the artifacts gener
homography estimation and propose to create a seemingly infinite
ated by image fusion [12–15].
dataset consisting of a quadruple (IA , IB , f, Label) for image stitching. The
In the first group of methods, Gao et al. [4] find that aligning pictures
pipeline of the dataset generation is depicted in Fig. 1 (a)-(d). Our
by a single homography has obvious limitations, which is easy to pro
specially designed network and dataset enable our algorithm to deal
duce artifacts when the depth of a scene is inconsistent. Separating an
with image inputs of arbitrary views, which frees the deep image
image into the foreground and background, this problem is alleviated by
stitching from the limitation of the image views.
computing a dual-homography warping (DHW). Based on their inspiring
In the experiment, we show that the proposed approach outperforms
work, as-projective-as-possible (APAP) [6] is proposed to divide the
previous methods with a large margin, demonstrating its efficacy on
image into dense grids and align each grid with a unique local homog
deep image stitching. The contributions of this paper can be summarized
raphy. Zaragoza et al. greatly reduce ghosting effects caused by incon
as follows:
sistent depth with the cost of bringing geometric distortion in non-
overlapping areas. Meanwhile, Chang et al. [7] propose a shape-
1) We present a view-free image stitching network. To the best of our
preserving half-projective (SPHP) warping by which the projective
knowledge, this is the first work that can tackle images from arbi
transformation of non-overlapping regions was gradually reduced to
trary views in the field of deep image stitching.
similar transformation only including translation, rotation and scaling to
2) To eliminate the ghosting effect as much as possible, we design a
reduce the geometric distortion caused by the projective transformation.
global correlation layer and a structure-to-content gradually stitch
Combining APAP with SPHP, Lin et al. [10] present an adaptive as-
ing module.
natural-as-possible (ANAP) warping to implement more perceptually
3) To enable a parallax-tolerant stitching algorithm, we construct a
natural stitching results.
synthetic dataset for a training network, which displays more chal
Seam-driven methods are also influential. In the work of Eden et al.
lenging overlapping parts than previous methods.
[12], in addition to data cost which is used to constrain the content
consistency, seam cost is also introduced to enhance smooth transitions
The remainder of this paper is organized as follows: The related work
in overlapping areas. Since we are looking forward to a seamless
is described in Section 2. The proposed view-free image stitching
stitching picture, seam-cutting can be a crucial step for obtaining a
network and the constructed synthetic dataset are presented in Section 3
perceptually seamless result. A seam-driven image stitching strategy
and Section 4, respectively. Section 5 describes the experiments. The
[15] was proposed by finding the best homography through seam-
conclusion and future direction are discussed in Section 6.
cutting. Then, Zhang et al. [13] propose an alignment algorithm to
find an optimal seam for stitching images, after which a multi-band
blending algorithm was used to get the final stitching results. And
based on the last-mentioned work, Lin et al. [14] improved seam-driven
2
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950
Fig. 3. Structure stitching stage (left): roughly stitching images in structure with global homography. Content revision stage (right): revise the content of image to
enhance fusion quality.
3
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950
reduced. decoder, its structure can be treated as a mirror of the encoder’s, with
three channels in the last layer. And this decoder is designed to reor
3.2. Structure stitching stage ganize the feature basis into more complex feature representation and
recombine the basis into the desired stitching result. Furthermore, to
Training a stitching network from scratch without structure prior is prevent the gradient vanishing problem and information imbalance in
very difficult, as it needs to learn structure information and content each layer [34], we add skip connections between the low-level and
details of stitching images simultaneously. To solve this problem, we high-level features with the same resolution.
divide the remaining stitching work into two parts as structure stitching
and content revision. We determine the coarse contour of the stitched 3.4. Loss functions
image at this stage, which is able to provide important structural prior
information for the subsequent content revision stage. In essence, deep homography networks [24–26] can be seen as
To realize structure stitching, we improve Spatial Transformer Layer regression tasks to solve eight parameters of homography. Even in un
(STL) introduced in [33] and propose a Structure Stitching Layer (SSL) supervised estimation methods [25,26], homography must be predicted
to obtain the structure information of stitched images. This part is first before calculating unsupervised loss function. Therefore, in our
accomplished in four steps as shown in Fig. 3 (left). work, we directly take a simpler but more effective supervised approach
First of all, a grid, which has the same size as stitching label, is to estimate the homography by minimizing the L2 distance between
generated for each image input. And every element in the grid represents predicted offsets ̂f and its ground truth f as follows:
its 2-D spatial location (u, v). Second, we calculate the original co ( )
( ) 1⃦⃦
⃦2
⃦
LH ̂f , f = ⃦̂f − f ⃦ , (3)
ordinates in IA and IB corresponding to these grid locations as xz, yz by N 2
Eq. 2:
where N defines the number of components in offsets ̂f .
⎡ ⎤ ⎡ ⎤
x u In order to constrain the structure of stitching images to be close to
⎣ y ⎦ = H − 1 ⎣ v ⎦, (2) that of labels as much as possible, L1 loss is used at corresponding po
z 1 sitions between outputs and labels:
( ) ⃦ ⃦
where H represents the projective transformation from the perspective LS ̂I , I =
1 ⃦̂ ⃦
⃦I − I⃦ , (4)
of IB to that of IA . (x, y, z) is the homogeneous coordinates of original W ×H×C 1
4
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950
f ix = pix + tx ,
(6)
f iy = piy + ty ,
5
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950
Fig. 5. Alignment performance of different models in different overlap rate. (a)(b): Image inputs used to predict a homography. (c): The alignment results of VGG-
style deep homography methods [24,25]. (d): The alignment results of the proposed deep homography method. (e) The ground truth alignment results. And the
bottom number indicates the overlap rates betw.een (a) and (b).
Fig. 6. Comparision with existing image stitching algorithms. We stitch IA and IB to get a stitching result close to ground truth by the baseline, APAP, robust ELA,
SPHP, and our algorithm. Each line represents a different stitching scene.
6
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950
Fig. 7. A failure case of the baseline. (i): Image inputs to be stitched. (ii): Our 5.4. Ablation study
stitching result and ground truth. (iii): Stitching result of the baseline and the
process of RANSAC. Homography Estimation (HE): The first step of our deep image
stitching is the homography estimation which consists of feature
extraction, feature matching, offset prediction, and solving DLT. With
Table 1
the help of homography estimation, we can obtain prior information
Ablation studies on homography estimation, coarse stitching and refined
about the image alignment that can equip our network with the ability of
stitching. Data represents the average PSNR and SSIM on 1000 test sets.
view-free image stitching. To verify the importance of this module, we
Architecture PSNR SSIM
compare the performance in the case of with and without homography
w/o HE STL + CR 18.3225 0.8684 estimation as shown in Table 1 ‘w/o HE’ and Fig. 8. Concretely, without
w/o CR VGG-HE + SSL 17.8983 0.8083 sufficient structural prior information which can be provided by HE, it is
GH-HE + SSL 20.7950 0.8445
w/o SSL GH-HE + STL + CR 23.8677 0.9118
difficult for a CNN to learn the process of view-free interpolation from
Ours GH-HE + SSL + CR 24.8525 0.9241 the view of Ib to that of Ia for the non-overlapping areas, compared Fig. 8
‘w/o HE’ with ‘Ours’. Moreover, we also divide HE into VGG-style
homography estimation (VGG-HE) and global correlation homography
fusion, which will be simply called baseline in the following text for the estimation (GC-HE) according to different network structures. In Section
sake of brevity. The results are illustrated in Fig. 6. 4.2, we verified the performance of these two different structures on
As we can see, the results of previous methods exhibit varying de homography estimation and image alignment. In subsequent experi
grees of ghosting effects on the overlapping areas. On the contrary, our ments, we also evaluated their effects on image stitching.
method can achieve almost 100% elimination of artifacts in overlapping Structure Stitching Layer (SSL): SSL is used to stitch the structure
areas at the cost of acceptable slight distortions in non-overlapping of images and it helps to obtain the basic contour of the stitched image
areas. This advantage benefits from our presented structure-to-content while ignoring possible ghosting effects. SSL is composed of Spatial
gradually stitch module, which makes full use of the structure prior by Transformation Layer (STL) and average fusion in tensor. When we train
the global homography and the fine content revision network. More the network with SSL, we can obtain more prior information about the
over, if we pay more attention to the content circled in red in Fig. 6, it is contour of the stitched image, thus get more perceptually precise
easy to find out, our results are more perceptually natural with more stitching results. As illustrated in Fig. 8, ‘w/o SSL’ and ‘Ours’ are similar
Fig. 8. Ablation experiments to vertify effectiveness of homography estimation (HE), structure stitching layer (SSL) and content revision network (CR) to
image stitching.
7
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950
in image content while the contour of ours is more smooth and closer to [8] C.-H. Chang, Y.-Y. Chuang, A line-structure-preserving approach to image resizing,
in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE,
the ground truth, especially in the part circled in red. Therefore, ours can
2012, pp. 1075–1082.
achieve higher PSNR and SSIM, which means that our results have lower [9] Y.-S. Chen, Y.-Y. Chuang, Natural image stitching with the global similarity prior,
noise error and higher structural similarity, as shown in Table 1. in: European conference on computer vision, Springer, 2016, pp. 186–201.
Content Revision Network (CR): The content revision network [10] C.-C. Lin, S.U. Pankanti, K. Natesan Ramamurthy, A.Y. Aravkin, Adaptive as-
natural-as-possible image stitching, in, in: Proceedings of the IEEE Conference on
refines the structure stitching result obtained in the previous stage. For Computer Vision and Pattern Recognition, 2015, pp. 1155–1163.
overlapping parts of images, this part of the network can remove almost [11] J. Li, Z. Wang, S. Lai, Y. Zhai, M. Zhang, Parallax-tolerant image stitching based on
all ghosting effects; for non-overlapping areas of images, it can reshape robust elastic warping, IEEE Trans. Multimedia 20 (2017) 1672–1687.
[12] A. Eden, M. Uyttendaele, R. Szeliski, Seamless image stitching of scenes with large
the contour and structure of the stitched image to make final results motions and exposure differences, in: 2006 IEEE Computer Society Conference on
more natural at seams. Table 1 and Fig. 8 also show the results in the Computer Vision and Pattern Recognition (CVPR’06), volume 2, IEEE, 2006,
case of with and without CR. Compared with our results, there can be pp. 2498–2505.
[13] F. Zhang, F. Liu, Parallax-tolerant image stitching, in: Proceedings of the IEEE
obvious ghosting effects in the stitching results without CR, where ‘VGG- Conference on Computer Vision and Pattern Recognition, 2014, pp. 3262–3269.
HE’ has more artifacts than ‘GC-HE’ since GC-HE can estimate homog [14] K. Lin, N. Jiang, L.-F. Cheong, M. Do, J. Lu, Seagull: Seam-guided local alignment
raphy with higher accuracy. This part is so important that the quality of for parallax-tolerant image stitching, in: European conference on computer vision,
Springer, 2016, pp. 370–385.
stitching images without CR is significantly worse than ours as Table 1 [15] J. Gao, Y. Li, T.-J. Chin, M.S. Brown, Seam-driven image stitching, in: Eurographics
shows, while a CR network alone has no effect. Therefore, only a com (Short Papers), 2013, pp. 45–48.
plete pipeline that fully includes the three stages of HE, SSL, and CR can [16] V.-D. Hoang, D.-P. Tran, N.G. Nhu, V.-H. Pham, et al., Deep feature extraction for
panoramic image stitching, in: Asian Conference on Intelligent Information and
achieve the natural view-free image stitching.
Database Systems, Springer, 2020, pp. 141–151.
[17] Z. Shi, H. Li, Q. Cao, H. Ren, B. Fan, An image mosaic method based on
6. Conclusion convolutional neural network semantic features extraction, J. Signal Process. Syst.
92 (2020) 435–444.
[18] L. Wang, W. Yu, B. Li, Multi-scenes image stitching based on autonomous driving,
In this paper, we present a view-free image stitching architecture in: 2020 IEEE 4th Information Technology, Networking, Electronic and
that can be divided into 3 stages: homography estimation, structure Automation Control Conference (ITNEC), vol. 1, IEEE, 2020, pp. 694–698.
[19] T.A. Alzohairy, E. El-Dein, et al., Image mosaicing based on neural networks, Int. J.
stitching, and content revision. In the homography estimation stage, we
Comput. Appl. 975 (2016) 8887.
extract features, match features, and predict a homography respectively [20] M. Yan, Q. Yin, P. Guo, Image stitching with single-hidden layer feedforward
by convolutional layers, global correlation layer, and regression neural networks, in: 2016 International Joint Conference on Neural Networks
network. In the second stage, the homography is used to perform (IJCNN), IEEE, 2016, pp. 4162–4169.
[21] C. Shen, X. Ji, C. Miao, Real-time image stitching with convolutional neural
structure stitching by SSL to determine the coarse contour of stitching networks, in: 2019 IEEE International Conference on Real-time Computing and
results. In the content revision stage, we refine the content of the image Robotics (RCAR), IEEE, 2019, pp. 192–197.
while keeping the contour of stitching result from distortion. Moreover, [22] J. Li, Y. Zhao, W. Ye, K. Yu, S. Ge, Attentive deep stitching and quality assessment
for 360 omnidirectional images, IEEE J. Select. Top. Signal Process. 14 (2019)
a scheme for generating a dataset of view-free image stitching is pro 209–221.
posed to enable efficient learning on various views. Compared with [23] W.-S. Lai, O. Gallo, J. Gu, D. Sun, M.-H. Yang, J. Kautz, Video stitching for linear
existing image stitching algorithms, our view-free image stitching camera arrays, arXiv preprint arXiv:1907.13622 (2019).
[24] D. DeTone, T. Malisiewicz, A. Rabinovich, Deep image homography estimation,
network can achieve almost 100% elimination of artifacts in over arXiv preprint arXiv:1606.03798 (2016).
lapping areas and is more robust with significantly reduced ghosting [25] T. Nguyen, S.W. Chen, S.S. Shivakumar, C.J. Taylor, V. Kumar, Unsupervised deep
effects, especially in a scene where feature corners are difficult to detect. homography: A fast and robust homography estimation model, IEEE Robot. Autom.
Lett. 3 (2018) 2346–2353.
[26] J. Zhang, C. Wang, S. Liu, L. Jia, J. Wang, J. Zhou, Content-aware unsupervised
Declaration of Competing Interest deep homography estimation, arXiv preprint arXiv:1909.05983 (2019).
[27] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, M. Gelautz, Fast cost-volume filtering
for visual correspondence and beyond, IEEE Trans. Pattern Anal. Mach. Intell. 35
The authors declare that they have no known competing financial (2012) 504–511.
interests or personal relationships that could have appeared to influence [28] D. Sun, X. Yang, M.-Y. Liu, J. Kautz, Pwc-net: Cnns for optical flow using pyramid,
warping, and cost volume, in, in: Proceedings of the IEEE Conference on Computer
the work reported in this paper. Vision and Pattern Recognition, 2018, pp. 8934–8943.
[29] R. Hartley, A. Zisserman, Multiple view geometry in computer vision, Cambridge
Acknowledgments University Press, 2003.
[30] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der
Smagt, D. Cremers, T. Brox, Flownet: Learning optical flow with convolutional
This work was supported by Fundamental Research Funds for the networks, in, in: Proceedings of the IEEE international conference on computer
Central Universities (2018JBM011) and National Natural Science vision, 2015, pp. 2758–2766.
[31] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox, Flownet 2.0:
Foundation of China (No.61772066, No.61972028).
Evolution of optical flow estimation with deep networks, in, in: Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017,
References pp. 2462–2470.
[32] P. Truong, M. Danelljan, R. Timofte, Glu-net: Global-local universal network for
dense flow and correspondences, arXiv preprint arXiv:1912.05524 (2019).
[1] V.R. Gaddam, M. Riegler, R. Eg, C. Griwodz, P. Halvorsen, Tiling in interactive
[33] M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer networks, in:
panoramic video: Approaches and evaluation, IEEE Trans. Multimedia 18 (2016)
Advances in neural information processing systems, 2015, pp. 2017–2025.
1819–1831.
[34] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical
[2] R. Anderson, D. Gallup, J.T. Barron, J. Kontkanen, N. Snavely, C. Hernández,
image segmentation, in: International Conference on Medical image computing
S. Agarwal, S.M. Seitz, Jump: virtual reality video, ACM Trans. Graph. (TOG) 35
and computer-assisted intervention, Springer, 2015, pp. 234–241.
(2016) 1–13.
[35] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and
[3] Z. Zhu, E.M. Riseman, A.R. Hanson, Parallel-perspective stereo mosaics, in:
super-resolution, in: European conference on computer vision, Springer, 2016,
Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001,
pp. 694–711.
vol. 1, IEEE, 2001, pp. 345–352.
[36] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken,
[4] J. Gao, S.J. Kim, M.S. Brown, Constructing image panoramas using dual-
A. Tejani, J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution
homography warping, in: CVPR 2011, IEEE, 2011, pp. 49–56.
using a generative adversarial network, in, in: Proceedings of the IEEE conference
[5] W.-Y. Lin, S. Liu, Y. Matsushita, T.-T. Ng, L.-F. Cheong, Smoothly varying affine
on computer vision and pattern recognition, 2017, pp. 4681–4690.
stitching, in: CVPR 2011, IEEE, 2011, pp. 345–352.
[37] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
[6] J. Zaragoza, T.-J. Chin, M.S. Brown, D. Suter, As-projective-as-possible image
image recognition, arXiv preprint arXiv:1409.1556 (2014).
stitching with moving dlt, in, in: Proceedings of the IEEE conference on computer
[38] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.
vision and pattern recognition, 2013, pp. 2339–2346.
L. Zitnick, Microsoft coco: Common objects in context, in: European conference on
[7] C.-H. Chang, Y. Sato, Y.-Y. Chuang, Shape-preserving half-projective warps for
computer vision, Springer, 2014, pp. 740–755.
image stitching, in, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2014, pp. 3254–3261.
8
L. Nie et al. Journal of Visual Communication and Image Representation 73 (2020) 102950
[39] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint [41] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model fitting
arXiv:1412.6980 (2014). with applications to image analysis and automated cartography, Commun. ACM 24
[40] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. (1981) 381–395.
Comput. Vision 60 (2004) 91–110.