You are on page 1of 17

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL.

60, 2022 5601117

RRSGAN: Reference-Based Super-Resolution for


Remote Sensing Image
Runmin Dong , Lixian Zhang , and Haohuan Fu , Member, IEEE

Abstract— Remote sensing image super-resolution (SR) plays


an important role by supplementing the lack of original high-
resolution (HR) images in the study scenarios of large spatial
areas or long time series. However, due to the lack of imagery
information in low-resolution (LR) images, single-image super-
resolution (SISR) is an inherently ill-posed problem. Especially,
it is difficult to reconstruct the fine textures of HR images at
large upscaling factors (e.g., four times). In this work, based
on Google Earth HR images, we explore the potential of the
reference-based super-resolution (RefSR) method on remote sens-
ing images, utilizing rich texture information from HR reference
(Ref) images to reconstruct the details in LR images. This method
can use existing HR images to help reconstruct the LR images
of long time series or a specific time. We build a reference-based
remote sensing SR data set (RRSSRD). Furthermore, by adopting Fig. 1. Example of the LR image and the corresponding HR image. For
the generative adversarial network (GAN), we propose a novel convenient comparison, the LR image is bicubic upsampled to the size of the
end-to-end reference-based remote sensing GAN (RRSGAN) for HR image. (a) Planet image (3-m resolution). (b) Google Earth image (0.6-m
SR. RRSGAN can extract the Ref features and align them resolution).
to the LR features. Eventually, the texture information in
the Ref features can be transferred to the reconstructed HR [3], building extraction [4], [5], small object detection [6],
images. In contrast to the existing RefSR methods, we propose and so on. However, due to the limitation of the underlying
a gradient-assisted feature alignment method that adopts the
deformable convolutions to align the Ref and LR features and a technology and the high cost of the hardware, the observed
relevance attention module (RAM) to improve the robustness HR images often suffer from incomplete coverage in terms
of the model in different scenarios (e.g., land cover changes of space or time. They cannot meet the growing demand for
and cloud coverage). The experimental results demonstrate that productions and applications [7]. The image super-resolution
RRSGAN is robust and outperforms the state-of-the-art SISR (SR) technology provides a low-cost and effective way to
and RefSR methods in both quantitative evaluation and visual
results, which indicates the great potential of the RefSR method obtain the HR images by reconstructing HR images from
for remote sensing tasks. Our code and data are available at relatively low-resolution (LR) but easily available images [8].
https://github.com/dongrunmin/RRSGAN. In recent years, single-image super-resolution (SISR) has
Index Terms— Deep learning, remote sensing imagery, super- received great attention. There are many SISR-related stud-
resolution (SR). ies [8]–[10], especially deep learning-based methods [7].
However, because the textures of HR images are excessively
I. I NTRODUCTION destructed in the degrading process and are difficult to be
recovered, the convolutional neural network-based (CNN-

H IGH-RESOLUTION (HR) remote sensing images are


crucial in urban planning [1], semantic labeling [2],
based) SISR methods [11]–[13] often result in blurry effects.
To address this issue, the generative adversarial network-based
(GAN-based) SISR methods [14]–[16] are proposed and lead
Manuscript received July 29, 2020; revised November 16, 2020 and
December 8, 2020; accepted December 16, 2020. Date of publication to breakthroughs on the visual quality. However, GAN-based
January 18, 2021; date of current version December 3, 2021. This work was SISR methods tend to generate fake textures and even produce
supported in part by the fund of Shanghai Municipal Commission of Economy artifacts. Moreover, due to the limited prior knowledge and the
and Informatization under Grant 2019-RGZN-01015, in part by the National
Key Research and Development Plan of China under Grant 2017YFA0604500, lack of imagery information in a single LR image, it is still a
in part by the National Natural Science Foundation of China under Grant challenging problem to reconstruct the fine textures of HR at
51761135015 and Grant U1839206, and in part by the Center for High large upscaling factors (e.g., four times).
Performance Computing and System Simulation, Pilot National Laboratory
for Marine Science and Technology (Qingdao). (Corresponding author: The publicly available Google Earth images have signif-
Haohuan Fu.) icantly facilitated the usability of HR (up to 0.3 m) images.
Runmin Dong is with the Department of Earth System Science, Although many HR images are available, they may fail to meet
Tsinghua University, Beijing 100084, China, and also with Shanghai Sense-
Time Intelligent Technology Company, Ltd., Shanghai 200030, China (e-mail: the demands of applications (e.g., lacking images of long time
drm17@mails.tsinghua.edu.cn). series or specific time [17]), in which case the SR technology
Lixian Zhang and Haohuan Fu are with the Department of Earth is still needed. However, the existing HR Google Earth images
System Science, Tsinghua University, Beijing 100084, China (e-mail:
zhanglx18@mails.tsinghua.edu.cn; haohuan@tsinghua.edu.cn). can facilitate new opportunities for image SR tasks. An LR
Digital Object Identifier 10.1109/TGRS.2020.3046045 remote sensing image, in many cases, can be matched to a
1558-0644 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
5601117 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

corresponding HR image (as shown in Fig. 1). Intuitively, the publicly available HR images from Google Earth,
a Google Earth HR image can provide extra information and we build an open-source reference-based remote sensing
may help reconstruct fine textures in LR images. Therefore, super-resolution data set (RRSSRD).
for the remote sensing SR task, we consider using publicly 2) We propose an end-to-end RefSR approach for remote
available HR images from Google Earth as reference (Ref) sensing images, named RRSGAN. RRSGAN contains
images to help reconstruct the fine texture from LR images. a feature extraction and alignment module, a multi-
Specifically, the Ref images can guide the SR process where level texture transformer, and GAN-based losses in
the LR and Ref images contain similar contents, leading to both image and gradient domains. Considering the mis-
the SR images that have sharp visual quality and preserve the alignment problem between LR and Ref images and
ground-truth class [18]. model robustness with different qualities of Ref images,
In the computer vision field, some studies have we propose a GAFA method and an RAM to fur-
diverted from SISR and explored reference-based SR ther release the potential of RefSR in remote sensing
(RefSR) [18]–[20]. RefSR introduces additional reference scenarios.
images to compensate for the lost details in the LR images. 3) We demonstrate that our proposed RRSGAN is superior
The state-of-the-art methods follow the pattern of combining to both the state-of-the-art SISR methods and existing
image alignment or patch matching and texture synthesis [20]. RefSR methods on RRSSRD. Our proposed method is
Existing RefSR studies are mainly based on two different types also robust with different qualities of Ref images and
of assumptions: 1) the Ref images and LR images are aligned performs well in real-world images. This work proves
well or have high content similarity (e.g., the same object the great potential of the RefSR approach in the field of
from different viewpoints or video frames) [19], [21], [22] or remote sensing.
2) the Ref images and LR images sometimes are significantly The rest of this article is organized as follows. We introduce
misaligned or have uncertain content similarity (e.g., from the related work in Section II, including the existing SR meth-
web image searches) [20], [23], [24]. However, the above ods for remote sensing images, SISR, and the deep learning-
assumptions do not fully match the remote sensing scenario based RefSR methods. In Section III, we give a detailed
in this article, mainly due to the following two reasons. description of the proposed approach. Experimental results and
1) In remote sensing tasks, the HR Ref images can be discussion are provided in Section IV, and Section V concludes
easily matched to the LR images at the same location our work.
using the latitude and longitude information. Therefore,
the Ref and LR images have a certain content similarity.
II. R ELATED W ORK
However, image alignment is still necessary due to
the different shooting viewpoints, which results in the A. SR for Remote Sensing Images
different tilting directions of tall buildings, and deviation
In the field of remote sensing, depending on the number of
in geographic coordinates (usually within several pixels).
input images, two main categories for improving the resolution
The challenges of alignment are the large spectral dif-
of remotely sensed images are SISR and multi-image super-
ference between different sensors, land cover changes at
resolution (MISR) techniques [25]. MISR techniques in
different times, shifts in the viewpoint, and illumination
remote sensing aim to reconstruct high spatial frequency
variations.
details from multiple LR versions of the same remote sensing
2) Due to inevitable land cover changes, cloud coverage,
scenes [26]–[30]. However, we can hardly apply MISR
and missing HR Ref images, there are higher require-
approaches to applications where multiple remotely sensed
ments for the robustness of the model, especially in the
images of the same scene are not possible or difficult to
Ref texture transfer process.
obtain [31]. For such situations, SISR has become the more
To address the above-mentioned issues, we propose a feasible SR technology [32].
novel reference-based remote sensing GAN (RRSGAN) With the rapid progress made in deep learning, the deep
for SR, as shown in Fig. 2. Our approach works in a learning-based SISR methods outperform the traditional
“feature extraction–alignment–transfer” fashion, as shown SISR methods and prove to be promising in the domain
in Fig. 3. In the feature alignment process, objects with of remote sensing SR [31], [33]. The early CNN-based
a larger offset usually come from tall buildings, which methods [34]–[36] usually retrained a network designed for
generally have apparent boundaries. Therefore, we propose natural images, such as SRCNN and VDSR, with remote
a gradient-assisted feature alignment (GAFA) method to sensing images. Recent progress of SR for remote sensing
match the Ref features with the LR features for better use images can be summarized in four aspects.
in the subsequent process. In the feature transfer process, 1) Improving the structure of SR network for remote sens-
we propose a relevance attention module (RAM) to suppress ing images. For example, Pan et al. [37] proposed a
irrelevant information and enhance the relevant information residual dense backprojection network (RDBPN)-based
of the Ref features to improve the robustness of the model. SISR method, which can utilize residual learning in
The contributions of our work are summarized as follows. both global and local manners. Jiang et al. [8] improved
1) To the best of our knowledge, we are one of the first the performance regarding the details of remote sensing
to explore RefSR on remote sensing images. Based on images by adding an edge enhancement module to the

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
DONG et al.: RRSGAN: RefSR FOR REMOTE SENSING IMAGE 5601117

GAN architecture. Zhang et al. [38] presented a scene-


adaptive approach using a multiscale attention network
to improve the SR performance on different scenes.
2) Using prior information to help the reconstruction. For
example, Zhang et al. [39] introduced the varying
saliency maps of different areas in remote sensing
images to drive SISR.
3) Designing an approach for the real-world image SR. For
example, Wang et al. used real LR and HR images,
which can be paired or unpaired remote sensing images,
Fig. 2. Overall workflow of the adversarial learning of RRSGAN.
and proposed an unsupervised learning network named
Cycle-CNN for remote sensing image SR.
online image searches [20], and so on. Recent RefSR methods
4) Improving GAN-based SISR methods. For example,
can be roughly classified into two categories: patch matching
Lei et al. [10] proposed a coupled-discriminated GANs
and image aligning. Some studies adopted the patch matching
(CDGANs) for reconstructing remote sensing images
method to find the most relevant Ref feature for each LR
to relieve the problem that the discriminator is easily
patch [20], [23]. For example, Zheng et al. [23] proposed a
confused in the low-frequency regions.
deep learning-based approach for Ref and LR patch matching
Besides, there are also many studies for promoting the reso- and synthesizing. SRNTT [20] adopted the patch matching
lution of other types of remote sensing images, such as radar method, which used a pretrained VGG network to extract Ref
images [40], [41] and hyperspectral images [42], [43]. and LR features, divided VGG features into small patches,
and then calculated and compared the similarity between
B. SISR each Ref and LR patch to swap similar texture features. The
The deep learning-based SISR techniques aim to build advantage of patch matching is that it can handle long-distance
a deep learning model that can learn a “universal” prior dependence, making the model more robust in the case of sig-
from the given training remotely sensed material and then nificant misalignment between Ref and LR images. However,
infer the missing details of the LR images using the trained patch matching is usually computationally intensive and time-
model [10]. Dong et al. [44] pioneered SRCNN, which used consuming. Besides, it cannot accurately deal with nonrigid
three-layer CNN for mapping LR image and HR image pairs. image deformation, which often causes blocky artifacts.
Subsequently, the performance of SISR has been improved Other studies adopted image aligning methods to alleviate
by applying more advanced deep learning architectures, such the abovementioned problems [19], [21], [22]. However, they
as residual blocks [45], dense blocks [11], and recursive are based on the assumption that the Ref image and LR
blocks [46]. For example, SRResNet [14] employed the image are well-aligned or have a high content similarity. For
ResNet architecture for SR and an enhanced deep SR (EDSR) example, CrossNet [19] applied optical flow to align Ref and
network [12] further improved the results by removing the LR images at different scales and then used the aligned Ref
unnecessary batch normalization modules. Recently, WDSR feature to perform multiscale feature fusion and synthesis
[13] used the strategy of wide activation by expanding using U-Net [50]. However, CrossNet is not an end-to-end SR
features before the rectified linear unit (ReLU) activation. method. CrossNet first applies an SISR model (e.g., SRResNet
DBPN [47] introduces backprojection modules to provide an [14]) to upsample the LR image and then performs the
error feedback mechanism. subsequent “encoder–warping–decoder” process. The remote
The abovementioned methods aim to minimize mean square sensing scenario is more suitable to adopt image aligning,
error (MSE) or mean absolute error (MAE) between the as the Ref and LR images have a certain content similarity.
SR and HR images, ignoring human perceptions. Therefore, However, the performance of such methods depends mainly
GANs [48] have emerged as a promising architecture for on the alignment quality. Therefore, we propose a gradient-
reconstructing single frame images into a more photorealis- assisted pyramid, cascading and deformable alignment method
tic and perceptual version. Ledig et al. [14] used a GAN to improve the aligning quality. Moreover, we design an end-
(SRGAN) to reconstruct a photorealistic single image with to-end RefSR approach, and the proposed RAM can further
HR. Wang et al. [15] proposed ESRGAN to enhance the improve the performance and robustness of our approach.
visual quality of reconstructed images further. Ma et al. [16]
proposed a structure-preserving SR (SPSR) method with gra- III. A PPROACH
dient guidance to generate perceptual-pleasant details, which
In this article, we aim to utilize a reference image (Ref) to
achieved state-of-the-art visual results.
assist in reconstructing an LR image. The proposed RRSGAN
is a GAN-based RefSR method, and the overall workflow is
C. RefSR shown in Fig. 2. RRSGAN consists of a generator and two dis-
RefSR methods introduce additional Ref images with high- criminators. The generator can utilize the Ref images to help
frequency details to reconstruct the missing fine texture of reconstruct the fine textures in LR images. To generate clear
LR images. The Ref images used in daily scenarios are and visually favorable SR results, we use two discriminators
generally from different viewpoints [19], video frames [49], D I and DG in both image and gradient domains. The gradient

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
5601117 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

Fig. 3. Generator of the proposed RRSGAN. The approach is designed in a “feature extraction–alignment–transfer” structure. First, the Ref features are
extracted and aligned to the LR features. In the texture transfer process, we first extract the LR features and then transfer the aligned Ref features in a
multiscale way.

information can help the model focus on neighboring con- respectively. The purpose is to extract the Ref features and
figurations and better infer the local intensity of sharpness. align them to the LR features. As we analyzed in Section I,
Therefore, the generator can benefit from the two discrimina- the most difficult objects for feature alignment are usually tall
tors to learn the fine appearance and focus on avoiding the buildings. Intuitively, the gradient information from the object
distortions of geometric details. boundary is useful for alignment. Therefore, in addition to
The generator of RRSGAN is an end-to-end network, extracting the image features, we explicitly calculate the image
as shown in Fig. 3. The approach is designed in a “feature gradient to assist with feature alignment. Note that although
extraction–alignment–transfer" structure. First, we need to the feature extractor can implicitly learn to extract useful
extract the Ref features and align them to the LR features. features, including the image gradient, the explicit retention
Otherwise, the Ref features cannot be used effectively in the and utilization of the gradient features make the network
subsequent texture transfer process. Different from the patch- concentrate more on the objects that are prone to deviation as
based method in SRNTT [20] and flow-based method in Cross- shown in the ablation studies in Section IV-F2. The gradient
Net [19], we adopt deformable convolutions to align the Ref map for an image is obtained by computing the difference
and LR features. Also, considering that objects with a larger between the adjacent pixels. The gradient calculation of a pixel
offset usually come from tall buildings, we propose a gradient- x = (x, y) in image I is defined as follows:
assisted alignment method to improve the performance. Then,
we extract the LR features and perform texture transfer from Ix (x) = I (x + 1, y) − I (x − 1, y)
the Ref features. Following SRNTT [20], we transfer the I y (x) = I (x, y + 1) − I (x, y − 1)
aligned Ref features in a multiscale way. Instead of directly ∇ I (x) = (Ix (x), I y (x))
concatenating the LR and Ref features in the texture transfer,
M(I ) =  ∇ I 2 (1)
we propose an RAM to improve the robustness of the model.
The RAM can enhance the relevant information to improve where M(·) computes the L 2 norm of the gradient at each
the performance and suppress the irrelevant information of the location and M(I ) is referred to as the gradient map of image
Ref features, which is useful for preventing the irrelevant Ref I . The image gradient ∇ I can be efficiently computed by using
features from making the results worse than SISR methods. convolution layers with fixed kernels.
We will elaborate on the details of the RRSGAN in the To match the size of the LR image to the Ref image,
following sections. In Section III-A, we introduce the GAFA we first upscale the LR image to obtain LR↑ (↑ denotes the
method. In Section III-B, we describe the end-to-end network, bilinear upsampling process). Instead of directly using the Ref
including the LR feature extractor, RAM, texture transfer, image for alignment, we use Ref↓↑ (↓ denotes the bilinear
and discriminator. A group of loss functions is introduced in downsampling process), resampled by first applying bilinear
Section III-C, and the implementation details of the proposed downsampling and then upsampling to the original Ref image.
method are presented in Section III-D. Then, LR↑ and Ref↓↑ are used to estimate the offset between
the LR image and the Ref image. Eventually, we use the
A. Gradient Assisted Feature Alignment Method offset to align the Ref features. We use Ref↓↑ rather than
The proposed GAFA contains a feature extraction module the original Ref image for computing the offset because the
and a feature alignment module, as shown in Fig. 4(a) and (b), different blur degrees of the Ref and LR↑ images increase the

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
DONG et al.: RRSGAN: RefSR FOR REMOTE SENSING IMAGE 5601117

 assisted feature alignment method. The symbol ↑ denotes the bicubic upsampling process, and ↓ denotes the bicubic downsampling process.
Fig. 4. Gradient
The symbol denotes feature concatenating on the channel. (a) Feature extraction module. (b) Feature alignment module.

difficulty of the alignment. In contrast, the blurry Ref↓↑ image can utilize the predicted offsets to align the Ref features.
is domain-consistent with LR↑, which helps the alignment. Besides, the offsets and aligned features at the lth level can
The effectiveness of the resample strategy of Ref images is be further used to help predict the offsets and aligned features
demonstrated in Section IV-F2. As shown in Fig. 4(b), three at the (l − 1)th level. At level 1, the cascading refinement
pairs of data are used as inputs for the alignment, in which [53] is used to improve the performance of the alignment
each pair contains an image and its corresponding gradient further. A subsequent deformable alignment is cascaded to
map. further refine the coarsely aligned Ref features [52]. Note that
1) Feature Extraction Module: The feature extraction mod- the alignment module does not require explicit supervision.
ule contains two parallel encoders to extract the image and Also, the offset is jointly learned with the whole model instead
gradient features simultaneously. Each encoder consists of four of being trained separately. The effectiveness of the feature
convolution blocks, and each convolution block is composed alignment module is evaluated in Section IV-F2.
of a convolution layer and a Leaky ReLU. The kernel size of
the convolution layers is set to 5. The stride of the last two B. End-to-End Network Structure
convolutional layers is set to 2 to obtain multiscale features. Without the requirement for a separate SISR network as in
The number of feature maps in the image encoder and image CrossNet [19] or a separate feature alignment network as in
gradient encoder is set to 64 and 16. The multiscale features SRNTT [20], the proposed RRSGAN is trained in an end-
include both semantic and textual information, contributing to to-end manner. We use the GAN architecture. The generator
feature alignment and transfer. Besides, the features extracted includes a feature alignment module (elaborated in Section III-
from the two parallel encoders are concatenated at the same A), an LR feature extractor, and a texture transformer, which
level. contains the RAMs. We use two discriminators D I and DG
2) Feature Alignment Module: The structure of the feature for the image and gradient domains, respectively.
alignment module is shown in Fig. 4(b). We use deformable 1) LR Feature Extractor: The input LR image is first
convolutions [51] to align the Ref and LR features. Deformable forwarded to the feature extractor, which contains N residual
convolution aligns the features more flexibly, compared to the blocks. Each block consists of two convolution layers and a
explicit motion estimation or image warping. Inspired by the Leaky ReLU, as shown in the pink box in Fig. 3. In this work,
success of deformable convolution for aligning neighboring the kernel size of the convolution layers is set to 3 × 3, and
frames in the video field [52], we adopt a pyramid structure the number of feature maps in each layer is set to 64.
to estimate and propagate the offsets and generate aligned Ref 2) RAM : Instead of directly combining the Ref and LR
features at multiple levels. features for texture transfer, we use the correlation between
Specifically, we align the features in a coarse-to-fine manner the Ref and LR features to modify the Ref features before
for the entire image. After we obtain three-level features in combining. Intuitively, the relevant information between Ref
the feature extraction module, the offsets are predicted at each and LR features should be enhanced, and the less relevant
level by two convolution layers. Then, deformable convolution information should be suppressed in the Ref features and

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
5601117 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

ceptual loss Lper [55]. In addition to the above three loss


terms calculated between I SR and I HR , we adopt the gradient
loss from [16], including the reconstruction loss Lg_rec and
the adversarial loss Lg_adv , both of which are computed for
the gradient maps. The gradient loss can help the network
generate plausible geometric structures. Therefore, we have
Fig. 5. Relevance attention module (RAM).
two discriminators, D I and DG , optimized by L D I and L DG ,
respectively. The overall loss is defined as
replaced with the LR features. Therefore, as shown in Fig. 5,
we first combine the Ref and LR features and obtain atten- Ltotal = Lrec + αLper + βLadv + γ Lg_rec + δLg_adv . (3)
tion maps through two convolution layers. The normalized
attention maps 1) Reconstruction Loss: The reconstruction loss aims to
 are elementwise multiplied by the Ref features preserve the spatial structure of the LR image. We use the
(denoted as ). Then, we obtain additional information from
the combined features using two other convolution layers, loss, which has shown effective sharpening of the SR images
which are elementwise added to the Ref features (denoted as and a better convergence than the MSE. The reconstruction
+). Finally, the optimized Ref features are
combined with the loss Lrec is defined as
LR features on the channel (denoted as ) for the further
texture transfer process. The RAM is designed to improve the Lrec =  I HR − I SR 1 (4)
robustness of the model and can be expressed as follows:
where I HR denotes the HR image, I SR denotes the SR image
f RAM = Ca ( fRef ⊕ f LR )  fRef + Cb ( f Ref ⊕ fLR ) and I SR = G(I LR ), G(·) denotes the generator, and I LR
FRAM = f RAM ⊕ f LR (2) denotes the LR image.
2) Adversarial Loss: The adversarial loss encourages the
where fRAM represents the optimized Ref features, fRef and
network to generate clear and visually favorable images [14].
f LR represent the Ref features and LR features, respectively,
The discriminator D I and the generator G are optimized as
Ca (·) represents the computation containing two convolutional
follows:
layers and a sigmoid layer, and Cb (·) represents the computa-
tion containing two convolutional layers. Ladv = − log(D I (G(I LR )) (5)
3) Texture Transformer: Inspired by SRNTT [20], we use
residual blocks and skip connections to build a texture trans- L D I = − log(D I (I HR )) − log(1 − D I (I SR )) (6)
former. We gradually combine different levels of Ref features
into the network, i.e., from level 3 to level 1. The SR texture where D I is designed for distinguishing a real I HR from a
reconstruction benefits from the multiscale texture transfer. generated I SR .
There are three stages of texture transfer, corresponding to 3) Perceptual Loss: Perceptual loss can enhance the visual
the three levels of features. Each stage contains an RAM and quality of SR images by constraining the content similarity in
N residual blocks. Besides, the first two stages include an the feature space between I SR and I HR . We use the pretrained
upsampling module, which upscales the features by subpixel 19-layer VGG network to extract the features of I SR and I HR .
convolution. In this process, the texture information in the Ref The perceptual loss Lper is defined as
image can be effectively transferred to the SR results.
4) Discriminator: Inspired by SPSR [16], we use two Lper =  φi (I HR ) − φi (I SR ) 1 (7)
discriminators D I and DG for the image and gradient domains,
respectively, which can help the generator learn the fine where φi (·) denotes the i th layer output of the VGG19.
appearance and capture the geometric relationship. The two 4) Gradient Loss: Gradient loss has been demonstrated to
discriminators use the same network structure, which shares be useful to retain the geometric structure [16]. The gradient
a similar model with the VGG-13 [54] network. Specifically, loss restricts the second-order relationship of neighboring
we replace the max-pooling layers with the stride convolutions pixels. We obtain the gradients of I SR and I HR by computing
using a 4 × 4 kernel size to enable the network to learn its the difference between the adjacent pixels, as shown in (1).
spatial downsampling. We also use the leaky-ReLU activation The gradient loss includes the gradient-based reconstruction
function to replace the ReLU activation function. Therefore, loss Lg_rec and the gradient-based adversarial loss Lg_adv ,
the discriminator has ten convolutional layers activated by which are defined as follows:
the leaky-ReLU function, where the number of convolutional
filters increases by a factor of 2 from 64 to 512 kernels as Lg_rec =  M(I HR ) − M(I SR ) 1 (8)
in the VGG network. Two fully connected layers follow the Lg_adv = − log(DG (G(M(I LR ))) (9)
resulting 512 feature maps to provide the final assessment. L DG = − log(DG (M(I HR ))) − log(1 − DG (M(I SR )))
(10)
C. Loss Function
Commonly used loss functions for SR methods include where M(·) denotes the gradient operation and DG is designed
reconstruction loss Lrec , adversarial loss Ladv [48], and per- for distinguishing a real M(I HR ) from a generated M(I SR ).

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
DONG et al.: RRSGAN: RefSR FOR REMOTE SENSING IMAGE 5601117

Fig. 6. Examples of HR-Ref pairs in the RRSSRD data set.


TABLE I
I NFORMATION A BOUT THE RRSSRD

D. Implementation Details of remote sensing scenes, including airport, bare land, beach,
Following the standard protocol, we obtain the LR images bridge, center, commercial, dense-residential, farmland, for-
during training by downsampling the HR images using a est, industrial, meadow, medium-residential, park, parking,
bicubic kernel with downsampling factor r = 4. For each playground, pond, port, river, sparse-residential, viaduct,
input minibatch, we randomly crop 16 64 × 64 patches from and so on.
LR images. The corresponding HR patches have a size of Information about the RRSSRD is shown in Table I. Exam-
256 × 256. The texture transformer in an SR network contains ples of the HR-Ref pair in RRSSRD are shown in Fig. 6.
three stages, each consists of 16 residual blocks. For the RRSSRD consists of 4047 pairs of HR-Ref images with RGB
discriminators, we adopt a VGG-style network without BN bands. The HR images are acquired from WorldView-2 and
layers. We set the weight hyperparameters for α, β, γ , and δ GaoFen-2, and depict Xiamen and Jinan City, China. The Ref
to 0.1, 0.001, 1, and 0.001, respectively. The Adam optimizer images are collected from Google Earth in 2019 with a spatial
is used for optimization with the parameters of β1 = 0.9, resolution of 0.6 m. We downsample each HR image 4 times
β2 = 0.999, and  = 1 × 10−8 . The learning rates for both to a LR image. The HR and Ref images are sized 480 × 480
the generator and discriminators are set to 1 × 10−4 and are pixels, and correspondingly, the LR images are sized 120×120
reduced to half with 50k, 100k, and 200k iterations. We first pixels.
warm up the network for 30k iterations where only Lrec and Considering the model performance on different image
Lg_rec are applied. Then, we use all losses to train a total sources and locations, we build four test data sets. Each test
of 300k iterations. We implement our models with the PyTorch set consists of 40 pairs of HR-Ref images. In the first test set,
framework and train them using 16 NVIDIA GTX 1080Ti the images are collected from WorldView-2 and depict Xiamen
GPUs. City, China. The images in the second test set are also taken in
Xiamen City, but the HR images are collected from Microsoft
Virtual Earth in 2018 with a spatial resolution of 0.5 m. In the
IV. E XPERIMENTS
third test set, the HR images are acquired from the GaoFen-
A. Data Sets 2 (GF-2) satellite in 2018 with a spatial resolution of 0.8 m
To the best of our knowledge, the existing common data and depict Jinan City, China. The HR images in the fourth test
sets used for SR of remote sensing tasks [56], [57] do set are collected from Microsoft Virtual Earth and depict Jinan
not provide coordinate information for each image, limit- in 2018 with a spatial resolution of 0.5 m. Note that all the Ref
ing the matching of reference images. Therefore, we build images are collected from Google Earth in 2019 with a spatial
a benchmark data set for RefSR technology in this work, resolution of 0.6 m. LR images are obtained by ×4 bicubic
named the RRSSRD. This data set covers common classes downsampling from the HR images and are sized 120 × 120

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
5601117 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

pixels. The Ref images are resized to 480 × 480 pixels, which CrossNet is a CNN-based method and SRNTT is a GAN-
is the same size as the HR images. based method. All these methods are fully optimized on
Furthermore, we test our method on real-world remotely our training data set to obtain their best performance for a
sensed images from the GaoFen-1 (GF-1) satellite with a fair comparison. Note that Cycle-CNN aims to reconstruct
spatial resolution of 2 m. The corresponding Ref images are the real-world images and requires real LR images that are
collected from Google Earth and have a spatial resolution not generated from HR images. Thus, we add the real LR
of 0.6 m. images from the GF-1 satellite for Cycle-CNN method and
also compare our method with the Cycle-CNN both on the
four test data sets and the real-world GF-1 data. The results
B. Evaluation Metrics
using bicubic interpolation (Bicubic) are also included for
The PSNR and SSIM have been used as standard evaluation comparison.
metrics in image SR [58]. Nevertheless, as revealed in some For better comparison with both GAN-based methods and
recent studies [59], [60], super-resolved images may some- CNN-based methods, we train two networks, i.e., RRSGAN
times have high PSNR and SSIM scores with over smoothed and RRSNet. RRSNet uses only the reconstruction loss, a sim-
results but tend to lack realistic visual results. At this moment, plified version of RRSGAN, with the discriminators removed.
apart from the PSNR, the perception index (PI) [59] and the RRSNet is evaluated to make a fair comparison with the CNN-
learned perceptual image patch similarity (LPIPS) [60] are based methods.
included in our experiments. Besides, the PI and natural image We quantitatively evaluated the SR results using four met-
quality evaluator (NIQE) [61] can be used as evaluation met- rics, including PI, LPIPS, PSNR, and SSIM. In each row,
rics on real-world images. The NIQE and PI were originally the best result is highlighted in red. As shown in Table II,
introduced as nonreference image quality assessment methods RRSNet exhibits the highest scores in the metric of PSNR and
based on low-level statistical features [62]. The NIQE is SSIM, whereas RRSGAN achieves the best performance in the
obtained by computing the 36 identical natural scene statistical metric of LPIPS on all four test data sets. For the PI metric,
(NSS) features from patches of the same size from the image our proposed approach outperforms the other SR methods on
[61]. The PI is calculated by incorporating the criteria of Ma most test data sets.
et al. [63] and NIQE as follows: Generally, CNN-based methods have better PSNR and
1 SSIM because they focus on preserving the spatial structure
PI = ((10 − Ma) + NIQE). (11) of the LR images. However, the SR results of CNN-based
2
methods suffer from a lack of realistic visual appearance,
The LPIPS is a full-reference metric that measures percep- which causes worse LPIPS and PI. In contrast, GAN-based
tual image similarity using a pretrained deep network. We use methods obtain better LPIPS and PI, as they use adversarial
the AlexNet [64] model to compute the l2 distance in the loss and perceptual loss, encouraging the network to generate
feature space. LPIPS can be calculated using a given image y visually favorable results. Besides, we notice that the per-
and a ground-truth image y0 as follows: formances of CrossNet and SRNTT are not satisfactory. The
 1    reason may be that the models are designed based on specific
LPIPS(y, y0 ) = wl  f l − f l 2 (12)
Hl Wl h,w h,w 0h,w 2 assumptions in common scenarios, and it is unreasonable to
l
apply them to remote sensing scenarios directly. Owning to
where Hl and Wl represent the height and width of the lth the usage of Ref features, RRSNet surpasses other CNN-
l
layer, respectively, f h,w and f0lh,w represent the features of the based methods by a large margin. Compared with GAN-
corresponding y and y0 of the lth layer at location (h, w), based methods, RRSGAN is not only very competitive on the
respectively, wl is a learned weight vector, and  is the image quality assessment but also performs well on PSNR and
elementwise multiplication operation. Note that, in contrast SSIM. The reason is that RRSGAN can utilize rich texture
to PSNR and SSIM, lower PI and LPIPS indicate better SR information from Ref images to reconstruct the details in LR
results. images.
A visual comparison is presented in Fig. 7 and can fur-
ther explain the quantitative results. The results of bicubic
C. Quantitative and Qualitative Comparison With Different interpolation cannot produce extra details. Owning to the
Methods learning-based technologies, CNN-based SISR methods, such
In this section, we compare our proposed method with as SRResNet, MDSR, WDSR, DBPN, RDBPN, and Cycle-
state-of-the-art SISR and RefSR methods on the four test CNN, can reconstruct some texture details but still suffer
data sets. The compared SISR methods include five CNN- from blurry contours due to the simplex optimization objective
based SISR methods (i.e., VDSR [11], SRResNet [14], MDSR function. GAN-based SISR methods, such as ESRGAN and
[12], WDSR [13], and DBPN [47]), two state-of-the-art SR SPSR, have a better visual appearance but generate artificial
methods for remote sensing images (i.e., RDBPN [37] and artifacts and worsen the reconstruction results. The SR results
Cycle-CNN [65]), and two GAN-based SISR methods (i.e., of SRNTT suffer from the problem of blocky artifacts due to
ESRGAN [15] and SPSR [16]). Two RefSR methods, i.e., the its patch matching method. Compared with other SR methods,
recently proposed CrossNet [19] and SRNTT [20], are also our proposed approach recovers finer texture details, and the
included in the comparison. Note that in these two methods, results are more natural and realistic.

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
DONG et al.: RRSGAN: RefSR FOR REMOTE SENSING IMAGE 5601117

Fig. 7. Visual comparison of our methods with different SR methods on the test sets.

D. Robustness of Our Proposed Method we simulate four scenarios, including Ref images from dif-
In practical applications, RefSR methods need to be suf- ferent image sources, covered by clouds, mismatched, and
ficiently robust against various Ref images with different missing. Correspondingly, the four kinds of Ref images
quality levels. To test the robustness of our proposed approach, are retrieved from Microsoft Virtual Earth (MS), covered

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
5601117 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

TABLE II
Q UANTITATIVE C OMPARISON W ITH D IFFERENT M ETHODS . VDSR, SRR ES N ET, MDSR, WDSR, DBPN, RDBPN, AND C YCLE -CNN A RE CNN-BASED
SISR M ETHODS . ESRGAN AND SPSR A RE GAN-BASED SISR M ETHODS . C ROSS N ET AND SRNTT A RE R EF SR M ETHODS . F OR PSNR AND
SSIM, A H IGHER S CORE I NDICATES B ETTER , W HEREAS FOR PI AND LPIPS, A L OWER S CORE I NDICATES B ETTER . I N E ACH ROW,
THE B EST R ESULT I S H IGHLIGHTED IN Red

TABLE III
R ESULTS OF THE U SE OF D IFFERENT R EF I MAGES ON THE F IRST T EST S ET. F OR PSNR AND SSIM, A H IGHER S CORE I NDICATES B ETTER , W HEREAS
FOR LPIPS, A L OWER S CORE I NDICATES B ETTER . Red I NDICATES THE B EST AND Blue I NDICATES THE S ECOND -B EST R ESULTS

TABLE IV
R ESULTS OF THE A BLATION S TUDY ON R EF T EXTURE T RANSFER . F OR PSNR AND SSIM, A H IGHER S CORE I NDICATES B ETTER , W HEREAS FOR PI AND
LPIPS, A L OWER S CORE I NDICATES B ETTER . I N E ACH C OLUMN , THE B EST R ESULT I S H IGHLIGHTED IN R ED

by clouds (Cloud), irrelevant images (Irrelevant), and black proposed approach is robust in handling the most com-
images (Black). To further demonstrate the robustness and mon distortion cases in remote sensing. Even when the
effectiveness of our proposed method, we also use the bicubic “Black” images or “Cloud” images are used as the references,
upsampled LR images (LR X4) or HR images (HR) as the results of our method are still better than those of the
references. GAN-based SISR methods (compared with Table II). This is
We calculate the quantitative results of the reconstruction due to the multiscale reconstruction structure and the gradient
results under the conditions mentioned above on the first loss, which guarantees the baseline performance, and the
test set. The results presented in Table III show that the RAM, which suppresses the irrelevant information in the

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
DONG et al.: RRSGAN: RefSR FOR REMOTE SENSING IMAGE 5601117

Fig. 8. SR results of using different Ref images. The first and the third rows represent the different Ref images. The second and the fourth rows represent
the corresponding SR results.

Ref features. Although the SR results of the model with display clear image content. In contrast, using irrelevant Ref
irrelevant Ref images have similar qualitative scores in terms images, the model degenerates to an SISR method and cannot
of PSNR and SSIM to that of the model with Google image, reconstruct more details.
a better relevance can produce more realistic textures, which Specifically, the model achieves the best performance when
is reflected in the metric of LPIPS. As shown in Fig. 8, using we use the HR images as references. It demonstrates the
relevant Ref images, the SR results show sharp edges and effectiveness of the texture transfer from the Ref images,

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
5601117 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

Fig. 9. Experimental results in a real-world scenario. The results of the PI and NIQE of each SR image are presented. A lower score indicates better results.

and the performance can be improved with highly relevant TABLE V


Ref images. R ESULTS OF THE A BLATION S TUDY ON THE F EATURE A LIGNMENT
M ETHOD . F OR PSNR AND SSIM, A H IGHER S CORE I NDICATES B ET-
TER , W HEREAS FOR PI AND LPIPS, A L OWER S CORE I NDICATES
E. Test on Real-World Data B ETTER . R ED I NDICATES THE B EST R ESULTS

In this section, we show the SR results of real-world


GaoFen-1 satellite data. As there are no HR images, we use
two nonreference image quality metrics, i.e., PI and NIQE,
to evaluate the perception results. For both metrics, a lower
score indicates a better reconstruction result. As presented
in Fig. 9, the reconstruction results in a real-world scenario
reveal that our proposed approach achieves the best perception
results than those of the other SISR and Ref methods. Note
that Cycle-CNN is an SR method toward real-world remote
sensing images, and we add the real LR images from the
GF-1 satellite for this method. Although our method does not
use the GF-1 satellite images in the training stage, the result
of RRSNet is still competitive compared with Cycle-CNN.
It demonstrates that our method performs well in the real-
world scenario, owing to the use of real Ref images.

F. Ablation Studies not use Ref features in the SR process, and this method can
In this section, we verify the effectiveness of each com- be regarded as an SISR method. Then, we gradually add
ponent of our approach, including the Ref textual transfer, different levels of Ref features into the network from level 1 to
GAFA method, RAM, and gradient loss. We also discuss the level 3. “With 1-level texture transfer” means that we only
hyperparameter tuning of loss weight and model efficiency. use the Ref level-1 features, which is shown as the largest
1) Effectiveness of Ref Textual Transfer: To verify the orange box in Fig. 3. “With 2-level texture transfer” means
effectiveness of the Ref textual transfer, we conduct ablation that we use the Ref level 1 and Ref level 2 features. “With
experiments. We use the same training strategy and network 3-level texture transfer” means that we use all three-level Ref
parameters as introduced in Section III-C, except for the used features. As shown in Table IV, the use of Ref features can
level of Ref features in texture transfer. Note that we keep significantly improve the performance of SR results compared
the same feature alignment module and the same number of with the SISR method. Gradually combining more in-depth
residual blocks at each stage in the comparison experiments Ref features can improve the performance of SR results.
instead of removing each entire stage of texture transfer to Therefore, we use three levels of Ref features in our method.
avoid different performances caused by the different depths 2) Effectiveness of the GAFA Method: We discuss the effect
of the network. “Without texture transfer” means that we do of feature alignment between Ref and LR images. We conduct

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
DONG et al.: RRSGAN: RefSR FOR REMOTE SENSING IMAGE 5601117

TABLE VI
R ESULTS OF THE A BLATION S TUDY ON F EATURE A LIGNMENT M ETHOD IN T ERMS OF THE N UMBER OF A LIGNMENT L EVEL . F OR PSNR AND SSIM,
A H IGHER S CORE I NDICATES B ETTER , W HEREAS FOR PI AND LPIPS, A L OWER S CORE I NDICATES B ETTER . I N E ACH C OLUMN , THE B EST
R ESULT I S H IGHLIGHTED IN R ED

TABLE VII TABLE VIII


R ESULTS OF THE A BLATION S TUDY ON F EATURE A LIGNMENT M ETHOD IN R ESULTS OF THE A BLATION S TUDY ON RAM. W E R EPORT LPIPS
T ERMS OF THE R ESAMPLING S TRATEGY OF R EF I MAGES . W E R EPORT R ESULTS ON T EST S ETS . A L OWER S CORE I NDICATES B ETTER . T HE
LPIPS R ESULTS ON T EST S ETS . A L OWER S CORE I NDICATES B ET- B EST R ESULT I S H IGHLIGHTED IN R ED
TER . I N E ACH C OLUMN , THE B EST R ESULT I S H IGHLIGHTED IN
R ED

with “Baseline.” “DConv” can further improve the perfor-


mance compared with the flow-based method. The gradient-
assisted method can effectively improve the performance in
PSNR, SSIM, and LPIPS, which indicates that the recon-
struction results are more credible and realistic than those not
using the gradient-assisted method. The experimental results
prove that feature alignment is an essential part of the RefSR
approach and that a better feature alignment method can obtain
better SR results.
We conduct the ablation study on the feature alignment
method in terms of the number of alignment levels. As we
discuss the optimal level of the Ref features in textual transfer
in Section IV-F1, the number of alignment levels requires
at least three to provide different Ref features. Therefore,
we compare the performance of three-level alignment with
four-level alignment. As shown in Table VI, the results of
the three-level alignment method are similar to the results of
the four-level alignment method, indicating that the features
Fig. 10. Attention masks of different levels in RAM. of level 4 have a minor contribution to improving feature
alignment. Considering the computational cost, we use the
three-level offset to align the LR domain and the Ref domain
four comparison experiments, as shown in Table V. “Baseline” in our method.
means that the extracted Ref features are directly used without We also conduct the ablation study on the feature alignment
the process of aligning to the LR features. “FlowNet” implies method in terms of the resampling strategy of Ref images.
that the Ref features are aligned to the LR features by the flow- As shown in Table VII, the feature alignment method is
based method, which is applied in CrossNet [66]. “DConv effective even without the resampling strategy, while the
without Grad” and “DConv with Grad” align the Ref features resampling strategy further improves the results. The reason
to the LR features by the deformable convolution method. The is that the resampled Ref features match the frequency band
only difference is whether the gradient features are extracted of the upsampled LR features, which can reduce the difficulty
and used to assist feature alignment. of feature alignment.
As shown in Table V, “Baseline” only performs better than 3) Effectiveness of RAM : To verify the effectiveness of the
SRGAN (shown in Table II), which indicates that the Ref RAM, we remove the RAM and directly combine the Ref and
features cannot be reasonably used without feature alignment. LR features in the feature transfer process. We compare the
“FlowNet” can effectively improve the performance compared results by using different Ref images. As shown in Table VIII,

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
5601117 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

TABLE IX
R ESULTS OF THE A BLATION S TUDY ON G RADIENT L OSS . F OR PSNR AND SSIM, A H IGHER S CORE I NDICATES B ETTER W HEREAS FOR PI AND LPIPS,
A L OWER S CORE I NDICATES B ETTER . I N E ACH C OLUMN , THE B EST R ESULT I S H IGHLIGHTED IN R ED

TABLE X
R ESULTS OF D IFFERENT L OSS W EIGHTS . F OR PSNR AND SSIM, A H IGHER S CORE I NDICATES B ETTER , W HEREAS FOR PI AND LPIPS, A L OWER S CORE
I NDICATES B ETTER . I N E ACH C OLUMN , THE B EST R ESULT I S H IGHLIGHTED IN R ED

TABLE XI
C OMPARISON OF M ODEL PARAMETERS AND I NFERENCE RUNTIME . VDSR, SRR ES N ET, MDSR, WDSR, DBPN, RDBPN, AND C YCLE -CNN A RE
CNN-BASED SISR M ETHODS . ESRGAN AND SPSR A RE GAN-BASED SISR M ETHODS . C ROSS N ET AND RRSN ET A RE CNN-BASED R EF SR
M ETHODS . SRNTT AND RRSGAN A RE GAN-BASED R EF SR M ETHODS

the use of RAM can effectively improve the robustness of visual appearance. Besides, although the additional gradient
the model in different scenarios. The reason is that RAM can discriminator increases the training time and training difficulty,
suppress the influence of the less relevant information in the it does not increase the inference time of the test phase.
Ref features. The attention masks of different levels in RAM 5) Hyperparameter Tuning of Loss Weight: We perform
are presented in Fig. 10 and can further explain the process. ablation experiments to understand the impact of different
As shown by the dark areas in the red rectangles, the areas loss terms in (1). We use the same training strategy and
with land cover changes between the LR image and the Ref network parameters as introduced in Section III-C, except
image, caused by different seasons or building changes, have for different loss weights. First, we experiment the effect of
received less attention. The attention is focused on the relevant using only reconstruction loss Lrec , i.e., with RRSNet, when
area between the LR image and the Ref image, as shown by α, β, γ , and δ are all set to 0. As shown in Table X,
the bright area in the green rectangles. Therefore, the RAM the highest PSRN and SSIM values can be obtained using only
can improve the robustness of the model by suppressing the reconstruction loss compared with other loss weight settings.
irrelevant information and enhancing the relevant information However, the reconstruction loss function often leads to overly
between the LR features and the Ref features. smoothed results and is weak in restoring natural and realistic
4) Effectiveness of the Gradient Loss: We analyze the effect textures. The introduction of the gradient loss Lgrec and Lgadv
of the gradient loss. “Baseline” means that we only use can greatly improve the visual quality of reconstruction, which
the common SR loss functions, including the reconstruction has been verified in Section IV-F4.
loss Lrec , the adversarial loss Ladv , and the perceptual loss To determine the appropriate setting of the loss weights,
Lper . The gradient-based reconstruction loss Lg_rec and the based on the commonly used loss weights in the SR methods
gradient-based adversarial loss Lg_adv are added sequentially. [15], [16], we experiment with three sets of hyperparame-
As shown in Table IX, gradient loss can improve the PI and ters. Following the setting of different loss weights in [16],
LPIPS compared with those of the baseline model. It indicates gradient-based loss weight settings are consistent with image-
that the use of the gradient loss can yield a more realistic based loss weight settings, i.e., γ is set to 1 and δ is equal

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
DONG et al.: RRSGAN: RefSR FOR REMOTE SENSING IMAGE 5601117

to β. From Table X, we can see that the reasonable use of [2] B. Pan, Z. Shi, X. Xu, T. Shi, N. Zhang, and X. Zhu, “CoinNet: Copy
the adversarial loss Ladv and the perceptual loss Lper can initialization network for multispectral imagery semantic segmentation,”
IEEE Geosci. Remote Sens. Lett., vol. 16, no. 5, pp. 816–820, May 2019.
greatly improve the perceptual effect and contribute to better [3] R. Dong, W. Li, H. Fu, M. Xia, J. Zheng, and L. Yu, “Semantic
PI and LPIPS values. However, excessive adversarial loss and segmentation based large-scale oil palm plantation detection using
perceptual loss weights can reduce the performance of the SR high-resolution satellite images,” Proc. SPIE, vol. 10988, May 2019,
Art. no. 109880D.
results or even lead to the failure of the feature alignment [4] W. Li, C. He, J. Fang, J. Zheng, H. Fu, and L. Yu, “Seman-
module. The reason is that our model learns to align feature tic segmentation-based building footprint extraction using very high-
maps in an unsupervised fashion, where we do not explicitly resolution satellite images and multi-source GIS data,” Remote Sens.,
vol. 11, no. 4, p. 403, Feb. 2019.
define a loss term for pixelwise offset estimation. It means that [5] S. Yuan et al., “Long time-series analysis of urban development based
the feature alignment indirectly benefits from the final super- on effective building extraction,” Proc. SPIE, vol. 11398, Apr. 2020,
vision of the SR results. In such a situation, the reconstruction Art. no. 113980M.
[6] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, “SOD-MTGAN: Small
loss can provide clearer guidance of learning feature alignment object detection via multi-task generative adversarial network,” in Proc.
than the adversarial loss and the perceptual loss. To balance Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 206–221.
the image reconstruction effect and perceptual effect, we set [7] W. Yang, X. Zhang, Y. Tian, W. Wang, J.-H. Xue, and Q. Liao, “Deep
the weight hyperparameters for α, β, γ , and δ to 0.1, 0.001, learning for single image super-resolution: A brief review,” IEEE Trans.
Multimedia, vol. 21, no. 12, pp. 3106–3121, Dec. 2019.
1, and 0.001, respectively. [8] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang, “Edge-enhanced
6) Model Efficiency: In Table XI, we report the number GAN for remote sensing image superresolution,” IEEE Trans. Geosci.
of model parameters, computational complexity, the training Remote Sens., vol. 57, no. 8, pp. 5799–5812, Aug. 2019.
[9] N. Huang, Y. Yang, J. Liu, X. Gu, and H. Cai, “Single-image super-
time, and the inference time (in the GPU mode) of different resolution for remote sensing data using deep residual-learning neural
SISR and RefSR methods. For the inference time, all the network,” in Proc. Int. Conf. Neural Inf. Process. Guangzhou, China:
approaches are run on an NVIDIA GTX 1080Ti GPU and Springer, 2017, pp. 622–630.
[10] S. Lei, Z. Shi, and Z. Zou, “Coupled adversarial training for remote
tested on 120 × 120 LR images. Correspondingly, the Ref sensing image super-resolution,” IEEE Trans. Geosci. Remote Sens.,
input images are 480 ×480 pixels and are only used in the Ref vol. 58, no. 5, pp. 3633–3643, May 2020.
methods. For the training time, some of the SISR methods are [11] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution
using very deep convolutional networks,” in Proc. IEEE Conf. Comput.
trained for 1 000 000 iterations, including VDSR, SRResNet, Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1646–1654.
EDSR, MDSR, WDSR. Cycle-CNN, and CrossNet are trained [12] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual
for 500 000 iterations. SRNTT is trained for 400 000 iterations. networks for single image super-resolution,” 2017, arXiv:1707.02921.
[Online]. Available: http://arxiv.org/abs/1707.02921
The rest of the methods are trained for 250 000 iterations. Note [13] J. Yu et al., “Wide activation for efficient and accurate image super-
that the training time of SRNTT is measured only for the resolution,” 2018, arXiv:1808.08718. [Online]. Available: http://arxiv.
network training phase, excluding the offline feature swapping org/abs/1808.08718
phase. In general, the training time of GAN-based methods is [14] C. Ledig et al., “Photo-realistic single image super-resolution using
a generative adversarial network,” in Proc. IEEE Conf. Comput. Vis.
longer than the CNN-based methods. The inference time of the Pattern Recognit. (CVPR), Jul. 2017, pp. 4681–4690.
Ref methods is longer than that of the SISR methods due to the [15] X. Wang et al., “ESRGAN: Enhanced super-resolution generative adver-
extra processing of the Ref images. Compared with the results sarial networks,” in Proc. Eur. Conf. Comput. Vis. Workshops (ECCVW),
Sep. 2018.
of SRNTT, our proposed methods can effectively reduce the [16] C. Ma, Y. Rao, Y. Cheng, C. Chen, J. Lu, and J. Zhou,
inference time. In future work, we will further optimize our “Structure-preserving super resolution with gradient guidance,” in Proc.
approach in terms of model efficiency. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
pp. 7769–7778.
[17] Q. Liu, J. C. Trinder, and I. L. Turner, “Automatic super-resolution
V. C ONCLUSION shoreline change monitoring using Landsat archival data: A case study
at Narrabeen–Collaroy Beach, Australia,” Proc. SPIE, vol. 11, no. 1,
In this article, we explore the use of reference (Ref) images Mar. 2017, Art. no. 016036.
to assist in the reconstruction of LR images in remote sensing [18] Z.-S. Liu, W.-C. Siu, and Y.-L. Chan, “Reference based face super-
tasks. We build a benchmark data set and propose RRSGAN, resolution,” IEEE Access, vol. 7, pp. 129112–129126, 2019.
[19] H. Zheng, M. Ji, H. Wang, Y. Liu, and L. Fang, “Crossnet: An end-to-end
an end-to-end network with a GAFA module and a texture reference-based super resolution network using cross-scale warping,” in
transformer. GAFA extracts the Ref features and aligns them Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 88–104.
to LR features. A texture transformer can effectively utilize [20] Z. Zhang, Z. Wang, Z. Lin, and H. Qi, “Image super-resolution by
neural texture transfer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
the aligned Ref features to help reconstruct the fine textures in Recognit. (CVPR), Jun. 2019, pp. 7982–7991.
LR images. Experimental results demonstrate the effectiveness [21] H. Yue, X. Sun, J. Yang, and F. Wu, “Landmark image super-resolution
and robustness of RRSGAN. This work also proves the great by retrieving Web images,” IEEE Trans. Image Process., vol. 22, no. 12,
pp. 4865–4878, Dec. 2013.
potential of the RefSR approach in the field of remote sensing. [22] Y. Wang, Y. Liu, W. Heidrich, and Q. Dai, “The light field attachment:
In future work, we will further explore the performance of Turning a DSLR into a light field camera using a low budget camera
RefSR at a larger upscaling factor (e.g., eight times) and ring,” IEEE Trans. Vis. Comput. Graph., vol. 23, no. 10, pp. 2357–2364,
Oct. 2017.
optimize our approach in terms of model efficiency. [23] H. Zheng et al., “Learning cross-scale correspondence and patch-
based synthesis for reference-based super-resolution,” in Proc. Brit.
R EFERENCES Mach. Vis. Conf. (BMVC), T.-K. Kim, S. Zafeiriou, G. Brostow, and
K. Mikolajczyk, Eds. BMVA Press, Sep. 2017, pp. 138.1–138.13, doi:
[1] R. Mathieu, C. Freeman, and J. Aryal, “Mapping private gardens 10.5244/C.31.138.
in urban areas using object-oriented techniques and very high- [24] F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo, “Learning texture trans-
resolution satellite imagery,” Landscape Urban Planning, vol. 81, no. 3, former network for image super-resolution,” in Proc. IEEE/CVF Conf.
pp. 179–192, Jun. 2007. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 5791–5800.

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
5601117 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 60, 2022

[25] L. Yue, H. Shen, J. Li, Q. Yuan, H. Zhang, and L. Zhang, “Image super- [49] J. Caballero et al., “Real-time video super-resolution with spatio-
resolution: The techniques, applications, and future,” Signal Process., temporal networks and motion compensation,” in Proc. IEEE Conf.
vol. 128, pp. 389–408, Nov. 2016. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 4778–4787.
[26] R. Y. Tsai and T. S. Huang, “Multiframe image restoration and regis- [50] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
tration,” Adv. Comput. Vis. Image Process., vol. 1, no. 2, pp. 317–339, for biomedical image segmentation,” in Proc. Int. Conf. Med. Image
1984. Comput. Comput.-Assist. Intervent. Munich, Germany: Springer, 2015,
[27] T. Akgun, Y. Altunbasak, and R. M. Mersereau, “Super-resolution pp. 234–241.
reconstruction of hyperspectral images,” IEEE Trans. Image Process., [51] J. Dai et al., “Deformable convolutional networks,” in Proc. IEEE Int.
vol. 14, no. 11, pp. 1860–1875, Nov. 2005. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 764–773.
[28] J. Ma, J. C.-W. Chan, and F. Canters, “An operational superresolution [52] X. Wang, K. C. K. Chan, K. Yu, C. Dong, and C. C. Loy, “EDVR: Video
approach for multi-temporal and multi-angle remotely sensed imagery,” restoration with enhanced deformable convolutional networks,” in Proc.
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 5, no. 1, IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
pp. 110–124, Feb. 2012. Jun. 2019.
[29] H. Shen, M. K. Ng, P. Li, and L. Zhang, “Super-resolution reconstruction [53] T.-W. Hui, X. Tang, and C. C. Loy, “LiteFlowNet: A lightweight convo-
algorithm to MODIS remote sensing images,” Comput. J., vol. 52, no. 1, lutional neural network for optical flow estimation,” in Proc. IEEE/CVF
pp. 90–100, Feb. 2008. Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8981–8989.
[30] F. Li, X. Jia, D. Fraser, and A. Lambert, “Super resolution for remote [54] K. Simonyan and A. Zisserman, “Very deep convolutional networks
sensing images based on a universal hidden Markov tree model,” IEEE for large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
Trans. Geosci. Remote Sens., vol. 48, no. 3, pp. 1270–1278, Mar. 2010. Available: https://arxiv.org/abs/1409.1556
[31] R. Fernandez-Beltran, P. Latorre-Carmona, and F. Pla, “Single-frame [55] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time
super-resolution in remote sensing: A practical overview,” Int. J. Remote style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis.
Sens., vol. 38, no. 1, pp. 314–354, Jan. 2017. Amsterdam, The Netherlands: Springer, 2016, pp. 694–711.
[32] D. Yang, Z. Li, Y. Xia, and Z. Chen, “Remote sensing image super- [56] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions
resolution: Challenges and approaches,” in Proc. IEEE Int. Conf. Digit. for land-use classification,” in Proc. 18th SIGSPATIAL Int. Conf. Adv.
Signal Process. (DSP), Jul. 2015, pp. 196–200. Geographic Inf. Syst. (GIS), 2010, pp. 270–279.
[33] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep [57] G.-S. Xia et al., “Structural high-resolution satellite image indexing,” in
residual networks for single image super-resolution,” in Proc. IEEE Proc. ISPRS TC 7th Symp.–100 Years ISPRS, Vienna, Austria, Jul. 2010,
Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 298–303.
pp. 136–144. [58] M. Irani and S. Peleg, “Super resolution from image sequences,” in Proc.
[34] Y. Luo, L. Zhou, S. Wang, and Z. Wang, “Video satellite imagery super 10th Int. Conf. Pattern Recognit., vol. 2, Jun. 1990, pp. 115–120.
resolution via convolutional neural networks,” IEEE Geosci. Remote [59] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor,
Sens. Lett., vol. 14, no. 12, pp. 2398–2402, Dec. 2017. “The 2018 pirm challenge on perceptual image super-resolution,” in
[35] Z. Shao and J. Cai, “Remote sensing image fusion with deep convolu- Proc. Eur. Conf. Comput. Vis. Workshops (ECCVW), Sep. 2018.
tional neural network,” IEEE J. Sel. Topics Appl. Earth Observ. Remote [60] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The
Sens., vol. 11, no. 5, pp. 1656–1669, May 2018. unreasonable effectiveness of deep features as a perceptual metric,”
[36] A. Xiao, Z. Wang, L. Wang, and Y. Ren, “Super-resolution for ‘Jilin-1’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
satellite video imagery via a convolutional network,” Sensors, vol. 18, pp. 586–595.
no. 4, p. 1194, Apr. 2018. [61] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a ‘completely
[37] Z. Pan, W. Ma, J. Guo, and B. Lei, “Super-resolution of single remote blind’ image quality analyzer,” IEEE Signal Process. Lett., vol. 20, no. 3,
sensing image based on residual dense backprojection networks,” IEEE pp. 209–212, Mar. 2013.
Trans. Geosci. Remote Sens., vol. 57, no. 10, pp. 7918–7933, Oct. 2019. [62] L. Liu, B. Liu, H. Huang, and A. C. Bovik, “No-reference image quality
[38] S. Zhang, Q. Yuan, J. Li, J. Sun, and X. Zhang, “Scene-adaptive remote assessment based on spatial and spectral entropies,” Signal Process.,
sensing image super-resolution using a multiscale attention network,” Image Commun., vol. 29, no. 8, pp. 856–863, Sep. 2014.
IEEE Trans. Geosci. Remote Sens., vol. 58, no. 7, pp. 4764–4779, [63] C. Ma, C.-Y. Yang, X. Yang, and M.-H. Yang, “Learning a no-reference
Jul. 2020. quality metric for single-image super-resolution,” Comput. Vis. Image
[39] L. Zhang, D. Chen, J. Ma, and J. Zhang, “Remote-sensing image Understand., vol. 158, pp. 1–16, May 2017.
superresolution based on visual saliency analysis and unequal recon- [64] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
struction networks,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 6, with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6,
pp. 4099–4115, Jun. 2020. pp. 84–90, May 2017, doi: 10.1145/3065386.
[40] I. Yanovsky, B. H. Lambrigtsen, A. B. Tanner, and L. A. Vese, “Efficient [65] P. Wang, H. Zhang, F. Zhou, and Z. Jiang, “Unsupervised remote sensing
deconvolution and super-resolution methods in microwave imagery,” image super-resolution using cycle CNN,” in Proc. IEEE Int. Geosci.
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 9, Remote Sens. Symp. (IGARSS), Jul. 2019, pp. 3117–3120.
pp. 4273–4283, Sep. 2015. [66] A. Dosovitskiy et al., “FlowNet: Learning optical flow with convo-
[41] S. Kanakaraj, M. S. Nair, and S. Kalady, “SAR image super resolution lutional networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
using importance sampling unscented Kalman filter,” IEEE J. Sel. Dec. 2015, pp. 2758–2766.
Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 2, pp. 562–571,
Feb. 2018.
[42] X. Xu et al., “A new spectral-spatial sub-pixel mapping model for
remotely sensed hyperspectral imagery,” IEEE Trans. Geosci. Remote
Sens., vol. 56, no. 11, pp. 6763–6778, Nov. 2018.
[43] C. Yi, Y.-Q. Zhao, and J. C.-W. Chan, “Hyperspectral image super-
resolution based on spatial and spectral correlation fusion,” IEEE Trans.
Geosci. Remote Sens., vol. 56, no. 7, pp. 4165–4177, Jul. 2018.
[44] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional
network for image super-resolution,” in Proc. Eur. Conf. Comput. Vis.
Zürich, Switzerland: Springer, 2014, pp. 184–199.
[45] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using Runmin Dong received the bachelor’s degree
deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell., in information and computing science from the
vol. 38, no. 2, pp. 295–307, Feb. 2016. Department of Science, Beijing Jiaotong University,
[46] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional Beijing, China, in 2017. She is currently pursuing
network for image super-resolution,” in Proc. IEEE Conf. Comput. Vis. the Ph.D. degree in ecology with the Department of
Pattern Recognit. (CVPR), Jun. 2016, pp. 1637–1645. Earth System Science, Tsinghua University, Beijing.
[47] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection Her research interests include remote sensing
networks for super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. image processing, deep learning, land cover map-
Pattern Recognit., Jun. 2018, pp. 1664–1673. ping, image super-resolution reconstruction, and
[48] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural self-supervised representation learning.
Inf. Process. Syst., 2014, pp. 2672–2680.

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.
DONG et al.: RRSGAN: RefSR FOR REMOTE SENSING IMAGE 5601117

Lixian Zhang received the bachelor’s degree in Haohuan Fu (Member, IEEE) received the Ph.D.
engineering of surveying and mapping from the degree in computing from Imperial College London,
School of Geodesy and Geomatics, Wuhan Uni- London, U.K., in 2009.
versity, Hubei, China, in 2018. He is pursuing the He is a Professor with the Ministry of Education
Ph.D. degree in ecology with the Department of Key Laboratory for Earth System Modeling and
Earth System Science, Tsinghua University, Beijing, the Department of Earth System Science, Tsinghua
China. University, Beijing, China. He is also the Deputy
His research interests include building extrac- Director of the National Supercomputing Center,
tion from remote sensing images, deep learn- Wuxi, China. His research interests include design
ing, and remote sensing image super-resolution methodologies for highly efficient and highly scal-
reconstruction. able simulation applications that can take advantage
of emerging multicore, many-core, and reconfigurable architectures, and make
full utilization of current Peta-Flops and future Exa-Flops supercomputers; and
intelligent data management, analysis, and data mining platforms that combine
the statistics methods and machine learning technologies.

Authorized licensed use limited to: National Technical University of Athens (NTUA). Downloaded on December 26,2022 at 20:03:24 UTC from IEEE Xplore. Restrictions apply.

You might also like