Professional Documents
Culture Documents
Abstract—Single image super-resolution (SISR) is a notoriously of the mapping that we aim to develop between the LR
challenging ill-posed problem, which aims to obtain a high- space and the HR space, and the other is the inefficiency
resolution (HR) output from one of its low-resolution (LR) of establishing a complex high-dimensional mapping given
versions. To solve the SISR problem, recently powerful deep
arXiv:1808.03344v2 [cs.CV] 16 Mar 2019
learning algorithms have been employed and achieved the state- massive raw data. Benefiting from the strong capacity of
of-the-art performance. In this survey, we review representative extracting effective high-level abstractions which bridge the
deep learning-based SISR methods, and group them into two LR space and HR space, recent DL-based SISR methods
categories according to their major contributions to two essential have achieved significant improvements, both quantitatively
aspects of SISR: the exploration of efficient neural network archi- and qualitatively.
tectures for SISR, and the development of effective optimization
objectives for deep SISR learning. For each category, a baseline In this survey, we attempt to give an overall review of recent
is firstly established and several critical limitations of the baseline DL-based SISR algorithms. Our main focus is on two areas:
are summarized. Then representative works on overcoming these efficient neural network architectures designed for SISR and
limitations are presented based on their original contents as effective optimization objectives for DL-based SISR learning.
well as our critical understandings and analyses, and relevant
comparisons are conducted from a variety of perspectives. Finally
The reason for this taxonomy is that when we apply DL
we conclude this review with some vital current challenges and algorithms to tackle a specified task, it is best for us to
future trends in SISR leveraging deep learning algorithms. consider both the universal DL strategies and the specific
Index Terms—Single image super-resolution, deep learning, domain knowledge. From the perspective of DL, although
neural networks, objective function many other techniques such as data preprocessing [6] and
model training techniques are also quite important [7], [8], the
combination of DL and domain knowledge in SISR is usually
I. I NTRODUCTION the key of success and is often reflected in the innovations of
neural network architectures and optimization objectives for
D EEP learning (DL) [1] is a branch of machine learn-
ing algorithms and aims at learning the hierarchical
representation of data. Deep learning has shown prominent
SISR. In each of these two focused areas, we firstly introduce a
benchmark, and then discuss several representative researches
superiority over other machine learning algorithms in many about their original contributions and experimental results, as
artificial intelligence domains, such as computer vision [2], well as our comments and views.
speech recognition [3] and nature language processing [4]. The rest of the paper is arranged as follows. In Section II,
Generally speaking, DL is endowed with the strong capacity we present relevant background concepts of SISR and DL.
of handling substantial unstructured data owing to two main In Section III, we survey the literature on exploring effi-
contributors: the development of efficient computing hardware cient neural network architectures for various SISR tasks. In
and the advancement of sophisticated algorithms. Section IV we survey the researches on proposing effective
Single image super-resolution (SISR) is a notoriously chal- objective functions for different purposes. In Section V, we
lenging ill-posed problem, because a specific low-resolution summarize some trends and challenges for DL-based SISR.
(LR) input can correspond to a crop of possible high-resolution We conclude this survey in Section VI.
(HR) images, and the HR space (in most instances it refers
to the nature image space) that we intend to map the LR
input to is usually intractable [5]. Previous methods for SISR II. BACKGROUND
mainly have two drawbacks: one is the unclear definition
A. Single Image Super-Resolution
This work was partly supported by the National Natural Science Foun-
dation of China (No.61471216 and No.61771276) and the Special Foun- Super-resolution (SR) [9] refers to the task of restoring high-
dation for the Development of Strategic Emerging Industries of Shenzhen
(No.JCYJ20170307153940960 and No.JCYJ20170817161845824). (Corre- resolution images from one or more low-resolution observa-
sponding author: Wenming Yang) tions of the same scene. According to the number of input
W. Yang, X. Zhang, W. Wang and Q. Liao are with the Department of LR images, the SR can be classified into single image super-
Electronic Engineering, Graduate School at Shenzhen, Tsinghua University,
China (E-mail: {yang.wenming@sz, xc-zhang16@mails, wangwei17@mails, resolution (SISR) and multi-image super-resolution (MISR).
liaoqm@}.tsinghua.edu.cn. Compared with MISR, SISR is much more popular because
Y. Tian is a PhD candidate in the University of Rochester, USA (E-mail: of its high efficiency. Since an HR image with high perceptual
ytian21@ur.rochester.edu).
J.-H. Xue is with the Department of Statistical Science, University College quality has more valuable details, it is widely useful in many
London, UK (E-mail: jinghao.xue@ucl.ac.uk). areas, such as medical imaging, satellite imaging and security
2
e.g. [57], have shown that inputting an inaccurate estimation as shown in Fig.5(b). To overcome the difficulties of training
has side effects on the final performance. deep recursive CNN, a multi-supervised strategy is applied
Although the normal deconvolution layer, that has al- and the final result can be regarded as the fusion of 16
ready been involved in popular open-source packages such as intermediate results. The coefficients for fusion are a list
Caffe [58] and Tensorflow [59], offers a quite good solution of trainable positive scalars whose summation is 1. As they
to the first question, there is still an underlying problem: showed, DRCN and VDSR have the quite similar performance.
when we use the nearest neighbor interpolation, the points Here we believe it is necessary to emphasize the importance
in the upsampled features are repeated several times in each of the multi-supervised training in DRCN. This strategy not
direction. This configuration of the upsampled pixels is re- only creates short paths through which the gradients can flow
dundant. To deal with this problem, Shi et al. proposed an more smoothly during back-propagation, but also guides all
efficient sub-pixel convolution layer in [49], known as ESPCN the intermediate representation to reconstruct raw HR outputs.
and the structure of ESPCN is shown in Fig.4. Rather than Finally fusing all these raw HR outputs produces a wonderful
increasing resolution by explicitly enlarging feature maps as result. However, as for fusion, this strategy has two flaws: 1)
the deconvolution layer does, ESPCN expands the channels once the weight scalars are determined in the training process,
of the output features for storing the extra points to increase it will not change with different inputs; 2) using single scalars
resolution, and then rearranges these points to get the HR to weight HR outputs does not take pixel-wise differences
output through a certain mapping criterion. As the expansion into consideration, that is to say, it would be better to weight
is carried out in the channel dimension, so a smaller kernel different parts distinguishingly in an adaptive way.
size is enough. [55] further shows that when the common but A plain architecture like VGG-net is hard to go deeper. Var-
redundant nearest neighbor interpolation is replaced with the ious deep models based on skip-connection can be extremely
interpolation that pads the sub-pixels with zero, the deconvo- deep and have achieved the state-of-the-art performance in
lution layer can be simplified into the sub-pixel convolution many tasks. Among them, ResNet [64], [65] proposed by
in ESPCN. Obviously, compared with the nearest neighbor He et al. is the most representative one. Readers can refer
interpolation, this interpolation is more efficient, which can to [66], [67] for further discussions on why ResNet works well.
also verify the effectiveness of ESPCN. In [68], the authors proposed SRResNet made up of 16 resid-
2) The Deeper, The Better: In the DL research, there is ual units (a residual unit consists of two nonlinear convolutions
theoretical work [60] showing that the solution space of a with residual learning). In each unit, batch normalization
DNN can be expanded by increasing its depth or its width. In (BN) [69] is used for stabilizing the training process. The
some situations, to get more hierarchical representations more overall architecture of SRResNet is shown in Fig.5(c). Based
effectively, many works mainly focus on improvement brought on the original residual unit in [65], Tai et al. proposed
by depth increasing. Recently, various DL-based applications DRRN [70], in which basic residual units are rearranged in
have also demonstrated the great power of very deep neural a recursive topology to form a recursive block, as shown in
networks despite many training difficulties. VDSR [61] is the Fig.5(d). Then for the sake of parameter reduction, each block
first very deep model used in SISR. As shown in Fig.5(a), shares the same parameters and is reused recursively, like the
VDSR is a 20-layer VGG-net [62]. The VGG architecture sets single recursive convolution kernel in DRCN.
all kernel size as 3 × 3 (the kernel size is usually an odd, and EDSR [71] proposed by Lee et al. has achieved the state-of-
taken the increasing of receptive field into account, 3 × 3 is the-art performance up to now. EDSR mainly has made three
the smallest kernel size). To train this deep model, they used a improvements on the overall frame: 1) Compared with the
relatively big initial learning rate to accelerate convergence and residual unit used in previous work, EDSR removes the usage
used gradient clipping to prevent annoying gradient explosion. of BN, as shown in Fig.5(e). The original ResNet with BN
Besides the novel architecture, VDSR has made two more was designed for classification, where inner representations
contributions. The first one is that a single model is used for are highly abstract and these representations can be insensitive
multiple scales since the SISR processes with different scale to the shift brought by BN. As for image-to-image tasks like
factors have strong relationship with each other. This fact is the SISR, since the input and output are strongly related, if the
basic of many traditional SISR methods. Like SRCNN, VDSR convergence of the network is of no problem, then such shift
takes the bicubic of LR as input. During training, VDSR may harm the final performance. 2) Except for regular depth
put the bicubics of LR of different scale factors together for increasing, EDSR also increases the number of output features
training. For larger scale factors (×3, ×4), the mapping for a of each layer on a large scale. To release the difficulties of
smaller scale factor (×2) may be also informative. The second training such wide ResNet, the residual scaling trick proposed
contribution is the residual learning. Unlike the direct mapping in [72] is employed. 3) Also inspired by the fact that the SISR
from the bicubic version to HR, VDSR uses deep CNN to processes with different scale factors have strong relationship
learn the mapping from the bicubic to the residual between with each other, when training the models for ×3 and ×4
the bicubic and HR. They argued that residual learning can scales, the authors of [71] initialized the parameters with the
improve performance and accelerate convergence. pre-trained ×2 network. This pre-training strategy accelerates
The convolution kernels in the nonlinear mapping part of the training and improves the final performance.
VDSR are very similar, in order to reduce parameters, Kim The effectiveness of the pre-training strategy in EDSR
et al. further proposed DRCN [63], which utilizes the same implies that models for different scales may share many
convolution kernel in the nonlinear mapping part for 16 times, intermediate representations. To explore this idea further, like
5
building a multi-scale architecture as VDSR does on the adaptively fusing various models with different purposes at
condition of bicubic input, the authors of EDSR proposed the pixel level. Motivated by this idea, MSCN proposed by
MDSR to achieve the multi-scale architecture, as shown in Liu et al. [82] develops an extra module in the form of CNN,
Fig.5(g). In MDSR, the convolution kernels for nonlinear taking the LR as input and outputting several tensors with the
mapping are shared across different scales, only the front same shape as the HR. These tensors can be viewed as adaptive
convolution kernels for extracting features and the final sub- element-wise weights for each raw HR output. By selecting
pixel upsampling convolution are different. At each update NN as the raw SR inference modules, the raw estimating
during training MDSR, minibatches for ×2, ×3 and ×4 are parts and the fusing part can be optimized jointly. However,
randomly chosen and only the corresponding parts of MDSR in MSCN, the summation of coefficients at each pixel is not
are updated. 1, which may be a little incongruous.
Besides ResNet, DenseNet [73] is another effective ar- Deep architectures with progressive methodology: In-
chitecture based on skip connections. In DenseNet, each creasing SISR performance progressively has been extensively
layer is connected with all the preceding representations, studied before, and many recent DL-based ones also exploit
and the bottleneck layers are used in units and blocks to it from various perspectives. Here we mainly discuss three
reduce the parameter number. In [74], the authors pointed out novel works within this scope: DEGREE [83] combining
that ResNet enables feature re-usage while DenseNet enables the progressive property of ResNet with traditional sub-band
new features exploration. Based on the basic DenseNet, SR- reconstruction, LapSRN [84] generating SR of different scales
DenseNet [75],as shown in Fig.5(f) further concatenates all progressively, and PixelSR [85] leveraging conditional autore-
the features from different blocks before the deconvolution gressive models to generate SR pixel by pixel.
layer, which is shown to be effective on improving perfor- Compared with other deep architectures, ResNet is intrigu-
mance. MemNet [76] proposed by Tai et al. uses residual unit ing for its progressive properties. Taking SRResNet for exam-
recursively to replace the normal convolution in the block ple, one can observe that directly sending the representations
of the basic DenseNet and add dense connections among produced by intermediate residual blocks to the final recon-
different blocks, as shown in Fig.5(h). They explained that the struction part will also yield a quite good raw HR estimator.
local connections in the same block resemble the short-term The deeper these representations are, the better the results
memory and the connections with previous blocks resemble can be obtained. A similar phenomenon of ResNet applied
the long-term memory [77]. Recently, RDN [78] proposed by in recognition is reported in [66]. DEGREE proposed by
Zhang et al. uses the similar structure. In an RDN block, basic Yang et al. combines this progressive property of ResNet with
convolution units are densely connected like DenseNet, and at the sub-band reconstruction of traditional SR methods [86].
the end of an RDN block, a bottleneck layer is used following The residues learned in each residual block can be used
with the residual learning across the whole block. Before to reconstruct high-frequency details, resembling the signals
entering the reconstruction part, features from all previous from certain high-frequency band. To simulate sub-band re-
blocks are fused by dense connection and residual learning. construction, a recursive residual block is used. Compared with
3) Combining Properties of SISR Process with the Design the traditional supervised sub-band recovery methods which
of CNN Frame: In this sub-section, we discuss some deep need to obtain sub-band ground truth by diverse filters, this
frames whose architectures or procedures are inspired by some simulation with recursive ResNet avoids explicitly estimating
representative methods for SISR. Compared with the above intermediate sub-band components benefiting from the end-to-
mentioned NN-oriented methods, these methods can be better end representation learning.
interpreted, and sometimes more sophisticated in handling As mentioned above, models for small scale factors can
some challenging cases for SISR. be used for a raw estimator of big scale SISR. In the SISR
Combine sparse coding with deep NN: The sparse prior community, SISR under big scale factors (e.g.×8) has been
in nature images and the relationships between the HR and a challenging problem for a long time. In such situations,
LR spaces rooted from this prior were widely used for its plausible priors are imposed to restrict the solution space. One
great performance and theoretical support. SCN [79] proposed simple way to tackle this is to gradually increase resolution
by Wang et al. uses the learned iterative shrinkage and by adding extra supervision on the auxiliary SISR process of
thresholding algorithm (LISTA) [80], an algorithm on produc- small scale. Based on this heuristic prior, LapSRN proposed by
ing approximate estimation of sparse coding based on NN, Lai et al. uses the Laplacian pyramid structure to reconstruct
to solve the time-consuming inference in traditional sparse HR outputs. LapSRN has two branches: the feature extraction
coding SISR. They further introduced a cascaded version branch and the image reconstruction branch, as shown in Fig.6.
(CSCN) [81] which employs multiple SCNs. Previous works At each scale, the image reconstruction branch estimates a
such as SRCNN tried to explain general CNN architectures raw HR output of the present stage, and the feature extraction
with the sparse coding theory, which from today’s view may branch outputs a residue between the raw estimator and
be a bit unconvincing. SCN combines these two important con- the corresponding ground truth as well as extracts useful
cepts innovatively and gains both quantitative and qualitative representations for next stage.
improvements. When faced with big scale factors with severe loss of
Learning to ensemble by NN: Different models specialize necessary details, some researchers suggest that synthesizing
in different image patterns of SISR. From the perspective rational details can have better results. In this situation, deep
of ensemble learning, a better result can be acquired by generative models to be mentioned in next sections could be
6
only with example-based training. C. Comparisons among Different Models and Discussion
Recently, Ulyanov et al. showed in [98] that the structure
In this section, we will summarize recent progress in deep
of deep neural networks can capture a great deal of low-level
architectures for SISR from two perspectives: quantitative
image statistical prior. They reported that when neural net-
comparisons for those trained for specific blurry, and com-
works are used to fit images of different statistical properties,
parisons on those models for handling non-specific blurry.
the convergence speed for different kinds of images can be
For the first part, quantitative criteria mainly include:
also different. As shown in Fig.9, naturally-looking images,
whose different parts are highly relevant, will converge much 1) PSNR/SSIM [104] for measuring reconstruction qual-
faster. On the contrary, images such as noises and shuffled ity: Given two images I and Iˆ both with N pixels, MSE and
images, which have little inner relationship, tend to converge peak signal-to-noise ratio (PSNR) are defined as
more slowly. As many inverse problems such as denoising and 1 ˆ 2,
M SE = ||I − I||F
(6)
super-resolution are modeled as the pixel-wise summation of N
the original image and the independent additive noises. Based ( L2
)
P N SR = 10 log10M SE , (7)
on the observed prior, when used to fit these degraded images,
the neural networks tend to fit the naturally-looking images where || · ||2F is the Frobenius norm and L is usually 255.
first, which can be used to retain the naturally-looking parts Structual similarity index (SSIM) is defined as
as well as filter the noisy ones. To illustrate the effectiveness
of the proposed prior for SISR, only given the LR x0 , they ˆ = 2µI µIˆ + k1 · σI Iˆ + k2 ,
SSIM (I, I) (8)
took a fixed random vector z as input to fit the HR x with a µ2I + µ2Iˆ + k1 σI2 + σI2ˆ + k2
randomly-initialized DNN fθ by optimizing where µI and σI2 is the mean and variance of I, σI Iˆ is the
min ||d(fθ (z)) − x0 ||22 , covariance between I and I, ˆ and k1 and k2 are constant
θ
(5)
relaxation terms.
where d(·) is common differentiable downsampling operator. 2) Amount of parameters of NN for measuring storage
The optimization is terminated in advance for only filtering efficiency (Params).
noisy parts. Although this novel totally unsupervised methods 3) Amount of composite multiply-accumulate operations
are outperformed by other supervised learning methods, it does for measuring computational efficiency (Mult&Adds):
considerably better than some other naive methods. Since operations in NNs for SISR are mainly multiplications
Deep architectures with internal-examples: Internal- with additions, here we use Mult&Adds in CARN [105] to
example SISR algorithms are based on the recurrence of small measure computation, assuming that the wanted SR is 720p.
pieces of information across different scales of a single image, Notably, It has been shown in [48] and [49] that the training
which are shown to be better at dealing with specific details datasets have great influences on the final performance, and
rarely existing in other external images [99]. ZSSR [100] pro- usually more abundant training data will lead to better results.
posed by Shocher et al. is the first literature combining deep Generally, these models are trained via three main datasets: 1)
architectures with the internal-example learning. In ZSSR, 91 images from [19] and 200 images from [106], called the
besides the image for testing, no extra images are needed and 291 dataset (some models only use 91 images); 2) images
all the patches for training are taken from different degraded derived from ImageNet [107] randomly; and 3) the newly
pairs of the test image. As demonstrated in [101], the visual published DIV2K dataset [108]. Beside the different number
9
of images each dataset contains, the quality of images in distribution Pmodel is defined as
each dataset is also different. Images in the 291 dataset are Pdata (z)
usually small (on average 150 × 150), images in ImageNet DKL (Pdata ||Pmodel ) = Ez∼Pdata [log ]. (11)
Pmodel (z)
are much bigger, while images in DIV2K are of very high
quality. Because of the restricted resolution of the images in We call (11) the forward KLD, where z = y|x denotes
the 291 dataset, models on this set have difficulties in getting the HR (SR) conditioned on its LR counterpart, Pdata and
big patches with big receptive fields. Therefore, models based Pmodel is the conditional distribution of HR|LR and SR|LR
on the 291 datasets usually take the bicubic of LR as input, respectively, where Ex∼Pdata [log Pdata (z)] is an intrinsic term
which is quite time-consuming. Table I compares different determined by the training data regardless of the parameter
models on the mentioned criterions. θ of the model (or say the model distribution Pmodel (x; θ)).
From Table I we can see that generally as the depth and Hence when we use the training samples to estimate parameter
the number of parameters grow, the performance improves. θ, minimizing this KLD is equivalent to MLE.
However, the growth rate of performance levels off. Recently Here we have demonstrated that MSE is a special case of
some works on designing light models [109], [105], [110] MLE, and furthermore MLE is a special case of KLD. How-
and learning sparse structural NN [111] were proposed to ever, we may wonder what if the assumptions underlying these
achieve relatively good performance with less storage and specialization are violated. This has led to some emerging
computation, which are very meaningful in practice. objective functions from four perspectives:
As for the second part, we mainly show that the performance 1) Translating MLE into MSE can be achieved by assuming
of the models for some specific degradation dropped drasti- Gaussian white noise. Despite the Gaussian model is the most
cally when the true degradation mismatches the one assumed widely-used model for its simplicity and theoretical support,
for training. For example, when we use an 7 × 7 Gaussian what if this independent Gaussian noise assumption is violated
kernel with width 1.6 and direct downsamlper with scale 2 to in a complicated scene like SISR?
get an LR, and use the EDSR trained with bicubic degradation, 2) To use MLE, we need to assume the parametric form the
IRCNN, SRMD and ZSSR to generate SR output. As shown in data distribution. What if the parametric form is mis-specified?
Fig.10, the performance of EDSR dropped drastically with ob- 3) Apart from KLD in (11), are there any other distances
vious blurry, while other models for non-specific degradation between probability measures which we can use as the opti-
performs quite well. Therefore, to tackle some longstanding mization objectives for SISR?
problems in SISR such as unknown degradation, the direct 4) Under specific circumstances, how can we choose the
usage of general deep learning techniques may be not enough. suitable objective functions according to their properties?
More effective solutions can be achieved by combining the Based on some solutions to these four questions, recent
power of DL and the specific properties of SISR scene. work on optimization objectives for DL-based SISR will be
discussed in sections IV-B, IV-C, IV-D and IV-E, respectively.
IV. O PTIMIZATION O BJECTIVES FOR DL- BASED SISR B. Objective Functions Based on non-Gaussian Additive
Noises
A. Benchmark of Optimization Objectives for DL-based SISR
The poor perceptual quality of the SISR images obtained by
We select the MSE loss used in SRCNN as the benchmark. optimizing MSE directly demonstrates a fact: using Gaussian
It is known that using MSE favors a high PSNR, and PSNR additive noise in the HR space is not good enough. To tackle
is a widely-used metric for quantitatively evaluating image this problem, solutions are proposed from two aspects: use
restoration quality. Optimizing MSE can be viewed as a other distributions for this additive noise, or transfer the HR
regression problem, leading to a point estimation of θ as space to some space where the Gaussian noise is reasonable.
X 1) Denote Additive Noise with Other Probability Distribu-
min ||F (xi ; θ) − yi ||2 , (9)
θ tion: In [112], Zhao et al. investigated the difference between
i
mean absolute error (MAE) and MSE used to optimize NN in
where (xi , yi ) are the ith training examples and F (x; θ) is image processing. Like (9), MAE can be written as
a CNN parameterized by θ. Here (9) can be interpreted X
in a probabilistic way by assuming Gaussian white noise min ||F (xi ; θ) − yi ||1 . (12)
θ
(N (; 0, σ 2 I)) independent of image in the regression model, i
and then the conditional probability of y given x becomes a From the perspective of probability, (12) can be interpreted as
Gaussian distribution with mean F (x; θ) and diagonal covari- introducing Laplacian white noise, and like (10), the condi-
ance matrix σ 2 I, where I is the identity matrix: tional probability becomes
p(y|x) = N (y; F (x; θ), σ 2 I). (10) p(y|x) = Laplace(y; F (x; θ), bI). (13)
Then using maximum likelihood estimation (MLE) on the Compared with MSE in regression, MAE is believed to be
training examples with (10) will lead to (9). more robust against outliers. As reported in [112], when MAE
The Kullback-Leibler divergence (KLD) between the condi- is used to optimize an NN, the NN tends to converge faster and
tional empirical distribution Pdata and the conditional model produce better results. They argued that this might be because
10
Figure 10: Comparisons of ’monarch’ in Set14 for scale 2 with Gaussian kernel degradation. We can see that, faced with
degradation mismatching the one for training, the performance of EDSR drops drastically.
MAE could guide NN to reach a better local minimum. Other discriminative, denoted by Ψ(r)2 . Then Ψ is the corresponding
similar loss functions in robust statistics can be viewed as deep architectures. In contrast, Φ is the mapping between the
modeling additive noises with other probability distributions. LR space and the manifold represented by Ψ(r), trained by
minimizing the Euclidean distance as
Although these specific distributions often cannot represent 2
unknown additive noise very precisely, their corresponding min ||Φ(x) − Ψ(r)|| . (15)
Φ
robust statistical loss functions are used in many DL-based
SISR work for their conciseness and advantages over MSE. After Φ is obtained, the final result r can be inferred with
SGD by solving
2
2) Using MSE in a Transformed Space: Alternatively, we min ||Φ(x) − Ψ(r)|| . (16)
r
can search for a mapping Ψ to transform the HR space to a cer-
tain space where Gaussian white noise can be used reasonably. For further improvement, [113] also proposed a fine-tuning
From this perspective, Bruna et al. [113] proposed so-called algorithm in which Φ and Ψ can be fine-tuned to the data.
perceptual loss to leverage deep architectures. In [113], the Like the alternating updating in GAN, Φ and Ψ are fine-tuned
conditional probability of the residual r between HR and LR with SGD based on current r. However, this fine-tuning will
given the LR x is stimulated by the Gibbs energy model: involve calculating gradient of partition function Z, which is
a well-known difficult decomposition into the positive phase
2
p(r|x) = exp(−||Φ(x) − Ψ(r)|| − log Z), (14) and the negative phase of learning. Hence to avoid sampling
within inner loops, a biased estimator of this gradient is chosen
where Φ and Ψ are two mappings between the original spaces
for simplicity.
and the transformed ones, and Z is the partition function.
The features produced by sophisticated supervised deep ar- 2 Either scattering network or VGG can be denoted by Ψ. When Ψ is VGG,
chitectures have been shown to be perceptually stable and there is no residual learning and fine-tuning.
11
The inference algorithm in [113] is extremely time- and (11) can be rewritten as
consuming. To improve efficiency, Johnson et al. utilized DKL (Pdata ||Pmodel ) =
this perceptual loss in an end-to-end training manner [114].
1 X X X
In [114], the SISR network is directly optimized with SGD [log K(zk , zi ) − log K(zk , wj )].
by minimizing the MSE in the feature manifold produced by N
k zi ∼Pdata wj ∼Pmodel
VGG-16 as follows: (20)
2
min ||Ψ(F (x; θ)) − Ψ(y)|| , (17) The first log term in (20) is a constant with respect to the
θ
model parameters. Let us denote the kernel K(zk , wj ) in the
where Ψ is the mapping represented by VGG-16, F (x; θ) de- second log term by Akj . Then the optimization objective in
notes the SISR network, and y is the ground truth. Compared (20) can be rewritten as
with [113], [114] replaces the nonlinear mapping Φ and the
1 X X
expensive inference with an end-to-end trained CNN, and their − log Akj . (21)
results show that this change does not affect the restoration N j
k
quality, but does accelerate the whole process. With the Jensen inequality, we can get a lower bound of (21):
Perceptual loss mitigates blurry and leads to more visually- 1 X X 1 XX
pleasing results, compared with directly optimizing MSE in − log Akj ≥ − log Akj ≥ 0. (22)
N j
N j
the HR space. However, there remains no theoretical analysis k k
why it works. In [113], the author generally concluded that 0
P
P first equality holds if and only if ∀k, k ,P j Akj =
The
successful supervised networks used for high-level tasks can
j Ak j . Both equalities hold if and only if ∀k, j Akj = 0.
0
produce very compact and stable features. In these feature When (21) reaches 0, the given lower bound also reaches
spaces, small pixel-level variation and many other trivial 0. Therefore, we can take this lower bound as optimization
information can be omitted, making these feature maps mainly objective alternatively.
focusing on human-interested pixels. At the same time, with We can further simplify the lower bound in (22). The lower
the deep architectures, the most specific and discriminative bound can be rewritten as
information of input are shown to be retained in feature spaces
1 X
because of the great performance of the models applied in − log ||Aj ||1 , (23)
various high-level tasks. From this perspective, using MSE N j
in these feature spaces will focus more on the parts which
where Aj = (A1j , · · · , Akj )T , and k · k1 is the `1 norm.
are attractive to human observers with little loss of original
When the bandwidth h → 0, the affinity Akj will degrade
contents, so perceptually pleasing results can be obtained.
into indicator function, which means if xk = yj , Akj ≈ 1,
else Akj ≈ 0. In this case, `1 norm can be approximated
well by `∞ norm, which returns the maximum element of
the vector. Thus (23) can degenerate into the contextual loss
in [115], [116]:
C. Optimizing Forward KLD with Non-parametric Estimation 1 X
− log max Akj . (24)
N j k
Parametric estimation methods such as MLE need to specify
in advance the parametric form the distribution of data, which Recently, implicit likelihood estimation (IMLE) [117] was
suffer from model misspecification. Different from parametric proposed and its conditional version is applied to SISR [118].
estimation, non-parametric estimation methods such as kernel Here we will briefly show that mininzing IMLE equals to
distribution estimation (KDE) fit the data without distributional minimizing an upper bound of the forward KLD with KDE.
assumptions, which are robust when the real distributional Let us use a Gaussian kernel as
kx − yk22
form is unknown. Based on non-parametric estimation, re- 1
K(x, y) = √ exp − . (25)
cently the contextual loss [115], [116] was proposed by 2πh 2h2
Mechrez et al. to maintain natural image statistics. In the
contextual loss, a Gaussian kernel function is applied: As with (21), the optimization objective can be rewritten as
1 X X − kzk −wj k22
K(x, y) = exp(−dist(x, y)/h − log Z), (18) − log e 2h2 . (26)
N j
k
where dist(x, y) can be any symmetric distance between x
With {wj }m N
j=1 and {zk }k=1 , we can get a simple upper bound
R y, h is the bandwidth, and the partition function Z =
and
of (26) as
exp(−dist(x, y)/h)dy. Then Pdata and Pmodel are
kz −w k2
X 1 X − k2h2j 2
Pdata (z) = K(z, zi ), − log m min e
N j
zi ∼Pdata k
(19) (27)
Pmodel (z) =
X
K(z, wj ), 1 X ||zk − wj ||22
= (min − log m).
wj ∼Pmodel N j 2h2
k
12
Figure 12: Visual comparisons between the MSE, MSE + GAN and MAE +GAN + Contexual Loss (The authors of [68]
and [116] released their results.) We can see that the perceptual loss leads to a lower PSNR/SSIM but a better visual quality.
Figure 13: (a) The perception-distortion space is divided by the perception-distortion curve, where an area cannot be attained.
(b) Use the non-reference metric proposed by [95] and RMSE to perform quantitative comparisons from the perception and
distortion views; the included methods are [47], [84], [71], [61], [68], [121], [116], [114].
perception-aimed losses. Recently, Blau et al. [127] discussed quantitative comparisons, and the representative qualitative
the inherent trade-off between the two kind losses. Their comparisons are depicted in Fig.13(b). To summarize, we
discussion can be simplified into an optimization problem: should be aware that there is no one-fits-all objective function
and we should choose one suitable to the context of an
P (D) = min d(PY , PŶ ) s.t. E[∆(Y, Ŷ )] ≤ D. (32)
PŶ |X application.
∆(·, ·) is distortion-aimed loss, and d(·, ·) is the (pseudo)
distance between distributions. Furthermore, the author also V. T RENDS AND C HALLENGES
proved if d(·, ·) is convex in its second argument, the P (D) is
monotonically non-increasing and convex. From this property, Along with the promising performance the DL algorithms
we can draw the curve of P (D) and easily see this trade-off have achieved in SISR, there remains several important chal-
as shown in Fig.13(a), improving one must be at the cost of lenges as well as inherent trends as follows.
the other. However, as shown in section IV-B, using MSE in 1) Lighter Deep Architectures for Efficient SISR: Although
the VGG feature space achieves a better quality, and choosing high accuracy of advanced deep models has been achieved for
suitable ∆ and d may ease this trade-off. SISR, it is still difficult to deploy these model to real-world
scenarios, mainly due to massive parameters and computation.
As for the perception-aimed losses mentioned in sec- To tackle this, we need to design light deep models or
tions IV-C and IV-D, up to now there is no rigorous analysis slim the existing deep models for SISR with less parameters
on their differences. Here we apply the non-reference quality and computation at the expense of little or no performance
assessment proposed by Ma et al. [95] with RMSE to conduct degradation. Hence in the future, researchers are expected to
14
[24] J. Liu, W. Yang, X. Zhang, and Z. Guo, “Retrieval compensated group [48] ——, “Image super-resolution using deep convolutional networks,”
structured sparsity for image super-resolution,” IEEE Transactions on IEEE transactions on pattern analysis and machine intelligence,
Multimedia, vol. 19, no. 2, pp. 302–316, 2017. vol. 38, no. 2, pp. 295–307, 2016.
[25] S. Schulter, C. Leistner, and H. Bischof, “Fast and accurate image [49] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
upscaling with super-resolution forests,” in Proceedings of the IEEE D. Rueckert, and Z. Wang, “Real-time single image and video super-
Conference on Computer Vision and Pattern Recognition, 2015, pp. resolution using an efficient sub-pixel convolutional neural network,” in
3791–3799. Proceedings of the IEEE Conference on Computer Vision and Pattern
[26] K. Zhang, D. Tao, X. Gao, X. Li, and J. Li, “Coarse-to-fine learning for Recognition, 2016, pp. 1874–1883.
single-image super-resolution,” IEEE transactions on neural networks [50] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional
and learning systems, vol. 28, no. 5, pp. 1109–1122, 2017. networks for mid and high level feature learning,” in Computer Vision
[27] J. Yu, X. Gao, D. Tao, X. Li, and K. Zhang, “A unified learning (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp.
framework for single image super-resolution,” IEEE Transactions on 2018–2025.
Neural networks and Learning systems, vol. 25, no. 4, pp. 780–792, [51] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
2014. learning,” arXiv preprint arXiv:1603.07285, 2016.
[28] C. Deng, J. Xu, K. Zhang, D. Tao, X. Gao, and X. Li, “Similarity [52] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
constraints-based structured output regression machine: An approach volutional networks,” in European conference on computer vision.
to image super-resolution,” IEEE transactions on neural networks and Springer, 2014, pp. 818–833.
learning systems, vol. 27, no. 12, pp. 2472–2485, 2016. [53] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
[29] W. Yang, Y. Tian, F. Zhou, Q. Liao, H. Chen, and C. Zheng, “Consis- for semantic segmentation,” in Proceedings of the IEEE conference on
tent coding scheme for single-image super-resolution via independent computer vision and pattern recognition, 2015, pp. 3431–3440.
dictionaries,” IEEE Transactions on Multimedia, vol. 18, no. 3, pp. [54] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
313–325, 2016. learning with deep convolutional generative adversarial networks,”
[30] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A arXiv preprint arXiv:1511.06434, 2015.
review and new perspectives,” IEEE transactions on pattern analysis [55] W. Shi, J. Caballero, L. Theis, F. Huszar, A. Aitken, C. Ledig, and
and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. Z. Wang, “Is the deconvolution layer the same as a convolutional
[31] H. A. Song and S.-Y. Lee, “Hierarchical representation using NMF,” in layer?” arXiv preprint arXiv:1609.07009, 2016.
International Conference on Neural Information Processing. Springer, [56] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution
2013, pp. 466–473. convolutional neural network,” in European Conference on Computer
[32] J. Schmidhuber, “Deep learning in neural networks: An overview,” Vision. Springer, 2016, pp. 391–407.
Neural networks, vol. 61, pp. 85–117, 2015.
[57] N. Efrat, D. Glasner, A. Apartsin, B. Nadler, and A. Levin, “Accurate
[33] N. Rochester, J. Holland, L. Haibt, and W. Duda, “Tests on a cell blur models vs. image priors in single image super-resolution,” in
assembly theory of the action of the brain, using a large digital Computer Vision (ICCV), 2013 IEEE International Conference on.
computer,” IRE Transactions on information Theory, vol. 2, no. 3, pp. IEEE, 2013, pp. 2832–2839.
80–93, 1956.
[58] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
[34] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning repre-
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
sentations by back-propagating errors,” nature, vol. 323, no. 6088, p.
fast feature embedding,” in Proceedings of the 22nd ACM international
533, 1986.
conference on Multimedia. ACM, 2014, pp. 675–678.
[35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
[59] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
S. Ghemawat, G. Irving, M. Isard et al., “TensorFlow: A system for
zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551,
large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
1989.
[36] J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, [60] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number
no. 2, pp. 179–211, 1990. of linear regions of deep neural networks,” in Advances in neural
information processing systems, 2014, pp. 2924–2932.
[37] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen-
cies with gradient descent is difficult,” IEEE transactions on neural [61] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution
networks, vol. 5, no. 2, pp. 157–166, 1994. using very deep convolutional networks,” in Proceedings of the IEEE
[38] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al., “Gradient Conference on Computer Vision and Pattern Recognition, 2016, pp.
flow in recurrent nets: the difficulty of learning long-term dependen- 1646–1654.
cies,” 2001. [62] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[39] G. E. Hinton, “Learning multiple layers of representation,” Trends in large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
cognitive sciences, vol. 11, no. 10, pp. 428–434, 2007. [63] J. Kim, J. Kwon Lee, and K. Mu Lee, “Deeply-recursive convolutional
[40] D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and network for image super-resolution,” in Proceedings of the IEEE
J. Schmidhuber, “Flexible, high performance convolutional neural conference on computer vision and pattern recognition, 2016, pp.
networks for image classification,” in IJCAI Proceedings-International 1637–1645.
Joint Conference on Artificial Intelligence, vol. 22, no. 1. Barcelona, [64] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Spain, 2011, p. 1237. recognition,” in Proceedings of the IEEE conference on computer vision
[41] D. CireşAn, U. Meier, J. Masci, and J. Schmidhuber, “Multi-column and pattern recognition, 2016, pp. 770–778.
deep neural network for traffic sign classification,” Neural networks, [65] ——, “Identity mappings in deep residual networks,” in European
vol. 32, pp. 333–338, 2012. Conference on Computer Vision. Springer, 2016, pp. 630–645.
[42] R. Salakhutdinov and H. Larochelle, “Efficient learning of deep [66] A. Veit, M. J. Wilber, and S. Belongie, “Residual networks behave
Boltzmann machines,” in Proceedings of the thirteenth international like ensembles of relatively shallow networks,” in Advances in Neural
conference on artificial intelligence and statistics, 2010, pp. 693–700. Information Processing Systems, 2016, pp. 550–558.
[43] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv [67] D. Balduzzi, M. Frean, L. Leary, J. Lewis, K. W.-D. Ma, and
preprint arXiv:1312.6114, 2013. B. McWilliams, “The shattered gradients problem: If resnets are the
[44] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backprop- answer, then what is the question?” arXiv preprint arXiv:1702.08591,
agation and approximate inference in deep generative models,” arXiv 2017.
preprint arXiv:1401.4082, 2014. [68] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,
[45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in image super-resolution using a generative adversarial network,” arXiv
Advances in neural information processing systems, 2014, pp. 2672– preprint, 2017.
2680. [69] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[46] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. network training by reducing internal covariate shift,” arXiv preprint
MIT press Cambridge, 2016, vol. 1. arXiv:1502.03167, 2015.
[47] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convo- [70] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive
lutional network for image super-resolution,” in European Conference residual network,” in The IEEE Conference on Computer Vision and
on Computer Vision. Springer, 2014, pp. 184–199. Pattern Recognition (CVPR), vol. 1, no. 4, 2017.
16
[71] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep Conference on Signal and Information Processing. IEEE, 2013, pp.
residual networks for single image super-resolution,” in The IEEE 945–948.
Conference on Computer Vision and Pattern Recognition (CVPR) [95] T. Meinhardt, M. Moller, C. Hazirbas, and D. Cremers, “Learning
Workshops, vol. 1, no. 2, 2017, p. 3. proximal operators: Using denoising networks for regularizing inverse
[72] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, imaging problems,” in Proceedings of the IEEE International Confer-
inception-resnet and the impact of residual connections on learning.” ence on Computer Vision, 2017, pp. 1781–1790.
in AAAI, vol. 4, 2017, p. 12. [96] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser
[73] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely prior for image restoration,” in Proceedings of the IEEE Conference
connected convolutional networks,” in Proceedings of the IEEE confer- on Computer Vision and Pattern Recognition, 2017, pp. 3929–3938.
ence on computer vision and pattern recognition, vol. 1, no. 2, 2017, [97] T. Tirer and R. Giryes, “Image restoration by iterative denoising
p. 3. and backward projections,” IEEE Transactions on Image Processing,
[74] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path vol. 28, no. 3, pp. 1220–1234, 2019.
networks,” in Advances in Neural Information Processing Systems, [98] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” arXiv
2017, pp. 4470–4478. preprint arXiv:1711.10925, 2017.
[75] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using [99] K. Zhang, X. Gao, D. Tao, and X. Li, “Single image super-resolution
dense skip connections,” in 2017 IEEE International Conference on with multiscale similarity learning,” IEEE transactions on neural
Computer Vision (ICCV). IEEE, 2017, pp. 4809–4817. networks and learning systems, vol. 24, no. 10, pp. 1648–1659, 2013.
[76] Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: A persistent memory [100] A. Shocher, N. Cohen, and M. Irani, “Zero-shot super-resolution using
network for image restoration,” in Proceedings of the IEEE Conference deep internal learning,” arXiv preprint arXiv:1712.06087, 2017.
on Computer Vision and Pattern Recognition, 2017, pp. 4539–4547. [101] M. Zontak and M. Irani, “Internal statistics of a single natural image,”
[77] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
computation, vol. 9, no. 8, pp. 1735–1780, 1997. Conference on. IEEE, 2011, pp. 977–984.
[78] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense [102] T. Michaeli and M. Irani, “Nonparametric blind super-resolution,” in
network for image super-resolution,” in The IEEE Conference on Proceedings of the IEEE International Conference on Computer Vision,
Computer Vision and Pattern Recognition (CVPR), 2018. 2013, pp. 945–952.
[79] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks for [103] T. Tirer and R. Giryes, “Super-resolution based on image-adapted CNN
image super-resolution with sparse prior,” in Proceedings of the IEEE denoisers: Incorporating generalization of training data and internal
International Conference on Computer Vision, 2015, pp. 370–378. learning in test time,” arXiv preprint arXiv:1811.12866, 2018.
[80] K. Gregor and Y. LeCun, “Learning fast approximations of sparse [104] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al., “Image
coding,” in Proceedings of the 27th International Conference on quality assessment: from error visibility to structural similarity,” IEEE
International Conference on Machine Learning. Omnipress, 2010, transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
pp. 399–406. [105] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and, lightweight
[81] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang, “Robust super-resolution with cascading residual network,” arXiv preprint
single image super-resolution via deep networks with sparse prior,” arXiv:1803.08664, 2018.
IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3194– [106] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human seg-
3207, 2016. mented natural images and its application to evaluating segmentation
algorithms and measuring ecological statistics,” in Computer Vision,
[82] D. Liu, Z. Wang, N. Nasrabadi, and T. Huang, “Learning a mixture of
2001. ICCV 2001. Proceedings. Eighth IEEE International Conference
deep networks for single image super-resolution,” in Asian Conference
on, vol. 2. IEEE, 2001, pp. 416–423.
on Computer Vision. Springer, 2016, pp. 145–156.
[107] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
[83] W. Yang, J. Feng, J. Yang, F. Zhao, J. Liu, Z. Guo, and S. Yan, “Deep
“ImageNet: A large-scale hierarchical image database,” in Computer
edge guided recurrent residual learning for image super-resolution,”
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
IEEE Transactions on Image Processing, vol. 26, no. 12, pp. 5895–
on. IEEE, 2009, pp. 248–255.
5907, 2017.
[108] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single
[84] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep Laplacian image super-resolution: Dataset and study,” in The IEEE Conference on
pyramid networks for fast and accurate super-resolution,” in Proc. IEEE Computer Vision and Pattern Recognition (CVPR) Workshops, vol. 3,
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 624–632. 2017, p. 2.
[85] R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursive super resolution,” [109] Z. Yang, K. Zhang, Y. Liang, and J. Wang, “Single image super-
arXiv preprint arXiv:1702.00783, 2017. resolution with a parameter economic residual-like convolutional neu-
[86] A. Singh and N. Ahuja, “Super-resolution using sub-band self- ral network,” in International Conference on Multimedia Modeling.
similarity,” in Asian Conference on Computer Vision. Springer, 2014, Springer, 2017, pp. 353–364.
pp. 552–568. [110] Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-
[87] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent resolution via information distillation network,” in Proceedings of the
neural networks,” arXiv preprint arXiv:1601.06759, 2016. IEEE Conference on Computer Vision and Pattern Recognition, 2018,
[88] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves pp. 723–731.
et al., “Conditional image generation with PixelCNN decoders,” in [111] X. Fan, Y. Yang, C. Deng, J. Xu, and X. Gao, “Compressed multi-
Advances in Neural Information Processing Systems, 2016, pp. 4790– scale feature fusion network for single image super-resolution,” Signal
4798. Processing, vol. 146, pp. 50–60, 2018.
[89] M. Irani and S. Peleg, “Improving resolution by image registration,” [112] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for neural
CVGIP: Graphical models and image processing, vol. 53, no. 3, pp. networks for image processing,” Computer Science, 2015.
231–239, 1991. [113] J. Bruna, P. Sprechmann, and Y. LeCun, “Super-resolution with deep
[90] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep backprojection convolutional sufficient statistics,” arXiv preprint arXiv:1511.05666,
networks for super-resolution,” in Conference on Computer Vision and 2015.
Pattern Recognition, 2018. [114] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-
[91] X. Wang, K. Yu, C. Dong, and C. Change Loy, “Recovering realistic time style transfer and super-resolution,” in European Conference on
texture in image super-resolution by deep spatial feature transform,” in Computer Vision. Springer, 2016, pp. 694–711.
Proceedings of the IEEE Conference on Computer Vision and Pattern [115] R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor, “Learning to
Recognition, 2018, pp. 606–615. maintain natural image statistics,” arXiv preprint arXiv:1803.04626,
[92] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional 2018.
super-resolution network for multiple degradations,” in Proceedings of [116] R. Mechrez, I. Talmi, and L. Zelnik-Manor, “The contextual loss
the IEEE Conference on Computer Vision and Pattern Recognition, for image transformation with non-aligned data,” arXiv preprint
2018, pp. 3262–3271. arXiv:1803.02077, 2018.
[93] R. Timofte, V. De Smet, and L. Van Gool, “Semantic super-resolution: [117] K. Li and J. Malik, “Implicit maximum likelihood estimation,” arXiv
When and where is it useful?” Computer Vision and Image Under- preprint arXiv:1809.09087, 2018.
standing, vol. 142, pp. 1–12, 2016. [118] K. Li, S. Peng, and J. Malik, “Super-resolution via conditional implicit
[94] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and- maximum likelihood estimation,” arXiv preprint arXiv:1810.01406,
play priors for model based reconstruction,” in 2013 IEEE Global 2018.
17