Lightweight Image Super-Resolution Based On

Information Fusion 94 (2023) 284–310
Contents lists available at ScienceDirect
Information Fusion
journal homepage: www.elsevier.com/locate/inffus
Lightweight image super-resolution based on deep learning: State-of-the-art

and future directions
Garas Gendy a , Guanghui He a ,∗, Nabil Sabor b
a Department of Micro-Nano Electronics, Shanghai Jiao Tong University, Shanghai 200240, China
b
Electrical Engineering Department, Faculty of Engineering, Assiut University, Assiut 71516, Egypt
ARTICLE INFO ABSTRACT
Keywords: Recently, super-resolution (SR) techniques based on deep learning have taken more and more attention, aiming
Lightweight image super-resolution to improve the images and videos resolutions. Most of the SR methods are related to other fields of computer
Convolution neural network vision such as image classification, image segmentation, and object detection. Based on the success of the
Attention mechanism
image SR task, many image SR surveys are introduced to summarize the recent work in the image SR domains.
Model distillation
However, there is no survey to summarize the SR models for the lightweight image SR domain. In this paper,
Residual learning
we present a comprehensive survey of the state-of-the-art lightweight SR models based on deep learning.
The SR techniques are grouped into six major categories: include convolution, residual, dense, distillation,
attention, and extremely lightweight based models. Also, we cover some other issues related to the SR task,
such as benchmark datasets and metrics for performance evaluation. Finally, we discuss some future directions
and open problems, that may help other community researchers in the future.
1. Introduction multi-branch, recursive, progressive, attention-based, and adversarial

designs. Also, in [2], the authors focused on deep learning-based image
In recent years, the research community has given increasing focus SR tasks and classified them into three major groups: supervised SR, un-
to image super-resolution (SR) task. This task of SR aims to generate the supervised SR, and domain-specific SR. For real-world SR (RSISR), the
high-resolution (HR) image from the corresponding low-resolution (LR)
authors in [5] reviewed the state-of-the-art SISR methods and divided
image. It is worth mentioning that the image SR task can be called with
other names such as image scaling, interpolation, upsampling, zooming, them into four main groups, namely RSISR based on degradation mod-
and enlargement [1,2]. Based on the practical usage, there are different eling, RSISR based on image pairs, RSISR based on domain translation,
image SR domains based on the type of input images such as face and RSISR based on self-learning. In addition, in [17], the task of blind
image, hyperspectral image, real-world image, and video SR [3–6]. In image super-resolution is well discussed based on grouping the existing
addition, the image SR task can be clustered into two branches; namely methods into three different classes based on the degradation modeling
single image SR (SISR) and multiple images SR (MISR). The MISR uses and the data used for solving the SR model. In the case of face SR [3],
multiple LR images to generate the SR image, while the SISR produces the authors categorized existing methods according to the utilization of
the SR image from only a single input image [7]. However, due to the
face-specific information. Finally, in [6], the authors classify the video
practical need, there is more attention is paid to SISR task [1,2,8–10].
SR methods into six sub-categories according to the ways of utilizing
The SISR can be divided into the blind SR and the non-blind SR. In
the non-blind SR, the degradation kernel is known, while the blind inter-frame information.
SR is the task when the degradation kernel is unknown and needs to The previous SR reviews discussed the general framework that
be estimated. Based on model weight and computational cost, there can be lightweight or not, which is independent of the applications
are two main models types: traditional deep learning methods and the model is used for. Also, the authors in the previous surveys re-
lightweight methods [11–15]. Due to the limited memory of many of viewed the existing SR methods from different aspects, there is no
the real-world devices, the lightweight models have taken more and one focus on the lightweight models, based on our knowledge. This
more attention in the last few years [16]. motivates us to investigate a review to summarize the progress in
Recently, many researchers have introduced many reviews for SISR
the field of lightweight models. Our classification taxonomy is based
based on deep learning. Firstly, in [1], the authors introduced a tax-
on four main metrics: network design, loss function, framework, and
onomy for SR-based deep networks into nine groups: linear, residual,
∗ Corresponding author.
E-mail addresses: guanghui.he@sjtu.edu.cn (G. He), nabil_sabor@aun.edu.eg (N. Sabor).
https://doi.org/10.1016/j.inffus.2023.01.024
Received 4 September 2022; Received in revised form 18 January 2023; Accepted 30 January 2023
Available online 3 February 2023
1566-2535/© 2023 Elsevier B.V. All rights reserved.
G. Gendy et al. Information Fusion 94 (2023) 284–310
training datasets. According to the network design, we classify the of SISR lightweight models. In Table 1, we summarized previous com-
lightweight models into six categories: Convolution-Based, Residual- petitions that were associated with lightweight SR, aiming to promote
Based, Dense-Based, Distillation-Based, Attention-Based, and extremely another perspective of some research trends.
Lightweight-Based. In addition, we summarized the used datasets and The first related competition to the lightweight SISR is the AIM 2019
previous competitions related to the lightweight image SR tasks. So challenge [24]. This competition aims to get a network design/solution
that, this paper gives a comprehensive overview of recent advances in to maintain the PSNR result above a certain number while constraining
lightweight image SR based on deep learning. the lowest number of parameters and inference time (runtime). After
The main contributions of this survey are the following: that, an AIM 2020 Challenge [16] is organized to decrease one or
many aspects of the networks such as runtime, parameters, Multi-
• We comprehensively review lightweight image SR methods based Adds, activations, and depth while keeping good PSNR. Because the
on deep learning, and classify them based on four metrics: net- evaluation of the obtained solutions of deep learning methods is done
work design, loss function, framework, and training datasets. on desktop CPUs and GPUs, it is very difficult to perform the inference
• We taxonomy the lightweight models according to the network time on the actual mobile devices. To address this problem, the Mobile
design into convolution, residual, dense, distillation, attention, AI workshop 2021 [22] is introduced as the first Mobile AI Workshop
and extremely lightweight. to encourage the authors to use mobile devices for evaluating deep
• We introduce the recent advances of lightweight deep learning- learning solutions.
based SR techniques hierarchically and structurally. Recently, NTIRE 2022 Challenge [25] is organized to be an exten-
• We discuss the challenges and introduce our expectation for the sion of the AIM 2020 Challenge for searching for more efficient models.
future directions in the field of lightweight SR. This competition is based on the same criteria as the previous one, but
they looking for a more efficient model. Lastly, the Mobile AI & AIM
2022 challenge [26] is organized as the previous Mobile AI workshop
2. Background
2021, but the runtime is evaluated using the Synaptics VS680 Smart
Home board.
In this section, we will present the SR task [1]. To do this, let
𝑋 𝜖 𝑅𝐻×𝑊 ×𝐶 and 𝑋̂ 𝜖 𝑅𝐻×𝑊 ×𝐶 represent the ground truth image
3.2. Assessment metrics for super-resolved images
and the super-resolved image, respectively; where 𝐻, 𝑊 , 𝐶 donate
the width, height, and components number, respectively. Also, 𝑦 is
There are two ways to evaluate the quality of recovered SR images,
degraded image the corresponding HR image 𝑋, so we can represent
a subjective way such as human perception and an objective way. In
the degradation process as:
general, the subjective way is more straightforward, and there is more
𝑦 = 𝛷(𝑋, 𝜃𝜂 ), (1) in agreement with the practical usage, but this evaluation has some
limitations [3]: (1) its result depends on the personal preferences, and
where 𝛷 represents the degradation function, and 𝜃𝜂 indicates the (2) its cost for evaluation and automation is very high. However, the
degradation parameters (for example, the scaling factor, noise, etc.). In objective evaluation is more suitable for use, even though some differ-
the case of the lightweight scenario, both 𝑦, and 𝑋 are available, and ent assessment metrics can result in inconsistent results or subjective
the function of the model is to get the degradation parameters 𝜃𝜂 or the evaluation methods. Table 2 shows the widely used metrics for the
degradation process. The aim of SR is to cancel the degradation effect objective quality evaluation of SR results, including PSNR, SSIM [27],
and recover an approximation HR image 𝑋̂ similar to the ground-truth IFC [28], and LPIPS [29].
image 𝑋 as: (1) PSNR: The peak signal-to-noise ratio (PSNR) is considered the
commonly used objective quality assessment metric of full reference
𝑋̂ = 𝛷−1 (𝑦, 𝜃𝜍 ), (2)
images for SR task. Given 𝑋, and 𝑋̂ defined in Section 2, the PSNR can
where 𝜃𝜍 represents the parameters of the function 𝛷−1 . In the case be represented as
of an unknown degradation process, the recovering task will be more 𝐿2
𝑃 𝑆𝑁𝑅 = 10.𝑙𝑜𝑔10 ( ), (5)
complex. So, several factors can affect the recovering process, such as 𝑀𝑆𝐸
noise (sensor and speckle), blur (defocus and motion), compression, 1
where 𝑀𝑆𝐸 = 𝐻𝑊 𝐶
‖𝑋 − 𝑋‖
̂ 2 represents the mean squared error (MSE)
and other artifacts. Consequently, most of the previous research is ̂ Also, 𝐿 denotes the maximum value of a pixel
between the 𝑋 and 𝑋.
finding the degradation of the model as illustrated in Eq. (3):
(i.e., 255 for 8-bit images). As indicated in Eq. (5) the PSNR is focused
𝑦 = (𝑋 ⊗ 𝑘) ↓𝑠 +𝑛, (3) on the proximity between the corresponding pixels in 𝑋 and 𝑋̂ that can
give low consistency perceptual quality.
where (𝑋 ⊗ 𝑘) is the convolution operation between the blur kernel (𝑘) (2) SSIM [27]: To find structural similarity, the structural similarity
and 𝑋, and ↓𝑠 represents the downsampling process by a scaling factor index (SSIM) [27] is used as a full-reference objective quality assess-
𝑠. Also, 𝑛 denotes the additive white Gaussian noise (AWGN) with a ment metric. In particular, the comparisons are jointly made in the
level of 𝜎. So that, the aim of the image SR is decreasing the term of prospect of luminance (𝑙), contrast (𝑐), and structures(𝑠) as
data fidelity associated with the model 𝑦 as follows:
̂ 𝛼 [𝑐(𝑋, 𝑋)]
̂ = [𝑙(𝑋, 𝑋)]
𝑆𝑆𝐼𝑀(𝑋, 𝑋) ̂ 𝛽 [𝑠(𝑋, 𝑋)]
̂ 𝛾, (6)
̂ 𝜃𝜂 , 𝑘) = ‖(𝑋 ⊗ 𝑘) − 𝑦‖ + 𝛼𝛹 (𝑋, 𝜃𝜂 ),
𝐽 (𝑋, (4) 2𝜇𝑋 𝜇𝑋̂ +𝐶1 2𝜎𝑋 𝜎𝑋̂ +𝐶2 𝜎𝑋 𝑋̂ +𝐶3
̂ =
𝑙(𝑋, 𝑋) ̂ =
, 𝑐(𝑋, 𝑋) ̂ =
, 𝑠(𝑋, 𝑋) .
2 +𝜇 2 +𝐶
𝜇𝑋 2 +𝜎 2 +𝐶
𝜎𝑋 𝜎𝑋 𝜎𝑋̂ +𝐶3
where 𝐽 (𝑋,̂ 𝜃𝜂 , 𝑘) represents the loss function, the first term is the data 𝑋̂ 1 𝑋 ̂ 2
where, 𝛼, 𝛽, and 𝛾 are control parameters for adjusting the relative
fidelity term, and the second one represents a regularization term. Also,
importance. Also, 𝜇𝑋 and 𝜎𝑋 represent the mean and standard devi-
𝛼 denotes the regularization balancing factor.
ation of 𝑋, respectively. In a similar manner, 𝜇𝑋̂ and 𝜎𝑋̂ represent the
mean and standard deviation of 𝑋, ̂ respectively. 𝜎 ̂ is the covariance
𝑋𝑋
3. Competitions and assessment metrics between 𝑋 and 𝑋; ̂ 𝐶1 , 𝐶2 , and 𝐶3 are constants. In comparison with
PSNR, the SSIM is used to indicate better visual quality. In general, the
3.1. Competitions PSNE and SSIM are used together to access the super-resolved image
quality in case the corresponding ground truth image is available.
In recent years, to push forward the development of superior solu- (3) IFC [28]: As a full-reference metric, the information fidelity
tions, some competitions have been made to help in the development criterion (IFC) is used for assessing the quality of images using statistics
285
Table 1
Details of challenges on blind image super-resolution.
Competition name Organizer Website Champion Dataset Weight Multi-Adds PSNR
solution used [M] [G]
1. AIM 2019 challenge on ICCV [2019] https://data.vision.ee.ethz.ch/cvl/aim19/ IMDN [18] DIV2K 0.893 58.53 28.78
constrained super-resolution [19]
2. AIM 2020 challenge on ECCV [2020] https://data.vision.ee.ethz.ch/cvl/aim20 RFDN [20] DIV2K 0.433 27.10 28.75
efficient super-resolution [19] &
Flickr2K
[21]
3. Mobile AI workshop 2021 CVPR [2021] https://ai-benchmark.com/workshops/mai/2021 XLSR [22] DIV2K 0.022 – 29.87
[19]
4. NTIRE 2022 challenge on CVPR [2022] https://data.vision.ee.ethz.ch/cvl/ntire22/ RLFN [23] DIV2K 0.317 19.70 28.72
efficient super-resolution [19]
5. Mobile AI & AIM: Real-time ECCV [2022] https://data.vision.ee.ethz.ch/cvl/aim22/ SCSRN DIV2K 0.067 – 30.03
image super-resolution 2022 [19]
Fig. 1. Four main metrics for classifying the lightweight SR models.
Table 2 via computing the 𝑙2 distance between the test image and the reference
An overview of widely used assessment metrics for lightweight SR adopt from [5].
one, which shows the acceptable result by human judgments.
Metrics Published Full/No-reference Keywords
PSNR – Full-reference Mean squared error 4. Classification taxonomy of the SISR lightweight models
SSIM TIP-2004 [27] Full-reference Structure similarity, luminance,
contrast, structures
IFC TIP-2005 [28] Full-reference Nature scene statistics, Gaussian Four main metrics, namely network design, loss function, frame-
scale mixtures work, and training datasets are considered for classifying the existing
LPIPS CVPR-2018 Full-reference Deep features, human perceptual SISR lightweight models, as shown in Fig. 1. Table 3 shows the tax-
[29] similarity
onomy of the existing lightweight SR models based on the mentioned
metrics.
of the natural scene. Many researchers show that space statistics formed 4.1. Network design
using natural images can be characterized based on various models
(e.g., the Gaussian scale mixtures). In general, the distortions can The performance of a model depends mainly on the structure of
change the natural scenes statistics and generate unnatural images. the designed network. According to the network design, we classify
Based on this idea, in [28], the authors tried to measure the visual the existing methods into six categories: Convolution-Based, Residual-
quality of an image based on the natural scene and distortion models, Based, Dense-Based, Distillation-Based, Attention-Based, and Extremely
which can find the mutual information quantification between the lightweight-Based, as shown in Fig. 2. In order to deeply classify the
reference and test image. So, the IFC can work for super-resolved existing models, the distillation-based models are grouped into Feature
images [30] quality assessment. Distillation and Model Distillation. Also, the attention-based models are
(4) LPIPS [29]: The learned perceptual image patch similarity categorized into six groups: Channel attention, Spatial Attention, Multi-
(LPIPS) can be used as an image quality assessment reference-based scale Attention, Pyramid Attention, Transformer-based Attention, and
metric. In particular, the LPIPS can be obtained in a deep feature space other attention models. Fig. 3 shows the main difference between the
286
Fig. 2. Taxonomy of SISR lightweight models base on the network design criteria.
Fig. 3. Categories of the network design structures.
network design of the first five major categories. While, the extremely 1. SRCNN
lightweight ones do not have a specific network design. The design con- A super-resolution convolutional neural network (SRCNN) is in-
cept will discuss for each class with discussing the recently published troduced in [31] that is based on end-to-end learning the mapping
models for each class. Moreover, the advantages and limitations of each between the LR and HR images, as shown in Fig. 4(a). The concept of
class will be explained. the SRCNN is based on reformulating the sparse-coding-based SR meth-
ods into a deep convolutional neural network. Contrary to traditional
sparse coding methods that are separately handling each component,
4.1.1. Convolution-based methods the SRCNN model can optimize all layers jointly. In addition, the
The main idea of convolution-based models are based on using sev- SRCNN model has a lightweight structure with a high speed for prac-
eral stacked convolution layers. So, the input flows sequentially from tical online usage. Even though the SRCNN is an end-to-end learning
starting to later layers. These convolution-based models can make up- network, its drawback is that it is based on shallow convolution with a
sampling operations: early upsampling or late upsampling. The network small receptive field, which limits its performance.
architectures of convolution-Based methods are shown in Fig. 4. It is 2. FSRCNN
clear from the figure that the traditional first few ones are based only A fast super-resolution convolutional neural network (FSRCNN) is
on convolution, but the most recent ones have some advanced types of introduced in [34] to speed the performance of SRCNN [31]. The
operations that had been applied to the model design. In this section, FSRCNN is based on redesigning the SRCNN in three aspects. Firstly,
we will survey some of the CNN-based methods. a deconvolution layer is used at the final model stage, as shown in
287
Table 3
Taxonomy of the existing lightweight SR models.
No. Method Published Network design Loss function Framework Training dataset Keywords
1 SRCNN [31] TPAMI [2015] Convolution 𝑙2 −norm Caffe 91 images [32] + Deep convolutional
ILSVRC 2013 ImageNet networks
[33]
2 FSRCNN [34] ECCV [2016] Convolution 𝑙2 −norm Caffe 91 images [32] + Convolutional neural
general-100 dataset network
[33]
3 VDSR [35] CVPR [2016] Convolution 𝑙2 −norm Caffe 91 images [32] + Very deep
Berkeley Segmentation convolutional networks
Dataset (BSD) [36]
4 DRCN [37] CVPR [2016] Convolution 𝑙2 −norm Caffe 91 images [32] + BSD Deeply-Recursive
[36] convolutional
5 CNF [38] CVPR [2017] Convolution 𝑙2 −norm Caffe Open image dataset Fusing multiple
[39] convolution neural
networks
6 VDSR- IJCV [2020] Convolution 𝑙2 −norm PyTorch 91 images [32] + BSD Adaptive importance
f22+ILT [36] learning
[40]
7 SPBP [41] arXiv [2020] Convolution 𝑙1 −norm PyTorch DIV2K [19] Sub-Pixel
Back-Projection
network
8 FDN [42] Signal, image and video Convolution Mean Absolute Error Caffe DIV2K [19] Fusion diversion
processing [2021] (MAE) loss network
9 GhostSR [43] arXiv [2021] Convolution 𝑙2 −norm PyTorch DIV2K [19] Learning ghost features
1 DRRN [44] CVPR [2017] Residual 𝑙2 −norm Caffe 91 images [32] + BSD Deep recursive residual
[36] network
2 BTSRN [45] CVPRW [2017] Residual 𝑙2 −norm TensorFlow 91 images [32] + BSD Deep laplacian pyramid
[36] networks
3 LapSRN [46] CVPR [2017] Residual Charbonnier loss MatConvNet 91 images [32] + BSD Deep laplacian pyramid
[36] networks
4 SelNet [47] CVPRW [2017] Residual 𝑙2 −norm MatConvNet DIV2K [19] Selection units for
super-resolution
5 CARN [48] ECCV [2018] Residual 𝑙1 −norm PyTorch BSD [36] Cascading residual
network
6 CBPN [49] ICCV [2019] Residual 𝑙1 −norm PyTorch DIV2K [19] Hybrid residual feature
learning
7 SRRFN [50] ICCV [2019] Residual 𝑙1 −norm PyTorch DIV2K [19] Recursive fractal
network
8 MRFN [51] IEEE Trans. on MM [2020] Residual Weighted huber loss Caffe 91 images [32] + BSD Multi-Receptive-Field
[36] network
9 MFIN [52] JRTIP [2021] Residual 𝑙1 −norm PyTorch DIV2K [19] Multi-scale feature
integration network
10 WMRN [53] IEEE journal of automatica Residual 𝑙1 −norm, Total PyTorch DIV2K [19] Multi-path residual
sinica [2021] variation (TV) loss network
11 ELCRN [54] IJRTIP [2021] Residual 𝑙1 −norm PyTorch DIV2K [19] Efficient local cascading
residual network
12 AdderNet CVPR [2021] Residual 𝑙1 −norm PyTorch 91 images [32] + BSD Energy Efficient
[55] [36]
13 FADN [56] ICCV [2021] Residual 𝑙1 −norm, Frequency – DIV2K [19] Learning
mask loss frequency-aware
dynamic network
14 ASSLN [57] NIPS [2021] Residual Sparsity inducing loss PyTorch DIV2K [19], Flickr2K Aligned structured
[21] sparsity learning
15 OverNet [58] WACV [2021] Residual Multi-scale loss PyTorch DIV2K [19] Overscaling network
16 SMSR [59] CVPR [2021] Residual L1 loss, sparsity PyTorch DIV2K [19] Exploring Sparsity
regularization loss 𝐿𝑟𝑒𝑔
17 CLB [60] CVPRW [2022] Residual 𝑙1 −norm PyTorch DIV2K [19], Flickr2K Collapsible linear
[21] blocks
18 FMEN [61] CVPRW [2022] Residual 𝑙1 −norm PyTorch DIV2K [19], Flickr2K Fast and
[21] memory-efficient
network
(continued on next page)
288
Table 3 (continued).
19 HPUN [62] ArXiv [2022] Residual 𝑙1 −norm PyTorch DIV2K [19], Flickr2K Hybrid pixel-unshuffled
[21] network
20 RLFN [23] CVPRW [2022] Residual 𝑙1 −norm, contrastive PyTorch DIV2K [19] Residual Local Feature
loss Network
21 ShuffleMixer ArXiv [2022] Residual 𝑙1 −norm PyTorch DIV2K [19], Flickr2K An efficient ConvNet
[63] [21]
1 GLADSR [64] IEEE Trans. on MM [2020] Dense 𝑙1 −norm – DIV2K [19] Global–local adjusting
dense network
2 ESRN [65] AAAI [2020] Dense Adaptive joint loss – DIV2K [19] Efficient residual dense
block search
1 IDN [66] CVPR [2019] Feature 𝑙2 −norm PyTorch 91 images [32] + BSD Information distillation
distillation [36] network
2 IMDN [18] Proc. ACM Inter. Conf. on Feature 𝑙2 −norm PyTorch DIV2K [19] Information
MM [2019] distillation multi-distillation
network
3 RFDN [20] ECCV [2020] Feature 𝑙1 −norm PyTorch DIV2K [19] Residual feature
distillation distillation network
4 DCDN [67] ACM transactions on MM Feature 𝑙1 −norm PyTorch DIV2K [19] Dense connection
[2021] distillation distillation network
5 AIDN [68] CVPRW [2022] Feature 𝑙1 −norm PyTorch DIV2K [19], Flickr2K Asymmetric information
distillation [21] distillation network
6 BSRN [69] CVPRW [2022] Feature 𝑙1 −norm PyTorch DIV2K [19], Flickr2K Blueprint separable
distillation [21] residual network
7 FDIWN [70] AAAI [2022] Feature 𝑙1 −norm PyTorch DIV2K [19] Feature distillation
distillation interaction weighting
network
8 MRDN [71] ESWA [2022] Feature 𝑙1 −norm PyTorch DIV2K [19], Flickr2K A lightweight
distillation [21] multi-stage residual
distillation network
1 FSRCNN with ECCV [2020] Model Imitation loss, PyTorch DIV2K [19] Learning with
distillation distillation reconstruction loss, privileged information
[72] distillation loss
2 VDSR with CVPR [2021] Model Reconstruction loss, PyTorch DIV2K [19] Data-free knowledge
Distillation distillation Adversarial loss, distillation
[73] knowledge distillation
loss
3 CSD [74] arXiv [2021] Model Reconstruction loss, PyTorch DIV2K [19] Data-free knowledge
distillation contrastive loss distillation
4 IMDN-FSL ICCV [2021] Model Fourier space loss – DIV2K [19] Fourier space losses
[75] distillation
5 MemSR [76] PMLR [2022] Model Mean square error PyTorch DIV2K [19] Training
distillation (MSE), reconstruction memory-efficient
loss lightweight model
1 FERN [77] Neurocomputing [2020] Channel 𝑙1 −norm PyTorch DIV2K [19] Feature enhancement
attention residual network
2 LGCN [78] Information sciences Channel 𝑙1 −norm PyTorch 91 images [32] + BSD Group convolutional
[2019] attention [36] network
3 MCAN [79] ACCV [2020] Channel 𝑙1 −norm PyTorch DIV2K [19] Matrix Channel
attention Attention Network
4 𝐴2 𝐹 [80] ACCV [2020] Channel 𝑙1 −norm PyTorch DIV2K [19] Attentive auxiliary
attention feature
5 PRRN [81] Neurocomputing [2022] Channel 𝑙1 −norm PyTorch DIV2K [19], Flickr2K Progressive
attention [21] representation
recalibration
1 MSAN [82] IEEE access [2020] Spatial attention 𝑙2 −norm – DIV2K [19] Multi-scale spatial
attention networks
2 A-CubeNet ACM Inter. Conf. on MM Spatial attention 𝑙1 −norm PyTorch DIV2K [19] Attention cube
[83] [2020]
1 LAMRN [84] IEEE access [2021] Multi-scale 𝑙1 −norm PyTorch DIV2K [19] Attended multi-scale
attention residual network
2 LMAN [85] IEEE transactions on Multi-scale 𝑙1 −norm PyTorch DIV2K [19] Multi-scale aggregation
broadcasting [2020] attention
3 AMSRN [86] Knowledge-based systems Multi-scale 𝑙2 −norm – DIV2K [19] Multi-scale residual
[2020] attention networks with attention
4 MCSN [87] Neurocomputing [2021] Multi-scale 𝑙1 −norm PyTorch DIV2K [19] Multi-scale channel
attention attention network
289
5 MARAN [88] Multimedia tools and Multi-scale MAE loss PyTorch DIV2K [19] Multi-scale aggregated
applications [2021] attention residual attention
networks
1 PDAN [89] arXiv [2021] Pyramid 𝑙1 −norm PyTorch DIV2K [19] Pyramidal dense
attention attention networks
2 FPAN [90] arXiv [2021] Pyramid 𝑙1 −norm PyTorch DIV2K [19] Feedback pyramid
attention attention networks
3 BSPAN [91] Neurocomputing [2022] Pyramid 𝑙1 −norm PyTorch DIV2K [19] Balanced spatial feature
attention distillation and pyramid
attention network
1 ESRT [92] arXiv [2021] Transformer- 𝑙1 −norm PyTorch DIV2K [19] Efficient transformer
based
attention
2 HNCT [93] CVPRW [2022] Transformer- 𝑙1 −norm PyTorch DIV2K [19] A Hybrid network of
based CNN and transformer
attention
3 LKASR [94] KBS [2022] Transformer- 𝑙1 −norm PyTorch DIV2K [19] Large kernel attention
based
attention
4 SCET [95] CVPRW [2022] Transformer- 𝑙1 −norm PyTorch DIV2K [19], Flickr2K Self-calibrated efficient
based [21] transformer
attention
5 CFIN [96] arXiv [2022] Transformer- 𝑙1 −norm PyTorch DIV2K [19] Cross-receptive focused
based inference network
attention
1 MADNet [97] IEEE trans. on cybernetics Other attention 𝑙1 −norm, Total PyTorch DIV2K [19] Fast and lightweight
[2020] variation (TV) loss 𝑙𝑡𝑣 network
2 MAFFSRN ECCV [2020] Other attention – PyTorch DIV2K [19] Multi-attention
[98]
3 LatticeNet ECCV [2020] Other attention MAE and 𝑙2 −norm PyTorch DIV2K [19] Lattice block
[99]
4 PAN [100] ECCV [2020] Other attention 𝑙1 −norm PyTorch DIV2K [19], Flickr2K Pixel attention
[21]
5 HRAN [101] arXiv [2020] Other attention 𝑙1 −norm PyTorch DIV2K [19] Hierarchical residual
attention network
6 MPRNet WACV [2021] Other attention 𝑙1 −norm PyTorch DIV2K [19] Weighted multi-scale
[102] residual network
7 EMASRN TCSVT [2021] Other attention 𝑙1 −norm PyTorch DIV2K [19] Expectation-
[103] maximization attention
mechanism
8 DRSAN [104] IEEE Trans. on MM [2021] Other attention 𝑙1 −norm PyTorch DIV2K [19] A Dynamic Residual
Self-Attention
9 HRFFN [105] Neurocomputing [2022] Other attention 𝑙1 −norm PyTorch DIV2K [19] Lightweight
hierarchical residual
feature fusion network
1 s-LWSR TIP [2020] Extremely 𝑙1 −norm PyTorch DIV2K [19] Super lightweight
[106] lightweight
2 SESR [107] arXiv [2021] Extremely 𝑙1 −norm TensorFlow DIV2K [19] Collapsible linear
lightweight blocks
3 SplitSR [108] IMWUT [2021] Extremely 𝑙1 −norm TensorFlow DIV2K [19] End-to-end approach
lightweight for SR
4 XLSR [22] CVPR [2021] Extremely Charbonnier loss TensorFlow DIV2K [19] Lightweight
lightweight quantization robust
real-time
5 ABPN [109] CVPR [2021] Extremely 𝑙1 −norm TensorFlow DIV2K [19] Anchor-based plain net
lightweight for mobile
6 CDFM-Mobile ECCVW [2022] Extremely 𝑙1 −norm TensorFlow DIV2K [19] Channel mixing net for
[110] lightweight mobile
Fig. 4(b). Using the deconvolution layer in the FSRCNN to upsample with a high SR performance. Finally, the FSRCNN achieved a speedup
images is dramatically reduced the computation. The FSRCNN network of SRCNN model with a good quality of restoration.
can learn the mapping between the LR and HR images without needing 3. VDSR
an interpolation. Afterward, by shrinking the input feature and then ex- Based on VGG-net [111], a very deep convolutional network (VDSR)
panding the FSRCNN to create a mapping layer. This allows using both is developed in [35] for SISR. The concept of VDSR is based on
additional mapping layers and smaller filter sizes. Moreover, selecting increasing the network depth to significantly improve the accuracy. The
these parameters settings can form generic CPU real-time performance VDSR consists of 20 weight layers as in Fig. 4(c) that are built on the
290
Fig. 4. Network structures the convolution-based methods.
deep structure of recursive cascading of the small filters. Learn residuals new training method based on the importance of learning into a joint
with very high learning rates (i.e., 104 times higher than SRCNN [31]) optimization problem in the training phase.
is used to optimize the convergence speed of the VDSR. Moreover, the 7. SPBP
operation gradient clipping is utilized to ensure the training stability In order to decrease the parameters number and the cost of compu-
of the VDSR. One limitation of the VDSR model is it using fixed-size tations of the SISR model, the sub-pixel back-projection (SPBP) network
receptive field. is introduced in [41]. The SPBP model is able to trade-off between
4. DRCN the performance and the computational complexity. Specifically, an
A deeply recursive convolutional network (DRCN) is suggested iterative back-projection architecture is introduced based on using
in [37] for SISR. The main idea of DRCN is based on increasing the sub-pixel convolution in place of deconvolution layers, as shown
the recursion depth of the network without introducing additional in Fig. 4(f). In this model, the sub-pixel back-projection (SPBP) is
convolutions parameters, as illustrated in Fig. 4(d). In this model, placed instead of the densely connected up-and down-projection units,
recursive-supervision and skip-connection methods are utilized to solve which can improve the performance. Thus, the SPBP can improve the
the vanishing gradients issues during the training of the network. performance by reconstructing accurate SR images.
The recursive supervision method is used to generate the HR target 8. FDN
image, while the skip-connection that comes out of the input to the Based on the fusion diversion idea, a lightweight SISR method
reconstruction layer is used for sharing the same information. So that,
named fusion diversion network (FDN) is designed [42] with a basic
the skip-connection can help in the case when there is a correlation
build module of diversion and fusion block. The fusion and diversion
between the input and output images. One benefit of this DRCN is it
mechanism allows information to interact and transfer through the
can reduce the parameters with weight sharing strategy.
network. That can effectively make the model more expressive. In
5. CNF
addition, a diversion and fusion block (DFB) is used to acquire features
A super-resolution system is constructed [38] based on multiple
from feature clusters. Also, DFB can get feature clusters from the former
convolution neural networks (CNNs) for SISR. Each individual CNN is
blocks, as shown in Fig. 4(g). At the same time, the DFB block can
trained independently with a different network structure. Afterward,
use the convolution operation for features extraction. Finally, a local
outputs of individual CNNs are fused by using different fusion schemes
fusion module is used to integrate the accumulated features adaptively.
such as context-wise network fusion (CNF), pixel-wise network fusion
One benefit of this FDN model is it can use the DFB block to provide
(PWF), and progressive network fusion (PNF). Fig. 4(e) shows the
connection channels and eliminate redundancy by cutting, sending, and
details of the CNF module. Compared to the individual networks,
the whole fused network is fine-tuned to improve overall network integrating features.
accuracy. The obtained results show that the performance of the CNF 9. GhostSR
is better than both PWF and PNF due to its ability to benefit from the In [43], a shift operation is used for redundant features (i.e., Ghost
fused features. The benefit of the CNF model is its convolution layers features) generation based on the notion that many features are similar
output are summed as the final output, so this model is not limited to to each other in SISR models. Unlike the depth-wise convolution that is
individual networks and can use for fusing any type of SR model. not friendly to graphic processing units (GPUs) or network processing
6. VDSR-f22+ILT units (NPUs), using the shift operation can accelerate the inference
An adaptive importance learning scheme model (VDSR-f22+ILT) of the CNNs on a common hardware. Based on the Gumbel-Softmax
is introduced in [40] based on the VDSR model [35] that utilizes trick, the authors made the shift orientation learnable so that the
the adaptive importance learning (AIL) in the SISR. In the model’s shift operation could work for SISR. In this model, all filters are first
training, the loss is used to dynamically update the importance of image clustered for each convolutional layer to know the intrinsic ones that
pixels. Also, an importance penalty function is carefully designed to can extract intrinsic features for the pre-trained model. Then, these
gradually increase the importance of individual pixels by solving a intrinsic features are moved along a specific orientation to derive ghost
convex optimization problem. The model training starts with the easy features. Moreover, the intrinsic and ghost features are concatenated to
to reconstruct pixels afterward gradually targets the more complex construct the complete output features. Finally, this GhostSR benefit
pixels. Thus, the initialization of the network can obtain better initial is it can use learnable shift operation in place of the proportion of
capacity. This VDSR-f22+ILT model key innovation is introducing a conventional filters to generate ghost features.
291
Discussion and limitations. Using the CNN in image SR achieved illustrated in Fig. 5(c), contains three types of components of convo-
a lot of success due to the ability of the model to extract more dis- lutional layers, leaky ReLU, and deconvolutional layers. In this model,
criminative features and achieve good performance compared to the using the coarse-resolution feature maps as input for each pyramid
traditional non-deep leaning methods. The CNN-based model solved level can help in predicting the high-frequency residuals. Afterward,
the issues of the sparse coding methods by optimizing all layers jointly. the transposed convolution is used in the final level for upsampling.
The initial CNN methods are based on the shallow layers, which limits With these improvements, the LapSRN model does not need the bicubic
their performance. Some techniques are used to improve these methods interpolation pre-processing step. In addition, the authors in the model
such as using the deconvolution layer [34] and using the recursive training used robust Charbonnier loss function in deep supervision,
idea [37]. In addition, other techniques are used to further improve the so the model achieved high-quality reconstruction. Finally, using the
performance based on using fusion schemes [38,42] and sub-pixel back- progressive reconstruction in the LapSRN model helps the generation
projection [41]. Also, the shift operation [43] is another solution used of the multi-scale prediction in one feed-forward pass, which assists in
to improve the CNN-based methods. However, the CNN-based methods resource-aware applications.
are shallow due to the vanishing gradient issue. This problem prevents 4. SelNet
increasing the model size, which limits the performance. So, there is a A deep convolutional neural network with selection units (SelNet)
need to solve this vanishing gradient problem of the convolution-based is introduced [47] for SISR. The idea of SelNet is designing a nonlinear
methods. In [112], residual learning is introduced to solve the problem unit called a selection unit (SU) based on reinterpreting the ReLU into
of convolution-based methods and allow models to increase their size. point-wise multiplication of an identity mapping and a sigmoid-based
selection module as shown in Fig. 5(d). This SU can optimize the on–
4.1.2. Residual-based methods off switching control of the data passed through it, contrary to the
Due to the fact the image SR task is an image-to-image translation conventional ReLU. As a result, the SU can be more flexible and better
task, the output image is highly correlated with the input image. than ReLU in handling nonlinearity functions. The key merit of the
So, residual learning can help in learning the residual between the SelNet is it uses gradient switching for faster convergence in training.
5. CARN
input and output images. This type of residual is called global residual
An accurate and lightweight deep network named cascading resid-
learning, which avoids learning a complicated transformation from
ual network (CARN) is suggested in [48] for SISR. The CARN network
one whole image to another. So that, the model needs to learn a
is designed to implement a cascading mechanism upon a residual
residual map to restore the missing high-frequency details. This residual
network. The CARN model is based on adding multiple cascading
learning will reduce the model complexity and decrease the learning
connections to each intermediary layer at the local levels. So, these
difficulty. In addition, similar to the ResNet [112], a local residual
connections allow the model flows of information and gradient effi-
is utilized to alleviate the degradation problem, combined with ever-
ciently. In order to further improve the performance, the ResNet [112]
increasing network depth. So, this type of residual learning can reduce
network is used in the middle of the CARN model, as shown in Fig. 5(e).
training difficulty and enhance learning ability. Fig. 5 shows the main
Moreover, a local and a global level are used in the CARN cascading
block diagram that developed based on these two types of residual
mechanism in addition to the ResNet architecture to make multiple
learning. It is obvious from Fig. 5 that residual learning allows the
layers features incorporation. So, the main advantage of the CARN is
network depth to be larger, so the model can extract more features.
it can apply multiple cascaded connections for integrating multilevel
In addition, the figure shows that there is a variety of additional blocks
features.
are introduced based on the residual, which allows for improving the
6. CBPN
performance further.
An efficient SISR method named compact back-projection network
1. DRRN
(CBPN) [49] is developed that can learn hybrid residual features of the
A deep recursive residual network (DRRN) is introduced in [44], LR and HR space. The idea of CBPN is based on the reconstruction of
representing a deep and concise network for SISR. In particular, the the residual HR image. Specifically, a compact back-projection block
residual learning is used to improve training difficulty in both global is introduced to fulfill the hybrid residual feature learning. The CBPN
and local ways. The residual block of the DRRN model is shown in can simultaneously cascade upsampling and downsampling layers for
Fig. 5(a). In this model, the parameters and the depth of the model are features generation in both LR and HR space using filters with small
increased based on recursive learning. This model uses local residual sizes. Also, a new UD block and a reconstruction layer are designed for
learning (LRL) to solve detail lost issues after so many model layers. restoring the HR images, as shown in Fig. 5(f). This model achieved
Also, rich image details are carried using the identity branch to the late high efficiency in terms of parameters and operations with a good
layers which can improve the flow of the gradient. In addition, in the performance.
DRRN, recursive learning of residual units is used to make the model 7. SRRFN
compact enough. Finally, one key benefit of the DRRN is it can combine A lightweight model called super-resolution recursive fractal net-
recursive and residual network schemes in one framework. work (SRRFN) is introduced in [50] based on the recursive fractal idea.
2. BTSRN This recursive fractal is used to fractal the features many times. The
A balanced two-stage residual network (BTSRN) is introduced in details of that flexible and diverse fractal module (FM) are illustrated
[45] for SISR. In this model, based on the deep residual constrained in Fig. 5(g). Furthermore, infinitely possible topological substructures
depth, the model achieved a balance between the performance and the are introduced using a simple component through their unique charac-
speed. The BTSRN is based on two-layer projected convolution (PConv) teristics. Then, self-similarity and infinitely refined structure are used
residual blocks, as shown in Fig. 5(b). This PConv has a feature map in the FM. The similarity and infinitely structures can be represented
projection of 1 × 1 convolution that decreases the input convolution to simple components that are based on the recursion and iteration. As
kernel size. The main advantage of this BTSRN model is it only contains a result, the overall model is more fault-tolerant and robust. So, the
10 residual blocks, so it considers a very efficient model. So, this model SRRFN model can be used in the feature extraction section in many
achieved good results in NTIRE SR 2017 [17] competition of SR in both low-level computer vision tasks.
accuracy and speed. 8. MRFN
3. LapSRN A multi receptive field network (MRFN) is developed in [51] that
To address the speed and accuracy problem, the Laplacian pyramid outperformed many SISR methods in three different aspects. Firstly, a
super-resolution network (LapSRN) is introduced in [46] to reconstruct multi-receptive-field (MRF) module extracts and fuses features from lo-
the sub-band residuals of the SR images. The LapSRN network, as cal to global features at different receptive fields, as shown in Fig. 5(h).
292
Fig. 5. Network structures of the residual-based methods.
By applying these hierarchical features in different scales can make In addition, using a deconvolution layer in the final stage can help to
good mappings for extracting high-fidelity information. Secondly, in avoid artificial priors made by numerical data pre-processing and fast
training, the authors combined the features from the MRF module using the process of the restoration. Finally, the weighted Huber loss function
both the skip connections and deep supervision to the preceding layers. is used for back-propagated derivative values adjustment based on
293
residual value. The key advantage of MRFN is it can extract features The main merit of FADN is it assigns cheap operations to low-frequency
from various receptive fields and then combine them for learning regions and vice visa.
object/part-dependent mappings. 14. ASSLN
9. MFIN An aligned structured sparsity learning (ASSL) network is devel-
A lightweight multi-scale feature integration network (MFIN) is oped in [57] for SISR. In this model, a weight normalization layer
introduced in [52] to address the problem of limited receptive fields is introduced, and L2 regularization is used for sparsity parameters
in lightweight SR models. In more detail, the receptive fields are ex- scaling. In addition, a sparsity structure alignment penalty term is
panded for global features by serially cascading the multi-scale feature used across different layer locations using filter pruning. This sparsity
integration blocks (MFIBs), as shown in Fig. 5(i). Each MFIB includes structure alignment penalty term can be used for the norm of soft mask
both a multi-scale feature extraction module (MFEM) and a feature gram matrix minimizing. Also, an aligned structured sparsity learning
integration unit (FIU). The features in MFEM are cascaded through strategy is used to make the training more efficient. One benefit of the
the parallel cascading mechanism (PCM) inside MFIB for enlarging the ASSLN is to can tackle the pruned filter location mismatch issue based
receptive field. While the backbone of the model is based on a serial on using a sparsity structure alignment penalty term by aligning the
cascading mechanism (SCM) to help the model gain multi-scale features pruned filter indices across different layers.
with a large receptive field. Finally, the output of MFEM is used in the 15. OverNet
FIU module to exact dense and pixel-wise for full-image dependencies The overscaling network (OverNet) [58] is developed to solve the
capturing. One benefit of the MFIN model is it can achieve real-time problem of lightweight applications by training certain models in each
running time performance. scale. The OverNet model uses arbitrary scale factors based on a
10. WMRN lightweight convolutional network for SISR, as shown in Fig. 5(m). The
In [53], a weighted multi-scale residual network (WMRN) is in- model is based on three major parts: Firstly, a recursive structure of
troduced for SISR to balance the computation and the performance. skip and dense connections are used in the feature extractor to better
The depthwise separable convolutions (DS Convs) block is used based reuse the model information. Secondly, an agnostic reconstruction
on modifying the structure of residual that improved the efficiency module improves the feature extractor performance and gets accurate
of convolutional operations, as shown in Fig. 5(j). Also, seeking to high-resolution images from any SR architecture over-scaled feature
improve the multi-scale representation capability, the weighted multi- maps. Thirdly, a multi-scale loss function is utilized to generate scales
scale residual blocks (WMRBs) are stacked together. In this model, generalization. A key advantage of OverNet is that its overscaling head
the dilated convolutions help the WMRB module to benefit from the can be flexibly used with other SR models by simply replacing their
different scale-spaces feature representations. Finally, to enhance the upsampling module to improve the performance.
information flow, a global residual connection is utilized to carry the 16. SMSR
high-frequency details to the final layers. A new sparse mask SR (SMSR) network [59] is introduced that
11. ELCRN explores the image SR sparsity to enhance the SR task inference ef-
An efficient local cascading residual network (ELCRN) is intro- ficiency. The sparse masks can be used for redundant computation
duced [54] for a lightweight real-time SISR. The ELCRN is built based pruning. Using the SMSR model will help the spatial masks to locate
on stacking several modules of efficient cascading residual blocks with the important regions. Afterward, the channel masks learn to mark the
a wide activation, as shown in Fig. 5(k). Also, a residual efficient unimportant regions in the redundant channels, as shown in Fig. 5(n).
channel attention module is used to improve the model performance The spatial and channel masks are used for localization at a fine-grained
by capturing feature channels interaction. In addition, the wide acti- level in the redundant computation of the model. As a result, this model
vated efficient module (WAEM) is used for achieving better and faster can localize and skip the redundant computation and at the same time
performance. The WAEM improves the representation capability and achieve good performance. The main merit of this SMSR is to decrease
transform the model into a lightweight framework. However, utilizing the parameters and computation costs with the pruning idea.
the ELCRN model in edge devices needs more exploration. 17. CLB
12. AdderNet In [60], the authors used coarse-grained pruning to improve the
Adder neural networks (AdderNet) are suggested for SISR in [55] acceleration of the network. This can be done using collapsible linear
based on incorporating additions for the output features calculation. blocks that are able to recover pruned network representative ability
Thus, the AdderNets can decrease the huge energy cost of conventional as shown in Fig. 5(o). In more detail, the collapsible linear block con-
multiplications compared to traditional convolutional neural networks. tains a multi-branch topology for the training part; then it is replaced
Due to the different calculation paradigms, it is challenging to use by one convolution at the stage of the inference. The training-time
AdderNet on large-scale image classification in the SR task. So that, and inference-time architecture decoupling is performed based on the
the authors tried to use the AdderNet for SR tasks by finding the structural re-parameterization technique. So, this re-parameterization
relationship between an adder operation, mapping identification, and technique can improve representation and does not use additional
inserting shortcuts to achieve good SR performance. Afterward, they computation costs. Moreover, a two-stage training mechanism is used
adjusted the feature distribution and refining details using a learnable to ease the optimization procedure using progressively larger patch
power activation. The advantage of AdderNets is that it can solve sizes. This model is evaluated on NTIRE 2022 efficient image super-
the adder operation problem using identity mapping and inserting resolution challenge [25] and achieved good latency and accuracy
shortcuts. trade-off.
13. FADN 18. FMEN
A frequency-aware dynamic network (FADN) is developed in [56] In [61], stacking multiple highly optimized convolution and acti-
for SISR task. The FADN uses the coefficients of the discrete cosine vation layers are used to build a fast and memory-efficient network
transform (DCT) domain for dividing the input, as shown in Fig. 5(l). (FMEN) for the SISR model that requires less usage of the feature fusion.
Specifically, the model processes the high-frequency part based on In this model, the authors introduce a sequential attention branch in
using expensive operations. At the same time, the model assigned the which every pixel is assigned an important factor based on local and
lower-frequency part to cheap operations, which decreased the compu- global contexts. Also, the residual block is tailored for the efficient
tation. The low-frequency areas, which includes relatively few textural image super-resolution (EISR), so an enhanced residual block (ERB)
details, will not be affected by the dynamic network. Moreover, to shown in Fig. 5(p) is used to improve the inference speed. The main
achieve an end-to-end fine-tune, the predictors’ modules are embedded benefit of the FMEN model is that it is based on using the ERB instead
in the dynamic network, which can handcraft frequency-aware masks. of the residual block (RB), which is more friendly for deployment.
294
19. HPUN 1. GLADSR

A hybrid pixel-unshuffled network (HPUN) is developed in [62] for A global–local adjusting dense super-resolution network (GLADSR)
SISR. This HPUN is based on using an efficient and effective downsam- is introduced in [64] to build a lightweight SR model based on global–
pling module. HPUN includes both pixel-unshuffled downsampling and local adjusting. The GLADSR structure is based on four parts: a feature
self-residual depthwise separable convolutions, as shown in Fig. 5(q). extraction net (FENet), a basic SR network, a fine SR network, and a
The pixel-unshuffle operation is used for downsampling the input fea- reconstruction network. The FENet is the first stage of the GLADSR
tures and reducing the channels by using grouped convolution. Also, model, and it includes a convolutional layer (Conv) used as a shallow
the model performance is enhanced by adding the input feature of features extraction. The model core is based on the SR and the fine SR
depthwise convolution to its output. So, this HPUN benefits it can solve networks, which share a similar structure. These networks include a
the issue of depthwise separable convolution. separable pyramid upsampling module (SPU) and a nested dense group
20. RLFN (NDG). The notion of the group and perception are combined in the
A novel residual local feature network (RLFN) is suggested in [23] upsampling design of the SPU module. There are multiple global–local
for SISR. This RLFN is based on the idea of using three convolutional adjusting modules (GLAMs) in the NDG that are connected with dense
layers to learn residual local features, as shown in Fig. 5(r). This is done skip connections. To make use of adjusted features extraction of each
based on using simplified feature aggregation to balance performance GLAM, the connecting strategy is used to keep flexible information
and time. In addition, a contrastive loss is used based on the idea that flow. Finally, the reconstruction net is used as a refinement structure
selecting intermediate features of its feature extractor can improve the of the output image.
model performance. Finally, a multi-stage warm-start training strategy 2. ESRN
is used using pre-trained weights from previous stages to improve the A fast, lightweight, and accurate network named efficient super-
model’s performance. Based on these improvements, the RLFN got first resolution network (ESRN) [65] is developed for SISR. The ESRN
place in the NTIRE 2022 efficient super-resolution challenge [25]. contains an efficient residual dense block that can use a search algo-
21. ShuffleMixer rithm with multiple objectives. Firstly, the model is accelerated using
In [63], a ShuffleMixer network is developed for SISR. This Shuf- a variation of feature scale, which includes efficient residual dense
fleMixer model is able to use large kernel convolution and channel blocks. Then, the evolutionary algorithm is used for searching locations
split-shuffle operations. The ShuffleMixer model uses a large kernel of the pooling and upsampling operator. Also, the guidance of block
ConvNet instead of the traditional stack of multiple small kernel con- credits is used in network architecture to achieve an accurate SR model.
volutions. In more detail, a large depth-wise convolution is used with So, the block credit can be generated during the evaluation process of
two projection layers depending on channel splitting and shuffling for the model, and it is able to reflect the effect of the current block. Thus,
mixing features efficiently. In addition, Fused-MBConvs is introduced by weighing the sampling probability of mutation, this block can help
for modeling the local connectivity of different features. The main ad- the evolution favor admirable blocks. One key advantage of this ESRN
vantage of the ShuffleMixer is lower parameters and FLOP due to fusing is it can fix the defect of the pooling based on integrating local residual
non-local and local spatial locations within a feature mixing block. So, learning and global feature fusion.
the ShuffleMixer model took part in the NTIRE 2022 competition and Discussion and limitations. The dense-based convolution solves
achieved a good result. some problems of residual-based by aggregating information from dif-
Discussion and limitations. The first models to use the residual ferent layers of the model to get more diversified features. Few models
learning for the SR task are based on the deep recursive residual [44, were developed to solve the task of a lightweight SR model based
50] . After that, some traditional method of the Laplacian pyramid [45] on the dense idea. The first model is based on the global–local dense
is used to solve this task. In addition, some deep learning techniques, adjustment, and the second model is based on the variation feature
such as cascading mechanism [48,54], compact back-projection [49], scale. In addition, these models can easily propagate the error signal
hybrid pixel-unshuffled [62], multi-receptive field [51], and multi- to the earlier layers, which makes a strong gradient flow. However, it
scale feature integration [52,53] are also utilized to solve this SR is still not able to extract non-local features, which limits its ability to
task. Sparsity learning is another solution used to improve SR per- find long dependencies.
formance [57,59]. In addition, frequency-aware dynamic idea [56]
is used in the SR task. Also, overscaling network [58] and adder 4.1.4. Distillation-based methods
neural networks [55] have a certain concept of training and inserting For lightweight SR, the main goal is to balance between the accu-
shortcuts to the model. Even the residual learning can help to extend racy and efficiency. One method to achieve that balancing is by using
the model size, these models are only based on the receptive field of distillation methods. These methods of distillation are divided into
the convolution operation, which limits their ability to find long-range two branches knowledge or model distillation and feature distillation.
dependence. So, there is a need to find other techniques to help the The knowledge distillation is based on using two models, teacher and
model explores long-range dependence. student, where the student is the lightweight model [72,73,75,76]. The
second method of distillation is based on the channel splitting of the
4.1.3. Dense-based methods feature maps [18,20,66–69,71,74].
The dense block is first developed in DenseNet [113] for the high- A. Feature Distillation
level vision task. After that, this dense block is used in many computer These models are based on explicitly splitting the intermediate
vision tasks. In the dense block, all the feature maps of the previous features into two parts along the channel dimension. So, the model
layers are input to the current layer. Then, the current layer feature can use one for retaining and the second one for further processing by
map is fed to all the subsequent layer. In this case, using the dense succeeding convolution layers. Also, some of these feature distillation
connection can create l ⋅ (l − 1)/2 connections in the l-layer dense block methods can extract features at a granular level. So, they can retain
for the case (l ≥ 2). So, these dense connections can help in alleviate partial information and further treats other features at each step. Fig. 6
gradient vanishing, enhance signal propagation and encourage feature shows the network structures of some of feature the distillation-based
reuse. However, in this block, there is a substantial reducing the model models that will review below.
size by employing a small growth rate (i.e., number of channels in dense 1. IDN
blocks). Also, this block can squeeze channels after concatenating all In [66], authors designed the first model based on using the infor-
input feature maps. We will discuss some of dense methods related to mation distillation, called information distillation network (IDN). The
lightweight image SR models. IDN structure consists of three main parts, namely feature extraction
295
Fig. 6. Network structures of the distillation-based methods.
blocks, stacked information distillation blocks (DB), and reconstruc- 2. IMDN

tion blocks. In the DB block, an enhancement unit combined with a An information multi-distillation network (IMDN) is introduced
compression unit together to extract local long and short-path features in [18] for SISR based on the cascaded information multi-distillation
effectively. The details of the enhancement unit are shown in Fig. 6(a). blocks (IMDB) that, as illustrated in Fig. 6(b). In more detail, the
In particular, the enhancement unit is used to mix two different feature hierarchical features can be extracted using the distillation module
types. While, the compression unit is used for the sequential blocks step-by-step. Afterward, the fusion module can aggregate the hier-
to distill more helpful information. Finally, one advantage of the IDN archical features according to the importance of candidate features.
model is that it has comparatively few-layer filters and group convo- Also, a contrast-aware channel attention (CCA) layer is used for task
lution. However, the IDN has a comparable large weight, so there is a evaluation. Finally, an adaptive cropping strategy (ACS) is developed to
need for more efficient models. process real images of varying sizes based on the same trained model.
296
The IMDN solved some problems with the IDN network of has large computational costs than other methods, but its performance is not
weights, but the IDMN model has much room for improvement. inferior to them.
3. RFDN 8. MRDN
A residual feature distillation network (RFDN) is introduced in [20] In [71], a lightweight multi-stage residual distillation network
using the channel splitting operation to build the feature distillation (MRDN) is introduced for SISR. The model is based on two ideas,
connection (FDCB) block. The FDCB is an enhancement version of the designing a multi-stage residual distillation block (MRDB) and using
previous IMDN [18]. The RFDB is shown in Fig. 6(c), is based on efficient pixel attention (EPA) module, as shown in Fig. 6(g). The MRDB
learning discriminative feature representations using multiple feature contains the channel separation and the skip connection for reducing
distillation and a shallow residual block (SRB). This SRB is designed the parameters and enhancing performance. In addition, the EPA is
using a convolutional layer, skip connection, and an activation unit. So, used by weighing different channels based on their importance, so the
adding the residual learning in the SRB can slightly increase the com- model can extract more discriminative features. Also, one other benefit
putation with no increase in the parameter numbers in comparison to of the EPA is that it can help the model to pay more attention to the
the original block in IMDN. In addition, an enhanced spatial attention high-frequency features conducive to detail recovery.
(ESA) is introduced in this model to further improve the performance. B. Model Distillation
The RFDN network is much more efficient in weight and has faster These models are based on learning a portable student network
runtime than IMDN. from a heavy teacher network based on transferring the knowledge
4. DCDN to a student network. In addition, different loss functions can be used
In [67], a dense connection distillation network (DCDN) is devel- between the output of the teacher and student, which can dramatically
oped based on combining the feature fusion block with the dense improve the student network.
connection distillation block (DCDB). These two blocks have the com- 1. FSRCNN with Distillation
ponents of selective cascading and dense distillation. Also, within the Based on the notion of learning with privileged information [72], a
distillation block, the dense connections are used to fuse shallow and distillation framework called fast super-resolution convolutional neural
deep features, which can help in the image reconstruction task. In par- network (FSRCNN) with Distillation is developed. The FSRCNN with
ticular, the dense distillation module in each DCDB block can provide Distillation consists of two networks with the same network architec-
some helpful information by concatenating all previous layers remain- ture, teacher and student networks. In the teacher network, the encoder
ing feature maps, as shown in Fig. 6(d). Then, a contrast-aware channel can learn the process of degradation based on an imitation loss by
attention mechanism is used for assessing the selected features. In subsampling HR images. While, the student network can learn from
addition, the contrast-aware channel attention layer (LCCA) is utilized intermediate features in the decoder that are transferred to the student
to enhance the performance of the model further. One advantage of this through feature distillation. Finally, in this framework, a decoder can
DCDN is it can use a distillation mechanism to reduce the parameters be used to initialize the student, which can help transfer the recon-
and computation. struction capability of the teacher to the student. The reason for using
5. AIDN the FSRCNN in this model is that it is hardware-friendly and has few
In [68], an asymmetric information distillation network (AIDN) numbers of parameters.
is developed based on multiplexing of distillation information and 2. VDSR with Distillation
extracting the asymmetric information. The multiplexing of distillation A study of the data-free compression technique in the task of SISR
information means repeating the processing of distilled information task is shown in [73], which is widely used in lightweight models. This
to extract the lost high-dimensional information. Also, the asymmet- compression technique has been applied to the very deep convolutional
ric information enhancement block (AIEB) can extract horizontal and network (VDSR) [35] and named the final model VDSR with Distilla-
vertical features by identifying different features of the image, as tion. This model is based on the model distillation idea of student and
shown in Fig. 6(e). The evaluation results showed that AIDN has a low teacher. In more detail, this study analyzed the connection between
parameters number and the ability to achieve a good balance between pre-trained model inputs and outputs. Then, a series of loss functions
performance and complexity. The benefit of the AIDN model is that are used in the generator to fully capture some useful information. In
it can recover lost information made by reducing channel dimension addition, the authors tried to train by synthesizing training samples that
based on using multiplexing of distilled information. have similar original data distribution in the generator training. Also,
6. BSRN the synthetic data is only used to deal with student network training
A blueprint separable residual network (BSRN) is suggested in [69] difficulty based on a progressive distillation scheme. Finally, the result
for SISR. The BSRN is based on two efficient designs, as shown in of the model illustrates that student networks can give good results
Fig. 6(f). Firstly, the blueprint separable convolution (BSConv) is used without training data and with fewer computational cost.
instead of the redundant convolution operation to reduce the pa- 3. CSD
rameters. This BSConv block is based on the depthwise convolution A contrastive self-distillation (CSD) framework is introduced in [74]
(DWconv) and 1 × 1 convolution. The second one is based on using to various off-the-shelf SR models compression and acceleration. This
more effective attention modules to improve the model efficiency. The model is based on the model distillation idea of student and teacher. So,
BSRN shows state-of-the-art and achieved good performance in NTIRE this CSD model is based on constructed student network from a target
2022 efficient SR challenge [25]. The main benefit of this BSRN model teacher model. This teacher–student idea is used to build a channel-
is it can take the place of the redundant convolution operation. splitting SR network. Afterward, the explicit knowledge transfer is used
7. FDIWN to enhance SR images quality and PSNR/SSIM using a contrastive loss.
A feature distillation interaction weighted network (FDIWN) is de- In addition, a universal CSD scheme is used to compress and accelerate
veloped in [70] for SISR. This FDIWN is based on using a series of various SR networks, which can provide runtime friendly model for
feature shuffle weighted groups (FSWG) in the model backbone. FSWG practical using.
is designed based on utilizing wide-residual distillation interaction 4. IMDN-FSL
blocks (WDIB). The WDIB block is used for good feature distillation by A new loss function is introduced in [75] for the SR task, which
using wide identical residual weighting (WIRW) units and wide convo- represents high perceptual quality from a lightweight model. Also, this
lutional residual weighting (WCRW) units. In addition, a wide-residual loss is introduced in a generative adversarial network (GAN) [114]
distillation connection (WRDC) framework with a self-calibration fu- that includes both of generator and a discriminator. Due to the fact
sion (SCF) unit are used for flexibly and efficiently interacting features that the low-complexity generator network representative power can
in different scales. The advantage of the FDIWN is that it has lower be leveraged using powerful guidance to the parameter’s optimally.
297
So, this can enhance the performance based solely on using efficient (i.e., a constant) using global average pooling (GAP). After that, feeding
generator architecture using the introduced loss functions. Specifically, these descriptors to two dense layers for producing the channel-wise
a Fourier space supervision loss is utilized for the reconstruction of the scaling factors for input channels. We will discuss the detail of each
missing high-frequency (HF) details from the ground truth image. In ad- model as the following.
dition, a discriminator architecture is operated on the Fourier domain 1. FERN
for distribution match of the target HF. So that, this introduced loss A lightweight feature enhancement residual network (FERN) is
can work on the frequencies in Fourier-space, which can enhance the introduced in [77] based on incorporating the non-local operations in
perceptual image quality. The model is based on an information multi- the residual block. The non-locally enhanced residual block is designed
distillation network (IMDN), and the final model is named (IMDN-FSL). in FERN model for capturing the long-range dependencies, as shown in
One benefit of this MDN-FSL model is it can use the combination of Fig. 7(a). Also, the structure-aware channel attention layer is utilized in
spatial and frequency domain losses for performance improvement. FERN to use structural and textural details to improve the feature maps.
5. MemSR In addition, the basic block of the model is based on the residual-in-
In [76], the author used the model distillation of teacher and residual structure, which helps to solve the training difficulty. One key
student and tried to calculate a winning initialization from a complex advantage of this FERN model is that it can benefit from the non-locally
teacher network for a plain student network. So that, this approach can enhanced residual block to capture long-range dependencies.
improve the performance compared to the complex models. Then, the 2. LGCN
teacher model is converted to an equivalent large plain model to derive A lightweight group deconvolution network (LGCN) is introduced
the initialization of the plain student. This hardware-friendly model in [78] to solve the problem of demanding computational and memory.
is referred as MemSR. In addition, the initialization-aware feature These two problems prevent the SR models from being applied to
distillation is used in the student to enhance the model performance real-world applications. In more detail, the LGCN is developed for
further. Moreover, this model can make the accuracy and speed trade- SISR based on using several memory groups convolutional network
off low memory footprint. One key merit is that a model trained with (MGCN) in a cascading connection. The MGCN network, shown in
MemSR needs half the memory footprint compared to the one without Fig. 7(b), has two main advantages; firstly, it has several groups of
it. 1 × 1 convolutional with a structure of dense connection that decrease
Discussion and limitations. the LGCN parameters. Then, a 1 × 1 convolution is used to create
The models based on the feature distillation are incrementally im- a combination of a linear output group and collect progressive local
proved, starting from IDN [66] with only information distillation to information. In addition, the MGCN uses the channel attention unit for
information multi-distillation (IMDN) [18]. After that, some residual channel-wise relationship modeling to enhance performance. Finally,
learning is added to IMDN to generate the RFDN [20]. Then, the SRB the key advantage of this MGCN is that it uses the memory group
in the RFDN is further improved in the other feature distillation mod- convolutional networks cascaded and adopts hierarchical feature fusion
els [67–71,74]. For knowledge distillation, these models are based on for collecting all-level features to achieve good performance.
using certain well-known SR models and trying to decrease their weight 3. MCAN
and computation based on the teacher–student notion. In addition, In [79], a matrix channel attention network (MCAN) is introduced
initialization of the students is also an important factor considered in to benefit from the matrix channel attention in the SR task. The MCAN
training these types of models [72,76]. Also, some models have some can construct multi-connected channel attention blocks (MCAB) as a
new losses function that helps in solving this task [73–75]. Even though form of a matrix ensemble, as shown in Fig. 7(c). In this model,
these distillation-based methods achieved good performance using the the matrix-in-matrix (MIM) comprises the MCAB for the utilization of
channel splitting operation for feature distillation, these methods are hierarchical features. In addition, a hierarchical feature fusion (HFF)
limited by the convolution kernel which limits their ability to find long- block is designed to combine with the MIM structure to benefit from
range dependence. To solve this limitation, attention-based methods the hierarchical features of MIM in the LR image. So, the MCAN model
are developed to find the long-range dependence using the blocks that can be used in lightweight applications due to its small computationally
can extract non-local features. expensive. One merit of this MCAN model is that it can pass multiple
levels of information for both in-depth and in-breadth.
4.1.5. Attention-based methods 4. 𝐴2 𝐹
The previously discussed convolution, residual, dense, and distilla- An attentive auxiliary feature (𝐴2 𝐹 ) is introduced for SISR in [80]
tion SISR methods are designed by considering all spatial locations and that is applied to the bottom layers of the feature exploring of the
channels of the feature maps to have similar importance for solving SR model. Specifically, all the previous layer’s auxiliary features are pro-
tasks. In many cases, there is a need to design a model that can only jected common space features for the layers of the bottom exploration.
attend to a specific few features in a given layer. So that, attention- Afterward, a channel attention is used for the current layer’s most
based models [79,82,85–87] are developed based on the notion that common feature selection by utilizing the projected auxiliary features
not all the features have the same importance for the SR task. These and filtering the redundant information. With the help of 𝐴2 𝐹 block
types of models can be divided into six main categories: channel seen in Fig. 7(d), the 𝐴2 𝐹 model can achieve significant performance
attention, spatial attention, pyramid attention, multi-scale attention, with parameters of 320K and multi-adds less than 75G. So that, the
and Transformer-based attention methods, in addition to some other 𝐴2 𝐹 can easily be used in many practical applications that have limited
improvements. It is important to note that some methods can belong resources. The main difference between this 𝐴2 𝐹 method and other
to different categories, so we tried to divide them based on the main methods is it uses dense auxiliary features rather than the backbone
idea of the method. We will discuss each one in detail. features or the sparse skip connections.
A. Channel Attention 5. PRRN
The channel attention is widely used in lightweight SR models due A progressive representation recalibration network (PRRN) is de-
to its ability to extract more non-local features without needing large signed in [81] for SISR. The PRRN is able to learn the complete
numbers of weights and multiplications. The channel attention is based representations of the feature. In this model, a progressive represen-
on finding the interdependence and interaction of the feature repre- tation recalibration block (PRRB) is used for extracting features from
sentations between different channels [115]. This channel attention is pixel and channel spaces using a two-stage approach. In the first stage,
based on ‘‘squeeze-and-excitation’’, which can enhance learning ability the PRRB benefits from pixel and channel information for learning
by explicitly modeling channel interdependence. This channel attention feature regions. Then, in the second stage, the authors utilized channel
block is based on squeezing the input channel into a channel descriptor attention to adjust the important feature channel distribution. Finally,
298
Fig. 7. Network structures of the attention-based methods.
299
the main benefit of the PRRN is solving the problem of information fusion. One advantage of this LMAN methods is it can work better for
loss that cause using nonlinear operations in the channel attention large scaling factors such as 4 × and 8 ×.
mechanisms, by using shallow channel attention (SCA) mechanism to 3. AMSRN
simply learn each channel’s importance. In [86], both multi-scale residual and attention are combined to
B. Spatial Attention produce an attention-based multi-scale residual network (AMSRN) to
In addition to channel attention, there is a spatial attention, which solve the issue of the convolution neural network in practical SR
can solve the very limited local receptive fields of the SR models. So, applications. Specifically, the AMSRN contains a residual atrous spatial
this spatial attention can help in enhancing the representation ability. pyramid pooling (ASPP) block that is alternately stacked with a spa-
The detailed models that uses spatial attention are as the following. tial and channel-wise attention residual (SCAR) block for supporting
1. MSAN network framework, as shown in Fig. 7(i). The channel attention (CA)
The spatial attention module is used in [82] to produce a lightweight and spatial attention (SA) mechanisms are added to the SCAR using
multi-scale spatial attention network (MSAN). The MSAN is used for the double-layer convolution residual block. Moreover, in the SCAR
SISR to attain good performance with a limited number of parameters. block, the group convolution is used for decreasing the parameters
Also, seeking to broadcast abundant features to each layer, the dense and over-fitting prevention. In addition, a multi-scale feature attention
connection is adopted with feature fusion layers. Moreover, a double module is used to get the instructive multi-scale attention information.
residual structure is introduced to provide an extra skip connection. Specifically, the sub-pixel convolution and nearest interpolation layers
In addition, a multi-scale spatial attention block (MSAB) is used to are used jointly, so an upscale module is used to upscale the features
get information from multi-scale spatial contextual. Additionally, the using dual paths. Instead of separately utilizing the deconvolution layer
spatial attention block is used to locate the most informative features or sub-pixel convolution method, the dual-path upscale method is used.
of the scale, as shown in Fig. 7(e). Finally, One advantage of this MSAN A key benefit of this model is it uses a dual-path method to upsample
model is that it can extract dynamic multi-scale features for feature the information on low and high frequencies.
enrichment. 4. MCSN
2. A-CubeNet A multi-scale channel attention SR network (MCSN) is suggested
The idea of the attention cube is to simultaneously use all attention in [87] for SISR. The MCSN model is based on three issues. First,
mechanisms, such as spatial, channel-wise, and hierarchical dimen- a multi-scale feature fusion block (MSFFB) is incorporated for multi-
sions. So, an attention cube network (A-CubeNet) is developed in [83] scale features extraction using different receptive fields filter. Second,
for image restoration is based on learning feature expression and the authors utilized a channel shuffle attention mechanism (CSAM)
feature correlation methods. In this model, the adaptive dual attention to enhance the information flow across feature channels and improve
module (ADAM), which contains the adaptive spatial attention branch feature selection capacity, as shown in Fig. 7(j). Thirdly, the global
(ASAB) and the adaptive channel attention branch (ACAB) is used, as feature fusion connection (GFFC) is used for feature utilization im-
shown in Fig. 7(f). The ACAB is used for the receptive field expansion provement. One important benefit of this MCSN is it can effectively
that helps to discriminate different information types. This A-CubeNet extract various features and uses low-frequency image information to
model is able to find long-range dependencies between pixels and reconstruct high-quality images with good quality.
channels. So, it can enlarge the receptive field and distinguish different 5. MARAN
types of information, which can lead to more effective representations An efficient multi-scale aggregated residual attention network
of the features. (MARAN) is developed in [88]. The idea of MARAN is based on
C. Multi-scale Attention the multi-scale contextual information and multi-level features by
The multi-scale attention models are based on extracting multi-scale using multi-scale aggregated residual attention groups (MARAGs). The
features using different attention types. These types include some of the MARAN model consists of shallow feature extraction and multiple
previously discussed attention types of channel and spatial attention. recursively MARAGs. The MARAG includes cascaded multi-scale ag-
Also, these models can include some combination of the residual learn- gregated residual attention blocks (MARABs). The MARAB contains
ing and attention modules. The details of these multi-scale attention the multi-scale aggregated block, dual-attention unit, and a skip con-
models can be summarized as follows. nection, as shown in Fig. 7(k). So that the MARAB can use different
1. LAMRN scales for adaptively extracting the image features to achieve a well
A lightweight attended multi-scale residual network (LAMRN) is representation and also good spatial and channel dimension informa-
introduced [84] based on extracting multi-scale features using an at- tive content. Moreover, a multi-level feature fusion block (MLFFB) is
tended multi-scale residual block (AMSRB), shown in Fig. 7(g). In this used that ends with a reconstruction part. Finally, a key merit of the
model, the features discrimination is improved by using an efficient MARAN is it uses the MLFFB for the hierarchical features fusion output
channel attention block (ECA). Moreover, the low-level and high-level for efficient SR images.
features are fused using a double-attention fusion (DAF) block. In addi- D. Pyramid Attention
tion, the spatial and channel attention modules are used to get guidance The pyramid attention models are based on using a pyramid struc-
for feature fusion tasks based on low-level and high-level features. ture to extract cross-scale features. These methods are based on the
One benefit of the LAMRN is it can extract multi-scale information idea that noise signals decrease in cross-scale features. Also, generating
effectively. cross-scale feature represents a powerful feature extraction method.
2. LMAN The detail of these pyramid attention models can be summarized as
Based on aggregating features from many scales at the same time, follows.
a lightweight multi-scale aggregation network (LMAN) is introduced 1. PDAN
in [85] to work well for all scale factors with fewer parameters. In A pyramidal dense attention network (PDAN) is introduced in [89]
LMAN, a group-wise multi-scale block (GMB) is designed to obtain for lightweight SISR. In PDAN, the pyramidal dense learning is used to
discriminative features based on the multi-scale features extraction and extract deep features efficiently, as shown in Fig. 7(l). Also, there is a
fusion before a channel attention layer, as shown in Fig. 7(h). Also, a gradual increase inside a pyramidal dense block for densely connected
hierarchical spatial attention (HSA) mechanism is used for fusing the layer width. Then, to relieve the parameter explosion, an adaptive
local and global hierarchical features for reconstructing the HR image. group convolution is used to linearly grow the number of groups with
The HAS mechanism is used upon stacked GMBs to build a spatial dense convolutional layers. In addition, a joint attention is used to
enhanced residual group (SERG) to integrate the local hierarchical extract the spatial and channel dimensions cross-dimension interaction.
features and afterward used the SERGs for global hierarchical features So that, the main benefit of this PDAN model is that it can use the
300
jointed attention method for rich discriminative feature representa- images. So, using this VAM block can help the model achieves state-
tions. of-the-art results. The main benefit of this model is that it can use
2. FPAN the decoupling strategy to decompose the large kernel to reduce the
A feedback pyramid attention network (FPAN) is suggested in [90] complexity.
that benefits from features mutual dependencies. In more detail, a 4. SCET
feedback connection structure is introduced that uses high-level infor- A lightweight self-calibrated efficient Transformer (SCET) network
mation for enhancing the expression of low-level features, as shown is developed in [95] for SISR. This SCET is based on using the self-
in Fig. 7(m). In this method, each stage output is fed to the corre- calibrated module and efficient transformer block. The function of the
sponding layer input of the next state that can re-update the previous self-calibrated module is effective in extracting image features using
low-level filter. In addition, a pyramid non-local structure is used at the pixel attention mechanism. Also, an efficient transformer is utilized
different scales for modeling the global contextual information, which for exploiting contextual information by learning similar long-distance
can enhance the discriminative network representation. The pyramid features. So, the SCET can recover sufficient texture information. Based
non-local blocks are used for discriminative representation improve- on the merits of these previous components, these can give the model
ment based on long-distance spatial contextual information capturing powerful modeling capabilities in the spatial dimension and channel
at many scales. dimension.
3. BSPAN 5. CFIN
A balanced spatial feature distillation and pyramid attention In [96], a lightweight cross-receptive focused inference network
(BSPAN) is developed in [91] for SISR. This model can make trade-offs (CFIN) is introduced for SISR. This CFIN is a combination of a con-
among the extracted features of different attention types. In this model, volutional neural network (CNN) and a Transformer. In this model, the
a balanced spatial feature distillation block (BSFDB) is introduced cross-receptive field guide Transformer (CFGT) module is used to mod-
to take advantage of different attention features. So, the balancing ify the model weights based on combining the modulated convolution
attention block is used to balance between the spatial attention residual kernels and the local representative semantic information. Also, a CNN-
feature distillation (SARFD) and classical attention (CA). In addition, a based cross-scale information aggregation module (CIAM) is utilized to
pyramid attention is used for extracting long-range features among dif- help the model to concentrate on practical information and improve the
ferent image scales. So, the model can benefit from feature distillation performance of the Transformer stage. The main benefit of this model
and cross-scale features without having conflicting attention. is it can use context reasoning to improve the SR performance based
E. Transformer-based Attention on adaptively modifying the network weights.
These models are based on using attention as the main module F. Other Attention
for feature extraction. This attention module can be multi-head self- This section will group the other methods that do not belong to the
attention (MHA) or some variant of it. Also, most of these models previous 5 categories. These methods include some novel ideas such as
use multi-layer perceptron (MLP) for feature transformation. The detail LatticeNet [99], which is based on using the lattice structure to build
of the widely used models for lightweight SR can be summarized as the model. Also, one model is based on pixel attention that is similar
follows. to channel and spatial but works on the pixel level. The detail of each
1. ESRT method as the following.
An efficient super-resolution Transformer (ESRT) model is intro- 1. MADNet
duced in [92] for SISR. The ESRT considers a hybrid Transformer that Based on using many scales of attention in the network, a dense
firstly used in the CNN-based SR network to extract features. In more lightweight network (MADNet) [97] is introduced to express the multi-
detail, the ESRT is built from two backbones. Firstly, a lightweight CNN scale feature and learn the feature correlation. In more detail, a residual
backbone (LCB) is used for deep SR features extraction. Secondly, a multiscale module with an attention mechanism (RMAM) is developed
lightweight Transformer backbone (LTB) contains a series of efficient to improve the representation ability of multi-scale features, as shown
Transformers (ET) which can utilize limited GPU memory based on in Fig. 7(p). Also, a dual residual-path block (DRPB) is designed to
the efficient multi-head attention (EMHA) module. This EMHA model benefit from a hierarchical feature in the LR image. So, the MADNet
can dramatically reduce the used GPU memory. The main benefit of can take advantage of multilevel features and dense connections using
the ESRT is that it efficiently learns the relationship between similar the employed block. However, this MADNet does not fully utilize the
local blocks, so this can make the super-resolved region have more intermediate layer’s informative features, which limits its performance.
references. 2. MAFFSRN
2. HNCT A multi-attentive feature fusion super-resolution network
A hybrid network of CNN and Transformer (HNCT) is suggested (MAFFSRN) is introduced for SISR in [98] to efficient memory use and
in [93] for SISR. This HNCT includes four modules: shallow feature ex- better computational cost of real-world applications. The MAFFSRN
traction module, hybrid blocks of CNN and Transformer (HBCTs), dense feature extraction block contains feature fusion groups (FFGs) using
feature fusion module, and up-sampling module. So, the combination a stack of multi-attention blocks (MAB), as shown in Fig. 7(q). The
of the CNN and Transformer can help the HBCT to extract deep features FFG and MAB are used in the MAFFSRN to tackle the issue of vital
that are helpful for image reconstruction. The Swin Transformer block information vanishing during the flow of the network. So that these two
can be seen in Fig. 7(n). In addition, the authors utilized enhanced blocks enabled to grow the network depth at the same time, minimizes
spatial attention (ESA) to further enhance the performance. The main computational cost, which increased the network performance. Also,
merit of the HNCT is it can use both local and non-local priors and to minimize the computational cost and memory usage, the model
extract deep features at the same time, so it represents a flexible block. used two modifications. Firstly, the cost-efficient (CEA) block directly
3. LKASR applies the attention mechanism to input features. Also, the enhanced
A large kernel attention SR network (LKASR) is developed in [94] spatial attention (ESA) block [47] replaced Conv groups with dilated
for SISR. This LKASR contains three modules: shallow feature extrac- convolutions that represented a large spatial size. So, the key benefit of
tion, deep feature extraction, and high-quality image reconstruction. In this model is it can multi-attention blocks to enhance the performance.
the deep feature extraction module, there are multiple cascaded visual 3. LatticeNet
attention modules (VAM) that include a 1 × 1 convolution, large kernel In [99], the LatticeNet is developed for SISR based on utilizing
attention, and a feature refinement module as shown in Fig. 7(o). In lattice block (LB). The LB can use linear combinations of two RBs,
more detail, the VAM can work similarly to the Swin Transformer, as shown in Fig. 7(r). These RBs used as an attention mechanism
which works in iterative extraction of global and local features of in determining the combination coefficients of LB. The design of the
301
LB represents a good lightweight SR model, which can decrease pa- Dynamic residual attention (DRA) is used in this model, which can
rameters numbers to half while keeping a similar SR performance. In change its structure adaptively based on the input statistics. Also, a
addition, the LB can adaptively combine using combination coefficients dynamic residual module is utilized in the DRA block to look for the
of the RBs with the attention mechanism. This combination can up- interrelation between the residual paths and the input image statistics.
weight the important channels of feature maps that can achieve good In addition, a residual self-attention (RSA) module is used to improve
SR results. The merit of the LatticeNet is it uses the backward fusion the performance by generating 3-dimensional attention maps without
strategy to extract hierarchical contextual information. any extra parameters cost based on the residual structure cooperation.
4. PAN The key advantage of this DRSAN is that using a combination of DRA
Similar to the channel attention and spatial attention in formulation, and RSA helps balance the computational cost and performance.
a new pixel attention network (PAN) based on pixel attention (PA) 9. HRFFN
is introduced in [100]. This pixel attention is used to generate 3D A hierarchical residual feature network (HRFFN) is proposed in
attention maps that replace the 1D attention vector or 2D map. The [105] for SISR. In more detail, an enhanced residual block (ERB)
new PA block slightly increases the parameters but produces better is introduced using multiple mixed attention blocks (MABs) to en-
SR results. The overall network main branch consists of two building
hance the network representation ability. This ERB is much better than
blocks in addition to the reconstruction branch. The first one is similar
residual blocks for decreasing network parameters and computational
to the self-calibrated convolution but includes the PA layer named (SC-
complexity. Also, a hierarchical feature fusion strategy (HFFS) is used
PA) block, as shown in Fig. 7(s). However, the second branch groups
for generating more features from intermediate convolution layers.
the nearest-neighbor upsampling, convolution, and PA layers, which
Then, this strategy can use the hierarchical details in image to refine the
enhances the quality of the reconstruction image with a small addition
hierarchical features. Finally, both the dense global connection strategy
in parameters. Finally, the main benefit of the PAN is it can increase
(GDCS) and residual learning connection (RLC, at low, meditate, and
the dimension of attention maps from 1D to 3D.
high levels) are used for HRFFN construction. So, the HRFFN benefit
5. HRAN
A lightweight SR model named hierarchical residual attention net- is that it can maximize the hierarchical features that usually lead to
work (HRAN) is designed in [101] for SISR. The HRAN uses an efficient the degradation of network reconstruction with a small number of
residual feature method with an attention aggregation block, as shown additional parameters.
in Fig. 7(t). In this model, for posterior usage, the hierarchically aggre- Discussion and limitations. The attention-based methods can be
gated feature banks are used at the network output to efficiently use the divided into six main categories: channel attention, spatial attention,
residual features. Simultaneously, the attention block is used based on pyramid attention, multi-scale attention, and transformer-based meth-
the idea that a hierarchical attention mechanism can benefit from the ods, in addition to some other improvements. The channel attention is
network’s most relevant features. In addition, successive operations are widely used for solving the SR task [77–81]; these models have mainly
used inside the network in the final output layer to prevent information been based on channel attention in [115] as a building the variant of
loss. This processing step is split into two independents simultaneously channel attention block. The merit of channel attention is its ability
carried out computation paths. So, this splitting step can use the LR to find interdependence and interaction of the feature representation
image to create an effective model that can reconstruct fine details in between different channels. Similar to channel attention, the spatial
the HR images. Finally, based on the hierarchically aggregate concept, attention module is used in [82,83] to improve the SR models based on
this HRAN advantage is it can use residual attention features groups to capturing long-distance spatial contextual information. Many attention-
make the preservation of finer details. based methods based on the multi-scale feature extraction [84–88] can
6. MPRNet use all the spatial and channel attention. The pyramid attention meth-
A multi-path residual network (MPRNet) is introduced in [102] ods [89–91] are based on using attention with pyramid architecture, so
using residual concatenation blocks stacked with adaptive residual these models can extract cross-scale features. The Transformer based
blocks (ARB), as shown in Fig. 7(u). The MPRNet model can well-focus attention methods [92–96] are mainly based on the multi-head self-
on enhancing the performance using spatial information via multi-path attention or some variant of it with multi-layer perceptron. The main
residual learning with small additional computation. So that, this model drawback of these Transformer-based methods is it uses long inference
is able to generate informative features and learn good spatial con- time due to calculating the self-attention mechanism. In addition to
text information. Also, it leverages multi-level representations before the previous main five design methods, there are some ideas that can
upsampling part. In addition, the MPRNet network allows efficient make some type of attention, including pixel attention (PA) [100],
information and gradient flow. Moreover, a two-fold attention module lattice block (LB) [99], hierarchical attention [101,105], expectation–
(TFAM) is included to achieve the network’s high representation ability. maximization [103], feature fusion [98], and self-attention [104]. So,
This module can refine the extracted information in both the channel the models based on attention can easily find the correlation between
and spatial axes for improving the network discriminative ability.
the features in the long-range, but it will compound with the computa-
7. EMASRN
tion cost. Also, many attention-based models are developed to solve the
An efficient SISR network [103] with an expectation–maximization
computation problem, but these methods need a lot of memory which
attention mechanism (EMASRN) is introduced for good balancing be-
limits their performance in low-memory devices such as mobile phones.
tween performance and applicability. The EMASRN deep projection
block is shown in Fig. 7(v). In particular, a progressive multi-scale fea-
ture extraction block (PMSFE) is used for different sizes of feature maps 4.1.6. Extremely lightweight
extraction. One advantage of the PMSFE is that it can progressively Because deep learning SR models (DL-SR) need high memory re-
share the adjacent scales feature information in the early stage. This quirements, these models have yet to be deployed on mobile devices.
leads to efficient fusion of the different features. In addition, an HR- So, there is a need for small and fast models; this means the mod-
size expectation–maximization attention block (HREMAB) is used for els should have a lower number of parameters and computational
long-range dependencies feature maps capturing. Finally, a feedback complexity than the other SR models. Most of the model usually
network is utilized to transfer each generated high-level feature to the contains small numbers of layers, and some other techniques can also
following generation’s shallow layer. be used to further improve the performance of these models, including
8. DRSAN post-training quantization and quantization-aware training, which can
A dynamic residual self-attention network (DRSAN) is introduced dramatically improved the performance. In this section, we will discuss
in [104] for efficient SISR. This DRSAN model is based on automa- some extremely lightweight models that can be applied to mobile
tion of the design of the residual connection among building blocks. devices.
302
1. s-LWSR quantization-aware training approach is utilized to further optimize the

In [106], a super lightweight SR network (s-LWSR) is introduced model at 8-bit quantize. One drawback of this model is its performance
to solve the issue of deploying DL-SR models on mobile devices. The is not comparably good in the Mobile AI & AIM 2022 challenge [26]
s-LWSR model consists of three main parts; Firstly, an information in the target hardware.
pool that is built to mix multi-level information from the first half of Discussion and limitations. The above surveyed models are devel-
the pipeline. So, this information pool can effectively abstract features oped to solve the mobile SR task based on three ideas: (1) use efficient
from the LR image. Second, a compression module is introduced to modules [106,108–110], (2) optimized for specific hardware [107,
further reduce the size of the parameter. Finally, several activation 116], and (3) using a specific type of training like 8-bit quantization
layers are removed from the model to retain more information for per- ware training [109,110,116]. These models can balance the perfor-
formance enhancement. One key merit of the s-LWSR model is it builds mance and computation dilemma, so these models can help the users
the information pool for transmitting features to high-dimensional to benefit from the massive progress of the deep learning field in
channels. a practical manner. Even though the previously discussed methods
2. SESR achieved good performance, there is a need for a more efficient model
Based on collapsible linear blocks, a super-efficient SR network that can work for large-scale factors.
(SESR) is introduced in [107] to enhance the quality of the image with
a high decrease in the model complexity. The SESR can perform 2 × 4.2. Loss functions
(1080p to 4K) and ×4 SISR (1080p to 8K) of constrained hardware. The
SESR model shows efficiency by simulating on a number of hardware To measure reconstruction error and guide the model optimization
performances for SISR on a commercial mobile neural processing unit of SR task, many loss functions are utilized as the basis for the SR field.
(NPU) for 1080p to 4K (2 ×) and 1080p to 8K (4 ×). The results show Initially, the pixel-wise 𝑙2 loss is used; however, it is found that it could
that the SESR is much faster than the prior art on mobile-NPUs. Also, not effectively indicate the reconstruction quality. So, there are many
one key benefit of the SESR model is it easy to deploy on a real mobile other loss functions (e.g., content loss [117], adversarial loss [118])
device with good performance. that are used for reconstruction error measure enhancement. These new
3. SplitSR losses help the model to produce further accurate and higher-quality
A novel hybrid architecture called split super-resolution model results. In this part of our review, we will discuss the loss functions
(SplitSR) is introduced [108] for SISR. The SplitSR model contains a that are widely used. For notation propose, we will follow Section 2.,
lightweight residual block called SplitSRBlock that is introduced to im- the target HR image 𝑋 and the generated HR image 𝑋̂ for shortness.
prove the latency and the accuracy for on-device SR. The SplitSRBlock
Pixel Loss. Pixel loss evaluates the pixel-wise two images differ-
can support channel-splitting, which helps retain spatial information
ence. This type of loss includes 𝑙1 loss (i.e., mean absolute error) and
of the residual blocks while decreasing the channel dimension compu-
𝑙2 loss (i.e., mean square error) and are given by
tation. The SplitSR model contains standard convolutional blocks and
∑
lightweight residual blocks hybridization that can assist in the SplitSR ̂ 𝑋) = 1
𝑝𝑖𝑥𝑒𝑙_𝑙1 (𝑋, ‖𝑋̂ − 𝑋𝑖,𝑗,𝑘 ‖, (7)
tuning of the computational cost. The SplitSR model is evaluated ℎ𝑤𝑐 𝑖,𝑗,𝑘 𝑖,𝑗,𝑘
on a low-end ARM CPU and demonstrated that it has 5 × faster 1 ∑ ̂
̂ 𝑋) =
𝑝𝑖𝑥𝑒𝑙_𝑙2 (𝑋, (𝑋 − 𝑋𝑖,𝑗,𝑘 )2 , (8)
inference and good accuracy. Also, the SplitSR model is deployed into ℎ𝑤𝑐 𝑖,𝑗,𝑘 𝑖,𝑗,𝑘
the ZoomSR smartphone app as the first on-device SR deep learning
instance. Finally, in this model, a modern deep learning compiler is where ℎ, 𝑤, and 𝑐 represent the height, width, and the evaluated
used for implementing the proposed system and generating highly images number of channels, respectively. Moreover, what is named
efficient machine code, which improves the inference speed on their Charbonnier loss [46,119] is a modified of the pixel 𝑙1 loss, given by:
target embedded device. ∑√
̂ 𝑋) = 1
𝑝𝑖𝑥𝑒𝑙_𝑐ℎ𝑎 (𝑋, (𝑋̂ 𝑖,𝑗,𝑘 − 𝑋𝑖,𝑗,𝑘 )2 + 𝜖 2 , (9)
4. XLSR ℎ𝑤𝑐 𝑖,𝑗,𝑘
A hardware (Synaptics Dolphin NPU) limitation aware model is
suggested in [22] for extremely lightweight quantization robust real- where 𝜖 is a constant (e.g., 10−3 ) that is used to make the loss numer-
time SR network (XLSR). The authors were inspired by [116] for image ically stable. When this pixel loss is applied to make the generated HR
classification and developed a new building block, then applied the root image 𝑋̂ more similar to its ground truth X. In comparing pixel values
modules to the SISR problem. Also, the clipped ReLU is used at the last with 𝑙1 loss, the 𝑙2 loss penalizes bigger errors; however, it gives smooth
layer of the network seeking of robust uint8 quantization model. So output due to its tolerance. Practically, utilizing the 𝑙1 loss achieve good
that, the XLSR model achieved a great trade-off between performance performance and convergence over 𝑙2 loss [48,120,121]. The PSNR def-
and runtime. This XLSR model achieved the 1st position on Mobile AI inition is connected with a pixel-wise difference, leading to diminishing
workshop 2021 [16]. pixel loss, which maximizes PSNR. Due to this fact, the pixel loss grad-
5. ABPN ually turned out to use in a wide range due to it considering the image
An efficient architecture is designed for 8-bit quantization and de- quality (e.g., perceptual quality [117], textures [122]) into account.
ployed on mobile devices in [109]. In this model, a meta-node latency The results are the absence of the high-frequency information and are
experiment is conducted by decomposing lightweight SR architectures perceptually unsatisfying with over smooth textures [117,118,123].
that are used to determine the portable operations. Also, anchor-based Adversarial Loss. When GAN [124] is introduced in deep learning,
plain net (ABPN) is used to find the 8-bit quantization efficient archi- it receives a lot of attention, and it is also shown to be used in various
tecture. In the end, a quantization-aware training strategy is adopted vision tasks due to its powerful learning ability. For the image SR
for performance enhancement. This new learning strategy improves the task, it is easy to use adversarial learning. In such a case, we can only
PSNR with 2 dB for the INT8 quantized model without any parameter consider the generator as the SR model afterward, using an additional
cost. discriminator to judge the generator output is good. So, a model of
6. CDFM-Mobile SRGAN is used in [118] by applying the cross-entropy adversarial loss,
In [110], a channel mixing Net (CDFM-Mobile) is developed to like the following:
solve the mobile SR. The CDFM-Mobile uses a channel mixing block
̂ 𝐷) = −𝑙𝑜𝑔𝐷(𝑋),
𝑔𝑎𝑛_𝑐𝑒_𝑔 (𝑋; ̂ (10)
composed of a pointwise convolution and deep features extraction.
In addition, anchor-based residual learning and deep feature resid-
ual learning are used to enhance the performance further. Also, the ̂ 𝑋; 𝐷) = −𝑙𝑜𝑔𝐷(𝑋̂ 𝑠 ) − 𝑙𝑜𝑔(1 − 𝐷(𝑋)),
𝑔𝑎𝑛_𝑐𝑒_𝑑 (𝑋, ̂ (11)
303
Table 4
An overview of datasets for lightweight SR improved version than in [2].
Dataset No. of images Avg. resolution Avg. pixels Format Usage Category keywords
BSDS100 [128] 100 – 154,401 JPG Train/Validation Animal, building, food, landscape, people, plant, etc.
BSDS300 [128] 300 (435, 367) 154,401 JPG Train/Validation Animal, building, food, landscape, people, plant, etc.
BSDS500 [129] 500 (432, 370) 154,401 JPG Train/Validation Animal, building, food, landscape, people, plant, etc.
DIV2K [19] 1000 (1972, 1437) 2,793,250 PNG Train/Validation Environment, flora, fauna, handmade object, people, scenery, etc.
General-100 [34] 100 (435, 381) 181, 108 BMP Train Animal, daily necessity, food, people, plant, texture, etc.
Manga109 [130] 109 (826, 1169) 966, 011 PNG Test Manga volume
L20 [131] 20 (3843, 2870) 11, 577, 492 PNG Test Animal, building, landscape, people, plant, etc.
OutdoorScene [132] 10 624 (553, 440) 249, 593 PNG Train/Test Animal, building, grass, mountain, plant, sky, water.
PIRM [133] 200 (617, 482) 292, 021 PNG Train/Validation Environments, flora, natural scenery, objects, people, etc.
Set5 [134] 5 (313, 336) 292, 021 PNG Test Baby, bird, butterfly, head, woman
Set14 [33] 14 (492, 446) 230, 203 PNG Test Humans, animals, insects, flowers, vegetables, comic, slides, etc.
T91 [32] 91 (264, 204) 58, 853 PNG Train Car, flower, fruit, human face, etc.
Urban100 [135] 100 (984, 797) 774, 314 PNG Test Architecture, city, structure, urban, etc.
Flickr1024 [21] 2 ∗ 1024 (762,480) 734,646 PNG Train/Validation Stereo images.
where 𝑔𝑎𝑛_𝑐𝑒_𝑔 and 𝑔𝑎𝑛_𝑐𝑒_𝑑 represent the adversarial loss of the gen- 5. Reviewed models comparison
erator and the discriminator 𝐷. Also, 𝑋𝑠 denotes ground truths images
randomly sampled and 𝑠 means random sample. 5.1. Model size analysis
The total variation (TV) loss [125] is introduced to generated
images noise suppression and is used in SR by Aly et al. [126]. The As shown in Table 5, we summarized some SR models based on the
definition for that loss is summing up the absolute differences between accuracy (i.e., PSNR), the size of the model (i.e., Num. of parameters),
neighboring pixels, and then it can find the value of noise in the images, and operations number (i.e., Multi-Adds). In this table, five benchmark
like the following: datasets Set5 [134], Set14 [33], BSD100 [128], Urban100 [135], and
∑√ Manga109 [130] are used to measure the accuracy at the scale of
̂ =− 1
𝑇 𝑉 (𝑋) (𝑋̂ 𝑖,𝑗+1,𝑘 − 𝑋̂ 𝑖,𝑗,𝑘 )2 + (𝑋̂ 𝑖+1,𝑗,𝑘 − 𝑋̂ 𝑖,𝑗,𝑘 )2 , (12)
ℎ𝑤𝑐 𝑖,𝑗,𝑘 2 ×, 3 ×, and 4 ×. Note, all statistics are derived from the original
papers. It is clear from the table that some attention-based models
Also, the TV loss is adopted by [118,127] for imposing spatial smooth- show good performance compared to other methods, but these methods
ness. mainly use large memory in comparison to other methods. In addition,
In addition, there are other less common losses such as Multi-scale the distillation-based models have a lower number of parameters and
loss [58], Sparsity regularization loss, [59], adaptive joint loss [65], computational costs.
imitation loss, reconstruction loss, distillation loss [72], knowledge
distillation loss [73], mean absolute error (MAE) [99], weighted Huber 5.2. Visual and run time comparison
loss [51], and frequency mask loss [56] sometime used in lightweight
networks. We also tried to make a visual comparison among the different
methods, as shown in Fig. 8. In this section, the visual results are
4.3. Super-resolution datasets
obtained from the official code or by running the model using the
pre-trained model given by the authors. The figure shows that the
On the one hand, before 2017, there were few datasets, such as 91
distillation-based and attention-based methods clearly show better vi-
images [40] and Berkeley Segmentation Dataset (BSD) [36], are used in
sual comparison than the other methods. In addition, we also calculate
the SR task. These datasets are usually used for training the SR model.
the run time using a similar manner used in the AIM Challenge for a
On the other hand, after the DIV2K [19] dataset is introduced in the
fair comparison between the models, as shown in Fig. 9. It is clear from
NTIRE 2017 [136], most of the lightweight SR methods are trained
the figure that the distillation-based models have lower run time com-
using this dataset. In Table 4, we list the most recent datasets that
pared to the other methods. Also, it is clear that some attention-based
were used in the training and testing of SR models. These datasets
significantly differ in image amounts, quality, resolution, diversity, methods use more time than other methods.
etc. Among these datasets, some have LR-HR image pairs. In contrast,
others provide only the HR images, so to get the LR images, these 6. Conclusion and future directions
are generated by MATLAB imresize function with settings (i.e., bicubic
interpolation with anti-aliasing). This review is extensively showed the state-of-the-art methods used
in lightweight image SR based on deep learning. We focus on the
4.4. Framework supervised SR models improvement. Because there are many unsolved
problems in the image super-resolution task, we will show some of
Before the modern deep learning framework, PyTorch [137] and these issues and indicate some promising future evolution trends. We
Tensorflow [138] were introduced, researchers were used traditional hope this survey will help image SR researchers contribute to future
frameworks like Caffe [31,34,35,37,38,42,44,51], MatConvNet [46,47] research and application developments.
to build their models. However, with introducing these two frameworks
PyTorch and Tensorflow, they were more widely used in SR task. Even 6.1. Network design
though Tensorflow [17,22,45,107,109] is available to the public before
the PyTorch, the PyTorch has gained wide popularity in many deep Good network design can help determine a hypothesis space and can
learning tasks. For the SR task, most of models [40,41,48,50,52,58,59] learn representations efficiently. This section will introduce ideas for
based on PyTorch as a framework. Tensorflow works on the concept the model enhancement. First, by considering a large receptive will lead
of the static graph, so it needs to define the computation graph of to more contextual information and generate accurate results. Second,
the model before running the model, but the PyTorch is based on due to the fact that shallow layers can work on low-level features and
the dynamic graph, which can dynamically defining/manipulating the deeper layers deal with higher-level features, so the combination be-
graph. tween them can significantly help the task of HR reconstruction. Third,
304
Fig. 8. The visualization results on Urban100 dataset at 4 ×.
Fig. 9. The PSNR vs. Run time of different models.
using the context-specific attention can help the models to facilitate 6.2. Learning strategies
the outcome of realistic details. Finally, there is a big success of using
graph neural networks, meta-learning, and reinforcement learning, so Robust learning strategies are required to achieve satisfactory re-
why not use this advanced methods for SR in general and lightweight sults. So, loss function designing and sub-optimality of batch normal-
SR as a specific case. ization (BN) is essential for future research. For the loss function, it is
305
Table 5
The performance of representative lightweight SR models using 5 datasets benchmark at scales of 2 ×, 3 ×, 4 ×, with number of parameters and operations.
Method Network design Scale Params Multi-Adds PSNR
Set5 Set14 BSD100 Urban100 Manga109
SRCNN Convolution 2 57k 52G 36.66 32.42 31.36 29.50 35.60
FSRCNN Convolution 2 12k 6.0G 37.00 32.63 31.53 29.88 36.67
VDSR Convolution 2 665k 612.6G 37.53 33.05 31.90 30.77 37.27
DRCN Convolution 2 1774k 17 974G 37.63 33.04 31.85 30.75 37.55
SPBP Convolution 2 629k 184G 37.95 33.54 32.15 31.89 –
DRRN Residual 2 297k 6796G 37.74 33.23 32.05 31.23 –
BTSRN Residual 2 410k 207G 37.75 33.20 33.20 31.63 –
LapSRN Residual 2 813k 29.9G 37.52 32.99 31.80 30.41 37.27
SelNet Residual 2 974k 225G 37.89 33.61 32.08 – –
CARN Residual 2 1592k 222G 37.76 33.52 32.09 31.92 38.36
LAMRN Residual 2 1390k 320G 38.09 33.87 32.22 32.31 39.14
MFIN Residual 2 704k 162G 38.05 33.67 32.22 32.38 –
MPRNet Residual 2 538k 32G 38.08 33.79 32.25 32.52 –
WMRN Residual 2 452k 103G 37.83 33.41 32.08 31.68 38.27
ELCRN Residual 2 535k 123G 37.99 33.52 32.16 32.06 –
OverNet Residual 2 900k 200G 38.11 33.71 32.24 32.44 –
SMSR Residual 2 985k 131G 38.00 33.64 32.17 32.19 38.76
FMEN Residual 2 748K 172G 38.10 33.75 32.26 32.41 38.95
HPUN Residual 2 714K 151G 38.09 33.79 32.25 32.37 39.07
RLFN Residual 2 527K 60.4G 38.07 33.72 32.22 32.33 –
ShuffleMixer Residual 2 394K 91G 38.01 33.63 32.17 31.89 38.83
GLADSR Dense 2 812k 187G 37.99 33.63 32.16 32.16 –
ESRN Dense 2 1014k 228G 38.04 33.71 32.23 32.37 –
IDN Feature distillation 2 553k 127G 37.83 33.30 32.08 31.27 38.01
IMDN Feature distillation 2 694k 158G 38.00 33.63 32.19 32.17 38.88
RFDN Feature distillation 2 534K 123.0G 38.05 33.68 32.16 32.12 38.88
AIDN Feature distillation 2 23k 19.21G 38.07 33.72 32.18 32.24 38.89
BSRN Feature distillation 2 332K 73G 38.10 33.74 32.24 32.34 39.14
MCAN Channel attention 2 1233k 191G 37.91 33.69 32.18 32.46 –
𝐴2 𝐹 Channel attention 2 1363k 306G 38.09 33.78 32.23 32.46 38.95
BSPAN Pyramid attention 2 1160K 187.8G 38.20 33.96 32.33 32.96 39.22
LKASR Transformer-based attention 2 947k 141G 38.17 33.84 32.31 32.69 39.12
CFIN Transformer-based attention 2 675K 116.9G 38.14 33.80 32.26 32.48 38.91
MADNet Other attention 2 878k 187G 37.94 33.46 32.10 32.46 –
MAFFSRN Other attention 2 790k 154G 38.07 33.59 32.23 32.38 –
LatticeNet Other attention 2 756k 169G 38.15 33.78 32.25 32.38 –
PAN Other attention 2 261K 70.5G 38.00 33.59 32.18 32.01 38.70
SESR Extremely lightweight 2 105k 24G 37.77 33.24 31.99 31.16 38.01
SRCNN Convolution 3 57K 52.7G 32.75 29.30 28.41 26.24 30.48
VDSR Convolution 3 665K 612.6G 33.67 29.78 28.83 27.14 32.01
DRCN Convolution 3 1774K 9788.7G 33.82 29.76 28.80 27.14 32.24
DRRN Residual 3 297k 6796G 34.03 29.96 28.95 27.53 –
BTSRN Residual 3 410k 176.2G 34.03 29.90 28.97 27.75 –
SelNet Residual 3 1159K 120.0G 34.27 30.30 28.97 – –
CARN Residual 3 1592k 118.8G 34.29 30.29 29.06 28.06 33.50
LAMRN Residual 3 1411k 145.3G 34.55 30.41 29.17 28.43 33.96
MFIN Residual 3 713k 71.9G 34.44 30.34 29.12 28.29 –
WMRN Residual 3 556K 57G 34.11 30.17 28.98 27.80 33.07
ELCRN Residual 3 543k 54.9G 34.33 30.32 29.07 28.06 –
SMSR Residual 3 985k 67.8G 34.40 30.33 29.10 28.25 33.68
FMEN Residual 3 757K 77.2G 34.45 30.40 29.17 28.33 33.86
HPUN Residual 3 723K 69.3G 34.63 30.52 29.22 28.49 34.12
GLADSR Dense 3 821K 88.2G 34.41 30.37 29.08 28.24 –
ESRN Dense 3 1014k 115.6G 34.46 30.43 29.15 28.42 –
IDN Feature distillation 3 553k 57.0G 34.11 29.99 28.95 27.42 32.71
IMDN Feature distillation 3 703K 71.5G 34.36 30.32 29.09 32.17 33.61
AIDN Feature distillation 3 330K 19.21G 34.43 30.35 29.11 28.25 33.69
BSRN Feature distillation 3 340K 33.3G 34.46 30.47 29.18 28.39 34.05
MCAN Channel attention 3 1233K 95.4G 34.45 30.43 29.14 28.47 –
𝐴2 𝐹 Channel attention 3 1367k 136.3G 34.54 30.41 29.14 28.40 33.83
BSPAN Pyramid attention 3 1170K 97.9GG 34.64 30.52 29.21 28.75 34.06
LKASR Transformer-based attention 3 947k 97.9G 34.64 30.55 29.20 28.55 34.11
MADNet Other attention 3 930k 88.4G 34.26 30.29 29.04 27.91 –
MAFFSRN Other attention 3 807K 68.5G 34.45 30.40 29.13 28.26 –
LatticeNet Other attention 3 765K 76.3G 34.53 30.39 29.15 28.33 –
306
SRCNN Convolution 4 57K 52.7G 30.48 27.50 26.90 24.5 27.58
VDSR Convolution 4 665K 612.6G 31.35 28.02 27.29 25.18 28.83
DRCN Convolution 4 1774K 9788.7G 31.53 28.02 27.23 25.18 28.93
DRRN Residual 4 297k 6796G 31.68 28.21 27.38 25.44 –
BTSRN Residual 4 410k 165.2G 31.85 28.20 27.47 25.74 –
LapSRN Residual 4 813k 149.4G 31.54 28.19 27.32 25.21 29.09
SelNet Residual 4 1417k 83.1G 32.00 28.49 27.44 – –
CARN Residual 4 1592k 90.9G 32.13 28.60 27.58 26.07 30.47
LAMRN Residual 4 1407k 85G 32.25 28.63 27.61 26.22 30.57
MFIN Residual 4 725k 41.8G 32.26 28.63 27.58 26.18 –
WMRN Residual 4 536K 45.7G 32.00 28.47 27.49 25.89 30.11
ELCRN Residual 4 556k 32.07G 32.18 28.55 27.53 25.94 –
SMSR Residual 4 1006K 41.6G 32.12 28.55 27.55 26.11 30.54
FMEN Residual 4 769K 44.2G 32.24 28.70 27.63 26.28 30.70
HPUN Residual 4 734K 39.7G 32.40 28.80 27.70 26.38 31.00
RLFN Residual 4 543K 16.4G 32.24 28.62 27.60 26.17 –
GLADSR Dense 4 826K 52.6G 32.14 28.62 27.59 26.12 –
ESRN Dense 4 1014k 66.1G 32.26 28.63 27.62 26.24 –
IDN Feature distillation 4 553k 32.3G 31.82 28.25 27.41 25.41 29.41
IMDN Feature distillation 4 715K 40.9G 32.21 28.58 27.56 26.04 30.45
AIDN Feature distillation 4 339K 20.27G 32.26 28.60 27.58 26.16 30.59
BSRN Feature distillation 4 352K 19.4G 32.35 28.73 27.65 26.27 30.84
MCAN Channel attention 4 1233K 83.1G 32.33 28.72 27.63 26.43 –
𝐴2 𝐹 Channel attention 4 1374K 77.2G 32.32 28.67 27.62 26.32 30.72
BSPAN Pyramid attention 4 1180K 55.55G 32.42 28.79 27.68 26.65 31.00
LKASR Transformer-based attention 4 1026k 62.6G 32.46 28.84 27.71 26.54 31.01
MADNet Other attention 4 1002k 54.1G 32.11 28.52 27.52 25.89 –
MAFFSRN Other attention 4 830K 38.6G 32.20 28.62 27.59 26.16 –
LatticeNet Other attention 4 777K 43.6G 32.30 28.68 27.62 26.25 –
SESR Extremely lightweight 4 114.97K 6.62G 31.54 28.12 27.31 25.31 29.04
promising to make constraints LR/HR/SR images which can guide the CRediT authorship contribution statement
optimization of the training. Also, due to the BN being not sub-optimal
SR [120,139,140], so it is necessary to find an effective normalization Garas Gendy: Writing – original draft, Data curation, Methodology.
techniques. Guanghui He: Supervision, Methodology, Writing – review & edit-
ing. Nabil Sabor: Supervision, Conceptualization, Writing – review &
editing.
6.3. Evaluation metrics
Data availability
Finding the right evaluation metrics is a critical part of machine
learning. Without accurate measurements of performance, researchers No data was used for the research described in the article.
will struggle to verify any improvements they make. Therefore, dis-
covering suitable metrics for SR face is a challenge that needs to be Acknowledgments
addressed, and further exploration is necessary.
This work was supported in part by the National Key Research
and Development Program of China under Grant 2019YFB2204500, in
6.4. Towards real-world scenarios
part by the National Natural Science Foundation of China under Grant.
62074097
Due to image SR limitation in real-world problems like unknown
degradation issues or missing the LR-HR paired images in datasets. This References
section will discuss some expectations for lightweight real-world image
SR scenarios. For example, there are some degradations in real-world [1] S. Anwar, S. Khan, N. Barnes, A deep journey into super-resolution: A survey,
ACM Comput. Surv. 53 (3) (2020) 1–34.
images like blurring, additive noise, and compression artifacts. That is [2] Z. Wang, J. Chen, S.C. Hoi, Deep learning for image super-resolution: A survey,
because networks trained in manually conducted datasets usually do IEEE Trans. Pattern Anal. Mach. Intell. 43 (10) (2020) 3365–3387.
not work well in real-world scenes. Many researchers tried to handle [3] J. Jiang, C. Wang, X. Liu, J. Ma, Deep learning-based face super-resolution: A
this issue [127,141–143]; however, the developed methods have some survey, ACM Comput. Surv. 55 (1) (2021) 1–36.
[4] M. Zhang, X. Sun, Q. Zhu, G. Zheng, A survey of hyperspectral image super-
issues like the training difficulty and assumptions of over-perfection. resolution technology, in: 2021 IEEE International Geoscience and Remote
These problems are essential to be solved. Finally, the domain-specific Sensing Symposium IGARSS, IEEE, 2021, pp. 4476–4479.
applications are another aspect because SR can work for both domain- [5] H. Chen, X. He, L. Qing, Y. Wu, C. Ren, R.E. Sheriff, C. Zhu, Real-world single
image super-resolution: A brief review, Inf. Fusion 79 (2022) 124–145.
specific data or help other vision tasks. This ability allows SR models
[6] H. Liu, Z. Ruan, P. Zhao, C. Dong, F. Shang, Y. Liu, L. Yang, Video super
to be used in more specific domains, such as object tracking, video resolution based on deep learning: A comprehensive survey, 2020, arXiv
surveillance, medical imaging, and scene rendering. preprint arXiv:2007.12928.
307
[7] F. Salvetti, V. Mazzia, A. Khaliq, M. Chiaberge, Multi-image super resolution of [32] J. Yang, J. Wright, T.S. Huang, Y. Ma, Image super-resolution via sparse
remotely sensed images using residual attention deep neural networks, Remote representation, IEEE Trans. Image Process. 19 (11) (2010) 2861–2873.
Sens. 12 (14) (2020) 2207. [33] R. Zeyde, M. Elad, M. Protter, On single image scale-up using sparse-
[8] G. Gendy, H. Mohammed, N. Sabor, G. He, A deep pyramid attention network representations, in: International Conference on Curves and Surfaces, Springer,
for single image super-resolution, in: 2021 9th International Japan-Africa 2010, pp. 711–730.
Conference on Electronics, Communications, and Computations (JAC-ECC), [34] C. Dong, C.C. Loy, X. Tang, Accelerating the super-resolution convolutional
IEEE, 2021, pp. 14–19. neural network, in: European Conference on Computer Vision, Springer, 2016,
[9] Y. Chen, L. Liu, V. Phonevilay, K. Gu, R. Xia, J. Xie, Q. Zhang, K. Yang, Image pp. 391–407.
super-resolution reconstruction based on feature map attention mechanism, [35] J. Kim, J.K. Lee, K.M. Lee, Accurate image super-resolution using very deep
Appl. Intell. 51 (7) (2021) 4367–4380. convolutional networks, in: Proceedings of the IEEE Conference on Computer
[10] K. Zeng, S. Ding, W. Jia, Single image super-resolution using a polymorphic Vision and Pattern Recognition, 2016, pp. 1646–1654.
parallel CNN, Appl. Intell. 49 (1) (2019) 292–300. [36] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing
[11] Y. Jo, S.W. Oh, J. Kang, S.J. Kim, Deep video super-resolution network using human-level performance on imagenet classification, in: Proceedings of the IEEE
dynamic upsampling filters without explicit motion compensation, in: Proceed- International Conference on Computer Vision, 2015, pp. 1026–1034.
ings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, [37] J. Kim, J.K. Lee, K.M. Lee, Deeply-recursive convolutional network for image
pp. 3224–3232. super-resolution, in: Proceedings of the IEEE Conference on Computer Vision
[12] X. Wang, K.C. Chan, K. Yu, C. Dong, C. Change Loy, Edvr: Video restoration and Pattern Recognition, 2016, pp. 1637–1645.
with enhanced deformable convolutional networks, in: Proceedings of the [38] H. Ren, M. El-Khamy, J. Lee, Image super resolution based on fusing multiple
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, convolution neural networks, in: Proceedings of the IEEE Conference on
2019. Computer Vision and Pattern Recognition Workshops, 2017, pp. 54–61.
[13] S. Li, F. He, B. Du, L. Zhang, Y. Xu, D. Tao, Fast spatio-temporal residual [39] https://github.com/openimages/dataset.
network for video super-resolution, in: Proceedings of the IEEE/CVF Conference [40] L. Zhang, P. Wang, C. Shen, L. Liu, W. Wei, Y. Zhang, A. Van Den Hengel,
on Computer Vision and Pattern Recognition, 2019, pp. 10522–10531. Adaptive importance learning for improving lightweight image super-resolution
[14] O. Thawakar, P.W. Patil, A. Dudhane, S. Murala, U. Kulkarni, Image and network, Int. J. Comput. Vis. 128 (2) (2020) 479–499.
video super resolution using recurrent generative adversarial network, in: 2019 [41] S. Banerjee, C. Ozcinar, A. Rana, A. Smolic, M. Manzke, Sub-pixel back-
16th IEEE International Conference on Advanced Video and Signal Based projection network for lightweight single image super-resolution, 2020, arXiv
Surveillance, AVSS, IEEE, 2019, pp. 1–8. preprint arXiv:2008.01116.
[15] E. Smith, S. Fujimoto, D. Meger, Multi-view silhouette and depth decomposition
[42] Z. Gu, L. Chen, Y. Zheng, T. Wang, T. Li, Fusion diversion network for fast,
for high resolution 3d object representation, Adv. Neural Inf. Process. Syst. 31
accurate and lightweight single image super-resolution, Signal Image Video
(2018).
Process. 15 (6) (2021) 1351–1359.
[16] K. Zhang, M. Danelljan, Y. Li, R. Timofte, J. Liu, J. Tang, G. Wu, Y. Zhu, X.
[43] Y. Nie, K. Han, Z. Liu, A. Xiao, Y. Deng, C. Xu, Y. Wang, Ghostsr: Learning
He, W. Xu, et al., AIM 2020 challenge on efficient super-resolution: Methods
ghost features for efficient image super-resolution, 2021, arXiv preprint arXiv:
and results, in: European Conference on Computer Vision, Springer, 2020, pp.
2101.08525.
5–40.
[44] Y. Tai, J. Yang, X. Liu, Image super-resolution via deep recursive residual
[17] A. Liu, Y. Liu, J. Gu, Y. Qiao, C. Dong, Blind image super-resolution: A survey
network, in: Proceedings of the IEEE Conference on Computer Vision and
and beyond, IEEE Trans. Pattern Anal. Mach. Intell. (2022).
Pattern Recognition, 2017, pp. 3147–3155.
[18] Z. Hui, X. Gao, Y. Yang, X. Wang, Lightweight image super-resolution
[45] Y. Fan, H. Shi, J. Yu, D. Liu, W. Han, H. Yu, Z. Wang, X. Wang, T.S. Huang, Bal-
with information multi-distillation network, in: Proceedings of the 27th Acm
anced two-stage residual networks for image super-resolution, in: Proceedings of
International Conference on Multimedia, 2019, pp. 2024–2032.
the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
[19] E. Agustsson, R. Timofte, Ntire 2017 challenge on single image super-resolution:
2017, pp. 161–168.
Dataset and study, in: Proceedings of the IEEE Conference on Computer Vision
[46] W.-S. Lai, J.-B. Huang, N. Ahuja, M.-H. Yang, Deep laplacian pyramid networks
and Pattern Recognition Workshops, 2017, pp. 126–135.
for fast and accurate super-resolution, in: Proceedings of the IEEE Conference
[20] J. Liu, J. Tang, G. Wu, Residual feature distillation network for lightweight
on Computer Vision and Pattern Recognition, 2017, pp. 624–632.
image super-resolution, in: European Conference on Computer Vision, Springer,
[47] J.-S. Choi, M. Kim, A deep convolutional neural network with selection units for
2020, pp. 41–55.
super-resolution, in: Proceedings of the IEEE Conference on Computer Vision
[21] Y. Wang, L. Wang, J. Yang, W. An, Y. Guo, Flickr1024: A large-scale dataset
and Pattern Recognition Workshops, 2017, pp. 154–160.
for stereo image super-resolution, in: Proceedings of the IEEE/CVF International
[48] N. Ahn, B. Kang, K.-A. Sohn, Fast, accurate, and lightweight super-resolution
Conference on Computer Vision Workshops, 2019.
with cascading residual network, in: Proceedings of the European Conference
[22] M. Ayazoglu, Extremely lightweight quantization robust real-time single-image
on Computer Vision, ECCV, 2018, pp. 252–268.
super resolution for mobile devices, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2021, pp. 2472–2479. [49] F. Zhu, Q. Zhao, Efficient single image super-resolution via hybrid residual
[23] F. Kong, M. Li, S. Liu, D. Liu, J. He, Y. Bai, F. Chen, L. Fu, Residual local feature learning with compact back-projection network, in: Proceedings of the
feature network for efficient super-resolution, in: Proceedings of the IEEE/CVF IEEE/CVF International Conference on Computer Vision Workshops, 2019.
Conference on Computer Vision and Pattern Recognition, 2022, pp. 766–776. [50] J. Li, Y. Yuan, K. Mei, F. Fang, Lightweight and accurate recursive fractal net-
[24] S. Gu, M. Danelljan, R. Timofte, M. Haris, K. Akita, G. Shakhnarovic, N. work for image super-resolution, in: Proceedings of the IEEE/CVF International
Ukita, P.N. Michelini, W. Chen, H. Liu, et al., Aim 2019 challenge on image Conference on Computer Vision Workshops, 2019.
extreme super-resolution: Methods and results, in: 2019 IEEE/CVF International [51] Z. He, Y. Cao, L. Du, B. Xu, J. Yang, Y. Cao, S. Tang, Y. Zhuang, Mrfn: Multi-
Conference on Computer Vision Workshop, ICCVW, IEEE, 2019, pp. 3556–3564. receptive-field network for fast and accurate single image super-resolution, IEEE
[25] Y. Li, K. Zhang, R. Timofte, L. Van Gool, F. Kong, M. Li, S. Liu, Z. Du, D. Liu, Trans. Multimed. 22 (4) (2019) 1042–1054.
C. Zhou, et al., Ntire 2022 challenge on efficient super-resolution: Methods and [52] Z. He, K. Liu, Z. Liu, Q. Dou, X. Yang, A lightweight multi-scale feature
results, in: Proceedings of the IEEE/CVF Conference on Computer Vision and integration network for real-time single image super-resolution, J. Real-Time
Pattern Recognition, 2022, pp. 1062–1102. Image Process. 18 (4) (2021) 1221–1234.
[26] A. Ignatov, R. Timofte, M. Denna, A. Younes, et al., Efficient and accurate [53] L. Sun, Z. Liu, X. Sun, L. Liu, R. Lan, X. Luo, Lightweight image super-resolution
quantized image super-resolution on mobile npus, mobile AI & AIM 2022 via weighted multi-scale residual network, IEEE/CAA J. Autom. Sin. 8 (7)
challenge: Report, in: Proceedings of the European Conference on Computer (2021) 1271–1280.
Vision (ECCV) Workshops, Vol. 2, 2022. [54] H. Yang, Q. Dou, K. Liu, Z. Liu, R. Francese, X. Yang, Efficient local cascading
[27] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: residual network for real-time single image super-resolution, J. Real-Time Image
from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) Process. 18 (4) (2021) 1235–1246.
(2004) 600–612. [55] D. Song, Y. Wang, H. Chen, C. Xu, C. Xu, D. Tao, Addersr: Towards energy
[28] H.R. Sheikh, A.C. Bovik, G. De Veciana, An information fidelity criterion for efficient image super-resolution, in: Proceedings of the IEEE/CVF Conference
image quality assessment using natural scene statistics, IEEE Trans. Image on Computer Vision and Pattern Recognition, 2021, pp. 15648–15657.
Process. 14 (12) (2005) 2117–2128. [56] W. Xie, D. Song, C. Xu, C. Xu, H. Zhang, Y. Wang, Learning frequency-aware
[29] R. Zhang, P. Isola, A.A. Efros, E. Shechtman, O. Wang, The unreasonable dynamic network for efficient super-resolution, in: Proceedings of the IEEE/CVF
effectiveness of deep features as a perceptual metric, in: Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 4308–4317.
Conference on Computer Vision and Pattern Recognition, 2018, pp. 586–595. [57] Y. Zhang, H. Wang, C. Qin, Y. Fu, Aligned structured sparsity learning for
[30] C.-Y. Yang, C. Ma, M.-H. Yang, Single-image super-resolution: A benchmark, in: efficient image super-resolution, Adv. Neural Inf. Process. Syst. 34 (2021).
European Conference on Computer Vision, Springer, 2014, pp. 372–386. [58] P. Behjati, P. Rodriguez, A. Mehri, I. Hupont, C.F. Tena, J. Gonzalez, Overnet:
[31] C. Dong, C.C. Loy, K. He, X. Tang, Image super-resolution using deep con- Lightweight multi-scale super-resolution with overscaling network, in: Proceed-
volutional networks, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2) (2015) ings of the IEEE/CVF Winter Conference on Applications of Computer Vision,
295–307. 2021, pp. 2694–2703.
308
[59] L. Wang, X. Dong, Y. Wang, X. Ying, Z. Lin, W. An, Y. Guo, Exploring sparsity in [89] H. Wu, J. Gui, J. Zhang, J.T. Kwok, Z. Wei, Pyramidal dense attention networks
image super-resolution for efficient inference, in: Proceedings of the IEEE/CVF for lightweight image super-resolution, 2021, arXiv preprint arXiv:2106.06996.
Conference on Computer Vision and Pattern Recognition, 2021, pp. 4917–4926. [90] H. Wu, J. Gui, J. Zhang, J.T. Kwok, Z. Wei, Feedback pyramid attention
[60] L. Wang, D. Li, L. Tian, Y. Shan, Efficient image super-resolution with collapsi- networks for single image super-resolution, 2021, arXiv preprint arXiv:2106.
ble linear blocks, in: Proceedings of the IEEE/CVF Conference on Computer 06966.
Vision and Pattern Recognition, 2022, pp. 817–823. [91] G. Gendy, N. Sabor, J. Hou, G. He, Balanced spatial feature distillation and pyra-
[61] Z. Du, D. Liu, J. Liu, J. Tang, G. Wu, L. Fu, Fast and memory-efficient network mid attention network for lightweight image super-resolution, Neurocomputing
towards efficient image super-resolution, in: Proceedings of the IEEE/CVF 508 (2022) 157–166.
Conference on Computer Vision and Pattern Recognition, 2022, pp. 853–862. [92] Z. Lu, H. Liu, J. Li, L. Zhang, Efficient transformer for single image
[62] B. Sun, Y. Zhang, S. Jiang, Y. Fu, Hybrid pixel-unshuffled network for super-resolution, 2021, arXiv preprint arXiv:2108.11084.
lightweight image super-resolution, 2022, arXiv preprint arXiv:2203.08921. [93] J. Fang, H. Lin, X. Chen, K. Zeng, A hybrid network of CNN and transformer for
[63] L. Sun, J. Pan, J. Tang, ShuffleMixer: An efficient ConvNet for image lightweight image super-resolution, in: Proceedings of the IEEE/CVF Conference
super-resolution, 2022, arXiv preprint arXiv:2205.15175. on Computer Vision and Pattern Recognition, 2022, pp. 1103–1112.
[64] X. Zhang, P. Gao, S. Liu, K. Zhao, G. Li, L. Yin, C.W. Chen, Accurate and [94] H. Feng, L. Wang, Y. Li, A. Du, LKASR: Large kernel attention for lightweight
efficient image super-resolution via global-local adjusting dense network, IEEE image super-resolution, Knowl.-Based Syst. 252 (2022) 109376.
Trans. Multimed. 23 (2020) 1924–1937. [95] W. Zou, T. Ye, W. Zheng, Y. Zhang, L. Chen, Y. Wu, Self-calibrated efficient
[65] D. Song, C. Xu, X. Jia, Y. Chen, C. Xu, Y. Wang, Efficient residual dense block transformer for lightweight super-resolution, in: Proceedings of the IEEE/CVF
search for image super-resolution, in: Proceedings of the AAAI Conference on Conference on Computer Vision and Pattern Recognition, 2022, pp. 930–939.
Artificial Intelligence, Vol. 34, 2020, pp. 12007–12014. [96] W. Li, J. Li, G. Gao, J. Zhou, J. Yang, G.-J. Qi, Cross-receptive focused
[66] Z. Hui, X. Wang, X. Gao, Fast and accurate single image super-resolution via inference network for lightweight image super-resolution, 2022, arXiv preprint
information distillation network, in: Proceedings of the IEEE Conference on arXiv:2207.02796.
Computer Vision and Pattern Recognition, 2018, pp. 723–731. [97] R. Lan, L. Sun, Z. Liu, H. Lu, C. Pang, X. Luo, Madnet: A fast and lightweight
[67] Y. Li, J. Cao, Z. Li, S. Oh, N. Komuro, Lightweight single image super-resolution network for single-image super resolution, IEEE Trans. Cybern. 51 (3) (2020)
with dense connection distillation network, ACM Trans. Multimed. Comput. 1443–1453.
Commun. Appl. (TOMM) 17 (1s) (2021) 1–17. [98] A. Muqeet, J. Hwang, S. Yang, J. Kang, Y. Kim, S.-H. Bae, Multi-attention based
[68] Z. Zong, L. Zha, J. Jiang, X. Liu, Asymmetric information distillation network ultra lightweight image super-resolution, in: European Conference on Computer
for lightweight super resolution, in: Proceedings of the IEEE/CVF Conference Vision, Springer, 2020, pp. 103–118.
on Computer Vision and Pattern Recognition, 2022, pp. 1249–1258. [99] X. Luo, Y. Xie, Y. Zhang, Y. Qu, C. Li, Y. Fu, Latticenet: Towards lightweight
[69] Z. Li, Y. Liu, X. Chen, H. Cai, J. Gu, Y. Qiao, C. Dong, Blueprint separable image super-resolution with lattice block, in: European Conference on Computer
residual network for efficient image super-resolution, in: Proceedings of the Vision, Springer, 2020, pp. 272–289.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. [100] H. Zhao, X. Kong, J. He, Y. Qiao, C. Dong, Efficient image super-resolution
833–843. using pixel attention, in: European Conference on Computer Vision, Springer,
[70] G. Gao, W. Li, J. Li, F. Wu, H. Lu, Y. Yu, Feature distillation interaction 2020, pp. 56–72.
weighting network for lightweight image super-resolution, in: Proceedings of [101] P. Behjati, P. Rodriguez, A. Mehri, I. Hupont, C.F. Tena, J. Gonzalez, Hierar-
the AAAI Conference on Artificial Intelligence, Vol. 36, 2022, pp. 661–669. chical residual attention network for single image super-resolution, 2020, arXiv
[71] X. Yang, Y. Guo, Z. Li, D. Zhou, T. Li, MRDN: A lightweight multi-stage preprint arXiv:2012.04578.
residual distillation network for image super-resolution, Expert Syst. Appl. [102] A. Mehri, P.B. Ardakani, A.D. Sappa, MPRNet: Multi-path residual network for
(2022) 117594. lightweight image super resolution, in: Proceedings of the IEEE/CVF Winter
[72] W. Lee, J. Lee, D. Kim, B. Ham, Learning with privileged information for Conference on Applications of Computer Vision, 2021, pp. 2704–2713.
efficient image super-resolution, in: European Conference on Computer Vision, [103] X. Zhu, K. Guo, S. Ren, B. Hu, M. Hu, H. Fang, Lightweight image super-
Springer, 2020, pp. 465–482. resolution with expectation-maximization attention mechanism, IEEE Trans.
[73] Y. Zhang, H. Chen, X. Chen, Y. Deng, C. Xu, Y. Wang, Data-free knowledge dis- Circuits Syst. Video Technol. 32 (3) (2021).
tillation for image super-resolution, in: Proceedings of the IEEE/CVF Conference [104] K. Park, J.W. Soh, N.I. Cho, Dynamic residual self-attention network for
on Computer Vision and Pattern Recognition, 2021, pp. 7852–7861. lightweight single image super-resolution, IEEE Trans. Multimed. (2021).
[74] Y. Wang, S. Lin, Y. Qu, H. Wu, Z. Zhang, Y. Xie, A. Yao, Towards compact single [105] J. Qin, F. Liu, K. Liu, G. Jeon, X. Yang, Lightweight hierarchical residual feature
image super-resolution via contrastive self-distillation, 2021, arXiv preprint fusion network for single-image super-resolution, Neurocomputing 478 (2022)
arXiv:2105.11683. 104–123.
[75] D. Fuoli, L. Van Gool, R. Timofte, Fourier space losses for efficient percep- [106] B. Li, B. Wang, J. Liu, Z. Qi, Y. Shi, S-lwsr: Super lightweight super-resolution
tual image super-resolution, in: Proceedings of the IEEE/CVF International network, IEEE Trans. Image Process. 29 (2020) 8368–8380.
Conference on Computer Vision, 2021, pp. 2360–2369. [107] K. Bhardwaj, M. Milosavljevic, A. Chalfin, N. Suda, L. O’Neil, D. Gope, L. Meng,
[76] K. Wu, C.-K. Lee, K. Ma, MemSR: Training memory-efficient lightweight model R. Matas, D. Loh, Collapsible linear blocks for super-efficient super resolution,
for image super-resolution, in: International Conference on Machine Learning, 2021, arXiv preprint arXiv:2103.09404.
PMLR, 2022, pp. 24076–24092. [108] X. Liu, Y. Li, J. Fromm, Y. Wang, Z. Jiang, A. Mariakakis, S. Patel, SplitSR: An
[77] Z. Hui, X. Gao, X. Wang, Lightweight image super-resolution with feature end-to-end approach to super-resolution on mobile devices, Proc. ACM Interact.
enhancement residual network, Neurocomputing 404 (2020) 50–60. Mob. Wearable Ubiquitous Technol. 5 (1) (2021) 1–20.
[78] A. Yang, B. Yang, Z. Ji, Y. Pang, L. Shao, Lightweight group convolutional [109] Z. Du, J. Liu, J. Tang, G. Wu, Anchor-based plain net for mobile image super-
network for single image super-resolution, Inform. Sci. 516 (2020) 220–233. resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision
[79] H. Ma, X. Chu, B. Zhang, Accurate and efficient single image super-resolution and Pattern Recognition, 2021, pp. 2494–2502.
with matrix channel attention network, in: Proceedings of the Asian Conference [110] G. Gendy, N. Sabor, J. Hou, G. He, Real-time channel mixing net for mobile im-
on Computer Vision, 2020. age super-resolution, in: Proceedings of the European Conference on Computer
[80] X. Wang, Q. Wang, Y. Zhao, J. Yan, L. Fan, L. Chen, Lightweight single- Vision (ECCV) Workshops, 2022.
image super-resolution network with attentive auxiliary feature learning, in: [111] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
Proceedings of the Asian Conference on Computer Vision, 2020. image recognition, in: Proceedings of the International Conference on Learning
[81] R. Wen, Z. Yang, T. Chen, H. Li, K. Li, Progressive representation recalibration Representations, 2015.
for lightweight super-resolution, Neurocomputing 504 (2022) 240–250. [112] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
[82] J.W. Soh, N.I. Cho, Lightweight single image super-resolution with multi-scale in: Proceedings of the IEEE Conference on Computer Vision and Pattern
spatial attention networks, IEEE Access 8 (2020) 35383–35391. Recognition, 2016, pp. 770–778.
[83] Y. Hang, Q. Liao, W. Yang, Y. Chen, J. Zhou, Attention cube network for [113] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected
image restoration, in: Proceedings of the 28th ACM International Conference convolutional networks, in: Proceedings of the IEEE Conference on Computer
on Multimedia, 2020, pp. 2562–2570. Vision and Pattern Recognition, 2017, pp. 4700–4708.
[84] Y. Yan, X. Xu, W. Chen, X. Peng, Lightweight attended multi-scale residual [114] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
network for single image super-resolution, IEEE Access 9 (2021) 52202–52212. A.C. Courville, Y. Bengio, Generative adversarial nets, in: NIPS, 2014.
[85] J. Wan, H. Yin, Z. Liu, A. Chong, Y. Liu, Lightweight image super-resolution [115] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of
by multi-scale aggregation, IEEE Trans. Broadcast. 67 (2) (2020) 372–382. the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp.
[86] H. Liu, F. Cao, C. Wen, Q. Zhang, Lightweight multi-scale residual networks 7132–7141.
with attention for image super-resolution, Knowl.-Based Syst. 203 (2020) [116] Y. Ioannou, D. Robertson, R. Cipolla, A. Criminisi, Deep roots: Improving cnn
106103. efficiency with hierarchical filter groups, in: Proceedings of the IEEE Conference
[87] W. Li, J. Li, J. Li, Z. Huang, D. Zhou, A lightweight multi-scale channel attention on Computer Vision and Pattern Recognition, 2017, pp. 1231–1240.
network for image super-resolution, Neurocomputing 456 (2021) 327–337. [117] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer
[88] S. Pang, Z. Chen, F. Yin, Lightweight multi-scale aggregated residual attention and super-resolution, in: European Conference on Computer Vision, Springer,
networks for image super-resolution, Multimedia Tools Appl. (2021) 1–23. 2016, pp. 694–711.
309
[118] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, [131] R. Timofte, R. Rothe, L. Van Gool, Seven ways to improve example-based single
A. Tejani, J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution image super resolution, in: Proceedings of the IEEE Conference on Computer
using a generative adversarial network, in: Proceedings of the IEEE Conference Vision and Pattern Recognition, 2016, pp. 1865–1873.
on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690. [132] X. Wang, K. Yu, C. Dong, C.C. Loy, Recovering realistic texture in image
[119] A. Bruhn, J. Weickert, C. Schnörr, Lucas/Kanade meets Horn/Schunck: Com- super-resolution by deep spatial feature transform, in: Proceedings of the IEEE
bining local and global optic flow methods, Int. J. Comput. Vis. 61 (3) (2005) Conference on Computer Vision and Pattern Recognition, 2018, pp. 606–615.
211–231. [133] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, L. Zelnik-Manor, The 2018 pirm
[120] B. Lim, S. Son, H. Kim, S. Nah, K. Mu Lee, Enhanced deep residual networks challenge on perceptual image super-resolution, in: Proceedings of the European
for single image super-resolution, in: Proceedings of the IEEE Conference on Conference on Computer Vision (ECCV) Workshops, 2018.
Computer Vision and Pattern Recognition Workshops, 2017, pp. 136–144. [134] M. Bevilacqua, A. Roumy, C. Guillemot, M.L. Alberi-Morel, Low-complexity
[121] H. Zhao, O. Gallo, I. Frosio, J. Kautz, Loss functions for image restoration with single-image super-resolution based on nonnegative neighbor embedding, in:
neural networks, IEEE Trans. Comput. Imaging 3 (1) (2016) 47–57. Proceedings of the British Machine Vision Conference, 2012, pp. 135.1–135.10.
[122] M.S. Sajjadi, B. Scholkopf, M. Hirsch, Enhancenet: Single image super-resolution [135] J.-B. Huang, A. Singh, N. Ahuja, Single image super-resolution from transformed
through automated texture synthesis, in: Proceedings of the IEEE International self-exemplars, in: Proceedings of the IEEE Conference on Computer Vision and
Conference on Computer Vision, 2017, pp. 4491–4500. Pattern Recognition, 2015, pp. 5197–5206.
[123] Z. Wang, E.P. Simoncelli, A.C. Bovik, Multiscale structural similarity for image [136] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, L. Zhang, Ntire 2017
quality assessment, in: The Thrity-Seventh Asilomar Conference on Signals, challenge on single image super-resolution: Methods and results, in: Proceedings
Systems & Computers, 2003, Vol. 2, Ieee, 2003, pp. 1398–1402. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
[124] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, 2017, pp. 114–125.
A. Courville, Y. Bengio, Generative adversarial nets, Adv. Neural Inf. Process. [137] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A.
Syst. 27 (2014). Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, 2017.
[125] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image transla- [138] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S.
tion using cycle-consistent adversarial networks, in: Proceedings of the IEEE Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale
International Conference on Computer Vision, 2017, pp. 2223–2232. machine learning, in: 12th {USENIX} Symposium on Operating Systems Design
[126] H.A. Aly, E. Dubois, Image up-sampling using total-variation regularization and Implementation ({OSDI} 16), 2016, pp. 265–283.
with a new observation model, IEEE Trans. Image Process. 14 (10) (2005) [139] Y. Wang, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O. Sorkine-Hornung,
1647–1659. C. Schroers, A fully progressive approach to single-image super-resolution,
[127] Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, L. Lin, Unsupervised image super- in: Proceedings of the IEEE Conference on Computer Vision and Pattern
resolution using cycle-in-cycle generative adversarial networks, in: Proceedings Recognition Workshops, 2018, pp. 864–873.
of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, [140] R. Chen, Y. Qu, K. Zeng, J. Guo, C. Li, Y. Xie, Persistent memory residual net-
2018, pp. 701–710. work for single image super resolution, in: Proceedings of the IEEE Conference
[128] D. Martin, C. Fowlkes, D. Tal, J. Malik, A database of human segmented natural on Computer Vision and Pattern Recognition Workshops, 2018, pp. 809–816.
images and its application to evaluating segmentation algorithms and measuring [141] K. Zhang, W. Zuo, L. Zhang, Learning a single convolutional super-resolution
ecological statistics, in: Proceedings Eighth IEEE International Conference on network for multiple degradations, in: Proceedings of the IEEE Conference on
Computer Vision. ICCV 2001, Vol. 2, IEEE, 2001, pp. 416–423. Computer Vision and Pattern Recognition, 2018, pp. 3262–3271.
[129] P. Arbelaez, M. Maire, C. Fowlkes, J. Malik, Contour detection and hierarchical [142] Y. Bei, A. Damian, S. Hu, S. Menon, N. Ravi, C. Rudin, New techniques for
image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (5) (2010) preserving global structure and denoising with low information loss in single-
898–916. image super-resolution, in: Proceedings of the IEEE Conference on Computer
[130] Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, K. Aizawa, Vision and Pattern Recognition Workshops, 2018, pp. 874–881.
Sketch-based manga retrieval using manga109 dataset, Multimedia Tools Appl. [143] A. Bulat, G. Tzimiropoulos, Super-fan: Integrated facial landmark localization
76 (20) (2017) 21811–21838. and super-resolution of real-world low resolution faces in arbitrary poses with
gans, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 109–117.
310

Lightweight Image Super-Resolution Based On

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lightweight Image Super-Resolution Based On

Uploaded by

Copyright:

Available Formats

Information Fusion 94 (2023) 284–310

Contents lists available at ScienceDirect

Lightweight image super-resolution based on deep learning: State-of-the-art

ARTICLE INFO ABSTRACT

1. Introduction multi-branch, recursive, progressive, attention-based, and adversarial

Fig. 1. Four main metrics for classifying the lightweight SR models.

Fig. 3. Categories of the network design structures.

(continued on next page)

(continued on next page)

Fig. 4. Network structures the convolution-based methods.

Fig. 5. Network structures of the residual-based methods.

19. HPUN 1. GLADSR

Fig. 6. Network structures of the distillation-based methods.

blocks, stacked information distillation blocks (DB), and reconstruc- 2. IMDN

Fig. 7. Network structures of the attention-based methods.

1. s-LWSR quantization-aware training approach is utilized to further optimize the

Fig. 8. The visualization results on Urban100 dataset at 4 ×.

Fig. 9. The PSNR vs. Run time of different models.

You might also like