1 s2.0 S1877050922008481 Main

Available online at www.sciencedirect.
com
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2022) 000–000
Procedia
Procedia Computer
Computer Science
Science 20400 (2022)
(2022) 000–000
907–913 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
International Conference on Industry Sciences and Computer Science Innovation

International Conference on Industry Sciences and Computer Science Innovation
SwinIR
SwinIR Transformer
Transformer Applied
Applied for
for Medical
Medical Image
Image Super-Resolution
Super-Resolution
a b,∗ c
Muralikrishna
Muralikrishna Puttagunta
Puttaguntaa ,, Ravi
Ravi Subban
Subbanb,∗,, Nelson
Nelson Kennedy
Kennedy Babu
Babu C
Cc
aDept of Computer Science, School of Engineering and Technology, Pondicherry University, India
aDept of Computer Science, School of Engineering and Technology, Pondicherry University, India
bDept of Computer Science, School of Engineering and Technology, Pondicherry University, India
bDept of Computer Science, School of Engineering and Technology, Pondicherry University, India
c Computer Science and Engineering, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai,India
c Computer Science and Engineering, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai,India
Abstract
Abstract
Super-resolution refers to artificially enhancing the resolution of a low-resolution (LR) image to get a high-resolution image
Super-resolution refers to artificially enhancing the resolution of a low-resolution (LR) image to get a high-resolution image
(HR), which is an effective technique in image processing and computer vision. Depending on the modality used in medical
(HR), which is an effective technique in image processing and computer vision. Depending on the modality used in medical
image processing, various variables may affect the spatial resolution of an image. Increasing the resolution of medical images
image processing, various variables may affect the spatial resolution of an image. Increasing the resolution of medical images
by super-resolution is essential for a more precise comprehension of the anatomy. Recently, several papers revealed that deep
by super-resolution is essential for a more precise comprehension of the anatomy. Recently, several papers revealed that deep
learning might be successfully used, resulting in state-of-the-art results for various practical medical image applications. This
learning might be successfully used, resulting in state-of-the-art results for various practical medical image applications. This
paper evaluated single image SR architectures, including SRGAN, BSRGAN, RealESRGAN, and SwinIR using medical images.
paper evaluated single image SR architectures, including SRGAN, BSRGAN, RealESRGAN, and SwinIR using medical images.
The SwinIR considerably enhanced the peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM) compared to
The SwinIR considerably enhanced the peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM) compared to
other architectures.
other architectures.
©
© 2022 The Authors. Published by Elsevier B.V.
© 2022 The Authors. Published by Elsevier B.V.
This is
This is an
an open
open access
access article
article under
under the
the CC
CC BY-NC-ND
BY-NC-ND license
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
(https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review access
under article under
responsibility ofthe
theCC BY-NC-ND
scientific license
committee of(http://creativecommons.org/licenses/by-nc-nd/4.0/)
the International
International Conference
Conference on
on Industry
Industry Sciences
Sciences and
and Computer
Computer
Peer-review under responsibility of the scientific committee of the
Peer-review under responsibility of the scientific committee of the International Conference on Industry Sciences and Computer
Sciences Innovation.
Innovation
Sciences
Sciences Innovation.
Keywords: Image super-resolution; deep learning; Swin Transformer
Keywords: Image super-resolution; deep learning; Swin Transformer
1. Introduction
1. Introduction
The term ”image super-resolution” refers to the process of reconstructing a high-resolution (HR) image from its
The term ”image super-resolution” refers to the process of reconstructing a high-resolution (HR) image from its
low-resolution (LR) counterpart. In recent years,Super Resolution (SR) image reconstruction has been an essential
low-resolution (LR) counterpart. In recent years,Super Resolution (SR) image reconstruction has been an essential
topic of study in the domains of digital image processing and computer vision, owing to its ability to overcome the
topic of study in the domains of digital image processing and computer vision, owing to its ability to overcome the
inherent resolution limitations of low-cost image sensors [1]. SR has many practical applications, including upgrad-
inherent resolution limitations of low-cost image sensors [1]. SR has many practical applications, including upgrad-
ing fuzzy and noisy images/videos to high-definition (HD) images, videos, robust pattern recognition, and tiny item
ing fuzzy and noisy images/videos to high-definition (HD) images, videos, robust pattern recognition, and tiny item
identification. A better-quality image gained using SR results in greater precision in medical imaging analysis for
identification. A better-quality image gained using SR results in greater precision in medical imaging analysis for
correct disease identification [2]. Medical images spatial resolution is not adequate due to the restrictions such as
correct disease identification [2]. Medical images spatial resolution is not adequate due to the restrictions such as
image capture time, hardware limits or low irradiation dosage. Numerous super-resolution strategies have been de-
image capture time, hardware limits or low irradiation dosage. Numerous super-resolution strategies have been de-
veloped to solve these challenges, such as learning-based approaches or optimization. Deep learning (DL) techniques
veloped to solve these challenges, such as learning-based approaches or optimization. Deep learning (DL) techniques
∗ Corresponding author. Tel.: +91-984-393-0392 ; fax: +0-000-000-0000.

∗ Corresponding author. Tel.: +91-984-393-0392 ; fax: +0-000-000-0000.
E-mail address: sravicite@gmail.com
E-mail address: sravicite@gmail.com
1877-0509
1877-0509 © © 2022
2022 The
The Authors.
Authors. Published
Published by by Elsevier
Elsevier B.V.
B.V.
1877-0509
This © 2022 Thearticle
Authors. Published by Elsevier B.V.
Thisisisananopen
openaccess
access under
article the CC
under theBY-NC-ND
CC BY-NC-ND licenselicense
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
(https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-reviewunder
Peer-review underresponsibility
responsibilityofofthethe scientific
scientific committee
committee of the
of the International
International Conference
Conference on Industry
on Industry Sciences
Sciences and Computer
and Computer Sciences
Sciences Innova-
Peer-review
Innovation under responsibility of the scientific committee of the International Conference on Industry Sciences and Computer Sciences Innova-
tion.
tion.
10.1016/j.procs.2022.08.110
908 Muralikrishna Puttagunta et al. / Procedia Computer Science 204 (2022) 907–913
2 Author name / Procedia Computer Science 00 (2022) 000–000
are widely applicable to medical image processing applications such as detection, segmentation, classification and
registration. With the advent of DL techniques for SR, researchers in medical imaging initially offered applications
and later novel architectures that allow for the integration of priors to improve the network’s performance and simplify
follow-up analysis and study[3]. The absence of high-quality reference images and the use of specific image priors
or constraints are two significant impediments to generalizing deep learning SR approaches in medical images. It is
challenging to generate HR ground truth images in clinical imaging based on various constraints.
2. literature review
The super-resolution convolutional neural network (SRCNN) [4] is a foundation of the research on super-resolution
challenges in deep learning. In SRCNN, the network is not very deep. It comprises only three sections: patch extraction
and representation, non-linear mapping, and reconstruction. Kim et al. proposed a very deep residual network (VDSR)
[5] for SR by applying residual learning to predict the mapping from LR to HR images using 20-layer architecture
inspired by the VGG-Net architecture. VDSR implements an approach known as residual learning, in which the
network learns to estimate a residual image. A residual image is a difference between a high-resolution reference
image and a low-resolution image that has been upscaled to match the size of the reference image using bicubic
interpolation. A residual image stores information about an image’s high-frequency features. The gradient clipping
approach enables a fast learning rate to train the network, which promotes convergence despite the architectures
enormous size. Gradient clipping works by truncating each gradient such that all gradients are confined to a set range.
Ledig et al. [6]introduced the first GAN-based SR framework (SRGAN), which comprises a generator network for
generating HR images, a discriminator network for distinguishing produced images from real-world images, and a loss
function that integrates GAN and perceptual loss. Gu et al.[7] developed a technique for SR in medical imaging based
on deep learning called MedSRGAN (Medical Images SR using GAN). The Residual Whole Map Attention Network
(RWMAN) was built as the generator network for MedSRGAN. It can collect valuable information across several
channels and focus attention on significant areas. A novel pairwise discriminator to distinguish pairs of HR/SR and
LR images, and a novel multi-task loss function that combines adversarial loss, content loss and adversarial feature loss
for guiding the SR image to increase its reliability and feasibility. The author’s Zhang et al. constructed a degradation
model [8] by randomly mixing blur, downsampling, and noise to reconstruct the degradation image properly. By
generating training data using the degradation model described above and training the blind SISR model, BSRGAN
demonstrates exceptional restoration capability.
The ESRGAN [9] primary architecture is identical to the SRGAN with a few variations. Residual in Residual Dense
Block (RRDB) is a component of ESRGAN that combines multi-level residual networks with dense connections
without using batch normalization. The model employs a Residual-in-Residual block as a basic convolution block,
rather than a basic residual network or a simple convolution trunk, to provide a more accurate flow gradient at the
microscopic level. Additionally, the model lacks a batch normalization layer in the generator, which prevents the
model from smoothing out image artefacts. Without batch normalization, the ESRGAN provides images with a more
accurate depiction of the artefacts’ sharp edges. The ESRGAN employs a relativistic discriminator to more precisely
estimate the likelihood of an image being real or false. During adversarial training, the generator employs a linear
combination of the Perceptual difference between real and fake images using a pre-trained VGG19 network, the Pixel
wise absolute difference between real and fake images, and the Relativistic average loss function between real and
fake images.
Real-ESRGAN [10] is a sophisticated modification of ESRGAN that synthesizes training pairs with a more real-
istic degradation process to recover generic low-resolution images from the real world. Real-ESRGAN can repair the
majority of real-world photographs and achieve far higher visual performance than previous efforts, making it more
usable in real-world applications. The most complex degradations usually emerge from elaborate combinations of
many deterioration processes, such as camera imaging systems, image editing, and Internet transmission. Real ES-
RGAN makes use of a second-order degradation mechanism to strike an acceptable balance between simplicity and
efficacy. Real-ESRGAN is trained entirely on fake data. A high-order deterioration modelling technique was devel-
oped to imitate intricate real-world degradations better. Finally, a spectral normalized U-Net discriminator enhances
discriminator capabilities and stabilizes training fluctuations.
Muralikrishna Puttagunta et al. / Procedia Computer Science 204 (2022) 907–913 909
Author name / Procedia Computer Science 00 (2022) 000–000 3
3. Vision Transformer
For machine translation, the transformer architecture was introduced [11].It is entirely based on self-attention and
fully connected layers, resulting in an enticing trade-off between efficiency and performance. The Transformer has
achieved state-of-the-art performance for various natural language processing applications [12]. Numerous efforts
have been made in computer vision applications to include transformers with different types of attention. Recently,
models without convolution that depend on transformer layers have shown comparable performance, establishing them
as a viable alternative to convolutional architectures. By adapting Transformer in vision tasks, it has been effectively
employed in low-level image processing, image recognition, and object identification, action recognition. ViT is the
first effort to replace the regular convolution with Transformer. Specifically, Vision Transformers (ViT) model is
example of a transformer-based strategy that can compete with or even outperform state-of-the-art convolutional
models for image classification. For the sequence elements, ViT flattened the 2D image patches into a vector and
passed them into the Transformer. The ViT architecture requires a grid of non-overlapping consecutive image patches
with a resolution of N×N as input. Generally, N=16 or N=8.Each patch is then projected into an embedding vector via
a linear projection layer, with an additional learnable position embedding.The patch embeddings are then processed
using various multi-head self-attention (MHSA) and feedforward layers to represent their long-range relationships
and evolve the token embedding features.
3.1. Swin transformer
Swin Transformers (ST) [13] are a subtype of Vision Transformers (VT), shown in figure 1. It creates hierarchical
feature maps by combining image patches into deeper layers and has a linear computational cost proportionate to the
input image size owing to self-attention processing happening only inside each local window. The ST model may use
these hierarchical feature maps to use more dense prediction techniques such as feature pyramid networks (FPN) or U-
Net. The linear computational complexity is determined by calculating non-overlapping windows that split an image.
Additionally, since the number of patches in each window is constant, the complexity of the image is proportionate
to its size. Thus, unlike older transformer-based systems, which create feature maps with a single resolution and are
quadratic in complexity, ST is well-suited as a general-purpose backbone for various vision applications like image
categorization and dense recognition.
Fig. 1. Swin transformer architecture.
Swin Transformer uses a smaller patch size of 4 × 4 pixels than the original ViT’s 16 x 16 pixels. The patch merging
module at the start of each stage, except stage 1, is the compulsory module that modifies the resolution of a feature
map. Let C be the dimension of a stage 1 embedding output. The patch merging module concatenates the embeddings
for each patch in a group of 2 x 2 patches, resulting in a 4C-dimensional embedding. The dimension is then reduced
to 2C using a linear layer. After patch merging, the number of embeddings decreases by a factor of four, which is
the group size. After merging the embeddings, a series of Transformers, dubbed Swin Transformer blocks, is used to
process them. This method is then repeated to build a lower-resolution feature map in the subsequent steps. All steps
produce a pyramid of feature maps depicting features of several sizes. Swin Transformer’s architecture is similar to
that of various CNNs, in that a factor drops the resolution of two on each side, but the channel size is doubled as it
goes deeper.
3.2. SwinIR Transformer
SwinIR Transformer [14] has shown in figure 2considerable potential because it combines the advantages of trans-
former and CNN. Due to the local attention mechanism, it benefits CNN in terms of processing large images. By
leveraging the shifting window architecture, it has the benefit of Transformer in terms of long-range modelling de-
pendence. SwinIR utilizes the Swin Transformer as its foundation. SwinIR comprises three modules: shallow feature
extraction, deep feature extraction, and image reconstruction of high quality. The shallow feature extraction module
extracts shallow features using a convolution layer and quickly transfers them to the reconstruction module to re-
tain low-frequency information. The deep feature extraction module’s Residual Swin Transformer blocks (RSTB) use
several Swin Transformer layers for local attention and cross-window interaction. Additionally, employed a residual
connection to provide a shortcut for feature aggregation and end the block with a convolutional layer for feature aug-
mentation. Finally, the reconstruction module combines shallow and deep characteristics to produce a quality image.
Fig. 2. SwinIR transformer architecture for SR image.
4. performance evaluation
In this paper, the image quality was evaluated using standard measures The peak noise-to-signal ratio (PSNR) and
the structural similarity index measure (SSIM)[15]. The PSNR between LR image X and HR image (X ∗ )both of which
have N pixels as defined as
L2
PS NR = 10log (1)
MS E
N ′
where MS E = N1 i=1 log(X(i) − X (i))2 and L=255 for 8-bit pixel encoding.
The SSIM is defined in equation 2 and ??
′ ′ ′
ssim = [l(X, X )]α [c(X, X )]β [s(X, X )]γ (2)
′ (2µX µX ′ + C1 )
l(X, X ) = 2 (3)
(µX + µ2X ∗ + C1 )
′ (2σX σX ′ + C2 )
c(X, X ) = 2 (4)
(σX + σ2X ∗ + C2 )
(2σ ′ + C3 )
s(X, X ) = 2 XX2
′
(5)
(σX σX ∗ + C3 )
′
µX and σX represent mean and standard deviation of X. µX ′ andσX ′ represents the mean and standared deviation of X .
′
σXX ′ is the covariance betweenX and X . C1 , C2 and C3 are constants. α, β and γ are weighting parameters considered
as α = β = γ = 1.for simplification c3 = C22 .
(2µX µX ′ + C1 )(2σX σX ′ + C2 )
ssim = (6)
(σ2X + σ2X ∗ + C2 )(µ2X + µ2X ∗ + C1 )
5. Results
We used chest X-ray, skin lesion, and funds images collected from different datasets for the experiments. Although
they are different in dimension, all these images have been scaled to a resolution of 256 ∗ 256 pixels. We downloaded
pre-trained models from the GitHub repository https://github.com/JingyunLiang/SwinIR to evaluate super-
resolution architectures on medical images. In the next step, each medical image is submitted, and the HR images of
the BSRGAN, RealSRGAN, and SwinIR models are formed and shown. The PSNR and SSIM of the output images
were estimated based on the input images. As demonstrated in this table 1, when compared to other image super-
resolution models, SwinIR obtains better performance in terms of SSIM. Figure 3 illustrates a visual comparison of
the SRGAN, BSRGGAN, RealESRGAN, and SwinIR approaches for various input medical image.
Table 1. The PSNR and SSIM of different architectures for different input images.
Image Metric SRGAN BSRGAN RealESRGAN SwinIR
Chest X-ray-1 PSNR 30.086 32.086 31.6284 33.27
SSIM 0.761 0.786 0.783 0.845
Chest X-ray-2 PSNR 33.23 34.99 32.702 34.88
SSIM 0.865 0.896 0.908 0.923
Skin lesion-1 PSNR 31.231 30.534 28.53 29.704
SSIM 0.843 0.851 0.862 0.862
Skin lesion-2 PSNR 32.43 34.87 32.83 34.64
SSIM 0.861 0.8876 0.894 0.921
Funds-1 PSNR 32.34 34.56 33.86 35.623
SSIM 0.887 0.915 0.926 0.932
Funds-2 PSNR 37.23 38.376 37.918 38.104
SSIM 0.893 0.921 0.932 0.943
Fig. 3. Visual comparison of results for different medical images. The SR image generated by SwinIR is compared with images produced by SR-
GAN, BSRGAN, and RealESRGAN. a)first coloum input images b), c), d), e) colums are images generated by SRGAN, BSRGAN, RealESRGAN
and SwinIR.
6. conclusion
Image resolution is critical for extracting information from medical images in medical image processing. Improved
image resolution enables more accurate identification of the ailment of the patient. However, medical images often
include a high level of noise and abnormalities due to the human body’s anatomical structure and the sensor limits of
the image acquisition device. The Super Resolution approaches based on GAN and transformer have been described
to resolve this issue. Using an excellent super-resolution algorithm swinIR transformer makes it possible to adequately
raise the resolution of low-resolution medical images to acceptable levels.
References
[1] Zamzmi, Ghada, Sivaramakrishnan Rajaraman, and Sameer Antani. (2020) ”Accelerating Super-Resolution and Visual Task Analysis in Med-
ical Images”.Applied Sciences 10, (12): 4282.https://doi.org/10.3390/app10124282
[2] Y. Li, B. Sixou, and F. Peyrin.(2021) ”A Review of the Deep Learning Methods for Medical Images Super Resolution Problems” Irbm 42 (2):
120–133.
[3] https://doi.org/10.1016/j.irbm.2020.08.004 Wang, Zhihao, Jian Chen, and Steven CH Hoi.(2020) ”Deep learning for image super-resolution: A
survey.” IEEE transactions on pattern analysis and machine intelligence 43 (10) : 3365-3387.https://doi.org/10.1109/TPAMI.2020.2982166.
[4] Dong, Chao, Chen Change Loy, Kaiming He and Xiaoou Tang.(2016) “Image Super-Resolution Using Deep Convolutional Networks.” IEEE
Transactions on Pattern Analysis and Machine Intelligence 38 : 295-307.https://doi.org/10.1109/TPAMI.2015.2439281.
[5] Kim, Jiwon, Jung Kwon Lee and Kyoung Mu Lee.(2016) ”Accurate Image Super-Resolution Using Very Deep Convolutional Networks.” IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) : 1646-1654.https://doi.org/10.1109/CVPR.2016.182.
[6] Ledig, Christian, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken et al. (2017)”Photo-
realistic single image super-resolution using a generative adversarial network.” In Proceedings of the IEEE conference on computer vision and
pattern recognition :4681-4690.https://doi.org/10.1109/CVPR.2017.19.
[7] Gu, Yuchong, Zitao Zeng, Haibin Chen, Jun Wei, Yaqin Zhang, Binghui Chen, Yingqin Li, et al.(2020) “MedSRGAN: Medical Images Super-
Resolution Using Generative Adversarial Networks.” Multimedia Tools and Applications 79 (29): 21815–40. https://doi.org/10.1007/s11042-
020-08980-w.
[8] Zhang, Kai, Jingyun Liang, Luc Van Gool, and Radu Timofte.(2021) ”Designing a practical degradation model for deep blind image super-
resolution.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4791-4800.).https://arxiv.org/abs/2103.14006
http://arxiv.org/abs/2103.14006.
[9] Wang, Xintao, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy.(2019) “ESRGAN: Enhanced Super-
Resolution Generative Adversarial Networks.” Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics) LNCS : 63–79.https://doi.org/10.1007/978-3-030-11021-5.
[10] Wang, Xintao, Liangbin Xie, Chao Dong, and Ying Shan. (2021) ”Real-esrgan: Training real-world blind super-resolution
with pure synthetic data.” In Proceedings of the IEEE/CVF International Conference on Computer Vision :1905-1914.
https://doi.org/10.1109/iccvw54120.2021.00217
[11] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser,
and Illia Polosukhin.(2017) “Attention Is All You Need.” In Advances in Neural Information Processing Sys-
tems:5999–6009.https://dl.acm.org/doi/pdf/10.5555/3295222.3295349
[12] Radford, Alce, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.(2018) “Improving Language Understanding by Generative Pre-
Training,” https://doi.org/10.4310/HHA.2007.v9.n1.a16
[13] Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.(2021) ”Swin transformer: Hierarchical
vision transformer using shifted windows.”In Proceedings of the IEEE/CVF International Conference on Computer Vision 10012-10022.
2021.http://arxiv.org/abs/2103.14030.
[14] Liang, Jingyun, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. (2021)”Swinir: Image restoration using swin trans-
former.” In Proceedings of the IEEE/CVF International Conference on Computer Vision:1833-1844.http://arxiv.org/abs/2108.10257
[15] Wang, Zhou, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli.(2004) ”Image quality assessment: from error visibility to structural
similarity.” IEEE transactions on image processing 13 (4) : 600-612.https://doi.org/10.1109/TIP.2003.819861.

1 s2.0 S1877050922008481 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1877050922008481 Main

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

International Conference on Industry Sciences and Computer Science Innovation

∗ Corresponding author. Tel.: +91-984-393-0392 ; fax: +0-000-000-0000.

3.1. Swin transformer

Fig. 1. Swin transformer architecture.

3.2. SwinIR Transformer

Fig. 2. SwinIR transformer architecture for SR image.

You might also like