You are on page 1of 7

Comparative Study of Image Super Resolution Models for Face Images

Park, Gunwoo Yu, Chun Ho


20635825.gparkab@connect.ust.hk 20775792.chyuar@connect.ust.hk
Cho, Young Beom

Abstract backgrounds, ages, and ethnicities, to train a deep learn-


ing model that identifies fundamental patterns and charac-
Reconstructing human faces from old, blurry, low- teristics of high-resolution facial photos. The deep learn-
resolution images is complex, with many uses in image ing method, known as super-resolution (SR), is then applied
restoration, medical imaging, and video enhancement. Nev- to generate high-resolution images from low-resolution in-
ertheless, current super-resolution (SR) techniques fre- puts. Our goal is to produce high-quality facial images that
quently need to be revised to create faithful and realistic fa- closely resemble the source photos. The performance of
cial images that maintain the original photos’ identity and our model is evaluated using Peak Signal-to-Noise Ratio
expression. In this work, we propose a novel deep learn- (PSNR) and Structural Similarity Index (SSIM) measures,
ing technique called the High-Resolution Attention Trans- which assess image quality and structural similarity, respec-
former (HAT), which trains an SR model that can pro- tively. By leveraging the FFHQ dataset and deep learning
duce high-quality, high-resolution facial images from low- techniques, this study contributes to the advancement of fa-
resolution inputs by using the Flickr-Faces-HQ (FFHQ) cial image reconstruction and restoration
dataset, which consists of 70,000 high-quality PNG images
with a variety of backgrounds, ages, and ethnicities. The 2. Releted work
HAT model uses a transformer architecture to improve the
Face super resolution has been extensively studied in the
model’s generation and representation skills, and it uses the
literature. We provide a short review of existing methods.
attention mechanism to extract both local and global fea-
tures from the face images. We assess our model’s image 2.1. SRCNN
quality and structural similarity using the Peak Signal-to-
Noise Ratio (PSNR) and Structural Similarity Index (SSIM) Image Super-Resolution Using Deep Convolutional Net-
measures, respectively. Our results demonstrate that our works (SRCNN) [1] is a groundbreaking paper that intro-
model produces faithful and realistic facial images that duces a novel approach to image super-resolution (SR) us-
closely resemble the source photos, outperforming other ing deep convolutional neural networks (CNNs). SRCNN
SR methods regarding PSNR and SSIM. Our research indi- achieves state-of-the-art results on various SR benchmarks,
cates that the HAT model has the potential for several image surpassing traditional SR methods by a significant margin.
processing and computer vision applications in addition to Its main contribution lies in employing a new CNN archi-
demonstrating its efficacy in facial image restoration. tecture tailored for SR, incorporating residual learning and
adversarial training strategies. However, Bicubic interpola-
tion only utilizes local information from neighboring pixels,
failing to leverage global context and external knowledge,
1. Introduction ultimately hindering the performance of SR.
Reconstructing human faces from outdated, low-quality,
2.2. ESRGAN
and low-resolution images is a challenging task with numer-
ous applications in fields such as medical imaging, video Real-ESRGAN, a groundbreaking approach in image
enhancement, and image restoration. This study aims to uti- super-resolution (SR), utilizes exclusively synthetic data
lize deep learning supersampling methods to convert low- for training, eliminating the need for large-scale real-world
resolution facial photos into high-quality, high-resolution datasets and enabling precise data manipulation [2]. Its ar-
images. We employ the Flickr-Faces-HQ (FFHQ) dataset, chitecture comprises a pre-trained SR network and a GAN-
consisting of 70,000 high-quality PNG images with diverse based perceptual loss module, achieving state-of-the-art SR
performance. While Real-ESRGAN offers data efficiency restoration.
and precise control, it may face domain shift and computa-
tionally intensive training [3]. The Hierarchical Attention 3. Data
Transfer (HAT) model [5], with its hierarchical attention
mechanism, generates sharper high-resolution images but 3.1. Filcker-Faces-HQ dataset (FFHQ)
may not be as data-efficient. Ultimately, the choice between
HAT and Real-ESRGAN depends on the application’s spe- Filcker-Faces-HQ dataset(FFHQ) contains 70,000 high-
cific requirements. quality face images at 1,024 × 1,024 resolution and contains
considerable variation in terms of age, ethnicity, and image
2.3. SwinIR background. It also has good coverage of accessories such
as eyeglasses, sunglasses, hats, etc. [7]
In ”SwinIR [4]: Image Restoration Using Swin Trans-
former,” the authors introduce an image restoration model Images of the FFHQ dataset were crawled from Flickr;
leveraging the Swin Transformer to reconstruct high-quality therefore, the dataset includes the biases of that website.
images from degraded versions. SwinIR architecture com- Furthermore, they are cropped and aligned automatically by
prises shallow feature extraction, deep feature extraction using dlib. Only images under permissive licenses were col-
with Residual Swin Transformer Blocks (RSTB), and high- lected. [7]
quality image reconstruction, significantly outperforming
contemporary CNN-based models in super-resolution, de-
noising, and JPEG artifact reduction tasks. From a criti-
cal standpoint, while SwinIR’s performance is commend-
able, it could be argued that the model’s reliance on large
datasets for training might limit its adaptability to more
diverse, real-world scenarios. Moreover, the complexity
of its transformer-based structure might present challenges
in understanding and modifying the network compared to
more transparent CNN-based models, which could hinder Figure 1. ffhq-dataset image teaser [7]
its broader adoption in the community or integration with
models like HAT, which are designed for high-level vision We selected 18,000 images out of 70,000 images for
tasks and could benefit from the granular improvements training. In addition, we down-scaled the resolution of the
SwinIR offers. image to 256 × 256 size for input data and kept the orig-
inal image for the scores and loss calculation. 256 × 256
2.4. Hat
images were used for model training input, and the output
The Hybrid Attention Transformer (HAT) for Image image was kept similar to 1024 × 1024, which is 4× super-
Restoration proposes a novel network that synergizes chan- sampling for the whole project.
nel attention with window-based self-attention mechanisms We are adopting multiple models (SRCNN, ESRGAN,
to enhance image restoration tasks such as super-resolution, SwinIR, and HAT) for the project. Due to the nature of
denoising, and compression artifact reduction [5, 6]. HAT the SRCNN training, the SRCNN model includes addi-
introduces an overlapping cross-attention module to im- tional preprocessing of the dataset that crops the images into
prove cross-window feature interactions, addressing limi- smaller parts for convolution. The size of the crop is 33 ×
tations in existing Transformer networks. Furthermore, it 33.
employs a same-task pre-training strategy using large-scale
datasets to exploit the model’s capabilities fully. Extensive
experiments demonstrate that HAT achieves state-of-the-art
performance quantitatively and qualitatively, with publicly
available codes and models. The HAT model stands out for
its innovative integration of attention mechanisms, which
significantly improves the range of input information used
for image restoration. This is particularly advantageous
for handling complex textures and repeated patterns, which
are common challenges in image reconstruction tasks. The Figure 2. Example of cropped images for SRCNN
model’s remarkable performance and scalability make it an
excellent candidate for practical applications, suggesting
its potential to set a new benchmark in the field of image [1] https://github.com/NVlabs/ffhq-dataset
4. Methods rigorous because it covers a broad range of models and met-
rics.
We suggest using several models and comparing their
performance to solve the issue of super sampling the old 4.3. SRCNN
face images. We chose four popular models (SRCNN,
ESRGAN, SwinIR, and HAT) for super-resolution images. Our way to use the SRCNN model for face reconstruc-
A three-layer convolutional neural network called SRCNN tion is as follows:
can learn an end-to-end mapping between images with dif- 1) We must first prepare the test and training sets of data.
ferent resolutions. A relativistic discriminator and percep- The FFHQ dataset, which has 70,000 excellent photos of
tual loss are used by ESRGAN, an improved version of SR- faces with a resolution of 1024 x 1024, can be used. 60,000
GAN, to produce realistic and intricate images. SwinIR photos for training and 10,000 images for testing can be sep-
is a novel model that captures rich features and long- arated out of the dataset. We can use bicubic interpolation to
range dependencies using the residual structure and the self- downsample the original images by a factor of 4, producing
attention mechanism. In order to adaptively highlight the 256 x 256 images as the low-resolution counterparts.
key areas and details, HAT is a hybrid attention network 2) The SRCNN model needs to be put into practice sec-
that blends channel attention, spatial attention, and pixel at- ond. The original architecture suggested by Dong et al. can
tention. Using a dataset of old face photos, we train and test be employed, comprising three convolutional layers with
these models and assess their performance using a variety of filters measuring 9 x 9, 3 x 3, and 5 x 5 respectively. The
metrics, including PSNR and SSIM. We aim to identify the low-resolution images’ features are extracted in the first
optimal model capable of generating faithful, high-quality layer and mapped to high-resolution space in the second
supersampled images of historical faces. layer. The high-resolution images are rebuilt in the third
layer. In this case, the optimizer would be Adam, and the
4.1. Baseline: Bicubic Interpolation loss function would be the mean squared error (MSE).

A popular technique for resizing images is called ”bicu-


bic interpolation,” which interpolates the pixel values us-
ing a cubic polynomial function. It is an easy-to-use, quick
method that can create continuous and fluid images. How-
ever, when the scaling factor is high, it can also result in the
introduction of some artifacts like aliasing, ringing, or blur-
ring. Consequently, bicubic interpolation will serve as the
baseline for our project, allowing us to compare the perfor-
mance of our models with the latter’s findings. This allows Figure 3. SRCNN model diagram [1]
us to quantify the improvement our models can achieve over
the standard approach and to pinpoint the areas where our 3) The SRCNN model needs to be trained and tested.
models excel and fall short in terms of image fidelity and The Pytorch framework is a valuable tool for implementing
quality. For image super-resolution, cubic interpolation is and executing models. Using the training data, we can train
a commonly used and recognized baseline that can offer a the model for 200 epochs before evaluating it using the test-
reasonable and impartial comparison for our project. ing data. The structural similarity index (SSIM) and peak
signal-to-noise ratio (PSNR) are two metrics we can use to
4.2. Multi model optimization gauge how well the model is performing. Additionally, we
Supersampling old face images is a complex problem be- can use image plots to compare the SRCNN model’s output
cause it needs to maintain facial expressions and features with the original images, bicubic interpolation, ESRGAN,
and improve resolution. Current techniques may yield unre- and SwinIR see how the those differ.
alistic, distorted, or blurry results that do not correspond to
4.4. ESRGAN
the original images. As a result, we employ a multi-model
approach and evaluate the models’ performance using dif- We have tried 2 different versions of ESRGAN, which
ferent metrics. In this manner, the optimal model for pro- are ESRGAN and REAL-ESRGAN and the method of test-
ducing faithful, high-quality, supersampled images of vin- ing is given below:
tage faces can be identified. By utilizing multiple models, 1) We must first prepare the test and training sets of
we can investigate the trade-offs between various aspects of data. The FFHQ dataset and the bicubic interpolation are
image quality, such as fidelity, realism, and sharpness. Our the same data the SRCNN model uses. 15900 images were
method can yield important insights for future image super- chosen from the dataset for training, and 100 were chosen
resolution research and applications, and it is thorough and for test images. The low-resolution images can then be pro-
duced by downsampling the training and testing images by ture extraction, and high-quality image reconstruction are
a factor of 4 using bicubic interpolation. the three components of SwinIR. Specifically, the resid-
2) The ESRGAN model must then be implemented us- ual Swin Transformer blocks (RSTB) with multiple Swin
ing an RRDBNET generator and discriminator. We can Transformer layers and a residual connection make up the
employ the discriminator and generator architecture sug- deep feature extraction module. The architecture and hy-
gested by Zhang et al. The low-resolution images are up- perparameters—such as the number of RSTBs, channels,
scaled by a factor of 4 using a sub-pixel convolution layer window size, patch size, and heads—used in the [4] can be
and 23 residual-in-residual dense block (RRDB) networks used again.
comprising the generator. The discriminator is an RRDB
network that classifies the input images as real or fake us-
ing five RRDBs and a global average pooling layer. The
loss function for the generator and discriminator can be the
BCEwithLog loss, which combines the content loss, the ad-
versarial loss, and the pixel-wise loss. Hyperparameters are
set by multiple fine-tuning, such as the batch size of 4, the
weight decay of, and the number of epochs of 25. Based on
this, we implement basic ESRGAN and REAL-ESRGAN
Figure 6. SwinIR model diagram [4]
model for the training and testing

2) To fine-tune the SwinIR model on the FFHQ dataset.


Using the same loss function, optimizer, and learning rate
scheduler as in the original paper, we can load the pre-
trained model and train it on the FFHQ dataset. We can
Figure 4. ESRGAN model diagram [2] fine-tune the model for 100 iterations using the training data
before testing it with the actual data. Several metrics can
be used to assess the model’s performance, including the
structural similarity index (SSIM) and the peak signal-to-
noise ratio (PSNR). Additionally, we can use image plots
to compare the outcomes of the refined SwinIR model with
the original images, other models we apply, and bicubic in-
terpolation.
Figure 5. RRDB diagram [2]
4.6. HAT
3) the ESRGAN model needs to be trained and tested.
The PyTorch framework is what we can use to run and im- We need to implement the HAT model. We can use the
plement the model. Using the training data, we can train GitHub repository, which provides the official PyTorch im-
the model and assess its performance using the testing data. plementation of HAT. HAT is a model based on the hy-
The model’s performance can be assessed using the same brid attention transformer, combining channel attention and
metrics as the SRCNN model, including the PSNR and window-based self-attention schemes. It also introduces an
SSIM. Additionally, we can use image plots to compare the overlapping cross-attention module to enhance the interac-
outcomes of the ESRGAN model with those of the bicubic tion between neighboring window features. For the hyper-
interpolation, SRCNN model, SwinIR model, and the orig- parameters, the default learning rate is 1e-4, patch size is
inal images. 16x16, attention heads have eight heads per layer, layers
have 20 encoder/decoder layers, and Embedding dimension
4.5. SwinIR is 128. We will fine-tune the model to show the best result
One possible way to use the SwinIR model for super- we can make by manipulating these hyperparameters.
sampling the face images with the FFHQ dataset is as fol-
4.7. Alternative method
lows:
1) For SwinIR model implementation, we can use the Using a single model pre-trained on a sizable and varied
official PyTorch implementation of SwinIR, provided by dataset of face images, like FFHQ or CelebA-HQ, is another
the KAIR framework. Based on the Swin Transformer, possible solution to this issue. In order to modify the model
SwinIR is a model that divides input feature maps into win- for the domain of old face photos, this method might make
dows and processes the features within each window us- use of the transfer learning technique. This method has the
ing self-attention. Shallow feature extraction, deep fea- benefit of being quick and easy to use because it does not
(a) (b) (c)

(d) (e) (f)

Figure 7. (a) is original image, (b) is the result from HAT, (c) is the result from Bicubic interpolation, (d) is the result from Srcnn, (e) is the
result from ESRGAN, (f) is the result from SwinIR

require training multiple models or evaluating their perfor- computationally intensive and time-consuming, with each
mance. The drawback with this method is that it may not model taking approximately six days to train. Despite the
capture the unique traits and nuances of the vintage face resource intensity, this phase was crucial for ensuring that
photos, like wrinkles, spots, or hairstyles. Furthermore, the our models could accurately learn and replicate the high-
pre-trained model might add biases or artifacts, like alter- resolution characteristics of the FFHQ images.
ing the gender, skin tone, or facial expression, that are not Model Evaluation,The primary objective of our exper-
in the original photos. Because our method compares the iment was to assess the performance of different super-
performance of multiple models using different metrics, it resolution models, namely the SRCNN, ESRGAN, SwinIR,
is superior. In this manner, the optimal model that produces and HAT models [8, 9, 10, 11]. We evaluated these mod-
faithful and high-quality super-sampled images of vintage els based on two prominent image quality metrics: the Peak
faces while maintaining their original features and expres- Signal-to-Noise Ratio (PSNR) and the Structural Similarity
sions can be identified. Since our method covers a broad Index Measure (SSIM). The PSNR is an engineering term
range of models and metrics, it is more thorough and rig- for the ratio between the maximum possible power of a sig-
orous. It can yield insightful information for future studies nal and the power of corrupting noise that affects the fidelity
and applications about image super-resolution. of its representation, making it a standard for measuring the
quality of reconstruction of lossy compression codecs. On
5. Experiment the other hand, SSIM is a method for comparing similarities
between two images. The SSIM index can be viewed as a
Dataset and Model Setup, Our experiment commenced quality measure of one of the images being compared, pro-
with the usage of the FFHQ dataset, a comprehensive col- vided the other image is regarded as of perfect quality. The
lection of 70,000 high-resolution PNG images of human higher value in PSNR and SSIM indicate the better model,
faces. Owing to its high-quality content and detailed at- As table 1, HAT is the highest amount all other models. The
tributes, the FFHQ dataset was ideal for the purpose of results of our experiment are visually presented in figures
training our super-resolution models. (a), (b), (c), (d), (e) and (f). An analysis of these results re-
The models we trained were cloned from repositories veals distinct characteristics and performance levels of each
available on GitHub [8, 9, 10, 11]. The training process was super-resolution model. The HAT model’s output closely
Method PSNR SSIM In summary, our experimental results demonstrate the su-
perior performance of HAT over state-of-the-art methods
SRCNN 27.855 0.819 such as SwinIR and ESRGAN in the context of facial image
ESRGAN 28.667 0.826 restoration. The ability of HAT to preserve fine details such
SwinIR 30.098 0.860 as eyelashes and eyebrows contributes to its overall effec-
HAT 30.535 0.866 tiveness in producing high-quality, realistic images, making
it a promising approach for various applications in the field
Table 1. Results. Ours is better.
of image processing and computer vision.
In summary, each super-resolution model has its
resembles the ground truth image. The processed image strengths and weaknesses, and the choice of model depends
appears to have less noise and looks as if a filter has been on the specific requirements of the task. In our case, HAT
applied to it, resulting in a higher exposure. This can be performed the best overall, offering a good balance between
attributed to HAT’s advanced learning capabilities, which detail enhancement, noise suppression, and faithfulness to
are designed to suppress noise while enhancing the overall the original image.
quality of upscaling. The model effectively learns the high-
frequency components, resulting in more visually pleasing
5.1. Conclusion
images that maintain a balance between sharpness and nat- Our study highlights the effectiveness of the HAT model
uralness. in facial image restoration, making it a promising approach
The SRCNN model [1], on the other hand, seems to have for various applications in the field of image processing
made minimal changes to the low-resolution input image. and computer vision. However, it is important to recog-
Visually, the output appears to be a higher-resolution ver- nize that each super-resolution model has its strengths and
sion of the original low-resolution input, with no signifi- weaknesses, and the choice of model depends on the spe-
cant enhancement in detail or quality. However, it is worth cific requirements of the task. Future work could explore
noting that SRCNN requires the least computational power the integration of the strengths of these models to develop
among the models we tested. It has a simpler network struc- a more robust and versatile super-resolution method for a
ture, which results in faster computation times but at the wider range of applications.
cost of more advanced super-resolution features.
The ESRGAN model shows an impressive level of de- 5.2. References
tail enhancement [2]. This is due to the model’s perceptual
loss function, which optimizes textures and details during [1] C. Dong, C. C. Loy, K. He, and X. Tang, ”Image
the upscaling process. However, this strength can also be Super-Resolution Using Deep Convolutional Networks,”
a weakness. The ESRGAN model, in its quest to add de- arXiv:1501.00092 [cs], Jul. 2015. [Online]. Available:
tail, can sometimes introduce elements that deviate from the https://arxiv.org/abs/1501.00092.
original image’s content. This is because the model uses a [2] X. Wang et al., ”ESRGAN: Enhanced
generative adversarial network (GAN) structure. The gen- Super-Resolution Generative Adversarial Net-
erator in ESRGAN is trying to create images that the dis- works,” arXiv.org, 2018. [Online]. Available:
criminator cannot distinguish from high-resolution images. https://arxiv.org/abs/1809.00219.
Sometimes, in this process, it generates high-frequency de- [3] X. Wang, L. Xie, C. Dong, and Y. Shan, ”Real-
tails that may not be present in the ground truth image, lead- ESRGAN: Training Real-World Blind Super-Resolution
ing to the creation of objects or details that shouldn’t be with Pure Synthetic Data,” arXiv.org, Aug. 17, 2021. [On-
there. line]. Available: https://arxiv.org/abs/2107.10833 (accessed
Based on our experimental results, we can conclude that Jul. 10, 2023).
HAT [5, 6]is the best-performing method among the three [4] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool,
tested approaches for facial image restoration. HAT con- and R. Timofte, ”SwinIR: Image Restoration Using Swin
sistently demonstrates superior performance in preserving Transformer,” arXiv.org, Aug. 23, 2021. [Online]. Avail-
crucial facial features such as eyelashes and eyebrows, re- able: https://arxiv.org/abs/2108.10257.
sulting in more realistic and visually appealing images. In [5] X. Chen, X. Wang, J. Zhou, and D. Chen, ”Ac-
comparison, SwinIR and ESRGAN show inferior perfor- tivating More Pixels in Image Super-Resolution Trans-
mance in terms of detail preservation, with ESRGAN be- former,” arXiv (Cornell University), May 2022, doi:
ing particularly prone to overwriting eyelashes with other https://doi.org/10.48550/arxiv.2205.04437.
color pixels. This finding highlights the effectiveness of [6] X. Chen et al., ”HAT: Hybrid Attention Transformer
HAT in capturing and restoring fine details in facial images, for Image Restoration,” arXiv (Cornell University), Sep.
which is essential for accurate and realistic representation. 2023, doi: https://doi.org/10.48550/arxiv.2309.05239.
[7] ”NVlabs/ffhq-dataset,” GitHub, Apr. 09, 2021. [On-
line]. Available: https://github.com/NVlabs/ffhq-dataset.
[8] S. Salaria, ”Using The Super-Resolution Con-
volutional Neural Network for Image Restoration,”
GitHub, Nov. 17, 2023. [Online]. Available:
https://github.com/xoraus/Super-Resolution-CNN-for-
Image-Restoration (accessed Dec. 06, 2023).
[9] Xintao, ”xinntao/ESRGAN,” GitHub, Jul. 26, 2023.
[Online]. Available: https://github.com/xinntao/ESRGAN.
[10] J. Liang, ”SwinIR: Image Restoration Using Swin
Transformer,” GitHub, Oct. 02, 2022. [Online]. Avail-
able: https://github.com/JingyunLiang/SwinIR (accessed
Oct. 02, 2022).
[11] ”HAT,” GitHub, Dec. 06, 2023. [Online]. Avail-
able: https://github.com/XPixelGroup/HAT (accessed Dec.
06, 2023).

You might also like