Smart Computer Vision

EAI/Springer Innovations in Communication
and Computing
Series Editor
Imrich Chlamtac, European Alliance for Innovation, Ghent, Belgium
The impact of information technologies is creating a new world yet not fully
understood. The extent and speed of economic, life style and social changes
already perceived in everyday life is hard to estimate without understanding the
technological driving forces behind it. This series presents contributed volumes
featuring the latest research and development in the various information engi-
neering technologies that play a key role in this process. The range of topics,
focusing primarily on communications and computing engineering include, but
are not limited to, wireless networks; mobile communication; design and learning;
gaming; interaction; e-health and pervasive healthcare; energy management; smart
grids; internet of things; cognitive radio networks; computation; cloud computing;
ubiquitous connectivity, and in mode general smart living, smart cities, Internet of
Things and more. The series publishes a combination of expanded papers selected
from hosted and sponsored European Alliance for Innovation (EAI) conferences
that present cutting edge, global research as well as provide new perspectives on
traditional related engineering fields. This content, complemented with open calls
for contribution of book titles and individual chapters, together maintain Springer’s
and EAI’s high standards of academic excellence. The audience for the books
consists of researchers, industry professionals, advanced level students as well as
practitioners in related fields of activity include information and communication
specialists, security experts, economists, urban planners, doctors, and in general
representatives in all those walks of life affected ad contributing to the information
revolution.
Indexing: This series is indexed in Scopus, Ei Compendex, and zbMATH.
About EAI - EAI is a grassroots member organization initiated through cooper-
ation between businesses, public, private and government organizations to address
the global challenges of Europe’s future competitiveness and link the European
Research community with its counterparts around the globe. EAI reaches out to
hundreds of thousands of individual subscribers on all continents and collaborates
with an institutional member base including Fortune 500 companies, government
organizations, and educational institutions, provide a free research and innovation
platform. Through its open free membership model EAI promotes a new research
and innovation culture based on collaboration, connectivity and recognition of
excellence by community.
B. Vinoth Kumar • P. Sivakumar • B. Surendiran •
Junhua Ding
Editors
Smart Computer Vision

Editors
B. Vinoth Kumar P. Sivakumar
PSG College of Technology PSG College of Technology
Coimbatore, Tamil Nadu, India Coimbatore, Tamil Nadu, India
B. Surendiran Junhua Ding

Thiruvettakudy University of North Texas
National Institute of Technology Denton, TX, USA
Puducherry, Karaikal, India
ISSN 2522-8595 ISSN 2522-8609 (electronic)

EAI/Springer Innovations in Communication and Computing
ISBN 978-3-031-20540-8 ISBN 978-3-031-20541-5 (eBook)
https://doi.org/10.1007/978-3-031-20541-5
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Computer vision is a field of computer science that works on enabling computers

to see, identify, and process images in the same way that human vision does,
and then provide appropriate output. It is like imparting human intelligence and
instincts to a computer. It is an interdisciplinary field that trains computers to
interpret and understand the visual world from digital images and videos. The main
objective of this edited book is to address and disseminate state-of-the-art research
and development in the applications of intelligent techniques for computer vision.
This book provides contributions which include theory, case studies, and intelligent
techniques pertaining to the computer vision applications. This will help the readers
to grasp the extensive point of view and the essence of the recent advances in this
field. The prospective audience would be researchers, professionals, practitioners,
and students from academia and industry who work in this field.
We hope the chapters presented will inspire future research both from theoretical
and practical viewpoints to spur further advances in the field. A brief introduction
to each chapter is as follows.
Chapter 1 discusses the machine learning approaches applied to automatic sports
video summarization. Chapter 2 proposes a new technique for lecture video seg-
mentation and key frame extraction. The results are compared against six existing
state-of-the-art techniques based on computational time and shot transitions.
Chapter 3 presents a system to detect the potholes in the pathways/roadways
using machine learning and deep learning approaches. It uses HOG (histogram
of oriented gradients) and LBP (local binary pattern) features to enhance the
classification algorithms performance.
Chapter 4 aims to explore various feature extraction techniques and shape
detection approaches required for image retrieval. It also discusses the real-time
applications of shape feature extraction and object recognition techniques with
examples.
Chapter 5 describes an approach for texture image classification based on Gray
Level Co-occurrence Matrix (GLCM) features and machine learning algorithm.
Chapter 6 presents an overview of unimodal and multimodal affective computing. It
also discusses the various machine learning and deep learning techniques for affect
recognition. Chapter 7 proposes a deep learning model for content-based image
v
vi Preface
retrieval. It uses K-Means clustering algorithm and Hamming distance for faster
retrieval of the image.
Chapter 8 provides a bio-inspired convolutional neural network (CNN)-based
model for COVID-19 diagnosis. A cuckoo search algorithm is used to improve
the performance of the CNN model. Chapter 9 presents convolutional CapsNet for
detecting COVID-19 disease using chest X-ray images. The model obtains fast and
accurate diagnostic results with less trainable parameters.
Chapter 10 proposes a deep learning framework for an automated hand gesture
recognition system. The proposed framework classifies the input hand gestures,
each represented by a set of histogram-oriented gradient feature vector into some
predefined number of gesture classes.
Chapter 11 presents a new hierarchical deep learning based approach for seman-
tic segmentation of 3D point cloud. It involves nearest neighbor search for local
feature extraction followed by an auxiliary pre trained network for classification.
Chapter 12 summarizes that the proposed model acts as a better automatic
colorization for colored and grayscale images without human intervention. The
proposed model predicted the color for the new images with good prediction
accuracy close to the real images. In future, such automatic colorization techniques
help to identify vintage images or movies with grayscale images with their details
in a very clear manner.
Chapter 13 proposes a generative adversarial network (GAN) for hyperspectral
image classification. It uses dynamic mode decomposition (DMD) to reduce the
redundant features in order to attain better classification. Chapter 14 presents a brief
introduction about the methodologies used for identifying diabetic retinopathy. It
also uses convolutional neural network models to achieve an effective classification
for diabetic detection of retinal fundus images.
Chapter 15 proposes a modified differential evolution (DE), best neighborhood
DE(BNDE), to solve discrete-valued benchmarking and real-world optimization
problems. The proposed algorithm increases the exploitation and exploration capa-
bilities of the DE and to reach the optimal solution faster. In addition, the proposed
algorithm is applied to grayscale image enhancement.
Chapter 16 presents an overview of the main swarm-based solutions proposed
to solve problems related to computer vision. It presents a brief description of
the principles behind swarm algorithms, as well as the basic operations of swarm
methods that have been applied in computer vision.
We are grateful to the authors and reviewers for their excellent contributions for
making this book possible. Our special thanks go to Mary James (EAI/Springer
Innovations in Communication and Computing) for the opportunity to organize this
edited volume.
Preface vii
We are grateful to Ms. Eliška Vlčková (Managing Editor at EAI – European

Alliance for Innovation) for the excellent collaboration.
We hope the chapters presented will inspire researchers and practitioners from
academia and industry to spur further advances in the field.
Coimbatore, Tamil Nadu, India B. Vinoth Kumar

Coimbatore, Tamil Nadu, India P. Sivakumar
Puducherry, Karaikal, India B. Surendiran
Denton, TX, USA Junhua Ding
January 2023
Contents
A Systematic Review on Machine Learning-Based Sports Video

Summarization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Vani Vasudevan and Mohan S. Gounder
Shot Boundary Detection from Lecture Video Sequences Using
Histogram of Oriented Gradients and Radiometric Correlation . . . . . . . . . . . 35
T. Veerakumar, Badri Narayan Subudhi, K. Sandeep Kumar,
Nikhil O. F. Da Rocha, and S. Esakkirajan
Detection of Road Potholes Using Computer Vision and Machine
Learning Approaches to Assist the Visually Challenged. . . . . . . . . . . . . . . . . . . . . 61
U. Akshaya Devi and N. Arulanand
Shape Feature Extraction Techniques for Computer Vision
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
E. Fantin Irudaya Raj and M. Balaji
GLCM Feature-Based Texture Image Classification Using
Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
R. Anand, T. Shanthi, R. S. Sabeenian, and S. Veni
Progress in Multimodal Affective Computing: From Machine
Learning to Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
M. Chanchal and B. Vinoth Kumar
Content-Based Image Retrieval Using Deep Features
and Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
R. T. Akash Guna and O. K. Sikha
Bioinspired CNN Approach for Diagnosing COVID-19 Using
Images of Chest X-Ray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
P. Manju Bala, S. Usharani, R. Rajmohan, T. Ananth Kumar,
and A. Balachandar
ix
x Contents
Initial Stage Identification of COVID-19 Using Capsule Networks . . . . . . . . 203

Shamika Ganesan, R. Anand, V. Sowmya, and K. P. Soman
Deep Learning in Autoencoder Framework and Shape Prior for
Hand Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Badri Narayan Subudhi, T. Veerakumar, Sai Rakshit Harathas,
Rohan Prabhudesai, Venkatanareshbabu Kuppili, and Vinit Jakhetiya
Hierarchical-Based Semantic Segmentation of 3D Point Cloud
Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
J. Narasimhamurthy, Karthikeyan Vaiapury, Ramanathan Muthuganapathy,
and Balamuralidhar Purushothaman
Convolution Neural Network and Auto-encoder Hybrid Scheme
for Automatic Colorization of Grayscale Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
A. Anitha, P. Shivakumara, Shreyansh Jain, and Vidhi Agarwal
Deep Learning-Based Open Set Domain Hyperspectral Image
Classification Using Dimension-Reduced Spectral Features . . . . . . . . . . . . . . . . 273
C. S. Krishnendu, V. Sowmya, and K. P. Soman
An Effective Diabetic Retinopathy Detection Using Hybrid
Convolutional Neural Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Niteesh Kumar, Rashad Ahmed, B. H. Venkatesh, M. Anand Kumar
Modified Discrete Differential Evolution with Neighborhood
Approach for Grayscale Image Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Anisha Radhakrishnan and G. Jeyakumar
Swarm-Based Methods Applied to Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . 331
María-Luisa Pérez-Delgado
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
A Systematic Review on Machine
Learning-Based Sports Video
Summarization Techniques
Vani Vasudevan and Mohan S. Gounder
1 Introduction
Sports video summarization is one of the interesting fields of research as it tends

to generate a highlight of the broadcast video. Usually, the broadcasted sports
videos are longer, and the audiences may not have enough time to watch the entire
duration of the game. Some of the sports like soccer (football), basketball, baseball,
tennis, golf, cricket, and rugby are played for the duration of 90–180 minutes per
match. Hence, creating a summarization that contains only events and excitements
of interest pertaining to individual sports is an intense human task.
There are several learning and non-learning-based techniques in the literature
that attempt to automate the process of creating such highlights or summarization
video. In addition to it, in recent years, the advancements of deep learning
techniques have also contributed to accomplish remarkable results in the sports
video summarization. Figure 1 shows the growing number of publications that are
associated with “sports video summarization” over the past two decades.
Video summarization techniques [2] have been widely used in many types of
sports. The choice of sports/games chosen for this systematic review are based on
the following criteria: (1) sports with high audience base (https://www.topendsports.
com/world/lists/popular-sport/fans.html), (2) sports where the sponsorship is more,
(3) sports with more watch views/hours, (4) sports where the research potential is
high with large datasets, (5) sports where the need for technological advancement is
very high, (6) frequency of the occurrence of the game/sport in a year, (7) number
of countries participating in the sports, and (8) number of countries hosting the
V. Vasudevan
Department of CSE, Nitte Meenakshi Institute of Technology, Bengaluru, India
M. S. Gounder ()
Department of ISE, Nitte Meenakshi Institute of Technology, Bengaluru, India
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1

B. V. Kumar et al. (eds.), Smart Computer Vision, EAI/Springer Innovations in
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5_1
2 V. Vasudevan and M. S. Gounder
Number of Publications in Sports Video Summarization

100 89
90
80
70
60
50
40
30
20 9
10 3
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Fig. 1 Number of publications in sports video summarization from 2000 to 2020. (Data from
google scholar advanced search with “sports video summarization” OR “sports highlights”
anywhere in the article)
Number of Publications based on type of popular

sports
Soccer 938
Baseketball
Baseball
Tennis 578
Golf
Cricket 206
Rugby
Handball 52
Type of Sport
-200 0 200 400 600 800 1000
Fig. 2 Number of publications based on types of popular sports videos used to generate video
highlights from 2000 to 2020. (Data from google scholar advanced search with “sports video
summarization” OR “sports highlights” < type of sport > anywhere in the article)
event. Figure 2 shows the publications based on the types of sports videos used
to generate highlights where “type of sport” is substituted with soccer/football,
basketball, baseball, etc. Based on the various criteria considered along with the
number of publications in the literature, we have confined our scope of review to
soccer, tennis, and cricket sports. The rest of this paper is organized as follows.
In Sect. 2, we review the techniques established for sports video summarization
since 2000. Some important ideas, algorithms, and methods evolved over a period
for video highlight generation specific to two of the popular sports videos, namely,
soccer and cricket are reviewed to a greater extent with a quick review on other
sports in Sect. 3. In Sect. 4, scope of future research, weaknesses in the methods
used, and possible solutions are discussed. We conclude the paper in Sect. 5.
A Systematic Review on Machine Learning-Based Sports Video. . . 3
2 Two Decades of Research in Sports Video Summarization
In this section, we reviewed the history of sports video summarization in multiple

aspects, including techniques established for sports video summarization, learning
and non-learning techniques applied in sports video highlight generation, and
evaluation metrics used. A generic architecture of the video summarization process
is shown in Fig. 3. It highlights that long-duration sport’s video is processed with
the help of various sports video summarization techniques by considering different
factors and finally generates sports video highlights or summary. Figure 4 shows
the established techniques in the last two decades in the field of sports video
summarization.
Fig. 3 Generic architecture

of sports video
summarization
Fig. 4 Techniques established for sports video summarization

Techniques established for sports video summarization is shown in Fig. 4.

According to the literature that we referred, the techniques are broadly classified
into feature based, cluster based, excitement based, and key event based.
2.1 Feature-Based Approaches
Most of the sports events can be summarized based on features like color, motion,
gesture of players or umpires/referees, combination of audio and visual cues, texts
that displays the scores, and objects. For example, soccer and short version of cricket
matches are played with different color jerseys for each team. The signals of referees
or umpires involve identifying some key gestures.
The method by [42] proposed a dominant color-based video summarization to
extract the video highlights of sports video. The key frames are extracted based on
color histogram analysis. This kind of features gives an additional confidence if the
visual features are used for key frame extraction. This method has not adapted any
such visual features to identify the key frames. However, it is found that the color
factor plays an important role in influencing the sports video summarization. Figure
5 shows all such factors that influence the sports video summarization. The field
dominant color is one of the major features in sports like cricket, soccer, and other
field games. This also can be used to classify the events of on and off field, crowd,
and player or umpire detection [64, 67].
The motion-based and gesture features are proposed by [63, 67]. The motion is
the key in any sports. When the camera motion is also considered, the challenge
to extract the events or key frames becomes more complicated. In [63], the
summarization is more of an event based that is presented as a cinematography. The
authors have proposed an interesting method to not only summarize the soccer video
but also identify scenes of intensive competition between players and emotional
events. In addition to this, they also proposed to classify the clips into many clips
based on cinematographic features like video production techniques, transition of
shots, and camera motions. The intensive competition is identified based on the
movements of players, attack, defense of goal, etc. The reaction of players or the
crowd is considered for emotional moments. This also counts what happened in the
scene, who were involved in the scene, and how the players and audience reacted to
the scene.
The video clips are converted into segments of sematic shots and then each of
them into clips based on camera motion. The interest level of each of these clips is
measured based on cinematography and motion features. This work also classifies
the shots as long view, close view, and medium view. Interestingly, a segment is
created based on semantics in the scene. Thus, it forms a semantic structure of soccer
video. The factors influencing the summarization in this method as per Fig. 4 are
events, movement, object/event detection, and camera motion.
In [67], the authors introduced a dataset called SNOW, which is used to identify
the pose of umpires in cricket. They identified four important events in cricket,
Fig. 5 Factors influencing sports video summarization
namely, Six, No Ball, Out, and Wide ball based on the umpire pose or gesture
recognition. The pretrained convolution networks such as Inception V3 and VGG19
have been used for extracting the features. The classification of the poses is based
on the SVM. The authors have attempted to create a database for public use, and
it has been made online for download. Some of the factors influencing the video
summarization in this method are visual cues and object detection.
Another interesting feature used in most of the video summarization methods
is the audio [9, 35, 44, 45, 66, 92]. In Fig. 5, it is evident that audio features
like commentator’s voice and crowd cheering are key factors influencing the
video summarization. In [9], an audiovisual-based approach has been presented.
The audio signal’s instantaneous energy and local periodicity are measured to
extract highlighted moments in the soccer game. In the work proposed by [35],
commentator’s voice, referee’s whistle, and crowd voice are considered to find
the exciting events. The events related to soccer games are goal, penalty shootout,
red card, etc. The authors also considered the audio noise during such events like
musical instruments and applied Empirical Mode Decomposition to filter them.
Another method proposed by [45] also applies audiovisual features to extract key
frames from sports video. The audio features considered here are the excitement
events that are identified by the spike in signal due to crowd cheer. In addition, the
visual features such as score card detection has been proposed using deep learning
methods. As in [9], the factors influencing are visual cues and crowd audio. The
authors of [44] proposed an interesting method to detect the highlights based on
referee whistle sound detection. A band-pass filter is designed to accentuate the
whistle sound of referee and suppress any other audio events. A decision rules-
based [1, 39, 61, 84] time-dependent threshold method has been applied to detect
the regions where the whistle sound occurs. The authors used an extensive 12 hours
testing signal from various soccer, football, and basketball games. Like in [9, 35,
44, 45, 83], the method proposed in [66] employs audio feature like spectrum of
signal during the key events like goal. This is applied on top of key event detection
using visual features and color features. Some of the factors used in this method
are audio, color, visual cues, replay, excitements, batting, bowling shots [3], and
player detection. This method is strongly dependent on some of the video production
style, and the detection accuracy is only 70%. A time-frequency-based feature
extraction is used to calculate local autocorrelations on complex Fourier values in
[92]. Then the extracted features are used in exciting scenes detection. The authors
have considered the environmental noise and proved that this method is robust, and
the performance is better. However, the commentator’s voice is not considered.
By carefully looking at all the methods that used audio as one of the features,
there is certainly a scope for additional confidence to extract or identify the key
frames or key events in a sports video. Majority of them focused on identifying
the events based on crowd or spectator cheering. Though there would be additional
noise, some of the methods like [44, 66, 92] have applied methods to deal with
noises.
2.2 Cluster-Based Approaches
The clustering-based methods work based on clustering similar frames or shots and
then processing these clusters as required. In [13], a Fuzzy C-Means clustering
method is applied to cluster video frames based on color feature. A shot detection
algorithm is also used to find the number of clusters. The authors attempted
to improve the computation speed and accuracy through this method of video
summarization. Another method [54] attempted to develop a hierarchical summary
based on a state transition model. In addition, the authors also used other cues like
text, audio, and expert’s choice to improve the accuracy of proposed algorithm. This
method uses cues of visual, text, and pitch as the factors of influence (Fig. 5). In [50],
a neuro-fuzzy based approach has been proposed to segment the shots. The content
of the shots is identified by the slots or windows that are more semantic. Hierarchical
clustering of the identified windows provides textual summary to the user based on
the content of video in the shot. The method claims to generate textual annotation
of video automatically. It is also used to compare the similarity between any two
video shots. In [79], a statistical classifier based on Gaussian mixture models with
unsupervised model is proposed. This method adapts majorly the audio features to
find the mismatch between test and pretrained models, which is also discussed in
Sect. 2.3.
2.3 Excitement-Based Approaches
All the sports and games do have some moments that can be identified as moment
of excitement. The moments could be part of players reaction, crowd reaction,
referee’s action, or even the commentator’s reaction. The players reactions and
expressions include high-fives, first pumps, aggressive, tense, and smiling. The
crowd and commentator’s excitement can be identified by the energy of audio signal
or the tone of the commentators. Some of the works that have been reported already
based on audio [9, 35, 44, 45, 66, 92] uses these features to extract key events.
In addition to them, it is also identified that the works like [59, 75, 79] exploit
such excitement-based features to identify key events that eventually contribute to
summarize sports video. In [58], the authors have used multiple features to identify
the key frames. The information from players reactions, expressions, spectators
cheer, and commentator’s tone are used to identify key events. It has been found that
these methods are applied to summarize sports like tennis and golf. The excitement-
based highlight generation reported in [75] considers the secondary events in a
cricket match like drop catches and pressure moments based on certain strategy that
includes loudness of a video, category associated with primary event, and replays.
The player celebration detection, excitement of the crowd, appeals, and some
intense commentaries are considered as excitement features. The method has been
extensively tested on cricket videos. Another method that exploits the commentator
speech is proposed in [79]. The method uses statistical classifier based on Gaussian
mixture models with unsupervised model adaption. The acoustic mismatch between
the training and testing data is compensated using maximum a posteriori adaption
method.
2.4 Key Event-Based Approaches
Every sport has its own list of key events. The summarization can be carried out
based on such key events. This would obviously make the viewers to get the
most exciting events in the sports of their choice. For example, the sports soccer
has key events such as goals, foul, shoot, etc. There are substantial number of
publications [4, 7, 34, 38, 39, 41, 48, 65, 75, 76, 81, 86, 91] that addressed the video
summarization based on the key events. The work proposed in [4], an unsupervised
framework for soccer goal event detection using external textual source typically
from the reports of sports website, has been proposed. Instead of segmenting the
actual video based on the visual and aural contents [73], this method claimed to
be more efficient since noneventful segments are discarded. The method seems to
be very promising and can be applied to any sports that has live coverage through
text format in websites. An approach based on language independent, multistage
classification is employed for detection of key acoustic events in [7]. The method
has been applied on rugby. Though the method is like most of the approaches using
audio features, it differentiates in the way it treats the audio events independent
of languages. A hybrid approach based on learning and non-learning method has
been proposed in [34] to automatically summarize sport video. The key events are
goal, foul, shoot, etc. SVM-based method is applied for shot boundary detection,
and a view classification algorithm is used for identifying game-field intensities and
player bounding box sizes. In [39], an automatic sports video summarization has
been proposed based on key events based on replays. As shown in Fig. 5, the factor
that influences this work is the replay. The frames corresponding to the replays are
enclosed between gradual transitions. A thresholding-based approach is employed
to detect these transitions. For each key event, a motion history image is generated
by applying Gaussian mixture model. A trained extreme learning machine (ELM)
classifier is used to learn the various events for labeling the key events, detecting
replays, and generate game summarization. They have applied this method for four
different sports containing 20 videos. In contrast to the event detection in the game,
a method has been proposed by [41] to classify the crowd events while evaluating
the video contents into marriage, cricket, shopping mall, and Jallikattu. The method
applies deep CNN by learning the features from the training set of data. However,
the method is good at classifying the events into a labeled outcome. Interestingly, a
more customizable highlights generation method is proposed in [48]. The videos are
divided into sematic slots, and then importance-based event selection is employed to
include those important events in the highlights. The authors have considered cricket
videos for highlight generation. Again, the work proposed in [75] exploits the audio
intensity in addition to replays, player celebration, and playfield scenarios as key
events. Further, the player stroke segmentation [26–29] and compilation in cricket
can be used for highlight generation specific to a player in a match. There is a more
general analysis of various computer vision systems from soccer video semantic
point of video in [81]. The interpretation of the scene was based on the complexity of
the semantic. This work makes an investigation and analysis of various approaches.
2.5 Object Detection
Object detection is one of the important computer vision tasks applied in video
summarization. Techniques used in detecting the objects in each image or a frame
have gone through remarkable breakthroughs. Hence, it is highly required to
understand the evolution as well as state-of-the-art techniques used in detecting
the objects present in an image. This covers wide range of techniques starting
from simple histogram-based techniques to complex computationally intensive deep
learning techniques. The techniques evolved over a period of two decades that
in turn address challenges [96] in object detection, which include but not limited
to the following aspects: objects under different viewpoints, illuminations, and
intraclass variations, object rotation and scale changes, accurate object localization,
dense and occluded object detection, and speed up of detection. Figure 6 shows
the predominant object detection techniques (object detectors) evolved over two
decades that include latest development in 2021. Between 2000 and 2011, that
is, before the rebirth of deep convolution neural network, there were more subtle
and robust techniques applied in detecting the objects present in the frame or an
image. It is referred to as traditional object detectors. With the limited computing
resources, researchers have made a remarkable contribution to detect the objects
Fig. 6 Evolution of object detectors in two decades
based on handcrafted features. Between 2001 and 2004, Viola and Jones have
achieved real-time detection of human face with the help of Pentium III CPU [87].
This detector was named as VJ detector and works with the sliding window concept.
VJ detector improved its detection performance and reduced computation overhead
through integral image, which uses Haar wavelet, feature selection with the help
of AdaBoost algorithm, and detection cascades that were multistage detection
paradigm by spending more computations on face target than background windows
[88, 96]. This approach can certainly contribute to player detection from any of the
sport’s video. In 2005, histogram of oriented gradient (HOG) detector was proposed
by Dala and Triggs [11]. It was another important milestone as it balances the
feature invariance. To detect objects of various sizes, HOG detector rescales the
input frame or an image multiple times while keeping the detection window size
the same. HOG detector was one of the important object detectors used in various
computer vision applications that include sports video processing too. Between
2008 and 2012, Deformable Part-Based Model (DPM) and its variants were the
peak of object detectors that evolved in the traditional object detectors era. DPM
was proposed by Felzenszwalb in 2008 [18] as an extension to HOG, and then
its variants were proposed by Girshick. DPM uses divide and conquer approach
where learning of the model happens by decomposing an object and then ensemble
the decomposed objects parts to form a complete object. The model comprises of
root filter and many part filters. This model has been further enriched [16, 17, 21,
22] to deal with real-world objects with significant variations. A weekly supervised
learning method was developed in DPM to learn all the configurations of part filters
as latent variables. This has been further formulated as Multi-Instance Learning, and
some important techniques such as hard negative mining, bounding box regression,
and context priming were also applied for improving the detection performance [16,
21].
From 2012 onwards, deep learning era began with the ability to learn high-
level feature representations of an image [50] with the availability of necessary
computational resources. Then, region-based CNN(RCNN) [24] was proposed and
became a breakthrough research in object detection with the help of deep learning
models. In this era, there were two genres in object detection, namely, two-stage
detection with coarse to fine process and one-stage detection to complete the process
in a step [96]. The RCNN extracts set of object proposals by selective search.
Then, each proposal is rescaled to a fixed size image and given as input to CNN
model trained on AlexNet [50] to extract features. In the end, linear SVM classifiers
are used to predict the presence of an object within each region. RCNN achieved
significant performance improvement over DPM. Even though RCNN had made
significant improvement, it had drawbacks of redundant feature computations on
many overlapped proposals that led to slow detection speed with GPU. In the
same year, Spatial Pyramid Pooling Network (SPPNet) [33] model was proposed
to overcome this drawback. Earlier CNN models require a fixed-size input, for
example, 224x224 image for AlexNet [51]. In SPPNet, Spatial Pyramid Pooling
layer enables to generate a fixed-length representation regardless of the size of
image or region of interest without rescaling it. It is proved that SPPNet was more
than twenty times faster than RCNN, which avoids redundancy while computing the
convolutional features. Though SPPNet had improved the detection performance
in terms of speed, there were drawbacks as training was still multistage and it
only fine-tuned its fully connected (FC) layers. In 2015, Fast RCNN [23] was
proposed to overcome the drawbacks of SPPNet. Fast RCNN train a detector and
a bounding box regressor simultaneously under the same network configurations.
This improved the detection speed 200 times faster than RCNN. Though there was
an improvement in detection speed, it was limited by the proposal detection. Hence,
it led to the proposal of Faster RCNN [71] where object proposals were generated
with a CNN model. Faster RCNN was the first end-to-end and near real-time object
detector. Even though Faster RCNN overcome the drawback of Fast RCNN, there
was still computation redundancy that led to further developments, namely, RFCN
[10] and light-head RCNN [56]. In 2017, Feature Pyramid Network (FPN) [57]
was proposed. It focused on all the layers in top-down architecture for building
high-level semantics at all scales. FPN had become a basic building block of latest
detectors.
Meanwhile in 2015, You Only Look Once (YOLO) was proposed. YOLO was
the first one-stage detector in this deep learning era. It followed entirely different
approach than the previously evolved models. It applied a single neural network to
the full image. The network divided the image into regions and predicted bounding
boxes and probabilities for each region simultaneously. Despite its improvements,
YOLO lacks from a drop of the localization accuracy compared with two-stage
detectors especially for detecting small objects. Based on the initial model, a series
of improvements [8, 68–70] were proposed that further improved the detection
accuracy and detection speed. Almost at the same time, Single Shot Multi-Box
Detector (SSD) [89], which was a second one-stage detector, evolved. The main
contribution of SSD was to introduce multi-reference and multi-resolution detection
techniques that significantly improved the detection accuracy of some small objects.
Despite the high speed and simplicity, one-stage detectors have lacked the accuracy
of two-stage detectors, and hence, RetinaNet [58] was proposed. RetinaNet focused
on the foreground–background class imbalance issue by introducing a new loss
function called focal loss. It reshaped the standard cross entropy loss to put
more focus on hard, misclassified samples during training. Focal loss achieved
comparable accuracy of two-stage detectors while maintaining very high detection
speed. The deep learning models continue to evolve [55, 77, 94, 95] by considering
both detection accuracy and speed. From these object detectors that evolved over the
last two decades and especially with the rebirth of deep learning models, it is made
possible to choose and apply appropriate detectors to solve most of the computer
vision-based problems that include sports video summarization.
2.6 Performance Metrics
Most of the research works in sports video summarization have used the following
objective metrics to evaluate the constructed models’ performance.
2.6.1 Objective Metrics
1. Accuracy: Represents the ratio of the correctly labeled replay/non-replay frames

or the key events/non-key events to the total number of frames or events [36, 47,
50, 60].
(TP + TN)
Accuracy =
(P + N)
where TP: True Positive, TN: True Negative, P: Positive, N: Negative

2. Error: Represents the ratio of mislabeled replay frames (both FP and FN) to the
total number of frames [36, 47].
(FP + FN)
Error =
(P + N)
where FP : False Positive, FN: False Negative

3. Precision: Represents the ratio of correctly labeled frames to the total detected
frames [4, 32, 34, 60, 90, 93].
(TP)
Precision =
(TP + FP)
4. Recall: Represents the ratio of true detection of frames against the actual number
of frames [4, 32, 34, 59, 90, 93].
TP
Recall =
(TP + FN)
5. F1-Score: It is weighted average representation of precision and recall. It is

computed because some methods have higher precision and lower recall and vice
versa [35, 46, 54].
(Precision ∗ Recall)
F1 − Score = 2 ∗
(Precision + Recall)
6. Confusion Matrix (CM): Represents Predicted Positives and Negatives over Pos-
itives and Negatives present in the chosen dataset. This is a highly recommended
model evaluation metric present in literature. In [32], CM matrix with goal, foul,
shoot, and non-highlight events was represented to compute precision and recall
%.
7. Receiving Operating Characteristics (ROC) curve: Precision Vs Recall (FPR Vs
TPR).
It is desirable to use Receiver Operator Characteristic (ROC) curves when
evaluating binary decision problems, which show how the number of correctly
classified positive examples varies with the number of incorrectly classified
negative [12].
As the False Position Rate (FPR) increases (i.e., more non-highlight plays are
allowed to be classified incorrectly), it is desirable that the True Positive Rate (TPR)
increases as quickly as possible (i.e., the derivative of the ROC curve is high) [19].
Other than the above listed objective metrics, which were predominantly used in
the last two decades, the following user experiences (subjective metrics) were also
used as an alternative performance evaluation metric in most of the sports video
summarization works.
2.6.2 Subjective Metrics Based on User Experience
1. The quality of each summary is evaluated in seven levels: extremely good, good,
upper average, average, below average, bad, and extremely bad [64].
2. Mean Opinion Score (MOS): Considering the following user experience rating,
(i) the overall highlights viewing experience is enjoyable, entertaining, and
pleasant and not marred by unexciting scenes, (ii) the generated scenes do
not begin or end abruptly, and (iii) the scenes are acoustically and/or visually
exciting.
3. Human vs. system detected shots (closeup, crowd, replay, sixer) [5].
4. Discounted cumulative gain (nDCG) metric, which is a standard retrieval
measure computed as follows:
k
1 2reli − 1
nDCG(k) =
Z log2 (i + 1)
i=1
where reli is the relevance score assigned by the users to clipi and Z is a
normalization factor ensuring that the perfect ranking produces an nDCG score of
1.
3 Evolution of Ideas, Algorithms, and Methods for Sports

Video Summarization
In this section, evolution of some of the video summarization ideas, algorithms,

and methods (learning and non-learning methods) over a period of two decades
with respect to two popular sports, cricket and soccer, is reviewed and summarized
in Tables 1, 2, and 3, respectively. In addition to it, a quick glimpse on the ideas,
algorithms, and methods proposed for other sports (Table 4) including rugby, tennis,
baseball, basketball, volleyball, football, golf, snooker, handball, hockey, and ice
hockey is also reviewed. In a nutshell, most of the algorithms used seems to be
based on feature extraction and key event detection. Also, most of the methods
are classified under learning and non-learning. Non-learning methods are mostly
used in preprocessing and key event detection and highlight generation, whereas
learning methods are used in shot boundary classification, shot view classification,
and feature extraction stages. Notably, in the last decade (2012 to present), which
is from the rebirth of convolution neural network (CNN), almost all the stages in
sports video summarization or video highlight generation are efficiently handled by
deep learning models and its variants.
4 Scope for Future Research in Video Summarization
Some of the works [34, 63, 74] that are specific to cricket and soccer video
summarization have several weaknesses or they are not addressed properly. In
this section, the weaknesses and scope for future research are discussed from
the outcome of selected papers. Section 4.1 groups the weaknesses under certain
categories. Section 4.2 highlights the scope for further research in sports video
summarization.
Table 1 Ideas that evolved over a period in sports video summarization

Year of
study publication Major idea Type of sports
[14] 2003 Framework based on cinematic and Cricket
object-based features
[48] 2006 Extracted events and semantic concepts Cricket
[49] 2008 Caption analysis and audio energy level Cricket
based
[62] 2009 HMM-based approach Cricket
[53] 2010 Priority curve based Cricket
[92] 2010 Time-frequency feature extraction to detect Cricket
excitement scenes in audio cues
[9] 2010 Personalized summarization of soccer sport Cricket
using both audio and visual cues
[80] 2010 Excited commentator speech detection with Cricket
unsupervisory model adaptation
[54] 2011 Automated highlight generation Cricket
[80] 2011 Unsupervised event detection framework Cricket
[93] 2011 Machine learning based Soccer
[90] 2011 Logo detection and replay sequence Soccer
[6] 2012 SVM-based shot classification Soccer
[46] 2013 Framework in encoded MPEG video Soccer
[42] 2013 Dominant color-based extraction of key Cricket
frames
[4] 2013 Framework for goal event detection through Cricket, soccer,
collaborative multimodal (textual cues, rugby, football, and
visual, and aural) analysis tennis
[63] 2014 Cinematography and motion analysis Soccer
[72] 2015 Annotation of cricket videos Cricket
[7] 2015 Key acoustic events detection Rugby
[30] 2015 Automatic summarization of hockey video Soccer
[5] 2015 Multilevel hierarchical framework to Soccer
generate highlights by interactive user input
[47] 2015 Bayesian network based Soccer
[66] 2015 Audiovisual descriptor based Soccer
[39] 2016 Learning and non-learning-based approach Cricket
[43] 2017 Real-time classification Soccer
[78] 2017 Parameterized approach with end-end Soccer
human-assisted pipeline
[15] 2017 Hidden-to-observable transferring Markov Soccer
model
[37] 2017 Court aware Volleyball
[19] 2018 CNN-based classification Cricket
[31] 2018 CNN-based approach to detect no balls Cricket
[75] 2018 Event-driven and excitement based Cricket
[82] 2018 Deep players’ action recognition features Soccer
(continued)
Table 1 (continued)
Year of
study publication Major idea Type of sports
[27] 2019 Cricket stroke dataset creation Cricket
[52] 2019 Outcome classification in cricket using deep Cricket
learning
[36] 2019 Classify bowlers Cricket
[60] 2019 AlexNet CNN-based approach Soccer
[41] 2019 CNN-based crowd event detection Cricket
[35] 2019 Decomposed audio information Soccer
[40] 2019 Confined elliptical local ternary patterns and Cricket, tennis,
extreme learning machine baseball, and
basketball
[59] 2019 Multimodal excitement features Golf and tennis
[26] 2020 Cricket stroke localization Cricket
[64] 2020 Transfer learning for scene classification Soccer
[34] 2020 Hybrid approach Soccer
[45] 2020 Content aware summarization—audiovisual Cricket, soccer,
approach rugby, basketball,
baseball, football,
tennis, snooker,
handball, hockey, ice
hockey, and
volleyball
[32] 2020 Multimodal multi-labeled extraction Soccer
4.1 Common Weaknesses of Existing Methods
In this section, the existing methods and their weaknesses are highlighted based on
the below mentioned categories.
4.1.1 Audio-Based Methods
– Audio excitements of audiences or spectators may create noise with commen-

tator’s speech. The noise level of the audience will sometimes be mixed with
instruments sounds or the spectrum may be higher for more than a specific time.
Also, the commentator’s voice may be masked by the audience cheer. This may
be a considerable challenge to be addressed.
– Some methods depend on the speech to text, which is directly based on the
accuracy of Google or other similar APIs. Other approaches like MS cognitive
services or IBM Watson are not attempted in the reviewed works. Natural
language processing to extract the players conversation also is not addressed in
any of the works.
– Many of the methods do not state the exact number of training and testing
samples used when the audio features are considered.
Table 2 Notable research work in cricket sport video summarization

Study Algorithms Methods Output
[36] 1. Transfer learning 1. Pretrained VGG16 to build the 1. Classify the bowlers
classifier based on action
2. Created a dataset
containing bowlers
[13] 1. Dominant color Naïve Bayes classifier – 1. All slow-motion
region detection Learning Method for Short segments in a game
2. Robust shot boundary Classification and non-learning based on cinematic
detection methods for other algorithms features
3. Shot classification 2. All goals in a game
4. Goal detection based on cinematic
5. Referee detection features
6. Penalty box detection 3. Slow-motion
segments classified
according to
object-based features
[31] 1. Transfer learning 1. Inception V3 CNN for transfer Classified results as “no
2. Video resizing learning ball” or not
2. SoftMax activation function
and SVM used for high-level
reasoning
[47] 1. Hierarchical 1. Semantic base rule Summarized video
feature-based classifier 2. Importance ordering and based on the events and
2. Finding concept and video pruning of concepts concepts
event rank 3. Importance ordering and
video pruning of events
4. Temporal ordering of events
and concepts
[27] 1. Temporal localized 1. Grayscale histogram A dataset of videos
stroke segmentation difference feature to detect shot containing strokes
2. Shot boundary boundaries played by batsmen
detection 2. Cut predictions using random
3. Cut predictions forest
3. SVM for finding the first
frames
4. Machine learning algorithm to
extract video shots for first frame
HOG features
[48] 1. Sum of absolute 1. Caption recognition Summarized video
difference model for 2. Event detection for excitement using the events and
caption recognition clips captions
2. Short-time zero 3. Performance measure of event
crossing for estimating detection and caption recognition
spectral properties of
audio
[62] 1. Hidden Markov 1. Shot boundary detection and Summarized video
model for state key frame extraction based on based on the features of
transition color changes color and excitement
2. Shot classification using view
classification probabilities
(continued)
Table 2 (continued)
[93] 1. Acoustic feature Mel bank filtering, local Highlight scene
extraction autocorrelation on complex generation
2. Highlight scene Fourier values – non-learning
detection methods
Complex Subspace Method –
Unsupervised Learning
[75] 1. Event detection 1. Frame difference for video Video summary with
2. Video shot shot important events
segmentation 2. CNN + SVM framework for
3. Replay detection replay detection
4. Scoreboard detection 3. Pretrained AlexNet for OCR
5. Playfield scenario 4. CNN + SVM for classifying
detection frames
5. Audio cues for excitement
detection
6. AlexNet for player celebration
[9] 1. Highlighted moment 1. Hot spot/special moment Resource constrained
detection through audio detection based on two acoustic summarization based on
cues features user’s narrative
2. Shot (or clip) 2. View type subsequence preference
boundary matching
detection/video 3. Lagrangian optimization and
segmentation convex-hull approximation –
3. Sub-summaries non-learning methods
detection
[80] 1. Excited speech 1. Gaussian mixture models Event highlights
segmentation through 2. Unsupervised model generated based on
pretrained pitched adaptation – average excited speech score
speech segment log-likelihood ratio score
2. Excited speech 3. Maximum a posteriori
detection adaptation – learning methods
[81] Event detection 1. Unsupervised event discovery Video highlights of
Highlight clip detection based on color histogram of cricket
oriented gradients
2. Supervised phase trains SVM
from clips labeled as highlight or
non-highlight
[10] 1. Shot boundary 1.Computation of convex hull of Collection of
detection the benefit/cost curve of each nonoverlapping
2. Video segmentation segment sub-summaries under
3. Candidate 2. Lagrangian relaxation – the given
sub-summaries non-learning methods user-preferences and
preparation duration constraint.
4. Metadata extraction
(continued)
Table 2 (continued)
[54] 1. Pitch segmentation 1. Temporal segmentation to Match summarization
using K-means detect boundaries and wickets with semantic results
2. SVM classifier to 2. Replay detection using Hough like batting, bowling,
recognize digits transform-based tracking boundary, etc.
3. Finite state 3. Ad detection using transitions
automation model based 4. Camera motion using KLT
on semantic rules method
5. Scene change using hard cut
detection
6. Crowd view detection using
textures
7. Boundary view detection
using field segmentation
[39] 1. Excitement detection 1. Rule-based induction to find Summarized video with
2. Key events detection excited clips key events
3. Decision tree for 2. Score caption region using
video summarization temporal image averaging
3. OCR to recognize the
characters
4. Graduation transition
detection by dual
threshold-based method
[41] Event recognition CNN (baseline and VGG16) to Classification of crowd
detect predefined events video into four classes:
marriage, cricket,
Jallikattu, and shopping
mall
[42] Playfield and 1. Color histogram analysis Extracted key frame
non-playfield detection 2. Extraction of dominant color
frames-thresholding hue values –
non-learning methods
[26] 1. Construction of two 1. Pretrained C3D model with Two cricket strokes
learning-based GRU training datasets
localization pipelines 2. Boundary detection with first
2. Boundary detection frame classification
3. Modified weighted mean
TIoU for single category
temporal localization problem
[72] 1. Video shot 1. K-means clustering to build Annotated video clips
recognition visual vocabulary containing events of
2. Shot classification 2. Shot representation by bag of interest
3. Text classification words
3. Classification using multiclass
Kernel SVM
4. Linear SVM for bowler and
batsman category
(continued)
Table 2 (continued)
[52] 1. Jittered sampling 1. Pretrained VGGNet is used on Automatically
2. Temporal ImageNet dataset for transfer generated commentary.
augmentation learning Classify the outcome of
3. Training; hyper 2. LRCN to classify the each ball as run, dot,
parameter tuning ball-by-ball activities boundary and wicket
[19] Video classification of 1. Adam optimizer for training Classified shots of
cricket shots the model cricket video that
2. CNN model with 13 layers for belongs to cut shot,
classification cover drive, straight
drive, pull shot, leg
glance shot, and scoop
shot
[53] 1. Play break detection 1. Block creation Summarized video
2. Event detection 2. Thresholding the duration based on the priorities
through visual, audio, between continuous long shots block merging
and text cues 3. Detecting low-level features
3. Peak detection (find (occurrence of replay scene, the
similar events) excited audience or commentator
speech, certain camera motion or
certain highlight event-related
sound, and crowd excitement)
4. Grass pixel ratio to detect the
boundaries in cricket
5. Audio feature extraction:
Root mean square volume
Zero crossing rate
Pitch period
Frequency
Centroid
Frequency bandwidth
Energy ratio
6. Priority assignment – non
learning methods
7. SVM to identify text line from
frame
8. Optical Character Recognition
(OCR) for text recognition –
learning methods
4.1.2 Shot and Boundary Detection
– The motivation for choosing some of the core classifiers like the one in [74] HRF-
DBN for labeling each shot and RF classifier for dividing the shots is not given
clearly.
– Umpire jerseys and its colors are one of the key elements in detecting the umpire
frames. Though it appears to be a straightforward approach, there is no discussion
on the challenges faced while segmenting the frames, for example, the color
variation of jerseys due to different light intensity [74].
Table 3 Notable research work in soccer sport video summarization

[5] For the video clip 1. RGB color histogram for Detected features and
1. Key frame detection frame boundary detection event close-up, replay,
2. Frame extraction 2. RGB to HSV for key crowd, and sixer on 292
3. Replay frame detection frame detection frames from 4 min
4. Event frame detection 3. Visual features (grass 52 seconds video
5. Crowd frame detection pixel ratio, edge pixel
6. Close-up frame ratio, skin color ratio) for
detection shot classification
7. Sixer detection 4. Haar wavelets for
close-up detection
5. Edge detection from
YCbCr converted frame
for crowd detection
6. Black pixel percentage
measure to detect sixer
7. Sliding window for
event detection –
[6] 1. Dominant color SVM classifier – learning Classified long, medium,
extraction method and infield and outfield
2. Connected components close-up shots
(players), middle rectangle
and vertical strips, two
horizontal strips features
extraction
[46] 1. Low-level visual 1. Feature extracted using Summarized MPEG-1
information extraction non-learning methods video
2. Grass modeling and 2. Hierarchy of SVM
detection classifiers – learning
3. Camera motion method
estimation
4. Shot boundary detection
(a) Abrupt transition
detection
(b) Dissolve transition
detection
(c) Logo transition
modeling and detection
5. Shot-type classification
6. Playfield zone
classification
7. Replay detection
8. Audio analysis
9. Ranking
(continued)
Table 3 (continued)
[47] 1. Extraction of exciting 1. Non-learning methods Generated highlights based
clips from audio cues using to detect different views on the selected labeled
short time audio energy 2. Bayes Belief Network clips based on the degree
algorithm (to assign semantic of importance
2. Event detection and concept labels to the
classification (annotation) exciting clips: goals, saves,
using hierarchical tree yellow cards, red cards,
3. Exciting clip selection and kicks in video
4. Temporal ordering of sequence) – learning
selected exciting clips method
[43] Scene classification 1. Radial basis Real-time video indexing
decompositions of a color and dataset
address space followed by
Gabor wavelets in
frequency space
2. The above is used to
train SVM classifier
[35] 1. Split audio and video 1. Empirical Mode Generated events (goals,
2. Intrinsic Mode Function Decomposition (EMD) to shots on goal, shots off
(IMF) extraction from filter the noise and extract goal, red card, yellow card,
audio signal audio penalty decision)
3. Feature extraction from 2. Non-learning methods
energy matrix of the signal to extract features and
((a) energy level of the compute shot score and
frame in shot, (b) audio summary generation
power increment, (c)
average audio energy
increment in continuous
shots, (d) whistle detector)
[66] 1. Video shot segmentation 1. VJ AdaBoost method Highlight generated based
2. MPEG-7-based audio with skin filter for face on user input
descriptor detection – learning
3. Whistle detector method
4. MPEG-7 motion 2. Other algorithms used
descriptor non-learning methods such
5. MPEG-7 color as Discrete Fourier
descriptor Transform for whistle
6. Replay detector detection
7. Persons detector
8. Long-shot detector
9. Zooms detector
(continued)
Table 3 (continued)
[93] 1. Shot boundary detection 1. SVM and NN (replay Highlights the most
2. Shot-type, play break and scoreboard) important events that
classification 2. K-means include goals and goal
3. Replay detection 3. Hough transform attempts
4. Scoreboard detection (vertical goal post
5. Excitement event detection)
detection 4. Gabor filter (Goal Net)
6. Logo-based event 5. Volume of each audio
detection frame,
7. Audio loudness subharmonic-to-harmonic
detection ratio-based pitch
determination, dynamic
thresholds – learning and
[78] 1. Define segmentation 1. Background subtraction The output video is
points using GMM for replay parameterized based on
2. Replay detection detection events over time and the
3. Player detection and 2. YOLO for player user priority list.
interpolation detection
4. Soccer event 3. Histogram of optical
segmentation flow to capture player
5. Bin-packing to select motion – learning and
subset of plays based on non-learning methods
utility from eight bins
[18] Video classification of 1. Adam optimizer for Classified shots of cricket
cricket shots training the model video that belongs to cut
2. CNN model with 13 shot, cover drive, straight
layers for classification drive, pull shot, leg glance
shot, and scoop shot
[60] 1. Shot classification 1. AlexNet CNN for shot Classified shots of sports
classification video with classes like
close, crowd, long, and
medium shots
[64] Scene classification Pretrained AlexNet CNN Classified shots into
batting, bowling, boundary,
crowd, and close-up
[90] 1. Detect candidate set for 1. Difference and Detected replay from given
logo template accumulated difference in video sequence
2. Find logo template from a window of 20 frames
the candidate set 2. K-means clustering to
3. Match the logo (pair find exact logo template
logo for replay detection) 3. Adaptive criterion:
frame difference and mean
intensity of the current
frame with those of the
logo template – learning
and non-learning methods
(continued)
Table 3 (continued)
[82] 1. Video segmentation 1. Two-stream deep neural User-generated sports
2. Highlight classification network (1. Holistic video (UGSV)
feature stream: 2D CNN 2.
Body joint stream: 3D
CNN) – trained from lower
layer to the top layers by
using a UGSV
summarization dataset.
2. LSTM (highlight
classification)
[45] 1. Scorebox detection 1. Nonoverlapping sliding Highlight generated based
(binary map hole filling window operation on on user preferences
algorithm) frame pairs for scorebox
2. OCR to recognize text in detection
scorebox 2. OCR using deep CNN
3. Parse and clean with 25 layers
algorithm to recognize 3. Butterworth band-pass
clearly text from text filter and Savitzky-Golay
region smoothing filter for audio
4. Audio feature extraction feature extraction
5. Key frame detection 4. Speech to text using
(start and end frame Google API2 – both
estimation algorithm) learning and non-learning
methods are used.
[32] 1. Unimodal learning 1. (a) Multibranch 1. Highlight generated
2. Multimodal learning Convolutional Networks based on unimodal
3. Multimodal and (merge the convolutional learning
multi-label learning features from input frames; 2. Highlight generated
then the regression value is based on multimodal
obtained) learning
(b) 3D CNN to capture 3. Highlight generated
more temporal and motion based on multimodal,
information. multi-label learning
(c) Long-term Recurrent
Convolutional Networks
uses pretrained CNN
model to extract features.
2. Pretrained CNN features
with NN (latent features
fusion (LFF) and
pretrained CNN features
with deep NN (early
features fusion (EFF))
3. (a) Construct a network
for training each label
separately
(b) Jointly train a
multi-label network, and
extract the joint features
from the last dense layer.
(continued)
Table 3 (continued)
[63] 1. Shot detection Non-learning methods Generated video summary
(a) Shot classification used to identify important based on user input on the
(b) Replay detection events from long, medium, length (N clips) of the
2. Video segmentation close, and replay views summary
(collection of successive
shots)
3. Clip boundary
calculation
4. Interest level measure
[34] 1. Shot boundary detection 1. Linear SVM classifier Summary with replay and
2. Shot view classification 2. Green color dominance without replay segments
(global view, medium and threshold frequencies
view, and close-up view) over player bounding box
3. Replay detection 3. Histogram difference
4. Play break detection between logo frames
5. Penalty box detection 4. Statistical measure for
6. Key event detection key event detections
– The shots are classified into only specific categories as in [6, 9, 34, 45, 60]. The
shot detection in these works is not generalized. For example, in [34], the size
of the bounding box is the major parameter to decide between medium and long
shots. This may go wrong if the shadows are detected as boundaries. The authors
have not addressed such issues.
– Authors of [34] have used an algorithm to find three parallel lines in a frame to
find near goal ratio. Due to the camera angle, these lines may not appear to be
parallel due to perspective distortion. This part has not been addressed to resolve
such issues.
– The clip boundaries are detected using the camera motion [34]. This may not be
applicable to other sports where the camera keeps moving.
4.1.3 Resolution and Samples
– In most of the works, the video samples and their resolutions are assumed to
be much lower than the quality of broadcast. The frame resolution is mentioned
as 640 × 480 [74]. It is common that the video resolution will not be always
the same. Either every video must be down sampled to the standard size that is
processed or the method must be flexible in processing any resolution videos.
– Number of samples used in methods [74] are significantly less, and the perfor-
mance is the models are justified only with such lower number of samples. The
impact on performance of the algorithm when the samples are increased is not
specified.
– Only few of the methods mentioned the series of sports that has been used for
video samples. The reason for choosing the specific series is not given clearly.
Table 4 Notable research work in other sports video summarization

[4] 1. Shot boundary 1. Rule-based approach to Detected goal events
detection – rank tracing classify (far view, close-up
algorithm view) shots – learning
2. Short view method and non-learning
classification – dominant methods for other
view detection, playfield algorithms
region segmentation,
object-size determination,
and shot classification
3. Minute by minute
textual cues – event
keyword matching
Time stamp extraction, text
video synchronization
Event search localization
4. Candidate shot list
generation
5. Candidate ranking
[7] 1. Feature extraction 1. Mel Frequency Cepstral Generated video highlights
(referee’s whistle, Coefficients (MFCC) and
commentators’ exciting their first order
speech) delta-MFCC – to represent
2. Multistage classification whistle sound and exciting
3. Highlight generation speech
2. Multistage Gaussian
mixture models (GMM) –
learning method to learn
and classify: Stage 1:
speech and nonspeech
Stage 2: Excited (from
speech) or whistle (from
nonspeech)
Five GMM models: (a) a
speech model, (b) an
excited speech model, (c)
an unexcited speech
model, (d) a whistle model,
and (e) a model to classify
all other acoustic events
3. Decision window and
onset and offset
determination for scene
(continued)
Table 4 (continued)
[30] 1. Shot detection 1. Structural Similarity Events are tagged by above
2. Penalty cornet and Index Measure (SSIM) event names and stitched
penalty stroke detection 2. Color segmentation and in the order of appearance
3. Umpire gesture morphological operations – based on user preferences
detection long shot/umpire shot to generate customized
4. Foul detection 3. Field color detection and highlights.
5. Replay and logo shot skin color detection –
detection close-up shot
6. Goal detection 4. Hough transformation
and morphological
operations – goal post
shot – Non-learning
methods
[37] 1. Rally scene detection 1. Unsupervised shot Extracted highlights from
clustering based on HSV unimodal integrated with
histogram multimodal
2. Correlation analysis
between court position and
ball position
3. Rally rank evaluation –
adjusted R squares –
learning methods
[40] 1. Replay segment 1. Thresholding-based Detected replay events
extraction approach (fade-in and
2. Key events detection fade-out transition during
(a) Motion pattern start and end of replay)
detection 2. Gaussian mixture model
(b) Feature extraction (GMM) to extract
silhouettes and generate
motion history image
(MHI) for each key event
3. Confined elliptical local
ternary patterns (CE-LTPs)
for feature extraction
4. Extreme learning
machine (ELM) classifier
for key event detection –
learning methods
[59] 1. Audio analysis (crowd 1. Sound net classifier Automatically extracting
cheer and the commentator (crowd cheer and highlights
excitement detection) commentator speech)
2. Visual analysis 2. Linear SVM classifier
(players – action (commentator excitement
recognition, shot boundary detection – tone based)
detection) 3. Speech to text
3. Text analysis (text conversion (commentator
based: 60 words/phrases excitement)
dictionary) 4. VGG-16 model (player
action of celebration)
5. Optical Character
Recognition (OCR) –
learning methods
– The methods that use AlexNet CNN and Decision Tree classifier do not apply
substantial number of samples to evaluate the robustness of the mode. Some of
them used as low as 50 samples.
– Dataset has limited number of samples as in [25, 64, 67]. The video has been
chosen based on the type of view it has as predefined. There seems to be no
preprocessing done to classify the videos into different types of views.
4.1.4 Events Detection
– Only four key events are considered in [34]. Likewise, in most of the event-
based methods, standard events relevant to sports (boundaries in cricket, goals in
soccer, etc.) are considered. Any attempt to improve the number of key events or
sub events of main key events would have made this work better.
– The audio events in most of the works [7, 45, 66, 92] uses cheer energy or the
spectrum of commentators. Attempts to record in field audios like stump mics
and umpire’s mics have not been reported in any of the works.
– The replay is the key to identify the semantic segment of the video in [20, 40, 90].
If the replay is not identified, there will be misleading semantics in the segments.
– The human subjective evaluation may not always reveal the true results as in
[74]. The samples used for evaluation are too less in number and that cannot be
considered for any proper conclusion.
4.2 Scope for Further Research
The major objective of sports video summarization is to reduce the length of the
broadcast video in a way the shortened video shows interesting events only. As such,
every sports video is lengthy in duration. Some of the sports like cricket has days
long videos. When the automated highlight generation is applied, it is supposed to
deal with huge volume of video data. With the current video broadcasting standards,
every video that is supposed to be processed will be at higher resolution, sometimes
up to 4 and 8 K. Keeping these constraints, any algorithm that is developed to
automate the video summarization should address these requirements. Hence, any
prospective research should deal with such high resolution and high volume of data.
In addition to the resolution of the video, the availability of standardized dataset
is a huge letdown for the researchers to benchmark the results. Only countable
number of researchers [25, 67] attempted to create dataset for sports summarization.
Most of the proposed methods employed custom created dataset or used commonly
available data from sources like YouTube. Creating and standardizing dataset with
huge collection of samples for each of the sports video is one of the high priority
research projects.
In the literature, it is found that most of the methods [6, 21, 34, 36, 38, 43, 50,
54, 59, 67, 72, 85] applied two levels of model building. Since the method of video
summarization involves majorly detecting or segmenting objects and classifying

them, the advanced object detection and classification methods like YOLO [68, 69],
R-CNN [71], etc. can be used to reduce the computational complexity. Since these
methods are standardized in terms of detection and classification, the sports video
summarization models can employ them to reduce model building time as well.
On the similar line with methods like YOLO [68, 69], which has common objects
classification, an attempt can be made to build a pretrained object detection model
specific to sports. There are many common events or scenes among the games
under consideration. For example, a green playfield, players jersey with names and
numbers printed, umpires, and scoreboards are most common in all the sports. The
pretrained models can address to classify or detect such common labels in sports
video. Further, the same can be extended to sports-specific models.
As stated earlier, the video summarization is a compute-intensive process. It
needs to process thousands of frames and hundreds of video shots before it classifies
or identifies proper results. The potential of any video processing algorithm is that
it can be implemented in parallel architecture. The best solution for the parallelism
at the consumer level is to utilize multicore processing capabilities of CPUs and
GPUs. Only selected works from the literature [60] has addressed the use of GPU.
However, there exists a huge potential for exploiting the GPUs with the support of
hardware architecture and software development tools like CUDA.
Further, methods that process the videos in real time to detect key events for
video summarization would produce very attractive results. The results can be
customized depending on the duration of summary that a viewer is interested to
watch. Embedding such methods as a product to real-time television streaming
boxes would bring much of commercial value to the video summarization not only
for sports but also for other events like coverage of musical events, festivals, and
other functions.
5 Conclusion
In this chapter, a systematic review of latest developments on video summarization

of sports video has been discussed. The chapter summarized some of the key
methods in detail by analyzing the methods and algorithms used for various sports
and the events. It is believed that the weaknesses posed by each of the papers
can potentially lead to further research avenues that can be solved by prospective
researchers of this domain. Though most of the papers focused on resolving the
problem of video summary generation, each of the method has its own merits and
demerits. Understanding the methods will certainly help to benchmark these results
against any further developments in this area. Another major contribution of this
work is to identify the methods that exploit latest machine learning methods and
high-performance computing. It is also shown that very few methods have deployed
the GPU or multicore computing to resolve the methods. This further opens room
for the budding researchers to experiment the potential of GPUs for processing such
high volume of data like sports video. The results of such summarization can be
instantly compared with highlights generated by broadcasting channels at the end
of every match. Going further, the results of such highlights should also include
some additional key events and drama, not just the key events of the games. For
instance, in soccer if a player is given a red card, the manual editing will show all
the events related to the player that led to red card and sometimes his activities from
previous matches also shown by manual editors. The automated system should be
capable enough to identify such key events and include in the highlights. Some of
other elements like pre- and post-match ceremony, player’s entry, injuries to players,
etc. should also be captured. Eventually, the machine learning-based methods should
learn to include the style of commentary, team’s jersey, noise removal from common
cheering, series-specific scene transition, and smooth commentary or video cuts that
will potentially reduce the human editor’s work. It is anticipated that this chapter
will turn out as one of the standard references for researchers to actively develop
video summarization algorithms using learning or non-learning approaches.
References
1. Rahman, A. A., Saleem, W., & Iyer, V. V. Driving behavior profiling and prediction in KSA
using smart phone sensors and MLAs. In 2019 IEEE Jordan international joint conference on
Electrical Engineering and Information Technology (JEEIT) (pp. 34–39).
2. Ajmal, M., Ashraf, M. H., Shakir, M., Abbas, Y., & Shah, F. A. (2012). Video summarization:
Techniques and classification. In Computer vision and graphics (Vol. 7594). ISBN: 978-3-642-
33563-1.
3. Sen, A., Deb, K., Dhar, P. K., & Koshiba, T. (2021). CricShotClassify: An approach to
classifying batting shots from cricket videos using a convolutional neural network and gated
recurrent unit. Sensors, 21, 2846. https://doi.org/10.3390/s21082846
4. Halin, A. A., & Mandava, R. (2013, January). Goal event detection in soccer videos via
collaborative multimodal analysis. Pertanika Journal of Science and Technology, 21(2), 423–
442.
5. Amruta, A. D., & Kamde, P. M. (2015, March). Sports highlight generation system based on
video feature extraction. IJRSI (2321–2705), II(III).
6. Bagheri-Khaligh, A., Raziperchikolaei, R., & Moghaddam, M. (2012). A new method for shot
classification in soccer sports video based on SVM classifier. In Proceedings of the 2012 IEEE
Southwest Symposium on Image Analysis and Interpretation (SSIAI). Santa Fe, NM.
7. Baijal, A., Jaeyoun, C., Woojung, L., & Byeong-Seob, K. (2015). Sports highlights generation
based on acoustic events detection: A rugby case study. In 2015 IEEE International Conference
on Consumer Electronics (ICCE) (pp. 20–23). https://doi.org/10.1109/ICCE.2015.7066303
8. Alexey, B., Chien-Yao, W., & Hong-Yuan, M. L. (2020). YOLOv4: Optimal speed and
accuracy of object detection. In arXiv 2004.10934[cs.CV].
9. Chen, F., De Vleeschouwer, C., Barrobés, H. D., Escalada, J. G., & Conejero, D. (2010).
Automatic summarization of audio-visual soccer feeds. In 2010 IEEE international conference
on Multimedia and Expo (pp. 837–842). https://doi.org/10.1109/ICME.2010.5582561
10. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully
convolutional networks. In Advances in neural information processing systems (pp. 379–387).
11. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005
IEEE Computer Society conference on Computer Vision and Pattern Recognition (CVPR ‘05)
(Vol. 1, pp. 886–893). https://doi.org/10.1109/CVPR.2005.177
12. Jesse, D., & Mark, G. (2006). The relationship between Precision-Recall and ROC curves.
In Proceedings of the 23rd International Conference on Machine Learning (ICML ‘06) (pp.
233–240). ACM, New York, NY, USA. https://doi.org/10.1145/1143844.1143874
13. Asadi, E., & Charkari, N. M. (2012). Video summarization using fuzzy c-means clustering. In
20th Iranian conference on Electrical Engineering (ICEE2012) (pp. 690–694). https://doi.org/
10.1109/IranianCEE.2012.6292442
14. Ekin, A., Tekalp, A., & Mehrotra, R. (2003). Automatic soccer video analysis and summariza-
tion. IEEE Transactions on Image Processing, 12(7), 796–807.
15. Fani, M., Yazdi, M., Clausi, D., & Wong, A. (2017). Soccer video structure analysis by parallel
feature fusion network and hidden-to-observable transferring Markov model. IEEE Access, 5,
27322–27336.
16. Felzenszwalb, P. F., Girshick, R. B., & McAllester, D. (2010). Cascade object detection with
deformable part models. In 2010 IEEE computer society conference on Computer Vision and
Pattern Recognition (pp. 2241–2248). https://doi.org/10.1109/CVPR.2010.5539906
17. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010, September). Object
detection with discriminatively trained part-based models. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 32(9), 1627–1645. https://doi.org/10.1109/TPAMI.2009.167
18. Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, mul-
tiscale, deformable part model. In 2008 IEEE conference on Computer Vision and Pattern
Recognition (pp. 1–8). https://doi.org/10.1109/CVPR.2008.4587597
19. Foysal, M. F., Islam, M., Karim, A., & Neehal, N. (2018). Shot-Net: A convolutional neural
network for classifying different cricket shots. In Recent trends in image processing and pattern
recognition. Springer Singapore.
20. Ghanem, B., Kreidieh, M., Farra, M., & Zhang, T. (2012). Context-aware learning for
automatic sports highlight recognition. In Proceedings of the 21st International Conference
on Pattern Recognition (ICPR2012) (pp. 1977–1980).
21. Girshick, R. B. (2012). From rigid templates to grammars: object detection with structured
models (Ph.D. Dissertation). University of Chicago, USA. Advisor(s) Pedro F. Felzenszwalb.
Order Number: AAI3513455.
22. Girshick, R. B., Felzenszwalb, P. F., & Mcallester, D. A. (2011). Object detection with grammar
models. In Proceedings of the 24th international conference on Neural Information Processing
Systems (NIPS’11) (pp. 442–450). Curran Associates Inc., Red Hook, NY, USA.
23. Girshick, R., & Fast, R.-C. N. N. (2015). 2015 IEEE International Conference on Computer
Vision (ICCV) (pp. 1440–1448). https://doi.org/10.1109/ICCV.2015.169
24. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016, January 1). Region-based con-
volutional networks for accurate object detection and segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 38(1), 142–158. https://doi.org/10.1109/
TPAMI.2015.2437384
25. Gonzalez, A., Bergasa, L., Yebes, J., & Bronte, S. (2012). Text location in complex images. In
IEEE ICPR.
26. Gupta, A., & Muthaiah, S. (2020). Viewpoint constrained and unconstrained Cricket stroke
localization from untrimmed videos. Image and Vision Computing, 100.
27. Gupta, A., & Muthaiah, S. (2019). Cricket stroke extraction: Towards creation of a large-scale
cricket actions dataset. arXiv:1901.03107 [cs.CV].
28. Gupta, A., Karel, A., & Sakthi Balan, M. (2020). Discovering cricket stroke classes in trimmed
telecast videos. In N. Nain, S. Vipparthi, & B. Raman (Eds.), Computer vision and image
processing. CVIP 2019. Communications in computer and information science (Vol. 1148).
Springer Singapore.
29. Arpan, G., Ashish, K., & Sakthi Balan, M. (2021). Cricket stroke recognition using hard and
soft assignment based bag of visual words. In Communications in computer and information
science (pp. 231–242). Springer Singapore. https://doi.org/10.1007/2F978-981-16-1092-2021
30. Hari, R. (2015, November). Automatic summarization of hockey videos. IJARET (0976–6480),
6(11).
31. Harun-Ur-Rashid, M., Khatun, S., Trisha, Z., Neehal, N., & Hasan, M. (2018). Crick-net: A
convolutional neural network based classification approach for detecting waist high no balls in
cricket. arXiv preprint arXiv:1805.05974.
32. He, J., & Pao, H.-K. (2020). Multi-modal, multi-labeled sports highlight extraction. In 2020
international conference on Technologies and Applications of Artificial Intelligence (TAAI)
(pp. 181–186). https://doi.org/10.1109/TAAI51410.2020.00041
33. He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional
networks for visual recognition. In European conference on Computer Vision (pp. 346–361).
Springer.
34. Khurram, I. M., Aun, I., & Nudrat, N. (2020). Automatic soccer video key event detection and
summarization based on hybrid approach. Proceedings of the Pakistan Academy of Sciences, A
Physical and Computational Sciences (2518–4245), 57(3), 19–30.
35. Islam, M. R., Paul, M., Antolovich, M., & Kabir, A. (2019). Sports highlights generation using
decomposed audio information. In IEEE International Conference on Multimedia & Expo
Workshops (ICMEW) (pp. 579–584). https://doi.org/10.1109/ICMEW.2019.00105
36. Islam, M., Hassan, T., & Khan, S. (2019). A CNN-based approach to classify cricket bowlers
based on their bowling actions. In 2019 IEEE international conference on Signal Processing,
Information, Communication & Systems (SPICSCON) (pp. 130–134). https://doi.org/10.1109/
SPICSCON48833.2019.9065090
37. Takahiro, I., Tsukasa, F., Shugo, Y., & Shigeo, M. (2017). Court-aware volleyball video
summarization. In ACM SIGGRAPH 2017 posters (SIGGRAPH ‘17) (pp. 1–2). Associa-
tion for Computing Machinery, New York, NY, USA, Article 74. https://doi.org/10.1145/
3102163.3102204
38. Javed, A., Malik, K. M., Irtaza, A., et al. (2020). A decision tree framework for shot
classification of field sports videos. The Journal of Supercomputing, 76, 7242–7267. https://
doi.org/10.1007/s11227-020-03155-8
39. Javed, A., Bajwa, K., Malik, H., Irtaza, A., & Mahmood, M. (2016). A hybrid approach for
summarization of cricket videos. In IEEE International Conference on Consumer Electronics-
Asia (ICCE-Asia). Seoul.
40. Javed, A., Irtaza, A., Khaliq, Y., & Malik, H. (2019). Replay and key-events detection for sports
video summarization using confined elliptical local ternary patterns and extreme learning
machine. Applied Intelligence, 49, 2899–2917. https://doi.org/10.1007/s10489-019-01410-x
41. Jothi Shri, S., & Jothilakshmi, S. (2019). Crowd video event classification using convolutional
neural network. Computer Communications, 147, 35–39.
42. Kanade, S. S., & Patil, P. M. (2013, March). Dominant color based extraction of key frames for
sports video summarization. International Journal of Advances in Engineering & Technology,
6(1), 504–512. ISSN: 2231-1963.
43. Kapela, R., McGuinness, K., & O’Connor, N. (2017). Real-time field sports scene classification
using colour and frequency space decompositions. Journal of Real-Time Image Process, 13,
725–737.
44. Kathirvel, P., Manikandan, S. M., & Soman, K. P. (2011, January). Automated referee whistle
sound detection for extraction of highlights from sports video. International Journal of
Computer Applications (0975–8887), 12(11), 16–21.
45. Khan, A., Shao, J., Ali, W., & Tumrani, S. (2020). Content-aware summarization of broadcast
sports videos: An audio–visual feature extraction approach. Neural Process Letter, 1945–
1968.
46. Kiani, V., & Pourreza, H. R. (2013). Flexible soccer video summarization in compressed
domain. In ICCKE 2013 (pp. 213–218). https://doi.org/10.1109/ICCKE.2013.6682798
47. Kolekar, M. H., & Sengupta, S. (2015). Bayesian network-based customized highlight
generation for broadcast soccer videos. IEEE Transactions on Broadcasting, (2), 195–209.
48. Kolekar, M. H., & Sengupta, S. (2006). Event-importance based customized and automatic
cricket highlight generation. In IEEE international conference on Multimedia and Expo.
Toronto, ON.
49. Kolekar, M. H., & Sengupta, S. (2008). Caption content analysis based automated cricket
highlight generation. In National Communications Conference (NCC). Mumbai.
50. Bhattacharya, K., Chaudhury, S., & Basak, J. (2004, December 16–18). Video summarization:
A machine learning based approach. In ICVGIP 2004, Proceedings of the fourth Indian con-
ference on Computer Vision, Graphics & Image Processing (pp. 429–434). Allied Publishers
Private Limited, Kolkata, India.
51. Alex, K., Ilya, S., & Hinton, G. E. (2012). ImageNet classification with deep convolutional
neural networks. In Proceedings of the 25th international conference on Neural Information
Processing Systems, Volume 1 (NIPS’12) (pp. 1097–1105). Curran Associates Inc., Red Hook,
NY, USA.
52. Kumar, R., Santhadevi, D., & Janet, B. (2019). Outcome classification in cricket using deep
learning. In IEEE international conference on Cloud Computing in Emerging Markets CCEM.
Bengaluru.
53. Kumar Susheel, K., Shitala, P., Santosh, B., & Bhaskar, S. V. (2010). Sports video sum-
marization using priority curve algorithm. International Journal on Computer Science and
Engineering (0975–3397), 02(09), 2996–3002.
54. Kumar, Y., Gupta, S., Kiran, B., Ramakrishnan, K., & Bhattacharyya, C. (2011). Automatic
summarization of broadcast cricket videos. In IEEE 15th International Symposium on Con-
sumer Electronics (ISCE). Singapore.
55. Li, Y., Chen, Y., Wang, N., & Zhang, Z. (2019). Scale-aware trident networks for object
detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 6053–
6062). https://doi.org/10.1109/ICCV.2019.00615
56. Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2017). Light-head r-cnn: In defense of
two-stage object detector. arXiv preprint arXiv:1711.07264.
57. Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid
networks for object detection. In IEEE conference on Computer Vision and Pattern Recognition
(CVPR) (pp. 936–944). https://doi.org/10.1109/CVPR.2017.106
58. Lin, T., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018, July). Focal loss for dense object
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327.
https://doi.org/10.1109/TPAMI.2018.2858826
59. Merler, M., Mac, K. N. C., Joshi, D., Nguyen, Q. B., Hammer, S., Kent, J., Xiong, J., Do, M.
N., Smith, J. R., & Feris, R. S. (2019, May). Cricket automatic curation of sports highlights
using multimodal excitement features. IEEE Transactions on Multimedia, 21(5), 1147–1160.
https://doi.org/10.1109/TMM.2018.2876046
60. Minhas, R., Javed, A., Irtaza, A., Mahmood, M., & Joo, Y. (2019). Shot classification of field
sports videos using AlexNet Convolutional Neural Network. Applied Sciences, 9(3), 483.
61. Mohan, S., & Vani, V. (2016). Predictive 3D content streaming based on decision tree classifier
approach. In S. Satapathy, J. Mandal, S. Udgata, & V. Bhateja (Eds.), Information systems
design and intelligent applications. Advances in intelligent systems and computing (Vol. 433).
Springer. https://doi.org/10.1007/978-81-322-2755-7_16
62. Namuduri, K. (2009). Automatic extraction of highlights from a cricket video using MPEG-
7 descriptors. In First international communication systems and networks and workshops.
Bangalore.
63. Nguyen, N., & Yoshitaka, A. (2014). Soccer video summarization based on cinematography
and motion analysis. In 2014 IEEE 16th international workshop on Multimedia Signal
Processing (MMSP) (pp. 1–6). https://doi.org/10.1109/MMSP.2014.6958804
64. Rafiq, M., Rafiq, G., Agyeman, R., Choi, G., & Jin, S.-I. (2020). Scene classification for sports
video summarization using transfer learning. Sensors, 20, 1702.
65. Raj, R., Bhatnagar, V., Singh, A. K., Mane, S., & Walde, N. (2019, May). Video sum-
marization: Study of various techniques. In Proceedings of IRAJ international conference,
arXiv:2101.08434.
66. Raventos, A., Quijada, R., Torres, L., & Tarrés, F. (2015). Automatic summarization of soccer
highlights using audio-visual descriptors. Springer Plus, 4, 1–13.
67. Ravi, A., Venugopal, H., Paul, S., & Tizhoosh, H. R. (2018). A dataset and preliminary
results for umpire pose detection using SVM classification of deep features. In 2018 IEEE
Symposium Series on Computational Intelligence (SSCI) (pp. 1396–1402). https://doi.org/
10.1109/SSCI.2018.8628877
68. Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In 2017 IEEE
conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6517–6525). https://
doi.org/10.1109/CVPR.2017.690
69. Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767.
70. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-
time object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern
Recognition (pp. 779–788).
71. Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards real-time object
detection with region proposal. arXiv:1506.01497 [cs.CV].
72. Sharma, R., Sankar, K., & Jawahar, C. (2015). Fine-grain annotation of cricket videos. In
Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition (ACPR). Kuala Lumpur,
Malaysia.
73. Shih, H. (2018). A survey of content-aware video analysis for sports. IEEE Transactions on
Circuits and Systems for Video Technology, 28(5), 1212–1231.
74. Shingrakhia, H., & Patel, H. (2021). SGRNN-AM and HRF-DBN: A hybrid machine learning
model for cricket video summarization. The Visual Computer, 38, 2285. https://doi.org/
10.1007/s00371-021-02111-8
75. Shukla, P., Sadana, H., Verma, D., Elmadjian, C., Ramana, B., & Turk, M. (2018). Automatic
cricket highlight generation using event-driven and excitement-based features. In IEEE/CVF
conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Salt Lake City,
UT.
76. Sreeja, M. U., & KovoorBinsu, C. (2019). Towards genre-specific frameworks for video
summarisation: A survey. Journal of Visual Communication and Image Representation (1047–
3203), 62, 340–358. https://doi.org/10.1016/j.jvcir.2019.06.004
77. Su Yuting., Wang Weikang., Liu Jing., Jing Peiguang., and Yang Xiaokang., DS-Net: Dynamic
spatiotemporal network for video salient object detection, arXiv:2012.04886 [cs.CV], 2020.
78. Sukhwani, M., & Kothari, R. A parameterized approach to personalized variable length
summarization of soccer matches. arXiv preprint arXiv:1706.09193.
79. Sun, Y., Ou, Z., Hu, W., & Zhang, Y. (2010). Excited commentator speech detection
with unsupervised model adaptation for soccer highlight extraction. In 2010 international
conference on Audio, Language, and Image Processing (pp. 747–751). https://doi.org/10.1109/
ICALIP.2010.5685077
80. Tang, H., Kwatra, V., Sargin, M., & Gargi, U. (2011). Detecting highlights in sports videos:
Cricket as a test case. In IEEE international conference on Multimedia and Expo. Barcelona.
81. Saba, T., & Altameem, A. (2013, August). Analysis of vision based systems to detect real time
goal events in soccer videos. International Journal of Applied Artificial Intelligence, 27(7),
656–667. https://doi.org/10.1080/08839514.2013.787779
82. Antonio, T.-d.-P., Yuta, N., Tomokazu, S., Naokazu, Y., Marko, L., & Esa, R. (2018, August).
Summarization of user-generated sports video by using deep action recognition features. IEEE
Transactions on Multimedia, 20(8), 2000–2010.
83. Tien, M.-C., Chen, H.-T., Hsiao, C. Y.-W. M.-H., & Lee, S.-Y. (2007). Shot classification of
basketball videos and its application in shooting position extraction. In Proceedings of the
IEEE international conference on Acoustics, Speech and Signal Processing (ICASSP 2007).
84. Vadhanam, B. R. J., Mohan, S., Ramalingam, V., & Sugumaran, V. (2016). Performance
comparison of various decision tree algorithms for classification of advertisement and non-
advertisement videos. Indian Journal of Science and Technology, 9(1), 48–65.
85. Vani, V., Kumar, R. P., & Mohan, S. Profiling user interactions of 3D complex meshes for
predictive streaming and rendering. In Proceedings of the fourth international conference on
Signal and Image Processing 2012 (ICSIP 2012) (pp. 457–467). Springer, India.
86. Vani, V., & Mohan, S. (2021). Advances in sports video summarization – a review based
on cricket video. In The 34th international conference on Industrial, Engineering & Other
Applications of Applied Intelligent Systems, Special Session on Big Data and Intelligence
Fusion Analytics (BDIFA 2021). Accepted for publication in Springer LNCS.
87. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple
features. In Proceedings of the 2001 IEEE Computer Society conference on Computer Vision
and Pattern Recognition. CVPR 2001 (p. I-I). https://doi.org/10.1109/CVPR.2001.990517
88. Viola, P., & Jones, M. (2004). Robust real-time face detection. International Journal of
Computer Vision, 57(2), 137–154.
89. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016).
SSD: Single shot multibox detector. In European conference on computer vision (pp. 21–37).
Springer.
90. Xu, W., & Yi, Y. (2011, September). A robust replay detection algorithm for soccer video.
IEEE Signal Processing Letters, 18(9), 509–512. https://doi.org/10.1109/LSP.2011.2161287
91. Khan, Y. S., & Pawar, S. (2015). Video summarization: Survey on event detection and
summarization in soccer videos. International Journal of Advanced Computer Science and
Applications (IJACSA), 6(11). https://doi.org/10.14569/IJACSA.2015.061133
92. Ye, J., Kobayashi, T., & Higuchi, T. Audio-based sports highlight detection by Fourier local
auto-correlations. In Proceedings of the 11th annual conference of the International Speech
Communication Association, INTERSPEECH 2010 (pp. 2198–2201).
93. Hossam, Z. M., Nashwa, E.-B., Ella, H. A., & Tai-hoon, K. (2011). Machine learning-based
soccer video summarization system, multimedia, computer graphics and broadcasting (Vol.
263). ISBN: 978-3-642-27185-4.
94. Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Singleshot refinement neural network
for object detection. In IEEE CVPR.
95. Zhang, S., Wen, L., Lei, Z., & Li, S. Z. (2021, February). RefineDet++: Single-shot refinement
neural network for object detection. IEEE Transactions on Circuits and Systems for Video
Technology, 31(2), 674–687. https://doi.org/10.1109/TCSVT.2020.2986402
96. Zou, Z., Shi, Z., Guo, Y., & Ye, J. (2019). Object detection in 20 years: A survey. arXiv preprint
arXiv:1905.05055.
Shot Boundary Detection from Lecture
Video Sequences Using Histogram
of Oriented Gradients and Radiometric
Correlation
T. Veerakumar, Badri Narayan Subudhi, K. Sandeep Kumar,

Nikhil O. F. Da Rocha, and S. Esakkirajan
1 Introduction
Due to rapid growth and development in multimedia techniques, e-learning is

getting more popularity. In the last few years, a lot of online courses are uploaded
in the Internet for basic study use. The user performs basic task like browsing the
content of a particular video or analyzing some specific parts of a video lecture.
Sometimes, a video may contain some specific video lecture of some topics or
subtopics. The main difficulty lies in finding specific pieces of knowledge for video
because it is unstructured. For example, if someone wants to analyze specific content
or to attend some specific part of the video for the particular lecture, then he has
to watch the entire two- or three-hour lecture. Hence, browsing these videos and
analyzing those are quite challenging and tiresome job. Hence, smooth browsing and
indexing of lecture videos are considered to be a primary task of computer vision.
One likely solution is to segment a video into different shots so as to facilitate the
students for learning and minimize learning time.
Recent years have seen a rapid increase in the storage of visual information.
It made scientists to find a way to index the visual data and retrieve it efficiently.
Content-Based Video Retrieval (CBVR) is an area of research that has been catered
T. Veerakumar () · K. S. Kumar · N. O. F. Da Rocha

Department of Electronics and Communication Engineering, National Institute of Technology
Goa, Farmagudi, Ponda, Goa, India
e-mail: tveerakumar@nitgoa.ac.in
B. N. Subudhi
Department of Electrical Engineering, Indian Institute of Technology Jammu, Nagrota, Jammu,
India
S. Esakkirajan
Department of Instrumentation and Control Engineering, PSG College of Technology,
Coimbatore, Tamil Nadu, India

36 T. Veerakumar et al.
to such demands for the users [1]. In this process, a video was first segmented
into successive shots. However, to automate the process of shot segmentation,
the analysis of the subsequent frames for changes in visual content is necessary.
These changes can be abrupt or gradual. After detecting the shot boundaries, key
frames are extracted from each shot. Key frames provide a suitable abstraction and
framework for video indexing, browsing, and retrieval. The usage of key frames
significantly reduces the amount of data required in video indexing and provides an
organizational framework for dealing with video content. The users, while searching
for a video of their interest, browse the videos randomly and view only certain key
frames that matches the content of search query. CBVR has various stages like shot
segmentation, key frame extraction, feature extraction, feature indexing, retrieval
mechanism, and result ranking mechanism [1].
These key frames are used for image-based video retrieval where an image is
given as a query to retrieve a video from a collection of lecture videos. Varieties
of ways have been reported in the literature. The simple method is the pixel-
wise difference between consecutive frames [2]. But it is very sensitive to camera
motion. An approach based on local statistical difference is proposed in [3], which is
obtained by dividing the image into few regions and comparing statistical measures
like mean and standard deviation of the gray levels within the regions of the image.
However, this approach is found to be computationally burden. The most shared
and popular method used for shot boundaries detection is based on histograms
[4–6]. The simplest one computes the gray level or color histogram of the two
images. If the sum of bin-wise difference between the two histograms is above a
threshold, a shot boundary is assumed. It may be noted that these approaches are
relatively stable, but the absence of spatial information may produce substantial
dissimilarities in between the frames and hence incurs a reduction in accuracy.
Mutual information computed from the joint histogram of consecutive frames are
also used to solve such task [7]. The renowned Machine Learning and Pattern
Recognition methods like neural network [8], KNN [9], fuzzy clustering [10, 11],
and support vector machines [12] have also been used for shot boundaries detection.
Shot boundary detection based on orthogonal polynomial method is proposed [13].
Here, orthogonal polynomial function is used to identify the shot boundary in the
video sequence.
In essence, previous works expose that the researchers have proposed numerous
types of features and dissimilarity measures. Many state-of-the-art techniques suffer
from the difficulty of selecting the thresholds and window size. However, such
methods prohibit the accuracy of shot boundary detection by generating false pos-
itives due to illumination change. The next phase after shot detection is key frame
extraction. A key frame is a representative for individual shot. One of the popular
approaches to key frame extraction is using singular value decomposition (SVD)
and correlation minimization [14, 15]. Another method for key frame extraction
is KS-SIFT [16]; it extracts the local visual features using SIFT, represented as
feature vectors, from a selected group of frames from a video shot. KS-SIFT method
analyzes those feature vectors eliminating near duplicate key frames, helping to
keep a compact key frame set. But it takes more computation time, and approach
Shot Boundary Detection from Lecture Video Sequences Using Histogram. . . 37
is found to be complex. Robust Principal Component Analysis (RPCA) method is

also introduced to extract the key frames in a video. RPCA provides a stable tool
for data analysis and dimensionality reduction. Under the RPCA framework, the
input data set is decomposed into a sum of low-rank and sparse components. This
approach is based on l1 norm optimization technique [17]; however, this method
is more complicated and takes high computational time. The problem of moving
object segmentation using background subtraction is introduced in [18]. Moving
object segmentation is very important for many applications: visual surveillance of
both in outdoor and indoor environments, traffic control, behavior detection during
sport activities, and so on. A new approach to the detection and classification of
scene breaks in video sequences is discussed in [19]. In this work, it can able to
detect and classify a variety of scene breaks: including cuts, fades, dissolves, and
wipes even in sequences involving significant motion. A novel dual-stage approach
for an abrupt transition detection is introduced [20], which is able to withstand under
certain illumination and motion effects. A hybrid shot boundary detection method is
developed by integrating a high-level fuzzy Petri net (HLFPN) model with keypoint
matching [21]. The HLFPN model with histogram difference is performed as a
predetection. Next, the speeded-up robust features (SURF) algorithm that is reliably
robust to image affine transformation and illumination variation is used to figure out
all possible false shots and the gradual transition based on the assumption from the
HLFPN model. The top-down design can effectively lower down the computational
complexity of SURF algorithm. From the above discussions, it may be concluded
that shot boundary detection and key frame extraction are important tasks in image
and video analysis and need attention. It is also to be noted that the works on lecture
video analysis for shot boundary detection are also very few [22].
This article focuses on shot boundary detection and key frame extraction for
lecture video sequences. Here, the combined advantages of Histogram of Oriented
Gradients (HOG) [23] and radiometric correlation with the entropic measure are
adhered to perform the task of shot boundary detection. The key frames from
the video are obtained by analyzing the peaks and valleys of the radiometric
correlation plot against different frames of the lecture video. In the proposed scheme
initially, HOG features are extracted from each frame. The similarities between
the consecutive image frames are obtained by finding the radiometric correlation
between the HOG features. To analyze the shot transition, a plot of the radiometric
correlation between the consecutive frames are plotted. The radiometric correlation
for the complete lecture video is found to have a significant amount of uncertainty
due to the variation in color, illumination, or object motion in consecutive frames
of a lecture video scene. Hence, the concept of entropic measure is used here. In
the proposed scheme, a center sliding window is considered on the radiometric
correlation plot to compute the entropy at each frame. Similarly, the analysis of
peaks and valleys of the radiometric correlation plot is used to find the key frames
from each shot. The proposed scheme is tested on several lecture sequences, and
seven results are reported in this article. The results obtained by the proposed
scheme are compared with six existing state-of-the-art techniques by considering
the computational time and shot detection.
This article is organized as follows. Section 2 describes the proposed algorithm.

The simulation results with discussions and future works are given in Sect. 3.
Finally, conclusions are drawn in Sect. 4.
2 Shot Boundary Detection and Key Frame Extraction
The block diagram of the proposed shot boundary detection scheme is shown in
Fig. 1. The proposed scheme follows three steps: feature extraction, shot boundary
detection, and key frame extraction. In the proposed scheme initially, HOG features
are extracted from all the frames of the sequence. Here, the extracted HOG feature
vectors from each of frame are compared with subsequent frame by radiometric
correlation [23] measure. Then the local entropy corresponding to the radiometric
correlation is obtained to identify the shot boundaries in the lecture video. In the
next step, the key frames from each shot are extracted by analyzing the peaks and
valleys of the radiometric correlation.
Video
Entropy over
sliding window
Extracted frames
If (entropy <
ith frame (i+1)th frame threshold)
HOG features HOG features Shot boundary is

detected
Radiometric correlation Key frame

extraction
Fig. 1 Flowchart of the proposed technique

2.1 Feature Extraction
In the proposed scheme, we have used HOG feature for our analysis. HOG
feature is initially suggested for the purpose of object detection [23] in computer
vision and image processing. The method is based on evaluating well-normalized
histograms of image gradient orientations. The basic idea is that object appearance
and shape can often be characterized well by the distribution of intensity gradients
or edge directions without precise knowledge of the corresponding gradient or
edge positions. HOG captures edge or gradient structure that describes the shape.
It does so by representation with an easily controllable degree of invariance and
photometric transformations. Translations or rotations make little difference if they
are much smaller than the spatial or orientation bin size. Since it is gradient based, it
captures the object shape information very well. The essential thought behind HOG
is to describe the object appearance and shape within the image by the distribution
of intensity gradients. The histograms are contrast-normalized by calculating a
measure of the intensity across the image and then normalize all the values. As
comparing the consecutive frames in the video is the key to detect a shot boundary,
using HOG is a good choice as it is computationally faster. The proposed shot
boundary detection algorithm uses radiometric correlation and entropic measure for
shot transition identification. It is discussed in detail in the next section.
2.2 Radiometric Correlation for Interframe Similarity Measure
The basic idea behind the shot boundary detection in a lecture sequence is to find
the similarity/correlation between the consecutive frames in the video and point
out the discontinuity from it. In this regard, we have considered the radiometric
correlation-based similarity measure to find the correlation between the frames. The
extracted HOG features are compared in between consecutive frames to estimate the
radiometric correlation. Here, it is assumed that the time instant is same as that of
the frame instant. Let the successive frames of a sequence is represented by It (x, y)
−−→
and It − 1 (x, y) and the extracted HOG feature vectors be represented by HOGt and
−−→
HOGt−1 , respectively. Then the radiometric correlation is given by [23]
−−−→ −−−−−→ −−−→ −−−−−→
m HOGt .HOGt−1 − m HOGt m HOGt−1
R (It (x, y) , It−1 (x, y)) = ,
−−−→ −−−−−→
v HOGt v HOGt−1
(1)
−−−→ −−−−−→
where m HOGt .HOGt−1 represents the mean of the product of the extracted
feature vectors and can be obtained as
−−−→ −−−−−→ 1 −−−−−→ −−−→T

m HOGt .HOGt−1 = HOGt−1 HOGt , (2)
n
where [HOGt−1 ] and [HOGt ] are the extracted HOG feature vectors (with
matrices of size [1 × n]) for (t−1)th and tth frames, respectively. −−→‘n’ is
the n-dimensional HOG features computed from each frame. m HOGt and
−−→
v HOGt represent mean and variance of the HOG feature vectors, respec-
−−→
be represented as HOGt =
th
tively, in t frame. The HOG features can
HOG(t,1) , HOG(t,2) , HOG(t,3) , . . . , HOG(t,n) . Hence, the mean vector can be
computed as
−−−→ 1 n
m HOGt = HOG(t,i) , (3)
n
i=1
and
−−−→ 1 n −−−→ 2
v HOGt = HOG(t,i) − m HOGt . (4)
n
i=1
The radiometric correlation varies in the range [0, 1]. From the radiometric
correlation values obtained, a threshold is required to detect the shot boundary.
The radiometric correlation values for consecutive frames are calculated. So, for
N frames, (N−1) radiometric correlation values can be obtained. Figure 2a shows
the plot of radiometric correlation vs. frames of lecture video 1 sequence.
2.3 Entropic Measure for Distinguishing Shot Transitions
After obtaining the radiometric correlation, the next step is shot boundary detection.
Now, the aim is to identify the discontinuity point, this radiometric distribution of
the consecutive frames. In Fig. 2a, it can be seen that there is a significant difference
in the radiometric correlation values from one frame to another. However, finding
the discontinuity on these values that correspond to the shot transition is very
difficult. It is also true that keeping a threshold directly on this similarity values
is not a good idea as it varies widely. So we have taken the help of moving window-
based entropy measure on the radiometric correlation. The idea is to rather than
keeping a threshold on the radiometric correlation values, a single dimensional
overlapping moving window may be considered on these radiometric values to
compute the entropic measure. It improves the performance.
A moving window is considered over the radiometric correlation plot. From
the radiometric correlation plot, the entropy is calculated for each location of the
window. In information theory, entropy is used as a measure of uncertainty, and this
Fig. 2 Lecture video 1: (a) radiometric correlation for different frame (b) corresponding entropy
values
gives the average information. Hence, the use of entropy in our work will reduce
the randomness or vagueness of the local radiometric correlation. We calculate the
entropy Em at each point (frame) m of the radiometric correlation values using the
formula

1
Em = pi log , (5)
pi
i∈ηm
where ηm represents the considered neighborhood at location m, i represents the

frame instant, and pi is the local radiometric plot.
As the frame contents are changing at the shot boundary, a lower entropy value
for the radiometric correlation may be obtained. This is to be expected as the
transition for the shot boundary. It also gives an easier choice to keep the threshold.
The entropy values obtained for lecture video are plotted as shown in Fig. 2b. A
white dot can be seen in Fig. 2b that shows the detected shot boundary. To identify
the threshold or shot transition, we have considered a variance-based selection
strategy on entropic plot Em . For the selection of the threshold, we have considered
each location in the x-axis of the entropic plot Em as a threshold and searched for a
particular frame position where the sum of the variance along the left and right side
of the point will be high. The total variance is computed as
σm = σl + σr , (6)
where the variance σ m is the total variance at frame position m. σ l and σ r are the
variances computed from the entropic information from the left and right side of m.
Then the threshold value is obtained by finding a point m such that

Th = arg min σj , (7)
j
where j represents the threshold for shot transition. For lecture video 1, we apply
Eq. (7), and the shot boundary is detected at 247th frame. The results of which
are being explained and have two shots, and hence, one shot boundary is detected.
However, we can go for video with more than two shots also. Hence, for a video
with P shots, the total number of threshold (Th) will be (P−1). For automatic shot
−
→
boundary detection, we assumed that j is a vector and represented by j , and to
obtain a threshold, we have considered a maximum number of possible components
−
→ −
→
in j as jp−1 where j = {j1 , j2 , . . . , jP −1 } . The expression for the threshold will
−
→
be represented by a vector, Th = {Th1 , Th2 , . . . , ThP −1 }.
2.4 Key Frame Extraction
Once the shot boundaries are extracted from a given sequence, there is a need to
extract the key frames to represent each shot. It can be seen from the graph in Fig. 3
that there is variation in similarity measure for a particular shot. The maxima of this
variation represent the frames where there is more similarity of nearest frames. The
idea here is to take the frames where there is a maximum of similarity distribution
to pickup as key frames. These maxima are the peaks of the distributions. If we
can properly isolate these maxima, then we can find the key frames. However, there
will be temporal redundancy in between the consecutive frames of the video; hence,
it is not a good idea to take two maxima that are close to each other. It is to be
noted that most of the shots contain significant variation in radiometric similarity
measure due to noise or illumination change. Hence, before maxima are picked
for shot representation, the similarity distribution corresponding to each frame is
smoothened by a one-dimensional smoothing filter. Using this scheme, the different
key frames for different shots are detected.
Figure 3 shows the key frames for different shots (a total of three) of lecture
video 1 sequence: shot 1 [41, 187, 323], shot 2 [434, 772, 1013, 1249], and shot 3
[1291, 1345, 1394, 1464]. Once the key frames are extracted, it is thus checked if
the visual contents of two consecutive key frames are same. Hence, the radiometric
correlation is obtained between the consecutive key frames and significant key
frames are selected as final key frames for a particular shot. We also applied the
same on lecture video 1, and we obtained the final key frames as [187, 1013, 1291,
1394].
Fig. 3 Location of the shots and key frames of lecture video 1. (a) Location of the shots. (b) Shot
1 key frames: [41, 187, 323]. (c) Shot 2 key frames [434, 772, 1013, 1249]. (d) Shot 3 key frames
[1291, 1345, 1394, 1464]
3 Results and Discussions
To assess the effectiveness of the proposed algorithm, the results obtained using the
proposed methodology are compared with those obtained using six different state-
of-the-art techniques. The proposed algorithm is implemented in MATLAB and is
run on Pentium D, 2.8 GHz PC with 2G RAM, and Windows 8 operating system.
Experiments are carried out on several lecture sequences. However, for illustration,
we have provided results on seven test sequences. This section is further divided
into two parts: (i) analysis of results and (ii) discussions and future works. In the
former section, the detailed discussion of visual illustration with different sequences
are discussed. In the later section, the quantitative analysis of the results with the
discussion of the proposed scheme with future works is discussed.
Fig. 4 Key frames for lecture video 1 [187, 1013, 1291, 1394] out of 1497 frames and three shots
3.1 Analysis of Results
Four key frames are extracted from the lecture video 1 that are given by the
frame numbers [187, 1013, 1291, 1394]. These extracted key frames are shown
in Fig. 4. Corresponding visual illustration for radiometric correlation and extracted
key frames are shown in Figs. 2 and 3.
Similarly, the radiometric correlation values for the lecture video 2 is taken, the
graph of which is shown in Fig. 5. It may be observed that here, a total of one
shot boundary or two shots are detected. The peaks and valley analysis reveal that
there are four major peaks, and three major peaks are there in shot 1 and shot
2, respectively. The red marks in Fig. 5 shows that the selected maxima (peaks)
are selected as key frames. These key frames are given by the frame numbers as
[18, 78, 143, 228, 277, 389, 467]. However, it may be observed that many key
frames selected from the last stage have large correlation; hence, after refinement
(as discussed in Sect. 2.4), we obtained two key frames as shown in Fig. 6.
The third example considered for our experiments is lecture video 3 sequence.
The radiometric correlation plot with corresponding entropy value plot of this
sequence is shown in Fig. 7. The use of an automated thresholding scheme on
entropic plot has produced two shots for this sequence. Application of key frame
extraction process results in 11 key frames for this sequence. After pruning, we
obtained that six key frames are extracted from this sequence and are shown in
Fig. 8.
Similar experiments are conducted on different other sequences to validate our
results. The fourth example considered for our experiments is lecture video 4
sequence. The radiometric correlation plot with corresponding entropy value is
shown in Fig. 9. The obtained key frames extracted from this sequence are shown in
Fig. 10. This sequence is found to be containing a total of four shots. After proposed
pruning mechanism, it is obtained that a total of four key frames are detected. It may
be noted that the said video contains a view with camera movements/jitter. However,
the proposed scheme has overcome without false detection. A detailed discussion
with example for camera jitter is also provided in Sect. 3.2.
Next example considered for our experiment is lecture video 5 sequence whose
radiometric correlation plot with selected key frames is shown in Figs. 11 and 12.
This sequence has several instances where fade-in and fade-out are there. However,
the proposed scheme effectively able to distinguish the exact number of key frame
as 2. A detailed analysis of the same with example is also provided in Sect. 3.2.
Fig. 5 Radiometric correlation, corresponding entropy plots, and extracted key frames from each
shot for lecture video 2 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Location of
the shots. (d) Key frames of shot 1 [18, 78, 143, 228]. (e) Key frames of shot 2 [277, 389, 467]
The next examples considered are lecture video 6 and lecture video 7 sequences.
The entropic value plot with shot categorization with the selected key frames are
provided in Figs. 13, 14, 15, and 16. In these two sequences, the scene undergoes
Fig. 6 Key frames for lecture video 2 [143, 389] out of 505 frames and two shots
zoomed in and out condition. However, this does not affect the results of the
proposed scheme.
All the sequences considered in our experiment are to validate the proposed
scheme in different challenging scenarios: camera with motion, the scene with
different subtopics, and camera with zoomed in and out. A detailed analysis of the
same is provided in the next section.
3.2 Discussions and Future Works
In this section, we have provided the quantitative analysis of the results with brief
discussions on the advantages/disadvantages and other issues related to the proposed
work. The efficiency of the algorithm is tested with the key frame extraction and
computational complexity. The computational time for the proposed and existing
algorithms is given in Tables 1 and 2. From these tables, it is possible to observe
that the proposed algorithm takes much more computational time than PWPD,
CHBA, and ECR, but these algorithms are found not to be good in the key frame
extraction when compared to the proposed algorithm. The other existing algorithms
like LTD, KS-SIFT, and RPCA results in the key frame extraction are similar to
that of the proposed algorithm. But the proposed algorithm claims much lesser
computational time than those existing algorithms. From that, we can conclude
that the proposed algorithm outperforms in the key frame extraction with less
computational complexity.
Here, it is required to mention that the shot boundary identification from lecture
sequence is a challenging task. The similarity among the frames of the video
contains a large amount of uncertainty due to variation in color, artificial effects
like fade-in and fade-out, illumination changes, object motion, camera jitter, and
zooming and shrinking. The proposed scheme is found to be providing a better
result in this regard for all of these considered scenarios. The performance of the
proposed scheme against each of these scenarios with examples are discussed as
follows:
Figure 17 shows two examples of motion blur condition. The two examples
depict the two frames from two different shots of lecture video 2 and lecture video
shot for lecture video 3 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Location of
the shots. (d) Key frames of the shot 1 [7, 90, 150, 231]. (e) Key frames of the shot 2 [241, 344,
429]. (f) Key frames of the shot 3 [481, 562, 634, 706]
5 sequences, respectively. Due to motion blur, it is hard to distinguish them as part

of the same shots. Hence, different existing schemes identify them as part of two
distinct shots. However, the proposed scheme can distinguish them as a part of the
Fig. 8 Key frames for lecture videos 3 [7, 150, 241, 429, 481, 706]
single shot. This happens due to the capability of HOG features used in the proposed
scheme.
Figure 18 shows an example for shots with fade-in and fade-out conditions. The
use of other schemes is found to be not able to distinguish them as two different
shots rather detected them as three different shots: texts in the board, professor, and
fade-in/fade-out frames. However, the proposed scheme well represents them into
two shots for each sequence. This is due to the capability of entropic measures that
diminish the effects of variation in radiometric correlation measure.
Figure 19 shows an example of lecture video 4 sequences where the sequence
undergoes with camera jitter or movements. This is due to the active application of
HOG feature with radiometric similarity measure that can detect them to being part
of a single shot.
A similar analysis is made on the lecture video 4 and lecture video 6 sequences
with a zoomed in and out conditions (shown in Fig. 20). A view variation by camera
zoomed in and zoomed out view is also shown and found to be a single shot by
the proposed scheme shown in Fig. 20. Thanks to the integration of radiometric
similarity with entropic measures to deal with real-life uncertainty for efficiently
detecting the shot transitions in the considered challenging scenarios. Figure 21
shows another example with noise. An application of the proposed scheme never
distinguishes them being part of different shots, whereas existing techniques fail to
do so.
With the above analysis, we found that the proposed scheme is found to be
providing better results against variation in color, artificial effects like fade-in and
fade-out, illumination changes, object motion, camera jitter, zooming and shrinking,
and noisy video scenes. It is to be noted that most of the false detection of key frames
by other considered scheme for comparisons reported in Tables 1 and 2 is due to the
abovesaid effects. The effectiveness of the proposed scheme can be concluded in
two phases. In the first phase, the use of HOG features will try to preserve the shape
information from a given lecture video. The shape information includes details of
the texts in the board, drawings, slides, pictures, teaching professor, etc. from the
lecture video. It is also to be noted that, as reported by literature, the HOG feature
shot for lecture video 4 sequence. (a) Radiometric correlation. (b) Entropy values. (c) Key frames
of the shot 1 and 2 [5, 127, 351, 494]. (d) Key frames of the shot 3 and 4 [502, 647, 808, 922,
1018]
is found to be providing good results against illumination changes, motion blur, and
noisy video scene. This is quite well understood from Figs. 17, 18, 19, 20, and 21. In
the second phase, the radiometric similarity between the frames is computed and the
variation in it is reduced by mapping to entropic scale. This minimizes the effects
of false detection in key frames and effective against the fade-in, fade-out, zoomed
in and out, and other irrelevant effects in the video.
Fig. 10 Key frames for lecture video 4 [127, 494, 647, 1018]
of the shots 1, 2, and 3 [2, 245, 487, 721, 776, 909]
Fig. 12 Key frames for lecture video 5 [2, 487, 776]
There are few parameters that are used in the proposed scheme and need further
discussions. One of the important parameters used in this article is one-dimensional
window size or the neighborhood used for computation of the entropy from the
radiometric similarity plot. In the proposed scheme, we have used a fixed window
of the shots 1 and 2 [2128, 481, 1422]
Fig. 14 Key frames for lecture video 6 [128, 481, 1422]
of size (7×1) for all the considered video sequences. However, a variable sized
window may be considered for this. In all the considered sequences, the choice of
window size may affect the performance of the proposed scheme. If the number of
frames in a particular video will be high and a small size window will be chosen,
then there will be many false shot transitions. If the number of frames in a particular
video will be low and a higher size window will be chosen, then there may be
chances of missing few shot transitions. A tabular representation of the performance
of the proposed scheme on all the considered sequences with different window size
is provided in Table 3. The proposed scheme is tested with different window size,
(11 × 1), (9 × 1), (7 × 1), (5 × 1), and (3 × 1), and the number of key frames
detected by each window size are presented in Table 3. It is also observed from this
of the shots 1, 2, and 3 [243, 824, 1236, 1618, 2015, 4143]
Fig. 16 Key frames for lecture video 7 [243, 1236, 2015, 4143]
table that use of (5 × 1), (7 × 1), and (9 × 1) gives almost same results for most
of the sequences in terms of number of key frames detected, whereas the results
obtained by the window size (7 × 1) and (9 × 1) are same. Hence, by the taking
the average of all the results concluded by manual trial and error basis infers that a
use of (7 × 1) window size is able to provide an affordable result. Hence, we fixed
it to size (7 × 1). It is to be noted that all experiments are made on frame size of
320 × 240.
In this article, all the results reported in Figs. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21 and Tables 1, 2, 3 for comparison with the proposed
scheme are developed by the authors in the working lab using MATLAB software.
All the considered technique codes are developed in an optimized manner so as to
validate the proposed scheme in same scale balance.
Table 1 Comparison of different lecture video with existing algorithms
Method Lecture video 1 (# frames = 1497) Lecture video 2 (# frames = 505) Lecture video 3 (# frames = 737) Lecture video 4 (# frames = 1025)
# Key Key # Key Key # Key Key # Key
frame(s) frame # CT in sec. frame(s) frame # CT in sec. frame(s) frame # CT in sec. frame(s) Key frame # CT in sec.
PWPD [2] 4 [246, 771, 182.84 1 [240] 13.45 2 [240, 480] 66.02 5 [127, 394, 255.22
1024, 1295] 847, 981,
1018]
CHBA [4] 4 [246, 771, 456.93 5 [90, 134, 34.99 10 [66, 180, 524.19 5 [112, 506, 502.81
1024, 1295] 180, 240, 240, 273, 647, 901,
270] 303, 420, 1001]
480, 531,
600, 681]
ECR [14] 7 [246, 771, 985.66 2 [12, 240] 31.29 9 [130, 240, 560.33 8 [102, 409, 1027.92
1024, 1295, 480, 601, 647, 709,
1314, 1321, 605, 606, 811, 905,
1326] 609, 613, 992, 1013]
642]
LTD [5] 6 [186, 558, 1208.36 5 [79, 143, 68.94 7 [66, 150, 665.87 7 [27, 323, 1278.45
838, 943, 277, 389, 240, 429, 419, 647,
1189, 1266] 467] 480, 642, 709, 899,
706] 1001]
KS-SIFT [16] 6 [186, 558, 1319.57 5 [9, 143, 223, 72.37 7 [66, 180, 705.69 5 [127, 480, 1899.14
751, 943, 389, 467] 273, 420, 617, 712,
1076, 1189] 480, 681, 1022]
706]
RPCA [17] 7 [186, 558, 1328.66 6 [9, 143, 223, 75.22 7 [10, 150, 719.02 8 [102, 399, 1928.83
838, 943, 277, 389, 240, 429, 619, 700,
1076, 1189, 467] 481, 600, 833, 909,
1266] 706] 999, 1020]
Proposed 4 [186, 772, 845.36 2 [143, 389] 45.96 6 [7, 150, 241, 410.73 4 [127, 494, 1021.22
1289, 1392] 429, 481, 647, 1018]
706]
54
Table 2 Comparison of different lecture video with existing algorithms

Lecture
video Lecture video
5 (# 6 (#
Method frames = 963) frames = 1507) Lecture video 7 (# frames = 4327)
Key Key Key
frame frame frame
# Key frame(s) # CT in sec. # Key frame(s) # CT in sec. # Key frame(s) # CT in sec.
PWPD [2] 6 [2, 144, 685, 193.95 3 [297, 965, 230.35 5 [228, 456, 239.88
902, 1012, 1501] 921, 2547,
1219] 4129]
CHBA [4] 6 [2, 255, 685, 418.64 8 [105, 303, 476.27 8 [54, 228, 504.39
912, 1012, 481, 551 456, 756,
1219] 719, 845, 921, 1221,
909, 1378] 2547, 4529]
ECR [14] 7 [2, 245, 681, 848.25 8 [82, 125, 914.96 8 [84, 218, 958.27
915, 1022, 398, 592, 456, 756,
1219, 1408] 704, 899, 921, 1221,
1004, 1365] 2547, 4019]
LTD [5] 6 [25, 145, 1064.55 6 [82, 762, 1108.12 7 [32, 218, 460, 1169.51
802, 1005, 998, 1092, 756, 921,
1219, 1408] 1304, 1405] 2221, 4071]
KS-SIFT [16] 4 [2, 144, 951, 1406.69 4 [97, 709, 1724.55 6 [84, 241, 1804.28
1219] 1065, 1385] 456, 756,
2224, 4071]
RPCA [17] 7 [2, 144, 778, 1481.73 6 [127, 762, 1781.95 7 [84, 218, 456, 1881.37
951, 1077, 827, 1065, 756, 1221,
1219, 1425] 1284, 1495] 2224, 4050]
Proposed 3 [2, 144, 894.52 3 [128, 481, 952.86 4 [243, 1236, 994.24
1219] 1422] 2015, 4143]
T. Veerakumar et al.
Fig. 17 Detected as part of single shot with motion blur
Fig. 18 Detected as part of single shot with fade-in and fade-out
The proposed scheme is mainly designed for lecture video segmentation or shots
boundary detection in lecture video sequences. It is to be noted that a lecture
sequence mostly have two or three different kinds of frames or shots that include the
face of the professor, writing texts/slides, and hand of the professor. The transitions
between these shots in a video occur in the fashion of face of the professor to hand,
hand to texts in the board, board to hand, and then again hand to professor’s face. In
few cases, it may happen that the view may change from text to face of the professor
and back to the text. Hence, before starting of all new topic or subtopic, most of the
video undergoes a transition like old topic/subtopic to face of the professor to new
topic/subtopic. Hence, the proposed scheme detects them as three different shots.
However, in rare cases, it may happen that the transition of frames may happen like
old topic/subtopic to new topic/subtopic. In this case, it will be difficult to identify
the shot transitions. Segments of radiometric correlation plot and corresponding
entropic values plot of a shot (contains a combination of two subtopics) are shown
in Fig. 22. The proposed scheme fails to separate both contents into two different
shots. This is because the significant change in scene view is not reflected in the
radiometric similarity. Hence, entropic plot fails to distinguish them. One way to
solve such issue is to split the radiometric correlation plot into the different part
and then entropy values may be computed locally at each location. Figure 23 shows
such an example, where the entropic plot can be easily separable at the change in
topic/subtopic region of the video. This is a preliminary result on this. The choice
of splitting the radiometric plot is manual. In the future, we would like to work
more on this issue. The proposed scheme is mainly used to identify the gradual shot
transition. In the future, we would like to develop some techniques that will be able
to determine the soft transitions.
56
Table 3 Performance comparison with different window size

# Key
frame(s)
Lecture video Lecture video Lecture video Lecture video Lecture video Lecture video Lecture video
1 (# 2 (# 3 (# 4 (# 5 (# 6 (# 7 (#
Window size frames = 1497) frames = 505) frames = 737) frames = 1927) frames = 1484) frames = 1807) frames = 1890)
11 3 2 4 3 2 3 3
9 4 2 6 4 3 3 4
7 4 2 6 4 3 3 4
5 4 2 7 4 4 3 4
3 5 3 7 5 4 5 5
T. Veerakumar et al.
Fig. 19 Detected as part of single shot with camera movements and jitter
Fig. 20 Detected as single shot for zoomed in and out condition with view variation for different
video
Fig. 21 Detected single shot in the presence of noise
Fig. 22 Radiometric
correlation plot and entropic
values for a shot with a
combination of two subtopics
4 Conclusions
In this article, a shot boundary detection and key frame extraction technique for
lecture video sequences, using an integration of HOG, and radiometric correlation
with entropic-based thresholding scheme are proposed. In the proposed approach,
Fig. 23 Radiometric
correlation plot and obtained
split entropic values
the advantages of HOG feature are explored to describe each frame effectively. The
similarities between the n-dimensional extracted HOG features are compared to the
consecutive image frames using radiometric correlation measure. The radiometric
correlation for the complete video is found to have a significant amount of
uncertainty due to variation in color, illumination, and camera and object motion.
To deal with these uncertainties, the entropic thresholding is adhered to it to find
the shot boundaries. After detection of the shot boundaries, the key frame from each
shot is obtained by analyzing the peaks and valleys of the entropic associated pdf
of the radiometric correlation measures. The proposed scheme is tested on several
lecture sequences, and results on seven lecture video sequences are reported here.
The results obtained by the proposed scheme are compared against six existing state-
of-the-art techniques by considering the computational time and shot detection. It is
obtained that the proposed scheme is found to be better.
References
1. Hu, W., Xie, N., Li, L., Zeng, X., & Maybank, S. (2011). A survey on visual content-based
video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C:
Applications and Reviews, 41, 797–819.
2. Zhang, H. J., Kankanhalli, A., & Smoliar, S. W. (1993). Automatic partitioning of full-motion
video. ACM/Springer Multimedia System, 1, 10–28.
3. Huang, C. L., & Liao, B. Y. (2001). A robust scene-change detection method for video
segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 11, 1281–
1288.
4. Borecsky, J. S., & Rowe, L. A. (1996). Comparison of video shot boundary detection
techniques. Proceedings of SPIE, 2670, 170–179.
5. Grana, C., & Cucchiara, R. (2007). Linear transition detection as a unified shot detection
approach. IEEE Transactions on Circuits and Systems for Video Technology, 17, 483–489.
6. Patel, N. V., & Sethi, I. K. (1997). Video shot detection and characterization for video
databases. Pattern Recognition, 30, 583–592.
7. Cernekova, Z., Pitas, I., & Nikou, C. (2006). Information theory-based shot cut/fade detection
and video summarization. IEEE Transactions on Circuits and Systems for Video Technology,
16, 82–91.
8. Lee, M. H., Yoo, H. W., & Jang, D. S. (2006). Video scene change detection using neural
network: Improved ART2. Expert Systems and Applications, 31, 13–25.
9. Cooper, M., & Foote, J. (2005). Discriminative techniques for keyframe selection. In Proceed-
ings of ICME (pp. 502–505). Amsterdam, The Netherlands.
10. Haoran, Y., Rajan, D., & Chia, L. T. (2006). A motion-based scene tree for browsing and
retrieval of compressed video. Information Systems, 31, 638–658.
11. Cooper, M., Liu, T., & Rieffel, E. (2007). Video segmentation via temporal pattern classifica-
tion. IEEE Transactions on Multimedia, 9, 610–618.
12. Duan, F. F., & Meng, F. (2020). Video shot boundary detection based on feature fusion and
clustering technique. IEEE Access, 8, 214633–214645.
13. Abdulhussain, S. H., Ramli, A. R., Mahmmod, B. M., Saripan, M. I., Al-Haddad, S. A. R., &
Jassim, W. A. (2019). Shot boundary detection based on orthogonal polynomial. Multimedia
Tools and Applications, 78(14), 20361–20382.
14. Lei, S., Xie, G., & Yan, G. (2014). A novel key-frame extraction approach for both video
summary and video index. The Scientific World Journal, 1–9.
15. Bendraou, Y., Essannouni, F., Aboutajdine, D., & Salam, A. (2017). Shot boundary detection
via adaptive low rank and SVD-updating. Computer Vision and Image Understanding, 161,
20–28.
16. Barbieri, T. T. S., & Goularte, R. (2014). KS-SIFT: a keyframe extraction method based on
local features. In IEEE International Symposium on Multimedia (pp. 13–17). Taichung.
17. Dang, C., & Radha, H. (2015). RPCA-KFE: Key frame extraction for video using robust
principal component analysis. IEEE Transactions on Image Processing, 24, 3742–3753.
18. Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection.
Proceedings of CVPR, 1, 886–893.
19. Spagnolo, P., Orazio, T. D., Leo, M., & Distante, A. (2006). Moving object segmentation by
background subtraction and temporal analysis. Image and Vision Computing, 24, 411–423.
20. Zabih, R., Miller, J., & Mai, K. A. (1995). A feature-based algorithm for detecting and
classifying scene breaks. In Proceedings of ACM Multimedia (pp. 189–200). San Francisco,
CA.
21. Singh, A., Thounaojam, D. M., & Chakraborty, S. (2020, June). A novel automatic shot
boundary detection algorithm: Robust to illumination and motion effect. Signal, Image Video
Process., 14(4), 645–653.
22. Subudhi, B. N., Veerakumar, T., Esakkirajan, S., & Chaudhury, S. (2020). Automatic lecture
video skimming using shot categorization and contrast based features. Expert Systems with
Applications, 149, 113341.
23. Shen, R. K., Lin, Y. N., Juang, T. T. Y., Shen, V. R. L., & Lim, S. Y. (2018, March). Automatic
detection of video shot boundary in social media using a hybrid approach of HLFPN and
keypoint matching. IEEE Transactions on Computational Social Systems, 5(1), 210–219.
Detection of Road Potholes Using
Computer Vision and Machine Learning
Approaches to Assist the Visually
Challenged
1 Introduction
It can be challenging for blind people to move around different places indepen-
dently. The presence of potholes, curbs, and staircases is a hindrance for blind
people to travel to various places freely without having to rely on others. The
necessity in identifying the potholes, curbs, and other obstacles on the pathway has
led many researchers to build smart systems to assist blind people. Various smart
systems incorporated in the walking stick, wearable system, etc. are being proposed
in achieving the aim of pothole detection for blind users.
The proposed system is a vision-based experimental study that employs machine
learning classification with computer vision techniques and a deep learning object
detection model to detect potholes with improved precision and speed. In machine
learning classification with computer vision approach, the images are preprocessed
and features extraction methods such as HOG (Histogram of Oriented Gradients)
and LBP (Local Binary Pattern) are applied with an assumption that use of a fusion
of feature vector of HOG and LBP feature descriptors will improve the classification
performance. Various classification models are implemented and compared using
performance evaluation metrics and methodologies. The process is extended to
pothole localization for the images that are classified as pothole images. The proof
of the hypothesis, i.e., the use of a fusion of feature extraction methods will improve
the performance of the classification model, is derived. The second approach is
pothole detection using a deep learning model. Through the years, deep learning
has proven to provide reliable solutions to real-world problems involving computer
vision and image analysis. The convolutional neural network in deep learning plays
U. Akshaya Devi () · N. Arulanand

Department of Computer Science and Engineering, PSG College of Technology, Coimbatore,
India

62 U. Akshaya Devi and N. Arulanand
a vital role in extracting features and classifying the data precisely. In this approach,
YOLO v3 model is implemented for the pothole detection system. The results of
the detection of potholes are analyzed, and the efficiency of the proposed system in
outdoor real-time navigation for visually challenged people is studied.
2 Related Works
Mae M. Garcillanosa et al., [1] implemented a system to detect and report the
presence of potholes using image processing techniques. The system was installed
in a vehicle with a camera and Raspberry Pi that will monitor the pavements. The
processing was performed on the real-time video at a rate of 8 frames per second.
Canny edge detection, contour detection, and final filtering were carried out on each
video frame. The location and image of the pothole are captured when the pothole is
detected, which can later be viewed. The system achieved an accuracy of 93.72% in
pothole detection, but an improvement was required in recognizing the normal road
conditions. The total processing time was 0.9967 seconds for video frames with
the presence of potholes and 0.8994 seconds for video frames with normal road
conditions.
Aravinda S. Rao et al. [2] proposed a system to detect potholes, staircases,
and curbs using a systematic computer vision algorithm. An Electronic Travel Aid
(ETA) equipped with a camera and a laser was employed to capture the pathway. The
camera was mounted on the ETA with an angle of 30◦ –45◦ between the camera and
the vertical axis and a distance of 0.5 meters between the camera and the pathway.
The Canny edge detection algorithm and Hough transform were used to process
each frame in the video to detect the laser lines. The output of the Hough transform
that depicts the number of intersecting lines was transformed into the Histogram of
Intersections (HoI) feature. The Gaussian Mixture Model (GMM) learning model
was utilized to detect whether the pathway is safe or unsafe. The system gave an
accuracy of over 90% in detecting the potholes. Since the system uses laser patterns
to identify the potholes, it can only be used during the nighttime.
Kanza Azhar et al. [3] proposed a system to detect the presence of potholes for
proper maintenance of the roadways. For classifying pothole/non-pothole images,
HOG (Histogram of Oriented Gradients) representation of the input image was
generated. The HOG feature vector was provided to the Naïve Bayes classifier as
it has higher scalability and strong independent nature. For the images classified
as an image containing pothole(s), localization of pothole(s) was carried out using
a technique called graph-based segmentation using normalized cut. The system
attained an accuracy of 90%, precision of 86.5%, recall of 94.1%, and a processing
time of 0.673 seconds.
The core idea of the research work by Muhammad Haroon Yousaf et al. [4]
is to detect and localize the potholes in an input image. The input image was
converted from RGB color space to grayscale and resized to 300 × 300 pixels.
The system was implemented using the following steps: feature extraction, visual
Detection of Road Potholes Using Computer Vision and Machine Learning. . . 63
vocabulary construction, histogram of words generation, and classification using

Support Vector Machine (SVM). The Scale-Invariant Feature Transform (SIFT) was
used to represent the pavement images as a visual vocabulary of words. To test and
train the histogram of words, support vector algorithm was applied. The system gave
an accuracy of 95.7%.
Yashon O. Ouma et al. [5] developed a system to detect potholes on asphalt
road pavements and estimate the areal extent of the detected potholes. The Fuzzy
c-means clustering algorithm was used to partition each pixel in the image into a
collection of M-fuzzy cluster centers. Since the FCM is prone to noise or outliers, a
small weight was assigned to the noisy data points and a large weight to the clean
data points to estimate an accurate cluster center. It was followed by morphological
reconstruction. The clusters as an output of the FCM clustering algorithm were used
to deduce and characterize the region as linear cracks, non-distress areas, or no-data
regions. The mean CPU runtime of the system was 95 seconds. The dice coefficient
of similarity, Jaccard Index, and sensitivity metric for the pothole detection were
87.5%, 77.7%, and 97.6% respectively.
Byeong-ho Kang et al. [6] introduced a system that involves 2D LiDAR laser
sensor-based pothole detection and vision-based pothole detection. In the 2D
LiDAR-based pothole detection method, the steps include filtering, clustering, line
extraction, and gradient of data function. In the vision-based pothole detection
method, the steps include noise filtering, brightness control, binarization, addictive
noise filtering, edge extraction, object extraction, noise filtering, and detection of
potholes. The system exhibited a low error rate in the detection of potholes.
Emir Buza et al. [7] proposed an unsupervised vision-based method with the
utilization of image processing and spectral clustering for the identification and
estimation of potholes. An accuracy of 81% was obtained for the estimation of
pothole regions on images with varied sizes and shapes of potholes.
Ping Ping et al. [8] proposed a pothole detection system using deep learning
algorithms. The models include YOLO V3 (You Look Only Once), SSD (Single
Shot Detector), HOG (Histogram of Oriented Gradients) with SVM (Support Vector
Machine), and Faster R-CNN (Region-based Convolutional Neural Network). The
data preparation involved labeling the images by creating bounding boxes on the
objects using an image labeling tool. The resultant XML data of the image were
appended to a CSV file. This file was used as an input file for the deep learning
models. The performance comparison of the four models indicated that the YOLO
V3 model performed well with an accuracy of 82% in detecting the potholes.
The system proposed by Aritra Ray et al. [9] constitutes a low-power and
portable embedded device to serve as a visual aid for visually impaired people.
The system uses a distance sensor and pressure sensor to detect potholes or speed
breakers when the user is walking along the roadside. The device containing the
sensors was attached to the walking stick and the communication takes place
through voice messages. A simple mobile application was also developed that can
be launched by pressing the volume up button of the mobile phone. Using the
speech communication, the user can give with his/her destination and the user
will be guided to the location using Google Maps navigation facility. The smart
portable device attached to the foldable walking stick was assembled with the
following: ATmega328 8-bit microcontroller, HC-SR04 ultrasonic distance sensor,
signal conditioner, pressure sensor, speaker, android device, walking stick, buzzer
(piezoelectric), and power supply (using Li-Ion rechargeable batteries - 2500 mAh,
AA1.2V X 4). The pressure sensor was attached to the bottom end of the walking
stick. When the user strikes the walking stick on the ground, the reading is taken
from the pressure sensor as well as the ultrasonic sensor. Using the Pythagoras
theorem, a predefined value is set for the value that would be sensed by the ultrasonic
sensor. If the currently sensed value exceeds that predefined value, the system
informs the user that there is a presence of obstacles like potholes. If the currently
sensed value is lesser than the predefined value, the system informs the user that
there is a presence of obstacles like speed breakers. The result of sensitivity of object
detection exceeded 96%.
It can be noted from the previous works that there is a scope of improvement
in detection accuracy as well as processing speed, and the false-negative outcomes
in the detection results can be reduced. Most of the related works are targeted for
periodic assessment and maintenance of the roadways in which the system takes
high runtime, whereas the pothole detection for the visually challenged requires
the system to perform with high speed and accuracy that swiftly alerts the user
if any pothole is detected. Thus, the main idea behind the proposed approach is
to develop a precise, fast pothole detection system that is effective and beneficial
for the visually challenged. Two approaches (machine learning algorithm with
computer vision techniques and a deep learning model) were implemented using
suitable machine learning and deep learning models for real-time pothole detection.
The system is trained with pothole images of various shapes and textures to provide
a broad solution. In the case of machine learning algorithm and computer vision
approach, the system performs localization of pothole region only if the image
is classified as a pothole image. This step helps in improving the computational
efficiency of the system as it reduces the number of false-negative outcomes by the
system.
3 Methodologies
3.1 Pothole Detection Using Machine Learning and Computer

Vision
The first approach of pothole detection comprises image processing techniques to

preprocess the input data, computer vision algorithms such as HOG (Histogram of
Oriented Gradients), LBP (Local Binary Patterns) feature descriptors to extract the
feature set, and machine learning classifiers to classify pothole/non-pothole images.
The system architecture is shown in Fig. 1. Various feature extraction methods,
machine learning algorithms, and evaluation methodologies are described briefly
in the following subsections.
Fig. 1 System architecture

HOG Feature Descriptor

The HOG (Histogram of Oriented Gradients) feature descriptor is a well-known
algorithm to extract important features for building the image detection and
recognition system. The input image of the HOG feature descriptor must be of
standard size and color scale (grayscale). The horizontal (gx ) and vertical (gy )
gradients are computed by filtering the image with the kernels Mx and My as in
Eqs. 1 and 2.
Mx = [−1 0 1] (1)
⎤ ⎡
−1
My = ⎣ 0 ⎦ (2)
1
Equations 3 and 4 are used to determine the value of the gradient magnitude “g”
and gradient angle “θ .”

g= gx2 + gy2 (3)
gy
θ = tan−1 (4)
gx
Assume that the images are resized to a standard size of 200 × 152 pixels and the
parameters such as pixels per cell, cells per block, and number of orientations are
set to (8,8), (2,2), 9 respectively. Thereby, each image is divided into 475 (25 × 19)
nonoverlapping cells of 8 × 8 pixels. In each cell, the magnitude values of the 64
pixels are binned and cumulatively added into nine buckets of gradient orientation
histogram (Fig. 2).
Fig. 2 Gradient orientation 16000

histogram with nine bins
(orientations) 14000
12000
Gradient Magnitude
10000
8000
6000
4000
2000
0
0 20 40 60 80 100 120 140 160 180
Orientation
Fig. 3 LBP feature extraction
A block of 2 × 2 cells is slid across the image. In every block, the corresponding
histogram of each of the four cells is normalized into a 36 × 1 element vector. This
process is repeated until the feature vector of the entire image is computed. The
prime benefit of the HOG feature descriptor is its capability of extracting the basic
yet meaningful information of an object such as shape, outline, etc. It is simpler,
less powerful, and faster in computation compared to deep learning object detection
models.
LBP Feature Descriptor
Local Binary Patterns (LBP) feature descriptor is mainly used for texture classifica-
tion. To compute the LBP feature vector, neighborhood thresholding is computed for
each pixel in the image, and the existing pixel value is replaced with the threshold
result. For example, the image is divided into 3 × 3 pixel cells as shown in Fig.
3. The pixel value of the eight neighbors is compared with the value of the center
pixel (value = 160). If the value of the center pixel is greater than the pixel value
of the neighbor, the neighboring pixel takes the value “0”; otherwise, it is “1.” The
resultant eight-bit binary code is converted into a decimal number and stored as
the center pixel value. This procedure is implemented for all the pixels in the input
image. A histogram is computed for the image with the number of bins ranging
from 0 to 255 where each bin denotes the frequency of that value. The histogram is
normalized to obtain a one-dimensional feature vector.
The main advantages of the LBP feature descriptor are computational simplicity
and discriminative nature. Such properties in a feature descriptor are highly useful
in real-time settings.
Machine Learning Models

Various machine learning models employed in the system are Adaboost, Gaussian
Naïve Bayes, Random Forest, and Support Vector Machine. The Adaboost classifier
is an iterative boosting algorithm that combines multiple weak classifiers to get an
accurate strong classifier. It is trained iteratively by selecting the training set based
on the accurate prediction of the previous training. The weights of each classifier are
set randomly during the first iteration, and the weights of each classifier during the
successive iterations are based on its classification accuracy in its previous iteration.
This process is continued until a maximum number of estimators is reached. This
boosting algorithm combines a set of weak learners to generate a strong learner that
shows a better classification accuracy and lower generalization error. The Naïve
Bayes classifier is based on the Bayes’ theorem with an assumption of strong
independence between the features. It is ideal for real-time applications due to its
simplicity and faster predictions. Support Vector Machine (SVM) algorithm is a
supervised learning algorithm that is used for both classification and regression
problems. The SVM algorithm aims to create a decision boundary that can segregate
n-dimensional space to distinctly classify the data points. This decision boundary is
called a hyperplane. SVM classifier chooses the extreme points/vectors to create
the hyperplane. These vectors are called support vectors, and hence algorithm
is termed as Support Vector Machine. Random Forest algorithm is an ensemble
learning method used for classification and regression problems. A Random Forest
model applies several decision trees on random subsets of the dataset, and enhanced
prediction accuracy is obtained by combining the results of the individual decision
trees. The model provides high prediction accuracy and limits overfitting to some
extent.
Performance Evaluation
The evaluation of a machine learning model is an important aspect to measure the
effectiveness of the model for a given problem. Accuracy, precision, recall, and F1
score are the performance metrics used in this work to determine the performance
of a model. The performance metrics are computed as shown in the Eqs. 5, 6, 7,
and 8:
TP + TN
Accuracy = (5)
TP + TN + FP + FN
TP
Precision = (6)
TP + FP
TP
Recall = (7)
TP + FN
precision × recall
F1 score = 2 × (8)
precision + recall
where TP, TN, FP, FN corresponds to true positive, true negative, false positive, and
false negative instances, respectively. Accuracy is the number of correctly predicted
instances out of all the instances. Precision quantifies the number of positively
predicted instances that actually belong to the positive class. Recall quantifies the
number of correctly predicted positive instances made out of all positive instances
in the dataset. The F1 score also called F-score or F-Measure provides a single score
that balances both precision and recall. It can be described as a weighted average of
precision and recall.
In addition to the performance metrics, the AUC-ROC curve is plotted for binary
classifiers. ROC curve (Receiver Operating Characteristic curve) is a probability
curve that is plotted with FPR (false-positive rate) against TPR (true-positive rate or
recall). The AUC (Area Under the Curve) score defines the capability of the model
to distinguish between the positive class and negative class. The score usually ranges
from 0.0 to 1.0 where a score of 0.0 denotes the inability of the model to distinguish
between positive/negative classes and a value of 1.0 denotes the strong ability of the
model to distinguish between positive/negative classes.
Localization of Potholes
The three steps in localization of potholes are pre-segmentation using k-means
clustering, construction of Region Adjacency Graph (RAG), and normalized graph
cut. In the pre-segmentation stage, the image is segmented using k-means clustering.
The result of this step will give the centroid of all the segmented clusters. In
the second step, the Region Adjacency Graph is constructed using mean colors.
The obtained clusters are represented as nodes where any two adjacent nodes are
separated by an edge in the RAG. The nodes that are similar in color are merged,
and the value of edges is set as the difference in the average of RGB of the adjacent
nodes. On the Region Adjacency Graph, a two-way normalized cut is performed
recursively as step 3. Thereby, the result will contain a set of nodes where any two
points in the same node have a high degree of similarity and any two points in
different nodes have a high degree of dissimilarity. As a result, the pothole region
can be clearly differentiated from the other regions from the image.
3.2 Pothole Detection Using Deep Learning Model
YOLO V3 Model
The YOLO (You Look Only Once) object detection algorithm is based on a single
deep convolutional neural network called Darknet-53 architecture. The YOLO v3
model is viewed as a single regression problem where a single neural network is
Fig. 4 YOLO v3 model (reproduced from Joseph redmon et al. 2016) [12]
trained on an entire image. An input image is split into an S × S grid as shown in

Fig. 4. The prediction of bounding boxes B and confidence score for each bounding
box takes place for every grid. The confidence score of a bounding box, as in Eq.
9, is the product of the probability that the object is present in the box and the
Intersection over Union (IOU) between the predicted box and actual truth.
Confidence Score = P (Object) ∗ IOU (9)
For each bounding box, values of x, y, width, height, and confidence score are
predicted. The x and y values represent the center coordinates of the bounding
box with respect to the grid cell. The product of conditional class probabilities
(P(Classi |Object)) and the individual bounding box confidence scores gives the
confidence scores of each class in the bounding box. This score indicates the
probability of the presence of a class in the box and how well the predicted box
fits the object. The main advantages of the YOLO object detection algorithm are the
fast processing of images in real-time and low false detections.
4 Implementation
The proposed work was implemented on Intel Core i5 1.60 GHz CPU with 8 GB
RAM. To implement the machine learning models with computer vision techniques,
the Jupyter notebook Web application was used to write and execute the python
code. Pothole detection dataset from Kaggle was used as dataset 1 [10]. The size
of the dataset was 197 MB containing 320 pothole images and 320 non-pothole
images (640 images in total). Dataset 2 was created manually (using Google image
Fig. 5 (a) Selected region of interest (ROI). (b) After setting the pixels of the region external to
the ROI to 0
search) with 504 images consisting of 252 pothole images and 252 non-pothole
images. The size of the dataset was 14.5 MB. OpenCV (Open-Source Computer
Vision Library) is an open-source library that is mainly used for programming real-
time applications that involve image processing and computer vision models. In this
work, the OpenCV library was used to read an image from the source directory,
convert it from RGB to grayscale, resize the image, and filter the image.
To ensure that all the images have a standard size, the scale of the images was
resized to 128 × 96 pixels. Since the images will be divided into 8 × 8 patches
during the feature extraction stage, a size of 128 × 96 pixels is preferable. The
RGB images in the dataset contain three layers of pixel values ranging from 0 to
255, and hence, it is computationally complex. Thus, RGB to grayscale conversion
was performed to ensure simpler computational complexity. The Gaussian filter
(Gaussian blur) is a widely used image filtering technique to reduce noise and
intricate details. This low-pass blurring filter that smooths the edges and removes
noise from an image is considered to be efficient for thresholding, edge detection,
and finding contours in an image. Thus, it will improve the efficiency of the pothole
localization procedure during region clustering and construction of the Region
Adjacency Graph (RAG). A Gaussian filter of kernel size 5 was applied to the image.
The pathway/road in the input image is the only required portion to determine the
presence of potholes (Fig. 5a). The remaining portion of the image was selected as a
polygonal region, and the pixel values were set to 0 (Fig. 5b). Therefore, the portion
of the road/pathway was selected as the region of interest. The HOG features and
a fusion of HOG and LBP features were extracted. These features were applied to
various machine learning classifiers to classify the pothole and non-pothole images.
The Adaboost, Gaussian Naïve Bayes, Random Forest, and Support Vector
Machine algorithms were selected for the proposed work. The train/test set was
split in the ratio of 70:30. To find optimum parameters for the classifiers, the grid
search algorithm was used. The grid search algorithm chooses the hyperparameters
by employing exhaustive search on the set of parameters given for the classification
model. It estimates the performance for every combination of the given parameters
and chooses the best performing combination of hyperparameters. The RBF (radial
basis function) kernel SVM was selected using the grid search method. The values of
hyperparameters c and gamma were set to 100 and 1, respectively. For the Random
Forest classifier, the values of hyperparameters such as n_estimators (total number
of trees), criterion, max_depth (maximum depth of the tree), min_samples_leaf
(minimum number of instances needed to be at a leaf node), and min_samples_split
(minimum number of instances needed to split an internal node) were set to
100, “gini” (Gini impurity), 5, 5, 5 respectively. For the Adaboost classifier, the
values of hyperparameters such as n_estimators (maximum number of estimators
required) and learning rate were set to 200 and 0.2, respectively. The machine
learning classifiers were evaluated using a cross-validation method. Subsequently,
the pothole localization was performed for the positively predicted images.
To implement the deep learning model, the google colaboratory notebook with
a single GPU was utilized. The size of the initial dataset was 270 MB with 1106
labeled pothole images [11]. Image data augmentation techniques, which processes
and modifies the original image to create variations of that image, were employed
on the images of the initial dataset. The techniques such as horizontal flip, change of
image contrast, and incorporation of Gaussian noise were adopted to synthetically
expand the size of the dataset. The resultant images of various data augmentation
operations are shown in Fig. 6.
The data augmentation methods benefit the deep learning model as larger training
data leads to an enhanced generalization of the neural network, reduction of
overfitting, and improvement in real-time detections. The dataset obtained after data
augmentation was 773 MB in size with 2500 pothole images and 2500 non-pothole
images. The size of the input images was 416 × 416 pixels. The object labels in each
image were represented using a text file containing five parameters: object class, x-
center, y-center, width, and height. The object class is an integer number given for
each object with the value starting from 0 to (number of classes-1). The x-center,
y-center, width, and height are float values relative to the width and height of the
image. The dataset was split into train/test set with a ratio of 70:30. The number
of iterations was set as 6000, and batch size for training and testing was set as 64
and 1, respectively. Various performance metric such as precision, recall, F1 score,
mean Average Precision (mAP), and prediction time was inferred with the test set.
5 Result Analysis
In the approach of machine learning and computer vision, the classification report
comprising of accuracy, precision, recall, and F1 score was generated and tabulated
(Table 1) for all the models. To estimate the classification model accurately, the k-
fold cross-validation method was utilized. In k-fold cross-validation, the dataset is
divided into k equal-sized partitions. The training of the classifier is performed for
k−1 partitions, and the remaining one partition is used for testing score of the “k”
classifications is averaged and used for performance estimation. In this work, the
machine learning models were evaluated using the 10-fold cross-validation method.
Fig. 6 (a) Original image and the resultant images of (b) horizontal flip, (c) contrast change, and
(d) Gaussian noise addition
Table 1 Classification performance report for different feature sets as input (dataset 1)
HOG features Combination of HOG and LBP features
ML classifiers Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score
Adaboost 86.66% 87% 87% 87% 96.66% 97% 97% 97%
Naïve-Bayes 85.33% 87% 85% 85% 90.66% 91% 91% 91%
Random Forest 87.33% 88% 87% 87% 95.33% 96% 95% 95%
SVM 90.66% 91% 91% 91% 92.66% 93% 93% 93%
The average accuracy scores acquired using the 10-fold cross-validation for all the
classifiers are shown in Table 2.
The ROC curve (Receiver Operating Characteristic curve) is plotted for models
that use only the HOG feature set and models that use a fusion of HOG and LBP
feature sets (Fig. 7).
The AUC scores are computed from the ROC curves and the results were
tabulated (Table 3). It can be noted that the AUC score for all the classification
models that use a fusion of HOG and LBP feature set is above 90%. The
performance improvement does prove that the adaption of fused HOG and LBP
features for the classification model helps achieve better results.
74
Table 2 Average accuracy computed with 10-fold cross-validation method

Machine Dataset 1 Dataset 2
learning
classifiers
HOG features Combination of HOG and LBP features HOG features Combination of HOG and LBP features
Adaboost 89.18% 94.38% 85.16% 87.33%
Naïve-Bayes 88.58% 89.97% 82.83% 83.5%
Random 88.77% 93.58% 85% 82%
Forest
SVM 87.57% 88.57% 84.16% 85.16%
Table 3 Comparison of AUC scores obtained using different feature sets
Dataset 1 Dataset 2
Machine learning classifiers HOG features Combination of HOG and LBP features HOG features Combination of HOG and LBP features
Adaboost 93.28% 99.24% 86.03% 95.78%
Naïve-Bayes 88.58% 95.06% 87.01% 92.17%
Random Forest 95.51% 98.81% 90.86% 95.24%
SVM 95.44% 97.54% 90.09% 94.22%
Detection of Road Potholes Using Computer Vision and Machine Learning. . .
75
a ROC curve b ROC curve

1.0
1.0
0.8
0.8
True Positive Rate
True Positive Rate

0.6 0.6
0.4 0.4
0.2 SVM 0.2 SVM

Random Forest Random Forest
Naive Bayes Naive Bayes
0.0 Adaboost 0.0 Adaboost
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate
ROC curve ROC curve
c d 1.0
1.0
0.8 0.8
True Positive Rate
True Positive Rate

0.6 0.6
0.4 0.4
0.2 SVM
0.2 SVM
Random Forest Random Forest
Naive Bayes Naive Bayes
0.0 Adaboost 0.0 Adaboost
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate False Positive Rate
Fig. 7 (a) ROC curve for the classification models that uses (a) HOG feature set extracted from
images of dataset 1. (b) Fusion of HOG and LBP feature set extracted from images of dataset 1.
(c) HOG feature set extracted from images of dataset 2. (d) Fusion of HOG and LBP feature set
extracted from images of dataset 2
The Adaboost classification algorithm shows the best performance among all the
classifiers. Further, the exact location of the pothole region must be determined and
highlighted. Therefore, the normalized graph cut segmentation using RAG (Region
Adjacency Graph) was employed for pothole localization in positively classified
images. Figures 8 and 9 depicts the process and result of pothole detection using
classification and localization.
In the deep learning approach, detection results of the YOLO v3 model run on
the test data are shown in Table 4. The prediction time for YOLO v3 was 26.90
milliseconds. The sample output of pothole detection by the YOLO v3 model is
shown in Fig. 10.
Based on the outcome of classification using HOG features and fusion of HOG
and LBP features, it is evident that the fusion of HOG and LBP features improves
the classification performance of the machine learning models.
The classification results of machine learning algorithms convey that the
Adaboost classifier with the HOG and LBP feature set outperforms all the other
classifiers. For creating a bounding box around the pothole region, localization of
potholes was performed using normalized graph cut using RAG (Region Adjacency
Graph). The overall detection time for this approach is approximately 0.35 seconds.
Fig. 8 A step-by-step illustration of normalized graph cut segmentation using Region Adjacency
Graph (RAG)
Fig. 9 Sample output of Adaboost classification and normalized graph cut segmentation using
RAG for detection of potholes
Table 4 Results of YOLO Metric Resultant value

v3 model detection
Precision 83%
Recall 87%
F1 score 85%
mAP 88.01%
This approach of pothole detection does not require high processors like GPU to
run smoothly. However, the results have few false positives during the localization
of potholes.
The YOLO v3 model achieved a mean Average Precision (mAP) of 88.01% and
a faster inference time to detect the pothole(s) in an image. With a prediction time of
26.90 milliseconds, the model can process up to 37 frames per second. However, the
requirement of higher processing power and disk space makes the model unlikely
to be used in low-power edge devices.
Fig. 10 Sample output of pothole detection performed using YOLO v3 model for pothole and
non-pothole images
6 Conclusion
A system to detect the potholes in the pathways/roadways can be highly useful

and convenient for visually challenged people. This work presented two different
approaches for pothole detection. In the machine learning and computer vision
approach, we introduced a fusion of features by combining HOG and LBP features
to enhance the classification performance. The results of various classifiers have
shown improved performance with the usage of fusion of HOG and LBP features.
Among all classifiers, the Adaboost algorithm with the fusion of HOG and LBP
features attained the highest accuracy of 96.6% in classifying pothole and non-
pothole images. For detecting the exact location of potholes, normalized graph cut
segmentation with Region Adjacency Graph was implemented. The inference time
to detect potholes using Adaboost classifier with the segmentation algorithm was
0.35 seconds approximately. But a few false positive outcomes were present during
the localization of potholes.
In the deep learning approach, the YOLO v3 model exhibited a favorable
outcome with mAP of 88.01% and a rapid prediction time of 26.90 milliseconds.
Moreover, the model requires the system to work on processors with high computa-
tion power.
Therefore, we can realize the Adaboost classifier and normalized graph cut
using RAG in a real-time pothole detection system on a low-power and economical
microprocessor if we have cost and computation power constraints. It can also be
integrated with stereo camera/dual camera technology that generates a depth map,
thereby detecting the presence of potholes accurately and reducing the false-positive
predictions. In the absence of cost and power limitations, the YOLO v3 model can
be employed in GPU-based embedded devices.
References
1. Garcillanosa, M. M., Pacheco, J. M. L., Reyes, R. E., & San Juan, J. J. P. (2018). Smart
detection and reporting of potholes via image-processing using Raspberry-Pi microcontroller.
In 10th international conference on knowledge and smart technology (KST), Chiang Mai,
Thailand. 31 Jan–3 Feb, 2018.
2. Rao, A. S., Gubbi, J., Palaniswami, M., & Wong, E. (2016). A vision-based system to detect
potholes and uneven surfaces for assisting blind people. In IEEE international conference on
communications (ICC), Kuala Lumpur, Malaysia. 22–27 May, 2016.
3. Azhar, K., Murtaza, F., Yousaf, M. H., & Habib, H. A. (2016). Computer vision based detection
and localization of potholes in Asphalt Pavement images. In IEEE Canadian conference on
electrical and computer engineering (CCECE), Vancouver, BC, Canada. 15–18 May, 2016.
4. Yousaf, M. H., Azhar, K., Murtaza, F., & Hussain, F. (2018). Visual analysis of asphalt
pavement for detection and localization of potholes. Advanced Engineering Informatics,
Elsevier, 38, 527–537.
5. Ouma, Y. O., & Hahn, M. (2017). Pothole detection on asphalt pavements from 2D-colour
pothole images using fuzzy c-means clustering and morphological reconstruction. Automation
in construction, Elsevier, 83, 196–211.
6. Kang, B.-H., & Choi, S.-I. (2017). Pothole detection system using 2D LiDAR and camera. In
Ninth international conference on ubiquitous and future networks (ICUFN), Milan, Italy. 4–7
July, 2017.
7. Buza, E., Omanovic, S., & Huseinovic, A. (2013). Pothole detection with image processing
and spectral clustering. In Recent advances in computer science and networking, 2013.
8. Ping, P., Yang, X., & Gao, Z. (2020). A deep learning approach for street pothole detection. In
IEEE sixth international conference on big data computing service and applications, Oxford,
UK. 3–6 Aug, 2020.
9. Ray, A., & Ray, H. (2019). Smart portable assisted device for visually impaired people. In
International conference on intelligent sustainable systems (ICISS), Palladam, India, 21–22
Feb, 2019.
10. Atuyla kumar. (2020). Kaggle pothole detection dataset. figsharehttps://www.kaggle.com/
atulyakumar98/pothole-detection-dataset
11. Atikur Rahman Chitholian. (2020). YOLO v3 pothole detection dataset. figsharehttps://
public.roboflow.com/object-detection/pothole
12. Redmon, J. (2016). You only look once: Unified, real-time object detection. In IEEE conference
on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA. 27–30 June, 2016.
Shape Feature Extraction Techniques for
Computer Vision Applications
E. Fantin Irudaya Raj and M. Balaji
1 Introduction
The act of identifying similar objects in a digital image is defined as object

recognition in computer vision [1]. Many adversaries, such as rotations, variations
in pose, poor illumination, scaling, occlusion, and so on, make shape-based object
recognition difficult [2]. Several methods have been developed to improve the
accurateness and ease of recognition of shape-based objects. Matching is a key
aspect of the digital object recognition system and one of the major concerns in
object recognition [3]. A primary objective of matching is to measure, compare, and
validate image data in order to perform accurate recognition. In object recognition,
the matching process entails some form of search, during which the set of features
extracted is compared to the set of features stored in a database for detection.
Appropriate and equivalent features must be extracted from the input image data
to complete this task [4].
Numerous methods and approaches for object recognition in computer vision
applications are discussed in the literature [5–8]. Translation and rotation invariant
qualities are critical in most image classification tasks and must be addressed in
the image retrieval features in any strategy for object recognition. The shape-
based object recognition procedure is divided into three steps. They are (a) data
preprocessing, (b) feature extraction, and c) classification of digital images. Image
data is preprocessed in the preprocessing stage to make it clearer or noise-free for the
feature extraction procedure. There are numerous sorts of filtering techniques used
E. F. I. Raj ()
Department of Electrical and Electronics Engineering, Dr. Sivanthi Aditanar College of
Engineering, Tiruchendur, Tamil Nadu, India
M. Balaji
Department of Electrical and Electronics Engineering, SSN College of Engineering, Chennai,
Tamil Nadu, India

82 E. F. I. Raj and M. Balaji
in this stage to improve image quality by reducing noise and making images clearer
for measuring current features [9]. The feature extraction stage extracts feature from
preprocessed images to make the recognition task easier and more accurate. Many
future extracting techniques are available to extract the important features of the
object present in the image. The retrieved features are then saved in a database.
The classifier will then utilize the database to look for and identify a comparable
image based on the input image attributes. Among all of these procedures, feature
extraction is among the most important for enabling object detection simpler and
more precisely [10].
Shape feature extraction is important in various applications, including (1)
shape retrieval, (2) shape recognition and classification, (3) shape alignment and
registration, and (4) shape estimation and classification. Shape retrieval is the
process of looking for full shapes that seem to be identical to a query shape in a large
database of shapes [11]. In general, all shapes that are within a specific distance
of the query, or the first limited shapes with the shortest distance, are calculated.
Shape recognition and classification is the process of determining if a given shape
resembles a model well or which database class is the most comparable. The
processes of converting or interpreting one shape to match other shapes completely
or partially are known as shape alignment and registration [12]. Estimation and
simplification of shapes reduce the number of elements (points, segments, etc.)
while maintaining similarity to the original.
2 Feature Extraction
The layout, texture, color, and shape of an object are used by the majority of image
retrieval systems. Its shape defines the physical structure of an object. Moment,
region, border, and so on can all be used to depict it. These depictions can be used
to recognize objects, match shapes, and calculate shape dimensions. The structural
patterns of surfaces of cloth, grass, grain, and wood are examples of texture.
Normally, it refers to the repeating of basic texture pieces known as Texel. A Texel
is made up of many pixels that are placed in a random or periodic pattern. Artificial
textures are usually periodic or deterministic, but natural textures are often random.
Linear, uneven, smooth, fine, or coarse textures are all possibilities. Texture can be
divided into two types in image analysis: statistical and structural. The textures in
a statistical approach are random. Textures in the structural approach are entirely
structural and predictable, repeating according to certain deterministic or random
placement principles. Another approach is also proposed in the literature, which
combines statistical and structural analysis. Mosaic models are what they’re called.
It represents a geometrical process that is random. Although texture, color, and
shape are key aspects of image retrieval, they are rendered useless when the image
in the database or the input image lacks such qualities. An example is that the query
image is a lighter version with simply white and black lines.
Shape Feature Extraction Techniques for Computer Vision Applications 83
Translation and rotation invariant qualities must be taken into consideration

when selecting features for image retrieval in most object recognition tasks. The
following are the two primary methodologies used to categorize invariant methods:
(a) invariant features and (b) image alignment. Invariant image qualities that do
not change when the object is rotated or translated are used in the invariant feature
methodology. Although this method is more commonly employed, it is still reliant
on geometric features. In the image alignment approach, the object recognition
algorithm transforms the image so that the object in the image is positioned in a
defined standard position. The extraction of geometric information such as boundary
curvature is the mainstay of this method. Segmentation is required when an object
image contains numerous objects.
The term “feature” has a highly application-dependent definition in general. A
feature is the consequence of certain projected values on the data stream input. To
effectively execute the object recognition task, the extracted feature is compared
to the stored feature data in the image database. Numerous techniques for feature
extraction have been developed and explained in literature to make shape-based
object identification faster and more efficient. The following sections go over some
of the most important feature extraction techniques.
3 Various Techniques in Feature Extraction
3.1 Histograms of Edge Directions
Kim et al. [13] propose a novel watermarking algorithm for grayscale text document
images. Edge image matching is a common comparison technique in computer
vision and retrieval of the image. This edge directions histogram is an important tool
for object detection in images with the absence of color information and identical
color information [14]. For this feature extraction, the edge is extracted using the
Canny edge operator, and the related edge directions are then quantized into 72 bins
of 50 each [15]. Histograms of edge directions (HED) can also be used to represent
shapes.
3.2 This Harris Corner
This detector has been utilized in a wide range for many additional images matching
applications, demonstrating its effectiveness for efficient motion tracking [16].
Although these feature detectors are commonly referred to as corner detectors, they
are capable of detecting any image region with significant gradients in all directions
at a prearranged scale. This approach is ineffective for matching images of varying
sizes because it is sensitive to variations in image scale.
3.3 Scale-Invariant Feature Transform
In [17, 18], the authors created a scale-invariant feature transform (SIFT) by

combining the concepts of feature-based and histogram-based picture descriptors.
Image data is transformed into scale-invariant coordinates relative to local features
using this transformation. This method provides many features that cover the image
at all scales and locations, which is an essential trait. A typical image with a
resolution of 500 × 500 pixels will yield around 2000 stable features (even though it
depends upon the image’s content). For object recognition, the selection of features
is especially important. Before being utilized for image recognition and matching,
SIFT features are taken from a set of reference images and stored in a database. A
new image is matched by comparing each feature in the new image to the primary
database and finding matching features based on the Euclidean distance between
their feature vectors.
3.4 Eigenvector Approaches
Every image is described by a minimal number of coefficients, kept in a database,

and processed effectively for object recognition using the eigenvector technique [19,
20]. The method is increasingly advanced. Although it is an effective strategy, it
does have certain disadvantages. The primary disadvantage is that any change in
each pixel value induced by manipulations such as translation, rotation, or scaling
will modify the image’s eigenvector representation. The eigenspace is calculated
while taking all possible alterations into account to address this issue [21].
3.5 Angular Radial Partitioning
In the Angular Radial Partitioning (ARP) approach, the edge detection is conducted
after the images stored in the database are transformed to gray scale [22]. Surround-
ing circles partition the edge image to achieve scale invariance, surrounding circles
are found with the intersection points of the edge, and for the feature extraction
technique employed in the picture retrieval comparison procedure, angles are
measured. The approach takes advantage of an object’s edge’s surrounding circle
to generate a number of radial divisions for that object’s edge image; therefore,
equidistant circles are created to extract the features required for scale after creating
the surrounding circle invariance.
3.6 Edge Pixel Neighborhood Information
The Edge Pixel Neighborhood Information (EPNI) method identifies neighboring

edge pixels whose structure will create an enhanced feature vector [23, 24]. In
the image retrieval matching procedure, this feature vector is used. This method
is invariant in terms of translation and scale but not in terms of rotation.
3.7 Color Histograms
In [25, 26], the authors introduced histogram-based image retrieval methodologies.

A color histogram is constructed by extracting the colors of the image and measuring
the number of instances of each distinct color in the image array for a particular
color image. Histograms are translation and rotation invariant and respond slowly
to occlusion, scale changes, and changes in the angle of view. Due to the gradual
change in histograms with perspective, a three-dimensional object can also be
described by a limited number of histograms [27, 28]. This approach utilizes the
color histogram of a test object to retrieve an image of a similar object in the
database. This method has a significant disadvantage in that it is more sensitive
to the color and color of the input object image and the intensity of the light source.
3.8 Edge Histogram Descriptor
A histogram made up of edge pixels is called the Edge Histogram Descriptor (EHD)
[29]. It is an excellent texture signature approach that can also be used to match
images. But the main drawback is it is a rotation-variant approach. In its texture
section, the standard MPEG-7 defines the EHD [30]. This technique is beneficial
for image-to-image matching. This descriptor, on the other hand, is ineffective at
describing rotation invariance.
3.9 Shape Descriptor
The shape is a critical fundamental feature that is used to describe the content of
an image. However, as a result of occlusion, noise, and arbitrary distortion, the
shape of the object is frequently corrupted, complicating the object recognition
problem. Shapes are represented using shape characteristics that are either based on
the boundary plus the inside content or on the shape boundary information. Object
identification uses a variety of shape features, which are evaluated based on how
well they allow users to retrieve comparable forms from the database.
Shape descriptors can be used to quickly find similar shapes in a database,

even if they have been affinely transformed in some way, such as scaled, flipped,
rotated, or translated [31]. It can also be used to quickly find noise-affected shapes,
defective shapes, and human-tolerated shapes for shape recovery and comparison.
A shape descriptor should be capable of retrieving images for the widest possible
variety of shapes, not just specific. As a result, it should be application generic.
Low calculation complexity is one of the shape descriptor’s most important
characteristics. The calculation can be accelerated and computation complexity
reduced by adding fewer picture attributes in the calculation technique, making it
more resilient.
The following are some crucial characteristics that must be present in effective
shape features [32]. They are (a) identifiability, (b) rotation invariance, (c) scale
invariance, (d) translation invariance, (e) occultation invariance, (f) affine invari-
ance, (g) noise resistance, (h) reliable, (i) well-defined range, and (j) statistically
independent. A shape descriptor is a collection of variables that are used to describe
a specific characteristic of a shape. It tries to measure a shape in a way that is
compatible with how humans see it. A form characteristic that can efficiently find
comparable shapes from a database is required for a high recognition rate of retrieval
[33]. The features are usually represented as a vector. The following requirements
should be met by the shape feature: (1) It should be simple to calculate the distance
between descriptors; else, implementation will take a long time. (2) It should be
efficiently denoted and stored so that the descriptor vector size does not get too
huge. (3) It should be comprehensive enough to accurately describe the shape.
Types of Shape Features
For shape retrieval applications, several shape explanation and depiction approaches
have been developed [34]. Depending on whether shape features are extracted from
the entire shape region or just the contour, the methodologies for describing and
depicting shapes are classified into two categories: (a) contour-based approach
and (b) region-based approach. Every technique is then broken down into two
approaches: global and structural. Both the global and structural techniques act by
determining whether the shape is defined as a whole or by segments.
Contour-Based Approach
Boundary or contour information can be recovered using contour-based algorithms
[35]. The shape representation is further classified into the structural and global
approaches. The discrete approach is also known as the structural approach because
they break the shape boundary information into segments or subparts called
primitives. The structural methodology is usually represented as a string or a tree
(or graph) that will be utilized for image retrieval matching. The global approaches
do not partition the shape into subparts and build the feature vector and perform the
matching procedure using the complete boundary information. As a result, they’re
often referred to as continuous approaches.
A multidimensional numeric feature vector is constructed from the shape contour
information, in the global approach, which will be used in the matching phase.
Manipulation of the Euclidean distance or point-to-point matching completes

the matching procedure. Shapes are divided into primitives, which are boundary
segments in the structural method because the contour information is broken into
segments. The result is stored as S = s1, s2, ..., sn, where si is an element of a
shape feature that contains a characteristic such as orientation, length, and so on.
The string can also be used to directly portray the shape or as an input parameter of
an image retrieval system.
The structural method’s primary limitation is the cohort of features and primi-
tives. There is no appropriate definition for an object or shape since the number of
primitives required for each shape is unknown. The other constraint is calculating
its effectiveness. This method does not ensure the best possible match. It is more
reliable than global techniques because changes in object contour cause changes in
primitives.
Region-Based Approach
Region-based techniques can tackle problems that contour-based techniques can’t
[36]. They are more durable and can be used in a variety of situations. They are
capable of dealing with shape defection. The region-based method considers all
pixels in the shape region, so the complete region is used to represent and describe
the shape. In the same way that contour-based approaches are separated into global
and structural methods, region-based approaches are separated into global and
structural approaches depending on whether or not they divide the shapes into
subparts. For shape description and representation, global methodologies consider
the complete shape region. Region-based structural approaches, which divide the
shape region into subparts, are utilized for shape description and representation.
The challenges that region-based structural approaches have are similar to those
that contour structural techniques have.
Contour-based techniques are more prominent than region-based techniques for
the following reasons: (a) The contour of a shape, rather than its inside substance, is
important in various applications, and (b) humans can easily recognize shapes based
on their contours. But this method also has some shortfalls. They are as follows: (a)
Contours are not attainable in a number of applications. (b) Because small sections
of forms are used, contour-based shape descriptors are susceptible to noise and
changes. (c) The inside contents of some applications are more important than
others. Figure 1 shows the classification and some examples of shape descriptors
in detail.
4 Shape Signature
The shape signature refers to the one-dimensional shape feature function obtained
from the shape’s edge coordinates. The shape signature generally holds the per-
spective shape property of the object. Shape signatures can define the entire
shape; they are also commonly used as a preprocessing step before other feature
Fig. 1 Shape descriptor – classification and example
extraction procedures. The important one-dimensional feature functions are (a)

centroid distance function, (b) chord length function, and (c) area function.
4.1 Centroid Distance Function
Centroid distance function (CDF) is defined as the distance of contour points from
the shape’s centroid (x0 , y0 ) and is represented by Eq. (1) [37].

r(n) = (x(n) − x0 )2 + (y(n) − y0 )2 (1)
The centroid is located at the coordinates (x0 , y0 ), which are the average of the x
and y coordinates for all contour points. A shape’s boundary is made up of a series
of contour or boundary points. A radius is a straight line that connects the centroid
to a point on the boundary. The Euclidean distance is used in the CDF model to
capture a shape’s radii lengths from its centroid at regular intervals as the shape’s
descriptor [38]. Let be the regular interval (in degrees) between two radii (Fig.
2). K = 360/ then gives the number of intervals. All radii lengths are normalized
by dividing by the longest radius length from the extracted radii lengths.
Moreover, without sacrificing generality, assume that the intervals are considered
clockwise from the x-axis. The shape descriptor can then be represented as a vector,
as illustrated in Eq. 2. Figure 3 illustrates the centroid distance function approach
plot of a shape boundary.
Fig. 2 Centroid distance

function (CDF) approach
Fig. 3 Centroid distance plot of shape boundary

S = {r0 , rθ , r2θ , . . . r(k−1)θ (2)
This method has some advantages and disadvantages. This method is translation-
independent due to the deduction of centroid, which designates the shape’s position
from edge coordinates. It is the main advantage. The main drawback is that this
method fails to properly depict the shape if there are multiple boundary points at the
same interval.
4.2 Chord Length Function
The chord length function (CLF) is calculated from the shape contour without using
a reference point [39]. As shown in Fig. 3, the CLF of each contour point C is
the shortest distance between C and the other contour point C’ such that line CC’
Fig. 4 Chord length function

(CLF) approach
is orthogonal to the tangent vector at C. This method also has some merits and
demerits. The important merit is this method is not a translation variant, and it
addresses the issue of biased reference points (the fact that the centroid is frequently
biased by contour defections or noise). The demerit is the chord length function
is extremely sensitive to noise, and even smoothed shape boundaries can cause an
extreme burst in the signature.
4.3 . Area Function
In area function (AF) approach, when the contour points along the shape edge are
changed, the area of the triangle modeled by two consecutive contour points and
the centroid changes as well [40]. It yields an area function that can be thought of
as a shape representation. It is illustrated in Fig. 4. Let An denote the area between
consecutive edge points Pn, Pn+1, and the centroid C. The area function approach
and its plot of a shape boundary are shown in Figs. 5 and 6.
5 Real-Time Applications of Shape Feature Extraction

and Object Recognition
The shape is an important visual and emerging feature for explaining image content.
One of the most difficult problems in developing effective content-based image
retrieval is the usage of object shape [41]. Because determining the similarity
between shapes is difficult, a precise explanation of shape content is impossible.
Thus, in shape-based image retrieval, two steps are critical: shape feature extraction
and similarity calculation among the extracted features. Some of the real-time
Fig. 5 Area function (AF)

approach
Fig. 6 Area function plot of shape boundary
applications of shape feature extraction and objection recognition are explained

further.
5.1 Fruit Recognition
Fruit recognition can be accomplished in a variety of ways by utilizing the shape

feature [42]. One of the fruit recognition algorithms that use the shape feature is as
follows:
Step 1: First, gather images of various types of fruits with varying shapes. Figure 7
depicts an orange image.
Step 2: The images are divided into two sets: training and testing.
Fig. 7 Image of an range
Fig. 8 Binarized image of an

orange
Step 3: Convert all images to binary so that the fruit pixels are 1 s and the residual
pixels are 0 s [43], as shown in Fig. 8.
Step 4: The Canny edge detector is an edge detection operator that detects a wide
range of edges in images using a multistage approach. The Canny edge detection
algorithm [44] is used to extract the fruit contour, as shown in Fig. 9.
Step 5: For each image, compute the centroid distance [45]. Figure 10 depicts Fig.
9’s centroid distance plot.
Step 6: Euclidean distance measurement is used to compare the centroid distance
between training and testing images [46].
Step 7: The test fruit image is distinguished from the training images by the smallest
difference [47].
Fig. 9 Contour of the orange

fruit
Fig. 10 The centroid

distance plot of Fig. 9
5.2 Leaf Recognition 2
The shape feature can be used in a variety of ways to recognize leaves [48]. The
following is an example of a leaf recognition algorithm that uses the shape feature:
Step 1: First, gather some images of various sorts of leaves with varying shapes. A
leaf is depicted in Fig. 11.
Step 2: The images are classified into two parts: training and testing.
Step 3: Convert all images to binary, with the fruit pixels being 1 s and the residual
pixels being 0 s (Fig. 12) [43].
Step 4: Following that, the leaf contour is extracted using the clever edge detection
algorithm (Fig. 13) [44]. It is an image processing approach that identifies points
in a digital image that have discontinuities or sharp changes in brightness.
Step 5: Calculate the seven Hu moments [49] associated with each image. Figure 14
depicts the plot of Fig. 13’s seven Hu moment values.
Step 6: Euclidean distance measurement is used to compare moments between
training and testing images.
Step 7: The test leaf image is distinguished from the training images by the smallest
difference [47].
Fig. 11 Image of a leaf
Fig. 12 Binarized image of

the leaf depicted in Fig. 11
Fig. 13 Contour of the leaf

Fig. 14 The Hu moment plot of Fig. 13
Fig. 15 (a) Target image, (b) test image
5.3 Object Recognition
There are two images in this case: the test image and the other of which is the target
image. The test image depicts a scene of flowers in front of a window, with the target
image (flower) to be found using scale-invariant feature transform (SIFT).
Step 1: Input target image (flower) – (Fig. 15a).
Step 2: Input test image (scene with cluttered objects) – (Fig. 15b).
Step 3: By using SIFT, find 100 strongest points in the target image – (Fig. 16a).
Step 4: By using SIFT, find 200 strongest points in the test image – (Fig. 16b).
Step 5: Calculate normatively matched spots by comparing the two images – (Fig.
17).
Step 6: Calculate a point that is precisely matched – (Fig. 18).
Step 7: A polygon should be drawn around the region of exactly matching points –
(Fig. 19).
Fig. 16 (a) 100 strongest points in the target image, (b) 200 strongest points in the test image
Fig. 17 Normatively matching points in target and test images
Fig. 18 Exactly matching points in the test and target images

Fig. 19 Detected target in

the test image
6 Recent Works
There are so many recent works reported in the literature regarding shape feature
extraction in computer vision. The same approach can be used in many recent
applications like robotics, fault detection, autonomous vehicle management system,
industry 4.0 framework, medical applications, etc. Here, we listed a few of them for
our reference.
Yang et al. [50] explained Fish Detection and Behavior Analysis Using Vari-
ous Computer Vision Models in Intelligent Aquaculture, and Foysal et al. [51],
Application for Detecting Garment Fit on a Smartphone Using Computer Vision
Approach. In [52–54], the authors detailed various autonomous vehicle management
systems. A comprehensive review of vehicle detection, traffic light recognition by
autonomous vehicles in the city environment, and pothole detection in the roadways
for such vehicles are also explained. In [55], Das et al. provided parking area
patterns from autonomous vehicle positions using an aerial image by computer
vision using mask R-Convolutional Neural Network.
Devaraja et al. [56] explained computer vision-based grasping of the robotic
hands used in industries. Adding to this, the authors brief about shape recognition
by autonomous robots in an industrial environment. In [57], the authors detailed
the computer vision-based robotic equipment used in the medical field and their
importance in surgeries. In [58], the author describes robotic underwater vehicles
that use computer vision to monitor deepwater animals. The high efficiency of the
system can be attained through employing machine learning techniques along with
computer vision.
In [59], the authors detailed the computer vision-enabled support vector
machines assisted fault detection in industrial texture. Cho et al. [60] explained
the fault analysis and fault detection in a wind turbine system using an artificial
neural network along with a Kalman filter using computer vision approaches. In
[61, 62], the author detailed fault detection in the aircraft wings and sustainable
fault detection of electrical facilities using computer vision methodologies. In

[63–65], the authors detailed the neural networks and deep learning configuration
and a computer vision approach for identifying and classifying faults in switched
reluctance motor drive used in automobile and power generation applications.
Esteva et al. [57] explained that deep learning enabled computer vision appli-
cations in their work. By combining deep learning with a computer vision-based
approach, the accuracy and classification performance are getting higher. Naresh
et al. [66] detailed computer vision-based health-care management through mobile
communication. The work briefs more about telemedicine-based applications.
Pillai et al. [67] discussed COVID-19 detection using computer vision and deep
convolutional neural networks. They provided a detailed analysis and compared the
results with already existing conventional methodologies.
In the real world, the use of computer vision techniques in health care improves
disease prognosis and patient care. Recent advancements in object detection and
image classification can significantly assist medical imaging. Medical imaging, also
known as medical image analysis, is a technique for visualizing specific organs and
tissues in order to provide a more precise diagnosis. Several research in pathology,
radiology, and dermatology [68–70] have shown encouraging results in complicated
medical diagnostics tasks. Computer vision has been utilized in a variety of health-
care applications to help doctors make better treatment decisions for their patients.
In the medical field, computer vision applications have proven to be quite useful,
particularly in the detection of brain tumors [71]. Furthermore, researchers have
discovered a slew of benefits of employing computer vision and deep learning
algorithms to diagnose breast cancer [72]. It can help automate the detection process
and reduce the risks of human error if it is trained with a large database of images
containing both healthy and malignant tissue [73].
There are other important works also available in the literature. Here, few of the
important and recent works are only listed for reference.
7 Summary and Conclusion
The present work is focusing more on the shape feature extraction technique used
in computer vision applications. Various feature extraction techniques are also
explained in detail. Histogram-based image retrieval feature extraction approaches
used in computer vision include the Edge Histogram Descriptor and histograms
of edge directions. The eigenvector approach, unlike scale variant, rotation, or
translation, is particularly sensitive to changes in individual pixel values. ARP is
invariant in terms of scale and rotation. The EPNI method is invariant in terms of
scale and translation but not in terms of rotation. Noise affects the color histogram.
But the color histogram approach is insensitive to rotation and translation.
Shape description and representation approaches are divided into two categories:
contour-based approaches and region-based approaches. Both sorts of approaches
are further subdivided into global and structural techniques. Although contour-based
techniques are more popular than region-based techniques, they still have significant
drawbacks. The region-based approaches can circumvent these restrictions. Shape
signatures are frequently utilized as a preprocessing step before the extraction of
other features. The most significant one-dimensional feature functions are also
presented in the current work. Some of the real-time feature extraction and object
recognition applications used in computer vision are explained in detail. In addition
to that, the latest recent works related to shape feature extraction with computer
vision are also listed.
References
1. Bhargava, A., & Bansal, A. (2021). Fruits and vegetables quality evaluation using computer
vision: A review. Journal of King Saud University-Computer and Information Sciences, 33(3),
243–257.
2. Zhang, L., Pan, Y., Wu, X., & Skibniewski, M. J. (2021). Computer vision. In Artificial
intelligence in construction engineering and management (pp. 231–256). Springer.
3. Dong, C. Z., & Catbas, F. N. (2021). A review of computer vision–based structural health
monitoring at local and global levels. Structural Health Monitoring, 20(2), 692–743.
4. Iqbal, U., Perez, P., Li, W., & Barthelemy, J. (2021). How computer vision can facilitate
flood management: A systematic review. International Journal of Disaster Risk Reduction,
53, 102030.
5. Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003, October). Context-
based vision system for place and object recognition. In Computer vision, IEEE international
conference on (Vol. 2, pp. 273–273). IEEE Computer Society.
6. Liang, M., & Hu, X. (2015). Recurrent convolutional neural network for object recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3367–
3375).
7. Kortylewski, A., Liu, Q., Wang, A., Sun, Y., & Yuille, A. (2021). Compositional convolutional
neural networks: A robust and interpretable model for object recognition under occlusion.
International Journal of Computer Vision, 129(3), 736–760.
8. Alom, M. Z., Hasan, M., Yakopcic, C., Taha, T. M., & Asari, V. K. (2021). Inception recurrent
convolutional neural network for object recognition. Machine Vision and Applications, 32(1),
1–14.
9. Cisar, P., Bekkozhayeva, D., Movchan, O., Saberioon, M., & Schraml, R. (2021). Computer
vision based individual fish identification using skin dot pattern. Scientific Reports, 11(1), 1–
12.
10. Saba, T. (2021). Computer vision for microscopic skin cancer diagnosis using handcrafted and
non-handcrafted features. Microscopy Research and Technique, 84(6), 1272–1283.
11. Li, Y., Ma, J., & Zhang, Y. (2021). Image retrieval from remote sensing big data: A survey.
Information Fusion, 67, 94–115.
12. Lucny, A., Dillinger, V., Kacurova, G., & Racev, M. (2021). Shape-based alignment of the
scanned objects concerning their asymmetric aspects. Sensors, 21(4), 1529.
13. Kim, Y. W., & Oh, I. S. (2004). Watermarking text document images using edge direction
histograms. Pattern Recognition Letters, 25(11), 1243–1251.
14. Bakheet, S., & Al-Hamadi, A. (2021). A framework for instantaneous driver drowsiness
detection based on improved HOG features and Naïve Bayesian classification. Brain Sciences,
11(2), 240.
15. Heidari, H., & Chalechale, A. (2021). New weighted mean-based patterns for texture analysis
and classification. Applied Artificial Intelligence, 35(4), 304–325.
16. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision, 60(2), 91–110.
17. Linde, O., & Lindeberg, T. (2012). Composed complex-cue histograms: An investigation of the
information content in receptive field based image descriptors for object recognition. Computer
Vision and Image Understanding, 116(4), 538–560.
18. Hazgui, M., Ghazouani, H., & Barhoumi, W. (2021). Evolutionary-based generation of rotation
and scale invariant texture descriptors from SIFT keypoints. Evolving Systems, 12, 1–13.
19. Shapiro, L. S., & Brady, J. M. (1992). Feature-based correspondence: An eigenvector approach.
Image and Vision Computing, 10(5), 283–288.
20. Park, S. H., Lee, K. M., & Lee, S. U. (2000). A line feature matching technique based on an
eigenvector approach. Computer Vision and Image Understanding, 77(3), 263–283.
21. Schiele, B., & Crowley, J. L. (2000). Recognition without correspondence using multidimen-
sional receptive field histograms. International Journal of Computer Vision, 36(1), 31–50.
22. Chalechale, A., Mertins, A., & Naghdy, G. (2004). Edge image description using angular radial
partitioning. IEE Proceedings-Vision, Image and Signal Processing, 151(2), 93–101.
23. Chalechale, A., & Mertins, A. (2002, Oct). An abstract image representation based on
edge pixel neighborhood information (EPNI). In EurAsian conference on information and
communication technology (pp. 67–74). Springer.
24. Wang, Z., & Zhang, H. (2008, July). Edge linking using geodesic distance and neighborhood
information. In 2008 IEEE/ASME international conference on advanced intelligent mechatron-
ics (pp. 151–155). IEEE.
25. Chakravarti, R., & Meng, X. (2009, April). A study of color histogram based image retrieval.
In 2009 sixth international conference on information technology: New generations (pp. 1323–
1328). IEEE.
26. Liu, G. H., & Wei, Z. (2020). Image retrieval using the fused perceptual color histogram.
Computational Intelligence and Neuroscience, 2020, 8876480.
27. Mohseni, S. A., Wu, H. R., Thom, J. A., & Bab-Hadiashar, A. (2020). Recognizing induced
emotions with only one feature: A novel color histogram-based system. IEEE Access, 8,
37173–37190.
28. Chaki, J., & Dey, N. (2021). Histogram-based image color features. In Image Color Feature
Extraction Techniques (pp. 29–41). Springer.
29. Park, D. K., Jeon, Y. S., & Won, C. S. (2000, November). Efficient use of local edge histogram
descriptor. In Proceedings of the 2000 ACM workshops on multimedia (pp. 51–54).
30. Alreshidi, E., Ramadan, R. A., Sharif, M., Ince, O. F., & Ince, I. F. (2021). A comparative
study of image descriptors in recognizing human faces supported by distributed platforms.
Electronics, 10(8), 915.
31. Virmani, J., Dey, N., & Kumar, V. (2016). PCA-PNN and PCA-SVM based CAD systems for
breast density classification. In Applications of intelligent optimization in biology and medicine
(pp. 159–180). Springer.
32. Chaki, J., Parekh, R., & Bhattacharya, S. (2016, January). Plant leaf recognition using
a layered approach. In 2016 international conference on microelectronics, computing and
communications (MicroCom) (pp. 1–6). IEEE.
33. Tian, Z., Dey, N., Ashour, A. S., McCauley, P., & Shi, F. (2018). Morphological segmenting and
neighborhood pixel-based locality preserving projection on brain fMRI dataset for semantic
feature extraction: An affective computing study. Neural Computing and Applications, 30(12),
3733–3748.
34. Chaki, J., Parekh, R., & Bhattacharya, S. (2018). Plant leaf classification using multiple
descriptors: A hierarchical approach. Journal of King Saud University-Computer and Infor-
mation Sciences, 32, 1158.
35. AlShahrani, A. M., Al-Abadi, M. A., Al-Malki, A. S., Ashour, A. S., & Dey, N. (2018).
Automated system for crops recognition and classification. In Computer vision: Concepts,
methodologies, tools, and applications (pp. 1208–1223). IGI Global.
36. Chaki, J., & Parekh, R. (2012). Designing an automated system for plant leaf recognition.
International Journal of Advances in Engineering & Technology, 2(1), 149.
37. Dey, N., Roy, A. B., Pal, M., & Das, A. (2012). FCM based blood vessel segmentation method
for retinal images. arXiv preprint arXiv:1209.1181.
38. Chaki, J., & Parekh, R. (2011). Plant leaf recognition using shape based features and neural
network classifiers. International Journal of Advanced Computer Science and Applications,
2(10), 41.
39. Kulfan, B. M. (2008). Universal parametric geometry representation method. Journal of
Aircraft, 45(1), 142–158.
40. Dey, N., Das, P., Roy, A. B., Das, A., & Chaudhuri, S. S. (2012, Oct). DWT-DCT-SVD
based intravascular ultrasound video watermarking. In 2012 world congress on information
and communication technologies (pp. 224–229). IEEE.
41. Zhang, D., & Lu, G. (2001, Aug). Content-based shape retrieval using different shape
descriptors: A comparative study. In IEEE international conference on multimedia and expo,
2001. ICME 2001 (pp. 289–289). IEEE Computer Society.
42. Patel, H. N., Jain, R. K., & Joshi, M. V. (2012). Automatic segmentation and yield measurement
of fruit using shape analysis. International Journal of Computer Applications, 45(7), 19–24.
43. Gampala, V., Kumar, M. S., Sushama, C., & Raj, E. F. I. (2020). Deep learning based image
processing approaches for image deblurring. Materials Today: Proceedings.
44. Deivakani, M., Kumar, S. S., Kumar, N. U., Raj, E. F. I., & Ramakrishna, V. (2021). VLSI
implementation of discrete cosine transform approximation recursive algorithm. Journal of
Physics: Conference Series, 1817(1), 012017 IOP Publishing.
45. Priyadarsini, K., Raj, E. F. I., Begum, A. Y., &Shanmugasundaram, V. (2020). Comparing
DevOps procedures from the context of a systems engineer. Materials Today: Proceedings.
46. Chaki, J., Dey, N., Moraru, L., & Shi, F. (2019). Fragmented plant leaf recognition: Bag-
of-features, fuzzy-color and edge-texture histogram descriptors with multi-layer perceptron.
Optik, 181, 639–650.
47. Chouhan, A. S., Purohit, N., Annaiah, H., Saravanan, D., Raj, E. F. I., & David, D. S. (2021).
A real-time gesture based image classification system with FPGA and convolutional neural
network. International Journal of Modern Agriculture, 10(2), 2565–2576.
48. Lee, K. B., & Hong, K. S. (2013). An implementation of leaf recognition system using leaf
vein and shape. International Journal of Bio-Science and Bio-Technology, 5(2), 57–66.
49. Chaki, J., & Parekh, R. (2017, Dec). Texture based coin recognition using multiple descriptors.
In 2017 international conference on computer, electrical & communication engineering
(ICCECE) (pp. 1–8). IEEE.
50. Yang, L., Liu, Y., Yu, H., Fang, X., Song, L., Li, D., & Chen, Y. (2021). Computer vision
models in intelligent aquaculture with emphasis on fish detection and behavior analysis: A
review. Archives of Computational Methods in Engineering, 28(4), 2785–2816.
51. Foysal, K. H., Chang, H. J., Bruess, F., & Chong, J. W. (2021). SmartFit: Smartphone
application for garment fit detection. Electronics, 10(1), 97.
52. Abbas, A. F., Sheikh, U. U., AL-Dhief, F. T., & Haji Mohd, M. N. (2021). A comprehensive
review of vehicle detection using computer vision. Telkomnika, 19(3), 838.
53. Liu, X., & Yan, W. Q. (2021). Traffic-light sign recognition using capsule network. Multimedia
Tools and Applications, 80(10), 15161–15171.
54. Dewangan, D. K., & Sahu, S. P. (2021). PotNet: Pothole detection for autonomous vehicle
system using convolutional neural network. Electronics Letters, 57(2), 53–56.
55. Das, M. J., Boruah, A., Malakar, J., & Bora, P. (2021). Generating parking area patterns
from vehicle positions in an aerial image using mask R-CNN. In Proceedings of international
conference on computational intelligence and data engineering (pp. 201–209). Springer.
56. Devaraja, R. R., Maskeliūnas, R., & Damaševičius, R. (2021). Design and evaluation of
anthropomorphic robotic hand for object grasping and shape recognition. Computers, 10(1),
1.
57. Esteva, A., Chou, K., Yeung, S., Naik, N., Madani, A., Mottaghi, A., et al. (2021). Deep
learning-enabled medical computer vision. NPJ Digital Medicine, 4(1), 1–9.
58. Katija, K., Roberts, P. L., Daniels, J., Lapides, A., Barnard, K., Risi, M., et al. (2021). Visual
tracking of deepwater animals using machine learning-controlled robotic underwater vehicles.
In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp.

860–869).
59. Tellaeche Iglesias, A., Campos Anaya, M. Á., Pajares Martinsanz, G., & Pastor-López, I.
(2021). On combining convolutional autoencoders and support vector machines for fault
detection in industrial textures. Sensors, 21(10), 3339.
60. Cho, S., Choi, M., Gao, Z., & Moan, T. (2021). Fault detection and diagnosis of a blade
pitch system in a floating wind turbine based on Kalman filters and artificial neural networks.
Renewable Energy, 169, 1–13.
61. Almansoori, N. N., Malik, S., & Awwad, F. (2021). A novel approach for fault detection in the
aircraft body using image processing. In AIAA Scitech 2021 Forum (p. 0520).
62. Kim, J. S., Choi, K. N., & Kang, S. W. (2021). Infrared thermal image-based sustainable fault
detection for electrical facilities. Sustainability, 13(2), 557.
63. Raj, E. F. I., & Balaji, M. (2021). Analysis and classification of faults in switched reluctance
motors using deep learning neural networks. Arabian Journal for Science and Engineering,
46(2), 1313–1332.
64. Sijini, A. C., Fantin, E., & Ranjit, L. P. (2016). Switched reluctance Motor for Hybrid Electric
Vehicle. Middle-East Journal of Scientific Research, 24(3), 734–739.
65. Raj, E. F. I., & Kamaraj, V. (2013, March). Neural network based control for switched
reluctance motor drive. In 2013 IEEE international conference ON emerging trends in
computing, communication and nanotechnology (ICECCN) (pp. 678–682). IEEE.
66. Naresh, E., Sureshkumar, K. R., & Sahana, P. S. (2021). Computer vision in healthcare
management system through mobile communication. Elementary Education Online, 20(5),
2105–2117.
67. Pillai, V. G., & Chandran, L. R. (2021). COVID-19 detection using computer vision and
deep convolution neural network. Cybernetics, cognition and machine learning applications:
Proceedings of ICCCMLA 2020, 323.
68. Razzak, M. I., Naz, S., & Zaib, A. (2018). Deep learning for medical image processing:
Overview, challenges and the future. Classification in BioApps (pp. 323–350).
69. Neri, E., Caramella, D., & Bartolozzi, C. (2008). Image processing in radiology. Medical
radiology. Diagnostic imaging. Springer.
70. Fourcade, A., & Khonsari, R. H. (2019). Deep learning in medical image analysis: A third eye
for doctors. Journal of Stomatology, Oral and Maxillofacial Surgery, 120(4), 279–288.
71. Mohan, G., & Subashini, M. M. (2018). MRI based medical image analysis: Survey on brain
tumor grade classification. Biomedical Signal Processing and Control, 39, 139–161.
72. Tariq, M., Iqbal, S., Ayesha, H., Abbas, I., Ahmad, K. T., & Niazi, M. F. K. (2021). Medical
image based breast cancer diagnosis: State of the art and future directions. Expert Systems with
73. Selvathi, D., & Poornila, A. A. (2018). Deep learning techniques for breast cancer detection
using medical image analysis. In Biologically rationalized computing techniques for image
processing applications (pp. 159–186). Springer.
GLCM Feature-Based Texture Image
Classification Using Machine Learning
Algorithms
R. Anand, T. Shanthi, R. S. Sabeenian, and S. Veni
1 Introduction
A picture describes a scene efficiently and conveys the information in better way.
Human visual perception aids the human to interpret more details from an image.
Almost 90% of the data processed by human brain is visual data, and this helps
human brain to respond and process visual data 60,000 times better than any other
form of data. Image processing systems need representation of image in digital form.
A digital image is a two-dimensional array of numbers, where the numbers represent
the intensity values of the image at various spatial locations. These pixels possess
spatial coherence that can be inherited by performing arithmetic operations such
as addition, subtraction, etc. The statistical manipulations of the pixel values help
to develop an image processing technique for a variety of applications. Most of
the techniques employ feature extraction as one of the steps. A variety of features
such as colour, shape, and textures can be extracted from digital images. Among
these features, texture features such as fine, coarse, smooth, grained, etc., play an
important role.
R. Anand ()
Department of ECE, Sri Eshwar College of Engineering, Coimbatore, India
T. Shanthi · R. S. Sabeenian
Research Member in Sona SIPRO, Department of ECE, Sona College of Technology, Salem,
India
e-mail: Shanthi@sonatech.ac.in; Sabeenian@sonatech.ac.in
S. Veni
Department of Electronics and Communication Engineering, Amrita School of Engineering,
Coimbatore, India
e-mail: S_veni@cb.amrita.edu
104 R. Anand et al.
Texture of an image describes the distribution of intensity values in an image.

The spatial distribution of intensities provides the texture information. The tex-
ture describes the characteristics of image or portion of the image. This texture
information can be used for extracting several valuable features that help for
segmentation and classification. Texture feature calculation [1] uses the content of
the GLCM to give a measure of the variation in intensity at the pixel of interest.
Images with varying textures have certain characteristics that can be extracted
statistically. Generally, statistical approaches have four prevalent methods such as
GLCM, histogram method, autocorrelation method, and morphological operation.
These four methods have both advantages and disadvantages. Out of these four
methods, this chapter adopts GLCM from which different 10 features are extracted.
These features are elaborated in the next section, and the literature on statistical
approaches is shown in Table 1. In the work [2], Elli, Maria, and Yi-Fan extracted
sentiment from the reviews and analysed the result to build up a business model.
They have claimed that demonstrated implements were robust enough to give them
high precision. The use of business analytics made their decision more congruous.
They additionally worked on detecting emotions from reviews, gender predicated
on the designations, and additionally detecting fake reviews. The commonly used
programming language is python. They mainly used multinomial naive Bayesian
(MNB) and support vector machine (SVM) as their main classifiers.
In this paper [3], the author has been applied a subsist supervised machine
learning algorithms for presage a review rating on a given numerical scales. They
have utilized hold-out cross-validation utilizing 70% data as training data and 30%
data as testing data. In this chapter, the author used different classifiers to determine
the precision and recall values. The author in paper [4] applied and elongated the
current work in the field of natural language processing and sentiment analysis to
data from Amazon review datasets. Naïve Bayesian and decision list classifiers were
habituated to tag a given review as positive or negative. They have culled books and
kindle section reviews from Amazon.
The author in [5] aimed to build a system that visualizes the sentiment of
the review in the form of data scraping from Amazon URL to get the data and
preprocess it. In this chapter, they have applied NB, SVM, and maximum entropy.
The paper claims that they summarize the product review to be the main point so
there is no precision showed. They showed their result in the statistical chart. In the
paper [6], the authors built a model for presaging the product ratings predicated
on rating text utilizing a bag of words. These models tested utilized unigrams
and bigrams. They utilized a subset of Amazon video game utilizer reviews from
UCSD time-predicated models that did not work well as the variance in average
rating between each year month and day was relatively diminutive. Between
unigrams and bigrams, unigrams engendered the most precise result. And popular
unigrams were profoundly serviceable presager for ratings for their more immensely
colossal variance. Unigram results had a 15.89% better performance than bigrams.
In paper [7], sundry feature extraction or cull techniques for sentiment analysis
A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions 105
are performed. They amassed the Amazon dataset at first and then performed
preprocessing for stop words and special characters abstraction. They applied phrase
level, single word, and multiword feature cull or extraction techniques. Ingenuous
Bayes is used as the classifier. They concluded that Ingenuous Bayes gives better
results for phrase level than a single word and multiword. The main cons of
this chapter are that they used only an Ingenuous Bayes classifier algorithm from
which we cannot get an ample result. In paper [8], it has utilized more facile
algorithms so it is facile to understand. The system gives high precision on SVM,
so it cannot work felicitously on the astronomically immense dataset. They used
support vector machine (SVM), logistic regression, and decision trees method. In
paper [9], tf-idf is utilized here as a supplemental experiment. It can prognosticate
rating by utilizing a bag of words. But classifiers used here are only few. They used
root mean square error and linear regression model. So, those are some cognate
works mentioned above, and we endeavoured to make our work more efficient
by culling best conceptions from them and applied those together. In our system,
we used a sizably voluminous amount of datasets to give efficient results and
make better decisions. Moreover, we have utilized active learning approach to label
datasets that can dramatically expedite many machine learning tasks. Our system
additionally consists of several types of feature extraction methods. To the best
of our erudition, our proposed approach gave higher precision than the subsisting
research works. Strength and weakness of statistical approach methods for texture
image classifications of the proposed work are shown in Table 1.
Table 1 Strength and weakness of statistical approach methods for texture image classifications
Methods Strengths Weakness
Morphological Good efficient aperiodic image 1. Morphological operations are not
operation [5] texture applicable for periodic images.
Autocorrelation 1. It overcomes illumination 1. Real-time applications for large
method [6] distortion, and it is robust to noise. images need high computations.
2. Low computational complexity 2. Not suitable for all kinds of
textures
Grey-level 1. Spatial relationship of pixels with 1. High Computational time.
co-occurrence different 10 statistical computations 2. Optimum movement vector is
matrix [7] 2. Contrast, Energy, Homogeneity, problematic.
Mean, Standard Deviation, Entropy, 3. It requires feature selection
RMS, Variance, Smoothness, IDM procedure.
4. Accuracy depends on the offset
3. Accuracy rate will be high
rotation.
Histogram 1. Less computations. 1. Sensitive to noise

method [8] 2. Invariant to conversion and 2. Low recognition rate
rotation
3. Mathematically solvable
106 R. Anand et al.
2 GLCM
The Grey-Level Co-occurrence Matrix is a square matrix that is obtained from the
input image. The dimension of the GLCM matrix is equal to the number of grey
levels in the input image. For example, an 8-bit image will have 256 grey levels
ranging from [0 255]. For such an image, the GLCM matrix will have 256 rows
and 256 columns with each row/column representing one of the intensity values.
The second-order statistics are attained by considering a set of pixels related to
each other in positive three dimensions. The Grey-Level Co-occurrence Matrices
provide rare mathematical statistics on the texture. GLCM matrix of image depends
on the direction and offset values. The direction can be anyone among the eight
possible directions as shown in Fig. 1. The offset represents the distance between
pixels. If the distance between the pixels is 1, the immediate neighbouring pixel in
the direction is taken for consideration. By this way, several GLCM matrices can be
obtained from a single image that is shown in Fig. 1.
2.1 Computation of GLCM Matrix
GLCM matrix is a square matrix that has the same number of rows and the same
number of columns with positive numbers only. GLCM matrix is N .× N matrix,
where N denotes the number of possible grey levels in an image. For example, a
2-bit image will have four grey levels (0–3) and results in a GLCM matrix of size
4.×4. The rows and columns are the grey values (0–7). Consider the following image
.f (x, y) of size 5 .× 5 with its grey-level representation given in Figs. 2 and 3.
The matrix .G, θ, d = G0, 1 represents the GLCM matrix. The first row
corresponds to the grey value 0, and next rows to the grey values 1, 2, and 3.
Fig. 1 Co-occurrence matrix 135’ [D, -D] 90’ [-D, -0] 45’ [-D, D]
180’ [0, -D] 0’ [0, D]
225’ [D, -D] 270’ [D, 0] 315’ [D, D]

1 0 2 2 1
1 0 0 1 2
1 3 1 1 3 f (x, y) =
0 1 1 1 3
0 2 2 1 2
Fig. 2 The intensity values and its corresponding grey levels for an image segment .f (x, y) with
four grey levels
Similarly the first column corresponds to the grey value 0, and the next columns
to the grey values 1, 2, and 3.The first element in the first row of the GLCM matrix
gives the count of occurrence of the grey value 0 in the neighbourhood of zero
direction.
Observing at the input matrix, the pair .(0, 0) occurs only at one point; hence, the
first cell in the GLCM matrix equals to the value 1. The second element in the first
row of the GLCM matrix gives the count of occurrence of the grey value 1 in the
neighbourhood of 0 direction. Observing at the input matrix, the pair .(0, 1) occurs
at two points; hence, the second element in the GLCM matrix equals to the value
2. Similarly, the third and fourth elements are calculated based on the occurrence of
the pairs (0,2) and (0,3). The second row of the GLCM matrix is computed based on
the occurrence of the pairs (1,0), (1,1), (1,2), and (1,3). The third row is based on the
occurrence of the pairs (2,0), (2,1), (2,2), and (2,3), and the fourth row is based on
the occurrence of the pairs (3,0), (3,1), (3,2), and (3,3). The resulting GLCM matrix
G 0,1 is given in Fig. 4.
GLCM matrix with single offset is not sufficient for image analysis. For example,
the GLCM matrix with zero offset is not adequate to extract information from an
image with vertical details. The input image may contain details in any direction;
hence, GLCM matrices with different offset values and different distance values are
computed from a single image, and the average of all these matrices is utilized
for further analysis. Each and every value in this matrix is divided by the total
number of pairs available in the input matrix to get a normalised GLCM matrix,
i.e., for an image with L grey levels each and every element of the average matrix
is to be divided by (L .× L .− 1). The normalised GLCM matrix g (m, n) can be
used to extract several features from the image. These features are elaborated in the
upcoming section.
108 R. Anand et al.
1 0 2 2 1
1 0 0 1 2
f (x, y) = 1 3 1 1 3
0 1 1 1 3
0 2 2 1 2
1 0 2 2 1 1 2 2 0
1 0 0 1 2
f (x, y) = 1 3 1 1 3
0 1 1 1 3
0 2 2 1 2
GTd
1 0 2 2 1
1 0 0 1 2
f (x, y) = 1 3 1 1 3
0 1 1 1 3
0 2 2 1 2
1 0 2 2 1
f (x, y) = 1 0 0 1 2
1 3 1 1 3
0 1 1 1 3
0 2 2 1 2
Fig. 3 Computation of the first row in co-occurrence matrix
Fig. 4 GLCM matrix G 0,1

1 2 2 0
G0.1- 2 3 2 3
0 2 2 0
0 1 0 0
2.2 GLCM Features
Let .g(m, n) represent the normalised matrix, with N number of grey levels and .μx ,
.σx and .μy , .σy are mean and standard deviation of the marginal probability matrices
.Px (m)andPy (n), respectively.

N−1
Px (m) =
. g(m, n) (1)
n=0

N−1
Py (n) =
. g(m, n). (2)
n=0
The mean values of the marginal probability matrices .P x(m) and .P y.(n) are given
as

N−1
N−1
μx =
. m g (m, n) (3)
m=0 n=0

N−1
μx =
. m Px (m) (4)
n=0

N−1
N−1
μy =
. n g (m, n) (5)
n=0 m=0

N−1
μy =
. nPy (m). (6)
n=0
The standard deviation values of the marginal probability matrices .Px (m) and
Py (n) are given as
.

N−1
N−1
σx2 =
. (m − μx )2 g (m, n) (7)
m=0 n=0

N−1

2 N−1
σy2 =
. n − μy g (m, n) . (8)
n=0 m=0
N−1
N−1
Px+y (l) =
. g (m, n), (9)
m=0 n=0
110 R. Anand et al.
where .l = x + y for .l = 0 to .2(N − 1).
N−1
N−1
Px−y (l) =
. g (m, n), (10)
m=0 n=0
where .l = x − y, for .l = 0 to .(N − 1).
2.2.1 Energy
The energy (E) is computed as the sum of squares of the elements in the GLCM
matrix. It returns the value in the range of [0–1]. Energy value of 1 indicates that the
image is the constant value. It also reveals about the uniformity of the image that is
shown in Eq. 11.
(N−1)
(N−1)
Energy (E) =
. (g(m, n))2 . (11)
(m=0) (n=0)
2.2.2 Entropy
Entropy measures the disorder or complexity of an image. It measures the amount

of arbitrariness in the image. If the entropy is large, then the image is not textually
uniform, and if the texture is complex, then entropy becomes high. .g(m, n) using
the Eq. 11
N−1
N−1
.Entropy (En) = − g (m, n) × log (g (m, n)) (12)
m=0 n=0
2.2.3 Sum Entropy

2N
SEn
. =− Px+y (m) log(Px+y (m)) (13)
m=2
2.2.4 Difference Entropy

N−1
DEn = −
. Px−y (m) log(Px−y (m)) (14)
m=0
2.2.5 Contrast
Contrast measures the spatial frequency of an image and different moment of

GLCM. It is the difference between the highest and lowest values of an adjacent
set of pixels. Contrast is 0 for a constant image. Inertia and variance also mean the
same property. It is also referred as inertia. Contrast can be calculated from .g(m, n)
using the following equation:
N−1
N−1
Contrast (C) =
. (m − n)2 g (m, n). (15)
m=0 n=0
2.2.6 Variance
This statistic measures the heterogeneity, and it is strongly correlated with first-
order statistical variable such as standard deviation. It returns a high value for the
elements that greatly differs from the average value of .g(m, n). It is also referred
to as the sum of squares. Variance can be calculated for image .g(m, n) using the
following equation, and .μ indicates the mean of an input image.
N−1
N−1
V ariance (V ) =
. (m − μ)2 g (m, n). (16)
m=0 n=0
2.2.7 Sum Variance

2N
SV
. = (m − SEn)2 P x+y (m). (17)
m=2
2.2.8 Difference Variance

N−1
DV
. = m2 P x−y (m). (18)
m=0
2.2.9 Local Homogeneity or Inverse Difference Moment (IDM)
Homogeneity is the consistency in the arrangement of an input image .g(m, n). If

the arrangement of an input image follows a regular pattern, then it is said to be
homogeneous. Homogeneity value of 1 indicates that the image is the constant
value. Mathematically, it can be expressed by the following equation:
112 R. Anand et al.
N−1
N−1 g (m, n)
H omogeneity (H ) =
. . (19)
m=0 n=0
1 + |m − n|2
2.2.10 Local Homogeneity or Inverse Difference Moment (IDM)
The input image .g(m, n) is highly correlated between the adjacent pixels; then we
can say the image is auto-correlated (the autocorrelation of input data with itself
after shifting one pixel). The correlation measures the linear dependency between
the pixels at the respective location, and it can be calculated by the following
equation:

N−1
(m × n) g (m, n) − (μx × μy )
Corr =
. sumN−1 . (20)
m=0
(σx × σy )
n=0
2.2.11 RMS Contrast
Root mean square (RMS) measures the standard deviation of the pixel intensities. It
does not depend upon any angular frequency or spatial distribution of contrast of an
input image. Mathematically, it can be expressed as

1 N−1
N−1
2
.RMS Contrast (RC) =
Iij − I¯ , (21)
N 2
m=0 n=0
where “I ” is the normalized pixel intensity values between 0 and 1.
2.2.12 Cluster Shade
Cluster shade measures the unevenness in the input matrix and gets the information
about the uniformity in the image. Disproportionate images result in higher cluster
shade values. The cluster shade is computed using the following equation:
N−1
N−1 3
CS =
. m + n − μx − μy × g (m, n). (22)
m=0 n=0
2.2.13 Cluster Prominence
Cluster prominence is also used to measure the asymmetric nature of the image.
Higher cluster prominence indicates that the image is less symmetric. Smaller
variance in grey levels in the image results in lower cluster prominence value.
N−1
N−1 4
CP =
. m + n − μx − μy × g (m, n). (23)
m=0 n=0
3 Machine Learning Algorithms
Texture features of an image are calculated considering only one band at a time.
Channel information can be consolidated using PCA before calculating texture
features. Texture features of an image can be used for both supervised and
unsupervised image classification. A classification method, Random Forest [10],
builds multiple models by using various bootstrapped feature sets. The following
stages are included in this algorithm: To construct a single tree in the ensemble, the
algorithm boots up the training set many times, after which it applies the fresh set to
create a single tree. To identify the optimal split variable and new features, a random
selection of features is drawn from the training sets every time the sample of tree is
divided. The random forest took extra time to calculate the validation procedure, but
it had acceptable performance. These methods are compared with KNN and SVM
to fight this problem. In the technique suggested by Shanthi et al. [1], the classifier
called K-nearest neighbours (KNN) is used next. Using “a” and “o” input letters in
this 2-dimensional feature space, this method may then identify “c”, another feature
vector that must be analysed. When faced with this scenario, it identifies the K-
nearest neighbours without respect to labels. Figure 3 shows the classes “a” and “o”
in the image; for our purposes, imagine the number 3 next to them. The objective
of the algorithm is to discover which class “c” belongs to. “c” needs to have its
three neighbours recognised, since k is 3. Of the three adjacent places, one is a
“a”, while the other two are “o”. “o” has two votes, while “a” has one. Class “o”
will be attributed to the “c” vector. When K is equal to 1, the class is defined by
the first closest neighbour of the element. Computation time for KNN prediction is
extremely long; however, training for KNN prediction is quicker than that of random
forest. Despite improved training timings, it takes more processing resources to
calculate data in the higher dimensions. And last, this chapter examines how well
these algorithms perform when compared to SVM. This method identifies the most
comparable observation to the one we are trying to predict and that observation
serves as a reasonable proxy for an answer since it helps us determine the most
likely response by averaging the values around the observation.
Finding the answer requires the algorithm to locate neighbours to calculate the
integer number or k value. Smaller values of k will force the algorithm to adapt
to the data we are using, putting it at danger of overfitting and allowing it to fit
complicated borders between classes. Bigger K values distance themselves from the
ups and downs of actual data and result in smoother class separators in data. KNN
prediction takes a lot of time to compute, yet it can train in a fraction of the time of
114 R. Anand et al.
Fig. 5 Separating hyperplane Margin

in SVM Separating
Hyperplane
+ve class
-ve class
Support Vectors
random forest. The HSI method for training may run faster but is more demanding
on memory. In this chapter, we have used support vector machine (SVM) [11, 12]
for texture image classification. This method falls under the category of supervised
machine learning [9].
Support vector machine was introduced in 1992 as a supervised machine leaning
algorithm. This algorithm gained its popularity because of its higher accuracy rate
and minimum error rate. SVM is one of the best examples for “Kernel Method” that
is the key areas of machine learning that is shown in Fig. 5. The idea behind SVM is
to make use of nonlinear mapping function .φ that transforms data in input space to
data in feature space in such a way that it becomes a linearly separable that is shown
in Fig. 5 [2].
The SVMs then automatically discover the optimal separating hyperplane, which
is nothing but a complex decision surface. The equation of hyperplane is derived
from a line equation .y = ax + b ever; even though hyperplane is a line, its
equation is shown in below [13, 14], where “w & x” are the vectors, and it can be
computed by dot matrix of these two vectors that is shown in Eq. 24.
w T x = 0.
. (24)
Any hyperplane can be framed as set of points(x), which can be satisfies the
optimum point with minimum of .w. x + b = 0. Two such hyperplanes are
chosen, and based on the values obtained, they are classified as class 1 and class 2
as given in Eqs. 25 and 26.
w. x + b ≥ 1 or xi having class 1
. (25)
w. x + b ≤ −1or xi having class 2.

. (26)
Here, optimizing problem may occur because the goal is to maximize the margin
among all the possible hyperplanes meeting the constraints. The hyperplane with
the small .w|| is chosen because of the biggest margin it provides. The pseudocode
for optimizing problem is given in Eq. 27.
1
2
minimizew,b w
2
(27)
subject to y (i) (w T xi + b) ≥ 1 .
.
The solution to the above equation is computing the values of .(w, b), with minimum
possible margin. The equation that satisfies the constraints will be considered as the
equation of the optimal hyperplane.
4 Dataset Description
The dataset from Centre for Image Analysis in Swedish University of Agriculture
Sciences [3], Uppsala university, has been used in this chapter. Totally, 4480 images
of 28 different texture classes were taken using Canon EOS 550d DSLR camera,
which is shown in Fig. 6. Each texture class has around 160 images, in that 112
images are used for training and 48 images are used for testing, which is shown
in Table 2. The Fig. 7 shows the complete flowchart of the proposed method for
texture image classifications using GLCM features. First step, the segmented image
is resized to [576 .× 576].
5 Experiment Results
The objective of the support vector machine algorithm is to find a hyperplane in

an N-dimensional space that distinctly classifies the data points. To separate the two
classes of data points, there are many possible hyperplanes that could be chosen. Our
objective is to find a plane that has the maximum margin, i.e., the maximum distance
between data points of both classes. Maximizing the margin distance provides some
reinforcement so that future data points can be classified with more confidence.
Hyperplanes are decision boundaries that help classify the data points. Data points
falling on either side of the hyperplane can be attributed to different classes. Also,
the dimension of the hyperplane depends upon the number of features. If the number
of input features is 2, then the hyperplane is just a line. If the number of input
features is 3, then the hyperplane becomes a two-dimensional plane. It becomes
difficult to imagine when the number of features exceeds 3. Support vectors are data
points that are closer to the hyperplane and influence the position and orientation
of the hyperplane. Using these support vectors, we maximize the margin of the
classifier. Deleting the support vectors will change the position of the hyperplane:
116 R. Anand et al.
Blanket1 Blanket2 Canvas Ceiling1
Ceiling2 Cushion1 Floor1 Floor2
Grass1 Lentils1 Linseeds1 Oatmeal
Pearl sugar RICE1 RICE2 RUG1
SAND1 SCARF1 SCARF2 SCREEN1
SEAT1 SEAT2 Sesame Seeds STONE1
STONE2 STONE3 STONELAB WALL
Fig. 6 Sample images from 28 different texture images
1. True positive (TP): Total number of faulty images accurately tagged

2. True negative (TN): Non-defective images that were properly identified as such
3. False positive (FP): Number of incorrect classifications
4. False negative (FN): Percentage of erroneous non-defective images identifica-
tions
The abbreviations FP and FN indicate classification. FP is referred to as a type-1
error, while FN is referred to as a type-2 mistake which is shown in (Table 3). Type-
1 mistake is to some degree tolerable in medical diagnosis as compared to type-2
Table 2 Dataset descriptions

Class label Training samples Testing samples Total samples
Blanket1 112 48 160
Blanket2 112 48 160
Canvas 112 48 160
Ceiling1 112 48 160
Ceiling2 112 48 160
Cushion1 112 48 160
Floor1 112 48 160
Floor2 112 48 160
Grass1 112 48 160
Lentils1 112 48 160
Linseeds1 112 48 160
Oatmeal 112 48 160
Pearl sugar 112 48 160
Rice1 112 48 160
Rice2 112 48 160
Rug1 112 48 160
Sand 1 112 48 160
Scarf1 112 48 160
Scarf2 112 48 160
Screen1 112 48 160
Seat1 112 48 160
Seat2 112 48 160
Sesame seeds 112 48 160
Stone1 112 48 160
Stone2 112 48 160
Stone3 112 48 160
Stonelab 112 48 160
Wall 112 48 160
Fig. 7 Flowchart for produced texture image classifications
error. When type-2 error is high, it implies that a greater proportion of individuals
with illness are classified as healthy, which may result in serious consequences [15–
17]. Table 4 illustrates the confusion matrix for a multiple class issue. For each class,
the entities TP, TN, FP, and FN may be assessed using the following equations:
118 R. Anand et al.
Table 3 Confusion matrix for binary classification problem

Predicted class
Binary classification Defective image Non-defective image
Actual class Defective image True positive False negative
Non-defective image False positive True negative
Table 4 Confusion matrix for multiclassification problem

Predicted class
Multiclass classification Class 1 Class 2 Class 3 Class 4
Actual class Class 1 X11 X12 X13 X14
Class 2 X21 X22 X23 X24
Class 3 X31 X32 X33 X34
Class 4 X41 X42 X43 X44
1. True positive (TP) of .class A = XAA

2. True negative (TN) of .class A = 4i=1 Xii −X AA
3. False positive (FP) of .class A = 4i=1 XiA −−X AA
4. False negative (FN) of .class A = 4i=1 XAi −−X AA
For example, TP, TN, FP, and FN values of Class 1 are computed as:
1. TP of .Class 1 = X11
2. TN of Class 1

4
= Xii −X 11 = X11 + X 22 +X33 +X 44 − X 11
. i=1 (28)
= X22 +X 33 +X 44
3. FP of Class 1

4
= Xi1 −−X 11 = X11 + X 21 +X 31 +X 41 − X11
. i=1 (29)
= X21 +X 31 +X 41
4. FN of Class 1

4
= X1i −−X 11 = X11 + X 12 +X 13 +X 14 − X11
. i=1 (30)
= X12 +X13 +X 14
Table 5 Comparison of Random forest KNN SVM

different machine learning
algorithm for texture data Accuracy 94.45 95.42 99.35
Precision 90.41 81.21 92.09
F1 score 0.86 0.84 0.92
Sensitivity 88.48 89.69 91.65
Specificity 90.41 94.35 99.66
These are the points that help us build our SVM. The performance of the
proposed system is measured in terms of sensitivity, specificity, accuracy, precision,
false positive rate, and false negative rate. The sensitivity and specificity are
important measures in classification. The accuracy of the system represents the
exactness of the system with respect to classification. To be precise, measurements
must be somewhat near to one another for the same object. The overall classification
accuracy of the proposed system is around 99.4% with the precision of 92.4%. The
false negative rate and false positive rate are very minimum in the range of 0.003 and
0.085, respectively. The sensitivity and specificity of the system are around 91.5 and
99.7%. The texture classes 2, 4, 5, 7, 9, 12, and 19 have been classified with good
accuracy and precision when compared to other classes which is shown in Table 5.
5.1 Performance Metrices

5.1.1 Sensitivity
Additionally, true positive rate (TPR), memory, and likelihood of detection are used.
It is a metric for true positives. It provides precise measurements of the test’s amount
or completeness.
5.1.2 Specificity
Specificity is also referred to as a genuine negative rate (TNR). It quantifies the true
downsides. Sensitivity improves type-1 error reduction.
5.1.3 False Positive Rate (FPR)
False positive rate (FPR) is also known as false alarm rate. It is the ratio of
misclassified to total negative samples (Table 5).
120 R. Anand et al.
Performance Best Worst

Formula
measure score score
Sensitivity/TPR TP 3100 100 0

TP + FN
Specificity/TNR TN 3100 100 0

TN + FP
FP
False Positive rate 0 1
TN + FP
FN
False negative rate 0 1
TP + FN
Accuracy TP + TN 3100 100 0

TP + FP + TN + FN
TP
Precision 1 0
TP + FP
Negative Predictive TN
value 1 0
TN + FN
2 • TP
F1-Score 1 0
2TP + FP + FN
Fig. 8 Performance measures
5.1.4 False Negative Ratio (FNR)
False negative ratio (FNR) is also called as miss rate. It is the ratio of misclassified
to total positive samples.
A system’s performance is measured in terms of its efficiency. The efficiency
with which a system solves a classification issue is quantified using metrics such
as sensitivity, specificity, false positive rate, false negative rate, accuracy, precision,
and F1 score [18, 19]. The formulas used to compute the parameters are given,
along with their reference values shown in Figs. 8 and 9, and the comparison of our
method as shown in Table 6 and individual class performance metrics are shown in
Table 7.
Fig. 9 Texture prediction image using support vector machine
6 Conclusion
Texture of an image is a description of the spatial arrangement of colours or

intensities in the image. Texture of image can be used to categorize the image into
several classes. Ample number of texture features can be computed mathematically
and used for image analysis. The method proposed in this paper combines the
texture features computed from GLCM matrix with those of the standard machine
learning algorithm for image classification. The overall classification accuracy of
the proposed system is around 99.4% with the precision of 92.4%. The classification
accuracy of the system can be further improved by increasing the size of the dataset.
122 R. Anand et al.
Table 6 Comparison of different SVM-based performance measures for texture data

Class Accuracy Precision F1 score Sensitivity Specificity
1 99.76 97.87 0.97 95.83 99.92
2 100 100 1 100 100
3 99.43 91.84 0.93 93.75 99.66
4 99.6 100 0.95 89.58 100
5 99.68 100 0.96 91.67 100
6 99.27 97.56 0.9 83.33 99.92
7 99.92 100 0.99 97.92 100
8 99.35 93.48 0.91 89.58 99.75
9 99.43 91.84 0.93 93.75 99.66
10 99.11 86.27 0.89 91.67 99.41
11 99.19 89.58 0.9 89.58 99.58
12 99.43 100 0.92 85.42 100
13 98.95 88.64 0.86 82.98 99.58
14 99.11 86.27 0.89 91.67 99.41
15 99.03 83.33 0.88 93.75 99.25
16 99.19 91.3 0.89 87.5 99.66
17 99.51 95.65 0.94 91.67 99.83
18 99.6 97.78 0.95 91.67 99.92
19 99.76 100 0.97 93.75 100
20 99.11 84.91 0.89 93.75 99.33
21 98.87 81.48 0.86 91.67 99.16
22 99.51 92 0.94 95.83 99.66
23 98.8 78.95 0.86 93.75 99
24 99.27 93.33 0.9 87.5 99.75
25 98.95 82.69 0.87 91.49 99.25
26 99.43 90.2 0.93 95.83 99.58
27 99.19 88 0.9 91.67 99.5
28 99.43 95.56 0.92 89.58 99.83
Average 99.35 92.09 0.92 91.65 99.66
Table 7 Confusion matrix for 28 classes
Actual class Predicted class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
1 46 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2 0 48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 45 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
4 1 0 0 43 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0
5 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 1 0
6 0 0 0 0 0 40 0 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 4 0 0 0 0 0
7 0 0 0 0 0 0 47 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 1 0 43 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
9 0 0 0 0 0 0 0 0 45 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0
10 0 0 0 0 0 0 0 0 0 44 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 43 0 0 2 1 0 0 0 0 0 0 0 2 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 4 0 41 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 3 0 39 0 0 0 2 0 0 0 0 0 3 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 2 0 0 0 44 2 0 0 0 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 1 0 0 0 0 45 2 0 0 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 42 0 0 0 0 0 0 0 0 2 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 44 0 0 0 3 1 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 44 0 0 2 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 45 2 0 0 0 0 0 0 0 0
A demonstration of the LATEX 2ε class file for EAI Endorsed Transactions
20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 45 3 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 44 1 0 0 0 0 0 0
22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 46 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 45 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 42 4 0 0 0
25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 43 3 0 0
26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 46 0 0
123
27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 44 2
28 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 43
124 R. Anand et al.
References
1. Shanthi, T., Sabeenian, R. S., Manju, K., Paramasivam, M. E., Dinesh, P. M., & Anand, R.
(2021). Fundus image classification using hybridized GLCM features and wavelet features.
ICTACT Journal of Image and Video Processing, 11(03), 2345–2348.
2. Veni, S., Anand, R., & Vivek, D. (2020). Driver assistance through geo-fencing, sign board
detection and reporting using android smartphone. In: K. Das, J. Bansal, K. Deep, A. Nagar, P.
Pathipooranam, & R. Naidu (Eds.), Soft computing for problem solving. Advances in Intelligent
Systems and Computing (Vol. 1057). Singapore: Springer.
3. Kylberg, G. The Kylberg Texture Dataset v. 1.0, Centre for Image Analysis, Swedish University
of Agricultural Sciences and Uppsala University, External report (Blue series) No. 35.
Available online at: http://www.cb.uu.se/gustaf/texture/
4. Anand, R., Veni, S., & Aravinth, J. (2016) An application of image processing techniques for
detection of diseases on brinjal leaves using k-means clustering method. In 2016 International
Conference on Recent Trends in Information Technology (ICRTIT). IEEE.
5. Sabeenian, R. S., & Palanisamy, V. (2009). Texture-based medical image classification of
computed tomography images using MRCSF. International Journal of Medical Engineering
and Informatics, 1(4), 459.
6. Sabeenian, R. S., & Palanisamy, V. (2008). Comparison of efficiency for texture image
classification using MRMRF and GLCM techniques. Published in International Journal of
Computers Information Technology and Engineering (IJCITAE), 2(2), 87–93.
7. Haralick, R. M., Shanmugam, K., & Dinstein, I. H. (1973). Textural features for image
classification. IEEE Transactions on Systems, Man, and Cybernetics, 6, 610–621.
8. Varma, M., & Zisserman, A. (2005). A statistical approach to texture classification from single
images. International Journal of Computer Vision, 62(1–2), 61–81.
9. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers.
Neural Processing Letters, 9(3), 293–300.
10. Shanthi, T., Sabeenian, R. S. (2019). Modified AlexNet architecture for classification of
diabetic retinopathy images. Computers and Electrical Engineering, 76, 56–64.
11. Sabeenian, R. S., Paramasivam, M. E., Selvan, P., Paul, E., Dinesh, P. M., Shanthi, T., Manju,
K., & Anand, R. (2021). Gold tree sorting and classification using support vector machine
classifier. In Advances in Machine Learning and Computational Intelligence (pp. 413–422).
Singapore: Springer.
12. Shobana, R. A., & Shanthi, D. T. (2018). GLCM based plant leaf disease detection using
multiclass SVM. International Journal For Research & Development In Technology, 10(2),
47–51.
13. Scholkopf, B., & Smola, A. J. (2001). Learning with kernels: support vector machines,
regularization, optimization, and beyond. MIT press.
14. Bennett, K. P., & Demiriz, A. (1999). Semi-supervised support vector machines. In Advances
in Neural Information Processing Systems (pp. 368–374).
15. Anand, R., Shanthi, T., Nithish, M. S., & Lakshman, S. (2020). Face recognition and
classification using GoogLeNET architecture. In: Das, K., Bansal, J., Deep, K., Nagar, A.,
Pathipooranam, P., & Naidu, R. (Eds.), Soft computing for problem solving. Advances in
Intelligent Systems and Computing (Vol. 1048). Singapore: Springer.
16. Shanthi, T., Sabeenian, R. S., & Anand, R. (2020). Automatic diagnosis of skin diseases using
convolution neural network. Microprocessors and Microsystems, 76, 103074.
17. Hall-Beyer, M. (2000). GLCM texture: A tutorial. In National Council on Geographic
Information and Analysis Remote Sensing Core Curriculum 3.
18. Shanthi, T., Anand, R., Annapoorani, S., & Birundha, N. (2023). Analysis of phonocardiogram
signal using deep learning. In D. Gupta, A. Khanna, S. Bhattacharyya, A. E. Hassanien,
S. Anand, & A. Jaiswal (Eds.), International conference on innovative computing and
communications (Lecture notes in networks and systems) (Vol. 471). Springer. https://doi.org/
10.1007/978-981-19-2535-1_48
19. Kandasamy, S. K., Maheswaran, S., Karuppusamy, S. A., Indra, J., Anand, R., Rega, P., &
Kathiresan, K. (2022). Design and fabrication of flexible Nanoantenna-based sensor using
graphene-coated carbon cloth. Advances in Materials Science & Engineering.
Progress in Multimodal Affective
Computing: From Machine Learning
to Deep Learning
M. Chanchal and B. Vinoth Kumar
1 Introduction
Emotions and sentiments play a significant role in our day-to-day lives. They help
in decision-making, learning, communication, and handling situations. Affective
computing is a technology that aims to detect, perceive, interpret, process, and
replicate emotions from the given data sources by using different type of techniques.
The word “affect” is a synonym for “emotions.” Affective computing technology
is a human-computer interaction system that detects the data captured through
cameras, microphones, and sensors and provides the user’s emotional state. The
advancement in signal processing and AI has led to the development of usage
of affective computing in medical, industry, and academia alike for detecting
and processing affective information from the data sources [5]. Emotions can be
recognized either from one type of data or more than one type of data. Hence,
affective computing can be classified broadly into two types. They are unimodal
affective computing and multimodal affective computing. Figure 1 depicts the
overview of affective computing.
Unimodal systems are those in which the emotions are recognized from one
type of data. Generally, human beings rely on multimodal information more than
unimodal. This is because one can understand a person’s intention by looking at
his/her facial expression when he/she is speaking. In this case, both the audio and
video data provide more information than the information that is provided from
one type of data. For example, during an online class, the teacher can interpret
M. Chanchal ()
Department of Computer Science and Engineering, Amrita School of Engineering, Coimbatore,
Amrita Vishwa Vidyapeetham, Coimbatore, India
B. Vinoth Kumar
Department of Information Technology, PSG College of Technology, Coimbatore, India
e-mail: bvk.it@psgtech.ac.in
128 M. Chanchal and B. Vinoth Kumar
Understanding
Sensing human Recognizing the
and modelling Affect expression
affect response affect response
affect
Emotive
Human
Fig. 1 Overview of affective computing
Image
Audio
Feature extraction Model selection

Pre-processing Affect computing
and selection and training
Video
Physiological
signals
Data
source
Fig. 2 Multimodal affective computing
more accurately if the students have understood the class or not by both looking
at student’s expression and also by asking their feedback, rather than just asking
them only their feedback.
The way people express their opinion varies from person to person. One person
may express his/her opinion more verbally while other person may express his/her
opinion through expression [12]. Thus, a model that can interpret emotion for any
type of person is required. This is when multimodal affective computing plays a
major role. Unimodal systems are the building block of multimodal systems. The
multimodal system outperforms the unimodal system since more than one type of
data are used for interpretation. The multimodal affective computing structure is
presented in Fig. 2.
Till date, only a very limited survey analysis has been done on multimodal
affective computing. Also, the previous studies do not concentrate specifically on
the machine learning and deep learning approaches. With the advancement in AI
techniques, a number of machine learning and deep learning algorithms can be
applied for multimodal affective computing. The objective of this chapter is to
provide a clear idea on the various machine learning and deep learning methods
used for multimodal affect computing. In addition to it, the details about the various
datasets, modalities, and fusion techniques have been elaborated. The remaining part
Progress in Multimodal Affective Computing: From Machine Learning to Deep. . . 129
of this chapter is organized with multiple sections. Section 2 is to present about the
available datasets, Sect. 3 is to elaborate about the various features used for affect
recognition, Sect. 4 explains about the various fusion techniques, Sect. 5 describes
about the various machine learning and deep learning techniques for multimodal
affect recognition, Sect. 6 is for discussion, and finally, Sect. 7 concludes the chapter.
2 Available Datasets
In literature, two types of datasets were found. They are publicly available dataset
and dataset collected from subjects based on the predecided concept. In the latter,
subjects were selected based on the tasks that need to be performed, and respective
data were collected for further processing. This section describes the publicly
available datasets for multimodal affective computing. Various kind of datasets for
multimodal affect analysis datasets are discussed in Table 1.
2.1 DEAP Dataset
The DEAP dataset [1] was collected from 32 subjects who were watching 40 videos
clips that stimulated emotions. It contained 1-min-long video clippings. Based on
that, EEG signals and Peripheral Physiological signals (PPS) were captured. The
PPS signal includes both electromyographic (EMG) and EOG data. It had four
emotions like valence, arousal, liking, and dominance and have a scale of 1–9.
2.2 AMIGOS Dataset
The AMIGOS database [4] was collected from subjects using two different exper-
imental settings mainly for mood, personality, and affect research purpose. In the
first experimental setting, 40 subjects watched 16 short videos. Each video varied
between 51 and 150 s. In second experimental setting, some subjects watched four
long videos of different scenarios like individually and as groups. The wearable
sensors were used to get the EEG, ECG, and GSR signals in this dataset. Also, it
contains face and depth data that were collected using separate equipment.
2.3 CHEVAD 2.0 Dataset
It is an extension of CHEAVAD dataset [14], including 4178 additional samples

to it. CHEVAD 2.0 dataset were collected from Chinese movies, soap operas, and
Table 1 Datasets for multimodal affective computing

Dataset Modality Subjects Data Emotions
DEAP [1] A+V+T+B 32 Facial, text, EEG, Arousal, valence,
and PPS signals dominance, and
liking
AMIGOS A+V+B 40 Facial, EEG, ECR, Valence, arousal,
[4] and GSR signals dominance, liking,
familiarity, and basic
emotions
CHEVAD A+V 527 Audio and video data Neutral, happiness,
2.0 [14] sadness, anger,
surprise, fear,
disgust, frustration,
and excitement
RECOLA A+V+B 46 Audio, video, ECG, Arousal and valence
[26] and EDA signals
IEMOCAP A+V+T 10 (5-m Audio, video, and Angry, sad, happy,
[28] and 5-f) lexical and neutral
CMU- A+V 1000 Audio and video data Angry, disgust, fear,
MOSEI happy, sad, and
[17] surprise
SEED IV A+V+B 44 Audio, video, and Angry, sad, happy,
[30] EEG signals and neutral
AVEC 2014 A+V – Audio, video, and BDI-II depression
[9] lexical scale range
SEWA [26] A+V+T 64 Audio, video, and Arousal and valence
textual
AVEC 2018 A+V+T 64 Audio, video, and Arousal, valence,
[7] lexical and preference for
commercial products
DAIC-WOZ A+V+T – Audio, facial feature, BDI-II depression
dataset [22] voice, facial action, scale range
and eye feature
UVA A+V 61 Audio, video, and Classroom
toddler [23] head and body pose Assessment Scoring
System (CLASS)
dimension – positive
or negative
MET [23] A+V 3000 Audio and video Classroom
features Assessment Scoring
System (CLASS)
dimension – positive
or negative
A audio, V video, T text, B biological, ECG electrocardiography, EEG electroencephalography,
GSR galvanic skin response, EDA electrodermal activity, PPS Peripheral Physiological signals
TV shows that contain noise in the background so as to simulate the real-world

condition. This dataset has 474 min of emotional segments. It contains 527 speakers
aging from kids to elderly people. The subjects were in a distribution: 58.4% of
male and 41.6% of female. The duration of the video clippings ranges from 1 to 19
s with the average duration being 3.3 s.
2.4 RECOLA Dataset
Remote COLlaborative and Affective (RECOLA) dataset [26] is a multimodal

database of spontaneous collaborative and affective interactions in French. Forty-
six French-speaking subjects were involved in this task. The subjects were recorded
during a video conference in dyadic interaction when completing a task that required
collaboration. This includes a total of 9.5 h. Six annotators measured the emotions
on the basis of two dimensions: arousal and valence. The dataset contains audio,
video, and physiological signals like electrocardiogram (ECG) and electrodermal
activity (EDA).
2.5 IEMOCAP Dataset
Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset [28] contains

speech, text, and face modalities that were collected from ten actors during
dyadic interaction using motion capture camera. The conversations included both
spontaneous and scripted sessions. There are four labeled annotations: angry, sad,
happy, and neutral. The dataset had five sessions, and each session was between one
female and one male (two speakers).
2.6 CMU-MOSEI Dataset
The CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI)

dataset [17] contained 23453 annotated video segments. It included 1000 different
speakers and 250 topics that were taken from social media. Six labeled annotations
are there in this dataset. They are angry, disgust, fear, happy, sad, and surprise.
2.7 SEED IV Dataset
The SEED IV dataset [30] contains four annotated emotions like happy, sad, fear,
and neutral. Forty-four subjects were used for this out of which 22 were female
college students. They were asked to assess their emotions when watching the
film clips as either sad, happy, fear, or neutral with rating from −5 to 5 for two
dimensions: arousal and valence. The valence scale ranges from sad to happy and
arousal scale ranges from calm to excited. At the end, 72 film clips were selected
that had the highest match among the subjects. The duration of each clip was 2 min.
2.8 AVEC 2014 Dataset
It is a subset of audiovisual depression language corpus [9]. The dataset had 300
video clips, which was recorder using Web cameras and microphones when people
were having computer interactions. One to four recordings of all the subjects were
taken with a gap of 2 weeks between two recordings. The length of the video
clips was between 6 s and 4 min. This dataset contains subjects with age 18–
63 and average age being 31.5 years. The BDI-II depression scale ranges from 0
to 63, where 0–10 is normal, 11–16 is mild depression, 17–20 is borderline, 21–
30 is moderate depression, 31–40 is severe depression, and above 40 is extreme
depression. The highest that was recorded was 45.
2.9 SEWA Dataset
Sentiment Analysis in the Wild (SEWA) dataset [26] contains audio and video
recordings that were collected from Web cameras and microphones, and also natural
emotions like arousal and valence. This dataset included a total of 64 subjects with
ages ranging from 18 to 60 years where training set were with 36 subjects, validation
set with 14 subjects, and testing set with 16 subjects. They were paired (a total of 32
pairs) and made to watch commercial videos and were asked to discuss the content
of the video with their partner for a limit of 3 min. The dataset includes text, audio,
and video data. Six German-speaking annotators (three males and three females)
annotated the dataset for arousal and valence.
2.10 AVEC 2018 Dataset
It is an extension of AVEC 2017 database [7]. The AVEC 2017 is like the SEWA
dataset of German culture with 64 subjects, having 36 for training, 14 for validation,
and 16 for testing. In AVEC 2018 dataset, the testing set is added with new subject
of Hungarian culture with same age as the German culture. This dataset includes
both audio and video recordings. It annotated three emotions: arousal, valence, and
preference for the commercial products. All emotions were annotated with scale
ranging from −1 to +1. The duration of recordings was 40 s to 3 min. The emotions
were annotated for every 100 ms.
2.11 DAIC-WOZ Dataset
The Distress Analysis Interview Corpus depression dataset [22] includes clinical
interviews used for diagnosis of psychological conditions like anxiety, depression,
and posttraumatic stress disorder. It included audio and video recordings and
questionnaire response from the interviews conducted by virtual interviewer called
Ellie that was controlled by a human interviewer in another room. It contains 189
sessions of interviews. Each interview contains audio file of interview session, 68
facial points of subjects, HoG (Histogram of oriented Gradients) facial feature, head
pose, eye features, file having continuous facial action, and file containing subjects
voice and transcript file of interview. All features except the transcript file are time-
series data.
2.12 UVA Toddler Dataset
The University of Virginia (UVA) Toddler dataset [23] has 192 videos each of 45–60
min. It is collected from 61 child care centers having toddlers of 2–3 years old. The
videos are recorded using digital camera with integrated microphone. Each video
includes a day of preschool including individual and group activity, outdoor plays,
and sharing meals. They included singing, reading, playing with blocks and toys,
and so on. Each session includes an average of 1.7 teachers and 7.59 students. It
includes video, audio along with background
Noise, and head and body pose.
2.13 MET Dataset
The Measures of Effective Teaching (MET) dataset [23] is one of the Classroom
Assessment Scoring System (CLASS)-coded video dataset. It includes 16000
videos where 3000 teachers were teaching language, mathematics, arts, and science
in both middle and elementary schools of USA (six districts). The data were
collected using 360◦ cameras integrated with microphone, which was placed in the
center of the classroom to capture both the teachers and students properly.
3 Features for Affect Recognition
Affective computing requires extraction of meaningful information from the gath-

ered data. This can be done by using various techniques. This section describes
the various modalities along with their techniques used in multimodal affective
Subject
Physiological Behavioral
EEG Audio
ECG Video
GSR Textual
PPS Facial expression
Captured through Naturally observed /

sensors Computer interaction
Fig. 3 Categories of modality for affect recognition
computing. The modalities acquired from a subject fall into two broad categories:
They are physiological and behavioral modalities. In this section, the primary focus
is given on the audio, visual, textual, facial expression, and biological signals
detection, along with their techniques. In the abovementioned modalities, audio,
visual, textual, and facial expression falls under behavioral category and biological
signal is nothing but the physiological signals. Figure 3 shows the modality
categories.
3.1 Audio Modality
Audio is one medium for capturing the emotions of a user. OpenSMILE toolkit
is one popular method for extraction of audio features like pitch, intensity of
utterance, bandwidth, pause duration, and perceptual linear predictive coefficients
(PLP) [19]. Mel-frequency cepstral coefficients (MFCC) [28] is the most popular
audio extraction method. Nowadays, for better extraction of audio features, more
deep neural networks are used.
3.2 Visual Modality
Similar to audio, visual modality is an important feature for affect computing.

OpenSMILE toolkit is one common method for extraction of visual features [19].
Also, Local Binary Patterns on Three Orthogonal Planes (LBPTOP) are sometimes
used as the baseline visual feature set [14]. Deep neural network like ResNet-50,
DenseNet, VGG Face, MobileNet, and HRNet can also be used for extracting better
visual features [25].
3.3 Textual Modality
For affect computing, text features play a vital role. Textual features are of two
types. They are Crowd-Sourced Annotation (CSA) and DISfluency and Non-verbal
Vocalization (DIS-NV). The DIS-NV are done by manual annotations. The CSA
features are extracted by removing the stop words like “a,” “and,” “the,” etc.
and then lemmatizing the remaining word using Natural Language Toolkit [19].
Parts of Speech (PoS), n gram features, and TF-IDF (Term frequency-Inverse
Document Frequency) are useful features for emotion recognition. Google English
word embedding (Word2Vec) [12] and Global Vectors (GloVec) are also used to
extract textual features.
3.4 Facial Expression
Facial expression can be captured using the AdaBoost algorithm using the Haar
eigen value [8]. Chehra algorithm can be used to locate the facial points in the image
frame [11]. With these facial points, the face feature can be further extracted using
the Facial Action Unit (AU) recognition algorithm [17]. The LBP-TOP can be used
to extract the face pictures or features [14]. The facial landmark can be detected in
an image using Openface, which is an open-source tool [22].
3.5 Biological Signals
The biological signals include ECG (electrocardiography), EEG (electroen-

cephalography), GSR (galvanic skin response), and PPS (Peripheral Physiological
signals) signals. All these signals are captured using appropriate electrodes. The
electrodes are attached to the human body. NeuroScan is one system that can be
used for recording and analyzing EEG signals [30]. Shimmer3 sensors can be used
for measuring ECG [2].
4 Features for Affect Recognition Various Fusion Techniques
Multimodal affect computing involves fusion of various modalities that are cap-
tured. In order to do the analysis, the modalities are combined using various fusion
techniques. Fusion of various data provides enormous information and thus achieves
a result with a very good accuracy. There are two main levels of fusion. They are
feature-level fusion or early-fusion and decision-level fusion or late fusion. Also,
there are other fusion techniques like hierarchical fusion, model-level fusion, and
score-level fusion. This section describes the various fusion techniques.
4.1 Decision-Level or Late Fusion
Decision-level fusion [20] is a technique that combines the emotion recognition

results of several unimodal by using an algebraic combination method. Each
modality is given as an input individually to the unimodal, and the results obtained
from these emotion classifiers are combined by algebraic methods like “Sum,”
“Min,” “Max,” etc. Hence, decision-level fusion is called late fusion. In decision-
level fusion, the unimodal emotion recognition is built for each of the multimodal
feature set. The main advantage of decision-level fusion is that the complete
knowledge about the individual modality can be applied separately.
4.2 Hierarchical Fusion
The hierarchical fusion techniques use different multimodal feature sets at its
different level of hierarchy [19]. For example, the set of ideas or perceived emotion
annotation types of features are used in the lower layers of a model, whereas abstract
features like text, audio, or video features are used in the higher layers. This method
fuses two-stream network at different level of hierarchy to improve the performance
of emotion recognition.
4.3 Score-Level Fusion
Score-level fusion technique is another variant of decision-level fusion [29]. It is

mainly used in audiovisual emotion recognition systems. The class score values
are obtained by using any techniques like equally weighted summation method.
The final predicted category is taken as the emotion category that has the highest
or maximum value in the fused score vector. The score-level fusion is based on
combining the individual classification score, thus indicating the possibility that a
sample might belong to a different class. Whereas, the decision-level fusion is done
by combining various predicted class labels.
4.4 Model-Level Fusion
Model-level fusion is a compromise between the techniques of feature-level fusion

and decision-level fusion [29]. It uses the technique of correlation between the data
observed from the different modality, with the fusion of data in a relaxed manner,
and it mainly uses a probabilistic approach. This approach is used mainly for audio
and video modalities. For example, in neural networks, the model-level fusion is
done by fusion of the features at hidden layers of neural network where the multiple
modalities are given as input. Then an additional hidden layer is added to learn the
joined feature from the fused feature vector.
5 Multimodal Affective Computing Techniques
Multimodal affective computing is a method of emotion recognition from more

than one modality. The extracted features from the data are used to train a model
for emotion recognition. This model can use any type of technique. In the recent
years, a greater number of machine learning and deep learning techniques are used
in multimodal affective computing. This section tells about the various machine
learning and deep learning techniques for emotion recognition. Various machine
learning- and deep learning-based techniques for multimodal affect computing are
discussed in Tables 2 and 3, respectively.
5.1 Machine Learning-Based Techniques
Eun-Hye Jang et al. [10] proposed a method for fear-level detection using phys-
iological measures like skin conductance level and response (SCL, SCR), heart
rate (HR), pulse transit time (PTT), fingertip temperature (FT), and respiratory rate
(RR). The task was performed using the data collected from 230 subjects who
were asked to watch fear-inducing video clips. Correlation and linear regression
among the physiological measures were performed to check the fear intensity.
ML techniques like nonparametric spearman’s rank correlation coefficient was
used. SCR and HR were positively correlated to the intensity, whereas the SCL,
RR, and FT were negatively correlated. It showed an accuracy of 92.5% on fear-
inducing clips. Oana Balan et al. [2] proposed an automation fear-level detection
and acrophobia virtual therapy system. It used galvanic skin response (GSR),
heart rate (HR), and the values of electroencephalography (EEG) from subjects
who played acrophobia game and who were undergoing vivo therapy and virtual
reality therapy. Two classifiers were used: one to determine the present fear level
and another one to determine the game level that needs to be played next. ML
techniques like Support Vector Machine, Random Forest, k-Nearest Neighbors,
Table 2 Machine learning-based techniques for multimodal affect computing

Reference Dataset Feature used Models used Result obtained
Eun-Hye Jang et Data collected SCL, SCR, HR, Spearman’s rank 92.5% accuracy
al. [10] from 230 PTT, FT, and RR correlation
subjects
Oana Balan et Data collected GSR, HR, and SVM, RF, KNN, Two scales:
al. [2] from subjects EEG Linear SVM – 89.5%
playing Discriminant DNN – 79.12%
acrophobia game Analysis, and Four scales:
four deep neural KNN – 52.75%
network models SVM – 42.5%
Seul-Kee Kim et Data collected EEG, ECG, and t-tests or Comparison
al. [13] directly from GSR Mann-Whitney done using
subjects U, NOVAs, or significance
(showing them Kruskal-Wallis values
clips of real tests
pedestrian
environments)
Cheng-Hung Data collected Textual features t-test and Highest effect
Wang et al. [27] from 136 and facial Cohen’s d value (0.71)
subjects expression standard
Jose Maria Data collected Facial t-test SUS score with
Garcia-Garcia et from six expression, key lesser attempts
al. [5] children strokes, and (60% less)
speech features
Shahla Nemati DEAP dataset Video, audio, SVM and Naive SVM had better
et al. [20] and text features Bayes accuracy (92%).
Javier Data collected EEG and heart SVM-RFE and Arousal – 75%
Marín-Morales from 60 subjects rate variability LOSO Valence –
et al. [16] (HRV) cross-validation 71.21%
Li Ya et al. [14] CHEVAD 2.0 Audio and video SVM with Decision-level
dataset features decision-level fusion – 35.7%
and feature-level MAP
fusion Feature-level
fusion – 21.7%
MAP
Deger Ayata et DEAP emotion GSR and PPG KNN, RF, and Arousal –
al. [1] dataset (photo plethys- decision tree 72.06%
mography) methods Valence –
71.05%
Asim Jan et al. AVEC 2014 Audio and visual Feature FDHH had
[9] dataset features Dynamic better RMSE
History and MAE.
Histogram
(FDHH)
algorithm,
Motion history
histogram
(MHH), PLS
regression, and
LR techniques
(continued)
Table 2 (continued)
Sandeep Nallan Data collected Acoustic, SVM Recall% better
Chakravarthula from 62 couples lexical, and by 13–20%
et al. [3] behavioral
features
Nathan L. Data collected Posture data and SVM and neural Kappa score of
Henderson et al. from 119 electrodermal network multimodal was
[6] subjects activity data better.
Papakostas M et Data collected Visual and RF, Gradient F1 score of
al. [21] from 45 subjects physiological Boosting multimodal was
information classifier, and better.
SVM classifier
Anand UVA toddler Audio features Resnet, Pearson Resnet –
Ramakrishnan et dataset and MET and face image correlation, and correlation
al. [23] dataset of both the Spearman values
teacher and correlation Positive – 0.55
students Negative – 0.63
Pearson
correlation
Positive – 0.36
Negative – 0.41
Spearman
correlation
Positive – 0.48
Negative – 0.53
Dongmin Shin et Data collected EEG and ECG Bayesian BN had highest
al. [24] from 30 subjects signals network (BN), accuracy
SVM, and MLP (98.56%).
SCL, SCR skin conductance level and response, HR heart rate, PTT pulse transit time, FT fingertip
temperature, RR respiratory rate, GSR galvanic skin response, EEG electroencephalography, ECG
electrocardiographic, SVM Support Vector Machine, RF Random Forest, KNN k-Nearest Neighbors
Linear Discriminant Analysis, and four deep neural network models were used. The
models were compared based on the accuracy, and it was found that DNN model
had highest accuracy of 79.12% for player-independent and SVM had an accuracy
of 89.5% for player-dependent modality for two scales, whereas for four scales, high
accuracy was seen for KNN (52.75%) and SVM (42.5%).
Seul-Kee Kim et al. [13] proposed a method to determine the fear of crime using
multimodality based on the data collected from the subjects by showing them clips
of real pedestrian environments. The features like electroencephalographic (EEG),
electrocardiographic (ECG), and galvanic skin response (GSR) signals were used.
To compare the difference of fear of crime between the two groups (i.e., Low Fear
of crime Group (LFG) and High Fear of crime Group (HFG)), techniques like
independent t-tests or Mann-Whitney U tests were used. To compare the fear of
crime based on the video clips that were provided to the subject, NOVAs or Kruskal-
Wallis tests were used. The values were compared by setting up a significance
level of p < 0.05. Cheng-Hung Wang et al. [27] proposed a method for multimodal
Table 3 Deep learning-based techniques for multimodal affect computing

Michal LIRIS-ACCEDE Audiovisual LSTM, DBN, LSTM had best
Muszynski et al. dataset features, lexical and SVR result:
[19] features, (A – arousal,
physiological V – valence)
reactions like MSE (A – 0.260,
GSR, and V – 0.070)
ACCeleration CC (A – 0.251,
signals (ACC) V- 0.266)
CCC (A- 0.111,
V – 0.143)
Joaquim Comas AMINGOS Facial and BMMN (bio Accuracy:
et al. [4] dataset physiological multi model Arousal –
signals like network) 87.53%
ECG, EEG, and Valence –
GSR 65.05%
Jiaxin Ma et al. DEAP dataset EEG signals and Deep LSTM MM-ResLSTM
[15] physiological network, had best result:
signals residual LSTM Arousal –
network, and 92.87%
Multimodal Valence –
Residual LSTM 92.30%
(MM-
ResLSTM)
network
Panagiotis RECOLA Audio and video LSTM network Raw signals of
Tzirakis et al. dataset features audio and video:
[26] Arousal – 78.9%
Valence – 69.1%
Raw signals of
audio, raw, and
geometric
signals of video:
Arousal – 78.8%
Valence – 73.2%
Seunghyun IEMOCAP Text and audio Multimodal WAP values –
Yoon et al. [28] dataset features Dual Recurrent 0.718
Encoder Accuracy:
(MDRE) 68.8% to 71.8%
Trisha Mittal et IEMOCAP and Facial, text, and LSTM Increase of
al. [17] CMU-MOSEI speech features 2–7% in F1 and
dataset 5–7% in MA
Siddharth et al. AMINGOS EEG, ECG, Extreme EEG and frontal
[11] dataset GSR, and frontal Learning videos, the
videos Machine (ELM) accuracy –
along with 52.51%
10-fold GSR and ECG,
cross-validation the accuracy –
38.28%
(continued)
Table 3 (continued)
Wei-Long Zheng Data collected EEG and eye Bimodal deep Accuracy –
et al. [30] from 44 subjects moments auto-encoder 85.11%
(BDAE) and
SVM
Shiqing Zhang RML database, Audio and video 3D-CNN with Accuracy:
et al. [29] eNTERFACE05 features DBN RML database
database, and (80.36%),
BAUM-1 s eNTERFACE05
database database
(85.97%) and
B4AUM-1 s
(54.57%)
Eesung Kim et IEMOCAP Acoustic and Deep neural WAR – 66.6
al. [12] dataset lexical features network (DNN) UAR – 68.7
Huang Jian et al. AVEC 2018 Visual, acoustic, LSTM-RNN Arousal –
[7] dataset and textual 0.599–0.524
features Valence –
0.721–0.577
Liking –
0.314–0.060
Qureshi et al. DAIC-WOZ Acoustic, visual, Attention-based Accuracy –
[22] depression and textual fusion network 60.61%
dataset features with deep neural
network (DNN)
Luntian Mou et Data collected Eye features and Attention-based Accuracy –
al. [18] from 22 subjects vehicle and CNN-LSTM 95.5%
environmental network
data
Panagiotis SEWA dataset Text, audio, and Attention-based Arousal – 69%
Tzirakis et al. video features fusion strategies Valence – 78.3%
[26]
GSR galvanic skin response, EEG electroencephalography, ECG electrocardiographic, SVM Support
Vector Machine, LSTM Long Short-Term Memory, DBN Deep Belief Network, RNN Recurrent
Neural Network, SVR Support Vector Regression
emotion computing for tutoring system. It used textual features and facial expression
collected from 136 subjects. The technique used was t-test. Also, to determine the
significance level, Cohen’s d standard was used. This model did a comparison of test
results obtained from normal Internet teaching group and affective teaching group
statistics. Pretest and posttest were conducted for both the groups, and it was found
that posttest value of emotional teaching group produced a moderate to higher effect
value (0.71) and closer significance value.
Jose Maria Garcia-Garcia et al. [5] proposed a multimodal affect computing
method to improve the users experience on educational software application. Facial
expression, key strokes, and speech features were the features used. The method
used was the t-test to compare the mean of all the datasets and to test the null
hypothesis. The test was done using two types of system: one with emotion
recognition application and other one without emotion recognition. The System
Usability Scare (SUS) score was used for determining which system performs better.
And it was found that one with emotion recognition had better SUS score with lesser
attempts required (60% less) and using less help. Shahla Nemati et al. [20] proposed
a method for hybrid latent space data fusion technique for emotion recognition.
Video, audio, and text features were used from the DEAP dataset. SVM and Naive
Bayes were used as a classifier. Feature-level fusion and decision-level fusion are
employed in this model using Marginal Fisher Analysis (MFA) to cross-modal
factor analysis (CFA) and canonical correlation analysis (CCA). In feature-level
fusion, SVM classifier outperforms Naïve Bayes classifier. But in decision-level
fusion, it is mainly dependent on the type of classifier.
Javier Marín-Morales et al. [16] proposed a method for emotion recognition
using the brain and heartbeat dynamics like electroencephalography (EEG) and
heart rate variability (HRV). The data was collected from a total of 60 subjects.
SVM-RFE and LOSO cross-validation techniques were used for emotion recogni-
tion. Two predictions were done, one for arousal and another one for valence. The
features were extracted from HRV, EEG band power, and EEG MPS. It was found
that the arousal dimension attained an accuracy of 75%, and valence had an accuracy
of 71.21%. Li Ya et al. [14] proposed a multimodal emotion recognition challenge
using the audio and video features of CHEVAD 2.0 dataset. SVM classifier was
used for emotion recognition. Two fusion techniques were compared. They are
decision-level fusion and feature-level fusion. Among the two fusion techniques,
decision-level fusion with 35.7% in MAP was better than feature-level fusion, which
had only 21.7% in MAP. Also, its results were compared with the individual feature
predictions. It was found that with audio alone and video alone, it had 39.2% and
21.7%, respectively.
Deger Ayata et al. [1] proposed music recommendation system based on
emotions using the GSR (galvanic skin response) signals and PPG (photo plethys-
mography) signals obtained from 32 subjects in DEAP emotion dataset. The features
are extracted from these signals and fused using feature-level fusion technique.
The classifiers are fed with the feature vector to obtain the arousal and valence
values. KNN, Random Forest, and decision tree methods are used for emotion
identification. The arousal and valence accuracy were compared with using only
GSR signal, only PPS signals, and multimodal features. It was found that accuracy
of fused method had better accuracy for both arousal (72.06%) and valence
(71.05%). Asim Jan et al. [9] proposed a method for automatic depression-level
analysis using the audio and visual features. Two methods are compared. Feature
Dynamic History Histogram (FDHH) algorithm is a fusion technique to produce
dynamic feature vector. Motion history histogram (MHH) is used to get the features
of visual data and then fuse it with audio data. PLS regression and LR techniques
have been used for determining the correlation between the feature space and for
the depression scale. On comparison FDHH was better with less MAE and RMSE
values.
Sandeep Nallan Chakravarthula et al. [3] proposed the suicidal risk prediction
among the military couples using the conversation among the couples. It used
features like acoustic, lexical, and behavioral aspects from the couple’s conversation
that was collected from a total of 62 couples, having a total of 124 people.
The model was used to check three scenarios: none, ideation, and attempt. The
recall% of the proposed system was 135–20% better than the chance. Principal
Component Analysis (PCA) was done to get only the important features. Support
Vector Machine was used as a classifier for the determination of risk prediction.
Nathan L. Henderson et al. [6] proposed a method for affect detection for game-
based learning environment. The posture data and Q-sensors were used to get the
electrodermal activity data. The data was collected from a total of 119 subjects who
were involved in the TC3Sim training. Two types of fusion techniques were tested:
feature-level and decision-level fusion. Based on these data, classifiers like Support
Vector Machine and neural network were used to determine the student’s affective
states. The results were compared using Kappa score taking only EDA data, only
posture data, and then multimodal data. The classifier performance was improved
when the EDA data was combined with posture data.
Papakostas M et al. [21] proposed a method for understanding and categorization
of driving distraction. It made use of visual and physiological information. The data
was collected from 45 subjects who were exposed to four different distractions
(three cognitive and one physical). Both early fusion and late fusion were tested.
It was tested for two class (mental and physical distraction) and four class (text,
cognitive task, listening to radio, GPS interaction). The two class and four class
test results were compared for visual features alone, physiological features alone,
and early fusion and late fusion. Classifiers like Random Forest (RF) with about
100 decision trees, Gradient Boosting classifier, and SVM classifier with linear and
RBF kernel were used for determining the driver distraction. In both two class and
four class, the visual features only performance was 15% comparatively in F1 score
and thus cannot be used in stand-alone mode.
Anand Ramakrishnan et al. [23] proposed a method for automatic classroom
observation. It used the audio features, face image of both the teacher and students
from the UVA toddler dataset, and MET dataset. It is used to determine the
Classroom Assessment Scoring System’s (CLASS) positive and negative aspect.
Classifier like Resnet, Pearson correlation, and Spearman correlation is used. Using
Resnet, the correlation values were 0.55 and 0.63 for positive and negative. The
Pearson correlation resulted in correlation values of 0.36 and 0.41 on positive and
negative aspects, respectively, for UVA dataset. In MET dataset, the Spearman
correlation was compared with Pearson correlation, and Spearman correlation
values were better for both positive and negative (0.48 and 0.53). Dongmin Shin
et al. [24] developed an emotion recognition system using EEG and ECG signals. It
recognized six types of feelings: amusement, fear, sadness, joy, anger, and disgust.
The noise was removed from the signals to create the data table. The classifier used
is the Bayesian network (BN) classifier. Also, BN classifier was compared with
MLP and SVM. All three classifiers accuracy was found for EEG signals alone and
also EEG signal with ECG signal. It was seen that BN result of multimodal modal
had the highest accuracy of 98.56%, which was 35.78% increase in the accuracy.
5.2 Deep Learning-Based Techniques
Michal Muszynski et al. [19] proposed a method for recognizing the emotions
that were induced when watching movies. Audiovisual features, lexical features,
physiological reactions like galvanic skin response (GSR), and ACCeleration
signals (ACC) were used from the LIRIS-ACCEDE dataset. In order to determine
the emotion from the multimodal signals, LSTM, DBN, and SVR models were
compared against each other for arousal and valence in basis of MSE, Pearson
correlation coefficient (CC), and concordance correlation coefficient (CCC) for both
unimodal and multimodal. LSTM outperformed SVR and DBN with MSE (A –
0.260, V – 0,070), CC (A – 0.251, V – 0.266), and CCC (A – 0.111, V – 0.143),
where A stands for arousal and V for valence. Joaquim Comas et al. [4] proposed
a method for emotion recognition using the facial and physiological signals like
ECG, EEG, and GSR from the AMINGOS dataset. Deep learning techniques like
CNN (Convolution Neural Network) is used for emotion recognition. BMMN (bio
multi model network) is used to estimate the affect state using the features extracted
using the Bio Auto encoder (BAE). Three networks are tested: BMMN that uses the
features directly, BMMN-BAE1 that uses only the latent features extracted using
BAE, and BMMN-BAE2 that used latent features along with the essential features.
The BMMN-BAE2 model outperformed all the other models with accuracy for
accuracy of 87.53% and valence of 65.05%.
Jiaxin Ma et al. [15] proposed an emotion recognition system using the EEG
signals and physiological signals of the DEAP dataset. The dataset was compared
for deep LSTM network, residual LSTM network, and Multimodal Residual
LSTM (MM-ResLSTM) network for both arousal and valence emotions. The MM-
ResLSTM outperformed the other two methods with an accuracy of 92.87% for
arousal and 92.30% for valence. Also, the proposed method was tested against
some state of art methods like SVM, MESAE, KNN, LSTM, BDAE, and DCCA.
Among all the method, MM-ResLSTM had better accuracy. Panagiotis Tzirakis
et al. [26] proposed a method for emotion recognition using the audio and video
features of RECOLA dataset. The audio and video features were extracted using
the ResNet. These extracted features were used for emotion recognition using the
LSTM network. The proposed model was compared against few other state-of-
the-art methods like the Output-Associative Relevance Vector Machine Staircase
Regression (OA RVM-SR) and strength modeling system proposed by Han et al. for
both arousal and valence prediction. The proposed method outperformed all other
methods with an accuracy of 78.9% for arousal and 69.1% for valence using raw
signals of both audio and video and 78.8% for arousal and 73.2% for valence using
raw signals of audio and raw and geometric signals of video.
Seunghyun Yoon et al. [28] proposed a multimodal speech emotion recognition
system using the text and audio features from IEMOCAP dataset. This model is
used to identify four emotions like happy, sad, angry, and neutral. The Multimodal
Dual Recurrent Encoder (MDRE) containing two RNNs is used for the prediction
of speech emotions. The proposed model is compared against the Audio Recurrent
Encoder (ARE) and Text Recurrent Encoder (TRE) using the weighted average
precision (WAP) score. It was found that the MDRE model had better WAP values
of 0.718; thus, accuracy is ranging from 68.8% to 71.8%. Trisha Mittal et al. [17]
proposed a multiplicative multimodal emotion recognition (M3ER) system that uses
facial, text, and speech features. This was done using the IEMOCAP and CMU-
MOSEI dataset. Deep learning models were used for feature extraction to remove
the inefficient signals, and finally, an LSTM is used for emotion classification. The
result of M3ER were compared with the already existing SOTA methods using the
F1 score and Mean Accuracy (MA) score. Having modality check on ineffective
modality of the dataset causes an increase of 2–5% in F1 and 4–5% in MA, and
when dataset undergoes a proxy feature regeneration step, it led to a further increase
of 2–7% in F1 and 5–7% in MA for M3ER model, which was better than SOTA
model.
Siddharth et al. [11] proposed a multimodal affective computing using the EEG,
ECG, GSR, and frontal videos of the subject from AMINGOS dataset. The features
are extracted using the CNN-VGG network. The Extreme Learning Machine (ELM)
along with 10-fold cross-validation and sigmoid function were used to train the
emotions like arousal, valence, liking, and dominance at a scale of 1–9. The
features are tested for emotion classification individually and also as multimodal.
By combining EEG and frontal videos, the accuracy was 52.51%, which is better
than the accuracy obtained individually for the features. By combining the features
like GSR and ECG, the accuracy was 38.28%. Wei-Long Zheng et al. [30] proposed
a model for emotion recognition using EEG signals and eye movements. The
data was collected from 44 subjects. Bimodal deep auto-encoder (BDAE) was
used to extract the shared features of both EEG and eye moments. The Restricted
Boltzmann machines (RBMs) were used, one for EEG and another for eye moments
to extract the features. Finally, an SVM was used as a classifier to do the emotion
classification. The modal was tested with accuracy for individual features and
for multimodal. The multimodal had an accuracy of 85.11%, which was better
compared to EE signals alone (70.33%) and eye movements alone (67.82%).
Shiqing Zhang et al. [29] proposed a method for emotion recognition using audio
and visual features. The model was tested on RML database, the acted eNTER-
FACE05 database, and the spontaneous BAUM-1s database. The CNN and 3D-CNN
are used to capture the audio and video features, respectively. The result from these
networks is fed to a DBN along with a fusion network to produce the fused features.
Linear SVM is used as a classifier for emotion classification. The proposed model
uses fusion technique along with CNN along with DBN to build the fusion network.
The model is tested and compared with unimodal features and different fusion meth-
ods like feature level, score level, and FC for all three datasets. Among all, the pro-
posed method with DBN outperformed with highest accuracy for all three datasets.
The accuracies were RML database (80.36%), eNTERFACE05 database (85.97%),
and B4AUM-1s (54.57%). Eesung Kim et al. [12] proposed a method for emotion
recognition using acoustic and lexical features of IEMOCAP dataset. The emotion
recognition was compared using the weighted average recall (WAR) and UAR. The
deep neural network (DNN) is used for feature extraction and also as a classifier.
The proposed model was compared with the results obtained using only lexical
features and also few state-of-the-art methods like LLD+MMFCC+BOWLexicon,
LLD+BOWCepstral+GSVmean+BOW+eVector, LLD+mLRF, and Hierarchical
Attention Fusion Model. Of all the models, the proposed model had greater WAR
and UAR value of 66.6 and 68.7 respectively. Huang Jian et al. [7] proposed a
model for emotion recognition using the visual, acoustic, and textual features. These
features are used from AVEC 2018 dataset. The features are extracted and fused
using both feature-level and decision-level fusion and compared. LSTM-RNN is
used to train the model with the features extracted and emotion classification is
performed. The comparison is done for unimodal (using only visual, only audio,
and only textual features). But multimodal features had better prediction of emotions
like arousal, valence, and liking. The German part of the dataset had good perfor-
mance in proposed multimodal with values 0.599–0.524 for arousal, 0.721–0.577
for valence, and 0.314–0.060 for liking. For the Hungarian part, the performance
was good for textual features. Qureshi et al. [22] proposed a method for estimation
of depression level in an individual using multimodality like acoustic, visual, and
textual features. The features were extracted from DAIC-WOZ depression dataset.
An attention-based fusion network is used, and deep neural network (DNN) is
used for classification of depression in PHQ-8 score scale. RMSE, MAE, and
accuracy were used to test the Depression Level Regression (DLR) and Depression
Level Classification (DLC). To test the multimodality, two-based one single-task
representation learning (ST-DLR-CombAtt and ST-DLC-CombAtt) and two others
based on multitask representation learning (MT-DLR-CombAtt and MT-DLC-
CombAtt) were used. The multimodal had better classification accuracy of 60.61%.
Luntian Mou et al. [18] proposed a model for determining the driver stress
level using the eye features and vehicle and environmental data. The data were
collected from a total of 22 subjects. The stress level was classified into three
classes: low, medium, and high. An attention-based CNN-LSTM network is used as
a classifier. The proposed model is compared with other state-of-the-art multimodal
methods where the handcrafted features are used and also some unimodal method.
It was seen that the attention-based CNN-LSTM network outperformed all the other
state-of-the-art methods with an accuracy of 95.5%. Panagiotis Tzirakis et al. [26]
proposed affect computing using the text, audio, and video features from the SEWA
dataset. ML techniques like Concordance Correlation Coefficient (ρc) was used to
determine the agreement level between the prediction and also in determination of
correlation coefficient with their mean square difference. The SEWA dataset was
compared with proposed model for single feature alone, also with different fusion
strategies like concatenation, hierarchical attention, self-attention, residual self-
attention, and cross-modal self-attention and cross-modal hierarchical self-attention.
The two emotions that are tested for are arousal and valence. It was also compared
with few state-of-the-art methods. It was seen that the model outperformed for text,
visual, and multimodality.
6 Discussion
The advancement in human computer interaction has led to the use of multimodal
analysis from unimodal analysis for affective computing. Also, the use of more
modality for affect detection can be more appropriate rather than using only single
feature. Earlier, only still images were used for affective computing. Nowadays,
the advancement in technology has led to the usage of audio and video formats for
affect detection. From the above study, it can be seen that a number of datasets are
available for audio, video, and textual data. But very few datasets are available for
biological signals. Most of the biological signals are obtained by getting the data
directly from the subjects based on the experiment that needs to be performed. In
multimodal affective computing, fusion techniques play a major role. A number of
fusion techniques has been discussed in the above section. Since more than one
modality are used in multimodal affective computing, fusion techniques are applied
to those technique. The fusion techniques are determined based on the dataset and
the model that is selected. The most commonly used fusion technique is feature-
level fusion.
One problem with the publicly available datasets is that it contains only posed
expression or acted expressions. Choosing an appropriate dataset is a challenging
task. In many cases, more naturalistic data are used. The features extracted from
these data may be numerous, and hence, feature reduction is essential. Only
nonredundant and relevant data are required for further processing and to increase
the speed and processing of the affect computation algorithm. Also, an appropriate
classification algorithm needs to be selected based on the dataset.
Initially, a number of machine learning techniques had been applied for affective
computing. But the advancement in AI has led to the usage of deep learning
techniques for affective computing. From our literature survey, it is clear that
most of the studies that involved emotion recognition used modalities like audio,
video, or textual data. But for studies that were application oriented like stress-
level detection, fear-level detection, education sector, or so on, more of biological
signals were used. This is because the physiological response of the human body
helps in determining these kinds of expressions much better rather than using only
audio, video, and textual data. The physiological responses were collected using
the sensors. For affective computation, some methods had used features that were
manually extracted, whereas some studies have used deep learning techniques for
feature extraction.
As the survey demonstrates, there are various research challenges in multimodal
affective computing. One important sector would be to focus on application-oriented
studies that could be helpful in real-world applications. Manually extracted features
and features extracted by deep learning can be compared to determine which would
give a better result in affect computation. Another aspect of future work would be
usage of biological signals for these kinds of affect computation. These biological
signals speak more about a person; hence, it can be helpful more in medical field.
If these biological signals are used with other modalities, then it would be a major
advance in many medical research fields.
In many cases, machine learning- and deep learning-based multimodal affective

computing has been used in a general sense for emotion recognition. Very few
researches focused on specific areas like detection of depression level in humans,
detection of fear level, and more specifically to a particular kind of fear (e.g.,
aquaphobia, etc.). Multimodal affective computing can be used in the applications
like education sector for determining the student’s level of understanding during
the online learning. It can also be used in medical field to determine the fear and
depression level. The fear and depression level can be used to find if the person
is prone to some medical ailments. It can be used for autism people to aid them
with technologies involving communication development. Multimodal affective
computing finds application in music players to play songs based on mood or to
determine the emotions of person when they are watching an advertisement in TV.
7 Conclusion
This chapter explains about the brief overview of affective computing and how
emotions are recognized. A brief introduction of unimodal and multimodal affective
computing has been discussed in this chapter. A clear study on the available
dataset, its modality, and emotions in each dataset has also been explained. In
addition, the various features used for affect recognition and fusion techniques
are elaborated. The machine learning and deep learning techniques for affect
recognition are explained, along with the discussion on what features were used
for affect recognition and against what other techniques the proposed methodology
was compared. Also, a few challenges in this research field have been identified.
They are to use real-time dataset for the study, to have more investigation before
capturing the data for study, more understanding about selecting the model, and
also to extend the research to application-oriented studies.
References
1. Ayata, D., Yaslan, Y., & Kamasak, M. E. (2018). Emotion based music recommendation system
using wearable physiological sensors. IEEE Transactions on Consumer Electronics, 64(2),
196–203.
2. Bălan, O., Moise, G., Moldoveanu, A., Leordeanu, M., & Moldoveanu, F. (2020). An
investigation of various machine and deep learning techniques applied in automatic fear level
detection and acrophobia virtual therapy. Sensors, 20(2), 496.
3. Chakravarthula, S. N., Nasir, M., Tseng, S. Y., Li, H., Park, T. J., Baucom, B., et al. (2020,
May). Automatic prediction of suicidal risk in military couples using multimodal interaction
cues from couples conversations. In ICASSP 2020–2020 IEEE international conference on
Acoustics, Speech and Signal Processing (ICASSP) (pp. 6539–6543). IEEE.
4. Comas, J., Aspandi, D., & Binefa, X. (2020, November). End-to-end facial and physiological
model for affective computing and applications. In 2020 15th IEEE international conference
on Automatic Face and Gesture Recognition (FG 2020) (pp. 93–100). IEEE.
5. Garcia-Garcia, J. M., Penichet, V. M., Lozano, M. D., Garrido, J. E., & Law, E. L. C.
(2018). Multimodal affective computing to enhance the user experience of educational software
applications. Mobile Inf Syst, 2018.
6. Henderson, N. L., Rowe, J. P., Mott, B. W., & Lester, J. C. (2019). Sensor-based data fusion for
multimodal affect detection in game-based learning environments. In EDM (workshops) (pp.
44–50).
7. Huang, J., Li, Y., Tao, J., Lian, Z., Niu, M., & Yang, M. (2018, October). Multimodal
continuous emotion recognition with data augmentation using recurrent neural networks. In
Proceedings of the 2018 on audio/visual emotion challenge and workshop (pp. 57–64).
8. Huang, Y., Yang, J., Liao, P., & Pan, J. (2017). Fusion of facial expressions and EEG for
multimodal emotion recognition. Computational Intelligence and Neuroscience, 2017, 1.
9. Jan, A., Meng, H., Gaus, Y. F. B. A., & Zhang, F. (2017). Artificial intelligent system for
automatic depression level analysis through visual and vocal expressions. IEEE Transactions
on Cognitive and Developmental Systems, 10(3), 668–680.
10. Jang, E. H., Byun, S., Park, M. S., & Sohn, J. H. (2020). Predicting individuals’ experienced
fear from multimodal physiological responses to a fear-inducing stimulus. Advances in
Cognitive Psychology, 16(4), 291.
11. Jung, T. P., & Sejnowski, T. J. (2018, July). Multi-modal approach for affective computing. In
2018 40th annual international conference of the IEEE Engineering in Medicine and Biology
Society (EMBC) (pp. 291–294). IEEE.
12. Kim, E., & Shin, J. W. (2019, May). Dnn-based emotion recognition based on bottleneck
acoustic features and lexical features. In ICASSP 2019-2019 IEEE international conference
on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6720–6724). IEEE.
13. Kim, S. K., & Kang, H. B. (2018). An analysis of fear of crime using multimodal measurement.
Biomedical Signal Processing and Control, 41, 186–197.
14. Li, Y., Tao, J., Schuller, B., Shan, S., Jiang, D., & Jia, J. (2018, May). Mec 2017: Multimodal
emotion recognition challenge. In 2018 first Asian conference on Affective Computing and
Intelligent Interaction (ACII Asia) (pp. 1–5). IEEE.
15. Ma, J., Tang, H., Zheng, W. L., & Lu, B. L. (2019, October). Emotion recognition using
multimodal residual LSTM network. In Proceedings of the 27th ACM International conference
on multimedia (pp. 176–183).
16. Marín-Morales, J., Higuera-Trujillo, J. L., Greco, A., Guixeres, J., Llinares, C., Scilingo, E.
P., et al. (2018). Affective computing in virtual reality: Emotion recognition from brain and
heartbeat dynamics using wearable sensors. Scientific Reports, 8(1), 1–15.
17. Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., & Manocha, D. (2020, April). M3er:
Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In
Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 02, pp. 1359–1367).
18. Mou, L., Zhou, C., Zhao, P., Nakisa, B., Rastgoo, M. N., Jain, R., & Gao, W. (2021). Driver
stress detection via multimodal fusion using attention-based CNN-LSTM. Expert Systems with
19. Muszynski, M., Tian, L., Lai, C., Moore, J., Kostoulas, T., Lombardo, P., et al. (2019).
Recognizing induced emotions of movie audiences from multimodal information. IEEE
Transactions on Affective Computing, 12, 36–52.
20. Nemati, S., Rohani, R., Basiri, M. E., Abdar, M., Yen, N. Y., & Makarenkov, V. (2019). A
hybrid latent space data fusion method for multimodal emotion recognition. IEEE Access, 7,
172948–172964.
21. Papakostas, M., Riani, K., Gasiorowski, A. B., Sun, Y., Abouelenien, M., Mihalcea, R.,
& Burzo, M. (2021, April). Understanding driving distractions: A multimodal analysis on
distraction characterization. In 26th international conference on Intelligent User Interfaces
(pp. 377–386).
22. Qureshi, S. A., Saha, S., Hasanuzzaman, M., & Dias, G. (2019). Multitask representation
learning for multimodal estimation of depression level. IEEE Intelligent Systems, 34(5), 45–
52.
23. Ramakrishnan, A., Zylich, B., Ottmar, E., LoCasale-Crouch, J., & Whitehill, J. (2021). Toward
automated classroom observation: Multimodal machine learning to estimate class positive
climate and negative climate. IEEE Transactions on Affective Computing.
24. Shin, D., Shin, D., & Shin, D. (2017). Development of emotion recognition interface using
complex EEG/ECG bio-signal for interactive contents. Multimedia Tools and Applications,
76(9), 11449–11470.
25. Tzirakis, P., Chen, J., Zafeiriou, S., & Schuller, B. (2021). End-to-end multimodal affect
recognition in real-world environments. Information Fusion, 68, 46–53.
26. Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2017). End-to-
end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected
Topics in Signal Processing, 11(8), 1301–1309.
27. Wang, C. H., & Lin, H. C. K. (2018). Emotional design tutoring system based on multimodal
affective computing techniques. International Journal of Distance Education Technologies
(IJDET), 16(1), 103–117.
28. Yoon, S., Byun, S., & Jung, K. (2018, December). Multimodal speech emotion recognition
using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 112–
118). IEEE.
29. Zhang, S., Zhang, S., Huang, T., Gao, W., & Tian, Q. (2017). Learning affective features with
a hybrid deep model for audio–visual emotion recognition. IEEE Transactions on Circuits and
Systems for Video Technology, 28(10), 3030–3043.
30. Zheng, W. L., Liu, W., Lu, Y., Lu, B. L., & Cichocki, A. (2018). Emotionmeter: A multimodal
framework for recognizing human emotions. IEEE transactions on cybernetics, 49(3), 1110–
1122.
Content-Based Image Retrieval Using
Deep Features and Hamming Distance
1 Introduction
Rapid evolution of smart devices and social media applications resulted in large
volume of visual data. The exponential growth of visual content demands for models
that can effectively index and retrieve relevant information according to the user’s
requirement. Image retrieval is broadly used in Web services to search for similar
images. Text-based queries are used to retrieve images during its early ages [6],
which required large scale of manual annotations. In this context, content-based
image retrieval systems gained popularity as one of the hot research topic since
1990s. Content-based image retrieval systems use visual features of an image to
retrieve similar images from the database. Most of the state-of-the-art CBIR models
extract low-level feature representations such as color descriptors [1, 2], shape
descriptors [3, 4], and texture descriptors [5, 8] for image retrieval. Usage of low-
level image features causes CBIR to be a heuristic technique. The major drawback
of classical CBIR models is the semantic gap between the feature representation
and user’s retrieval concept. The low-level semantics fails to reduce the semantic
gap when the image database grows large. Also, similarities among different classes
increase as the dimensionality of the database increases. Figure 1 further illustrates
the shortcomings of the classical CBIR models, which use low-level image features.
In Fig. 1, both a and b shares similar texture and color although they belong to two
different classes (Africa and beach). Figure 1c, d have different color and texture
even though they belong to same class (mountains). Figure 1e, f shows images from
the same class (Africa) with different shape, texture, and color.
R. T. Akash Guna · O. K. Sikha ()

Department of Computer Science and Engineering, Amrita School of Engineering, Coimbatore,
India
152 R. T. Akash Guna and O. K. Sikha
Fig. 1 Illustration of failure

cases of classical features on
CBIR
Fig. 2 Illustration of a
general CBIR architecture
[21]
1.1 Content-Based Image Retrieval: Review
Content-based image retrieval systems use visual semantics of an image to retrieve

similar images from large databases. The basic architecture of a CBIR system is
shown in Fig. 2. In general, feature descriptors are used to extract significant features
from the image data. Whenever a query image comes, the same set of features are
extracted and are compared with the feature vectors stored in the database. Similar
images are retrieved based on the similarity measures like Euclidean distance as
depicted in Fig. 2.
Deep learning has been widely researched for constructing CBIR systems. Wan
Ji et al. [11] in 2014 introduced Convolutional Neural Networks (CNN) to form
feature representations for content-based image retrieval.
Usage of CNNs for CBIR achieved greater accuracy than classical CBIR
systems. Babenko [23] in 2014 transfers learned state-of-the-art CNN models used
for Imagenet Classification for feature representation. Transfer learning significantly
Content-Based Image Retrieval Using Deep Features and Hamming Distance 153
improved the accuracy of state-of-the-art CBIR systems since the base model is
heavily trained on large volume of image data. Lin Kevin et al. [22] generated binary
hash codes using CNNs for fast image retrieval. This method was highly scalable
to increase in the dataset size. Putzu [25] in 2020 introduced a CBIR system using
relevance feedback mechanism where the users are expected to give their feedback
on misclassification with respect to the retrieval results. Based on the user-level
feedback, CBIR model alters the parameters or similarity measures for getting better
accuracy. The major drawback of relevant feedback-based CBIR models is that the
accuracy of those systems purely depends on the feedback provided by the user.
If the user fails to give proper feedback, then the system may fail. Some common
applications of CBIR [32–36].
The primary objective of this chapter is to investigate the effectiveness of high-
level semantic features computed by deep learning models for image retrieval. The
major contributions of this chapter are follows:
• Transfer learned deep features are used as high-level image representation for
CBIR.
• Applicability of Hamming distance as a distance metric for deep feature vectors
is explored.
• Clustering dataset before retrieval to induce faster retrieval is experimented.
The organization of the chapter is as follows. Section 2 describes the background
of CNN. Proposed model is detailed in Sect. 3. Sections 4 and 5 detailed the
dataset used for experimentation and results obtained, respectively. Finally, the
paper concludes with Sect. 6.
2 Background: Basics of CNN
Convolutional Neural Networks (CNNs) are deep learning networks that detect
visual patterns present in the input images. Fukushima introduced the concept
of Convolutional Neural Networks (CNNs) in 1980 that was initially named as
“Neocognitron” [7] since it resembled the working of cells. Figure 3 shows the
basic architecture of a simple Convolutional Neural Network (CNN). Deep neural
networks are capable of computing high-level features that can distinguish objects
more precisely than classical features. Since CNN “doesn’t need a teacher,” it
automatically finds features suitable to distinguish and retrieve the images. A basic
CNN can have six layers as shown in Fig. 3: input layer, convolutional layer, RELU
layer, pooling layer, dense layer, and output classification layer.
1. Input layer: This layer holds the input raw image data for processing.
2. Convolution layer: It extracts features from the input image by convolving with
filters of various size. Hyper parameters like stride (number of pixels that a
kernel/filter can skip) can be tuned to get better accuracy.
Fig. 3 Simple Convolutional Neural Network [10] used for classification
3. ReLU layer: Rectified Linear Unit layer acts as a thresholding layer that converts
any value less than zero as 0.
4. Pooling layer: It reduces the feature map dimension to avoid the possibility of
overfitting.
5. Dense layer: Feature map obtained from the pooling layer is flattened into a
vector form, which is then fed into the final classification layer.
6. Output/classification layer: Predicts the final class of the input image. Here, the
number neurons are equal to the classes.
3 Proposed Model
This section describes the proposed CNN-based model for high-level feature
extraction and image retrieval in detail. The CNN model being used is described
first followed by the methodology to extract feature vectors from the model, and
then the techniques used to retrieve similar images are described. In this work, a
reduced InceptionV3 Network [9] is chosen as the feature extractor. The generalized
architecture of the proposed CBIR framework is shown in Fig. 4, and the steps
followed are described below:
1. The high-level feature representation for the database images is calculated by
passing it through the pretrained CNN (Inception V3) model.
2. The features are then clustered into N clusters using an adaptive K-means
clustering algorithm.
3. For a query image, the feature vector is calculated by passing it into the pretrained
CNN model.
4. The least distant cluster of the feature vector is found.
5. Least distinct images are retrieved from that cluster by using similarity measures.
Fig. 4 Architecture of the proposed model
3.1 Transfer Learning Using Pretrained Weights
The proposed model explored a pretrained inception network for extracting high-
level feature representation for the candidate images. The model is transfer learned
with pretrained weights for ImageNet [24] classification. Transfer learning [14] is
a well-known technique through which a pretrained model for a similar task can
be used to train another model. Transferred weights used for training the model
improvises the quality of features captured in a short span of time. The model is
initially trained like a classification model; thus, the addition of transferred weights
to the model gave a surge in the results produced.
The proposed model consists of a chain of ten inception blocks. The input tensor
to an inception block is convoluted in four different paths, and the output of those
paths are concatenated together. This model is chosen owing to its capability to
deduce different features using a single input tensor. An inception block has four
different paths. Path 1 consists of three convolutional blocks, path 2 consists of
two convolutional blocks, path 3 consists of an Average Pooling Layer followed
by a convolutional block, and path 4 consists of a single convolutional block. A
convolutional block has a convolutional layer, batch normalization layer [12], and
a ReLU [13] activation layer stacked in the above order. The architecture of an
inception block is visualized in Fig. 5. Following the inception blocks, a global max
pooling layer and three deep layers are present. The dimension of the final deep
layer is equal to the number of different classes present in the image database used
for retrieval. The activation of the final layer is a SoftMax activation function that
normalizes the values of the final deep layer to a scale of 0–1.
3.2 Feature Vector Extraction
Deep layers of inception network were explored as a feature descriptor. The feature
vector obtained from the deep layers 1–3 is denoted as DF1, DF2, and DF3,
respectively. Figure 6 depicts the feature extraction from the deep layers of inception
network. Ji Wan compared the effectiveness of feature vectors extracted from the
deep layers (1–3) of inception network in [1]. Their study concludes that the
penultimate layer (DF2) produced better results compared to DF1 and DF3. Inspired
by the work of Ji Wan, feature vector extracted from DF2 is used in this work.
Figure 8 shows the intermediate results received when calculating deep features
from intermediate layers of Inception Resnet for all classes of Wang’s dataset. The
extracted feature vectors are then stored as csv files.
3.3 Clustering
The feature vector obtained from the inception model is then fed into a clustering
module to perform an initial grouping. A K-means [15] clustering algorithm is used
for clustering the extracted features. The objective of introducing clustering is to
reduce the searching time for a query image. K-means clustering is an iterative
algorithm that tries to separate the feature vectors into K nonoverlapping groups
using expectation-maximum approach. It uses Euclidean distance to measure the
distance between a feature vector and centroid of a cluster and assigns the feature
vector to the cluster that is least distanced from the feature vector. The centroid
of that cluster is then updated with the feature vector added to the group. The
mathematical representation of K-means clustering is given as:
m
K 2

J = wij x i − μj (1)
i=1 j =1
where J is the objective function, wij = 1 for feature vector xi if it belongs to cluster
j; otherwise, wij = 0. Also, μj is the centroid of the cluster of xi .
Fig. 5 Architecture of an inception block present in an Inception V3 Network [9]

Fig. 6 A deep neural

network that could be used as
feature extractor by using
intermediate deep layers
Convolutional
Layers
DF1
[0,7.4,0,0,2.1,0,-1.75...0]
DF2
[0,3.1,0,0,1.1,0,15.75...0]
DF3
[0,0,0,1,0,0,0,...0]
3.4 Retrieval Using Distance Metrics
Relevant images from the database are retrieved by calculating the distance between
the input image feature vector and feature vectors stored in the database. This
work compares two distance metrics, Euclidean distance and Hamming distance,
for calculating the similarity.
3.4.1 Euclidean Distance
Euclidean distance [16] represents the shortest distance between two points. It is
given as square root of summation of squared distances.

n

E.D = (xi − yi )2 (2)
i=1
where n is the dimension of the feature vectors, and xi and yi are elements of the
feature vectors x and y, respectively.
3.4.2 Hamming Distance
Hamming distance [17] measures the similarity between two feature vectors.
Hamming distance for two feature vectors is the number of positions at which
corresponding characters are different.
n

H.D = ∼ (xi = yi ) (3)
i=1
where n is the dimension of the feature vectors, and xi and yi are elements of the
feature vectors x and y, respectively. xi = yi = 1 if both are equal, else 0.
4 Dataset Used
This section describes the datasets used for experimentation. Wang’s dataset is
a tailor-made dataset for content-based image analysis and its larger version: the
COREL-10000 dataset is chosen for the analysis. Wang’s dataset consists of 1000
images divided into 100 images per classes, and the classes of Wang’s dataset are
African tribe, beach, bus, dinosaur, elephant, flower, food, mountain, Rome, and
horses.
The COREL dataset comprises of 10,000 images downloaded from COREL
photo gallery and is widely used for CBIR applications [18–20]. The dataset
comprises of 100 classes with 100 images in each class. Figure 7 shows sample
images from Wang’s dataset and COREL dataset (Fig. 8).
Fig. 7 Sample images from Corel dataset and Wang’s dataset
5 Results and Discussions
The model is experimented with Euclidean distance and Hamming distance as

similarity metrics for both of the datasets. The model is tested to retrieve 40, 50,
60, and 70 images on Wang’s dataset and COREL dataset, respectively. Average
precision is used as the performance metric for evaluating the proposed CBIR mode.
Average Precision:
Precision is one of the commonly used measures for evaluating image retrieval
algorithms that is defined as:
Similar images retrieved

Precision =
Total images retrieved
Since we are having more than one category of images, we use average precision.
Average precision is defined as:

n
Precision [k]
k=0
Average precision =
Number of categories(n)
Fig. 8 The features that were computed by the intermediate convolutional layers of the feature
extractor CNN
Fig. 8 (continued)
Number of classes with precision >0.95:

Since the proposed model uses deep features that are sparse in nature, it is necessary
to know the number of categories with a high precision for final retrieval. Hence, a
threshold of 0.95 is set for the retrieval task.
5.1 Retrieval Using Euclidean Distance

5.1.1 Retrieving 40 Images
On retrieving 40 images per class from the Wang’s dataset using Euclidean distance,
an average precision of 0.946 is obtained. Seven classes (out of 10) were retrieved
with a precision greater than 0.95. When retrieving 40 images from COREL dataset,
the average precision received is 0.961 in which 92 classes have a precision of 0.95.
Average precision of 0.944 is achieved when retrieving 50 images per class from
Wang’s dataset using Euclidean distance. Seven out of the 10 classes were retrieved
with a precision of above 0.95. With an average precision of 0.955, 91 classes had a
precision of above the threshold of 0.95.
Retrieval of 60 images per class from Wang’s dataset resulted in images being
retrieved. With an average precision of 0.94, 5 classes of a total of 10 classes had a
precision of above 0.95. Retrieval of 60 images from COREL dataset had an average
precision of 0.952, while 87 of 100 classes had a precision of above 0.95 during
retrieval.
When retrieving 70 images per class on the Wang’s dataset, Euclidean distance
retrieved images with an average precision of 0.932. The number of classes with
precision greater than 0.95 is 5. When the retrieval is done on the COREL dataset,
the average precision in which the images were retrieved is 0.95 and 84 classes out
of 100 had a precision of above 0.95.
5.2 Retrieval Using Hamming Distance

Retrieval of 40 images per class from Wang’s dataset resulted in the average
precision of 0.957 while eight classes were retrieved with a precision of above 0.95.
On retrieving the same number of images from the COREL dataset, we received
an average accuracy of 0.946 while retrieving 90 classes with a precision of above
0.95.
When 50 images per class were retrieved from Wang’s dataset using Hamming
distance with an average precision of 0.956, eight classes were retrieved with a
precision of above 0.95. When retrieved from the COREL dataset, the average
accuracy was 0.943, and 90 classes were retrieved with a precision of above 0.95.
Retrieval of 60 images per class from the Wang’s dataset resulted in images being
retrieved with an average precision of 0.942, and eight classes had a precision of
above 0.95. Retrieving the same number of images from COREL dataset resulted in
the average accuracy being 0.941, and 86 of 100 classes had a precision of above
0.95 during the retrieval.
When retrieving 70 images per class from the Wang’s dataset, Hamming distance
retrieved images with an average precision of 0.926 while 7 classes were retrieved
with a precision of greater than 0.95. Hamming distance produced an average
precision of 0.94 from the COREL dataset while retrieving 85 classes with precision
of above the threshold value of 0.95.
Table 1 summarizes the average precision obtained using Euclidean and Ham-
ming distance on images from Wang’s dataset and COREL dataset. From the table,
it is evident that the deep features obtained from the proposed model is effective on
image retrieval. The transition of retrieving 40 images to 70 images from Wang’s
dataset using Euclidean distance caused the average precision to reduce by 1.4%
while the number of classes with precision above 0.95 reduced from 7 classes to
5 classes. During the transition from 40 to 70 images in the COREL dataset, the
precision got reduced by 1.1% while the number of classes above the threshold
reduced from 92 to 84 classes. Hamming distance produced a change of 3.1% on
Wang’s dataset, but the number of classes above threshold just reduced from 8 to 7.
Table 1 Average precision obtained for image retrieval using Euclidean distance and Hamming
distance for images from Wang’s dataset and COREL dataset
Euclidean distance Hamming distance
Number of images retrieved Number of images retrieved
Wang’s dataset 40 50 60 70 40 50 60 70
Average precision 0.946 0.944 0.944 0.932 0.957 0.956 0.942 0.926
CORELdataset 40 50 60 70 40 50 60 70
Average precision 0.961 0.95 0.95 0.95 0.946 0.943 0.941 0.95
Fig. 9 Graphical
3.0 corel euc
representation of decrease in corel ham
Percentage of change of
average precision from 2.5 wangs euc

wangs ham
Average Precision
retrieval of 40–70 images

2.0
1.5
1.0
0.5
0.0
50.0 52.5 55.0 57.5 60.0 62.5 65.0 67.5 70.0
Number of Images Retrieved
On the COREL dataset, the change in precision on transition from 40 images to 70

images was only 0.6% while the number of classes reduced from 90 to 85 classes.
The graphical representation of the precision for each class is represented in Fig.
10. From the figure, it is evident that although precision of Euclidean distance is
slightly higher compared to Hamming distance, the change is drastic and faster
which in turn increases the error rate also. Figure 9 shows that Hamming distance
when compared to Euclidean distance has a higher precision as the number of
images to be retrieved increases. To further illustrate the effectiveness of the
proposed retrieval model, retrieval results obtained from Wang’s dataset for 10
images using Euclidean distance and Hamming distance is shown in Fig. 11, and
the statistics are tabulated in Table 2.
5.3 Retrieval Analysis Between Euclidean Distance

and Hamming Distance
On the retrieval of 10 images from each class as shown in Fig. 11A, B, both of
the distance metrices showed a 100 percent retrieval precision. In the retrieved
images, a few images were commonly retrieved by both of the metrices, and a few
Fig. 10 Graphical Representation of Precision Obtained for COREL dataset (A, B) and for
Wang’s Dataset (C, D). (a) Graphical representation of precision on retrieval of N number of
images using (A) Euclidean distance and (B) Hamming distance from COREL dataset. The X
axis represents the classes of COREL classes and Y axis represents precision. (b) Graphical
representation of precision on retrieval of N number of images using (A) Euclidean distance and
(B) Hamming distance from Wang’s dataset. The X axis represents the classes of COREL classes
and Y axis represents precision
classes had higher number of images that were similar and few classes had minimal
similarity parring the input image. The number of same images retrieved by different
metrices from each class is visualized in Fig. 12. These similarities in images show
that some classes like beaches and mountains have certain internal clusters with
unique features that make the retrieval more efficient. Classes like Rome, flowers,
dinosaurs, and horses although retrieved with a precision of 100%, the number of
same images retrieved showed that these classes have inseparable images within the
Fig. 10 (continued)
classes. The primary goal of content-based image retrieval is to retrieve the images
most similar by its content. When we look at the images retrieved using Hamming
distance and Euclidean distance, we found certain subtle differences between the
images retrieved, and Hamming distance gave high precision than Euclidean as the
number of images increases as shown in Fig. 8.
Difference in Horse Class:
Consider the retrieval of horse class as in Fig. 11A–H and Fig. 11B–H. The quey
image has two horses, a white horse with brown foal and a brown horse. All of the
images retrieved by Hamming distance retrieved images containing the same horse
and foal (refer to Fig. 11A–H), but when retrieved using Euclidean distance, images
containing multiple horses and images containing horses of different colors were
Table 2 Retrieval of 10 Correctly retrieved

images using Euclidean
Classes of dataset Euclidean Hamming
distance and Hamming
distance Africa 10 10
Beach 10 10
Mountain 10 10
Bus 10 10
Horses 10 10
Flowers 10 10
Elephants 10 10
Dinosaurs 10 10
Rome 10 10
Food 10 10
also retrieved (refer to Fig. 11B–H). Figure 13 shows the distribution of subclasses
of the retrieved horses images.
Difference in Flower Class:
The input image provided for retrieval from the flowers class had red petals as
in Fig. 11A–G, and leaves are visible in the background. The images retrieved
using Hamming distance had red or reddish-pink petals in all the retrieved images,
and leaves were visible (refer to Fig. 11A–H). The images retrieved by Euclidean
distances showed wider variety of colors such as red, reddish-pink, pink, orange,
and yellow. In all retrieved images, leaves were visible but were not visible in
a substantial amount as seen in the input image and the images retrieved using
Hamming distance (refer to Fig. 11B–H). Figure 14 shows the number of retrieved
images belonging to each subcategory of flower class.
Difference in Dinosaurs Class:
Dinosaur class has two major subclusters that differ only by its orientation. The
dinosaurs in the first cluster faces to left while other dinosaurs face to the right.
The input image provided to the retrieval system had a dinosaur oriented toward the
right. One major characteristic of the dinosaur is the dinosaur’s long neck. All the
images retrieved by Hamming distance was oriented toward the right, and also it can
be noticed that all the dinosaurs retrieved had long necks (refer to Fig. 11A–G). A
handful of images retrieved by the Euclidean distance were either oriented toward
the left or had a shorter neck (refer to Fig. 11B–G). Figure 15 shows the statistics of
the number of images from each subclass of the dinosaurs category.
Difference in Rome Class:
The input image to the Rome class contained an image of the colosseum. Hamming
distance was able to retrieve only one image of the colosseum out of 10 retrieved
images from Rome category (refer to Fig. 11A–I), whereas Euclidean distance was
able to retrieve more number of images of the colosseum from the Rome class
(refer to Fig. 11B–I). The statistics of the number of images containing colosseum
retrieved by Euclidean and Hamming distance is shown in Fig. 16.
Fig. 11 Results on retrieving 10 images using a sample image from each category of Wang’s
dataset. Results for each class contains input image and 10 retrieved images. The classes are
represented in the order: (A) Africa (B) beach (C) mountains (D) bus (E) dinosaurs (F) elephants
(G) flowers (H) horses (I) Rome (J)
Fig. 11 (continued)
Fig. 11 (continued)
Number of Same Images in Different Class
5
Number of Same Images
0
Africa Beach Mountains Bus Dinosaurs Elephant Flowers Horses Rome Food
Classes-Wangs Dataset
Fig. 12 Same images retrieved by both distance metrics

Fig. 13 Comparison of Number of images per Sub Group of Horses

interclass clusters of horses
10 Hamming
9 Euclidean
8
Number of images
7
6
5
4
3
2
1
0
Similar Multiple Horse Wrong Colour
Sub groups
Fig. 14 Comparison of Number of images per Sub Group of Flowers

interclass clusters of flowers 10
Hamming
9
Euclidean
8
Number of images
7
6
5
4
3
2
1
0
Red Reddish-Pink Pink Oranges Yellow
Sub groups
5.4 Comparison with State-of-the-Art Models
To further evaluate the effectiveness of deep features for image retrieval, obtained
results are compared against state-of-the-art CBIR models with classical features
reported in the literature. CBIR models proposed by Lin et al. [27], Irtaza et al. [26],
Wang et al. [28], and Walia et al. [29, 30] are considered for comparison. CBIR
system based on CNN proposed by Hamreras et al. [31] is also compared with
proposed model. Table 3 compares the proposed deep feature-based CBIR model
with state-of-the-art classical feature-based models in terms of precision. From the
table, it can be inferred that deep features-based image retrieval produced good
results compared to other models under consideration across all the classes. Table
4 compares the average precision achieved by the proposed model against the five
SOA models under consideration. Dinosaurs class was retrieved with an average
precision of 99.1, which was the maximum average precision received among the
Fig. 15 Comparison of Number of images per Sub Group of Dinosaurs

interclass clusters of 10
dinosaurs Hamming
9
Euclidean
8
Number of images
7
6
5
4
3
2
1
0
Right+Tall Left+Tall Rigt+Short Left+Short
Sub groups
Fig. 16 Clusters in Rome Number of images per Sub Group of Rome

class 10
Hamming
9
Euclidean
8
Number of images
7
6
5
4
3
2
1
0
With Colosseum Without Colosseum
Sub groups
SOA models. While horses, flowers, and bus classes have an average precision of
81.0, 84.95, and 73.85, Rome, food, elephants, beach, and mountain classes have
an average precision of less than 60%. When compared to the SOA models, our
proposed model produces a greater average precision for all the 10 classes of Wang’s
dataset (Africa, beach, bus, dinosaurs, elephants, horses, food, mountain, flowers,
Rome).
Table 5 compares the recall results of state-of-the-art CBIR models with the
proposed model. From the table, it is evident that the proposed model outperforms
better than all of the state-of-the-art models giving perfect results on retrieving 20
images from each class of Wang’s dataset. Table 6 represents the average recall
achieved by each class for SOA models. Dinosaurs class was retrieved with an
average recall of 19.82, which was the maximum average precision received among
the SOA models. While horses, flowers, and bus classes have an average recall of
174
Table 3 Precision comparison of the proposed CBIR model with state-of-the-art models on retrieving 20 images
Wang’s database Lin et al. [27] Irtaza et al. [26] Wang et al. [28] Walia et al. [29] Walia et al. [30] Hamreras et al. [31] Proposed model
African 55.5 53 80.5 41.25 73 93.33 100
Beach 66 46 56 71 39.25 90 100
Rome 53.5 59 48 46.75 46.25 96.67 100
Bus 84 73 70.5 59.25 82.5 100 100
Dinosaurs 98.25 99.75 100 99.5 98 100 100
Elephants 63.75 51 53.75 62 59.25 100 100
Flowers 88.5 76.75 93 80.5 86 96.67 100
Horse 87.25 70.25 89 68.75 89.75 100 100
Mountain 48.75 62.5 52 69 41.75 83.83 100
Food 68.75 70.75 62.25 29.25 53.45 96.83 100
Average 71.425 66.2 67.2 60.91 66.92 95.73 100
Table 4 Comparison of Average precision

average precision of proposed
Classes of dataset Proposed SOA Difference
model against six state-of-the
art models across 10 classes Africa 100 65.97 34.03
of Wang’s dataset Beach 100 61.37 38.03
Mountain 100 59.64 40.36
Bus 100 78.20 21.8
Horses 100 84.1 15.9
Flowers 100 86.90 13.1
Elephants 100 64.95 35.05
Dinosaurs 100 99.25 0.75
Rome 100 58.36 41.64
Food 100 63.50 36.5
16.2, 16.99, and 14.77, Rome, food, elephants, beach, and mountain classes have an
average recall of less than 12.
When compared to the SOA models, our proposed model produces a greater
average recall for all the 10 classes of Wang’s dataset (Africa, beach, bus, dinosaurs,
elephants, horses, food, mountain, flowers, Rome).
6 Conclusion
This chapter proposed a deep learning-based content-based retrieval model using

Hamming distance as the similarity metrics. While comparing Euclidean distance
and Hamming distance for image retrieval, it is found that Hamming distance
produced less change in precision over change in number of images retrieved,
causing the number of classes to have precision above 0.95. On retrieving 10 images
using Euclidean and Hamming distance, it was noticed that some of the images
received were similar. In different images retrieved by Euclidean and Hamming
distance, we found the presence of interclass clusters in classes such as horses,
dinosaurs, Rome, and flowers of the Wang’s dataset. Hamming distance does a
better job in identifying the interclass clusters since Hamming distance retrieved
more images related to the interclass cluster of the given input image. The proposed
model that uses Hamming distance for retrieval when compared to SOA models
produced a significant increase in average precision and average recall. The class-
wise precision and recall were also significantly higher. Some classes that received
low precision and recall using SOA models received a higher precision and recall
when using the proposed model.
We conclude our chapter upon the note that Hamming distance performs better
than Euclidean distance when the dataset becomes very large and number of image
to be retrieved is high. Hamming distance is also capable of identifying interclass
cluster of classes, which is relevant when the number of images in the database is
176
Table 5 Recall comparison of the proposed CBIR model with state-of-the-art models on retrieving 20 images
Wang’s database Lin et al. [27] Irtaza et al. [26] Wang et al. [28] Walia et al. [29] Walia et al. [30] Hamreras et al. [31] Proposed model
African 11.1 10.6 16.1 8.25 14.6 18.6 20
Beach 13.2 9.2 11.2 14.2 7.85 18.0 20
Rome 10.7 11.8 9.6 9.35 9.25 19.33 20
Bus 16.8 14.6 14.1 11.85 16.5 20 20
Dinosaurs 19.65 19.95 20 19.9 19.6 20 20
Elephants 12.75 10.2 10.75 12.4 11.85 20 20
Flowers 17.7 15.35 18.6 16.1 17.2 19.33 20
Horse 17.45 14.05 17.8 13.75 17.95 20 20
Mountain 9.75 12.5 10.4 13.8 8.35 16.6 20
Food 13.75 14.15 12.45 5.85 10.69 19.33 20
Average 14.285 13.24 14.1 12.545 13.384 19.125 20
Table 6 Comparison Average recall

average precision of proposed
Classes of dataset Proposed SOA Difference
model against six state-of-the
art models across 10 classes Africa 20 13.20 6.8
of Wang’s dataset Beach 20 12.275 7.725
Mountain 20 11.90 8.1
Bus 20 15.64 4.37
Horses 20 16.83 3.62
Flowers 20 17.38 2.62
Elephants 20 12.99 7.01
Dinosaurs 20 19.85 0.15
Rome 20 11.67 8.33
Food 20 12.66 7.34
large and diverse within classes. This leads to the retrieval of more similar content-
based image retrieval.
7 Future Works
To evaluate the effectiveness of the proposed model against microscopic images and
exploration of preprocessing techniques to enhance the proposed model on applying
to a medical dataset.
References
1. Pass, G., & Zabih, R. (1996). Histogram refinement for content-based image retrieval. In
Proceedings third IEEE workshop on applications of computer vision. WACV’96 (pp. 96–102).
https://doi.org/10.1109/ACV.1996.572008
2. Konstantinidis, K., Gasteratos, A., & Andreadis, I. (2005). Image retrieval based on fuzzy color
histogram processing. Optics Communications, 248(4–6), 375–386.
3. Jain, A. K., & Vailaya, A. (1996). Image retrieval using color and shape. Pattern Recognition,
29(8), 1233–1244.
4. Folkers, A., & Samet, H. (2002). Content-based image retrieval using Fourier descriptors on a
logo database. In Object recognition supported by user interaction for service robots (Vol. 3).
IEEE.
5. Manjunath, B. S., & Ma, W. Y. (1996). Texture features for browsing and retrieval of image
data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 837–842. https:/
/doi.org/10.1109/34.531803
6. Hörster, E., Lienhart, R., & Slaney, M. (2007). Image retrieval on large-scale image databases.
Proceedings of the 6th ACM international conference on Image and video retrieval.
7. Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural network model
for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets
8. Haralick, R. M., Shanmugam, K., & Dinstein, I.’. H. (1973). Textural features for image
classification. IEEE Transactions on Systems, Man, and Cybernetics, 6, 610–621.
9. Szegedy, C., et al. (2016). Rethinking the inception architecture for computer vision. Proceed-
ings of the IEEE conference on computer vision and pattern recognition.
10. LeCun, Y., et al. (1998). Gradient-based learning applied to document recognition. Proceedings
of the IEEE, 86(11), 2278–2324.
11. Wan, J., et al. (2014). Deep learning for content-based image retrieval: A comprehensive study.
Proceedings of the 22nd ACM international conference on Multimedia.
12. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
13. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines.
ICML.
14. Pan, S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on Knowledge
and Data Engineering, 22(10), 1345–1359.
15. Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering algorithm.
Pattern Recognition, 36(2), 451–461.
16. Danielsson, P.-E. (1980). Euclidean distance mapping. Computer Graphics and Image Pro-
cessing, 14(3), 227–248.
17. Norouzi, M., Fleet, D. J., & Salakhutdinov, R. R. (2012). Hamming distance metric learning.
In Advances in neural information processing systems.
18. Wang, J. Z., Li, J., & Wiederhold, G. (2001). SIMPLIcity: Semantics-sensitive integrated
matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 23(9), 947–963.
19. Tao, D., et al. (2006). Direct kernel biased discriminant analysis: A new content-based image
retrieval relevance feedback algorithm. IEEE Transactions on Multimedia, 8(4), 716–727.
20. Bian, W., & Tao, D. (2009). Biased discriminant Euclidean embedding for content-based image
retrieval. IEEE Transactions on Image Processing, 19(2), 545–554.
21. Haldurai, L., & Vinodhini, V. (2015). Parallel indexing on color and texture feature extraction
using R-tree for content based image retrieval. International Journal of Computer Sciences and
Engineering, 3, 11–15.
22. Lin, K., et al. (2015). Deep learning of binary hash codes for fast image retrieval. Proceedings
of the IEEE conference on computer vision and pattern recognition workshops.
23. Babenko, A., et al. (2014). Neural codes for image retrieval. In European conference on
computer vision. Springer.
24. Chollet, F., et al. (2015). Keras. https://keras.io.
25. Putzu, L., Piras, L., & Giacinto, G. (2020). Convolutional neural networks for relevance
feedback in content based image retrieval. Multimedia Tools and Applications, 79(37), 26995–
27021.
26. Irtaza, A., Jaar, M. A., Aleisa, E., & Choi, T.-S. (2014). Embedding neural networks for
semantic association in content based image retrieval. Multimedia Tools and Applications,
72(2), 1911{1931}.
27. Lin, C.-H., Chen, R.-T., & Chan, Y.-K. (2009). A smart content-based image retrieval system
based on color and texture feature. Image and Vision Computing, 27(6), 658{665}.
28. Wang, X.-Y., Yu, Y.-J., & Yang, H.-Y. (2011). An e_ective image retrieval scheme using color,
texture and shape features. Computer Standards & Interfaces, 33(1), 59{68}.
29. Walia, E., & Pal, A. (2014). Fusion framework for e_ective color image retrieval. Journal of
Visual Communication and Image Representation, 25(6), 1335{1348.
30. Walia, E., Vesal, S., & Pal, A. (2014). An e_ective and fast hybrid framework for color image
retrieval. Sensing and Imaging, 15(1), 93.
31. Hamreras, S., et al. (2019). Content based image retrieval by convolutional neural networks.
In International work-conference on the interplay between natural and artificial computation.
Springer.
32. Sikha, O. K., & Soman, K. P. (2021). Dynamic Mode Decomposition based salient edge/region
features for content based image retrieval. Multimedia Tools and Applications, 80, 15937.
33. Akshaya, B., Sri, S., Sathish, A., Shobika, K., Karthika, R., & Parameswaran, L. (2019).
Content-based image retrieval using hybrid feature extraction techniques. In Lecture notes in
computational vision and biomechanics (pp. 583–593).
34. Karthika, R., Alias, B., & Parameswaran, L. (2018). Content based image retrieval of remote
sensing images using deep learning with distance measures. Journal of Advanced Research in
Dynamical and Control System, 10(3), 664–674.
35. Divya, M. O., & Vimina, E. R. (2019). Performance analysis of distance metric for con-
tent based image retrieval. International Journal of Engineering and Advanced Technology
(IJEAT), 8(6), 2249.
36. Byju, A. P., Demir, B., & Bruzzone, L. (2020). A progressive content-based image retrieval
in JPEG 2000 compressed remote sensing archives. IEEE Transactions on Geoscience and
Remote Sensing, 58, 5739–5751.
Bioinspired CNN Approach for
Diagnosing COVID-19 Using Images
of Chest X-Ray
P. Manju Bala, S. Usharani, R. Rajmohan, T. Ananth Kumar,

and A. Balachandar
1 Introduction
COVID-19, a novel virus, was revealed in December 2019, at Wuhan, China [1].
This is a member of coronavirus class; however, it is more virulent and hazardous
than the other coronaviruses [2]. Many nations are allowed to administer the
COVID-19 trial to a minor group of participants due to limited diagnostic facilities.
There are significant attempts to expand a feasible method for diagnosing COVID-
19, a key stumbling block continues to be the health care offered in so many
nations. There is also a pressing need to develop a simple and easy way to identify
and diagnose COVID-19. As the percentage of patients afflicted with this virus
grows by the day, physicians are finding it increasingly difficult to complete the
clinical diagnosis in the limited time available [3]. One of most significant areas
of study is clinical image processing, which provides identification and prediction
model for a number of diseases, including the MERS coronavirus and COVID-19,
concerning many others. Imaging techniques have increasingly gained prominence
and effort. As a result, interpreting these images needs knowledge and numerous
methods to improve, simplify, and provide a proper treatment [4]. Numerous efforts
have been made to use computer vision and artificial intelligence techniques to
establish an efficient and quick technique to detect infected patients earlier on.
For example, digital image processing with supervised learning technique has been
developed for COVID-19 identification by fundamental genetic fingerprints used for
quick virus categorization [5]. Using a deep learning method, a totally spontaneous
background is created to diagnose COVID-19 as of chest X-ray [6]. The information
was acquired from clinical sites in order to effectively diagnose COVID-19 and
P. M. Bala · S. Usharani · R. Rajmohan · T. A. Kumar () · A. Balachandar

Department of Computer Science and Engineering, IFET College of Engineering, Villupuram,
Tamilnadu, India
e-mail: tananthkumar@ifet.ac.in
182 P. M. Bala et al.
distinguish it from influenza as well as other respiratory illnesses. For diagnosing

COVID-19 in X-ray image of chest, a combined deep neural network architecture
is suggested in [7]. First, the image of chest X-ray contrast was improved, and
the background noise was minimized. The training parameters from two distinct
preconditioning deep neural methods are combined and then utilized to identify and
distinguish between normal and COVID-19-affected individuals. They used a com-
bined heterogeneous deep learning system to develop their abilities based on images
of chest X-ray for pulmonary COVID-19 analysis [8]. A comprehensive analysis
of multiple deep learning methods for automatic COVID-19 detection from CXR
employing CNN, SVM, Naive Bayes, KNN, and Decision Tree, as well as different
neural deep learning structures has been presented [9]. An innovative method to
aid in the detection of COVID-19 for comparison and rating reasons, the multi-
criteria judgment (MCJ) technique, was combined with optimization technique,
while variance was employed to generate the values of factors as well as the SVM
classification has been used for COVID-19 detection [10]. Artificial intelligence and
the IoT [11] to create a model for diagnosing COVID-19 cases in intelligent health
care have been presented. Deep image segmentation, adjustment of preconditioning
deep neural networks (DNN), and edge preparation of a built DNN-based COVID-
19 categorization from Chest X-ray imageries were suggested in [12]. Using chest
X-ray images [13] offered alternative designs of autonomous deep neural network
for the categorization of COVID-19 from ordinary patients. ResNet has the optimum
showing, with a precision of 98.35 percent. The Stacked RNN model [14] was
suggested for the identification of COVID-19 patients using images of chest X-
ray. To mitigate the loss of training samples, this approach employs a variety of
pretrained methods. Based on literature related to the research, we may conclude
that precision and optimal timing continue to be a significant issue for physicians
in minimizing human pain. Traditional artificial intelligence (AI) algorithms have
encountered various issues when utilized on X-ray-based pulmonary COVID-19
detection, including caught-in-state space, tedious noise susceptibility, and ambi-
guity. The constraint of dimensions is the most essential and challenging. When it
comes to characteristic selection, there are generally two methods: (1) The filtering
technique provides ratings to each characteristic based on statistical parameters,
and (2) the time of induction is based on a met heuristic of all potential groups of
features [15]. Bioinspired Particle Swarm Algorithm (PSA) methods are important
for optimization technique to increase and enhance the efficiency of selecting
characteristics [16]. The China ministry declared that COVID-19 identification as
a critical indicator for backward transcriptional synthesis or hospitalized should be
validated by genetic research for lung or blood samples. Because of the present
public health crisis, the real-time polymerase network reaction’s primary sensor
makes it difficult to identify and treat many COVID-19 cases. Furthermore, the
disease is extremely infectious; a larger population is at danger of sickness. Rather
than waiting for positive viral testing, the diagnosis now encompasses all patients
who demonstrate the common COVID-19 lung bacterial meningitis characteristic.
This method allows officials to isolate and treat the patient more quickly. Even if
death doesn’t really happen at COVID-19, many patients recovered with lifelong
Bioinspired CNN Approach for Diagnosing COVID-19 Using Images of Chest X-Ray 183
lung loss. As per the WHO (World Health Organization), COVID-19 also causes
pores in the chest, similar to MERS, providing them a “hexagonal appearance.”
Some of the ways for controlling pneumonia is digital chest imaging. Machine
learning (ML)-based image analytical techniques for the recognition, measurement,
and monitoring of MERS-CoV (Middle East respiratory syndrome coronavirus)
were created to discriminate among individuals with coronavirus and those who
were not. Deep learning method to autonomously partition all lung and disease
locations using chest radiography is developed. To create an earlier model for
detecting COVID-19, influenza and pneumonia-bacterial meningitis in a healthy
case utilizing image data and in-depth training methodologies are identified. In
the investigation by the authors, they created a deep neural approach based on
COVID-19 radiography alterations of X-ray images that can bring out the visual
features of COVID-19 prior pathologic tests, thus reducing critical time for illness
detection. MERS features like pneumonia can be seen on chest X-ray images
and computer tomography scans, according to the author study. Data mining
approaches to discriminate between MERS and predictable influenza depending on
X-ray pictures were used in the research. The clinical features of 40 COVID-19
participants, indicating that coughing, severe chronic fatigue, and weariness were
common beginning symptoms, has been evaluated. All 40 patients were determined
to have influenza, and the chest X-Ray examination was abnormal. The author
team identified the first signs of actual COVID-19 infection at the Hong Kong
University [17]. The author proposed a statistical methodology to predict the actual
amount of instances discovered in COVID-19 during January 2020. They came to
the conclusion that there were 469 unregistered instances between January 1 and
January 14, 2020. They also stated that the number of instances has increased.
Using information from 555 Chinese people relocated from Wuhan on the 29th
and 31st of January 2020, the author suggested a COVID-19 disease rate predictive
models in China. According to their calculations, the anticipated rate is 9.6 percent,
with a death rate of 0.2 percent to 0.5 percent. Unfortunately, the number of
Asian citizens moved from Wuhan is insufficient to assess illness and death. A
mathematical method to identify the chance of infection for COVID-19 has been
proposed. Furthermore, they estimated that the maximum will be attained after
2 weeks. The prediction of persistent human dissemination of COVID-19 from
48 patients was based on information from Thompson’s (2020) research [18]. The
researchers study created a prototype of the COVID-19 risk of dying calculation.
For two other situations, the percentages are 5.0 percent and 8.3 percent. For the
two situations, the biological number was calculated to be 2.0 and 3.3, respectively.
COVID-19 could cause an outbreak, according to the projections. X-ray imaging is
utilized to check for fracture, bone dislocations, respiratory problems, influenza, and
malignancies in the national health. Computed tomography is a type of advanced X-
ray that evaluates the extremely easy structure of the functioning amount of the body
and provides sharper images of soft inside organs and tissues. CT is faster, better,
more dependable, and less harmful than X-rays. Death can increase if COVID-19
infection is not detected and treated immediately.
To summarize this article, the following contributions are made:

• CIFAR dataset is used for normalization procedure.
• The cuckoo-based hash function is realized to determine the interested regions
of the COVID-19 X-ray images. In CHF, we represent the intention to move a
destination with the probability less than 1 in order to ensure that the total number
of regions to assess remain constant. Additionally, we take an arbitrary number
from the image and assign it to a position.
• Incorporated and test the training accuracy and validation accuracy.
The rest of the paper is organized as follows: Section II introduces the back-
ground work related to CNN approach in terms of diagnosing COVID-19 using
chest X-ray images. Section III outlines the approaches and tools used for diagnos-
ing COVID-19. Section IV describes cuckoo-based hash function to determine the
regions of X-ray images. Section V discusses the implementation and settings of
the model. Finally, Section VI concludes the accuracy of the proposed COVID-19
disease classification.
2 Related Work
The use of an X-ray image of chest has grown commonplace in recent years. A
chest X-ray is used to evaluate a patient’s respiratory status, including evolution
of the infection and any accident-related wounds. In comparison to CT scan
images, chest X-ray has shown encouraging outcomes in the period of COVID-
19. Moreover, due to the domain names rapid growth, academics have become
more unaware of advances across many techniques, and as a result, knowledge of
different algorithms is waning. As a consequence, artificial neural network, particle
swarm optimization, firefly algorithm, and evolutionary computing dominate the
research on bioinspired technology. The researchers then investigated and discussed
several techniques relevant to the bioinspired field, making it easier to select the best
match algorithm for each research [17]. Big data can be found in practically every
industry. Furthermore, the researchers of this research emphasize the significance
of using an information technology rather than existing data processing techniques
such as textual data and neural networks [19]. A fuzzy logic learning tree approach
is utilized in this study to improve image storing information performance [20].
The researchers’ purpose is to provide the concept of image recommendations
from friends (IRFF) and a comprehensive methodology for it. The significance
of reproductive argumentative and their numerous requests in the domain of
background subtraction has been said according to the author of this research.
Health care, outbreaks, face recognition, traffic management, image translation,
image analysis, and 3D image production are some of the uses of GAN that
have been discovered [21]. In interacting with radiographic images, state-of-the-
art computing and machine learning have investigated an amount of choices to
make diagnoses. The rapid increase of deep neural networks and their benefits to the
health-care industry has been unstoppable since the 1985. Specular reflection class
activation transfer has been standing up with DNN to conquer over the identification
of COVID-19. Deep learning techniques have been operating interactively to assist
in the analysis of COVID-19. Aside from the time restrictions, deep neural networks
(DNN) are providing confidence in the analysis of COVID-19 utilizing chest X-ray
data, with no negative cases. DNN’s main advantage is that it detects vital properties
without the need for human contact.
Understanding the present condition and irrespective of COVID-19 confirmation,
it is critical to diagnose COVID-19 in a timely manner so that COVID-19 patients
diagnosed can be free of additional respiratory infection. Image categorization and
information extraction play a significant role in the nature of X-ray of chest and
diagnostic image procedures. Based on autonomous pulmonary classification, a
convolutional deep neural network is needed to retrieve significant information and
partition the pulmonary region more accurately. In this article, an SRGAN+VGG
framework is designed, where a deep neural network named visual geometrical team
infrastructure (VGT16) is being used to recognize the COVID-19 favorable and
unfavorable results from the image of chest X-ray, and a deep learning model is
used to rebuild those chest x-ray images to excellent quality. A convolutional neural
network identified as Disintegrate, Transmit, and Composition was employed for
the categorization of chest X-ray pictures with COVID-19 illness. DNN investigates
the image dataset’s category limitations by a session disintegration methodology to
handle with any irregularities. Multiple preconditioning methods, such as VGT16
and ResNet, have been deployed for categorization of COVID-19 chest X-ray
pictures from a normal image of chest X-ray to one impacted with influenza
using a supervised learning process. Dense convolutional network was employed
in this study to improve the outcomes of COVID-19 illness utilizing the suggested
bioinspired CNN model.
3 Approaches and Tools
3.1 CIFAR Dataset of Chest X-Ray Image
COVID-19 was diagnosed based on two sets of chest X-ray images, one taken
from one source and another. Joseph Paul Cohen and colleagues [22] were able to
magnify the chest X-ray of COVID-19 data collection by using images from various
publicly available sources. Four hundred ninety-five of them have been identified as
having COVID-19 antibodies. Figure 1 shows the distribution of the CIFAR dataset
has 950 people in it as of the time of this writing. Results from COVID-19 were
found in 53.3 percent (of all the images) of the photos with COVID-19 findings,
while results from normal-healthy X-ray findings were found in 46.7 percent (of all
the images). The standard X-ray images of a healthy chest were developed by Paul
Mooney, who independently created them after reading an article in the same journal
Fig. 1 Data distribution of Chart Title

images
60% 54%
50% 46%
40%
1
Covid-19 Normal/Regular
written by Thomas Kermany and his colleagues [23]. This CIFAR dataset contains
a total of 1341 regular and healthy photos. Roughly one-third of the photos were
chosen at random. It is essential to ensure that images with chest radiographs that
are images of normal-healthy people are not included because this prevents learning
using unequal datasets. If the dataset has many samples, it favors classes with a
few images, limiting the images that can be used. All X-rays fall into one class,
called normal or healthy X-rays, while COVID-19 X-rays are separate. There are
two types of data: When the number of patients who have COVID-19 is calculated
and considering gender, there are 346 males who have the disease and 175 females
who have it. The results show that 88 of the COVID-19-positive patients diagnosed
between the ages of 20 and 40 are 20- to 30-year-olds. In 175 patients, the most
patients were 41–61 years old, and 175 patients were received. COVID-19 was
detected in 172 patients who were between the ages of 62 and 82 years old.
Even medical specialists may find the X-ray images challenging to interpret. We
propose that our approach might be able to help them achieve their goals. PyTorch is
utilized for the model’s development and implementation. Tensor computations are
an essential element in developing deep learning algorithms, which consists a deep
learning directory with tensor computation functionality. Google is developing it,
and it is being used by Facebook’s AI Research Lab at the moment. The explosion
in interest in this subject among researchers has undoubtedly resulted in the advance
of several leading-edge algorithms, including NLP and computer vision, which has
been applied throughout the entire field of deep learning. By using chest X-ray of
COVID-19 for diagnosis, one of the most important goals is to develop a model for
image ordering, which is the important goal of using PyTorch. Classification models
that form images could generate considerable concern for clinicians, particularly
those who utilize X-ray imaging.
3.2 Image Scaling in Preprocessing
Two types of views have been chosen to distinguish between the lung image scenario
and the infection affecting it: posteroanterior (PA) and anteroposterior (AP). In
the posteroanterior (PA) view, the X-ray of the patient’s chest is taken from the
posterior to anterior ends of the patient’s upper body. When it occurs with chest X-
rays, the term “anteroposterior” refers to the X-ray taken from the patient’s anterior
Fig. 2 Normal image
to posterior coverage. Before the labeling process can begin, the images must first
be scaled to accommodate data augmentation, which takes place before the images
can be labeled. It is necessary to scale the image first before preprocessing (Fig. 2)
because it will be subjected to several transformations during its creation process.
Compared to the chest X-ray images of regular patients, which are readily available,
the COVID-19 data is limited in comparison. K-Fold Cross-Validation is one of
the methods that can be used to develop skewed datasets, which are defined as
those that have a significant difference in the amount of data they contain from one
another. While staging the data throughout the study, all of the images were taken
in proportion to the data in order to avoid overfitting the results to the datasets.
The following data transformations are available such as data augmentation and
preprocessing.
The next step involves loading the dataset with images that are positive for
COVID-19 and normal in appearance. Data augmentation is the process of creating
novel data from available data and a few simple image manipulation and image
classification methods, which are referred to as data synthesis. By including an aug-
mentation, the model’s generalization will be improved, and the risk of overfitting
training results will be eliminated. Using image augmentation, adding additional
information to the live dataset can be manually entering without overwhelming.
PyTorch’s torch vision library can be used to accomplish all of these tasks. In addi-
tion to data transformation and handling, Torchvision offers deep learning models
that are already defined and at the field’s cutting edge. An image augmentation
technique may include image rotation, image shifting, image flipping, or image
noising, among other things. The training and validation of image transforms are
performed with only a small amount of data, resulting in the creation of additional
data. Preprocessed input images (Fig. 3) are always required for pretrained models.
The datasets are first to read into PIL image (Python imaging format), which is then
used to create a sequence of transformations. Aside from that, To TenSor converts a
PIL image (C, H, W), which is in the range of [0–255] in (x, y, width, height), and
shape (C, H, W), which is in the range of [0–1], to a floating-point FloatTensor in
Fig. 3 Preprocessed image
the range of [0–1], which is in the range [0–1]. Images have been normalized to a
range of 0 to 1, with a 0.5 standard deviation serving as the standard deviation.
Input − μ
Input = (1)
Standard deviation
Input − 0.5
Input =
0.5
In this case, μ is equivalent to the average deviation 0.5. The length of the channel
is denoted by the letter C, the height is denoted by H, and the width is denoted by
the letter W, H, and W and must score a minimum of 224 points in order to be
considered for the tournament. Normative values were calculated using the mean
and average deviation, with the mean range being [0.484, 0.455, 0.405] and the
standard deviation range being [0.228, 0.225, 0.226], respectively, for the data. In
this case, the CIFAR dataset is used for the normalization procedure. CIFAR dataset
is a group of images that is commonly used to train deep learning algorithms. It is
mainly used dataset for image classification and also used by scientists with different
algorithms. Imaging networks are used in many applications, and CIFAR dataset is
the known dataset for machine learning and computer vision algorithms. CIFAR is
widely used in machine learning applications, such as image recognition techniques
with machine learning. In total, the forum has more than 1.2 million images in
10,000 different categories, which can be searched by typing in a keyword. On
the other hand, the data in this dataset are loaded into a top-of-the-line piece
of hardware, such as a CPU that alone cannot handle datasets of this size and
complexity.
3.3 Training and Validation Steps
During training and validating the model, the dataset is divided into 80/20 ratios to
avoid utilizing skewed datasets. With regard to each folder, the images are tagged
using the class name of the folder where the images are located. Additionally, the
DataLoader loads the labeled images and train tracks in the game’s memory (Class
Name). This divides the dataset into two distinct classes: regular and healthy X-rays
and the other for the dangerous COVID-19 X-ray. In either case, data is loaded into
CUDA (graphics processing unit) or the CPU before moving on to model definition.
Torchvision is a sub-package that contributes to deep learning image classification,
detection of objects, image segmentation, etc. For image processing, Torchvision
offers a pretrained and in-built deep learning model.
3.4 Deep Learning Model
The CNN model is a form of neural network that allows us to obtain higher
interpretations for picture input. Unlike traditional image processing, which requires
the user to specify the feature representations, CNN takes the image’s original
captured image, training the system, and then separates the characteristics for
improved categorization. The structure of the brain is deeply implicated in machine
learning. Like MRI, CT, and X-rays, signal and image processing technology are
widely used to apply deep learning to images described in Fig. 4. Deep learning
models are overly configured by the CNN model parameters using deep feature
extraction techniques.
The visual system of the human brain inspired CNNs. The goal of CNNs is to
enable computers to see the world in the same way that humans do. Image iden-
tification and interpretation, image segmentation, and natural language processing
can all benefit from CNNs in this fashion [24]. CNNs feature convolutional, max
pooling, and nonlinear activation layers and are a type of deep neural network.
The convolutional layer, which is considered a CNN’s core layer, performs the
“convolution” action that gives the network its name.
Convolutional neural networks are similar to classic machine learning. Figure 5
describes convolution layers have odd number layers, while sharing and subsam-
pling layers have even number layers, excluding the input and output layers. Figure
6 describes the CNN has eight different layers linked to sharing layers with kernel,
the group dimension is 100, and the model boundary is 100 epochs.
Preprocessing
Applying Classifiers and fine-

tuning CNN Features
Covid `19 Lung CT scan

Training Testing
Detecting
Covid’19
Classification
Fig. 4 Classification approach for COVID-19
Hidden Layers
Input Output
Layer Layer
x1
y0
x2
y1
x3
x4
yq
xp
Fig. 5 Convolution layers of neural network

Fig. 6 CNN architecture

with kernel
Image as input
Convolution
Kernel size: 3 Feature map: 8
Sharing Stride=2
Convolution
Sharing Stride=2
Convolution
Sharing Stride=2
COVID-19 Image as
Output
4 Cuckoo-Based Hash Function
The cuckoo algorithm is realized to determine the interested regions of the COVID-
19 x-ray images. Here, the cuckoo-based hash function (CHF) is a contour
metaheuristic process that utilizes constant variance as a method of search. To
design cuckoos as they look for the strongest roost to lay, a method of this kind is
investigated to obtain pixel locations. Almost every pixel Pi is indeed a destination
that might be good for applying the information gain and can potentially be selected
from the pixel locations that meet the function’s criteria. It is assumed that, for the
purposes of medical imaging, x-ray pixel intensities Pi are probable locations. We
can greatly increase efficiency by placing initial egg-nesting cuckoos throughout all
C-ray spatial domain. In CHF, we represent the intention to move a destination with
the probability less than 1 in order to ensure that the total number of regions to
assess remain constant. Additionally, we take an arbitrary number from the image
and assign it to a position.
The pixel selection over the X-ray image based on cuckoo hash functions is
modeled as:
Ti +1 T
Pi = Pi i + ϕ. Loc (ρ, σ, τ ) (2)

ϕHu1 (Pi ) + ϕHu2 (Pi ) ϕSa1 (Pi ) + ϕSa2 (Pi ) ϕBr1 (Pi ) + ϕBr2 (Pi )
ρ = Least , ,
2 2 2
(3)

ωHu1 (Pi ) + ωHu2 (Pi ) ωSa1 (Pi ) + ωSa2 (Pi ) ωBr1 (Pi ) + ωBr2 (Pi )
σ = Least , ,
2 2 2
(4)

ϑHu1 (Pi ) + ϑHu2 (Pi ) ϑSa1 (Pi ) + ϑSa2 (Pi ) ϑBr1 (Pi ) + ϑBr2 (Pi )
τ = Least , ,
2 2 2
(5)
where Ti stands for the event time period, PiTi represents the chosen pixel location,
ϕ defines the measured normal variance distance, Loc(ρ , σ , τ ) express the location
of current pixel location in terms of rows and column, Hu denotes the hue value of
pixel location, and Sa denotes the saturation value of pixel location.

Mean (H u1 (Pi )) − Hu (Pi )
ϕHu1 = (6)
width

Mean (Hu2 (Pi )) − Hu (Pi )
ϕHu2 = (7)
height

Mean (Hu1 (Pi )) (ϕHu1 (Pi ) − Hu (Pi ))
ωHu1 = (8)
width

Mean (Hu2 (Pi )) (ϕHu2 (Pi ) − Hu (Pi ))
ωHu2 = (9)
height

Mean (Hu1 (Pi )) (ωHu1 (Pi ) − Hu (Pi ))
ϑHu1 = (10)
width

Mean (Hu2 (Pi )) (ωHu2 (Pi ) − Hu (Pi ))
ϑHu2 = (11)
height

Mean (Sa1 (Pi )) − Sa (Pi )
ϕSa1 = (12)
width

Mean (Sa2 (Pi )) − Sa (Pi )
ϕSa2 = (13)
height

Mean (Sa1 (Pi )) (ϕSa1 (Pi ) − Sa (Pi ))
ωSa1 = (14)
width

Mean (Sa2 (Pi )) (ϕSa2 (Pi ) − Sa (Pi ))
ωSa2 = (15)
height

Mean (Sa1 (Pi )) (ωSa1 (Pi ) − Sa (Pi ))
ϑSa1 = (16)
width

Mean (Sa2 (Pi )) (ωSa2 (Pi ) − Sa (Pi ))
ϑSa 2 = (17)
height

Mean (Br1 (Pi )) − Br (Pi )
ϕBr1 = (18)
width

Mean (Br2 (Pi )) − Br (Pi )
ϕBr2 = (19)
height

Mean (Br1 (Pi )) (ϕBr1 (Pi ) − Br (Pi ))
ωBr1 = (20)
width

Mean (Br2 (Pi )) (ϕBr2 (Pi ) − Br (Pi ))
ωBr2 = (21)
height

Mean (Br1 (Pi )) (ωBr1 (Pi ) − Br (Pi ))
ϑBr1 = (22)
width

Mean (Br2 (Pi )) (ωBr2 (Pi ) − Br (Pi ))
ϑBr2 = (23)
height
We realize the X-ray image vital points using the feature vector for pixels Pi ,
which uses integer positions. The notion of conditional variance is used to enhance
stochastic search, as in the case of the recommended CHF. Across each phase,
the distance of the normal distribution is set using a unique dissemination that is
calculated as
⎧ τ
⎪
⎨ lim 1+ ρ1
σ ρ→∞
0<ρ<σ <τ <∞
Loc (ρ, σ, τ ) = 2 (ρ−σ )5/2 (24)
⎪
⎩
0 Loction not satisfied
The hypothesis for which a pixel in the X-ray image will be deselected is based
on the following formula:
Ti +1 1− Ti Pi Location not chosen

D Pi = (25)
Pi Location chosen
Based upon the above equational model, the proposed cuckoo-based hash search
algorithm works as follows:
Algorithm 1 Cuckoo-Based Hash Search
1. Evaluate and obtain the dissemination points from the X-ray image using Eqs.
(2) to (23).
2. Declare the variance values for CHF: Loc(ρ , σ , τ ).
3. At random points, assume the pixel variations in the tissues regions.

4. For each and every time period at a particular iteration:
(a) Use Eq. 24 to determine the normal distribution of random pixel points.
(b) After that, use Eq. 25 to obtain chosen and abandon location points.
(c) Compare all the chosen pixel points based upon their distance function.
(d) Choose the best pixel points by evaluating them based on the health function.
(e) Add the positions to the chosen random pixel point list.
(f) Repeat the process for possible iteration Ti.
5. Best chosen pixel points act as the dissemination points for X-ray image.
6. End process.
5 Research Data and Model Settings
The experiments were conducted out on different types of information: open

datasets and images from active participation health centers. The Kaggle dataset
is an accessible set of images of chest X-ray. There are 1341 trained data for chest
x-rays present in this collection, with 234 normal cases and 390 with pulmonary
pneumonia. The photos are in Joint Photographic Experts Group (JPEG) format
and have a size of 5 k 5 k or 5 k 5 k pixels from a common investigative tool.
The photos are in Joint Photographic Experts Group (JPEG) format and have a
resolution of 4 × 4 k pixels. The research set is a group of images we obtained
from health-care professions and clinics. The material is available to the public, but
it was made accessible for specific functions in the work. To secure and identify
personal details, all medical data is protected. There are 65 frontal chest X-rays in
this set, containing 11 normal cases, 35 pulmonary illness cases (pneumonia and
lung cancer), and 11 health-related instances. The photos are taken from a standard
monitoring instrument, i.e., in JPEG format in different size.
The data sampling techniques were separated into training and testing sets
at random for every dataset. Only certain test dataset was utilized to determine
appropriate BMI parameters, and only the testing data was used to validate results.
Both clinical and laboratory methods were used to get variables. The chest X-ray
photos of 40 COVID-19 patients were collected from the GitHub library. It includes
chest X-ray or CT scans, the majority of which are of persons with COVID-19,
pneumonia, influenza, and lung opacity. Furthermore, 40 normal X-ray images of
the chest were used from the Kaggle repository.
Our research used a chest X-ray image collection from 40 healthy patients and
40 COVID-19 patients. All photos in this sample have been rescaled to a scale of
320_320 pixels. COVID-19 and normal patient’s chest X-ray pictures are shown
in Fig. 7. Deep features are collected from the fully connected layer and fed to
the predictor for learning purposes. A specialized layer removes the significant
properties of CNN models while also providing functionality. The attributes are fed
into the classifier, which is used to diagnose COVID-19 disorders.
Fig. 7 Proposed model for prediction of COVID-19
5.1 Estimates of the Proposed Model’s Accuracy
Accuracy of training, validation, and loss of training and validation are some of the
measures used to assess the model’s performance. When it comes to determining
classification models, accuracy is crucial. Accuracy is the percentage of the number
of exact estimates to the total number of exact estimates. When it comes to training
accuracy, it’s indeed evident that a model’s accuracy is determined by the instances
it must have been trained on. Each epoch’s accuracy is depicted on the graph. The
accuracy of each epoch was recorded here, and a final chart was generated. As a
result, the graphs show that as the amount of epochs grows, the model’s accuracy
grows as well. As a result, training precision should be excellent. The greatest
training accuracy recorded here seems to be 99.14 percent. Training accuracy is
critical in researcher situation since they need to detect a larger number of valid
COVID-19 instances. As illustrated in Fig. 8, training accuracy indicates that
the model is acquiring all the features correctly and that there will be minimal
misinterpretation. Testing accuracy is another name for validation accuracy. That’s
the accuracy that has to be determined on the dataset that the model hasn’t been
trained on. It only shows the model of the sets that it hasn’t seen so far. Its purpose
is to determine how generalized model it is. Validation precision ought to be lower
or equal to train precision. It can be claimed that the model is overfitting when
there is a significant gap among validation and training accuracy. On each epoch,
the figure illustrates the validation accuracy, which is somewhat higher than that
of the accuracy. Validation and accuracy are important in this context to accurately
describe the test data, since both classifications are significant. It essentially explains
how the program will identify COVID-19 and regular cases for CT scan of chest.
As a result, the system should be able to categorize a wider range of right cases
as shown in Fig. 9. The error that happened on the training data is referred to as
training loss. The loss is a metric that shows how terrible the model is. When a
Training Accuracy vs Epochs

1
0.99
0.99
0.98
0.968
0.97
Accuracy
0.961 0.96
0.96 0.954
0.95 0.952 0.95
0.95 0.942
0.94
0.93
0.93
0.92
0 2 4 6 8 10 12
Epochs
Fig. 8 Training accuracy vs. epochs
Validation Accuracy vs Epochs

1.015
1.01 1.01 1.01 1.01 1.01 1.01 1.01
1.01
1.005
1
0.995
Accuracy
0.99
0.984 0.984
0.985 0.982
0.98
0.975
0.97
0.965
1 2 3 4 5 6 7 8 9 10
Epochs
Fig. 9 Validation accuracy vs. epochs

Training Loss vs Epochs

0.17
0.18 0.16
0.15 0.15
0.16 0.14
0.14 0.12 0.12
0.11 0.11
0.12 0.1
0.1
Loss
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12
Epochs
Fig. 10 Training loss vs. epochs
model predicts exactly, then it has a deficit of zero. Aside from that, researchers
argue that the deficit is bigger. The basic goal of training a model is to discover a
loss function that will result in minimal loss. Every time a loss occurs, the weights
are changed to reduce the loss. As a result, it is evident that train loss must be
kept to a minimum. Since the smaller the loss, the greater accurate our system is.
There is an attempt to see all the degradation of the model on each epoch, and as
the amount of epochs increases, the loss decreases. These have increased in certain
circumstances. So training loss basically indicates whether effectively the system
is learning for each cycle as well as which parameters are included because then
the model makes fewer errors in the next cycle and can successfully differentiate
between the COVID-19 and regular cases. As demonstrated in Fig. 10, a lower
number of losses indicates that the model is indeed very efficient and there are fewer
mistakes in the categorization of covid and regular instances. The validation loss is
nearly identical to a training loss, with the exception that it is computed mostly on
validation set. During training, the weight does not change. Validation loss must
be comparable to training loss. The system is overfitting if somehow the validation
loss is higher than that of the training loss. Also, it’s under fitting if somehow the
validation loss is lower than the training loss. Validation loss must be kept to a
minimum, and a tiny amount of overfitting could be tolerated. As illustrated in Fig.
11, the graph demonstrates the validation loss for each epoch, and the validation
loss decreases as the amount of epochs grows. Validation loss indicates how so
much inaccuracy the system has when identifying the COVID-19 and regular cases
for testing data. Unless the validation loss is minimal, the system is making fewer
mistakes when categorizing the COVID and regular cases in the testing data.
Validation Loss vs Epochs

0.045 0.04 0.04
0.04
0.035 0.03 0.03 0.03 0.03 0.03
0.03
0.025 0.02 0.02 0.02
Loss
0.02
0.015
0.01
0.005
0
0 2 4 6 8 10 12
Epochs
Fig. 11 Validation loss vs. epochs
6 Conclusion
In the event of COVID-19 outbreak, a global crisis has emerged and affected people
worldwide. Keeping up with the demand for medical supplies and testing kits is
nearly impossible, even for the most developed countries. Because there are not
enough testing kits available, a rise in COVID-19 infections can occur because many
infections are not discovered. Prevention is paramount when it comes to reducing
spread and mortality rates. The computer-aided diagnosis system is currently under
development and would use radiograph films of patients’ chest X-rays to predict
COVID-19. COVID-19 diagnosis methods have gained much attention, with many
researchers devoting their time to the project.
The use of deep neural networks for aerial views allows for a more comprehen-
sive understanding of the spread and treatment of COVID-19 in our approach. Deep
feature extraction and the CSA approach were applied to identify coronaviruses
in chest X-ray images from the GitHub and Kaggle repositories. To extract the
concepts, 11 CNN models have been pretrained to serve as classifiers for CSA.
Statistical research is performed to identify which classification pattern is the most
effective. The training and validation accuracy will be improved using cuckoo-based
function. The accuracy of the proposed COVID-19 disease classification model is
98.54 percent.
References
1. Singh, A. K., Kumar, A., Mufti Mahmud, M., Kaiser, S., & Kishore, A. (2021). COVID-19
infection detection from chest X-ray images using hybrid social group optimization and support
vector classifier. Cognitive Computation, 1–13.
2. Dhiman, G., Chang, V., Singh, K. K., & Shankar, A. (2021). Adopt: Automatic deep learning
and optimization-based approach for detection of novel coronavirus covid-19 disease using
x-ray images. Journal of Biomolecular Structure and Dynamics, 1–13.
3. Anter, A. M., Oliva, D., Thakare, A., & Zhang, Z. (2021). AFCM-LSMA: New intelligent
model based on Lévy slime mould algorithm and adaptive fuzzy C-means for identification
of COVID-19 infection from chest X-ray images. Advanced Engineering Informatics, 49,
101317.
4. Altan, A., & Karasu, S. (2020). Recognition of COVID-19 disease from X-ray images by
hybrid model consisting of 2D curvelet transform, chaotic salp swarm algorithm and deep
learning technique. Chaos, Solitons & Fractals, 140, 110071.
5. Elaziz, M. A., Hosny, K. M., Salah, A., Darwish, M. M., Songfeng, L., & Sahlol, A. T. (2020).
New machine learning method for image-based diagnosis of COVID-19. PLoS One, 15(6),
e0235187.
6. Dev, K., Khowaja, S. A., Bist, A. S., Saini, V., & Bhatia, S. (2021). Triage of potential
COVID-19 patients from chest X-ray images using hierarchical convolutional networks. Neural
Computing and Applications, 1–16.
7. Dhiman, G., Kumar, V. V., Kaur, A., & Sharma, A. (2021). DON: Deep learning and
optimization-based framework for detection of novel coronavirus disease using X-ray images.
Interdisciplinary Sciences: Computational Life Sciences, 1–13.
8. Kavitha, S., & Inbarani, H. (2021). Bayes wavelet-CNN for classifying COVID-19 in chest
X-ray images. In Computational vision and bio-inspired computing (pp. 707–717). Springer.
9. Pathan, S., Siddalingaswamy, P. C., & Ali, T. (2021). Automated detection of Covid-19 from
chest X-ray scans using an optimized CNN architecture. Applied Soft Computing, 104, 107238.
10. El-Kenawy, El-Sayed, M., Mirjalili, S., Ibrahim, A., Alrahmawy, M., El-Said, M., Zaki, R.
M., & Metwally Eid, M. (2021). Advanced meta-heuristics, convolutional neural networks,
and feature selectors for efficient COVID-19 X-ray chest image classification. IEEE Access, 9,
36019–36037.
11. Alorf, A. (2021). The practicality of deep learning algorithms in COVID-19 detection:
Application to chest X-ray images. Algorithms, 14(6), 183.
12. Vrbančič, G., Pečnik, Š., & Podgorelec, V. (2020). Identification of COVID-19 X-ray images
using CNN with optimized tuning of transfer learning. In 2020 International Conference on
INnovations in Intelligent SysTems and Applications (INISTA) (pp. 1–8). IEEE.
13. Bahgat, W. M., Balaha, H. M., AbdulAzeem, Y., & Badawy, M. M. (2021). An optimized
transfer learning-based approach for automatic diagnosis of COVID-19 from chest x-ray
images. PeerJ Computer Science, 7, e555.
14. Rajpal, S., Lakhyani, N., Singh, A. K., Kohli, R., & Kumar, N. (2021). Using handpicked
features in conjunction with ResNet-50 for improved detection of COVID-19 from chest X-ray
images. Chaos, Solitons & Fractals, 145, 110749.
15. Toğaçar, M., Ergen, B., & Cömert, Z. (2020). COVID-19 detection using deep learning models
to exploit Social Mimic Optimization and structured chest X-ray images using fuzzy color and
stacking approaches. Computers in Biology and Medicine, 121, 103805.
16. Balachandar, A., Santhosh, E., Suriyakrishnan, A., Vigensh, N., Usharani, S., & Manju Bala, P.
(2021). Deep learning technique based visually impaired people using YOLO V3 framework
mechanism. In 2021 3rd International Conference on Signal Processing and Communication
(ICPSC) (pp. 134–138). IEEE.
17. Gopalakrishnan, A., Manju Bala, P., & Ananth Kumar, T. (2020). An advanced bio-inspired
shortest path routing algorithm for SDN controller over VANET. In 2020 International
Conference on System, Computation, Automation and Networking (ICSCAN) (pp. 1–5). IEEE.
18. Thompson, R. N. (2020). Novel coronavirus outbreak in Wuhan, China, 2020: Intense
surveillance is vital for preventing sustained transmission in new locations. Journal of Clinical
Medicine, 9(2), 1–8.
19. Ucar, F., & Korkmaz, D. (2020). COVIDiagnosis-Net: Deep Bayes-SqueezeNet based diagno-
sis of the coronavirus disease 2019 (COVID-19) from X-ray images. Medical Hypotheses, 140,
109761.
20. Shams, M. Y., Elzeki, O. M., Elfattah, M. A., Medhat, T., & Hassanien, A. E. (2020). Why
are generative adversarial networks vital for deep neural networks? A case study on COVID-
19 chest X-ray images. In Big data analytics and artificial intelligence against COVID-19:
Innovation vision and approach (pp. 147–162). Springer.
21. Pereira, R. M., Bertolini, D., Teixeira, L. O., Silla Jr, C. N., & Costa, Y. M. G. (2020).
COVID-19 identification in chest X-ray images on flat and hierarchical classification scenarios.
Computer Methods and Programs in Biomedicine, 194, 105532.
22. Cohen, J. P., Morrison, P., & Dao, L. (2020). COVID-19 image data collection.
arXiv:2003.11597.
23. Kermany, D., Zhang, K., & Goldbaum, M. (2018). Labeled optical coherence tomography
(OCT) and chest X-ray images for classification. Mendeley data, 2.
24. Reshi, A. A., Rustam, F., Mehmood, A., Alhossan, A., Alrabiah, Z., Ahmad, A., Alsuwailem,
H., & Choi, G. S. (2021). An efficient CNN model for COVID-19 disease detection based on
X-ray image classification. Complexity, 2021, 1.
Initial Stage Identification of COVID-19
Using Capsule Networks
Shamika Ganesan, R. Anand, V. Sowmya, and K. P. Soman
1 Introduction
Coronavirus is an outbreak that has been exceedingly infectious and has spread
rapidly with common symptoms such as fever, toxicity, myalgia, or weariness
worldwide. COVID-19 is treated differently depending on the degree of the illness,
although typically antibiotics, cough drugs, antipyretics, and painkillers are effective
[1]. The initial phase of treatment is the discovery of any disease. COVID-19
is mostly detected by a swab test along with an X-ray or CT scan of the lung
[2]. Chest X-rays are the most cheap for a typical person among these medical
assessments. A major barrier for adopting radiographic imaging is the scarcity of
adequately qualified radiologists who can interpret X-ray pictures in a timely and
reliable manner. Artificial intelligence (AI) was widely used to speed up biological
research. AI is frequently employed with deep learning algorithms in numerous
applications such as image detection, data categorization, and picture segmentation
[3, 4]. Pneumonia in individuals infected with COVID-19 can occur when the virus
progresses to the lungs. Many in-depth study investigations have discovered the
condition employing an X-ray chest imagery approach [5]. In the early diagnosis
and treatment of COVID-19 illness, computed tomography and X-ray imaging play
a vital role [4]. Due to X-ray pictures being cheaper and faster and less radiation
exposed, these pictures prefer to CT images [3, 5]. However, pneumonia cannot
be diagnosed mechanically. White spots on X-ray pictures must be analyzed by
a specialist and explained in depth. However, these patches might be mistaken
S. Ganesan · V. Sowmya () · K. P. Soman

Center for Computational Engineering and Networking (CEN), Amrita School of Engineering,
Coimbatore, India
R. Anand
Department of Electronics and Communication Engineering, Sri Eshwar College of Engineering,
Coimbatore and Sona College of Technology, Salem, India
204 S. Ganesan et al.
with TB or bronchitis, which might result in misdiagnosis. It is therefore very

desired to create strong and precise artificial intelligence models with a small
number of defective data. The CNN is currently the strategy employed by most
researchers that utilize deep learning technology to find COVID-19 as an excellent
extractor. CNNs, however, have some faults, such as being unable to record spatial
feature correlations and not being able to detect images with rotation or other
transformations. In 2017 [6], the Capsule Network was revealed in detail. The
Capsules in the network of the Capsules are a collection of connected neurons
representing several characteristics of a particular object such as posture, texture,
and tone. Capsule Networks may capture whole visual attributes, postures, and
spatial interactions through neuron “packing,” such that Capsules may recognize
particular pattern types and lessen network dependency on vast datasets. The
network of Capsules has shown to be an effective replacement for CNNs. This
motivates advances in the CapsNet area for the identification of COVID-19 by
employing X-rays in the chest. This research therefore leads us to explore a
CapsNet-based architecture to imitate the real-world situation in the form of refined
transformation data. The main aims of the work proposed are as follows: 1. To carry
out a complete study and analysis of CapsNet apps for COVID-19 patients’ chest
X-ray data. 2. Comparison of performance with standard deep learning literature
structures.
2 Literature Review
Over the years, humankinds have seen a variety of pandemics, some of which
are more damaging to humans than others. We are faced with the COVID-19
coronavirus, a new unseen opponent, and this is a difficult war. The COVID-
19 pandemic continues to have a disastrous impact on the health and well-being
of the global population, with people infected with the serious acute respiratory
coronavirus 2 (SARS-CoV-2). The purpose of this literature review is to provide
the latest results on the sciences of SARS-basic CoV-2. On 31 December 2019,
the Republic of China’s Wuhan Health Commission Hubei warned the National
Health Commission, the China CDC, and WHO of a cluster of 27 pneumonia
cases of undetermined origin [6]. These patients experienced a host of symptoms,
including fever, dyspnea, and dry cough, and bilateral glassy opacity in the lungs
was found in radiographical examinations. Due to its high population density and
proximity to a market that sold live animals, Wuhan became the hub of the human–
animal connection. Furthermore, the fast transmission in Wuhan was supported
by the absence of early containment due to the failure to accurately trace the
exposure history in early patient cases. This led in the announcement of the viral
pneumonia outbreak by the World Health Organization (WHO) on 30 January
2020. The 2019 Coronavirus (COVID-19) was identified by the WHO on 11 March
2020 as a pandemic based on the global logarithmic increase in cases. On January
7, 2020, China CDC detected the virus known as the novel coronavirus 2019
Initial Stage Identification of COVID-19 Using Capsule Networks 205
(CoV 2019). The SARS-CoV-2 virus is identical to the 2002 SARS Coronavirus
(SARS-CoV-1). A variety of distinct coronaviruses can cause the common cold.
The virus can become an infectious virus if these coronaviruses discover a mammal
reservoir that provides an appropriate cellular environment to multiply the virus
and to acquire a series of advantageous genetic changes. Similar to the SARS-
CoV-1 and MERS-CoV viruses, the origin of the SARS-CoV-2 genome has been
traced to bats [7]. Coronavirus illness 2019 is the outcome of a SARS-CoV-2
viral infection (COVID-19). In the joint WHO-China publication on COVID-19 [8],
the COVID-19 symptomatology was thoroughly studied. In 85% of the instances,
COVID-19 patients develop pyrexia through the course of their condition, whereas
only 45% are initially febrile [9]. Cough is also seen in 67.7% of the patients and
33.4% produces sputum. In 18.6%, 13.9%, and 4.8%, dyspnea, sore throat, and
nasal congestion were noted [10], respectively. Constitutional symptoms, such
as muscle or bone pain, chills, and headache, are found in 14.8%, 11.4%, and
13.6% of the patients accordingly [9]. In 5 and 3.7% of the patients, GI symptoms
such as nausea or vomit, as well as diarrhea, are recorded correspondingly. Even
in many rich countries, the health system is on the point of collapse as demand
for intensive care units is simultaneously increasing. The number of patients
in intensive care units is increasing with COVID-19 pneumonia. Deep learning
algorithms in recent years have continued to show remarkable results both in
the field of medical image processing and in a number of other fields. Tests are
carried out to get meaningful findings from medical data utilizing in-depth learning
algorithms. Effective screening of infected people using a primary methodology of
screening, i.e., radiological imaging using chest X-rays, is a major milestone in the
struggle against COVID-19. Early studies showed that COVID-19 patients exhibit
abnormalities in their chest X-ray imagery. As a result, a wide range of deep learning
approaches for the artificial intelligence (AI) have been created, with encouraging
findings regarding accuracy in the detection of COVID-19 infected persons utilizing
thoracic X-rays. However, these advanced AI systems have remained closed sources
and are only available for future study and growth by the scientific community
[10]. As an automated prediction technique for COVID-19, a deep convolutionary
network based on pretrained transmission models and chest X-ray images has
been developed. Deep learning is a machine learning subdiscipline inspired by the
structure of the brain. They used pretrained models such as ResNet50, InceptionV3,
and Inception-ResNetV2, to increase the accuracy of prediction for small X-ray
datasets [20]. (i) The recommended models have a full end-to-end design that
removes the necessity for the extraction and selection of human features. (ii) They
show that ResNet50 is the most effective of the three pretrained models. (iii)
Chest X-ray photos are the greatest device to identify COVID-19. (iv) Pretrained
models have been proven to generate great results in limited datasets [21], [22].
Hand washing is one of the preventative strategies advised by the World Health
Organization for at least 20 seconds after visiting public places. Soap or hand
sanitizers should be used with at least 60% ethanol [11]. It is also a good idea
to keep your hands away from the given T-zone face (eyes, nose, and mouth),
as the virus penetrates the upper respiratory system. Avoid interaction with those
having symptoms already, as well as crowds and overcrowded environments. It is

necessary to prohibit travel to the area afflicted by the outbreak. A healthy individual
should be at least six feet apart from persons with symptoms. The whole EPI
must be used by all healthcare professionals working with COVID-19, including
surgical masks, double gloves, complete dressings, and eye shields. N95 masks that
block 95% of the droplets in the mask should be used just before surgeries such
as tracheostomy, tracheal intubation, bronchoscopy, cardiopulmonary resuscitation
(CCPR), and non-invasive ventilation (NIV) are performed [11]. As a result of
these processes, the virus may be aerosolized. Closing of schools, enterprises,
airspace, and athletic events all serves to contain communities’ broadcasts. To limit
the potential for COVID-19 contraction, high-risk patients, such as those above
65 or those who are symptomatically chronic comorbid, must additionally have
independent quarantine. Unfortunately, as of March 2020, there is no licensed
vaccination for COVID-19. The mRNA-1273 vaccine, developed by ModernaTx
Inc (Cambridge, MA, USA), is one of the most promising possibilities [12]. The
full-length, perfusion stabilized spike glycoprotein (S) of the SARS-CoV-2 virus
is encoded by mRNA-1273, which is contained within a lipid nanoparticle. The
safety profile, reactogenicity, and immunogenicity of this vaccine are now being
evaluated in healthy patients in a Phase I, Open-Label, Dose-Ranging clinical study
(NCT04283461). The project is expected to be completed in June 2021. The most
significant studies pertaining to the individual factors that determine the clinical
course and therapy of COVID-19 are summarized in this literature review. This
COVID-19 pandemic serves as a reminder of the unpredictability of ongoing plans
to control SARS-CoV-2 illness, both primary and secondary [13]. In this day of data
abundance, precise modeling of current data and the elimination of disinformation
can help with this planning. Rapidly updated surveillance data, the availability
of reliable authorized information, and a multidisciplinary strategy that bridges
the knowledge gap between fundamental and clinical sciences are all factors that
can help boost countermeasures against this pandemic. Following the pneumonia
caused by unclear causes in Wuhan, China [12], the author Chaolin Huang and
his colleagues stated that 41 sick persons had been proven to be contaminated
with COVID-19 [14] and were taken to a Wuhan City hospital with a variety of
symptoms including fever, dry cough, lethargy, and other nonspecific signs and
symptoms. The respiratory system of the human body gets gradually contaminated.
As a result, we collected chest X-ray medical photographs from both sick and
healthy individuals. In the biomedical area, deep learning models have proven to
be successful. The use of deep learning models to diagnose lung segmentation
[8], breast cancer [15], epilepsy [18], and pneumonia [19] has enhanced the
appeal of these methods in the biomedical area. COVID-19 has just been diagnosed
with the help of radiology scans. [7] created a deep-learning-based technique for
diagnosing COVID-19 that included both binary class and multi-class analyses.
The generated model has a 98.78% accuracy rate for the binary class (COVID-
19 vs. Normal) and a 93.48% accuracy rate for the multi-class (COVID-19 vs.
Normal vs. pneumonia). [9] developed a deep learning model for COVID-19 illness
detection and compared it to seven different deep learning algorithms. For the binary
class problem, an average of 74.29% accuracy was attained. By constructing the

ResNet50, InceptionV3, and Inception ResNetv2 deep learning models, Narin et al.
[10] were able to diagnose COVID-19 using chest X-ray pictures. The data were
verified with fivefold cross-validation once the binary classification procedure was
completed in the research. The performance of the models was assessed using five
distinct criteria, with the ResNet50 model achieving an average accuracy of 98%.
A deep learning model for diagnosing COVID-19 illness was developed by Wang
et al. [11]. The suggested method’s performance was assessed using sensitivity
and accuracy measures, and the findings were compared to those of the VGG19
and ResNet50 deep learning models. The data were verified with fivefold cross-
validation once the binary classification procedure was completed in the research.
The performance of the models was assessed using five distinct criteria, with the
ResNet50 model achieving an average accuracy of 98%. A deep learning model for
diagnosing COVID-19 illness was developed by Wang et al. [11]. The suggested
method’s performance was assessed using sensitivity and accuracy measures, and
the findings were compared to those of the VGG19 and ResNet50 deep learning
models. The average accuracy was 93.3% at the conclusion of the research. [12]
used a deep learning model to extract characteristics from COVID-19’s X-ray
pictures and then categorized them using SVM. F1-scores and Kappa values were
used to evaluate the performance of this hybrid model, which was built by mixing
ResNet50 and SVM models. This strategy, compared to other ways, was shown
to be more successful, according to the study. Capsule Networks were utilized
by Afshar et al. [13] to diagnose COVID-19 patients. Specificity, sensitivity, and
accuracy values were used to assess the suggested model’s effectiveness, and the
results showed that it was accurate to 98.3%. [14] created Capsule Networks that
used X-ray pictures to diagnose COVID-19 instances. The constructed model was
compared to the deep learning models Inceptionv3, ResNet50, and DenseNet121,
and the proposed strategy was shown to be more successful. COVID-19 diagnostic
investigations using computed tomography are also available, in addition to X-ray
pictures. Zheng et al. [16] constructed an unique deep learning algorithm, which
they tested on 499 CT scans. At the conclusion of the research, the average accuracy
was 88.55%. CT scans were utilized to detect COVID-19 illness and differentiate
it from pneumonia in a research by Ying et al. [17]. In this chapter, the suggested
deep learning model was compared to various existing models in the literature, and
the model’s performance was assessed using accuracy, precision, recall, AUC, and
F1-scores.
3 Dataset Description
As previously indicated, we evaluated pretraining the model as a first step in

order to potentially improve the COVID-CAPS’ diagnostics capabilities. In contrast
to Reference [10], which used the ImageNet dataset [11] for pretraining, we
created and employed an X-ray dataset. The rationale for not utilizing ImageNet
Fig. 1 Zero and affine transformed X-ray image. (a) Zero degree centered X-ray. (b) Affine
transformed X-ray
for pretraining is that the nature of the pictures in that dataset (natural pictures)
is completely different from that of the COVID-19 X-ray dataset. It is believed that
utilizing a model that has been pretrained on comparable X-ray pictures would result
in superior COVID-CAPS boosting. The whole COVID-CAPS model is initially
trained on the external data, with the number of final Capsules set to the number
of output classes in the external set, for pretraining with an external dataset. The
final Capsule layer is changed with two Capsules to represent positive and negative
COVID-19 instances, in order to fine-tune the model using the COVID-19 dataset.
All of the other Capsule layers are fine-tuned, whereas the traditional layers are
set to the pretraining weights. In our work, we have used the Kaggle Covid-19
Radiography database that consists of two classes viz. COVID-19 and Normal,
consisting of 208 samples from each class for training, making it 416 samples for
training altogether. For testing, we have created two sets of data—one with 10.◦ of
rotation and one with 30.◦ of rotation. Each of the testing uses 32 samples, 23 rotated
by the corresponding angles of rotation and 9 that are not rotated. Figure 1a shows
a zero-degree centered X-ray image. Figure 1b shows an affine transformed X-ray
image.
4 Methodology
4.1 Overview of Layers Present in Convolutional Neural

Networks
4.1.1 Convolutional Layer
The convolutional layer generates a simple dot product between the defined region
of the image and the kernel function collection called filters [15]. In most situations,
an image with MNC dimensions is utilized as an input. Length and breadth of the
picture are MN, and the number of color channels are usually .C = 3 (red, green, and
blue color channels). The characteristic map is the output of the convolution layer,
and its dimensions are given in Eq. 1 as WQK. The following equation can be used
to obtain W and Q values. The feature map dimensions have a variety of parameters
such filter size (F ), zero padding (ZP ), stride (S), and filter number (K) [16]
M − F + 2P
W =
. + 1 (1) . (1)
S
Filter parameters such as filter size (F ) indicate the size of the filter or kernel that
will be utilized for convolution. Kernel functions are often odd-numbered square
matrices, such as .3 ∗ 3, 5 ∗ 5, 7 ∗ 7, etc. An odd value is desirable so that the kernel
matrix’s center may be set on the pixel on which it is operated.
4.1.2 Stride(S)
A filter or the kernel matrix must be translated throughout the input matrix vertically
from top to bottom and horizontally from left to right covering [15] all the elements
in the input matrix. Stride controls the movement of the translation. It represents the
number of steps taken by the kernel matrix during its translational movement. The
movement of the filter when a .3 × 3 kernel function is applied over a .7 × 7 input
with stride .S = 2 results in of size .3 × 3.
4.1.3 Pooling Layer
Pooling layer is an important layer in CNN. Pooling is used for downsampling.

It mainly involves reducing the spatial dimension of the feature maps [18]. This
considerably reduces the number of parameters required for training the network. By
reducing the trainable parameters, it minimizes the computations required. Pooling
layer becomes essential in networks with a greater number of layers. Applying a
.2 × 2 pooling filter (Average, Maximum and Sum) reduces the size of the feature
map to half of its original size. For example, the effect of .2 × 2 pooling filter applied
on .224 × 224 × 64 feature map that reduces the size of the o/p feature map to
.112 × 112 × 64. However, the depth of the feature map remains unaltered.
4.1.4 ReLU Activation Functions
The ReLU function returns values in the [19], [20] range. It is the most often
utilized activation method. The negative values in the input matrix are fully deleted.
This activation may be implemented using a simple thresholding scheme. This
function reduces the amount of calculations by avoiding the simultaneous activation
of all neurons, as indicated in Eq. 2. When compared to sigmoidal and tangent
functions, this function’s convergence is also quicker. When the gradient is zero,
this function cannot update all of the weights during back propagation, and when
employed with a fast learning rate, it results in a higher number of dead neurons [17].
.ReLU = max (0, x) . (2)
4.1.5 Generalized Supervised Deep Learning Flowchart
The art and science of teaching computers to make judgments based on data
without being explicitly programmed is known as machine learning. In general,
there are three forms of learning in machine learning: supervised, unsupervised,
and reinforcement learning. We employed a convolutional neural network based
on supervised machine learning in this chapter. The supervised machine learning
model [8] is trained on a labeled dataset to predict the result of the sample data.
The suggested methodology for identifying different pneumonia instances, such as
viral and bacterial pneumonia in COVID-19, is described in the next section.
5 Proposed Work
5.1 Capsule Networks
Convolutional neural networks (CNNs) form the basic computing structures for
deep learning architectures. While CNNs try to grasp the generalized features of
an image, they do not focus on geometrical or orientational information such as
relative size of the features, angle of rotation of the features, etc. The idea of Capsule
Networks is to explicitly capture such information, wherein the dimension of each
Capsule would be equal to the number of such orientational information captured.
Each Capsule is a vector (a set of feature maps in case of multi-dimensional data),
where the size of the vector (the number of feature maps) that constitutes a Capsule
is equal to its dimension. An encoder and a decoder are the two most important
components of a Capsule Network. They have a total of six layers. The first three
layers of the encoder oversee accepting the input picture and transforming it to a
vector format (16-dimensional). The convolutional neural network, which is the first
layer of the encoder, extracts the picture’s fundamental characteristics. The Primary
Caps Network is the second layer, and it takes those fundamental traits and looks
for more intricate patterns among them. It might detect the spatial link between
specific strokes, for example. In the Primary Caps Network, different datasets
have varying numbers of Capsules; for example, the MNIST dataset includes 32
Capsules. The Digit Caps Network is the third layer, and the number of Capsules
in it changes as well. The encoder produces a 16-dimensional vector that is sent
to the decoder after these layers. The decoder is made up of three levels. It takes
the 16-dimensional vector and uses the data it must try to rebuild the same image
from scratch. The network becomes more resilient because of its ability to generate
K1 X V X B BXCXd K2 X C X D X d X d HXEXdXd
Conv1 PrimaryCaps ConvCaps Flatten
FCCaps
Routing
Routing
K2 Routing
E
B C D
= d-dimension
Local Spatial Routing (Conv1->PrimCap) Local Spatial Routing (PrimCaps>ConvCaps)
Matrix Transformation Matrix Transformation
BxCxd K2 x C x D x d x d
Conv1 Capsule PrimCaps Capsule

BxCxd PrimCap ConvCaps
K2 x C x D x d
(L-K1+1) x B (L-K1+1) x C (L-K1+1) x C (L-K1-K2+2) x D
Cxd Probability Dxd Probability
Routing Routing
Fig. 2 Block diagram for Capsule Network architecture
predictions based on its prior knowledge. The input image or a feature map first
undergoes the process of forming Capsules using a Transformation Matrix W, and
the obtained activity vector U.∧ is used to compute the CapsNet output, S. Finally,
a squash function is applied to obtain the prediction vector, V. This ensures that the
output is approximated to zero if the value is too small or to one, if the value is too
huge. The block diagram for the Capsule Network is given in Fig. 2.
Uj ∨i = Wij ui
. (3)

.sj = icij Uj ∨i (4)
(sj2 ) sj
vj =
. . (5)
(1 + sj2 ) (sj )
5.2 Proposed Architecture
In this chapter, we propose an architecture (as shown in Algorithm 1) containing

CNN layers for feature extraction and a Capsule layer to capture the dimension
information. Our model consists of an input layer that takes in 128 .× 128 grayscale
chest X-ray images, the first convolutional layer with 512 filters of kernel size
Algorithm 1: CapsNet model

Input: In shape, n class, routing, batch size. in shape is a tuple, n class is an integer, routing
is an integer and batch size is an integer
Output: Model
Initialization
CapsNet (in shape, n class, routing, batch size)
Input → Image
Conv1 (512 f ilters, 3x0 3kernel, 1x1 stride) → Input
Conv1 (512 f ilters, 3x3 kernel, 1x1 stride) → Conv1
PrimaryCaps (inputs, dim capsule, n channels, kernel size, strides, padding) → Conv2
DigitCaps (num capsule, dim capsule, routings) → Primary Caps
Out_caps → Digit Caps Softmax → Out_caps
return Softmax
Fig. 3 Block diagram of the proposed methodology
.3 × 3, and the second convolutional layer with 256 filters of kernel size 3 .× 3.
The extracted features are passed into a Primary Caps layer that creates Capsules
with dimension as 8 by using a convolution of kernel size 9 .× 9. These Capsules
are passed on to the Digit Caps layer where the Dynamic Routing is carried out
with routings for each update to be 3. The generated output called the activity
vector gives a 16-dimensional output corresponding to each class. The magnitude of
each of these vectors is passed through a SoftMax layer to obtain the classification
probabilities. A detailed block diagram for our proposed model is given in Fig. 3.
The architecture was experimentally decided based on the experimental results
given in Table 1. Experiments were conducted for 52 epochs with a batch size of 32
and an Adam Optimizer with a learning rate of 0.0001. Due to the increased training
loss associated with increasing the number of epochs, the training was limited to 52
epochs. Initially, 416 samples for training and 32 samples for testing a 128 .× 128
.× 1 dimension of data with a batch size of 32 were used. The highest average is
Table 1 Experimental results obtained from the proposed architecture

Conv1 Conv2 Degree of rotation Accuracy 10.◦ Accuracy 30.◦ Average accuracy
64 64 10 50.00 65.63 57.81
128 128 10 65.63 46.87 56.25
256 128 10 56.25 34.37 45.31
512 128 10 65.25 40.62 52.94
64 64 30 50.00 65.63 57.81
128 512 30 34.38 59.38 46.88
256 64 30 34.38 50.00 42.18
512 256 30 62.50 68.75 65.63
with 512 filters in Conv1 and 256 filters in Conv2, and the corresponding number
of learnable parameters is 1,935,136. We used the Adam optimizer with a learning
rate of 103%, 100 epochs, and a batch size of 16. The training dataset (described
in Sect. 4) was separated into two parts: training (70%) and validation (30%), with
the training set used to train the model and the validation set used to choose the
best model. The chosen model is then assessed on the testing set. To indicate the
performance, the following four measures are used: accuracy, sensitivity, specificity,
and AUC (area under the curve) are all terms that are used to describe the precision,
sensitivity, and specificity of a measurement (AUC). Following that, we will show
you the outcomes. We utilized the same dataset as Reference [10] to carry out
our studies. This information was gathered from two publically available sources
[19, 20]. Datasets of chest X-rays are accessible. Normal and COVID-19 are two
separate labels in the produced dataset. As the title suggests, the primary purpose of
this research is to discover COVID-19 positive patients. We divided the labels into
two categories: positive and negative.
The confusion matrix is used to evaluate the efficacy of any model. The number
of properly predicted outcomes and the number of wrongly predicted outcomes are
separated into classes in the confusion matrix, which represents the performance of
a classifier.
5.3 Metrics for Evaluation

5.3.1 Accuracy
Accuracy refers to the % age of correct predictions (T P and T N) compared to the

total number of predictions.
TP +TN
Accuracy =
. . (6)
T P + T N + FP + FN
5.3.2 Precision
Precision is the % age of correct positive predictions in relation to the total number
of positive forecasts.
TP
P recision =
. . (7)
T P + FP
5.3.3 Recall
It is the % age of correct positive predictions made out of the total number of
samples in a given class. True positive rate (TPR) is another term for recall (TPR).
TP
Recall =
. . (8)
T P + FN
5.3.4 F1-Score
F1-score denotes the sympathetic mean between the recall and precision values.
2 ∗ P recision ∗ Recall
F1 =
. . (9)
P recision + Recall
5.3.5 False Positive Rate (FPR)
A positive class is predicted when the actual outcome is negative.
FP
FPR =
. . (10)
FP + T N
Experimental results for CapsNet are provided in Table 1. A comparative study

of the proposed CapsNet with standard CNN architectures for affine transform is
provided in Table 2. Performance comparison of the proposed method with the
standard CNN architectures in terms of computational intensity by calculating the
number of learnable parameters is given in Table 3. For an effective understanding of
how efficiently affine transformations have been captured by CapsNet, the different
statistical measures with a varying number of filters for both 10 and 30.◦ of rotation
are provided in Tables 4 and 5.
According to the performance measures of the above experiments, the CapsNet
deep learning framework has a high sensitivity to COVID-19, and the network
framework can actually generate sucient features on short datasets to achieve accu-
rate COVID-19 identification. The findings of experiment I demonstrate that using
a Capsule Network may significantly increase CNN’s performance in detecting
Table 2 Comparison of the Model Accuracy (%)

standard CNN architectures
with the proposed CapsNet DenseNet 121 87.50
architecture for affine ResNet 50 75.00
transform VGG 16 53.13
VGG 19 53.13
Xception 58.75
Inception V3 71.88
MobileNet 34.38
Proposed architecture (CapsNet) 68.75
Table 3 Comparison of the Model Learnable parameters

standard CNN architectures
with the proposed CapsNet DenseNet 121 7,978,856
architecture for affine ResNet 50 25,583,592
transform with learnable VGG 16 134,268,738
parameters VGG 19 138,357,544
Xception 22,855,952
Inception V3 23,626,728
MobileNet 4,231,976
Proposed architecture (CapsNet) 1,935,136
Table 4 Global statistics Metric name Value

Label-wise accuracy 97.7%
Image-wise accuracy 97.7%
Hamming loss 2.3%
Cross-entropy 0.06
Table 5 Global statistics averaged

Metric name Micro-average Macro-average Weighted average
Precision 97.7% 97.8% 97.8%
Recall 97.7% 97.7% 97.7%
F1* 97.7% 97.7% 97.7%
Average precision (AP) 99.9% 100.0% 100.0%
ROC AUC 99.9% 100.0% 100.0%
COVID-19 chest X-rays. Experiment II demonstrates that CapsNet is resilient and

does not require data augmentation or pretraining. Experiment III will also be used
to examine CapsNet’s capacity to identify classifications (Normal, COVID-19),
and when compared to COVID-Net, the framework achieves ideal performance.
Following that, we will go over the three types of comparison trials mentioned
above in detail. Experiment number one: The detection capacities of DenseNet121
and ResNet50 may be compared when utilizing a typical CNN alone to identify
COVID-19 chest X-rays, as shown in Table 2. Aside from accuracy, the indices
are significantly different, with a sensitivity differential of about 20%. This is
Table 6 Per class statistics

Class Precision* Recall* F1* Ap* ROC AUC*
Normal 95.7% 100.0% 97.8% 100.0% 100.0%
COVID 100.0% 95.4% 97.6% 100.0% 100.0%
critical because the better the sensitivity, the less likely it is that COVID-19 patients
would be mistakenly identified. DenseNet121 identified 22 COVID-19 patients
as Normal, whereas ResNet50 identified 24 COVID-19 patients as Normal, with
ResNet50 having nearly double the amount of false detections as DenseNet121. This
conclusively demonstrates that DenseNet121 outperforms ResNet50 in identifying
COVID-19. This might be attributed to DenseNet 121’s ability to harvest deeper
features due to the network level deepening. DenseNet 121 is also superior than
ResNet 50 because of its outstanding feature transfer and feature reuse abilities.
DenseNet121, on the other hand, does not deliver sufficient results. Table 1 further
shows that the combination of a CNN and a Capsule Network is significantly
better at identifying COVID-19 than the CNN alone. DenseNet121 is combined
with CapsNet to make CapsNet, and ResNet50 is combined with CapsNet to make
ResNet 50, and these two models are tested for their ability to identify COVID-19.
When combined, these two frameworks outperform the ResNet50 and DenseNet121
models. As seen in Fig. 5, both CapsNet and ResNet50 are very sensitive to COVID-
19 recognition. ResNet 50, on the other hand, misses five COVID-19 patients, but
CapsNet misses only three, implying that CapsNet is better. Our research aims to
improve the detection sensitivity of COVID-19 while also determining how long
it takes to train. Under the same settings and with the same testing equipment, the
same dataset is utilized to train ResNet 50 and CapsNet. 30 epochs in all, with
ResNet 50 lasting 12 hours, 47 minutes, and 56 seconds and CapsNet taking 6
hours, 3 minutes, and 24 seconds. CapsNet requires less than half the training time
of ResNet 50 and also gives considerable time savings.
Using the aforementioned dataset, Table 6 shows the COVID-CAPS obtained an
accuracy of 62.50%, a precision of 57.50%, a recall of 62.50%, and an F1-score
of 62.50% for 395 10-degree rotation, and COVID-CAPS obtained an accuracy of
68.4%, a precision of 67.60%, a recall of 68.75%, and an F1-score of 68.75% for
10-degree rotation. False positive instances were researched further to see which
categories are most likely to be misclassified by COVID-19. Normal instances
account for 54% of false positives, while COVID patients account for just 17%
of false positives, respectively. We compare our results to those of Reference [12],
which employed a binarized version of the same dataset, as given in Figs. 4 and 5. In
terms of precision and specificity, COVID-CAPS exceeds its equivalent. The model
suggested in Reference [12], which comprises 23 million trainable parameters,
has a greater sensitivity. Reference [6] has more study on the binarized version
of identical X-ray images. We did not compare the COVID-CAPS performance
to this study since the negative label only comprises normal patients (as opposed
to all normal, bacterial, and non-COVID viral cases being labeled as negative).
120
Accuracy
Statistical Measures with 30 degrees rotation
Precision
100 Recall
F1- Score
80
60
40
20
8
64
6
6
64
64
6
8
64
12
12
25
25
25
12
12
25
+
+
+
+
+
+
+
+
+
+
+
6
8
2
64
2
6
51
8
25
6
64
2
12
64
51
25
51
12
25
12
Filters Size
Fig. 4 Statistical measures of the proposed architecture with 10-degree rotation
Notably, the proposed COVID-CAPS has only 1,935,136 trainable parameters.

Because COVID-CAPS has 23 million trainable parameters, it can be learnt and
used more quickly than the model recommended in Reference [12]. Additionally,
it alleviates the need for large computational resources. The proposed architecture
performs better than the VGG, Xception and MobileNet for affine transforms. The
number of learnable parameters used in the proposed architecture is less than the
number of learnable parameters of all the standard CNN architectures, which is
presented in Table 2. Figures 5 and 6 show the prediction output of Normal and
COVID image classification using Capsule Network without rotations.
Figure 7 describes the learning curve with image-wise training accuracy. In that
after 500 seconds, it was saturated at the accuracy level. So, we can keep 500
seconds of training time that is more than enough to classifying with the help of
Capsule Networks. Figure 8 shows the training learning curve with entropy values.
Entropy is the important parameter during training process. Here the entropy values
are very less after 500 training time (seconds). During 200–250 of training time, the
entropy values are varying because of similar pixel values of the image in different
classes. Figure 7a and b shows the AUC curve for both training and validation of
the images. Figure 8 shows the statistical review of performance metrices of Capsule
Networks (Figs. 9 and 10).
120
Accuracy
Statistical Measures with 30 degrees rotation
Precision
100 Recall
F1- Score
80
60
40
20
8
64
6
6
64
6
64
8
64
12
12
25
25
25
12
12
25
+
+
+
+
+
+
+
+
+
+
2
8
64
2
6
51
8
25
6
64
2
12
64
51
25
51
12
25
12
Filters Size
Fig. 5 Statistical measures of the proposed architecture with 30-degree rotation testing
Normal
Predicted score: 100%
Covid
Predicted score: 0%
- predicted label is in the - predicted label is NOT in the - score

picture picture threshold
Normal
Fig. 6 Normal image prediction using Capsule Networks
6 Conclusion
In this chapter, we explore the use of CapsNet for capturing rotational variations
in COVID chest X-ray data to mimic real-world scenario. We train our model with
chest X-ray images from the Kaggle COVID-19 Radiography dataset that are all
centered at 0.◦ . In order to verify that the orientation details are efficiently captured
by CapsNet, we test our model with two types of testing data—10.◦ rotated and
Normal
Predicted score: 0%
Covid
Predicted score: 100%
- predicted label is in the - predicted label is NOT in the - score

picture picture threshold
Covid
Fig. 7 COVID image prediction using Capsule Networks
Learning curve for image-wise accuracy

100%
80%
Image-wise accuracy
60%
40%
20%
train values
validation values
best model saved
0%
0 500 1000 1500 2000 2500 3000 3500
Training time (seconds)
Note: Learning curve for image-wise accuracy in multi label classification is
plotted using 50% score threshold for all classes
Fig. 8 Training learning curve with image-wise
30.◦ rotated. Based on the results, it is clear that CapsNet is able to capture affine
transformed data though it has not been trained with such data. Comparing the
accuracy with standard architectures, it is evident that CapsNet is more efficient
for COVID-19 chest X-ray classification, especially considering the computational
efficiency of the proposed architecture. As a future scope, we wish to observe
performance of CapsNet on a larger dataset to be able to generalize the results.
Also, modifications in the architecture to include more convolutional layers might
help in a different set of features for better classification.
Learning curve for cross entropy loss

1.0 train values
validation values
best model saved
0.8
Cross entropy loss
0.6
0.4
0.2
0.0
0 500 1000 1500 2000 2500 3000 3500
Training time (seconds)
Fig. 9 Training learning curve with cross-entropy values
Precision-Recall curve
Precision-Recall curve
All classes All classes
100% 100%
80% 80%
Precision
Precision
60% 60%
40% 40%
20% 20%
0%
0% 20% 40% 60% 80% 100% 0%
0% 20% 40% 60% 80% 100%
Recall Recall
(a) (b)
Fig. 10 (a) and (b) shows the AUC curve for both training and validation of the images
References
1. Guan, W., Ni, Z., Hu, Y., Liang, W., Ou, C., He, J. X., Liu, L., Shan, H., Lei, C.L., Hui, D.S.,
Du, B., Li, L. J., Zeng, G., Yuen, K. Y., Chen, R. C., Tang, C. L., Wang, T., Chen, P. Y., Xiang,
J., . . . , Zhong, N. S. (2020). Clinical characteristics of coronavirus disease 2019 in China. New
England Journal of Medicine, 382(18), 1708–20.
2. Corman, V. M., Landt, O., Kaiser, M., Molenkamp, R., Meijer, A., Chu, D. K. W., Bleicker, T.,
Brünink, S., Schneider, J., Schmidt, M. L., Mulders, D. G. J. C., Haagmans, B. L., van der Veer,
B., van den Brink, S., Wijsman, L., Goderski, G., Romette, J.-L., Ellis, J., Zambon, M., . . . ,
Drosten, C. (2020). Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR.
Eurosurveillance, 25(3), 2000045.
3. Togacar, M., Ergen, B., & Cömert, Z. (2020). Application of breast cancer diagnosis based on
a combination of convolutional neural networks, ridge regression and linear discriminant anal-
ysis using invasive breast cancer images processed with autoencoders. Medical Hypotheses,
135, 109503.
4. Liu, X., Deng, Z., & Yang, Y. (2019). Recent progress in semantic image segmentation.
Artificial Intelligence Review, 52(2), 1089–1106.
5. Jaiswal, A. K., Tiwari, P., Kumar, S., Gupta, D., Khanna, A., & Rodrigues, J. J. (2019).
Identifying pneumonia in chest X-rays: A deep learning approach. Measurement, 145, 511–
518.
6. Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. In Advances
in Neural Information Processing Systems (pp. 3856–3866).
7. Apostolopoulos, I. D., & Mpesiana, T. A. (2020). COVID-19: Automatic detection from
x-ray images utilizing transfer learning with convolutional neural networks. Physical and
Engineering Sciences in Medicine, 43, 635–40. https://doi.org/10.1007/s13246-020-00865-4
8. Kumar, G. S. P., Variyar, V. S., & Soman, K. P. (2019). Preprocessing techniques and area
estimation of ECG from a wireless ECG patch. Journal of Physics: Conference Series, 1362,
012107. https://doi.org/10.1088/1742-6596/1362/1/012107
9. Hemdan, E. E., Shouman, M. A., & Karar, M. E. (2020). COVIDX-Net: a framework of deep
learning classifiers to diagnose Covid-19 in x-ray images. arXiv: 2003. 11055.
10. Narin, A., Kaya, C., & Pamuk, Z. (2020). Automatic detection of coronavirus disease (COVID-
19) using X-ray images and deep convolutional neural networks. arXiv: 2003.10849.
11. Wang, L., & Wong, A. (2020). COVID-Net: a tailored deep convolutional neural network
design for detection of COVID-19 cases from chest radiography images. arXiv: 2003.09871.
12. Sethy, P. K., & Behera, S. K. (2020). Detection of coronavirus disease (COVID19) based on
deep features. https://doi.org/10.20944/preprints202003.0300
13. Afshar, P., Heidarian, S., Naderkhani, F., Oikonomou, A., Plaraniotis, K. N., & Mohammadi,
A. (2020). COVID-CAPS: a capsule network-based framework for identification of COVID-19
cases from X-ray images. arXiv: 2004.02696.
14. Mobiny, A., Cicalese, P. A., Zare, S., Yuan, P., Abavisani, M., Wu, C. C., Ahuja, J., de Groot,
P. M., & Van Nguyen, H. (2020). Radiologist-level COVID-19 detection using CT scans with
detail-oriented capsule networks arXiv, 2004.07407.
15. Pranav, J. V., Anand, R., Shanthi, T., Manju, K., Veni, S., & Nagarjun, S. (2020). Detection
and identification of COVID-19 based on chest medical image by using convolutional neural
networks. International Journal of Intelligent Networks, 1, 112–118.
16. Zheng, C., Deng, X., Fu, Q., Zhou, Q., Feng, J., Ma, H., Liu, W., & Wang, X. (2020). Deep
learning-based detection for COVID-19 from chest CT using weak label. medRxiv. https://doi.
org/10.1101/2020.03.12.20027185
17. Song, Y., Zheng, S., Li, L., Zhang, X., Zhang, X., Huang, Z., Chen, J., Zhao, H., Wang, R.,
Chong, Y., Shen, J., Zha, Y., & Yang, Y. (2020). Deep learning enables accurate diagnosis of
novel coronavirus (COVID-19) with CT images. medRxiv. https://doi.org/10.1101/2020.02.23.
20026930
18. Shanthi, T., Anand, R., Annapoorani, S., & Birundha, N. (2023). Analysis of phonocardiogram
signal using deep learning. In D. Gupta, A. Khanna, S. Bhattacharyya, A. E. Hassanien,
S. Anand, & A. Jaiswal (Eds.), International conference on innovative computing and
communications (Lecture notes in networks and systems) (Vol. 471). Springer. https://doi.org/
10.1007/978-981-19-2535-1_48
19. Anand, R., Sowmya, V., Menon, V., Gopalakrishnan, A., & Soman, K. P. (2021). Modified
VGG deep-learning architecture for COVID-19 classification using chest radiography images.
Biomedical and Biotechnology Research Journal (BBRJ), 5(1), 43.
20. Ramakrishnan, R., Vadakedath, A., Modi, A. J., Sajith Variyar, V. V., Sowmya, V., Gopalakr-
ishnan, E. A., & Soman, K. P. (2023). CT image enhancement using Variational mode
decomposition for AI-enabled COVID classification. In Artificial Intelligence on Medical Data
21. Pandianchery, M. S., Sowmya, V., Gopalakrishnan, E. A., & Soman, K. P. (2022). Long short-
term memory-based recurrent neural network model for COVID-19 prediction in different
states of India. In Emerging Technologies for Combatting Pandemics (pp. 245–270). Auerbach
Publications.
22. Kandasamy, S. K., Maheswaran, S., Karuppusamy, S. A., Indra, J., Anand, R., Rega, P., et al.
(2022). Design and fabrication of flexible Nanoantenna-based sensor using graphene-coated
carbon cloth. Advances in Materials Science & Engineering.
Deep Learning in Autoencoder
Framework and Shape Prior for Hand
Gesture Recognition
Badri Narayan Subudhi, T. Veerakumar, Sai Rakshit Harathas,

Rohan Prabhudesai, Venkatanareshbabu Kuppili, and Vinit Jakhetiya
1 Introduction
For the past several years, humans to interact with the computer or machine, as they
have been adequate in performing most of the tasks, are using wired devices such
as keyboard and mouse. The advancement in science and technology has led to the
inventions of complex and embedded system requiring the use of faster interfacing
devices. These input devices have been well recognized but have limitations when it
comes to naturalness and speed of human-machine interaction. Visual interpretation
of gestures can be useful in preserving the naturalness and ease of interaction.
Gesture is a type of nonverbal communication where a human being communicates
with the help of physical positions or movements of any body part either in place
of or in concurrence with speech. Hand gestures include physical positioning or
movement of fingers and sometimes entire hand. Gesture recognition process is a
subject in computer vision and communication fields with the aim of recognizing
human gestures using mathematical computation. Gesture recognition system can
B. N. Subudhi
Department of Electrical Engineering, Indian Institute of Technology Jammu, Nagrota, Jammu,
India
T. Veerakumar () · S. R. Harathas · R. Prabhudesai
Department of Electronics and Communication Engineering, National Institute of Technology
Goa, Farmagudi, Ponda, Goa, India
e-mail: tveerakumar@nitgoa.ac.in
V. Kuppili
Department of Computer Science and Engineering, National Institute of Technology Goa,
Farmagudi, Ponda, Goa, India
V. Jakhetiya
Department of Computer Science & Engineering, Indian Institute of Technology Jammu,
Nagrota, Jammu, India
224 B. N. Subudhi et al.
be seen as a model for gadgets to interpret human body language, leading to it

replacing the hardware interfacing devices. It allows humans to interact with the
computer naturally without the use of any hardware devices. Using this concept of
hand gesture recognition, the cursor on the computer screen can be moved by simple
movement of fingers or hand accordingly. This could lead to decrease in use of input
devices such as mouse, keyboard, and also touch screens [1].
Recently, there has been a lot of attraction in identifying hand gestures. This
hand gesture recognition system has numerous applications such as machinery
control, computer games, and in 3D world. Sign language describes one of the
most organized set of gestures. Gestures can be grouped mainly into two classes,
static and dynamic. A static gesture is a specific way of hand organization and pose
and is depicted by a single image. A dynamic gesture is a moving gesture and is
illustrated by a string of images. Real-time hand gesture recognition is used as a link
between human and computer and also as a medium of communication for deaf and
dumb people [1]. Research on hand gesture recognition techniques can be divided
into three classes: glove-based, vision-based, and drawing gesture approaches. The
two approaches glove-based analysis and analysis of drawing gestures involve
external hardware parts. One of the major challenges of using glove-based technique
includes limitation of hand movement. Also, being in continuous contact with
the devices can cause health hazards, where magnetic devices increase the risks
of cancer. The vision-based approach depends on how human beings discern
information from the surroundings. This method does not involve the use of any
external hardware part and thus makes the gestural interaction more empirical
preserving its naturalness. Vision-based approaches are comparatively easy, natural,
and cheap compared to the ones involving use of hardware [2].
In this article, a deep learning framework is proposed for hand gestures recog-
nition. In the proposed scheme, the stored hand gesture images are initially
preprocessed to extract the exact hand regions from the images. A set of histogram
of oriented gradients (HOG) features is then extracted from each of the hand gesture
image. A deep neural network with autoencoding framework is designed to train the
images from the database. With the trained deep neural net, a set of testing data
are checked for gesture recognition. The proposed scheme classifies the input hand
gestures into some predefined number of gesture classes. The proposed scheme
is tested on three different hand gesture databases. The results obtained by the
proposed scheme are compared with those of the four different state-of-the-art
techniques and are found to be better as compared to the existing state-of-the-art
techniques. The performance of the proposed scheme is validated by using different
percentage of the training samples with k-fold cross validation.
The organization of the remaining portion of this article is as follows. In Sect.
2 of the manuscript, the literature survey on hand gesture recognition is provided.
The proposed scheme with detailed description on each part is provided in Sect.
3. Section 4 presents the simulation results and discussions on three considered
databases. The conclusion of the proposed scheme and future works are drawn in
Sect. 5.
Deep Learning in Autoencoder Framework and Shape Prior for Hand Gesture. . . 225
2 State-of-the-Art Techniques
Krueger et al. [3] have considered the work on artificial reality. This is one of
the fundamental researches where the user tried to communicates with the digital
world. Oka et al. [4] highlighted about the various techniques that can be used by
the users to communicate with the outside world, and they also discussed about
the fingertip movements in images. They showed the methodology for recognizing
symbolic gestures based on fingertip motions. The authors use an explicit various
color markers invasive devices to detect the finger movements. They were able
to detect the movements even in complicated backgrounds, and also the detection
was independent of the illumination. Invasive devices are hardware devices placed
on human body (in this case, they were directly placed on the hand to detect the
gesture). Mitra and Acharya [5] have worked on gesture recognition by deriving
the various signs made by an individual involving various body parts like hands,
arms, face, etc. with importance given to the hand and facial gestures. Hidden
Markov model (HMM) is used by the authors, which is mostly used to remove the
spatiotemporal instability. Freeman and Roth [6] gave a view about histogram of
the oriented gradients for recognizing hand gestures, and he presented a method to
recognize hand gestures, developed by McConnell [7]. The said scheme uses local
orientation histograms. The features were extracted from histogram of orientation
gradients, which were further used to classify and divide the gestures into various
classes. Stergiopoulou and Papamarkos [8] used neural network to recognize hand
gestures and used the YCbCr color model instead of the RGB. The advantage in
YCbCr model is the hand can be segmented even if the lighting conditions are
poor and can be used for complex backgrounds. Atsalakis et al. [9] proposed a
theory for color estimation and color reduction. Both color and spatial features are
given as input to the above techniques, and similarity functions are used for vector
comparison. Thus, the above algorithm can be applied to any model independent of
the background. Chen et al. [10] worked on hand gesture recognition that consisted
of four stages. The first stage included acquiring the image from the camera and
tracking the moving hand to obtain the segmented hand region. This was followed
by extraction of features from the segmented hand image. Feature extraction
included the Fourier descriptors to obtain the spatial features. The temporal features
were obtained from the motion analysis of the hand gesture. The feature vector
consists of the temporal and the spatial features that were obtained in the previous
steps. The HMM was used for hand. An accuracy of 90% was obtained for 20
classes. In the past years, the research community has witnessed substantial amount
of work done in hand gesture recognition based on using neural network. One of the
significant works in hand recognition was proposed by Symeonidis [11], which uses
histogram of oriented gradients to obtain the feature vector. Malik [12] developed a
method that captures hand gestures and recognizes. The HMM is used for gesture
recognition. Using HMM, it is difficult to detect the initial and the final points, so
the Baum-Welch algorithm is used where the unknown features of the HMM can be
obtained. Hasan [13] put forward a hand gesture recognition technique using depth
camera. Depth of the camera is defined as the distance of the image from the camera.
The features used for classifying the images into different classes are area and the
orientation. To find the depth of the image, kinect sensor can be used that can track
the body movement of the person. A real-time system was proposed by Ao Tang
et al. [14] to recognize the gestures. Both color and depth are used as features of
the image and classified using deep neural networks. Use of principal component
analysis for hand gesture recognition is also well cited [15].
Gesture recognition with skin color detection is getting its popularity in last one
decade [16]. A gesture recognition scheme is proposed by Kawulok et al. [17] where
a new skin detection algorithm is used in color images with spatial analysis by newly
proposed texture-based discriminative skin-presence features. The feature fusion
scheme is also explored in this regard for gesture recognition. Yao and Fu [18]
proposed hand gesture recognition scheme, where a semiautomatic labeling is used
in RGB color and depth data to label hand patches. The 3D contour model is used to
acquire the hand region, and nearest neighbor method is used for correspondence. A
new superpixel-based hand gesture recognition scheme is developed by Wang et al.
[19], where the skeleton information from kinect is used for extraction of the hand
region. The textures and depths are represented in the form of superpixels to retain
the shape information. The earthmover distance is used to measure the dissimilarity
between the hand gestures. Chen et al. [20] proposed a gesture recognition system
using an interactive image segmentation scheme. The authors have used Gaussian
Mixture Model for image modeling, and Expectation Maximum algorithm is used
to learn the parameters of it. Gibbs random field is used for image segmentation.
3 Proposed Gesture Recognition Scheme
Considering the state-of-the-art literature on hand gesture recognition, it may be

concluded that hand gesture recognition is one of the important tasks to be taken
care. Starting from the medical diagnosis system to daily life aids for blind people,
hand gesture recognition has quite important requirement. Hand gesture recognition
using neural network classifier is reported in the literature. However, the major
drawbacks in most of them have considered low-level intensity features as the input
to the neural classifier. The neural classifier is also found not to be providing good
accuracy in gesture recognition task. As per the author’s knowledge in the state-of-
the-art literature of hand gesture recognition, none of the work reported the use of
HOG features with deep neural network for gesture recognition. In the proposed
scheme, we have explored the advantages of both the scheme for hand gesture
recognition.
The proposed scheme consists of mainly three stages: preprocessing, feature
extraction, and classification. Block diagram of the proposed gesture recognition
scheme is provided in Fig. 1. The images of hand gestures are stored in a database
and are preprocessed to make them suitable for further stages of processing. In the
subsequent stage, the feature vectors are extracted from each gesture image. The
Feature
Preprocessing Extraction
Input Hand
Gesture
Classification
Image
Using Deep
NN
Recognized
Gesture
Output
Fig. 1 Block diagram of the proposed gesture recognition scheme
feature vectors are used to train the set of images using deep learning framework
based on which the new hand gesture images from test set are classified into
predefined finite number of classes.
3.1 Preprocessing
A good hand gesture recognition system has to tackle many difficulties like
illumination variations, complex background, and scaling. Illumination variations
may affect badly on the extracted hand skin region due to different lighting
conditions. In complex background, images contain other objects in the scene along
with hand pixels. These objects may contain color similar to skin that leads to the
problem of misclassification. The hand poses have different sizes in the gesture
image, and it arises scaling effect. Hand gesture images vary with respect to the size
and the quality, depending on the image capturing source. Hand gesture images are
also affected by the background and the lighting conditions. The background pixels
may sometimes resemble as the hand pixels, which makes it difficult to recognize
the gesture. Image of the same hand gesture taken in different lighting conditions
may give different results. Therefore, it is necessary to convert all the images into a
suitable form where the hand pixels are easily separable from the non-hand pixels.
In the proposed scheme, we have used the following steps for preprocessing: color
conversion, background removal, bounding box, and resizing.
3.1.1 A. Color Space Conversion
The first step in preprocessing is to eliminate the luminance effect by splitting

the image in RGB color space to luminance and chrominance components. In the
proposed scheme, YCbCr is used to differentiate the luminance and chrominance
components, where Y is the luminance part and Cb and Cr are the chrominance
components [13]. Figure 2 shows three examples of gesture images: gesture 1,
gesture 3, and gesture 5, where the RGB to YCbCr color conversion results with
the individual histogram for Y, Cb, and Cr components are presented.
3.1.2 Background Removal
To recognize the hand gesture, it is necessary to separate the hand pixels from
non-hand pixels. This is done by thresholding the YCbCr image. This results in
separation of hand pixels from non-hand pixels. After obtaining the segmented form
of the hand gesture from the YCbCr color model, various morphological operations,
dilation and erosion operations, are used to obtain the proper preprocessed image.
The background is black in color while the hand is white. Figure 3 shows three
examples of gesture images, gesture 2, gesture 3, and gesture 5, where the RGB
to YCbCr color converted images are thresholded by Kapur’s thresholding scheme
[21]. The obtained image may have isolated holes and noisy objects pixels. A
combined operation of dilation and erosion is used on the thresholded images to
remove noisy object pixels, which is shown in Fig. 3. From this figure, it is possible
to observe that segmentation of the hand gesture is very clear.
3.1.3 Bounding Box and Resizing
The bounding box is used to find a minimum box containing the maximum
percentage of hand pixels in gesture image. The bounding box has edges parallel
to the Cartesian coordinate axis. It is mathematically expressed as an array of
coordinate axis of the edges of the box. The hand image is then cropped to overcome
the effects of hand size and camera depth. The images are then resized to a particular
size that seems optimal for offering enough data. Figure 4 shows output of the hand
gesture after using bounding box for two images of gesture 4.
It is a challenging task to recognize human gestures because of the variable appear-

ance and different poses. Therefore, it is necessary to extract the robust features,
which can uniquely define any gesture in testing conditions. Local orientation
gradients or the boundary directions can be used to determine the shape and the
Fig. 2 Conversion of RGB to YCbCr color space and YCbCr histogram separately for images:
gesture 1, gesture 3, and gesture 5
Fig. 3 Segmentation of hand images for background removal for images: gesture 2, gesture 3, and
gesture 5
Fig. 4 Output of the hand

gesture after using bounding
box for image: gesture 4
appearance of the object. The histogram of these orientation gradients known as

HOG is used as the feature vector for this work. Histograms of oriented gradients
have been widely used as an important feature for different computer vision
applications including segmentation [22], pose estimation [23], object detection and
tracking [24], video skimming [25], etc. It captures the local object appearance
and shape perfectly and is also invariant to local geometric and photometric
transformations like translation or rotation and global illumination changes. To
extract these features, at each frame location of the video, a 1D histogram of
gradient directions or edge orientations is accumulated. For better illumination
invariance and to reduce the effects of shadow, the histograms are normalized. The
normalized descriptor histogram is referred to as histograms of oriented gradients
(HOG) descriptors.
This process counts the frequency of the oriented gradients in localized portions
of an image. The image is divided into small localized portions called cells, and
for the pixels within each cell, a histogram of orientation gradients is found. The
feature vector is the combination of the occurrences of particular range of oriented
gradient values. The main idea behind the histogram of orientation gradients (HOG)
feature vector is that the shape and the appearance of the object can be described by
the frequency of the edge directions. The extracted HOG feature of gesture 1 and
gesture 2 is shown in Fig. 5.
Fig. 5 Extracted HOG feature image and plot for images: gesture 1 and gesture 2
3.3 Classification
The gradient descent method is mainly employed in neural networks, and the
framework is commonly known as the multilayer back propagation method. Here,
the error function (obtained by taking the difference between the target vector
and the predicted output vector) is to be minimized to obtain the optimal weights
connecting the neurons. It can be defined in simple words as the method that is
used to compute the gradients of the error function with respect to predefined
parameters like the learning rate and number of epochs, and the aim is to minimize
the error. In back propagation, the errors computationally precede “backwards,” and
the weights are changed depending on how it will minimize the error with respect to
the computations performed to compute the loss. In the back propagation algorithm,
given the error in the output of the network, it tries to modify or distribute the
weights in the network so that error is obtained as minimum as possible [26].
Deep learning [27] is a type of machine learning algorithm that uses more than
one hidden layers. This involves cascade of many layers of non-processing units to
get the optimal features. Deep learning is mainly used in classification and pattern
analysis in learning multiple levels of features where each consecutive layer uses the
output of the previous layer as the input to it. In most general case, there are two sets
of neurons consisting of nodes as defined. The first set of neurons receives the input
from the input layer, which passes the modified input to the other set of neurons.
Deep neural network consists of multiple processing layers between the input layer
and the output layer with multiple linear and nonlinear conversions. Architecture of
deep neural network is provided in Fig. 6.
Deep neural network can be viewed as an artificial neural network that consists
of many hidden layers between the input and the output nodes. It divides the desired
complicated mapping into a series of nested multiple simple mappings each defined
by a separate layer. The input layer is called the visible layer as the original input is
known to us. This is followed by a sequence of hidden layers and an output layer. In
a gesture data, various shapes of fingers and palm express the human gesture. Hence,
such data has various factors of variation in the dataset. Deep learning algorithms
have already established its effectiveness to capture the statistical variations in the
Fig. 6 Architecture of deep

neural network
Output Layer
Input Layer
Hidden Layer
data and hence able to discriminate easily the important variation as different gesture
for recognition.
In the proposed scheme, the considered deep neural network architecture is a
feed-forward network, which has more than one layer of hidden units between the
input and output layer. In the proposed scheme, each neuron in the hidden layer is
considered to be j, and a logistic function of hyperbolic tangent function f is used.
This is expressed as
1
yj = f xj = (1)
1 + e−xj
where yj is the output of a neural unit and xj is the input of the same. The input xj is
defined as

xj = bj + yi wij (2)
i
where bj is the input bias function for jth neuron and wij is the weight connecting
between ith neuron and jth neuron. For different gesture class, the output unit j
converts the input image into a class probability, Prj , using softmax function:
e xj
Pr j = x (3)
ek
k
where the constant k represents the kth gesture class.

In the proposed scheme, we have used the autoencoder [28] configuration for
deep neural network. An autoencoder network consists of unsupervised learning
Fig. 7 Diagram of an
x1 x’1
autoencoder
x2 z1 x’2
x3 z2 x’3
x4 z3 x’4
x5 x’5
network that applies back propagation by defining the target values same as the
inputs. The autoencoder tries to learn the function with the aim of making output
similar to the input. It is often trained using the methods of back propagation.
Architecture of autoencoder is given in Fig. 7. So the huge complexity of the
network is reduced by considering that the dimension of the input is the same as
the dimension of the output, which is x’ = x. Thus, the objective function is defined
as
N

J = x(n) − x̂(n)2 (4)
n=1
where .x̂(n) = Q−1 Qx(n), and Q is the full transformation matrix. Hence, the
modified objective function is written as
N
2

J = x(n) − Q−1 Qx(n) (5)
n=1
In the training phase of the autoencoder, for input x, activation function is used
at each unit of the hidden layer, as defined in Eq. (1). Then in the output layer, the
output x’ is obtained. The back propagation algorithm is used to back propagate the
error to the designed network to perform the weight updates [28].
4 Simulation Results and Discussions
The proposed algorithm is run on Pentium D, 2.8 GHz PC with 2GB RAM,
Ubuntu operating system, and Python. We have tested with three hand gesture
image databases: HGR1 [29], Thomas Moeslund’s Gesture Recognition [30], and
NITG. The results obtained by the proposed scheme are compared with that of
the different existing state-of-the-art techniques. The evaluation of the proposed

scheme is carried out with four state-of-the-art-techniques: neural network with
back propagation [14], skin detection using discriminative feature [17], contour
model [18], and interactive segmentation [20] based hand gesture recognition
schemes.
HGR1 Database This database contains the gestures from Polish Sign Language
and American Sign Language. In addition, some special signs were included. The
database was developed as a part of the hand detection and pose estimation project,
supported by the Polish Ministry of Science and Higher Education under research
grant no. IP2011 023071 [29]. In this database, a total of 899 images is there
for 25 different gestures. The considered gestures are obtained from 12 different
individuals. All the considered images are taken in uncontrolled background and
lighting condition. The considered images vary from size 174 × 131 up to
640 × 480. The proposed scheme is applied on HGR1 database. The preprocessing
scheme with HOG feature extraction scheme is applied on this database. The deep
neural network scheme is applied considering the k-fold cross-validation scheme.
For training purposes, we have considered the training sample of 70%, 50%, 40%,
30%, 20%, 15%, 10%, 7%, 5%, and 3%. For each training set, we have run the
proposed deep neural network framework for 50 random runs. The corresponding
results obtained on different training samples are presented in Table 1. We have
considered the best, the worst, and the average values of the output in percentage.
The performance of the proposed scheme is evaluated by comparing it against four
other state-of-the-art-techniques: NN using back propagation [14], discriminative
feature [17], contour model [18], and interactive segmentation [20] based hand
gesture recognition scheme. It may be observed from this table that at a lower
percentages of training samples, the proposed scheme outperforms compared to
other considered schemes.
Thomas Moeslund’s Gesture Recognition Database This database contains
2060 images of hand gesture for different static signs used in the English alphabet.
All the considered images are grayscale image with a dimension of 248 × 256.
All the gestures images are obtained in controlled environmental condition with a
dark background, and the person’s arm is covered with a similar black cloth. All
the gestures are performed at various scales, translations, and rotations in the plane
parallel to the image plane. The number of images captured for each gesture is
not constant and varies gesture to gesture. A total of 24 gestures are recorded in
the image format [30]. The performance of the proposed scheme is compared with
those of the other four considered state-of-the-art techniques. The results obtained
by all these methods with the proposed scheme are provided in Table 2. It may
be observed from the proposed scheme that the proposed scheme outperforms the
entire considered scheme in terms of percentage of correct classifications.
NITG Database In this database, a total of 875 images are there for five different
gestures. The considered gestures are number 1, 2, 3, 4, and 5. For each considered
number, 175 images by 25 individuals are considered with uncontrolled (different
Table 1 Evaluation of HGR1 database
Training %age NN using back propagation [14] Discriminative feature [17] Contour model [18] Iterative segmentation [20] Proposed scheme
Best Worst Avg. Best Worst Avg. Best Worst Avg. Best Worst Avg. Best Worst Avg.
70 95 92 93 96 94 96 96 93 96 97 94 95 97 95 96
50 93 90 92 95 92 94 94 91 95 95 93 93 96 93 94
40 91 87 87 92 88 90 91 89 91 94 92 93 96 92 94
30 90 85 88 90 87 89 89 88 90 93 90 91 95 92 93
20 88 85 86 89 85 86 87 85 88 93 89 90 94 91 92
15 85 80 81 85 82 83 87 84 86 92 89 89 93 91 91
10 80 74 78 80 75 80 85 81 84 90 87 87 92 90 90
7 71 65 68 74 65 72 75 68 76 88 82 80 91 89 86
5 62 58 61 65 52 64 68 56 69 78 65 74 89 85 84
3 59 39 56 64 44 60 61 45 64 65 55 62 82 72 79
Deep Learning in Autoencoder Framework and Shape Prior for Hand Gesture. . .
235
236
Table 2 Evaluation of Thomas Moeslund’s database

70 98 95 96 97 96 95 97 95 94 98 96 96 98 97 96
50 97 95 94 96 95 93 95 94 94 96 95 94 97 95 95
40 95 93 90 95 92 91 94 93 92 96 94 93 97 95 94
30 93 90 88 95 89 90 94 90 91 95 90 92 96 92 94
20 91 88 85 94 85 89 93 89 88 94 87 89 95 90 91
15 89 78 81 92 80 84 90 83 85 92 85 86 94 88 89
10 85 72 76 90 78 81 88 81 80 90 80 82 94 85 87
7 80 65 71 88 73 75 81 72 76 85 77 78 92 83 86
5 75 55 63 78 65 71 75 68 70 80 70 72 90 80 82
3 68 39 51 70 55 68 72 58 65 75 65 68 85 75 79
B. N. Subudhi et al.
illumination, textural background) environmental conditions. All the considered

images are of size varying from 198 × to 512 × 320. The proposed scheme is
applied on this database. The preprocessing scheme with HOG feature extraction
scheme is applied on this database. The deep neural network scheme is applied on
this database considering the k-fold cross-validation scheme.
For training purposes, we have considered the training sample of 70%, 50%,
40%, 30%, 20%, 15%, 10%, 7%, 5%, and 3%. For each training set, we have run the
proposed deep neural network framework for 50 random runs. The corresponding
results obtained on different training samples are presented in Table 3. We have
considered the best, worst, and the average values of the output in percentage (%).
It can be seen that for higher training percentage, the accuracy for both neural
network and deep learning is higher. But as the training percentage decreases, the
classification accuracy also decreases. For training percentage 7%, 5%, and 3%,
the average accuracy is 72%, 64%, and 59%, respectively, for backpropagation and
88%, 88%, and 77%, respectively, for deep neural network. It is observed that the
accuracy for lower training percentage is better for deep neural network as compared
to back propagation and other three considered state-of-the-art techniques.
For more clarity on the performance of the proposed scheme, the confusion
matrices obtained for the training data of 3%, 5%, 7%, 10%, 15%, 20%, 30%, and
40% are provided in Fig. 8. It may be observed from these results that a significant
amount of gesture recognition accuracy is obtained by the proposed scheme with a
lesser percentage of training data.
Different plots for classification accuracy (in %) against percentage of training
plot for all the considered technique with the proposed scheme for average over
50 runs. Best and worst performance are provided in Fig. 9a–c. It may be observed
from all these plots that a better result is obtained on NITG database by the proposed
scheme as compared to all the considered techniques. By comparing these results, it
is also observed that the results obtained by the iterative segmentation technique
provide a close accuracy at higher training data. However, the performance by
neural network with back propagation, contour model, and discriminative features
provided lesser accuracy than the proposed scheme. This corroborates our findings.
5 Conclusions and Future Works
Vision-based interfaces are feasible and popular at this moment because the
computer is able to communicate with user using webcam without the requirement
of any interfacing device between human and machine. Hence, users are able
to perform human-machine interaction (HMI) with these user-friendlier features
preserving the naturalness and with higher speed. Therefore, the computer vision
algorithm should be reliable and fast. There should be no delay between the
gestures being captured and the response time of the computer in recognizing the
238
Table 3 Evaluation of NITG database

70 98 90 95 98 94 96 99 95 96 99 96 96 99 96 97
50 96 89 94 96 92 94 97 93 95 98 94 95 98 94 96
40 95 90 92 96 91 93 96 92 94 96 93 94 96 94 95
30 96 87 91 97 88 92 98 88 92 98 89 92 98 89 93
20 91 72 86 93 85 89 95 87 90 96 88 91 96 89 92
15 93 60 82 93 80 88 94 85 89 95 86 90 95 87 92
10 85 62 77 90 77 80 92 80 85 93 82 88 95 83 90
7 88 45 72 89 65 78 90 78 80 91 80 85 93 84 88
5 76 35 64 81 55 71 85 70 75 88 75 80 92 83 88
3 79 28 59 80 50 65 81 65 72 85 68 75 83 70 77
B. N. Subudhi et al.
Fig. 8 Confusion matrices with training of (a) 3%, (b) 5%, (c) 7%, (d) 10%, (e) 15%, (f) 20%,
(g) 30%, and (h) 40%
gesture. Also, the vision-based interfaces are low cost compared to the other gesture
recognition techniques, which makes use of interfacing devices.
In this article, we have designed a system that aims at recognizing static
hand gestures using vision-based approach. The images of hand gestures taken
from the camera are firstly preprocessed to extract the hand part from the image.
This is followed by the extraction of histogram of orientation gradients (HOG)
feature vector. We have classified the hand gestures using deep neural network.
The proposed scheme is tested on three different hand gesture databases: HGR1,
Moeslund, and NITG. The results obtained are compared against four state-of-the-
art techniques. From the results, we can conclude that the accuracy of deep neural
Average Performance plot Best Performance Plot

100 100
95
Classification Accuracy (%)

90 95
NN using Back propagation
85 NN using Back propagation
Discriminative Feature
Discriminative Feature 90 Contour Model

80 Contour Model
Iterative Segmentation
Proposed Scheme
Iterative Segmentation
75
Proposed Scheme 85
70
65 80
60
55 75
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Training Percentage Training Percentage
(a) (b)
Worst Performance Plot
100
90
80
70 NN using Back propagation

Discriminative Feature
Contour Model
60 Iterative Segmentation
Proposed Scheme
50
40
30
20
0 10 20 30 40 50 60 70
Training Percentage
(c)
Fig. 9 Classification accuracy vs. training percentage plot for (a) average performance, (b) best
performance, and (c) worst performance
network is much better. Also, with a lesser training data, we have achieved a higher
accuracy. It may be observed from the results that on an average, a 14% improve-
ment in the recognition accuracy is obtained by the proposed scheme against the
competitive state-of-the-art recognition techniques. The proposed method also helps
in overcoming the effect of illumination and camera depth.
In the future, this work will be extended for real-world applications by designing
an app for smartphone applications. Using such an app in smartphone, a deaf and
dumb person can interact with normal people for their daily needs.
References
1. Pavlovic, V., Sharma, R., & Huang, T. (1997). Visual interpretation of hand gestures for
human-computer interaction: A review. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(7), 677–695.
2. Stenger, B., Thayananthan, A., Torr, P., & Cipolla, R. (2006). Model-based hand tracking using
a hierarchical Bayesian filter. IEEE Transactions on Pattern Analysis and Machine Intelligence,
28(9), 1372–1384.
3. Krueger, M. (1991). Artificial reality. Addison-Wesley Professional.
4. Oka, K., Sato, Y., & Koike, H. (2002). Real-time fingertip tracking and gesture recognition.
University of Tokyo, Report.
5. Mitra, S., & Acharya, T. (2007). Gesture recognition: A survey. IEEE Transactions on Systems,
Man, and Cybernetics—Part C: Applications and Reviews, 37(3), 311.
6. Freeman, W. T., & Roth, M. (1995). Orientation histograms for hand gesture recognition. In
IEEE International Workshop on Automatic Face and Gesture Recognition.
7. McConnell, R. K. (1986). Method of and apparatus for pattern recognition. U. S. Patent No.
4,567,610.
8. Stergiopoulou, E., & Papamarkos, N. (2009). Hand gesture recognition using a neural network
shape fitting technique. Engineering Applications of Artificial Intelligence, 22(8), 1141–1158.
9. Atsalakis, A., Papamarkos, N., & Andreadis, I. (2005). Image dominant colors estimation and
color reduction via a new self-growing and self-organized neural network. In Proceedings
of the 10th Iberoamerican Congress conference on Progress in Pattern Recognition, Image
Analysis and Applications
10. Chen, F. S., Fu, C. M., & Huang, C. L. (2003). Hand gesture recognition using a real-time
tracking method and hidden Markov models. Image Vision Computing, 21(8), 745–758.
11. Symeonidis, K. (2000, August 23). Hand gesture recognition using neural networks. School of
Electronic and Electrical Engineering, Report.
12. Malik, S. (2003, December 18). Real-time hand tracking and finger tracking for interaction.
CSC2503F Project Report.
13. Hasan, M. M. (2010). HSV brightness factor matching for gesture recognition system.
International Journal of Image Processing (IJIP), 4(5), 456–467.
14. Tang, A., Lu, K., Wang, Y., Huang, J., & Li, H. (2013). A real-time hand posture recognition
system using deep neural networks. ACM Transactions on Intelligent Systems and Technology,
9(4), 1–23.
15. Birk, H., Moeslund, T. B., & Madsen, C. B. (1997). Real-time recognition of hand alphabet
gestures using principal component. In 10th Scandinavian Conference on Image Analysis (pp.
261–268).
16. Kawulok, M. (2013). Fast propagation-based skin regions segmentation in color images.
In 10th IEEE International Conference and Workshops on Automatic Face and Gesture
Recognition (FG), Shanghai (pp. 1–7).
17. Kawulok, M., Kawulok, J., & Nalepa, J. (2014). Spatial-based skin detection using discrimi-
native skin-presence features. Pattern Recognition Letters, 41, 3–13.
18. Yao, Y., & Fu, Y. (2014). Contour model-based hand-gesture recognition using the kinect
sensor. IEEE Transactions on Circuits and Systems for Video Technology, 24(11), 1935–1944.
19. Wang, C., Liu, Z., & Chan, S. C. (2015). Superpixel-based hand gesture recognition with kinect
depth camera. IEEE Transactions on Multimedia, 17(1), 29–39.
20. Chen, D., Li, G., Sun, Y., Kong, J., Jiang, G., Tang, H., Ju, Z., Yu, H., & Liu, H. (2017). An
interactive image segmentation method in hand gesture recognition. Sensors, 7(2), 253–270.
21. Kapur, J. N., Sahoo, P. K., & Wong, A. K. C. (1985). A new method for gray-level picture
thresholding using the entropy of the histogram. Computer Vision, Graphics, and Image
Processing, 29(3), 273–285.
22. Kato, T., Relator, R., Ngouv, H., Hirohashi, Y., Takaki, O., Kakimoto, T., & Okada, K. (2011).
Segmental HOG: New descriptor for glomerulus detection in kidney microscopy image. BMC
Bioinformatics, 16(1), 1–16.
23. Wang, B., Liang, W., Wang, Y., & Liang, Y. (2013). Head pose estimation with combined
2D SIFT and 3D HOG features. In Seventh International Conference on Image and Graphics,
Qingdao (pp. 650–655).
24. Jung, H., Tan, J. K., Ishikawa, S., & Morie, T. (2011). Applying HOG feature to the detection
and tracking of a human on a bicycle. In 11th International Conference on Control, Automation
and Systems, Gyeonggi-do (pp. 1740–1743).
25. Subudhi, B. N., Veerakumar, T., Yadav, D., Suryavanshi, A. P., & Disha, S. N. (2017). Video
skimming for lecture video sequences using histogram based low level features. In IEEE 7th
International Advance Computing Conference (pp. 684–689).
26. Ahmed, T. (2012). A neural network based real time hand gesture recognition system.
International Journal of Computer Applications, 59(4), 17.
27. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. (2010). Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising
criterion. The Journal of Machine Learning Research, 11, 3371–3408.
28. Xue, G., Liu, S., & Ma, Y. (2020). A hybrid deep learning-based fruit classification using
attention model and convolution autoencoder. Complex Intelligent Systems.
29. Database for hand gesture recognition [Online]. Available: http://sun.aei.polsl.pl/~mkawulok/
gestures/. Accessed April 24, 2020.
30. Moeslund Gesture [Online]. Available: http://www-prima.inrialpes.fr/FGnet/data/12-
MoeslundGesture/database.html. Accessed June 16, 2020.
Hierarchical-Based Semantic
Segmentation of 3D Point Cloud Using
Deep Learning
J. Narasimhamurthy, Karthikeyan Vaiapury, Ramanathan Muthuganapathy,

and Balamuralidhar Purushothaman
1 Introduction
Deep learning and convolutional neural networks have shown promising results on
various computer vision problems like image classification, image segmentation,
face detection, image inpainting, etc. These results are made possible due to enor-
mous amounts of image datasets and lot of emphasis that was laid on annotation.
Similar techniques were ideated and implemented on 3D datasets like voxelized 3D
shapes, triangulated 3D meshes, and 3D point cloud data [7, 12].
In this work, we are interested in performing analysis on 3D point cloud data,
which are inherently available in an unordered and unstructured form. Point clouds
also have variable number of points in each scene, which makes it even challenging
and harder to deal with. Existing approaches on image datasets fail to work with
unstructured data. Advanced CNNs-based technology usually requires structured
grids of data as input. Naive approach for dealing with raw and unstructured point
cloud data would be to voxelize the data, which involves lot of preprocessing and
loss in data. Recently, techniques like Randla-net [1] and recurrent slice networks
[2] are used for efficient semantic segmentation of large-scale point clouds.
Some of the works make the input point clouds order independent by building
permutation invariant representations for the data points like building a kd tree in
[13]. But these approaches come with own limitations that we can input only point
clouds with size of the form 2∧n.
J. Narasimhamurthy () · K. Vaiapury · B. Purushothaman

TCS Research and Innovation, Bangalore, Karnataka, India
e-mail: karthikeyan.vaiapury@tcs.com; balamurali.p@tcs.com
R. Muthuganapathy
Department of Engineering design, IIT Madras, Chennai, Tamil Nadu, India
e-mail: mraman@iitm.ac.in
244 J. Narasimhamurthy et al.
Some recent techniques like [3] proposes a new method of extracting order
invariant features using global average pooling. Point Net was the first paper
to propose this architecture, which does not involve any domain transformation
of point clouds. It has achieved state-of-the-art results in shape classification,
segmentation, and retrieval tasks.
Very recent approaches like [9] proposes a similar technique to address the
issue of order invariance. It computes the fisher vectors as derivative of gaussian
likelihood of all the points. As gaussian likelihood involves summation, the sum
makes the fisher vectors order independent. They have shown state-of-the-art results
in shape classification and segmentation.
Main contributions of our work include:
(a) Ability to perform semantic segmentation on point clouds with variable number
of sizes using global max pooling.
(b) Redefining the traditional segmentation problem into an independent per point
classification problem aided by two-step hierarchy of local and global features.
(c) Using classification as an auxiliary task for extracting higher-level features.
2 Related Work
Deep learning on 3D data broadly falls into one of the following three categories:
(a) voxel based, (b) multi view based, and (c) graph based. Extending the successful
convolution neural network architectures to 3D shape analysis involves exponential
computation cost that limits to the usage of lower-resolution voxelized grids
leading to poor performance. Also, performing convolution inside the object is
not quite helpful though. Some previous works like [10] address this issue to
compute convolution effectively on the surface voxels by defining a new convolution
operation that adapts to the shape of the surface voxels instead of a conventional
rectangular kernel.
Multi view-based architectures leverage the advantages with an image-based
CNN in that they have similar data for deep processing. Usually in these approaches,
the 3D data is projected onto different planes, and these images are analyzed using
deep networks [6, 11]., etc. falls under this category. Graph-based methods perform
spectral convolutions on the graph constructed using 3D meshes.
With the development of 3D point cloud sensors like LiDAR, Kinect, etc., the
availability of 3D point cloud data has massively increased. Point cloud data helps
in indoor navigation and outdoor navigation of autonomous vehicles, nondestructive
machine vision inspection, and production audit. Better benchmarks for 3D point
cloud analysis are extremely important to leverage this data. 3D point clouds
obtained from these techniques are sparse, unordered, and extremely noisy. Point
cloud analysis include but not limited to segmentation, reconstruction, etc.
Hierarchical-Based Semantic Segmentation of 3D Point Cloud Using Deep Learning 245
Several learning-based approaches have been applied addressing these issues

in performing deep learning on point cloud data [3]. was the first paper to work
on 3D raw point clouds. They solve both classification and segmentation tasks
on ModelNet 40 [12] and shapenet datasets, respectively. They proposed a novel
architecture to address the problem of permutation invariance, and the features
extracted in their network are order independent. But the features extracted by them
are global features.
To address the issue of local feature extraction, they proposed another architec-
ture in [4]. Point Net and Point Net ++ both work directly on the raw point cloud
data, i.e., just x, y, and z coordinates of all the points. However, these architectures
can only take in fixed input size point clouds.
[13] used a kd tree step to make point clouds order invariant. Their work also
requires the number of points to be in posers of 2 to fully grow the tree.
3 NN-Based Point Cloud Segmentation Using Octrees
This network takes in points with x, y, and z coordinates and returns the segmenta-
tion labels of each point (refer to Fig. 1).
3.1 Box Search by Octrees
Octree-based methods have been applied in analyzing 3D voxel grids like in [8]. In
our work, we construct an octree to efficiently query the neighbors for all the points
in a scene. Faster approach mentioned [14] has been used to construct octree for
each point cloud scene.
3.2 Feature Hierarchy
The proposed architecture consists of a two-level hierarchy in feature extraction.

We use the neighborhood points for a given point to compute the local features. We
also use the pretrained network for classification task to extract global features of
the given scene. The local features and the global features are then aggregated and
are fed to a deep neural network to predict the segmentation label of the given point.
Likewise, labels of all the given points in a scene are predicted.
Fig. 1 Proposed architecture for 3D segmentation and classification
3.3 Permutation Invariance
As said earlier, we also employ the global max pooling operator as order invariant
transformation. The max pooling operation is done at the end of the convolution
layers.
3.4 Size Invariance
Since we perform nearest neighbor box search as opposed to KNN, we will end up
having variable number of input points for local feature extractor, but the network
architecture is designed in such a way that it doesn’t fail under any input shape.
3.5 Architecture Details
4 Experiments
We performed several experiments with each slight modification from the other.
The first network we trained on the segmentation branch of the above network
using neighborhood data extracted from octrees and then the network predicts the
Fig. 2 Sample car models from shapenet dataset
segmentation labels of each point. Cross entropy loss is computed on these predicted
labels whose gradients are back-propagated to train the network. It was found that
there was a huge class imbalance since trained on segmentation labels directly.
Post scaling is applied as a method to correct the predicted softmax probabilities.
Then we understood that since we are breaking the segmentation problem into a
neighborhood-wise classification problem (can be considered to be analogous to the
patch-wise segmentation done in images), the predicted labels do not semantically
understand the object. So to include the ability to semantically segment a particular
neighborhood, we posed the classification problem as an auxiliary task to extract the
global features of an object and use these features in predicting the segmentation
labels of a point using its neighborhood. This brings in the network ability to
semantically segment a point cloud scene.
4.1 Implementation and Dataset Details
The proposed framework is implemented using tensorflow on Nvidia Tesla K20

GPU Linux machine. For performance evaluation, we have used standard publicly
available shapenet segmentation [5] challenge dataset, which has 16 different
object categories for both classification and segmentation. Sample car models from
shapenet dataset can be seen in Fig. 2.
Firstly, we trained the network on categorical cross entropy loss for 18 hrs.
We obtained a classification accuracy of 89% on the test car dataset. Then we
used frozen weights of the classification network and used the activations after the
global max pooling and concatenated with the global max pooling activations of
the segmentation network. The loss is computed over the segmentation labels, and
gradients of only the segmentation weights are updated. Standard evaluation metric
IoU score is computed for part segmentation and reported in Table 1. The training
and validation loss for model with and without feature hierarchy is provided in Fig.
3a–c.
The segmentation network is also trained on NVIDIA TESLA K20 GPU Linux
machine for 3 days. The segmentation branch is only trained on the models
belonging to the car category. The same network can be slightly modified to
248
Table 1 Experimental results

Description of experiment Network architecture Classification results IoU score
Kd tree-based implementation on car KDT-Net in [3] n/a 0.54
Weighted categorical loss and other Same architecture n/a M2: trained with weights and inference also with
techniques during class imbalance weights: 0.234
KDT-Net in [3] M4: trained with weights but not inf: 0.324
Post scaling M3: trained without weights inf with weights: 0.5364
Naive implementation of octree-based NN 1 1×3 CONV n/a 0.67 after 20 epochs
search inside a box around a point and then
obtain classification of per point label using
the points around it.
2 1×1 CONV
MAX OPERATION
2 FC (512,256)
SOFTMAX(LABELS)
Two-level hierarchy with 0.89 on shapenet 0.61 after 20 epochs.
classification as auxiliary task challenge data
J. Narasimhamurthy et al.
Fig. 3 (a) Training loss for model without feature hierarchy. (b) Validation loss for model without
feature hierarchy. (c) Training and validation loss for model with feature hierarchy
incorporate segmentation of all the other object categories too. The qualitative
results for car part segmentation from shapenet dataset and Porsche car is provided
in Figs. 4 and 5, which is quite promising.
4.2 List of Experiments
4.3 Learning Curves
4.4 Qualitative Results

4.4.1 Shapenet Dataset
5 Conclusions and Future Work
Our method can produce effective segmentation even for shapes different from
the ones used while training. The network while training is observed to perform
Fig. 4 (a–d) Car part segmentation result (shapenet dataset)
Fig. 5 Porsche car segmentation result
comparatively not so good when using hierarchy of features than compared to

one without global feature extractor. This could be attributed to the fact that the
global features extracted here are features distinguishing the given object from
other objects. But it should be observed that global features extracted using auto
encoders can possibly help in feature hierarchy. Being said that since we trained
the segmentation network only on one particular object category, when trained
simultaneously on all categories, global features extracted from a classification
network can perform well. This direction needs to be explored and can be a possible
direction for the future work.
Acknowledgments First author was a student from IIT Madras supported by TCS research
internship at TCS Research and Innovation, Tata Consultancy Services Limited. The authors thank
reviewers for comments and suggestions for improvement.
References
1. Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., Trigoni, N., & Markham, A. (2020).
Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the
IEEE conference on computer vision and pattern recognition.
2. Huang, Q., Wang, W., & Neumann, U. (2018). Recurrent slice networks for 3d segmentation
of point clouds. In Proceedings of the IEEE conference on computer vision and pattern
recognition.
3. Qi, C., Su, H., Mo, K., & Guibas, L. (2017). PointNet: Deep learning on point sets for 3D
classification and segmentation. In International conference on computer vision and pattern
recognition.
4. Qi, C., Su, H., Mo, K., & Guibas, L. (2017). PointNet++: Deep hierarchical feature learning
on point sets in a metric space, In Conference on neural information processing systems.
5. Yi, L., Su, H., Shao, L., & Savva, M. (2017). Large scale 3D shape reconstruction and
segmentation from shapenet core55. In International conference on computer Vision.
6. Su, H., Maji, S., Kalogerakis, E., & Miller, E. (2015). Multi view convolutional neural networks
for 3D shape recognition. In International conference on computer Vision.
7. Dai, A., Chang, A., Savva, M., Manolis, Halber, Maciej, Funkhouser, Thomas, & Niessner,
M. (2017). ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In International
conference on computer vision and pattern recognition.
8. Riegler, G., Ulusoy, A., & Geiger, A. (2017). OctNet: Learning deep 3D representations at high
resolutions. In International conference on computer vision and pattern recognition.
9. Shabat, Y., Lindenbaum, M., & Fischer, A. (2017). 3D point cloud classification and segmen-
tation using 3D modified fisher vector representation for convolutional neural networks, arXiv.
https://arxiv.org/abs/1711.08241
10. Li, Y., Pirk, S., Su, H., Qi, C., & Guibas, L. (2016). FPNN: Field probing neural networks for
3D data. In Conference on neural information processing systems.
11. Shi, B., Bai, S., Zhou, Z., & Bai, X. (2015). DeepPano: Deep panoramic representation for 3-D
shape recognition. In Signal Processing Letters.
12. http://modelnet.cs.princeton.edu/
13. Klokov, R., & Lempitsky, V. (2017). Escape from cells: Deep kd networks for the recognition
of 3D point cloud models. In International conference on computer vision.
14. Behley, J., Steinhage, V., & Cremers, A. B. (2015). Efficient radius neighbor search in three
dimensional point clouds. In IEEE International conference on robotics and automation.
Convolution Neural Network
and Auto-encoder Hybrid Scheme for
Automatic Colorization of Grayscale
Images
A. Anitha , P. Shivakumara, Shreyansh Jain, and Vidhi Agarwal
1 Introduction
Converting the grayscale images into colored images with the help of computers
with some human intervention was carried out in the early approaches of research.
Generally, two large approaches are pursued in computer vision for image coloriza-
tion such as user-oriented and automatic colorization. User-oriented approach is
with the human intervention giving the rules to be followed to color the pixels
for a given grayscale images. But in this approach, there is a constant need for
human intervention, and the output produced will be of dull scale pictures with
shading issues. The automation was carried out using some statistical techniques.
But the drawback of these techniques paved the way for new techniques. The
latest methodology of utilizing machine learning techniques used neural network to
perform the automatic colorization by making the system to learn about the process.
Varga and Szirányi [1] suggested the plan of automatic colorization for animation
images, as they are different from the original images. But there is a limitation of
much higher color uncertainty with poor image shading. Shweta Salve et al. [2]
proposed image colorization using Google image classifier and ResNet V2 but failed
to produce optimal implementation due to poor computation process. Putri et al. [3]
A. Anitha ()
School of Information Technology and Engineering, Vellore Institute of Technology, Vellore,
Tamil Nadu, India
e-mail: aanitha@vit.ac.in
P. Shivakumara
B-2-18, Block B, Department of Computer System and Technology, Faculty of Computer Science
and Information Technology, University of Malaya (UM), Kuala Lumpur, Malaysia
S. Jain · V. Agarwal
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil
Nadu, India
254 A. Anitha et al.
used sketch inversion model to convert plain sketches into colorful images. This
approach can handle various geometric transformation but failed to work well since
it is trained for limited dataset. Existing effort on image colorization research can be
divided into scribble-based approaches, example-based approaches, and learning-
based approaches as proposed by Varga and Siranyi [1]. Scribble-based colorization
can be applied for static images and for the continuous frame of images. Levin
et al. [4] introduced a colorization technique using scribble-oriented colorization
where user gives the region of interest (ROI) to be colored in the picture by placing
scribbles. This algorithm was enhanced by Huang et al. [5] to decrease the color
blending at the edges of the images. Yatziv and Sapiro [6] proposed a model to
determine the RGB of the pixel by combination of multiple scribbles. The distance
between the scribble and the pixel is calculated by distance metric. Example-
oriented colorization was suggested by Reinhard et al. [7] on transferring color from
one image to another using statistical analysis. Likewise, Welsh et al. [8] proposed
a method of finding similar pixel-concentrated images. In addition, they proposed
to transfer color information to the matched pixel of the grayscale picture based on
the neighborhood statistics. Irony et al. [9] integrated example-oriented colorization
and scribble-oriented colorization to learn picture colorization. Charpiat et al. [10]
classified and forecasted the anticipated difference of RGB pixel intensity at each
pixel by defining a variable spatial coherency standard. Learning-based colorization
model studies the data of image colorization modeling to avoid human intervention.
Bugeau and Ta [11] introduced a train model for color prediction by taking the
images into square patches around each pixel. Cheng et al. [12] projected a three-
layered deep neural network for extracting characteristics of raw grayscale values
and sophisticated semantic features. Ryan Dahl [13] concatenated these semantic
features to teach a deep neural network using convolution neural network as a
character extractor. Then to endow the color channels, an enduring encoder was
trained and employed. The prognosticated colors are coherent for the majority part,
albeit the scheme is deeply horizontal to desaturate else to illuminate invariance
images.
In order to perform the automatic colorization, semantic features play a vital role
in colorization. With the aim of coloring an image efficiently and effectively, the
machine should be fed with information about the semantic composition of each
and every image and its localization. For instance, the shade of chameleon should
be changed according to the environment. Similarly, the ocean looks mostly blue
in color during daytime, but it will be total darkness during night. So to make the
machine to learn, we proposed CNN model combined with auto-encoder. The model
is used for significant feature identification and to train the model for the given
dataset of different categories and predict the color for the new grayscale images.
The chapter is organized as follows: Section 1 discussed about the introduction
followed by Sect. 2, which talks more about the basics of convolution neural
network and auto-encoder and decoder model. Section 3 confers about the proposed
methodology and Sect. 4 an empirical study on automatic colorization using CNN
model. Section 5 demonstrates the experimental results and analysis. The chapter
ends with conclusion as Sect. 6.
Convolution Neural Network and Auto-encoder Hybrid Scheme for Automatic. . . 255
2 Basics of Convolution Neural Network
Convolution neural network (CNN) is a deep learning technique to process image

data and have the capacity of adapting to learn the features from low-level patterns
to high-level patterns. CNN is generally composed of three layers: convolution,
pooling, and fully connected layers. The first two layers such as convolution and
pooling perform the feature extraction, whereas the third layer, fully connected
layers, establishes the classification by mapping the features with the target values.
The basic architecture of CNN is represented in Fig. 1.
2.1 Convolutional Layer
This is the initial layer in CNN for feature extraction from the input images. A filter
of particular size, R x R, where R can be a small matrix of R > 2, convoluted with
the input image and slide toward the whole image to perform the dot product. The
produced matrix is called as the feature map, which provides information of every
corner and edge of the image by applying the filters to an input image. The cause
for feature map is to gain the knowledge about the input image, and those features
are in to other layers for better learning the whole input image.
2.2 Pooling Layer
Generally, pooling layer is preferred next to convolution layer. By this layer, the
computation cost is reduced since this layer aims in reducing the convolute feature
map from the convolution layer. It is achieved by reducing the associations between
Fig. 1 Basic architecture of convolution neural network

the layers and works separately on each and every feature map. There are various
pooling operations performed, such as max pooling, average pooling, and global
pooling.
In max pooling, the biggest part is taken from the feature map. Average pooling
computes the normalized values in a considered sized image section. The summing
up is calculated in sum pooling. The pooling layer generally serves as a viaduct
between the convolution layer and the fully connected layer.
2.3 Fully Connected Layer
The fully connected layer is a typical neural network structure used to connect the
pooling layer values and the output layer. The layer consists of weights and biases
with the neurons as hidden layer between two different layers. The process of image
flattening is performed in the previous layer, and it is fed to the fully connected
layers. The flattened vector experiences few more fully connected layers to compute
the mathematical functions and to produce the classified images.
2.4 Overfitting or Dropout
Generally, the basic learning process has overfitting issues. When the training data
produces some negative impact on the performance of the model, it is identified as
overfitting. To overcome the issue of overfitting, the dropout layer can be utilized.
This is achieved by identifying the neurons whose weights are negligible can be
dropped or random selection can be used, in order to decrease the dimension of the
model and to perform the training procedure.
2.5 Activation Functions
Lastly, the majority parameter of the CNN model is the activation function.
Activation functions are acclimated to be trained and estimated any kind of perpetual
and involutes association among the variables of the network. To be straightforward,
it determines which information of the model should be moved in the forward
direction and which one should be removed from the network. Numerous regularly
utilized activation functions such as the ReLU, Softmax, tanH, and the Sigmoid
functions are proposed in various research models. These activation functions
are used for categorical utilization. Sigmoid activation function is preferred for
binary relegation, whereas Softmax activation function is preferred for multi-class
relegation.
3 Auto-encoder and Decoder Model
In this modern era, it makes the system to learn, and it is advisable to feed huge
dataset as input for training. In image processing technique, the collected data
from various sources can be fed into the system for automatic colorization. As
the number of data increases, the chance of increasing the execution time may
reduce the performance of the learning rate. To avoid feeding the input with huge
data without data loss, the same data can be compressed and can be fed into
the system. To implement this process, the concept of auto-encoder comes into
existence. Auto-encoder is an unsupervised artificial neural network (ANN) that
is trained to compress and encode the data efficiently and then learn how to rebuild
the data back from the compressed encoded representation, which is almost similar
to the original input. Auto-encoder comprises of four main components such as
encoder, bottleneck, decoder, and reconstruction loss as represented in Fig. 2. In
Fig. 2, x1 ,x2 ,x3 ...xn represents the input, and the hidden layers are the convolution
layer, flatten layer, and dense layer. Similarly for reshaping the reconstruction of
flatten layer, dense layer, and convolution layer to get the compressed input, with
the same dimension as the original image, the encoder converts the input dimension
into encoded representation. Bottleneck contains the reduced representation of the
input data to be stored. Decoder learns how to unwrap the data from the reduced
form; see the proximity with the original input data. Reconstruction loss helps in
checking the performance of the decoder. The whole process of auto-encoder is
illustrated in Fig. 3.
4 Proposed Research Methodology
The chapter aims to perform automatic colorization using machines. To achieve this
aim, convolution neural networks with auto-encoder model is used for the proposed
system. In the latest Internet era, convolutional neural networks (CNN) have been
emerging as one of the de facto standards for solving various problems relating to
image classification. It has come to limelight and has become popular because of the
lower error rates (lesser than 4%), which it has achieved in ImageNet challenge. The
main reason behind their success is due to their capability to discover and discern
colors, shapes, and patterns within various images and to find relationship with
object classes. This becomes one of the main reasons for such well-defined coloring
carried out by CNNs as object classes, prototype, and silhouette are generally
correlated with color options. Apart from using CNNs, auto-encoders are used
for image classification. As discussed in Sect. 3, auto-encoders are a self-learning
unsupervised technique in which the neural networks are influenced for the task
of representation learning. It employs back propagation, fixing the target values to
be equal to the input values, i.e., it uses y(i) = x(i), where i represents the values,
y(i) indicates the input, and x(i) denotes the target values. Most of the automatic
Input layer Hidden layers Output layer
x1 xˆ 1
a1 a1
x2 xˆ 2
a1
a2 a2
x3 xˆ 3
a2
x4 a3 a3 xˆ 4
a3
x5 xˆ 5
a4 a4
x6 xˆ 6
encoder decoder
Fig. 2 General architecture of auto-encoder
Input Output
Latent space
representation
Encoder Bottle neck Decoder
Fig. 3 Auto-encoder components and its encoded representation
colorization used linear regression as its methodology for automatic colorization,

which produced the output that is comparatively lesser than the proposed one. As
the absence of direct function to convert a grayscale image to a colored image, this
32X32X1
4096 4096 16X16X64 32X32X1
8X8X128
16X16X64 8X8X128 4X4X256
4X4X256
256
Con3 Reshape
Con2 stride=2 DeCon3 DeCon2
Con1 stride=2 h stride=2 stride=2
stride=2 Flatten Fully
layer connectd
Fig. 4 Proposed research methodology
motivates to propose the method. The architecture of the proposed research method
is depicted in Fig. 4.
In this chapter, statistical learning-driven approach is used to solve the problem
to attain the output values. As initial phase, the convolution neural network (CNN)
was designed and build that would accept a grayscale image as its input and would
generate a colorized description of the input image as its output. To accomplish the
aim of the proposed work, the neural network would be trained on thousands of
colored images, and the output generated by it would solely be based on images it
has learned from. This would also remove the intervention of humans to generate the
desired image. If the neural network gets well trained, then the output should be an
image that the user would be looking for. The main aim of the chapter is to efficiently
as well as accurately convert a given grayscale image to its corresponding colored
image with the help of various deep learning models along with auto-encoders. The
proposed model was trained to learn on its own with the help of various datasets
available and without any human intervention. Another aim is to provide a direct
function to convert a grayscale image to a colored image, which can be embedded
with various software in the near future.
4.1 Data Description and Design Approaches
As the real-time data involves various color and with various perception, it is
not enough to make the system to learn only about filling the color. Since every
image can be placed in various (x,y) coordinates, it may be scaled, rotated, or
transpositioned in the given coordinates. So simply making the model to learn
about the automatic colorization is necessary to make the model to learn about the
semantic features of the image. To adopt the proposed idea, the convolution neural
network has been designed. Generally, 32-bit precision will be used in training a
neural network. Since the input data are stored and used in training, it is better to
convert the input values into float. It is preferred to scale the input between 0.0
and 1.0. So each input is divided by 255 to make the learning rate reasonably the
better one. In general, Canadian Institute for Advanced Research (CIFAR-10)
dataset was considered as one of the best image dataset used in the neural networks.
The dataset consists of 60,000 images of 32 × 32 dimensioned color images in
10 classes. Every class contains 6000 images. Example for cat as image, it contains
Fig. 5 Sample dataset
6000 images of cat in various positions and colors. For the proposed model, the total
dataset is divided into 50,000 training images and 10,000 testing images in random.
For better performance measure, the dataset is divided into batches of each batch
containing 10,000 images, so totally five training batches and one testing batch. So
the test batch contains accurately 1000 randomly selected images from each class.
The training batches hold the rest of images in random order. Some training batches
may contain more images from one class. But it was carefully selected that the
training batches contain exactly 5000 images from each class. Initially, the dataset
are not uniform with colored images so it needs to be converted into black and white
and then used it in the training processes. The sample dataset depicted in Fig. 5 has
the random images from each class.
Initial stage of training involves preprocessing techniques such as converting the
images into grayscaled image and batch normalization and applied into the encoder
model for data compression and decoder for reconstruction of the compressed data.
The reconstructed data is fed into the convolution neural network. Batch normaliza-
tion is performed by making the input values between 0.0 and 1.0. To perform this,
each pixel is divided by 255, and during training, the images having high values are
considered as significant, whereas the pixel values having low values are considered
low significant. So the features are extracted based on the values obtained. This
normalization provided the uniformity among the images in the dataset. Looking
at a black-and-white picture as input, the proposed chapter assaults the difficulty

of hallucinating an appreciable color description of a picture. The problem of
feature extraction is unrelated, so the previous research approaches have either
relied on significant user interactions or resulted in desaturated colors. The proposed
fully automated approach produces lively and sensible colors. We acknowledge the
inherent uncertainty of the problem by presenting it as a classification task and use
class rebalancing over training time to increase the variability of colors in the results.
The system is implemented as a feed-forward pass to CNN at the time of testing
and trained on more than one million color images. The input of 50,000 images of
size 32 × 32 having three channels referring to the RGB channels of three layers
of convolution layers is used in the proposed model to extract the features of the
grayscale images. The flatten layer, the dense layers, and finally the latent vector
contain all the features of the image in reduced size. The flatten layer converts the
max-pooled feature map to a single column that is passed to the fully connected
layer. The dense layer is a neural network layer that is connected deeply to make
the model to learn about the features by receiving the input from all neurons of its
previous layer. The dense layer performs a matrix-vector multiplication to generate
the image. The synopsis of the encoder model displays the whole network with
all the details in it like the number of trainable and the non-trainable features and
parameters, number of nodes, and shape of each layer. After the encoder model is
created, the decoder model takes its process. The latent vector contains the features
of the images, which is now required to be converted into the output image. This
model is the exact opposite of what encoder model was. Here, we have the latent
vector as the input and then the dense layers, the convolution 2D layers, and finally
the output layer that is a 2D matrix representing the image. The dimensions of
the image are to be kept the same so that there is no change in the image. The
combination of both the encoder and decoder model results in the single unit called
auto-encoder as specified in Sect. 3. The main idea of the chapter target includes
converting a grayscale picture into a colored picture having the red-blue-green
composition. Although there exists a direct function to convert a colored image into
a grayscale picture, there exists no direct function for the vice versa process.
5 Experimental Analysis and Result
The proposed method was experimented in the free online cloud-based Jupyter note-
book environment called Google Colab. Google Colab uses python development
environment that runs in the browser using Google cloud and matplotlib library
package is used to plot the images. The CIFAR-10 dataset is fed into the three-
layered CNN model with the size of three-channel RGB image to the convolution
layer. The input layer is a grayscaled image. Each neuron receives the input from
every element of the previous layer. The output layer is a multi-class label as
y=x
1
y=0
–3 –2 –1 0 1 2 3
Fig. 6 ReLU activation function
explained in sample dataset of Fig. 5. Hidden layers consist of convolution layer,

activation function layers, pooling layers, and fully connected neural network.
Identifying the intensity of the grayscale image is a simple task. But an image
of multiple color channels for huge voluminous data to be automated makes the
computation very intensive. The task of auto-encoder model is to reduce the huge
voluminous data into reduced form, without any loss. This compressed data is fed
into the convolution layer. Since the dataset images are the combination of multiple
colors, the convolution layer uses deep network of [64, 128, and 256] filters with
max pooling to reduce the spatial dimension. The strides modify the amount of
movement over the image of the filter. In the proposed model, the strides are given
constant as 2, so the movement of two steps in convolution layer. Since the aim is
to preserve the spatial dimension for the input to be same as the output, the padding
is set as “same.” The activation function as ReLU is used. ReLU is an activation
function, which is expanded as rectified linear unit. ReLU function (y) is given as
max(0, x), where x is the input value as depicted in Fig. 6. Since ReLU activation
function is simple, fast, and converge more quickly, it is the most commonly used
activation function in neural networks, especially in CNNs to set the negative values
to zero, and all other values are kept constant. The flatten layer is used to convert
the input matrix into 1D array of elements. The auto-encoder will generate a latent
vector from input data and recover the input using the decoder. The latent vector
in the proposed model is of 256 dimension. Totally, 1,418,496 trainable params are
performed in the proposed model.
To maintain the same dimension for the input values in the convolution and fully
connected layer, the TensorSpec function is used. The proposed model makes the
dense layer of 4*4*256 of dtype as float 32. A single layer of 4*4*256 = 4096 single
dimensional array was created. Conv2DTranspose function is used in the decoder
to return back the reduced input in a reconstructed way. The auto-encoder model
combines the encoder and decoder params such as encoder_input as 1,418,496,
whereas decoder_model as 2,013,315. So the summation is carried out in the auto-
encoder model as 3,431,811 trainable params. To judge the performance of the
trained model, the metrics are used. In the proposed model, regression metrics such
as mean square error is used to compile the auto-encoder.
5.1 Classification and Validation Process
This section explores about the classification and prediction process for the consid-
ered dataset. The proposed model was trained using CNN by the compression of
data using auto-encoder process. The classification process was carried out using
various optimizer such as Adam, RMSProp, and stochastic gradient descent (SGD).
Adam is considered as optimizer that used stochastic gradient descent method
based on adaptive estimation to train the deep learning models. It also handles the
noise reduction as it handles optimization by integrating AdaGrad and RMSProp
algorithm. To check for the fitness of the model, the validation is performed with the
customized epochs and with varying batch sizes. Generally, the batch size estimates
the number of samples to be considered as batches.
The dataset contains 50,000 input, the batch size is fixed as 32, and the training
started with epoch as 5. Once the model is trained with best fit of classification
accuracy, we can make the model to predict for the unobserved images. The
classification was carried out with epoch = 5, epoch = 10, epoch = 30, and
epoch = 50. The batch size = 32 and batch size = 64 are used to train the model.
The computed classification accuracy for batch size = 32 for SGD, RMSProp, and
Adam is provided in Table 1.
From Table 1, it is clearly seen that the RMSProp attains the classification
accuracy for epoch 50 as 64.89%, whereas SGD attains 65.02% and the Adam
optimizer attains 66.9%, which is the best optimizer among the computed one.
The pictorial representation of the classification accuracy against every epoch was
depicted in Fig. 7. In Fig. 7, the exponential trend line for the SGD and linear
trend line for the Adam are indicated to show that there is increase in accuracy
for computation of 5, 10, 15, 20, 25, 30, 35, 40, 45, and 50 epochs separately.
Similarly, the classification accuracy is computed for batch size = 64, with
various optimizer such as SGD, RMSProp, and Adam with epoch from 5, 10, 15,
Table 1 Classification accuracy for various optimizers with batch size = 34

Batch size = 32
Classification accuracy (%) Validation loss Validation accuracy (%)
Epoch SGD RMSProp ADAM SGD RMSProp ADAM SGD RMSProp ADAM
5 52.79 47.12 56.00 0.008 0.0091 0.0075 52.35 44.99 50.79
10 55.35 53.42 57.86 0.0076 0.0086 0.0076 54.3 52.45 51.76
15 59.23 57.21 62 0.0074 0.0081 0.0075 50.49 50.03 51.89
20 61.73 58.34 63.4 0.0076 0.0079 0.0076 51.38 50.63 51.9
25 63.36 61.98 64.04 0.0077 0.0079 0.0076 51.74 52.44 51.39
30 63.5 62.56 63.72 0.0077 0.0078 0.0077 51.02 52.16 51.11
35 63.88 63.58 64.32 0.0077 0.0075 0.0077 50.9 49.16 51.1
40 63.98 63.92 64.69 0.0078 0.0077 0.0077 50.97 51.84 52.01
45 64.19 64.5 65 0.0078 0.0077 0.0078 51.34 52.89 52.46
50 65.02 64.89 66.9 0.0078 0.0076 0.0078 51.95 51.76 52.78
Classification Accuracy - Batch Size - 32

70 65 66.9
63.4 64.04 63.72 64.32 64.69 y = 50.312e
62 2
60 57.86 61.98 62.56 63.58 63.92 R = 0.7982
56.00 64.5 64.89
57.21 58.34
50 53.42 Classification
47.12 Accuracy(%) SGD
Accuracy
40 Classification
64.19 65.02 Accuracy(%) RMSProp
30 61.73 63.36 63.5 63.88 63.98
55.35 59.23
52.79 Classification
Accuracy(%) ADAM
20
Expon.(Classification
10 Accuracy(%) RMSProp)
Expon.(Classification
0
5 10 15 20 25 30 35 40 45 50 Accuracy(%) RMSProp)
Number of Epochs
Fig. 7 Classification accuracy for batch size = 32
Table 2 Classification accuracy for batch size = 64

Batch size = 64
Classification accuracy (%) Validation loss Validation accuracy (%)
Epoch SGD RMSProp ADAM SGD RMSProp ADAM SGD RMSProp ADAM
5 45.34 44.58 46.43 0.103 0.0091 0.0101 50.46 46.24 48.16
10 46.1 45.76 48.16 0.0097 0.0086 0.0095 51.67 46.98 48.2
15 47.67 46.99 48.2 0.0086 0.0081 0.0087 52.16 47.89 52.6
20 49.34 48.67 51.56 0.0083 0.0079 0.0081 53.09 48.92 51.9
25 52.01 51.34 52.11 0.0083 0.0079 0.0081 53.98 50.01 53.18
30 53.79 52.89 54.56 0.0081 0.0078 0.0079 54.24 51.42 54.56
35 55.76 54.95 56.89 0.0081 0.0075 0.0079 55.1 51.98 55.38
40 57.99 57.01 58.12 0.0079 0.0077 0.0077 55.89 52.41 56.27
45 59.1 58.47 60.34 0.0079 0.0077 0.0078 56.57 53.03 57.16
50 61.42 60.34 62.98 0.0079 0.0076 0.0078 57.34 53.98 57.98
20, 25, 30, 35, 40, 45, to 50. Table 2 represents the classification accuracy for batch
size = 64. The results are obtained with batch size = 64, with ReLU as activation
function, on various optimizer such as SGD, RMSProp, Adam. The SGD gets the
accuracy of 61.42%, RMSProp arrives the classification accuracy of 60.34%, and
Adam provides the classification accuracy of 62.98%. Even though the batch size
increases, the proposed method gives less accuracy rather than the accuracy arrived
from batch 32. The accuracy increases by 3.92% for batch size = 32 is greater than
the batch size = 64. So it is enough to train the proposed model with batch size
as 32. The classification accuracy obtained for batch size 64 is shown in Fig. 8.
Along with the classification accuracy, the proposed model computed for validation
accuracy and validation loss. Tables 1 and 2 contain the validation accuracy for the
batch size 32 and 64, respectively. Figures 9 and 10 portray about the validation
accuracy for the batch sizes 32 and 64, respectively.
Classification Accuracy for Batch size - 64

70
62.98
60.34
60 56.89 58.12
54.56 60.34
51.56 52.11 57.01 58.47
48.16 48.2 54.95
50 46.43 51.34 52.89
46.99 48.67
44.58 45.76
Accuracy
40
30 61.42
57.99 59.1
49.34 52.01 53.79 55.76 Classification
45.34 46.1 47.67 Accuracy(%) SGD
20
Classification
10 Accuracy(%) RMSProp
Classification
0 Accuarcy(%) ADAM
5 10 15 20 25 30 35 40 45 50
Number of Epochs
Fig. 8 Classification accuracy – batch size 32
Validation Accuracy -Batch size -32

60
50.79 51.76 51.89 51.9 51.39 51.11 51.1 52.01 52.46 52.78
50 51.84 52.89
52.45 50.03 50.63 52.44 52.16 49.16 51.76
Valdiation Accuracy
44.99
40
30
52.35 54.3 50.49 51.38 51.74 51.02 50.9 50.97 51.34 51.95
20
10
0
5 10 15 20 25 30 35 40 45 50
SGD 52.35 54.3 50.49 51.38 51.74 51.02 50.9 50.97 51.34 51.95
RMSProp 44.99 52.45 50.03 50.63 52.44 52.16 49.16 51.84 52.89 51.76
ADAM 50.79 51.76 51.89 51.9 51.39 51.11 51.1 52.01 52.46 52.78
Fig. 9 Validation accuracy – batch size 32
Validation accuracy - Batch size - 64

56.27 57.16 57.98
60 54.56 55.38
52.6 51.9 53.18
48.16 48.2
50
Validation Accuracy
51.98 52.41 53.03 53.98

46.98 47.89 48.92 50.01 51.42
40 46.24
30 56.57 57.34
51.67 52.16 53.09 53.98 54.24 55.1 55.89
50.46
20
10
0
5 10 15 20 25 30 35 40 45 50
SGD 50.46 51.67 52.16 53.09 53.98 54.24 55.1 55.89 56.57 57.34
RMSProp 46.24 46.98 47.89 48.92 50.01 51.42 51.98 52.41 53.03 53.98
ADAM 48.16 48.2 52.6 51.9 53.18 54.56 55.38 56.27 57.16 57.98
Fig. 10 Validation accuracy – batch size 64

Table 3 Validation loss for batch sizes 32 and 64

Batch size = 32 Batch size = 64
Validation loss Validation loss
Epoch SGD RMSProp ADAM SGD RMSProp ADAM
5 0.008 0.0091 0.0075 0.0103 0.0091 0.0101
10 0.0076 0.0086 0.0076 0.0097 0.0086 0.0095
15 0.0074 0.0081 0.0075 0.0086 0.0081 0.0087
20 0.0076 0.0079 0.0076 0.0083 0.0079 0.0081
25 0.0077 0.0079 0.0076 0.0083 0.0079 0.0081
30 0.0077 0.0078 0.0077 0.0081 0.0078 0.0079
35 0.0077 0.0075 0.0077 0.0081 0.0075 0.0079
40 0.0078 0.0077 0.0077 0.0079 0.0077 0.0077
45 0.0078 0.0077 0.0078 0.0079 0.0077 0.0078
50 0.0078 0.0076 0.0078 0.0079 0.0076 0.0078
Validation loss - batch size - 32

0.0091
0.0086
Validation loss
0.0081
0.008 0.0079 0.0079

0.0078 0.0078 0.0078
0.0077 0.0077 0.0077
0.0076 0.0076 0.0076 0.00780.0077
0.0078 0.0078
0.0076
0.0075 0.0077 0.00770.0075
0.0077
0.00760.0075 0.0076
0.0074
0.0073
5 10 15 20 25 30 35 40 45 50
SGD 0.008 0.0076 0.0074 0.0076 0.0077 0.0077 0.0077 0.0078 0.0078 0.0078
RMSProp 0.0091 0.0086 0.0081 0.0079 0.0079 0.0078 0.0075 0.0077 0.0077 0.0076
ADAM 0.0075 0.0076 0.0075 0.0076 0.0076 0.0077 0.0077 0.0077 0.0078 0.0078
Fig. 11 Validation loss – batch size 32
Validation loss is an indicator for the research model about the data partition,
and if the validation loss is equal or less than the actual loss, then we can arrive to
the conclusion that the training and testing data partition is good for the proposed
model. The validation loss obtained from the optimizers such as RMSprop, SGD,
and Adam is depicted in Table 3, Figs. 11, and 12. Figure 11 shows a clear idea about
how the validation loss is obtained from various optimizers for batch size =32, and
Fig. 12 shows the validation loss for batch size = 64. From Figs. 11 and 12 even
though the validation loss is almost same during the training process of epoch = 50,
the initial training validation loss is having much variation.
During the training process, there is a problem of overfitting, which is to address
the performance of the training with respect to the accuracy obtained for how long
the training should be performed. Figure 13 clearly shows the overfitting caused
during training from epoch 1 to epoch 50. Since to exhibit the figure in a clear
Validation loss - Batch size - 64
0.0104
0.0096
Validation loss
0.0088
0.008
0.0072
5 10 15 20 25 30 35 40 45 50
SGD 0.0103 0.0097 0.0086 0.0083 0.0083 0.0081 0.0081 0.0079 0.0079 0.0079
RMSProp 0.0091 0.0086 0.0081 0.0079 0.0079 0.0078 0.0075 0.0077 0.0077 0.0076
ADAM 0.0101 0.0095 0.0087 0.0081 0.0081 0.0079 0.0079 0.0077 0.0078 0.0078
Fig. 12 Validation loss – batch size 64
Overfitting
68
66.78 66.87 66.9
67 66.34
65.99 66.01 66.01
65.73 65.89
66 65.36 65.47 65.58 65.45
64.98
65 64.1
Accuracy
64.01
64 63.34 63.43
62.98
63
62.01 61.78
62
61
60
59
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Accuracy 65.36 65.47 65.58 65.73 65.89 65.99 66.01 66.34 66.78 66.87 66.9 66.01 65.45 64.98 64.01 63.34 64.1 63.43 62.98 62.01 61.78
Fig. 13 Overfitting
manner the epoch from 40 to 60 displayed in Fig. 13 for batch size = 32, and
for Adam optimizer. From Fig. 13, we can clearly see that at epoch = 50, the
accuracy achieved is 66.9%, whereas at epoch 51–60, the accuracy gets reduced.
So it undoubtedly shows that the training can be performed till epoch = 50 for
better classification process.
The proposed model accuracy and overall loss for epoch = 30 are displayed
using matplotlib library as shown in Fig. 14. Since the proposed model is trained
with 66.90% accuracy, now it is time to predict the automatic colorization for the
grayscale images.
Model accuracy Model loss

0.62 Train Train
0.014
0.60 Test Test
0.58 0.012
Accuracy
0.56
Loss
0.010
0.54 0.008
0.52
0.006
0.50
0.48 0.004
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Epoch Epoch
Fig. 14 Model accuracy and loss for the proposed model for epoch = 30
5.2 Prediction
The prediction process is carried out by predict function where we input as

grayscaled images and expected to get the colored images as output. The next step
is to display the results, so we create a variable where we save the colored images
that were the output from the neural network. In order to perform the prediction, 25
images are saved and organized as grid of 5 × 5. To distinguish between the actual
and the predicted image, the display of both the images are discussed as figures
for epoch = 5, epoch = 10, epoch = 30, and epoch = 50. The original image is
checked against the predicted image. For better understanding, the observed image
and the original image for each and every epoch with equal intervals are given in
the following figures. Also, the number of random pictures are selected from the
original dataset, and it is checked against the predicted automated colorized images
in three different matrix arrangement such as 5 × 5, 4 × 4, and 3 × 3. Initially,
for epoch = 5, the images are displayed as follows. Figure 15 displays the original
image vs. predicted image for epoch = 5.
Similarly, the original and predicted images for epoch =10 is as demonstrated
below in Fig. 16.
Similarly, the original and predicted images for epoch = 30 and epoch = 50 are
demonstrated below in Figs. 17 and 18, respectively.
As we can see, there is not much difference between the original and the
predicted images. The predicted images are slightly dull in color. Overall, the images
can be resembled and can be identified, and the network performs quite well on the
black-and-white images.
Fig. 15 Original vs. predicted image for epoch = 5

Fig. 18 Original vs. Predicted image for epoch = 50

6 Conclusion
The enhancement in the field of computer vision takes a milestone of automatic

colorization by predicting the color of pixels by the trained data. Based on the
experimental analysis, it is concluded that the proposed model gives better color
prediction for the new image or grayscale images, the number of layers increases.
The results show that the predicted images are closer to the real images. Proposed
work gives better prediction of color for the new images with Adam optimizer as
66.69% accuracy, which is good prediction accuracy compared with other self-
learning or supervised machine learning algorithms. The need of such automatic
colorization without human intervention helps in various real-life applications.
References
1. Varga, D., & Szirányi, T. (2016). Fully automatic image colorization based on Convolutional
Neural Network. In 2016 23rd International Conference on Pattern Recognition (ICPR) (pp.
3691–3696). IEEE.
2. Salve, S., Shah, T., Ranjane, V., & Sadhukhan, S. (2018). Notice of violation of IEEE
publication principles: Automatization of coloring grayscale images using convolutional
neural network. In 2018 Second International Conference on Inventive Communication and
Computational Technologies (ICICCT) (pp. 1171–1175). IEEE.
3. Putri, V. K., & Fanany, M. I. (2017). Sketch plus colorization deep convolutional neural
networks for photos generation from sketches. In 2017 4th International Conference on
Electrical Engineering, Computer Science and Informatics (EECSI) (pp. 1–6). IEEE.
4. Levin, A., Lischinski, D., & Weiss, Y. (2004). Colorization using optimization. In ACM
SIGGRAPH 2004 Papers (pp. 689–694).
5. Huang, Y. C., Tung, Y. S., Chen, J. C., Wang, S. W., & Wu, J. L. (2005). An adaptive edge
detection based colorization algorithm and its applications. Proceedings of the 13th annual
ACM international conference on Multimedia, 351–354.
6. Yatziv, L., & Sapiro, G. (2006). Fast image and video colorization using chrominance blending.
IEEE Transactions on Image Processing, 15(5), 1120–1129.
7. Reinhard, E., Ashikhmin, M., Gooch, B., & Shirley, P. (2001). Color transfer between images.
IEEE Computer Graphics and Applications, 21(5), 34–41.
8. Welsh, T., Ashikhmin, M., & Mueller, K. (2002). Transferring color to greyscale images. ACM
Transactions on Graphics, 21(3), 277–280.
9. R. Irony, D. Cohen-Or, and D. Lischinski. Colorization by example. Eurographics Symp. on
Rendering, 2005.
10. Charpiat, G., Hofmann, M., & Sch¨olkopf, B. (2008). Automatic image colorization via multi-
modal predictions. Computer Vision-ECCV, 126, 2008–139.
11. Bugeau, A., & Ta, V. T. (2012). Patch-based image colorization. Proceedings of the IEEE
International Conference on Pattern Recognition, 3058–3061.
12. Cheng, Z., Yang, Q., & Sheng, B. (2015). Deep colorization. Proceedings of the IEEE
International Conference on Computer Vision, 415–423.
13. Dahl, R. (2016, January). Automatic colorization. http://tinyclouds.org/colorize/
Deep Learning-Based Open Set Domain
Hyperspectral Image Classification Using
Dimension-Reduced Spectral Features
C. S. Krishnendu, V. Sowmya, and K. P. Soman
1 Introduction
Hyperspectral remote sensing focuses on targets by observing the Earth’s surface

with ground-based, airborne, or spaceborne imaging systems [1]. The primary
objective of hyperspectral imaging is to recognize objects, recognizing materials
with the assistance of sensors [1]. A hyperspectral picture is generally observed
as cubical shape that contains spatial and spectral data. X and Y axes speak to
the spatial data, and Z speaks to spectral one. Hyperspectral picture is exposed to
some preprocessing procedures preceding examination. The data from the scene is
kept in each voxel in as far as evident reflectance of the material present on the
ground. Every pixel compares to a specific item in the scene. So the component
of hyperspectral pictures is enormous. In this way, productive feature extraction
techniques are required for the hyperspectral imaging. In recent years, for image
processing tasks, deep learning has been used. Here, the paper used the idea of
domain adaptation [3–6]. Training and test data are taken from different domains.
For this situation, the classifier can’t work appropriately, and the exactness is
reduced. Endless methodologies have been proposed to determine this domain
adaptation [13–16]. It is a strategy that manages the circumstance where a model
is prepared on a label scarce domain and acts in label-rich domain. However,
these techniques depend on the idea that the information for training (source) and
testing (target) are from same dispersion. In [7], they have utilized the ideas of
Grassmannian manifolds and proposed positioning of source and target tests in
shared subspace. That was the first work on domain adaptation. This paper utilizes
a technique that encourages us to manage the open set domain issue with regard
C. S. Krishnendu · V. Sowmya () · K. P. Soman

Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering,
e-mail: V_sowmya@cb.amrita.edu
274 C. S. Krishnendu et al.
to the domain adaptation [8]. In this technique, the dissemination of the known
data in target (testing) is different from that of the data in source (training). For
some applications, preparing information is lacking due to the significant expense
of getting annotated data [2]. Known samples are samples that are shared in both
source and target, and unknown samples are samples that are absent in source. So
here, the issue is we don’t have the thought regarding which tests are unknown. In
[8], they have utilized GAN for data generation, and furthermore classifier assists
with identifying known and unknown samples. Generator creates data to train the
unknown samples. Additionally, here utilized a similar idea of adversarial learning
to make sample and to distinguish known and unknown data samples. Sometimes,
huge component of hyperspectral dataset brings complexity in computation. One
technique to take care of this issue is feature reduction of the data. It is a process
of anticipating information into lower measurement feature space, and this helps
in reducing redundant features. In machine learning, different feature extraction
methods are there for dimension reduction such as Principal Component Analysis
(PCA), diffusion maps, Independent Component Analysis (ICA), Local Linear
Embedding (LLE), etc. In [9, 10], they have used dynamic mode decomposition
(DMD) as dimension reduction technique for hyperspectral image classification and
also shows that this method is more efficient over existing conventional methods.
Two separate dimensionality reduction methods are utilized here and investigate
the impact of these methods in decreasing the redundancy of features. Here, the
work incorporates two stages. In first stage, dynamic mode decomposition (DMD)
has been utilized for the dimension reduction [11]. The viability of the technique
is investigated through precision and calculation time. In the subsequent stage, we
investigated a novel Chebyshev polynomial-based strategy for the feature reduction
of HSI. Chebfun isn’t much acclimated in image processing tasks. But it has been
applied to solving several problems in engineering [12, 13]. The method is devel-
oped upon the theories of approximation, and it can perform numerical computing
with functions [14]. In [15], Chebyshev approximation is used for improving epoch
extraction from Telephonic Speech, and it outperforms the conventional methods.
Chebyshev approximation is found to be very effective in estimating power system
frequency along with variational mode decomposition in [16]. So this approach
has been applied in many areas and shown its significance, but not explored with
image classification tasks. Here, this technique is used for approximating spectral
features for hyperspectral image. A function f (x), defined in the interval [−1,1], can
be represented by a unique polynomial interpolant PN(x) at the Chebyshev points.
A good approximate of a function f (x) with reduced coefficients is obtained by
evaluating the Chebyshev polynomial series at the Chebyshev points rather than
uniformly spaced points [15]. So Chebyshev uses an adaptive procedure to find the
right points so as to represent the function to approximately machine precision,
that is called Chebyshev coefficients [14]. Thus, the major contribution of the
present work is the dimension reduced Chebyshev-based approximated spectral
feature-based open set domain adaptation for hyperspectral image classification.
The structure of the paper is as per the following. Third section clarifies the related
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . . 275
work followed by strategy, result, and investigation. Lastly, segment 6 finishes up

the paper.
2 Methodology
The principal objective of this work is to reduce the dimension for open set domain
HSI classification. At first, the hyperspectral data without feature reduction of
each dataset is utilized to prepare the model. Both training and test data have
separate class dimension probabilities. Since there is no information for unknown
sample training, GAN model is utilized for sample generation in order to train and
furthermore to characterize known and unknown samples. GAN model incorporates
generator and classifier. Here, classifier has been prepared to give yield as P (y =
K + 1/xt) == t, where xt is the objective example, y is the label, and t ought to be
in between 0 and 1. Here, we accept t as 0.5. In the event that the optimum value is
under 0.5, sample is recognized as known, and if it is more prominent than 0.5, it
is distinguished as unknown [10]. Before feature reduction, 3D HSI information is
changed. Before dimension reduction, 3D HSI data is converted to mn × b format,
where mn represents no. of pixels and b represents number of bands. Figure 1
shows the flow of the proposed work. Here, the work includes two phases. In the
primary stage, we utilized DMD as a strategy for feature reduction. DMD assists
with acquiring the dynamic modes (eigenvalues and eigenvector) of a nonlinear
framework. Here, the HSI information is changed over to 2D lattice structure. At
that point the final column was recreated and added to the matrix. The same matrix
is then isolated into two matrices, x1 and x2. Singular value decomposition of x1 is
figured, and a low-rank approximation matrix is acquired [10]. After performing QR
decomposition of eigenvector of S, P matrix that is known as per mutation matrix is
acquired.
This P matrix contains order of eigenvector arrangement. Spectral information
are organized in the matrix in such a way that the bottom contains least significant
and the top contains most significant bands. The precision of the model is being
noted for each percentage reduction of bands. Half of the features (bands) are
diminished at first, and the reduction kept repeating percentage-wise till most
elevated conceivable decrease of the feature is accomplished with no loss of
information. The learning rate that corresponds to the highest accuracy has been
considered as the ideal boundary for better outcome. For each feature reduction,
calculation time and number of learnable boundaries are likewise noted. In the
second stage, Chebyshev-based approximation method is used for the dimension
reduction for HSI. By using DMD, the maximum possible reduction for each dataset
with highest classification accuracy is achieved before and after each dimension
reduction. The novelty of the present work lies where the Chebyshev approximation
is used to analyze the performance of Chebyshev approximation of features on
image dataset. Suppose 30% is the maximum possible reduction obtained with
comparable classification accuracy for one dataset by using DMD. Chebyshev is
Fig. 1 Block diagram of proposed diagram
used to check whether we can truncate spectral features to 20% and 10% of feature
with better classification accuracy. Here also, the data should be in the form of
2D matrix before applying Chebyshev approximation. Each row of the matrix
represents pixels corresponding to each class. Chebyshev helps to truncate the data
with minimum number of coefficients. These coefficients are capable to reconstruct
the data with almost machine precision. The model has been trained with these
truncated coefficients, and the performance is analyzed in terms of classification
accuracy, PSNR (peak signal to noise ratio), and computation time. Each time, data
has been truncated to lesser number of coefficients to obtain highest achievable
reduction by keeping all parameters like number of epochs, learnable parameter, etc.
the same. Data variation can be visible by comparing the plots of spectral signature
of each pixel that belongs to each class before and after truncation. In this stage,
three hyperspectral dataset named Salinas, Salinas A, and Pavia U are used for the
experiment.
2.1 Dataset
This segment gives the brief description of HSI dataset. There are a total of three
datasets that we have utilized for the trial. They are Salinas, Salinas A, and Pavia
University dataset.
Table 1 Dataset description for Salinas dataset [17]

Source (Salinas with Target (Salinas with
Class Class labels six classes) eight classes)
0 Broccoli green weeds 1 1607 402
1 Corn senesced green weeds 2622 656
2 Lettuce romaine 4 wk 854 214
6 Unknown class (broccoli green 0 5702
weeds 2 and fallow)
Total samples 8214 7756
2.2 Salinas
This dataset is gathered over Salinas Valley, California, utilizing AVIRIS sensor
[17], and it has a resolution of 3.7-meter pixels. The dimension is 512217224,
and the total number of classes is 16. Table 1 gives the details of sample and
classes. The hyperspectral data incorporates exposed soils, vegetables, and grape
plantation fields. Out of absolutely 16 classes, 6 classes are taken as source
(training information) and 8 classes of Salinas dataset are taken as target (testing
information). The train test split is 80%–20% more. Also, around 5702 examples are
utilized distinctly for testing (target). These 5702 examples are acquired by joining
two classes, in particular fallow and broccoli green weeds 2.
2.3 Salinas A
Salinas A is one of the subscene extraction of Salinas image, which comprises

86 × 83 × 224 pixels with six classes. Table 2 provides the class and sample
details [17]. Out of absolute six classes, five classes are taken as source (training
information), and six classes of Salinas A dataset are taken as target (testing
information). The train test split is 80–20%, and it is just for five classes, which
are regular for both source and target, and one class with around 799 examples is
utilized uniquely for target dataset.
2.4 Pavia U
Pavia U is one of the two scenes gotten from ROSIS sensor, from Pavia, Northern
Italy [17]. Pavia University image involves 610 × 610 pixels, and the number of
Table 2 Dataset description for Salinas A dataset [17]

Source (Salinas A with Target (Salinas A with
Class Class labels five classes) six classes)
0 Broccoli green weeds 1 313 78
1 Corn senesced green weeds 1074 269
Total samples 3639 1709
Table 3 Dataset description for Pavia U with five classes (source) and Pavia U with eight classes
(target) [17]
Source (Pavia U with Target (Pavia U with
Class Class labels five classes) eight classes)
0 Asphalt 5305 1325
1 Gravel 1679 420
2 Trees 2451 613
3 Painted metal sheets 1076 269
4 Bare soil 4023 1006
5 Unknown class (meadows, 0 23,278
self-blocking bricks,
shadows)
Total samples 14,534 26,911
bands are 103. It contains absolutely nine classes, and the resolution is 1.3 meters.
Table 3 shows the dataset depiction of Pavia U dataset with five classes as source
and Pavia U dataset with eight classes as target. Five classes of Pavia U are shared
by both source and target. Three classes with around 23,278 examples are utilized
distinctly for testing (target). These 23,278 examples are acquired by consolidating
three classes, in particular shadows, self-blocking bricks, and meadows.
3 Experiment Results
3.1 Dimensionality Reduction Based on Dynamic Mode

Decomposition
The proposed work is implemented on three datasets and assessed through classi-
fication accuracy. This same model is utilized for dimension reduced dataset. The
optimum values that are utilized for raw dataset for getting great outcomes are 128
batch size and epoch of 500. Learning rates of 0.0001 is utilized for both datasets
and Adam optimization, and cosine ramp down is utilized for preparing network.
Table 4 Classification results for Salinas dataset

With dimension reduction (20% of
Without dimension reduction bands) using DMD
Overall accuracy (%) 98.42 97.66
Elapsed time (min) 98 48
Table 5 Classification results for shuffled Salinas dataset

Shuffled Shuffled Shuffled
dataset 1 dataset 2 dataset 3
100% of 20% of the 100% of 20% of the 100% of 20% of the
bands bands (44 bands bands (44 bands bands (44
bands) bands) bands)
Overall 99.01 98.89 98.78 97.51 99.32 98.41
accuracy (%)
Elapsed 95 46 91 38 87 39
time (min)
The configurations of hardware that are utilized for the experiment are Intel Core
i5-8250U with clock speed of 1.6 GHz. Driver adaptation of GPU is NVIDIA-SMI
391.25. Cuda 10.0 is likewise utilized for computation reason.
3.1.1 Salinas Dataset
For Salinas dataset, Salinas with six classes is taken as source, and target is Salinas
with eight classes. The accuracies are acquired for Salinas dataset for various
learning rates. The accuracy is same for 20% (44 bands) of bands as that of
raw information with a learning rate of 0.0001. The most elevated classification
precision of 98.42% is accomplished for the learning rate of 0.0001 for raw data
comprising of 224 groups. Accuracy for unknown class is 98.40% and for known
class is 98.00%. Henceforth, the learning rate is set to 0.0001. In like manner, for 44
bands, the most noteworthy classification accuracy of 97.66% has been achieved for
the learning rate of 0.0001. Accuracy for unknown class is 98.45% and for known
class is 96.88%. Thus, it is fixed to 0.0001. For Salinas dataset, 20% of the spectral
information is sufficient to acquire practically identical precision. Table 4 shows the
examination of classification accuracies for dimension reduced data, and also, table
shows the time taken for network training.
To verify the efficiency of the approach, we shuffled the dataset in such a way
that every class should come under both known and unknown classes. So in case
of Salinas dataset, shuffling has been done for three times, and we analyze the
performance of the model using these new sets of datasets with overall accuracy
(OA) and computation time. Classification results for shuffled Salinas dataset is
shown in Table 5.
Table 6 Classification results for Salinas A dataset

Elapsed time(min) 33 12
3.1.2 Salinas A Dataset
As the following step, Salinas A dataset is utilized for network training. Five classes
are utilized as source, and six classes are utilized as target. Classification accuracies
are practically the same for 10% (22 bands) bands at learning rate of 0.0001. The
most elevated classification precision of 99.80% is accomplished for the learning
rate of 0.0001 for the raw dataset comprising of 224 bands. Accuracy for unknown
class is 99.67% and for known class is 99.89%. Subsequently, the learning rate is
set to 0.0001. In like manner, for 22 bands (10%), the highest accuracy of 98.53%
is accomplished for the learning rate of 0.0001. So it is set to 0.0001. 98.32% is
the unknown class accuracy and 99.01% is the known class accuracy. For Salinas
A dataset, 10% of the bands are sufficient to acquire comparable accuracy with
less computational time. Table 6 shows the investigation of accuracies for feature
reduced data, and it likewise shows the time taken for network training. It is clear
from the table that the computational time is additionally decreased after feature
reduction.
of Salinas A dataset, shuffling has been done for five times, and we analyze the
(OA) and computation time. Classification results for shuffled Salinas A dataset is
shown in Table 7.
3.1.3 Pavia University Dataset
As our third analysis, Pavia U dataset is utilized for network training. Five
classes are utilized as source, and eight classes are utilized as target. Classification
accuracies are practically same for 30% (31 bands) bands at learning rate of 0.0001.
The most noteworthy classification precision of 83.10% is accomplished for the
learning rate of 0.0001 for the raw data comprising of 103 groups. Unknown class
accuracy is 86.22, and known class accuracy is 60.42%. Henceforth, the learning
rate is set to 0.0001. Similarly, for 31 bands (30% of bands), the most noteworthy
classification precision of 82.3% is accomplished for the learning rate of 0.0001.
Along these lines, it is set to 0.0001. unknown class precision is 85.74% and known
class precision is 59.42%. For Pavia U dataset, 30% of the bands are sufficient to
get practically identical precision with less computational time. Table 8 shows the
accuracies for feature reduced data, and it is additionally showing the time taken for
Table 7 Classification results for shuffled Salinas A dataset
Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5
100% of 10% of the 100% of 10% of the 100% of 10% of the 100% of 10% of the 100% of 10% of the
bands bands bands bands bands bands bands bands bands bands
OA (%) 99.83 98.75 98.78 97.32 98.98 97.65 99.90 98.98 99.65 98.71
Time (min) 35 19 32 13 38 15 28 12 32 13
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . .
281
Table 8 Classification results for Pavia U dataset

Elapsed time (min) 216 118
Table 9 Classification results for shuffled Pavia U dataset

30% of the 30% of the 30% of the
bands (31 bands (31 bands (31
bands) bands) bands)
100% of 100% of 100% of
bands bands bands
Overall accuracy (%) 77.89 76.37 78.72 76.30 81.98 80.90
Elapsed time (min) 234 109 226 112 243 123
network training. It is clear from the table that the computational time is likewise
diminished after feature reduction.
of Pavia U dataset, shuffling has been done for three times, and we analyze the
(OA) and computation time. Classification results for shuffled Pavia U dataset are
shown in Table 9.
3.2 Dimension Reduction Using Chebyshev Polynomial

Approximation
From the previous experimental results by using DMD, it is evident that though
there is dimension reduction, the model is more or less able to achieve almost the
same classification accuracy as that of raw data of HSI. Also, the experimental
results show that 20% (44 bands) of the total number of available bands are the
maximum possible reduction in feature dimension for Salinas dataset, 10% bands
(22 bands) for Salinas A dataset, and 30% (31 bands) of the total bands for Pavia U
dataset, which results in comparable classification accuracy. Shuffling of classes has
been done for every dataset to analyze the accuracy and computation time in each
percentage-wise dimension reduction. As an extension for the experiment, spectral
features are approximated using Chebyshev coefficients. Chebyshev is used to check
whether is it possible to reduce spectral features further with better classification
accuracy. Chebyshev approximation helps to truncate the data with minimum
number of coefficients. The model is trained with these truncated coefficients,
Table 10 Classification results for Salinas dataset after dimension reduction using Chebfun
44 coefficients
(20% of the bands) 34 coefficients 24 coefficients 14 coefficients
OA (%) 99.30 98.74 95.43 90.94
Computation time 46 38 29 11
(min)
PSNR 25.07 25.02 23.24 19.42
and the performance is analyzed in terms of classification accuracy computation

time. Analyzed PSNR ratio is an error measure for every dataset after Chebyshev
approximation. Spectral signature of one pixel corresponding to each class has been
plotted to visualize the data variation in the dataset after dimension reduction. The
results after approximating spectral features using Chebyshev coefficients are given
here in this section.
3.2.1 Salinas Dataset
As we mentioned before, 44 bands (20% of bands) were the maximum possible

reduction of bands achieved for Salinas dataset using DMD. Initially, spectral
features are truncated to 44 coefficients, and the model is trained with these 44
coefficients. Three parameters are taken for the validation process in this phase.
They are overall accuracy (OA), computation time, and PSNR (peak signal to noise)
ratio. Overall accuracy includes known and unknown class accuracy. Table 10 shows
the classification results obtained for Salinas dataset after dimension reduction
using Chebyshev polynomial approximation method. Same number of epochs and
learning rate that were employed in the case of DMD are used here to run the
model. From the table, it is obvious that there is a significant increase in accuracy of
99.30% for the classification using 44 coefficients in case of Chebyshev polynomial
approximation. It was 97.66% for the 20% of the bands (44 bands) in case of
using DMD. Then reduction process is repeated until we get minimum number
of coefficients with comparable classification accuracy. PSNR ratio is computed
between the original dataset and the dimensionally reduced dataset. From the table,
it is evident that the overall accuracy and PSNR are almost comparable compared
to the results for 44 coefficients.
As the next step, the efficiency of the approach has been verified by shuffling the
dataset in such a way that every class should come under both known and unknown
classes. So in case of Salinas dataset, shuffling has been done three times, and we
analyze the performance of the model using these new sets of datasets with overall
accuracy (OA) and computation time. Results show that this approach is efficient
in reducing spectral features without further information loss. Classification results
for shuffled Salinas dataset is shown in Table 11. The table shows a comparison of
the classification accuracies obtained for the maximum possible reduction of bands
in case of DMD and the maximum possible reduction of coefficients in case of
Table 11 Classification results for shuffled Salinas dataset

With 44 With 34 With 44 With 34 With 44 With 34
bands in coeffi- bands in coeffi- bands in coeffi-
case of cients in case of cients in case of cients in
DMD case of DMD case of DMD case of
Cheby- Cheby- Cheby-
shev shev shev
Overall accuracy (%) 98.89 98.77 97.51 96.90 98.41 98.40
Elapsed time (min) 46 41 38 26 39 32
Chebyshev approximation. To visualize the variation in pattern of the data, we have

plotted spectral signature of one pixel corresponding to any of the class of Salinas
dataset before and after truncation using Chebyshev approximation.
Figure 2(i) shows the comparison of the plot of one pixel corresponding to class
0 (broccoli green weeds 1) before and after truncated with 34 coefficients and 24
coefficients. The data has been normalized in −1 to 1 range.
It is evident from Figure 2(i) that the plot of the pixel after truncation is almost
similar to the plot of the original pixel before truncation in the case of using 34
coefficients. But in case of 24 coefficients, the smoothening of signal has happened
and more information has lost, hence less accuracy obtained.
Again, we plotted the spectral signature for one pixel corresponding to another
class. Figure 2(ii) shows the comparison of the plot of one pixel corresponding to
class 1 (corn senesced green weeds) before and after truncated with 34 coefficients
and 24 coefficients. It is evident from Figure 2(ii) that the plot of the pixel after
truncation is almost similar to the plot of the original pixel before truncation in the
case of using 34 coefficients. But in case of 24 coefficients, the smoothening of
signal has happened and more information has lost, hence less accuracy obtained.
Figure 3 shows the comparison of classification maps obtained for Salinas dataset
under different dimension reductions. Also, the figure gives the comparison of maps
obtained for both the techniques. It is obvious that the map obtained after Chebyshev
approximation shows better clarity in scene.
3.2.2 Salinas A Dataset

reduction of bands achieved for Salinas A dataset using DMD. Initially, spectral
ratio. Overall accuracy includes known and unknown class accuracy.
Table 12 shows the classification results obtained for Salinas A dataset after
dimension reduction using Chebyshev polynomial approximation method. Same
Reflectance
a a
Reflectance
0.5 0.5
value
value
0 0
-0.5 -0.5
-1
0 50 100 150 200 0 50 100 150 200
Number of bands Number of bands
b b
Reflectance
Reflectance
0.5 0.5
value
value
0 0
-0.5 -0.5
-1 -1
0 50 100 150 200 0 50 100 150 200
c c
1
Reflectance
Reflectance
0.5 0.5
value
value
0 0
-0.5 -0.5
-1
0 50 100 150 200 0 50 100 150 200
34 coefficients (i) 24 coefficients

a a
Reflectance
Reflectance
-0.2 -0.2
value
value
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
0 50 100 150 200 0 50 100 150 200
b b
Reflectance
Reflectance
-0.2 -0.2
-0.4
value
value
-0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
0 50 100 150 200 0 50 100 150 200
c c
0
Reflectance
Reflectance
-0.2
value
-0.4
value
-0.5
-0.6
-1 -0.8
-1
0 50 100 150 200 0 50 100 150 200
34 coefficients (ii) 24 coefficients
Fig. 2 (i) Spectral signature plot corresponding to one sample from class 0 (broccoli green weeds
1) for Salinas original dataset. (ii) Spectral signature plot corresponding to one sample from class
1 (corn senesced green weeds) for Salinas original dataset. (a) Plot between no. of bands and
reflectance value. (b) Plot after using Chebyshev without truncation. (c) Plot after using Chebyshev
approximation with truncated 34 and 24 coefficients
number of epochs and learning rate that were employed in the case of DMD are
used here to run the model. From the table, it is obvious that there is a significant
increase in accuracy of 99.64% for the classification using coefficients in case of
Chebyshev polynomial approximation. It was 98.53% for the 10% of the bands
(22 bands) in case of using DMD. Then the reduction process is repeated until
we get minimum number of coefficients with comparable classification accuracy.
PSNR ratio is computed between the original dataset and the dimensionally reduced
dataset. Computation time is getting reduced by dimension reduction. From the
Fig. 3 Classification maps obtained before and after dimension reduction for Salinas dataset.
(a) Without dimension reduction. (b) After using DMD (44 bands). (c) After using Chebyshev
approximation (34 coefficients)
Table 12 Classification results for Salinas A dataset after dimension reduction using Chebyshev
22 coefficients (20% of the bands) 15 coefficients 10 coefficients
OA (%) 99.64 99.50 97.84
Computation time (min) 13 10 10
PSNR 21.78 20.08 17.54
table, it is evident that the overall accuracy and PSNR are almost comparable
compared to the results for 22 coefficients.
As the next step, the efficiency of the approach has been verified by shuffling
the dataset in such a way that every class should come under both known and
unknown classes. So in case of Salinas A dataset, we did shuffling five times and
analyze the performance of the model using these new sets of datasets with overall
accuracy (OA) and computation time. Results show that this approach is efficient in
reducing spectral features without further information loss. Classification results for
shuffled Salinas A dataset is shown in Table 13. The table shows a comparison of
the classification accuracies obtained for the maximum possible reduction of bands
in case of DMD and the maximum possible reduction of coefficients in case of
Chebyshev approximation. To visualize the variation in pattern of the data, we have
plotted spectral signature of one pixel corresponding to any of the class of Salinas
A dataset before and after truncation using Chebyshev approximation. Figure 4(i)
shows the comparison of the plot of one pixel corresponding to class 5 (lettuce
romaine 7 wk) before and after truncated with 15 coefficients and 10 coefficients.
The data has been normalized in -1 to 1 range.
It is evident from Fig. 4(i) that the plot of the pixel after truncation is almost
Table 13 Classification results for shuffled Salinas A dataset
Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5
With22 With 15 With 22 With 15 With 22 With 15 With 22 With 15 With 22 With 15
bands in coeff in case bands in coeff in case bands in coeff in case bands in coeff in case bands in coeff in case
case of of case of of case of of case of of case of of
DMD Chebyshev DMD Chebyshev DMD Chebyshev DMD Chebyshev DMD Chebyshev
OA (%) 98.75 99.70 97.32 98.67 97.65 97.34 97.09 99.50 98.71 99.43
Time (min) 19 11 13 10 15 10 12 12 13 12
Deep Learning-Based Open Set Domain Hyperspectral Image Classification. . .
287
a a
Reflectance
0.5 0.5
Reflectance
value
0 0
value
-0.5 -0.5
-1
0 50 100 150 200 0 50 100 150 200
b b
Reflectance
Reflectance
0.5 0.5
value
value
0 0
-0.5 -0.5
0 50 100 150 200

0 50 100 150 200
c c
Reflectance
0.5
Reflectance
0.5
value
value
0
-0.5 -0.5
-1 -1
0 50 100 150 200
0 50 100 150 200 Number of bands
Number of bands

a a
Reflectance
0.5
Reflectance
0.5
value
0
value
0
-0.5
-0.5
-1
0 50 100 150 200 0 50 100 150 200
b b
0.5
Reflectance
0.5
Reflectance
value
0 0
value
-0.5 -0.5
-1 -1
0 50 100 150 200 0 50 100 150 200
c c
Reflectance
0.5
Reflectance
0.5
value
0 0
value
-0.5 -0.5
-1 -1
0 50 100 150 200
0 50 100 150 200
Number of bands
Number of bands
Fig. 4 (i) Spectral signature plot corresponding to one sample from class 5 (lettuce romaine 7
wk) for Salinas A original dataset. (ii)Spectral signature plot corresponding to one sample from
class 4 (lettuce romaine 6 wk) for Salinas A original dataset. (a) Plot between no. of bands and
reflectance value. (b) Plot after using Chebyshev without truncation. (c) Plot after using Chebyshev
approximation with truncated 15 and 10 coefficients

and more information has lost, hence less accuracy obtained.
The spectral signature for one pixel corresponds to another class. Figure 4(ii)
shows the comparison of the plot of one pixel corresponding to class 4 (lettuce
romaine 6 wk) before and after truncated with 15 coefficients and 10 coefficients. It
is evident from Fig. 4(ii) that the plot of the pixel after truncation is almost similar
to the plot of the original pixel before truncation in the case of using 15 coefficients.
But in case of 10 coefficients, the smoothening of signal has happened and more
information has lost, hence less accuracy obtained.
Fig. 5 Classification maps obtained before and after dimension reduction for Salinas A dataset.
(a) Without dimension reduction. (b) After using DMD (22bands). (c) After using Chebyshev
Figure 5 shows the comparison of classification maps obtained for Salinas A

dataset under different dimension reductions. Also, the figure gives the comparison
of maps obtained for both the techniques. It is obvious that the map obtained after
Chebyshev approximation shows better clarity in scene.
3.2.3 Pavia U Dataset

reduction of bands achieved for Salinas A dataset using DMD. Initially, spectral
ratio. Overall accuracy includes known and unknown class accuracy. Table 14 shows
the classification results obtained for Salinas A dataset after dimension reduction
using Chebyshev polynomial approximation method. Same number of epochs and
learning rate that were employed in the case of DMD are used here to run the
model. From the table, it is obvious that there is a significant increase in accuracy of
82.84% for the classification using coefficients in case of Chebyshev polynomial
approximation. It was 82.30% for the 30% of the bands (31 bands) in case of
using DMD. Then the reduction process is repeated until we get minimum number
of coefficients with comparable classification accuracy. PSNR ratio is computed
between the original dataset and the dimensionally reduced dataset. Computation
time is getting reduced by dimension reduction. From the table, it is evident that the
overall accuracy and PSNR ratio are almost comparable compared to the results for
31 coefficients. As the next step, the efficiency of the approach has been verified
by shuffling the dataset in such a way that every class should come under both
known and unknown classes. So in case of Pavia U dataset, shuffling has been done
three times, and we analyzed the performance of the model using these new sets
of datasets with overall accuracy (OA) and computation time. Results show that
this approach is efficient in reducing spectral features without further information
loss. Classification results for shuffled Pavia U dataset are shown in Table 15. The
table shows a comparison of the classification accuracies obtained for the maximum
Table 14 Classification results for Pavia U dataset after dimension reduction using Chebfun
31 coefficients (20% of the bands) 21 coefficients 11 coefficients
OA (%) 82.84 92.24 90.94
Computation time (min) 116 104 98
PSNR ratio 40.77 41.13 41.09
Table 15 Classification results for shuffled Pavia U dataset

With 31 With 21 With 31 With 21 With 31 With 21
bands in coefficients bands in coefficients bands in coefficients
case of in case of case of in case of case of in case of
DMD Chebyshev DMD Chebyshev DMD Chebyshev
OA (%) 76.37 82.44 76.30 72.44 80.90 81.45
Time (min) 109 97 112 100 123 112
possible reduction of bands in case of DMD and the maximum possible reduction
of coefficients in case of Chebyshev approximation.
To visualize the variation in pattern of the data, we have plotted spectral signature
of one pixel corresponding to any of the class of Pavia U dataset before and after
truncation using Chebyshev approximation. Figure 6(i) shows the comparison of the
plot of one pixel corresponding to class 2 (trees) before and after truncated with 21
coefficients and 11 coefficients. The data has been normalized in -1 to 1 range.
It is evident from Figure 6(i) that the plot of the pixel after truncation is almost
and more information has lost, hence less accuracy obtained. The spectral signature
for one pixel corresponding to another class is plotted. Figure 6(ii) shows the
comparison of the plot of one pixel corresponding to class 0 (asphalt) before and
after truncated with 21 coefficients and 11 coefficients. It is evident from Figure
6(ii) that the plot of the pixel after truncation is almost similar to the plot of the
original pixel before truncation in the case of using 21 coefficients. But in case of
11 coefficients, the smoothening of signal has happened and more information has
lost, hence less accuracy obtained.
Figure 7 shows the comparison of classification maps obtained for Pavia U
dataset under different dimension reductions. Also, the figure gives the comparison
of maps obtained for both the techniques. It is obvious that the map obtained after
Chebyshev polynomial approximation shows better clarity in scene.
Reflectance 2
a a
Reflectance
1
value
1
value
0 0
-1
10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
b b
2
Reflectance
Reflectance
1
1 0
value
value
0 -1
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Number of bands
Number of bands
2
c c
Reflectance
Reflectance
2
1
value
value
1
0 0
-1
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
a a
Reflectance
0
Reflectance
0
-0.2
value
-0.2
value
-0.4 -0.4
-0.6 -0.6
0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
b b
Reflectance
0
Reflectance
-0.2 -0.2
value
value
-0.4 -0.4
-0.6 -0.6
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
c c
Reflectance
Reflectance
0
-0.2 -0.2
value
value
-0.4 -0.4
-0.6 -0.6
-0.8
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Fig. 6 (i) Spectral signature plot corresponding to one sample from class 2 (trees) for Pavia U
original dataset. (ii) Spectral signature plot corresponding to one sample from class 0 (asphalt). (a)
Plot between no. of bands and reflectance value. (b) Plot after using Chebyshev without truncation.
(c) Plot after using Chebyshev approximation truncated to 21 and 11 coefficients
4 Conclusion
In this work, open set domain adaptation with GAN model has been applied for
HSI classification. For dimension reduced dataset also, the same model has been
applied. In the principal stage, dynamic mode decomposition (DMD) is utilized as
the feature reduction procedure, and the outcomes show that this method is very
effective in removing the redundancy of bands without much data loss. It performed
well on open set domain HSI classification. The fundamental point of the model
is to identify the classes that were absent during training as unknown. From the
Asphalt
Gravel
Trees
Metal sheets
Bare soil
Unknown class
(Meadows,
bricks hadows)
(a) (b) (c)
Fig. 7 Classification maps obtained before and after dimension reduction for Pavia U dataset.
(a) Without dimension reduction. (b) After using DMD (31 bands). (c) After using Chebyshev
results for the three datasets, it is evident that the model is able to accomplish
practically identical classification accuracy as that of raw dataset even after the
dimension reduction. Here, the time for computation is likewise decreased after
dimension reduction for three dataset. Additionally, the results show that 20%
of the groups of Salinas dataset, 10% of the bands for Salinas A dataset, and
30% of the bands for Pavia U dataset are the most achievable feature reduction
that outcomes in practically same accuracies. In the next phase, we explored a
novel Chebyshev approximation-based dimensionality reduction technique for HSI
classification to check whether it is possible to reduce the dimension of each
dataset further with comparable classification accuracy. The performance of the
model is analyzed in terms of overall accuracy, computation time, and PSNR
ratio. Surprisingly, the results show that the Chebyshev polynomial approximation
is a very effective approach to approximate the spectral features and results in
good classification accuracy. Computational time taken for the model when using
reduced data became much lesser compared to raw data for each dataset. Also, the
experimental results show that only 15 coefficients or bands are needed for Salinas
A, about 34 coefficients for Salinas dataset, and 21 coefficients for Pavia U dataset,
which results in better classification accuracy.
References
1. Goetz, A. F., Vane, G., Solomon, J. E., & Rock, B. N. (1985). Imaging spectrometry for earth
remote sensing. Science, 228(4704), 1147–1153.
2. Pau, P. B., & Gall, J. (2017). Open set domain adaptation. Proceedings of the IEEE
International Conference on Computer Vision.
3. Hoffman, J., Rodner, E., Donahue, J., Kulis, B., & Saenko, K. (2014). Asymmetric and
category invariant feature transformations for domain adaptation. International Journal of
Computer Vision, 109(1–2), 28–41.
4. Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: An
unsupervised approach. In IEEE Conference on Computer Vision and Pattern Recognition (pp.
999–1006).
5. Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting visual category models to new
domains. In IEEE European conference on computer vision (pp. 213–226).
6. Chopra, S., Balakrishnan, S., & Gopalan, R. (2013). DLID: Deep learning for domain adap-
tation by interpolating between domains. In ICML workshop on challenges in representation
learning.
7. Gong, B., Shi, Y., Sha, F., & Grauman, K. (2012). Geodesic flow kernel for unsupervised
do-main adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2066–
2073.
8. Saito, K., Yamamoto, S., Ushiku, Y., & Harada, T. (2018). Open set domain adaptation by back
propagation, ArXiv, 1804.10427v2[cs.CV].
9. Fong, M. (2007). Dimension reduction on hyperspectral images. University of California.
10. Megha, P., Sowmya, V., & Soman, K. P. (2018). Effect of dynamic mode decomposition based
dimension reduction technique on hyperspectral image classification. In Computational signal
processing and analysis (pp. 89–99). Springer.
11. Krishnendu, C. S., Sowmya, V., & Soman, K. P. (2021). Impact of dimension reduced spectral
features on open set domain adaptation for hyperspectral image classification. In Evolution in
computational intelligence (pp. 737–746). Springer.
12. Aldhaher, S., Luk, P. C. K., & Whidborne, J. F. (2014). Electronic tuning of misaligned coils
in wireless power transfer systems. IEEE Transactions on Power Electronics, 29(11), 5975–
5982.
13. Lee, S.-P., Cho, B.-L., Ha, J.-S., & Kim, Y.-S. (2015). Target angle estimation of multifunction
radar in search mode using digital beamforming technique. Journal of Electromagnetic Waves
and Applications, 29(3), 331–342.
14. Driscoll, T. A., Hale, N., & Trefethen, L. N. (Eds.). (2014). Chebfun guide. Pafnuty Publica-
tions.
15. Gowri, B., Ganga, K. P., & Soman, and D. Govind. (2018). Improved epoch extraction from
telephonic speech using Chebfun and zero frequency filtering. Interspeech.
16. Mohan, N., & Soman, K. P. (2018). Power system frequency and amplitude estimation using
variational mode decomposition and chebfun approximation system. In 2018 twenty fourth
national conference on communications (NCC). IEEE.
17. Hyperspectral image dataset available at http://www.ehu.eus/ccwintco/index.php/
HyperspectralRemoteSensingScenes
An Effective Diabetic Retinopathy
Detection Using Hybrid Convolutional
Neural Network Models
Niteesh Kumar , Rashad Ahmed , B. H. Venkatesh ,

and M. Anand Kumar
1 Introduction
Diabetic retinopathy can be described as a disease that is related to the eye that is
caused due to damage of specific type of blood vessels [11] namely arteries and
veins of the photosensitive tissues at the rear of the eye (retina) that usually has
effects on either eyes. Initially, there is an absence of symptoms or mild vision
problems and this eventually leads to blindness. Therefore, it is necessary to detect
and categorize the diabetic retinopathy at the early stages for the effective treatment
and preventing the loss of vision [10]. Diabetic retinopathy identification is a
long-term process requiring manual intervention the involves techniques which a
qualified physician needs to analyze and measure pictures of the retinal images
of digitally colored backbone. According to the previous studies conducted on
DR treatment and the study undertaken at Wisconsin Epidemiologic on Diabetic
Retinopathy, there is a popularly used metric of classification of the 5 different
stages of DR that was initially proposed by Wilkinson in the year 2003. A brief
description of the same can be found in Table 1. They are namely: no apparent
retinopathy—I, mild none-proliferative diabetic retinopathy (NPDR)—II, moderate
NPDR—III, severe NPDR—IV, and proliferative diabetic retinopathy—V [4]. It
also provides an insight into intensity of the DR stages according to the observations
made by the dilated pupils during ophthalmoscopy checkup also referred to as
fundoscopy.
N. Kumar · R. Ahmed · B. H. Venkatesh · M. Anand Kumar ()

Department of Information Technology, National Institute of Technology Karnataka, Mangalore,
India
e-mail: m_anandkumar@nitk.edu.in
296 N. Kumar et al.
Table 1 Diabetic severity labels of the dataset [4]

Stage Dilated ophthalmoscopy Severity
1 No abnormalities No DR
2 Microaneurysms only Mild non-proliferative DR
3 Retinal dot and blot hemorrhages Moderate non-proliferative DR
4 Definite venous beading in 2 or more quadrants Severe non-proliferative DR
5 Neovascularization pre-retinal hemorrhage Proliferative DR
2 Related Work
As an aspect of DR (diabetic retinopathy) for aneurysm detection, [2] have proposed

a deep learning framework based on the InceptionV3 model for feature extraction.
The preprocessing step in this research involved cleaning of the data, normalization,
transformation, and finally feature selection. For classification purpose, the mul-
ticlass SVM classifier was implemented since the key parameters are defined only
once instead of binary analysis. The method yielded good performance with average
accuracy ranging between 72 and 78% but in the case where the fundus image lost
focus, the approach performed poorly. The compatible microaneurysm identification
in dynamic fundus images concerned with the processing of images from medical
domain is based on the collection framework [14] that identifies the microaneurysm.
The collection framework is analyzed with the result from various classifiers and the
amalgamation of internal mechanism of microaneurysm classification.
Harun et al. [5] make use of the artificial neural network (ANN) model, namely
the multi-layered perceptron (MLP) trained by Levenberg–Marquardt (LM) and
Bayesian regularization (BR) to classify whether the fundus image has signs of
DR or not. 19 binary-based features were extracted to feed the input layer of the
MLP network. The classification was run on a number of epochs and resulted in the
MLP trained by BR proving to be better than the MLP trained by LM in order to
produce an average accuracy of 66.04%. The accuracy in this case could have been
improved if an optimal number of hidden nodes were configured. This research
describes that an automatic detection of retinal images can easily diagnose and
screen diabetic retinopathy. Image segmentation is done using neural network and
fuzzy clustering. In some noisy space, this approach becomes a failure, and the result
regions become too sensitive to light, thereby resulting in incorrect segmentation of
the image.
The detection of DR stages using color fundus images has been proposed by
Sodhu and Khatkar [16]. Extracting the features from raw images and using image
processing technique, then they are fed into support vector machine (SVM) using
fuzzy C-means clustering. This fuzzy C-means clustering is a combination of
SVM technique and preprocessing to improve the blood vessels and optic disk
detection. Hybrid approach is used to analyze and remove diabetic retinopathy.
In the past application of neural network, the ImageNet CNN architecture was
implemented [8]. ImageNet comprises 650,000 neurons and parameters that span
An Effective Diabetic Retinopathy Detection Using Hybrid Convolutional. . . 297
over 60 million. The integration of rectified linear (ReLU) activation function that
introduces nonlinearity into neural network architecture is an important feature in
the ImageNet implementation. Apart from ReLU, there are other nonlinear functions
such as the hyperbolic tangent (tanh) and the sigmoid function, but these are
saturating in nature. The reason in opting for ReLU as the activation function lies
in the fact that it is six folds faster than the tanh function in their respective error
rates during the phase of training. ImageNet also features the dropout functionality
that prevents the neural network model from overfitting. Since the distribution of the
dataset over the 5 different categories of DR was not even, the method could only
produce an average accuracy of 71.06%.
Like the previously discussed AlexNet model, there are a number of neural
network models that have evolved over the course of technological development.
One such instance is the VGG16 model [15], comprising of 16 weighted layers and
about 138 million parameters. Its simplicity and standard way of implementation
is one of the prime reasons to adopt this architecture. All the convolutional hidden
layers use 3 .× 3 filters, while 2 .× 2 kernels are used in the pooling layer. This is
due to the fact that the number of filters starts from 64 all the way up to 512 that
are actually powers of 2. At the time, the performance of such an architecture was
gratified since it provided about 51% in average accuracy.
3 Methodology
Staging of DR at various phases is critical for the detection of diabetes mellitus

(DM) that is often termed as the initial progression of the disease that later develops
into retinopathy. In order to determine the stage diabetic retinopathy objectively and
accurately, the purpose of this research is to incorporate an automatic and standalone
approach based on image processing and machine learning. The application of
image processing in biomedical domain is immense, ranging from malignant and
benign tumor detection using segmentation to efficient and cost-effective means to
brain MRI scan. The next step followed by image analysis is the integration of a deep
learning framework to help categorize the analyzed medical data. Since medical
data is largely composed of visual attributes such as images, the use of standard [1]
machine learning approaches makes it difficult to analyze the same in the presence
of environmental constraints. Hence CNN models are adopted in such scenario
involving large quantity of visual input. In the past, the use of a deep learning
algorithm to identify diabetic retinopathy in retinal images has been performed in
large-scale scientific studies. Nevertheless, these studies include several steps and
are only aimed at defining binary classifications. One of the earlier works aimed
at the development of a cinematic gravity classificatory for diabetic retinopathy
with CNNs. In the following research, we explore the performance of different
CNN and hybrid CNN models in order to correctly categorize the data into the
different DR labels as mentioned in Table 1. The dataset comprises 8407 labeled
fundoscopic images provided by the EyePACS foundation. The hybrid version of the
298 N. Kumar et al.
CNN models includes the combination of standard CNN models with other machine
learning techniques. Here we focus on integrating the CNN model with support
vector machine [3] and random forest classifiers. We found that the proposed deep
learning architecture performs quite well with an average accuracy of 75%.
3.1 Research Objectives
The objectives of this study are to:

– Detect blood vessel.
– Categorize different stages of diabetic retinopathy into no DR, mild DR, moder-
ate DR, severe DR, and proliferative DR.
– Explore the performances of different CNN and hybrid CNN models and
compare accuracy between them.
3.2 Feature Selection
Feature selection is the process in which the number of input variables of the data get
reduced while applying to a classifier. Feature selection improves the performance
of the model. Feature selection [6] can be used to identify and remove from data
the unnecessary variables that are not relevant and do not increase the accuracy of
the classifier. We have chosen 2 features, blood vessels and exudate areas. Blood
vessels that are little and delicate break through the bottom of the tissue and cover
the white of the eye that results in eye redness. The eye redness means that you have
a hemorrhage that is a sign of diabetic risk. The normal range for the blood vessel
area is 36,230.56 [17], and the value of this range decreases as contraction occurs in
diabetic retinopathy. In the retina, blood vessels get damaged when diabetes occurs.
The patches that are yellow are called hard exudates. Hence, it is an informative
feature, and we must consider it for the feature selection process.
3.3 Proposed Models
For the classification of the retinal images, we deploy convolutional neural network
model. To achieve better performance and accuracy, the neural network model is
coupled with different classifiers, and they are random forest and support vector
machine. The description of the models (Fig. 1) is mentioned below.
Fig. 1 Diagrammatic representation of Proposed Model
3.3.1 CNN Model
The convolutional neural network (CNN) is effective toward image recognition and
classification. When the input images are fed into CNN models, the main aim is
to extract the features from the model. The preprocessing is done on the input
images, such as blurring, sharpening, and edge detection of the image. The operation
function called rectified linear unit (ReLU) is used after every layer of CNN. ReLU
is used for nonlinearity, and this function is a nonlinear operation that replaces all
pixel values in the feature map that contains negative by zero. While keeping most
of the important features, the dimensionality reduction of each feature map is done
using pooling and subsampling. Pooling is done by taking the sum, average, or
max of the sub-region in the feature map. After these three operations are added
with layers, the activation function called softmax function is added to output of the
feature map, and this completes the classification process.
In classification of the DR stages, we implemented our proposed convolutional
neural network model. We will be discussing our proposed model and explain
further configurations to leverage these CNN implementations.
3.3.2 CNN with SVM Classifier
Figure 2 represents the hybrid CNN with SVM classifier model. The proposed
model consists of CNN for feature extractions from the images, and the extracted
features are used by a SVM for classification.
300 N. Kumar et al.
Fig. 2 Coupled architecture of CNN and SVM classifier
The CNN layers consist of 5 layers in which the first layer is the input layer and
the last layer is the output layer. The normalized image shape that is fed into the
input layer is 144 .× 256 raw pixel images. As shown in Fig. 2, C1 is a convolution
layer with 32, 5 .× 5, filters, C2 is a convolution layer with 64, 3 .× 3, filter maps, and
C3 is a convolutional layer with 128, 3 .× 3, filter maps. The output layer is a dense
layer with 128 units of feature maps. In between the convolutional layers, there are
max-pooling filters of 2 .× 2 size consecutively arranged. Before the dense output
layer, the features are flattened. The intermediate generated from the dense layer is
fed into the support vector machine classifier for segmentation and classification.
As shown in Fig. 2 after flattening, the output of the dense layer is fed into the
SVM classifier. Support vector machine (SVM) [13] is a classifier that classifies
really well when there is a clear margin of separation between classes. Effectiveness
of SVM is observed in high-dimensional spaces. The diabetic retinopathy detection
is a multiclass classification model; we need the SVM to classify the multiclass
labels. Complexity of SVM increases when there are more than two classes to
classify; to solve this, there are ways such as we can use the one-against-all support
vector machine (OAASVM). For N -class problems (.N > 2), N binary SVM
classifiers are built [7].
3.3.3 CNN with RF Classifier
Figure 3 represents the hybrid CNN with RF classifier model. The proposed model
consists of CNN for feature extractions from the images, and the extracted features
are used by a random forest for classification.
Fig. 3 Coupled architecture of CNN and random forest classifier
The net CNN feature extraction is the same as that explained in the CNN with
SVM classifier. But the main difference is the coupled classifier, which in this case
is the random forest (RF) classifier. Random forest consists of many individual
decision trees that operate as an ensemble [12]. Each decision tree predicts the
class of the diabetes, and the class with maximum votes becomes the prediction
class of the model.
The CNN model is trained to extract features from the input image, and the
fully connected layer of CNN is replaced by a random forest classifier to classify
the image pattern. The output of the dense layer of CNN produces a feature
vector representing the image pattern, consisting of 645 values. The random forest
classifier is trained using the features of images produced by the CNN model. The
trained random forest uses the features to perform classification task and makes
decisions on testing images using features extracted by CNN. In the experiment, the
random forest contains 50 individual decision trees keeping other values default.
4 Experimental Results and Analysis
The experiment is performed on high-end server with NVIDIA Tesla V100 32 GB

passive GPU that has up to 3 DW and 6 SW GPU cards. The GPU contains 12 Gbps
PCIe 3.0 with 2 GB NV cache supporting RAID levels 1, 5, 6, 10, 50. The high-
end server CPU configuration is—2 no’s x Intel Xeon Gold 6240R 2.4G, 24C/48T,
10.4GT/s, 35.75 M cache, Turbo, HT(165W) DDr4-2933.
302 N. Kumar et al.
Table 2 Training and testing dataset distribution

Class labels Number of samples in train set Number of samples in test set
0 1834 230
1 1222 154
2 1222 154
3 917 115
4 917 115
The original dataset repository published by Kaggle consisted of over 35,000

fundoscopic images. Due to the fact that these images were not collected in
complaisant laboratory environment, these images are of relatively inharmonious
nature. Image resolutions of these images range from 2594 .× 1944 pixels to 4752
.× 3168 pixels, and due to sub-optimal lighting circumstance, the images contain
some amount of noise. From the Kaggle dataset, 8407 representative and high-
quality images constituting about 8 GB of data were selected to build the dataset
that is used for training and testing the proposed models reported in this chapter.
Out of the 8 GB dataset containing 8407 images, 6112 images are used for training
200 the model (Table 2). Finally, for testing purpose, almost 10% of the images,
i.e., 768 images, are employed. The images are then chosen in such a way that each
stage in the current reorganized dataset has a reasonably balanced collection. From
Table 3, we can observe that the models using the LeakyReLU activation function
have a better improvement on applying preprocessing. This is mostly due to the loss
of information that is observed in the ReLU activation function when the output
value becomes negative. But in case of LeakyReLU, this negative output value is
not discarded and a parametric measure is applied.
Accuracy metric is adopted to measure and compare the performance of different
CNN-based classifiers. Equation 1 represents the formal definition of accuracy
where .χi = 1, if the predicted class is true class for image “i,” else .χi = 0.
1
m
Accuracy =
. χi . (1)
m
i=1
5 Conclusion and Future Work
The test accuracy obtained by the models is around the range of 70–75% using just
24% of the images available in the DR dataset. Our experimental results indicate
the importance of CNN and machine learning techniques for detection of a different
diabetic retinopathy. While, even on such a small-sized training data, the accuracy
of the models is reasonable, indicating that there is room for further improvements
to the models in the future. The models can be employed with user-friendly user
Table 3 Accuracy of the proposed classifiers

Test accuracy without Test accuracy with
Models Activation function feature selection feature selection
CNN LeakyReLU 74.48 75.01
CNN ReLU 74.12 74.11
CNN+RF LeakyReLU 73.74 74.01
CNN+RF ReLU 73.99 73.97
CNN+SVM LeakyReLU 74.12 75
CNN+SVM ReLU 72.48 73.13
SVM None 55.29 73.14
Table 4 Comparison of average accuracy metric over recent DR classifiers

Research Methodology Average test accuracy
[11] Linear kernel SVM 69.04
[2] InceptionV3, Multiclass SVM 76.1
[14] Gaussian mixture model 74
[5] MLP 66.04
[9] Gradient-weighted class activation 77.11
[8] ImageNet 71.06
[15] VGG16 51
Proposed method CNN+SVM 75
interface for specialists, especially ophthalmologist to evaluate the severity level of

DM by recognizing different DR stages, further supporting for proper management
of prognosis of diabetic retinopathy.
Table 4 describes the average testing accuracy of the various researches that were
performed over the years in the field of DR classification. Clearly, [9] boast the best
performance metric in terms of average test accuracy with 77.11%. Since the dataset
distribution used in [9] and the proposed method does not overlap completely,
it is difficult to make a direct comparison. But nonetheless, the proposed method
has performed quite well taking into consideration the performance metrics of the
recently developed methods.
Since the domain of deep learning is constantly evolving, further efforts can be
made to improve the performance of the models. The accuracy of the proposed
model depends on quality of dataset and training as it employs only machine
learning techniques. Computer vision techniques can be employed to detect the
different important parts of retina such as cotton wool spots that will help the CNN
models to select most important features. The real-life fundoscopic images consist
of noises, and the proposed model does not have a layer to deal with different noises.
Image preprocessing techniques can also be used to remove noises from the image
so that the model can work more efficiently. Last but not the least, different machine
learning techniques can be combined to build hybrid models as a novel state-of-the-
art DR classifier that can provide better performance.
304 N. Kumar et al.
Acknowledgments We would like to express our gratitude toward the Information Technology
Department of NITK, Surathkal for its kind cooperation and encouragement that helped us in the
completion of this project entitled “An Effective Diabetic Retinopathy Detection using Hybrid
Convolutional Neural Network Models.” We would like to thank the department for providing
the necessary cluster and GPU technology to implement the project in a preferable environment.
We are grateful for the guidance and constant supervision as well as for providing necessary
information regarding the project and also for its support in completing the project.
References
1. Bhatia, K., Arora, S., & Tomar, R. (2016). Diagnosis of diabetic retinopathy using machine
learning classification algorithm. In 2016 2nd International Conference on Next Genera-
tion Computing Technologies (NGCT) (pp. 347–351). https://doi.org/10.1109/NGCT.2016.
7877439
2. Boral, Y. S., & Thorat, S. S. (2021). Classification of diabetic retinopathy based on hybrid
neural network. In 2021 5th International Conference on Computing Methodologies and Com-
munication (ICCMC) (pp. 1354–1358). https://doi.org/10.1109/ICCMC51019.2021.9418224
3. Carrera, E. V., González, A., & Carrera, R. (2017). Automated detection of diabetic retinopathy
using SVM. In 2017 IEEE XXIV International Conference on Electronics, Electrical Engi-
neering and Computing (INTERCON) (pp. 1–4). https://doi.org/10.1109/INTERCON.2017.
8079692
4. Cuadros, J., Bresnick, G. (2009). EyePACS: an adaptable telemedicine system for diabetic
retinopathy screening. Journal of Diabetes Science and Technology, 3, 509–516.
5. Harun, N. H., Yusof, Y., Hassan, F., & Embong, Z. (2019). Classification of fundus images for
diabetic retinopathy using artificial neural network. In 2019 IEEE Jordan International Joint
Conference on Electrical Engineering and Information Technology (JEEIT) (pp. 498–501).
https://doi.org/10.1109/JEEIT.2019.8717479
6. Herliana, A., Arifin, T., Susanti, S., & Hikmah, A. B. (2018). Feature selection of diabetic
retinopathy disease using particle swarm optimization and neural network. In: 2018 6th
International Conference on Cyber and IT Service Management (CITSM) (pp. 1–4). https://
doi.org/10.1109/CITSM.2018.8674295
7. Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector
machines. IEEE Transactions on Neural Networks, 13(2), 415–425. https://doi.org/10.1109/
72.991427
8. Jayakumari, C., Lavanya, V., & Sumesh, E. P. (2020). Automated diabetic retinopathy detection
and classification using ImageNet convolution neural network using fundus images. In: 2020
International Conference on Smart Electronics and Communication (ICOSEC) (pp. 577–582).
https://doi.org/10.1109/ICOSEC49089.2020.9215270
9. Jiang, H., Xu, J., Shi, R., Yang, K., Zhang, D., Gao, M., Ma, H., & Qian, W. (2020). A multi-
label deep learning model with interpretable Grad-CAM for diabetic retinopathy classification.
In: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology
Society (EMBC) (pp. 1560–1563). https://doi.org/10.1109/EMBC44109.2020.9175884
10. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep
convolutional neural networks. In: Proceedings of the 25th International Conference on Neural
Information Processing Systems (Vol. 1, pp. 1097–1105). NIPS’12, Red Hook, NY, USA:
Curran Associates.
11. Kumar, S., & Kumar, B. (2012). Diabetic retinopathy detection by extracting area and number
of microaneurysm from colour fundus image. In: 2018 5th International Conference on
Signal Processing and Integrated Networks (SPIN) (pp. 359–364). https://doi.org/10.1109/
SPIN.2018.8474264
12. Ramani, R. G., Shanthamalar J., J., & Lakshmi, B. (2017). Automatic diabetic retinopathy
detection through ensemble classification techniques automated diabetic retinopathy classifi-
cation. In: 2017 IEEE International Conference on Computational Intelligence and Computing
Research (ICCIC) (pp. 1–4). https://doi.org/10.1109/ICCIC.2017.8524342
13. Roy, A., Dutta, D., Bhattacharya, P., & Choudhury, S. (2017). Filter and fuzzy C means based
feature extraction and classification of diabetic retinopathy using support vector machines. In:
2017 International Conference on Communication and Signal Processing (ICCSP) (pp. 1844–
1848). https://doi.org/10.1109/ICCSP.2017.8286715
14. Roychowdhury, S., Koozekanani, D. D., & Parhi, K. K. (2016). Automated detection of
neovascularization for proliferative diabetic retinopathy screening. In: 2016 38th Annual
International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)
(pp. 1300–1303). https://doi.org/10.1109/EMBC.2016.7590945
15. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale
image recognition. CoRR, abs/1409.1556.
16. Sodhu, P. S., & Khatkar, K. (2014). A hybrid approach for diabetic retinopathy analysis.
International Journal of Computer Application and Technology, 1(7), 41–48.
17. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,
V., & Rabinovich, A. (2015). Going deeper with convolutions. In: 2015 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (pp. 1–9). https://doi.org/10.1109/CVPR.
2015.7298594
Modified Discrete Differential Evolution
with Neighborhood Approach for
Grayscale Image Enhancement
Anisha Radhakrishnan and G. Jeyakumar
1 Introduction
Evolutionary Algorithms (EAs) are the potential optimization tools for wide range
of benchmarking and real-world optimization problems. The most prominent
algorithms under EAs are Differential Evolution (DE), Genetic Algorithm (GA),
Genetic Programming (GP), Evolutionary Programming (EP), and Evolutionary
Strategies (ES). Though the algorithmic structure of these algorithms is similar,
their performance varies based on different factors, viz., population representation,
variation operations, selection operations, and the nature of the problem to be
solved.
Among all these algorithms, DE is simpler and is applicable for complex real-
valued parameter optimization problems. The differential mutation operation of
DE makes it not directly applicable for discrete parameter optimization problems.
Considering the unique advantages of DE, extending its applicability to discrete
optimization problems is an active part of research. In computer vision, good
contrast images have vital role in many applications of image processing. From
past few decades, an extensive research was carried out on metaheuristic approach
for automatic image enhancement.
The objective of the study presented in this paper is to propose an algorithmic
change to DE, by adding a new mapping technique, to make it suitable for discrete
optimization problem. The performance of DE with proposed mapping technique
was tested with benchmarking travelling salesperson (TSP) problems and an image
enhancement problem. The algorithmic structure, design of experiments, results,
and discussion are presented in this chapter.
A. Radhakrishnan · G. Jeyakumar ()

Department of Computer Science and Engineering, Amrita School of Engineering, Amrita
Vishwa Vidyapeetham, Coimbatore, India
e-mail: r_anisha@cb.amrita.edu; g_jeyakumar@cb.amrita.edu
308 A. Radhakrishnan and G. Jeyakumar
The remainder of this chapter is sectioned as follows: Sect. 2 summarizes the

related works, Sect. 3 describes the theory of DE, Sect. 4 describes the proposed
approach, Sect. 5 explains Phase I of the experiment, Sect. 6 presents the Phase II
of experiment, and finally, Sect. 7 concludes the chapter.
2 Related Works
Though DE is native algorithm for real-valued parameter optimization, it is

extended to discrete-valued parameter optimization also with relevant changes in
its algorithmic structure. These changes are done either in population level or in the
operator levels. Similar such works are highlighted below.
A mapping mechanism for converting the continuous variable to discrete is
proposed in [1]. The authors also have suggested a way to move the solution faster to
the optimality. The MADEB algorithm with binary mutation operator was proposed
in [2] to discretize the DE mutation operation. Interestingly, in [3], an application-
specific discrete mapping technique was added with DE. The application attempted
in this work was multi-targets assignment for multiple unmanned aerial vehicle.
Similar to [3], there were many application-specific changes suggested for DE in
literature. For solving an antenna design problem, a binary DE named NBDE was
presented in [4]. For computer vision and image processing problem, the DE-based
algorithm to detect the circles was introduced in [5]. A new operator name “position
swapping” was added to DE in [6] and used for index tracking mechanism.
The forward and backward transformation of converting the real values to integer
values and vice versa was discussed in [7]. There were research ideas in the
literature discussing about appropriate population representation for making DE apt
for discreate-valued problems. In [8], a discrete representation for the candidates
in the population was proposed and was tested for flow shop scheduling problem.
A modified DE with binary mutation rule was introduced in [9] and was tested
on few discrete problems. Taking TSP as benchmarking problem, a set of discrete
DE algorithms were introduced in [10]. A novel mutation operation for spatial data
analysis was introduced in [11], which was named as linear ordering-based mutation
operator. A set of changes were proposed to incorporate in DE, in [12], for solving
vehicle routing problems. An attempt to solve the discrete RFID reader placement
problem was tried out in [13], with an application-specific mapping. The keyframe
extraction problem was experimented in [14] to solve it by a modified DE algorithm.
The work presented in [13] was extended by authors in [15] to experiment DE on a
discrete multi objective optimization problem.
Image enhancement is a technique where the information in images becomes
more expressible. It transforms original image to enhanced image, which is visually
good, and an object can be distinguished from the background. The purpose of
enhancement is to improve image quality, focus certain features, and strengthen
interpretation of an image. It has vital role in computer vision and image processing
in which it is a preprocessing phase in applications like image analysis and remote
Modified Discrete Differential Evolution with Neighborhood Approach for. . . 309
sensing. Regions with low contrast appear as dark, and high-contrast regions
appear to be illuminated nonnatural. The outcome of both is loss of pertinent
information. Thus, the optimal enhancement of image contrast that represents the
relevant information of the input image is a difficult task [16, 17]. There is no
generic approach for image enhancement; they are image dependent. Histogram
Equalization (HE) and its variants are effectively applied to enhance the contrast of
the image that is widely used in several image processing applications [18–20]. The
major drawback of this approach is that for the darker and lighter image, it does not
produce quality enhanced image due to the noise and loss of relevant information.
In recent years, several bioinspired algorithmic approaches are used in image
contrast enhancement [21]. These algorithms help in searching the optimal map-
ping of gray-level input image to new gray-level image with enhanced image
contrast. Automatic image enhancement requires well-defined evaluation criterion
that validates wide range of datasets. An approach for tuning the parameters of
transformation function can be adopted. The transformation function is evaluated
by the objective function. Bioinspired algorithms search the optimal combination
of transformation parameters stochastically. Embedding population-based approach
in image enhancement has gained wide popularity in recent years. This approach
helps to explore and exploit such complex problems and search the solution space
to achieve the optimal parameter setting [22, 23]. Plenty of literature indicates the
application of metaheuristic algorithm for image contrast enhancement.
Pal et al. used Genetic Algorithm (GA) for automating the operator selection for
image enhancement [24]. Saitoh used Genetic Algorithm for modeling the intensity
mapping of transformation function. This approach generated better result with
respect to execution time [25]. Braik et al. examined Particle Swarm Optimization
(PSO) by increasing the entropy and edge details [26]. Dos Santos Coelho et al. [27]
modeled three DE variants by adopting the similar objective function proposed in
[26]. The advantage of this approach was faster convergence but could not provide
suitable statistical evidence for the quality of enhanced image. Shanmugavadivu
and Balasubramanian proposed new method that avoids mean shift that happens
at the equalization process [28]. This method could preserve the brightness of
enhanced images. Hybridization approaches are also found in improving the quality
of enhanced image. Mahapatra et al. proposed hybridization approach where PSO
is combined with Negative Selection Algorithm (NSA) [29]. This method could
preserve the number of edge pixels. Shilpa and Shyam [30] investigated a Modified
Differential Evolution (MDE), which could avoid the premature convergence. It also
enhanced the exploitation capability of DE by adapting the Levy flight from Cuckoo
search. In MDE, mean intensity of the enhanced image is preserved.
A comparative study of five traditional image enhancement algorithms
(Histogram Equalization, Local Histogram Equalization, Contrast Limited Adaptive
Histogram Equalization, Gamma Correction, and Linear Contrast Stretch) was
presented in [31]. A study on the effect of image enhancement was carried out in
[32], using weighted thresholding enhancement techniques.
On understanding the interesting research attempts in making DE suitable for
solving discrete-valued parameter optimization problems, and the importance of
image enhancement process in computer vision applications, this chapter proposes

to investigate the novelty of a modified DE to solve discrete TSP and an image
enhancement problem.
3 Differential Evolution
Differential Evolution (DE) is a probabilistic population-based approach that has

gained recognition among other Evolutionary Algorithms (EAs) because of its easi-
ness and robustness. The algorithm was formulated to solve problems in continuous
domain and was modeled by Storn in 1997 [33]. Self-organizing capability of this
algorithm provided researchers to expand DE to discrete domain. The research work
presented in this chapter is an effort to improve the exploration and exploitation
nature of DE by mapping gene of the mutant vector appropriately.
3.1 Classical Differential Evolution
The classical DE has two phases: The first phase includes population initialization,
followed by the evolution phase (second phase). Mutation and crossover are
performed during the evolving stage. Selection of the candidate happens thereafter,
which replaces the candidate from the population thereby generating population for
the next generation. This is iterated until the termination criterion.
(a) Population Initialization – In this phase, the candidate set is generated
in g =
guniformly distributed fashion. The set of candidate solution .C
Ck : k = 1, 2, 3.. . . . n , where g denotes generation, and n denotes
g g
size of the population.
. Ck denotes a d – dimensional vector. .Ck =
g g g g
c1,k , c2,k , c3,k . . . cd,k and is generated using random uniform distribution, as
mentioned in Eq. (1).
g
Ck = CL + (CU − CL ) ∗ rand (0, 1) (1)
where CL and CU represent the lower bound and upper bound of search space Sg .
(b) Evolution – In this phase, the mutation operation that is a crucial step in DE
is performed. Three random vectors are selected to generate mutant vector. The
g
weighted difference is added with the base vector. The mutant vector .vk for
g
every target vector .Ck is generated using Eq. (2):
g g g g
vk = cr1 + F cr2 − cr3 (2)
where r1 , r2 , and r3 are the random vectors in the population and r1 = r2 = r3 , and
F is the Mutation Factor with the value in the range [0, 1].
Once mutant vector is generated, the crossover operation is performed
between the mutant vector and the parent. The crossover operation is performed
g g g g g
between mutant vector .Uk = u1,k , u2,k , u3,k .. . . . uD,k and target vector

g g g g g
.C
k = c , c
1,k 2,k 3,k, c .. . . . c D,k , with a crossover probability Cr ∈ [0, 1], and

g g g g g g
a trial vector .Uk = u1,k , u2,k , u3,k .. . . . uD,k is generated. Each trial vector .ui,k

g g g
is produced as .ui,k = ui,k if ran dk ≤ Cr, xi,k if not where i ∈ {1, 2, 3, .. . . . D}
selection is performed after crossover, and the individual with better fitness value
moves to the next generation (it can be trial vector or the target vector). These
operations (mutation, crossover, and selection) repeat till the termination criterion.
4 Proposed Approach
This section proposes new mapping approach Best Neighborhood Differential

Evolution (BNDE). DE cannot be directly applied to discrete-valued parameter
problems (also called as combinatorial optimization problem), and appropriate
mapping approach for the mutant vector is required in enhancing the exploration
and exploitation of algorithm. In proposed method, initialization of population is
performed same as the classical DE. In the evolution phase, few genes of mutant
vector are selected by a probability. Neighbors of those selected genes are replaced
in the mutant vector with the gene that has the best optimal value. The genes can
be from the best or average or worst candidates in the population. The proposed
approach was investigated to solve classical travelling salesman problem. The
algorithmic components of BNDE algorithm used for the experiment is described
below. The structure of the algorithm is depicted in Fig. 1.
• Initializing the Population – Population was initialized by the random position-
ing of the city nodes. Euclidean distance was considered to get the fitness of the
candidates in the population.
• Fitness Evaluation: Fitness of each candidate is evaluated using objective
function. The candidate with shortest path has best fitness and the one with
longest path has worst fitness. Based on the fitness, the selection of individuals
is performed. For mapping approach, the best, average, and worst candidates are
selected.
• Mutation – DE variant DE/BEST/RAND2/BIN is considered for the mutation as
mentioned in Eq. (3).
g g g g g g
vk = cBEST + F cr4 − cr3 + F cr1 − cr2 (3)
Fig. 1 Algorithmic structure

of enhanced DE
The BNDE mapping approach is performed.

• Crossover – Crossover is performed and trial vector is obtained.
• Selection Scheme – The fitness-based selection was considered. The fitness of
trial vector is analyzed with the target vector, and the vector with improved fitness
is considered for the next generation.
The enhanced DE in Fig. 1 is described for travelling salesman problem.
Initialization of population was performed by random positioning of the city nodes.
The objective function evaluates the fitness of each candidate in the population.
The TSPLIB dataset was considered for the experiment. The candidate with the
minimum distance has the better fitness value. Euclidean distance was used for
calculating the distance. The candidates were ranked in the population based on
the fitness value. Candidates with best fitness, average, and worst fitness were
considered to perform the BNDE mapping. DE/BEST/RAND2/BIN was considered
for finding the mutant vector [33]. The quality of best gene was considered; thus,
best vector was chosen as base vector to improve the exploitation of the mutant
vector, and weighted scale of random vectors were considered to improve the
exploration of mutant vector. This approach replaces the neighbors of the gene
selected in mutant vector with the best gene from the best or average or worst
candidates. This approach can improve the quality of the candidate as we are
mapping the neighbors with potential gene and could converge to better optimal
solution. Crossover was applied to normalize the searching. The candidate with the
better fitness was selected for the next iteration.
4.1 Best Neighborhood Differential Evolution (BNDE)

Mapping
The best neighborhood Differential Evolution is a mapping approach proposed to

enhance the exploration and exploitation of differential evolution in the discrete
domain. Probability is generated for all the genes in the mutant vector. For a gene
(say n) that has the probability less than 0.3, its adjacent neighbors n + 1 and
n-1 are considered. Only for 30 percent of the genes, its neighbors are replaced
with best gene to preserve the randomness. For dataset with higher dimensions,
mapping 50% with the gene of good candidate improves the result, but for dataset
with lower dimensions results in stagnation and results in more computational
time. Hence, 30% gene replacement is considered for better results in lower and
higher dimensions. Exploration and exploitation processes are also balanced to
search better candidates. This method could prevent premature convergence and
yield possible optimal solution than other mapping approaches. The selected genes’
neighbors are replaced with better gene considering the best, average, and worst
selected candidate. This approach could yield a stability in exploring and exploiting
the enhanced differential evolution. The algorithm for BNDE is presented in Fig. 2.
This chapter presents the proposed algorithm in two phases. In Phase I, the
proposed algorithm is compared with other existing algorithms, and in Phase II,
the performance of the proposed algorithm is validated on an image processing
Fig. 2 Algorithm for best neighborhood differential evolution

application. The design of experiments, results, and discussion for the Phase I and
Phase II experiments are presented next.
5 Phase I – Performance Comparison
In this phase, the performance of BNDE is compared with existing mapping

algorithms for solving travelling salesman problem.
5.1 Design of Experiments – Phase I
The parameter setting of DE was carried out with appropriate values after trial and
error. The crucial parameters of DE were set appropriately. The population size n
was set to 100. The mutation scale factor (F) was considered as [0.6, 1.5], referring
[1, 34]. F > 1 solves many problems and F < 1 shows good convergence. Optimal
value of F is between the values Fmin = 0.6 and Fmax = 1.5,and it is calculated
using the equation (Eq. (4)) below (as given in [1]):

Fmin − Fmax
F = × cfe + FMax (4)
MaxFit
where cfe is the number of times the objective function is evaluated, and MaxFit is
the maximum number of fitness evaluations. Crossover (Cr) value was considered
as 0.9, number of generations (g) = 2000, and number of runs (r) = 30.
All the DE mapping approaches were implemented in a computer system with
8 GB RAM, i7 processor with Windows 7 operating system, using python 3.6
programming language. The performance analysis of BNDE was carried out with
travelling salesman problem (TSP) as the benchmarking problem.
TSP is an NP hard problem. The solution for TSP is the smallest path of the
salesman to visit all the cities (nodes) in the city map, with the constraint of visiting
each city only once. Six different instances of TSP from TSPLIB dataset were
considered for the experiment. Each candidate in the population is a possible path
for the salesman. The Euclidean distance was calculated to validate the fitness of the
path. The objective function defined to measure the distance (D) is given in Eq. (5).
k
D = Dj,k + Dj,j +1 (5)
j =1
where Dj denotes the node j, Dj + 1 denotes the neighbor node j + 1, and k denotes
the total number of nodes in the city map.
The performance metrics used for the comparative study were the best of optimal
values (Aov ) and the error rate (Er ). The Aov is the best of the optimal values attained
by a variant on all it runs, and it is calculated using the Eq. (6).
Aov = Bestof (ovi) (6)
where ovi is the optimal value obtained at a run i.

The error rate (Er ) is the measure to indicate how much the Aov is deviating from
expected optimal solution of the problem and was calculated using Eq. (7).
Er = (Aov − AS) AS ∗ 100 (7)
where Aov is the obtained solution, and the AS is the actual solution.
5.2 Results and Discussions – Phase I
Performance comparison of BNDE with existing mapping approach of DE was

measured for TSP with Aov and Er. Six TSP datasets were considered for the
experiment. The datasets are att48, eil5, berlin52, st70, pr76, and eil76. Six existing
mapping techniques were implemented, and their performance were analyzed
empirically and statistically. The mapping approaches referred to this experiment are
Truncation Procedure (TP1) [35], TP rounding to the closest integer number (TP2)
[1], TP only with int part (TP3) [35], rank based (RB) [36], largest ranked value
(LRV) [37], and Best Match Value (BMV) [1]. The algorithms with all the mapping
techniques considered in this experiment were applied to solve TSPs dataset with
the similar DE parameters. The empirical results are shown in Table 1 with best
optimal value. Table 2 presents the result obtained for average optimal values. Table
3 shows the comparison of result obtained for the worst optimal values.
Table 1 presents the comparison of BNDE approach with the state-of-the-art
mapping approaches with the best optimal value obtained from 30 independent runs.
BNDE could outperform TP1, TP2, TP3, RB, and LRV, but not the BMV. Similarly,
Table 1 Comparison of BNDE with existing mapping approaches with best value obtained
TP1 TP2 TP3 RB LRV BMV BNDE
Dataset Optimal solution BEST BEST BEST BEST BEST BEST BEST
att48 33,523 95,309 98,565 97,298 97,309 90,293 58,679 75,031
eil51 429 1112 1118 1106 1124 1078 766 863
berlin52 7544.37 20585.53 18,768 19,670 20,166 19,977 13,292 15,758
st70 675 2550 2506 2625 2559 2574 1597 1902
pr76 108,159 230,691 236,060 416,316 318,696 397,748 235,442 311,432
eil76 545.39 1825 1809 1860 1890 1865 1184 1394
316
Table 2 Comparison of BNDE with existing mapping approaches with average value obtained
Dataset Optimal solution Avg Avg Avg Avg Avg Avg Avg
att48 33,523 102364.83 104100.56 102361.56 101795.83 102043.30 70946.70 83238.40
eil51 429 1157.23 1159.233 1156 1174.20 1180.46 832.93 959.83
berlin52 7544.37 21,431 20308.90 21107.33 21080.80 20925.66 14814.80 17185.26
st70 675 2677.47 2654.07 2727.33 2711.43 2716.83 1876.03 2116.43
pr76 108,159 254327.63 252948.50 431907.76 344088.63 415970.40 269644.20 333313.46
eil76 545.39 1922.86 1920.07 1932.63 1937.03 1935.17 1364.03 1516.40
A. Radhakrishnan and G. Jeyakumar
Table 3 Comparison of BNDE with existing mapping approaches, with worst value obtained
Dataset Optimal solution Worst Worst Worst Worst Worst Worst Worst
att48 33,523 106,700 98,565 107,536 106,878 105,805 82,000 91,649
eil51 429 1216 1118 1206 1218 1233 898 1046
berlin52 7544.37 19,559 21,266 21,914 21,660 21,911 16,579 18,519
st70 675 2807 2779 2810 2798 2779 2155 2270
pr76 108,159 276,195 267,409 440,280 426,851 437,337 302,095 351,145
eil76 545.39 1998 1973 1985 1982 1994 1577 1613
Table 4 Comparison of error rate with existing mapping approaches, with best value obtained
TP1 TP2 TP3 RB LRV BNDE BMV
Dataset Optimal solution Er Er Er Er Er Er Er
att48 33,523 184.30 194.02 190.24 190.27 169.34 123.81 75.04
eil51 429 159.20 160.60 157.80 162.00 151.28 101.16 78.55
berlin52 7544.37 172.85 148.76 160.72 167.29 164.79 108.87 76.18
st70 675 277.77 271.25 288.88 279.11 281.33 181.77 136.59
pr76 108,159 113.28 118.25 284.91 194.65 267.74 187.93 117.68
eil76 545.39 234.62 231.68 241.04 246.54 241.95 155.59 117.09
for the other two cases of comparing the approaches with average values and worst
values, also the BNDE failed to outperform BMV. Table 4 presents the error rate (Er )
calculated with the best values. The comparison shows the BNDE could outperform
TP1, TP2, TP3, RB, and LRV (except for pr76 with TP1 and TP2). In overall, it
is observed that the BNDE could outperform existing mapping approaches, except
BMV. However, performance of BNDE was comparable with BMV. Further tuning
of BNDE to make it better than BMV is taken as future study of this work.
To validate these findings, statistical significance analysis was performed using
Wilcoxon Signed Ranks Test for paired samples. The Wilcoxon Signed Ranks
Test for Paired Samples with two tails was used for the comparison. Optimal
value and error rate are measured for the independent run of the algorithm. The
parameters used for this test were number of samples n, test statistics (T), critical
value of T (T Crit), level of significance (α), the z-score, the p-value, and the
significance. The observations are summarized in Table 5. The ‘+’ indicates that
the performance difference between the BNDE and the counterpart approach is
statistically significant, and the ‘-’ indicates that the performance difference is
not statistically significant. For the BNDE-TP1, BNDE-TP2, and BNDE-TP3 pairs,
the outperformance of BNDE was statistically significant for all the datasets,
except the att48 dataset. For the BNDE-RB and BNDE-LRV pairs, it is observed
that the outperformance of BNDE was statistically significant for all the datasets.
Interestingly, for BNDE-BMV pair, though BMV empirically outperformed BNDE,
the performance differences were not statistically significant.
Table 5 Statistical analysis of BNDE using Wilcoxon Signed Ranks Test

TP1 Dataset TP2
Dataset T T-Critic p-value Significance
T T-Critic p-value Significance
att48 151 137 0.09367 No – att48 153 137 0.1020 No −
eil51 117.5 137 0.018 Yes + eil51 105 137 0.0087 Yes +
berlin52 66 137 0.00061 Yes + berlin52 72 137 0.0010 Yes +
st70 48 137 berlin52 Yes + st70 56 137 0.0003 Yes +
pr76 17 137 st70 Yes + pr76 1 137 0.0000 Yes +
eil76 11 137 pr76 Yes + eil76 11 137 0.0000 Yes +
TP3 Dataset RB
Dataset T T-Critic p-value Significance T T-Critic p-value Significance
att48 139 137 0.0545 No − att48 126 137 0.02848 Yes +
eil51 104 137 0.0082 Yes + eil51 103 137 0.00773 Yes +
berlin52 51 137 0.0002 Yes + berlin52 74 137 0.00111 Yes +
st70 51 137 0.0002 Yes + st70 36 137 0.0000 Yes +
pr76 3 137 0.0000 Yes + pr76 29 137 0.0000 Yes +
eil76 11 137 0.0000 Yes + eil76 11 137 0.0000 Yes +
LRV Dataset BMV
Dataset T T-Critic p-value Significance T T-Critic p-value Significance
att48 122 137 0.023 Yes + att48 107 137 0.0098 No −
eil51 91 137 0.0036 Yes + eil51 107 137 0.0098 No −
berlin52 64 137 0.00052 Yes + berlin52 50 137 0.00017 No −
st70 55 137 0.00026 Yes + st70 10 137 0.0000 No −
pr76 18 137 0.0000 Yes + pr76 1 137 0.0000 No −
eil76 11 137 0.0000 Yes + eil76 11 137 0.0000 No −
From the statistical analysis examined from Table 5, we can perceive that BNDE
could outperform significantly the existing algorithms (except BMV). Also, for other
few exceptional cases of TP1, TP2, and TP3 for the dataset att48, the statistical
significance on the improved performance of BNDE was not shown.
6 Phase II – Image Processing Application
In the second phase of the experiment, an attempt to apply BNDE mapping approach
for finding the optimal parameter combination of transformation function of image
contrast enhancement was made. Since the tuning of parameters for image contrast
enhancement is a combinatorial optimization problem, this mapping approach could
explore and exploit the optimal parameter combinations.
Transformation Function Local enhancement method [38] was applied using the
contrast improvement mentioned in Eq. (8):
h (x, y) = A (x, y) [ f (x, y) − m (x, y)] + m (x, y) (8)
where .A (x, y) = k.M

σ (x,y) and k ∈[0, 1].
h(x, y) represents the pixel with new pixel intensity value at (x, y), f (x, y) denotes
the intensity value of original pixels, m(x, y) represents the local mean, σ (x, y) is the
local standard deviation, and M denotes mean of the pixel intensities of the grayscale
image. The modified local enhancement is used to incorporate the edge parameters
a, b, c as
k.M
h (x, y) = [c ∗ f (x, y) − m (x, y)] + m(x, y)a (9)
σ (x, y) + b
where a ∈[1, 2],b ∈[0, 2], c ∈ [0, 0.5], k ∈ [0, 1].

Fitness of transformed images was calculated to examine the better candidates:

log2 log2 (S(T i)) .E(T i) ∗ H (T i) ∗ P (T i) ∗ L(T i)
F (T i) = (10)
h∗w
where F(Ti) indicates transformed image fitness, and S(Ti) denotes the sum of pixel
values of the image edges. Canny edge detector is used. For adding the intensity of
the pixel that are white, the white pixel position is considered in the original image,
and the sum of those pixel intensities is denoted S(Ti).
E(Ti) denotes pixel counts that form the image edges. Edge detector detects the
count of white pixels. Shannon entropy of an image is calculated using
255
H (T i) = pi.log2 (pi) (11)
0
pi denotes the probability of ith intensity value of an image that ranges from [0, 255].
P (T i) = 20log10 (255) − 10log10 (MSE(T i)) (12)
where
h−1 w−1
1
MSE(T i) = [ T i (i, j ) − T 0 (i, j )]2 (13)
h.w
i=0 j =0
where original image is denoted as T0Ti A = π r2
(2μ T i μ T 0 + c1) (2σ T i T 0 + c2)

L(T i) = 2 (14)
μT i + μ2T 0 + c1 σT2 i + σT2 0 + c1
Fig. 3 BNDE for image enhancement
where μ Ti denotes the mean value of the transformed image pixel value, μ T0
denotes the mean value of the pixel values for original image, σ Ti is the transformed
image variance, σ T0 is the variance of original image, and h denotes height and w
denotes width of the image.
6.1 Design of Experiments – Phase II
The proposed approach was implemented in a computer system with 8 GB RAM, i5

processor with Mac OS, using python 3.6 programming language. The performance
analysis of BNDE was carried out with mouse cell embryo dataset. The performance
of BNDE was compared with Histogram and Contrast Limited Adaptive Histogram
(CLAHE). The parameter setting of BNDE approach was population size = 20,
number of iterations = 20, and number of runs = 10. For the Mutation Factor
F and Cr, the values were considered same as applied for the TSP problem. The
algorithmic approach of BNDE for image contrast enhancement is shown in Fig. 3.
The random population of size 20 was generated with the parameters a, b,
c, and k values generated randomly within the boundary range. Local image
enhancement was applied, and fitness for each transformed image were evaluated.
The DE/rand/1/bin variant was used. Three random vectors from the population
were selected for the rand/1 mutation. The genes of the mutant vector were replaced
with the best gene after comparing with best, average, and worst candidates. Since
the proposed mapping approach was applied to an image processing application, the
BNDE approach was modified according to the application. Replacement of best
gene was carried out for the selected gene with probability greater than 0.25.
6.2 Results and Discussions – Phase II
The ideal combination of the parameters a, b, c, and k was selected based on the
finest fitness value, and the transformed image with these parameters was considered
for the evaluation. The original image was compared with the enhanced image, and
its histogram was analyzed. The result is shown in Table 6 (Tables 6.a and 6.b). First
column from the left represents the original image and its edge detected using canny
edge detector. The second column represents the original image histogram. Third
column from the left represents the enhanced image and its edge detected. Fourth
column represents the histogram of the enhanced image. Based on the analysis of
edge detected for the enhanced image, it is observed that the number of edges pixels
detected are more. By comparing the histograms of original and enhanced images,
it is observed that the contrast is enhanced in the enhanced images.
Comparison of the BNDE approach with the existing algorithm was also
performed. Two existing algorithms Histogram Equalization and CLAHE were
considered for the analysis. Table 7 (Tables 7.a, 7.b and 7.c) shows the result
obtained. Though BNDE approach could outperform the Histogram Equalization,
it was observed that enhanced image by CLAHE algorithm could detect more edges.
Enhanced image by Histogram Equalization generated noisy images.
The performance analysis of state-of-the-art algorithm was assessed using the
metrics Peak to Signal Noise Ratio (PSNR), Mean Squared Error (MSE), entropy,
and Structural Similarity Index (SSIM). Table 8 summarizes the result obtained.
The comparison of BNDE with Histogram Equalization and CLAHE is sum-
marized in Table 8. From the analysis of values obtained for metrics, PSNR
value obtained for BNDE is better than the other algorithms. MSE is inversely
proportional to the PSNR. MSE obtained for BNDE is less compared to the
Histogram Equalization and CLAHE. Entropy is another metrics used for measuring
the quality of image. It is a measure of randomness in image. BNDE approach
could obtain less entropy when comparing the performance with the state-of-the-art
algorithm. Reduced entropy value shows that enhanced image is more homogeneous
than the input image. SSIM obtained for BNDE outperforms the other method. SSIM
measures the similarity of the enhanced image with input image. BNDE approach
could enhance the image, but still similarity to input image is ensured.
322
Table 6.a Comparison of original image and the enhanced images (1–6)
Original image Original image Histogram Enhanced image (BNDE approach) BNDE approach Histogram
A. Radhakrishnan and G. Jeyakumar
Table 6.b Comparison of original image and the enhanced images (7–12)
Original image Original image Histogram Enhanced image (BNDE approach) BNDE approach Histogram
Modified Discrete Differential Evolution with Neighborhood Approach for. . .
323
Table 7.a Comparison of enhanced images (1–4)

Original image Hist-equalization CLAHE BNDE
7 Conclusions
This chapter presented a study in two phases. The Phase I of the study proposed an
improved Differential Evolution algorithm, named as BNDE. The BNDE is added
with a proposed mapping approach to improve the exploration and exploitation
nature of DE. The proposed mapping technique made DE suitable to solve discrete
optimization problems. The performance of the proposed algorithm was evaluated
on benchmarking TSPs and compared with state-of-the-art six different similar
Table 7.b Comparison of enhanced images (5–8)

algorithms. The empirical studies revealed that the proposed algorithm works better
than all the approaches, except a one (the BMV approach, to which the performance
difference was not significant statistically). Though it could not outperform BMV
approach, their performance was comparable. The statistical studies highlighted that
in the cases where BNDE outperformed, it was statistically significant. In phase
II, the proposed algorithm was tested on an image enhancement application and
Table 7.c Comparison of enhanced images (9–12)

compared with two classical image enhancement algorithms. The results obtained
for the performance metrics, used in the experiment, reiterated the quality of the
proposed algorithm that it could outperform the classical algorithms by the PSNR,
MSE, and SSIM values.
This approach is validated in grayscale images. For validating in RGB images,
the mapping approach can be applied on red channel, green channel, and blue
Table 8 Comparison of BNDE with existing algorithms
Dataset HISTO CLAHE BNDE
PSNR Entropy MSE SSIM PSNR Entropy MSE SSIM PSNR Entropy MSE SSIM
7_19_M16 11.21 7.91 105.47 0.1547 28.06 5.45 31.37 0.8711 69.74 4.311 0.00022135 0.9999985
7_19_2ME2 11.343 7.90 105.62 0.1549 26.07 5.72 38.72 0.8700 66.10 4.40 0.00011719 0.99999817
7_19_2ME4 11.28 7.90 107.06 0.1471 26.92 5.51 31.96 0.8738 65.27 4.241 0.00013346 0.99999771
7_19_2ME5 11.29 7.89 103.63 0.1399 27.01 5.49 32.86 0.8728 64.115 4.20 0.00020833 0.99999754
7_19_2ME6 11.29 7.93 102.05 0.1775 27.43 5.50 31.97 0.8803 64.11 4.44 0.00020833 0.99999754
7_19_2ME8 11.39 7.91 104.19 0.1848 27.30 5.53 30.465 0.8808 62.17 4.49 0.00022135 0.99999736
7_19_2ME10 11.185 7.913 03.002 0.17365 27.98 5.35 27.11 0.8859 62.17 4.37 0.00022135 0.99999734
7_19_2ME13 11.294 7.92 102.04 0.17923 27.55 5.55 33.14 0.8794 68.85 4.41 0.00013346 0.99999845
7_19_2ME15 11.576 9.97 104.567 0.2567 28.44 5.80 33.04 0.8356 63.82 4.93 0.00028971 0.99999757
7_19_1ME1 11.23 7.95 105.30 0.1299 27.44 5.66 38.56 0.8176 63.45 4.43 0.00021159 0.99999744
7_19_1ME9 11.292 7.94 105.650 0.129 27.45 5.67 37.43 0.8223 62.66 4.447 0.00020833 0.99999748
7_19_1ME10 11.269 7.94 105.25 0.1255 27.50 5.66 37.96 0.8210 64.41 4.38 0.00018555 0.99999768
Modified Discrete Differential Evolution with Neighborhood Approach for. . .
327
channel. The future scope of this work is to extend the mapping approach in color
images. This study also will be enhanced further to hybridize other optimization
techniques with DE to make BNDE outstanding among all the state-of-the-art
mapping techniques.
References
1. Ali, I. M., Essam, D., & Kasmarik, K. (2019). A novel differential evolution mapping technique
for generic combinatorial optimization problems. Applied Soft Computing, 80, 297–309.
2. Santucci, V., Baioletti, M., Di Bari, G., & Milani, A. (2019). A binary algebraic differential
evolution for the multidimensional two-way number partitioning problem. European Confer-
ence on Evolutionary Computation in Combinatorial Optimization, 11451, 17–22.
3. Ming, Z., Zhao Linglin, S., Xiaohong, M. P., & Yanhang, Z. (2017). Improved discrete mapping
differential evolution for multi-unmanned aerial vehicles cooperative multi-targets assignment
under unified model. International Journal of Machine Learning and Cybernetics, 8(3), 765–
780.
4. Goudos, S. (2017). Antenna design using binary differential evolution: Application to discrete-
valued design problems. IEEE antennas and propagation magazine, 59(1), 74–93.
5. Cuevas, E., Zaldivar, D., Perez Cisneros, M. A., & Ramirez-Ortegon, M. A. (2011). Circle
detection using discrete differential evolution optimization. Pattern Analysis and Applications,
14(1), 93–107.
6. Davendra, D., & Onwubolu, G. (2009). Forward backward transformation. In Differential
evolution: A handbook for global permutation-based combinatorial optimization (pp. 35–80).
Springer.
7. Wang, L., Pan, Q.-K., Suganthan, P. N., & Wang, W. (2010). A novel hybrid discrete differential
evolution algorithm for blocking flow shop scheduling problems. Computers & Operations
Research, 37(3), 509–520.
8. Viale Jacopo, B., ThiemoKrink, S. M., & Paterlini, S. (2009). Differential evolution and
combinatorial search for constrained index-tracking. Annals of Operations Research, 172(1),
39–59.
9. Wagdy, A. (2016). A new modified binary differential evolution algorithm and its applications.
Applied Mathematics & Information Sciences, 10(5), 1965–1969.
10. Sauer, J. G., & Coelho, L. (2008). Discrete differential evolution with local search to solve
the traveling salesman problem: Fundamentals and case studies. In Proceedings of 7th IEEE
international conference on conference: cybernetic intelligent systems.
11. Uher, V., Gajdo, P., Radecky, M., & Snasel, V. (2016). Utilization of the discrete differential
evolution for optimization in multidimensional point clouds. Computational Intelligence and
Neuroscience, 13(1–14).
12. Lingjuan, H. O. U., & Zhijiang, H. O. U. (2013). A novel discrete differential evolution
algorithm. Indonesian Journal of Electrical Engineering, 11(4).
13. Rubini, N., Prashanthi, C. V., Subanidha, S., & Jeyakumar, G. (2017). An optimization
framework for solving RFID reader placement problem using differential evolution algorithm.
In Proceedings of ICCSP-2017 – International conference on communication and signal
proceedings.
14. Abraham, K. T., Ashwin, M., Sundar, D., Ashoor, T., & Jeyakumar, G. (2017). Empirical
comparison of different key frame extraction approaches with differential evolution based
algorithms. In Proceedings of ISTA-2017 – 3rs international symposium on intelligent system
technologies and applications.
15. Shinde, S. S., Devika, K., Thangavelu, S., & Jeyakumar, G. Multi-objective evolutionary
algorithm based approach for solving RFID reader placement problem using weight-vector
approach with opposition-based learning method. International Journal of Recent Technology
and Engineering (IJRTE) 2277–3878, 7(5), 177–184.
16. Lu, X., Wang, Y., & Yuan, Y. (2013). Graph-regularized low-rank representation for destriping
of hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing, 51(7), 4009–
4018.
17. Lu, X., & Li, X. (2014). Multiresolution imaging. IEEE Transactions on Cybernetics, 44(1),
149–160.
18. Sujee, R., & Padmavathi, S. (2017). Image enhancement through pyramid histogram matching.
International Conference on Computer Communication and Informatics (ICCCI), 2017, 1–5.
https://doi.org/10.1109/ICCCI.2017.8117748
19. Zhu, H., Chan, F. H., & Lam, F. K. (1999). Image contrast enhancement by constrained
local histogram equalization. Computer Vision and Image Understanding, 73, 281–290. https:/
/doi.org/10.1006/cviu.1998.0723
20. Chithirala, N., et al. (2016). Weighted mean filter for removal of high density salt and pepper
noise. In 2016 3rd international conference on advanced computing and communication
systems (ICACCS) (Vol. 1). IEEE.
21. Radhakrishnan, A., & Jeyakumar, G. (2021). Evolutionary algorithm for solving combinatorial
optimization—A review. In H. S. Saini, R. Sayal, A. Govardhan, & R. Buyya (Eds.),
Innovations in computer science and engineering (Lecture notes in networks and systems)
(Vol. 171). Springer.
22. Gorai, A., & Ghosh, A. (2009). Gray-level image enhancement by particle swarm optimization.
Proc IEEE World Cong Nature Biol Inspired Comput, 72–77.
23. Munteanu, C., & Rosa, A. Gray-scale image enhancement as an automatic process driven by
evolution. IEEE Transactions on Systems, Man, and Cybernetics: Systems.
24. Pal, S. K., Bhandari, D., & Kundu, M. K. (1994). Genetic algorithms for optimal image
enhancement. Pattern Recognition Letters, 15(3), 261–271.
25. Saitoh, F. (1999). Image contrast enhancement using genetic algorithm. Proceedings of IEEE
International Conference on Systems, Man and Cybernetics, 4, 899–904.
26. Braik, M., Sheta, A., & Ayesh, A. (2007). Particle swarm optimisation enhancement approach
for improving image quality. International Journal of Innovative Computing and Applications,
1(2), 138–145.
27. dos Santos Coelho, L., Sauer, J. G., & Rudek, M. (2009). Differential evolution optimization
combined with chaotic sequences for image contrast enhancement. Chaos, Solitons & Fractals,
42(1), 522–529.
28. Shanmugavadivu, P., & Balasubramanian, K. (2014). Particle swarm optimized multi-objective
histogram equalization for image enhancement. Optics & Laser Technology, 57, 243–251.
29. Mahapatra, P. K., Ganguli, S., & Kumar, A. (2015). A hybrid particle swarm optimization
and artificial immune system algorithm for image enhancement. Soft Computing, 19(8), 2101–
2109.
30. Suresh, S., & Lal, S. (2017). Modified differential evolution algorithm for contrast and
brightness enhancement of satellite images. Applied Soft Computing, 61, 622–641.
31. Harichandana, M., Sowmya, V., Sajithvariyar, V. V., & Sivanpillai, R. (2020). Comparison of
image enhancement techniques for rapid processing of post flood images. The International
Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Xliv-M-2-
2020, 45–50.
32. Sony, O., Palanisamy, T., & Paramanathan, P. (2021). A study on the effect of thresholding
enhancement for the classification of texture images. Journal of The Institution of Engineers
(India): Series B, 103, 29. https://doi.org/10.1007/s40031-021-00610-9
33. Storn, R., & Price, K. (1997). Differential evolution–a simple and efficient heuristic for global
optimization over continuous spaces. Journal of Global Optimization, 11(4), 341–335.
34. Rönkkönen, J., Kukkonen, S., & Price, K. V. (2005). Real-parameter optimization with
differential evolution. Congress on Evolutionary Computation, 506–513.
35. Li, H., & Zhang, L. (2014). A discrete hybrid differential evolution algorithm for solving
integer programming problems. Engineering Optimization, 46(9), 1238–1268.
36. Liu, B., Wang, L., & Jin, Y.-H. (2007). An effective pso-based memetic algorithm for flow
shop scheduling. IEEE Transactions on Systems Man and Cybernetics Part B, 37(1), 18–27.
37. Li, X., & Yin, M. (2013). A hybrid cuckoo search via lévy flights for the permutation flow shop
scheduling problem. International Journal of Production Research, 51(16), 4732–4754.
38. Keerthanaa, K., & Radhakrishnan, A. (2020). Performance enhancement of adaptive image
contrast approach by using artificial bee colony algorithm. 2020 Fourth International Confer-
ence on Computing Methodologies and Communication (ICCMC), 255–260.
Swarm-Based Methods Applied to
Computer Vision
María-Luisa Pérez-Delgado
Abbreviations
Below are the abbreviations used in the chapter:
AA Artificial ants
ABC Artificial bee colony
ALO Ant lion optimizer
BA Bat algorithm
BFO Bacterial foraging optimization
CRS Crow search
CSO Cat swarm optimization
CT computed tomography
CUS Cuckoo search
FA Firefly algorithm
FPA Flower pollination algorithm
FSA Fish swarm algorithm
GWO Gray wolf optimization
MR magnetic resonance
PSO Particle swarm optimization
RGB Red, Green, Blue
WO Whale optimization
M.-L. Pérez-Delgado ()

University of Salamanca, Escuela Politécnica Superior de Zamora, Zamora, Spain
e-mail: mlperez@usal.es
332 M.-L. Pérez-Delgado
1 Introduction
Nowadays, computer vision has become a very important element in many sectors,
such as the development of autonomous vehicles, the surveillance and supervision
systems, the manufacturing industry, or the health care sector [1]. It involves
the application of different image processing operations to analyze the data and
extract relevant information. The dimensionality of the data makes many of these
operations have a high computational cost. This requires applying methods with
reasonable execution time to generate solutions. Among such methods, swarm-
based algorithms have been successfully applied in various image processing
operations.
This chapter shows the application of this type of solution to various image
processing tasks related to computer vision. The objective is not to include an
exhaustive list of articles, since the length of the chapter does not allow it. Rather,
the chapter focuses on the most recent and interesting proposals where swarm-based
solutions have been successfully applied.
2 Brief Description of Swarm-Based Methods
Swarm-based algorithms define a metaheuristic approach to solve complex prob-

lems [2, 3]. These algorithms mimic the behavior observed in natural systems
in which all individuals of a swarm or population contribute to solve a problem.
This collective behavior was simulated to apply it to solve optimization problems.
Certainly, it has been shown that swarm-based methods can perform well for
complex problems [4–6].
Various swarm algorithms have been proposed in recent years [7]. Although
each algorithm has its peculiarities, they all share the same basic structure. The
first operation initializes the population. This operation generally associates each
individual with a random solution of the search space. Then, an iterative process
is applied to improve the current solutions associated with the individuals in the
population. At each iteration, the quality or fitness of the solutions is determined.
This value is computed by applying the objective function of the problem (or a
modification of said function) to the solution represented by each individual. The
solution with the best fitness is considered the solution to the problem in the current
iteration. Then, the population shares information to try to move the individuals to
better areas of the search space. This operation is different for each swarm-based
method, but it always moves some individual (all or some of them) to new positions
(generally more promising positions) in the search space. The iterative process
continues until the stopping criterion of the algorithm is met. This occurs when
the algorithm has performed a specific number of iterations or when the solution
converges. At the end of the iterations, the solution to the problem is the best solution
found by the swarm throughout the iterations.
Swarm-Based Methods Applied to Computer Vision 333
Algorithm 1 PSO algorithm

1: Set initial values for xi (0) and vi (0), for i = 1, . . . , P
2: Set bi (0) = xi (0), for i = 1, . . . P
3: Compute g(0) according to Eq. 1
.g(t) = {bi (t) | f itness(bi (t)) = max (f itness(bi (t))) ∀j } (1)

j
4: for t = 1 to T MAX do
5: Compute vi (t), xi (t) and bi (t), for i = 1, . . . , P , according to Eqs. 2, 3 and 4, respectively
.vi (t) = ωvi (t − 1) + φ1 1 [bi (t − 1) − xi (t − 1) + φ2 2 [g(t − 1) − xi (t − 1)] (2)
.xi (t) = xi (t − 1) + vi (t) (3)

xi (t) if f itness(xi (t)) > f itness(bi (t − 1))
.bi (t) = (4)
bi (t − 1) otherwise
6: Compute g(t) according to Eq. 1.

7: end for
The preceding paragraph shows an overview of the operations of the swarm-

based algorithms. This information is completed by describing the specific opera-
tions of the particle swarm optimization (PSO) algorithm, which is one of the most
widely used swarm algorithms (Algorithm 1). The variables used in the description
of this algorithm are defined as follows. A swarm of P particles is used to solve
a problem defined in an r-dimensional space. We consider that the algorithm will
conclude after performing T MAX iterations. At iteration t of the algorithm, particle
i has a position .xi (t) and a velocity .vi (t) and remembers the best position it
has found so far .bi (t), (.xi (t) = (xi1 (t), . . . , xir (t)), .vi (t) = (vi1 (t), . . . , vir (t)),
.bi (t) = (bi1 (t), . . . , bir (t)), with .i = 1, . . . , P ). .g(t) denotes the best position
found by the swarm up to iteration t (the solution to the problem). .f itness(a)

represents the function applied to compute the quality of a solution a. Finally, .ω, .φ1 ,
and .φ2 are predefined weights, while .1 , .2 are random vectors. Equations 1 and 4
are defined considering that the problem to be solved is a maximization problem.
Figure 1 graphically shows the elements that condition the movement of a particle
within the solution space.
Table 1 lists the swarm-based solutions mentioned in this chapter, along with a
reference that provides the reader with detailed information on each method.
Fig. 1 PSO determines the new position of particle i (.xi (t)), taking into account its previous
position (.xi (t − 1)), the best position found by the particle (.bi (t − 1)), the best position found by
the swarm (.g(t − 1)), and the current velocity of the particle (.vi (t))
Table 1 Basic references for Swarm-based Method Reference

the swarm-based algorithms
cited in this article Artificial bee colony (ABC) [8]
Artificial ants (AA) [9]
Ant lion optimizer (ALO) [10]
Bat algorithm (BA) [11]
Bacterial foraging optimization (BFO) [12]
Cuckoo search (CUS) [13]
Cat swarm optimization (CSO) [14]
Crow search (CRS) [15]
Flower pollination algorithm (FPA) [16]
Firefly algorithm (FA) [17]
Fish swarm algorithm (FSA) [18]
Gray wolf optimization (GWO) [19]
Particle swarm optimization (PSO) [20]
Whale optimization (WO) [21]
3 Some Advantages of Swarm-Based Methods
Computer vision systems require handling noisy, complex, and dynamic images.
For the system to be useful, it must interpret the image data accurately and quickly.
Many operations related to computer vision can be formulated as optimization
problems (segmentation, classification, tracking, etc.). Many of these problems are
difficult to solve for different reasons (the high dimensionality of the data, the large
volume of data to be processed, the noise in the data, the size and characteristics
of the solution space, etc.). Therefore, the resulting problems are often high-
dimensional optimization problems with complex search spaces that can include
complex constraints. Finding the optimal solution to these problems is very difficult,
and operations require a lot of execution time. The characteristics of these problems
make the classical optimization techniques not suitable for their resolution. For this
reason, various optimization techniques have been proposed in recent years to avoid
the problems of classical techniques. These methods have been applied to solve
optimization problems for which classical techniques do not work satisfactorily.
Swarm-based methods have been successfully applied to solve several computer
vision tasks, providing a good solution since they avoid getting stuck in local optima.
Swarm-based algorithms were developed to solve optimization problems, and
they have been successfully applied to many problems in different areas [2, 3].
These algorithms have been applied to complex non-linear optimization problems.
They are also useful to solve high-dimensional and multimodal problems. Further-
more, these methods require little a priori knowledge of the problem and have low
computational cost.
The characteristics of the swarm-based methods make them have several advan-
tages over classical optimization algorithms:
• Individuals are very simple, which facilitates their implementation.
• Individuals do their work independently, so swarm-based algorithms are highly
parallelizable.
• The system is flexible, as the swarm can respond to internal disturbances
and changes in the environment. In addition, the swarm can adapt to both
predetermined stimuli and new stimuli.
• The system is scalable because it can include from very few individuals to a lot
of them.
• The control of the swarm is distributed among its members, and this allows
the swarm to give a rapid local response to a change. This operation is quick
because it is not necessary to communicate with a central control or with all the
individuals in the swarm.
• The system is robust. Since there is no central control in the swarm, the system
can obtain a solution even if several individuals fail.
• Individuals interact locally with other individuals and also with the environment.
This behavior is useful for problems where there is no global knowledge of the
environment. In this case, the individuals exchange locally available information,
and this allows obtaining a good global solution to the problem. In addition, the
system can adapt to changes in its environment.
• In order to apply some classical methods, it is necessary to make assumptions
about the characteristics of the problem to be solved or the data to be processed.
In contrast, swarm-based methods do not make assumptions and can be applied
to a wide range of problems.
• These methods can explore a larger region of the search space and can avoid
premature convergence. Since the swarm evaluates several feasible solutions
in parallel, this prevents the system from being trapped in local minima.
Although some individual falls into a local optimum, other individuals may find
a promising solution.
• These algorithms include few parameters, and they do not require to be fine-tuned
for the algorithm to work.
4 Swarm-Based Methods and Computer Vision
Many of the image processing operations discussed below are closely related and are
often applied sequentially to an image. However, the operations have been separated
into several sections, each citing swarm-based solutions that focus on the specific
operation.
Feature extraction is a preliminary task for other image processing, since it allows
reducing the dimensionality of the data to be handled in said processing. This
operation obtains the most relevant information from the image and represents it
in a lower dimensional space. The set of features obtained by this operation can be
used as input information to apply other processing to the image.
When a feature set has been extracted from an image, feature selection allows
selecting a subset of features from the entire set of candidate features. This is
a complex task, and swarm-based solutions have been proposed to reduce the
computational cost. In general, the interesting feature subset is conditioned by the
image processing that will be applied to those features. For this reason, the feature
selection operation is usually the previous step to another more general operation
that conditions the features to be selected (Fig. 2). For example, this occurs when
selecting features for image classification. Several swarm-based methods have been
used for feature selection to classify images, such as AA [22], PSO [23, 24], or
ABC [25].
Feature selection is an important aspect of hyperspectral image processing, as
it allows selecting the relevant bands of the image in order to reduce the dimen-
sionality. PSO was used in [26] to select features and then apply a convolutional
neural network to classify hyperspectral images. PSO was also applied in [27], but
combining two swarms: one of them estimates the optimal number of bands and the
other selects the bands. The proposal of [28] combines PSO with genetic algorithms
for feature selection. PSO operations are applied to update the particles, and then a
Fig. 2 Feature extraction and feature selection are two initial steps for other image processing
operations
new population is generated by applying the operators of the genetic algorithm. The
method automatically determines the number of features to select. Other researchers
have applied various swarm algorithms to address the same problem, including
GWO [29], ALO [30], CUS [31], ABC [32], or FA [33].
Feature selection is also important for image steganalysis, which is the process of
detecting hidden messages in an image. It has been performed by ABC [34], PSO
[35, 36], or GWO [37]. In addition, other articles that apply swarms to this problem
are described in [38].
Detecting an object or region of interest within an image is highly dependent on
the image features being analyzed. The objective of feature detection is to identify
features, such as edges, shapes, or specific points. Swarm-based solutions reduce
the time required to perform this operation.
Several articles describe the use of artificial ants for edge detection. The proposal
presented in [39] uses the algorithm called ant system, while the methods described
in [40] and [41] use the ant colony system algorithm. In all the cases, ants are
used to obtain the edge information. On the other hand, the method described in
[42] applies artificial ants as a second operation to improve the edge information
obtained by other conventional edge detection algorithms (the Sobel and Canny
edge detection approaches). Other proposals for the application of artificial ants for
edge detection are described in [43] and [44]. Other swarm-based methods that
have been applied for edge detection are PSO [45] and ABC [46].
There are also articles that describe the application of swarms for shape detection.
PSO and genetic algorithms were combined in [47] to define a method that detects
circles. ABC was applied in [48] to detect circular shapes, while BFO was applied
in [49].
4.2 Image Segmentation
Image segmentation consists of decomposing an image into non-overlapping

regions. Interesting parts can then be extracted from the image for further
processing. For example, this makes it possible to separate different objects and
also to separate an object from the background (Fig. 3). Image segmentation is
Fig. 3 Example of segmentation process applied to extract the objects from the background
very important in computer vision applications, as it is a preliminary step for other

operations such as image understanding or image recognition. Several techniques
are commonly used for image segmentation, such as clustering, thresholding, edge
detection, or region identification. To analyze swarm-based solutions, we will focus
on the first two approaches.
Clustering algorithms are one of the simplest segmentation techniques. The
pixels of the image are divided into clusters or groups of similar pixels, and each
cluster is represented by a single color. PSO was used in [50] to define the initial
centroids to apply the well-known K-means clustering method. The centroid of a
cluster is the value used to represent that cluster. The same methods were combined
in [51], but in this case PSO not only defines the initial centroids for K-means but
also determines the number of centroids.
The proposal of [52] combines FSA with the fuzzy c-means clustering method.
The first method is used to determine the number of clusters for the second method
and also to optimize the selection of the initial centroids.
Artificial ants were applied in [53]. In this case, an ant is assigned to each pixel
and moves around the image looking for low grayscale regions. When the algorithm
concludes, the pheromone accumulated by the ants allows the pixels to be classified
as black or white. The proposal of [54] also uses artificial ants, but the information
used to define the clusters is the gray value, the gradient, and the neighborhood of
the pixels. The method described in [55] applies the ant-tree algorithm, which is
an ant-based method in which the ants represent items that are connected in a tree
structure to define clusters.
Thresholding methods are popular techniques for image segmentation due to its
simplicity. They divide the pixels of the image based on their intensity and determine
the boundaries between classes. Bi-level thresholding is applied to divide an image
into two classes (e.g., the background and the object of interest), while multi-level
thresholding is used to divide it into more than two classes. The methods used to
compute the thresholds can be divided into non-parametric and parametric. Non-
parametric methods determine the thresholds by optimizing some criteria, and they
have been proven to be more accurate than parametric methods. Several thresholding
criteria have been proposed. The Otsu criterion is a very popular method that selects
optimal thresholds by maximizing the between-class variance [84]. Entropy-based
criteria maximize the sum of entropy for each class and are also widely used. Among
the criteria of this type, we can mention the Kapur entropy [85], the Tsallis entropy
[86], the minimum cross entropy [87], or the fuzzy entropy.
Many swarm-based methods have been applied to determine the thresholds in
multi-level thresholding (Table 2). In general, they define the fitness function of the
swarm using some of the thresholding criteria described above.
Table 2 Swarm-based Swarm References Criterion

methods applied to
multi-level thresholding for ABC [56] Otsu
image segmentation [57] Kapur
[58] Tsallis
[59] Kapur, Otsu
[60] Kapur, Otsu, Tsallis
FA [61] Otsu
[62] Kapur, Otsu
[63] Tsallis and Kapur
[64] Otsu, Kapur, minimum cross entropy
[65] Fuzzy entropy
[66] Minimum cross entropy
CUS [67] Kapur
[68] Tsallis
[62] Kapur and Otsu
[70] Kapur, Otsu, Tsallis
PSO [71–73] Otsu
[74] Kapur
[59, 75] Kapur, Otsu
GWO [77] Kapur
[78] Kapur, Otsu
BA [79] Otsu
[80] Kapur, Otsu
AA [81] Otsu
WO [82] Otsu, Kapur
CRS [83] Kapur
4.3 Image Classification
Image classification is the process of identifying groups of similar image primitives.

These image primitives can be pixels, regions, line elements, etc., depending on the
problem encountered. Many basic image processing techniques such as quantization
or segmentation can be viewed as different instances of the classification problem.
A classification method can be applied to associate an image with a specific class
(Fig. 4). Another possibility is to classify parts of the image as belonging to certain
classes (river, road, forest, etc.).
Several swarm-based approaches have been proposed to associate images to
specific classes.
In general, swarm methods are combined with other methods to define a
classification system. For example, the system described in [88], uses PSO to update
the weights of a neural network that classifies color images. PSO was also used
Fig. 4 Example of a classification system that can distinguish ripe and unripe tomatoes from an
image
in [89, 90], and [91] to define the optimal architecture of a convolutional neural
network applied to classify images. A system to classify fruit images was proposed
in [92] that applies a variant of ABC to train the neural network that performs
the classification. The solution proposed in [93] to classify remote-sensing images
uses a Naïve Bayes classifier and applies CUS to define the classifier weights. The
system for identifying and classifying plant leaf diseases described in [94] uses
BFO to define the weights of a radial basis function neural network.
Swarm algorithms have also been applied to define classification methods that
allow classifying parts of an image.
Omran et al. described two applications of PSO for this type of image classifi-
cation, using each particle to represent the mean of all the clusters. In the first case,
the fitness function tries to minimize the intra-cluster distance and to maximize the
inter-cluster distance [95]. In the second case, the function includes a third element
to minimize the quantization error [96].
The method described in [97] uses artificial ants to classify remote-sensing
images, so that different land uses are identified in the image. The same problem
was solved in [98], but applying PSO. The method described in [99] classifies a
high-resolution urban satellite image to identify 5 land cover classes: vegetation,
water, urban, shadow, and road. The article proposes two classification methods,
which apply artificial ants and PSO, respectively. Another crop classification system
was defined in [100]. This system uses PSO to train a neural network that can
differentiate 13 types of crops in a radar image.
4.4 Object Detection
Object detection consists of finding the image of a specific object into another more
general image or in a video sequence (Fig. 5). Automatic object detection is a very
important operation in computer vision, but it is difficult due to many factors such
as rotation, scale, occlusion, or viewpoint changes. The practical applications of this
Fig. 5 Example of object

detection trying to find
tomatoes in an image. The
figure shows the object to be
detected (left) and the
instances of that object
identified in a general image
(right)
operation include surveillance, medical image analysis, or image retrieval, among

others.
A method that uses feature-based object classification together with PSO is
proposed in [101]. This method allows finding multiple instances of objects of a
class. Each particle of the swarm is a local region classifier. The objective function
measures the confidence that the image distribution in the region currently analyzed
by a particle is a member of the object class. PSO was also used in [102] to define
a feature-based method to distinguish a salient object from the background of the
image.
Model-based methods use a mathematical model to describe the object to be
recognized. Said model must preserve the key visual properties of the object
category, so that it can be used to identify objects of that category with variations
due to deformations, occlusions, illumination, etc.
A model-based system that uses PSO was described in [103]. In this case,
the object detection operation is considered as an optimization problem, and the
objective function to be maximized represents the similarity between the model and
a region of the image under investigation. PSO is used to optimize the parameters
of the deformable template that represents the object to be found.
The method presented in [104] applies PSO to detect traffic signs in real time.
In this case, the signs are defined as sets of three-dimensional points that define the
contour. The fitness function of PSO detects a sign belonging to a certain category
and, at the same time, estimates its position relative to the camera’s reference frame.
The proposal of [105] uses PSO to optimize the parameters of a support vector
machine that is used to identify traffic signs.
Active contour models, also called snakes, are deformable models applied to
detect the contour of an object from an image. Control points are defined near the
object of interest, which are moved to conform to the shape of the object. Tseng
et al. used PSO to define an active contour model that uses several swarms, each
associated with a control point [106]. The method proposed in [107] uses ABC to
apply an active contour model.
Model-based methods include graph models, which break the object into parts
and represent each one by a graph vertex. This approach is considered in [108],
which applies artificial ants for road extraction from very high-resolution satellite
images. First, the image is segmented to generate image objects. These objects are
then used to define the nodes of the graph that the ants will traverse to define a binary
roadmap. At the end of the process, the binary roadmap is vectorized to obtain the
center lines of the road network.
A cuckoo search-based method was applied in [109] to detect vessels in a
synthetic aperture radar image.
The proposals described in [110] and [111] define two template-matching
methods that apply ABC. Template-matching methods try to find a sub-image,
called template, within another image. The objective function proposed in [110]
computes the difference between the RGB level histograms corresponding to the
target object and the template object. On the other hand, the absolute sum of
differences between pixels of the target image and the template image was used
in [111] to define the fitness function.
A method for visual target recognition for low altitude aircraft was described
in [112]. It is a shape matching method that uses ABC to optimize the matching
parameters.
Before concluding this section, it should be noted that object recognition is
a necessary operation for object tracking. Several applications of PSO for object
tracking appear in [113, 114], or [115]. Other swarm-based solutions considered
are CUS [116, 117], BA [118], and FA [119].
4.5 Face Recognition
Face recognition is an interesting area of image analysis. It has a wide range of

applications, including human–computer interaction and security systems (Fig. 6).
Face recognition is a difficult operation due to the variability of many parameters,
such as scale, size, pose, expression, hair, and environmental parameters.
The quality of a face recognition system is highly influenced by the set of
features selected to complete the operation. The most discriminant features should
be selected, especially those that are not affected by variations in scale, facial
Fig. 6 Blocks of a security system that includes face recognition

expressions, pose, or illumination. Several swarm-based methods have been used

to improve feature selection for face recognition, including artificial ants [120], FA
[121], PSO [122, 123], CUS [124], BFO [125], or BA [126].
The authors of [127] addressed the face recognition problem when the illumina-
tion conditions are not suitable. They used a sensor that simultaneously takes two
face images (visible and infrared) and applied PSO to merge both images.
BFO was used in [128] to recognize faces with age variations. Since aging affects
each facial region differently, they defined specific weights for the information
extracted from each area and applied the swarm-based algorithm to combine the
features of global and local facial regions.
In addition to using swarm algorithms for feature selection in face recognition,
these algorithms are also combined with other methods to define a face recognition
system. The face recognition system defined in [129] combines support vector
machines with PSO. In this case, PSO was used to optimize the support vector
machine parameters. The proposal of [130] defines a system based on linear
discriminant analysis in which BFO was used to define the optimal principal
components. The method described in [131] combines ABC with Volterra kernels.
The solution described in [132] combines PSO with a neural network to optimize
the parameters of the network. CUS was combined with principal component
analysis and intrinsic discriminant analysis in [133]. The proposal of [134] applies
PSO and ABC to define a classifier for face recognition. The system defined in [135]
combines a neural network with FA and uses the fireflies to define the parameters of
the network.
4.6 Gesture Recognition
Humans show many emotions through facial expressions (happiness, sadness, anger,
etc.). The recognition of these expressions is useful for the analysis of customer
satisfaction, video games, or virtual reality, among other applications. Several
swarm-based methods have been proposed for the automatic recognition of facial
expressions. They use FA [136], PSO [137], CSO [138], or GWO [139]. On
the other hand, the method described in [140] proposes a three-dimensional facial
expression recognition model that uses artificial ants and PSO.
Face recognition and facial expression recognition are related to head pose
estimation. Head pose estimation is a difficult problem in computer vision. This
problem was addressed in [141] by a method that uses images from a depth camera
and applies the PSO algorithm to solve the problem as an optimization problem.
The method presented in [142] is a PSO-based solution for three-dimensional head
pose estimation that uses a commodity depth camera. A variant of ABC was used in
[143] for face pose estimation from a single image.
Human motion recognition is a process that requires detecting changes in the
position of a human posture or gesture in a video sequence. The starting point of
this process is the identification of the initial position. Tracking human motion from
video sequences has applications in fields such as human–computer interaction and

surveillance.
Several articles propose methods for hand pose estimation based on image
analysis. The method described in [144] uses PSO to estimate hand pose from
two-dimensional silhouettes of a hand extracted from a multi-camera system. The
proposal presented in [145] uses PSO to estimate the three-dimensional pose of the
hand. Another PSO-based method is proposed in [146] to estimate the pose of a
hand that interacts with an object. The problem of tracking hand articulations was
solved in [147] by a model that uses PSO. In this case, the input information was
obtained by a Kinect sensor, which includes an image and a depth map.
Human body pose estimation from images is an interesting starting point for
more complex operations, such as tracking human body pose in a video sequence.
The proposal of [148] uses PSO to estimate the human pose from still images. The
input data used by this method is a set of multi-view images of a person sitting at
a table. On the other hand, BA was used in [149] to estimate the pose of a human
body in video sequences. PSO was applied in [150] to estimate upper-body posture
from multi-view markerless sequences.
A system to detect a volleyball player from a video sequence was proposed in
[151]. The authors of the article analyzed the application of several swarm methods,
concluding that CUS generates the best results.
PSO was applied in [152] for markerless full-body articulated human motion
tracking. They used multi-view video sequences acquired in a studio environment.
The same swarm-based method was used in [153] to define a model for three-
dimensional tracking of the human body.
A PSO-based solution was described in [154] to track multiple pedestrians in a
crowded scene.
4.7 Medical Image Processing
Many techniques can be applied to obtain medical images, such as magnetic

resonance (MR) imaging, computed tomography (CT), or X-ray. The images
obtained by these methods provide very useful information for making medical
decisions. To this end, various image processing techniques are often applied to
medical images.
Many articles describe swarm-based solutions that apply to medical images some
operations already discussed in previous sections, such as segmentation (Table 3),
classification (Table 4), or feature selection (Table 5).
Image registration is an interesting operation applied to medical images. In
general, the images obtained by different techniques must be compared or combined
by experts to make decisions. To combine these images properly, they must first be
geometrically and temporally aligned. This alignment process is called registration.
Table 6 shows several articles that apply swarm-based methods to medical image
registration.
Table 3 Swarm-based methods applied to medical image segmentation

Swarm References Image type
ABC [155] MR brain image
[156, 157] MR brain image
(Combines ABC with fuzzy c-means)
[158] CT images to segment the liver area
(Clustering method to segment the liver area)
AA [159] Fundus photographs for exudate segmentation
[160, 161] MR brain image
[162] MR brain image
(Combines artificial ants with fuzzy segmentation)
CUS [163] Microscopic image
[164] MR brain image to detect brain tumors
PSO [165] Stomach images
(PSO optimizes the parameters for Otsu criterion)
[166] MR brain image
(PSO selects the optimal cluster centers for the
fuzzy c-means method that performs
segmentation)
[167] CT images to detect lung tumor
(PSO selects the optimal cluster centers for the
segmentation)
[168] Several types of medical images
(Active contour-based image segmentation)
[169] MR angiography
(PSO estimates the parameters of a finite mixture
model that fits the intensity histogram of the
image)
GWO [170] Skin images to detect melanoma
(GWO optimizes a multilayer perceptron neural
network designed to detect melanoma)
FPA [171] CT and MR imaging
BA [172] MR brain image to detect brain tumors
(BA selects the optimal cluster centers for the
segmentation)
FA [173] MR brain image to detect brain tumors
The fitness function of FA uses Tsallis entropy
Table 4 Swarm-based methods applied to medical images classification

Swarm References Objective
ABC [174] Cervical cancer detection in CT images
GWO [175] Classification of MR brain images as normal or abnormal
(Combines GWO with neural networks)
PSO [176] Classification of MR brain images as normal or abnormal
(The classification is performed by a support vector machine
whose parameters are optimized by PSO)
[177] Detection breast abnormalities in mammograms
(Combines PSO with a neural network)
FA [178] Breast tumor classification
(FA updates the weights of the neural network that performs the
classification)
Table 5 Swarm-based methods applied to feature selection in medical images

Swarm References Feature selection for. . .
PSO [179] Skin cancer diagnosis
FA [180] Detection of brain tumors on MR brain image
ABC [181] Classification of breast lesion on mammogram images
GWO [182] Classification of brain images for Alzheimer detection
[183] Classification of cervical lesions as benign and malignant
BA [184] Classification of brain tumor by a support vector machine
CUS [185] Breast tumor identification on mammogram images
Table 6 Swarm-based methods applied to medical image registration

Swarm References Applied to. . .
AA [186] Brain images
(The result of the ant-based algorithm is provided to a neural network)
PSO [187] Several types of images
[188] Several types of images
(Combines PSO and differential evolution)
[189] Several types of images
(Describes several PSO-based methods published for this issue)
GWO [190] Brain images
CRS [191] CT and MR images
References
1. Szeliski, R. (2010). Computer vision: Algorithms and applications, Springer Science &
Business Media.
2. Panigrahi, B. K., Shi, Y., & Lim, M. H. (2011). Handbook of swarm intelligence: Concepts,
principles and applications (Vol. 8). Springer Science & Business Media.
3. Yang, X. S., Cui, Z., Xiao, R., Gandomi, A. H., & Karamanoglu, M. (2013). Swarm
intelligence and bio-inspired computation: theory and applications. Newnes.
4. Abraham, A., Guo, H., & Liu, H. (2006). Swarm intelligence: foundations, perspectives and
applications. In Swarm intelligent systems (pp. 3–25). Springer.
5. Abdulrahman, S. M. (2017). Using swarm intelligence for solving NP-hard problems.
Academic Journal of Nawroz University, 6(3), 46–50.
6. Hassanien, A. E., & Emary, E. (2018). Swarm intelligence: Principles, advances, and
applications. CRC Press.
7. Slowik, A. (2021). Swarm intelligence algorithms: Modifications and applications. CRC
Press.
8. Karaboga, D., & Basturk, B. (2007). A powerful and efficient algorithm for numerical func-
tion optimization: Artificial bee colony (ABC) algorithm. Journal of Global Optimization,
39(3), 459–471.
9. Dorigo, M., & Stützle, T. (2019). Ant colony optimization: overview and recent advances. In
Handbook of metaheuristics (pp. 311–351).
10. Mirjalili, S. (2015). The ant lion optimizer. Advances in Engineering Software, 83, 80–98.
11. Yang, X. S. (2010) A new metaheuristic bat-inspired algorithm. In González, J., Pelta, D.,
Cruz, C., Terrazas, G., & Krasnogor, N. (Eds.), Nature Inspired Cooperative Strategies for
Optimization (NICSO 2010) (pp. 65–74). Springer. 10.1007/978-3-642-12538-6_6
12. Passino, K. M. (2002). Biomimicry of bacterial foraging for distributed optimization and
control. IEEE Control Systems Magazine, 22(3), 52–67.
13. Yang, X. S., & Deb, S. (2009). Cuckoo search via Lévy flights. In 2009 World
Congress on Nature & Biologically Inspired Computing (NaBIC) (pp. 210–214). IEEE.
10.1109/NABIC.2009.5393690
14. Chu, S. C., & Tsai, P. W. (2007). Computational intelligence based on the behavior of cats.
International Journal of Innovative Computing, Information and Control, 3(1), 163–173.
15. Askarzadeh, A. (2016). A novel metaheuristic method for solving constrained engineering
optimization problems: Crow search algorithm. Computers & Structures, 169, 1–12.
16. Yang, X. S., Karamanoglu, M., & He, X. (2014). Flower pollination algorithm: a novel
approach for multiobjective optimization. Engineering Optimization, 46(9), 1222–1237.
17. Yang, X. S., & He, X. (2013). Firefly algorithm: recent advances and applications. Interna-
tional Journal of Swarm Intelligence, 1(1), 36–50.
18. Li, X. L., Shao, Z. J., & Qian, J. X. (2002). An optimizing method based on autonomous
animats: Fish-swarm algorithm. Systems Engineering - Theory and Practice, 22(11), 32–38.
19. Mirjalili, S., Mirjalili, S. M., & Lewis, A. (2014). Grey wolf optimizer. Advances in
Engineering Software, 69, 46–61.
20. Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. In Proceedings of
ICNN’95-International Conference on Neural Networks (Vol. 4, pp. 1942–1948). IEEE.
10.1109/ICNN.1995.488968
21. Mirjalili, S., & Lewis, A. (2016). The whale optimization algorithm. Advances in Engineering
Software, 95, 51–67.
22. Chen, B., Chen, L., & Chen, Y. (2013) Efficient ant colony optimization for image feature
selection. Signal Processing, 93(6), 1566–1576.
23. Kumar, A., Patidar, V., Khazanchi, D., & Saini, P. (2016). Optimizing feature selection using
particle swarm optimization and utilizing ventral sides of leaves for plant leaf classification.
Procedia Computer Science, 89, 324–332.
24. Naeini, A. A., Babadi, M., Mirzadeh, S. M. J., & Amini, S. (2018). Particle swarm
optimization for object-based feature selection of VHSR satellite images. IEEE Geoscience
and Remote Sensing Letters, 15(3), 379–383.
25. Andrushia, A. D., & Patricia, A. T. (2020). Artificial bee colony optimization (ABC) for grape
leaves disease detection. Evolving Systems, 11(1), 105–117.
26. Ghamisi, P., Chen, Y., & Zhu, X. X. (2016). A self-improving convolution neural network
for the classification of hyperspectral data. IEEE Geoscience and Remote Sensing Letters,
13(10), 1537–1541.
27. Su, H., Du, Q., Chen, G., & Du, P. (2014). Optimized hyperspectral band selection using
particle swarm optimization. IEEE Journal of Selected Topics in Applied Earth Observations
and Remote Sensing, 7(6), 2659–2670.
28. Ghamisi, P., & Benediktsson, J. A. (2014). Feature selection based on hybridization of genetic
algorithm and particle swarm optimization. IEEE Geoscience and Remote Sensing Letters,
12(2), 309–313.
29. Medjahed, S. A., Saadi, T. A., Benyettou, A., & Ouali, M. (2016). Gray wolf optimizer for
hyperspectral band selection. Applied Soft Computing, 40, 178–186.
30. Wang, M., Wu, C., Wang, L., Xiang, D., & Huang, X. (2019). A feature selection approach for
hyperspectral image based on modified ant lion optimizer. Knowledge-Based Systems, 168,
39–48.
31. Medjahed, S. A., Saadi, T. A., Benyettou, A., & Ouali, M. (2015). Binary cuckoo search
algorithm for band selection in hyperspectral image classification. IAENG International
Journal of Computer Science, 42(3), 183–191.
32. Xie, F., Li, F., Lei, C., Yang, J., & Zhang, Y. (2019). Unsupervised band selection based on
artificial bee colony algorithm for hyperspectral image classification. Applied Soft Computing,
75, 428–440.
33. Su, H., Cai, Y., & Du, Q. (2016). Firefly-algorithm-inspired framework with band selection
and extreme learning machine for hyperspectral image classification. IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing, 10(1), 309–320.
34. Mohammadi, F. G., & Abadeh, M. S. (2014). Image steganalysis using a bee colony based
feature selection algorithm. Engineering Applications of Artificial Intelligence, 31, 35–43.
35. Chhikara, R. R., Sharma, P., & Singh, L. (2016). A hybrid feature selection approach based on
improved PSO and filter approaches for image steganalysis. International Journal of Machine
Learning and Cybernetics, 7(6), 1195–1206.
36. Adeli, A., & Broumandnia, A. (2018). Image steganalysis using improved particle swarm
optimization based feature selection. Applied Intelligence, 48(6), 1609–1622.
37. Pathak, Y., Arya, K., & Tiwari, S. (2019). Feature selection for image steganalysis using Levy
flight-based grey wolf optimization. Multimedia Tools and Applications, 78(2), 1473–1494.
38. Zebari, D. A., Zeebaree, D. Q., Saeed, J. N., Zebari, N. A., & Adel, A. Z. (2020). Image
steganography based on swarm intelligence algorithms: A survey. Test Engineering and
Management, 7(8), 22257–22269.
39. Nezamabadi-Pour, H., Saryazdi, S., & Rashedi, E. (2006). Edge detection using ant algo-
rithms. Soft Computing, 10(7), 623–628.
40. Tian, J., Yu, W., & Xie, S. (2008). An ant colony optimization algorithm for image edge
detection. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on
Computational Intelligence) (pp. 751–756). IEEE. 10.1109/CEC.2008.4630880
41. Baterina, A. V., & Oppus, C. (2010). Image edge detection using ant colony optimization.
WSEAS Transactions on Signal Processing, 6(2), 58–67.
42. Lu, D. S., & Chen, C. C. (2008). Edge detection improvement by ant colony optimization.
Pattern Recognition Letters, 29(4), 416–425.
43. Verma, O. P., Hanmandlu, M., & Sultania, A. K. (2010). A novel fuzzy ant system for edge
detection. In 2010 IEEE/ACIS 9th International Conference on Computer and Information
Science (pp. 228–233). IEEE. 10.1109/ICIS.2010.145
44. Etemad, S. A., & White, T. (2011). An ant-inspired algorithm for detection of image edge
features. Applied Soft Computing, 11(8), 4883–4893.
45. Setayesh, M., Zhang, M., & Johnston, M. (2009). A new homogeneity-based approach to edge
detection using PSO. In 2009 24th International Conference Image and Vision Computing
New Zealand (pp. 231–236). IEEE. 10.1109/IVCNZ.2009.5378404
46. Yigitbasi, E. D., & Baykan, N. A. (2013). Edge detection using artificial bee colony algorithm
(ABC). International Journal of Information and Electronics Engineering, 3(6), 634–638.
47. Dong, N., Wu, C. H., Ip, W. H., Chen, Z. Q., Chan, C. Y., & Yung, K. L. (2012). An
opposition-based chaotic GA/PSO hybrid algorithm and its application in circle detection.
Computers & Mathematics with Applications, 64(6), 1886–1902.
48. Cuevas, E., Sención-Echauri, F., Zaldivar, D., & Pérez-Cisneros, M. (2012) Multi-circle
detection on images using artificial bee colony (ABC) optimization. Soft Computing, 16(2),
281–296.
49. Dasgupta, S., Das, S., Biswas, A., & Abraham, A. (2010). Automatic circle detection on
digital images with an adaptive bacterial foraging algorithm. Soft Computing, 14(11), 1151–
1164.
50. Li, H., He, H., & Wen, Y. (2015). Dynamic particle swarm optimization and k-means
clustering algorithm for image segmentation. Optik, 126(24), 4817–4822.
51. Omran, M.G., Salman, A., & Engelbrecht, A. P. (2006). Dynamic clustering using particle
swarm optimization with application in image segmentation. Pattern Analysis and Applica-
tions, 8(4), 332–344.
52. Chu, X., Zhu, Y., Shi, J., & Song, J. (2010). Method of image segmentation based on
fuzzy c-means clustering algorithm and artificial fish swarm algorithm. In 2010 International
Conference on Intelligent Computing and Integrated Systems (pp. 254–257). IEEE.
53. Malisia, A. R., & Tizhoosh, H. R. (2006). Image thresholding using ant colony optimization.
In The 3rd Canadian Conference on Computer and Robot Vision (CRV’06) (pp. 26–26). IEEE.
10.1109/CRV.2006.42
54. Han, Y., & Shi, P. (2007). An improved ant colony algorithm for fuzzy clustering in image
segmentation. Neurocomputing, 70(4–6), 665–671.
55. Yang, X., Zhao, W., Chen, Y., & Fang, X. (2008). Image segmentation with a fuzzy clustering
algorithm based on ant-tree. Signal Processing, 88(10), 2453–2462.
56. Ye, Z., Hu, Z., Wang, H., & Chen, H. (2011). Automatic threshold selection based on
artificial bee colony algorithm. In 2011 3rd International Workshop on Intelligent Systems
and Applications (pp. 1–4). IEEE. 10.1109/ISA.2011.5873357
57. Horng, M. H. (2010). A multilevel image thresholding using the honey bee mating optimiza-
tion. Applied Mathematics and Computation, 215(9), 3302–3310.
58. Zhang, Y., & Wu, L. (2011). Optimal multi-level thresholding based on maximum Tsallis
entropy via an artificial bee colony approach. Entropy, 13(4), 841–859.
59. Akay, B. (2013). A study on particle swarm optimization and artificial bee colony algorithms
for multilevel thresholding. Applied Soft Computing, 13(6), 3066–3091.
60. Bhandari, A. K., Kumar, A., & Singh, G. K. (2015). Modified artificial bee colony based
computationally efficient multilevel thresholding for satellite image segmentation using
Kapur’s, Otsu and Tsallis functions. Expert Systems with Applications, 42(3), 1573–1601.
61. Sri Madhava Raja, N., Rajinikanth, V., & Latha, K. (2014). Otsu based optimal multilevel
image thresholding using firefly algorithm. Modelling and Simulation in Engineering, 2014.
10.1155/2014/794574
62. Brajevic, I., & Tuba, M. (2014). Cuckoo search and firefly algorithm applied to multilevel
image thresholding. In Yang, X. (Ed.), Cuckoo search and firefly algorithm. Studies in
Computational Intelligence (pp. 115–139). Springer.
63. Manic, K. S., Priya, R. K., & Rajinikanth, V. (2016). Image multithresholding based on
Kapur/Tsallis entropy and firefly algorithm. Indian Journal of Science and Technology, 9(12),
1–6. 10.17485/ijst/2016/v9i12/89949
64. He, L., & Huang, S. (2017). Modified firefly algorithm based multilevel thresholding for color
image segmentation. Neurocomputing, 240, 152–174.
65. Pare, S., Bhandari, A. K., Kumar, A., & Singh, G. K. (2018). A new technique for multilevel
color image thresholding based on modified fuzzy entropy and Lévy flight firefly algorithm.
Computers & Electrical Engineering, 70, 476–495.
66. Horng, M. H., & Liou, R. J. (2011). Multilevel minimum cross entropy threshold selection
based on the firefly algorithm. Expert Systems with Applications, 38(12), 14805–14811.
67. Bhandari, A. K., Singh, V. K., Kumar, A., & Singh, G. K. (2014). Cuckoo search algorithm
and wind driven optimization based study of satellite image segmentation for multilevel
thresholding using Kapur’s entropy. Expert Systems with Applications, 41(7), 3538–3560.
68. Agrawal, S., Panda, R., Bhuyan, S., & Panigrahi, B. K. (2013). Tsallis entropy based
optimal multilevel thresholding using cuckoo search algorithm. Swarm and Evolutionary
Computation, 11, 16–30.
69. Pare, S., Kumar, A., Bajaj, V., & Singh, G. K. (2017). An efficient method for multilevel color
image thresholding using cuckoo search algorithm based on minimum cross entropy. Applied
Soft Computing, 61, 570–592.
70. Suresh, S., & Lal, S. (2016). An efficient cuckoo search algorithm based multilevel threshold-
ing for segmentation of satellite images using different objective functions. Expert Systems
with Applications, 58, 184–209.
71. Gao, H., Xu, W., Sun, J., & Tang, Y. (2009). Multilevel thresholding for image segmentation
through an improved quantum-behaved particle swarm algorithm. IEEE Transactions on
Instrumentation and Measurement, 59(4), 934–946.
72. Liu, Y., Mu, C., Kou, W., & Liu, J. (2015). Modified particle swarm optimization-based
multilevel thresholding for image segmentation. Soft Computing, 19(5), 1311–1327.
73. Ghamisi, P., Couceiro, M. S., Martins, F. M., & Benediktsson, J. A. (2013). Multilevel
image segmentation based on fractional-order Darwinian particle swarm optimization. IEEE
Transactions on Geoscience and Remote Sensing, 52(5), 2382–2394.
74. Maitra, M., & Chatterjee, A. (2008). A hybrid cooperative–comprehensive learning based
PSO algorithm for image segmentation using multilevel thresholding. Expert Systems with
Applications, 34(2), 1341–1350.
75. Duraisamy, S. P., & Kayalvizhi, R. (2010). A new multilevel thresholding method using
swarm intelligence algorithm for image segmentation. Journal of Intelligent Learning Systems
76. Yin, P. Y. (2007). Multilevel minimum cross entropy threshold selection based on particle
swarm optimization. Applied Mathematics and Computation, 184(2), 503–513.
77. Li, L., Sun, L., Guo, J., Qi, J., Xu, B., & Li, S. (2017). Modified discrete grey wolf optimizer
algorithm for multilevel image thresholding. Computational Intelligence and Neuroscience,
2017. 10.1155/2017/3295769
78. Khairuzzaman, A. K. M., & Chaudhury, S. (2017). Multilevel thresholding using grey wolf
optimizer for image segmentation. Expert Systems with Applications, 86, 64–76.
79. Satapathy, S. C., Raja, N. S. M., Rajinikanth, V., Ashour, A. S., & Dey, N. (2018). Multi-
level image thresholding using Otsu and chaotic bat algorithm. Neural Computing and
Applications, 29(12), 1285–1307.
80. Alihodzic, A., & Tuba, M. (2014). Improved bat algorithm applied to multilevel image
thresholding. The Scientific World Journal, 2014. 10.1155/2014/176718
81. Liang, Y. C., Chen, A. H. L., & Chyu, C. C. (2006). Application of a hybrid ant colony
optimization for the multilevel thresholding in image processing. In King, I., Wang, J., Chan,
L., & Wang, D. (Eds.), International Conference on Neural Information Processing. Lecture
Notes in Computer Science (Vol. 4233, pp. 1183–1192). Springer.
82. Abd El Aziz, M., Ewees, A. A., Hassanien, A. E., Mudhsh, M., & Xiong, S. (2018). Multi-
objective whale optimization algorithm for multilevel thresholding segmentation. In Advances
in Soft Computing and Machine Learning in Image Processing (pp. 23–39). Springer.
83. Upadhyay, P., & Chhabra, J. K. (2020). Kapur’s entropy based optimal multi-
level image segmentation using crow search algorithm. Applied Soft Computing, 97.
10.1016/j.asoc.2019.105522
84. Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions
on Systems, Man, and Cybernetics, 9(1), 62–66.
85. Kapur, J. N., Sahoo, P. K., & Wong, A. K. (1985). A new method for gray-level picture
thresholding using the entropy of the histogram. Computer Vision, Graphics, and Image
Processing, 29(3), 273–285.
86. Tsallis, C. (1988). Possible generalization of Boltzmann-Gibbs statistics. Journal of Statisti-
cal Physics, 52(1), 479–487.
87. Li, C. H., & Lee, C. (1993). Minimum cross entropy thresholding. Pattern Recognition, 26(4),
617–625.
88. Chandramouli, K., & Izquierdo, E. (2006). Image classification using chaotic particle swarm
optimization. In 2006 International Conference on Image Processing (pp. 3001–3004). IEEE.
10.1109/ICIP.2006.312968
89. Wang, B., Sun, Y., Xue, B., & Zhang, M. (2018). Evolving deep convolutional neural net-
works by variable-length particle swarm optimization for image classification. In 2018 IEEE
Congress on Evolutionary Computation (CEC) (pp. 1–8). IEEE. 10.1109/CEC.2018.8477735
90. Fielding, B., & Zhang, L. (2018). Evolving image classification architectures with enhanced
particle swarm optimisation. IEEE Access, 6, 68560–68575.
91. Junior, F. E. F., & Yen, G. G. (2019). Particle swarm optimization of deep neural networks
architectures for image classification. Swarm and Evolutionary Computation, 49, 62–74.
92. Wang, S., Zhang, Y., Ji, G., Yang, J., Wu, J., & Wei, L. (2015). Fruit classification by
wavelet-entropy and feedforward neural network trained by fitness-scaled chaotic ABC and
biogeography-based optimization. Entropy, 17(8), 5711–5728.
93. Yang, J., Ye, Z., Zhang, X., Liu, W., & Jin, H. (2017). Attribute weighted Naive Bayes for
remote sensing image classification based on cuckoo search algorithm. In 2017 International
Conference on Security, Pattern Analysis, and Cybernetics (SPAC) (pp. 169–174). IEEE.
10.1109/SPAC.2017.8304270
94. Chouhan, S. S., Kaul, A., Singh, U. P., & Jain, S. (2018). Bacterial foraging optimization
based radial basis function neural network (BRBFNN) for identification and classification of
plant leaf diseases: An automatic approach towards plant pathology. IEEE Access, 6, 8852–
8863.
95. Omran, M. G., Engelbrecht, A. P., & Salman, A. (2004). Image classification using
particle swarm optimization. In K. Tan, M. Lim, X. Yao, & L. Wang (Eds.),
Recent Advances in Simulated Evolution and Learning (pp. 347–365). World Scientific.
10.1142/9789812561794_0019
96. Omran, M., Engelbrecht, A. P., & Salman, A. (2005). Particle swarm optimization method
for image clustering. International Journal of Pattern Recognition and Artificial Intelligence,
19(03), 297–321.
97. Liu, X., Li, X., Liu, L., He, J., & Ai, B. (2008). An innovative method to classify remote-
sensing images using ant colony optimization. IEEE Transactions on Geoscience and Remote
Sensing, 46(12), 4198–4208.
98. Liu, X., Li, X., Peng, X., Li, H., & He, J. (2008). Swarm intelligence for classification of
remote sensing data. Science in China Series D: Earth Sciences, 51(1), 79–87.
99. Omkar, S., Kumar, M. M., Mudigere, D., & Muley, D. (2007). Urban satellite image
classification using biologically inspired techniques. In 2007 IEEE International Symposium
on Industrial Electronics (pp. 1767–1772). IEEE. 10.1109/ISIE.2007.4374873
100. Zhang, Y., & Wu, L. (2011). Crop classification by forward neural network with adaptive
chaotic particle swarm optimization. Sensors, 11(5), 4721–4743.
101. Owechko, Y., & Medasani, S. (2005). Cognitive swarms for rapid detection of objects and
associations in visual imagery. In Proceedings of 2005 IEEE Swarm Intelligence Symposium,
2005. SIS 2005. (pp. 420–423). IEEE.
102. Singh, N., Arya, R., & Agrawal, R. (2014). A novel approach to combine features for salient
object detection using constrained particle swarm optimization. Pattern Recognition, 47(4),
1731–1739.
103. Ugolotti, R., Nashed, Y. S., Mesejo, P., Ivekovič, Š., Mussi, L., & Cagnoni, S. (2013). Particle
swarm optimization and differential evolution for model-based object detection. Applied Soft
Computing, 13(6), 3092–3105.
104. Mussi, L., Cagnoni, S., & Daolio, F. (2009). GPU-based road sign detection using particle
swarm optimization. In 2009 Ninth International Conference on Intelligent Systems Design
and Applications (pp. 152–157). IEEE.
105. Maldonado, S., Acevedo, J., Lafuente, S., Fernández, A., & López-Ferreras, F. (2010). An
optimization on pictogram identification for the road-sign recognition task using SVMs.
Computer Vision and Image Understanding, 114(3), 373–383.
106. Tseng, C. C., Hsieh, J. G., & Jeng, J. H. (2009). Active contour model via multi-population
particle swarm optimization. Expert Systems with Applications, 36(3), 5348–5352.
107. Horng, M. H., Liou, R. J., & Wu, J. (2010). Parametric active contour model by using the
honey bee mating optimization. Expert Systems with Applications, 37(10), 7015–7025.
108. Maboudi, M., Amini, J., Hahn, M., & Saati, M. (2017). Object-based road extraction from
satellite images using ant colony optimization. International Journal of Remote Sensing,
38(1), 179–198.
109. Iwin, S., Sasikala, J., & Juliet, D. S. (2019). Optimized vessel detection in marine environment
using hybrid adaptive cuckoo search algorithm. Computers & Electrical Engineering, 78,
482–492.
110. Banharnsakun, A., & Tanathong, S. (2014). Object detection based on template matching
through use of best-so-far ABC. Computational Intelligence and Neuroscience, 2014.
10.1155/2014/919406
111. Chidambaram, C., & Lopes, H. S. (2009). A new approach for template matching in digital
images using an artificial bee colony algorithm. In 2009 World Congress on Nature & Bio-
logically Inspired Computing (NaBIC) (pp. 146–151). IEEE. 10.1109/NABIC.2009.5393631
112. Xu, C., & Duan, H. (2010). Artificial bee colony (ABC) optimized edge potential function
(EPF) approach to target recognition for low-altitude aircraft. Pattern Recognition Letters,
31(13), 1759–1772.
113. Zhang, X., Hu, W., Qu, W., & Maybank, S. (2010). Multiple object tracking via species-
based particle swarm optimization. IEEE Transactions on Circuits and Systems for Video
Technology, 20(11), 1590–1602.
114. Kobayashi, T., Nakagawa, K., Imae, J., & Zhai, G. (2007). Real time object tracking on
video image sequence using particle swarm optimization. In 2007 International Conference
on Control, Automation and Systems (pp. 1773–1778). IEEE. 10.1109/ICCAS.2007.4406632
115. Ramakoti, N., Vinay, A., & Jatoth, R. K. (2009). Particle swarm optimization aided Kalman
filter for object tracking. In 2009 International Conference on Advances in Computing,
Control, and Telecommunication Technologies (pp. 531–533). IEEE. 10.1109/ACT.2009.135
116. Walia, G. S., & Kapoor, R. (2014). Intelligent video target tracking using an evolutionary
particle filter based upon improved cuckoo search. Expert Systems with Applications, 41(14),
6315–6326.
117. Ljouad, T., Amine, A., & Rziza, M. (2014). A hybrid mobile object tracker based on the
modified cuckoo search algorithm and the Kalman filter. Pattern Recognition, 47(11), 3597–
3613.
118. Gao, M. L., Shen, J., Yin, L. J., Liu, W., Zou, G. F., Li, H. T., & Fu, G. X. (2016). A novel
visual tracking method using bat algorithm. Neurocomputing, 177, 612–619.
119. Gao, M. L., He, X. H., Luo, D. S., Jiang, J., & Teng, Q. Z. (2013). Object tracking using
firefly algorithm. IET Computer Vision, 7(4), 227–237. 10.1049/iet-cvi.2012.0207.
120. Kanan, H. R., & Faez, K. (2008). An improved feature selection method based on ant
colony optimization (ACO) evaluated on face recognition system. Applied Mathematics and
Computation, 205(2), 716–725.
121. Kotia, J., Bharti, R., Kotwal, A., & Mangrulkar, R. (2020). Application of firefly algorithm
for face recognition. In Dey, N. (Ed.), Applications of firefly algorithm and its variants (pp.
147–171). Springer.
122. Ramadan, R. M., & Abdel-Kader, R. F. (2009). Face recognition using particle swarm
optimization-based selected features. International Journal of Signal Processing, Image
Processing and Pattern Recognition, 2(2), 51–65.
123. Krisshna, N. A., Deepak, V. K., Manikantan, K., & Ramachandran, S. (2014). Face recogni-
tion using transform domain feature extraction and PSO-based feature selection. Applied Soft
Computing, 22, 141–161.
124. Tiwari, V. (2012). Face recognition based on cuckoo search algorithm. Indian Journal of
Computer Science and Engineering, 3(3), 401–405.
125. Jakhar, R., Kaur, N., & Singh, R. (2011). Face recognition using bacteria foraging
optimization-based selected features. International Journal of Advanced Computer Science
126. Kumar, D. (2017). Feature selection for face recognition using DCT-PCA and bat algorithm.
International Journal of Information Technology, 9(4), 411–423.
127. Raghavendra, R., Dorizzi, B., Rao, A., & Kumar, G. H. (2011). Particle swarm optimization
based fusion of near infrared and visible images for improved face verification. Pattern
Recognition, 44(2), 401–411.
128. Yadav, D., Vatsa, M., Singh, R., & Tistarelli, M. (2013). Bacteria foraging fusion for face
recognition across age progression. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops (pp. 173–179). 10.1109/CVPRW.2013.33
129. Wei, J., Jian-Qi, Z., & Xiang, Z. (2011). Face recognition method based on support vector
machine and particle swarm optimization. Expert Systems with Applications, 38(4), 4390–
4393.
130. Panda, R., Naik, M. K., & Panigrahi, B. K. (2011). Face recognition using bacterial foraging
strategy. Swarm and Evolutionary Computation, 1(3), 138–146.
131. Chakrabarty, A., Jain, H., & Chatterjee, A. (2013). Volterra kernel based face recognition
using artificial bee colony optimization. Engineering Applications of Artificial Intelligence,
26(3), 1107–1114.
132. Lu, Y., Zeng, N., Liu, Y., & Zhang, N. (2015). A hybrid wavelet neural network and switching
particle swarm optimization algorithm for face direction recognition. Neurocomputing, 155,
219–224.
133. Naik, M. K., & Panda, R. (2016). A novel adaptive cuckoo search algorithm for intrinsic
discriminant analysis based face recognition. Applied Soft Computing, 38, 661–675.
134. Nebti, S., & Boukerram, A. (2017). Swarm intelligence inspired classifiers for facial
recognition. Swarm and Evolutionary Computation, 32, 150–166.
135. Sánchez, D., Melin, P., & Castillo, O. (2017). Optimization of modular granular neural
networks using a firefly algorithm for human recognition. Engineering Applications of
Artificial Intelligence, 64, 172–186.
136. Zhang, L., Mistry, K., Neoh, S. C., & Lim, C. P. (2016). Intelligent facial emotion recognition
using moth-firefly optimization. Knowledge-Based Systems, 111, 248–267.
137. Mistry, K., Zhang, L., Neoh, S. C., Lim, C. P., & Fielding, B. (2016). A micro-GA embedded
PSO feature selection approach to intelligent facial emotion recognition. IEEE Transactions
on Cybernetics, 47(6), 1496–1509.
138. Sikkandar, H., & Thiyagarajan, R. (2021). Deep learning based facial expression recognition
using improved cat swarm optimization. Journal of Ambient Intelligence and Humanized
Computing, 12(2), 3037–3053.
139. Sreedharan, N. P. N., Ganesan, B., Raveendran, R., Sarala, P., & Dennis, B. (2018). Grey
wolf optimisation-based feature selection and classification for facial emotion recognition.
IET Biometrics, 7(5), 490–499.
140. Mpiperis, I., Malassiotis, S., Petridis, V., & Strintzis, M. G. (2008). 3D facial expression
recognition using swarm intelligence. In 2008 IEEE International Conference on Acoustics,
Speech and Signal Processing (pp. 2133–2136). IEEE. 10.1109/ICASSP.2008.4518064
141. Padeleris, P., Zabulis, X., & Argyros, A. A. (2012). Head pose estimation on depth data based
on particle swarm optimization. In 2012 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition Workshops (pp. 42–49). IEEE.
142. Meyer, G. P., Gupta, S., Frosio, I., Reddy, D., & Kautz, J. (2015). Robust model-based 3D
head pose estimation. In Proceedings of the IEEE International Conference on Computer
Vision (pp. 3649–3657).
143. Zhang, Y., & Wu, L. (2011). Face pose estimation by chaotic artificial bee colony. Interna-
tional Journal of Digital Content Technology and its Applications, 5(2), 55–63.
144. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2010). Markerless and efficient 26-DOF
hand pose recovery. In Asian Conference on Computer Vision (pp. 744–757). Springer.
145. Ye, Q., Yuan, S., & Kim, T. K. (2016). Spatial attention deep net with partial PSO
for hierarchical hybrid hand pose estimation. In B. Leibe, J. Matas, N. Sebe, & M.
Welling (Eds.), European Conference on Computer Vision (pp. 346–361). Springer.
10.1007/978-3-319-46484-8_21
146. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2011). Full DOF tracking of a hand inter-
acting with an object by modeling occlusions and physical constraints. In 2011 International
Conference on Computer Vision (pp. 2088–2095). IEEE. 10.1109/ICCV.2011.6126483
147. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2011). Efficient model-based 3D tracking of
hand articulations using Kinect. In J. Hoey, S. McKenna, & E. Trucco (Eds.), British Machine
Vision Conference (Vol. 1, pp. 2088–2095). 10.5244/C.25.101
148. Ivekovič, Š., Trucco, E., & Petillot, Y. R. (2008). Human body pose estimation with particle
swarm optimisation. Evolutionary Computation, 16(4), 509–528.
149. Akhtar, S., Ahmad, A., & Abdel-Rahman, E. M. (2012). A metaheuristic bat-inspired
algorithm for full body human pose estimation. In 2012 Ninth Conference on Computer and
Robot Vision (pp. 369–375). IEEE. 10.1109/CRV.2012.55
150. Robertson, C., & Trucco, E. (2006). Human body posture via hierarchical evolutionary
optimization. In British Machine Vision Conference (Vol. 6, pp. 111–118). 10.5244/C.20.102
151. Balaji, S., Karthikeyan, S., & Manikandan, R. (2021). Object detection using metaheuristic
algorithm for volley ball sports application. Journal of Ambient Intelligence and Humanized
Computing, 12(1), 375–385.
152. John, V., Trucco, E., & Ivekovic, S. (2010). Markerless human articulated tracking using
hierarchical particle swarm optimisation. Image and Vision Computing, 28(11), 1530–1547.
153. Zhang, X., Hu, W., Wang, X., Kong, Y., Xie, N., Wang, H., Ling, H., & Maybank, S. (2010). A
swarm intelligence based searching strategy for articulated 3D human body tracking. In 2010
IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops
(pp. 45–50). IEEE.
154. Thida, M., Eng, H. L., Monekosso, D. N., & Remagnino, P. (2013). A particle swarm
optimisation algorithm with interactive swarms for tracking multiple targets. Applied Soft
Computing, 13(6), 3106–3117.
155. Hancer, E., Ozturk, C., & Karaboga, D. (2013). Extraction of brain tumors from MRI
images with artificial bee colony based segmentation methodology. In 2013 8th International
Conference on Electrical and Electronics Engineering (ELECO) (pp. 516–520). IEEE.
0.1109/ELECO.2013.6713896
156. Taherdangkoo, M., Yazdi, M., & Rezvani, M. (2010). Segmentation of MR brain images using
FCM improved by artificial bee colony (ABC) algorithm. In Proceedings of the 10th IEEE
International Conference on Information Technology and Applications in Biomedicine (pp.
1–5). IEEE. 10.1109/ITAB.2010.5687803
157. Menon, N., & Ramakrishnan, R. (2015). Brain tumor segmentation in MRI images
using unsupervised artificial bee colony algorithm and FCM clustering. In 2015 Interna-
tional Conference on Communications and Signal Processing (ICCSP) (pp. 6–9). IEEE.
10.1109/ICCSP.2015.7322635
158. Mostafa, A., Fouad, A., Abd Elfattah, M., Hassanien, A. E., Hefny, H., Zhu, S. Y., & Schaefer,
G. (2015). CT liver segmentation using artificial bee colony optimisation. Procedia Computer
Science, 60, 1622–1630.
159. Pereira, C., Gonçalves, L., & Ferreira, M. (2015). Exudate segmentation in fundus images
using an ant colony optimization approach. Information Sciences, 296, 14–24.
160. Huang, P., Cao, H., & Luo, S. (2008). An artificial ant colonies approach to medical image
segmentation. Computer Methods and Programs in Biomedicine, 92(3), 267–273.
161. Lee, M. E., Kim, S. H., Cho, W. H., Park, S. Y., & Lim, J. S. (2009). Segmentation of brain MR
images using an ant colony optimization algorithm. In 2009 Ninth IEEE International Con-
ference on Bioinformatics and Bioengineering (pp. 366–369). IEEE. 10.1109/BIBE.2009.58
162. Karnan, M., & Logheshwari, T. (2010). Improved implementation of brain MRI image seg-
mentation using ant colony system. In 2010 IEEE International Conference on Computational
Intelligence and Computing Research (pp. 1–4) IEEE. 10.1109/ICCIC.2010.5705897
163. Chakraborty, S., Chatterjee, S., Dey, N., Ashour, A. S., Ashour, A. S., Shi, F., & Mali, K.
(2017). Modified cuckoo search algorithm in microscopic image segmentation of hippocam-
pus. Microscopy Research and Technique, 80(10), 1051–1072.
164. Ilunga-Mbuyamba, E., Cruz-Duarte, J. M., Avina-Cervantes, J. G., Correa-Cely, C. R.,
Lindner, D., & Chalopin, C. (2016). Active contours driven by cuckoo search strategy for
brain tumour images segmentation. Expert Systems with Applications, 56, 59–68.
165. Li, Y., Jiao, L., Shang, R., & Stolkin, R. (2015). Dynamic-context cooperative quantum-
behaved particle swarm optimization based on multilevel thresholding applied to medical
image segmentation. Information Sciences, 294, 408–422.
166. Mekhmoukh, A., & Mokrani, K. (2015). Improved fuzzy C-means based particle swarm
optimization (PSO) initialization and outlier rejection with level set methods for MR brain
image segmentation. Computer Methods and Programs in Biomedicine, 122(2), 266–281.
167. Kavitha, P., & Prabakaran, S. (2019). A novel hybrid segmentation method with particle
swarm optimization and fuzzy c-mean based on partitioning the image for detecting lung
cancer. International Journal of Engineering and Advanced Technology, 8(5), 1223–1227.
168. Mandal, D., Chatterjee, A., & Maitra, M. (2014). Robust medical image segmentation
using particle swarm optimization aided level set based global fitting energy active contour
approach. Engineering Applications of Artificial Intelligence, 35, 199–214.
169. Wen, L., Wang, X., Wu, Z., Zhou, M., & Jin, J. S. (2015). A novel statistical cerebrovascular
segmentation algorithm with particle swarm optimization. Neurocomputing, 148, 569–577.
170. Parsian, A., Ramezani, M., & Ghadimi, N. (2017). A hybrid neural network-gray wolf
optimization algorithm for melanoma detection. Biomedical Research, 28(8), 3408–3411.
171. Wang, R., Zhou, Y., Zhao, C., & Wu, H. (2015). A hybrid flower pollination algorithm based
modified randomized location for multi-threshold medical image segmentation. Bio-medical
Materials and Engineering, 26(s1), S1345–S1351. 10.3233/BME-151432
172. Alagarsamy, S., Kamatchi, K., Govindaraj, V., Zhang, Y. D., & Thiyagarajan, A. (2019).
Multi-channeled MR brain image segmentation: A new automated approach combining bat
and clustering technique for better identification of heterogeneous tumors. Biocybernetics and
Biomedical Engineering, 39(4), 1005–1035.
173. Rajinikanth, V., Raja, N. S. M., & Kamalanand, K. (2017). Firefly algorithm assisted
segmentation of tumor from brain MRI using Tsallis function and Markov random field.
Journal of Control Engineering and Applied Informatics, 19(3), 97–106.
174. Agrawal, V., & Chandra, S. (2015). Feature selection using artificial bee colony algorithm
for medical image classification. In 2015 Eighth International Conference on Contemporary
Computing (IC3) (pp. 171–176). IEEE. 10.1109/IC3.2015.7346674
175. Ahmed, H. M., Youssef, B. A., Elkorany, A. S., Saleeb, A. A., & Abd El-Samie, F. (2018).
Hybrid gray wolf optimizer–artificial neural network classification approach for magnetic
resonance brain images. Applied Optics, 57(7), B25–B31.
176. Zhang, Y., Wang, S., Ji, G., & Dong, Z. (2013). An MR brain images classifier system via
particle swarm optimization and kernel support vector machine. The Scientific World Journal,
2013. 10.1155/2013/130134
177. Dheeba, J., Singh, N. A., & Selvi, S. T. (2014). Computer-aided detection of breast cancer on
mammograms: A swarm intelligence optimized wavelet neural network approach. Journal of
Biomedical Informatics, 49, 45–52.
178. Senapati, M. R., & Dash, P. K. (2013). Local linear wavelet neural network based breast tumor
classification using firefly algorithm. Neural Computing and Applications, 22(7), 1591–1598.
179. Tan, T. Y., Zhang, L., Neoh, S. C., & Lim, C. P. (2018). Intelligent skin cancer detection using
enhanced particle swarm optimization. Knowledge-based Systems, 158, 118–135.
180. Jothi, G., & Hannah Inbarani, H. (2016). Hybrid tolerance rough set–firefly based supervised
feature selection for MRI brain tumor image classification. Applied Soft Computing, 46, 639–
651.
181. Santhi, S., & Bhaskaran, V. (2014). Modified artificial bee colony based feature selection: A
new method in the application of mammogram image classification. International Journal of
Scientific and Technology Research, 3(6), 1664–1667.
182. Shankar, K., Lakshmanaprabu, S., Khanna, A., Tanwar, S., Rodrigues, J. J., & Roy, N.
R. (2019). Alzheimer detection using group grey wolf optimization based features with
convolutional classifier. Computers & Electrical Engineering, 77, 230–243.
183. Sahoo, A., & Chandra, S. (2017). Multi-objective grey wolf optimizer for improved cervix
lesion classification. Applied Soft Computing, 52, 64–80.
184. Kaur, T., Saini, B. S., & Gupta, S. (2018). A novel feature selection method for brain tumor
MR image classification based on the Fisher criterion and parameter-free bat optimization.
Neural Computing and Applications, 29(8), 193–206.
185. Sudha, M., & Selvarajan, S. (2016). Feature selection based on enhanced cuckoo search for
breast cancer classification in mammogram image. Circuits and Systems, 7(04), 327–338.
186. Kavitha, C., & Chellamuthu, C. (2014). Medical image fusion based on hybrid intelligence.
Applied Soft Computing, 20, 83–94.
187. Wachowiak, M. P., Smolíková, R., Zheng, Y., Zurada, J. M., & Elmaghraby, A. S. (2004). An
approach to multimodal biomedical image registration utilizing particle swarm optimization.
IEEE Transactions on Evolutionary Computation, 8(3), 289–301.
188. Talbi, H., & Batouche, M. (2004). Hybrid particle swarm with differential evolution for mul-
timodal image registration. In 2004 IEEE International Conference on Industrial Technology,
2004. IEEE ICIT’04. (Vol. 3, pp. 1567–1572). IEEE. 10.1109/ICIT.2004.1490800
189. Rundo, L., Tangherloni, A., Militello, C., Gilardi, M. C., & Mauri, G. (2016). Mul-
timodal medical image registration using particle swarm optimization: A review. In
2016 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1–8). IEEE.
10.1109/SSCI.2016.7850261
190. Daniel, E., Anitha, J., Kamaleshwaran, K., & Rani, I. (2017). Optimum spectrum mask
based medical image fusion using gray wolf optimization. Biomedical Signal Processing and
Control, 34, 36–43.
191. Parvathy, V. S., & Pothiraj, S. (2020). Multi-modality medical image fusion using hybridiza-
tion of binary crow search optimization. Health Care Management Science, 23(4), 661–669.
Index
A D
Affective computing, v, 127–148 Dataset, 1, 68, 104, 128, 153, 184, 204, 231,
Audio, 4–8, 14–23, 26, 27, 127–134, 136–147 243, 254, 274, 296, 309
Auto-encoder, 253–271 Deep features for CBIR, 151–177
Automatic colorization, vi, 253–271 Deep learning, v, vi, 1, 5, 10, 11, 13, 61, 63,
64, 67, 69–70, 72, 76, 78, 98, 127–148,
152, 153, 175, 181–183, 185–191,
203–207, 210, 214, 223–240, 243–250,
B
255, 259, 263, 273–292, 296, 297, 303
Bio inspired CNN, vi
Deep neural network (DNN), 23, 134, 138,
139, 141, 146, 158, 182, 184, 185, 189,
199, 224, 226, 231, 232, 234, 237, 239,
C 245, 254
Capsule network, 203–230 Diabetic, vi, 295–304
Cheby-shev polynomial approximation, Differential evolution, vi, 307–328
282–285, 289, 290, 292 Digital image processing, 181
Chest X-ray images, vi, 182–187, 195, 199, Dimensionality reduction, 37, 274, 278–282,
205, 211, 219 292, 299
Classification, 5, 37, 81, 103, 136, 153, 182, 3D point cloud processing, vi, 243–250
207, 226, 243, 273, 295, 334 Dynamic mode decomposition (DMD), vi,
Combinatorial optimization, 311, 318 274, 275, 279, 280, 282–287, 289–292
Computer vision, 8, 9, 29, 35, 61–78, 81–99,
186, 188, 223, 230, 237, 243, 253, 303,
307, 308, 310, 331–346 E
Confusion matrix (CM), 12, 118, 123, 213, 239 Emotions, 4, 104, 127–132, 134–138, 141–148,
Content based image retrieval, 151–177 343
Convolutional neural network (CNN), vi, 8, 10, Entropy, 11, 37–42, 44, 45, 47–52, 55, 57, 58,
13–19, 22, 23, 27, 28, 61, 63, 69, 97, 98, 104, 105, 110, 215, 217, 247, 309, 319,
141, 144–146, 152–154, 161, 181–199, 321, 327, 338, 339, 345
204, 209–211, 214–217, 243–257, 259,
261, 263, 295–304
COVID-19, vi, 98, 181–199, 203–220 F
Cricket video summarization, 7, 14, 19, 22 Feature, 4, 36, 61, 81, 103, 128, 151, 182, 204,
Cuckoo search approach, 334, 342 224, 244, 254, 273, 296, 308, 336
Communication and Computing, https://doi.org/10.1007/978-3-031-20541-5
358 Index
Feature descriptor, 61, 64, 66, 67, 152, 156 P

Fundus, vi, 296, 345 Physiological, 128–131, 134, 135, 137, 139,
140, 143, 144, 147
Population-based methods, 309, 310, 312
G Pothole detection, 61–64, 69, 71, 76, 77, 78, 97
Gray level co-occurrence matrix, 36, 309 Prediction, vi, 16, 68, 70, 72, 76, 77, 78, 113,
121, 142–146, 181, 183, 196, 205, 211,
213, 214, 217, 218, 254, 263, 268–271,
H 301
Hamming distance, vi, 151–177
Heuristics, 151, 182, 191, 307, 309, 332
Histogram of oriented gradients (HOG), v, R
9, 16, 17, 35–58, 61–64, 66, 67, 71, Radiometric correlation, 35–58
73–76, 78, 133, 224–226, 230, 231, Retinopathy, vi, 295–304
234, 237, 239
Human machine interface (HMI), 223, 237
Hyper spectral image classification, 273–292
S
Semantic segmentation, 27, 243–250
I Shape descriptor, 85–88, 98
Image enhancement, vi, 307–328 Shape feature extraction, v, 81–99
Image processing, 39, 62–64, 71, 93, 101, 181, Shot boundary, 7, 13, 16, 17, 20, 22, 24–26,
189, 205, 257, 274, 297, 307–309, 313, 35–58
318–324, 332, 336, 339, 344–346 Sports video classification, 19, 22
Image retrieval, v, 81–83, 85–87, 90, 98, Sports video highlights, 3
151–177, 341 Support vector machine (SVM), 5, 7, 10, 14,
16–22, 24, 26, 36, 63, 68, 72–76, 104,
105, 113–115, 119–122, 138, 139,
K 141–145, 182, 207, 296, 299–301, 303,
Key frames, 1, 4, 7, 16, 18, 20, 23, 36–38, 346
42–54, 56–58 Swarm-based methods, 331–346
K-means clustering, vi, 18, 22, 69, 156, 338
T
M Text, 6, 7, 15, 18, 19, 23, 25, 26, 55, 72, 83,
Machine learning (ML), v, 1–29, 36, 61–78, 104, 130–132, 135, 136, 138, 140–143,
97, 103–123, 127–148, 183, 184, 188, 145–147, 151
189, 205, 210, 231, 253, 271, 274, 297, Texture image, v, 103–123
302 Travelling salesman (TSP) problem, 307, 308,
Multimodal, v, 14, 15, 23, 26, 127–148, 335 310–312, 314, 315, 320
N V
Nearest neighborhood search, vi, 226, 246 Video, 1, 35, 62, 104, 127, 230, 340
Video summarization, v, 1–29
O
Object recognition, v, 81, 83–85, 90, 95, 342 X
Octree, 245–246, 248 X Ray, vi, 181–199, 203–208, 211, 213, 215,
OpenSet Domain adaptation, 274, 291 218, 219, 344

Smart Computer Vision

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Smart Computer Vision

Uploaded by

Copyright:

Available Formats

EAI/Springer Innovations in Communication

Smart Computer Vision

B. Surendiran Junhua Ding

ISSN 2522-8595 ISSN 2522-8609 (electronic)

Computer vision is a field of computer science that works on enabling computers

We are grateful to Ms. Eliška Vlčková (Managing Editor at EAI – European

Coimbatore, Tamil Nadu, India B. Vinoth Kumar

A Systematic Review on Machine Learning-Based Sports Video

Initial Stage Identification of COVID-19 Using Capsule Networks . . . . . . . . 203

Vani Vasudevan and Mohan S. Gounder

Sports video summarization is one of the interesting fields of research as it tends

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1

Number of Publications in Sports Video Summarization

Number of Publications based on type of popular

-200 0 200 400 600 800 1000

2 Two Decades of Research in Sports Video Summarization

In this section, we reviewed the history of sports video summarization in multiple

Fig. 3 Generic architecture

Fig. 4 Techniques established for sports video summarization

Techniques established for sports video summarization is shown in Fig. 4.

2.1 Feature-Based Approaches

Fig. 5 Factors influencing sports video summarization

2.2 Cluster-Based Approaches

2.3 Excitement-Based Approaches

2.4 Key Event-Based Approaches

2.5 Object Detection

Fig. 6 Evolution of object detectors in two decades

2.6 Performance Metrics

2.6.1 Objective Metrics

1. Accuracy: Represents the ratio of the correctly labeled replay/non-replay frames

where TP: True Positive, TN: True Negative, P: Positive, N: Negative

where FP : False Positive, FN: False Negative

5. F1-Score: It is weighted average representation of precision and recall. It is

2.6.2 Subjective Metrics Based on User Experience

3 Evolution of Ideas, Algorithms, and Methods for Sports

In this section, evolution of some of the video summarization ideas, algorithms,

4 Scope for Future Research in Video Summarization

Table 1 Ideas that evolved over a period in sports video summarization

4.1 Common Weaknesses of Existing Methods

4.1.1 Audio-Based Methods

– Audio excitements of audiences or spectators may create noise with commen-

Table 2 Notable research work in cricket sport video summarization

4.1.2 Shot and Boundary Detection

Table 3 Notable research work in soccer sport video summarization

4.1.3 Resolution and Samples

Table 4 Notable research work in other sports video summarization

4.1.4 Events Detection

4.2 Scope for Further Research

summarization involves majorly detecting or segmenting objects and classifying

In this chapter, a systematic review of latest developments on video summarization

T. Veerakumar, Badri Narayan Subudhi, K. Sandeep Kumar,

Due to rapid growth and development in multimedia techniques, e-learning is

T. Veerakumar () · K. S. Kumar · N. O. F. Da Rocha

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 35

is found to be complex. Robust Principal Component Analysis (RPCA) method is

This article is organized as follows. Section 2 describes the proposed algorithm.

2 Shot Boundary Detection and Key Frame Extraction

HOG features HOG features Shot boundary is

Radiometric correlation Key frame

Fig. 1 Flowchart of the proposed technique

2.1 Feature Extraction

2.2 Radiometric Correlation for Interframe Similarity Measure

−−−→ −−−−−→ 1 −−−−−→ −−−→T

2.3 Entropic Measure for Distinguishing Shot Transitions

−−−→ −−−−−→ 1 −−−−−→ −−−→T