Professional Documents
Culture Documents
Gyanendra K. Verma
Badal Soni
Salah Bourennane
Alexandre C. B. Ramos Editors
Data Science
Theory, Algorithms, and Applications
Transactions on Computer Systems
and Networks
Series Editor
Amlan Chakrabarti, Director and Professor, A. K. Choudhury School of
Information Technology, Kolkata, West Bengal, India
Transactions on Computer Systems and Networks is a unique series that aims
to capture advances in evolution of computer hardware and software systems
and progress in computer networks. Computing Systems in present world span
from miniature IoT nodes and embedded computing systems to large-scale
cloud infrastructures, which necessitates developing systems architecture, storage
infrastructure and process management to work at various scales. Present
day networking technologies provide pervasive global coverage on a scale
and enable multitude of transformative technologies. The new landscape of
computing comprises of self-aware autonomous systems, which are built upon a
software-hardware collaborative framework. These systems are designed to execute
critical and non-critical tasks involving a variety of processing resources like
multi-core CPUs, reconfigurable hardware, GPUs and TPUs which are managed
through virtualisation, real-time process management and fault-tolerance. While AI,
Machine Learning and Deep Learning tasks are predominantly increasing in the
application space the computing system research aim towards efficient means of
data processing, memory management, real-time task scheduling, scalable, secured
and energy aware computing. The paradigm of computer networks also extends it
support to this evolving application scenario through various advanced protocols,
architectures and services. This series aims to present leading works on advances
in theory, design, behaviour and applications in computing systems and networks.
The Series accepts research monographs, introductory and advanced textbooks,
professional books, reference works, and select conference proceedings.
Data Science
Theory, Algorithms, and Applications
Editors
Gyanendra K. Verma Badal Soni
Department of Computer Engineering Department of Computer Science
National Institute of Technology and Engineering
Kurukshetra National Institute of Technology Silchar
Kurukshetra, India Silchar, India
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
We dedicate to all those who directly or
indirectly contributed to the accomplishment
of this work.
Preface
Digital information influences our everyday lives in various ways. Data sciences
provides us tools and techniques to comprehend and analyze the data. Data sciences
is one of the fastest-growing multidisciplinary fields that deals with data acquisition,
analysis, integration, modeling, visualization, and interaction of a large amount of
data.
Currently, each sector of the economy produces a huge amount of data in an
unstructured format. A huge amount of data is being available from various sources
like web services, databases, online repositories, etc.; however, the major challenge
is to extract meaningful intelligence information. However, to preprocess and extract
useful information is a challenging task. The role of artificial intelligence is playing
a pivotal role in the analysis of the data.
It becomes possible to analyze and interpret information in real-time with the
evolution of artificial intelligence. The deep learning models are widely used in
the analysis of big data for various applications, particularly in the area of image
processing.
This book aims to develop an understanding of data sciences theory and concepts,
data modeling by using various machine learning algorithms for a wide range of real-
world applications. In addition to providing basic principles of data processing, the
book teaches standard models and algorithms to data analysis.
vii
Acknowledgements
We are thankful to all the contributors who have generously given time and material
to this book. We would also want to extend our appreciation to those who have well
played their role to inspire us continuously.
We are extremely thankful to the reviewers, who have carried out the most impor-
tant and critical part of any technical book, evaluation of each of the submitted
chapters assigned to them.
We also express our sincere gratitude toward our publication partner, Springer,
especially to Ms. Kamiya Khatter and the Springer book production team for
continuous support and guidance in completing this book project.
Thank you.
ix
Introduction
This book aims to provide authors with an understanding of data sciences, their
architectures, and their applications in various domains. The data sciences is helpful
in the extraction of meaningful information from unstructured data. The major aspect
of data sciences is data modeling, analysis, and visualization. This book covers major
models, algorithms, and prominent applications of data sciences to solve real-world
problems. By the end of the book, we hope that our readers will have an understanding
of concepts, different approaches, models, and familiarity with the implementation
of data sciences tools and libraries.
Artificial intelligence has a major impact on research and raised the performance
bar substantially in many of the standard evaluations. Moreover, the new challenges
can be tackled using artificial intelligence in the decision-making process. However,
it is very difficult to comprehend, let alone guide, the process of learning in deep
learning. There is an air of uncertainty about exactly what and how these models
learn, and this book is an effort to fill those gaps.
Target Audience
The book is divided into three parts comprising a total of 27 chapters. Parts, distinct
groups of chapters, as well as single chapters are meant to be fairly independent
and also self-contained, and the reader is encouraged to study only relevant parts or
chapters. This book is intended for a broad readership. The first part provides the
theory and concepts of learning. Thus, this part addresses readers wishing to gain an
overview of learning frameworks. Subsequent parts delve deeper into research topics
and are aimed at the more advanced reader, in particular graduate and PhD students
as well as junior researchers. The target audience of this book will be academi-
cians, professionals, researchers, and students at engineering and medical institutions
working in the areas of data sciences and artificial intelligence.
xi
xii Introduction
Book Organization
This book is organized into three parts. Part I includes eight chapters that deal with
theory concepts of data sciences, Part II deals with data design and analysis, and
finally, Part III is based on the major applications of data sciences. This book contains
invited as well as contributed chapters.
The first part of the book exclusively focuses on the fundamentals of data sciences.
The book chapters under this part cover active learning, ensemble learning concepts
along with language processing concepts.
Chapter 1 describes a general active learning framework that has been proposed for
network intrusion detection. The authors have experimented with different learning
and sampling strategies on the KDD Cup 1999 dataset. The results show that the
performance of complex learning models has been found to outperform the rela-
tively simple learning models. The uncertainty and entropy sampling also outperform
random sampling. Chapter 2 describes a bagging classifier which is an ensemble
learning approach for student outcome prediction by employing base and meta-
classifiers. Additionally, performance analysis of various classifiers has been carried
out by an oversampling approach using SMOTE and an undersampling approach
using spread sampling. Chapter 3 presents the patient’s medical data security via bi-
chaos bi-order Fourier transform. In this work, authors have used three techniques
for medical or clinical image encryption, i.e., FRFT, logistic map, and Arnold map.
The results suggest that the complex hybrid combination makes the system more
robust and secure from the different cryptographic attacks than these methods alone.
In Chap. 4, word-sense disambiguation (WSD) for the Nepali language is performed
using variants of the Lesk algorithm such as direct overlap, frequency-based scoring,
and frequency-based scoring after drooping of the target word. Performance anal-
ysis based on the elimination of stop words, the number of senses, and context
window size has been carried out. Chapter 5 presents a performance analysis of
different branch prediction schemes incorporated in ARM big.LITTLE architecture.
The performance comparison of these branch predictors has been carried out based
on performance, power dissipation, conditional branch mispredictions, IPC, execu-
tion time, power consumption, etc. The results show that TAGE-LSC and perceptron
achieve the highest accuracy among the simulated predictors. Chapter 6 presents a
global feature representation using a new architecture SEANet that has been built
over SENet. An aggregate block implemented after the SE block aids in global feature
representation and reducing the redundancies. SEANet has been found to outperform
ResNet and SENet on two benchmark datasets—CIFAR-10 and CIFAR-100.
Introduction xiii
The subsequent chapters in this part are devoted to analyzing images. Chapter 7
presents an improved super-resolution of a single image through an external dictio-
nary formation for training and a neighbor embedding technique for reconstruction.
The task of dictionary formation is carried out so as to contain maximum structural
variations and the minimal number of images. The reconstruction stage is carried
out by the selection of overlapping pixels of a particular location. In Chap. 8, single-
step image super-resolution and denoising of SAR images are proposed using the
generative adversarial networks (GANs) model. The model shows improvement in
VGG16 loss as it preserves relevant features and reduces noise from the image. The
quality of results produced by the proposed approach is compared with the two-step
upscaling and denoising model and the baseline method.
The second part of the book focuses on the models and algorithms for data sciences.
The deep learning models, discrete wavelet transforms, principal component anal-
ysis, SenLDA, color-based classification model, and gray-level co-occurrence matrix
(GLCM) are used to model real-world problems.
Chapter 9 explores a deep learning technique based on OCR-SSD for car detection
and tracking in images. It also presents a solution for real-time license plate recog-
nition on a quadcopter in autonomous flight. Chapter 10 describes an algorithm for
gender identification based on biometric palm print using binarized statistical image
features. The filter size is varied with a fixed length of 8 bits to capture information
from the ROI palm prints. The proposed method outperforms baseline approaches
with an accuracy of 98%. Chapter 11 describes a Sudoku puzzle recognition and solu-
tion study. Puzzle recognition is carried out using a deep belief network for feature
extraction. The puzzle solution is given by serialization of two approaches—parallel
rule-based methods and ant colony optimization. Chapter 12 describes a novel profile
generation approach for human action recognition. DWT & PC is proposed to detect
energy variation for feature extraction in video frames. The proposed method is
applied to various existing classifiers and tested on Weizmann’s dataset. The results
outperform baselines like the MACH filter.
The subsequent chapters in this part are devoted to more research-oriented models
and algorithms. Chapter 13 presents a novel filter and color-based classification
model to assess the ripeness of tobacco leaves for harvesting. The ripeness detection
is performed by a spot detection approach using a first-order edge extractor and a
second-order high-pass filtering. A simple thresholding classifier is then proposed
for the classification task. Chapter 14 proposes an automatic deep learning frame-
work for breast cancer detection and classification model from hematoxylin and
eosin (H&E)-stained breast histopathology images with 80.4% accuracy for supple-
menting analysis of medical professionals to prevent false negatives. Experimental
results yield that the proposed architecture provides better classification results as
compared to benchmark methods. Chapter 15 specifies a technique for indoor flying
xiv Introduction
of autonomous drones using image processing and neural networks. The route for
the drone is determined through the location of the detected object in the captured
image. The first detection technique relies on image-based filters, while the second
technique focuses on the use of CNN to replicate a real environment. Chapter 16
describes the use of a gray-level co-occurrence matrix (GLCM) for feature detection
in SAR images. The features detected in SAR images by GLCM find much applica-
tion as it identifies various orientations such as water, urban areas, and forests and
any changes in these areas.
The third part of the book covers the major applications of data sciences in various
fields like biometrics, robotics, medical imaging, affective computing, security, etc.
Chapter 17 deals with signature verification using Galois field operator. The
features are obtained by building a normalized cumulative histogram. Offline signa-
ture verification is also implemented using the K-NN classifier. Chapter 18 details a
face recognition approach in videos using 3D residual networks and comparing the
accuracy for different depths of residual networks. A CVBL video dataset has been
developed for the purpose of experimentation. The proposed approach achieves the
highest accuracy of 97% with DenseNets on the CVBL dataset. Microcontroller units
(MCU) with auto firmware communicate with the fog layer through a smart edge
node. The robot employs approaches such as simultaneous localization and mapping
(SLAM) and other path-finding algorithms and IR sensors for obstacle detection. ML
techniques and FastAi aid in the classification of the dataset. Chapter 20 describes
an automatic tumor identification approach to classify MRI of brain. An advanced
CNN model consisting of convolution and a dense layer is employed to correctly
classify the brain tumors. The results exhibit the proposed model’s effectiveness in
brain tumor image classification. Chapter 21 presents a vision-based sensor mech-
anism for phase lane detection in IVS. The land markings on a structured road are
detected using image processing techniques such as edge detection and Hough space
transformation on KITTI data. Qualitative and quantitative analysis shows satis-
factory results. In Chapter 22, the proposed implementation of deep convolutional
neural network (DCNN) for micro-expression recognition as DCNN has established
its presence in different image processing applications. CASME-II, a benchmark
database for micro-expression recognition, has been used for experimentations. The
results of the experiment had revealed that types based on CNN give correct results
of 90% and 88% for four and six classes, respectively, that is beyond the regular
methods.
In Chapter 23, the proposed semantic classification model intends to employ
modern embedding and aggregating methods which considerably enhance feature
discriminability and boost the performance of CNN. The performance of this frame-
work is exhaustively tested across a wide dataset. The intuitive and robust systems
that use these techniques play a vital role in various sectors like security, military,
Introduction xv
xvii
xviii Contents
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Editors and Contributors
Salah Bourennane received his Ph.D. degree from Institut National Polytechnique
de Grenoble, France. Currently, he is a Full Professor at the Ecole Centrale Marseille,
France. He is the head of the Multidimensional Signal Processing Group of Fresnel
Institute. His research interests are in statistical signal processing, remote sensing,
xxi
xxii Editors and Contributors
Contributors
xxvii
Part I
Theory and Concepts
Chapter 1
Active Learning for Network Intrusion
Detection
Amir Ziai
Abstract Network operators are generally aware of common attack vectors that they
defend against. For most networks, the vast majority of traffic is legitimate. How-
ever, new attack vectors are continually designed and attempted by bad actors which
bypass detection and go unnoticed due to low volume. One strategy for finding such
activity is to look for anomalous behavior. Investigating anomalous behavior requires
significant time and resources. Collecting a large number of labeled examples for
training supervised models is both prohibitively expensive and subject to obsole-
tion as new attacks surface. A purely unsupervised methodology is ideal; however,
research has shown that even a very small number of labeled examples can signifi-
cantly improve the quality of anomaly detection. A methodology that minimizes the
number of required labels while maximizing the quality of detection is desirable.
False positives in this context result in wasted effort or blockage of legitimate traf-
fic, and false negatives translate to undetected attacks. We propose a general active
learning framework and experiment with different choices of learners and sampling
strategies.
1.1 Introduction
Detecting anomalous activity is an active area of research in the security space. Tuor
et al. use an online anomaly detection method based on deep learning to detect anoma-
lies. This methodology is compared to traditional anomaly detection algorithms such
as isolation forest (IF) and a principal component analysis (PCA)-based approach
and found to be superior. However, no comparison is provided with semi-supervised
or active learning approaches which leverage a small amount of labeled data (Tuor
et al. 2017). The authors later propose another unsupervised methodology leverag-
ing recurrent neural network (RNN) to ingest the log-level event data as opposed to
aggregated data (Tuor et al. 2018). Pimentel et al. propose a generalized framework
for unsupervised anomaly detection. They argue that purely unsupervised anomaly
A. Ziai (B)
Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
e-mail: amirziai@stanford.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 3
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_1
4 A. Ziai
Table 1.1 Prevalence and number of attacks for each of the 10 attack types
Label Attacks Prevalence Prevalence Records
(overall)
smurf. 280,790 0.742697 0.568377 378,068
neptune. 107,201 0.524264 0.216997 204,479
back. 22.3 0.022145 0.04459 99,481
satan 1589 0.016072 0.003216 98,867
ipsweep 1247 0.012657 0.002524 98,525
portsweep. 1040 0.010578 0.002105 98,318
warezclient. 1020 0.010377 0.002065 98,298
teardrop. 979 0.009964 0.001982 98,257
pod. 264 0.002707 0.000534 97,542
nmap. 231 0.002369 0.000468 97,509
1.2 Dataset
We have used the KDD Cup 1999 dataset which consists of about 500K records
representing network connections in a military environment. Each record is either
“normal” or one of 22 different types of intrusion such as smurf, IP sweep, and
teardrop. Out of these 22 categories, only 10 have at least 100 occurrences, and the
rest were removed. Each record has 41 features including duration, protocol, and
bytes exchanged. Prevalence of attack types varies substantially with smurf being
the most pervasive at about 50% of total records and Nmap at less than 0.01% of
total records (Table 1.1).
1 Active Learning for Network Intrusion Detection 5
We generated 10 separate datasets consisting of normal traffic and each of the attack
vectors. This way we can study the proposed approach over 10 different attack vectors
with varying prevalence and ease of detection. Each dataset is then split into train,
development, and test partitions with 80%, 10%, and 10% proportions. All algorithms
are trained on the train set and evaluated on the development set. The winning strategy
is tested on the test set to generate an unbiased estimate of generalization. Categorical
features are one-hot encoded, and missing values are filled with zero.
1.3 Approach
Since labeled data is very hard to come by in this space, we have decided to treat this
problem as an active learning one. Therefore, the machine learning model receives a
subset of the labeled data. We will use the F1 score to capture the trade-off between
precision and recall:
F1 = (2P R)/(P + R) (1.1)
However, this usually comes at the cost of being overly conservative and not catching
anomalous activity that is indeed an intrusion.
Labeling effort is a major factor in this analysis and a dimension along which we
will define the upper and lower bounds of the quality of our detection systems. A
purely unsupervised approach would be ideal as there is no labeling involved. We
will use an isolation forest (Zhou et al. 2004) to establish our baseline. Isolation
forests (IFs) are widely, and very successfully, used for anomaly detection. An IF
consists of a number of isolation trees, each of which are constructed by selecting
random features to split and then selecting a random value to split on (random value
in the range of continuous variables or random value for categorical variables). Only a
small random subset of the data is used for growing the trees, and usually a maximum
allowable depth is enforced to curb computational cost. We have used 10 trees for
each IF. Intuitively, anomalous data points are easier to isolate with a smaller average
number of splits and therefore tend to be closer to the root. The average closeness
to the root is proportional to the anomaly score (i.e., the lower this score, the more
anomalous the data point).
A completely supervised approach would incur maximum cost as we will have
to label every data point. We have used a random forest classifier with 10 estimators
trained on the entire training dataset to establish the upper bound (i.e., Oracle). In
Table 1.3, the F1 scores are reported for evaluation on the development set:
The proposed approach starts with training a classifier on a small random subset of
the data (i.e., 1000 samples) and then continually queries a security analyst for the
next record to label. There is a maximum budget of 100 queries (Fig. 1.1).
This approach is highly flexible. The choice of classifier can range from logistic
regression all the way up to deep networks as well as any ensemble of those models.
Moreover, the hyper-parameters for the classifier can be tuned on every round of
training to improve the quality of predictions. The sampling strategy can range from
simply picking random records to using classifier uncertainty or other elaborate
schemes. Once a record is labeled, it is removed from the pool of labeled data and
placed into the labeled record database. We are assuming that labels are trustworthy
which may not necessarily be true. In other words, the analyst might make a mistake
in labeling or there may be low consensus among analysts around labeling. In the
presence of those issues, we would need to extend this approach to query multiple
analysts and to build the consensus of labels into the framework.
1.4 Experiments
We used a logistic regression (LR) classifier with L2 penalty as well as a random forest
(RF) classifier with 10 estimators, Gini impurity for splitting criteria, and unlimited
depth for our choice of learners. We also chose three sampling strategies. First is
a random strategy that randomly selects a data point from the unlabeled pool. The
second option is uncertainty sampling that scores the entire database of unlabeled
data and then selects the data point with the highest uncertainty. The first option
is entropy sampling, which calculates the entropy over the positive and negative
8 A. Ziai
Table 1.4 Effects of learner and sampling strategy on detection quality and latency
Learner Sampling F1 initial F1 after 10 F1 after 50 F1 after Train time Query time
strategy 100 (s) (s)
LR Random 0.76±0.32 0.76±0.32 0.79±0.31 0.86±0.17 0.05±0.01 0.09±0.08
LR Uncertainty 0.83±0.26 0.85±0.31 0.88±0.20 0.10±0.08
LR Entropy 0.83±0.26 0.85±0.31 0.88±0.20 0.08±0.08
RF Random 0.90±0.14 0.91±0.12 0.84±0.31 0.95±0.07 0.11±0.00 0.09±0.07
RF Uncertainty 0.98±0.03 0.99±0.03 0.99±0.03 0.16±0.06
RF Entropy 0.98±0.04 0.98±0.03 0.99±0.03 0.12±0.08
classes and selects the highest entropy data point. Ties are broken randomly for both
uncertainty and entropy sampling.
Table 1.4 shows the F1 score immediately after the initial training (F1 initial)
followed by the F1 score after 10, 50, and 100 queries to the analyst across different
learners and sampling strategies aggregated over the 10 attack types:
Random forests are strictly superior to logistic regression from a detection per-
spective regardless of the sampling strategy. It is also clear that uncertainty and
entropy sampling are superior to random sampling which suggests that judiciously
sampling the unlabeled dataset can have a significant impact on the detection quality,
especially in the earlier queries (F1 goes from 0.90 to 0.98 with just 10 queries). It is
important to notice that the query time might become a bottleneck. In our examples,
the unlabeled pool of data is not very large but as this set grows these sampling
strategies have to scale accordingly. The good news is that scoring is embarrassingly
parallelizable.
Figure 1.2 depicts the evolution of detection quality as the system makes queries
to the analyst for an attack with high prevalence (i.e., the majority of traffic is an
attack):
The random forest learner combined with an entropy sampler can get to perfect
detection within 5 queries which suggests high data efficiency (Mussmann and Liang
2018). We will compare this to the Nmap attack with significantly lower prevalence
(i.e., less than 0.01% of the dataset is an attack) (Fig. 1.3):
We know from our Oracle evaluations that a random forest model can achieve
perfect detection for this attack type; however, we see that an entropy sampler is not
guaranteed to query the optimal sequence of data points. The fact that the prevalence
of attacks is very low means that the initial training dataset probably does not have a
representative set of positive labels that can be exploited by the model to generalize.
The failure of uncertainty sampling has been documented (Zhu et al. 2008), and
more elaborate schemes can be designed to exploit other information about the unla-
beled dataset that the sampling strategy is ignoring. To gain some intuition into
these deficiencies, we will unpack a step of entropy sampling for the Nmap attack.
Figure 1.4 compares (a) the relative feature importance after the initial training to (b)
the Oracle (Fig. 1.5):
1 Active Learning for Network Intrusion Detection 9
The Oracle graph suggests the “src_bytes” is a feature that the model is highly
reliant upon for prediction. However, our initial training is not reflecting this; we will
compute the z-score for each of the positive labels in our development set:
|μ R fi − μW fi |
z fi = (1.2)
σ R fi
where μ R fi is the average value of the true positives for feature i (i.e., f i ), μW fi is
the average value of the false positives or false negatives, and σ R fi is the standard
deviation of the values in the case of true positives.
10 A. Ziai
The higher this value is for a feature, the more our learner needs to know about it
to correct the discrepancy. However, we see that the next query made by the strategy
does not involve a decision around this fact. The score for “src_bytes” is an order
of magnitude larger than other features. The model continues to make uncertainty
queries staying oblivious to information about specific features that it needs to correct
for.
Fig. 1.4 Random forest feature importance for a initial training and b Oracle
we
[PredictionEnsemble = I we Prediction E > (1.3)
e E e E
2
where Prediction E {0, 1} is the binary prediction associated with the classifier e E =
{R F, G B, L R, I F} and we is the weight of the classifier in the ensemble.
The weights are proportional to the level of confidence we have in each of the
learners. We have added a gradient boosting classifier with 10 estimators.
Unfortunately, the results of this experiment suggest that this particular ensemble
is not adding any additional value. Figure 1.6 shows that at best the results match
that of random forest (a) and in the worst case they can be significantly worse (b):
The majority of the error associated with this ensemble approach relative to only
using random forests can be attributed to a high false negative rate. The other four
algorithms are in most cases conspiring to generate a negative class prediction which
overrides the positive prediction of the random forest.
1 Active Learning for Network Intrusion Detection 13
Finally, we explore whether we can use an unsupervised method for finding the
most anomalous data points to query. If this methodology is successful, the sampling
strategy is decoupled from active learning and we can simply precompute and cache
the most anomalous data points for the analyst to label.
We compared a sampling strategy based on isolation forest with entropy sampling
(Table 1.5):
In both cases, we are using a random forest learner. The results suggest that
entropy sampling is superior since it is sampling the most uncertain data points in
the context of the current learner and not a global notion of anomaly which isolation
forest provides.
1.5 Conclusion
We have proposed a general active learning framework for network intrusion detec-
tion. We experimented with different learners and observed that more complex learn-
ers can achieve higher detection quality with significantly less labeling effort for most
attack types. We did not explore other complex models such as deep neural networks
and did not attempt to tune the hyper-parameters of our model. Since the bottleneck
associated with this task is the labeling effort, we can add model tuning while staying
within the acceptable latency requirements.
We then explored a few sampling strategies and discovered that uncertainty and
entropy sampling can have a significant benefit over unsupervised or random sam-
pling. However, we also realized that these strategies are not optimal, and we can
extend them to incorporate available information about the distribution of the fea-
tures for mispredicted data points. We attempted a semi-supervised approach called
label spreading that builds the affinity matrix over the normalized graph Laplacian
which can be used to create pseudo-labels for unlabeled data points (Zhou et al. 2004).
However, this methodology is very memory-intensive, and we could not successfully
train and evaluate it on all of the attack types.
14 A. Ziai
References
Mussmann S, Liang P (2018) On the relationship between data efficiency and error for un-certainty
sampling. arXiv preprint arXiv:1806.06123
Pimentel T, Monteiro M, Viana J, Veloso A, Ziviani N (2018) A generalized active learning approach
for unsupervised anomaly detection. arXiv preprint arXiv:1805.09411
Tuor A, Kaplan S, Hutchinson B, Nichols N, Robinson S (2017) Deep learning for unsupervised
insider threat detection in structured cybersecurity data streams. arXiv preprint arXiv:1710.00811
Tuor A, Baerwolf R, Knowles N, Hutchinson B, Nichols N, Jasper R (2018) Recurrent neural
network language models for open vocabulary event-level cyber anomaly detection. Workshops
at the thirty-second AAAI conference on artificial intelligence
Veeramachaneni K, Arnaldo I, Korrapati V, Bassias C, Li K (2016) AI: training a big data machine
to defend. Big Data Security on Cloud (BigDataSecurity), IEEE international conference on high
performance and smart computing (HPSC), and IEEE international conference on intelligent data
and security (IDS), IEEE 2nd international conference, pp 49–54
Zainal A, Maarof MA, Shamsuddin SM (2009) Ensemble classifiers for network intrusion detection
system. J Inf Assur Secur 4(3):217–225
Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global
consistency. In: Advances in neural information processing systems, pp 321–328
Zhu J, Wang H, Yao T, Tsou BK (2008) Active learning with sampling by uncertainty and density
for word sense disambiguation and text classification. In: Proceedings of the 22nd international
conference on computational linguistics, vol 1, pp 1137–1144
Chapter 2
Educational Data Mining Using Base
(Individual) and Ensemble Learning
Approaches to Predict the Performance
of Students
2.1 Introduction
M. Ashraf (B)
School of CS and IT, Jain university, Bangalore190006, India
Y. K. Salal · S. M. Abdullaev
Department of System Programming, South Ural State University, Chelyabinsk, Russia
e-mail: yasskhudheirsalal@gmail.com
S. M. Abdullaev
e-mail: abdullaevsm@susu.ru
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 15
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_2
16 M. Ashraf et al.
In this study, primarily, we have applied four learning classifiers such as j48, ran-
dom tree, naïve bayes, and knn across academic dataset. Thereafter, the academic
dataset was subjected to progression of oversampling and undersampling methods
to corroborate whether there is any improvement in prediction achievements of stu-
dent’s outcome. Correspondingly, the analogous procedure is practiced over ensem-
ble methodologies including bagging and boosting to substantiate which learning
classifier among base or meta has demonstrated compelling results.
Table 2.1 portrays outcome of diverse classifiers accomplished subsequent to
running these machine learning classifiers across educational dataset. Moreover, it
is unequivocal that naïve bayes has achieved notable prediction precision of 95.50%
in classifying the actual occurrences, incorrectly classification error of 4.45%, and
minimum relative absolute error of 7.94% in contrast to remaining classifiers. The
supplementary calculations related with the learning algorithm such as Tp rate, Fp
rate, precision, recall, f -measure, and ROC area have been also found significant.
Conversely, random tree produced although substantial classification accuracy of
90.03%, incorrectly classified instances as 9.69%, (RAE) relative absolute error of
15.46%, and supplementary parameters connected with the algorithm were found
18 M. Ashraf et al.
Under this subsection, bagging has been utilized using various classifiers that are
highlighted in Table 2.4. Nevertheless, after employing bagging, the prediction accu-
racy has demonstrated paramount success over base learning mechanism. The cor-
rectly classified rate in Table 2.4 when contrasted with initial prediction rate of
different classifiers in Table 2.1 have shown substantial improvement in three learn-
ing algorithms such as j48 (92.20–94.87%), random tree (90.30–94.76%), and knn
(91.80–93.81%).
In addition, the incorrectly classified instances have come down to considerable
level in these classifiers, and as a consequence, supplementary parameters viz. Tp
rate, Fp rate, precision, recall, ROC area, and f -measure related with these classifiers
have also rendered admirable results. However, naïve bayes has not revealed any
significant achievement in prediction accuracy with bagging approach, and moreover,
relative absolute error associated with each meta classifier has augmented while
synthesizing different classifiers.
20 M. Ashraf et al.
2.4 Conclusion
In this research study, the central focus has been early prediction of student’s out-
come using various individual (base) and meta classifiers to provide timely guidance
for weak students. The individual learning algorithms employed across pedagogical
22 M. Ashraf et al.
data including j48, random tree, naïve bayes, and knn which have evidenced phe-
nomenal prediction accuracy of student’s final outcomes. Among each base learning
algorithms, naïve bayes attained paramount accuracy of 95.50%. As the dataset
in this investigation was imbalanced which could have otherwise culminated with
inaccurate and biased outcomes, therefore academic dataset was exploited to filter-
ing approaches, namely synthetic minority oversampling technique ( SMOTE) and
spread subsampling.
In this contemporary study, a comparative revision was conducted with base and
meta learning algorithms, followed by oversampling (SMOTE) and undersampling
(spread subsampling) techniques to get a comprehensive knowledge which classifiers
can be more precise and decisive in generating predictions. The above-mentioned
base learning algorithms were subjected to phenomenon of oversampling and under-
sampling methods. The naïve bayes yet again demonstrated noteworthy improve-
ment of 97.15% after practiced with oversampling technique. With undersampling
technique, knn showed exceptional improvement of 93.94% in prediction accuracy
over other base learning algorithms. However, in case of ensemble learning such as
bagging, among all classifiers bagging with naïve bayes accomplished convincing
correctness of 95.32% in predicting the exact instances.
The bagging algorithm, when put into effect with techniques such as oversam-
pling and undersampling, the ensembles generated from classifiers viz. j48 and naïve
bayes demonstrated with significant accuracy and least classification error (95.21%,
bagging with j48 and 96.07%, bagging with naïve bayes), respectively.
References
Ahmed ABED, Elaraby IS (2014) Data mining: a prediction for student’s performance using clas-
sification method. World J Comput Appl Technol 2(2):43–47
Ali KM, Pazzani MJ (1996) Error reduction through learning multiple descriptions. Mach Learn
24(3):173–202
2 Educational Data Mining Using Base (Individual) … 23
Ashraf M et al (2017) Knowledge discovery in academia: a survey on related literature. Int J Adv
Res Comput Sci 8(1)
Ashraf M, Zaman M (2017) Tools and techniques in knowledge discovery in academia: a theoretical
discourse. Int J Data Min Emerg Technol 7(1):1–9
Ashraf M, Zaman M, Ahmed Muheet (2018a) Using ensemble StackingC method and base classi-
fiers to ameliorate prediction accuracy of pedagogical data. Proc Comput Sci 132:1021–1040
Ashraf M, Zaman M, Ahmed M (2018b) Using predictive modeling system and ensemble method
to ameliorate classification accuracy in EDM. Asian J Comput Sci Technol 7(2):44–47
Ashraf M, Zaman M, Ahmed M (2020) An intelligent prediction system for educational data mining
based on ensemble and filtering approaches. Proc Comput Sci 167:1471–1483
Ashraf M, Zaman M, Ahmed M (2018c) Performance analysis and different subject combinations:
an empirical and analytical discourse of educational data mining. In: 8th international conference
on cloud computing. IEEE, data science & engineering (confluence), p 2018
Ashraf M, Zaman M, Ahmed M (2019) To ameliorate classification accuracy using ensemble vote
approach and base classifiers. Emerging technologies in data mining and information security.
Springer, Singapore, pp 321-334
Bartlett P, Shawe-Taylor J (1999) Generalization performance of support vector machines and other
pattern classifiers. Advances in Kernel methods—support vector learning, pp 43–54
Brazdil P, Gama J, Henery B (1994) Characterizing the applicability of classification algorithms
using meta-level learning. In: European conference on machine learning. Springer, Berlin, Hei-
delberg, p 83102
Breiman L (1996). Bagging predictors. Machine Learn 24(2): 123–140; Freund Y, Schapire RE
(1996) Experiments with a new boosting algorithm. ICML 96:148–156
Bruzzone L, Cossu R, Vernazza G (2004) Detection of land-cover transitions by combining multidate
classifiers. Pattern Recogn Lett 25(13):1491–1500
Bü hlmann P, Yu B (2003) Boosting with the L 2 loss: regression and classification. J Am Stat Assoc
98(462):324–339
Dimitriadou E, Weingessel A, Hornik K (2003) A cluster ensembles framework, design and appli-
cation of hybrid intelligent systems
Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural
Comput 4(1):1–58
Leigh W, Purvis R, Ragusa JM (2002) Forecasting the NYSE composite index with technical
analysis, pattern recognizer, neural network, and genetic algorithm: a case study in romantic
decision support. Decision Support Syst 32(4):361–377
Maimon O, Rokach L (2004) Ensemble of decision trees for mining manufacturing data sets. Mach
Eng 4(1–2):32–57
Mangiameli P, West D, Rampal R (2004) Model selection for medical diagnosis decision support
systems. Decision Support Syst 36(3):247–259
Mukesh K, Salal YK (2019) Systematic review of predicting student’s performance in academics.
Int J Eng Adv Techno 8(3): 54–61
Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11:169–
198
Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-Learning by land-marking various
learning algorithms. In: ICML, pp 743–750
Salal YK, Abdullaev SM (2020, December) Deep learning based ensemble approach to predict
student academic performance: case study. In: 2020 3rd International conference on intelligent
sustainable systems (ICISS) (pp 191–198). IEEE
Salal YK, Hussain M, Paraskevi T (2021) Student next assignment submission prediction using a
machine learning approach. Adv Autom II 729:383
Salzberg SL (1994) C4. 5: programs for machine learning by J. Rossquinlan. Mach Learn 16(3):235–
240
Sidiq SJ, Zaman M, Ashraf M, Ahmed M (2017) An empirical comparison of supervised classifiers
for diabetic diagnosis. Int J Adv Res Comput Sci 8(1)
24 M. Ashraf et al.
Tan AC, Gilbert D, Deville Y (2003) Multi-class protein fold classification using a new ensemble
machine learning approach. Genome Inform 14:206–217
Chapter 3
Patient’s Medical Data Security via Bi
Chaos Bi Order Fourier Transform
3.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 25
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_3
26 B. Ahuja and R. Doriya
random value generator, bit-plane decomposition, and permutation. In paper (Ali and
Ali 2020), a new medical image signcryption scheme is introduced that fulfills the
necessary safety criteria of confidential medical information during its contact. The
design of the latest theme for medical data signcryption is derived from the hybrid
cryptographic combination. It uses a type of elliptic curve cryptography configured
with public key cryptography for private key encoding.
A chaos coding for clinical image data is introduced in paper (Belazi et al. 2019).
A blend of chaotic with polymer computation is proposed, followed by a secret key
generator along with the permutation combination.
Among all types of encryption methods classification, the chaotic map-based
techniques are more efficient especially when we talk of the digital image encryption.
There are various chaotic maps developed by the researchers and mathematician
such as, Logistic map, Arnold map, Henon map, Tinkerbell map, etc. The chaotic
maps give prominent results in security because of the sensitivity towards the initial
condition.
In this paper, we have combined the two chaotic maps for increasing the com-
plexity level and blend it with the fractional Fourier transform with order m. The rest
of the paper is organized as follows: In part two, the proposed method is described
with the definitions of the term used. The results and analysis are done in the third
Sect. 3.3. Finally, the conclusion is made in the fourth part.
The FRFT is derived from the classical Fourier transform and having an order ‘m.’
Here, the usage of an extra parameter ‘m’ is significant in the sense that it makes the
FRFT more robust than the classical in terms of enhancing the applications (Ozaktas
et al. 2001; Tao et al. 2006). It is represented as:
∞
x(u) = x(t)K γ (t, u)dt (3.1)
−∞
∞
x(t) = x(u)K −γ (u, t)du (3.2)
−∞
28 B. Ahuja and R. Doriya
where
γ =m (3.3)
2
Let Fγ denote the operator corresponding to the FRFT of angle γ . Under this
notation, some of the important properties of the FRFT operator are listed below
with time frequency plane in Fig. 3.2.
1. For γ = m = 0 we do get the identity operator: F 0 = F 4 = I
2. For γ = 2 , i.e., m = 1, we get the Fourier operator: F 1 = F
3. For γ = , i.e., m = 2, we get the reflection operator: F 2 = F F = I
4. For γ = 32
, i.e., m = 3, we get the reflection operator: F 3 = F F 2 = F
FRFT computation involves following steps:
1. A product by a Chirp.
2. A Fourier transforms.
3. Another product by a Chirp.
4. A product by a complex amplitude factor.
Properties of Fractional Fourier Transform explained in Table 3.1. Different
parameters have been used for the performance evaluation of various classes of
discrete fractional Fourier transform (DFRFT).
• Direct form of DFRFT
• Improved Sampling type DFRFT
• Linear Combination type DFRFT
• Eigen Vector Decomposition type DFRFT
3 Patient’s Medical Data Security via Bi … 29
∞ ∞
Yαβ ( p, q) = kαβ ( p, q; r, s)y(r, s)dr ds (3.5)
−∞ −∞
where
A secure encryption system must normally have the following basic features:
(1) Able to convert the message into a random encrypted text or cipher;
(2) Extreme sensitivity towards the secret key.
The chaos system has few common features as stated above, including the pseudo-
random sensitivity of the initial state, also the parametric sensitivity (Avasare and
Kelkar 2015). Many studies, therefore, were based on applications for mapping
discrete chaotic maps for the cryptography in recent years; still numerous chaotic
systems have their specific fields of study that are fitting for different circumstances.
However, because of certain intrinsic image characteristics, such as the ability for
mass data and high pixel associations, existing algorithms alone for cryptography
are not sufficient for the realistic encryption of photos.
Chaos in data security is an unpredictable and similarly irregular instrument that
happens inside the dynamic nonlinear frameworks. The riotous component is recep-
tive to the underlying condition, unstable but then typical. Numerous disordered
capacities are utilized in encryption. We are utilizing chaotic maps here, i.e., logistic
and Arnold Cat map.
Logistic map is the general chaotic function which is utilized to have the long key
space for enhanced security as it increases the randomness (Ahuja and Lodhi 2014).
It is stated as:
yn+1 = u ∗ yn ∗ (1 − yn ) (3.7)
3.2.3 Algorithm
The proposed algorithm contains two processes: sender’s side process or encryption
algorithm and receiver’s side process or decryption algorithm. Algorithms are shown
in Fig. 3.4 and 3.5.
Encryption Algorithm:
Step 1 At sender’s side, take medical image and apply Arnold and logistic chaotic
map to the image.
Step 2 Apply discrete fractional Fourier transform (Simulation is done with order
of parameter a = 0.77 and b = 0.77) as a secret key.
Step 3 This transformed image is an encrypted image.
32 B. Ahuja and R. Doriya
Decryption Algorithm:
Step 1 At receiver’s side, apply inverse discrete fractional Fourier transform (Simu-
lation is done with order of parameter a = 0.77 and b = 0.77) to the encrypted
image.
Step 2 Remove logistic and apply inverse Arnold Cat map to get decrypted image.
This segment contains two images (Medical image 1 and Medical image 2) for testing
purpose with a resolution of 512 × 512.
Software Version: MATLAB 2016a. The parameters used in the proposed system
for the simulation are as follows; a = 0.77, b = 0.77 of FRFT and u = 3.9 and y0 = 0.1
of the logistic map respectively.
Figures 3.6, 3.7, and 3.8 describe the computational outcomes after MATLAB
simulation. In these figures, input medical image 1, encrypted image, decrypted
image, and their hisrograms are shown.
Figures 3.9, 3.10 and 3.11 describe the computational outcomes after MATLAB
simulation. In these figures input medical image 2, encrypted image, decrypted image
and their hisrograms are shown.
For testing the efficacy of the system, some of the popular metrics such as PSNR,
MSE, SSIM, and correlation coefficient (CC) would be tested.
34 B. Ahuja and R. Doriya
MSE: The distinction between the comparing pixel esteems in the real image and the
encrypted image well defines the mean square error (Salunke and Salunke 2016). In
order to get a reliable encryption, the mean square error will be as small as possible.
3 Patient’s Medical Data Security via Bi … 35
M−1 N −1
1 2
MSE = f (i, j) − f (i, j) (3.10)
MN 0 0
SSIM: The structural similarity index is used to calculate the relation between an
original image and a reconstructed one. The SSIM should be described as (Horé and
Ziou 2010):
CC: The correlation coefficient of two neighboring pixels is another significant char-
acteristic of the image. It is for evaluating the degree of linear correlation between
two random variables (Zhang and Zhang 2014).
There is a simple correlation of neighboring pixels in horizontal, vertical, and
diagonal directions for a real image. A strong association between adjacent pixels is
predicted for plain image. And weak association between adjacent pixels is predicted
for cipher images.
The simulation results of PSNR, MSE, SSIM, and CC are shown in Tables 3.3
and 3.4.
Figures 3.12 and 3.13 depict the correlation coefficient diagram for medical image
1 and 2, respectively.
3.4 Conclusion
In this paper, we have used three techniques for medical or clinical image encryption,
i.e., FRFT, logistic map, and Arnold map. The results suggest that the complex
hybrid combination makes the system more robust and secure from the different
cryptographic attacks than these methods alone. The use of fourier transform-based
approach with logistic chaotic map and Arnold map makes this algorithm much
complex and nonlinear and difficult to breach the security henceforth. In future
work, the method may be used for the medical data security with the advanced tools
of IoT and machine learning.
References
Ahuja B, Lodhi R (2014) Image encryption with discrete fractional Fourier transform and chaos.
Adv Commun Netw Comput (CNC)
Akkasaligar PT, Biradarand S (2016) Secure medical image encryption based on intensity level using
Chao’s theory and DNA cryptography. In: 2016 IEEE international conference on computational
intelligence and computing research (ICCIC). IEEE, Chennai, pp 1–6
Ali T, Ali RA (2020) Novel medical image signcryption scheme using TLTS and Henon chaotic
map. IEEE Access 8:71974–71992
Avasare MG, Kelkar VV (2015) Image encryption using chaos theory. In: 2015 international con-
ference on communication, information and computing technology (ICCICT). IEEE, Mumbai,
pp 1–6
Belazi A, Talha M, Kharbech S, Xiang W (2019) Novel medical image encryption scheme based
on chaos and DNA encoding. IEEE Access 7:36667–36681
Cao W, Zhou Y, Chen P, Xia L (2017) Medical image encryption using edge maps. Signal Process
132:96–109
Horé and Ziou2010]ref16 Horé A, Ziou D (2010) Image quality metrics: PSNR versus SSIM. In:
20th international conference on pattern recognition. IEEE, Istanbul, pp 2366–2369
Ozaktas M, Zalevsky Z, Kutay MA (2001) The fractional Fourier transform. West Sus-sex, U. K.,
Wiley
Priya S, Santhi B (2019) A novel visual medical image encryption for secure transmission of
authenticated watermarked medical images. Mobile networks and applications. Springer, Berlin
Roy M, Mali K, Chatterjee S, Chakraborty S, Debnath R, Sen S (2019) A study on the applications
of the biomedical image encryption methods for secured computer aided diagnostics. In: Amity
international conference on artificial intelligence (AICAI), Dubai, United Arab Emirates. IEEE,
pp 881–886
Rachmawanto E, De Rosal I, Sari C, Santoso H, Rafrastara F, Sugiarto E (2019) Block-based Arnold
chaotic map for image encryption. In: International conference on information and communica-
tions technology (ICOIACT). IEEE, Yogyakarta, Indonesia, pp 174–178
Salunke BA, Salunke S (2016) Analysis of encrypted images using discrete fractional transforms viz.
DFrFT, DFrST and DFrCT. In: International conference on communication and signal processing
(ICCSP). IEEE, Melmaruvathur, pp 1425–1429
Tao R, Deng B, Wang Y (2006) Research progress of the fractional Fourier transform in signal
processing. Sci China (Ser. F Inf Sci) 49:1–25
Wang C, Ding Q (2018) A new two-dimensional map with hidden attractors. Entropy 20:322
Zhang X (2011) Lossy compression and iterative reconstruction for encrypted image. IEEE Trans
Inf Forensics Secur 6:53–58
3 Patient’s Medical Data Security via Bi … 39
Zhang J, Zhang Y (2014) An image encryption algorithm based on balanced pixel and chaotic map.
Math Probl Eng
Zhang L, Zhu Z, Yang B, Liu W, Zhu H, Zou M (2015) Medical image encryption and compression
scheme using compressive sensing and pixel swapping based permutation approach. Math Probl
Eng 2015
Chapter 4
Nepali Word-Sense Disambiguation
Using Variants of Simplified Lesk
Measure
Abstract This paper evaluates simplified Lesk algorithm for Nepali word-sense
disambiguation (WSD). Disambiguation is performed by computing similarity
between sense definitions and context of ambiguous word. We compute the simi-
larity using three variants of simplified Lesk algorithm: direct overlap,
frequency-based scoring, and frequency-based scoring after dropping target word.
We further evaluate the effect of stop word elimination, number of senses and
context window size on Nepali WSD. The evaluation was carried out on a sense
annotated corpus comprising of 20 polysemous Nepali nouns. We observed overall
average precision and recall of 38.87% and 26.23% using frequency-based scoring
for baseline. We observed overall average precision and recall of 32.23% and
21.78% using frequency-based scoring after dropping target word for baseline. We
observed overall average precision and recall of 30.04% and 20.30% using direct
overlap for baseline.
4.1 Introduction
S. Singh (&)
BML Munjal University, Gurugram, Haryana, India
R. Rauniyar
Tredence Analytics, Bangalore, India
M. Manohar
Gramener, Bangalore, India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 41
G. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_4
42 S. Singh et al.
In this work, we evaluate a WSD algorithm for Nepali language. The algorithm
used is based on Lesk (1986) and following Vasilescu et al. (2004), is called
simplified Lesk. The algorithm uses the similarity between context vector and sense
definitions for disambiguation. We further investigate the effects of context window
size, stop word elimination, and number of senses for Nepali WSD. We compare
the results for Nepali WSD using a similar Lesk-like algorithm for Hindi WSD
(Singh et al. 2017). The article is organized as follows: Sect. 4.2 provides the
related work in WSD for English, Hindi, and Nepali languages. The WSD algo-
rithm is discussed in Sect. 4.3. Section 4.4 provides the details of construction of
sense annotated Nepali corpus used in this work. In Sect. 4.5, we provide the
experiments conducted and results and Sect. 4.6 provide discussion of results. In
Sect. 4.7, we present our conclusion.
There are two main categories into which WSD techniques are broadly grouped:
Dictionary-based or knowledge-based and corpus-based. Dictionary-based tech-
niques (Baldwin et al. 2010; Banerjee and Pederson 2002, 2003; Lesk 1986;
Vasilescu et al. 2004) utilize information from lexical resources and machine-
readable dictionaries for disambiguation. Corpus-based techniques utilize corpus,
either sense tagged (supervised) (Gale et al. 1992; Lee et al. 2004; Ng and Lee
1996) or raw corpus (unsupervised) (Resnik 1997; Yarowsky 1995) for
disambiguation.
Lesk (1986) was one of the early and pioneer works on dictionary-based WSD
for English language. He represented dictionary definition obtained from lexicon as
bag of words. He extracted words in sense definition of words, in context of target
ambiguous word. Disambiguation was performed by contextual overlap between
sense and context bag of words. The work in (Agirre and Rigau 1996; Miller et al.
1994) is some other early work utilizing dictionary definitions for WSD. Since
Lesk, several extensions to his work have been proposed (Baldwin et al. 2010;
Banerjee and Pederson 2002, 2003; Gaona et al. 2009; Vasilescu et al. 2004).
Baldwin et al. (2010) reinvestigated and extended the task of machine-readable
dictionary-based WSD. They extended the Lesk-based WSD approach by methods
of definition extension and by applying different tokenization schemes. Evaluation
was carried out on Hinoki Sense bank example sentences and Senseval-2 Japanese
dictionary task. The WSD accuracy uses their approach surpassed both unsuper-
vised and supervised baselines. Banerjee and Pedersen (2002) utilized glosses that
were associated with synset, semantic relations and each word attribute in pair for
disambiguation using English WordNet. In (Banerjee and Pederson 2003), they
explored a novel measure of semantic relatedness that was based on count of
overlap in glosses. Comparative evaluation of original Lesk algorithm was per-
formed by Vasilescu et al. (2004). They observed the performance of adapted Lesk
44 S. Singh et al.
For Nepali language work on WSD includes (Dhungana and Shakya 2014;
Shrestha et al. 2008). Dhungana and Shakya (2014) investigated adapted Lesk-like
algorithm for Nepali WSD. They included synset, gloss, example sentences, and
hypernym of every sense of target polysemous word for creating sense bag. Context
bag was created by extracting all words from whole sentence after dropping
prepositions, articles, and pronouns. Score was computed by contextual overlap of
sense bag and context bag. Evaluation was done on 348 words, 59 being polyse-
mous and they achieved an accuracy of 88.05%. Shrestha et al. (2008) studied the
role of morphological analyzer and machine-readable dictionary for Nepali
word-sense disambiguation using Lesk algorithm. Evaluation was performed on a
small dataset comprising of Nepali nouns and they achieved accuracy values
ranging from 50 to 70%. For Nepali language work on sentiment analysis includes
(Gupta and Bal 2015; Piryani et al. 2020). Gupta and Bal (2015) studied sentiment
analysis of Nepali text. They developed Nepali SentiWordNet named as
Bhavanakos and employed it for detecting sentiment words in Nepali text. They
also trained machine learning classifier using annotated Nepali text for document
classification. Piryani et al. (2020) performed sentiment analysis of tweets in Nepali
text. They employed machine and deep learning models for sentiment analysis of
Nepali text.
The Simplified Lesk algorithm for WSD used in this work is adapted from (Singh
et al. 2017) and given in Fig. 4.1. In this algorithm, score is computed by contextual
overlap of two bags: context bag and sense definition bag. Sense definition bag
comprises of synsets, gloss, and example sentence of target word. Context bag is
formed by extracting neighboring words in a window size of ±n in context of target
word. The winner sense is one which maximizes the overlap of two bags. For
studying the effects of context window size, test runs were computed on window
size of 5, 10, 15, 20, and 25. For studying the effects of stop word elimination, we
dropped stop words from the context vector and then created the context window.
We utilized three variants to compute the score: direct overlap, frequency-based
scoring, and frequency-based scoring after dropping target word. For direct overlap,
we computed the number of matching words for disambiguation. For
frequency-based scoring, we computed the frequency of matching words between
context and sense bag. For frequency-based scoring after dropping target word, we
computed the frequency of matching words between context and sense bag after
dropping target word.
46 S. Singh et al.
4.4 Dataset
For evaluating the WSD algorithm, a sense annotated Nepali corpus was created
comprising of 20 polysemous Nepali nouns. The sense annotated Nepali corpus is
given in Table 4.1. The sense definitions were obtained from IndoWordNet (http://
tdil-dc.in/indowordnet/), an important lexical resource for Nepali and Indian lan-
guages. IndoWordNet is available at Centre for Indian Language Technology
(CFILT), Indian Institute of Technology (IIT) Bombay, India. Test instances were
obtained from Nepali General Text Corpus (http://tdil-dc.in/index.php?option=
com_download&task=showresourceDetails&toolid=453&lang=en), a raw Nepali
corpus available at Technology Development for Indian Language (TDIL) portal.
Text instances were also collected by firing search queries to various sites on the
Web containing Nepali text. The sense annotated Nepali corpus was build using
similar guidelines as sense annotated Hindi corpus (Singh and Siddiqui 2016).
The sense listings in IndoWordNet is fine-grained. Hence, few fine-grained
senses have been merged in our dataset using subject-based evaluation. For
example, Nepali noun “तिल” (til) has three senses as nouns in IndoWordNet as
given below.
4 Nepali Word-Sense Disambiguation Using Variants … 47
1. तिल, एउटा रूखको बीउ जसबाट तेल निस्कन्छ, “ऊ सधैँ नुहाएपछि तिलको तेल
लगाउँछ”
Til, euta rukhko biu jasbata tel niskancha, “u sadhai nuhayepachi tilko tel
lagaucha.”
Sesame, a tree seed that secretes oils, “he always puts on sesame oil after
bathing”
2. कोठी, थोप्लो, तिल, छालामा हुने कालो वा रातो रङ्गको धेरै सानो प्राकृतिक चिनो अथवा
दाग, उसका गालामा कालो कोठी छ
Kothi, thoplo, til, chaalama hune kalo wa rato rangko dherai sano prakritik chino
athawa daag, uska gaalama kalo kothi cha.
Mole, mole, mole, a very small black or red colored natural identity or spot
present in the skin. He has a black mole on his cheek.
3. कोठी, तिल, कालो वा रातो रङ्गको अलिक उठेको मासुको त्यो दानो जुन शरीरमा कतैकतै
निक्लिने गर्छ, उनको डड्याल्नामा एउटा कालो कोठी छ
Kothi, til, kalo wa rato rangko alik utheko masuko tyo dano jun sarirma katai-
katai niklane garcha, unko ḍaḍyalnma euta kalo kothi cha.
Mole, mole, black or red colored slightly elevated spot which can appear any-
where in body. He has a black mole on his back.
For “तिल” (til), sense 1 pertains to small oval seeds of the sesame plant. Sense 2
pertains to mole, small congenital pigment spotted on the skin. Sense 3 pertains to
mole, firm abnormal elevated blemish on the skin. The instances of sense 2 and 3
were marked as similar by two subjects. Hence, we merged sense 2 and 3. The two
subjects were native speaker of Nepali language and were undergraduate students
of BML Munjal University, Gurugram, India.
For some senses in our dataset, we could not find sufficient instances, hence we
dropped them. For example, Nepali noun “क्रिया” (kriya) has 5 senses as nouns in
IndoWordNet as given below.
1. क्रिया, क्रियापद, व्याकरणमा त्यो शब्द जसद्वारा कुनै व्यापार हुनु या गरिनु सूचित
हुन्छ “यस अध्यायमा क्रियामाथि छलफल गरिन्छ”
Kriya, kriyapad, byakaranma tyo sabdha jasdwara kunai byapar hunu ya garinu
suchit huncha, “yas adhyayama kriyamathi chalfal garincha.”
48 S. Singh et al.
In verbs, verbs, grammar, the word by which a trade is made or done indicates
“This chapter discusses verbs”
2. प्रक्रिया, क्रिया, प्रणाली, पद्धति, त्यो क्रिया या प्रणाली जसबाट कुनै वस्तु हुन्छ,
बन्छ या निक्लिन्छ “युरियाको निर्माण रासायनिक प्रक्रियाबाट हुन्छ”
Prakriya, kriya, pranali, paddhati, tyo kriya ya pranali jasbata kunai vastu
huncha, bancha ya niklincha “Ureako nirman Rasayanik prakriyabata huncha”
Process, action, system, method, the action or system from which an object is
made, formed, or derived “Urea is formed by a chemical process”
3. श्राद्ध, सराद्ध, क्रिया, किरिया, कुनै मुसलमान साधु वा पीरको मृत्यु दिवसको कृत्य
“सुफी फकिरको श्राद्धमा लाखौँ मान्छे भेला भए”
Shradh, Saradh, kriya, kiriya, kunai musalman sadhu wa pirko mrityu diwasko
krtiya “Sufi fakirko shradhma lakhau manche bhela bhaye”
Last rites/rituals, Death anniversary, rites, the death anniversary of a Muslim
sage “Millions of people gathered to pay their respects to the Sufi fakir”
4. श्राद्ध, सराद्ध, क्रिया, किरिया, मुसलमान पीरको निर्वाण तिथि “पीर बाबाको श्राद्ध
बडो धुमदामले मनाइयो”
Shradh, Saradh, kriya, kiriya, kunai musalman pirko nirwan tithi “pir babako
shradh badho dhumdhamle manaiyo”
Last rites/rituals, Death anniversary, rites, the last rites/funeral of a Muslim
monk “The death anniversary of Pir Baba was celebrated with great pomp”
5. क्रिया, कुनै कार्य भएको वा गरिएको भाव “दूधबाट दही बनिनु एउटा रासायनिक क्रिया
हो”
Kriya, kunai karya bhyeko wa gariyeko bhab “Dhudhbata dahi baninu euta
rasayanik kriya ho”
Action, the feeling of having or doing an action/the feeling that something has
been done “Getting Yogurt from milk is an action of chemical reaction”
For “क्रिया” (kriya), sense 1 pertains to a content word that denotes an action or a
state, verb in Nepali grammar. Sense 2 pertains to particular course of action
intended to achieve a result. Sense 3 pertains to Death Anniversary Act of a Muslim
monk. Sense 4 pertains to Muslim monk emancipation date. Sense 5 pertains to
something that people do or cause to happen.
The Nepali noun “क्रिया” (kriya) has sense 3 and 4 pertaining to the act of death
of a Muslim monk or death anniversary of a Muslim monk. We could not get
instances pertaining to these senses for “क्रिया” (kriya), hence we dropped these
senses from the dataset. Sense 2 and 5 pertain to a course of action, hence sense 2
and 5 are merged.
We further added few senses as well which were not available in IndoWordNet.
For example, Nepali noun “दर” (dar) has 3 senses as nouns in IndoWordNet as
given below.
4 Nepali Word-Sense Disambiguation Using Variants … 49
1. अनुपात, दर, मान,माप,उपयोगिता आदिको तुलनाको विचारले एउटा वस्तु अर्को वस्तुसित
रहने सम्बन्ध या अपेक्षा “पुस्तकका लागि लेखकले दुई प्रतिशतको अनुपातले रोयल्टी
भेटिरहेको छ”
Anupat, dar, maan, maap, upayogita adhiko tulanako bicharle euta vastu arko
vastustith rahane sambhandha ya apekchya “Pusktakko lagi lekheko dui pratisatko
anupatle royalty bhetiraheko cha.”
The relation or expectation of one object to be with another by comparing
proportions, rates, values, measurements, utility, etc. “The author is receiving a two
per cent royalty for the book.”
2. मूल्य, दर, दाम, मोल, कुनै वस्तु किन्दा वा बेच्दा त्यसको बदलामा दिइने धन “यस
कारको मूल्य कति हो”
Mulya, dar, daam, moal, kunai vastu kinda wa bechda tyasko badlama diyine
dhan. yash kaarko mulya kati ho
Price, rate, price, value, what is the value/money to be paid in return for buying
or selling an item. “What is the rate of this car?”
3. मूल्य, मोल, दाम, भाउ, दर, कुनै वस्तुको गुण,योग्यता या उपयोगिता जसको आधारमा
उसको आर्थिक मूल्य जाँचिन्छ “हीराको मूल्य जौहारीले मात्रै जान्दछ”
Mulya, moal, daam, bhau, dar, kunai vastuko gun, yogyata ya upayogita jasko
aadharma usko arthik mulya jachincha “Hirako mulya jauharile matrai jandacha”
Price, value, price, price, rate, quality, merit or usefulness of a commodity on the
basis of which its economic value is checked “Only a jeweler knows the value of a
diamond”
For “दर” (dar), sense 1 pertains to rate or value. Sense 2 pertains to the rate or
price. Sense 3 pertains to rate or value or price.
For Nepali noun “दर” (dar) we added a sense as
दर, तीजमा खईने विशेष खाना, दर खाने दिनबाट तीज सुरु भएको मानिन्छ ।
Dar, teejma khaine vishesh khana, dar khane dinbata teej suru bhayeko
manincha.
Dar, a special food on the occasion of Teej, The Teej festival is assumed to start
from the day after having Dar.
This sense pertains to special dish made on occasion of Teej Festival, a festival
celebrated in India and Nepal.
Sense 1, 2, and 3 pertain to rate hence sense 1, 2, and 3 have been merged.
Precision and recall were computed for performance evaluation of WSD algo-
rithm (Singh et al. 2017). Precision is computed as the ratio of instances disam-
biguated correctly to total test instances answered for a word. Recall is computed as
the ratio of instances disambiguated correctly to total test instances to be answered
for a word.
Sense annotated Nepali corpus comprises of 20 polysemous Nepali nouns. Total
number of words in the corpus are 231,830. Total number of unique words in the
corpus are 40,696. Total number of instances in the corpus are 3525. Total number
of senses in the corpus are 48. Average number of instances per word in the corpus
50 S. Singh et al.
are 176.25. Average number of instances per sense in the corpus are 73.44. Average
number of senses per word in the corpus are 2.4. The transliteration, translation, and
number of instances for every senses of each word of this corpus are provided in
Appendix in Table 4.10. The Nepali stop words list used in this work is given in
Fig. 4.2.
Two test runs were performed for evaluation and to study the effect of stop word
elimination on our algorithm. These test runs pertained to the following two cases:
without stop word removal (Case 1), which is also our baseline case and with stop
word removal (Case 2). For each test run, results were computed on window size 5,
10, 15, 20, and 25. Test run 1 (Case 1) corresponds to baseline and it is overlap
between context and sense definitions. For test run 2 (Case 2), we performed stop
words removal from sense definition and context vector and then similarity is
computed.
Overall average precision and recall for 20 words for direct overlap,
frequency-based scoring after dropping target word and frequency-based scoring
for both cases, averaged over context window size of 5–25 is given in Table 4.2.
Average precision and recall for 20 words with regard to context window size for
direct overlap is given in Tables 4.3 and 4.4. Tables 4.5 and 4.6 provide average
precision and recall for 20 words with regard to context window size for
frequency-based scoring after dropping target word. Average precision and recall
for 20 words with regard to context window size for frequency-based scoring are
given in Tables 4.7 and 4.8. Table 4.9 provides average precision for these words
with regard to number of senses for both cases and three variants.
4.6 Discussion
The maximum overall precision and recall of 38.87% and 26.23% were observed
for frequency-based scoring for baseline case, as seen in Table 4.2. We observed
overall precision and recall of 35.03 and 23.47% for the case with stop word
elimination using frequency-based scoring. For direct overlap for baseline case, we
observed overall average precision and recall of 30.04% and 20.30%. For case with
stop word removal, overall average precision and recall observed were 25.08 and
16.55%. We observed overall average precision and recall of 32.23 and 21.78% for
frequency-based scoring after dropping target word for baseline case. For case with
stop word removal, we observed overall average precision and recall of 27.59 and
18.24%.
Decrease in precision is observed using stop word elimination (case 2) over
baseline (case 1). For direct overlap, we observed 16.51% decrease in precision
after stop word elimination (case 2) over baseline (case 1). For frequency-based
scoring, we observed 9.88% decrease in precision using stop word elimination (case
2) over baseline (case 1). Similarly, for frequency-based scoring after dropping
target word, we observed 14.40% decrease in precision using stop word elimination
(case 2) over baseline (case 1).
The results in Tables 4.3, 4.5, and 4.7 suggest that increasing context window
size enhances the possibility of disambiguation of correct sense. On increasing
window size, more content words are induced in the context vector, wherein some
word may be a strong indicator of particular sense.
As the number of senses (classes) increases, the possibility of correct disam-
biguation decreases in general as seen in results from Table 4.9. There were 14
words having 2 senses, 4 words with 3 senses and 2 words with 4 senses. We
observed maximum precision for words comprising of 2 senses following 3 and 4
senses.
Comparing the results of Nepali WSD, with similar kind of work on Hindi WSD
(Singh et al. 2017), we obtained overall decrease in precision for Nepali language.
Table 4.9 Average precision and recall with respect to number of senses
Precision Recall
Number of senses Number of senses
2 3 4 2 3 4
Direct overlap Case 1 0.3321 0.2598 0.1591 0.2140 0.2004 0.1310
Case 2 0.2666 0.2419 0.1582 0.1645 0.1879 0.1277
Frequency-based Case 1 0.3575 0.2795 0.1619 0.2317 0.2147 0.1261
scoring after Case 2 0.2973 0.2466 0.1848 0.1851 0.1915 0.1445
dropping target
word
Frequency-based Case 1 0.4534 0.2759 0.1614 0.2942 0.2130 0.1371
scoring Case 2 0.4134 0.2333 0.1421 0.2659 0.1832 0.1192
4 Nepali Word-Sense Disambiguation Using Variants … 53
In the work reported on Hindi WSD (Singh et al. 2017), the maximum overall
average precision obtained on simplified Lesk algorithm was 54.54% using
frequency-based scoring excluding target word and after applying stemming and
stop word removal. The overall average precision obtained for baseline and stop
word removal was 49.40 and 50.64% using frequency-based scoring excluding
target word. Moreover, an increase in precision was observed after stop word
removal over baseline for Hindi WSD.
Nepali language has a complex grammatical structure. The root words in Nepali
grammar are often suffixed with words such as “को” (ko), “का” (ka), “मा” (maa),
“द्वारा” (dwara), “ले” (le), “लाई” (lai), “बाट” (bata), etc. These set of words are
known as vibhaktis. Apart from such words, some other suffix such as “हरू” (haru)
denotes the plural form of a word. For example, “केटाहरू” (ketaharu) meaning boys
is the plural form of “केटा” (keta) meaning boy. Moreover, different vibhatis can be
suffixed with same root words in different sentences, depending upon the context.
The separation of these suffixes and vibhaktis from root word results in an incorrect
grammatical sentence.
Given below is a context in Nepali.
सविधानसभा निर्वाचनमा मनाङबाट निर्वाचित टेकबहादुर गुरुङ श्रम तथा रोजगार
राज्यमन्त्री हुन्।
Sambhidhansabha nirvachanma Manangbata nirvachit tekbahadur gurung shram
tatha rojgar rajyamantri hun.
Tek Bahadur Gurung, elected from Manang in the Constituent Assembly elec-
tion, is the Minister of Labor and Employment.
The Nepali context has the word “निर्वाचनमा” (nirwachanma) having vibhakti
“मा” (maa) appended with “निर्वाचन” (nirwachan). “निर्वाचन” (nirwachan) can
also be append with vibhatki “को” (ko) forming the word “निर्वाचनको” (nirwa-
chanko). In the computational overlap “निर्वाचनमा” (nirwachanma) and
“निर्वाचनको” (nirwachanko) would be treated differently. This accounts for
decrease in precision in Nepali WSD over Hindi WSD. The context vector and
sense vector overlap for Nepali language may comprise of a word, suffixed with
different vibhatis. Hence, those content words would be treated differently for every
suffixed vibhakti and would not be counted in computational overlap. After stop
word removal in Nepali, stop words are dropped from the context vector, which
may have contributed in contextual overlap in baseline case. The context vector
thus formed comprises of content words with different vibhaktis appended. Thus,
there will be no match, for the same content word in sense and context vector
because the vibhatki appended to content word in sense vector and context vector
are different.
Nepali is morphologically very rich language. Given below are two contexts of
Nepali.
54 S. Singh et al.
4.7 Conclusion
Appendix
Table 4.10 Translation, transliteration and details of sense annotated Nepali corpus
Word Sense number: translation of senses in English (number of instances)
उत्तर Sense 1: Answer (224)
(uttar) Sense 2: North Direction (131)
क्रिया Sense 1: Verb in Nepali grammar (107)
(kriya) Sense 2: A course of action (102)
गोली Sense 1: A dose of medicine (22)
(goli) Sense 2: bullet, A projectile fired from a gun (88)
ग्रहण Sense 1: The act of receiving (145)
(grahan) Sense 2: One celestial body obscures other (86)
टीका Sense 1: A Jewellery which is worn is worn by women in South Asian countries
(tikaa) (12)
Sense 2: A sign on forehead using sandalwood (27)
Sense 3: Writing about something is detail (20)
Sense 4: name of person (26)
ताल Sense 1: A small lake (105)
(taal) Sense 2: Rhythm as given by divisions (32)
तिल Sense 1: A small oval seeds of the sesame plant (38)
(til) Sense 2: A small congenital pigment spotted on the skin (20)
तुलसी Sense 1: Basil, Holy and medicinal plant (167)
(tulsi) Sense 2: A Saint who wrote Ramayana and was follower of God Ram (46)
Sense 3: A common name used for a man (42)
दर Sense 1: Rate (87)
(dar) Sense 2: Special Dish made in occasion of Teej Festival (66)
धारा Sense 1: River’s Flow (57)
(dhaaraa) Sense 2: Law Charges for Crimes/Section (126)
Sense 3: Flow of speech and thought (35)
पुतली Sense 1: Toy (24)
(putali) Sense 2: Contractile aperture in the iris of the eye (34)
Sense 3: Butterfly (21)
फल Sense 1: Fruit (155)
(phal) Sense 2: Result (112)
बल Sense 1: Strength, power (93)
(bal) Sense 2: Emphasis on a statement or something said (31)
Sense 3: Ball (41)
Sense 4: Force relating to police, army, etc. (83)
बोली Sense 1: Communication by word of mouth (164)
(boli) Sense 2: Bid (34)
वचन Sense 1: What one speaks or says, saying (62)
(vachan) Sense 2: Promise, commitment (59)
Sense 3: Used in Grammar as an agent to denote singular or plural (24)
शाखा Sense 1: Divisions of Organization (61)
(saakhaa) Sense 3: Community (21)
साँचो Sense 1: Truth (136)
(sancho) Sense 2: Keys (38)
साल Sense 1: Year (150)
(saal) Sense 2: Type of Tree (49)
(continued)
56 S. Singh et al.
References
Agirre E, Rigau G (1996) Word sense disambiguation using conceptual density. In: Proceedings of
the international conference on computational linguistics (COLING’96), pp 16–22
Baldwin T, Kim S, Bond F, Fujita S, Martinez D, Tanaka T (2010) A re-examination of
MRD-based word sense disambiguation. J ACM Trans Asian Lang Process 9(1):1–21
Banerjee S, Pederson T (2002) An adapted Lesk algorithm for word sense disambiguation using
WordNet. In: Proceedings of the third international conference on computational linguistics
and intelligent text processing, pp 136–145
Banerjee S, Pederson T (2003) Extended gloss overlaps as a measure of semantic relatedness. In:
Proceedings of the eighteenth international joint conference on artificial intelligence, Acapulco,
Mexico, pp 805–810
Bhingardive S, Bhattacharyya P (2017) Word sense disambiguation using IndoWordNet. In:
Dash N, Bhattacharyya P, Pawar J (eds) The WordNet in Indian Languages. Springer, pp 243–
260
Dhungana UR, Shakya S (2014) Word sense disambiguation in Nepali language. In: Fourth
international conference on digital information and communication technology and its
applications (DICTAP), Bangkok, Thailand, pp 46–50
Gale WA, Church K, Yarowsky D (1992) A method for disambiguation word senses in a large
corpus. J Comput Hum 26:415–439
Gaona MAR, Gelbukh A, Bandyopadhyay S (2009) Web-based variant of the Lesk approach to
word sense disambiguation. In: Mexican international conference on artificial intelligence,
pp 103–107
Gupta CP, Bal BK (2015) Detecting sentiments in Nepali text. In: Proceedings of international
conference on cognitive computing and information processing, Noida, India, pp 1–4
Ide N, Veronis J (1998) Word sense disambiguation: the state of the art. Comput Linguist 24(1):
1–40
Indowordnet http://tdil-dc.in/indowordnet/
Jain A, Lobiyal DK (2016) Fuzzy Hindi WordNet and word sense disambiguation using fuzzy
graph connectivity measures. ACM Trans Asian Low-Resource Lang Inf Process 15(2)
Lee YK, Ng HT, Chia TK (2004) Supervised word sense disambiguation with support vector
machines and multiple knowledge sources. In: SENSEVAL-3: Third international workshop
on the evaluation of systems for the semantic analysis of text, Barcelona, Spain, pp 137–140
Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a
pine cone from an ice cream cone. In: Proceedings of the 5th annual international conference
on systems documentation SIGDOC, Toronto, Ontario, pp 24–26
Miller G, Chodorow M, Landes S, Leacock C, Robert T (1994) Using a semantic concordance for
sense identification. In: Proceedings of the 4th ARPA human language technology workshop,
pp 303–308
Mishra N, Yadav S, Siddiqui TJ (2009) An unsupervised approach to hindi word sense
disambiguation. In: Proceedings of the first international conference on intelligent human
computer interaction, pp 327–335
4 Nepali Word-Sense Disambiguation Using Variants … 57
5.1 Introduction
Heterogeneous multicore systems have become an alternative for smart phone indus-
tries, whose primary objective is power efficiency and high performance. For high
performance, we need fast processors which further makes it difficult to fit within
the required thermal budget or mobile power. Battery power fails to cope up with
the fast evolving CPU technology. Today smart phones with high performance and
F. V. Rodrigues (B)
Dnyanprassarak Mandal’s College and Research Centre, Assagao-Goa, India
N. B. Guinde
Goa College of Engineering, Ponda-Goa, India
e-mail: nitesh.guinde@gec.ac.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 59
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_5
60 F. V. Rodrigues and N. B. Guinde
long-lasting battery life are preferred. ARM big.LITTLE architectures (ARM Lim-
ited 2014) are designed to satisfy the above requirements. This technology uses
heterogeneous cores. The “big” processors provide maximum computational perfor-
mance, and the “LITTLE” cores provide maximum power efficiency.
As per the research in (Butko et al. 2015; Sandberg et al. 2017), both processors
use the same instruction set architecture (ISA) and are coherent in operation. Branch
misprediction latency is one of the severe reasons for performance degradation in
processors. This becomes even more critical as micro-architectures become more
deeply pipelined (Sprangle and Carmean 2002). To mitigate this, an accurate predic-
tion scheme is essential to boost parallelism. Prefetching and executing the branch
along the predicted direction avoids stalling in the pipeline. This helps in reducing
the performance losses caused by branches by predicting their behavior. An accu-
rate branch prediction scheme exploits parallelism. Predicting the branch outcome
correctly, frees the functional units, which can be utilized for other tasks.
The work carried out earlier on branch predictors shows the comparison on the
basis of its performance alone, that is without taking into consideration the effects
of the operating system (OS) while running the workload. The novelty of this paper
is the evaluation and comparative analysis of various branch predictors by incor-
porating them in a ARM big.LITTLE architecture with Linux running on it. The
comparison is carried out for performance and power dissipation. Our contributions
also include comparing the branch predictors on ARM big.LITTLE system in terms
of its percentage of conditional branch mispredictions, overall percentage of branch
mispredictions that considers the conditional and indirect branches, IPC, execution
time, and power consumption. Based on the detailed analysis, we report some useful
insights about the designed architecture with various branch prediction schemes and
their associated impact on performance and power assessment.
The rest of the paper is organized as follows: Sect. 5.2 presents related research
work on branch prediction schemes commonly used. Section 5.3 includes the dis-
cussion about simulated branch predictors. In Sect. 5.4, the experimental setup is
described within architectural modeling of the processor. Also, the performance and
power models have been discussed. Section 5.5 includes the experimental results.
Section 5.6 gives concluding remarks and perspectives regarding the branch predic-
tion schemes.
K. Aparna shows a comparative study of various BPs including the bimodal, gshare,
YAGs, and meta-predictor (Kotha 2007). The BPs are evaluated for their performance
using the applications of JPEG Encoder, G721 Speech Encoder, Mpeg Decoder, and
Mpeg Encoder. A new BP, namely YAGS Neo is modeled which outperforms for
some of the applications. The paper shows meta predictor with various combinations
of the predictors, and this shows an improved performance over the others.
5 Performance Analysis of Big.LITTLE System … 61
The branch prediction in the processor is dynamic. These adaptive predictors observe
the pattern of the history of previous branches during execution. This behavior is then
utilized to predict the outcome of the branch, whether taken or not taken when the
same branch occurs the next time. If multiple unrelated branches index the same
entry in the predictor table, it leads to the aliasing effect as shown in Fig. 5.1, where
there is an interference between the branches P and Q that leads to a misprediction.
Hence, it is necessary to have an accurate branch prediction scheme.
62 F. V. Rodrigues and N. B. Guinde
Some of the commonly used branch prediction schemes for computer architectures
are bimodal, local, gshare, YAGS, Tournament, L-TAGE, perceptron, ISL-TAGE, and
TAGE-SC-L.
Bimodal BP: The bimodal predictor is the earliest form of a branch prediction
scheme (Lee et al. 1997). The prediction is based upon the branch history of a given
branch instruction. The table of counters is indexed by using the lower bits of the
corresponding branch address. When a branch is identified and if the bias of the
corresponding counter is in ST or WT state, then the future branches are predicted as
taken, and when in WNT and SNT state, the branches are predicted as not taken.
Local BP:Yeh and Patt propose a branch prediction scheme that uses the local history
of a branch being predicted.This history helps in predicting the next branch. Here, the
branch address bits are XORed with the local history to index the table of saturating
counters, whose bias will provide the prediction (Yeh and Patt 1991).
Tournament BP: Z. Xie et al. present tournament branch predictor that uses local
and global predictors based on saturating counters per branch (Xie et al. 2013) . The
local predictor is a two-level table that keeps a track of the local history of individual
branches. The global predictor is a single-level table. Both provide the prediction.
The meta-predictor, that selects between the two predictors, is a table indexed with
the branch address and comprises of saturating counters.
Gshare BP: S. McFarling comes up with a strategy to use sharing index scheme
aiming at higher accuracy (McFarling 1993). The gshare scheme is same as the
bimodal scheme. The global history register bits are XORed with the bits of program
counter (PC) to point to the pattern history table (PHT) entry, whose value will
give the prediction. However, aliasing is the major factor for reducing the prediction
accuracy.
YAGS BP: Eden and Mudge present yet another global scheme (YAGS) which is
a hybrid of bimodal and direction PHTs. Bimodal scheme stores the bias, and the
direction PHTs store the traces of a branch only when it is not according to the
5 Performance Analysis of Big.LITTLE System … 63
bias (Eden and Mudge 1998). This reduces the information being stored otherwise
in the direction PHT tables.
TAGE, L-TAGE, ISL-TAGE, TAGE-LSC: Seznec and Michaud implement the
TAgged GEometric length predictor in (Seznec and Michaud 2006). It improvises
Michaud’s PPM-like tag-based branch predictor. A prediction is made with the
longest hitting entry in the partially tagged tables whose history lengths increase
according to the geometric series given as:
Length geometrically increases with i. The table entries are allocated in an optimized
way making it very space efficient.
The updating policy used by the BP includes incrementing/decrementing the “U”
counter if the final prediction is correct/incorrect, respectively. The “U” counter
is reset periodically, to avoid any entry to be marked as useful forever. When the
prediction is made by a newly allocated entry, it is not considered as the new entries
need some training time to make a correct prediction. As a result, the alternate
prediction is considered as the final outcome. This branch predictor is better in terms
of accuracy. Also, partial tagging is cost efficient , hence, can be used by predictors
using global history lengths.
The L-TAGE predictor is presented in the CBP-2 (Seznec 2007). This is a hybrid
of TAGE and loop predictor. Here, the loop predictor identifies branches that are
regularly occurring loops with a fixed number of iterations. When the loop has been
executed successively three times with the persistent number of iterations, the loop
predictor provides a prediction. A. Seznec also presents ISL-TAGE and TAGE-LSC
predictors which incorporates a statistical corrector(SC) and a loop predictor (Seznec
et al. 2016; Seznez 2011).
Perceptron: Jimenez et al. implement perceptron predictors based on neural net-
works (Jiménez and Lin 2001). The perceptron model is a vector that comprises of
weights (w), which are signed integers and gives the amount of correlation between
the branches and the inputs (x). The boolean function of previous outcomes (1 = taken
and - 1 = not taken) from the global history shift register is the input to the perceptron.
The outcome is calculated as a dot product of the weight vector w0 , w1 ..., wn and
the input vecor x0 , x1 ..., xn . Here, x0 is the bias always set to 1. The outcome P is
based on the formula:
n
P = w0 + wi ∗ xi (5.2)
i=1
Positive or negative value of P indicates that the branch is predicted as taken or not
taken, respectively.
64 F. V. Rodrigues and N. B. Guinde
This section comprises of the performance and power modeling using gem5 and
McPAT, respectively, along with system configuration and further describing the
Rodinia bench suite.
The system is configured using Ubuntu 16.04 OS on vmlinux kernel for ARM ISA.
The Rodinia bench suite (Che et al. 2009) for heterogeneous platforms is used to study
the effect of branch prediction schemes. We have used the OpenMP workloads of the
Rodinia bench suite. The problem size of the workloads is mentioned in Table 5.2.
The workloads of rodinia bench suite comprise of:
Heart Wall removes speckles from an image without without impairing its features.
It uses structured grids.
k-Nearest Neighbors comprises of a dense linear algebra.
Number of boxes 1D estimates potential of a particle and relocates them within a
large 3D space due to mutual force among them.
This section discusses the various parameters used for comparing the performance
of branch predictors along with the performance and power analysis.
The analysis for performance of the branch predictors is carried out for a minimum
of 1 million branch instructions. The overall conditional branch mispredictions are
shown in Fig. 5.2. From the Table 5.3, local BP has the minimum accuracy of 94.7%
with a misprediction rate of 5.29%, while perceptron and TAGE-LSC have a mispre-
diction rate of 3.43% and 3.4%, respectively. This gives TAGE-LSC and perceptron
the highest accuracy of 96.6% for conditional branch predictions with fixed history
length of 16kb. Mispredictions occurring in popular PC indexed predictors with 2-bit
counters is mainly due to destructive interference or aliasing. The other reason is that
the branch requires local history or global history or both kind of histories in-order
to predict the outcome correctly. TAGE predictors on the other hand handle branches
with long histories. They employ multiple tagged components with various folded
histories.
Tage-LSC predictor outperforms L-TAGE and ISL-TAGE predictors, as these pre-
dictors cannot predict accurately branches biased statistically towards a given direc-
tion. For certain branches, their performance is worse than a simple PC indexed table
Fig. 5.2 Percentage of conditional branch mispredictions per application at a fixed history length
of 16 kb
68 F. V. Rodrigues and N. B. Guinde
Fig. 5.3 Percentage of overall branch mispredictions comprising the conditional and unconditional
branches per application at a fixed history length of 16 kb
Fig. 5.4 Instructions Per Cycle (IPC) per application at a fixed history length of 16 kb
Figure 5.5 shows the execution time in seconds per application of the rodinia
bench suite for the simulated ARM big.LITTLE architecture by varying the branch
prediction scheme incorporated into it. It is observed that perceptron and TAGE-LSC
have the least execution time for almost all applications in the suite. Also, it is seen
that local BP has the maximum execution time.
Figures 5.6 and 5.7 show the results of performance of branch prediction schemes
by changing the history length. LavaMD benchmark is used to study the effect of
variations in history length for the popular branch predictors. LavaMD benchmark has
the highest utilization of the branch predictor as compared to the other applications of
the rodinia suite. It is seen that as the history length of branch predictors is increased,
the misprediction percentage drops. But, in case of L-TAGE predictor, the drop is not
significant. This proves that L-TAGE is robust to the changes in geometric history
lengths.
The overall branch misprediction rate comprising of the conditional and uncon-
ditional branches for varying history lengths is found to be 2.7701%. Seznec who
implemented L-TAGE also states that the overall misprediction rate is within 2% for
70 F. V. Rodrigues and N. B. Guinde
any minimum value of history length in the interval [2–5] and any maximum value
between [300 and 1000] (Seznec 2007). In other words, L-TAGE just like TAGE
and OGEHL predictors are not sensitive to history length parameters (Seznec 2005;
Seznec and Michaud 2006). ISL-TAGE, perceptron, and TAGE-LSC predictors have
an overall branch misprediction rates of 2.51%, 2.37%, and 1.9%, respectively, for
varying history lengths. Perceptron predictor provides the best performance for lower
history lengths below 16kb and eventually attains a constant level of misprediction
rate as the history length is increased beyond 16kb. Beyond this history length, TAGE-
LSC provides higher accuracy than the perceptron and other predictors. The reason
for the low performance of L-TAGE, ISL-TAGE, and TAGE-LSC predictors is the
insufficient correlation from remote branches due to reduced history lengths, result-
ing in negative interference. Also reducing local history bits, fail to detect branches
which exhibit loop behavior. However, the computational complexity involved in
perceptron and TAGE predictors is high in comparison to the popular PC indexed
2-bit predictors.
Table 5.3 shows the summarized results of performance analysis for the param-
eters of % conditional mispredictions, % overall mispredictions, and IPC for better
readability.
5 Performance Analysis of Big.LITTLE System … 71
Figure 5.8 shows the result of power performance of branch predictors incorporated in
ARM big.LITTLE architecture. The average power consumed by the L-TAGE, ISL-
TAGE, and TAGE-LSC branch predictors is 5.89%, 5.91%, and 5.98%, respectively,
for various workloads. This is high in comparison to the other branch predictors. The
reason for this being the complexity in the design arises with the increase in com-
ponents. As a result of which the silicon area and power dissipation in the processor
increases (Seznec 2007). The perceptron predictor consumes 4.7% of the processor
power. It also shows that the minimum average power of 3.77% is dissipated by local
BP unit.
To be noted that the power estimation is based on a simulation which can incur
abstraction errors and reflect approximate levels of power utilization by the predictor.
5.6 Conclusion
Exhaustive analysis on various branch prediction schemes is done for power and
performance using McPAT and gem5 frameworks, respectively. It is observed that
TAGE-LSC and perceptron have the highest prediction accuracy among the simu-
lated predictors. Perceptron predictor performs efficiently at reduced resource budget
and history length, while TAGE-LSC outperforms it for higher history lengths and
increased resource budget.
In the ARM big.LITTLE architecture, the big cores can be incorporated with
TAGE-LSC predictor, where high performance is desired, and LITTLE cores can be
built with perceptron predictor which achieves a high accuracy and power efficiency
at reduced budget and power requirements. Also, local branch predictor dissipates
minimum power, but the accuracy is very less.
72 F. V. Rodrigues and N. B. Guinde
Acknowledgements We would like to thank all those who have dedicated their time in research
related to the branch predictors and have contributed to the gem5 and McPAT simulation frameworks.
References
ARM Limited big.LITTLE Technology (2014) The future of mobile. In: Low power-high perfor-
mance white paper
Butko A, Bruguier F, Gamati‘e A (2016) Full-system simulation of big.LITTLE multicore archi-
tecture for performance and energy exploration. In: 2016 IEEE 10th international symposium on
embedded multicore/many-core systems-on-chip (MCSOC) IEEE Lyon, , pp 201–208
Butko A, Gamatié A, Sassatelli G (2015) Design exploration for next generation high-performance
manycore on-chip systems: application to big.LITTLE Architectures. In:2015 IEEE computer
society annual symposium on VLSI. IEEE Montpellier, pp 551–556
Che S, Boyer M, Meng J (2009) Rodinia: A benchmark suite for heterogeneous computing. In:2009
IEEE international symposium on workload characterization (IISWC). IEEE Austin, pp 44–54
Eden AN, Mudge T (1998) The YAGS branch prediction scheme. In: Proceedings of the 31st annual
ACM/IEEE international symposium on microarchitecture 1998. ACM/IEEE Dallas, pp 169–177
Jiménez DA, Lin C (2001) Dynamic branch prediction with perceptrons. In: Proceedings of the 7th
international symposium on high-performance computer architecture HPCA ’01. ACM/IEEE
Mexico, p 197
Kotha A (2007) Electrical & computer engineering research works. digital repository at the Uni-
versity of Maryland (DRUM). https://drum.lib.umd.edu/bitstream/handle/1903/16376/branch_
predictors_tech_report.pdf?sequence=3&isAllowed=y.Cited. 10 Dec 2007
Lee CC, Chen ICK, Mudge TN (1997) The bi-mode branch predictor. In: Proceedings of 30th
annual international symposium on microarchitecture. ACM/IEEE, USA, pp 4–13
McFarling S (1993) Combining branch predictors. TR, Digital Western Research Laboratory, Cal-
ifornia, USA
Sandberg A, Diestelhorst S, Wang W (2017) Architectural exploration with gem5. In:ARM Res
Seznec A (2005) Analysis of the O-GEometric history length branch predictor. ACM SIGARCH
computer architecture news. Journal 33(2):394–405
Seznec A (2007) A 256 kbits l-tage branch predictor. The second championship branch prediction
competition (CBP-2). J Instruction-Level Parall (JILP) J 9(1):1–6
Seznec A (2016) TAGE-SC-L branch predictors again. 5th JILP workshop Comput Arch Competi
(JWAC-5) J 5(1)
Seznec A, Michaud P (2006) A case for (partially) TAgged GEometric history length branch pre-
diction. J Instruction Level Parall J 8(1):1–23
Seznez A (2011) A 64 Kbytes ISL-TAGE branch predictor. In: Workshop on computer architecture
competitions (JWAC-2): championship branch prediction
Sparsh M (2018) A survey of techniques for dynamic branch prediction. J CoRR. ArXiv.
abs/1804.00261
Sprangle E, Carmean D (2002) Increasing processor performance by implementing deeper pipelines.
In: Proceedings of the 29th annual international symposium on computer architecture (ISCA).
IEEE USA, pp 25–34
The gem5 simulator. Homepage. http://gem5.org/Main_Page. Last accessed 2018/1/12
Xie Z, Tong D, Cheng X (2013) An energy-efficient branch prediction technique via global-history
noise reduction. In: International symposium on low power electronics and design (ISLPED).
ACM Beijing, , pp 211–216
Yeh T, Patt Y (1991) Two-level adaptive training branch prediction. In: Proceedings of the 24th
annual international symposium on microarchitecture 1991. ACM, New York, pp 51–61
Chapter 6
Global Feature Representation Using
Squeeze, Excite, and Aggregation
Networks (SEANet)
6.1 Introduction
With advancement in deep learning models, lots of real-world problems are being
solved which were stacked up for the past few decades. Deep learning is being used
in a wide range of applications ranging from image detection and recognition to
security and surveillance. One of the major advantages of deep learning models is that
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 73
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_6
74 A. Pandey et al.
they extract features on their own. Convolutional neural networks (CNNs) are used
extensively to solve image recognition and image classification tasks. Convolutional
layer basically learns a set of filters that help in extracting useful features. It learns
powerful image descriptive features by combining the spatial and the channelwise
relationship in the input.
To enhance the performance of CNNs, recent research has explored three dif-
ferent aspects of networks, namely width, depth, and cardinality. It was found that
deeper models could model complex input distribution much better than the shallow
models. With the availability of specialized hardware accelerators such as GPUs, it
has become easy to train larger networks. Taking the advantage of GPUs, continu-
ous improvement in accuracy is shown by recent models like VGGNet (Sercu et al.
2016), GoogLeNet (Szegedy et al. 2015), and Inception net (Szegedy et al. 2017).
VGGNet showed that stacking blocks of same shape gives better results. GoogLeNet
shows that width plays an important role in improving the performance of a model.
Xception (Chollet 2016) and ResNeXt (Xie et al. 2017) come up with an idea of
increasing the cardinality of a neural network. They empirically showed that apart
from saving the number of parameters cardinality also increases the representation
power compared to width and depth. But it was observed that deep models are built
by stacking up layers suffered from degradation problem (He et al. 2016). Degrada-
tion problem arises when after some iterations the training error refuses to decrease
thereby giving high training error and test error. The reason behind the degradation
problem is vanishing gradient—as the model becomes larger, the propagated gradi-
ent becomes very small by the time it reaches the earlier layers, thereby making the
learning more difficult. The degradation problem was solved with the introduction
of the ResNet (He et al. 2016) models which stacked residual blocks along with
skip connections to build very deep architecture. They gave better accuracy than its
predecessors.
ResNet performed very well and won the ILSVRC (Berg et al. 2010) challenge
in 2015. Subsequently, the architecture that won ILSVRC 2017 challenge is SENet
(Hu et al. 2017). Unlike other CNN architectures that considered all feature maps
to be equally important, SENet (Hu et al. 2017) quantifies the importance of each
feature map adaptively and weighs them accordingly. The main architecture of SENet
discussed in Hu et al. (2017) is built over base ResNet by incorporating SE blocks. SE
blocks can also be incorporated in other CNN architectures. Though SENet quantifies
the importance of feature maps, it does not focus on redundancies across feature maps
and provide a global representation across feature maps. In this work, we propose a
novel architecture, namely SEANet, that is built over SENet. Following SE block, we
introduce an aggregate block that helps in providing a global feature representation
and also minimizes redundancies in feature representation.
6 Global Feature Representation Using Squeeze, Excite, and Aggregation … 75
maps. In 2018, Hu et al. introduced Squeeze and Excitation (SE) block in their work
SENet (Hu et al. 2017) to compute channelwise attention wherein (i) the squeeze
operation compresses each feature map to a scalar representation using global average
pooling that subsequently maps to weights and (ii) the excite operation excites the
feature maps using the obtained weights. This architecture won the ILSVRC1 2017
challenge.
In this work, we propose an architecture, namely SEANet, which is built over
SENet (Hu et al. 2017). Following SE block, we introduce an aggregate block that
help in global feature representation by downsampling the number of feature maps
by simple sum operation.
In Sect. 6.3, an overview of SENet is provided. The proposed architecture SEANet
is elucidated in detail in Sect. 6.4. Results and analysis are discussed in Sect. 6.5.
6.3 Preliminaries
Squeeze-and-Excitation-Networks (SENet)
There are two issues with the existing models in the way they apply convolution
operation on inputs. Firstly, the receptive field has the information only about the
local region because of which the global information is lost. Secondly, all the feature
maps are given equal weight but some feature maps may be more useful for the next
layers than others. SENet (Hu et al. 2017) proposes a novel technique to retain global
information and also dynamically re-calibrate the feature map inter-dependencies.
Following two subsections explain these two problems in detail and how SENet
(Hu et al. 2017) addresses them using squeeze and excitation operations. Squeeze
Operation: Since each of the learned filters operates within a local receptive field,
each unit of the output is deprived of the contextual information outside the local
region. Smaller the receptive field, lesser is the contextual information retained. The
issue becomes more severe when the network under consideration is very large and
the receptive field used is small.
SENet (Hu et al. 2017) solves this problem by finding a means to extract the
global information and then embed global information to the feature maps. Let U =
[u 1 , u 2 , . . . , u c ] be the output obtained from previous convolutional layer. The global
information is obtained by applying a global average pooling for each channel u p of
U to obtain a channelwise statistics Z = [z 1 , z 2 , . . . , z c ] where z k is the kth element
of Z computed as:
1
H W
zk = u k (i, j) (6.1)
H ∗ W i=1 j=1
1 http://image-net.org/challenges/LSVRC/.
6 Global Feature Representation Using Squeeze, Excite, and Aggregation … 77
Equation (6.1) is the squeeze operation, denoted by Fsq . We see that Z obtained
in this way captures the global information for each feature map. Hence, the first
problem is addressed by squeeze operation.
Excitation Operation: Once the global information is obtained from the squeeze
step, the next step is to embed the global information to the output. Excitation step
basically multiplies the output feature maps by Z . But by simply multiplying the
feature maps with the statistics Z would not answer the second question. So in order
to re-calibrate the feature map inter-dependency, the excitation step uses a gating
mechanism consisting of a network (shown in Fig. 6.1) with two fully connected
layers having sigmoid as the output layer.
The excitation operation can be mathematically expressed as follows:
1. Let the network be represented as a function Fex .
2. Let U be the input to the network.
3. Let
X be the output of Fex .
4. Let FC1 and FC2 be fully connected layers with weights W1 and W2 respectively
(biases are set to 0 for simplicity).
78 A. Pandey et al.
5. Let δ be the ReLU (Qiu et al. 2017) activation applied to the output of FC1 , and
σ be the sigmoid activation applied after FC2 layer.
Then output of the excitation operation can be expressed by the following equation:
where Z is the statistics obtained from squeeze operation. The output S of the network
Fex is a vector of probabilities having same shape as the input to the network. The final
output of the SE block is obtained by scaling each element of U by corresponding
elements of S, i.e., ( X 1 ), (
X ) = [( X 2 ), . . . , (
X c )] where (
X p ) = s p ∗ u p . Thus, now
the feature maps are dynamically re-calibrated.
2 https://github.com/moskomule/senet.pytorch.
6 Global Feature Representation Using Squeeze, Excite, and Aggregation … 79
G 1 = [c1 , c2 , . . . , ck ]
G 2 = [ck+1 , ck+2 , . . . , c2k ]
..
.
(6.3)
G i = [c((i−1)k)+1 , c((i−1)k)+2 , . . . , cik ]
..
.
G n/k = [c(((n/k)−1)k)+1 , c(((n/k)−1)k)+2 , . . . , cn ]
80 A. Pandey et al.
are n/k mutually exclusive groups of feature maps. For example, in Fig. 6.2, after
the first three residual blocks and SE block, the number of incoming feature maps is
128. With aggregate factor k = 4, these feature maps are partitioned into 32 groups
with each group having 4 feature maps.
Step 2: Each group is downsampled to a single feature map using the aggregate
operation sum. That is,
ik
Si = aggr egate(G i ) = cj. (6.4)
j=((i−1)k)+1
Figure 6.3 depicts the aggregate operation. The effect of downsampling by aggre-
gation is to remove redundant representations and obtain a global feature representa-
tion across feature maps in each group. To understand this, let us assume that we have
an RGB image. We combine/aggregate information from each of R, G, and B maps
to output a single grayscale feature map. This way we move away from local color
information from each individual map to a global grayscale map. Further, the aggre-
gation downsampled three feature maps into one grayscale map, thereby eliminating
implicitly redundancies in representing a pixel. Our idea is to extend this principle
through the layers of a deep network. As batch normalization extends the idea of
normalizing input to normalization of all activations, so does our SEANet extends
the aforementioned principle through the layers of a deep network. The advantages
of such extension by aggregation in our SEANet are manifold:
1. Redundancy in representation of features is minimized.
2. A global representation across feature maps is obtained.
3. With sum as aggregate operation, significant gradient flow back during back-
propagation as sum shares its incoming gradient to all its operands.
4. Significant improvement in performance.
It is to be noted that many other aggregation operations including min, max
are available but sum performed the best. Further, one may argue that the number
6 Global Feature Representation Using Squeeze, Excite, and Aggregation … 81
of feature maps can be downsampled by keeping only the important ones where
importance is provided by the SE block. But we observed this idea to drastically pull
down the performance.
6.5.1 Datasets
We chose two benchmark datasets, namely CIFAR-10 (Krizhevsky et al. 2014a) and
CIFAR-100 (Krizhevsky et al. 2014b), for our experiments.
The CIFAR-10 dataset3 : The CIFAR-10 dataset consists of 60,000 color images
each of size 32 × 32. Each image belongs to one of the ten mutually exclusive classes,
namely airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The
dataset is divided into training and test set. The training set consists of 50,000 images
equally distributed across classes, i.e., 5000 images are randomly selected from each
of the classes. The test set consists of 10,000 randomly selected images from each
class.
The CIFAR-100 dataset4 : The CIFAR-100 dataset is similar to CIFAR-10 except
that it contains 100 classes instead of 10 classes. The 100 classes are grouped into
20 superclasses. Each class consists of 600 images. The training set consists of 500
images from each of the 100 classes, and the test set consists of 100 images from
each of the 100 classes.
Before we delve upon the superior performance of our architecture SEANet
against state-of-the-art architectures, we provide the implementation details.
Data is preprocessed using per-pixel mean subtraction and padding by four pixels on
all sides. Subsequent to preprocessing, data augmentation is performed by random
cropping and random horizontal flipping. Model weights are initialized by Kaiming
initialization (He et al. 2015). The initial learning rate is set to 0.1 and is divided by 10
after every 80 epochs. We trained for 200 epochs. The other hyper-parameters such as
weight decay, momentum, and optimizer are set to 0.0001, 0.9 and stochastic gradient
descent (SGD), respectively. We fixed the batch size to 64. No dropout (Srivastava
et al. 2014) is used in the implementation. This setting remains the same across
the datasets. The hyper-parameters used in our implementation are summarized in
3 https://www.cs.toronto.edu/kriz/cifar.html.
4 https://www.cs.toronto.edu/kriz/cifar.html.
82 A. Pandey et al.
Table 6.1. Our implementation is done in Pytorch (Adam et al. 2019) and codes5 are
made publically available. We trained our model on Tesla K-40 GPU and training
took around 22 h.
First, we compare the performance of our SEANet with ResNet and SENet. Table 6.2
enumerates the error rate in classification on CIFAR-10 and CIFAR-100 datasets
with respect to SEANet and variants of ResNet. It is clearly evident that SEANet
outperforms all variants of ResNet on both the datasets. In CIFAR-10, we achieve the
smallest error rate of 4.3% that is better by 2% than the best performing ResNet-110.
Similarly, in CIFAR-100, we achieve the smallest error rate of 22.24% that is better
by 4% than the best performing ResNet-56. It is to be noted that 1% on CIFAR-10 and
CIFAR-100 test sets correspond to 100 images. Therefore, we perform better than
ResNet on additional 200 and 400 images with respect to CIFAR-10 and CIFAR-100,
respectively.
Table 6.3 compares performance of SEANET against SENet. Again SEANet out-
performs SENet by 3% and 8% on CIFAR-10 and CIFAR-100, respectively. The
remarkable improvement in performance can be attributed to presence of additional
aggregate block in SEANet. Figures 6.4 and 6.5 display the validation accuracy over
5 https://github.com/akhilesh-pandey/SEANet-pytorch.
6 Global Feature Representation Using Squeeze, Excite, and Aggregation … 83
Fig. 6.4 Validation accuracy over epochs for ResNet, SENet, and SEANet on CIFAR-10 dataset
epochs for ResNet, SENet, and SEANet on CIFAR-10 and CIFAR-100 datasets,
respectively.
As mentioned earlier, SEANet uses 128, 256, and 512 feature maps in its blocks
unlike the standard ResNet-20 and SENet (with standard ResNet-20 as its base). For
fair comparison, we increased the feature maps in blocks of standard ResNet-20 to
128, 256, and 512, respectively. Table 6.4 reports the performance of SEANet versus
modified ResNet-20 and modified SENet. SEANet performs better or on par with
modified ResNet-20 and modified SENet on CIFAR-10 dataset while on CIFAR-
100, it performs marginally lower. But it is to be noted that due to downsampling by
sum aggregation in SEANet, the number of parameters in SEANet is smaller than
the corresponding numbers in modified ResNet-20 and modified SENet. Specifically,
SEANet has around 8% parameters lower than the number of parameters in modified
84 A. Pandey et al.
Fig. 6.5 Validation accuracy over epochs for ResNet, SENet, and SEANet on CIFAR-100 dataset
Table 6.6 Effect of reduction factor used for downsampling in aggregate operation on CIFAR-10
and CIFAR-100 using SEANet
Reduction factor CIFAR-10 CIFAR-100
2 95.50 77.12
4 95.53 77.16
6 95.57 77.50
8 95.70 78.67
10 95.53 77.19
12 95.56 77.38
Table 6.5 compares SEANet against other state-of-the-art architectures. Note that
EncapNet (Li et al. 2018) is a very recent network architecture published in 2018. It
has two improvised variants, viz. EncapNet+ and EncapNet++. Our SEANet outper-
forms EncapNet and both its variants on the complex CIFAR-100 dataset by around
2%. On CIFAR-10 SEANet reports 1% lower than variants of EncapNet though it
outperforms EncapNet by 0.25%. Further, SEANet performs better than all other
enumerated state-of-the-art architectures.
The aggregate operation used in the proposed SEANet downsamples number of fea-
tures by reduction factor of 8 on both CIFAR-10 and CIFAR-100. We did an ablation
study to determine the effect of this reduction factor. The results are reported in
Table 6.6 for various reduction factors of 2, 4, 6, 8, 10, and 12. Clearly, downsam-
pling by a factor of 8 gives best performance on both datasets. If reduction factor is
too small, then there is no elimination of redundant features, and if it is too large,
then there may be loss of useful features.
6.5.5 Discussion
The proposed SEANet is able to eliminate redundancies in feature maps and thereby
reduce number of feature maps by using simple aggregate operation of sum. Other
aggregate operations like max and min were also tried but did not give significant
improvement compared to sum. One possible future work could be to explore why
some aggregate function perform better than others.
86 A. Pandey et al.
6.6 Conclusion
Acknowledgements We dedicate this work to our Guru and founder chancellor of Sri Sathya Sai
Institute of Higher Learning, Bhagawan Sri Sathya Sai Baba. We also thank DMACS for providing
us with all the necessary resources.
References
Abstract Single image super-resolution is one of the evolving areas in the field
of image restoration. It involves reconstruction of a high-resolution image from
available low-resolution image. Although lot of researches are available in this field,
still there are many issues related to existing problems those are still unresolved.
Here, this research work focuses on two aspects of image super-resolution. The first
aspect is that the process of dictionary formation is improved by using lesser number
of images while preserving maximum structural variations. The second aspect is
that pixel value estimation of high-resolution image is improved by considering only
those overlapping patches which are more relevant from the characteristics of image
point of view. For this, all overlapping pixels corresponding to a particular location
are classified whether they are part of smooth region or an edge. Simulation results
clearly prove the efficacy of the algorithm proposed in this paper.
7.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 89
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_7
90 G. Pandey and U. Ghanekar
L = AU H (7.1)
In past, lot of work have been done and proposed in the field of SISR. Classifica-
tion of those existing techniques is discussed in detail in (Pandey and Ghanekar
2018, 2020). In spatial domain, techniques are methodized into interpolation-
based, regularization-based and learning-based. In all of the three specified groups,
researchers are more emphasizing on learning-based SISR techniques since these
techniques are able to create new spectral information in the reconstructed HR image.
In learning-based methods, in the first stage, a dictionary is created for training the
computer, and once the machine is trained, then in the second stage, i.e., the recon-
struction stage, the HR image is estimated for the given input LR image.
In training stage, the target is to train the machine in such a way that same types of
structural similarities as of the input LR image are available in the computer database
for further processing. This is achieved either by forming internal (Egiazarian and
Katkovnik 2015; Singh and Ahuja 2014; Pandey and Ghanekar 2020, 2021; Liu
et al. 2017) or external dictionary/dictionaries (Freeman et al. 2002; Chang et al.
7 Improved Single Image Super-resolution Based on Compact … 91
Similarity score
calculaƟon
Image selected
having least SSIM
HR image in output
2004; Zhang et al. 2016; Aharon et al. 2006). In the case of external dictionary, the
size of dictionary is governed by number of images used in its formation and is of
major concern since greater number of images means large memory requirement that
will result in more number of computations during reconstruction phase. This can be
reduced by forming the dictionary from the images that are similar to input image but
differ from each other so that lesser number of images are required to cover the whole
structural variations present in the input image. In reconstruction stage, through local
processing (Chang et al. 2004; Bevilacqua et al. 2014; Choi and Kim 2017) or sparse
processing (Yang et al. 2010), HR image is recreated. Either neighbor embedding
(Chang et al. 2004) or direct mapping (Bevilacqua et al. 2014) is used in the case of
local processing. In neighbor embedding, a HR patch is constructed with the help of
a set of similar patches, searched in the dictionary formed in training phase. After
constructing corresponding HR patches for all input LR patches, all the HR patches
are combined to form the HR image in the output. In existing techniques, simple
averaging is used in overlapping areas for combining all the HR patches which may
result in poor HR image estimation in the case of having one or more erroneous
patches in the overlapping areas. This also results in blurring of edges due to simple
averaging.
To overcome this problem, in this paper, only similar types of patches are con-
sidered for combining the HR patches in the overlapping areas. This is based on
92 G. Pandey and U. Ghanekar
Learning-based SISR has two important parts for HR image reconstruction. At first,
a dictionary is build, and then, with the help of this dictionary and the input LR
image, a HR image is reconstructed. Method proposed in this paper consists of
forming an external dictionary for training part and neighbor embedding approach
for reconstruction part. Its generalized block diagram is provided in Fig. 7.1. All
the LR images of the dictionary as well as the input image are upscaled by a factor
U through bi-cubic interpolation. On the basis of structural similarity score (Wang
et al. 2004) given in Eq. (7.2), a set of ten images is selected from the database
for dictionary formation. Once images having higher score are selected from the
database, two of them are selected to have the maximum structural variations that are
present in the input LR image. For this, image having the highest structural similarity
score with the input image will be first selected for the dictionary formation, and then,
structural similarity score is calculated between the selected image and rest of the
images chosen in the first step. Image having the least structural similarity will be
considered as the second image for the dictionary formation. The complete dictionary
is formed with the help of these two selected LR images and their corresponding HR
pairs by forming overlapping patches of 5 × 5.
(2µi µ j + v1 )
SS I M(i, j) = (7.2)
(µi2 + µ2j + v1 )
where µ is mean, v1 is a constant, and i and j represent ith and jth image, respectively
In second stage, i.e., reconstruction part, the constructed dictionary is used for
HR image generation. At first, overlapping blocks of 5 × 5 are obtained from the
input LR image, then for every individual block ‘k’, number of nearest neighbors are
searched in the dictionary. LLE (Chang et al. 2004) is used as neighbor embedding
technique to obtain optimal weights for the set of nearest neighbors, and then, these
weights along with corresponding HR pairs are used to construct HR patches to
estimate HR image. In overlapping areas, simple averaging results in blurry edges.
Thus, a new technique has been given here to combine the patches. In this, every pixel
of the output HR image is individually estimated with the help of the constructed
HR patches. The steps are as follows:
7 Improved Single Image Super-resolution Based on Compact … 93
i. Edge matrix for the input LR image is calculated by canny edge detector, and
then, overlapping patches are formed from the matrix just like that of patches
formed for the input LR image.
ii. At every pixel location of the HR image, check whether 0 or 1 is present in its
corresponding edge matrix. 0 represents smooth pixel, and 1 represents edge
pixel. To confirm the belongingness of each pixel to there respective group,
consider a 9 × 9 block of edge matrix having the pixel under consideration as
center pixel, and calculate the number of zeros or ones in all the four direction
given in Fig. 7.2.
iii. In any of the four directions, if number of zeros is 5 in case of smooth pixel and
number of ones in case of edge pixel, then assign that pixel as true smooth or
true edge pixel, respectively.
iv. For the pixels that cannot be categorized into true smooth or edge pixel, count
the number of zeros or ones in the 9 × 9 block of edge matrix with pixel under
consideration at center. If count of zero is 13, consider the pixel as smooth pixel,
and if count of one is 13, then pixel under consideration is edge pixel.
v. After categorizing the pixel into smooth or edge type, its value is estimated
with the help of HR patches that contains the overlapping pixel position and its
corresponding patch of edge matrix.
vi. Instead of considering all the overlapping pixels from their respective overlap-
ping HR patches, only pixels of patches which are of same type to that of the
pixel that is to be estimated are considered. Two cases are explained separately.
a. At a particular pixel location of the bi-cubic interpolated LR image, if by
following the above procedure the pixel is considered to be smooth type,
then to estimate the pixel value at same location in HR image, same type
of patches will be considered (out of selected 25 patches, number will be
less for boundary patches). For this, firstly, overlapping HR patch having all
zeros is chosen. If not available, then, the patches having maximum number
of zeros will be considered.
b. Similarly, for edge type, at first, overlapping HR patch having all ones is
chosen. If not available, then, the patches having maximum number of ones
will be considered to estimate the value instead of all the patches.
vii. Process of assigning the values to all the pixels of HR image is performed
individually to obtain the HR image in the output.
In this section, experiments are performed to prove the usefulness of the proposed
algorithm for generating a HR image from LR image. Images that are used for
validation of the algorithm are given in Fig. 7.3. In all the experiments, LR images
are formed by smoothing the HR image with a 5 × 5 block of average filter followed
94 G. Pandey and U. Ghanekar
(a) (b)
(c) (d)
(d) (e)
A set of 100 standard images are selected from (Berkeley dataset 1868) to form
database of LR-HR images. All the LR images are upscaled by a factor of three using
bi-cubic interpolation. For dictionary formation, two images are selected one by one.
7 Improved Single Image Super-resolution Based on Compact … 95
For selection of the first image, structural similarity score is calculated between input
image and database LR images, and image having the highest similarity score with
the input is selected. Nine more images having higher score are selected for deciding
the second image that will be used for dictionary formation. For selection of the
second image, structural similarity score between the first selected image and rest
of nine images is taken. Then, image having the lowest score with the first image
will be selected. This process of image selection will help in forming dictionary with
maximum possible variations with only two images. Overlapping LR-HR patches are
formed with the help of these two images to form the dictionary for training purpose.
Once the dictionary formation is completed, procedure for conversion of input
LR image into HR image starts. For this, the input LR image is first upscaled by
a factor of three using bi-cubic interpolation, and then, overlapping patches of size
5 × 5 are formed from it. For every patch, six nearest neighbors of LR patches
are selected from the dictionary using Euclidean distance. Now, with the help of
these selected LR patches (from the dictionary), optimal weights are calculated for
corresponding HR patches using LLE to obtain the final corresponding HR patch.
All such constructed HR patches are combined to obtain the HR image by estimating
each pixel individually by the technique proposed in the paper. Experimental results
of the proposed algorithm are given in Tables 7.1 and 7.2 to compare it with other
existing techniques like bi-cubic interpolation, neighbor embedding given in (Chang
et al. 2004) and sparse coding given in (Yang et al. 2010).
Table 7.1 Experimental results for proposed method and other methods for comparison, in terms
of PSNR
Sr.no. Name of Bi-cubic NE (Chang ScSR (Yang Proposed
image et al. 2004) et al. 2010) algorithm
1. Starfish 25.38 27.71 28.43 29.01
2. Bear 26.43 28.92 30.05 29.91
3. Flower 27.21 29.67 29.11 30.54
4. Tree 25.15 28.57 28.03 28.87
5. Building 27.12 29.97 30.34 30.21
Table 7.2 Experimental results for proposed method and other methods for comparison, in terms
of PSNR
Sr.no. Name of Bi-cubic NE (Chang ScSR (Yang Proposed
image et al. 2004) et al. 2010) algorithm
1. Starfish 0.8034 0.8753 0.8764 0.9091
2. Bear 0.7423 0.8632 0.8725 0.8853
3. Flower 0.7932 0.8321 0.8453 0.8578
4. Tree 0.7223 0.8023 0.8223 0.8827
5. Building 0.8059 0.8986 0.9025 0.9001
96 G. Pandey and U. Ghanekar
Tables and the figures showing comparison of the proposed algorithm with a few
other algorithms prove that the results of our algorithm are better than the other
algorithms used in the present study. HR image constructed through our algorithm
is better in visualization when juxtaposed with other images obtained from other
algorithm (Fig. 7.4).
7.5 Conclusion
The research work, present in this paper, is focused on generating a HR image from
a single LR image with the help of an external dictionary. A novel way of building
an external dictionary has been proposed which helps to contain maximum types of
structural variations with the help of a fewer number of images in the dictionary. To
achieve this, images that are similar to input LR image but differ with each other
are selected for dictionary formation. This helps to reduce the size of dictionary and
hence the number of computations during the process of finding nearest neighbors.
To form the complete HR image, a new technique based on classifying the pixels as
the part of smooth or edge region is used for combining the HR patches in overlapping
areas that are generated using LLE. The results obtained through experiments verify
the effectiveness of the algorithm.
References
Aharon M, Elad M, Bruckstein A (2006) The K-SVD: an algorithm for designing of over-complete
dictionaries for sparse representation. IEEE Trans Signal Process 54(11), 4311–4322
Berkeley dataset (1868) https://www2.eecs.berkeley.edu
Bevilacqua M, Roumy A, Guillemot C, Morel M.-L. A (2014) Single-image super-resolution via
linear mapping of interpolated self-examples. In: IEEE Transactions on Image Processing, vol.
23(12), pp 5334–5347
Chang H, Yeung DY, Xiong Y (2004) Super-resolution through neighbor embedding. In: IEEE
conference on computer vision and pattern recognition, vol. 1, pp 275–282
7 Improved Single Image Super-resolution Based on Compact … 97
Choi JS, Kim M (2017) Single image super-resolution using global regression based on multiple
local linear mappings. In: IEEE transactions on image processing, vol. 26(3)
Dong C, Loy CC, He K, Tang X (2016) Image super-resolution using deep convolutional networks.
IEEE Trans Pattern Anal Mach Intell 38(2), 295–307
Egiazarian K, Katkovnik V (2015) Single image super-resolution via BM3D sparse coding. In: 23rd
European signal processing conference, pp 2899–2903
Freeman W, Jones T, Pasztor E (2002) Example-Based Super-Resolut. IEEE Comput Graph Appl
22(2), 56–65
Liu ZS, Siu WC, Huang JJ (2017) Image super-resolution via weighted random forest. In: 2017
IEEE international conference on industrial technology (ICIT). IEEE
Liu C, Chen Q, Li H (2017) Single image super-resolution reconstruction technique based on a
single hybrid dictionary. Multimedia Tools Appl 76(13), 14759–14779
Pandey G, Ghanekar U (2018) A compendious study of super-resolution techniques by single image.
Optik 166:147–160
Pandey G, Ghanekar U (2020) Variance based external dictionary for improved single image super-
resolution. Pattern Recognit Image Anal 30:70–75
Pandey G, Ghanekar U (2020) Classification of priors and regularization techniques appurtenant to
single image super-resolution. Visual Comput 36:1291–1304. doi: 10.1007/s00371-019-01729-z
Pandey G, Ghanekar U (2021) Input image-based dictionary formation in super-resolution for online
image streaming. In: Hura G, Singh A, Siong Hoe L (eds) Advances in communication and
computational technology. Lecture notes in electrical engineering, vol 668. Springer, Singapore
Park SC, Park MK, Kang MG (2003) Super-resolution image reconstruction: a technical overview.
IEEE Signal Process Magz 20(3), 21–36
Singh A, Ahuja N (2014) Super-resolution using sub-band self-similarity. In:’ Asian conference on
computer vision, pp 552–5684
Wang Z, Bovik A, Sheikh H, Simoncelli E (2004) Image quality assessment: from error visibility
to structural similarity. IEEE Trans Image Process 13(4), 600–612
Yang J, Wright J, Huang TS, Ma Y (2010) Image Super-Resolution via Sparse Representation.
IEEE Trans Image Process 19(11), 2861–2873
Yang W, Zhang X, Tian Y, Wang W, Xue J-H, Liao Q (2019) Deep learning for single image
super-resolution: a brief review. IEEE Trans Multimedia 21(12), 3106–3121
Zhang Z, Qi C, Hao Y (2016) Locality preserving partial least squares for neighbor embedding-
based face hallucination. In: IEEE conference on image processing, pp 409–413
Chapter 8
An End-to-End Framework for Image
Super Resolution and Denoising of SAR
Images
Abstract Single image super resolution (or upscaling) has become very efficient
because of the powerful application of generative adversarial networks (GANs).
However, the presence of noise in the input image often produces undesired artifacts
in the resultant output image. Denoising an image and then upscaling introduces more
chances of these artifacts due to the accumulation of errors in the prediction. In this
work, we propose a single shot upscaling and denoising of SAR images using GANs.
We have compared the quality of the output image with the two-step denoising and
upscaling network. To evaluate our standing with respect to the state-of-the-art, we
compare our results with other denoising methods without super resolution. We also
present a detailed analysis of experimental findings on the publicly available COWC
dataset, which come with context information for object classification.
8.1 Introduction
ods, including convolutional neural network (CNN) and generative adversarial net-
works (GANs) for removal of noise. Section 8.3 gives a detailed description of
such techniques. In applications involving images or videos, high-resolution data
have usually aspired for more advanced computer vision-related works. The ratio-
nale behind the thirst for high image resolution can be either improving the pixel
information for human perception or easing computer vision tasks. Image resolution
describes the details in an image; the higher the resolution, the more image details.
Among various denotations of the word resolution, we focus on a spatial resolution
that refers to the pixel density in an image and measures in pixels per unit area.
In situations like satellite imagery, it is challenging to use high-resolution sen-
sors due to physical constraints. The input image is captured at a low resolution and
post-processed to obtain a high-resolution image to address this problem. These tech-
niques are commonly known as super resolution (SR) reconstruction. SR techniques
construct high-resolution (HR) images from several observed low-resolution (LR)
images. In this process, the high-frequency components are increased, and the degra-
dations caused by the imaging process of the low-resolution camera are removed.
Super resolution of images has proven to be very useful for better visual quality and
ease in the detection processes by other computer vision techniques. However, the
presence of noise in the input image may be difficult as the network enhances the
noise when upscaling is done. We try to merge the two-step procedure of denoise an
image and upscaling the image to compare quality and the performance benefit that
we get by running one network instead of two. The motivation is to create a single
artificial neural network with denoising and upscaling capabilities.
Contributions
The primary contributions of our work are
1. We analyze the two possible approaches of denoising and super resolution of
SAR images, single-step and two-step. We demonstrate the comparison of the
performance on the complied dataset.
2. Through empirical analysis, we demonstrate that the single-step approach better
preserves the details in the noisy image. Even with higher PSNR values for the
ID-CNN network, the RRDN images are better able to maintain the high-level
features present in the image, which will prove to be of more use in object
recognition compared to PSNR driven networks, which are not able to preserve
these details.
Organization of the chapter
The remaining of this chapter is organized in the following way. To give the readers
a clear understanding of the problem, we define speckle noise in Sect. 8.2, which is
believed to be necessary. In Sect. 8.3, we present a detailed description of the works
proposed in the literature and are related to our framework. In Sect. 8.4, we describe
the proposed framework in detail. Section 8.5 describes the experimental findings of
our proposed framework. In Sect. 8.6, we present a detailed analysis of the results
obtained using our framework. Finally, the paper concludes with Sect. 8.7, along
with some indications to the future scope of work.
8 An End-to-End Framework for Image Super Resolution … 101
Speckle noise arises due to the effect of environmental conditions on the imaging
sensor during image acquisition. The speckle-noise primarily prevalent in application
areas like medical images, active radar images, and synthetic aperture radar (SAR)
images. The model commonly used for representing the SAR speckle multiplicative
noise is defined as:
Y = NX (8.1)
1
p(N ) = L L N L−1 e−L N , N ≥ 0, L ≥ 1 (8.2)
(L)
where (·) is the Gamma function. L, the equivalent number of looks (ENL), is usu-
ally regarded as the quantitative evaluation index for real SAR images de-speckling
experiments in the homogeneous areas and defined as:
X̄
ENL = (8.3)
σ X2
the increasing values of the hyperparameters used to define the noise. As a result of
which proposing a unified, end-to-end model for such a task is challenging.
As the importance of SAR denoising is explained above, various literature in the past
years has been proposed several techniques on this particular topic. In 1981, Lee filter
(Wang and Bovik 2002) was proposed that uses statistical techniques to define a noise
model, probabilistic path-based filter (PBB) (Deledalle et al. 2009) based on noise
distribution uses similarity criterion, non local means (NL-means) (Buades et al.
2005) use all possible self-prediction to preserve texture and details, block matching
3D (BM3D) (Kostadin et al. 2006) using inter-patch correlation (NLM) and intra-
patch correlation (Wavelet shrinkage). The deep learning approach has received much
attention in the area of image denoising. However, there are tangible differences
in the various types of deep learning techniques dealing with image denoising. For
example, discriminative learning based on deep learning tackles the issue of Gaussian
noise. On the other hand, optimization models based on deep learning are useful in
estimating the actual noise. In Chunwei et al. (2020), a comparative study of deep
techniques in image denoising is explained in detail.
There has been several approaches of speckle reduction in important application
domains. Speckle reduction is an important step before the processing and analysis
of the medical ultrasound images. In Da et al. (2020), a new algorithm is proposed
based on deep learning to reduce the speckle noise for coherent imaging without
clean data. In Shan et al. (2018), a new speckle noise reduction algorithm in medical
ultrasound images is proposed by employing monogenic wavelet transform (MWT)
and Bayesian framework. Considering the search for an optimal threshold as exhaus-
tive and the requirements as contradictory, in Sivaranjani et al. (2019), the problem
is conceived as a multi-objective particle swarm optimization (MOPSO) task, and
a MOPSO framework for de-speckling an SAR image using a dual-tree complex
wavelet transform (DTCWT) in the frequency domain was proposed. Huang et al.
(2009) proposesd a novel method that includes the coherence reduction speckle noise
(CRSN) algorithm and the coherence constant false-alarm ratio (CCFAR) detection
algorithm to reduce speckle noise for SAR images and to improve the detected
ratio for SAR ship targets from the SAR imaging mechanism. Techniques such
as (Vikrant et al. 2015; Yu et al. 2020; Vikas Kumar and Suryanarayana 2019) pro-
posed speckle denoising filters in their respective papers that are specifically designed
for SAR images and shown encouraging performance. For target recognition from
SAR images, Wang et al. (2016) proposed a complementary spatial pyramid coding
(CSPC) approach in the framework of spatial pyramid matching (SPM) (Fig. 8.2).
In Wang et al. ((2017), a novel technique was proposed, and the network proposed
in this technique has eight convolution layers along with rectified linear units (ReLU)
where each convolution layer has 64 filters of 3 × 3 kernel size with the stride of one,
without pooling and applies the combination of batch normalization and residual
8 An End-to-End Framework for Image Super Resolution … 103
learning strategy with skip connection where the input image is divided with the
estimated speckle noise pattern in the image; this method uses L2 Norm or Euclidean
distance as the loss function which reduces the distance between the output and the
target image; however, this can introduce artefacts in the image and does not consider
the neighborhood pixels, so a total variational loss has been added to the overall loss
function balanced with the regularization factor λTV to overcome this shortcoming,
the TV loss encourages smoother results, assuming X̂ = φ(Y w,h ) where φ is the
learned network (parameters) for generating the despeckled output.
L = L E + λT V L T V (8.4)
1
W H
LE = ||φ(Y w,h ) − X w,h ||22 (8.5)
W H w=1 h=1
H
W
LT V = ( X̂ w+1,h − X̂ w,h )2 + ( X̂ w,h+1 − X̂ w,h )2 (8.6)
w=1 h=1
This method proposed in Wang et al. ((2017) perform well as compared to the tra-
ditional image processing methods mentioned above, and hence, we compared our
work with Wang et al. ((2017).
8.4.1 Architecture
We propose a single-step artificial neural network model for image super resolution
and SAR denoising tasks inspired by the RRDN GAN (Wang et al. 2019) model.
Figure 8.3 depicts an illustration of the proposed modification of the original RRDN
network. The salient features of the proposed model are
• To compare it to the other noise removal techniques, we have removed the upscaling
part of the super-resolution GAN and have trained the network for 1X.
104 A. Pandey et al.
Fig. 8.3 The RRDN architecture (we adapted the network without the upscaling layer for compar-
ison)
Fig. 8.4 A schematic representation of the residual in residual dense block (RRDB) architecture
• We also train the network with upscaling layers for simultaneous super resolution
and noise removal.
The model was trained for various configurations; however, best results were found
for 10 RRDB blocks. Figure 8.4 depicts an illustration of such architecture. There
are 3 RDB blocks in each RRDB block in that architecture, 4 Conv in each RDB
block, 64 feature maps in RDB Conv layers, and 64 feature maps outside RDB Conv.
Figure 8.5 shows the discriminator used for in the adversarial network. As shown in
Fig. 8.5, the discriminator has a series of convolution and ReLU layer, followed by
a dense layer of dimension 1024 and, finally, a dense layer with Sigmoid function to
classify between the low and high-resolution images. Next, we discuss the various
loss functions used in the network.
8 An End-to-End Framework for Image Super Resolution … 105
In perceptual loss (Wang et al. 2019), we measure the mean square error in the feature
maps of a pre-trained network. For our training, we have taken layer 5, 9 of the pre-
trained VGG19 model. The perceptual loss function is the improved version of MSE
to evaluate a solution based on perceptually relevant characteristics and is defined as
follows:
I SR = I XSR + 10−3 IGen SR
(8.7)
content loss adversarial loss
perceptual Loss (for VGG based content losses)
SR
where, l xSR is the content loss and lGen is the adversarial loss which are defined in the
following section.
where φi, j represents the feature map obtained by the jth convolution and before ith
max-pooling layer, Wi, j and Hi, j describe the dimensions of the feature maps within
the VGG network, G θG (I LR ) is the reconstructed image, and I HR is the ground truth
image.
One of the most important uses of adversarial networks is the ability to create natural
looking images after training the generator for a sufficient amount of time. This
component of the loss encourages the generator to favor outputs that are closer to
realistic images. The adversarial loss is defined as follows:
N
SR
IGen = − log Dθ D G θG (I LR ) (8.9)
n=1
8.5.1 Dataset
We recompile the dataset as outlined by the authors of ID-CNN (Wang et al. (2017)
and ID-GAN (Wang Puyang et al. 2017) for analysis on synthetic SAR images. The
dataset is a combination of images from UCID (Schaefer Gerald and Stich Michal
2004), BSDS-500 (Arbeláez et al. 2011) and scraped Google map images (Isola
Phillip et al. 2017). All these images are converted to grayscale using OpenCV to
simulate intensity SAR images.
Grayscale images are then downscaled to 256 × 256 to serve as the noiseless high-
resolution target. Another set of grayscale images are downscaled to 256 × 256 and
64 × 64 images from their original size. For each input image, we have three different
noise levels. We randomly allot the images to the training, validation, and testing set.
The split ratio was 95 : 2.5 : 2.5. The ratio was taken to get a similar test set size as
ID-CNN (Wang et al. (2017).
We also use the cars overhead with context (COWC) dataset (Mundhenk et al.
2016), which is provided by the Lawrence Livermore National Laboratory, for further
investigation of the performance. We use this dataset because it contains target boxes
for classification and localization of cars. The data can be further used for object
detection for performance comparison. We then add speckle noise to the images
in our dataset using np.random.gamma(L , 1/L) from NumPy to randomly sample
8 An End-to-End Framework for Image Super Resolution … 107
from gamma distribution which is equivalent to the above probability density function
as shown in Fig. 8.1a. The image is normalized before adding noise and then again
scaled to the original range after adding noise to avoid clipping of values and loss of
information.
8.5.2 Results
In this section, we describe the various quantitative and qualitative results obtained
while conducting various experiments based on the proposed architecture.
Here, we report the quantitative comparison with the super resolution. Table 8.2
shows the comparison for both the approaches. For two-step networks, we train
ID-CNN (Wang et al. (2017) on 256 × 256 noisy images to 256 × 256 clean target
images. Then, we train SR network on clean 256 × 256 input images to 1024 × 1024
high-resolution target images. For the single shot network, we train the network on
256 × 256 noisy images to 1024 × 1024 clean high-resolution images.
We compare the performance of the networks based on the above- mentioned
strategy. The PSNR calculations are done for the same output sizes as the target and
the output image sizes match. The VGG16 loss calculation, however, is done after
downscaling to 224 × 224 for the images from both the cases. It can be observed from
Table 8.2 that the two-step approach produces better results for most of the metrics.
108 A. Pandey et al.
Table 8.1 Comparison of RRDN without upscaling layer with ID-CNN (Wang et al. (2017)
Metric ID-CNN (Wang et al. RRDN 1x
(2017)
L=1 PSNR 19.34 19.55
SSIM 0.57 0.61
VGG16 1.00 0.81
L=4 PSNR 22.59 22.47
SSIM 0.77 0.79
VGG16 0.60 0.30
L = 10 PSNR 24.74 24.58
SSIM 0.85 0.86
VGG16 0.33 0.16
Table 8.2 Comparison of RRDN with upscaling layer with ID-CNN (Wang et al. (2017)
Metric ID-CNN → ISR Single Shot
L=1 PSNR 19.35 18.77
SSIM 0.61 0.60
VGG16 0.91 1.06
L=4 PSNR 21.95 21.38
SSIM 0.72 0.71
VGG16 0.53 0.48
L = 10 PSNR 23.32 23.00
SSIM 0.77 0.77
VGG16 0.29 0.25
However, the single shot network is still able to slightly outperform the two-step
network based on the VGG16 metric, which again shows that the network preserves
better high-level details, while doing both tasks at once instead of denoising the image
and then increasing its resolution using two independently trained networks. These
higher level details lead to better perceived quality of image and better performance
in object detection tasks even though the pixel-wise MSE or PSNR values come out
to be lower.
In this subsection, we present the result of super resolution and denoising in a single
network on SAR images. We present the calculated results in Table 8.3 without
comparison since we were not able to find any other papers with both super resolution
and denoising in the context of SAR images to the best of our knowledge. The input
images are 64 × 64 noisy images with almost no pixel information available. Figure
8.6 depicts an illustration of the proposed single-step denoising and super-resolution
8 An End-to-End Framework for Image Super Resolution … 109
Fig. 8.6 An illustration of the single-step denoising and super resolution task on a input image size
of 64 × 64
110 A. Pandey et al.
task. The RRDN is able to generate a pattern of the primary objects, such as buildings,
from the input noisy image based on the high-level features of the image. It is also
evident from the quantitative analysis that our proposed single-step method is also
able to produce HR images with considerably lesser noise elements. These claims
will be further clarified in the following section, where we analyze performance in
details.
8.6 Analysis
So far, we have discussed about the quantitative results obtained using the proposed
method. In this section, we now present the qualitative results and analysis behind
such results in details.
Figure 8.7 shows the denoising performance comparison of ID-CNN with RRDN
on a 1L noise speckle image with no upscaling. Similarly, Figs. 8.8 and 8.9 show
the comparison for 4L and 10L noise speckle images. In all the three images, the
results are presented in a manner such that the part (a) and (d), the two diagonally
opposite images depict the input speckled image and the target image, respectively.
On the other hand, the part (b) and (c) represent the despeckled image generated
from our proposed model and using the method proposed in Wang et al. ((2017).
The denoised image output for the proposed network shows better preserved edges
and lesser smudge and sharper image when compared to ID-CNN even though the
PSNR difference between the images is not very high. The proposed method gives
consistently better quality image for all noise levels.
Figures 8.10, 8.11 and 8.12 show the comparison for 1, 4 and 10L noise, respectively.
The images are downscaled after super resolution for comparison. The original image
input size is of 256 × 256, and the output image size is 1024 × 1024. Starting with
the 1L noisy image, we can see the stark difference in the output images produced by
the network. Using two-step network for denoising, then upscaling causes blurred
out and smeared images. The high-level details are better preserved in the single-step
approach compared to two step. It can be seen that the two-step approach induces
distortion when higher noise is induced, whereas, the single-step approach is able
to preserve more higher level details since the content loss has made it possible for
8 An End-to-End Framework for Image Super Resolution … 111
Fig. 8.7 Qualitative comparison between the proposed denoising method and (Wang et al. (2017)
without upscaling (Noise level = 1L)
network to learn to extract details from the noisy image which help produce the
building patterns even in presence of very high noise.
8.6.3 Comparison
Fig. 8.8 Qualitative comparison between the proposed denoising method and (Wang et al. (2017)
without upscaling (Noise level = 4L)
network hence inducing distortions in the additional step. Also, the distortions left out
by the ID-CNN network are magnified in the upscaling network which are reduced
in the single step network since the features of the input noisy image is available to
the network from the dense skip connections.
8 An End-to-End Framework for Image Super Resolution … 113
Fig. 8.9 Qualitative comparison between the proposed denoising method and (Wang et al. (2017)
without upscaling (Noise level = 10L)
In this work, we presented the results of the proposed network with the upscaling
layer. The results show significant improvement in VGG16 loss because the systems
can remove noise from the images while preserving the image’s relevant features.
The single-step performance is comparable to the two step, while also reducing the
need for a double pass and saving the need for training an additional network. Since
the image better preservers features in the single-step system, it might perform better
if used further in any tasks that require the use of features like object recognition.
114 A. Pandey et al.
Fig. 8.10 Qualitative comparison between the proposed denoising method and (Wang et al. (2017)
with upscaling (Noise level = 1L)
8 An End-to-End Framework for Image Super Resolution … 115
Fig. 8.11 Qualitative comparison between the proposed denoising method and (Wang et al. (2017)
with upscaling (Noise level = 4L)
116 A. Pandey et al.
Fig. 8.12 Qualitative comparison between the proposed denoising method and (Wang et al. (2017)
with upscaling (Noise level = 10L)
8 An End-to-End Framework for Image Super Resolution … 117
Fig. 8.13 A qualitative comparison between the two-step and approach for denoising and super-
resolving an input image.
References
Bai YC, Zhang S, Chen M, Pu YF, Zhou JL (2018) A fractional total variational CNN approach for
SAR image despeckling. ISBN: 978-3-319-95956-6
Bhateja V, Tripathi A, Gupta A, Lay-Ekuakille A (2015) Speckle suppression in SAR images
employing modified anisotropic diffusion filtering in wavelet domain for environment monitoring.
Measurement 74:246–254
Buades A, Coll B, Morel J (2005) A non-local algorithm for image denoising. In: IEEE conference
on computer vision and pattern recognition (CVPR)
Deledalle C, Denis L, Tupin F (2009) Iterative weighted maximum likelihood denoising with prob-
abilistic patch-based weights. IEEE Trans Image Process 18(12):2661–2672
Francesco C et al (2018) ISR. https://github.com/idealo/image-super-resolution
Gai S, Zhang B, Yang C, Lei Y (2018) Speckle noise reduction in medical ultrasound image using
monogenic wavelet and Laplace mixture distribution. Digital Signal Process 72:192–207
Huang S, Liu D, Gao G, Guo X (2009) A novel method for speckle noise reduction and ship target
detection in SAR images. Patt Recogn 42(7):1533–1542
Isola P, Zhu J-Y, Zhou T, Efros A (2017) Image-to-image translation with conditional adversarial
networks. In: IEEE conference on computer vision and pattern recognition (CVPR)
Kostadin D, Alessandro F, Vladimir K, Karen E (2006) Image denoising with block-matching
and 3D filtering. In: Image processing: algorithms and systems, neural networks, and machine
learning
Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J,
Wang Z, Shi W (2017) Photo-realistic single image super-resolution using a generative adversarial
network. In: IEEE conference on computer vision and pattern recognition (CVPR)
118 A. Pandey et al.
Li Y, Wang S, Zhao Q, Wang G (2020) A new SAR image filter for preserving speckle statistical
distribution. Signal Process 196:125–132
Mundhenk TN, Konjevod G, Sakla WA, Boakye K (2016) A large contextual dataset for classifi-
cation, detection and counting of cars with deep learning. In: European conference on computer
vision
Mundhenk TN, Konjevod G, Sakla WA, Boakye K (2016) A large contextual dataset for classifi-
cation, detection and counting of cars with deep learning. In: European conference on computer
vision
Puyang W, He Z, Patel Vishal M (2017) SAR image despeckling using a convolutional neural
network. IEEE Signal Process Lett 24(12):1763–1767
Rana VK, Suryanarayana TMV (2019) Evaluation of SAR speckle filter technique for inundation
mapping. Remote Sensing Appl Soc Environ 16:125–132
Schaefer G, Stich M (2004) UCID: an uncompressed color image database. In: Storage and retrieval
methods and applications for multimedia
Sivaranjani R, Roomi SMM, Senthilarasi M (2019) Speckle noise removal in SAR images using
multi-objective PSO (MOPSO) algorithm. Appl Soft Comput 76:671–681
Tian C, Fei L, Zheng W, Yong X, Zuo W, Lin C-W (2020) Deep learning on image denoising: an
overview. Neural Netw 131:251–275
Wang Z, Bovik AC (2002) A universal image quality index. IEEE Signal Process Lett 9(3):81–84
Wang S, Jiao L, Yang S, Liu H (2016) SAR image target recognition via Complementary Spatial
Pyramid Coding. Neurocomput 196:125–132
Wang X, Yu K, Wu S, Gu J, Liu Y, Dong C, Qiao Y, Change Loy C.(2018). Esrgan: enhanced
super-resolution generative adversarial networks. ISBN: 978-3-030-11020-8
Wang P, Zhang H, Patel VM (2017) Generative adversarial network-based restoration of speckled
SAR images. In: IEEE 7th international workshop on computational advances in multi-sensor
adaptive processing
Yin D, Zhongzheng G, Zhang Y, Fengyan G, Nie S, Feng S, Ma J, Yuan C (2020) Speckle noise
reduction in coherent imaging based on deep learning without clean data. Optics Lasers Eng
72:192–207
Part II
Models and Algorithms
Chapter 9
Analysis and Deployment of an
OCR—SSD Deep Learning Technique
for Real-Time Active Car Tracking
and Positioning on a Quadrotor
Abstract This work presents a deep learning solution object tracking and object
detection in images and real-time license plate recognition implemented in F450
quadcopter in autonomous flight. The solution uses Python programming language,
OpenCV library, remote PX4 control with MAVSDK, OpenALPR, neural network
using Caffe and TensorFlow.
9.1 Introduction
A drone can follow an object that updates its route all the time (Pinto et al. 2019). This
is called active tracking and positioning, where an autonomous vehicle needs to fol-
low a goal without assistance from human intervention. There are some approaches
to this mission with drones (Amit and Felzenszwalb 2014; Mao et al. 2017; Patel
and Patel 2012; Sawant and Chougule 2015), but it is rarely used for object detec-
tion and OCR due to resource consumption. (Lecun et al. 2015) State-of-the-art
algorithms (DAI et al. 2016; Liu et al. 2016; Redmon and Farhadi 2017; Ren et al.
2017) can identify the class of a target object being followed. This work presents
and analyzes a technique that grants control to a drone during an autonomous flight,
using real-time tracking and positioning through an OCR system for deep learning
model of plate detection and object detection. (Bartak and Vykovsky 2015; Barton
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 121
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_9
122 L. G. M. Pinto et al.
and Azhar 2017; Bendea 2008; Braga et al. 2017; Breedlove 2019; Brito et al. 2019;
Cabebe 2012; Jesus et al. 2019; Lecun et al. 2015; leE et al. 2010; Martins et al.
2018; Mitsch et al. 2013; Qadri and Asif 2009; TavareS et al. 2010).
The following will present the concepts, techniques, models, materials, and methods
used in the proposed system, in addition to the structures and platforms used for its
creation.
There are all kinds of autonomous vehicles (Bhadani et al. 2018; Chapman et al.
2016). This project used an F-450 quadcopter drone for outdoor testing and a Typhoon
H480 octorotor for the simulation. A quadcopter is an aircraft made up of four rotors
carrying the controller board in the middle and the rotors at the ends (Sabatino et al.
2015). It is controlled by changing the angular speeds of the rotors that are rotated
by electromagnetic motors, where you can have six degrees of freedom, as seen in
Fig. 9.1, with x, y and z as the transition movement, and roll, pitch and yaw as the
rotational movement (Luukkonen 2011; Martins et al. 2018; Strimel et al. 2017).
The altitude and position of the drone can be modified by adjusting the speeds of
the motors (Sabatino et al. 2015). The same applies to pitch control but controlling
the rear or front engines (Braga et al. 2017).
This project was implemented using the Pixhawk flight controller (Fig. 9.2), an inde-
pendent open hardware flight controller (Pixhawk 2019). Pixhawk supports manual
and automatic operations, being suitable for research, because it is inexpensive and
compatible with most remote control transmitters and receivers (Ferdaus et al. 2017).
Pixhawk is built with a dual processor with 32-bit computing capacity STM32f427
Cortex M4 MHz processor/256 cores 168KB of RAM/2MB of flash bit (Feng and
Angchao 2016; Meier et al. 2012). The current version is the Pixhawk 4 (Kantue and
Pedro 2019).
In this work, the PX4 flight control software (PX4 2019a) was used, because is the
Pixhawk’s native software, avoiding compatibility problems. PX4 is open-source
flight control software for drones and other unmanned vehicles (PX4 2019a) that
provides a set of tools to create customized solutions.
There are several internal types of frames with their own flight control parameters,
including engine assignment and numbering (PX4 2020), which includes the F450
used in this project. These parameters can be modified to obtain refinement during a
flight. In the case of PX4, it uses proportional, integral, derivative (PID) controllers,
which are one of the most widespread control techniquesg (PX4 2019b).
In the PID controllers, the P (proportional) gain is used to minimize the tracking
error. It is responsible for a quick response and therefore should be as high as possible.
Gain (derivative) is used to moisten. It is necessary, but only the maximum necessary
to avoid overtaking must be defined. Gain I (integral) maintains an error memory. The
term “I” increases when desired the rate has not been reached for some time (PX4
2019b). The idea is to parameterize the in-flight board model using ground control
station (GCS) software, where it is possible change these parameters and check their
effects on each of the drone’s degrees of freedom (QGROUNDCONTROL 2019V).
A ground control station (GCS), described in the previous chapter, needs to check the
behavior of the drone during the flight and was used to update the drone’s firmware,
adjust its parameters, and calibrate your sensors. Running on a base computer, com-
municate with the UAV wirelessly, such as telemetry or Wi-Fi, display real-time data
on the performance and position of the UAV, and show data from many instruments
present on a conventional plane or helicopter (ARDUPILOT 2019; Hong et al. 2008).
This work used the QGroundControl GCS due to its affinity with the PX4 and
PID tools, which allowed changes to the PID while the drone was still in the air.
QGroundControl has the ability to read telemetry data simultaneously from multi-
ple aircrafts if they are using the same version of MAVLink, while still being more
common features such as telemetry logging, visual display of the GUI, a mission
planning tool, the ability to adjust the PID gains during the flight (QGROUNDCON-
TROL 2019), as shown in Fig. 9.3, and the ability to display vital information flight
data information (Huang et al. 2017; Ramirez-Atencia and Camacho 2018; Shuck
2013; Songer 2013).
During the development of this project, some SDKs were used to obtain autonomous
control of software on drones. They were used in different parts of the project and
the reason it was the versatility of each other. The idea was to choose the SDK
that had the best balance between robustness and simplicity. The selection included
MAVROS (ROS 2019), MAVSDK (2019) and DroneKit (2015) because of their
popularity. All SDKs included the MAVLink protocol, responsible for controlling
the drone.
DroneKit presented the best simplicity, but it did not have the best support for
the PX4. External control that accepts remote commands via the programming was
developed for ArduPilot applications (PX4 2019c).
The MAVROS package running ROS presented the best system in terms of robust-
ness, but it was complex to manage. MAVSDK presented the best result. It has full
support for PX4 applications and is not complex to manage, being the subject of
choice for this project.
9 Analysis and Deployment of an OCR—SSD Deep Learning … 125
The Robot Operating System (ROS), also presented in the previous chapter, is a
structure widely used in robotics (Cousins 2010; Martinez 2013). ROS offers features
such as distributed computing, message passing and code reuse (Fairchild 2016;
Joseph 2015).
126 L. G. M. Pinto et al.
ROS allows the robot control to be divided into several tasks called nodes and are
processes that perform calculations, allowing modularity (Quigley et al. 2009). The
nodes exchange messages with each other in an editor–subscriber system, where a
topic acts as an intermediate store for some of the nodes to publish its content while
others subscribe to receive this content (Fairchild 2016; Kouba 2019; Quigley et al.
2009).
ROS has a general manager called “master” (Hasan 2019). The ROS master, as
seen in Fig. 9.4, is responsible for providing names and records to services and
nodes in the ROS system. It tracks and directs editors and subscribers to topics
and Services. The role of the master is to guide the individual ROS nodes to locate
each from others. After the nodes are located, they define their communication via
peer-to-peer (Fairchild 2016).
To support collaborative development, the ROS system is organized in packages,
which are simply directories that contain XML files that describe the package and
presenting any dependencies. One of these packages is the MAVROS, which is an
extensible communication node MAVLink with proxy for GCS. The package pro-
vides a driver for communication between a variety of autopilots with MAVLink com-
munication protocol and a UDP MAVLink bridge for GCS. The MAVROS package
allows MAVLink communication between different computers running ROS, being
currently the official supported bridge between ROS and MAVLink (PX4 2019d).
9.2.3.3 DroneKit
DroneKit is an SDK built with development tools for UAVs (DRONEKIT 2015). It
allows to create applications which runs on a host computer and allows communica-
tion with ArduPilot flight controllers. Applications can insert a level of intelligence
into the vehicle’s behavior and can perform tasks with high computational perfor-
9 Analysis and Deployment of an OCR—SSD Deep Learning … 127
mance they cost or depend in real time, such as computer vision or path planning (3D
Robotics 2015, 2019).
Currently, the PX4 is not yet fully compatible, being more suitable for ArduPilot
applications (PX4 2019c).
9.2.4 Simulation
This work used the Gazebo platform (Nogueira 2014), with its PX4 Simulator imple-
mentation, which brings various vehicle models with PixHawk specific hardware and
firmware simulation. It wasn’t the only option (Hentati et al. 2018; Meyer 2020; Shah
et al. 2019), but In this project, the Gazebo platform was the choice to imitate environ-
ment through PX4 SITL. However, it was not the only option, since the PX4 SITL is
available for other platforms, such as jMAVSim (Hentati et al. 2018), AirSim (Shah
et al. 2019) and X-Plane (Meyer 2020). JMAVSim was not easy to integrate obsta-
cles or Extra sensors, such as cameras (Hentati et al. 2018), are discarded mainly for
this purpose reason. AirSim was also discarded because, while realistic, it requires
a powerful graphics processing unit (GPU) (Shah et al. 2019), which could compete
for resources during the object detection phase.
The X -Plane, also discarded, is realistic and has a variety of UAV models and
environments already implemented (Hentati et al. 2018), however, it depends on
the licensing for its use. Thus, Gazebo was the option chosen, due to its simulation
environment with a variety of online resource models, the ability to import meshes
from other modeling software, such as SolidWorks or Inkscape (Koenig and Howard
2004), and the free license.
9.2.4.1 Gazebo
The PX4 firmware offers a complete hardware simulation (Hentati et al. 2018; Yan
et al. 2002), with a response that provides the entry of the environment using its
own SITL (Software in the loop). The simulation reacts to the simulation data given
exactly how it would react in reality and evaluates the total production energy required
in each rotor (Cardamone 2017; Nguyen et al. 2018).
The PX4 SITL allows you to simulate the same software as on a real platform, rig-
orously replicating the behaviors of an autopilot and can simulate the same autopilot
used on a real drone and its MAVLink protocol, which generalizes direct use of a
128 L. G. M. Pinto et al.
real drone (Fathian et al. 2018). The greatest the PX4 SITL is that a flight controller
cannot distinguish whether it is running on simulation or inside a real vehicle, allow-
ing the simulation code to be imported directly for commercially available UAV
platforms (Allouch et al. 2019).
Deep learning is a type of machine learning, generally used for classification, regres-
sion, and resource extraction tasks, with multiple layers of representation and abstrac-
tion (Deng 2014). For object detection, resource extraction tasks are required and
can be achieved using convolutional neural networks (CNN), a class of deep neural
networks that apply filters at various levels to extract and classify visual information
from a source, such as an image or video (O’Shea and Nash 2015). This project used
a CNN (Holden et al. 2016; Jain 2020; Opala 2020; Sagar 2020; Tokui et al. 2019)
to detect visual targets using a camera.
In this project was used the Caffe deep learning framework (Jia and Shelhamer
2020; Jia et al. 2019, 2014), but there are other options such as Keras (2020), scikit-
learn (2020), PyTorch (2020) and TensorFlow (AbadI et al. 2020; Rampasek and
Goldenberg 2016; Tensorflow 2016; TENSORFLOW 2019, 2020b, a). Caffe pro-
vides a complete toolkit for training, testing, fine-tuning, and model deployment,
with well-written documentation examples for these tasks. It is developed under a
free BSD license, being built with the C++ language and maintaining Python and
MATLAB links for training and deploying general-purpose convolutional neural
networks and many other deep models efficiently (Bahrampour et al. 2015; Bhatia
2020).
The most common way to interpret the location of the object is to create a bounding
box around the detected object, as seen in Fig. 9.5.
Object detection, detailed in the previous chapter, was the first stage in this tracking
system, as it focuses on the object to be tracked (Hampapur et al. 2005; Papageorgiou
and Poggio 2000; Redmon et al. 2016).
9 Analysis and Deployment of an OCR—SSD Deep Learning … 129
SSD is an object detection algorithm that uses a deep learning model for neural
networks (Liu et al. 2011; Liu 2020; Liu et al. 2016). It was designed for real-time
applications, like this one. It is lighter than other models, as it speeds up the process of
inferring new bounding boxes reuse of pixels or feature maps, which are the result of
convolutional blocks and representation of the dominant characteristics of the image
at different scales (Dalmia and Dalmia 2019; Forson 2019; Soviany and Ionescu
2019).
Its core was built around a technique called MultiBox, which is a method for
fast class agnostic bounding box coordinate proposals (Szegedy et al. 2014, 2015).
Regarding its performance and accuracy, for applicability in object detection, it has
a score above 74% mAP at 59 frames per second in datasets like COCO and Pas-
calVOC (Forson 2019).
MultiBox
in Fig. 9.6 that resize images over the network, maintaining the original width and
height.
The magic behind the MultiBox technique is the interaction between two critical
assessment factors: loss of confidence (CL) and loss of location (LL). CL is a fac-
tor that measures how confident the class selection is made, which means whether
it is the correct class of the object, using categorical cross-entropy in relation to
entropy (Forson 2019). We can consider cross-entropy as a received response that
is not optimal. Entropy, on the other hand, represents the ideal answer. Therefore,
knowing entropy, all entropy received can be measured in terms of how far these
responses are from the optimal (Dipietro 2019).
SSD Architecture
to predict class scores and bounding box coordinates (Dalmia and Dalmia 2019).
The final composition of the SSD increases the chances of an object being even-
tually detected, localized, and classified (Howard et al. 2017; Sambasivarao 2019;
Simonyan and Zisserman 2019; Szegedy et al. 2015; Tompkin 2019).
In this work, the identification of the vehicle license plate was important for the
drone be able to follow a car with the correct license plate. This problem occurred
with the optical character recognition system (OCR). OCR is a technique respon-
sible for recognizing optically drawn characters (Eikvil 1993). OCR is a complex
problem to deal with due to the variety of languages, fonts, and styles in which char-
132 L. G. M. Pinto et al.
acters and information can be written, including the complex rules for each of these
languages (Islam et al. 2016; Mithe et al. 2013; Singh et al. 2010).
An example of the steps in the OCR technique is shown in Fig. 9.9 (Adams 1854). The
steps are as follows: acquisition, preprocessing, segmentation, resource extraction,
and recognition (Eikvil 1993; Kumar and Bhatia 2013; Mithe et al. 2013; Qadri and
Asif 2009; Sobottka et al. 2000; Weerasinghe et al. 2020; Zhang et al. 2011).
a. Acquisition: a recorded image is fed into the system.
b. Preprocessing: eliminates color variations by smoothing and normalizing pixels.
Smoothing applies convolution filters to the image to remove noise and smooth
the edges. Normalization finds a uniform size, slope and rotation for all characters
in the image (Mithe et al. 2013).
c. Segmentation: finds the words written inside the image.(Kumar and Bhatia 2013)
d. Resource extraction: extracts the characteristics of the symbols (Eikvil 1993).
e. Recognition: actually identifies the characters and classifies them, searching the
lines, word for word, converting the images for character streams representing
letters of recognized words (Weerasinghe et al. 2020).
This work used a plate detection called OpenALPR that uses Google Tesseract, an
open-source OCR framework, to train networks for different languages and scripts. It
converts the image into binary images and identifies and extracts character outlines.
Transforms Blobs contours, which are small regions isolated from digital images,
divide text into words using techniques like cloudy spaces and defined spaces. Finally,
it recognizes the text by classifying and storing each recognized word (Mishra et al.
2012; Patel and Patel 2012; Sarfraz et al. 2003; Shafait et al. 2008; Smith 2007;
Smith et al. 2009).
ALPR is a way to detect characters that makes up the license plate of the vehicle and
uses OCR for most of the process. It combines object detection, image processing,
and pattern recognition (Silva and Jung 2018). It is used in real-life applications, such
as automatic toll collection, traffic law enforcement, access control to parking lots and
road traffic monitoring (Anagnostopoulos et al. 2008; Du et al. 2013; Kranthi et al.
2011; Liu et al. 2011). The four steps of the ALPR are shown in Fig. 9.10 (Sarfraz
et al. 2003).
OpenALPR
This project used OpenALPR, an open-source ALPR library built with the C++
language, and has links in C#, Java, Node.js, and Python. The library receives images
and video streams for analysis in order to identify registrations and generates a text
representing the enrollment characters (OPENALPR 2017). It is based on OpenCV,
an open-source computer vision library for image analysis (Bradski and Kaehler
2008) and Tesseract OCR (Buhus et al. 2016; Rogowski 2018).
This project used datasets, SSD training in TensorFlow and Caffe frameworks, image
preprocessing for OCR and tracking and motion functions.
opers 2019; Shobha and Rangaswamy 2018; Talabis et al. 2015). The LabelImg
tool (Talin 2015) was used, with PascalVOC format like the standard XML anno-
tation (Fig. 9.11), which includes class labels, coordinates of the bounding boxes,
image path, image size and name and other tags (Everingham et al. 2010).
9.3.1.1 Datasets
Three sets were used: INRIA Person, Stanford Cars and GRAZ-02.
The Stanford Cars dataset (Krause et al. 2013) allowed SSD to identify cars along
the quadcopter trajectory. This dataset contains 16,185 images from 196 classes of
cars, divided into 8,144 training images and 8,041 test images, already noted in terms
of make, model and year.
9 Analysis and Deployment of an OCR—SSD Deep Learning … 135
The INRIA person dataset (Dalal and Triggs 2005) is a collection of digital images
to highlight people, taken over a long period of time, and some Web images taken
from Google Images (Dalal and Triggs 2005). About 2500 images were collected
from that dataset. The class of person was added because it was the most common
false positive found in single-class training, along with a background class, to help
improve the model discernment between different environments.
To help improve the model detection in Multi-Class, the GRAZ-02 (Opelt et al.
2006) dataset was used, since it contains images with high complexity objects and a
high variability on backgrounds, which includes 311 images with persons and 420
images with cars (Oza and Patel 2019).
Unlike TensorFlow (TENSORFLOW 2019, 2020b), Caffe (Bair 2019) does not
have a complete object detection API (Huang et al. 2017), which makes it more
complicated when starting a new training. Caffe does not include a direct visualization
tool like Tensorboard. However, it includes in its tools subdirectory, the parse log.py
script, which extracts all relevant training information from the log file and makes
it suitable for plotting. Using a Linux plotting tool called gnuplot (Woo and Broker
2020), in addition to an analysis script, it was possible to build a real-time plotting
algorithm (Fig. 9.12).
The latest log messages indicated that the model reached an overall mAP of
95.28%, which is an accuracy gain of 3.55% compared to the first trained model (Aly
2005; Jia et al. 2014; Lin et al. 2020).
136 L. G. M. Pinto et al.
To help ALPR identify license plates, the quality of the images has been improved
through the use of two techniques of image preprocessing brightness variation and
sharp mask. The variation in brightness is controlled in the color system used was
RGB, where the color varies according to the levels of red, green and blue provided.
There are 256 possible values for each level, ranging from 0 to 255. To change the
brightness, just you need to add or subtract a constant value from each level. For
brighter images, the value is add, while for darker images the values are subtracted,
as seen in Fig. 9.13, providing a total −255 to 255 values. The only necessary care
is to check whenever the addition or subtracted value will be greater than 255 and
less than 0.
After applying the brightness, the sharpness mask, also called the sharpness fil-
ter, was it is necessary to highlight the edges, thus improving the characters of the
plate (Dogra and Bhalla 2014; Fisher et al. 2020). Figure 9.14 shows an example of
the result of this process.
9 Analysis and Deployment of an OCR—SSD Deep Learning … 137
The algorithm for controlling the drone was built with the assistance of libraries
in the Python language, which include OpenALPR, Caffe, TensorFlow, MAVSDK,
OpenCV (Cartesian System) (Bradski and Kaehler 2008) and others. Algorithm 1 is
the main algorithm of the solution and evaluates updated data from object detection
and OCR at each speed on the x, y, z axes of the three-dimensional real-world
environment.
138 L. G. M. Pinto et al.
Height centering is the only control function shared between views. It positions the
drone at a standard altitude (Fig. 9.15).
The drone has its camera pointed at the ground, the captured image is analogous to
a 2D Cartesian System, and the centralization is oriented using the x-coordinate and
y-coordinate, as shown in Fig. 9.16. The idea is to reduce the two values x and y,
which represent the detection x and y distances, respectively, central point (pink) to
the central point of the frame (blue).
In the rear_view approach, the alignment function is used to center the drone hori-
zontally, according to the position of the car. The idea is to keep the drone facing the
car, as shown in Fig. 9.17. These values are calculated using the a yaw distance
between the frame and the central detection point on the Y-axis.
9 Analysis and Deployment of an OCR—SSD Deep Learning … 139
In the rear_view approach, the zoom function was the most difficult to find and
used the distance from the object in the camera, the speed of the car, the balance
between a safe distance from the car, and the minimum distance for the OCR to
work. The distance from the object was calculated using the relationship between
the object camera field of view (FOV) and its sensor dimension (Fulton 2020), as
seen in Fig. 9.18.
For the simulated experiment, a Typhoon H-480 model was used in the Gazebo, as
shown in Fig. 9.19. It is available in the PX4 Firmware Tools directory on Github,
available at: https://github.com/PX4/Firmware/tree/master/Tools, ready to use on
SITL environment. It was a handful, as it had a built-in gimbal and camera. The gimbal
allowed the images to be captured in a very stable way, avoiding compromising the
detections.
142 L. G. M. Pinto et al.
In the simulated experiments, a customized city (Fig. 9.20) was created with pre-
compilation models in the Gazebo model database, available at: https://bitbucket.
org/osrf/gazebo models.
The CAT vehicle is a simulated distributed autonomous vehicle that is part of the
ROS project in order to support research on autonomous driving technology. CAT has
complete configurable simulated sensors and actuators imitating a real-world vehicle
capable of autonomous driving, which includes a steering speed control implemented
in real-time simulation.
In the experiments, the drone was adjusted some distance from the rear end of the
car, as seen in Fig. 9.21, and followed it, capturing its complete image and allowing
the algorithm to process the recognized card. In addition, it allowed the plate to be
personalized with a Brazilian model, making it very convenient for this project.
The experiments related that in an urban scene, the most cars could be detected within
a range of 0.5–4.5 m from the camera, as shown in the green area in Fig. 9.22.
9 Analysis and Deployment of an OCR—SSD Deep Learning … 143
The detection range was 1.5–3 m. The balance between the main number of
detections with the total of correct information extracted is represented in the green
area of Fig. 9.23.
Figure 9.24 shows a collision hazard zone represented by a red area.
The ideal safety height and image quality vary between 4 and 6m. Figure 9.25
shows the red area where the height of other possible objects to be found, such as
people and other vehicles, making it difficult to identify the moving vehicle as an
object to be tracked.
144 L. G. M. Pinto et al.
A custom F450 (Fig. 9.26) was used for the outdoor experiments. The challenge was
to use cheap materials to achieve reasonable results.
The numbers shown in Fig. 9.26 are reference indexes for each of the components
shown in Table 9.1. For the acquisition of the frame, an EasyCap capture device was
connected to the computer, previously connected to the video receiver.
The Pixhawk 4, the most expensive component, was the flight controller chosen
board, as the idea was to use the PX4 flight stack as the control board configuration.
For the outdoor experiment on Campus (Fig. 9.27), the model was changed to detect
people instead of cars. To avoid colliding with the person who served as a target,
everything was coordinated very slowly.
Another experiment used a video as images source (Fig. 9.28), to check how
many detections, plates, and right information extracted the technique could acquire.
For the video, a record from Marginal Pinheiros was used, being one of the busiest
highways in the city of São Paulo (Oliveira 2020).
The experiment produced 379 vehicle detections of the 500 existing in a video
clip, where 227 plaques were found, 164 with the correct information extracted
(Fig. 9.29).
146 L. G. M. Pinto et al.
The distance and the brightness level determined the limits in the tests performed,
being an aspect to work on future improvements. A high-definition camera should
be used to prevent noise and vibrations in the captured images.
The mathematical functions used to calculate the drone’s speed were useful in
understanding the drone’s behavior.
A different approach to position estimation or PID controller can be used to
determine the object’s route.
148 L. G. M. Pinto et al.
References
Buhus ER, Timis D, Apatean A (2016) Automatic parking access using openalpr on raspberry pi3. In:
Journal of ACTA TECHNICA NAPOCENSIS Electronics and Telecommunications. Technical
University of Cluj-Napoca, Cluj, Romania
Cabebe J (2019) Google translate for android adds OCR (2012). Available in ?? Cited November
15th, 2019
Cardamone A (2017) Implementation of a pilot in the loop simulation environment for UAV devel-
opment and testing. Doctoral Thesis (Graduation Project)|Scuola di Ingegneria Industriale e
dell’Informazione, Politecnico di Milano, Milano, Lombardia, Italia
Chapman A (2016) Types of drones: multi-rotor vs XEDwing vs single rotor vs hybrid VTOL.
DRONE Magz I(3):10
Cousins S (2010) ROS on the PR2 [ROS topics]. IEEE robotics & automation magazine, institute of
electrical and electronics engineers (IEEE), vol 17(3), 23-25. https://doi.org/10.1109/mra.2010.
938502
DAI, J. et al. R-fcn: Object detection via region-based fully convolutional networks. In: Proceedings
of the 30th International Conference on Neural Information Processing Systems, 379-387, ISBN
9781510838819, Red Hook, NY, USA: Curran Associates Inc., (NIPS’16) (2016)
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE
computer society conference on computer vision and pattern recognition (CVPR’05). IEEE.
https://doi.org/10.1109%2Fcvpr.2005.177
Dalmia A, Dalmia A (2019) Real-time object detection: understanding SSD. Available
in: https://medium.com/inveterate-learner/real-time-object-detection-part-1-understanding-
ssd-65797a5e675b. Cited November 28th, 2019
Deng L (2014) Deep learning: methods and applications. In: Foundations and trends R in signal
processing, Now Publishers, vol 7(3-4), 197-387. https://doi.org/10.1561%2F2000000039
Dipietro R (2019) A friendly introduction to cross-entropy loss. Available in: https://rdipietro.
github.io/friendly-intro-to-cross-entropy-loss/. Cited December 06th, 2019
Dogra A, Bhalla P (2014) Image sharpening by gaussian and butterworth high pass lter. Biomed
Pharmacol J Oriental Sci Publishing Company 7(2):707–713 https://doi.org/10.13005%2Fbpj
%2F545
DRONEKIT (2015) DRONEKIT: your aerial platform. Available in: https://dronekit.io/. Cited
December 21st, 2019
Du S et al (2013) Automatic license plate recognition (ALPR): a state-of-the-art review. In: IEEE
transactions on circuits and systems for video technology. Institute of Electrical and Electron-
ics Engineers (IEEE), v. 23, n. 2, 311–325 (2013) doi: https://doi.org/10.1109%2Ftcsvt.2012.
2203741
Eikvil L (1993) OCR—optical character recognition. Gaustadalleen 23, P.O. Box 114 Blindern,
N-0314 Oslo, Norway
Everingham M et al (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision
88(2):303–338
Fairchild C (2016) Getting started with ROS. In: ROS robotics by example: bring life to your robot
using ROS robotic applications. Packt Publishing, Birmingham, England. ISBN 978-1-78217-
519-3
Fathian K et al (2018) Vision-based distributed formation control of unmanned aerial vehicles
Feng L, Fangchao Q, and EKF altering algorithm of the autopilot PIXHAWK. In (2016) Research
on the hardware structure characteristics sixth international conference on instrumentation &
measurement, computer, communication and control (IMCCC). IEEE. https://doi.org/10.1109/
imccc.2016.128
Ferdaus MM (2017) ninth international conference on advanced computational intelligence
(ICACI). IEEE. https://doi.org/10.1109/icaci.2017.7974513
Fisher R et al (2020) Unsharp Filter. 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK:
The University of Edinburgh. Hypermedia Image Processing Reference (HIPR), School of Infor-
matics, The University of Edinburgh (2003). Available in: https://homepages.inf.ed.ac.uk/rbf/
HIPR2/unsharp.htm. Cited January 14th, 2020
150 L. G. M. Pinto et al.
Joseph L (2015) Why should we learn ros?: Introduction to ROS and its package management. In:
Mastering ROS for robotics programming : design, build, and simulate complex robots using
Robot Operating System and master its out-of-the-box functionalities. Packt Publishing, Birm-
ingham, England. ISBN 978-1-78355-179-8 (2015)
Kantue P, Pedro JO (2019) Real-time identification of faulty systems: development of an aerial
platform with emulated rotor faults. In: 4th conference on control and fault tolerant systems
(SysTol). IEEE. https://doi.org/10.1109/systol.2019.8864732
Karpathy A (2020) Layers used to build ConvNets (2019). Available in: http://cs231n.github.io/
convolutional-networks/. Cited January 02nd 2020
KERAS (2020) KERAS: The python deep learning library (2020). Available in: https://keras.io/.
Cited March 03rd 2020
Koenig N, Howard (2004) A Design and use paradigms for gazebo, an open-source multi-robot
simulator. In: 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS)
(IEEE Cat. No.04CH37566). IEEE. https://doi.org/10.1109/iros.2004.1389727
Kouba A (2019) Services. Available in: http://wiki.ros.org/Services. Cited December 19th, 2019
Kranthi S, Pranathi K, Srisaila A (2011) Automatic number plate recognition. In: International
journal of advancements in technology (IJoAT). Ambala: Nupur Publishing House, Ambala,
India
KrauseJ et al (2013) 3d object representations for ne-grained categorization. In: 4th International
IEEE workshop on 3D representation and recognition (3dRR-13). Sydney, Australia
Kumar G, Bhatia PK (2013) Neural network based approach for recognition of text images. Int J
Comput Appl Foundation Comput Sci 62(14):8–13. https://doi.org/10.5120/10146-4963
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Lee T, Leok M, Mcclamroch NH (2010) Geometric tracking control of a quadrotor UAV on SE(3).
In: 49th IEEE conference on decision and control (CDC), IEEE. https://doi.org/10.1109/cdc.
2010.5717652
Lin T-Y et al (2014) Microsoft coco: Common objects in context. In: European conference on com-
puter vision (ECCV). Zurich: Oral. Available in: /se3/wp-content/uploads/2014/09/cocoeccv.pdf;
http://mscoco.org. Cited January 14th, 2020
Liu G et al (2011) The calculation method of road travel time based on license plate recognition tech-
nology. In: Communications in computer and information science. Springer, Berlin, Heidelberg,
385–389. https://doi.org/10.1007%2F978-3-642-22418-854
Liu W (2020) SSD: single shot multibox detector. Available in: https://github.com/weiliu89/caffe/
tree/ssd. Cited January 14th, 2020
Liu W et al (2016) Ssd: Single shot multibox detector. Lecture notes in computer science. Springer
International Publishing, 21–37. https://doi.org/10.1007/978-3-319-46448-0. ISSN 1611-3349
Luukkonen T (2011) Modelling and control of quadcopter. Master Thesis (Independent research
project in applied mathematics)|Department of Mathematics and Systems Analysis, Aalto Uni-
versity School of Science, Espoo, Finland
MAO W et al (2017) Indoor follow me drone. In: Proceedings of the 15th annual international
conference on mobile systems, applications, and services—MobiSys ’17. ACM Press. https://
doi.org/10.1145/3081333.3081362
Martinez A (2013) Getting started with ROS. In: Learning ROS for robotics programming: a prac-
tical, instructive, and comprehensive guide to introduce yourself to ROS, the top-notch, leading
robotics framework. Packt Publishing, Birmingham, England. ISBN 978-1-78216-144-8
Martins WM et al (2018) A computer vision based algorithm for obstacle avoidance. Information
Technology-New Generations. Springer 569–575. https://doi.org/10.1007/978-3-319-77028-4
MAVLINK (2019) MAVLink Developer Guide. Available in: https://mavlink.io/en/. Cited Decem-
ber 01st, 2019
MAVSDK (2019) MAVSDK (develop) (2019). Available in: https://mavsdk.mavlink.io/develop/
en/. Cited December 01st, 2019
152 L. G. M. Pinto et al.
Meier L et al (2012) PIXHAWK: a micro aerial vehicle design for autonomous hight using onboard
computer vision. In: Autonomous robots, vol 33(1-2). Springer Science and Business Media LLC,
21-39. https://doi.org/10.1007/s10514-012-9281-4
Meyer A (2020) X-Plane. Available in: https://www.x-plane.com/. Cited March 01st, 2020
Mishra N et al (2012) Shirorekha chopping integrated tesseract OCR engine for enhanced hindi
language recognition. Int J Comput Appl Foundation Comput Sci 39(6):19–23
Mithe R, Indalkar S, Divekar N (2013) Optical character recognition. In: International Journal of
Recent Technology and Engineering (IJRTE). G18-19-20, Block-B, Tirupati Abhinav Homes,
Damkheda, Bhopal (Madhya Pradesh)-462037, India: Blue Eyes Intelligence Engineering and
Sciences Publication (BEIESP), (1, v. 2), 72–75
Mitsch S, Ghorbal K, Platzer A (2013) On provably safe obstacle avoidance for autonomous robotic
ground vehicles. In: Robotics: science and systems IX. Robotics: Science and Systems Founda-
tion. https://doi.org/10.15607%2Frss.2013.ix.014
Nguyen KD, Ha C, Jang JT (2018) Development of a new hybrid drone and software-in-the-
loop simulation using PX4 code. In: Intelligent computing theories and application. Springer
International Publishing, 84–93. https://doi.org/10.1007/978-3-319-95930-6
Nogueira L (2014) Comparative analysis between Gazebo and V-REP robotic simulators. Master
Thesis (Independent research project in applied mathematics)|School of Electrical and Computer
Engineering, Campinas University, Campinas, São Paulo, Brazil
Oliveira na Estrada (2020) Marginal Pinheiros alterac oes no caminho para Castelo Branco. Avail-
able in: https://www.youtube.com/watch?v=VEpMwK0Zw1g. Cited January 05th, 2020
Opala M (2020) Top machine learning frameworks compared: SCIKIT-Learn, DLIB, MLIB, tensor
flow, and more. Available in: https://www.netguru.com/blog/top-machine-learning-frameworks-
compared. Cited March 04th, 2020
Opelt A et al (2006) Generic object recognition with boosting. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, Institute of Electrical and Electronics Engineers (IEEE), v. 28, n.
3, 416-431. https://doi.org/10.1109%2Ftpami.2006.54
OPENALPR (2020) OpenALPR Documentation. Available in: http://doc.openalpr.com/. Cited
January 07th, 2020
O’Shea K, Nash R (2015) An introduction to convolutional neural networks
Oza P, Patel VM (2019) One-class convolutional neural network. IEEE signal processing letters,
Institute of Electrical and Electronics Engineers (IEEE), v. 26, n. 2, 277-281. https://doi.org/10.
1109%2Flsp.2018.2889273
Papageorgiou C, Poggio T (2000) A trainable system for object detection. International Journal of
Computer Vision, Springer Science and Business Media LLC 38(1):15–33. https://doi.org/10.
1023/a:1008162616689
Patel C, Patel A, Patel D (2012) Optical character recognition by open source OCR tool tesseract: a
case study. In: International journal of computer applications, vol 55(10). Foundation of Computer
Science, 50-56. https://doi.org/10.5120/8794-2784
Pinto LGM et al (2019) A SSD–OCR approach for real-time active car tracking on quadrotors. In:
16th international conference on information technology-new generations (ITNG 2019). Springer,
471–476
PIXHAWK (2019) What is PIXHAWK? Available in https://pixhawk.org/. Cited December 16th,
2019
PX4 (2019) PX4 DEV, MAVROS. Available in: https://dev.px4.io/v1.9.0/en/ros/mavrosinstallation.
html. Cited December 19th, 2019
PX4 (2019) Simple multirotor simulator with MAVLink protocol support. Available in: https://
github.com/PX4/jMAVSim. Cited March 01st, 2020
PX4 (2019) What Is PX4? Available in: https://px4.io. Cited December 03rd 2019
PX4 DEV (2019) MAVROS. Available in: https://dev.px4.io/v1.9.0/en/ros/mavrosinstallation.html.
Cited December 19th, 2019
PX4 DEV (2019) Using DRONEKIT to communicate with PX4. Available in: https://dev.px4.io/
v1.9.0/en/robotics/dronekit.html. Cited December 21st, 2019
9 Analysis and Deployment of an OCR—SSD Deep Learning … 153
PYTORCH (2020) Tensors and dynamic neural networks in Python with strong GPU acceleration.
Available in: https://github.com/pytorch/pytorch. March 03rd, 2020
Qadri MT, ASIF M (2009) Automatic number plate recognition system for vehicle identification
using optical character recognition. In, (2009) international conference on education technology
and computer. IEEE. https://doi.org/10.1109/icetc.2009.54
QGROUNDCONTROL (2019) QGroundControl User Guide (2019). Available in: https://docs.
qgroundcontrol.com/en/. Cited December 06th, 2019
QGROUNDCONTROL (2019) QGROUNDCONTROL: intuitive and powerful ground control sta-
tion for the MAVLink protocol. Available in: http://qgroundcontrol.com/. Cited December 19th,
2019
Quigley M et al (2009) ROS: an open-source robot operating system, vol 3
Ramirez-Atencia C, Camacho D (2018) Extending QGroundControl for automated mission plan-
ning of UAVs. Sensors, MDPI AG 18(7):2339. https://doi.org/10.3390/s18072339
Rampasek L, Goldenberg A (2016) TensorFlow: Biology’s gateway to deep learning? Cell Systems,
Elsevier BV 2(1):12–14. https://doi.org/10.1016/j.cels.2016.01.009
Redmon J (2016) IEEE conference on computer vision and pattern recognition (CVPR). IEEE.
https://doi.org/10.1109/cvpr.2016.91
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, 7263–7271
Ren S et al (2017) Faster r-cnn: towards real-time object detection with region proposal networks. In:
IEEE transactions on pattern analysis and machine intelligence, vol 39(6). Institute of Electrical
and Electronics Engineers (IEEE), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031.
ISSN 2160-9292
Rezatofighi H, (2019) Generalized intersection over union: a metric and a loss for bounding box
regression. In, et al (2019) IEEE/CVF conference on computer vision and pattern recognition
(CVPR). IEEE. https://doi.org/10.1109/CVPR.2019.00075
Rogowski MV da S (2018) LiPRIF: Aplicativo para identificação de permissão de acesso de veículos
e condutores ao estacionamento do IFRS (in portuguese). Monography (Graduation Final Project)
| Instituto Federal de Educação, Ciência e Tecnologia do Rio Grande do Sul (IFRS), Campus Porto
Alegre, Av. Cel. Vicente, 281, Porto Alegre - RS - Brasil
ROS WIKI (2019) MAVROS. Available in: http://wiki.ros.org/mavros. Cited December 19th, 2019
Sabatino F (2015) Quadrotor control: modeling, nonlinear control design, and simulation. Master
Thesis (MSc)|School of Electrical Engineering and Computer Science, KTH Royal Institute of
Technology, Stockholm, Sweden
Sagar A (2020) 5 techniques to prevent obverting in neural networks. Available
in: https://towardsdatascience.com/5-techniques-to-prevent-overfitting-in-neural-networks-
e05e64f9f07. Cited March 03rd, 2020
Sambasivarao K (2019) Non-maximum suppression (NMS). Available in: https://
towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c. Cited December
10th, 2019
Sarfraz M, Ahmed M, Ghaz, S (2003) Saudi arabian license plate recognition system. In: 2003
international conference on geometric modeling and graphics, 2003. proceedings. IEEE Computer
Society. https://doi.org/10.1109%2Fgmag.2003.1219663
Sawant AS, Chougule D (2015) Script independent text pre-processing and segmentation for OCR.
In: International conference on electrical, electronics, signals, communication and optimization
(EESCO). IEEE
SCIKIT-LEARN (2020) sCIKIT-Learn: machine learning in Python. Available in: https://github.
com/scikit-learn/scikit-learn. Cited March 03rd, 2020
Shafait F, Keysers D, Breuel TM (2008) Efficient implementation of local adaptive thresholding
techniques using integral images. In: Yanikoglu BA, Berkner K (ed) Document Recognition and
Retrieval XV. SPIE. https://doi.org/10.1117%2F12.767755
154 L. G. M. Pinto et al.
SHAH S et al (2017) Airsim: High-fidelity visual and physical simulation for autonomous vehicles.
In: Field and service robotics. Available at: https://arxiv.org/abs/1705.05065 Cited December
21st, 2019
Shobha G, Rangaswamy S (2018) Machine learning. In: Handbook of statistics. Elsevier, 197-228.
https://doi.org/10.1016%2Fbs.host.2018.07.004
Shuck TJ (2013) Development of autonomous optimal cooperative control in relay rover config-
ured small unmanned aerial systems. Master Thesis (MSc)|Graduate School of Engineering and
Management, Air Force Institute of Technology, Air University, Wright-Patterson Air Force Base
(WPAFB), Ohio, US
Silva SM, Jung CR (2018) License plate detection and recognition in unconstrained scenarios.
In: Computer vision ECCV Springer International Publishing, 593–609. https://doi.org/10.1007
%2F978-3-030-01258-8 36
Simonyan K, Zisserman A (2019) Very deep convolutional networks for large-scale image recogni-
tion. In: Bengio Y, Lecun Y (ed) 3rd international conference on learning representations, ICLR
2015, San Diego, CA, USA, Conference Track Proceedings. Available in: http://arxiv.org/abs/
1409.1556. Cited December 10th, 2019
Singh R et al (2010) Optical character recognition (OCR) for printed Devnagari script using artificial
neural network. In: International journal of computer science & communication (IJCSC). (1, v.
1), 91–95
Smith R (2007) An overview of the tesseract OCR engine. In: Ninth international conference on
document analysis and recognition (ICDAR 2007) Vol 2. IEEE. https://doi.org/10.1109%2Ficdar.
2007.4376991
Smith R, Antonova D, Lee D-S (2009) Adapting the tesseract open source OCR engine for multi-
lingual OCR. In: Proceedings of the International workshop on Multilingual OCR—MOCR ’09.
ACM Press. https://doi.org/10.1145%2F1577802.1577804
Sobottka K et al (2000) Text extraction from colored book and journal covers. In: Kise Daniel
Lopresti SMK (ed) International journal on document analysis and recognition (IJDAR). Tier-
gartenstrasse 17 69121, Heidelberg, Germany: Springer-Verlag GmbH Germany, part of Springer
Nature, (4, v. 2), 163–176
Songer SA (2013) Aerial networking for the implementation of cooperative control on small
unmanned aerial systems. Master Thesis (MSc)|Graduate School of Engineering and Man-
agement, Air Force Institute of Technology, Air University, Wright-Patterson Air Force Base
(WPAFB), Ohio, US
Soviany P, Ionescu RT (2019) Frustratingly easy trade-o optimization between single-stage and
two-stage deep object detectors. In: Lecture notes in computer science. Springer International
Publishing, 366-378. https://doi.org/10.1007/978-3-030-11018-5
Strimel G, Bartholomew S, Kim E (2017) Engaging children in engineering design through the
world of quadcopters. Children’s Technol Eng J 21:7–11
Stutz D (2015) Understanding convolutional neural networks. In: Current Topics in Computer Vision
and Machine Learning. Visual Computing Institute, RWTH AACHEN University
Szegedy C et al (2014) Scalable, high-quality object detection
Szegedy C, (2015) Going deeper with convolutions. In, et al (2015) IEEE conference on computer
vision and pattern recognition (CVPR). IEEE. https://doi.org/10.1109/CVPR.2015.7298594
Talabis MRM et al (2015) Analytics de ned. In: Information security analytics. Elsevier, 1-12.
https://doi.org/10.1016%2Fb978-0-12-800207-0.00001-0
Talin T (2015) LabelImg. Available in: https://github.com/tzutalin/labelImg. Cited January 13th,
2020
Tavares DM, Caurin GAP, Gonzaga A (2010) Tesseract OCR: a case study for license plate recog-
nition in Brazil
Tensorflow (2016) A system for large-scale machine learning. In: Proceedings of the 12th USENIX
conference on operating systems design and implementation. USENIX Association USA, 265–
283. ISBN 9781931971331
9 Analysis and Deployment of an OCR—SSD Deep Learning … 155
10.1 Introduction
Security concerns, as the credentials-based methods are not prevailing and suitable
for usage, thus simply biometrics-based measures are adapted and mapped to rapid
growing technologies. The era of biometrics is evolved nowadays usage of biomet-
rics become inevitable for gender classification and user identification (Gornale et al.
2015; Sanchez and Barea 2018; Shivanand et al. 2015). Likewise for many years,
the humans have been also interested in palm and palm lines for the telling for-
tunes. Scientists have also determined the association of palm line by certain genetic
disorders (Kumar and Zhang 2006) like Down syndrome, Cohen syndrome, and
Aarskog syndrome. Palmprint is an important biometrics trait, which gains lot of
attention because of its high potential authentication capability (Charif et al. 2017).
A few studies have been carried out related to gender identification using palmprints.
In this context, palmprint-based gender identification will be among the next most
popular task for improvising accuracy of other biometrics devices and may even
sometimes doubles haste of biometrics system. The problem of comparison will be
reduced to half the database by it relatively to other methods. The gender classifi-
cation has several applications even in civil, commercial domain, surveillance, and
especially in forensic science for criminal detection and nabbing the suspects, etc.
Gender identification using palmprint is a binary class problem of deciding
whether given palm image corresponds to a male or to a female. Palmprints are
permanent (Kumar and Zhang 2006) and unalterable (Krishan et al. 2014) by nature,
whereas shape and size of an individual’s palm may vary with age, although basic
patterns remain unchanged (Kanchan et al. 2013). This makes palmprint slightly
noteworthy and individualistic. In earlier studies, it is observed that palmprint pat-
terns are genotypically determined (Makinen and Raisamo 2008; Zhang et al. 2011)
and there exists greater differences between female and male palmprints. These are
absolute means that can be considered to identify gender of an individual. Palm-
print contains both high- and low-resolution features like Geometrical, Delta-point,
Principal-Lines, Wrinkles and Minutiae (ridges) features, etc. (Gornale et al. 2018).
In proposed method, binarized statistical image feature (BSIF) technique is used.
Based on which the performance of this technique is evaluated on CASIA palmprint
public database. The results are outperforming performance which is noticed in the
literature. The remaining part of the paper consist of following: Sect. 10.2 contains
the work related to palmprint-based gender classification, and Sect. 10.3 focused on
proposed methodology. In Sect. 10.4, experimental results are discussed, and Sect.
10.5 contains the comparison between the proposed method and existing results, and
finally in Sect. 10.6, conclusions are presented.
The research done earlier reveals that it is possible to authenticate an individual from
palmprint, but the work carried out in this domain is very scanty. In this section,
we discuss review of studies that have been classified on gender identification, G.
Amayeh et al. (2008) investigated possibilities of obtaining the information pertain-
ing to gender by using palmprint; for this, they used palmprint geometry and fingers
which they encoded making use of Fourier descriptors for evaluation; data is col-
lected from 40 subjects and obtained the result of 98% with limited dataset. After
those Wo et al. (2014), classified palmprints geometrical properties using polyno-
mial support vector machine classifier 85% accuracy are attained with a separate
180 palmar images collected from 30 subjects.Unlikely, these datasets are not avail-
able publically for further comparison. Gornale et al. (2018) have performed fussing
Gabor Wavelet with local binary pattern on publicly CASIA palmprint database
using simple nearest neighbor classifier; an accuracy of 96.1% is observed. Xie et al.
(2018) have explored with hyper-spectral CASIA palmprint dataset with convolution
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF … 159
Fig. 10.1 Diagram representing the proposed methodology is given in Fig. 10.1
neural network and fine-tuning of visual geometry group net (VGG-Net) managed
to achieve a considerable accuracy of 89.2% with blue spectrum.
The proposed method comprises the following. As first step, the palmprint image
is preprocessed which normalizes an input image and crops the region of interest
(ROI) from the image of the palm. In the second step, the features are computed
using BSIF. In the last step, the computed features are classified. Figure 10.1 gives
a representation of the proposed methodology.
10.3.1 Database
In the proposed work, authors have utilized CASIA palmprint database which is
available to the public (CASIA) (http://biometrics.idealtest.org/). From the CASIA
palmprint database, we have considered a subset of 4207 palmprints, out of which
3078 palm images belongs to male and 1129 belongs to female subjects, respectively.
Images from database are shown in Fig. 10.2.
10.3.2 Preprocessing
Step 3 Two key points are searched; key-point no. 1 is the gap between forefinger
and middle finger. Key-point no. 2 is the gap between the ring finger and
little finger(Shivanand et al. 2019).
Step 4 To determine palmprints co-ordinate system, the tangents of previously
located two key points are computed.
Step 5 The line which joins these two key points is considered y-axis, along with it
the centroid is detected, and the line passing through perpendicular to it is
treated as x-axis.
Step 6 After obtaining the co-ordinates by step 5, the sub-image from the co-
ordinates is contemplated as region of interest. The process of region of
interest extraction can be understood from Fig. 10.3.
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF … 161
Where ’x’ denotes the convolutional manipulation, b and c indicates the size of the
palmprint image patch and WiK ×K = (1 . . . L) represents filter length and K × K
represents the filter size.
1, if ri > 0
d(i) = (10.2)
0, otherwise
Likewise, for each pixel (b, c) represents the corresponding pixel; L represents filter
length ; and the BSIF features are obtained by plotting the histogram of the obtained
binary codes, from each sub-region of ROI.
i
B S I FiK ×K (b, c) = d (b, c) × (2i−1 ) (10.3)
1=1
In this experiment, size is varied from 3 × 3 to 13 × 13, so that we have utilized six
different sizes of filters, and size is fixed to standard 8 bit length coding. Consequently,
the feature vector of 256 elements from each male and female ROI is extracted from
palmprint images. Figure 10.4 represents the visualization of the application of these
filters.
10.3.4 Classifier
Linear discriminant analysis (LDA) is the primary classifying techniques that have
smaller computational complexity, which is commonly utilized for reducing the
dimensionality (Zhang et al. 2012). It works by separating the variance both between
162 S. Gornale et al.
and within the classes (Jing et al. 2005). LDA is a binary classifier, which classifies
class label ’0’ or ’1’ from the palmprint images based upon the class variances.
Nearest neighbor classifier classifies the class labels based upon different kinds of
distances. It classifies the class values based on k-value and interns which explores
for immediate neighbors and provides labeling for unlabeled sample.
dEuclidean (M, N ) = (M − Ni )T (M − N j ) (10.4)
n
dCityblock (M, N ) = (|M j − N j |) (10.5)
j=1
Here, Yi predicts value either of the class belonging to male or female class by
using F(X ) discriminative function. Geometrically support vector machines are the
training patterns that are nearest to the decision boundary.
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF … 163
has been obtained by LDA. Support vector machine has performed less in comparison
to K-NN and has yielded 95.9% accuracy.
The confusion matrix for the following experiment is illustrated in Table 10.3.
By varying the size and length with a fixed length of 8bit, it has been observed that
as the filter size is increased, higher accuracy is attained. Thus, varying the size of
filters allows in capturing various information from ROI palmprints images.
To realize effectiveness of the proposed method, the authors have compared it with
similar works present in literature predicted in Table-10.4. In Amayeh et al. (2008),
the authors have made use of palm geometry, Zernike moments, and Fourier descrip-
tors and obtained 98% accuracy on relatively smaller dataset of just 40 images.
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF … 165
However, Wo et al. (2014) have utilized very basic geometric properties like length,
height, and aspect ratio with PSSVM and obtained 85% accuracy. Gornale et al.
(2018) have performed fussing Gabor Wavelet with local binary pattern on publicly
CASIA palmprint database. Xie et al. (2018) have explored gender classification
with hyper-spectral CASIA palmprint dataset with convolution neural network and
fine-tuning of visual geometry group net (VGG-Net). The drawback of the reported
works with self-created database is that they are inapt with low-resolutional and far-
distant images captured through non-contact method, as they require touch-based
palm acquisition. However, the proposed method is worked with public database
which is suitable for both the approaches. The proposed method outperformed by
using BSIF filters with basic K-NN classifier on relatively larger dataset consisting
of 4207 ROIs of palmprints, which yielded the progressive result of an accuracy
98.2%. A brief summary of comparison is presented in Table 10.4.
10.6 Conclusion
In this paper, authors explore the performance of binary statistical image features,
i.e., BSIF on CASIA palmprint images, by varying the filter size with a fixed length
of 8bit length, further the filter size is increased, and progressive result of 98.2% is
noticed for filter size of 11 × 11 and above. Thus, varying the size of filters allows
capturing information from ROI palmprints. The proposed method is implemented
on contact-free-based palmprint acquisition process, and this is implacable to both
contact and contact-less-based methods. Our basic objective in this work is to develop
a standardized system that can efficiently distinguish between males and females on
166 S. Gornale et al.
the bases of palmprints. Likewise, with basic K-NN classifier and BSIF features,
authors have managed to enact relatively better result on larger database of 4207
palmprint images. In near future, plan is device generic algorithm which identifies
gender based on multimodal biometrics.
Acknowledgements The authors would like to thank to Chinese Academy of Science Institute of
Automation for providing the access to (CASIA) Palmprint Database for conducting this experi-
ment.
References
Abdenour H, Juha Y, Boradallo M (2014) Face and texture analysis using local descriptor: a compar-
ative analysis. In: IEEE international conference image processing theory, tools and application
IPTA https://doi.org/10.1109/IPTA.2014.7001944
Adam K, Zhang D, Kamel M (2009) A survey of palmprint recognition. Patt Recognit Lett
42(8):1408–1411
Amayeh G, Bebis G, Nicolescu M (2008) Gender classification from hand shapes. In: 2008
IEEE society conference on computer vision and pattern recognition workshop. AK, 1–7.
https://doi.org/10.11.09/CVPRW
Charif H, Trichili Adel M, Solaiman B (2017) Bimodal biometrics system for hand shape and palm-
print recognition based on SIFT sparse representation. Multimedia Tools Appl 76(20):20457–
20482. https://doi.org/10.1007/s11042-106-3987-9
Gornale SS, Malikarjun H, Rajmohan P, Kruthi R (2015) Haralick feature descriptors for gender
classification using fingerprints: a machine learning approach. Int J Adv Res Comput Sci Softw
Eng 5:72–78. ISSN: 2277 128X
Gornale SS, Patil A, Mallikarjun H, Rajmohan P (2019) Automatic human gender identification
using palmprint. In: Smart computational strategies: theoretical and practical aspects. Springer,
Singapore. Online ISBN 978-981-13-6295-8, Print ISBN 978-981-13-6294-1
Gornale SS (2015) Fingerprint based gender classification for biometrics security: a state-of -the-art
technique. American Int J Res Sci Technol Eng Mathe (AIJRSTEM). ISSN-2328-3491
Gornale SS, Kruti R (2014) Fusion of fingerprint and age biometrics for gender classification using
frequency domain and texture analysis. Signal Image Process Int J (SIPIJ) 5(6):10
Gornale SS, Patil A, Veersheety C (2016) Fingerprint based gender identification using discrete
wavelet transformation and Gabor filters. Int J Comput Appl 152(4):34–37
Gornale S, Basavanna M, Kruti R (2017) Fingerprint based gender classification using local binary
pattern. Int J Comput Intell Res 13(2):261–272
Gornale SS, Patil A, Kruti R (2018) Fusion Of Gabor wavelet and local binary patterns features
sets for gender identification using palmprints. Int J Imaging Sci Eng 10(2):10
Jing X-Y, Tang Y-Y, Zhan D (2005) A Fourier-LDA approaches for image recognition. Patt Recognit
38(3):453–457
Juho K, Rahtu E (2012) Bsif: Binarized statistical image features. In: IEEE international conference
on pattern recognition, in (ICPR), pp 1363–1366
Kanchan T, Kewal K, Aparna KR, Shredhar S (2013) Is there a sex difference in palmprint ridge
density. Med Sci law 15:10. https://doi.org/10.1258/msl.2012.011092
Krishan K, Kanchan T, Ruchika S, Annu P (2014) Viability of palmprint ridge density in North
Indian population and its use in inference of sex in forensic examination. HOMO-J Comparat
Hum Biol 65(6):476–488
Kumar A, Zhang D (2006) Personal recognition using hand shape. IEEE Trans. Image Process
15:2454–2461
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF … 167
Abstract The motivation behind the paper is to give a single shot solution of sudoku
puzzle by using computer vision. This study’s purpose is twofold. First to recognise
the puzzle by using deep belief network which is very useful to extract the high-level
feature, and the second objective is to solve the puzzle by using parallel rule-based
technique and efficient ant colony optimization method. Each of the two methods can
solve this NP-complete puzzle. But singularly they lack effeciency, so we serialised
these two techniques to resolve any puzzle efficiently with less time and number of
iteration.
11.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 169
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_11
170 S. Sahoo et al.
There are many papers published over the years to solve the puzzle in efficient
manner. Several authors proposed a different type of computer algorithm to address
the standard and higher sized puzzle. Among all the algorithm the backtracking
algorithm, a genetic algorithm is the most famous one. Even Abu Sayed Chowdhury
and Suraiya Akhter solved sudoku with the help of boolean algebra (Chowdhury and
Akhter 2012). The work on this paper is divided into two main categories. First is
based on sudoku image processing for printed digit and grid recognition and next to
proceed for an appropriate solution for that image.
In 2015, Kamal et al. made a comparative analysis paper on sudoku image pro-
cessing and solve the puzzle by using backtracking, genetic algorithm, etc., they
had used camera-based OCR technique (Kamal et al. 2015). In that 2015, Baptiste
Witch and jean hennebert proposed a work based on handwriting and printed digit
recognition using convolution deep belief network (Wicht and Henneberty 2015),
which is the extension work of the same author on deep belief network. It is handy
for detecting grid with cell number (Wicht and Hennebert 2014).
Computer vision plays an active role to detect and solve the puzzle (Nguyen et al
2018). Several methods over the year like heuristic hybrid approach (Musliu and
Winter 2017) by Nysret Musliu et al., the genetic algorithm by Gorges et al. (Gerges
et al. 2018) and through parallel processing by Saxena et al. (Saxena et al. 2018)
proposed to solve the puzzle in efficient manner. Saxena et al. composed five rule-
based methods with serial and parallel processing algorithm (Saxena et al. 2018).
The preprocessing of an image involves digit detection and edge detection. The
acquired image by camera or pre-stored file image needs to take care of elimi-
nating unnecessary background noise, the orientation of image and non-uniformly
distributed illumination gradient. The image preprocessing detections steps are:
5. Deep Belief Networks Deep belief network consists of multi-layer RBM where
each layer comprises of a set of binary values. Multi-layers convolutional RBM
consists of input layers of an array Iv × Iv (visible layers) and N groups of hidden
layers of array Ih × Ih . Where each N groups of hidden layers are associated with
a Iw × Iw filters and filter weights are shared within the groups. The probabilistic
semantic P(v, h) for CRBM is defined as:
1
p(v, h) = exp
⎛ z⎛ ⎞⎞
N
IH
IW
N
IH
Iv
⎝− ⎝− v(i+r−1),j+s−1 Wrsk hnij − bn hnij − c vij ⎠⎠
n=1 i,j=1 r,s=1 n=1 i,j=1 i,j=1
where bn = Bias of hidden group c = Single shared bias of visible input group
N groups of units pooling layers (P n ) shrinks the same number of units of hidden
layers (H n ) by a constant small integer factor of C . Each block α in the detection
unit Bα where Bα (i, j) : hi,j belongs to block α and is connected to exactly one
11 Recognition of Sudoku with Deep Belief Network and Solving … 173
Each unit of detection layers receives signal from bottom visible layer
exp(L(hni,j ))
P(hnij = 1|v) =
1 + (í,j́)Bα exp(L(hní,j́ ))
1
P(pαn = 0|v) =
1+ (í,j́)Bα exp(L(hí,j́ ))
n
1
p(v, h) = exp(−E(v, h))
z
where
N
E(v, h) = − (hni,j (W̄ n ∗ v)i,j + bn hni,j ) − c vi,j
n i,j i,j
Convolutional deep belief network (CDBN) is made up of the stack of many proba-
bilistic max-pooling-CRBMs (Lee et al. 2009). So, the total energy of CDBN is the
sum of own energy of CRBM layers. CDBN is used not only to recognise the digit
inside the grid but also to act as feature extractor and classifier at the same time. Our
model is made up of three layers of RBM where first layers ( 500 hidden units) are
using rectifier linear unit (ReLU) for activation function which is defined as :
f (x) = max(0, x)
174 S. Sahoo et al.
It is followed by second layers of the same 500 number of units, and the final visi-
ble layers are labelled with the digits from 1 to 9 (9 units). Simple base-e exponential
is used in final layers. Each CRBM is trained in an unsupervised manner using con-
trastive divergence (CD) except the last layer of the network, and stochastic gradient
descent (SGD) is used for “fine-tuning” of the network. The classifier is trained on
the training set of 150 images, batches of 15 images for ten epochs and tested 50
images with an accuracy of 98.52 % for printed digit recognition. In Fig. 11.3, it is
showing successfully digit recognisation by DBN.
After successful recognition of digit and the column number and row number from 1
to 9, our algorithm is successfully implemented on the outcome of digit recognition
on the basis of row and column. In this method, our algorithm is divided into two
parts. In the first part, the handwritten general rule-based algorithm is implemented
followed by an ant colony optimisation algorithm. Our handwritten algorithm can
be able to solve many puzzle problems. Newspaper sudoku is partitioned into three
basic categories (Easy, medium, hard) or 5-star rating according to their difficulty
level and number of the empty cells to solve the puzzle. The handwritten general rule-
based algorithm can solve easy and most of the medium level puzzle. Hard puzzles
are partially addressed by general rule-based handwritten algorithm and its difficulty
level decreases after that. If the problem remains unsolved after some iteration of the
general rule-based algorithm, then it has to be implemented in a ACO algorithm, as
the ACO algorithm is very efficient to solve an NP-complete problem.
11 Recognition of Sudoku with Deep Belief Network and Solving … 175
The general rule-based algorithm is subdivided into six different stages which run
parallel to solve the problem. CDBN is implemented to classify the digit and place
them according to the row and column. Each cell is assigned to an array to store
either its probable digits or available recognised digits along with each row, column
and grid (3 × 3) to create its avail and unavailable list of collection.
4.[1 ... 9] ←Unavail list [row no] , Unavail list [col no] ,Unavail list [Block no]
5.Loop (col no ≤ 9 ) upto step 13
6.Read the row element by Convolution [1 × 1]
7. if ( y > 3 ,then reset y = 1) then
8.Block no ← int ( (col no + 2) / 3 ) + ( row no - y))
9.If (Digit found and check digit is in Unavail list [row no] [col no][Block no]) Then
10.Assign value to cell no[row no] [col no] [ Block no]
11.Add the value ← Avial list [Row no], Avail list[col no] , Avail list [Block no],
12.Eliminate the value ← Unavail list [row no] , Unavail list [col no] , Unavail list [Block no];
13.Row no + 1 , y + 1 ←Row no , y ;
each empty cell has assigned to the probability array ::-
14.1 ← Row no, col no , y;
15.Loop (col no ≤ 9 ) upto step 19
16.Read the row element by Convolution [1 × 1]
17.Select the empty cell[row no][col no][Block no]
18.Probability array[cell no] ←common element of cell unavail list[row no],unavail list[col no]
and unavail list[Block no]
19.End of loop
20.End
11.4.2 Methods
and other is the eliminator. The algorithm is divided into six steps which is executed
in parallel manner.
As in Fig. 11.4, Hidden single represents the single candidate occurrence in the
probability list of entire row or column. Here, algorithm is expressed as:
Naked pair eliminator algorithm is useful when there are only two occurrences of a
pair of candidates in a single row, column, or in block. Then, the possibility of those
candidates in the probability array for that cell is increased, so it needs removal of
other candidates in that cell as in digit 4 and 8 in Fig. 11.5 (https://www.youtube.
com/watch?v=b123EURtu3It=97s). The same procedure is followed by naked triple
eliminator; but in this case, three elements are subdivided in to either one set of three
candidates or three set combination of two candidates each for searching occurrences
of three different cell in a single row, column or Block. The algorithm is represented
as follows:
As in Fig. 11.6, digit 9 of the 8th block and 5th column, When a certain candidate
appears only in two or three cells in a Block, and the cell are aligned in a column
or a row. They are called Ponting pairs. All the other appearances of that candidate
outside that block in the same column or row can be eliminated.
*above cell [R][ ][B] means same row number and same block number but dif-
ferent column number.
178 S. Sahoo et al.
Fig. 11.5 Naked pair: Digit 4 and 8 is naked pair for 4th block
(https://www.youtube.com/watch?v=b123EURtu3It=97s)
Fig. 11.6 Pointing Pair: Digit 9 is the pointing pair in figure all appearance outside block 8 can be
eliminated
11 Recognition of Sudoku with Deep Belief Network and Solving … 179
When a certain candidate appears in only two or three cells in a row or column and
the cells are in a single block, they are called claiming pair. Here, algorithm for row
(similar for column) is represented by:
X-wings is most used by enigmatologist to solve the high-rated difficult level puzzle
to minimise the number of candidate distribution probability. X-wings technique
is implemented when four cells that form corners of a rectangle or square, and it
appears only in the two cells in both the rows. Then, the candidate appearing in that
two columns can be eliminated like Fig. 11.7. The same technique also can be applied
for columns.
180 S. Sahoo et al.
All the above algorithm is independent of each other. The central theme is to find
out a possible array for the empty cell with the help of DCBN after that the parallel
algorithm helps to minimise by eliminating element from the probable arrays of each
cell. If all the six methods are implemented serially one after another, it will be more
11 Recognition of Sudoku with Deep Belief Network and Solving … 181
time consumer and ineffecient. The main aim of parallel execution is to minimise the
time cost and increase the efficiency of implementation. Some steps are constituted
for both row and column separately, which are also executed in a parallel manner.
As in a single epoch, all the six methods are implemented by single time only, and
the results are updated for next iteration as input.
The parallel algorithm can be capable of solving most of the easy to a medium level
problem within 100–150 epochs. Many of the challenging level puzzles also are being
answered within 250–300 epochs. Rule-based parallel methods fail to efficiently
handle higher level difficult problem. For some of these rule-based puzzle parallel,
the algorithm is stopped with more than one possible digit candidates in an array
for a single cell in solution, as the methods are unable to eliminate candidates after
certain epoch. But the rule-based parallel method efficiently minimises the candidate
arrays, so that any other greedy-based algorithms approach can be implemented with
less epoch with less time, compared to only a rule-based parallel algorithm. For this
reason, ant colony optimization method is serialised with the parallel rule-based
method.
182 S. Sahoo et al.
The ant colony optimisation method as sudoku solver (Lloyd and Amos 2019) is
used with constant propagation method (Musliu and Winter 2017). Here in our ACO,
each ant has covered only those cells with probability array of multiple candidates
in their local copy of the puzzle. A fixed amount of pheromone they add when they
pick up for a single element from the array of possible candidates and delete that
element other existence in that same row, column, and block. One pheromone matrix
(9 × 81) is created to keep track of the updating of each component in the possible
array. The best ant covered all the possible candidates of multiple candidates in the
puzzle.
For this paper, we did experiment on various possible dataset available in Internet
such as https://github.com/wichtounet/sudokudataset used by witch paper
(Wicht and Henneberty 2015) (for both recognition and solution) and
https://www.kaggle.com/bryanpark/sudoku (for only testing solver) . In the first half
of digit and grid recognition case by using deep belief network, we got the accuracy
of 98.58% with an error rate of 1.42 %. So, nearly it works perfect to recognise digit
according to grids, and we processed the puzzle which is recognised fully (Figs. 11.8
and 11.9). Only with rule-based algorithm is succeed to solve with success rate at
96.63% puzzle with 304 epochs, while ant colony optimisation is capabled of solving
98.87 % puzzle with 263 epochs but with the serialisation of parallel and ant colony
optimisation is highest success rate at 99.34 % result with 218 epochs. Maximum
epochs for three algorithms are calculated with the puzzle with 62 blank cells.
11.6 Conclusion
We designed and implemented a new rule-based algorithm associated with ant colony
optimization technique to solve the puzzle detected by image processing by using
convolutional deep belief network methods. CDBN is so efficient to recognise the
printed digit properly and implement it in any of highly difficult solution. It is well
designed that a player can stop at any iteration to see the hints for the answer. In
the future, we try to implement the digit recognization and proper solution solver by
using deep convolution neural network alone.
11 Recognition of Sudoku with Deep Belief Network and Solving … 183
References
Chowdhury AS, Akhter S (2012) Solving Sudoku with Boolean Algebra. Int J Comput Appl
52(21):0975–8887
Gerges F, Zouein G, Azar D (2018) Genetic algorithms with local optima handling to solve sudoku
puzzles. In: Proceedings of the 2018 international conference on computing and artificial intelli-
gence, pp 19–22
184 S. Sahoo et al.
Kamal S, Chawla SS, Goel N (2015) Detection of Sudoku puzzle using image processing and
solving by backtracking, simulated annealing and genetic algorithms: a comparative analysis. In:
2015 third international conference on image information processing (ICIIP). IEEE, pp 179–184
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsu-
pervised learning of hierarchical representations. In: Proceedings of the 26th annual international
conference on machine learning, pp 609–616
Lloyd H, Amos M (2019) Solving Sudoku with ant colony optimization. IEEE Trans Games
Musliu N, Winter F (2017) A hybrid approach for the sudoku problem: using constraint programming
in iterated local search. IEEE Intell Syst 32(2):52–62
Nguyen TT, Nguyen ST, Nguyen LC (2018) Learning to solve Sudoku problems with computer
vision aided approaches. Information and decision sciences. Springer, Singapore, pp 539–548
Ronse C, Devijver PA (1984) Connected components in binary images: the detection problem
Saxena R, Jain M, Yaqub SM (2018) Sudoku game solving approach through parallel processing. In:
Proceedings of the second international conference on computational intelligence and informatics.
Springer, Singapore, pp 447–455
Wicht B, Hennebert J (2014) Camera-based sudoku recognition with deep belief network. In: 2014
6th international conference of soft computing and pattern recognition (SoCPaR). IEEE, pp 83–88
Wicht B, Henneberty J (2015) Mixed handwritten and printed digit recognition in Sudoku with
Convolutional deep belief network. In: 2015 13th international conference on document analysis
and recognition (ICDAR). IEEE, pp 861–865
Chapter 12
Novel DWT and PC-Based Profile
Generation Method for Human Action
Recognition
12.1 Introduction
Human action recognition (HAR) is the process of recognizing various actions that
people perform, either individually or in a group. These actions may be walking, run-
ning, jumping, swimming, shaking hands, dancing, and many more. There are many
challenges in HAR such as differences in physiques of humans performing actions
like shape, size, color, etc., differences in background scene like occlusion, light-
ing or any other visual impairments, differences in recording settings like recording
speed, types of recording (2D or 3D/gray-scale or colored video recording), differ-
ences in motion performance by different people like difference in speed of walking,
difference in height of jumping, etc. For an algorithm to succeed, the methods used
for action representation and classification are of utmost importance. This motivated
research work in this field and the development of a plethora of different techniques
T. Zaveri · R. Shah
Nirma University, Ahmedabad 382481, India
P. Prajapati (B) · R. Shah
Government Engineering College, Patna 384265, India
e-mail: payalprajapati2808@gmail.com
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 185
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_12
186 T. Zaveri et al.
which falls under local and global representation approaches, classified by Weinland
et al. (2011) based on how actions are represented.
In global representation approaches, the human body is to be detected in the image,
usually with background subtraction techniques. While this step is a disadvantage, it
results in reduction of image size and complexity. Silhouettes and contours are usually
used for representing the person. Contour-based features like Cartesian coordinate
feature (CCF), the Fourier descriptor feature (FDF) (H et al. 1995; RD and L 2000;
R et al. 2009), centroid-distance feature (CDF) (D and G 2004, 2003), and chord-
length feature (CLF) (D and G 2004, 2003; S et al. 2012) are extracted from contour
boundary of the person in the aligned silhouettes image (ASI). The region inside
contour of human object is the silhouette. Silhouette-based features are extracted
from the silhouette in the ASI image. Some common silhouette-based features are
histogram of gradient (HOG) (D and G 2003; N et al. 2006), histogram of optical
flow (HOOF) (R et al. 2012), and structural similarity index measure (SSIM) (Z et al.
2004).
In local representation approaches, videos are treated as a collection of small
unrelated patches that involve the regions of high variations in spatial and temporal
domains. Centers of these patches are called spatio-temporal interest points (STIPs).
STIPs are represented by the information related to the motion in their patches and
then clustered to form a dictionary of visual words. Each action is represented by
bag of words model (BOW) (Laptev et al. 2008). Several STIPs detectors have been
proposed recently. For example, Laptev (2005) applied Harris corner detector for
spatio-temporal case and proposed Harris3D detector, Dollar et al. (2005) applied
1D Gabor filters temporally and proposed the Cuboid detector, Willems et al. (2008)
proposed Hessian detector that measures the saliency with the determinant of 3D Hes-
sian matrix, and Wang et al. (2009) introduced dense sampling detector that finds
STIPs at regular points and scales, both spatially and temporally. Various descrip-
tors used for STIPs include histogram of oriented gradients (HOG) descriptor and
histogram of optical flow (HOF) descriptor (H et al. 1995), gradient descriptor (Dol-
lár et al. 2005), 3D scale-invariant feature transform (3D SIFT) (Scovanner et al.
2007), 3D gradients descriptor (HOG3D) (A et al. 2008), and the extended speeded
up robust features descriptor (ESURF) (Willems et al. 2008). Some limitations of
global representation such as sensitivity to noise and partial occlusion and the com-
plexity of accurate localization by object tracking and background subtraction can
be overcome by local representation approaches. Local representation also has some
drawbacks, like the ignorance of spatial and temporal connections between local
features and action parts that are necessary to preserve intrinsic characteristics of
human actions.
In this paper, a novel energy-based approach is presented which is based on the
fact that every action has a unique energy profile associated with it. we are trying to
model some physics theorems which are related to kinetic energy generated while
performing an action. As it is known, if you apply a force over a given distance—
you have done work using equation W = F × D. Through work-energy theorem,
this work done can be related to changes in kinetic energy or gravitational potential
energy of an object using equation W = K where K stands for kinetic energy.
12 Novel DWT and PC-Based Profile Generation Method for Human … 187
As per normal observation, force required for performing various actions is different
(e.g., Running requires more force then Walking) which leads to difference in amount
of work done and hence difference in energy profile for different actions. So, energy
difference can be a feature in distinguishing among various actions. We are trying
to model different energy profile associated with various actions using fundamentals
of phase congruency and discrete wavelet transform. Goal is to automate human
activity recognition task by integrating seven energy-based features calculated in
frequency domain with few machine learning algorithms. All the seven features are
based on exploration of differences in energy profiles for different action. Section
12.2 of the paper describes background theory, proposed methodology, and analysis
of energy profiles. Section 12.3 presents details of dataset and results obtained. It also
compares result with existing method. Section 12.4 concludes the paper followed by
references in section 5.
In frequency domain, we can easily identify the high-frequency components from the
original signal. This property of frequency domain can be exploited to aid in action
recognition. When an action is performed, the high frequency component, in the part
of the frame where the action is performed, changes. Moreover, it is observed that
the change is different for different actions. Here, other transforms like the Fourier
transform cannot be used; because in Fourier transform domain, there is no relation
with the spatial coordinates. DWT resolves the issue. In DWT, the frequency infor-
mation is obtained at original location in the spatial domain. As a direct consequence
of this property, the frequency information obtained can be visualized as an image
in spatial domain. Figure 12.3 shows result of DWT for wave action which gives
approximation (low pass) and detail information (high pass). The detail information
horizontal, vertical, and diagonal information is used in energy calculation.
A mathematical concept of 2D wavelet transform is given by Eqs. (12.1) (scaling
function), (12.2) (wavelet functions), (12.3, 12.4).
M−1 N −1
1
Wφ ( j0 , m, n) = √ f (x, y)φ j0 ,m,n (x, y), (12.3)
MN x=0 y=0
M−1 N −1
1
Wψk ( j0 , m, n) = √ f (x, y)ψ kj0 ,m,n (x, y);
MN x=0 y=0 (12.4)
where, k = H, V, D.
Steps for extracting features using DWT and PC are given as below:
1. The training video is divided into frames. These frames are stored in F. Take
difference of every alternate frame stored in F, call such result as frame difference
image which is obtained by:
Fd x = |F p − F( p+2) |, (12.6)
12 Novel DWT and PC-Based Profile Generation Method for Human … 191
where Fd x is absolute difference between two alternate frames and p varies from
1 to 46.
2. Apply analysis equations of DWT (Sect. 12.2.1.1) on each frame difference
image which gives c A , c H , cV , c D .
M−1 N −1
1
Wφ ( j0 , m, n) = √ Fd x (x, y)φ j0 ,m,n (x, y) (12.7)
MN x=0 y=0
M−1 N −1
1
Wψk ( j0 , m, n) = √ Fd x (x, y)ψ kj0 ,m,n (x, y);
MN x=0 y=0 (12.8)
where, k = H, V, D.
Let U L, L L, U R, L R depict the parts, namely upper left, lower left, upper
right, and lower right, respectively, obtained after dividing FPC in four parts.
Let T E PC be the total energy of the matrix FPC , then energy contribution of
every individual component is calculated by,
n
(FG (i))2
EG = , G = U L , L L , U R, L R (12.11)
i=1
T E PC
The proposed idea is based on the fact that energy profile over entire video is different
for various action classes. We employ the above feature extraction algorithms to
generate feature vectors for videos of different classes. We also plot the results
obtained by both methods, DWT and PC, individually to show that the energy profiles
obtained matches with underlying observation.
The values for the horizontal, vertical, and diagonal energy obtained from the frame
differences were calculated and plotted for analysis. These profiles for four bend
action videos and their average profile is shown in Fig. 12.3.
Figure 12.3 represents the energy profiles of H , V , and D components for multiple
videos of bend action, and their average is also shown. As expected, the vertical
energy is the highest initially. This is because the video starts with a person standing
upright. Also, a dip is observed in the diagonal energies as the person bends more and
more. After the completion of about half the video, the horizontal energy increases,
since now the person’s gait is almost horizontal. After that, diagonal energies increase
again as the person begins to stand up. Similarly, for other actions too, the profiles
were created, and they behave as expected.
(a) Horizontal Energy Profile (b) Vertical Energy Profile (c) Diagonal Energy Profile
(d) Avg. Horizontal Energy (e) Avg. Vertical Energy (f) Avg. Diagonal Energy
Profile Profile Profile
Fig. 12.3 Pattern analysis of horizontal, vertical, and diagonal energy profiles obtained using DWT
12 Novel DWT and PC-Based Profile Generation Method for Human … 193
(a) Energy profile for upper (b) Energy profile for upper (c) Energy profile for lower
left part right part left part
(d) Average energy profile (e) Average energy profile for (f) Average energy profile for
for upper left part upper right part lower left part
(g) Energy profile for lower right part (h) Average energy profile for lower
right part
Result obtained after applying phase congruency is divided into four parts, namely
upper left, lower left, upper right, and lower right, respectively. Energy profiles of
these four parts are plotted and analyzed in Fig. 12.4 for wave action.
Figure 12.4 shows that energy distribution in all four parts for wave action is
almost equal which is consistent with the expected results for wave action. Profiles
obtained for other actions too have definite, unique patterns, and making these energy
profiles suitable for use in classifiers to recognize actions.
We used Weizmann‘s dataset in our experiment. This dataset contains ten actions of
day-to-day activities that are walk, run, bend, wave (one hand), wave2 (two hands),
jack, jump, skip, gallop sideways, and jump in place (pjump), performed by ten
actors. Figure 12.5 shows these actions being performed by different actors.
194 T. Zaveri et al.
In our experiment, fourteen action classes were used which are run left, run right,
pjump, jump left, jump right, wave2, walk left, walk right, skip left, skip right, side
left, side right, jack, and bend. Total 85 videos have been used for the experiment.
The energy values obtained from the DWT and phase congruency methods were
used to construct the training matrix for the classifier. Since we had 14 action classes
with total 85 videos of 92 frames each and 7 features, we get a 85 × 322 feature
matrix. Thus, each row in the feature matrix contains the extracted features for a
particular action. And each of these rows is labeled by type of action. The dataset has
been divided randomly into training and test dataset keeping the 60–40 proportion.
Ten such divisions are employed to use cross-validation approach. We used three
classifiers—SVM, Naive Bayes, and J48 to test our algorithm. To evaluate the pro-
posed algorithms, four parameters sensitivity, specificity, precision, and accuracy are
calculated from confusion matrix. Sensitivity is the percentage of positive labeled
instances that were predicted as positive, specificity is the percentage of negative
labeled instances that were predicted as negative, and precision is the percentage of
positive predictions that are correct. Accuracy tells what percentage of predictions
is correct. Equations of each parameter is given in (Kacprzyk 2015).
We have analyzed result of DWT and PC separately and together for above three
classifiers which is shown in Tables 12.1, 12.2 and 12.3, respectively. It shows DWT
and PC together gives better sensitivity, specificity, precision, and accuracy for all
three classifiers. Out of these three classifiers, SVM gives better result in terms of
all four parameters.
12.4 Conclusion
References
Abstract In this paper, a novel filter-based model for classification of tobacco leaves
for the purpose of harvesting is proposed. The filter-based model relies on estimation
of degree of ripeness of a leaf using combination of filters and color models. Degree
of ripeness of a leaf is computed using density of maturity spots on a leaf surface and
yellowness of a leaf. A new maturity spot detection algorithm based on combination
of first order edge extractor (sobel edge detector or canny edge detector) and second
order high-pass filtering (Laplacian filter) is proposed to compute the density of
maturity spots on a unit area of a leaf. Further, a simple thresholding classifier is
designed for the purpose of classification. Superiorities of the proposed model in
terms of effectiveness and robustness are established empirically through extensive
experiments.
13.1 Introduction
Agriculture sector plays a vital role in an economy of any developing countries like
INDIA. Source of employment, wealth, and security of any nation is directly depends
on the qualitative and quantitative production of agriculture. It is an outcome of a
complex interaction of soil, seed, water, and agrochemicals. Enhancement of pro-
ductivity needs proper type, quantity, and timely application of soil, seed, water, and
agrochemicals at specific sites. This demands precision agriculture practices such
as soil mapping, disease mapping at both seedling and plant level, weed mapping,
P. B. Mallikarjuna (B)
JSS Academy of Technical Education, Bengaluru, Karnataka, India
e-mail: pbmalli2020@gmail.com
D. S. Guru
University of Mysore, Mysore, Karnataka, India
e-mail: dsg@compsci.uni-mysore.ac.in
C. Shadaksharaiah
Bapuji Institute of Engineering and Technology, Davangere, Karnataka, India
e-mail: shadaa@rediffmail.com
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 197
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_13
198 P. B. Mallikarjuna et al.
properties of crop such as color, texture, and shape could be exploited to evaluate the
ripeness of crop for harvesting purpose using computer vision algorithmic models.
To show the importance of the computer vision techniques in precision agricultural
practices especially for selective harvesting stage, we have taken the tobacco crop
as a case study. After 60 d of plantation of tobacco crop, we can find three types of
leaves. They are unripe, ripe, and over-ripe leaves. One should harvest ripe leaves
to get quality cured tobacco leaves in a curing process. As an indication of ripeness,
small areas called maturity spots appear non-uniformly on the top surface of a leaf.
As ripeness increases, yellowness of a leaf also increases. The maturity spots are
more in over-ripe leaves when compared to ripe leaves.
Though tobacco crop has commercial crop, no attempt has been made on harvest-
ing of tobacco leaves using CV techniques. However, few attempts could be traced
on ripeness evaluation of other commercial crops for automatic harvesting. Direct
color mapping approach was developed to an evaluate maturity levels of tomato and
date fruits (Lee et al. 2011). This color mapping method maps the RGB values of
colors of interest into 1D color space using polynomial equations. It uses a single
index value to represent each color in the specified range for the purpose of maturity
evaluation of tomato and date fruits. A robotic system for harvesting ripe tomatoes
in greenhouse (Yin et al. 2009) was designed based on the color feature of tomatoes,
and morphological operations are used to denoise and handle the situations of tomato
overlapping and shelter. Medjool date fruits were taken as a case study to demon-
strate the performance of a novel color quantization and color analysis technique for
fruit maturity evaluation and surface defect detection (Lee et al. 2008).
A novel and robust color space conversion and color index distribution analysis
technique for automated date maturity evaluation (Lee et al. 2008) were proposed.
Applications of mechanical fruit grading and automatic fruit grading (Gao et al.,
2009) were discussed and also compare the performance of CV-based automatic
fruit grading with mechanical fruit grading. A neural network system using genetic
algorithm was implemented to evaluate the maturity levels of strawberry fruits (Xu
2008). In this work, H frequency of HIS color model was used to distinguish matu-
rity levels of strawberry fruits in a variable illumination conditions. An intelligent
algorithm based on neural network was developed to classify coffee cherries into
under-ripe, ripe, and over-ripe (Furfaro et al. 2007). A coffee ripeness monitoring
system was proposed (Johnson et al. 2004). In this work, reflectance spectrum was
recorded from four major components of coffee field viz., green leaf, under-ripe fruit,
ripe fruit, and over-ripe fruit. Based on reflectance spectrum, ripeness evaluation of
coffee field was performed. A Bayesian classifier was exploited for the purpose of
classification of intact tomatoes based on their ripening stages (Baltazar et al. 2008).
We made an initial attempt on ripeness evaluation of tobacco leaves for automatic
harvesting in our previous work (Guru and Mallikarjuna 2010), where we exploited
only the combination of sobel edge detector and laplacian filter with CIELAB color
model to estimate degree of ripeness of a leaf and conducted experiments on our
own small dataset of 244 sample images. In our current work, we exploited two
combinations (i) combination of laplacian filter and sobel edge detector and (ii)
combination of laplacian filter and canny edge detector with different color models
200 P. B. Mallikarjuna et al.
viz., RGB, HSV, MUNSELL, CIELAB, and CIELUV and conducted experiments
on our own large dataset of 1300 sample images. Indeed, the success of our previous
attempt motivated to take up the current work, where in the previous model has been
extended significantly.
Thus, overall contributions of this work are,
• Creation of a relatively large dataset of harvesting tobacco leaves due to non-
availability of a benchmarking dataset.
• Introduction of concept of fusing image filters of different orders for maturity spot
detection on tobacco leaves.
• Development of a model which combines density of maturity spots and color
information for estimating degree of ripeness of a leaf.
• Design of simple threshold-based classifier for classification of leaves.
• Conduction of experimentations on the large tobacco harvesting dataset created.
The proposed model has four stages: leaf segmentation, detection of maturity spots,
estimation of degree of ripeness, and classification.
The CIELAB (Viscarra et al., 2006) color model was used to segment a leaf area from
their background which includes soil, stones, and noise. According to the domain
experts, the color of a tobacco leaf varies from green to yellow. Therefore, the chro-
macity coordinate is used to segment the leaf from its background. For an illustration,
we have shown three different samples (Figs. 13.2, 13.3, and 13.4) of tobacco leaves
and also the results of the segmentation.
The proposed maturity spots detection algorithm mainly consists of two stages. The
first stage involves application of a second order high-pass filtering and a first order
edge extraction algorithm separately on a leaf, the results of which are later subjected
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting … 201
(a) (b)
Fig. 13.2 a A sample tobacco leaf with rare maturity spots, b segmented image
(a) (b)
Fig. 13.3 a A sample tobacco leaf with with moderate maturityspots, b segmented image
(a) (b)
Fig. 13.4 a A sample tobacco leaf with with rich maturity spots, b segmented image
202 P. B. Mallikarjuna et al.
Fig. 13.5 Block diagram of the proposed maturity spots detection algorithm.
for subtraction in second stage. The block diagram of the proposed maturity spots
detection algorithm is given in Fig. 13.5.
The maturity spots are highly visible in a R-channel gray scale image compared
to G-channel and B-Channel gray scale images, respectively. Therefore, the RGB
image of a tobacco leaf is transformed into its R-channel gray scale image. A second
order high-pass filter is exploited to enhance mature spots (fine details) present on a
red channel gray scale image of a tobacco leaf. It highlights transitions in intensities
in an image. Any high-pass filter in frequency domain attenuates low-frequency com-
ponents without disturbing high-frequency information. Therefore, to extract finer
details of small maturity spots, we recommend to apply any second order deriva-
tive high-pass filter (in our case laplacian filter) which enhances much better than
any first order derivative high-pass filters (Sobel and Roberts). Then, we transform
the second order filtered image into a binary image using a suitable threshold. The
resultant binary image contains veins and leaf boundary in addition to maturity spots.
An image subtraction is used to eliminate the vein and leaf boundary edge pixels
of resultant binary image. Therefore, we recommend to subtract the edge image
containing only vein and leaf boundary edge pixels from the resultant binary image.
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting … 203
Since first order edge extraction operator is susceptible to noise, it is used to extract
edge image from red channel gray scale image of the segmented original RGB
color image. The obtained edge image is then subtracted from the binary image
obtained due to second order high-pass filtering. Image subtraction results in an
image containing only maturity spots. Number of connected components present in
that image decides the number of maturity spots on the leaf.
Let us consider a tobacco leaf (Fig. 13.6) for the purpose of illustration of the
proposed maturity spots detection algorithm. As discussed above, when we apply
the transformation (RGB to R-channel gray scale) on the original RGB image of
segmented tobacco leaf (Fig. 13.6a), the maturity spots are highly noticeable in the
R-channel gray scale image as shown in Fig. 13.6b. The second order high-pass filter
(Laplacian filter) is used to enhance the maturity spots. The laplacian filtered image
(Fig. 13.6b) is converted into binary image (Fig. 13.6c) using a suitable predefined
threshold. As stated above, the resultant binary image (Fig. 13.6e) contains maturity
spots, vein, and boundary edge pixels. So, to remove vein and boundary pixels, we
subtracted edge image (Fig. 13.6b) obtained after first order edge extraction (canny
edge detector) from the binary image (Fig. 13.6d). Finally, image subtraction has
resulted in an image (Fig. 13.6f) containing only maturity spots.
13.2.4 Classification
During harvesting, we can find three types of leaves on a plant: unripe, ripe, and
over-ripe. Unripe leaves have low degree of ripeness. Ripe leaves have moderate
degree of ripeness. Over-ripe leaves have high degree of ripeness. Therefore, we
have used a simple thresholding classifier based on two predefined thresholds T1 and
T2 for classification of tobacco leaves into three classes: unripe, ripe, and over-ripe.
The threshold T1 is selected as the midpoint of distribution of degree of ripeness of
samples of unripe class and ripe class. The threshold T2 is selected as the midpoint
of distribution of degree of ripeness of samples of ripe class and over-ripe class.
204 P. B. Mallikarjuna et al.
(a) Segmented RGB image of a tobacco (b) Red channel gray scale
leaf image
(e) Image after leaf vein and boundary (f) Image consisting of only maturity
extraction using canny edge detector spots after subtraction of the image (e)
Then, the class label for a given leaf is decided based on two predefined thresholds
T1 and T2 , as given in Eq. 13.3.
N umber o f maturit y
DES = (13.2)
Lea f ar ea
⎧
⎪
⎨C1 D < T1
Classlabel = C2 T1 < D < T2 (13.3)
⎪
⎩
C3 D > T2
13.3.1 Dataset
(a)
(b)
(c)
13.3.2 Results
The proposed model estimates the degree of ripeness of a leaf using the proposed
method of maturity spots detection and color models. The proposed maturity spots
detection algorithm is a combination of first order edge extraction and second order
filtering. We exploited the first order edge extractors such as sobel edge detector
and canny edge detector. We have used the laplacian filter for second order filtering.
Hence, we have two combinations (i) combination of laplacian filter and sobel edge
detector and (ii) combination of laplacian filter and canny edge detector. Henceforth,
in this paper, we refer these combinations, respectively, as Method 1 and Method 2.
The sobel edge detector works on one threshold (Tr ). Therefore, in the Method
1, we have fixed the threshold (Tr ) value of the sobel detector in the training phase.
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting … 207
Table 13.2 Average classification accuracy using the Method 2 (combination of laplacian filter
and canny edge detector) for varying the thresholds Tr 1 and Tr 2 of canny edge detector
Tr 2 → Tr 1 ↓ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 - 71.65 72.96 72.9 72.97 73.73 73.93 73.9 74.23 74.43 74.65
0.1 – – 73.21 73.17 73.4 73.08 73.45 73.54 74.02 74.21 75.76
0.2 – – – 74.66 74.96 75.3 76.31 77.31 78.3 78.84 79.49
0.3 – – – – 86.59 85.6 85.2 85.51 85.48 85.73 84.51
0.4 – – – – – 81.59 81.49 82.3 81.71 82.03 82.04
0.5 – – – – – – 81.97 81.77 82.31 82.07 82.28
0.6 – – – – – – – 81 81.32 79.13 79.64
0.7 – – – – – – – – 77.08 76.98 76.17
0.8 – – – – – – – – – 75.19 76.47
0.9 – – – – – – – – – – 75.62
1 – – – – – – – – – – –
That is, we have varied the threshold (Tr ) value from 0 to 1 with a step of 0.1.
Experimentally, it is found that the best average classification accuracy has been
achieved for Tr = 0.2.
On other hand, canny edge detector works on two thresholds (Tr 1 and Tr 2 ). There-
fore, in the Method 2, we have fixed the threshold values of Tr 1 and Tr 2 of the canny
edge detector in the training phase. Experimentally, it is found that the values of
Tr 1 and Tr 2 are 0.3 and 0.4, respectively. The thresholds Tr 1 and Tr 2 of the canny
edge detector are tuned up in such way that the leaf boundary edge pixels and leaf
vein edge pixels are extracted clearly. Therefore, there is a very less probability of
leaf vein edge pixels and leaf boundary edge pixels to be counted as maturity spots
while estimating maturity spots density. However, selecting suitable values of Tr 1
and Tr 2 is a challenging task. Pixels with values between Tr 1 and Tr 2 are weak edge
pixels that are 8-connected to the strong edge pixels (pixel values greater than Tr 2 )
which perform edge linking. Therefore, the values of Tr 1 and Tr 2 are set such that the
probability of leaf boundary and veins weak edge pixels to be missing is minimum.
By varying the thresholds Tr 1 and Tr 2 , it is found that the best average classification
accuracy has been achieved for Tr 1 = 0.3 and Tr 2 = 0.4 and is given in Table 13.2.
For estimation of degree of ripeness of a leaf, we vary the weights W1 and W2
(Eq. 13.1) such that the best average classification accuracy has been achieved (W1 =
0.7 and W2 = 0.3) for all set of training and testing samples. It is shown in Fig. 13.8
using the Method 2 for 60% training.
For purpose of fixing up T1 and T2 , during classification, we considered 150
samples from each class, and we plotted distribution of samples over degree of
ripeness (Fig. 13.9). Since there is a large overlapping between the classes as shown
in Fig. 13.9, we recommend to select the thresholds T1 and T2 by studying the over-
lapping of unripe class and ripe class to select the threshold T1 and ripe class and
over-ripe class to select the threshold T2 as follows.
208 P. B. Mallikarjuna et al.
Fig. 13.8 Average classification accuracy obtained by the Method 2 (combination of laplacian filter
and canny edge detector) under varying weights W1 and W2
Table 13.3 Classification accuracy using the Method 1 (combination of laplacian filter and canny
edge detector) with different color models
Training Color model Minimum Maximum Average Std. deviation
Examples accuracy accuracy accuracy
30% RGB 51.3055 57.1953 53.6121 1.6049
HSV 50.2696 54.8161 51.8439 1.1337
MUNSELL 60.5778 69.4947 64.6835 2.3205
CIELAB 78.3534 82.6594 80.43 1.0805
CIELUV 80.3467 83.1168 82.018 1.0023
40% RGB 50.713 55.1152 52.7922 1.1783
HSV 50.2438 54.9609 52.0307 1.2323
MUNSELL 60.5041 68.7531 63.8465 2.417
CIELAB 78.2631 82.7723 80.8273 1.2004
CIELUV 80.561 84.129 82.4455 1.268
50% RGB 50.2906 54.2789 52.1122 0.9674
HSV 49.6817 53.4766 51.0107 1.0167
MUNSELL 59.7798 68.2022 63.685 1.9417
CIELAB 79.0244 83.5759 80.9809 1.1023
CIELUV 80.8077 85.4418 83.3289 1.2991
60% RGB 50.4844 54.6317 52.4391 1.1576
HSV 49.5419 52.3252 50.7931 0.8702
MUNSELL 57.9345 68.3146 62.2161 2.1327
CIELAB 78.3068 85.4018 81.9757 1.4671
CIELUV 80.332 85.2349 83.5698 1.3244
precision, recall, and F-measure. The minimum, maximum, average, and standard
deviation of classification accuracy of all the 20 trails using the proposed simple
thresholding classifier for both methods are given in Tables 13.3 and 13.4 respectively.
Classification accuracy using the Method 1 with different color models viz., RGB,
HSV, MUNSELL, CIELAB, and CIELUV is given in Table 13.3. Similarly, classi-
fication accuracy using the Method 2 with different color models viz., RGB, HSV,
MUNSELL, CIELAB, and CIELUV is given in Table 13.4. The confusion matrix
across leaf types using the Method 1 for best average classification accuracy is given
in Table 13.5. Similarly, the confusion matrix across leaf types using the Method 2
for best average classification accuracy is given in Table 13.6. The corresponding
precision, recall, and F-measure for individual classes are presented for both the
Method 1 and the Method 2 in Fig. 13.10. From Tables 13.3 and 13.4, it is observed
that the best average classification accuracy has been achieved for the Method 2 with
CIELUV color model.
210 P. B. Mallikarjuna et al.
Table 13.4 Classification accuracy using the Method 2 (combination of laplacian filter and canny
edge detector) with different color models
Training Color model Minimum Maximum Average Std. deviation
Examples accuracy accuracy accuracy
30% RGB 66.6959 73.3427 70.5315 1.5919
HSV 54.9965 61.1442 58.7328 1.4712
MUNSELL 67.3114 74.1979 70.1425 1.6726
CIELAB 78.9152 82.9264 81.1872 0.9323
CIELUV 82.7643 88.7130 85.2472 1.8509
40% RGB 67.0316 73.2570 70.3530 1.7062
HSV 54.9682 62.1305 57.9434 1.8306
MUNSELL 67.0676 72.0768 69.7736 1.2273
CIELAB 78.9224 81.8946 80.2090 0.8748
CIELUV 81.0633 88.9688 86.3933 2.0416
50% RGB 65.8649 73.5737 70.3453 1.8429
HSV 52.9818 61.8088 58.6286 1.8589
MUNSELL 67.9882 72.8952 70.0468 1.1668
CIELAB 83.5848 79.4480 81.3107 1.1918
CIELUV 81.8075 89.7177 86.1646 2.3228
50% RGB 64.8270 73.8114 70.6765 1.9759
HSV 54.3457 62.1077 58.5386 2.1944
MUNSELL 66.3917 72.5029 70.8465 1.3572
CIELAB 78.2949 84.9594 81.8277 1.6287
CIELUV 80.7693 89.5043 86.5945 2.2025
Table 13.5 Confusion matrix across leaf types using the Method 1 (combination of laplacian filter
and sobel edge detector) for best average classification accuracy
Unripe Ripe Over-ripe
Unripe 106 23 00
Ripe 20 222 25
Over-ripe 00 18 106
Table 13.6 Confusion matrix across leaf types using the Method 2 (combination of laplacian filter
and canny edge detector) for best average classification accuracy
Unripe Ripe Over-ripe
Unripe 111 18 00
Ripe 10 228 29
Over-ripe 00 14 110
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting … 211
13.3.3 Discussion
When we applied our previous method’s combination of laplacian filter and sobel
edge detector (Method 1) with CIELAB (Guru and Mallikarjuna 2010) on our large
dataset, we have achieved classification accuracy of about 81% (see Table 13.3). To
improve classification accuracy, our previous method (Guru and Mallikarjuna 2010)
has been extended with different color models viz., RGB, HSV, MUNSELL, and
CIELUV and achieved a good classification accuracy of about 83% with CIELUV
color model (see Table 13.3) on our large dataset. To improve classification accuracy
further, the current work has been extended for combination of laplacian filter and
canny edge detector (Method 2) with different color models viz., RGB, HSV, MUN-
SELL, CIELAB, and CIELUV. We have achieved an improvement in classification
accuracy by 3% using Method 2 with CIELUV color model (see Table 13.4) on our
large dataset.
13.4 Conclusion
In this work, we present a novel model based on strategies of filtering for classification
of tobacco leaves for the purpose of harvesting. A method of detection of maturity
spots is proposed. A method of finding degree of ripeness of a leaf is presented.
Further, we proposed a simple thresholding classifier for effective classification of
tobacco leaves. In order to investigate the effectiveness and robustness of the proposed
model, we conducted experiments for both the methods (i) combination of laplacian
filter and sobel edge detector and (ii) combination of laplacian filter and canny edge
detector on our own large dataset. Experimental results reveal that combination of
laplacian filter and canny edge detector is superior than combination of laplacian
filter and sobel edge detector.
References
Baltazar A, Aranda JI, Aguilar GG (2008) Bayesian classification of ripening stages of to-mato
fruit using acoustic impact and colorimeter sensor data. Comput Electron Agric 60(2):113–121
Furfaro R, Ganapol BD, Johnson LF, Herwitz SR (2007) Neural network algorithm for coffee
ripeness evaluation using airborne images. Appl Eng Agric 23(3):379–387
Gao H, Cai J, Liu X (2009) Automatic grading of the post-harvest fruit: a review. In: Third IFIP inter-
national conference on computer and computing technologies in agriculture. Springer, Beijing,
pp 141–146
Guru DS, Mallikarjuna PB (2010) Spots and color based ripeness evaluation of tobacco leaves for
automatic harvesting. In: First international conference on intelligent interactive technologies and
multimedia. ACM, IIIT Allahabad., India, pp 198–202
Jabro JD, Stevens WB, Evans RG, Iversen WM (2010) Spatial variability and correlation of selected
soil in the AP horizon of a CRP grassland. Appl Eng Agric 26(3):419–428
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting … 213
Johnson LF, Herwitz SR, Lobitz BM, Dunagan SE (2004) Feasibility of monitoring coffee field
ripeness with airborne multispectral imagery. Appl Eng Agric 20(6):845–849
Lee DJ, Chang Y, Archibald JK, Greco CJ (2008a) Color quantization and image analysis for
automated fruit quality evaluation. In: IEEE international conference on automation science and
engineering. IEEE, Trieste, Italy, pp 194–199
Lee DJ, Chang Y, Archibald JK, Greco CJ (2008b) Robust color space conversion and color distri-
bution analysis techniques for date maturity evaluation. J Food Eng 88:364–372
Lee D, Archibald JK, Xiong G (2011) Rapid color grading for fruit quality evaluation using direct
color mapping. IEEE Trans Autom Sci Eng 8:292–302
Manickavasagam A, Gunasekaran JJ, Doraisamy P (2007) Trends in Indian flue cured virgina
tobacco (Nictoina tobaccum) processing: harvesting, curing and grading. Res J Agric Biol Sci
3(6):676–681
Patricio DI, Riederb R (2018) Computer vision and artificial intelligence in precision agriculture
for grain crops: A systematic review. Comput Electron Agric 153:69–81
Viscarra RA, Minasny B, Roudier P, McBratney AB (2006) Colour space models for soil science.
Geoderma 133:320–337
Xu L (2008) Strawberry maturity neural network detecting system based on genetic algorithm. In:
Second IFIP international conference on computer and computing technologies in agriculture,
Beijing, China, pp 1201–1208
Yin H, Chai Y, Yang SX, Mitta GS (2009) Ripe tomato extraction for a harvesting robotic system.
In: IEEE international conference on systems, man and cybernetics. IEEE, San Antonio, USA,
pp 2984–2989
Chapter 14
Automatic Deep Learning Framework
for Breast Cancer Detection and
Classification from H&E Stained Breast
Histopathology Images
Abstract About half a million breast cancer patients succumb to the disease, and
nearly 1.7 million new cases arise every year. These numeric entities are expected
to rise significantly due to the advances in social and medical engineering. Further-
more, the histopathological images are a gold standard for identifying and classifying
breast cancer compared with other medical imaging. Evidently, the decision of an
optimal therapeutic schedule of breast cancer rests upon early detection.The primal
motive to have better breast cancer detection algorithm helps to the doctors who
know the molecular sub-types of breast cancer in order to control the metastasis of
tumor cells early in the disease prognosis and treatment planing. This paper pro-
poses automatic deep learning framework for breast cancer detection and classifica-
tion model from hematoxylin and eosin (H&E) stained breast histopathology images
with 80.4% accuracy for supplementing analysis of medical professionals to prevent
false negatives. Experimental results yield that proposed architecture provides better
classification results as compared to benchmark methods.
14.1 Introduction
Proper diagnosis of breast cancer is the demand of today’s time; because in women,
it becomes a major cancer-related issues worldwide. Manual analysis of microscopic
slides leads to differences of opinion among pathologists as well as time consuming
process due to the complexity associated with such images. Breast cancer is a dis-
ease having a distinctive histological attribute and has benign tumor of sub-class as
Adenosis, Fibroadenoma, Phyllode Tumor, Tubular Adenoma and malignant tumor
of sub-class as Ductal Carcinoma, Lobular Carcinoma, Mucinous Carcinoma, and
Papillary Carcinoma. Classical classification algorithms have own merit and demerit.
Logistic regression-based classification easy to implement, but its accuracy depends
on the nature of the dataset. If it is linearly separable, then it will work well, but in the
real world dataset rarely linearly separable. Decision tree-based classification model
is able to deal complex nature dataset, but there are always chances of overfittiing
in this method. Overfitting problem can be reduced by a random forest algorithm
which is a more sophisticated version of decision tree-based classification model.
The working method of support vector machine is based on hyperplane which acts
as a decision boundary. Appropriate selection of kernel is the key to better perfor-
mance in support vector machine classification method. To improve the process of
diagnosis, automatic detection and treatment are one of the leading research areas
to deal with cancer-related issue. Last one decade, the development of fast digital
whole slide scanners (DWSS) that provide whole slide images (WSI) has led to a
revival of interest in medical image processing, analysis, and their applications in
digital pathology solution.Segmentation of cells and nuclei proves to be an important
first step towards automatic image analysis of digitized histopathalogy images. We
therefore pose to develop an automated cell identification method that works with
in (H&E) stained breast cancer histopathology images. Deep learning framework
is very effective to detect and classify breast cancer histopathology slides. A typi-
cal deep learning classification system consists of (a) A properly annotated dataset
where its each class and sub-class is verified by experienced pathologists. (b) A
robust architecture that are able to differentiate its class and sub-class of tissue under
diagnosis (c) Good optimization algorithm and a proper loss function that are able
to train the model effectively. (d) In case of supervised learning, the performance
of the model depends that how ground truth prepared, and it should be under the
supervision of experienced pathologists.
The organizations of the this chapter are as follows: In Sect. 14.2 we have dis-
cussed related research work. Section 14.3 presents proposed model architecture.
Section 14.4 presents experimental results and discussion. Section 14.5 presents con-
clusion of the manuscript.
The breast cancer detection had SVM as a benchmark model as presented by Akay
in Akay (2009), and the benefits of SVM as used in detection of cancer were clearly
presented but it lacked the classification of the type of breast cancer which was one
of our motivation regarding this research. The motivation was further enhanced by
the foundings presented in Karabatak and Ince (2009) by Karabatak.
Veta and Diest presented automatic nuclei segmentation in H&E stained breast
cancer histopathology images. In this paper, authors explained the different nuances
for breast cancer detection that have been achieved by automated cell segmentation.
Method of cell segmentation is explained deeply in this paper which is based on
patched slide analysis for higher accuracy of cancer detection (Veta and Diest 2013).
The advantage of this paper particularly is the accuracy of detection it achieves
with cell segmentation method, as it is the best in class with over 90.4% accuracy
14 Automatic Deep Learning Framework for Breast Cancer . . . 217
in positive detection. The disadvantage of this paper is that it fails to touch upon
the different ways of achieving such detection accuracy with multiple deep learning
algorithms.
Cruz-Roa, Angel, et al. presented automatic detection of invasive ductal carci-
noma in whole slide images with convolutional neural networks(CNNs). In this
paper, authors explained the detection and visual analysis of IDC tissues in whole
slide images(WSI). The framework explained in (Cruz-Roa et al. 2014) extends
to a number of CNNs. The CNN is trained over a large number of image patches
represented by tissue regions from the WSI to learn a hierarchical part-based rep-
resentation and classification. The resulting accuracy is stated as being 71.80% and
84.23% for F-measure and balanced accuracy, respectively. The disadvantage of the
method published is from the inherent limitations in obtaining a very highly granular
annotation of the diseased area of interest by an expert pathologist.
In Spanhol et al. (2016), Janowczyk and Anant (2016), the work presented by
authors brings significance to the datasets being used to elucidate significant deep
learning techniques needed to produce comparable, and in many cases, superior
to results from the benchmark hand-crafted feature-based classification algorithmic
design.
Recently, advanced CNN model has achieved paramount success in classification
of natural image as well as in biomedical image processing. In Han et al. (2017), Han
et al. designed a novel convolutional neural network, which includes a convolutional
layer, small SE-ResNet module, and fully connected layer and was responsible for
impeccable detection cancer detection outcomes.
Most of the state-of-the-art algorithms in the literature are based on learned fea-
tures that extract high-level abstractions directly from the histopathological H& E
stained images utilizing deep learning techniques. In Look and Once: Unified, Real-
Time Object Detection, (2016), Janowczyk and Anant (2016), Han et al. (2017),
authors discussion was brought upon the various algorithms applied for the nuclear
pleomorphism scoring of breast cancer, disquisition the challenges to be dealt with,
and outlines the importance of benchmark datasets in multi-level classification archi-
tectures.
The multiple layer analysis of cancer detection and classification draws its roots
from papers (Feng et al. 2018; Guo et al. 2019; Jiang et al. 2019; Liao et al. 2018;
Liu et al. 2019) explaining feature extraction representing different types of breast
cancer and giving a prominent inclination to invasive ductal carcinoma (IDC).
M Z Alom, T Aspiras et al. presented advanced deep convolutional neural net-
work approaches for digital pathology image analysis (Alom et al. 2019). In this
paper, authors explained the process of detection of cancer through a CNN approach
specifically IRRCNN. The process of detection using neural networks makes us
understand the multiple layers that go into making the model. The advantage of this
paper particularly is the approach of detection, as it is the optimum way of utilizing
CNN for image recognition in this case cancer detection . The disadvantage of this
paper is that it pegs only the detection of cancer among the cell and does not allow
abstract classification of cancer types.
218 A. Verma et al.
In Ragab et al. (2019), the authors explain the significance of SVM as a benchmark
model for identification of breast cancer although the analysis proves to be promising
but its done by mammogram images instead of H&E stained images.
A deep learning model by Li et al. (2019) that classifies into malignant and non-
malignant and use of a classifier in such a way that it detects local patches.
The idea of multi-classifier development can be shared by Kassani et al. (2019)
as a problem tackled through ResNet and other prominent neural networks. The
disadvantage encountered in these papers is the lack specificity of the disease.
The paper by Lichtblau and Stoean (2019) suggests the different models that need
to be studied to identify the most optimum approach for classification of different
cancer types. Due to the focus of this paper primarily on the classification of breast
cancer, our detection algorithm consists of data procured through transfer learning
of benchmark algorithms as presented in Shallu (2018), Ting et al. (2019), Vo et al.
(2019), Khan et al. (2019), Ni et al. (2019), Das et al. (2020).
For BreaKHis dataset (Toğaçar et al. 2020) proposed a general framework for
diagnosis of breast cancer. Their architecture consists of attention modules, convo-
lution block, dense block, residual block, and hyper column block to capture spatial
information precisely. Categorical cross entropy is loss function, and Adam opti-
mization is used to train the model.
Sheikh et al. (2020), densely connected CNN-based network for binary and mul-
ticlass classification is able to capture meaningful structure and texture by fusing
multi-resolution feature for ICIAR2018 dataset and BreaKHis dataset.
For classification of breast cancer into carcinoma and non-carcinoma Hameed
et al. (2020) utilized deep CNN-based pre-trained model of VGG-16 and VGG-19
that is helpful in better initialization and convergence. Their final architecture is an
ensemble of fine-tuned VGG-16 and fine-tuned VGG-19 models.
By utilizing sliding window mechanism and class-wise clustering with image-
wise feature pooling (Li et al. 2019) extract multi-layered features to train two parallel
CNN. Their final classification accuracy has both larger patches, features, and smaller
patch features.
For multiclass classification of breast cancer histopathology images (Xie et al.
2019) adopted transfer learning. The pre-trained model of Inception_ResNet_V2 and
Inception_V3 is utilized for the classification purpose. Their deep learning frame-
work used four different magnification factor for training and testing to ensure the
universality of the model.
Both CNN and SVM classifier used by Araújo et al. (2017) to achieve comparable
results. By dividing the histology image into patches and patch-based features are
extracted using CNN, and finally, these features are fed to the SVM input to classify
the images.
Classification of breast carcinomas by Babak ehteshami (Bejnordi et al. 2017) in
whole slide breast histology images by stacking high resolution patches on the top
of the network that accepts large size input to obtain fine-grained details as well as
global tissue structures.
Spanhol et al. (2017) utilize CNN trained on natural images to the BreaKHis
dataset to extract the deep features, and they find these features are better than
14 Automatic Deep Learning Framework for Breast Cancer . . . 219
hand-crafted features. These features are fed to different classifier trained on spe-
cific dataset. Their patch-based classification with four different magnification factor
achieves very good prediction accuracy.
Zhu et al. (2019) works on the BreaKHis dataset by merging local and global
information called multiple CNN or hybrid CNN that is able to classify effectively.
To remove the redundant information, they incorporated SEP block in the hybrid
model. Combining the above two effects, their model got promising results.
For BACH and BreaKHis dataset (Patil et al. 2019) used attention based multiple
instance learning where they did aggregation of features called bag level features.
Their multiple instance-based learning is able to localize and classify into benign,
malignant, and invasive.
The proposed architecture consists of two parts, namely detection and classification.
The detection networks take influence from IRRCNN (Alom et al. 2019), while the
classification network takes influence from WSI-Net (Ni et al. 2019).
The flow diagram of our proposed architecture is shown in Fig. 14.1. The archi-
tecture consists of two convolutional matrix and three residual networks. The H&
E image of the breast tissue is pre-processed and sent into the first convolutional
network followed by a residual network which is then repeated once more. The pro-
cessed data is sent into the classification branch and malignancy detection branch.
The malignancy detection branch decides whether the fed data is malignant or non-
malignant. The classification branch further processes the data and classifies it on
whether it is invasive ductal carcinoma (IDC) positive or negative. The data from
both branches are combined and passed through the final residual network. We then
give the prediction through the confusion matrix segmentation map.
The loss function utilized in proposed architecture is the Adam loss function (Kingma
and Ba 2014). First and foremost, Adam means adaptive moment estimation. In
Adam, the exponential moving average (EMA) of the first moment of the gradient
scaled by the square root of the second moment of the moment is subtracted to the
parameter vector which is presented in mathematical Eq. 14.1 as explored in
η
θt+1 = θt − m̂ t (14.1)
vˆt +
where θ is the parameter vector, v(t) is the exponential moving average of the second
moment of the gradient G(t), and is a very small hyper-parameter that prevents the
algorithm from dividing by zero.
Please do not use quotation marks when quoting texts! Simply use the quotation envi-
ronment – it will automatically be rendered in line with the preferred layout.
A jupyter file in google colaboratory was used to implement all models used and
proposed model in this paper through a virtual machine on cloud as well as a PC
with Intel(R) Core(TM) i7-8750H CPU @ 3.98GHz, 16GB RAM and NVIDIA GTX
1070 Max-Q as its core specifications.
The research work is using the BHC dataset for detection and classification of
histopathology and eosin stained breast cancer images. The keras data generator
is used to get the data from respective folders and into Keras automatically. Keras
provides convenient python library functions for this purpose.
The learning rate presided by this proposed model is adjusted to be 0.0001. On top
of it, a global average pooling layer followed by 50% dropouts to reduce over-fitting
bias was used. Adam is used as the optimizer and binary-cross-entropy as the loss
function. A sequential model along with confusion matrix is used for implementation
of the classification branch of proposed algorithm. It adds convolutional layer of bin
size 32 and kernel size 3 × 3. Four units are pooled together from both the axes.
Then, different operations like increasing density, flatten, and redundancy reduction
are applied.
The results evaluation and discussions of proposed model for cancer detection and
classification method are presented in this section. For validity, the results of proposed
architecture are compared with existing methodology and composition from the
referenced literature such as IRRCNN, DCNN, SVM, VGG-16, decision tree, etc.
Table 14.1 consists of the different models that were tested for the cancer classi-
fication branch of the proposed architecture and following results observed.
The decision was made to implement DenseNet201 for the malignancy detection
branch of the proposed architecture by weighing in the size of the model and its
top-5 accuracy which weighed in best for the DenseNet201 model. The accuracy
and loss plots of malignancy detection branch of proposed architecture are shown in
Figs. 14.2 and 14.3 respectively. The receiver operating characteristics (ROC) plot
of proposed architecture is shown in Fig. 14.4.
Invasive ductal carcinoma (IDC) is the most common form of breast cancer.
Through the medium of this project, we are implementing a two base classification
for the preferred algorithm in order to broaden our automated analysis, i.e., IDC
versus DCIS (IDC−). This particular method involves confusion matrix in order to
222 A. Verma et al.
understand the implications of error in the prediction of the type of cancer. Predicted
IDC(+) and IDC(-) of proposed model is shown in Fig. 14.5 and confusion matrix
of proposed model is presented in Fig. 14.6. The comparison of predicted results of
proposed model vs actual is shown in Fig. 14.7.
The machine learning algorithms in Table 14.1 were brought in contrast with the
proposed algorithm. The idea is to design a optimal algorithm in which bias to both
bases is limited and achieve similar efficacy to support vector machine (SVM). The
proposed algorithm provides the best approach in terms of this and can be used as
an alternative to the existing SVM method for classification of cancer.
224 A. Verma et al.
The CNN such as WSI-Net was brought in contrast with the proposed algorithm
on the weighted parameters, and results were drawn as listed in the conclusion.
14.5 Conclusion
The proposed model was broadly the combination of cancer detection and classifica-
tion into IDC and non-IDC. Detecting breast cancer is based on IRRCNN algorithm
with significant improvements in the number of epochs and layers of convolution
network in order to get near the desired results. Then, it is coalesced with the classi-
fication algorithm which gives us a significant improvement on WSI-Net and other
machine learning classifiers for classification. The accuracy that was observed for
detection of breast cancer stands at 95.25% and that for classification of IDC versus
DICS stands at 80.43% which was better than WSI-Net.
Acknowledgements This research work was supported in part by the Science Engineering
and Research Board, Department of Science and Technology, Govt. of India under Grant No.
EEG/2018/000323, 2019.
226 A. Verma et al.
References
Akay MF (2009) Support vector machines combined with feature selection for breast cancer diag-
nosis. Exp Syst Appl 36:3240–3247. https://doi.org/10.1016/j.eswa.2008.01.009
Alom M, Aspiras T, Taha MT, Asari K, Bowen V, Billiter D, Arkell S (2019) Advanced Deep
convolutional neural network approaches for digital pathology image analysis: a comprehensive
evaluation with different use cases
Araújo T, Aresta G, Castro E, Rouco J, Aguiar P, Eloy C, Polónia A, Campilho A (2017) Classifi-
cation of breast cancer histology images using convolutional neural networks. PloS One 12(6).
https://doi.org/10.1371/journal.pone.0177544
Bejnordi BE, Zuidhof G, Balkenhol M, Hermsen M, Bult P, van Ginneken B, Karssemeijer N, Litjens
G, van der Laak J (2017) Context-aware stacked convolutional neural networks for classification
of breast carcinomas in whole-slide histopathology images. J Med Imag (Bellingham, Wash)
4(4):044504. https://doi.org/10.1117/1.JMI.4.4.044504
Cruz-Roa A, et al (2014) In: Gurcan MN, Madabhushi A (eds) Automatic detection of invasive
ductal carcinoma in whole slide images with convolutional neural networks, p 904103. https://
doi.org/10.1117/12.2043872
Das A, Nair MS, Peter D (2020) Computer-aided histopathological image analysis techniques for
automated nuclear atypia scoring of breast cancer
Feng Y, Zhang L, Mo J (2018) Deep manifold preserving autoencoder for classifying breast cancer
histopathological images. IEEE/ACM Trans Comput Biol Bioinform 1. https://doi.org/10.1109/
TCBB.2018.2858763
Guo Y, Shang X, Li Z (2019) Identification of cancer subtypes by integrating multiple types of
transcriptomics data with deep learning in breast cancer. Neurocomputing 324:20–30. https://
doi.org/10.1016/j.neucom.2018.03.072
Hameed Z, Zahia S, Garcia-Zapirain B, Javier Aguirre J, María Vanegas A (2020) Breast cancer
histopathology image classification using an ensemble of deep learning models. Sensors 20:4373
Han Z, Wei B, Zheng Y, Yin Y, Li K, Li S (2017) Breast cancer multi-classification from histopatho-
logical images with structured deep learning model. Sci. Rep. 7:1–10. https://doi.org/10.1038/
s41598-017-04075-z
Janowczyk A, Madabhushi A (2016) Deep learning for digital pathology image analysis: a com-
prehensive tutorial with selected use cases. J Pathol Inform 7:29 (2016). PubMed https://doi.org/
10.4103/2153-3539.186902
Jiang Y et al (2019) Breast cancer histopathological image classification using convolutional neural
networks with small SE-ResNet module. PLOS ONE 14(3): e0214587. PLoS J. https://doi.org/
10.1371/journal.pone.0214587
Karabatak M, Ince MC (2009) An expert system for detection of breast cancer based on association
rules and neural network. Expe Syst Appl 36:346–3469. https://doi.org/10.1016/j.eswa.2008.02.
064
Kassani SH, Kassani PH, Wesolowski M (2019) Classification of histopathological biopsy images
using ensemble of deep learning networks. SIGGRAPH 4(32). https://doi.org/10.1145/3306307.
3328180
Khan S, Islam N, Jan Z, Din IU, Rodrigues JJPC (2019) A novel deep learning based framework
for the detection and classification of breast cancer using transfer learning. Pattern Recognit Lett
125:1–6. https://doi.org/10.1016/j.patrec.2019.03.022
Kingma D, Ba J (2014). Adam: a method for stochastic optimization. In: International conference
on learning representations
Li Y, Wu J, Wu Q (2019) Classification of breast cancer histology images using multi-size and
discriminative patches based on deep learning. IEEE Access 7:21400–21408. https://doi.org/10.
1109/ACCESS.2019.2898044
Li S, Margolies LR, Rothstein JH, Eugene F, Russell MB, Weiva S (2019) Deep learning to improve
breast cancer detection on screening mammography. Sci Rep 9:12495. https://doi.org/10.1038/
s41598-019-48995-4
14 Automatic Deep Learning Framework for Breast Cancer . . . 227
Liao Q, Ding Y, Jiang ZL, Wang X, Zhang C, Zhang Q (2018) Multi-task deep convolutional neural
network for cancer diagnosis. Neurocomputing. https://doi.org/10.1016/j.neucom.2018.06.084
Lichtblau D, Stoean C (2019) Cancer diagnosis through a tandem of classifiers for digitized
histopathological slides. PLoS One 14:1–20. https://doi.org/10.1371/journal.pone.0209274
Liu N, Qi E-S, Xu M, Gao B, Liu G-Q (2019) A novel intelligent classification model for breast
cancer diagnosis. Inf Process Manag 56:609–623. https://doi.org/10.1016/j.ipm.2018.10.014
Mehra SR (2018) Breast cancer histology images classification: training from scratch or transfer
learning? ICT Exp 4:247–254. https://doi.org/10.1016/j.icte.2018.10.007
Ni H, Liu H, Wang K, Wang X, Zhou X, Qian Y (2019) WSI-Net: branch-based and hierarchy-aware
network for segmentation and classification of breast histopathological whole-slide images. In:
International Workshop on Machine Learning in Medical Imaging, pp 36-44
Patil A, Tamboli D, Meena S, Anand D, Sethi A (2019) Breast cancer histopathology image clas-
sification and localization using multiple instance learning. In: 2019 IEEE international WIE
conference on electrical and computer engineering (WIECON-ECE), Bangalore, India, pp 1–4.
https://doi.org/10.1109/WIECON-ECE48653.2019.9019916
Ragab DA, Sharkas M, Marshall S, Ren J (2019) Breast cancer detection using deep convolutional
neural networks and support vector machines. Peer J 7:e6201
Redmon J (2016) You only look once: unified, real-time object detection (2016) Retrieved from
http://pjreddie.com/yolo/
Spanhol FA, Oliveira LS, Cavalin PR, Petitjean C, Heutte L (2017) Deep features for breast cancer
histopathological image classification. In: 2017 IEEE international conference on systems, man,
and cybernetics (SMC), Banff, AB, pp 1868-1873 https://doi.org/10.1109/SMC.2017.8122889
Sheikh TS, Lee Y, Cho M (2020) Histopathological classification of breast cancer images using
a multi-scale input and multi-feature network. Cancers 12(8):2031. https://doi.org/10.3390/
cancers12082031
Spanhol F, Oliveira LS, Petitjean C, Heutte L (2016) A dataset for breast cancer histopathological
image classification. IEEE Trans Biomed Eng (TBME) 63(7):1455–1462
Ting F, Tan YJ, Sim KS (2019) Convolutional neural network improvement for breast cancer
classification. Exp Syst Appl 120:103–115. https://doi.org/10.1016/j.eswa.2018.11.008
Toğaçar M, Ergen B, Cömert Z (2020) Application of breast cancer diagnosis based on a combination
of convolutional neural networks, ridge regression and linear discriminant analysis using invasive
breast cancer images processed with autoencoders. Med Hypotheses
Vo DM, Nguyen N-Q, Lee S-W (2019) Classification of breast cancer histology images using
incremental boosting convolution networks. Inf Sci (Ny) 482:123–138. https://doi.org/10.1016/
j.ins.2018.12.089
Veta MJ, Diest PJ (2013) Automatic nuclei segmentation in HE stained. Breast cancer histopathol
images. PLOS One 8(7)
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2018) The marginal value of adaptive gradient
methods in machine learning, 2017. arXiv:1705.08292v2 [stat.ML] (22 May 2018)
Xie J, Liu R, Luttrell J, Zhang C (2019) Deep learning based analysis of histopathological images
of breast cancer. Front Genet 10. https://doi.org/10.3389/fgene.2019.00080
Zhu C, Song F, Wang Y et al (2019) Breast cancer histopathology image classification through
assembling multiple compact CNNs. BMC Med Inform Decis Mak 19:198. https://doi.org/10.
1186/s12911-019-0913-x
Chapter 15
An Analysis of Use of Image Processing
and Neural Networks for Window
Crossing in an Autonomous Drone
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 229
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_15
230 L. P. de Brito et al.
15.1 Introduction
The main objective of this study was to investigate the creation of a system capable
identify a passage using a monocular camera, such as a door or window, and guide a
little drone through it. This system involves the construction of a small aircraft capa-
ble of capture images that will be processed by an external machine which, in turn,
accurately respond to the specific movement the aircraft must follow. Figure 15.1
shows an outline of the system structure, where an arrow indicates the flow of infor-
mation and processes carried out.
The detection algorithm works with the classification of pixels and the location
of the red object within an image. Through this obtained position, it is possible to
make an analysis with reference to the position of the camera, thus calculating a
route for the aircraft run. Two detection approaches were studied in this work; the
first being only about simple image processing techniques and the other based on the
use of convolutional neural networks, more specifically the use of SSD, single shot
multibox detector (Falanga et al. 2018; Ilie and Gheorghe 2016; Liu et al. 2016).
The work combined the hardware and software to enable control of the aircraft.
The choice of the hardware equipment for research was due to the low investment
compared to the commercial models to sale. Not to mention the fact that PixHawk
has a large open-source development kit.
A small Q 250 racing quadcopter (25 cm) was built with enough hardware to complete
the project, including a flight controller board PixHawk 4 (Meier et al. 2011; Pixhawk
2019), 2300 kv mo-tors and 12A electronic speed controllers (ESCs). This controller
board has a main flight management unit (FMU) processor, an input/output (I/O)
Ground control station (CGS) is a kind of software executed on a solo platform that
performs the monitoring and configuration of the sensor’s drone such as sensor cali-
bration settings and the configuration of general purpose boards, supporting different
types of vehicle models, like the PixHawk that needs a configuration of its firmware
before its use.
In this work, the QGroundControl software of ground station was used. This soft-
ware allows you to check the status of the drone and program in a simple way mis-
sions with global positioning system (GPS) and map. It is suitable for the PixHawk
232 L. P. de Brito et al.
configuration. The following Fig. 15.3 shows the interface of this GCS used (Dami-
lano et al. 2013; Planner 2019a, b; QGROUNDCONTROL 2019; Ramirez-Atencia
and Camacho 2018).
15.2.4 Simulation
This work used the Gazebo simulator, with its PX4 Simulator implementation, which
brings various vehicle models with PixHawk specific hardware and firmware simu-
lation.
15 An Analysis of Use of Image Processing and Neural . . . 233
15.2.4.1 Gazebo
Gazebo provides realistic simulation with complex scenarios and robust environment
physics, including several sensors to sketch a true real-world implementation. Gazebo
enables the implementation of multiple robots. This makes possible to test and train
AI codes and image processing with remarkable ease and agility. Gazebo can create a
scenario with various buildings, such as houses, hospitals, cars, people, etc. With this
scenario, it is possible to evaluate the quality of the codes and trim their parameters
before a test in the real environment(de Waard et al. 2013; GAZEBOSIM 2019;
Koenig and Howard 2004).
The Iris model is the PX4 simulated drone that has the greatest fidelity to the image
real model q250 implemented (already presented previously), both are based on Pix-
Hawk, which means that the autopilot and the simulated Iris firmware are compatible
with the real q250. It is possible to connect the aircraft connect with the code and
command the aircraft, which in this case is the same for both. Figure 15.4 shows this
simulated model (Garcia and Molina 2020; PX4SIM 2019).
This work was used the TensorFlow framework that has a large number of existing
implementations available for adaptation (Bahrampour et al. 2015; Kovalev et al.
2016).
TensoFlow was created by Google and is based on Keras API, to facilitate the imple-
mentation of high performance algorithms, especially for large servers. It accepts
the use of graphics processing unit (GPU) beyond central processing unit (CPU)
234 L. P. de Brito et al.
only. This tool is considered heavy compared to others on the market; however, it is
very powerful because it provides a large number of features, tools, and implemen-
tations. TensorFlow has a Github repository where its main code and other useful
tools like deployed templates, TensorBoard, Project Magenta, etc., are available. As
a portable library, it is available in several languages, such as Python, C++, Java,
and Go, as well as other community-developed extensions (BAIR 2019; GOOGLE
2019; Sergeev and Balso 2018; Unruh 2019).
This work used convolutional neural networks (CNN) and machine learning (ML)
for object detection (Cios et al. 2012; Kurt et al. 2008). The classification method
was added to the calculation of the location of the object. This approach is called
object detection.
A CNN is a variation of perceptron multilayer network (Vargas et al. 2016). A
perceptron is simply a neuron model capable of storing and organizing information
as in the brain (Rosenblatt 1958). The idea is to divide complex tasks into several
smaller and simpler tasks that, in turn, act on different characteristics of the same
problem and that eventually return an answer as desired. Figure 15.5 illustrates this
structure (He et al. 2015; Szarvas et al. 2005; Vora et al. 2015).
The CNN applies filters to visual data, to extract or highlight some important fea-
ture, maintaining the neighborhood relationship, as well as convolution matrix graph-
ical processing, hence the origin of that name for this type of network (Krizhevsky
et al. 2012). When a convolution layer is made over an image, it multiplies and adds
15 An Analysis of Use of Image Processing and Neural . . . 235
the values of each pixel to the values of a convolution filter or mask. After calcu-
lating an area following a defined pattern, the filter moves to another region of the
image until it completes the operation over it (Jeong 2019). Figure 15.6 illustrates
the structure of a CNN (Vargas et al. 2016). The single shot multibox detector (SSD)
neural network (Bodapati and Veeranjaneyulu 2019; Huang et al. 2017; Yadav and
Binay 2017), a convolutional neural network for real-time object detection (Cai et al.
2016; Dalmia 2019; Hui 2019; Liu et al. 2016; Moray 2019; Ning et al. 2017; Tindall
et al. 2015; Xia et al. 2017 was used because it is considered the start-of-the-art in
accuracy (Liu et al. 2016)).
tone, where all pixels darker than the limit go to one group and take them to another. To
find the edges, the Canny filter (Accame and Natale 1997; OPENCV 2019) is applied,
which walks over the image pixels with a gradient vector that calculates the direction
and color intensity of the pixels (Boze 1995; Hoover et al. 2000; Simple Thresholding
2019). OPENCV’s findContours () function was used to detect polygons after proper
treatment of the image with the mentioned filters.
Figure 15.7 shows a flowchart of the overall system architecture, indicating their
processes. Three interconnected machines (GCS) compute most of the implemented
code. The drone receives speed commands to move and to image capture through of
a camera and a receiver and transmitter pair for data transfer.
This is the general architecture of the both implemented simulated and real sys-
tems. To start this system, it is necessary to start three machines that will communi-
cate.
In the flowchart drawn, the red arrows represent internal executions of the same
machine and the blue arrows indicate the transfer of data between a machine and
another through a communication protocol. The orange arrows indicate the creation
of a multiprocessing chain necessary for the implementation of the system. The
machine running the main code is GCS, which consists of three main threads, one
for image capture, another to shut down the software and the main one that manages
all the processing and calculations.
The captured images are transferred to the detection algorithm that will calculate
the bounding box that best represents the desired passage. A speed calculation is
performed according to the detection performed and is transferred to the drone.
When the drone loses a detection or finds none, the code slows the aircraft down
to a certain speed until, if no further detection actually occurs, the drone will stop.
When the drone receives a speed to be adjusted on its axis, it remains the same until
another speed is received or some other type of command is executed, for security it
can land.
The CNN used applies filters to highlight the desired object and a classifier and
bounding box estimator to indicate the location of the object in the image. The
resource extractor used it was the graph called MobileNet version 2, and the classifier
comes from CNN resources.
The CNN was trained with a set of images of windows and other objects with their
respective boundary box coordinates. This set of images was obtained from Google
Open Image version 4 containing about 60,000 window images (GOOGLE 2019).
15 An Analysis of Use of Image Processing and Neural . . . 237
RGB format images are converted to grayscale and smoothed with the the Gaus-
sian filter. Then, apply the Canny filter to isolate the edges of the objects. The code
looks for lines that form a four-sided polygon to identify passage within a window.
The center this polygon is identified and its side and diagonal measurements are
calculated.
An important detail is that the distances between points are calculated geomet-
rically for execution as quadrilateral measurements. But these calculated distances
they are values in pixel units, that is, the number of pixels between one point and
another and that number of pixels varies according to the resolution of the camera
used. Correct this error, these quantity values were converted into percentage values,
so every measure is defined as a percentage of the maximum it could assume, and the
maximum is usually the height, width, or diagonal of the image. For example, when
measuring the height of the bounding box, it is necessary to divide it by the height
of the image to find its occupation percentage. The Cartesian image study plan is
238 L. P. de Brito et al.
best suited using y and z axes because the image is in the same pattern as the drone
movements.
It was relatively easy to produce a passage in simulated experiment. However, in
a real experiment with many polygons, it generated many unwanted detections. To
solve this, a segmentation network could be used, which has the capacity to capture
the total area of the object sought. Figure 15.8 shows an example of this type of
network (He et al. 2017).
Another challenge is that the current algorithm does not capture the slope of
the object to be able to align the aircraft with the found window. To solve this, a
segmentation network can be used with the ability to capture the total area of the
searched object.
The drone control algorithm uses three functions to position the drone in front of
the window to make a slightly linear crossover: Approximate, Center and Align. The
algorithm defines the speed of the drone on the x, y, and z axes, as shown in Fig. 15.9.
In the Approximate function, the entries are the diagonals of the image and the
bounding box, while the output is a speed on the x axis of the drone. This function
15 An Analysis of Use of Image Processing and Neural . . . 239
captures the bounding box detected and checks its size in relation to the image to
measure the relative distance of the object. The algorithm estimates in percentage
the value that the object occupies in the image.
The mathematical function 15.1 was used to model the characteristic of this move-
ment.
1
f ( p) = k. (15.1)
p2
It is the inverse function of the square of the calculated diagonal size. The smaller
the diagonal the farther the object is, the faster the movement speed. For greater
gain, a quadratic function was used. In this function "p" represents the input measure
of the function, and "k" is a constant that controls the output value according to
some factors, such as distance and state of execution, giving a greater or lesser gain,
depending on the case of the function.
The behavior of this function is shown in Fig. 15.10. Only the positive domain
was used for the problem. Where when the size of the object tends to zero, the speed
tends to infinity, and when size tends to infinity, speed tends to zero. Due to the high
values achieved by this function, only part of it is used, an interval defined by code
that respects the system conditions. This interval is defined by p between [0.1, 0.7],
that is, detections with diagonals of 10–70% of image occupation.
The centralization function positions the drone in the center at the opening of the
identified window. It uses the distance on the y and z axes between the center of the
image and the center of the bounding box to set the speeds on the y and z axes of the
drone to perform the centering of the aircraft (Fig.15.11).
240 L. P. de Brito et al.
Fig. 15.11 Measures of bounding box and image to perform centralization (author)
Figure 15.12 shows a picture of a side view, a distorted box, where the right side
is smaller than the left, as the view from the drone is misaligned in relation to the
window. The alignment function sets the speed of the drone’s angular axis to produce
a yaw that align the aircraft in relation to the window.
15 An Analysis of Use of Image Processing and Neural . . . 241
Fig. 15.12 Measures of bounding box and image to perform alignment (author)
When the algorithm finds the detection of a passage, the movement functions them-
selves indicate the current state of execution of the drone. Five states are performed,
being the last crossing of the passage. The Algorithm 1 is a pseudocode system status
control.
Each of the functions sets the speed to produce faster movements over a long
distance and slower over a short distance. When all functions return "true," the state
changes.
242 L. P. de Brito et al.
We will evaluate the results of the training CNN, window, and passageway detection,
performed on images who were not part of the group used during the training of the
network.
To assess the quality of CNN’s convolutional processes, we use the percentage of
the number of false and true identifications. Accuracy measures:
• The percentage of your predictions that indicates how correct this is.
• How correct positive results are.
15 An Analysis of Use of Image Processing and Neural . . . 243
Fig. 15.14 Number of bounding boxes detected. (I) Mono-class training. (II) Multi-class training
(auhor)
Two trainings were applied, one called mono-class that uses only one object class
(window) and another called multi-class that uses several other objects (people, cars,
traffic signs) besides the window.
Mono-class training obtained many false detections, where people and cars were
identified as windows. Figure 15.8 shows this clearly. To evaluate the false positives, a
test was carried out with about 700 images without windows, and the re-sult is shown
that the multi-class training effectively eliminated the number of false positives. The
results are shown in the graph in Fig. 15.14, which shows that the network has lost
its detection generality by 50%.
During the experiments, the loss of detection affected the system’s efficiency in
distorted frames. When the detection failure occurs, the drone tends to execute the last
calculated speed until it stops, generating an inappropriate movement. An experiment
was carried out that measured the number of detections carried out in a video capture.
Two cameras were tested, a Full HD (1920 × 1080) digital and an FPV analog
transmission (600TVL), for two types of video, one with camera movement and the
other without. The objective was to assess what most affected the detection number,
the camera, or the movement. As the camera used in the system, it was low cost
without gimbal to stabilize the video. In carrying out the experiment, both cameras
filmed the same stationary and moving object to simulate a system with gimbal and
one without.
As factors influencing, results were evaluated (a) camera resolution and (b) pres-
ence of movement, with two levels each. Each recorded video lasted one minute, and
each camera has a number of frames captured per second, so the final analysis the
variable was calculated on the percentage of frames that were successfully detected
(Jain 1990).
244 L. P. de Brito et al.
To assess the detection losses of the polygon detection algorithm, a passage was used
that forms a well-defined polygon to facilitate detection in an isolated environment.
The results obtained are shown in Table 15.1.
The affectivity of each factor in the number of detection found is shown in
Fig 15.15, where "M" represents movement and "C" represents the camera and "CM"
means both together.
The camera was the factor that most affected the performance of the detector. The
motion did not have much influence on the result. But the union of this two factors
brings significant impact in the system.
The same test performed for the polygon detector was also performed for the neural
network, where four experiments were carried out between the two types of cameras
for a stop recording and one in motion. The results obtained are shown in Table 15.2.
Figure 15.16 shows the result performed on the camera (C) and the movement
(M). The factor camera had an influence of 48.1%, that is, responsible for almost half
the influence of the system, so the quality of the video for neural networks generates
a great difference. The movement factor did not affect as much as it obtained a lower
value of 29.6%, but still relevant. The neural network can work with distorted and
flawed images as long as it has some characteristics sought. In the influence of both
factors together with CM, it also had a relevant figure of 22.4%, almost as large as
the movement itself.
15 An Analysis of Use of Image Processing and Neural . . . 245
This means that a low resolution set with movement had a lot to lose, and this can
be seen in the Table 15.1, where the experiment with these characteristics had a loss
of detection of more than half of the total group.
Anyway, the object detector based on neural networks also suffers from poor
quality image capture, better hardware implementation produces greater detection
number performed.
The system was first tested in a simulation in the Gazebo environment. Subsequently,
the system was implemented with PixHawk and MavSDK, and it was tested in real
scenarios.
Figure 15.17 shows the passage used for testing in a real environment. Using this
disposable passageway and the gazebo simulator, it was possible to find adequate
speed values for these movements. The highlighted border color makes it easy to
detect. Although the algorithm avoids unwanted detections, it still tends to find
squares in environments with very linear objects. Therefore, the use of a neural
network facilitated the application of the technique in the real world.
The tests performed on the simulated platform are important for control modeling,
as there are no problems with falls and accidents. A simple passage was simulated
246 L. P. de Brito et al.
with the edges highlighted by the color difference. Figure 15.18 illustrates these
experiments in which there are four distinct steps in the execution of the algorithm.
The experiments were carried out by crossing a simple 2 × 2 meter size passage
in the simulation environment shown in Fig. 15.18. In these tests, the drone always
managed to cross the passage.
Sometimes, the drone has difficulty finding the equilibrium point of the last state,
relation to the centralization function and ends up taking longer to perform the
crossing, even more for a more precise parameterization, he often spends a lot of time
looking for the balance point in a reciprocating motion due to its precise movement.
But in the end, even with these parameters, he ends up crossing.
After the implementation of the neural network-based object detector, a new
environment was modeled to simulate a city with cars and people’s homes. In this
environment, it was possible to apply the algorithm that detected real windows to
control the drone at its intersection. Through the studied control functions and the
parameter adjustment made within the first environment, it was possible to carry out
the crossing in the desired way. Figure 15.19 provides a little bit about these tests.
In all tests performed with a control algorithm crossing a 1.5 × 2 meter window
in the simulation environment shown in Fig. 15.19, the drone has always managed
to cross. The simulation for the system occurred with success, having a high number
of detections per second, in addition to generating the correct speed for the drone to
follow.
The validation of the system for a real environment was fully applied on a disposable
passage, modeled to avoid accidents that would cause damage to the aircraft model.
The detection of this passage is shown in Fig. 15.17, and it was done with paper in a
15 An Analysis of Use of Image Processing and Neural . . . 247
way that a propeller could cut. Figure 15.20 shows a sequence of steps that performed
in the real environment.
The validation in the real environment presented many difficulties due to the
quality of the image capture, which encouraged the analysis carried out on this
aspect. The use of a better camera and the use of fault detection filters was what
made the tests possible. In that case, an average of detections and a decrease in the
speed generated to avoid accidents; however, the system still sometimes failed due
to lack of detection. But, it was enough to show that detection methods can be used
in real environments.
248 L. P. de Brito et al.
15.5 Conclusion
The research carried out obtained satisfactory results. An autonomous image process-
ing and decision system depend on various peripherals, such as structures, detection
and data capture kit, hardware control, programming languages, simulation envi-
ronment, and among other characteristics that directly influence the final imple-
mentation. The bibliographic review of this work provides extensive modeling and
demonstration possibilities for a system similar to the one implemented.
The use of a square detector based on image processing facilitated the research
test. In environments with a low number of squares as in nature or in open fields, the
algorithm obtained excellent movement results.
The main problem with a polygon detector based on image filters, like the one
implemented, is being able to identify many different polygons in an image. For an
implementation in a real environment, it was necessary to use the trained CNN for
window detection. Thus, during the implementation of the CNN, the motion control
algorithm was developed based on the quadrilateral detector.
The mono-class network showed several false positives, which indicated people
and cars as windows. The solution was to train various classes, such as people and
cars, for the training group. Thus, it was possible to perform window detection
efficiently, as desired.
The implemented system generates speeds according to the position of the bound-
ing box found, suffering from detection failures, as it needs real-time detection to
maintain its correct movement, due to state variations. The solution was to implement
medium filters to circumvent the control failures, because in the real environment
with a low cost camera, many losses occurred depending on the lighting and distance
of data transmission. In that case, a possible solution would be to implement another
15 An Analysis of Use of Image Processing and Neural . . . 249
route-based control method instead of speed. Thus, despite losing its detection, the
vehicle would maintain its route with the static object it wants to cross.
The research focused on a solution based only on the use of image processing
and convolutional neural networks. However, a system like this implemented in a
real environment needs to use sensors to assist in the decision, such as a proximity
sensor, for example, to avoid accidents, identify closed windows, and stabilize the
flight. Especially during the crossing, which, in the case of the current system, is
done blindly, as the camera loses sight of object when it is inside it, following only
the route parameters already calculated until that point.
An interesting implementation would be a neural network model of control, cap-
turing data from the position of the drone and objects in the image (bounding boxes)
to perform various crossings for human control and create a data set. So, with this
data set, a neural mechanism control network can be trained to perform the move-
ments, where the bounding box be the entrance and the movement speeds would be
the exit.
References
Accame M, Natale FGD (1997) Edge detection by point classification of canny filtered images. Sig
Proces 60(1):11–22
Bahrampour S et al (2015) Comparative study of deep learning software frameworks. arXiv preprint
1511.06435
BAIR (2019) Caffe. Available in https://caffe.berkeleyvision.org/. Cited 2019
Bodapati JD, Veeranjaneyulu N (2019) Feature extraction and classification using deep convolu-
tional neural networks. J Cyber Secur Mob 8(2):261–276
Bottou L (2012) Stochastic gradient descent tricks. In: Neural networks: tricks of the trade. Springer,
pp 421–436
Boze SE (1995) Multi-band, digital audio noise filter. Google Patents. US Patent 5,416,847
Bradski G, Kaehler A (2008) Learning OpenCV: computer vision with the OpenCV library. O’Reilly
Media, Inc
Cai Z et al (2016) A unified multi-scale deep convolutional neural network for fast object detection.
In: European conference on computer vision. Springer, pp 354–370
Cios KJ, Pedrycz W, Swiniarski RW (2012) Data mining methods for knowledge discovery. In:
Springer Science & Business Media. Springer, vol 458
Cork RC, Vaughan RW, Humphrey LS (1983) Precision and accuracy of intraoperative temperature
monitoring. Anesth Analg 62(2):211–214
Countours (2019) OpenCV. Available in http://host.robots.ox.ac.uk/pascal/VOC/voc2007/. Cited
2019
Dalmia A (2019) Real-time object detection: understanding SSD. Available in https://medium.
com/inveterate-learner/real-time-object-detection-part-1-understanding-ssd-65797a5e675b.
Cited 2019
Damilano L et al (2013) Ground control station embedded mission planning for UAVs. J Intel Rob
Syst 69(1–4):241–256
de Brito PL et al (2019) A technique about neural network for passageway detection. In: 16th
international conference on information technology-new generations (ITNG 2019). Springer, pp
465–470
250 L. P. de Brito et al.
de Jesus LD et al (2019) Greater autonomy for rpas using solar panels and taking advantage of rising
winds through the algorithm. In: 16th international conference on information technology-new
generations (ITNG 2019). Springer, pp 615–616
de Waard M, Inja M, Visser A (2013) Analysis of flat terrain for the atlas robot. In: 3rd joint
conference of AI & robotics and 5th RoboCup Iran open international symposium. IEEE, pp 1–6
Deng G, Cahill L (1993) An adaptive gaussian filter for noise reduction and edge detection. In: IEEE
conference record nuclear science symposium and medical imaging conference, pp 1615–1619
Ding J et al (2016) Convolutional neural network with data augmentation for sar target recognition.
IEEE Geosci Remote Sens Lett 13(3):364–368
DRONEKIT (2019) Available in https://dronekit.io/. Cited 2019
Falanga D et al (2018) The foldable drone: a morphing quadrotor that can squeeze and fly. IEEE
Rob Autom Lett 4(2):209–216
French R, Ranganathan P (2017) Cyber attacks and defense framework for unmanned aerial systems
(uas) environment. J Unmanned Aerial Syst 3:37–58
Garcia J, Molina JM (2020) Simulation in real conditions of navigation and obstacle avoidance with
px4/gazebo platform. In: Personal and ubiquitous computing. Springer, pp 1–21
GAZEBOSIM (2019) Available in http://gazebosim.org/. Cited 2019
GOOGLE (2019) Open images dataset. Available in https://opensource.google.com/projects/open-
images-dataset. Cited in 2019
He K et al (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer
vision, pp 2961–2969
He K et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition.
IEEE Trans Pattern Anal Machine Intell 37(9):1904–1916
Hoover A, Kouznetsova V, Goldbaum M (2000) Locating blood vessels in retinal images by piece-
wise threshold probing of a matched filter response. IEEE Trans Med Imag 19(3):203–210
Huang J et al (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7310–7311
Hui J (2019) SSD object detection: single shot MultiBox detector for real-time processing. Avail-
able in https://medium.com/@jonathanhui/ssd-object-detection-single-shot-multibox-detector-
for-real-time-processing-9bd8deac0e06. Cited 2019
Hussain Z et al (2017) Differential data augmentation techniques for medical imaging classification
tasks. In: AMIA annual symposium proceedings. American Medical Informatics Association, p
979
Ilie I, Gheorghe GI (2016) Embedded intelligent adaptronic and cyber-adaptronic systems in organic
agriculture concept for improving quality of life. Acta Technica Corviniensis-Bull Eng 9(3):119
Ito K, Xiong K (2000) Gaussian filters for nonlinear filtering problems. IEEE Trans Autom Control
45(5):910–927
Jain R (1990) The art of computer systems performance analysis: techniques for experimental
design, measurement, simulation, and modeling. Wiley, Hoboken
Jarrell TA et al (2012) The connectome of a decision-making neural network. Science
337(6093):437–444
Jeong J (2019) The most intuitive and easiest guide for convolutional neural net-
work. Available in: https://towardsdatascience.com/the-most-intuitive-and-easiest-guide-for-
convolutional-neural-network-3607be47480. Cited 2019
Koenig N, Howard A (2004) Design and use paradigms for gazebo, an open-source multi-robot
simulator. In: IEEE/RSJ international conference on intelligent robots and systems (IROS) (IEEE
Cat. No. 04CH37566), vol 3, pp 2149–2154
Kovalev V, Kalinovsky A, Kovalev S (2016) Deep learning with theano, torch, caffe, tensorflow,
and deeplearning4j: which one is the best in speed and accuracy? Publishing Center of BSU,
Minsk
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional
neural networks. In: Advances in neural information processing systems. pp 1097–1105
15 An Analysis of Use of Image Processing and Neural . . . 251
Kumar A (2019) Computer vision: Gaussian filter from scratch. Available in https://medium.com/
@akumar5/computer-vision-gaussian-filter-from-scratch-b485837b6e09. Cited 2019
Kurt I, Ture M, Kurum AT (2008) Comparing performances of logistic regression, classification
and regression tree, and neural networks for predicting coronary artery disease. Exp Syst Appl
34(1):366–374
Kyrkou C et al (2019) Drones: augmenting our quality of life. IEEE Potentials 38(1):30–36
Liu W et al (2016) Ssd: Single shot multibox detector. In: European conference on computer vision.
Springer, pp 21–37
Marengoni M, Stringhini S (2009) Tutorial: Introdução á visão computacional usando opencv (in
portuguese). Revista de Informática Teórica e Aplicada 16(1):125–160
Martins WM et al (2018) A computer vision based algorithm for obstacle avoidance. In: Information
technology-new generations. Springer, pp 569–575
MAVROS (2019) Available in http://wiki.ros.org/mavros. Cited 2019
MAVSDK (2019) Available in https://mavsdk.mavlink.io/. Cited 2019
Meier L et al (2011) Pixhawk: a system for autonomous flight using onboard computer vision. In:
IEEE international conference on robotics and automation, pp 2992–2997
Moray A (2007) Available in https://docs.opencv.org/. Cited 2019
Ning C et al (2017) Inception single shot multibox detector for object detection. In: IEEE interna-
tional conference on multimedia & expo workshops (ICMEW), pp 549–554
OPENCV (2019) Canny edge detector. Available in https://docs.opencv.org/2.4/doc/tutorials/
imgproc/imgtrans/cannydetector/cannydetector.html. Cited 2019
OPENCV (2019) Simple thresholding. Available in https://docs.opencv.org/master/d7/d4d/
tutorialpythresholding.html. Cited 2019
Pandey P (2019) Understanding the mathematics behind gradient descent. Available in https://
opensource.google.com/projects/open-images-dataset. Cited 2019
Pinto LGM et al (2019) A ssd–ocr approach for real-time active car tracking on quadrotors. In: 16th
international conference on information technology-new generations (ITNG 2019). Springer, pp
471–476
Pixhawk Available in https://pixhawk.org/. Cited 2019
Planner A (2019a) APM planner. Available in https://ardupilot.org/planner2/. Cited 2019
Planner M (2019b) Mission planner. Available in https://ardupilot.org/planner/. Cited 2019
Prescott JW (2013) Quantitative imaging biomarkers: the application of advanced image processing
and analysis to clinical and preclinical decision making. J Digit Imag 26(1):97–108
PX4SIM (2019) Available in https://dev.px4.io/. Cited 2019
QGROUNDCONTROL. Available in: http://qgroundcontrol.com/. Cited 2019
Ramirez-Atencia C, Camacho D (2018) Extending qgroundcontrol for automated mission planning
of UAVs. Sensors 18(7):2339
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, pp 7263–7271
Ren S et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks.
In: Advances in neural information processing systems, pp 91–99
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization
in the brain. Psychol Rev 65(6):386
Sergeev A, Balso MD (2018) Horovod: fast and easy distributed deep learning in tensorflow. In
preprint arXiv:1802.05799
Szarvas M et al (2005) Pedestrian detection with convolutional neural networks. In: Intelligent
vehicles symposium, pp 224–229
TensorFlow (2019). Available in https://www.tensorflow.org. Cited in 2019
Tindall L, Luong C, Saad A (2015) Plankton classification using vgg16 networ
Unruh A (2019) What is the TensorFlow machine intelligence platform? Available in https://
opensource.com/article/17/11/intro-tensorflow. Cited 2019
252 L. P. de Brito et al.
Vargas ACG, Paes A, Vasconcelos CN (2016) Um estudo sobre redes neurais convolucionais e sua
aplicaç ao em detecç ao de pedestres" (in portuguese). In: Proceedings of the XXIX conference
on graphics, patterns and images. pp 1–4
Vora K, Yagnik S, Scholar M (2015) A survey on backpropagation algorithms for feedforward
neural networks. Citeseer
Xia X, Xu C, Nan B (2017) Inception-v3 for flower classification. In: 2nd international conference
on image, vision and computing (ICIVC), pp 783–787
Yadav N, Binay U (2017) Comparative study of object detection algorithms. Int Res J Eng Technol
(IRJET) 4(11):586–591
Chapter 16
Analysis of Features in SAR Imagery
Using GLCM Segmentation Algorithm
Abstract Synthetic Aperture Radar (SAR) system is one of the most popular system
used widely due to its property of self-illumination. Use of SAR is gaining interest in
earth remote sensing due to its advantages over optical imaging systems. Its ability to
consistently monitor and adapt to changing weather conditions makes application of
SAR for the purpose of radar imaging important. Feature detection in SAR images can
be achieved by using accurate texture segmentation methods. This paper introduces
Grey Level Co-occurrence Matrix (GLCM) that proves to be a good discriminator
for the purpose of identication of different textural features in SAR Imagery. With
this technique features present in SAR images such as water, vegetation and urban
area in land is detected using different orientations.
16.1 Introduction
Synthetic aperture radar (SAR) is an active microwave radar having ability to achieve
high-resolution images independent of daylight. SAR imaging is an important earth
observation technique in remote sensing field used in a wide range of applications.
SAR satellites provide high resolution images and it is difficult to identify the targets
in these data such as rivers etc that are present in images. Hence it is necessary to
introduce segmentation techniques that will be useful to identify such targets present
in such images in less time. One of simplest method of segmentation is thresholding
which comes under category intensity based segmentation. The major drawback
of this technique is that only intensity value is considered and not the relationship
between the pixels in an image. This will lead to either losing information of a
particular region or possibility of obtaining too much background pixels which are
unnecessary.
GLCM texture segmentation is one of the statistical texture segmentation method
that belongs to second order characteristics which considers the spatial relationship
among the pixels (Payal Dilip Wankhade 2014; Gonzalez et al. 2009). When
compared with previously existing algorithm such as watershed algorithm, results
obtained for GLCM method is much better than those of these previously existing
algorithms (Kaur et al. 2014). This paper shows how this texture segmentation method
can be carried out in SAR images to detect important features in them. Section 16.2
gives an overview of Gray Level Co-occurrence Matrix approach for texture seg-
mentation. In Sect. 16.3, the features based on gray level co-occurrence matrix are
explained. Section 16.4 gives an overview of the algorithm with the methodology
implemented for texture segmentation. Comparison of results obtained from texture
segmentation carried out on different SAR images is shown in Sect. 16.5.
One of earliest and widely used method for the purpose of texture feature extraction
was proposed by Haralick in the year 1973 which is the Gray-Level Co-occurrence
Matrix (GLCM) after that it is widely used in many texture analysis applications
(Pathak et al. 2013). Gray Level Co-occurrence Matrix (GLCM) has proved to be one
of the popular statistical methods beneficial in extracting textural feature from images
that considers the spatial relationship of pixels (Mohanaiah et al. 2013; Materka et al.
1998). GLCM can be carried out in four directions Horizontal (0◦ or 180◦ ), Vertical
(90◦ or 270◦ ), Right Diagonal (45◦ or 225◦ ), Left diagonal (135◦ or 315◦ ) (Hall-Beyer
and Mryka 2017) which are denoted as P0 , P45 , P90 and P135 (Girisha et al. 2013)
as shown in Fig. 16.2 with GLCM created from the test image with its 4 direction
results. The co-occurance matrix directions taking place in GLCM is as shown in
Fig. 16.1 (Pathak et al. 2013).
It will also show how often a pixel in an image with the intensity having the grey
level value i occurs either horizontally, vertically, or diagonally to adjacent pixels in
a spatial relationship to a pixel with the value j (Singh and Inderpal 2014; Girisha
et al. 2013). Two neighbouring pixels can be separated by distance d, where one of
them has gray level i and other j. The co-occurrences matrix can be calculated in
an image through window which scans the image. This co-occurrence matrix can
be associated with each pixel as shown in Fig. 16.1 which shows the co-occurrence
matrix directions (Pathak et al. 2013). After creation of GLCM one can compute
various features from it (Singh and Inderpal 2014). Resultant matrices obtained are
used to characterize textures in images which contain information about the image
such as contrast, energy entropy, variance etc. (Hall-Beyer and Mryka 2017). After
creation of GLCM various features must be computed from it which are discussed
in next section (Singh and Inderpal 2014; Pathak et al. 2013).
16 Analysis of Features in SAR Imagery . . . 255
Fig. 16.2 Creation of GLCM from image matrix based on a a test image with b General form of
GLCM along four possible directions, c 0◦ , d 45◦ , e 90◦ and f 135◦ , # represents the number of
times with distance = 1 (Pathak et al. 2013)
Haralick has extracted 14 features from GLCM. In order to extract Haralick features
it is necessary that GLCM should be symmetric which is achieved by taking transpose
and adding it with original GLCM and normalized matrix must be created which can
be achieved from the calculation of sum of all elements in a GLCM and dividing
each element of the matrix with the obtained sum (Girisha et al. 2013). Thus the
resulting matrix can be used to extract features from the normalized symmetrical
GLCM matrix.
An Image texture can be classified into three groups depending on distribution of
spatial variation of pattern in an image.
(1) contrast
(2) orderliness
(3) statistics (Wen et al. 2011).
256 James et al.
Group one will include contrast, homogeneity and dissimilarity and from this
group we have selected contrast for the purpose of texture segmentation. Second
group is used to measure orderliness which includes energy as well as entropy.
Entropy is said to be inversely correlated with energy (Wen et al. 2011). We prefer
both energy and entropy for the purpose of texture segmentation in SAR imagery.
Group three includes mean, variance and correlation where variance is said to be
correlated with contrast and Correlation is uncorrelated with energy, entropy and
contrast (Wen et al. 2011). From this group variance is selected for the purpose of
texture segmentation. Details of these features is described below
16.3.1 Energy
Energy feature is also called as Uniformity or Angular second moment which mea-
sures orderliness of the gray level distribution in an image. This feature is high when
image has good homogeneity that is it will have more of similar set of pixels. The
expression for energy is as shown below
Ng Ng
Energy = P(i, j)2 (16.1)
i=1 j=1
16.3.2 Contrast
Contrast represents the amount of local variation in image acting as a good edge
detector and that measures the spatial frequency of an image (Girisha et al. 2013;
Cevik et al. 2016). The general representation of this feature in GLCM is shown
below
Ng Ng
Contrast = (i − j)2 · P(i, j) (16.2)
i=1 j=1
16.3.3 Homogeneity
will be large. Homogeneity feature in GLCM will consider high values for a low
contrast image (Girisha et al. 2013). The general expression for homogeneity is as
given as
Ng Ng
1
Homogeneity = · P(i, j) (16.3)
i=1 j=1
1 + (i − j)2
16.3.4 Correlation
Correlation feature will measure of grey tone linear-dependencies in the image and
is uncorrelated with energy and entropy (Girisha et al. 2013; Mohanaiah et al. 2013).
Correlation of a pixel is calculated with its neighbour in the entire image that will
measure linear dependency of gray levels in the neighbouring pixels. This feature
can be expressed as follows (Cevik et al. 2016)
(i − μi )( j − μ j )P(i, j)
Ng Ng
Correlation = (16.4)
i=1 j=1
σi σ j
16.3.5 Entropy
Entropy feature belongs to the orderliness group which will show how regular the
pixel value in the image is different within the window. Entropy will give the amount
of information in the image that is required for image compression (Mohanaiah et al.
2013). The expression for Entropy is as shown below
Ng Ng
Entropy = − P(i, j) · log(P(i, j)) (16.5)
i=1 j=1
16.3.6 Variance
Variance is the average square of the distance of each data point from its mean which
is also called as mean squared deviation. The expression for Variance is given as
Ng Ng
Variance = (i − μ)2 · P(i, j) (16.6)
i=1 j=1
258 James et al.
First step is to acquire SAR images for this purpose we use Sintinel-1 data from Euro-
pean Space Agency (ESA). All the images used are collected from ESA. Selection
of images from the database with window size 5 × 5, distance = 1 and orientation
from 0◦ , 45◦ , 90◦ and 135◦ is carried out. Gray Level Co-occurrence Matrix(GLCM)
is created initially with the matrix created as follows:
• Considering a set of samples surrounding a sample which will fall within the
window centered upon a sample with its window size specified.
• The i, j considered in GLCM will be the number of times two samples of intensities
i and joccur in that specified spatial relationship.
The obtained GLCM is made symmetric by creating a transpose and adding with
GLCM itself. The obtained results are Normalized by dividing each element by sum
of all the elements present in the matrix and the elements of GLCM is expressed
in probability. Finally Texture feature detection is done with features such as Con-
trast, ASM, Entropy, Variance and the obtained results are compared and tabulated
(Fig. 16.3).
16.5 Results
The SAR data considered for texture analysis is shown in Fig. 16.4. Area of Nether-
land is considered for study which includes data of class A and B. Class C includes
region between Denmark and Netherland. The sizes of the considered class A,B
and C images are 378 × 550, 376 × 550 and 376 × 549 respectively.The results are
observed by varying window size 3 × 3, 5 × 5, 7 × 7, 9 × 9 and 11 × 11. We observe
that over segmentation is taking place in images with the window size of 11 × 11 as
shown in Figs. 16.17 and 16.18 of class A and B images. It is necessary to select the
right window size so that features present in SAR images will be detected clearly.
By varying window sizes we observe that window size of 5 × 5 is gave clear results
than other window sizes and hence we select window size of 5 × 5 for all the images.
The distance considered is 1 and gray levels used is 3 to reduce time and to avoid
complexity.
Results of class A image show that water details of all the orientation is clear,hence
we compare the land details. We observe that vegetation region in the land is
detectable using GLCM feature of variance as shown in Figs. 16.5, 16.6, 16.7 and
16.8.
The presence of urban area in the land such as Hague and Amsterdam can be
detected from the feature of energy as shown in figures of all the orientations of class
A results. The 0◦ results of energy feature gave clearer results than the results of
other orientation as shown in Figs. 16.5, 16.6, 16.7 and 16.8.enlargethispage-12pt
With the results of class B image, Variance gave clearer results and no change is
observed in results of contrast. The presence of water region in land can be detected
16 Analysis of Features in SAR Imagery . . . 259
Fig. 16.4 SAR image data considered for GLCM segmentation. a ClassA, b ClassB, c ClassC
260 James et al.
Fig. 16.5 Results obtained for class A data for a Contrast, b Entopy, c Variance, d ASM for
orientation 0◦
Fig. 16.6 Results obtained for class A data for a Contrast, b Entopy, c Variance, d ASM for
orientation 45◦
Fig. 16.7 Results obtained for class A data for a Contrast, b Entopy, c Variance, d ASM for
orientation 90◦
16 Analysis of Features in SAR Imagery . . . 261
Fig. 16.8 Results obtained for class A data for a Contrast, b Entopy, c Variance, d ASM for
orientation 135◦
Fig. 16.9 Results obtained for class B data for a Contrast, b Entopy, c Variance, d ASM for
orientation 0◦
using variance feature as observed in Figs. 16.9, 16.10, 16.11 and 16.12 with the
window size 5 × 5 for class B image.
On applying algorithm on class C image, River Elbe can be detected from the
variance feature of GLCM as shown in Figs. 16.13, 16.14, 16.15 and 16.16. The
presence of urban area in land such as Hamburg is also detectable using GLCM
feature of Energy of class C image data. The results of variance feature is observed
to be clearer than other features.
Since water details of all the images obtained are clear the details of land for all
the SAR images along with all possible orientation is compared and tabulated as
shown in Table 16.1. For the construction of any co-occurance matrix parameters of
distance (d) and direction (θ ) are important,hence we compare the results of GLCM
features observed in the SAR images by obtaining results stimulated with d = 1.
262 James et al.
Fig. 16.10 Results obtained for class B data for a Contrast, b Entopy, c Variance, d ASM for
orientation 45◦
Fig. 16.11 Results obtained for class B data for a Contrast, b Entopy, c Variance, d ASM for
orientation 90◦
Fig. 16.12 Results obtained for class B data for a Contrast, b Entopy, c Variance, d ASM for
orientation 135◦
16 Analysis of Features in SAR Imagery . . . 263
Fig. 16.13 Results obtained for class C data for a Contrast, b Entopy, c Variance, d ASM for
orientation 0◦
Fig. 16.14 Results obtained for class C data for a Contrast, b Entopy, c Variance, d ASM for
orientation 45◦
Fig. 16.15 Results obtained for class C data for a Contrast, b Entopy, c Variance, d ASM for
orientation 90◦
264 James et al.
Fig. 16.16 Results obtained for class C data for a Contrast, b Entopy, c Variance, d ASM for
orientation 135◦
Fig. 16.17 Results obtained for class A data for a Contrast, b Entropy, c Variance, d ASM for
orientation 0◦ with window size 11 × 11
Fig. 16.18 Results obtained for class B data for a Contrast, b Entropy, c Variance, d ASM for
orientation 0◦ with window size 11 × 11
16 Analysis of Features in SAR Imagery . . . 265
16.6 Conclusion
In this paper, comparative analysis of GLCM texture features is carried out to detect
different features in SAR images. MATLAB software was used to differentiate vari-
ous features present in SAR images using different possible orientations. This texture
segmentation method has the ability to segment the region having higher texture than
other region present in an image which is useful to detect important features in SAR
images. Comparative analysis of GLCM features on SAR images show that results of
energy and variance gave accurate results as compared to other features. Examples
of SAR applications where GLCM texture segmentation can be used are tropical
forest monitoring, land cover change detection due to natural disasters such as land-
slides, floods and many other applications.This texture segmentation method is also
beneficial to obtain information which cannot be achieved through conventional seg-
mentation methods.
Acknowledgements Images used in this paper are collected from Sentinel-1 satellite data from
ESA (European Space Agency). The authors would like to thank them for providing data along
with necessary information that helped us to analyse the data and apply algorithm on it.
References
Cevik T, Ali Mustafa AA, Cevik N (2016) Performance analysis of GLCM-based classification on
Wavelet Transform-compressed fingerprint images. In: 2016 sixth international conference on
digital information and communication technology and its applications (DICTAP). IEEE
Chauhan AS, Silakari S, Dixit M (2014) Image segmentation methods: a survey approach. In: 2014
Fourth International Conference on Commun Syst Netw Technol (CSNT). IEEE
266 James et al.
Girisha AB, Chandrashekhar MC, Kurian MZ (2013) Texture feature extraction of video frames
using GLCM. Int J Eng Trends Technol 4(6):2718–2721
Gonzalez RC, Woods RE, Eddins SL (2009) Digital image processing using MATLAB, vol 2
Hall-Beyer M (2017) GLCM texture: a tutorial v. 3.0 March 2017
Kaur D, Kau Y (2014) Various image segmentation techniques: a review. Int J Comput Sci Mob
Comput 3(5):809–814
Materka A, Strzelecki M (1998) Texture analysis methods a review. Technical University of lodz,
Institute of Electronics, COST B11 report. Brussels 1998:9–11
Mohanaiah P, Sathyanarayana P, GuruKumar L (2013) Image texture feature extraction using GLCM
approach. Int J Sci Res Publ 3(5)
Pathak B, Barooah D (2013) Texture analysis based on the gray-level co-occurrence matrix consid-
ering possible orientations. Int J Adv Res Electric Electron Instrum Eng 2(9):4206–4212
Singh I (2014) Performance evaluation of texture based image segmentation using GLCM. Int J
Adv Image Process Tech IJIPT 1(3):2372–3998
Wen C, Zhang Y, Deng K (2011) Urban area classification in high resolution SAR based on texture
features. In: International conference on geo-spatial solutions for emergency management and
the 50th anniversary of the Chinese academy of surveying and mapping
Wankhade PD (2014) A review on Aspects of Texture analysis of images. Int J Appl Innova Eng
Manage (IJAIEM) 3(10)
Part III
Applications and Issues
Chapter 17
Offline Signature Verification Using
Galois Field-Based Texture
Representation
Abstract Signature verification has been one of the popular research areas with face
recognition being one among the other physiological traits in biometric recognition.
Offline signatures can be treated as texture images, and thus, texture representation
methods can be applied to such images. One such texture representation is based on
Galois fields. In this work, after application of Galois field operator, the cumulative
histogram is built, and it is normalized. Thus, obtained bin values are considered as
features of a signature image. k-NN classifier is used for offline signature verification.
Experiments conducted on the benchmark dataset, namely GPDS synthetic signature
database, have supported the application of the proposed method.
17.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 269
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_17
270 S. Shivashankar et al.
biometric features against the user’s biometric features stored in the database (Jain
et al. 2004). In signature verification system, a query signature can be classified
as genuine or forged. Forgeries can be of three types: simple, skilled and random
forgeries. In random forgeries, the person doing the forgery does not know the other
person’s signature or the other person. The forger unknowingly uses his signature
which is in different shape and style from the original signature of the person. This
leads to a very different semantic from the original signature. In simple forgeries,
the forger deliberately tries to copy the person’s signature. The forger is aware of
the person’s name but not the person’s signature. The forgery may be similar to the
person’s original signature but not exactly the same. In skilled forgeries, the forger
is aware of the person’s name and signature. The forger copies the person’s signa-
ture as close as possible to the original signature, and this type of forgery is hard to
determine. (Hafemann et al. 2017).
Offline and online are the two types of signature verification systems, depending
on the acquisition method of the signatures. User’s signatures are acquired using
an acquisition device like a digitizing tablet in online type. Online signatures are
collected as a sequence over a time, such as co-ordinates of writing points, angle,
direction of the pen and pen pressure. Online signatures are difficult to forge, since
they contain dynamic information. In the case of offline type, after the completion
of the writing process, the signature is acquired. Here, the signature is treated as a
digital image (Hafemann et al. 2017). Hence, offline signature verification is a pattern
recognition problem.
Some of the research works on the topic are presented as follows: Kalera, Shri-
hari and Xu developed an offline signature verification method based on quasi-
multiresolution technique using structural, concavity and gradient features for feature
extraction (Kalera et al. 2004). Fierrez-Aguilar, Alonso-Hermira, Moreno-Marquez
and Ortega Garcia proposed an offline signature verification system employing the
fusion of global and local information (Fierrez-Aguilar et al. 2004). Global, direc-
tional and grid features of signatures were used for offline signature verification by
Ozgunduz, Senturk and Karshgil (Ozgunduz et al. 2005. Kiani, Pourreza and Pour-
reza (Kiani et al. 2009) and Bharadi and Kekre (Bharadi et al. 2010) used local Radon
transform and cluster-based global features for feature extraction, respectively. Pun
and Lee extracted features using log-polar transform to eliminate rotation and scale
effects in the input image (Pun et al. 2003). Local binary patterns (LBP) used grey
level distribution to enhance statistical and structural analysis of textural patterns
(Ojala et al. 2002). Vargas, Ferrer, Travieso and Alonso used co-occurrence matrix
and LBP to extract the grey level statistical texture features at the global level (Vargas
et al 2011). Ferrer, Vargas, Morales and Ordonez proved the robustness of grey level
features extracted from a distorted signature image in (Ferrer et al. 2012. Wajid and
Mansoor evaluated the performance of classifiers using the feature vector formed by
a code matrix of LBPs, created from divided signature images (Wajid et al. 2013).
Serdouk, Nemmour and Chibani developed a descriptor called orthogonal combi-
nation local binary pattern (OC-LBP) based on orthogonal combination of LBP
(Serdouk et al. 0000). Shekar, Bharathi, Kitler, Vizilter and Mestestskiy represented
grid structured morphological spectrum in the form of a histogram for offline signa-
17 Offline Signature Verification Using Galois … 271
ture verification (Shekar et al. 2015). Pal, Alaei, Pal and Blumenstein used LBP and
ULBP in extracting features from offline signature images (Pal et al. 2016). Yilmaz
and Yanikoglu presented an offline signature verification system that used histogram
of LBP, oriented gradients and scale invariant feature transform descriptors to gener-
ate a score-level fusion of complementary classifiers (Yilmaz et al. 2016). The more
recent advancements in the field are summarized in the literature review presented
by Hafemann, Sabourin and Oliveira (Hafemann et al. 2017).
The paper presents an offline signature verification method using a Galois field-
based texture representation. The proposed method consists of two steps: extraction
of features and signature verification (classification). In the extraction of features
step, the features are extracted after Galois field operator has been applied on sig-
nature image. During verification, the features of the signature image in question
are measured with the features of the genuine signature images that are stored in
database. k-Nearest Neighbour (k-NN) classifier is used in the present study. Exper-
imentations have also been done using the log-polar transform and rotation invariant
LBP (RILBP) method to depict the efficacy of the proposed method.
In the next section, the Galois field-based texture representation and feature extrac-
tion from the signature image is presented in detail. Section 17.3 describes briefly
the classification technique employed in the present study. The experimental settings
and the realized results are given in Sect. 17.4 followed by conclusion of the present
study in Sect. 17.5.
Texture description based on Galois fields was implemented for scale and rotation
invariant texture classification (Shivashankar et al. 2017, 2018). The same method-
ology has been applied to signature images which are grey scale images with the
handwritten signature representing texture in the image. A grey scale image has
intensity values ranging from 0 to 255, with 0 representing black and 255 indicating
272 S. Shivashankar et al.
white. This can be represented in a Galois field of 28 which has 256 values. The
Galois field-based texture representation procedure is as given below.
Step 1: Consider a pixel Ii, j and its first eight neighbours in an image I
Step 2: Perform bitwise XOR operation on all the nine pixels
(Addition in GF( 28 ))
Step 3: Convert the binary number obtained into decimal value
Step 4: Steps 1, 2 and 3 are repeated for all the pixels in the image I , which results
in transformed image I ’
h(rk ) = n k (17.1)
where rk → k th value
n k → no. of rk values . The histogram represents the number of pixels with particular
intensity which lies within the range of 0–255 in that image.
Step 2: The Ck (cumulative histogram) is calculated as
k
Ck = h(r j ) (17.2)
j=0
where k = 0, 1, 2, ..., 49
Step 3: The normalization of Ck is given in (17.3)
Ck
N Ck = (17.3)
C
where
C = c02 + c12 + c22 + · · · + ck2 + · · · + c49
2
17.3 Classification
17.4 Experiments
Fig. 17.1 Sample images of GPDS synthetic signature dataset. Genuine images are displayed in
the first row, and forged images are displayed in the second row
274 S. Shivashankar et al.
FAR + FRR
AER = (17.6)
2
Table 17.1 displays the results obtained with the proposed method and Euclidean
distance for signatures of different number of people in the GPDS synthetic signature
dataset. FAR of 0.23, 0.04 and 0.02, FRR of 0.68, 0.93 and 0.94, AER of 0.46, 0.49
and 0.48 were obtained with Euclidean distance measure for 10 persons, 100 persons
and 250 persons’ signatures, respectively.
It is observed that FRR is commonly greater than FAR indicating that the signature
images are correctly classified by the method proposed in Sect. 17.2. The consistency
of the values of AER in the last column of Table 17.2 shows that the trend will
continue as more number of signatures are included for classification. Depending
on the threshold used in the verification system, FRR and FAR can be changed by a
significant amount. The performance of a biometric system may be expressed using
the AER. Better performance is indicated by a lower AER value.
In what follows, the experiment is described to show how a different distance
metric affects the system performance. When Chi Square distance metric is applied
Table 17.1 FAR, FRR and AER of different number of people using the proposed method with
Euclidean as distance measure
No. of people FAR FRR AER
10 persons 0.23 0.68 0.46
100 persons 0.04 0.93 0.49
250 persons 0.02 0.94 0.48
17 Offline Signature Verification Using Galois … 275
Table 17.2 FAR, FRR and AER of different number of people using the proposed method with
Chi Square as distance measure
No. of people FAR FRR AER
10 persons 0.25 0.65 0.45
100 persons 0.05 0.91 0.48
250 persons 0.02 0.94 0.48
Table 17.3 Signature verification using RILBP, log-polar and proposed method using Euclidean
distance measure
Methods 10 persons 100 persons 250 persons
FAR (false RILBP (Ojala 0.17 0.04 0.01
acceptance rate) et al. 2002)
Log-polar (Pun 0.20 0.03 0.01
et al. 2003)
Proposed 0.23 0.04 0.02
FRR (false RILBP (Ojala 0.66 0.91 0.94
rejection rate) et al. 2002)
Log-polar (Pun 0.70 0.94 0.96
et al. 2003)
Proposed 0.68 0.93 0.94
AER (average RILBP (Ojala 0.41 0.48 0.48
error rate) et al. 2002)
Log-polar (Pun 0.45 0.49 0.49
et al. 2003)
Proposed 0.46 0.49 0.48
in the place of Euclidean distance metric in the same experiment as described above,
the results in Table 17.2 are obtained. False acceptance rate of 0.25, 0.05 and 0.02
is observed for 10, 100 and 250 persons’ signatures with Chi Square as the distance
measure. False rejection rate of 0.65, 0.91 and 0.94 was observed for the same.
Average error rates of 0.45, 0.48 and 0.48 were recorded for signatures of 10 persons,
100 persons and 250 persons, respectively.
Experiments are performed on the GPDS synthetic signatures database with
RILBP method with neighbour pixels P = 8 and radius set to R = 1, to evaluate
against the performance of the proposed method. FAR of 0.17, 0.04 and 0.01, FRR
of 0.66, 0.91 and 0.94 and AER of 0.41, 0.48 and 0.48 were obtained for RILBP
with Euclidean distance for signatures of 10, 100 and 250 people. Log-polar trans-
form method is applied on the GPDS synthetic signatures database with Euclidean
distance, and a FAR of 0.20, 0.03 and 0.01, FRR of 0.70, 0.94 and 0.96 and AER
of 0.45, 0.49 and 0.49 were recorded for 10 people, 100 and 250 people signatures.
The results are tabulated in Table 17.3.
276 S. Shivashankar et al.
Table 17.4 Signature verification using RILBP, log-polar and proposed method using Chi Square
distance measure
Methods 10 persons 100 persons 250 persons
FAR (false RILBP (Ojala 0.17 0.05 0.02
acceptance rate) et al. 2002)
Log-polar (Pun 0.19 0.03 0.01
et al. 2003)
Proposed 0.25 0.05 0.02
FRR (false RILBP (Ojala 0.46 0.88 0.93
rejection rate) et al. 2002)
Log-polar (Pun 0.65 0.93 0.96
et al. 2003)
Proposed 0.65 0.91 0.94
AER (average RILBP (Ojala 0.32 0.46 0.48
error rate) et al. 2002)
Log-polar (Pun 0.42 0.48 0.48
et al. 2003)
Proposed 0.45 0.48 0.48
The experiments are repeated using RILBP and log-polar transform with Chi
Square as distance measure. FAR of 0.17, 0.05 and 0.02 for RILBP and FAR of 0.19,
0.03, and 0.01 for log-polar transform method were obtained for signatures of 10,
100 and 250 people, respectively. FRR of 0.46, 0.88 and 0.93 for RILBP and 0.65,
0.93 and 0.96 for log-polar transform methods were observed for 10, 100 and 250
people’s signatures. AER of 0.32, 0.46 and 0.48 for RILBP and AER of 0.42, 0.48
and 0.48 for log-polar transforms methods are recorded for 10, 100 and 250 people
signatures, respectively, and are presented in Table 17.4 along with the values for
the proposed method. The proposed method’s performance is comparable with the
existing methods.
17.5 Conclusion
petent and robust system. Comparing the Galois field operator method proposed in
Sect. 17.2 with methods like log-polar transform and RILBP confirms the efficiency
of the offline signature verification system.
References
Bharadi VA, Kekre HB (2010) Off-line signature recognition systems. Int J Comput Appl 1(27):48–
56
Boccignone G, Chianese A, Cordella LP, Marcelli Angelo (1993) Recovering dynamic information
from static handwriting. Pattern Recogn 26(3):409–418
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, A Wiley-Interscience
Publication, New York
Ferrer Miguel A, Vargas JF, Morales A, Ordonez A (2012) Robustness of offline signature verifi-
cation based on gray level features. IEEE Trans Inf Forensics Secur 7(3):966–977
Ferrer MA, Diaz-Cabrera M, Morales A et al (2013) Synthetic off-line signature image generation.
In: 6th IAPR international conference on biometrics (ICB), pp 1–7
Fierrez-Aguilar J, Alonso-Hermira N, Moreno-Marquez G, Ortega-Garcia J (2004) An off-line
signature verification system based on fusion of local and global information. In: International
workshop on biometric authentication. Springer, pp 295–306
Hafemann LG, Sabourin R, Oliveira LS (2017) Offline handwritten signature verification—literature
review. In: 2017 seventh international conference on image processing theory, tools and applica-
tions (IPTA), pp 1–8
Jain AK, Ross A, Prabhakar Salil (2004) An introduction to biometric recognition. IEEE Trans Circ
Syst Video Technol 14:4–20
Kalera MK, Srihari S, Xu Aihua (2004) Offline signature verification and identification using
distance statistics. Int J Pattern Recogn Artif Intell 18(07):1339–1360
Kiani V, Pourreza Shahri R, Pourreza HR (2009) Int J Image Process 3
Ojala T, Pietikainen M, Maenpaa Topi (2002) Multiresolution gray-scale and rotation invariant
texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–
987
Ozgunduz E, Senturk T, Karsligil ME (2005) Off-line signature verification and recognition by
support vector machine. In: Signal processing conference, 2005 13th European IEEE, pp 1–4
Pal S, Alaei A, Pal U, Blumenstein M(2016) Performance of an off-line signature verification
method based on texture features on a large indic-script signature dataset. In: 2016 12th IAPR
workshop on document analysis systems (DAS), pp 72–77
Pun C-M, Lee M-C (2003) Log-polar wavelet energy signatures for rotation and scale invariant
texture classification. IEEE Trans Pattern Anal Mach Intell 25(5):590–603
Semwal VB, Raj M, Nandi Gora Chand (2015) Biometric gait identification based on a multilayer
perceptron. Robot Autonom Syst 65:65–75
Serdouk Y, Nemmour H, Chibani Y Orthogonal combination and rotation invariant of local binary
patterns for off-line handwritten signature verification
Shekar BH, Bharathi RK, Kittler J, Vizilter YV, Mestestskiy L (2015) Grid structured morpho-
logical pattern spectrum for off-line signature verification. In: 2015 international conference on
biometrics (ICB), pp 430–435
Shivashankar S, Kudari M, Hiremath PS (2017) Texture representation using galois field for rota-
tion invariant classification. 2017 13th international conference on signal-image technology &
internet-based systems (SITIS), pp 237–240
Shivashankar S, Kudari M, Hiremath PS (2018) Galois field-based approach for rotation and scale
invariant texture classification. Int J Image Graph Signal Process (IJIGSP) 10(9):56–64
278 S. Shivashankar et al.
Vargas JF, Ferrer MA, Travieso CM, Alonso JB (2011) Off-line signature verification based on grey
level information using texture features. Pattern Recogn 44(2):375–385
Wajid R, Mansoor AB (2013) Classifier performance evaluation for offline signature verification
using local binary patterns. In: 2013 4th European workshop on visual information processing
(EUVIP), pp 250–254
Yilmaz MB, Yanikouglu B (2016) Score level fusion of classifiers in off-line signature verification.
Inf Fusion 32:109–119
Chapter 18
Face Recognition Using 3D CNNs
Abstract The area of face recognition is one of the most widely researched areas
in the domain of computer vision and biometric. This is because the non-intrusive
nature of face biometric makes it comparatively more suitable for application in area
of surveillance at public places such as airports. The application of primitive methods
in face recognition could not give very satisfactory performance. However, with the
advent of machine and deep learning methods and their application in face recogni-
tion, several major breakthroughs were obtained. The use of 2D convolution neural
networks(2D CNN) in face recognition crossed the human face recognition accuracy
and reached to 99%. Still, robust face recognition in the presence of real-world con-
ditions such as variation in resolution, illumination and pose is a major challenge for
researchers in face recognition. In this work, we used video as input to the 3D CNN
architectures for capturing both spatial and time domain information from the video
for face recognition in real-world environment. For the purpose of experimentation,
we have developed our own video dataset called CVBL video dataset. The use of
3D CNN for face recognition in videos shows promising results with DenseNets
performing the best with an accuracy of 97% on CVBL dataset.
18.1 Introduction
Face recognition started long back in the 1990s, and since then, the algorithms have
become more efficient. Various algorithms were applied to detect face in an image,
and subsequently, the recognition of face was done using a recognition algorithm.
Researchers developed various mathematical models and features to represent and
recognize faces. The features were based on traits of face such as geometry, texture,
color and appearance (Brunelli and Poggio 1993; Chellappa et al. 1995; Jain et al.
2000; Turk and Pentland 1991; Viola and Jones 2004; Wiskott et al. 1997) . No
feature was able to represent the face with all its complex dimensions. In addition
to this, recognition of face was made difficult by real-world challenges such as
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 279
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_18
280 N. K. Mishra and S. K. Singh
varying illumination, pose and resolution. Various image transformations and super-
resolution methods have been proposed to deal with these challenges (Ahonen et al.
2004, 2008; Bilgazyev et al. 2011; Gunturk et al. 2003; Xie et al. 2011; Zhu et al.
2015). In spite of this, the real-world applications are still not reliable and robust.
In case of face recognition from video, the actual processing was done at frame
level which is actually an image. The best frame among all the frames of the video
was selected based on quality of the face in the image, and recognition algorithm
was applied subsequently (Wibowo et al. 2012; Wibowo and Tjondronegoro 2011).
With the advent of deep learning architectures, 2D convolution network came to be
applied on images or frames of videos to detect and recognize faces. The generation of
features was no more manual. Deep features, though undecipherable, were better than
manually developed features for face recognition. This lead to increase in accuracy
and robustness. However, deep architectures did not treat video as one input rather
they also generated spatial features based on series input of frames.
In the recent works, 3D CNN showed good results for activity recognition in
videos (Tran et al. 2015). This is because unlike 2D CNNs, 3D CNNs are capable
of modeling the time dimension as well as the spatial dimension. The 3D CNNs
accepted and treated a video as a single-input unit. 3D CNNs could generate a single
compact feature that contained facial trait as well as body language, gait pattern and
any other temporal and spatial pattern that may be relevant to classification.
The concept of residual networks (He et al. 2016) allowed for more depth in deep
learning networks without the limitation of vanishing gradient. While 2D residual
networks were successfully applied for image classification (He et al. 2016), 3D
residual networks have been designed to extend the capability of residual networks
in third dimension as well (Hara et al. 2017). 3D residual networks have been success-
fully applied for activity recognition using videos (Hara et al. 2017). The application
of 3D CNNs and 3D residual networks for activity recognition is motivated us to use
the 3D CNNs for face recognition using videos.
Apart from YTF (Wolf et al. 2011) dataset, all the other video datasets such
as UCF (Khurram et al. 2012) and HMDB (Kuehne et al. 2011) are available for
activity recognition. We have therefore created a comprehensive biometric dataset
with modalities video, iris and fingerprint. The video dataset has been used in our
experiments for face recognition.
In this paper, we perform the face recognition on CVBL facial video dataset. This
paper has therefore the following contributions:
i. A comprehensive biometric dataset called the CVBL dataset containing video,
iris and fingerprint modalities has been collected.
ii. It uses 3D residual networks to find out the accuracy for face recognition in
videos.
iii. It compares the accuracy of 3D residual network for different depths of residual
networks in case of face recognition in videos.
iv. It also compares the accuracy of different genres of 3D residual networks in case
of face recognition in videos.
18 Face Recognition Using 3D CNNs 281
The previous work on face recognition has been discussed in Sect. 18.2. Section
18.3 discusses in detail about our comprehensive biometric dataset called the CVBL
dataset. The residual network architectures used in the experiment and configuration
detail related to the experiment have been discussed in Sect. 18.4. The exact imple-
mentation details are discussed in Sect. 18.5 followed by discussion over the result
in Sect. 18.6. Finally, the conclusion and future scope are discussed in Sect. 18.7.
A lot of work has been done in the field of face recognition using images. Convolution
neural networks (CNN) are being used for face recognition these days. In (Hadsell
et al. 2006), the author introduced the concept of contrastive loss. The contrastive
loss is based on Euclidean distance between the two points. In contrastive loss, the
points in higher dimension are mapped to a manifold such that the euclidean distance
between the points on the manifold corresponds to the similarity between the same
two points in the higher dimension input space. In contrastive loss, the CNNs are
trained using pairs of images. The contrastive loss is such that it tries to generate
highly discriminative features when the training images in the pair are dissimilar to
each other. In case the images in the pair for training are same, the contrastive loss
tries to generate similar features for the images.
In (Schroff et al. 2015), triplet loss was introduced. The author trained a CNN
using triplets of images containing an anchor image which is the actual image, the
positive image which is the image of the same person as in anchor image and a
negative image which consists of an image of a person different from that in anchor
image. The training is done to obtain discriminative features such that it increases
the distance between the anchor face and negative face and decreases the distance
between positive and anchor face. In case of both contrastive loss and triplet loss,
organizing the batches in pair or triplet such that it satisfies a given condition is in
itself a difficult and complex process.
In the sequence of improvement of loss function to increase the discriminative
power of features for face recognition, a new loss function was proposed by Liu et.
al. In his paper (Liu et al. 2016), Liu proposed a generalized large-margin softmax
(L-Softmax) loss which explicitly encourages intra-class compactness and inter-class
separability between learned features. L-softmax not only can adjust the desired
margin but also can avoid overfitting.
Liu et al. (2017) in the year 2017 proposed a new loss function called A-softmax
as his extension and improvement to L-softmax. A-Softmax loss can be viewed as
imposing discriminative constraints on a hypersphere manifold, which intrinsically
matches the prior so that faces also lie on a manifold. The size of angular margin can be
quantitatively adjusted by a parameter m. This makes the learning better by increasing
282 N. K. Mishra and S. K. Singh
the angular margin between the classes and making the feature discrimination better
than L-softmax. This paper has used two datasets for performance analysis. One is
labeled face in the wild (LFW), and the other is Youtube Faces (YTF). A-softmax
also which has also been called as SphereFace in the paper achieves 99.42% and
95.0% accuracies on LFW and YTF datasets, respectively. In an extension to the
angular softmax loss, Deng et. al. in his work (Deng et al. 2018) tried to increase the
inter-class separability by introducing the concept of additive angular margin.
Lot of work has been done on images for face recognition. However, all the works
on images in face recognition are prone to spoofing. This can be overcome only if we
can use videos for face recognition. This will allow the system to check the liveliness
of the person by the random body and face movements and thus avoid spoofing.
Activity recognition is one domain where deep learning has been successfully
applied for processing temporal domain along with the spatial domain.
Karpathy et al. (2014) used two-stream convolution network for activity recognition
in video. He used one stream to input centrally cropped video frames and other stream
to input the full frame but at half the original resolution. The two streams got con-
catenated later in the fully connected layer. The use of two-stream architecture made
the processing of videos 2–4 times faster than in case of a single stream architecture.
However, the problem of capturing the temporal dimension still remained because
the use of 2D convolution in the two-stream architecture limited the architecture
from capturing the temporal dimension.
Yue-Hei et al. (2015) applied an array of Long short-term memory (LSTM) cells
to capture the temporal dimension in videos for activity recognition. Because of the
use of LSTMs, the method was capable of handling full-length videos. This meant
that the architecture using LSTM was able to model the temporal change across the
entire length of the video. Firstly, a layer of CNN processed frames of videos in
sequence to produce spatial features. These spatial features were passed to LSTM
for extracting temporal features. Jeff et. al. in his research work (Donahue et al.
2015) also applied LSTMs in a different architecture but with the same objective of
modeling the temporal dimension for activity recognition. However, the LSTM-based
architectures are not giving accuracy better than two-stream-based architectures.
Tran et. al. in his work (Tran et al. 2015) used 3D CNNs for activity recognition.
Using his 3D CNN, he could capture both spatial and temporal dimension in his
features. The features extracted from videos using 3D CNN are highly efficient,
compact and extremely simple to use. He called these features C3D. Tran et. al.
demonstrated that C3D features along with a linear classifier can outperform or
approach current best methods on different video analysis benchmarks. However,
the only problem with 3D CNN is that the 3D CNNs cannot capture the entire length
18 Face Recognition Using 3D CNNs 283
of video sequence in one go. This causes a limitation in capturing the temporal
dimension if the length of the temporal activity is longer than the number of frames
captured by 3D CNN.
Tran in his work (Tran et al. 2015) used the temporal depth of 16 frames. Laptev
et. al. in his work (Varol et al. 2018) tried to figure out what happens to activity recog-
nition accuracy if we change the temporal depth of video clip. Laptev experimented
for temporal depth of 16, 20, 40, 60, 80 and 100. He found that on increasing the
temporal depth, the accuracy for activity recognition increased. This was because
the 3D CNN architecture could model the activity in a better way when the number
of frames were more. Thus, this experiment also confirmed that temporal dimension
played an important role in the activity recognition. However, greater temporal depth
also required more processing.
In 2016, He et al. (2016) came with the idea of residual networks and won the
first place in several tracks in ILSVRC & COCO 2015 competitions. However, in
ILSVRC, the architectures are tested on images.
In 2017, Hara et al. (2017) extended the concept of residual networks from 2D
to 3D. Kensho applied 3D residual networks for activity recognition in videos. He
changed the depth of 3D residual networks and tried to experiment its effect on
the accuracy. He found that as we increase the depth of the residual networks, the
accuracy increases till it reached the depth of 151. Upon further increasing the depth,
the accuracy for activity recognition saturates. With this experiment, it was clear that
with the increasing depth, it was possible to capture better features and thus increase
the accuracy of the activity recognition.
The work of Hara et al. (2017) motivated us to experiment if the residual networks
can be used to identify a person in a video. To realize this purpose, we developed a
video dataset of our own for face recognition called the CVBL dataset.
CVBL dataset is named CVBL dataset (CVBL Dataset 2018) after the lab that is
creating it. The CVBL biometric data (CVBL Dataset 2018) is an exhaustive bio-
metric dataset consisting of facial videos, fingerprint and signature of each subject.
The dataset consists of biometric data of 125 school-going children below the age
of 15.
From the CVBL dataset, we used the face video dataset for our face recognition
experiment. The face video dataset consists of 320 × 240 size video taken at 30
frames per second. Each video is of maximum one minute, and minimum five such
videos of each subject have been taken. The videos show the subjects talking and
expressing themselves freely while being seated at a place in front of the camera as
shown in Fig. 18.1. There are 125 different subjects, and thus, there are 125 classes
for face recognition. The subjects are facing the camera; however, they can move
their face in any direction while talking. The videos include static background, and
there is no camera motion. More number of subjects will be included in the CVBL
dataset (CVBL Dataset 2018) in the future.
284 N. K. Mishra and S. K. Singh
In our experiment, out of total 675 videos, 415 videos have been taken for training
and the rest 260 for testing. Thus, training is to testing split ratio is 60:40 approx.
18.4.1 Summary
Our objective is to find out the accuracy of 3D ResNets on face recognition video
dataset. In addition, we also wanted to know how the accuracy of face recognition
changes with change in depth and genre of residual networks. For this purpose, we
used the code from (Hara et al. 2017) for experimentation and modified it as per
our requirements and objectives. The code for face recognition experiments (Hara
et al. 2017) uses Pytorch library (PyTorch 2018). We begin our analysis by checking
whether the size of the dataset is large enough not to underfit the residual network of
such large depths. We therefore start with the depth of 18, assuming that if ResNet-18
overfits, then we can conclude that the size of the dataset is too small to train the
architecture of such depth. We will experiment with the larger depths of ResNets only
if we are convinced that CVBL dataset is large enough to train ResNet-18 without
underfitting.
In this section, all those network architectures will be discussed which we plan to
implement and analyze them over training them on CVBL dataset. In this paper,
ResNet architectures of various depth have been experimented. The ResNet archi-
tectures have a special property that they allow shortcut connections to bypass layers
in between to move to the next layer. However, back-propagation still takes place
without any problem.
Apart from the ResNet (basic and bottleneck blocks) (He et al. 2016), follow-
ing extensions of the ResNet architecture have also been used for experiment: Pre-
activation ResNet (He et al. 2016), Wide ResNet (WRN) (Zagoruyko and Komodakis
2016), ResNeXt (Xie et al. 2017) and DenseNet (Huang et al. 2017).
18 Face Recognition Using 3D CNNs 285
A basic ResNet block (He et al. 2016) is the most simple ResNet and consists of
only two convolution layers. Each of the convolution layers is followed by a batch
normalization layer and a non-linearization layer ReLU. A shortcut connection is
also provided between the top of the block and to the layer just before the last ReLU
in the block. ResNets-18 and ResNets-34 adopt the basic ResNet block structure.
A ResNet bottleneck block (He et al. 2016) is different from the basic ResNets
block in the sense that it consists of three convolution layers instead of two. As in case
of basic ResNets block, each convolution layer is followed by batch normalization
layer and ReLU layer. The first and third convolution layers consist of the filters
of size 1 × 1 × 1, whereas the second convolution layer consists of filters of size
3 × 3 × 3. The networks which adopt ResNets bottleneck block are ResNet-50, 101,
152 and 200. The 1 × 1 × 1 convolutions (Lin et al. 2013) help the network to go
deeper by being computationally efficient as well as contains more information than
otherwise.
Unlike bottleneck ResNet, where each convolution layer is followed by batch
normalization and a ReLU, in case of pre-activation ResNet (He et al. 2016), the
batch normalization layer and the ReLU layer come before convolutional layer. He
et al. (2016) also confirmed in his studies on ResNet that pre-activation ResNets are
better in optimization and avoiding overfitting. The shortcut in case of pre-activation
ResNets connects the top of the block to the layer just after the last convolution layer
in the block. Pre-activation ResNet-200 is an example using pre-activation ResNet.
Wide ResNets (Zagoruyko and Komodakis 2016) increase the width of the residual
network instead of increasing the depth of the residual network. Width here means
the number of features maps in one layer. If we talk of a convolution layer network,
the number of feature maps corresponds to the number of filters in a convolution
layer. In a neural network, the width corresponds to the number of neurons in a
layer. In (Zagoruyko and Komodakis 2016), the authors increased the width instead
of depth and showed that same accuracy can be gained by increasing width instead
of depth. Several other authors (Zagoruyko and Komodakis 2016) however feel that
the increase in accuracy was not because of increase in number of feature maps but
because of increase in number of parameters. The increase in number of parameters
can also cause overfitting.
DenseNets (Huang et al. 2017) are those residual network which exploit the con-
cept of feature reuse. In DenseNets, the features from early layers are used in the
later layers by the providing direct connections from every early layer to every later
layers in the feed-forward fashion and concatenating them. This process makes the
interconnections very dense and hence the name. The concept of pre-activations used
in pre-activation ResNets has also been used in DenseNets to reduce the number of
parameters and yet achieve better accuracy than ResNets. The number of feature
maps at each layer is called the growth rate in case of DenseNets. This is because the
features maps at a particular layer grow after concatenation with the feature maps
of the previous layer. DenseNet-121 and DenseNet-201 with growth rate of 32 are
examples of DenseNets.
In Xie et al. (2017), it introduced a new term called cardinality. As per the author,
the cardinality refers to the size of the set of transformations. In his paper, Xie et.
286 N. K. Mishra and S. K. Singh
18.5 Implemenation
Training: For training purpose, a 16-frame clip is generated from the temporal
position selected by uniform sampling of the video frames. In case the video contains
less than 16 frames, then 16 frames are generated by looping around the existing
frames as many times as required. Multiscale cropping is done by first selecting
randomly a spatial position out of 4 corners and 1 center. Then, for a particular
sample,
a scalevalue is selected out of the following to perform multi-cropping:
1 , , 1 ,1 .
1 √1
24 2 2 43 2
Aspect ratio is maintained to one, and scaling is done on the basis of shorter
side of the frame. Frames are then resized to 112 × 112 pixels. After all this, we
finally get the input sample size as (3 channels × 16 frames × 112 pixels × 112
pixels). Horizontal flipping with a probability of 50 percent is also performed. Mean
subtraction is performed to keep the pixel values zero centered. In the process of
mean subtraction, a mean value is subtracted from each color of the sample. Cross-
entropy loss is used for calculation of loss and back-propagation. For optimization
using the calculated gradients, stochastic gradient descent (SGD) with momentum
is used. Weight decay of 0.001 and momentum of 0.9 have been used in the training
process. When training the networks from scratch, we start from learning rate of 0.1
and divide it by 10 after the validation loss saturates.
Each video is split into non-overlapped 16-frame clips, and each clip is then
passed into the network for recognition of faces. Hence, in a way, we are following
sliding window to generate input clips where the sliding window is moving in time
dimension, and the length of the sliding window is 16. The sliding window is being
moved in non-overlapped fashion.
As we discuss the results, it must be noted here that the term ResNets means ResNet-
18, 34, 50, 101 and 152. Extension of ResNets means pre-activation ResNets, wide
ResNets and DenseNets. The results of experiments on CVBL dataset are summa-
18 Face Recognition Using 3D CNNs 287
Table 18.1 Accuracy of our proposed method using residual networks of different depth on CVBL
dataset for face recognition
Residual networks Accuracy (%)
ResNet-18 96
ResNet-34 93.7
ResNet-50 96.2
ResNet-101 93.4
ResNet-152 49.1
ResNeXt-101 78.5
Pre-activation ResNet-200 96.2
DenseNet-121 55%
DenseNet-201 97
WideResNet-50 90.2
rized in Table 18.1. The training versus validation loss and comparison of accuracy
for all kinds of ResNet architectures are shown in Figs. 18.3 and 18.4.
It can be easily observed that in case of ResNets, the accuracy does not increase
with the increase in depth of the architecture. In fact, it follows a zig-zag kind of path
as shown in Fig. 18.2. For ResNet-18, the accuracy is 96%. It drops to 93.7% for
ResNet-34. The accuracy of ResNet-50 then comes back to 96.2% and then dropping
to accuracy of 93.4% for ResNet-101. Thus, there is a zig-zag kind of variation in
accuracy as we increase the depth of the ResNets. It is also observed from the Table
18.1 and Fig. 18.2 that the performance of ResNets drops very sharply after ResNet-
101. ResNet-152 fails to perform with just 49.1%. The increase in depth for the same
training set may be the cause for the high bias and hence the decrease in accuracy. In
spite of all these variations of accuracy in ResNet architectures, it can be inferred that
ResNets in general are performing well with more than 90% accuracy. The highest
accuracy in ResNets has gone to 96.2%.
In case of DenseNets too, DenseNet-201 performed the best with 97% accuracy.
The reuse of features from previous layers in later layers of DenseNet seems to have
contributed to increase in the accuracy to 97% in case of DenseNet-201. However, the
accuracy of DenseNet-121 was way too low with just 55%. Figure 18.3e, f shows the
training versus validation loss for DenseNet-121 and DenseNet-201, respectively. It
can be seen that after convergence, the training loss in case of DenseNet-121 is more
than that in case of DenseNet-201. This shows that DenseNet-121 has a high bias
as compared to DenseNet-201. Also DenseNet accumulates features from previous
layers to later layers. Hence, we can say that DenseNet-201 is able to make use of
the accumulated features because of its high depth. Due to its relatively lower depth,
DenseNet-121 is not able to make use of the accumulated features to increase its
accuracy.
288 N. K. Mishra and S. K. Singh
We compare Tables 18.1 and 18.2 for comparing the results of our proposed method
with the state-of-the-art results on various image and video datasets. Table 18.2 shows
the accuracy of different approaches on LWF dataset for face recognition in images
and YTF dataset for face recognition in video. From Table 18.2, it is observed that
for face recognition in images, the accuracy of 99% has been achieved by different
18 Face Recognition Using 3D CNNs 289
Fig. 18.3 Training loss versus validation loss for different ResNets
290 N. K. Mishra and S. K. Singh
Fig. 18.4 a, b and c shows training versus validation loss for Wideresnet-50, pre-activation ResNet-
200 and ResNeXt-101. d shows the graph of accuracy against the number of epochs for all residual
networks
Table 18.2 State-of-the-art results for face recognition on image and video datasets.
Architecture Image dataset Accuracy Video dataset Accuracy (%)
Facenet (Schroff et al. 2015) LWF 99.63% YTF 95.12
Deep ID 2 (Sun et al. 2015) – – YTF 93.2
Center Loss (Wen et al. 2016) LWF 99.28% YTF 94.9
Deep residual learning on images Imagenet 3.57% error – –
(He et al. 2016)
L-softmax (Liu et al. 2016) LWF 98.71% – –
A-softmax (Liu et al. 2017) LWF 99.42% YTF 95.0
3D residual networks (ResNeXt- – – UCF-101 94.5
101 64 frames) (Hara et al. 2017)
3D residual networks (ResNeXt- – – HMDB-51 70.2
101 64 frames) (Hara et al. 2017)
18.7 Conclusion
ResNets architectures seem to be sensitive to face patterns in videos. Since the CVBL
dataset consists of videos with same background which is plain white, it can be easily
assumed that the background is not at all contributing to the recognition accuracy. It
is only the spatial and the temporal dimensions that are contributing effectively to the
classification and recognition accuracy. Except few cases, the ResNets are providing
good results with accuracies above 90%. DenseNets performed the best with 97%.
Hence, we can conclude that ResNets are sensitive to face recognition patterns
with accuracy near 96%. This is the first of its kind experiment on face video dataset,
and the residual networks have given an accuracy of above 90% in general which
gives a very positive indication about the future in video biometric.
In the future, we plan to collect biometric samples from more subjects and prepare
a bigger biometric and exhaustive dataset. We then plan to experiment on this bigger
biometric dataset to evaluate the effect of large number of classes in the dataset on
face recognition accuracy. We also plan to experiment on the existing YTF dataset
using the residual networks.
References
Ahonen T, Hadid A, Pietikäinen M (2004) Face recognition with local binary patterns. In: Computer
vision-ECCV 2004. Springer, pp 469–481
Ahonen T, Rahtu E, Ojansivu V, Heikkila J (2008) Recognition of blurred faces using local phase
quantization. In: International conference on pattern recognition
Bilgazyev E, Efraty B, Shah SK, Kakadiaris IA (2011) Improved face recognition using super-
resolution. In: 2011 international joint conference on biometrics (IJCB). IEEE, pp 1–7
Brunelli R, Poggio T (1993) Face recognition: features versus templates. IEEE Trans Pattern Anal
Mach Intell 15(10):1042–1052
292 N. K. Mishra and S. K. Singh
Chellappa R, Wilson CL, Sirohey S (1995) Human and machine recognition of faces: a survey. Proc
IEEE 83(5):705–740
CVBL Dataset. https://cvbl.iiita.ac.in/dataset.php. Last accessed 30 Dec 2018
Deng J, Guo J, Xue N, Zafeiriou S (2018) Arcface: Additive angular margin loss for deep face
recognition. arXiv preprint arXiv:1801.07698
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T
(2015) Long-term recurrent convolutional networks for visual recognition and description. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Gunturk BK, Batur AU, Altunbasak Y, Hayes MH, Mersereau RM (2003) Eigenface-domain super-
resolution for face recognition. IEEE Trans Image Process 12(5):597–606
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping.
In: Null. IEEE, pp 1735–1742
Hara K, Kataoka H, Satoh Y (2017) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs
and ImageNet?arXiv preprint arXiv:1711.09577
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. European
conference on computer vision. Springer, Cham, pp 630–645
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings
of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: Proceedings
of the European conference on computer vision (ECCV), pp 630–645
Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional
networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
(CVPR), pp 4700–4708
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern
Anal Mach Intell 22(1):4–37
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video
classification with convolutional neural networks. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 1725–1732
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human
motion recognition. In: 2011 IEEE international conference on computer vision (ICCV). IEEE,
pp 2556–2563
Lin M, Chen Q, Yan S (2013) Network in network. arXiv preprint arXiv:1312.4400
Liu W et al (2017) Sphereface: deep hypersphere embedding for face recognition. In: The IEEE
conference on computer vision and pattern recognition (CVPR), vol 1
Liu W, Wen Y, Yu Z, Yang M (2016) Large-margin softmax loss for convolutional neural networks.
In: ICML, pp 507–516
PyTorch. https://pytorch.org/. Last accessed 25 Dec 2018
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and
clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 815–823
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human action classes from videos
in the wild. In: CRCV-TR-12-01, Nov (2012)
Sun Y, W, Tang X (2015) Deeply learned face representations are sparse, selective, and robust. In:
Proceedings of the IEEE conference on computer vision and pattern recognition
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with
3d convolutional networks. In: Proceedings of the IEEE international conference on computer
vision, pp 4489–4497
Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE
Trans Pattern Anal Mach Intell 40(6):1510–1517
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Wen Y et al (2016) A discriminative feature learning approach for deep face recognition. In: Euro-
pean conference on computer vision. Springer, Cham
18 Face Recognition Using 3D CNNs 293
Wibowo ME, Tjondronegoro D, Chandran V (2012) Probabilistic matching of image sets for video-
based face recognition. In: International conference on digital image computing: techniques and
applications (DICTA)
Wibowo ME, Tjondronegoro D (2012) Face recognition across pose on video using eigen light-
fields. International conference on digital image computing: techniques and applications (DICTA)
2011:536–541
Wiskott L, Fellous J-M, Kruger N, Von Malsburg CD (1997) Face recognition by Elastic Bunch
graph matching. IEEE Trans Pattern Anal Mach Intell 19(7):775–779
Wolf L, Hassner T, Maoz I (2011) Face recognition in unconstrained videos with matched back-
ground similarity. In: CVPR
Xie X, Zheng W-S, Lai J, Yuen PC, Suen CY (2011) Normalization of face illumination based on
large-and small-scale features. IEEE Trans Image Process 20(7):1807–1821
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural
networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
(CVPR), pp 1492–1500
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond
short snippets: deep networks for video classification. In: Proceedings of the IEEE conference
on computer vision and pattern recognition, pp 4694–4702
Zagoruyko S, Komodakis N (2016) Wide residual networks. In: Proceedings of the British machine
vision conference
Zhu X, Lei Z, Yan J, Yi D, Li SZ (2015) High-fidelity pose and expression normalization for face
recognition in the wild. Proc IEEE Conf Comput Vis Pattern Recogn:787–796
Chapter 19
Fog Computing-Based Seed Sowing
Robots for Agriculture
Abstract Agriculture is the most important field and the backbone of any coun-
try’s economic systems. After soil testing of land, seed sowing is the most important
and time-consuming process. Fog large-scale farming, seed sowing robots are pro-
posed with fog computing. These robots have microcontroller units (MCU) with
auto firmware that communicates with the fog layer through a smart edge node. Fog
robotics provides services like security, distributed storage, minimize latency, and
bandwidth utilization. A bot saves battery consumption as it communicated to fog
instead of the cloud layer. A typical seed sow robot consists of a powered wheel,
MCU, plower, seed hopper, counter sensor, UV sensor, IR sensor, and preloaded
map of the area. Different methods of planting and its system for sowing seed are
shown, and the seed rate, row spacing, and space between seeds, the volume of hop-
per, and the density of seeds are calculated with standard velocity. The robot uses
simultaneous localization and mapping (SLAM) and other path-finding algorithms
for working on the field. IR sensors detect the end of the field and obstacles for each
robot. FastAi method and machine learning techniques are used to classify the wheat
dataset into different classes with high accuracy.
19.1 Introduction
In farming after soil testing plant, cropping is an important factor, and there is a need
to decide which plant cropping will be done. As it is an important and tedious job for
any farmer and big farming, the area is very large on the scale, and performing this
activity is tedious along with that it needed more workers with proper planning. The
old traditional technical needs lot of effort and time that directly reduces production
and quality. Thus, agriculture robots or IoT devices play an important role and were
developed to simplify and reduce human efforts. In the traditional method of seed
planting, less spacing efficiency, the results such as low seed placement, and serious
issues of backache for the farmer. They can also be planted in the limited size of the
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 295
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_19
296 J. Lachure and R. Doriya
field. Hence, for achieving maximum performance, the limits should be optimized
for a seed planter. Thus, we need to use the Internet of Things, devices, and robots for
making farms automatically. For seed sowing, the robot consists of different sensors
with mechanical structure.
Agriculture is the main pillar of any economy around the world that gives backbone
for suitability. For the sustainable growth of any country, agriculture development
plays a vital role. The world population is around 700 billion, and food security is an
important aspect as the huge population increases day by day, and thus, demand for
food security is also increasing. All over the world, the pandemic situation occurs
due to corona that affects the drop-down of the economy, and food security issues
happen. As around a total of 43 th of water available and the rest island. As the pop-
ulation increases, the forest and agriculture lands are converted into a residency that
causes insecurity for food directly. For a long time, traditional methods are used in
agriculture and various machines with huge manpower. This manual planting for a
large scale is difficult. The farmer has to spend almost all time planting, but the avail-
ability of time is less planting. Thus, for completing the task within the stipulated
time, it needs more manpower to complete which is costlier. Another drawback is
that more wastage of seeds happens during manual planting with improper spacing.
Hence, it needs to develop a mechanical robot with sensors that are connected to
the Internet and work on that principle, so the efforts reduced while planting, so
the farmer can perform best. This process of using machines with sensors and with
the Internet that can provide the services from the server is called “cloud robotics.”
This automation helps to increase the efficiency of the process. In typical robots with
sensors, the method is used to enhance farming methods, such as seed sowing, culti-
vation on plowed land, smart irrigation, weed detection, plant leaf disease detection,
and pesticide control system. For smart seed sowing robots, it will cultivate the farm
at a fixed distance by considering a particular column from the map for a particular
crop. The crop planting is the art of seeds placement at proper intervals in the soil
for obtaining good germination. Few plants like rice padding are first developed and
then placed after a period of dormancy. A perfect seed sowing gives:
• Correct ratio of seed
• Correct depth for sowing
• Correct amount of seeds per unit area
• Correct Spacing on each column and row to row and plant to plant.
The organization of the paper is given as an overview of fog robotics and its
services with architecture, edge layer, fog layer, and robotic layer. The literature of
seed sowing followed by mathematical calculation robotic process for sowing seed,
followed by the dataset of wheat, and machine learning techniques. The result and
comparisons are followed by a reference.
19 Fog Computing-Based Seed Sowing Robots for Agriculture 297
The Fog Robotics (Ai et al. 2018; Chauhan and Vermani 2016; Gia et al. 2018;
Smith et al. 2013; Wang et al. 2018) consists of different layers like the cloud layer
for controlling all major things and monitoring the complete system, the fog layer for
distributed storage, and providing limited resources with edge-cutting services for
data processing and analysis, and the robotic layer for sending data. The cloud layer
enables administrators or end users for controlling the system by providing general
instructions, and the performance is a monitor at different layers. The cloud layer
enables administrators or end users for controlling the system by providing general
instructions and at different layers its monitors the performance. The layer of fog
robotics includes the edge layer and the fog layer which have the same physical
resources and smart dedicated gateways. This architecture is easy to scalable and
can be distributed and modular by defining it.
Figure 19.1 shows how the fog layer interacts with the cloud layer and the robotic
layer for smart seed sowing robots. Each layer has a role and distribution of compu-
tation load over the network. The main key factors are latency, energy consumption,
security, and computational power at different layers. In general for seed sowing
robot, the data acquired every time and send toward the nearest server called the fog
layer through the same network. For making real-time decisions, latency is important
to factor, and the edge layer reduces it by providing services quickly and manages
the computational load for different connected robots. The fog layer also provides
services like distributed storage, security, and minimal data loss process for different
mechanisms when robots switch from one gateway to another gateway. The robot
feed with map data required for moving over the field and interaction of each robot
through different algorithm like SLAM and path-finding methods. The edge layer
plays an important role in reducing energy consumption at robot or end nodes. Fog
computing is essential for providing robot state knowledge, robust situation aware-
ness for handling critical issues with safety.
(i) Cloud Services: The time-series data of robot or IoT devices are uploaded for
monitoring, controlling, and visualization for administrator or user. These data
include a robot state that covers position coordinates, velocity, steering drift
factor, acceleration, torque variation, current consumption, battery percentage,
or level. The fog layer with the edge layer helps to prepose the raw sensor data
generated from robots or devices; only critical or important information is sent
toward the cloud from devices for handling any issue or false effect in robots.
This can be implemented dynamically, each robot has to install a particular
map, and it operates all work in that map region only, so it can be uploaded at
different frequencies for the end user that monitor the system actively or not.
The cloud services provide general control instructions for stable operation and
better performance from robots. In agriculture land with a large area, the seed
sowing bots are placed at a different location with a pre-requested map, and
the position of robots is changing within the map. Once it reached to end point,
infrared sensor detects it and stops the working of sowing. Each specific or
group task is applied to the robot, and the cloud allows global access to a user
or administrative for monitoring and controlling the system cost-effectively.
However, latency is very high between the robot layer and cloud, along with
security challenges are between both layers. So, for a primary control purpose, it
should not be used as the variation affect performance of a network that creates
many problems or fault. Cloud provides at every interval access to pertinent
processed data from the robot layer through the edge node periodically. Cloud
provides services like machine learning, big data analysis, better prediction
with a high-performance computing system that needs.
(ii) Fog with edge layer: The edge node receives the real-time sensor data from sin-
gle or multi-robots. This includes odometry data, i.e., wheel odometry and
visual odometry data and inertial, velocity, location coordinates, data from
range sensors like radar or lidars, onboard camera data, to analyze the gate-
way for path planning, obstacle detection, and avoidance. These all data are
transferred through the edge gateway that is collected at the fog layer for pro-
viding fast response over the data. Fog acts as a central element, ensuring low
latency through edge gateways. If more than one robot is connected in the cloud,
then latency time increases that affect the performance directly. So, multiple
robots are connected to a single gateway, and all data from different sensors are
aggregated and analyzed to obtain a more comprehensive understanding of the
19 Fog Computing-Based Seed Sowing Robots for Agriculture 299
environment. For an instance, multiple robots are operating in the same envi-
ronment, and it is connected with the same gateway so that it can able to obtain
information from larger areas through sensors of robots. The main role of the
edge node in the fog layer is to provide importance in safety-critical situations,
communication between the robot and the fog layer did through it within the
same network. Compared to the more traditional methods in the cloud practice
of moving complex tasks need high bandwidth with high latency, but due to
the edge node, it reduces the latency and transfers data at a safe point called
the fog layer if network failure happens even it provides safety of data with
safe operation. The edge node continuously sends data from multiple robots to
the fog layer wirelessly for deciding terms of task allocation and robot move-
ment. In the fog layer, the interconnection of different edge nodes, together with
other services such as location, security, tracking, monitoring, and distributed
database services. If robots are disconnected from one gateway and connected
to another gateway, then it takes care of the fog layer for minimizing latency
and data loss during the handover. Localization algorithms such as simultane-
ous localization and mapping (SLAM) in the edge node use real-time sensor
data to match with an area of an existing map. In these crucial situations where
the robot operates in a partial or completely unknown environment. In the fog
layer, other services are collaborative processing, external senior management,
and monitoring. Additional services are included in the fog layer, to enhance
overall system robustness and fault tolerance. For instance, if a gateway fails or
abruptly disconnects, then the shared storage resources handle all the informa-
tion that makes it available to other gateways. This makes the robot to reconnect
next time with available gateway and continue its operation with low latency,
minimal data loss, and operational interruption.
(iii) Robot layer: In this layer, sensor row data is gathered and streamed in real-
time toward the fog layer suing smart edge node. All the instructions are giving
real-time, and in this system, the robot may or may not be aware of its current
state that depends on an on fog server that exactly gives the location and current
state condition of robots. A low-power wireless communication technologies
such as Wi-Fi, Bluetooth, or nRF that used for robot. Wi-Fi is used for more
bandwidth-intensive applications such as live streaming of video, Bluetooth, or
nRF is used if requirements of the bandwidth are met. The robot gains more
energy efficiently, the streaming row data directly send to the fog layer through
the edge node that saves processing energy. The microcontroller unit (MCU) is
used for a dedicated purpose with low power consumption, and it directly takes
resources from the cloud server to make it a user-friendly, robust, ubiquitous,
and dedicated system. The robots with different sensors use for sensing, path
planning, obstacle detection, object detection, and localization. In a traditional
robotic system that needs an onboard system within build firmware. Due to
the cloud robotic concept, it was replaced with low-power and low-cost MCU
board that operate through a cloud server or fog server. The analysis of basic
information is important for the robot itself as it includes, for instance, the
location of the robot, velocity, wheel odometry, or inertial data for current
300 J. Lachure and R. Doriya
acceleration, and orientation. This estimates the state at the fog or cloud layer
for performing online, which allows more accurate movement along with all
instructions that are get simplified. In a smart seed sowing bot, each robot has
specific tasks which include sowing, harvesting, irrigation, fertilization along
UV sensor to improving the germination power with proper path planning to
complete the task quickly.
Liu et al. proposed a method to design the cyber-physical system for proper seeding,
irrigation, and fertilizer management for alfalfa medicinal plants. This model is
comprehensive that includes the sub-model of water, sub-model of the biophysical
system, and fertilizer regulation after seed sowing. For alfalfa, growth sub-models
interact with each other for improving the precise regulation of fertilizer and water.
The simulation model was developed for measuring the values such as leaf area,
index, soil water content for improved the precise regulation of water, and fertilizer
application for alfalfa (Liu et al. 2020).
Praveena et al. proposed a robot with AVR at mega microcontroller that capable
of performing operations like plowing, seed dispensing, picking fruits, and spraying
pesticides. Early, the robot tilled the entire field, then it parallelly plows and sides by
side sow the seed in the row. For navigation, this device used an ultrasonic sensor,
and raw data sends continuously over the field of the microcontroller. On the field,
the robot operates automatically, and outside it operates manually. They proposed a
control device application as Bluetooth pairing for manual control. For continuous
data, collection humidity sensors were placed at various spots. For proper growth of
the crop, if the level of humidity is above the threshold, then it alerts the farmer that
the water sprinklers should be started using the GSM module for bringing down the
level of humidity (Praveena et al. 2015).
Naik et al. discussed the main reason behind the automation of farming processes
for saving the time and energy required for performing continuous farming tasks and
for increasing the productivity using the precision farming method by treating every
crop individually. They proposed the four-wheeler vehicle controlled by LPC2148
microcontroller as agriculture robot for seed sowing only. The efficient seed sowing
at optimal distances between crops and their rows, at optimal depth and specific for
each crop type and this done through the precision agriculture (Naik et al. 2016).
Srinivasan et al. proposed a novel design for an autonomous mobile robot that
is capable of sowing seeds over a prepared land. For constructing the body of the
proposed device, aluminum is used for efficient weight reduction and proper strength
utilization purposes. The navigation of the robot is done over the land through inputs
from a magnetometer. Proportional integral (PI) controller is used to improve the
accuracy of direction. An ultrasonic sensor is used for detecting the end of the field.
The robot sown seeds in evenly spaced rows and the point is decided where a seed
has been dropped with equidistant. The seed meter is proposed using the solenoid
19 Fog Computing-Based Seed Sowing Robots for Agriculture 301
actuator mechanism. The seed metering mechanism is based upon a solenoid actuator
assembly. The device consists of a modular structure providing ease for maintenance.
Overall, the proposed device consumes power efficiently and makes it suitable for
the field of agriculture (Srinivasan et al. 2016).
Ranjith et al. proposed the methods for designing a robot that sows the seed,
cuts the grass, and sprays the pesticide, and it uses a solar power system for the
whole system. The energy is got from solar panels for designed robots, and it is
operated through Bluetooth/Android application. This app sends the signals to the
robot for movement and required mechanisms. The efficiency of this robot increases
and reduces the problems encountered in manual planting (Ranjitha et al. 2019).
Saurabh Umarkar and Anil Karwankar discussed for agriculture the most impor-
tant component is seed and sowing of it over the field. A wide range of seed sizes is
available with different crop varieties to developed high-precision pneumatic plant-
ing that needed uniform seed distribution with proper seed spacing with travel path.
They use Wi-Fi for receiving data. The main disadvantage of the system is that robot
moves in only one direction. Whenever obstacles in the power supply that turned
OFF of the robot automatically (Umarkar and Karwankar 2016).
Sujon et al., proposed a robot that will perform using the analogy of ultrasonic
detection for changing its position every time. They studied the effects of various
seeding machines and for oilseed with different rates sowing application that was
developed (Sujon et al. 2018).
Kareemulla et al., proposed a robot for minimizing the wastage of seed. The
proposed robot machine needs less sowing time and energy as compared to the
tractor and manual methods. Its benefits can operate in simple mode to increase
the total yield effectively. The major disadvantage is that it only consists of one
mechanism (Kareemulla et al. 2018).
The main objective of the seed sowing system is to put the seed, fertilizer, and
water in rows at the desired depth, cover the seeds with soil, seed to seed spacing, and
provide proper compaction over the seed. Some mechanical factors that affect seed
germination like the uniformity of distribution of seed along rows, and uniformity
of depth of placement of seed. In this power transmission mechanism, seed counting
sensor, UV rays sensor, water dripping sensor, and harvesting mechanism. The rec-
ommended things are seed rate, seed placement depth that varies from crop to crop,
row to row spacing, and depends on different agroclimatic conditions for achieving
maximum yields. That is why seed sowing robot plays a wide role in the agriculture
field. Typical robotic system consists of a sensor and mechanical structure. For that
multi-purpose types of equipment that consists of a cylindrical shape container for
filling the seeds. The four-wheeled carrier assembly for carrying the container. It
consisting of a seed counting sensor, metering plate bevel gear mechanism, and for
seed size, two holes at the bottom are given. The robot works as when the plate will
rotate in the container, then the coinciding of both bottom holes of the container
and meter plate hole happens, and seeds will flow through the pipe to the soil called
seed metering sensor at the same time counting sensor count the seeds. The UV
with 400 nm is continuously applied in the container so the germination power of
the seed gets increases. The water dripping sensor-activated once the robotic arm
302 J. Lachure and R. Doriya
plows the soil and sows it, then water from another container drops the water over
the harvest place. Continuously, this process runs for row by row. Thus, this enables
the conservation of inputs through precision for ensuring reducing quantity needed
for better response, better distribution, and prevention of losses or wastage of inputs
applied. This directly reduces the unit cost of production as the input gets conserved
and productivity gets high. The most important purpose of the robot is to make it
affordable to farmers, reduce labor cost, early prediction of seedling, water irrigation
in the initial stage, and to increase the germination power of the seed.
The main objective is to make it affordable to the farmers so that they can manually
do their work without depending on labor. The above-mentioned machine increases
the efficiency of seed sowing so thereby reducing the wastage of seeds and thus
improving overall yield. For precision agriculture, different types of innovation going
on a different part, but seed sowing robot is a key component in the agriculture field.
The performance of this robot increases the yield with low cost, and the initial cost
for infrastructure with cloud, fog, and edge node is higher; after that, there is a need
of a few cost on the maintenance of it. Presently, different approaches are available
to detect the performance of seed sowing machines.
The multi-purpose agriculture robot can be used for soil testing, seed sowing,
fertilizer supplier, weed detector, and plant leaf disease detector. The drilling arm
completes the task of soil drilling, seed sowing, water dripping sensor for smart
irrigation, fertilizer spreading, and soil testing. The robots with the Internet come
with a lightweight, it is the biggest advantage that it works to fast, and every data is
sent to the nearest edge server for further analysis. At the same time, server gives the
prediction for further work too. Here, the main objective of the seed sowing robot
is to make it simple and easy for the use of farmers. The architecture is simple,
and the robot is developed with lightweight materials along with sensors which are
embedded with the Wi-Fi module. The main objective is to do sowing without the
use of laborers. Thus, it increases the efficiency of seed sowing, and it reduces the
wastage of seeds that affect improving overall yield.
In farm plant spacing, and the optimum plant population is the primary objective for
seeding any planting operation. The ultimate goal is for obtaining the maximum net
return per unit area. Spacing and population requirements are influenced by factors
such as:
• Type of soil
• Type of crop
• Amount of moisture available
• Fertility of the soil
• Pollution level.
19 Fog Computing-Based Seed Sowing Robots for Agriculture 303
Planting may be done on the flat surfaces of the field, in furrows, or on beds as:
– Furrow or shifter planting: In semiarid conditions, this technique is widely prac-
ticed for row crops such as cotton, corn, and grains. This system places the seed
down into the moist soil and protects young plants from wind and blowing soil.
– Bed planting: In high rainfall areas, it is often practiced to improve surface
drainage.
– Flat planting: In favorable moisture conditions, this type of planting generally
predominates.
Figure 19.2 1 shows the three types of plantings: The left corner is the furrow
planting system, the right side one is a bed planting system, and flat planting shows
below both of them. This system is generally used in a different region with a different
condition for getting yield more under favorable conditions.
In sowing seeds, there are few problems concerning the production of yield as:
• Irregular placing of seeds: In this process, throwing of the seeds placed manually
all over the field. This irregular way of placing seeds causes the growth of seed
irregularly.
1 http://www.google.com.
304 J. Lachure and R. Doriya
Fig. 19.2 Different types of planting system like furrow left one, bed right one, and flat planting
system
• Wastage of seeds: In seed sowing, during the manual process of seeds scattering
here and there results in irregular ways and lines also. So, the water and important
nutrition are not getting properly that may get disturbed, and the growth of the
plant will not be done up to that mark.
• Time-consuming process: Seed sowing is a time-consuming process as over the
complete land. If the area is small, then it is not a burden but if are is so big, then
it requires a lot of time and the process becomes difficult.
• Insufficient ground temperature: Nowadays, due to global warming, the surround-
ing temperature changes suddenly. If the ground temperature is cool, then seeds
will not raise and grow properly as seeds need warm conditions to grow well.
• Sowing of seeds too deeply: The depth of seed should be moderate. If the seeds
are sown too deeply, then the seed will not raise even if we water the plants.
• Lacking of quality seed rising mix: The seed quality should be good for rising the
plant, and it is very important process after germination. If the quality is not good,
then growth will not be up to that mark.
The robot consists of a different part for seed sowing; it contains sensors like UV
sensor, counter, infrared(IR) sensors, gyro-sensor for moving of robot, water dripping
sensor, and mechanical parts like seed hopper, plow, small and big chain gears. These
parts decide the what exact amount of seed required with at rate for the particular
area as
• Seed rate (Sr).
• Seed hopper volume (Vs).
• Row spacing (RS).
• Spacing between seeds (X).
• Bulk density of seeds (Pb).
• Rotation per minute Rpm.
• Number of cells in seed chamber (n).
The transmission ratio for driver sprocket and driven sprocket is given as,
Consider “GROUNDNUT” seeds, it is widely used for making cooking oil; apart
from this, it is used in daily ingredient and in spices too. The total seed required for
germination in one hector is as follows,
(i) Seed rate (RS): The quantity of seeds sown per unit area is called seed rate.
It depends on spacing as the plant to plant or row to row spacing or plant
population, germination percentage, and test weight. Its units are kg/ha (kg per
hectare).
Seed rate = Plant population ∗ no. of seeds per test weight ∗ 100 ∗ 100/
germination percentage ∗ purity percentage ∗ 1000 ∗ 1000
306 J. Lachure and R. Doriya
(ii) Plant population: From the various article for the study of groundnut, it con-
cluded that,
Germination percentage = 90%
Test weight = 114 gm per 100 seeds
Seed bulk density = 434.8 kg/m
Normal seed rate = 100–110 kg/ha
Plant population = area to be planted/space between plant to plant * space
between rows
Consider 1-ha area for plantation then,
Area to be planted = 1 ha = 2.471 acres = 4046.856 m2 Therefore, 1 ha = 2.471
* 4046.856 = 9999.78 m2 = 10000 m2
Space between plant to plant = 75 cm = 0.75 m
Space between row to row = 10 cm = 0.1 m
Plant population = 10000/0.75 * 0.1 = 133,333 plants . We took to test 1000
grams, i.e., 1 kg of seeds are taken as test weight to know how much seed rate
will come per kg.
Seed rate = 133, 333 ∗ 1000 ∗ 100 ∗ 100/90 ∗ 90 ∗ 1000 ∗ 1000
= 164.61 kg/ha
The value from standard data is in between 100 and 160 kg/ha. To meet that
standard velocity of seed sowing robot, it depends on the diameter of the wheel
that rotates inside it. This can be assumed or can be taken from a standard book
not to cause seed breakage, so, velocity (v) = 0.2 m/s and Rpm of the motor =
60pm
No. of cells should be in the seed sowing gear = n = 3.14 ∗ D/i ∗ X
n = no.of cells
D= big wheel diameter
I = transmission ratio
X = space between seeds
n = 3.14 ∗ 0.25/0.525 ∗ 0.3 = 4.98 = 5 cells
5 cells need to be present in the seed sowing gear
(iii) Flow rate: The flow rate for sowing seed varies on rate, velocity, seed bluck
density, and space between rows as:
Q = Rs ∗ S ∗ V /10000 ∗ Pb
Q = flow rate
Rs = seed rate kg/ha
S = space between rows m
V = velocity of seed sowing machine m/s
Pb = seed bulk density kg/m3
Q = 68.58 ∗ 0.6 ∗ 0.2/10000 ∗ 714.861.1512 ∗ 10−6 m3 /s
19 Fog Computing-Based Seed Sowing Robots for Agriculture 307
(iv) Volume of seed hopper: It depends on flow rate, volume of seed hopper, and
speed of machine gear as: V c = Q ∗ 60 ∗ 106 /n ∗ N d
v. Planting depth: Without breaking of seeds, the seed should be planted to the
depth that is required. For that after covering of seeds with the help of V-shaped
metal, tires should go through the rows where plow was a dig. Then, the seed
will be at the required depth. Now, we will go through the table in which the
diameter, required depth, and required gap between the plants were shown
(Table 19.1).
For different seeds like soybean, wheat, Bengal gram, and peanut, the table shows
the placement of seed at what depth, seed rate for the robot so that the seed cloud,
not damage, a width of coverage means plant to plant distance and number of labor
required so the labor work reduces and plant population for every hector (Table 19.2).
19.5 Methodology
In this section, a brief detail about machine learning techniques, seed dataset, and
working process is given.
308 J. Lachure and R. Doriya
In this section, a brief introduction about machine learning methods such as decision
tree, adaboost, and SVM and deep learning method FastAi:
(i) Decision tree: It uses the attribute and splits the data into successor nodes, and
entropy is calculated for discording the node. The node belongs to set of item of
two classes positive and negative. Thus, the attribute that maximizes the infor-
mation gain is selected for seed dataset after calculating the gain information.
p p n n
E( p, n) = − × log2 − × log2 (19.1)
p+n p+n p+n p+n
To calculate the quality of the entire split over the attribute At using entropy of
the system is,
Di
Gain(D, At) = E(D) − × E(Di ) (19.2)
i
D
(ii) Adaboost: The learner consists of two types weak and strong in Adaboost. A
weak learner classifier is better than random guessing. While a strong learner is
almost provided a correct classification for true value (Abd Rahman et al. 2015).
Consider the given training data of the form ( p1 , q1 ), ( p2 , q2 ), ... , ( pn , qn )
where qi ∈ {+1, −1} ∀ pi ∈ P [1] and a learner h the error is defined as,
1 0 if qi = h( pi )
= × (19.3)
N 1 if qi = h( pi )
(iii) Support vector machine: It is a binary classification (Abd Rahman et al. 2015)
dataset which is linearly separable for Ds ⊆ R d . A hyperplane discard the
point which having maximum distance from dataset for separating the line.
19 Fog Computing-Based Seed Sowing Robots for Agriculture 309
end
310 J. Lachure and R. Doriya
The FastAi method is used with linear transformation, and the data-bunching tech-
nique works quickly with the learning rate.
For given dataset classification purpose, the data first get preposed, and then, it
gets normalized first null value removed and then for missing data different methods
like average, min, max to fill with that value. FastAi using linear transformation is
a deep and fast learning technique that first finds the target layer; then, the dataset
splits into training and validation. For every learning cycle, the learning rate changes
to fit the data. Once the data fit, it easy to get the best learning rate. If fitting of data
is out of scope, then overflows occur and then again need to change the learning rate.
This learning rate cycle changes continuously to optimize the best result. The given
algorithm finds the best learning rate with the batch process.
The given Fig. 19.3 shows learning rate versus iteration in that the curve shows
that the best-optimized value is in between underflow and overflow condition.
Figure 19.4 shows the accuracy versus batch process as the training data fit linearly
for increasing the accuracy of the system.
Loss versus batch process shows how much data get lost and when its overflow
in the batch process (Fig. 19.5).
19 Fog Computing-Based Seed Sowing Robots for Agriculture 311
Forgiven wheat dataset, for classification into different categories using FastAi, a
deep learning approach and different machine learning techniques are used. It is
shown that FastAi runs faster and quicker as compared to all other algorithms as it
works on the learning cycle and fitting the dataset linearly (Fig. 19.6).
19.6 Conclusion
In agriculture, seed sowing robots play an important role in plowing, digging, seeding,
and harvesting. These robots are connected with the edge node for communicating
with the fog layer. This layer works in a homogenous network with a decentralized
server, to give better security as it is near to the robot or devices as compared to the
cloud layer. The fog robotics minimized the latency and utilized the proper bandwidth
of the network. As latency decreases, the battery for sending data to minimize so
it saves power and performance increases. As the fog layer gives firmware that can
be installed anytime within the network to perform other operation that indirectly
increases machine efficiency as per requirement, it becomes ubiquitous. The multi-
robotics scenario can be handle within the fog layer with the help of the edge node.
Each robot has a predefined map loaded in the firmware. The path planning of each
robot is done through SLAM or path-finding algorithms so that robots can work
quickly and accurately within that field.
The seed sowing robotic machine consists of different sensors such as a UV sensor,
an IR sensor, a hopper, plower, chain gears, and a seed counting meter. UV sensor
used for disinfectant of seed; an IR sensor detects the obstacle and end of fields. Seed
rate, row to row distance, speed spacing, depth for sowing seed are calculated, which
directly improves the growth of seed in that field.
19 Fog Computing-Based Seed Sowing Robots for Agriculture 313
The dataset of wheat species has a different type that needs to operate at the time
of sowing. The standard size of each species parameters recorded to classify in the
proper class, so the machine can separate that before sowing. The FastAi methods
along with different machine learning models developed to separate such a species
from each other.
References
Abd Rahman HA, Wah YB, He H, Bulgiba A (2015) Comparisons of adaboost, knn, svm and
logistic regression in classification of imbalanced dataset. In: International conference on soft
computing in data science. Springer, Berlin pp 54–64
Ai Y, Peng M, Zhang K (2018) Edge computing technologies for internet of things: a primer. Digital
Commun Netw 4(2):77–86
Chauhan S, Vermani S (2016) Cloud computing to fog computing: a paradigm shift. J Appl Comput
1(1):25–29
Charytanowicz M, Niewczas J, Kulczycki P, Kowalski PA, Łukasik S, Zak S (2010) Complete
gradient clustering algorithm for features analysis of x-ray images. Information technologies in
biomedicine. Springer, Berlin, pp 15–24
Gia TN, Rahmani AM, Westerlund T, Liljeberg P, Tenhunen H (2018) Fog computing approach for
mobility support in internet-of-things systems. IEEE Access 6:36064–36082
http://www.soiltillage.com
Howard J, Gugger S (2020) Fastai: a layered api for deep learning. Information 11(2):108
Kareemulla MS, Prajwal E, Sujeshkumar B, Mahesh B, Reddy BV (2018) Gps based autonomous
agricultural robot
Liu R, Zhang Y, Ge Y, Hu W, Sha B (2020) Precision regulation model of water and fertilizer for
alfalfa based on agriculture cyber-physical system. IEEE Access 8:38501–38516
Naik NS, Shete VV, Danve SR (2016) Precision agriculture robot for seeding function. In: 2016
international conference on inventive computation technologies (ICICT), vol 2, pp 1–3. IEEE
Praveena R, Srimeena R, et al (2015) Agricultural robot for automatic ploughing and seeding.
In: IEEE technological innovation in ICT for agriculture and rural development (TIAR). IEEE
2015:17–23
Ranjitha B, Nikhitha MN, Aruna Afreen K, Murthy BTV (2019) Solar powered autonomous mul-
tipurpose agricultural robot using bluetooth/android app. In: 3rd International conference on
electronics, communication and aerospace technology (ICECA), pp 872–877
Srinivasan N, Prabhu P, Smruthi SS, Sivaraman NV, Gladwin SJ, Rajavel R, Natarajan AR (2016)
Design of an autonomous seed planting robot. In: IEEE region 10 humanitarian technology
conference (R10-HTC), pp 1–4. IEEE
Smith CV, Doran MV, Daigle RJ, Thomas TG (2013) Enhanced situational awareness in autonomous
mobile robots using context-based mapping (october 2012). In: 2013 IEEE international multi-
disciplinary conference on cognitive methods in situation awareness and decision support
(CogSIMA)
Sujon MDI, Nasir R, Habib MMI, Nomaan MI, Baidya J, Islam MR (2018) Agribot: Arduino
controlled autonomous multi-purpose farm machinery robot for small to medium scale cultivation.
In: 2018 international conference on intelligent autonomous systems (ICoIAS), pp 155–159. IEEE
Umarkar S, Karwankar A (2016) Automated seed sowing agribot using arduino. In: 2016 interna-
tional conference on communication and signal processing (ICCSP), pp 1379–1383. IEEE
Wang X, Ning Z, Wang L (2018) Offloading in internet of vehicles: a fog-enabled real-time traffic
management system. IEEE Trans Indus Inform 14(10):4568–4578
Chapter 20
An Automatic Tumor Identification
Process to Classify MRI Brain Images
Abstract The mortality rate due to failure of brain tumor diagnosis and treatment
is increasing extensively. The accurate and feasible interpretation of brain tumor is
mandatory for consecutive prognostication as well as medication. By expert physi-
cians, inspection of brain tumor can be done but it will make the process labor
demanding as well as more time consuming. So, in this work, we propose an auto-
matic tumor identification process to classify MRI brain images of which contains
tumor of benign and malignant type using an advance convolution neural network
or CNN architecture. Analysis of the proposed model is done based on some per-
formance metric as precision, recall and F1 score; as per the analysis, the proposed
method gives better result compared with the other state of the art methods.
20.1 Introduction
The growth of abnormal tissues in human brain can lead to the appearance of tumor.
A primary brain tumor can be cancerous or benign in nature. Gliomas and Menin-
giomas are two most frequent types of primary brain tumor. The origin of gliomas
tumor is from glial cell. The other type of primary tumor, i.e., Meningiomas is more
tends to develop among women than men. These tumors are benign in nature but can
cause complication due to the location and size of the tumor.
Last year with an increasing trend in India, around 5–10 cases of brain tumor
per one lakh population were encountered. Among these 20% cases are from the
children under the age of 15 years. The symptoms of brain tumor include a early
morning headache, continuously vomiting or nausea, partially memory loss, sleep
problems, etc. The diagnosis of brain tumor can be classified into two categories as
benign and malignant; types of benign tumors are not required for surgical treatments
unless it gets extended in size and expresses some doubtful symptoms. Hence, early
and accurate diagnosis of malignant tumor became mandatory to reduce the rate of
mortality.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 315
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_20
316 A. Ghosh and B. Soni
The diagnosis of brain tumor can be done from different medical imaging modal-
ities as magnetic resonance imaging or MRI, computed tomography or CT scan,
ultra-sonography, etc. An MRI uses magnetic fields to construct complete image of
the body. It can help to measure the size of the tumor.
Recently, various automatic tumor diagnosis technique as K-means clustering
algorithm, fuzzy C-means method (Abdel-Maksoud et al. 2015), LinkNet convo-
lution network (Sobhaninia et al. 2018), ELM-LRF (Ari and Hanbay 2018), SVM
(Priya et al. 2016) were used for automatic tumor detection. Much work has already
mentioned in the related work part but still some gap available to achieve more
promising results based on the performance matrices.
In this current study, the proposed model consists of two separate set of neural
network as convolution network and fully connected dense layer; 3 convolution layers
were used to extract the features from the input image, and 2 dense fully connected
layers were used to classify. The classification report of the proposed model is given
in Table 20.3 where the precision, recall and F1 score of the model are calculated
and given.
Related works were discussed in Sect. 20.2, and the concept of CNN is high-
lighted in Sect. 20.3. Section 20.4 is about the proposed architecture and description
of the architecture. Data set description is given in Sect. 20.5. Section 20.6 is about
the experimental analysis and results. Conclusion and future work are discussed in
Sect. 20.7.
ELM-LRF for tumor classification and extraction of tumor region, and watershed
algorithm is used for segmentation.
In Goswami and Bhaiya (2013) mentioned an automatic system where edge detec-
tion, histogram equalization, noise removal and thresholding was performed as pre-
processing step. Independent component analysis (ICA) method is used for feature
extraction, and self organizing map is used for brain tumor diagnosis. At last for seg-
mentation of tumor into different cells, K-means clustering algorithm is performed.
Safaa E. Amin et al proposed in their work (Amin and Megeed 2012) a perceptive
neural network which can automatically classify the types of brain tumors present.
The proposed system is divided into two parts which contains a hybrid neural network
system and PCA for dimensionality reduction and feature extraction. The second part
includes segmentation of the MRI images using wavelet multi-resolution expectation
resolution (WMER) algorithm. Then, lastly, MLP or multi-layer perceptron is applied
for classification of the features extracted from the first phase or the second one.
Mohana Priya K. et al. mentioned in their paper (Priya et al. 2016) about support
vector machine (SVM) for classification of brain tumor images into different class.
Here, the SVM is used combining some statistical features like first-order feature set,
second-order feature set and the combination of both. The experimentation performed
in the paper is based on different SVM kernel types and using different gamma values.
In this paper, the experimental analysis is done using only different type of kernal
of SVM classifier, so the supremacy of the approach is not compared with other
state-of-the-art approaches.
Al-Ayyoub et al. (2012) mentioned in their paper about machine learning approach
for detection of brain tumor present or not from MR image. Four different types
of classification algorithm, i.e., ANN, Tree J48, Naïve Bayes and Lazy IBK are
performed in 27 MR images, and the comparison of the results has been using the
parameters recall, precision, F1 score and correctness and shows that the accuracy
of ANN is best among the others.
The Sudha et al. (2014) used feed-forward neural network, multi-layer perceptron
and BP neural network for classification purposes. Feature extraction has been done
using GLCM and GLRM approach.
Pereira et al. (2016) used CNN for segmentation of BRATS MRI data set. Data
argumentation was used to increase the size of the training set. Their proposed archi-
tecture was able to identify two types of tumor grade, i.e., HGG and LGG. The
evaluation matrices as DSC, PPV and sensitivity were used to measure the perfor-
mance of the architecture.
Author Tanzila Saba et al. used in the paper (Saba et al. 2020) a accurate segmen-
tation process called Grab-cut method for tumor segmentation and VGG-19 model
for feature extraction on BRATS data set. After extracting the features, they used sev-
eral classifiers as decision tree (DT), linear discriminant analysis (LDA), K-nearesst
neighbour (KNN), ensemble classifier and support vector machine (SVM) and com-
pared the results obtained from these several classifiers, and based on accuracy, DSC
the evaluation of these classifiers was performed.
318 A. Ghosh and B. Soni
CNN or convolution neural network comprises some basic layers to define the work-
ing principle of the network.
1. Convolution Layer: In this layer, the working principle depends on extracting
the features from the input image. Not only feature extraction but also it pre-
serves the co-relation between the image pixels. Using different types of filters,
it can perform various operation such as edge detection, image sharpening and
blurring of an image. For better understanding of the operation, the below exam-
ple is given. Here, in the example, an image with 1 channel and an 3 × 3 kernel
convolution operation is performed.
⎡ ⎤
55321 ⎡ ⎤ ⎡ ⎤
⎢1 0 3 0 1⎥ 101 13 20 9
⎢ ⎥
⎢2 5 3 5 2⎥ ⎣0 1 0⎦ = ⎣12 5 14⎦ (20.1)
⎢ ⎥
⎣ 3 0 0 2 5⎦ 101 9 17 14
12354
2. Activation Layer: The activation layer is used for converting the output of
the conv layer into a nonlinear output. In this experiment, “ReLU” is used as a
activation function. ReLU stands for rectified linear unit. The operation of ReLU
is mentioned below.
f (x) = max(0, x) (20.2)
Using this function, the network will learn the non-negative linear values.
3. Pooling Layer: This layer is important when size of the input image is too large;
in that case, reduction of number of trainable parameters is necessary. So usage
of this layer between subsequent convolution layers is important.
4. Fully connected layer (FC Layer): Before feeding the input matrix to the
FC layer, we should flatten the matrix into a vector. From the above-stated
diagram, the matrix of the feature map will be converted into vector such as
i 1 , i 2 , i 3 , . . . , i n . The creation of model will be done by combining the extracted
features together by the FC layer such as Fig. 20.1.
5. Output Layer: This layer comprises of a activation functions softmax or sigmoid
to classify the outputs. In this work, we used softmax as a activation function
for the output layer. The overall CNN architecture is illustrated in Fig. 20.2.
4. Train the CNN architecture: Training the network includes 3 Conv2D layer
with activation function ReLU and for pooling layer, max-pooling function of
size 2 × 2 and 2 fully connected hidden layers is used. The loss of the trained
network is calculated using sparse categorical cross-entropy, and the network is
trained up to 25 epochs.
5. Test the network: In this phase, the network will decide a given MR image
which contains tumor or not.
In Table 20.1, the architecture of the proposed advance CNN model is given.
This CNN model is consisting of 3 conv layers (conv + activation + pooling), and
before reaching the last output softmax layer 2, fully connected hidden layers were
used. The model was trained up to 25 number of epochs as it gives the most optimal
result for the used data set. The architecture of the model is given in Fig. 20.3.
20 An Automatic Tumor Identification Process … 321
This architecture is built in such a manner that it gives the optimal result for the
used data set. The Conv2D layer is used for feature extraction, the filter size used
here is 32, and the size of kernel is 3 × 3. ReLU or rectified linear unit is used here
as a activation function. The purpose of using ReLU is to suggest the nonlinearity
in the output of the Conv2D layer. The network will learn the non-negative linear
values using this function. The pooling layer is also used along with the activation
layer; if the input image is too large, then it will minimize the number of parameters.
Here, max-pooling is used as a pooling function. All these layers are important to
extract the features from the input image.
Fully connected layers are used to classify the input image as with or without
tumor. In this architecture 2, fully connected hidden layers are used along with the
last softmax output layer with 2 neurons for 2 classes. The output layer of the CNN
is responsible for producing the probability of each class.
The number of layers used here relies on the data set based on the parameters as
variation and size of data set. If we used more number of layers here, it will just
help to extract extra features but up to a certain limit. After that, instead of extracting
features, it will overfit the data and produce erroneous result like false positives.
322 A. Ghosh and B. Soni
Experimental work was done by the setup of Intel®CoreTM i5-8500 CPU 3.00 GHz
processor with 8 GB RAM and windows 10 operating system. For simulation work,
Python 3.6.7 and keras were used for implementing the CNN architecture with ten-
sorflow backend. For data visualization, scikit-learn, matplotlib and seaborn modules
were used. And for reading the data pandas and numpy modules were used. We used
confusion matrix to calculate accuracy of the classifier. The format of confusion
matrix is as given in Table 20.2.
The terms in the confusion matrix are associated with the performance of the
proposed model. The mathematical definitions of the terms, i.e., TP, FP, FN and TN
are given here.
The precision, recall and F1 score are calculated using the formula given below:
20 An Automatic Tumor Identification Process … 323
(TP)
Precision = (20.3)
(TP + FP)
(TP)
Recall = (20.4)
(TP + FN)
2 × (precision × recall)
F1 score = (20.5)
(precision + recall)
The confusion matrix associated with the experimental work is given in Fig. 20.5.
The above-mentioned performance measures associated with the experiment, i.e.,
precision, recall, F1 score support values are given in Table 20.3.
The tumor detection task for the proposed model is an imbalanced classification
problem where we need to identify two classes with tumor and without tumor. Here,
in case of disease detection, this imbalance classification issue occurs when the rate
of the disease is very low. In such condition, positive class tends to enormously
exceeded by the negative class. Accuracy is not a good metric for evaluating the
model performance in that case. So instead of that, recall can be a good statics for
evaluation. The actual definition of recall is given in Eq. 20.4. Recall measures the
ability of the model to identify the most concern data particles in a specific data set.
The formula of precision is given in Eq. 20.3 where FPs are the data particles that
the model incorrectly identifies as positive but actually are negative. In this problem,
FP counts the number of particles as tumor which are not actually. Precision expresses
the capacity of data particles which were actually relevant and the model also labeled
it as relevant.
Table 20.4 Comparison of the accuracy and loss based on number of epochs
No. of epochs Loss Accuracy Val_loss Val_accuracy
20 0.1908 0.9048 0.3423 0.8077
21 0.1959 0.9031 0.3407 0.8077
22 0.1308 0.9404 0.2928 0.8692
23 0.1346 0.9427 0.2642 0.8846
24 0.1437 0.9604 0.3972 0.8462
25 0.0625 0.9824 0.9310 0.8846
26 0.0722 1.0000 0.9580 0.7692
27 0.0885 1.0000 0.9580 0.7692
28 0.0244 1.0000 0.9580 0.7692
29 0.0207 1.0000 0.7408 0.8846
30 0.0145 1.0000 0.9358 0.8077
Bold values in the table are showing the highest accuracy in the optimal number of epoch, that is
25
In this current work, a novel advance CNN architecture is proposed for automatic
brain tumor identification. Based on several MRI brain images, the current model is
able to identify brain tumors correctly. The prospective current model is composed
20 An Automatic Tumor Identification Process … 327
of two networks: one is for feature extraction convolution layers, and another one
is dense layer for classification. Manual inspection and generating diagnostic result
are a real burden which can be reduced by using this automated model. From the
experimental results, the effectiveness of the model can be determined.
In the future work, we plan to examine the results using a large data set, and the
model can be constructed in such a way where it can identify the types of tumors
present. This can reduce the pain and burden for the expert physicians to determine
a type of tumor present.
References
Abstract Intelligent vehicle system (IVS) is being designed to leverage the safety,
facility, and life style of society. At the same time, it aims to enhance the driving
behavior to minimize the traffic-related issues. Artificial intelligence is assisting
such autonomous system, which is now not restricted only to software data, but its
functionality is being utilized in decision making in various phases of the IVS in
dynamic road environments. One such phase lane detection plays a significant role
in IVS especially through various sensors. Here, vision-based sensor mechanism is
employed which detects lane marking scheme on structured road. For this purpose,
traditional image processing technique has been applied to keep the computation less
complex, and public datasets KITTI is utilized. The proposed scheme is effectively
identifies various lane markings on the road in the normal driving conditions.
21.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 329
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_21
330 D. K. Dewangan and S. P. Sahu
A regular lane marking scheme is mostly seen in the country. It permits the vehicle to
perform activities like overtake, U-turns, and change the lanes. But, to keep vehicle
driving safe, it is expected that the road traffic is almost clear and safe to accomplish
mentioned activities.
21 Lane Detection for Intelligent Vehicle System … 331
On the road, this lane marking scheme does not allow a vehicle to perform activities
like taking U-turns or overtaking other vehicles until there is a situation to avoid
accidents. These schemes are commonly found in hilly roads to avoid any single
chance of accidents.
Under this scheme, overtaking vehicle is permissible only if the vehicles are present
on the current side and mostly found in those areas where visibility is slightly low..
It indicates that crossing the lane marking is rigorously not tolerable and mostly
found in areas where there is a huge probability for random or constant disasters.
It allows the vehicle to take U-turn and to overtake automobiles, provided that it is
complete safe to do so.
To make autonomous vehicle more intelligent in lane detection, it is required to
understand the features of these marking scheme, so that a proper driving decision
can be brought for such vehicles. It makes the situation complex when the traffic load
is a bit high and involvement of pedestrian makes the detection task difficult. How-
332 D. K. Dewangan and S. P. Sahu
ever, computation cost is also a challenging issue when artificial and deep learning
techniques are involved. Approach towards finding lane marking on the road using
traditional approach can be visualized from Fig. 21.2. In this direction, study of var-
ious approaches for lane detection is represented in Sect. 21.2. Employed methods
with their working concepts are represented in Sect. 21.3. Experimental analysis and
results are discussed in Sect. 21.4, whereas concluding remarks is given in Sect. 21.5.
In order to detect lane markings, various studies involve learning lane line features
using image processing, computer vision, feature and model-based, and convolu-
tional neural network (CNN) techniques. A practical and reliable roadway vanish
spot tracking system based on the principles of v-disparity, and visual odometry
approaches are discussed, at which v-disparity mapping may effectively reduce its
state space towards vanishing point. Also, the visual odometry benefits the detec-
tion of the vanishing point for both of the straight and curved roads (Su Yingna et al.
2018). To measure lane equation from the lane applicant, Kalman filter and RANSAC
were applied and then used in the approach of state transfer to maintain lane tracking
(CHOI et al. 2012). With ROI implementation, lane identification can be performed
through Hough space. it was also mentioned that this model could be improved with
GIS or electronic map (Song Wenjie et al. 2018). Using Hough transform in Hough
space, a lane line can be identified where all points with parallel characteristic, length
and angle, and apprehend characteristics are considered in Hough space (Zheng Fang
et al. 2018). A collection of fuzzy collinear fuzzy lines, and line searching is able to
handle vague data and enables computational burden to be decreased compared to
Hough transform (Obradović et al. 2013). B-snake algorithm is addressed in the lane
identifier, and canny/Hough vanishing point estimation (CHEVP) is applied with
minimal mean square error (MMSE) to classify the control points on two sides of the
lane (Wang Yue et al. 2004). Recognition of lane markings using lane detection and
Hough transformation in combination with field programmable gate array (FPGA)
21 Lane Detection for Intelligent Vehicle System … 333
and digital signal processor ( DSP) was used, and lane markings can be accurately
detected by using gradient direction and gradient amplitude together (Xiao Jing et al.
2016). Feature line selection (FLS) a method based on a linear-cubic road model is
incorporated for two-way lane detection and involves only correct lane positions and
angles in close regions (Xin LIU et al. 2012). Lane detection using vanishing points
is based on a probabilistic technique based on the intersection of line sections from
an image. The host lane is being optimized using similarity to the interframe (Yoo
Ju Han et al. 2017).
Feature-based technique that uses visible features from an image such as bound-
aries (Gaikwad and Lokhande 2015; Lotfy et al. 2017; Son et al. 2015) colors,
intensity variations are widely used. Detection of the edge-based function involves
edge information, lane recognition, and estimate of departure. The most popular
edge detection systems are Canny (Gaikwad and Lokhande 2015; Kortli et al. 2016),
Sobel (Dai et al. 2016), and Prewitt (Li et al. 2014), which has been shown to be
better for robust individual pixels-wise edge detection (Son et al. 2015). Using a
stretching feature, an additional intensity-based enhancement can be done to cor-
rectly distinguish lanes with different colors. Research in (Gaikwad and Lokhande
2015) employs a 5-PLSF feature for contrast enhancement followed by a lane width
applied to calculate missing lanes, which decreases the system’s false alert rate. Dai
et al. (2016) had to use a different day and night time identification and then used
a gamma-correction method to having an efficient detection under poorly lit con-
ditions. Similarly, the analysis in (Lotfy et al. 2017) acquired an image’s inverse
perspective map (IPM) and then used a score-based lane detection tracking system.
Hough transform (Gaikwad and Lokhande 2015; Kortli et al. 2016) and RANSAC
were also implemented for lane recognition from the obtained edges. A Hough trans-
form inspired by RANSAC (Lotfy et al. 2017) was also used for lane detection to
reduce time per frame. Traditional lane detection strategies are not very accurate in
the existence of stray edges found in urban surroundings (Kortli et al. 2016). This
selection greatly improves Hough transform performance, raising the average detec-
tion accuracy. A further solution to this framework is learning based on (Gurghian
et al. 2016) or smartphone (Murugesh et al. 2016). Learning-based methods, as in
(Jayanth Balaji et al. 2017; Nair et al. 2017; Singh et al. 2016), have been widely
utilized in other applications. This technique is free from the tradition approach, but
such a method requires an enormous labeled dataset to train a CNN. Performance of
this system greatly depends on the time consuming, classified training dataset.
Considering the scenario, the image frame is captured from the vision-sensor
mounted on the vehicle. Basic preprocessing phases are required to enhance the qual-
ity of the image and to perform some corrective actions. Afterwards, these images are
operated under filtering mechanism to fetch the contour information and recognize
334 D. K. Dewangan and S. P. Sahu
those pixels which belongs to the lane marking schemes. Finally, the fitness of these
lane pixels with a model-based approach is represented. In this direction, following
significant stages are performed:
21.3.1 Preprocessing
After acquiring the images from camera or by raw dataset, they are required to be
processed. Images are pre-processed for the simplification and extraction of lane
marking features from road surfaces. Color image processing is computationally
difficult, so the input image preprocessing was done in this step, and the image is
transformed to gray scale image, where process is computationally simple. Gray
scale images are fully adequate for multiple tasks, so the use of rigid color images is
not needed here. Such a procedure is the contrast spreading which re-maps the pixels
to use another maximum range of possible values and it can be described by:
where Image(i, j) is the gray color values of the (i, j)th pixels of the given input
image, Gray(i, j) is the gray value of the (i, j)th pixel of the improved image, and
Transformation is the makeover procedure for the image and gray level values depend
on the feature used for remapping.
To make this procedure more robust, pixel intensities values are rescaled to yield
an image in where brightness values of the pixels are more uniformly dispersed. Let I
represents the intensity values and Q describes the function of normalized histogram
of image which can be described by:
I (i, j)
ImageEqualization = (I − 1) Qn (21.2)
n=0
where Q n represnts total number of pixel intensities to the total number of available
pixels.
To reduce the noise present in the image, averaging the brightness of the pixels
found within the mask filtration procedure for the given image and a local neighbor-
hood about location (i, j). Then, the final outcome is given by:
p
p
Image(i, j) = h(x, y)Image(i + x, j + y) (21.3)
x=− p =− p
Not all the portion of an image is to be processed, as the lane portion is only found
in the bottom region of the image. So, cropping one portion from the top side of the
image provides a reduced region to be processed and computationally beneficial.
where partial derivative of i and j are the outcome of gradient in i and j direction.
Similarly, to determine the partial derivatives, one-dimensional filter is utilized to
the convolving procedure, and gradient direction is then computed using following
equation:
−1 G i
= tan (21.5)
Gj
For the computed gradient value, it is compared with the upper and lower threshold
values to accpet or reject certain values in the predefined range.
The mentioned partial derivatives G i and G j are estimated with the help of deriva-
tive operators to compute the changes in the horizontal as well as in vertical direction.
Following are the few operators which is applied to extract lane features in the image,
out of which Prewitt operator is described by Eqs. (21.7) and (21.8), Roberts oper-
ator is represnted in Eqs. (21.9), (21.10), (21.11) and (21.12) are utilized for Sobel
operator. However, the Laplacian operator which uses only one kernel is represnted
in Eq. (21.13).
336 D. K. Dewangan and S. P. Sahu
⎡ ⎤
−1 0 1
G i = ⎣−1 0 1⎦ (21.7)
−1 0 1
⎡ ⎤
−1 −1 −1
Gj = ⎣ 0 0 0 ⎦ (21.8)
1 1 1
10
Gi = (21.9)
01
0 1
Gj = (21.10)
−1 0
⎡ ⎤
1 0 −1
G i = ⎣2 0 −2⎦ (21.11)
1 0 −1
⎡ ⎤
1 2 1
Gj = ⎣ 0 0 0 ⎦ (21.12)
−1 −2 −1
⎡ ⎤
0 −1 0
Laplacian : ⎣−1 4 −1⎦ (21.13)
0 −1 0
Though, mentioned operators are fair enough to extract edge from images but have
restricted perfromance in terms of accuracy and computation time. Another approach
for edge detection is to use canny mechanism (Canny et al. 1986) which initially
involves minimizing the noise by smoothing the given image with Gaussian filter.
Afterwards, gradient is computed using any of the mentioned operators followed by
fetching edge points. Finally, hysteresis is performed by iterating a kernel over all
the available pixels in the image, and it verifies the existense of latest pixel being an
edge component.
21 Lane Detection for Intelligent Vehicle System … 337
where θ is the orientation of normal r to the x axis with target line component.
The locations of the edge fragment points (xn , yn ) in the image are recognized
and, thus, functions as constants in the equation of the parametric line, while the
unknown variables are searched. If we plot the distinct outcomes (r, θ ) identified by
each (xn , yn ) the point to curves maps in the polar Hough space in the cartesian image
space. For straight lines, this point to curve conversion is the Hough transformation.
The transformation is realized by quantizing the space of the Hough parameter
into finite iterations. As the algorithm runs, each (xn , yn ) is converted into a dis-
crete (r, θ ) curve, incrementing the generator cells that lie along this curve. The
corresponding spikes in the accumulator set provide concrete proof that the frame
contains a matching straight line.
Curves created in the gradient image by the collinear points converge in the Hough
transform space at peaks. The above convergence points represent the fragments of
the original image in a straight line. Finally, a subjective baseline to obtain the strong
characteristics (r, θ ) in the final processed image which relate to each of the straight
line borders is executed. A typical example of the mentioned approach is represented
in Fig. 21.6, and lane detection after applying Hough transfrom can be visualized in
Fig. 21.7.
In the proposed approach, all the implementations are carried out using Python and
OpenCV. All the steps have been experimented on the data available in KITTI dataset
(Geiger et al. 2013). The 289 images were applied to test the proposed approach and
determined the lane marking successfully. Apart from these, some random video
sequences were also tested under this scheme, and the obtained results are shown in
Figs. 21.8, 21.9, 21.10, 21.11, 21.12 and 21.13.
Several techniques for the extraction of edges have been utilized in many studies
and been compared from a computational perspective. Here, the figure of merit
(FOM) (Pratt et al. 1978) has also been applied, which determines error between
true gradient and locations of estimated gradient values. The quantitative outcome
for all the operators applied in this approach is given in Table 21.1.
342 D. K. Dewangan and S. P. Sahu
21.5 Conclusion
References
Canny JF (1986) A computational approach to edge detection. IEEE Trans Pattern Analysis Mach
Intell 8(6):679–698
Castorena J, Agarwal S (2017) Ground-edge-based LIDAR localization without a reflectivity cali-
bration for autonomous driving. IEEE Robot Autom Lett 3(1):344–351
Changalvala R, Malik H (2019) LiDAR data integrity verification for autonomous vehicle. IEEE
Access 7:138018–138031
Choi HC, Park JM, Choi WS, Oh SY (2012) Vision-based fusion of robust lane tracking and forward
vehicle detection in a real driving environment. Int J Autom Technol 13(4):653–669
Conrad P, Foedisch M (2003) Performance evaluation of color based road detection using neural nets
and support vector machines. In: Proceedings of applied imagery pattern recognition workshop,
Washington, DC
Cui Y, Wu J, Xu H, Wang A (2020) Lane change identification and prediction with roadside LiDAR
data. Optics Laser Technol 123:
Dai J, Wu L, Lin H, Tai W (2016) A driving assistance system with vision based vehicle detection
techniques
Dewangan DK, Sahu SP (2020) Real time object tracking for intelligent vehicle. In: 2020 first
international conference on power, control and computing technologies (ICPC2T). IEEE, pp
134–138
Drews P, Williams G, Goldfain B, Theodorou EA, Rehg JM (2019) Vision-based high-speed driving
with a deep dynamic observer. IEEE Robot Autom Lett 4(2):1564–1571
Fernandez MG, Lopez YA, Arboleya AA, Valdes BG, Vaqueiro YR, Andres FLH, Garcia AP (2018)
Synthetic aperture radar imaging system for landmine detection using a ground penetrating radar
on board a unmanned aerial vehicle. IEEE Access 6:45100–45112
Foedisch M, Takeuchi A (2004) Adaptive real-time road detection using neural networks. In: Pro-
ceedings of the 7th international IEEE conference on intelligent transportation systems, Wash-
ington, DC, 3-6 Oct 2004
Gaikwad V, Lokhande S (2015) Lane departure identification for advanced driver assistance. IEEE
Trans Intell Transp Syst 16(2):910–918
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot
Res 32(11):1231–1237
GHO—by category—road traffic deaths—data by country. World Health Organization. https://apps.
who.int/gho/data/node.main.A997. Road traffic deaths. Cited 12 Sep 2020
21 Lane Detection for Intelligent Vehicle System … 347
Gupta A, Choudhary A (2018) A framework for camera-based real-time lane and road surface
marking detection and recognition. IEEE Trans Intell Vehicles 3(4):476–485
Gurghian A, Koduri T, Bailur SV, Carey KJ, Murali VN (2016) DeepLanes: end-to-end lane position
estimation using deep neural networks. In: 2016 IEEE conference on computer vision and pattern
recognition workshops, pp 38–45
He Y, Wang H, Zhang B (2004) Color-based road detection in urban traffic scenes. IEEE Trans
Intell Transp Syst 5(4):309–318
Jayanth Balaji A, Harish Ram DS, Nair BB (2017) Machine learning approaches to electricity
consumption forecasting in automated metering infrastructure (AMI) systems: an empirical study.
In: Silhavy R, Senkerik R, Kominkova Oplatkova Z, Prokopova Z, Silhavy P (eds) CSOC 2017.
AISC, vol 574. Springer, Cham, pp 254–263. https://doi.org/10.1007/978-3-319-57264-2_26
Jian W, Zhong J, Yuting S Unstructured road detection using hybrid features. In: International
conference on machine Learning and Cybernetics, Baoding, China, pp. 482–486 (2009)
Kong H, Audibert J-Y (2010) General road detection from a single image. IEEE Trans Image
Process 19(8)
Kortli Y, Marzougui M, Atri M (2016) Efficient implementation of a real-time lane departure
warning system. In: 2016 international image processing, application system, pp 1–6
Li Q, Chen L, Li M, Shaw SL, Nüchter A (2014) A sensor-fusion drivable-region and lane detection
system for autonomous vehicle navigation in challenging road scenarios. IEEE Trans Veh Technol
63(2):540–555
Liu X, Xu X, Dai B (20122) Vision-based long-distance lane perception and front vehicle location
for full autonomous vehicles on highway roads 19:1454–1465
Lotfy OG et al (2017) Lane departure warning tracking system based on score mechanism. In:
Midwest symposium circuits systems, pp 16–19
Murugesh R, Ramanadhan U, Vasudevan N, Devassy A, Krishnaswamy D, Ramachandran A (2016)
Smartphone based driver assistance system for coordinated lane change. In: 2015 international
conference on connected vehicles and expo, ICCVE 2015—proceedings, pp 385–386
Nair BB, Kumar PKS, Sakthivel NR, Vipin U (2017) Clustering stock price time series data to
generate stock trading recommendations: an empirical study. Expert Syst Appl 70:20–36
Obradović D, Konjović Z, Pap E, Rudas IJ (2013) Linear fuzzy space-based road lane model and
detection. Knowledge-Based Syst 38:37–47
Okamoto K, Itti L, Tsiotras P (2019) Vision-based autonomous path following using a human
driver control model with reliable input-feature value estimation. IEEE Trans Intell Vehicles
4(3):497–506
Pratt WK (1978) Digital image processing. Wiley-Interscience, New York
Rasmussen C (2004) Texture-based vanishing point voting for road shape estimation. In: British
machine vision conference
Sha Y, Zhang G-Y (2007) A road detection algorithm by boosting using feature combination. In:
2007 IEEE intelligent vehicles symposium, pp 364–368
Singh AK, John BP, Subramanian SV, Kumar AS, Nair BB (2016) A low-cost wearable Indian
sign language interpretation system. In: International conference on robotics & automation for
humanitarian applications
Son J, Yoo H, Kim S, Sohn K (2015) Real-time illumination invariant lane detection for lane
departure warning system. Expert Syst Appl 42(4):1816–1824
Song W, Yang Y, Fu M, Li Y, Wang M (2018) Lane detection and classification for forward collision
warning system based on stereo vision. IEEE Sensors J 18(12):5151–5162
Su Y, Zhang Y, Lu T, Yang J, Kong H (2018) Vanishing point constrained lane detection with a
stereo camera. IEEE Trans Intell Transp Syst 19(8):2739–2744
Types of roads and lane system in India explained. https://www.cars24.com/blog/types-of-roads-
lane-system-in-india/. Cited 12 Sep 2020
Wang Y, Chen D, Shi C (2008) Vision-based road detection by adaptive region segmentation and
edge constraint. In: Second international symposium on intelligent information technology appli-
cation, pp 342–346
348 D. K. Dewangan and S. P. Sahu
Wang Y, Teoh Eam K, Shen D (204) Lane detection and tracking using B-Snake. Image Vis Comput
22:269–280
Xiao J, Li S, Sun B (2016) A real-time system for lane detection based on FPGA and DSP. Sens
Imaging 17(6):1–13
Yoo JH, Lee S-W, Park S-K, Kim DH (2017) A robust lane detection method based on vanishing
point estimation using the relevance of line segments. IEEE Trans Intell Transp Syst 18(12):3254–
3266
Zhang J, Nagel HH (1994) Texture-based segmentation of road images. In: IEEE symposium on
intelligent vehicles., Washington DC, pp 260–265
Zhang G, Zheng N, Cui C (2009) An efficient road detection method in noisy urban environment.
In: IEEE intelligent vehicles symposium. Xi’an, China, pp 556–561
Zheng F, Luo S, Song K, Yan C-W, Wang M-C (2018) Improved lane line detection algorithm based
on Hough transform. Pattern Recogn Image Analysis 28:254–260
Zhou S, Jiang Y (2010) A novel lane detection based on geometrical model and Gabor filter. In:
IEEE intelligent vehicles symposium. San Diego, CA, USA, pp 59–64
Chapter 22
An Improved DCNN Based Facial
Micro-expression Recognition System
22.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 349
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_22
350 D. Garg and G. K. Verma
1. Local Binary Pattern (Ahonen et al. 2006; Zhao and Pietikainen 2007),
2. Local Binary Pattern from Three Orthogonal Planes (LBP-TOP) (Liu et al. 2016;
Wang et al. 2017), and
3. Histograms of Oriented Gradients (Kim and Cho 2014; Mishra et al. 2013; Shu
et al. 2011).
However, here our literature survey is limited to deep learning-based studies.
The methodologies based on deep learning have demonstrated their practicality for
various visual assignments and pulled tremendous enthusiasm from the PC vision
network. Lately, deep learning features have additionally been read to analyze spon-
taneous micro-expressions. Peng et al. (2017) proposed a two-stream network termed
as dual temporal scale convolutional neural network (DTSCNN) to recognize uncon-
strained micro-expressions. Diversity of the stream of DTSCNN is utilized to adjust
to various edge pace of micro-expression video cuts. Each surge of DSTCNN com-
prises an autonomous shallow organization for maintaining a strategic distance from
the overfitting issue. Then, they nourished the network with optical-stream classi-
fications to guarantee that the shallow organizations can additionally procure more
elevated level features. They tested their experiment results on benchmark databases,
i.e., CASME I/II and attained 10% higher than other state-of-the-art approaches.
Al-Shabi et al. (2017) developed an aggregator model based on scale invariant
feature transform (SIFT) and CNN. They have extracted features from both dense
SIFT and regular SIFT and merged with CNN features to increase the performance
on small data. They have studied both dense SIFT and regular SIFT along with
CNN and designed an aggregator model. The accuracy reported by them is 73.4%
on FER-2013 and 99.1% on the CK+ database.
Takalkar and Xu (2017) indicated that it is conceivable to substantially improve
precision over the expected outcomes for classifying micro-expression utilizing
CNNs pre-trained system used for face recognition tasks. Their examinations also
reason that the small sizes of the datasets do not support them for preparing CNNs.
Notwithstanding, CNNs prepared on adequately massive datasets of face micro-
expressions can acquire preferred outcomes over the pattern without utilizing the
352 D. Garg and G. K. Verma
data augmentation procedure. They combined two datasets CASME and CASME-
II, to frame a bigger dataset. At that point, these datasets are consolidated to tune a
palatable CNN-based micro-expression recognizer.
Zhang et al. (2018) designed a network termed as SMEConvNet to analyze micro-
expression from the long video. They extracted 500 features per frame and built a
feature matrix. Then, they proposed a technique for processing a feature matrix to
locate the apex frame from video, which utilizes a sliding window and considers the
attributes of micro-expression to look through the apex frame. Exploratory outcomes
exhibit that the proposed strategy can accomplish the highest apex spotting rate
(0.8280) and the smallest mean absolute error-(22.36) than other techniques.
Li et al. (2016) proposed a methodology by merging a deep learning network and
histograms of oriented optical flow (HOOF) to recognize micro-expression. They
utilized CNN to localize facial areas and region-based normalized HOOF features.
Hu et al. (2018) presented a framework to perceive miniature articulation by
merging deep learning techniques and handmade features. They merged time-based
and spatial features through local Gabor binary pattern from three orthogonal panels
to covert the nearby facial developments and trained miniature articulation dataset
on the CNN model. The outcomes exhibit that the proposed approach accomplishes
better performance than other mainstream micro-expression acknowledgment strate-
gies.
Li et al. (2018) proposed an algorithm in which they employed CNN for identify-
ing the facial benchmarks/regions. Moreover, they employed fused CNN to extricate
the optical-stream features from the facial benchmarks that comprise the muscle
movements when the miniature expressions occur. They applied the proposed algo-
rithm on two databases and achieved better results in analyzing micro-expressions.
22.3 Methodology
In this segment, we discuss the most revolutionary and potent technical field i.e., deep
learning and the convolutional neural network (CNN) standard, which establishes
a framework for investigating the deep convolutional neural network (DCNN) for
recognition of micro-expression.
Deep learning is a potent part of artificial intelligence. It comprises well-defined
sequences that are imitations or motivations of the human cerebrum, intended to
emulate the human mind’s structure, and termed an artificial neural network. It is a
neural function capable of emulating a human’s brain in handling information and
making outlines in decision making.
22 An Improved DCNN Based Facial … 353
The structure of deep learning consists of multiple layers. Each layer has neu-
ral nodes that can communicate with other network nodes and has full capacities
to learn various features and data representations. At prsent, numerous deep orga-
nization structures have been advanced. For example, LeNet (LeCun et al. 1998),
AlexNet (Krizhevsky et al. 2012), VGG (Szegedy et al. 2015), GoogleNet (Simonyan
and Zisserman 2014), ResNet (He et al. 2016), etc. In this paper, we predominantly
talk about DCNN for micro-expression recognition, as there is the drastically extraor-
dinary achievement of CNN in emotion recognition and computer vision.
CNN is a popular deep learning model and has shown tremendous success in
various areas. It is proposed by LeCun et al. (1998) that incorporates a series of few
building blocks or layers. The main objective is to process the given input, extract
high and low-level features, and classify it into specific classes.
CNN architecture has two phases: The first phase accomplishes the feature learn-
ing through the convolutional layer, activation function, and pooling layer. In contrast,
the second phase comprises a fully-connected and SoftMax layer, which performs
classification (as shown in Fig. 22.1). A complex CNN architecture involves repe-
titions of a few convolution layers and a pooling layer, trailed by at least one fully
connected layer.
The convolutional layer is connected to a section of the input (normalized Image),
aiming to perform convolution operations among given input and kernel. The convo-
lution operation is a particular linear operation utilized for extracting features where
kernel/filter (a set of array numbers) is applied over the given input, which is a set of
array numbers, called a tensor. A dot-product between every section of kernel/filter
and input tensor is determined at every area of the tensor and summated to get the
output in the corresponding location of the output tensor, termed as a feature map.
The convolution operation between image(I) and kernel (K) is shown below by using
Eq. 22.1.
Ih Iw Ic
(I ∗ K )x,y = Ix+ p−1,y p −1 K p,q,r (22.1)
p=1 q=1 r =1
354 D. Garg and G. K. Verma
where Ih , Iw and Ic are height, width, and channels of image, respectively, x and
y represent pixel values corresponding to the local receptive field. This method is
repeated by applying multiple convolution filters to frame several feature extractors,
signifying various attributes of the given input tensor. After that, the activation func-
tion is applied over the information signal because it does not initiate all neurons
simultaneously and introduce nonlinearity into the transformed output tensor. ReLU
is extensively used as an activation function, as it combines multiple times quicker
than tanh and sigmoid functions. It is described as Eq. 22.2
This output tensor is then directed to the following layer of neurons as input.
The pooling layer targets down-sampling the highlights of the contribution without
affecting the quantity of the channels.
[l]
Ix,y,z = pool(I [l−1] )x,y,z (22.3)
where l is the current working layer and l − 1 is previous layer. The first convolutional
layer and pooling layer would obtain low-level data of the picture, while the pile of
them would empower significant level component extraction. The pooling process
outcomes are fed into a fully connected layer where it classifies given input into
classes/labels. Considering the kth node of the lth layer, it can be defined by Eq. 22.4
nl−1
z k[i] = [l] [l−1]
wk,i Ii + bi[l] (22.4)
i=1
2. CLASSIFICATION PROCESS
INPUT: x̂ is an image to be classified
fˆ ← CNN ← x̂ ;
OUTPUT: class = f_output (i) where i =1, 2. . . , K % the class that x belongs to
All the experimentations were carried out on benchmark CASME-II database to rec-
ognize miniature expressions algorithms that fit for recognizing subtle facial muscle
motions for analyzing affective states. The CASME-II (Yan et al. 2014) consoli-
356 D. Garg and G. K. Verma
dates 246 subtle facial little scope articulations recorded by a 200-fps camera. These
samples were browsed over 2500 outward expressions. CASME-II is improved in
extended sample measure, settled light, and higher assurance (both standard and spa-
tial). The selected, scaled downscale articulations in this dataset either had a total
duration under 500 ms or an onset term (time from beginning edge to top layout)
under 250 ms. These samples are coded with the beginning and balance diagrams and
categorized with action units (AUs) and sentiments. Hyperparameters of CASME-II
and their respective values are shown in Table 22.1.
CASME-II comprises mainly six classes of the littler scale articulations to be
specific ‘Happiness,’ ‘Fear,’ ‘Sadness,’ ‘Surprise,’ ‘Repression,’ and ‘Disgust.’ We
have chosen 1841 samples of happiness, 274 samples of sadness, 1605 samples of
surprise, 4204 samples of disgust, 2187 samples of repression, and other samples for
analyzing micro-expressions. These examples are gathered from number of various
people with various gender. Some samples of this database are shown in Fig. 22.3.
22 An Improved DCNN Based Facial … 357
22.4.2 Implementations
The implementation is done with TensorFlow and Python 3.5.2. TensorFlow, an open-
source software library used for experimentation in this study. The computations
of TensorFlow are sated as stateful data-flow diagrams. TensorFlow gets from the
activities that neural organizations perform on multidimensional arrays, which are
alluded to as tensors. In our work, we utilized Tensorboard to construct complex
charts and diagrams. We tracked and visualized model parameters such as accuracy
and loss function using Tensorboard. It also helps in viewing tensor histograms, as
they vary over time. For the usage of this model, we tested by considering a diverse
number of preparing pictures and a distinctive number of classes, for example, two,
four, and six classes.
The spontaneous emotion recognition model identifies the micro-expressions of
human beings. As we discussed in the previous section, we utilized standard database
CASME-II that comprised 12,000 images. We divided these images accordingly to
the training and testing phase. Approx. 80% of images have been utilized for the
training phase and the remaining images for testing purposes. Then, these images
have been classified into several classes in the classification process.
358 D. Garg and G. K. Verma
The results are given in terms of cost and accuracy parameters for two classes (Hap-
piness and Sadness), four classes (Happiness, Sadness, Disgust, and Fear), and six
classes, as shown in Table 22.2. The cost function is a capacity that quantifies the
performance of the deep neural model for given data. It evaluates the relationship
between variables (predicted and expected values), and it is framed based on cost
(estimation of error). It is minimalized to guarantee the improved working of the
model. These two measurements are of the prime noteworthiness because they legit-
imately oversee the presentation of the created model and ought to be differed to
improve the equivalent. The cost graph for 2, 4, and 6 class is given in Figs. 22.4,
22.6, 22.8 repectively. Following are the portrayals of cost and precision for a various
number of classes taken at a time Figs. (22.6 and 22.8).
Test results exhibit that our framework accomplished the most noteworthy perfor-
mance (recognition accuracy) against existing investigations detailed in the literature.
The test accuracy graphs for two, four, and six classess are shown in Fig. 22.5, 22.7,
and 22.9, respectively. We have likewise contrasted the exhibition of our framework
and different works, as appeared in Table 22.3. To summarize, DCNN can adequately
learn low-level and high-level features from imbalanced information and decipher the
unobtrusive movement in facial regions of micro-expressions and achieve remarkable
performance for recognizing micro-expressions.
22.6 Conclusion
References
Ahonen T, Hadid A, Pietikainen M (2006) Face description with local binary patterns: application
to face recognition. IEEE Trans Pattern Anal Mach Intell 28(12):2037–2041
Bartlett MS, Littlewort G, Frank M, Lainscsek C, Fasel I, Movellan J (2005, June) Recognizing
facial expression: machine learning and application to spontaneous behavior. 2005 IEEE Comput
Soc Conf Comput Vis Pattern Recogn (CVPR’05) 2:568–573
Chan CH, Goswami B, Kittler J, Christmas W (2011) Local ordinal contrast pattern histograms for
spatiotemporal, lip-based speaker authentication. IEEE Trans Inf Forensics Secur 7(2):602–612
Connie T, Al-Shabi M, Cheah WP, Goh M (2017) Facial expression recognition using a hybrid CNN-
SIFT aggregator. International workshop on multi-disciplinary trends in artificial intelligence.
Springer, Cham, pp 139–149
Dollár P, Rabaud V, Cottrell G, Belongie S (2005, Oct) Behavior recognition via sparse spatio-
temporal features. In: 2005 IEEE international workshop on visual surveillance and performance
evaluation of tracking and surveillance. IEEE, pp 65–72
Ekman P, Friesen WV (1969) Nonverbal leakage and clues to deception. Psychiatry 32(1):88–106
Haggard EA, Isaacs KS (1966) Micromomentary facial expressions as indicators of ego mechanisms
in psychotherapy. Methods of research in psychotherapy. Springer, Boston, MA, pp 154–165
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings
of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang X, Zhao G, Zheng W, Pietikäinen M (2012) Towards a dynamic expression recognition
system under facial occlusion. Pattern Recogn Lett 33(16):2181–2191
Huang X, Zhao G, Hong X, Zheng W, Pietikäinen M (2016) Spontaneous facial micro-expression
analysis using spatiotemporal completed local quantized patterns. Neurocomputing 175:564–578
362 D. Garg and G. K. Verma
Hu C, Jiang D, Zou H, Zuo X, Shu Y (2018, Aug) Multi-task micro-expression recognition combin-
ing deep and handcrafted features. In: 2018 24th international conference on pattern recognition
(ICPR). IEEE, pp 946–951
Jan B, Farman H, Khan M, Imran M, Islam IU, Ahmad A, Jeon G (2019) Deep learning in big data
analytics: a comparative study. Comput Electr Eng 75:275–287
Kim S, Cho K (2014) Fast calculation of histogram of oriented gradient feature by removing
redundancy in overlapping block. J Inf Sci Eng 30(6):1719–1731
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional
neural networks. Adv Neural Inf Process Syst:1097–1105
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document
recognition. Proc IEEE 86(11):2278–2324
Liu L, Fieguth P, Wang X, Pietikäinen M, Hu D (2016) Oct) Evaluation of LBP and deep texture
descriptors with a new robustness benchmark. European conference on computer vision. Springer,
Cham, pp 69–86
Li Q, Yu J, Kurihara T, Zhan S (2018, Apr) Micro-expression analysis by fusing deep convolutional
neural network and optical flow. In: 2018 5th international conference on control, decision and
information technologies (CoDIT). IEEE, pp 265–270
Li X, Yu J, Zhan S (2016, Nov) Spontaneous facial micro-expression detection based on deep
learning. In: 2016 IEEE 13th international conference on signal processing (ICSP). IEEE, pp
1130–1134
Mishra G, Aung YL, Wu M, Lam SK, Srikanthan T (2013, Dec) Real-time image resizing hardware
accelerator for object detection algorithms. In: 2013 international symposium on electronic system
design. IEEE, pp 98–102
Pantic M, Rothkrantz LJ (2004) Facial action recognition for facial expression analysis from static
face images. IEEE Trans Syst Man Cybern Part B (Cybern) 34(3):1449–1461
Peng M, Wang C, Chen T, Liu G, Fu X (2017) Dual temporal scale convolutional neural network
for micro-expression recognition. Frontiers Psychol 8:1745
Pfister T, Li X, Zhao G, Pietikäinen M (2011, Nov) Differentiating spontaneous from posed facial
expressions within a generic facial expression recognition framework. In: 2011 IEEE international
conference on computer vision workshops (ICCV workshops). IEEE, pp 868–875
Shu C, Ding X, Fang C (2011) Histogram of the oriented gradient for face recognition. Tsinghua
Sci Technol 16(2):216–224
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recog-
nition. arXiv:1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich
A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 1–9
Takalkar MA, Xu M (2017, Nov) Image based facial micro-expression recognition using deep learn-
ing on small datasets. In: 2017 international conference on digital image computing: techniques
and applications (DICTA). IEEE, pp 1–7
Verma, G. K. (2017, November). Facial micro-expression recognition using discrete curvelet trans-
form. In: 2017 conference on information and communication technology (CICT). IEEE, pp.
1–6
Wang Y, See J, Phan RCW, Oh YH (2014) Nov) Lbp with six intersection points: reducing redundant
information in lbp-top for micro-expression recognition. Asian conference on computer vision.
Springer, Cham, pp 525–537
Wang Y, See J, Oh YH, Phan RCW, Rahulamathavan Y, Ling HC, Li X (2017) Effective recog-
nition of facial micro-expressions with video motion magnification. Multimedia Tools Appl
76(20):21665–21690
Yan WJ, Wu Q, Liu YJ, Wang SJ, Fu X (2013, Apr) CASME database: a dataset of spontaneous
micro-expressions collected from neutralized faces. In: 2013 10th IEEE international conference
and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–7
22 An Improved DCNN Based Facial … 363
Yan WJ, Li X, Wang SJ, Zhao G, Liu YJ, Chen YH, Fu X (2014) CASME II: an improved sponta-
neous micro-expression database and the baseline evaluation. PloS One 9(1):e86041
Zhang Z, Chen T, Meng H, Liu G, Fu X (2018) SMEConvNet: a convolutional neural network for
spotting spontaneous facial micro-expression from long videos. IEEE Access 6:71143–71151
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an
application to facial expressions. IEEE Trans Pattern Anal Machine Intell 29(6):915–928
Chapter 23
Selective Deep Convolutional Framework
for Vehicle Detection in Aerial Imagery
23.1 Introduction
Growing vehicle counts on the roads have reinstated the necessity of automation in
the transportation systems. Numerous efforts are seen to make the automation in the
vehicle detection practically viable. Intelligent transportation with advancements in
driverless cars (Hosang et al. 2015; Moranduzzo and Melgani 2013) has stimulated
the conglomerated solutions using sensors and algorithms.
A multifold advantage of autonomous vehicles made it a mandate to incorporate
those features in all leading organizations. Improved mobility, reduction in the traffic
cost, and optimization of the many more factors in transportation and road safety have
attracted majority of the researcher’s attention to evaluate their novel object detection
techniques for vehicle detection, ultimately resulting in autonomous vehicles (Lowe
2004; Moranduzzo and Melgani 2013). Radar-based vehicle detection has seen the
discriminatory capability of the object detection on vehicle datasets in diversified
environment. However, the input image received from cameras has made it possible
for the researchers to target robust solutions in driverless vehicles. It triggered the
set of explorations in the developing vision-based techniques vehicle detection in
ADAS. With this objective, the paper examines some of the best methods of vehicle
detection when objects are of smaller instances within the image.
The remainder of this manuscript is structured as follows: Sect. 23.2 details the
different techniques such as HOG and SIFT used in vehicle detection on the VEDAI
database of vehicle images. Furthermore, the section lists the shortcomings of the
earlier mentioned handcrafted features and compares them to significant features like
object proposal methods. Section 23.3 discusses the feature ontology of histogram
of gradient combined with logistic regression. Feature aggregation of scale invariant
feature transform (SIFT) and bag of words (BoW) is elucidated in detail as a pream-
ble to the proposed model. The work builds a semantic network model using the
feature ontology of HOG and logistic regression and aggregated features made up of
SIFT and BoW. The ontological representation generated here yields the necessary
feature discriminability. Section 23.4 illustrates the proposed novel relevance feed-
back approach that uses the semantic network model with aggregated features and
feature ontology. The proposed semantic features are given as input to the CNN along
with object proposal methods. The feature engineering here is compared with the
DNN (CNN + HOG) proposed by the author in one of his earlier works. Database and
experimental results are discussed in Sect. 23.5. Conclusions are drawn in Sect. 23.6.
Shape detections are pursued effectively for real-time object detection as shape con-
vey the unique exterior characteristics of the objects. This distinguished representa-
tion of the objects compared to the background is important in scene understanding.
The best shape descriptor should be unaffected due to the rotational, scale transfor-
mations that is why it should be invested insightfully in object detection (Cheng and
Han 2016; Moranduzzo and Melgani 2013). A large number of techniques have been
proposed for describing shapes in object detection, wherein the points of interest are
discerned out of the images and compared to those with the ones registered from
dataset images to find the object of interest (Razakarivony and Jurie 2016). This part
of interest inside the image is normally treated as the feature (Sommer et al. 2017;
Villon et al. 2016). A detailed survey in this regards is elaborated in Table 23.1.
Feature extraction can be visualized with numerous techniques such as extracting
the feature transforms which are scale invariant (SIFT) and histogram of gradients
(Cheng and Han 2016; Moranduzzo and Melgani 2013; Xu et al. 2016).
Detection of the various classes of vehicles from aerial images is very useful. The
system, when applied to real-time scenarios, helps in solving various day-to-day
problems like screening of large areas, surveillance, traffic management, vehicle
density detection, urban planning, and many others. In real-time scenarios, the vehi-
cle density detection on the roads can be used in navigation systems to provide the
best possible path to travel and thus reduce travel time (Xu et al. 2016). Since manual
image analysis is a difficult task, providing an automated system for aerial image
analysis makes object localization, detection, and classification more efficient. Aerial
image analysis for vehicle detection has seen limitless applications such as land map-
ping, screening of large areas, etc. The enormous need for early detection of vehicles
has resulted in aerial imagery being used in a lot of vehicle detection applications.
The solutions that have been devised have become the most comprehensive and
sophisticated research in scene understanding, semantic analysis for traffic surveil-
lance, and defense applications. When compared with ground image-based object
detection, aerial imagery-based techniques are faced with the exacting task of dis-
368 K. V. Sakhare and V. Vyas
cerning very small-sized vehicles that are obscured in their backgrounds (Chen et al.
2014). Keeping these peculiar challenges in mind, a few object detectors have been
investigated (Cucchiara et al. 2000). This paper exposes the limited utilization of
these object detectors in aerial imagery, as these are found to be lacking in handling
multi-scale imagery and fast response times.
Object detection from aerial imagery has seen growing attention of the researchers
due to its ability to capture larger areas in single image. At the same time, having
the precision in real-time object detection has always been challenging scenario to
achieve desired level of accuracy. The small instance objects as cars, pick up vans,
and vans will take less pixels comparative to the whole image area.
State-of-the-art object detectors even could give limited efficiency on the aerial
image analysis. It has demanded the continued researches in object detection system
on small instance objects with improved accuracy. Significant work is carried out
in vehicle detection from the aerial imagery (Ajmal and Hussain 2010; Alexe et al.
2010; Dalal and Triggs 2005; Dhanaraj et al. 2020; Hosang et al. 2015; Lowe 2004;
Rabiu et al. 2013; Sakhare et al. 2020; Tayara et al. 2017; Tewari et al. 2019; Van de
Sande et al. 2011) , still keeping the scope for improvisation in terms of accuracy,
minimizing the complexity and achieving the robust performance on the dynamically
changing backgrounds. A preferred duo of most of the researches as combination of
handcrafted features and classifier always bounded to the feature engineering based
on human ingenuity (Dalal and Triggs 2005; Ren et al. 2015; Tewari et al. 2019).
Two-stage approach of feature extraction and classification, those systems always
proved to be computationally complex and yet, offered limited efficacy for occlusion,
light variations clutter, and rotation variance.
Given the existing shortfalls of the prevalent techniques, the work proposes a novel
vehicle detection system offering more efficiency and robustness to address small
instances of vehicle objects within the aerial images. This proposed method utilizes
an adaptive model to symbolize optimum features required for object representation.
The optimum features are obtained by employing logistic regression and histograms
of oriented gradients (HOG) (Cheng and Han 2016; Sommer et al. 2017). The feature
ontology uses a logistic loss between the test object and the training sets for feature
selection.
370 K. V. Sakhare and V. Vyas
SIFT-based techniques perform matching of local features in the image using two
stages of feature detector and descriptor (Zheng et al. 2017). SIFT methods cohesively
function with bag of word models, as initially BoW were proposed for document
parsing. The word responses were accumulated into a vector form. Scale invariant
feature transform (SIFT) (Uijlings et al. 2013) along with BoW model derives better
performer for object detection (Fig. 23.1).
Figure 23.2 demonstrates the feature aggregation reducing the dimensionality
of the input feature sets. SIFT is most popularly used for identifying prominent,
unwavering features in an image. It generates the rotation and scale invariant feature
points. These points describe a small image region around the points (Girshick 2015).
A generic SIFT framework for object detection is set out in the following steps as:
1. Location and scale of salient feature points are determined. Intensity changes
are identified using difference of Gaussians at nearby scales.
1 −(x2 +y2 )
G (x, y, σ ) = e 2 σ2 (23.1)
2π σ 2
The DoG function is stated around a key point (xi , yi , σ , i) using a second-order
Taylor-series.
∂D(x, y, σ ) T
D(x, y, σ ) = G(xi , yi , αi ) +
∂(x, y, σ ) x=xi ,y=yi ,σ =σi
(23.4)
1 ∂D(x, y, σ ) T
+ T
2 ∂(x, y, σ ) x=xi ,y=yi ,σ =σi
where ⎡ ⎤
x − xi
= ⎣ y − yi ⎦ (23.5)
α − αi
For finding extreme values of the DoG in this region, the derivative of D(.) is
set to 0. It performs as:
23 Selective Deep Convolutional Framework for Vehicle Detection … 371
⎡ ⎤
x −1
⎣ y ⎦ = ∂D(x, y, σ ) ∂D(x, y, σ )
(23.6)
α ∂(x, y, σ ) x=xi ,y=yi ,σ =σi ∂(x, y, σ ) x=xi ,y=yi ,σ =σi
⎡ ⎤
x
1 ∂D(x, y, σ )
T
⎣y ⎦
Dextremal = D(xi , yi , αi ) (23.7)
2 ∂(x, y, σ ) x=xi ,y=yi ,σ =σi α
23 Selective Deep Convolutional Framework for Vehicle Detection … 373
The position of the key point is updated. |Dextremal | < 0.03, are discarded as “low
contrast points.”
3. The gradient magnitudes and orientations observed over a small window around
the key point are computed.
4. A small region around the key point is considered. Further, it is divided into
n × n cells where each cell will be of size 4 × 4. A gradient orientation his-
togram is built in each cell (Cheng and Han 2016). Each histogram entry is
weighted by the gradient magnitude and a Gaussian weighting function with
σ = 0.5 times the window width. Each gradient orientation histogram is sorted
keeping the prevailing key point orientations obtained from step 3. Once the
key points are extracted, an image can be represented as an unordered collec-
tion of visual words. Bag of visual words are scale, viewpoint, orientation, and
illumination invariant, which makes them suitable for real-time applications.
Figure 23.6 presents the outline of SIFT and BOW for object detection. Bag
of words combine the local descriptors in the form of codebook. With limited
scope for the variations in the objects, BoW entail K centroid. Local descriptor
with d dimension will be clustered based on the nearest centroid. Bag of word
is histogram of image descriptors when assigned to visual word, generating K-
dimensional vector. The K dimensional space is normalized at later stage. The
normalization of the histogram can be done using different distances. Manhattan
and Euclidean distances are some of the frequent choices for the normalization.
When K is maintained at 4096, mean average precision of 68.9% is achieved,
which extrapolate the results obtained by conventional BoW (Qiang et al. 2006;
Zheng et al. 2017) for VEDAI dataset. When K increases, it is impossible to get
the dimensionality reduction using feature aggregation. A novel model evolved
by feature ontology of HOG with logistic regression and feature aggregation of
SIFT with BoW is proposed as a semantic network model as shown in Fig. 23.2.
The conventional methods may fail to give commanding feature vector particularly
for moving objects. Sliding windows have certainly acknowledged those issues,
while generating the candidate regions. Recently, object proposal methods override
gave high objectness (Bharathi et al. 2012; Hsu et al. 2018) compared to the slid-
ing window methods. Compared to the different object proposal methods, selective
374 K. V. Sakhare and V. Vyas
search (Konoplich et al. 2016; Pawar and Humbe 2015) outperformed on benchmark
dataset. The algorithm has shown similar comparable results on aerial images, when
applied with certain adaptations (Lowe 2004). This diversified grouping technique
combined with heuristic grouping has resulted in achieving higher objectness.
Selective search gives s(r1 , r2 ) similarity measure by grouping r1 and r2 as two
segmented regions. The grouping is done based on the weighted combination of
color, texture, and shape and size similarity.
Selective search yields 99% recall rate on PASCAL VOC dataset, and there is
9.868 mean average best overlap achieved (Hsu et al. 2018). The performance of
selective search drops drastically when applied on VEDAI 1024 dataset. By tuning
the proposal width and segmentation size, the recall value of selective search is
marginally increased (Lowe 2004).
The adaptive selective search algorithm is chosen for generating well articulated
proposals for small instances as depicted in Fig. 23.3.
These adaptations have stated it as one of the outperformer for non-aerial image
datasets. However, their analysis on small-sized objects is debatable (Lowe 2004;
Lu et al. 2005). A semantic classification model, as shown in Fig. 23.4, is proposed
to sort the loopholes of all above-mentioned techniques. The output of the semantic
network model along with object proposals is given as input to the CNN.
In recent times, a lot of CNN-based methods are being used in the field of object
detection (Moranduzzo and Melgani 2013; Xu et al. 2016). The objective of this paper
is to devise a framework best suited for databases having small instance of objects.
The level 2 data flow diagram of the proposed CNN-based semantic classifier is
shown in Fig. 23.4. The vehicle detection is boosted in a two-stage approach. In
stage I, the semantic network model (3.0) is derived as a combination of ontological
features (1.1) and aggregated features (0.1) as discussed thoroughly in Sects. 23.3.1
23 Selective Deep Convolutional Framework for Vehicle Detection … 375
and 23.3.2, respectively. Object proposals (2.0) are identified as the unique features
that are fed as input to the convolutional neural network.
The semantic model acts as the relevance feedback to the CNN. The semantic
classifier model is derived based on CNN and object proposals (4.0). The performance
of the proposed semantic CNN model is compared with the CNN-based vehicle
detection.
classifier level giving an effective semantic classifier. This novel semantic classifier
is obtained from ontology and feature aggregation.
The combined semantic network model when feature engineered with the object
proposal methods in deep learning classifier gives comparable accuracy with respect
to the conventional techniques. The simple architecture does not need GPU comput-
ing. CNN-based semantic classifier model summary is represented for input image
of 32×32 pixels (Table 23.3).
23 Selective Deep Convolutional Framework for Vehicle Detection … 377
23.5 Database
The execution considered 70% of the database for training, while 20% is kept for
validation and 10% for the testing purpose.
The ground truths objects are hold to get the overlapping bounding boxes using
selective search method (Table 23.4).
23.6 Results
The performance metrics used for the validation of the algorithms are accuracy.
Table 23.5 shows the accuracy of the semantic classifier using HoG and logistic
regression, convolutional neural network, and the proposed CNN-based semantic
classifier model.
23.7 Conclusion
Conventional dominant object detection paradigms such as HOG, SIFT, with classi-
fiers such as logistic regression, bag of word are analyzed. HOG, with logistic regres-
sion, has created the optimal features required for vehicle representation. Moreover,
the benefits of applying logistic regression create ontological features to be used in
the semantic network model. Scale invariant feature transforms with bag of word
has reduced the dimensionality of the feature vector causing feature aggregation. A
semantic network model of optimal features is proposed with a conglomeration of
feature ontology and feature aggregation. Object proposal methods are identified as
the suitable feature representation techniques for small instance objects in VEDAI
1024 database. Selective search with minimum proposal size minimum box width
and maximum proposal size and minimum box width facilitate the best feature repre-
380 K. V. Sakhare and V. Vyas
sentation for small instance objects in VEDAI 1024 database. Object proposals along
with CNN yield a detection accuracy of 95.8%. A CNN-based semantic classifier
model is proposed which accepts selective search object proposals as input features,
while semantic network features act as relevance feedback to boost the detection
accuracy to 98.71%, keeping the average detection accuracy to 96.5%.
References
Ajmal A, Hussain IM (2010, March) Vehicle detection using morphological image processing tech-
nique. In: 2010 International conference on multimedia computing and information technology
(MCIT). IEEE, pp 65–68
Alexe B, Deselaers T, Ferrari V (2010, June) What is an object? In: 2010 IEEE computer society
conference on computer vision and pattern recognition. IEEE, pp 73–80
Bharathi TK, Yuvaraj S, Steffi DS, Perumal SK (2012, December) Vehicle detection in aerial surveil-
lance using morphological shared-pixels neural (MSPN) networks. In: 2012 Fourth international
conference on advanced computing (ICoAC). IEEE, pp 1–8
Chen X, Xiang S, Liu CL, Pan CH (2014) Vehicle detection in satellite images by hybrid deep
convolutional neural networks. IEEE Geosci Remote Sens Lett 11(10):1797–1801
Cheng G, Han J (2016) A survey on object detection in optical remote sensing images. ISPRS J
Photogrammetry Remote Sens 117:11–28
Cucchiara R, Piccardi M, Mello P (2000) Image analysis and rule-based reasoning for a traffic
monitoring system. IEEE Trans Intell Transp Syst 1:119–130
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings
of the 2005 IEEE computer society conference on computer vision and pattern recognition, San
Diego, CA, USA, vol 1, pp 886–893
Dhanaraj M, Sharma M, Sarkar T, Karnam S, Chachlakis D, Ptucha R, Markopoulos PP, Saber E
(2020, April) Vehicle detection from multi-modal aerial imagery using YOLOv3 with mid-level
fusion (conference presentation). In: Big data II: learning, analytics, and applications, vol 11395.
International Society for Optics and Photonics, p 1139506
Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer
vision, pp 1440–1448
Hosang J, Benenson R, Dollár P, Schiele B (2015) What makes for effective detection proposals?
IEEE Trans Pattern Anal Mach Intell 38(4):814–830
Hsu SC, Huang CL, Chuang CH (2018, January) Vehicle detection using simplified fast R-CNN.
In: 2018 International workshop on advanced image technology (IWAIT). IEEE, pp 1–3
Konoplich GV, Putin EO, Filchenkov AA (2016, May) Application of deep learning to the problem of
vehicle detection in UAV images. In: 2016 XIX IEEE International conference on soft computing
and measurements (SCM). IEEE, pp 4–6
Lowe DG (2004) Distinctive image features from scale-invariant key points. Int J Comput Vis 60:91.
https://doi.org/10.1023/B:VISI.0000029664.99615.94
Lu M, Wevers K, Van Der Heijden R (2005) Technical feasibility of advanced driver assistance
systems (ADAS) for road traffic safety. Transp Plan Technol 28(3):167–187
Moranduzzo T, Melgani F (2013, July) Comparison of different feature detectors and descriptors
for car classification in UAV images. In: 2013 IEEE International geoscience and remote sensing
symposium-IGARSS. IEEE, pp 204–207
Pawar BD, Humbe VT (2015) Morphology based composite method for vehicle detection from
high resolution aerial imagery. VNSGU J Sci Technol 4(1):50–56
Qiang Z, Mei-Chen Y, Cheng K-T (2006) Fast human detection using a cascade of histograms of
oriented gradients. Comput Vis Pattern Recognit 1491–1498
23 Selective Deep Convolutional Framework for Vehicle Detection … 381
Rabiu H et al (2013) Vehicle detection and classification for cluttered urban intersection. Int J
Comput Sci Eng Appl (IJCSEA) 3(1)
Razakarivony S, Jurie F (2016) Vehicle detection in aerial imagery: a small target detection bench-
mark. J Vis Commun Image Representation 34:187–203
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with
region proposal networks. In: Advances in neural information processing systems, pp 91–99
Sakhare KV, Tewari T, Vyas V (2020) Review of vehicle detection systems in advanced driver
assistant systems. Arch Comput Methods Eng 27(2):591–610
Sommer LW, Schuchert T, Beyerer J (2017, March) Fast deep vehicle detection in aerial images. In:
2017 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 311–319
Tayara H, Soo KG, Chong KT (2017) Vehicle detection and counting in high-resolution aerial
images using convolutional regression neural network. IEEE Access 6:2220–2230
Tewari T, Sakhare KV, Vyas V (2019) Vehicle detection in aerial images using selective search with
a simple deep learning based combination classifier. In: Proceedings of the third international
conference on microelectronics, computing and communication systems. Springer, Singapore,
pp 221–233
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recog-
nition. Int J Comput Vis 104(2):154–171
Van de Sande KE, Uijlings JR, Gevers T, Smeulders AW (2011, November) Segmentation as
selective search for object recognition. In: 2011 International conference on computer vision.
IEEE, pp 1879–1886
Villon S et al (2016) Coral reef fish detection and recognition in underwater videos by supervised
machine learning: comparison between deep learning and HOG + SVM methods. In: Advanced
concepts for intelligent vision systems, ACIVS
Xu Y, Yu G, Wang Y, Wu X, Ma Y (2016) A hybrid vehicle detection method based on Viola–Jones
and HOG + SVM from UAV images. Sensors 16(8):1325
Yang Z, Pun-Cheng LSC (2018) Vehicle detection in intelligent transportation systems and its
applications under varying environments: a review. J Image Vis Comput 143–154
Zhang L, Lin L, Liang X, He K (2016, October) Is faster R-CNN doing well for pedestrian detection?
In: European conference on computer vision. Springer, Cham, pp 443–457
Zheng H (2006, July) Automatic vehicles detection from high resolution satellite imagery using
morphological neural networks. In: Proceedings of the 10th WSEAS international conference on
computers, Vouliagmeni, Athens, Greece, vol 13, p 608
Zheng L, Yang Y, Tian Q (2017) SIFT meets CNN: a decade survey of instance retrieval. IEEE
Trans Pattern Anal Mach Intell 40(5):1224–1244
Chapter 24
Exploring Source Separation as
a Countermeasure for Voice Conversion
Spoofing Attack
24.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 383
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_24
384 R. Hemavathi et al.
ing is a major threat to verification systems, as it leads to increase in the false alarm
rate (FAR), i.e., the imposter is falsely accepted as genuine speaker.
Spoofing attacks for ASV systems include replay attack (Wu et al. 2014), imper-
sonation (Hautamaki et al. 2015), voice conversion (Chen et al. 2014), and speech
synthesis (Masuko et al. 1997). Impersonation refers to human voice mimicking.
It is observed that humans can efficiently mimic speakers with similar voice char-
acteristics, and impersonating an arbitrary speaker is more challenging (Lau et al.
2004). Replay attack is a scenario where attacker has a digital copy of an original
target speaker’s utterance, and he replays it using playback device to claim false
identity for ASV system. Voice conversion refers to transforming the identity of
source (imposter) speaker’s to that of target (intended) speaker’s without altering the
linguistic content. In speech synthesis, unit selection (Masuko et al. 1997) and sta-
tistical approaches (Masuko et al. 1996) are used to generate more natural sounding
speech with specific speaker’s voice characteristics.
Among all the spoofing attacks for ASV system, voice conversion and speech
synthesis (SS) attacks gain more attention, as impersonation cannot be applied in
large scale, and eventhough replay attacks can be accomplished easily, it poses threat
to text-dependent ASV systems. Availability of opensource software and lack of
effective countermeasures make voice conversion and speech synthesis genuine and
high-risk attacks.
There are two approaches to overcome the spoofing attacks: First approach is to
build an efficient ASV system, it does not seems to work significantly, as state-of-art
ASV systems based on i-vectors, Gaussian mixture model, and hidden Markov model
are also vulnerable to spoofing (Wu et al. 2016). Second is to build a countermeasure,
which can make a decision whether the input speech as spoofed or natural speech.
This work aims to build an efficient countermeasure based on source separation for
voice conversion spoofing attack.
Voice conversion (VC) is a technique where the identity of imposter speaker’s speech
is transformed to that of intended speaker’s speech. Initially, the source and filter
parameters of target and imposter speaker’s speech are extracted. As the duration of
imposter and target speech features may differ, dynamic time warping is performed to
align them. Later imposter speaker’s speech features are transformed to that of target
by using VC algorithms like: parametric approaches (Toda et al. 2007), frequency
warping (Daniel et al. 2010), and artificial neural networks-based techniques (Desai
et al. 2009; Wu et al. 2015). The converted features are synthesized to obtain spoofed
speech.
The vulnerability ASV system to voice conversion spoofing is studied for Gaus-
sian mixture model (GMM) system in Pellom and Hansen (1999), GMM-universal
background modeling (UBM) system in Bonastre et al. (2007). These studies showed
an increase FAR from around 10 to 40%. The advanced ASV systems based on
24 Exploring Source Separation as a Countermeasure for Voice … 385
joint factor analysis (JFA), i-vectors, and probabilistic linear discriminative analysis
(PLDA) are also vulnerable to VC spoofing attack resulting in increased FARs from
3 to 17%.
To overcome the spoofing attacks for ASV system, various countermeasures are
proposed. An efficient countermeasure should differentiate between natural and syn-
thetic (or VC) speech, hence, reduce FAR. To detect VC and SS attacks, spectro-
temporal features derived from local binary patterns were employed in Alegre et al.
(2013b). Phase and modified group delay features are exploited to detect VC spoof-
ing in Wu et al. (2012). Eventhough the phase-based features efficiently detect VC
attacks, its performance for unknown attacks remains challenging. Relative phase
shift feature was proposed in Sanchez et al. (2015) to detect SS spoofing using min-
imum phase. In Alegre et al. (2013a), an average pair-wise distance (PWD) between
consecutive feature vectors was employed to detect VC speech.
In this paper, we propose a countermeasure which can detect voice conversion
spoofed speech. Countermeasure is built by combining unsupervised co-channel
speech separation algorithm based on non-negative matrix factorization as front end
for CNN-based binary classifier. We also propose to model voice conversion spoofed
speech as an instantaneous mixture of estimate of target speech and artifacts intro-
duced due to voice conversion. The efficiency of the proposed countermeasure is
evaluated using a CNN-based automatic speaker verification system for voice con-
version challenge 2016 dataset. The proposed countermeasure is also validated for
noisy speech database NOIZEUS.
Rest of the paper is organized as follows: Sect. 24.3 gives the motivation for
exploring speech separation as a countermeasure, Sect. 24.4 gives the overview of
proposed system, Sect. 24.5 gives experimental results, and Sect. 24.6 concludes the
paper.
24.3 Motivation
This section gives the motivation for the present study. Voice conversion algorithms
mainly focus on transforming the spectral content of source to that of target. Hence,
a lot of similarity is observed in target and spoofed speech. In this study, instead of
processing the signals directly, the speech is pre-processed using speech separation
block where the artifacts introduced during the voice conversion is separated.
The main motivation to apply the source separation algorithm is shown in Fig.
24.1. Based on the fact that voice conversion introduces artifacts in the resultant
speech (Patel and Patil 2017; Wu et al. 2016), if the VC spoofed speech is processed
using source separation, estimate of target speech and artifacts can be obtained. But
the major issue was the artifact estimate that should be distinct from other noises and
generalizable for all VC algorithms.
Hence, study was conducted, where speech separation was applied for clean
speech, spoofed speech, and speech degraded with different noises. Figure 24.1
shows the cochleagram plot of artifacts estimates obtained for clean speech from
386 R. Hemavathi et al.
Time (s)
Fig. 24.1 Cochleagram plots of artifact estimates obtained using source separation stage for a–
c clean speech from VCC-16 database, Timit and ASV-15 database d–f spoofed speech from
participant submission J , L and M, respectively, from VCC-16 database i–l speech degraded with
street noise, reverberation, and babble noise, respectively
VCC-2016 dataset, Timit and ASV-15 database (Alegre et al. 2013a) (Fig. 24.1a–c).
Voice conversion spoofed speech from VC algorithm J , L, and M from VCC-2016
dataset Fig. 24.1d–f, respectively. The details regarding all the VC algorithms in
VCC-16 is discussed in Table 24.2. Further to show the difference between the arti-
fact introduced from voice conversion and other background noises, the artifacts
estimates of speech signals degraded with street noise, reverberation, and babble
noise are shown in Fig. 24.1g–i, respectively. Figure shows that the artifact esti-
mates of VC spoofed speech are unique. It is also distinguishable from clean speech
and speech degraded with other background noise and artifacts. This is the major
motivation to conduct the present study.
The schematic of proposed system is given in Fig. 24.2. Automatic speaker verifica-
tion system (ASV) is built using a convolutional neural network (CNN) and trained
using Mel-spectrogram images of target speech. In the testing phase, the test signal
Stest is initially given to the proposed countermeasure to classify it as spoofed or nat-
ural speech. The countermeasure is built by combining speech separation block and
CNN-based binary classifier. The speech separation block separates the input speech
signal Stest into estimate of target Ŝtarget and artifact αvc introduced due to voice con-
24 Exploring Source Separation as a Countermeasure for Voice … 387
As voice conversion include various stage of processing, artifacts are introduced in the
converted speech (Patel and Patil 2017; Wu et al. 2016). The countermeasure that we
are proposing mainly relays on artifacts introduced due to voice conversion algorithm.
Figure 24.1 shows that the artifacts introduced due to voice conversion algorithm is
uniquely characterized by discontinuous formants in high frequency region. Based
on this, we propose to model the voice converted speech as an instantaneous mixture
of estimate of target Ŝtarget (t) and artifact introduced during VC αvc .
The source separation algorithms separate the mixed speech signal into two inde-
pendent signals. Two state-of-art approaches in unsupervised co-channel speech
separation (USCSS) are computational auditory scene analysis (CASA) and non-
negative matrix factorization (NMF) approaches. A comparative study in Hemavathi
and Swamy (2018) showed that NMF-based approach performs better as CASA-
based speech separation system itself introduces additional artifacts. Hence, in this
work NMF-based source separation approach, 2D Itakura-Saito non-negative matrix
factorization (ISNMF) (Gao et al. 2013) is used.
Initially, STF , time-frequency (TF) representation of input speech signal is decom-
posed into two matrices, D, a set of spectral basis vectors, and H , an encoding matrix
which has amplitude of each basis vector at each time point. Each element of |STF |0.2
is given by
388 R. Hemavathi et al.
φmax
τmax
I
φ
|STF ( f, t)|0.2 = Dτf −φ,i Hi,ts −τ (24.2)
i=1 τ =0 φ=0
where vertical and horizontal arrows denote downward and right shift by φ rows and τ
columns. Estimates of D and H are obtained using Quasi-EM IS-NMF2D algorithm
(Gao et al. 2013). The objective is to find the estimates of target and artifacts, i.e.,
{|X i ( f, ts )|0.2 }i=1
I
.
Where
φmax
τmax
φ
| X̃ i ( f, ts )| =
0.2
Dτf −φ,i Hi,ts −τ (24.3)
τ =0 φ=0
Fig. 24.3 Mel-spectrogram representation of a clean speech b VC spoofed speech from baseline
from VCC-16 database, respectively. c–d speech estimate and e–f artifact estimate obtained from
natural and spoofed speech, respectively, using source separation
24 Exploring Source Separation as a Countermeasure for Voice … 389
The binary classifier is built using VGG-16 convolutional neural network (CNN)
model (Simonyan and Zisserman 2014) to classify the input as natural and spoofed
speech. The CNN is trained using the Mel-spectrogram images of αvc of natural and
spoofed speech. Mel-spectrogram is obtained by applying the Mel-scale to linear
spectrogram. Mel spectrogram gives the magnitude of TF bins. Mel-frequency m
representation of frequency f (Hz) can be obtained using (24.4)
f
m = 2595 ∗ log10 1 + (24.4)
700
24.5.1 Dataset
In this work, ASV is built for five classes of target, 500 files were used for training and
310 for evaluation. The training and validation plot of ASV system is shown in Fig.
24.4b. To train CNN-based binary classifier, 500 speech files of clean and spoofed
speech are taken. For spoofed training set, 200 speech files from baseline (B L) with
mean opinion score (MOS) 1.5, 150 each from VC algorithm M and L with MOS 1.9,
and 2.9 from VCC-16 dataset were used. Mean opinion scores (MOS) are derived
from Toda et al. (2016), and high MOS indicates more naturalness in converted
speech. Training and validation plot of proposed countermeasure is shown in Fig.
24.4a. Here, FAR indicates that spoofed speech classified as natural speech.
24.5.3 Results
Table 24.2 gives the experimental results, the performance of proposed countermea-
sure (CM), ASV system with and without proposed countermeasure in terms of false
alarm rate (FAR), and equal error rate (EER). The vulnerability of ASV system is
tested for 500 samples (100 from each class) for VCC-16 database.
The proposed system mainly relays on artifact estimate; hence, the system’s perfor-
mance in noisy environment is major issue of concern. To show the efficiency of the
proposed system in speaker independent and noisy environments, validation is done
for NOIZEUS database (Hu and Loizou 2007). This database consists of 30 IEEE
392 R. Hemavathi et al.
Fig. 24.4 Training and validation plots of a proposed countermeasure and b CNN-based automatic
speaker verification system
Fig. 24.5 Validation of proposed countermeasure at for airport, babble, car, exhibition, restaurant,
train-station, and street noises
sentences spoken by six speakers, corrupted by seven real-world noises airport, bab-
ble, car, exhibition, restaurant, train-station, and street (AURORA database) at 0 dB.
The validation plot of proposed countermeasure for NOIZEUS database is shown in
Fig. 24.5. It is seen that for all seven noises at all dBs, the proposed countermeasure
is giving excellent performance. The reason is uniqueness of artifact introduced by
voice conversion algorithms which is indicated in Fig. 24.1.
394 R. Hemavathi et al.
24.5.5 Discussion
Table 24.2 shows the brief description of voice conversion techniques used by all
17 participant submission. Proposed countermeasure gives best result for baseline,
i.e., FAR of 0.2% as it has more artifact and lesser MOS and least result for dataset
L with high MOS of 3%. The ASV system is vulnerable to all spoofing attacks,
and least FAR is observed for baseline, i.e., 18.80% and algorithms which have high
similarity and MOS successfully increase the FAR by more than 50%. By combining
the countermeasure with ASV, the FAR is found to be decreased from 60.8 to 2.8%.
The efficiency of proposed algorithm can be seen for three known attacks (used
for training the countermeasure) baseline, participant submission L and M, highest
FAR reported is 0.4% and for 15 other unknown attacks highest FAR observed is
2.8% for participant submission N , which has MOS of 3. Based on the results, the
proposed countermeasure can be considered as more reliable countermeasure for
voice conversion spoofing attack. Eventhough the noisy speech was not used while
training the binary classifier, validation plot for seven different noises show that the
system can perform well in noisy environments too.
24.6 Conclusion
Acknowledgements The first author would like to thank Women Scientist Scheme-A, Department
of Science and Technology (WOS-A DST), Government of India for providing financial assistance
wide reference number SR/WOS-A/ET-69/2016.
24 Exploring Source Separation as a Countermeasure for Voice … 395
References
Wu Z, Chng ES, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack
in speaker recognition. In: INTERSPEECH
Wu Z, Gao S, Cling ES, Li H (2014) A study on replay attack and anti-spoofing for text-dependent
speaker verification. In: Signal and information processing association annual summit and con-
ference (APSIPA), 2014 Asia-Pacific, Dec 2014, pp 1–5
Wu Z, Nicholas E, Tomi K, Junichi Y, Federico A, Haizhou L (2015) Spoofing and counter measures
for speaker verification: a survey. Speech Commun 66:130–153
Wu Z, De Leon PL, Demiroglu C, Khodabakhsh A, King S, Ling ZH, Saito D, Stewart B, Toda T,
Wester M, Yamagishi J (2016) Anti-spoofing for text-independent speaker verification: an initial
database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio
Speech Lang Process 24(4):768–783
Chapter 25
Statistical Prediction of Facial Emotions
Using Mini Xception CNN and Time
Series Analysis
Abstract The growing era of facial recognition has opened a large area of compu-
tational study. Facial emotion recognition has always been a challenging task in the
fields of deep learning. In this work, we have proposed a better approach to not only
study human facial emotion, but also predict one’s emotion by collecting the same
personal data. One part of the article represents the usage of CNN for detecting facial
emotions in which it takes in real-time video frames and predicts the probabilities
of the seven basic emotion states. Output data of the CNN model serves as the input
for the time-series analysis model, and the task of predicting one’s future emotions
has been accomplished. The two-step hierarchical structure helps in studying human
behaviour to predict future outcomes. Finally, the model can be used for continuous
monitoring and predicting a person’s behaviour providing constant emotional param-
eters. This work will be used in many interrogatory procedures or as a preventive
measure by collections of the convict’s facial data.
25.1 Introduction
Reading facial emotion is not a difficult work for the human being but doing the same
thing using a machine learning or neural network is imperious. Facial expressions
represent more than 55% (Mehrabian 2017) of one’s emotions. Advancements in
machine learning and neural network have made the application of emotional anal-
ysis available to the public. Possibly the study of facial expressions started from the
work of Clark et al. (2020). A lot of work has been carried out in the field of computer
vision, and it is able to provide satisfactory results (Semwal et al. 2017).
According to Shaver et al. (1987), there are seven basic emotions into which one’s
expression could be classified—happy, sad, neutral, surprise, anger, fear and disgust.
Studying these personal emotions and predicting future emotions can help predict
one’s mindset. This work focuses not only on studying facial emotions, but also
studying one’s emotions over a period of time and predicting future behaviour pro-
vided the same emotional parameters. To deal with the initial part of the work, Haar
cascade (Cuimei et al. 2017) and Mini Xception CNN models were used, and for the
later part FB Prophet for time-series analysis model is used for the future prediction.
The Haar cascade model helps in detecting the face in the given video frame using
Haar features (Kaehler and Bradski 2016). By studying different regions of the pic-
ture one by one, it creates a bounding box. Since in a Haar cascade model weights
are assigned manually, the training is done very fast, and results are perfect for espe-
cially still and front facing images. After the bounding box is created, Mini Xception
CNN model is used for emotion recognition. The model predicts the emotion of the
detected face whether happy, sad, neutral, surprise, anger, fear or disgust (Cuimei
et al. 2017). The data from this study is saved over a period of time and then passed
in as a csv file which serves an input for the FB Prophet model. Studying through
the data, it forecasts the emotions of the object for an upcoming time period with
parameters remaining constant.
This work serves a broad survey of the usage of these three models simultaneously.
All the major problems were handled properly to obtain the results with properly pre-
processing of the used data. It also discusses the challenges occurred and how they
were handled. The key focus was on the usage of time-series model for predicting
future emotional behaviour.
When CNN has revolutionised the study of image processing (Behera et al. 2020a, b),
commonly used CNN models use fully connected layers as the basis for feature extra-
dition (Arriaga et al. 2019). In the recent models such as Inception V3 (Szegedy et al.
2015), rather focussing on the fully connected layer towards the end, much of the
feature extraction is been done in the global average pooling layer. This layer forces
to extract the global features from the input image. The most recent of the architec-
tures is the Xception CNN (Chollet 2017) model which is been used in this work.
It works on the combination of two most successful experimental assumptions, i.e.
residual models (He et al. 2016) and depth-wise separable convolutions (Howard
et al. 2017). The depth-wise separable convolutions serve as the basis of the reduc-
tion in the number of the parameters been used, by separating the process of feature
extraction and combination within the convolutional layers.
Study of facial emotions was just the first part of this work. The innovation is the
usage of time-series analysis for predicting the trends of a particular emotion. Pre-
dicting the future by collecting the data over a period and finding the trend in the
variation can be best solved by using the Prophet tool made available by Facebook
25 Statistical Prediction of Facial Emotions Using Mini Xception … 399
(Polusmak 2017). The Prophet tool was introduced for creating high-quality business
forecasts. This study of the number over the period of time is been used for predict-
ing the state of one’s emotions by studying the emotions over a period of time. The
input for the FB Prophet model provided by this work is the statistical analysis of a
particular emotion over a period of time using the Xception CNN model.
Sun et al. (2020) proposed a robust vectorized convolutional neural network (CNN)
model for extracting features in the region of interests (ROIs) of the face. The atten-
tion concept was adopted in the first layer of the neural network to perform ROIs-
related convolution calculation, and ROIs-related convolution calculation results of
the specific fields in the ROIs are increased by extracting more robust features. Com-
prehensive comparative experiments and cross-database experiments are conducted
to verify the validity and robustness of the proposed model.
Choi and Song (2020) proposed a two-dimensional (2D) landmark feature map for
effectively recognising such facial micro-expressions (FMEs). The proposed 2D
landmark feature map (LFM) is obtained by transforming conventional coordinate-
based landmark information into 2D image information. LFM is designed to have
an advantageous property independent of the intensity of facial expression change.
Alam et al. (2019) propose an IoMT-based emotion recognition system for affective
state mining. Human psychophysiological observations are collected through elec-
tromyography (EMG), electro-dermal activity (EDA) and electro-cardiogram (ECG)
medical sensors and analysed through a deep convolutional neural network (CNN)
to determine the covert affective state. They performed experimental study, and a
benchmark dataset was used to analyse the performance of the proposed method.
In this process, efforts were made to find the face in the given frame/image on which
the emotion recognition algorithm is applied. For this, Haar Cascade Cuimei et al.
(2017) is used, which is a machine learning (ML) algorithm used to identify objects in
a given frame/image. This model is perfect for front facing image detection. The fast
nature of the model helps in detection of face at every frame of the video. For training
this classifier, it requires a lot of positive and negative images (Tutorials 2020). Posi-
tive images are the ones which we want our classifier to identify, and negative images
are the images of everything else apart from the positive ones. Haar features, as shown
400 B. Behera et al.
Fig. 25.1 Haar features used to extract features from images (Kaehler and Bradski 2016)
in Fig. 25.1 (Cuimei et al. 2017), are used to extract features from the images. A
single value is calculated as the difference between the sum of pixels under white
rectangle and sum of pixels under black rectangle and which is the calculated feature
(Tutorials 2020). Rectangle features can be calculated using integral images (inter-
mediate representation of the image). The integral image at a point x, y is the sum
of the pixel values to its left and above (https://www.researchgate.net/publication/
3940582-Rapid-Object-Detection-using-Boosted-Cascade-of-Simple-Features).
ii(x, y) = i(x , y ) (25.1)
x≤x,y ≤y
As per literature, most of the region inside an image is the non-facing region.
So first, it need to check whether a window is a non-facing region or not? If it is
a non-facing region, then it needs to discard the image without processing. Again
process the image for searching the facial position. Instead of applying all features
on windows, features are taken into groups, and each group is applied one by one
at different stages. The number of features in the initial stages is less and keeps
increasing later (Tutorials 2020). It processes a window only if it passes the first
stage or the previous stage and then moves to the next step. If the window passes all
stages, then it is a face region.
Now the image with the front face has been detected, and it needs to classify the
image’s emotions into one, out of the seven classes which were considered. For
finding the specific emotion, the Xception CNN model is used. This architecture is
small and performs well in emotion classification. Xception CNN architecture (Fig.
25.3) is slightly different from normal CNN model (Fig. 25.2) in the prospect that
25 Statistical Prediction of Facial Emotions Using Mini Xception … 401
the fully connected layers are used at the end. Most of the parameters reside in this
layer in the normal CNN architectures, and they use standard convolutions. Xception
CNN architecture uses the residual modules and depth-wise separable convolutions.
Residual modules modify the expected mapping of subsequent layers. Thus, the
learned features become the difference between the desired features and the original
feature map.
The depth-wise separable convolutions are a combination of two different layers:
• Depth-wise convolutions.
• Pointwise convolutions.
These layers separate the channel cross-correlations from the spatial cross-
correlations. To do this, firstly a D × D filter is applied on every M input channel.
Now, N numbers of 1 × 1 × M convolution filters are applied so that to com-
bine the M input channels into N output channels. Each value in the feature map is
combined by applying 1 × 1 × M convolutions without considering their spatial
relationships within the channel. The computation is reduced by the depth-wise sep-
arable convolutions. We can see how efficient the depth-wise separable convolutions
are as compared to standard convolutions from the number of calculations involved
in each of them. In normal convolutions, for an image of size
Df × Df × M (25.2)
N × D 2p × Dk2 × M (25.5)
Dk × Dk × 1 (25.6)
Dp × Dp × M (25.7)
M × Dk2 × D 2p (25.8)
25 Statistical Prediction of Facial Emotions Using Mini Xception … 403
Dp × Dp × N (25.9)
M × D 2p × N (25.10)
M × D 2p × (Dk2 + N ) (25.11)
1 1
+ 2 (25.12)
N Dk
From here, we can see that depth-wise separable convolutions do much lesser
computations than standard convolutions. Figure 25.4 (Shaver et al. 1987) shows
the difference between the architecture of the standard convolutions with that of the
depth-wise separable convolutions.
Time-series analysis (TSA) is a way to analyse time-series data and extract some
useful information from it. Time-series data is just a series of data that are arranged
based on time periods or intervals (Adhikari and Agrawal 2013). It can be extremely
valuable if we become able to accurately predict the future. TSA is already being
used in many areas such as economic and sales forecasting, stock market analysis,
404 B. Behera et al.
budgetary analysis, census analysis, etc., and producing satisfactory results. In this
article, TSA is implemented in the field of medical and security systems. The aim of
the time series is to develop a mathematical model and then estimate the model to
predict future patterns. The objective of TSA in our model is to identify the nature
of emotions and then forecast or predict the future values of the emotions.
FB Prophet is used for future emotions prediction. It is an open-source forecasting
tool available in both R and Python provided by Facebook. The Prophet uses three
main components—trend, holidays and seasonality. Trend deals with the piecewise
logistic or linear growth curve for non-periodic variations in time series like in our
case. It implements two trend models: saturating growth model and the piecewise
linear model (Taylor and Letham 2018). This work is not dealing with the holidays
and seasonality effects in our model. Prophet frames the forecasting as a curve-fitting
problem and not just dealing with the time-based dependence of each reading in the
input time series. It is robust to outliers and dramatic shifts in the trend and typically
works fine by handling the missing data as well. We are providing the CSV file
obtained from the previous part as the input. We have taken the facial reading around
4 times daily for 3 days and feeding this as input. Totally, we are feeding 15,000 input
data points. The input to the prophet must have exactly two columns in it, one is the
date time stamp as ‘ds’ and second is the recorded value (emotion in our case) as ‘y’
(Adhikari and Agrawal 2013). We are predicting with minute as frequency and for
a period equal to 60 (i.e. 1 min × 60 = 60 mins or one hour). Based on the trends
present in the input, this model gives the prediction of emotion for the next hour.
The web camera served as the input for our model. It worked on real-time video.
The video taken from the camera was converted into frame sequences where each
frame served as a single image input for the model. Figure 25.5 given below shows
the output of the face detection model. The bounding box is built around the face.
Usage of Harr Cascade model helped in fast frontal face detection. The square frame
around the face in Fig. 25.5 is the bounding box created (Table 25.1; Fig. 25.7).
The next part after detecting face was to statistically analyse the seven basic facial
emotions of the detected face. For doing the same, we have used the MINI Xception
model as discussed. Figure 25.6 displays the output of the MINI Xception model. It
displays the percentage of all seven emotions of the detected face and also displays
the one with the maximum percentage. The main parameters for this are the widening
of lips, contraction on the sides of eyes and cheeks.
For the second part and the most important part of this work involving the pre-
diction of emotion, this calculated data serves as the input for the time-series model.
The data calculated over the period is sent in as the input in the form of a csv file.
The model takes each emotion as a different set of input, and with the collected data
of an individual over the period, it predicts the percentage of each emotion for the
25 Statistical Prediction of Facial Emotions Using Mini Xception … 405
upcoming future. Here, we are showing the output of two of the seven emotions sad
and angry, respectively. Figures 25.8 and 25.9 the plot of input data and the predicted
values for next 1 hour of sad and angry, respectively. The dots represent the input
data points and the curve after the last black dot is that of the predicted ones.
Figures 25.10 and 25.11 show the trend of both the emotions—sad and angry,
respectively. Figures 25.10 and 25.12 show the predicted trend over the entire period
of the data tabulated, while that of Figs. 25.11 and 25.13 show the predicted changes
over a day. The daily trend curve is calculated by observing the changes within the
24 hour. Similarly, the model predicts the percentage of each of the seven emotions.
As we are not providing data over months or years, so monthly or yearly trends are
25 Statistical Prediction of Facial Emotions Using Mini Xception … 407
Fig. 25.10 Trend for emotion—sad, trend over the period of time
Fig. 25.12 Trend for emotion—angry, trend over the period of time
not present in the output. This model can be used to gather information about the
current emotion of a person, and with time as the amount of input data increases, the
predicted values should also be more accurate.
25.5 Conclusion
Hence, the experimental work shows the utilisation of two types of emotions. Various
results were represented out of the facial emotions. These behavioural studies of
human emotions serve as the basis for many interrogative and preventive cases. It
can serve as a monitoring mechanism. Sudden change in emotional behaviour or
diversion from the regular trend of a particular emotion can be easily analysed here.
The work carried out in this article can be beneficial for satisfying following tasks
such as Doctors can use this for monitoring patient’s emotion and Policemen can use
it to monitor criminals’ mental activities.
25 Statistical Prediction of Facial Emotions Using Mini Xception … 409
References
Adhikari R, Agrawal RK (2013) An introductory study on time series modeling and forecasting
Alam MGR, Abedin SF, Moon S II, Talukder A, Hong CS (2019) Healthcare IoT-based affective
state mining using a deep convolutional neural network. IEEE Access 7:1–15. https://doi.org/10.
1109/ACCESS.2019.2919995
Arriaga O, Valdenegro-Toro M, Plöger PG (2019) Real-time convolutional neural networks for
emotion and gender classification. In: ESANN 2019—Proceedings, 27th European symposium
on artificial neural networks, computational intelligence and machine learning, pp 221–226
Available online. https://www.researchgate.net/publication/3940582-Rapid-Object-Detection-
using-Boosted-Cascade-of-Simple-Features. Accessed: 03-Sept-2020
Behera B, Kumar N, Mahato MR, Prasad BK, Semwal VB (2020a) Weather forecasting and
monitoring using machine learning. In: National conference on electronics, communication and
computation—NCECC 2020. MANTECH Publications, Jamshedpur, pp 1–6
Behera B, Kumar N, Mahato MR, Kumar A (2020b) COVID-19 detection using advanced CNN and
X-rays. In: Arpaci I et al (eds) Emerging technologies during the era of COVID-19 pandemic.
Springer Nature, Berlin, pp 1–11
Choi DY, Song BC (2020) Facial micro-expression recognition using two-dimensional landmark
feature maps. IEEE Access 8:121549–121563. https://doi.org/10.1109/ACCESS.2020.3006958
Chollet F (2017) Xception: deep learning with depth wise separable convolutions, pp 1–8. http://
arxiv.org/abs/161002357v3. arXiv: 161002357v3. https://doi.org/10.1109/CVPR.2017.195
Clark EA, Kessinger J, Duncan SE et al (2020) The facial action coding system for characterisation of
human affective response to consumer product-based stimuli: a systematic review. Front Psychol
11:1–21. https://doi.org/10.3389/fpsyg.2020.00920
Cuimei L, Zhiliang Q, Nan J, Jianhua W (2017) Human face detection algorithm via Haar cascade
classifier combined with three additional classifiers. In: IEEE 13th International conference on
electronic measurement & instruments. IEEE, pp 483–487
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE
Conference on computer vision and pattern recognition, pp 770–778
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017)
MobileNets: efficient convolutional neural networks for mobile vision applications, pp 1–9. http://
arxiv.org/abs/170404861v1. arXiv:170404861v1
Kaehler A, Bradski G (2016) Learning OpenCV 3: computer vision in C++ with the OpenCV
library, 1st edn. O’Reilly, Sebastopol
Mehrabian A (2017) Nonverbal communication. Taylor & Francis Group, New York, USA
Polusmak E (2017) Time series analysis in Python: predicting the future with Facebook Prophet.
In: mlcourse.ai. https://mlcourse.ai/articles/topic9-part2-prophet/. Accessed: 03-Sept-2020
Saha S (2018) A comprehensive guide to convolutional neural networks—the ELI5 way. In: Towards
Data Science. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-
networks-the-eli5-way-3bd2b1164a53. Accessed: 03-Sept-2020
Semwal VB, Singha J, Sharma PK, Chauhan A, Behera B (2017) An optimized feature selection
technique based on incremental feature analysis for bio-metric gait data classification. Multimed
Tools Appl 76:24457–24475. https://doi.org/10.1007/s11042-016-4110-y
Shaver P, Schwartz J, Kirson D, O’Connor C (1987) Emotion knowledge: further exploration of a
prototype approach. J Pers Soc Psychol 52:1061–1086
Sun X, Zheng S, Fu H (2020) ROI-attention vectorized CNN model for static facial expression
recognition. IEEE Access 8:7183–7194. https://doi.org/10.1109/ACCESS.2020.2964298
410 B. Behera et al.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture
for computer vision, pp 1–8. arXiv: 151200567v3. https://doi.org/10.1109/CVPR.2016.308
Taylor SJ, Letham B (2018) Forecasting at scale. In: American Statistician. Available online. https://
facebook.github.io/prophet/. Accessed: 03-Sept-2020
Tutorials O-P (2020) Face detection using Haar cascades. In: OpenCV. https://opencv-python-
tutroals.readthedocs.io/en/latest/py-tutorials/py-objdetect/py-face-detection/py-face-detection.
html. Accessed: 03-Sept-2020
Chapter 26
Identification of Congestive Heart
Failure Patients Through Natural
Language Processing
26.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 411
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_26
412 N. Baliyan et al.
ilar allergies, etc., is one of the major tasks in medical field in order to utilize existing
patient records for future researches or to determine new useful insights from existing
case studies (Gupta et al. 2018). Cohort identification may be carried out with the
help of enormous patient and biomedical data stored in institutional as well as public
repositories. This data may be structured, unstructured, or semi-structured. There is
a need to apply efficient and appropriate techniques to extract and utilize this data
(Saini et al. 2017).
Nevertheless, the procedure of differentiating the group of patients on the foun-
dation of their records stored in EHR can be exceedingly taxing and time consuming
owing to the complexity of the criteria on which the grouping has to be performed.
This happens due to the fact that the texts emulating these criteria are concealed
across several documents and beyond several data points in patients EHR (Malathi
et al. 2019).
26.1.1 Background
EHR, also referred to as electronic medical record (EMR), is the systematized col-
lection of patient and population electronically stored health information in a digital
format. These records can be shared across different healthcare settings. Records
are shared through network connected, enterprise-wide information systems or other
information networks and exchanges (Shickel et al. 2017). EHRs may include a
range of data, including demographics, medical history, medication and allergies,
immunization status, laboratory test results, radiology images, vital signs, personal
statistics like age and weight, and billing information. The growing accessibility
and mobility of EHR have increased the ease with which they can be used by doc-
tors unlike with the use of paper-based medical records. EHR provides researchers
unmatched phenotypic comprehensiveness and have the prospective to extend the
growth of accuracy in medicine techniques, at scale. One of the chief EHR-based
use case is determining a patient cohort identification algorithm or framework which
can discover disease status, inception, and its severity (Fox et al. 2018). Phenotype
algorithms using EHR data to classify patients with specific diseases and outcomes
are a foundation of EHR research.
Clinicians/doctors issue significant supplementary observations in the form of
unstructured documents, such as notes of patients progress, radiological reports, and
clinical narratives. However, there are open difficulties built in parsing the heteroge-
neous and multiplex clinical narratives. Additionally, there are additional challenges
present such as the existence of abbreviations, grammatical errors, and spelling mis-
takes, local dialectical phrases, make the job of processing this data even harder.
Moreover, the collected data in the unstructured and structured forms need recom-
bining strategies. The ability to extract meaningful data from EHRs and integrate it
into a reasonable structure can give prominent profit for patient cohort identification
in an automated manner (Grana and Jackwoski 2015).
26 Identification of Congestive Heart Failure Patients Through … 413
fact that the test results reveal otherwise, i.e., patient does not have the condition;
the diagnosis code stays with the patients health record history. And if then the
diagnosis code is spotted without considering its context (i.e., without understanding
the nuances of the patient’s case as shown in his or her health records), this can
become a major concern because it forbids the aptness of investigators to perform
patient cohorts identification accurately and fully use the statistical potential of the
available populations (Critical Data 2016). It is extensively proved that clinical NLP
systems perform well for information extraction from free text, i.e., unstructured
data in specific disease domains. After identification of patients’ groups (cohorts)
for a study or research, further analytics has to be performed on these cohorts to
interrelate the data extracted and find useful insights. The analysis on data requires
this data to be in a standard structure, for sake of ease. The main limitation in the
present approaches of semantic web studies is lack of support for NLP methods for
extraction of clinical markers from unstructured text.
This paper proposes a framework for patient cohort identification from unstruc-
tured clinical records. To demonstrate a use case, i2b2 dataset (https://www.i2b2.org/
NLP/DataSets/Main.php) of obese patients was chosen, and the cohort of patients
having congestive heart failure condition has been identified. The results were com-
pared against the experts annotations which were considered as gold standard. Addi-
tionally, manual review of clinical records was performed for validation.
This section describes about the data used, proposed framework, detailed steps of
the framework, and tool used for the study.
The data for this study was extracted from the publicly available Informatics for Inte-
grating Biology & Besides (i2b2) obesity challenge dataset (https://www.i2b2.org/
NLP/DataSets/Main.php). i2b2 is a zealous patron which has the quality of exist-
ing clinical information to capitulate insights which can directly impact healthcare
improvement.
The data was randomly taken from the RPDR using a query that extracted records
of patients who were either diabetic or obese. Each patient record in the dataset has
occurrences of the stem “obes” from zero to more than ten times. The withdrawn
patient records were semi-automatically de-identified. An automatic pass, followed
by two parallel manual passes were run over each individual record. After which
a third manual pass was made over the records that resolved all the disagreements
that occurred between the previous two manual passes. The data was made HIPAA
compliant (Murphy et al. 2011) by replacing the patient names, patient age, patient
26 Identification of Congestive Heart Failure Patients Through … 415
family member names, nationalities, hospital names, phone numbers, doctor names,
ID numbers, dates, locations, patient’s occupations, and other potential identifiers
with surrogates.
The annotation of the challenge data was done by two experts in the field of
obesity (https://www.i2b2.org/NLP/Obesity/Documentation.php). If the patient has
the co-morbidity, then they have marked it with a Y which stands for a YES, N for NO
(does not have the co-morbidity), U for Unmentioned (co-morbidity is not mentioned
in the narrative), and Q for Questionable (Questionable whether the patient has the
disease or not).
For this study, CHF co-morbidity was chosen, and patients belonging to this cohort
were identified. Only two markers—Y and N were used. The unmentioned category
is also considered as an N, as the records do not show any instance of the disease.
Thus, only records with a Y marker were considered to be in the cohort, i.e., the
patient has or had CHF. The experts annotations are considered as the ground truth
and compared against our proposed systems output.
26.2.2 Methodology
As discussed in Sect. 26.1, cohort identification for patients with a common charac-
teristic, for example, a particular disease, or similar symptoms, similar allergies, etc.,
is one of the major tasks in medical field in order to utilize existing patient records
for future researches or to determine new useful insights from existing case studies.
And as this task is complex and time devouring, there is a need to apply efficient
and appropriate techniques to extract and use this data. NLP techniques that focus
on information extraction from free text, i.e., unstructured data in specific disease
domains may be applied.
To allow identification of potential treatments, researchers need not only identify a
proper patients cohort, but also collect and combine relevant data from a wide range of
repositories, so that statistical significance can be obtained at the end. Semantic web
technologies can do wonders in this, as they can seamlessly combine heterogeneous
data from multiple sources and also provide interoperability.
The main limitation in the present approaches of semantic web studies is supported
for NLP methods for extraction of clinical markers from unstructured text. This is
not naively supported by any of the studied approaches. The current phenotype
ontology can be extended by specifying lexicons and NLP rules for NLP engines
which then can help in ontology creation. Figure 26.2 shows the overview of the
proposed framework (Johar and Baliyan in press).
An Apache project tool called cTAKES which was used to apply NLP technique
in order to identify patients belonging to a particular cohort. The main reason to
choose cTAKES is that it provides the feature of dictionary creator which can help to
identify clinical terminology and also map these to its matching terms in the Unified
Medical Language System (UMLS) Metathesaurus (https://ctakes.apache.org/s).
416 N. Baliyan et al.
As we know NLP can handle unstructured data very well, so when pertained to clinical
narratives, it can conquer the restrictions of billing code algorithms to identify patient
cohort by identification of terms that narrate signs and symptoms used to build a
diagnosis. Certainly, antecedent studies have illustrated that NLP techniques outshine
billing code algorithms for phenotype identification from clinical narratives from the
EHR. In this work, we build a methodology which uses NLP-based algorithm for text
recognition and semantic network, namely Metathesaurus for mapping predefined
entities to the recognized text.
Figure 26.3 shows how individual sentences in the unstructured clinical notes will
be processed using the Apache project cTAKES. The first four steps use basic NLP
processes, namely—boundary detection, tokenization, part-of-speech tagging, and
chunking and were implemented through Python. The next two steps use semantic
network for entity recognition and mapping entities to its properties (i.e., with its
corresponding UMLS code) were derived with help of the cTAKES tool. These steps
are explained in the following sub-sections.
26.2.4.1 Tokenization
26.2.4.3 Chunking
cTAKES was developed at Mayo Clinic in 2006. After its full development, it grew
into a fundamental segment for clinical data management infrastructure (https://en.
wikipedia.org/wiki/Apache_cTAKES). It consists of multiple components which are
used to produce semantic annotations and linguistics which further can be used for
research purposes and in decision support systems (DSS).
Each of these components has distinctive characteristic and potential. This study
has used the default fast pipeline from cTAKES. This involves annotations of dis-
eases/disorders, signs/symptoms, medications, anatomical sites, and procedures. For
every annotation, there are UMLS CUIs, used for uncertainty, negation, and subject.
420 N. Baliyan et al.
Figure 26.8 shows all the components used in this study, in which the components
in the box are used for named entity recognition process and derived with help of
cTAKES.
To access the fulfillment of our model, we choose precision, recall, F1 score, and
accuracy as the standard metrics. A confusion matrix is used, which is a plan in a
tabular form that is usually utilized to narrate the achievement of a classification
prototype on a set of test data for which the correct merits are familiar.
The true negatives and true positive are the utterances which are accurately antic-
ipated. There is a need to shrink false negatives and false positives. Each of these are
explained next.
1. True Positives (TP) is labeled when the system rightly predict the positive values,
that is, the value of the predicted class and the actual class both are yes. Example,
suppose our system has predicted that a patient belongs to the cohort of patients
having CHF and actual class value also suggests that the patient have CHF, then
this case will be labeled as true positive.
26 Identification of Congestive Heart Failure Patients Through … 421
2. True Negatives (TN) is labeled when the system rightly predicts the negative
values that is the value of the predicted class and the actual class both are no.
Example, suppose our system has predicted that a patient do not belong to the
cohort of patients having CHF and actual class value also suggests that the patient
does not have CHF, then this case will be labeled as True Negative.
3. False Positives (FP) is labeled when predicted classification is yes but the actual
classification is no. Example, suppose our system has predicted that a patient
belong to the cohort of patients having CHF but the actual class value suggests
that the patient does not have CHF, then this case will be labeled as False Positive.
4. False Negatives (FN) is labeled when predicted classification is no but the actual
classification is yes. Example, suppose our system has predicted that a patient
does not belong to the cohort of patients having CHF but the actual class value
suggests that the patient have CHF, then this case will be labeled as False Neg-
atives.
The above-mentioned four specifications are used to calculate precision, F1 score,
recall, and accuracy.
26.3.1 Precision
True Positive
Precision = (26.1)
True Positive + False Positive
26.3.2 Recall
26.3.3 F1 -Score
It is the measured mean of recall and precision. Hence, this metric draws false
negatives and false positives in consideration. Instinctively, F1 is not that simple to
422 N. Baliyan et al.
perceive if compared to accuracy, still the former is generally have more utility than
accuracy, particularly if there is an unequal distribution of class. However, accuracy
best operate when false negatives and false positives and have alike fare.
Precision × Recall
F1 = 2 × (26.3)
Precision + Recall
26.3.4 Accuracy
26.4.2 Implementation
1. Tokenization
One of the most basic and foremost step of NLP is tokenization. It is the pro-
cess of segregation of text or set of text into its individual constituent words
For this step, Apache tool cTAKES has been used because it allows the access
of the UMLS library which is required for the recognition of medical named
entities.
The clinical narratives of the dataset were processed using the collection pro-
cessing engine (CPE) of cTAKES. The output text files of the above NLP steps
were saved in a directory (in this case called the data directory) from where these
were collectively processed using bundled UMIA CPE, which in turn saves the
output annotations in another directory (in this case called the output directory).
There is another option to process documents, which is through a UIMA CAS
visual debugger (CVD), but it processes documents one at a time, and the results
can be seen in the GUI itself or as XCAS files. Figure 26.12 shows the snapshot
of CPE configurator graphical user interface (GUI) with three major divisions,
namely collection reader, analysis engines, and CAS consumers. Each line in the
document will be considered an entity to be analyzed by the CPE. The collection
reader segment asks you to define the descriptor used, input directory, language,
encoding, and any extensions. We have used apache-ctakes-4.0.0/desc/ctakes-
core/desc/collection_reader/FilesInDirectory CollectionReader.xml. For input
directory, a directory was created which contained all the selected output files
(this directory was named as data). For analysis engine, AggregatePlaintextFas-
tUMLSProcessor has been used. This analysis engine runs the complete pipeline,
encompasses the SimpleSegmentAnnotator analysis engine that makes a seg-
ment annotation which enfolds the entire plain text document. This analysis
engine uses the UMLS resources for NER or concept identification medical
terms.
For each CAS, a local file with the document text is written to a directory speci-
fied by the parameter. This CAS consumer does not make use of any annotation
information in the CAS except for the document id specified the CommonType-
System.xml descriptor. The document id will be the name of the file written for
each CAS. This CAS consumer may be useful if you want to write the results of
a collection reader and/or CAS initializer to the local file system. For example,
a JDBC collection reader may read XML documents from a database, and a
specialized cas initializer may convert the XML to plain text. The FilesInDirec-
toryCasConsumer was then used to write the plain text to local plain text files.
These annotated files were then searched for the CUIs related to CHF. For this
search, a Python script was written which parsed the annotated XML files for
the desired CUIs.
The dataset from i2b2 was used in this study to develop the proposed method. 1130
patient records were considered from the data set for the study. Disease mentions in
the records were assigned with CUI IDs using the UMLS vocabulary.
26 Identification of Congestive Heart Failure Patients Through … 425
As discussed in Sect. 26.4.2, cTAKES CPE was used to process the clinical nar-
ratives, which produced the annotated xml files for each of the record. These anno-
tated files were searched for three of the CUIs related to CHF which were C0018802
(Congestive heart failure), C0018801 (Heart failure), and C0018800 (Cardiomegaly).
Cardiomegaly (C0018800) is used to find patients who are at risk of a heart failure.
Cardiomegaly generally is a symptom of a state such as a heart disease or a heart valve
issue. It might also prompt a heart attack in advance. And it is seen that congestive
heart failure is commonly called simply as “heart failure” (https://www.mayoclinic.
org/diseases-conditions/heart-failure/symptoms-causes/syc-20373142). So, adding
the latter two CUIs in our search might help in improving the identification of CHF
patients. Firstly, the three CUIs are checked if they are present in the annotated file of
a patient record (all the three CUIs can also be present in a record or any combination
of them), if any of them is not present that means that particular patient does not have
the condition and henceforth does not belong to the patient cohort having CHF. If
present, then for that particular instance, the corresponding annotations of polarity
and uncertainty are checked. If polarity is –1 that means that sentence is negated, and
this CUI can be ignored and does not qualify the patient to be in the desired cohort. If
uncertainty is 1 that means it is uncertain to have the condition specified by the CUI
and can be ignored. If the polarity and uncertainty have values other than mentioned
above, then that CUI will be considered and, hence, that record will belong to the
cohort having CHF.
Figure 26.13 conveys the performance statistics of CPE while executing a record
file.
Figures 26.14, 26.15, and 26.16 show partial snapshots of the resultant annotation
file from the CPE. Part 1 shows how the text strings that mentions an anatomical
entity (A body part or area, corresponding to the UMLS semantic group of Anatomy)
is given an ID, and its different corresponding parameters are defined like—history
of, generic, conditional, uncertainty, polarity, confidence, subject, and OntologyCon-
26 Identification of Congestive Heart Failure Patients Through … 427
ceptArr number for reference to its corresponding UMLS concept mention (shown
in part-1). Just like AnatomicalSiteMention, there are more entities for which an
ID is created, these are, DiseaseDisorderMention (A text string that refers to a Dis-
ease/Disorder Event) and DateAnnotation (A text string that refers to a Date Event),
SignSymptomMention (A text string that refers to a sign or symptom event. As dis-
cussed above, these annotated files were searched for three of the CUIs in order to
be a part of the cohort having a congestive heart failure problem. Figure 26.15 shows
an instance of the CUI C0018802, i.e., congestive heart failure (which is also men-
428 N. Baliyan et al.
tioned through the preferredText parameter) present in one of the annotated record
(highlighted text). This means that if this instance is not negated, then this record
will belong to the desired cohort. The FSARRAY_id will be used to check if this
instance was negated in the text, i.e., polarity = –1. As mentioned in Fig. 26.14,
this id will be mapped to the cTAKES OntologyConceptArr id (shown in part-2) of
the DiseaseDisorderMention event annotation. The FSARRAY_id mentioned here
is “59273,” if search for this ID we’ll find an instance of DiseaseDisorderMention
in this annotated file. This can be shown in Fig. 26.16. On checking this instance’s
parameter, it is found that polarity is 1 that means the text is not negated, that means
this record belongs to the CHF cohort. Also, we can check the other parameters as
well like the subject of the sentence is patient, history is 0 which means the text is
not in the context of the patient’s history, uncertainty is 0 which means the instance
is not uncertain, and generic is false which means it is not used in a generic way.
These annotation files were then searched for the CHF CUI codes with help of the
Python script (shown in Appendix C). A few sample outputs of this search are shown
in Fig. 26.17. Table 26.2 summarizes the count of correct and incorrect predictions
26 Identification of Congestive Heart Failure Patients Through … 429
of the study through a confusion matrix. A number of true positives, true negatives,
false positive, and false negatives for N = 1130 (where N is the total number of
records) are 466, 631, 13, 20, respectively. The accuracy, precision, recall, and F-
score for our study were 0.970, 0.972, 0.958, and 0.965. Extracting and using the two
additional UMLS CUI for “C0018801” and “C0018800” for searching the clinical
notes aided in increasing the count of patients incorporated in the final cohort and
have actually improved the results. As Smith Reátegui and Ratté (2018) only used
CUI C0018802 and showed F1-score, recall, and precision as 0.89, 0.92, and 0.86,
respectively. On the other hand, the proposed system showed these metrics as 0.96,
0.95, and 0.97. Figure 26.18 shows the results of our model graphically.
26 Identification of Congestive Heart Failure Patients Through … 431
26.5 Discussion
Previous studies have revealed that the usage of “International Classification of Dis-
eases, Ninth Revision (ICD-9) codes” is no more considered to be enough for cohort
identification and has motivated the use of secondary data sources for patient cohort
identification (Fox et al. 2018).
Sohn et al. (2018) and Wi et al. (2018) proposed an NLP algorithm for extraction
of descriptive patterns of asthma events in the free text and temporal information.
These were manually annotated, then associated jointly and rules were applied. One
of the limitations was the small sample size owing to the arduous manual annotation
required for construction of a huge data set. However, in our method, annotation
process is automated with help of cTAKES. So, it works fairly well on large datasets.
Wang et al. (2018) proposed a prediction model for new patients along the learned
graph structure for chronic kidney disease. They used the k-nearest neighbor-based
prediction and showed effective prediction rate of 87%; whereas, for our method, it
was 97%. Table 26.3 compares our model with related work.
This comparison can also be noticed graphically in Fig. 26.19.
It is also seen that amalgamating the use of negation detection and UMLS syn-
onyms in a clinical NLP tool can aid clinical researchers to improve the performance
432 N. Baliyan et al.
of cohort identification problems utilizing data from various sources inside a huge
clinical database. Also, by the use of aggregation of CUIs improved the results, as
it was seen that congestive heart failure is frequently mentioned simply as “heart
failure.” So, adding the C0018801 (Heart failure) and C0018800 (Cardiomegaly)
CUIs in our search helped in improving the identification of CHF patients.
26.6 Conclusion
A limitation of the NLP technique implemented is that, all patients might not be
correctly classified in its correct cohort. For instance, as we identified a crucial term
as “heart failure” instead of the whole term “congestive heart failure” and used its
code for searching the annotation files. Seizing all the plausible means in which
medical practitioners abbreviate is a tough task and can cause a few patients to be
declassified. The absence of conventional practice in clinical narratives has been
identified as a blockage to NLP analysis of clinical text. Another limitation is that
we used our NLP methodology only to find CHF cohort. It can be scaled and utilized
for more diseases or to find cohorts with common characteristics. Also in future,
an approach which performs an automatic selection of related CUIs utilizing the
association linking of the concepts in the Metathesaurus can be developed.
Biomedical data needs to be analyzed for new researches in the field. The ability to
integrate and connect data across a number of EHRs is required for better under-
26 Identification of Congestive Heart Failure Patients Through … 433
References
Afzal N, Mallipeddi VP, Sohn S, Liu H, Chaudhry R, Scott CG, Arruda-Olson AM (2018) Natural
language processing of clinical notes for identification of critical limb ischemia. Int J Med Inform
111:83–89
Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical ter-
minology. Nucleic Acids Res 32(Suppl_1):D267–D270
Denecke K (2015) Health web science: social media data for healthcare. Springer, Berlin
Fox F, Aggarwal VR, Whelton H, Johnson O (2018, June) A data quality framework for process
mining of electronic health record data. In: 2018 IEEE International conference on healthcare
informatics (ICHI). IEEE, pp 12–21
Grana M, Jackwoski K (2015, November) Electronic health record: a review. In: 2015 IEEE Inter-
national conference on bioinformatics and biomedicine (BIBM). IEEE, pp 1375–1382
Gupta D, Sundaram S, Khanna A, Hassanien AE, De Albuquerque VHC (2018) Improved diagnosis
of Parkinson’s disease using optimized crow search algorithm. Comput Electr Eng 68:412–424
https://ctakes.apache.org/ . Last accessed 11 Apr 2019
https://en.wikipedia.org/wiki/Apache_cTAKES . Last accessed 11 Apr 2019
https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb . Last
accessed: 10 Apr 2019
https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html . Last accessed 10 Apr
2019
https://nlp.stanford.edu/software/tagger.shtml . Last accessed: 10 Apr 2019
https://www.i2b2.org/NLP/DataSets/Main.php . Last accessed 18 Apr 2019
434 N. Baliyan et al.
© The Editor(s) (if applicable) and The Author(s), under exclusive 435
license to Springer Nature Singapore Pte Ltd. 2021
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5
436 Glossary