Data Science 2021

Transactions on Computer Systems and Networks
Gyanendra K. Verma
Badal Soni
Salah Bourennane
Alexandre C. B. Ramos Editors
Data Science
Theory, Algorithms, and Applications
Transactions on Computer Systems
and Networks
Series Editor
Amlan Chakrabarti, Director and Professor, A. K. Choudhury School of
Information Technology, Kolkata, West Bengal, India
Transactions on Computer Systems and Networks is a unique series that aims
to capture advances in evolution of computer hardware and software systems
and progress in computer networks. Computing Systems in present world span
from miniature IoT nodes and embedded computing systems to large-scale
cloud infrastructures, which necessitates developing systems architecture, storage
infrastructure and process management to work at various scales. Present
day networking technologies provide pervasive global coverage on a scale
and enable multitude of transformative technologies. The new landscape of
computing comprises of self-aware autonomous systems, which are built upon a
software-hardware collaborative framework. These systems are designed to execute
critical and non-critical tasks involving a variety of processing resources like
multi-core CPUs, reconfigurable hardware, GPUs and TPUs which are managed
through virtualisation, real-time process management and fault-tolerance. While AI,
Machine Learning and Deep Learning tasks are predominantly increasing in the
application space the computing system research aim towards efficient means of
data processing, memory management, real-time task scheduling, scalable, secured
and energy aware computing. The paradigm of computer networks also extends it
support to this evolving application scenario through various advanced protocols,
architectures and services. This series aims to present leading works on advances
in theory, design, behaviour and applications in computing systems and networks.
The Series accepts research monographs, introductory and advanced textbooks,
professional books, reference works, and select conference proceedings.
More information about this series at http://www.springer.com/series/16657

Gyanendra K. Verma · Badal Soni ·
Salah Bourennane · Alexandre C. B. Ramos
Editors
Data Science
Theory, Algorithms, and Applications
Editors
Gyanendra K. Verma Badal Soni
Department of Computer Engineering Department of Computer Science
National Institute of Technology and Engineering
Kurukshetra National Institute of Technology Silchar
Kurukshetra, India Silchar, India
Salah Bourennane Alexandre C. B. Ramos

Multidimensional Signal Processing Group Mathematics and Computing Institute
Ecole Centrale Marseille Universidade Federal de Itajuba
Marseille, France Itajuba, Brazil
ISSN 2730-7484 ISSN 2730-7492 (electronic)

Transactions on Computer Systems and Networks
ISBN 978-981-16-1680-8 ISBN 978-981-16-1681-5 (eBook)
https://doi.org/10.1007/978-981-16-1681-5
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
We dedicate to all those who directly or
indirectly contributed to the accomplishment
of this work.
Preface
Digital information influences our everyday lives in various ways. Data sciences
provides us tools and techniques to comprehend and analyze the data. Data sciences
is one of the fastest-growing multidisciplinary fields that deals with data acquisition,
analysis, integration, modeling, visualization, and interaction of a large amount of
data.
Currently, each sector of the economy produces a huge amount of data in an
unstructured format. A huge amount of data is being available from various sources
like web services, databases, online repositories, etc.; however, the major challenge
is to extract meaningful intelligence information. However, to preprocess and extract
useful information is a challenging task. The role of artificial intelligence is playing
a pivotal role in the analysis of the data.
It becomes possible to analyze and interpret information in real-time with the
evolution of artificial intelligence. The deep learning models are widely used in
the analysis of big data for various applications, particularly in the area of image
processing.
This book aims to develop an understanding of data sciences theory and concepts,
data modeling by using various machine learning algorithms for a wide range of real-
world applications. In addition to providing basic principles of data processing, the
book teaches standard models and algorithms to data analysis.
Kurukshetra, India Dr. Gyanendra K. Verma

Silchar, India Dr. Badal Soni
Marseille, France Prof. Dr. h. c. Salah Bourennane
Itajuba, Brazil Prof. Dr. h. c. Alexandre C. B. Ramos
October 2020
vii
Acknowledgements
We are thankful to all the contributors who have generously given time and material
to this book. We would also want to extend our appreciation to those who have well
played their role to inspire us continuously.
We are extremely thankful to the reviewers, who have carried out the most impor-
tant and critical part of any technical book, evaluation of each of the submitted
chapters assigned to them.
We also express our sincere gratitude toward our publication partner, Springer,
especially to Ms. Kamiya Khatter and the Springer book production team for
continuous support and guidance in completing this book project.
Thank you.
ix
Introduction
Objective of the Book
This book aims to provide authors with an understanding of data sciences, their
architectures, and their applications in various domains. The data sciences is helpful
in the extraction of meaningful information from unstructured data. The major aspect
of data sciences is data modeling, analysis, and visualization. This book covers major
models, algorithms, and prominent applications of data sciences to solve real-world
problems. By the end of the book, we hope that our readers will have an understanding
of concepts, different approaches, models, and familiarity with the implementation
of data sciences tools and libraries.
Artificial intelligence has a major impact on research and raised the performance
bar substantially in many of the standard evaluations. Moreover, the new challenges
can be tackled using artificial intelligence in the decision-making process. However,
it is very difficult to comprehend, let alone guide, the process of learning in deep
learning. There is an air of uncertainty about exactly what and how these models
learn, and this book is an effort to fill those gaps.
Target Audience
The book is divided into three parts comprising a total of 27 chapters. Parts, distinct
groups of chapters, as well as single chapters are meant to be fairly independent
and also self-contained, and the reader is encouraged to study only relevant parts or
chapters. This book is intended for a broad readership. The first part provides the
theory and concepts of learning. Thus, this part addresses readers wishing to gain an
overview of learning frameworks. Subsequent parts delve deeper into research topics
and are aimed at the more advanced reader, in particular graduate and PhD students
as well as junior researchers. The target audience of this book will be academi-
cians, professionals, researchers, and students at engineering and medical institutions
working in the areas of data sciences and artificial intelligence.
xi
xii Introduction
Book Organization
This book is organized into three parts. Part I includes eight chapters that deal with
theory concepts of data sciences, Part II deals with data design and analysis, and
finally, Part III is based on the major applications of data sciences. This book contains
invited as well as contributed chapters.
Part I Theory and Concepts
The first part of the book exclusively focuses on the fundamentals of data sciences.
The book chapters under this part cover active learning, ensemble learning concepts
along with language processing concepts.
Chapter 1 describes a general active learning framework that has been proposed for
network intrusion detection. The authors have experimented with different learning
and sampling strategies on the KDD Cup 1999 dataset. The results show that the
performance of complex learning models has been found to outperform the rela-
tively simple learning models. The uncertainty and entropy sampling also outperform
random sampling. Chapter 2 describes a bagging classifier which is an ensemble
learning approach for student outcome prediction by employing base and meta-
classifiers. Additionally, performance analysis of various classifiers has been carried
out by an oversampling approach using SMOTE and an undersampling approach
using spread sampling. Chapter 3 presents the patient’s medical data security via bi-
chaos bi-order Fourier transform. In this work, authors have used three techniques
for medical or clinical image encryption, i.e., FRFT, logistic map, and Arnold map.
The results suggest that the complex hybrid combination makes the system more
robust and secure from the different cryptographic attacks than these methods alone.
In Chap. 4, word-sense disambiguation (WSD) for the Nepali language is performed
using variants of the Lesk algorithm such as direct overlap, frequency-based scoring,
and frequency-based scoring after drooping of the target word. Performance anal-
ysis based on the elimination of stop words, the number of senses, and context
window size has been carried out. Chapter 5 presents a performance analysis of
different branch prediction schemes incorporated in ARM big.LITTLE architecture.
The performance comparison of these branch predictors has been carried out based
on performance, power dissipation, conditional branch mispredictions, IPC, execu-
tion time, power consumption, etc. The results show that TAGE-LSC and perceptron
achieve the highest accuracy among the simulated predictors. Chapter 6 presents a
global feature representation using a new architecture SEANet that has been built
over SENet. An aggregate block implemented after the SE block aids in global feature
representation and reducing the redundancies. SEANet has been found to outperform
ResNet and SENet on two benchmark datasets—CIFAR-10 and CIFAR-100.
Introduction xiii
The subsequent chapters in this part are devoted to analyzing images. Chapter 7
presents an improved super-resolution of a single image through an external dictio-
nary formation for training and a neighbor embedding technique for reconstruction.
The task of dictionary formation is carried out so as to contain maximum structural
variations and the minimal number of images. The reconstruction stage is carried
out by the selection of overlapping pixels of a particular location. In Chap. 8, single-
step image super-resolution and denoising of SAR images are proposed using the
generative adversarial networks (GANs) model. The model shows improvement in
VGG16 loss as it preserves relevant features and reduces noise from the image. The
quality of results produced by the proposed approach is compared with the two-step
upscaling and denoising model and the baseline method.
Part II Models and Algorithms
The second part of the book focuses on the models and algorithms for data sciences.
The deep learning models, discrete wavelet transforms, principal component anal-
ysis, SenLDA, color-based classification model, and gray-level co-occurrence matrix
(GLCM) are used to model real-world problems.
Chapter 9 explores a deep learning technique based on OCR-SSD for car detection
and tracking in images. It also presents a solution for real-time license plate recog-
nition on a quadcopter in autonomous flight. Chapter 10 describes an algorithm for
gender identification based on biometric palm print using binarized statistical image
features. The filter size is varied with a fixed length of 8 bits to capture information
from the ROI palm prints. The proposed method outperforms baseline approaches
with an accuracy of 98%. Chapter 11 describes a Sudoku puzzle recognition and solu-
tion study. Puzzle recognition is carried out using a deep belief network for feature
extraction. The puzzle solution is given by serialization of two approaches—parallel
rule-based methods and ant colony optimization. Chapter 12 describes a novel profile
generation approach for human action recognition. DWT & PC is proposed to detect
energy variation for feature extraction in video frames. The proposed method is
applied to various existing classifiers and tested on Weizmann’s dataset. The results
outperform baselines like the MACH filter.
The subsequent chapters in this part are devoted to more research-oriented models
and algorithms. Chapter 13 presents a novel filter and color-based classification
model to assess the ripeness of tobacco leaves for harvesting. The ripeness detection
is performed by a spot detection approach using a first-order edge extractor and a
second-order high-pass filtering. A simple thresholding classifier is then proposed
for the classification task. Chapter 14 proposes an automatic deep learning frame-
work for breast cancer detection and classification model from hematoxylin and
eosin (H&E)-stained breast histopathology images with 80.4% accuracy for supple-
menting analysis of medical professionals to prevent false negatives. Experimental
results yield that the proposed architecture provides better classification results as
compared to benchmark methods. Chapter 15 specifies a technique for indoor flying
xiv Introduction
of autonomous drones using image processing and neural networks. The route for
the drone is determined through the location of the detected object in the captured
image. The first detection technique relies on image-based filters, while the second
technique focuses on the use of CNN to replicate a real environment. Chapter 16
describes the use of a gray-level co-occurrence matrix (GLCM) for feature detection
in SAR images. The features detected in SAR images by GLCM find much applica-
tion as it identifies various orientations such as water, urban areas, and forests and
any changes in these areas.
Part III Applications and Issues
The third part of the book covers the major applications of data sciences in various
fields like biometrics, robotics, medical imaging, affective computing, security, etc.
Chapter 17 deals with signature verification using Galois field operator. The
features are obtained by building a normalized cumulative histogram. Offline signa-
ture verification is also implemented using the K-NN classifier. Chapter 18 details a
face recognition approach in videos using 3D residual networks and comparing the
accuracy for different depths of residual networks. A CVBL video dataset has been
developed for the purpose of experimentation. The proposed approach achieves the
highest accuracy of 97% with DenseNets on the CVBL dataset. Microcontroller units
(MCU) with auto firmware communicate with the fog layer through a smart edge
node. The robot employs approaches such as simultaneous localization and mapping
(SLAM) and other path-finding algorithms and IR sensors for obstacle detection. ML
techniques and FastAi aid in the classification of the dataset. Chapter 20 describes
an automatic tumor identification approach to classify MRI of brain. An advanced
CNN model consisting of convolution and a dense layer is employed to correctly
classify the brain tumors. The results exhibit the proposed model’s effectiveness in
brain tumor image classification. Chapter 21 presents a vision-based sensor mech-
anism for phase lane detection in IVS. The land markings on a structured road are
detected using image processing techniques such as edge detection and Hough space
transformation on KITTI data. Qualitative and quantitative analysis shows satis-
factory results. In Chapter 22, the proposed implementation of deep convolutional
neural network (DCNN) for micro-expression recognition as DCNN has established
its presence in different image processing applications. CASME-II, a benchmark
database for micro-expression recognition, has been used for experimentations. The
results of the experiment had revealed that types based on CNN give correct results
of 90% and 88% for four and six classes, respectively, that is beyond the regular
methods.
In Chapter 23, the proposed semantic classification model intends to employ
modern embedding and aggregating methods which considerably enhance feature
discriminability and boost the performance of CNN. The performance of this frame-
work is exhaustively tested across a wide dataset. The intuitive and robust systems
that use these techniques play a vital role in various sectors like security, military,
Introduction xv
automation, industries, medical, and robotics. In Chap. 24, a countermeasure for

voice conversation spoofing attack has been proposed using source separation based
on nonnegative matrix factorization and CNN-based binary classifier. The voice
conversation spoofed speech is modeled as a combination of target estimate and
the artifact used in the voice conversion. The proposed method shows a decrease in
the false alarm rate of automatic speaker verification. Chapter 25 proposes a facial
emotion recognition and the prediction that can serve as a useful monitoring mecha-
nism in various fields. The first stage utilizes CNN for facial emotion detection from
real-time video frames and assigns a probability to the various emotional states. The
second stage uses a time-series analysis that predicts future facial emotions from
the output of the first stage. The final Chap. 26 describes a methodology for the
identification and analysis of cohorts for heart failure patients using NLP tasks. The
proposed approach uses various NLP processes implemented in the cTAKES tool to
identify patients of a particular cohort group. The proposed system has been found
to outperform the manual extraction process in terms of accuracy, precision, recall,
and F-measure scores.
Kurukshetra, India Dr. Gyanendra K. Verma

Silchar, India Dr. Badal Soni
Marseille, France Prof. Dr. h. c. Salah Bourennane
Itajuba, Brazil Prof. Dr. h. c. Alexandre C. B. Ramos
October 2020
Contents
Part I Theory and Concepts

1 Active Learning for Network Intrusion Detection . . . . . . . . . . . . . . . . 3
Amir Ziai
2 Educational Data Mining Using Base (Individual)
and Ensemble Learning Approaches to Predict
the Performance of Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Mudasir Ashraf, Yass Khudheir Salal, and S. M. Abdullaev
3 Patient’s Medical Data Security via Bi Chaos Bi Order
Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Bharti Ahuja and Rajesh Doriya
4 Nepali Word-Sense Disambiguation Using Variants
of Simplified Lesk Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Satyendr Singh, Renish Rauniyar, and Murali Manohar
5 Performance Analysis of Big.LITTLE System with Various
Branch Prediction Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Froila V. Rodrigues and Nitesh B. Guinde
6 Global Feature Representation Using Squeeze, Excite,
and Aggregation Networks (SEANet) . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Akhilesh Pandey, Darshan Gera, D. Gunasekar, Karam Rai,
and S. Balasubramanian
7 Improved Single Image Super-resolution Based on Compact
Dictionary Formation and Neighbor Embedding
Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Garima Pandey and Umesh Ghanekar
8 An End-to-End Framework for Image Super Resolution
and Denoising of SAR Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Ashutosh Pandey, Jatav Ashutosh Kumar, and Chiranjoy Chattopadhyay
xvii
xviii Contents
Part II Models and Algorithms

9 Analysis and Deployment of an OCR—SSD Deep Learning
Technique for Real-Time Active Car Tracking and Positioning
on a Quadrotor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Luiz G. M. Pinto, Wander M. Martins, Alexandre C. B. Ramos,
and Tales C. Pimenta
10 Palmprint Biometric Data Analysis for Gender Classification
Using Binarized Statistical Image Feature Set . . . . . . . . . . . . . . . . . . . . 157
Shivanand Gornale, Abhijit Patil, and Mallikarjun Hangarge
11 Recognition of Sudoku with Deep Belief Network and Solving
with Serialisation of Parallel Rule-Based Methods and Ant
Colony Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Satyasangram Sahoo, B. Prem Kumar, and R. Lakshmi
12 Novel DWT and PC-Based Profile Generation Method
for Human Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Tanish Zaveri, Payal Prajapati, and Rishabh Shah
13 Ripeness Evaluation of Tobacco Leaves for Automatic
Harvesting: An Approach Based on Combination of Filters
and Color Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
P. B. Mallikarjuna, D. S. Guru, and C. Shadaksharaiah
14 Automatic Deep Learning Framework for Breast Cancer
Detection and Classification from H&E Stained Breast
Histopathology Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Anmol Verma, Asish Panda, Amit Kumar Chanchal, Shyam Lal,
and B. S. Raghavendra
15 An Analysis of Use of Image Processing and Neural Networks
for Window Crossing in an Autonomous Drone . . . . . . . . . . . . . . . . . . 229
L. Pedro de Brito, Wander M. Martins, Alexandre C. B. Ramos,
16 Analysis of Features in SAR Imagery Using GLCM
Segmentation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Jasperine James, Arunkumar Heddallikar, Pranali Choudhari,
and Smita Chopde
Part III Applications and Issues

17 Offline Signature Verification Using Galois Field-Based
Texture Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
S. Shivashankar, Medha Kudari, and S. Prakash Hiremath
18 Face Recognition Using 3D CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Nayaneesh Kumar Mishra and Satish Kumar Singh
Contents xix
19 Fog Computing-Based Seed Sowing Robots for Agriculture . . . . . . . 295

Jaykumar Lachure and Rajesh Doriya
20 An Automatic Tumor Identification Process to Classify MRI
Brain Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Arpita Ghosh and Badal Soni
21 Lane Detection for Intelligent Vehicle System Using Image
Processing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Deepak Kumar Dewangan and Satya Prakash Sahu
22 An Improved DCNN Based Facial Micro-expression
Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Divya Garg and Gyanendra K. Verma
23 Selective Deep Convolutional Framework for Vehicle
Detection in Aerial Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Kaustubh V. Sakhare and Vibha Vyas
24 Exploring Source Separation as a Countermeasure for Voice
Conversion Spoofing Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
R. Hemavathi, S. Thoshith, and R. Kumaraswamy
25 Statistical Prediction of Facial Emotions Using Mini Xception
CNN and Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Basudeba Behera, Amit Prakash, Ujjwal Gupta,
Vijay Bhaksar Semwal, and Arun Chauhan
26 Identification of Congestive Heart Failure Patients Through
Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Niyati Baliyan, Aakriti Johar, and Priti Bhardwaj
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Editors and Contributors
About the Editors
Gyanendra K. Verma is currently working as Assistant Professor at the Depart-

ment of Computer Engineering, National Institute of Technology Kurukshetra, India.
He has completed his B.Tech. from Harcourt Butler Technical University (formerly
HBTI) Kanpur, India, and M.Tech. & Ph.D. from Indian Institute of Information
Technology Allahabad (IIITA), India. His all degrees are in Information Technology.
He has teaching and research experience of over six years in the area of Computer
Science and Information Technology with a special interest in image processing,
speech and language processing, human-computer interaction. His research work on
affective computing and the application of wavelet transform in medical imaging and
computer vision problems have been cited extensively. He is a member of various
professional bodies like IEEE, ACM, IAENG & IACSIT.
Badal Soni is currently working as Assistant Professor at the Department of

Computer Engineering, National Institute of Technology Silchar, India. He has
completed his B.Tech. from Rajiv Gandhi Technical University (formerly RGPV)
Bhopal, India, and M.Tech from Indian Institute of Information Technology, Design,
and Manufacturing (IITDM), Jabalpur, India. He received Ph.D. from the National
Institute of Technology Silchar, India. His all degrees are in Computer Science and
Engineering. He has teaching and research experience of over seven years in the area
of computer science and information technology with a special interest in computer
graphics, image processing, speech and language processing. He has published more
than 35 papers in refereed Journals, contributed books, and international conference
proceedings. He is the Senior member of IEEE and professional members of various
bodies like IEEE, ACM, IAENG & IACSIT.
Salah Bourennane received his Ph.D. degree from Institut National Polytechnique
de Grenoble, France. Currently, he is a Full Professor at the Ecole Centrale Marseille,
France. He is the head of the Multidimensional Signal Processing Group of Fresnel
Institute. His research interests are in statistical signal processing, remote sensing,
xxi
xxii Editors and Contributors
telecommunications, array processing, image processing, multidimensional signal

processing, and performance analysis. He has published several papers in reputed
international journals.
Alexandre C. B. Ramos is the associate Professor of Mathematics and Computing

Institute—IMC from Federal University of Itajubá—UNIFEI (MG). His interest
areas are multimedia, artificial intelligence, human-computer interface, computer-
based training, and e-learning. Dr. Ramos has over 18 years of research and
teaching experience. He did his Post-doctorate at the Ecole Nationale de l’Aviation
Civile—ENAC (France, 2013–2014), Ph.D. and Master in Electronic and Computer
Engineering from Instituto Tecnológico de Aeronáutica - ITA (1996 and 1992). He
completed his graduation in Electronic Engineering from the University of Vale
do Paraíba—UNIVAP (1985) and sandwich doctorate at Laboratoire d’Analyse et
d’Architecture des Systèmes—LAAS (France, 1995–1996). He has professional
experience in the areas of Process Automation with an emphasis on chemical
and petrochemical processes (Petrobras 1983–1995); and Computer Science, with
emphasis on Information Systems (ITA/ Motorola 1997–2001), acting mainly on the
following themes: Development of Training Simulators with the support of Intelli-
gent Tutoring Systems, Hybrid Intelligent Systems, and Computer Based Training,
Neural Networks in Trajectory Control in Unmanned Vehicles, Pattern Matching and
Image Digital Processing.
Contributors
S. M. Abdullaev Department of System Programming, South Ural State University,

Chelyabinsk, Russia
Bharti Ahuja Department of Information Technology, National Institute of
Technology, Raipur, India
Mudasir Ashraf School of CS and IT, Jain University, Bangalore, India
Jatav Ashutosh Kumar Indian Institute of Technology Jodhpur, Jodhpur,
Rajasthan, India
S. Balasubramanian Department of Mathematics and Computer Science
(DMACS), Sri Sathya Sai Institute of Higher Learning (SSSIHL), Prasanthi
Nilayam, Anantapur District, India
Niyati Baliyan Department of Information Technology, IGDTUW, Delhi, India
Basudeba Behera Department of Electronics and Communication Engineering,
NIT Jamshedpur, Jamshedpur, Jharkhand, India
Priti Bhardwaj Department of Information Technology, IGDTUW, Delhi, India
Editors and Contributors xxiii
Chiranjoy Chattopadhyay Indian Institute of Technology Jodhpur, Jodhpur,

Rajasthan, India
Arun Chauhan Department of Computer Science Engineering, IIIT Dharwad,
Dharwad, Karnataka, India
Smita Chopde FCRIT, Mumbai, India
Pranali Choudhari FCRIT, Mumbai, India
L. Pedro de Brito Federal University of Itajuba, Institute of Mathematics and
Computing, Itajubá, Brazil
Deepak Kumar Dewangan Department of Information Technology, National
Institute of Technology, Raipur, Chhattisgarh, India
Rajesh Doriya Department of Information Technology, National Institute of
Technology, Raipur, Chhattisgarh, India
Divya Garg Department of Computer Engineering, National Institute of
Technology Kurukshetra, Kurukshetra, India
Darshan Gera DMACS, SSSIHL, Bengaluru, India
Umesh Ghanekar National Institute of Technology Kurukshetra, Kurukshetra,
India
Arpita Ghosh National Institute of Technology Silchar, Silchar, Assam, India
Shivanand Gornale Department of Computer Science, Rani Channamma
University, Belagavi, Karnataka, India
Nitesh B. Guinde Goa College of Engineering, Ponda-Goa, India
D. Gunasekar Department of Mathematics and Computer Science (DMACS), Sri
Sathya Sai Institute of Higher Learning (SSSIHL), Prasanthi Nilayam, Anantapur
District, India
Ujjwal Gupta Department of Electronics and Communication Engineering, NIT
Jamshedpur, Jamshedpur, Jharkhand, India
D. S. Guru University of Mysore, Mysore, Karnataka, India
Mallikarjun Hangarge Department of Computer Science, Karnatak College,
Bidar, Karnataka, India
Arunkumar Heddallikar RADAR Division, Sameer, IIT Bombay, Mumbai, India
R. Hemavathi Department of Electronics and Communication Engineering,
Siddaganga Institute of Technology (Affiliated to Visveswaraya Technological
University, Belagavi), Tumakuru, India
S. Prakash Hiremath Department of Computer Science, KLE Technological
University, BVBCET, Hubballi, Karnataka, India
xxiv Editors and Contributors
Jasperine James FCRIT, Mumbai, India

Aakriti Johar Department of Information Technology, IGDTUW, Delhi, India
Medha Kudari Department of Computer Science, Karnatak University, Dharwad,
India
B. Prem Kumar Pondicherry Central University, Pondicherry, India
Amit Kumar Chanchal National Institute of Technology Karnataka, Mangalore,
Karnataka, India
R. Kumaraswamy Department of Electronics and Communication Engineering,
Jaykumar Lachure National Institute of Technology Raipur, Raipur, Chhattisgarh,
India
R. Lakshmi Pondicherry Central University, Pondicherry, India
Shyam Lal National Institute of Technology Karnataka, Mangalore, Karnataka,
India
P. B. Mallikarjuna JSS Academy of Technical Education, Bengaluru, Karnataka,
India
Murali Manohar Gramener, Bangalore, India
Wander M. Martins Institute of Systems Engineering and Information
Technology, Itajuba, MG, Brazil
Nayaneesh Kumar Mishra Computer Vision and Biometric Lab, IIIT Allahabad,
Allahabad, India
Asish Panda National Institute of Technology Karnataka, Mangalore, Karnataka,
India
Akhilesh Pandey Department of Mathematics and Computer Science (DMACS),
Sri Sathya Sai Institute of Higher Learning (SSSIHL), Prasanthi Nilayam, Anantapur
District, India
Ashutosh Pandey Indian Institute of Technology Jodhpur, Jodhpur, Rajasthan,
India
Garima Pandey National Institute of Technology Kurukshetra, Kurukshetra, India
Abhijit Patil Department of Computer Science, Rani Channamma University,
Belagavi, Karnataka, India
Tales C. Pimenta Institute of Systems Engineering and Information Technology,
Itajuba, MG, Brazil
Editors and Contributors xxv
Luiz G. M. Pinto Institute of Mathematics and Computing, Federal University of

Itajuba, Itajuba, MG, Brazil
Payal Prajapati Government Engineering College, Patna, India
Amit Prakash Department of Electronics and Communication Engineering, NIT
Jamshedpur, Jamshedpur, Jharkhand, India
B. S. Raghavendra National Institute of Technology Karnataka, Mangalore,
Karnataka, India
Karam Rai Department of Mathematics and Computer Science (DMACS), Sri
Sathya Sai Institute of Higher Learning (SSSIHL), Prasanthi Nilayam, Anantapur
District, India
Alexandre C. B. Ramos Institute of Mathematics and Computing, Federal
University of Itajuba, Itajuba, MG, Brazil
Renish Rauniyar Tredence Analytics, Bangalore, India
Froila V. Rodrigues Dnyanprassarak Mandal’s College and Research Centre,
Assagao-Goa, India
Satyasangram Sahoo Pondicherry Central University, Pondicherry, India
Satya Prakash Sahu Department of Information Technology, National Institute of
Technology, Raipur, Chhattisgarh, India
Kaustubh V. Sakhare Department of Electronics and Telecommunication, College
of Engineering, Pune, India
Yass Khudheir Salal Department of System Programming, South Ural State
University, Chelyabinsk, Russia
Vijay Bhaksar Semwal Department of Computer Science Engineering, MANIT,
Bhopal, Madhya Pradesh, India
C. Shadaksharaiah Bapuji Institute of Engineering and Technology, Davangere,
Karnataka, India
Rishabh Shah Nirma University, Ahmedabad, India;
Government Engineering College, Patna, India
S. Shivashankar Department of Computer Science, Karnatak University, Dharwad,
India
Satish Kumar Singh Computer Vision and Biometric Lab, IIIT Allahabad,
Allahabad, India
Satyendr Singh BML Munjal University, Gurugram, Haryana, India
Badal Soni National Institute of Technology Silchar, Silchar, Assam, India
xxvi Editors and Contributors
S. Thoshith Department of Electronics and Communication Engineering,

Anmol Verma National Institute of Technology Karnataka, Mangalore, Karnataka,
India
Gyanendra K. Verma Department of Computer Engineering, National Institute of
Technology Kurukshetra, Kurukshetra, India
Vibha Vyas Department of Electronics and Telecommunication, College of
Engineering, Pune, India
Tanish Zaveri Nirma University, Ahmedabad, India
Amir Ziai Stanford University, Stanford, CA, USA
Acronyms
BHC Bayesian Hierarchical Clustering

CNN Convolution Neural Network
DCIS Ductal Carcinoma In Situ
HE Hematoxylin and Eosin
IDC Invasive Ductal Carcinoma
IRRCNN Inception Recurrent Residual Convolutional Neural Network
SVM Support Vector Machine
VGG16 Visual Geometry Group—16
WSI Whole Slide Image
xxvii
Part I
Theory and Concepts
Chapter 1
Active Learning for Network Intrusion
Detection
Amir Ziai
Abstract Network operators are generally aware of common attack vectors that they
defend against. For most networks, the vast majority of traffic is legitimate. How-
ever, new attack vectors are continually designed and attempted by bad actors which
bypass detection and go unnoticed due to low volume. One strategy for finding such
activity is to look for anomalous behavior. Investigating anomalous behavior requires
significant time and resources. Collecting a large number of labeled examples for
training supervised models is both prohibitively expensive and subject to obsole-
tion as new attacks surface. A purely unsupervised methodology is ideal; however,
research has shown that even a very small number of labeled examples can signifi-
cantly improve the quality of anomaly detection. A methodology that minimizes the
number of required labels while maximizing the quality of detection is desirable.
False positives in this context result in wasted effort or blockage of legitimate traf-
fic, and false negatives translate to undetected attacks. We propose a general active
learning framework and experiment with different choices of learners and sampling
strategies.
1.1 Introduction
Detecting anomalous activity is an active area of research in the security space. Tuor
et al. use an online anomaly detection method based on deep learning to detect anoma-
lies. This methodology is compared to traditional anomaly detection algorithms such
as isolation forest (IF) and a principal component analysis (PCA)-based approach
and found to be superior. However, no comparison is provided with semi-supervised
or active learning approaches which leverage a small amount of labeled data (Tuor
et al. 2017). The authors later propose another unsupervised methodology leverag-
ing recurrent neural network (RNN) to ingest the log-level event data as opposed to
aggregated data (Tuor et al. 2018). Pimentel et al. propose a generalized framework
for unsupervised anomaly detection. They argue that purely unsupervised anomaly
A. Ziai (B)
Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
e-mail: amirziai@stanford.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 3
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_1
4 A. Ziai
Table 1.1 Prevalence and number of attacks for each of the 10 attack types
Label Attacks Prevalence Prevalence Records
(overall)
smurf. 280,790 0.742697 0.568377 378,068
neptune. 107,201 0.524264 0.216997 204,479
back. 22.3 0.022145 0.04459 99,481
satan 1589 0.016072 0.003216 98,867
ipsweep 1247 0.012657 0.002524 98,525
portsweep. 1040 0.010578 0.002105 98,318
warezclient. 1020 0.010377 0.002065 98,298
teardrop. 979 0.009964 0.001982 98,257
pod. 264 0.002707 0.000534 97,542
nmap. 231 0.002369 0.000468 97,509
detection is undecidable without a prior on the distribution of anomalies, and learned

representations have simpler statistical structure which translate to better generaliza-
tion. They propose an active learning approach with a logistic regression classifier as
the learner (Pimentel et al. 2018). Veeramachaneni et al. propose a human-in-the-loop
machine learning system that provides both insights to the analyst and addressing
large data processing concerns. This system uses unsupervised methods to surface
anomalous data points for the analyst to label and a combination of supervised and
unsupervised methods to predict the attacks (Veeramachaneni et al. 2016). In this
work, we also propose an analyst-in-the-loop active learning approach. However, our
approach is not opinionated about the sampling strategy or the learner used in active
learning. We will explore trade-offs in that design space.
1.2 Dataset
We have used the KDD Cup 1999 dataset which consists of about 500K records
representing network connections in a military environment. Each record is either
“normal” or one of 22 different types of intrusion such as smurf, IP sweep, and
teardrop. Out of these 22 categories, only 10 have at least 100 occurrences, and the
rest were removed. Each record has 41 features including duration, protocol, and
bytes exchanged. Prevalence of attack types varies substantially with smurf being
the most pervasive at about 50% of total records and Nmap at less than 0.01% of
total records (Table 1.1).
1 Active Learning for Network Intrusion Detection 5
Table 1.2 Snippet of input data

Duration Protocol Service Flag src dst Land Wrong Urgent Hot ... dst
_type _bytes _bytes _fragment _host
_srv
_count
0 tcp http SF 181 5450 0 0 0 0 ... 9
0 tcp http SF 239 4860 0 0 0 0 ... 19
0 tcp http SF 235 1337 0 0 0 0 ... 29
1.2.1 Input and Output Example
Table 1.2 depicts three rows of data (excluding the label):

The objective of the detection system is to label each row as either “normal” or
“anomalous.”
1.2.2 Processing Pipeline
We generated 10 separate datasets consisting of normal traffic and each of the attack
vectors. This way we can study the proposed approach over 10 different attack vectors
with varying prevalence and ease of detection. Each dataset is then split into train,
development, and test partitions with 80%, 10%, and 10% proportions. All algorithms
are trained on the train set and evaluated on the development set. The winning strategy
is tested on the test set to generate an unbiased estimate of generalization. Categorical
features are one-hot encoded, and missing values are filled with zero.
1.3 Approach
1.3.1 Evaluation Metric
Since labeled data is very hard to come by in this space, we have decided to treat this
problem as an active learning one. Therefore, the machine learning model receives a
subset of the labeled data. We will use the F1 score to capture the trade-off between
precision and recall:
F1 = (2P R)/(P + R) (1.1)
where P = TP/((TP + FP)), R = TP/((TP + FN)), TP is true positive, FP is false

positive, and FN is the number of false negative. A model that is highly precise
(does not produce false positives) is desirable as it will not waste the analyst’s time.
6 A. Ziai
However, this usually comes at the cost of being overly conservative and not catching
anomalous activity that is indeed an intrusion.
1.3.2 Oracle and Baseline
Labeling effort is a major factor in this analysis and a dimension along which we
will define the upper and lower bounds of the quality of our detection systems. A
purely unsupervised approach would be ideal as there is no labeling involved. We
will use an isolation forest (Zhou et al. 2004) to establish our baseline. Isolation
forests (IFs) are widely, and very successfully, used for anomaly detection. An IF
consists of a number of isolation trees, each of which are constructed by selecting
random features to split and then selecting a random value to split on (random value
in the range of continuous variables or random value for categorical variables). Only a
small random subset of the data is used for growing the trees, and usually a maximum
allowable depth is enforced to curb computational cost. We have used 10 trees for
each IF. Intuitively, anomalous data points are easier to isolate with a smaller average
number of splits and therefore tend to be closer to the root. The average closeness
to the root is proportional to the anomaly score (i.e., the lower this score, the more
anomalous the data point).
A completely supervised approach would incur maximum cost as we will have
to label every data point. We have used a random forest classifier with 10 estimators
trained on the entire training dataset to establish the upper bound (i.e., Oracle). In
Table 1.3, the F1 scores are reported for evaluation on the development set:
Table 1.3 Oracle and baseline for different attack types

Label Baseline F1 Oracle F1
smurf 0.38 1.00
neptune 0.49 1.00
back 0.09 1.00
satan 0.91 1.00
ipsweep 0.07 1.00
portsweep 0.53 1.00
warezclient 0.01 1.00
teardrop 0.30 1.00
pod 0.00 1.00
nmap 0.51 1.00
Means ± standard deviation 0.33±0.29 1.00±0.01
Fig. 1.1 Active learning scheme
1.3.3 Active Learning
The proposed approach starts with training a classifier on a small random subset of
the data (i.e., 1000 samples) and then continually queries a security analyst for the
next record to label. There is a maximum budget of 100 queries (Fig. 1.1).
This approach is highly flexible. The choice of classifier can range from logistic
regression all the way up to deep networks as well as any ensemble of those models.
Moreover, the hyper-parameters for the classifier can be tuned on every round of
training to improve the quality of predictions. The sampling strategy can range from
simply picking random records to using classifier uncertainty or other elaborate
schemes. Once a record is labeled, it is removed from the pool of labeled data and
placed into the labeled record database. We are assuming that labels are trustworthy
which may not necessarily be true. In other words, the analyst might make a mistake
in labeling or there may be low consensus among analysts around labeling. In the
presence of those issues, we would need to extend this approach to query multiple
analysts and to build the consensus of labels into the framework.
1.4 Experiments
1.4.1 Learners and Sampling Strategies
We used a logistic regression (LR) classifier with L2 penalty as well as a random forest
(RF) classifier with 10 estimators, Gini impurity for splitting criteria, and unlimited
depth for our choice of learners. We also chose three sampling strategies. First is
a random strategy that randomly selects a data point from the unlabeled pool. The
second option is uncertainty sampling that scores the entire database of unlabeled
data and then selects the data point with the highest uncertainty. The first option
is entropy sampling, which calculates the entropy over the positive and negative
8 A. Ziai
Table 1.4 Effects of learner and sampling strategy on detection quality and latency
Learner Sampling F1 initial F1 after 10 F1 after 50 F1 after Train time Query time
strategy 100 (s) (s)
LR Random 0.76±0.32 0.76±0.32 0.79±0.31 0.86±0.17 0.05±0.01 0.09±0.08
LR Uncertainty 0.83±0.26 0.85±0.31 0.88±0.20 0.10±0.08
LR Entropy 0.83±0.26 0.85±0.31 0.88±0.20 0.08±0.08
RF Random 0.90±0.14 0.91±0.12 0.84±0.31 0.95±0.07 0.11±0.00 0.09±0.07
RF Uncertainty 0.98±0.03 0.99±0.03 0.99±0.03 0.16±0.06
RF Entropy 0.98±0.04 0.98±0.03 0.99±0.03 0.12±0.08
classes and selects the highest entropy data point. Ties are broken randomly for both
uncertainty and entropy sampling.
Table 1.4 shows the F1 score immediately after the initial training (F1 initial)
followed by the F1 score after 10, 50, and 100 queries to the analyst across different
learners and sampling strategies aggregated over the 10 attack types:
Random forests are strictly superior to logistic regression from a detection per-
spective regardless of the sampling strategy. It is also clear that uncertainty and
entropy sampling are superior to random sampling which suggests that judiciously
sampling the unlabeled dataset can have a significant impact on the detection quality,
especially in the earlier queries (F1 goes from 0.90 to 0.98 with just 10 queries). It is
important to notice that the query time might become a bottleneck. In our examples,
the unlabeled pool of data is not very large but as this set grows these sampling
strategies have to scale accordingly. The good news is that scoring is embarrassingly
parallelizable.
Figure 1.2 depicts the evolution of detection quality as the system makes queries
to the analyst for an attack with high prevalence (i.e., the majority of traffic is an
attack):
The random forest learner combined with an entropy sampler can get to perfect
detection within 5 queries which suggests high data efficiency (Mussmann and Liang
2018). We will compare this to the Nmap attack with significantly lower prevalence
(i.e., less than 0.01% of the dataset is an attack) (Fig. 1.3):
We know from our Oracle evaluations that a random forest model can achieve
perfect detection for this attack type; however, we see that an entropy sampler is not
guaranteed to query the optimal sequence of data points. The fact that the prevalence
of attacks is very low means that the initial training dataset probably does not have a
representative set of positive labels that can be exploited by the model to generalize.
The failure of uncertainty sampling has been documented (Zhu et al. 2008), and
more elaborate schemes can be designed to exploit other information about the unla-
beled dataset that the sampling strategy is ignoring. To gain some intuition into
these deficiencies, we will unpack a step of entropy sampling for the Nmap attack.
Figure 1.4 compares (a) the relative feature importance after the initial training to (b)
the Oracle (Fig. 1.5):
Fig. 1.2 Detection quality for a high prevalence attack
The Oracle graph suggests the “src_bytes” is a feature that the model is highly
reliant upon for prediction. However, our initial training is not reflecting this; we will
compute the z-score for each of the positive labels in our development set:
|μ R fi − μW fi |
z fi = (1.2)
σ R fi
where μ R fi is the average value of the true positives for feature i (i.e., f i ), μW fi is
the average value of the false positives or false negatives, and σ R fi is the standard
deviation of the values in the case of true positives.
10 A. Ziai
Fig. 1.3 Detection quality

for a low prevalence attack
The higher this value is for a feature, the more our learner needs to know about it
to correct the discrepancy. However, we see that the next query made by the strategy
does not involve a decision around this fact. The score for “src_bytes” is an order
of magnitude larger than other features. The model continues to make uncertainty
queries staying oblivious to information about specific features that it needs to correct
for.
1.4.2 Ensemble Learning
Creating an ensemble of classifiers is usually a very effective way to combine the

power of multiple learners (Zainal et al. 2009). This strategy is highly effective
when the errors made by classifiers in the ensemble tend to cancel out and are not
compounded. To explore this idea, we designed a weighted ensemble:
The prediction in the above diagram is calculated as follows:
Fig. 1.4 Random forest feature importance for a initial training and b Oracle
Fig. 1.5 Ensemble learner

12 A. Ziai
Fig. 1.6 Ensemble active

learning results for
warezclient and satan attacks

we
[PredictionEnsemble = I we Prediction E > (1.3)
e E e E
2
where Prediction E {0, 1} is the binary prediction associated with the classifier e E =
{R F, G B, L R, I F} and we is the weight of the classifier in the ensemble.
The weights are proportional to the level of confidence we have in each of the
learners. We have added a gradient boosting classifier with 10 estimators.
Unfortunately, the results of this experiment suggest that this particular ensemble
is not adding any additional value. Figure 1.6 shows that at best the results match
that of random forest (a) and in the worst case they can be significantly worse (b):
The majority of the error associated with this ensemble approach relative to only
using random forests can be attributed to a high false negative rate. The other four
algorithms are in most cases conspiring to generate a negative class prediction which
overrides the positive prediction of the random forest.
Table 1.5 Active learning an unsupervised sampling strategy

Sampling strategy Initial F1 F1 after 10 F1 after 50 F1 after 100
Isolation forest 0.94±0.07 0.94±0.05 0.95±-0.09 0.93=0.09
Entropy 0.94±0.07 0.98±0.03 0.99±0.03 0.99±0.03
1.4.3 Sampling the Outliers Generated Using Unsupervised

Learning
Finally, we explore whether we can use an unsupervised method for finding the
most anomalous data points to query. If this methodology is successful, the sampling
strategy is decoupled from active learning and we can simply precompute and cache
the most anomalous data points for the analyst to label.
We compared a sampling strategy based on isolation forest with entropy sampling
(Table 1.5):
In both cases, we are using a random forest learner. The results suggest that
entropy sampling is superior since it is sampling the most uncertain data points in
the context of the current learner and not a global notion of anomaly which isolation
forest provides.
1.5 Conclusion
We have proposed a general active learning framework for network intrusion detec-
tion. We experimented with different learners and observed that more complex learn-
ers can achieve higher detection quality with significantly less labeling effort for most
attack types. We did not explore other complex models such as deep neural networks
and did not attempt to tune the hyper-parameters of our model. Since the bottleneck
associated with this task is the labeling effort, we can add model tuning while staying
within the acceptable latency requirements.
We then explored a few sampling strategies and discovered that uncertainty and
entropy sampling can have a significant benefit over unsupervised or random sam-
pling. However, we also realized that these strategies are not optimal, and we can
extend them to incorporate available information about the distribution of the fea-
tures for mispredicted data points. We attempted a semi-supervised approach called
label spreading that builds the affinity matrix over the normalized graph Laplacian
which can be used to create pseudo-labels for unlabeled data points (Zhou et al. 2004).
However, this methodology is very memory-intensive, and we could not successfully
train and evaluate it on all of the attack types.
14 A. Ziai
References
Mussmann S, Liang P (2018) On the relationship between data efficiency and error for un-certainty
sampling. arXiv preprint arXiv:1806.06123
Pimentel T, Monteiro M, Viana J, Veloso A, Ziviani N (2018) A generalized active learning approach
for unsupervised anomaly detection. arXiv preprint arXiv:1805.09411
Tuor A, Kaplan S, Hutchinson B, Nichols N, Robinson S (2017) Deep learning for unsupervised
insider threat detection in structured cybersecurity data streams. arXiv preprint arXiv:1710.00811
Tuor A, Baerwolf R, Knowles N, Hutchinson B, Nichols N, Jasper R (2018) Recurrent neural
network language models for open vocabulary event-level cyber anomaly detection. Workshops
at the thirty-second AAAI conference on artificial intelligence
Veeramachaneni K, Arnaldo I, Korrapati V, Bassias C, Li K (2016) AI: training a big data machine
to defend. Big Data Security on Cloud (BigDataSecurity), IEEE international conference on high
performance and smart computing (HPSC), and IEEE international conference on intelligent data
and security (IDS), IEEE 2nd international conference, pp 49–54
Zainal A, Maarof MA, Shamsuddin SM (2009) Ensemble classifiers for network intrusion detection
system. J Inf Assur Secur 4(3):217–225
Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global
consistency. In: Advances in neural information processing systems, pp 321–328
Zhu J, Wang H, Yao T, Tsou BK (2008) Active learning with sampling by uncertainty and density
for word sense disambiguation and text classification. In: Proceedings of the 22nd international
conference on computational linguistics, vol 1, pp 1137–1144
Chapter 2
Educational Data Mining Using Base
(Individual) and Ensemble Learning
Approaches to Predict the Performance
of Students
Mudasir Ashraf, Yass Khudheir Salal, and S. M. Abdullaev
Abstract The ensemble approaches involving amalgamation of various learning

classifiers are grounded on heuristic machine learning methods to device prediction
paradigms, as these learning ensemble methods are commonly more precise than
individual classifiers. Therefore, among diverse ensemble techniques, investigators
have experienced a widespread learning classifier viz. bagging to forecast the per-
formance of students. As exploitation of ensemble approaches is considered to be a
remarkable phenomenon in prediction and classification mechanisms, therefore con-
sidering the striking character and originality of analogous method, namely bagging
in educational data mining, researchers have applied this specific approach across
the existing pedagogical dataset obtained from the University of Kashmir. The entire
results were estimated with 10-fold cross validation, once pedagogical dataset was
subjected to base classifiers comprising of j48, random tree, naïve bayes, and knn.
Consequently, based on the learning phenomenon of miscellaneous types of classi-
fiers, prediction models have been proposed for each classifier including base and
meta learning algorithm. In addition, techniques specifically SMOTE (oversampling
method) and spread subsampling (undersampling method) were employed to further
draw a relationship among ensemble classifier and base learning classifiers. These
methods were exploited with the key objective to observe further enhancement in
prediction exactness of students.
2.1 Introduction
The fundamental concept behind ensemble method is to synthesize contrasting base

classifiers into a single classifier, which is more precise and consistent in terms
M. Ashraf (B)
School of CS and IT, Jain university, Bangalore190006, India
Y. K. Salal · S. M. Abdullaev
Department of System Programming, South Ural State University, Chelyabinsk, Russia
e-mail: yasskhudheirsalal@gmail.com
S. M. Abdullaev
e-mail: abdullaevsm@susu.ru
16 M. Ashraf et al.
of prediction accuracy produced by the composite model and in decision making.

This theory of hybridizing multiple models to develop a single predictive model
has been under study since decades. According to Buhlmann and Yu (Bü hlmann
and Yu 2003), the narrative of ensemble techniques incepted in the beginning of
1977 through Turkey Twicing, who initiated the ensemble research by integrating
couple of linear regression paradigms (Bü hlmann and Yu 2003). The application
of ensemble approaches can be prolific in enhancing the excellence and heftiness
of various clustering learning models (Dimitriadou et al. 2003, 2018a; Ashraf et al.
2019). From the past empirical research undertaken by different machine learning
researchers acknowledges that there is considerable advancement in mitigating the
generalization error, once output of multiple classifiers are synthesized (Ashraf et al.
2020; Opitz and Maclin 1999; Salzberg 1994, 2018b).
Due to the inductive bias of individual classifiers involved in the phenomenon of
ensemble approach, it has been investigated that ensemble methods are very effective
in nature than deployment of individual base learning algorithms (Geman et al.
1992). In fact, distinct mechanisms of ensembles can be efficacious in squeezing the
variance error to some level (Ali and Pazzani 1996) without augmenting the bias
error associated with the classifier (Ali and Pazzani 1996). In certain cases, the bias
error can be curtailed using ensemble techniques, and identical approach has been
highlighted by the theory of large margin classifiers (Geman et al. 1992).
Moreover, ensemble learning methods have been applied in diverse areas includ-
ing bioinformatics (Bartlett and Shawe-Taylor 1999), economics (Tan et al. 2003),
health care (Leigh et al. 2002), topography (Mangiameli et al. 2004), production
(Ahmed and Elaraby 2014), and so on. There are several ensemble approaches that
have been deployed by various research communities to foretell the performance of
different classes pertaining to miscellaneous datasets. The preeminent and straight-
forward ensemble-based concepts are bagging (Bruzzone et al. 2004) and boosting
(Maimon and Rokach 2004), wherein predictions are based on combined output,
generally through subsamples of the training set on different learning algorithms.
The predictions are, however, geared up through the process of voting phenomenon
(Ashraf et al. 2018).
Another method of learning viz. meta learning, targets on selecting the precise
algorithm for making predictions while solving specific problems, which is based
on the inherent idiosyncrasy of the dataset (Breiman 1996). The performance in
meta learning can in addition be grounded on other effortless learning algorithms
(Brazdil et al. 1994). Another widespread practice employed by (Pfahringer et al.
2000) for making decisions via using ensemble technique is to generate subsamples
of comprehensive dataset and exploit them on each algorithm.
Researcher have made valid attempts and have applied various machine learning
algorithms to improve prediction accuracy in the field of academics (Ashraf and
Zaman 2017; Ashraf et al. 2017, 2018c; Sidiq et al. 2017; Salal et al. 2021; Salal and
Abdullaev 2020; Mukesh and Salal 2019). Contemporarily, realizing the potential
application of ensemble methods, several techniques are at the disposal to the research
community for ameliorating prediction accuracy as well as explore possible insights
that are obscure within large datasets. Therefore, in this study, primarily efforts
2 Educational Data Mining Using Base (Individual) … 17
Table 2.1 Exhibits results of diverse classifiers

Classifier Correctly Incorrectly TP rate FP rate Precision Recall F- ROC Rel.
name classified classified (%) measure area Abs.
(%) Err.
(%)
J48 92.20 7.79 0.922 0.053 0.923 0.919 0.922 0.947 13.51
Random tree 90.30 9.69 0.903 0.066 0.903 0.904 0.903 0.922 15.46
Naïve Bayes 95.50 4.45 0.955 0.030 0.957 0.956 0.955 0.994 7.94
KNN 91.80 8.18 0.918 0.056 0.919 0.918 0.917 0.934 13.19
would be propounded to categorize all significant methods employed in the realm of

ensemble approaches and procedures.
Moreover, to make further advancements in this direction, researchers would
be employing classification algorithms including naïve bayes, KNN, J48, random
tree, and an ensemble method viz. bagging on the pedagogical dataset attained from
university of Kashmir, inorder to improve the prediction accuracy of students. Fur-
thermore, from the past literature related to educational data mining hitherto, the
researchers have done self-effacing efforts to exploit ensemble methods. Therefore,
there is still deficit of research conduct within this realm. Moreover, innovative
techniques are indispensable to be employed across pedagogical datasets, so as to
determine prolific and decisive knowledge from educational settings.
2.2 Performance of Diverse Individual Learning Classifiers
In this study, primarily, we have applied four learning classifiers such as j48, ran-
dom tree, naïve bayes, and knn across academic dataset. Thereafter, the academic
dataset was subjected to progression of oversampling and undersampling methods
to corroborate whether there is any improvement in prediction achievements of stu-
dent’s outcome. Correspondingly, the analogous procedure is practiced over ensem-
ble methodologies including bagging and boosting to substantiate which learning
classifier among base or meta has demonstrated compelling results.
Table 2.1 portrays outcome of diverse classifiers accomplished subsequent to
running these machine learning classifiers across educational dataset. Moreover, it
is unequivocal that naïve bayes has achieved notable prediction precision of 95.50%
in classifying the actual occurrences, incorrectly classification error of 4.45%, and
minimum relative absolute error of 7.94% in contrast to remaining classifiers. The
supplementary calculations related with the learning algorithm such as Tp rate, Fp
rate, precision, recall, f -measure, and ROC area have been also found significant.
Conversely, random tree produced although substantial classification accuracy of
90.03%, incorrectly classified instances as 9.69%, (RAE) relative absolute error of
15.46%, and supplementary parameters connected with the algorithm were found
18 M. Ashraf et al.
Table 2.2 Shows results with SMOTE process

(%) Err.
(%)
J48 92.98 7.01 0.930 0.038 0.927 0.925 0.930 0.959 11.43
Random tree 90.84 9.15 0.908 0.049 .909 0.908 0.908 0.932 13.92
Naïve Bayes 97.15 2.84 0.972 0.019 0.973 0.972 0.974 0.978 4.60
KNN 92.79 7.20 0.928 0.039 0.929 0.928 0.929 0.947 10.98
noteworthy as well, and nevertheless acquired outcomes were least considerable

among remaining algorithms.
2.2.1 Empirical Results of Base Classifiers with

Oversampling Method
Table 2.2 exemplifies results of diverse classifiers subsequent to the application of

oversampling technique, namely SMOTE across pedagogical dataset. As per the
results furnished in the below-mentioned table, all classifiers have shown exem-
plary improvement in prediction accuracy along with additional performance matri-
ces related with the classifiers after using oversampling method. Therefore, Tables
2.1 and 2.2 disclose expansion of miscellaneous algorithms viz. j48 (from 92.20
to 92.98%), random tree (90.30–90.84%), naïve bayes (95.50–97.15%), and knn
(91.80–92.79%).
Additionally, the relative absolute errors related with the individual algorithms
after SMOTE demonstrated further improvement from 13.51 to 11.43% (j48), 15.46–
13.92% (random tree), 7.94–4.60% (naïve bayes), and 13.19–10.98% (knn), than
other estimates. On the contrary, ROC (area under curve) has shown minute discrep-
ancy in case of naïve bayes algorithm with definite variation in its values from 0.994
to 0.978.
2.2.2 Empirical Outcomes of Base Classifiers with

Undersampling Method
After successfully deploying spread subsampling (undersampling technique) over

real pedagogical dataset, the underneath Table 2.3 puts forth the results. The under-
sampling method has depicted excellent forecast correctness in case of knn classifier
from 91.80% to 93.94% which has exemplified supremacy over knn using oversam-
pling technique (91.80–92.79%).
Table 2.3 Demonstrates results with undersampling method

(%) Err.
(%)
J48 92.67 7.32 0.927 0.037 0.925 0.926 0.924 0.955 11.68
Random tree 88.95 11.04 0.890 0.055 0.888 0.889 0.896 0.918 16.65
Naïve Bayes 95.85 4.14 0.959 0.021 0.960 0.959 0.959 0.996 7.01
KNN 93.94 6.05 0.939 0.030 0.939 0.937 0.939 0.956 9.39
Entire performance estimates connected with knn learning algorithm such as Tp

rate (0.918–0.939), Fp rate (0.030–0.056), precision (0.919–0.939), recall (0.918–
0.937), f -measure (0.917–0.939), ROC area (0.934–0.956), and relative absolute
error (13.19–9.39%) have explained exceptional results, which have been demon-
strated in Table 2.1 (prior to application of undersampling) and Table 2.3 (post
undersampling). Nevertheless, undersampling procedure has demonstrated unpre-
dictability in results across random tree classifier whose performance has declined
from 90.30 to 88.95%. Although, the forecast correctness of j48 and naïve bayes
has exemplified significant achievements (92.20–92.67, 95.50–95.85% correspond-
ingly), but outcomes are not as noteworthy as oversampling procedure has produced
(Table 2.2).
2.3 Bagging Approach
Under this subsection, bagging has been utilized using various classifiers that are
highlighted in Table 2.4. Nevertheless, after employing bagging, the prediction accu-
racy has demonstrated paramount success over base learning mechanism. The cor-
rectly classified rate in Table 2.4 when contrasted with initial prediction rate of
different classifiers in Table 2.1 have shown substantial improvement in three learn-
ing algorithms such as j48 (92.20–94.87%), random tree (90.30–94.76%), and knn
(91.80–93.81%).
In addition, the incorrectly classified instances have come down to considerable
level in these classifiers, and as a consequence, supplementary parameters viz. Tp
rate, Fp rate, precision, recall, ROC area, and f -measure related with these classifiers
have also rendered admirable results. However, naïve bayes has not revealed any
significant achievement in prediction accuracy with bagging approach, and moreover,
relative absolute error associated with each meta classifier has augmented while
synthesizing different classifiers.
20 M. Ashraf et al.
Table 2.4 Shows results using bagging approach

(%) Err.
(%)
Bag. with 94.87 5.12 0.949 0.035 0.949 0.947 0.948 0.992 14.55
J48
Bag.with 94.76 5.23 0.948 0.036 0.948 0.946 0.947 0.993 16.30
Random tree
Bag. with 95.32 4.67 0.953 0.031 0.954 0.953 0.952 0.993 8.89
Naïve Bayes
Bag. with 93.81 6.18 0.938 0.042 0.939 0.937 0.938 0.983 11.63
KNN
Table 2.5 Displays results of bagging method with SMOTE

(%) Err.
(%)
Bag. with 95.21 4.78 0.952 0.026 0.953 0.951 0.952 0.994 11.91
J48
Bag. with 95.21 4.79 0.952 0.026 0.954 0.951 0.951 0.996 13.11
random tree
with Naïve 95.15 3.84 0.962 0.020 0.963 0.962 0.961 0.996 7.01
Bayes
Bag. with 94.68 5.31 0.947 0.028 0.948 0.947 0.946 0.988 9.49
KNN
2.3.1 Bagging After SMOTE
The oversampling method(SMOTE) when applied on ensemble of each algorithms

viz. j48, random tree, naïve bayes, and knn with bagging system, the results attained
afterwards have explicated considerable accuracy in its prediction, and the statistical
figures are represented in Table 2.5.
The results not only have publicized improvement in correctly classified and
wrongly classified instances, Tp rate, Fp rate, precision, and so on, but more notice-
ably in relative absolute error had shown inconsistency in earlier Table 2.5. However,
naïve bayes with bagging method again has not shown any development in its predic-
tion accuracy. Nevertheless, misclassified instances, relative absolute error, and other
performance estimates have demonstrated substantial growth. Furthermore, bagging
with j48 classifier delivered best forecasting results among other classifiers while
comparing entire set of parameters with ensemble technique (bagging).
Table 2.6 Explains outcomes of bagging method with undersampling method

(%) Err.
(%)
Bag. with 95.43 4.56 0.954 0.023 0.955 0.953 0.954 0.995 13.24
J48
Bag. with 94.79 5.20 0.948 0.026 0.949 0.947 0.948 0.995 13.98
random tree
Bag. with 96.07 3.92 0.961 0.020 0.963 0.961 0.961 0.997 6.90
Naïve Bayes
Bag. with 92.99 7.00 0.930 0.035 0.932 0.929 0.930 0.985 10.76
KNN
2.3.2 Bagging After Spread Subsampling
Bagging procedure, when deployed with undersampling technique (Spread subsam-

pling), has shown advancement in prediction accuracy with two classifiers, namely
j48 (95.43%) and naïve bayes (96.07%) that are referenced in Table 2.6. Using
undersampling method, naïve bayes has generated paramount growth from 95.15 to
96.07% in distinction to earlier results (Table 2.5) acquired with SMOTE technique,
and relative absolute error has reduced to statistical value of 6.90%. On the contrary,
bagging with random tree and knn have produced relatively significant results but
with less precision in comparison to bagging with oversampling approach.
Figure 2.1 summarizes the precision and relative absolute error of miscellaneous
learning algorithms under application of different approaches viz. bagging without
subject to filtering process, bagging with both SMOTE, and spread subsampling.
Among all classifiers with inclusion of bagging concept and without employment
of filtering procedures, naïve bayes performed outstanding with 95.32%. Further-
more, with oversampling technique (SMOTE), the ensemble of identical classifiers
produced results relatively with same significance. However, by means of undersam-
pling technique, naïve bayes once again achieved exceptional prediction accuracy of
96.07%. Moreover, the below-mentioned figure symbolizes relative absolute error of
entire bagging procedures, and consequently among classifiers, naive base has gen-
erated admirable results with minimum relative absolute errors of 8.89% (without
filtering process), 7.01% ( SMOTE) and 6.90% (Spread subsampling).
2.4 Conclusion
In this research study, the central focus has been early prediction of student’s out-
come using various individual (base) and meta classifiers to provide timely guidance
for weak students. The individual learning algorithms employed across pedagogical
22 M. Ashraf et al.
Fig. 2.1 Visualizes the results after deployment of different methods
data including j48, random tree, naïve bayes, and knn which have evidenced phe-
nomenal prediction accuracy of student’s final outcomes. Among each base learning
algorithms, naïve bayes attained paramount accuracy of 95.50%. As the dataset
in this investigation was imbalanced which could have otherwise culminated with
inaccurate and biased outcomes, therefore academic dataset was exploited to filter-
ing approaches, namely synthetic minority oversampling technique ( SMOTE) and
spread subsampling.
In this contemporary study, a comparative revision was conducted with base and
meta learning algorithms, followed by oversampling (SMOTE) and undersampling
(spread subsampling) techniques to get a comprehensive knowledge which classifiers
can be more precise and decisive in generating predictions. The above-mentioned
base learning algorithms were subjected to phenomenon of oversampling and under-
sampling methods. The naïve bayes yet again demonstrated noteworthy improve-
ment of 97.15% after practiced with oversampling technique. With undersampling
technique, knn showed exceptional improvement of 93.94% in prediction accuracy
over other base learning algorithms. However, in case of ensemble learning such as
bagging, among all classifiers bagging with naïve bayes accomplished convincing
correctness of 95.32% in predicting the exact instances.
The bagging algorithm, when put into effect with techniques such as oversam-
pling and undersampling, the ensembles generated from classifiers viz. j48 and naïve
bayes demonstrated with significant accuracy and least classification error (95.21%,
bagging with j48 and 96.07%, bagging with naïve bayes), respectively.
References
Ahmed ABED, Elaraby IS (2014) Data mining: a prediction for student’s performance using clas-
sification method. World J Comput Appl Technol 2(2):43–47
Ali KM, Pazzani MJ (1996) Error reduction through learning multiple descriptions. Mach Learn
24(3):173–202
Ashraf M et al (2017) Knowledge discovery in academia: a survey on related literature. Int J Adv
Res Comput Sci 8(1)
Ashraf M, Zaman M (2017) Tools and techniques in knowledge discovery in academia: a theoretical
discourse. Int J Data Min Emerg Technol 7(1):1–9
Ashraf M, Zaman M, Ahmed Muheet (2018a) Using ensemble StackingC method and base classi-
fiers to ameliorate prediction accuracy of pedagogical data. Proc Comput Sci 132:1021–1040
Ashraf M, Zaman M, Ahmed M (2018b) Using predictive modeling system and ensemble method
to ameliorate classification accuracy in EDM. Asian J Comput Sci Technol 7(2):44–47
Ashraf M, Zaman M, Ahmed M (2020) An intelligent prediction system for educational data mining
based on ensemble and filtering approaches. Proc Comput Sci 167:1471–1483
Ashraf M, Zaman M, Ahmed M (2018c) Performance analysis and different subject combinations:
an empirical and analytical discourse of educational data mining. In: 8th international conference
on cloud computing. IEEE, data science & engineering (confluence), p 2018
Ashraf M, Zaman M, Ahmed M (2019) To ameliorate classification accuracy using ensemble vote
approach and base classifiers. Emerging technologies in data mining and information security.
Springer, Singapore, pp 321-334
Bartlett P, Shawe-Taylor J (1999) Generalization performance of support vector machines and other
pattern classifiers. Advances in Kernel methods—support vector learning, pp 43–54
Brazdil P, Gama J, Henery B (1994) Characterizing the applicability of classification algorithms
using meta-level learning. In: European conference on machine learning. Springer, Berlin, Hei-
delberg, p 83102
Breiman L (1996). Bagging predictors. Machine Learn 24(2): 123–140; Freund Y, Schapire RE
(1996) Experiments with a new boosting algorithm. ICML 96:148–156
Bruzzone L, Cossu R, Vernazza G (2004) Detection of land-cover transitions by combining multidate
classifiers. Pattern Recogn Lett 25(13):1491–1500
Bü hlmann P, Yu B (2003) Boosting with the L 2 loss: regression and classification. J Am Stat Assoc
98(462):324–339
Dimitriadou E, Weingessel A, Hornik K (2003) A cluster ensembles framework, design and appli-
cation of hybrid intelligent systems
Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural
Comput 4(1):1–58
Leigh W, Purvis R, Ragusa JM (2002) Forecasting the NYSE composite index with technical
analysis, pattern recognizer, neural network, and genetic algorithm: a case study in romantic
decision support. Decision Support Syst 32(4):361–377
Maimon O, Rokach L (2004) Ensemble of decision trees for mining manufacturing data sets. Mach
Eng 4(1–2):32–57
Mangiameli P, West D, Rampal R (2004) Model selection for medical diagnosis decision support
systems. Decision Support Syst 36(3):247–259
Mukesh K, Salal YK (2019) Systematic review of predicting student’s performance in academics.
Int J Eng Adv Techno 8(3): 54–61
Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11:169–
198
Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-Learning by land-marking various
learning algorithms. In: ICML, pp 743–750
Salal YK, Abdullaev SM (2020, December) Deep learning based ensemble approach to predict
student academic performance: case study. In: 2020 3rd International conference on intelligent
sustainable systems (ICISS) (pp 191–198). IEEE
Salal YK, Hussain M, Paraskevi T (2021) Student next assignment submission prediction using a
machine learning approach. Adv Autom II 729:383
Salzberg SL (1994) C4. 5: programs for machine learning by J. Rossquinlan. Mach Learn 16(3):235–
240
Sidiq SJ, Zaman M, Ashraf M, Ahmed M (2017) An empirical comparison of supervised classifiers
for diabetic diagnosis. Int J Adv Res Comput Sci 8(1)
24 M. Ashraf et al.
Tan AC, Gilbert D, Deville Y (2003) Multi-class protein fold classification using a new ensemble
machine learning approach. Genome Inform 14:206–217
Chapter 3
Patient’s Medical Data Security via Bi
Chaos Bi Order Fourier Transform
Bharti Ahuja and Rajesh Doriya
Abstract Telemedicine is used in wireless communication networks to detect and

treat illnesses of patients in isolated areas. The transfer of electronic patient data is
also considered one of the most critical and confidential data in information sys-
tems but because of lack of protection standards in online communication, hackers
and others are often vulnerable to medical information. Therefore, sending medical
information over the network needs a powerful encryption algorithm such that it is
immune against various online cryptographic attacks. Among the three security des-
tinations for the security of data frameworks to be specific privacy, trustworthiness,
and accessibility, privacy is the most significant perspective that should be taken a
lot of care. It is also very important to ensure the protection of patients’ privacy. In
this paper, we have combined the two chaotic maps for increasing the complexity
level and blend it with the fractional Fourier transform with order m. The simulation
is done on MATLAB platform, and the analysis is done through the PSNR, MSE,
SSIM, and Correlation Coefficient.
3.1 Introduction
Encryption innovation is an ordinarily utilized strategy to encode the media’s com-

puterized content. With the guide of a key, the information is encoded before trans-
mission and unscrambled on the collector hand. The idea of the key can be chosen
by nobody, except the person who has the key. The message is known as plain text
and the scrambled content is known as cipher text. The data is ensured at the ideal
opportunity for transmission. In any case, after unscrambling, the data gets unpro-
tected and it very well may be duplicated and conveyed. The schematic portrayal of
the cryptography is given in Fig. 3.1.
With reference to cryptography and data security, the medical data is always
in a need of utmost care and security because the medical data of the patient is
conventionally prone to external interference and little alteration inside the data may
B. Ahuja (B) · R. Doriya

Department of Information Technology, National Institute of Technology, Raipur, India
e-mail: rajeshdoriya.it@nitrr.ac.in
26 B. Ahuja and R. Doriya
Fig. 3.1 Schematic representation of cryptography
cause the final outcome to be immensely colossal. Therefore, diagnostic computing

is among the most productive fields and has an enormously colossal impact on the
trade in health care (Roy et al. 2019). Biomedical data security is therefore one of the
main challenges and is mandatory for remote health care. This is useful in the transfer
of electronic images through telecommunications with technical developments in the
contact network and Internet. The image will simply be obtained by cybercriminals
due to the lack of security levels on the web.
Correspondingly, in this phase, image coding innovation is a primary concern.
The increasing dissemination of clinical images across networks has become a crit-
ical way of life in healthcare systems, benefiting from the exponential advances of
network technology and the excellent advantages of visual medical images of health
care (Zhang et al. 2015).
As medical pictures are the secret cognizance of patients, the way to ensure their
safe preservation and delivery across public networks has thus become a critical
problem for rational clinical applications. Electronic medical picture delivery is also
vulnerable to hackers and alternate data breaches (Akkasaligar and Biradarand 2016).
Medical evidence such as scanned medical photographs and safe transfer of elec-
tronic health information is the main necessity of telemedicine. There is an opportu-
nity during transmission to target the hacker’s medical records for data manipulation.
As a consequence, an incorrect call (Priya and Santhi 2019) is given by the desig-
nated procedure. It prevents long until its transmission due to exposure to medical
cognizance.
Many ways and methods have been reported in this regard in the past few years.
The author have given a method in (Zhang et al. 2015) which inscribe and com-
press the medical image at the same time with compressive sensing and permutation
approach.
In paper (Akkasaligar and Biradarand 2016), another procedure to safeguard the
electronic clinical pictures is intended to use Chaos hypotheses and polymer cryp-
tographic plan. Here, input picture is upgraded into two DNA coded framework; at
that point, Chen’s hyper chaotic map dependent on DNA encoded lattice and Lorenz
map are utilized to produce the chaotic arrangements freely for even pixel dependent
on DNA encoded matrices.
An encoding method utilizing edge maps are shown in paper (Cao et al. 2017)
that is generated from a source image. There are three components of the algorithm:
3 Patient’s Medical Data Security via Bi … 27
random value generator, bit-plane decomposition, and permutation. In paper (Ali and
Ali 2020), a new medical image signcryption scheme is introduced that fulfills the
necessary safety criteria of confidential medical information during its contact. The
design of the latest theme for medical data signcryption is derived from the hybrid
cryptographic combination. It uses a type of elliptic curve cryptography configured
with public key cryptography for private key encoding.
A chaos coding for clinical image data is introduced in paper (Belazi et al. 2019).
A blend of chaotic with polymer computation is proposed, followed by a secret key
generator along with the permutation combination.
Among all types of encryption methods classification, the chaotic map-based
techniques are more efficient especially when we talk of the digital image encryption.
There are various chaotic maps developed by the researchers and mathematician
such as, Logistic map, Arnold map, Henon map, Tinkerbell map, etc. The chaotic
maps give prominent results in security because of the sensitivity towards the initial
condition.
In this paper, we have combined the two chaotic maps for increasing the com-
plexity level and blend it with the fractional Fourier transform with order m. The rest
of the paper is organized as follows: In part two, the proposed method is described
with the definitions of the term used. The results and analysis are done in the third
Sect. 3.3. Finally, the conclusion is made in the fourth part.
3.2 Proposed Methodology
3.2.1 Fractional Fourier Transform (FRFT)
The FRFT is derived from the classical Fourier transform and having an order ‘m.’
Here, the usage of an extra parameter ‘m’ is significant in the sense that it makes the
FRFT more robust than the classical in terms of enhancing the applications (Ozaktas
et al. 2001; Tao et al. 2006). It is represented as:
∞
x(u) = x(t)K γ (t, u)dt (3.1)
−∞
The inverse form of FRFT is further represented as:
∞
x(t) = x(u)K −γ (u, t)du (3.2)
−∞
Fig. 3.2 Time-frequency plane
where

γ =m (3.3)
2
Let Fγ denote the operator corresponding to the FRFT of angle γ . Under this
notation, some of the important properties of the FRFT operator are listed below
with time frequency plane in Fig. 3.2.
1. For γ = m = 0 we do get the identity operator: F 0 = F 4 = I
2. For γ = 2 , i.e., m = 1, we get the Fourier operator: F 1 = F
3. For γ = , i.e., m = 2, we get the reflection operator: F 2 = F F = I
4. For γ = 32
, i.e., m = 3, we get the reflection operator: F 3 = F F 2 = F
FRFT computation involves following steps:
1. A product by a Chirp.
2. A Fourier transforms.
3. Another product by a Chirp.
4. A product by a complex amplitude factor.
Properties of Fractional Fourier Transform explained in Table 3.1. Different
parameters have been used for the performance evaluation of various classes of
discrete fractional Fourier transform (DFRFT).
• Direct form of DFRFT
• Improved Sampling type DFRFT
• Linear Combination type DFRFT
• Eigen Vector Decomposition type DFRFT
Table 3.1 Properties of fractional fourier transform

Property Relation
Integer order F J = (F) j
Inverse (F α )−1 = F −α
Unitary (F α )−1 = (F α ) H
Index additivity F α1 F α2 = F α1+α2
Commutativity F α2 F α1 = F α1 F α2
Associativity F α3 (F α2 F α1 ) = (F α3 F α2 )F α1
Table 3.2 Comparison for different types of DFRFT

Properties Direct Improved Linear Eigenfunction Group Impulse
Reversible No No Yes Yes Yes Yes
Closed Yes Yes Yes No Yes Yes
from
Similarity Yes Yes No Yes NA Yes
with
CFRFT
FFT NA 2 FFT 1 FFT NA 2 FFT 2 FFT
Constraints Less Middle Unable Less Much Much
All orders Yes Yes Yes Yes No No
Properties Less Middle Middle Less Many Many
Additive No Convertible Yes Yes Yes Yes
DSP No Yes Yes Yes Yes Yes
• Group Theory type DFRFT

• Impulse Train type DFRFT.
The main disadvantage of direct form of DFRFT and linear combination type
DFRFT is not reversible and additive. They are not similar to the continuous FRFT.
Both of these types lose various important characteristics of continuous FRFT. This
concludes that both these DFRFT are not useful for DSP applications.
The analysis of remaining four classes of DFRFT, i.e., improved sampling type,
group theory type, impulse train type, and eigenvector decomposition type is dis-
cussed here. Comparison for different types of DFRFT is discussed in Table 3.2.
The FRFT is an individual from a more broad class of changes that are at times
called linear canonical transformations . Individuals from this class of transform can
be separated into a progression of less complex tasks, for example, chirp convolution,
chirp multiplication, scaling, and standard Fourier transforms.
For the FRFT of a signal x(t), the following measures could be used to calculate it.
We chose to break down the fractional transformation into a multiplication of chirps,
proceeded by a convolution of chirps, proceeded by yet another multiplication of
chirps. First, multiply the function x(t) by a chirp function u(t) as below:
g(x) = u(t) ∗ x(t) (3.4)
The two-dimensional FRFT is calculated in a simple manner for the M × N

matrices: The one-dimensional FRFT is applied for every matrix row and for every
corresponding column, so the generalization of the FRFT for the 2D image is defined
as (Ahuja and Lodhi 2014):
∞ ∞
Yαβ ( p, q) = kαβ ( p, q; r, s)y(r, s)dr ds (3.5)
−∞ −∞
where
kαβ ( p, q; r, s) = kα ( p, r )kβ (q, s) (3.6)
3.2.2 Chaotic Functions
A secure encryption system must normally have the following basic features:
(1) Able to convert the message into a random encrypted text or cipher;
(2) Extreme sensitivity towards the secret key.
The chaos system has few common features as stated above, including the pseudo-
random sensitivity of the initial state, also the parametric sensitivity (Avasare and
Kelkar 2015). Many studies, therefore, were based on applications for mapping
discrete chaotic maps for the cryptography in recent years; still numerous chaotic
systems have their specific fields of study that are fitting for different circumstances.
However, because of certain intrinsic image characteristics, such as the ability for
mass data and high pixel associations, existing algorithms alone for cryptography
are not sufficient for the realistic encryption of photos.
Chaos in data security is an unpredictable and similarly irregular instrument that
happens inside the dynamic nonlinear frameworks. The riotous component is recep-
tive to the underlying condition, unstable but then typical. Numerous disordered
capacities are utilized in encryption. We are utilizing chaotic maps here, i.e., logistic
and Arnold Cat map.
Logistic map is the general chaotic function which is utilized to have the long key
space for enhanced security as it increases the randomness (Ahuja and Lodhi 2014).
It is stated as:
yn+1 = u ∗ yn ∗ (1 − yn ) (3.7)
where yn ∈ (0, 1) is the iterative value. When parameter of operation u is in interval

(3.5499, 4), it signifies the system in an unpredictable state, and a small variance
leads to a spontaneous shift in iterative value.
Fig. 3.3 Diagram of Arnold

cat map
Arnold Cat map is a picture encryption technique that is finished by permuting

the pixel estimation of the picture without changing the size and picture histogram
(Rachmawanto et al. 2019). This strategy encodes a network with a similar width
and length. The consequence of Arnold map encryption is affected by the quantity
of cycles and two positive number as integers as information sources. Arnold’s Cat
Map is a 2D chaotic map and is characterized as (Wand and Ding 2018):

x(n + 1) 1 1 x(n)
= (mod1) (3.8)
y(n + 1) 1 2 y(n)
Graphical representation of Arnold map is displayed in Fig. 3.3.
3.2.3 Algorithm
The proposed algorithm contains two processes: sender’s side process or encryption
algorithm and receiver’s side process or decryption algorithm. Algorithms are shown
in Fig. 3.4 and 3.5.
Encryption Algorithm:
Step 1 At sender’s side, take medical image and apply Arnold and logistic chaotic
map to the image.
Step 2 Apply discrete fractional Fourier transform (Simulation is done with order
of parameter a = 0.77 and b = 0.77) as a secret key.
Step 3 This transformed image is an encrypted image.
Fig. 3.4 Encryption algorithm
Fig. 3.5 Decryption algorithm

Fig. 3.6 Input medical image 1 and its histogram
Decryption Algorithm:
Step 1 At receiver’s side, apply inverse discrete fractional Fourier transform (Simu-
lation is done with order of parameter a = 0.77 and b = 0.77) to the encrypted
image.
Step 2 Remove logistic and apply inverse Arnold Cat map to get decrypted image.
3.3 Results and Facts
This segment contains two images (Medical image 1 and Medical image 2) for testing
purpose with a resolution of 512 × 512.
Software Version: MATLAB 2016a. The parameters used in the proposed system
for the simulation are as follows; a = 0.77, b = 0.77 of FRFT and u = 3.9 and y0 = 0.1
of the logistic map respectively.
Figures 3.6, 3.7, and 3.8 describe the computational outcomes after MATLAB
simulation. In these figures, input medical image 1, encrypted image, decrypted
image, and their hisrograms are shown.
Figures 3.9, 3.10 and 3.11 describe the computational outcomes after MATLAB
simulation. In these figures input medical image 2, encrypted image, decrypted image
and their hisrograms are shown.
For testing the efficacy of the system, some of the popular metrics such as PSNR,
MSE, SSIM, and correlation coefficient (CC) would be tested.
Fig. 3.7 Encrypted medical image 1 and its histogram
Fig. 3.8 Decrypted medical image 1 and its histogram
PSNR: A general understanding of the complexity of the encryption is provided by

peak signal to noise ratio. To have a reasonable encryption, PSNR ought to be as high
as could be expected under the circumstances (Zhang 2011). The PSNR in scientific
structure is expressed as;

256 × 256
PSNR = 10 log10 (3.9)
MSE
MSE: The distinction between the comparing pixel esteems in the real image and the
encrypted image well defines the mean square error (Salunke and Salunke 2016). In
order to get a reliable encryption, the mean square error will be as small as possible.
Fig. 3.9 Input medical image 2 and its histogram
Fig. 3.10 Encrypted medical image 2 and its histogram
M−1 N −1
1 2
MSE = f (i, j) − f (i, j) (3.10)
MN 0 0
SSIM: The structural similarity index is used to calculate the relation between an
original image and a reconstructed one. The SSIM should be described as (Horé and
Ziou 2010):
SSIM( f, g) = l( f, g).c( f, g).s( f, g) (3.11)

Fig. 3.11 Decrypted medical image 2 and its histogram
CC: The correlation coefficient of two neighboring pixels is another significant char-
acteristic of the image. It is for evaluating the degree of linear correlation between
two random variables (Zhang and Zhang 2014).
There is a simple correlation of neighboring pixels in horizontal, vertical, and
diagonal directions for a real image. A strong association between adjacent pixels is
predicted for plain image. And weak association between adjacent pixels is predicted
for cipher images.
The simulation results of PSNR, MSE, SSIM, and CC are shown in Tables 3.3
and 3.4.
Table 3.3 Metrics values of medical image

Images PSNR (dB) MSE SSIM
Medical image 1 Inf 0 1
Medical image 2 Inf 0 1
Table 3.4 Correlation coefficient values of medical images

Images Horizontal Vertical Diagonal
Medical image 1 0.8978 0.9049 0.8485
(original)
Medical image 1 0.0906 0.0929 0.0072
(encrypted)
Medical image 2 0.9980 0.9958 0.9937
(original)
Medical image 2 0.0904 0.0929 0.0072
(encrypted)
Figures 3.12 and 3.13 depict the correlation coefficient diagram for medical image
1 and 2, respectively.
Fig. 3.12 Correlation coefficient diagram of medical image 1
Fig. 3.13 Correlation coefficient diagram of medical image 2

3.4 Conclusion
In this paper, we have used three techniques for medical or clinical image encryption,
i.e., FRFT, logistic map, and Arnold map. The results suggest that the complex
hybrid combination makes the system more robust and secure from the different
cryptographic attacks than these methods alone. The use of fourier transform-based
approach with logistic chaotic map and Arnold map makes this algorithm much
complex and nonlinear and difficult to breach the security henceforth. In future
work, the method may be used for the medical data security with the advanced tools
of IoT and machine learning.
References
Ahuja B, Lodhi R (2014) Image encryption with discrete fractional Fourier transform and chaos.
Adv Commun Netw Comput (CNC)
Akkasaligar PT, Biradarand S (2016) Secure medical image encryption based on intensity level using
Chao’s theory and DNA cryptography. In: 2016 IEEE international conference on computational
intelligence and computing research (ICCIC). IEEE, Chennai, pp 1–6
Ali T, Ali RA (2020) Novel medical image signcryption scheme using TLTS and Henon chaotic
map. IEEE Access 8:71974–71992
Avasare MG, Kelkar VV (2015) Image encryption using chaos theory. In: 2015 international con-
ference on communication, information and computing technology (ICCICT). IEEE, Mumbai,
pp 1–6
Belazi A, Talha M, Kharbech S, Xiang W (2019) Novel medical image encryption scheme based
on chaos and DNA encoding. IEEE Access 7:36667–36681
Cao W, Zhou Y, Chen P, Xia L (2017) Medical image encryption using edge maps. Signal Process
132:96–109
Horé and Ziou2010]ref16 Horé A, Ziou D (2010) Image quality metrics: PSNR versus SSIM. In:
20th international conference on pattern recognition. IEEE, Istanbul, pp 2366–2369
Ozaktas M, Zalevsky Z, Kutay MA (2001) The fractional Fourier transform. West Sus-sex, U. K.,
Wiley
Priya S, Santhi B (2019) A novel visual medical image encryption for secure transmission of
authenticated watermarked medical images. Mobile networks and applications. Springer, Berlin
Roy M, Mali K, Chatterjee S, Chakraborty S, Debnath R, Sen S (2019) A study on the applications
of the biomedical image encryption methods for secured computer aided diagnostics. In: Amity
international conference on artificial intelligence (AICAI), Dubai, United Arab Emirates. IEEE,
pp 881–886
Rachmawanto E, De Rosal I, Sari C, Santoso H, Rafrastara F, Sugiarto E (2019) Block-based Arnold
chaotic map for image encryption. In: International conference on information and communica-
tions technology (ICOIACT). IEEE, Yogyakarta, Indonesia, pp 174–178
Salunke BA, Salunke S (2016) Analysis of encrypted images using discrete fractional transforms viz.
DFrFT, DFrST and DFrCT. In: International conference on communication and signal processing
(ICCSP). IEEE, Melmaruvathur, pp 1425–1429
Tao R, Deng B, Wang Y (2006) Research progress of the fractional Fourier transform in signal
processing. Sci China (Ser. F Inf Sci) 49:1–25
Wang C, Ding Q (2018) A new two-dimensional map with hidden attractors. Entropy 20:322
Zhang X (2011) Lossy compression and iterative reconstruction for encrypted image. IEEE Trans
Inf Forensics Secur 6:53–58
Zhang J, Zhang Y (2014) An image encryption algorithm based on balanced pixel and chaotic map.
Math Probl Eng
Zhang L, Zhu Z, Yang B, Liu W, Zhu H, Zou M (2015) Medical image encryption and compression
scheme using compressive sensing and pixel swapping based permutation approach. Math Probl
Eng 2015
Chapter 4
Nepali Word-Sense Disambiguation
Using Variants of Simplified Lesk
Measure
Satyendr Singh, Renish Rauniyar, and Murali Manohar
Abstract This paper evaluates simplified Lesk algorithm for Nepali word-sense
disambiguation (WSD). Disambiguation is performed by computing similarity
between sense definitions and context of ambiguous word. We compute the simi-
larity using three variants of simplified Lesk algorithm: direct overlap,
frequency-based scoring, and frequency-based scoring after dropping target word.
We further evaluate the effect of stop word elimination, number of senses and
context window size on Nepali WSD. The evaluation was carried out on a sense
annotated corpus comprising of 20 polysemous Nepali nouns. We observed overall
average precision and recall of 38.87% and 26.23% using frequency-based scoring
for baseline. We observed overall average precision and recall of 32.23% and
21.78% using frequency-based scoring after dropping target word for baseline. We
observed overall average precision and recall of 30.04% and 20.30% using direct
overlap for baseline.
4.1 Introduction
Polysemy exits in natural languages, as natural languages comprise of words

bearing different senses in different context. The English noun cricket can mean a
game or insect, based on the context it is being used. For human beings, to interpret
the appropriate sense of the word in given context is easy, using nearby words in
the context of target ambiguous word. For machines, this is a challenging task.
Given a context, the task of identifying the correct meaning of an ambiguous word
computationally is called as word-sense disambiguation (WSD). It is considered as
S. Singh (&)
BML Munjal University, Gurugram, Haryana, India
R. Rauniyar
Tredence Analytics, Bangalore, India
M. Manohar
Gramener, Bangalore, India
G. Verma et al. (eds.), Data Science, Transactions on Computer Systems
42 S. Singh et al.
an intermediate task in many natural language processing (NLP) applications (Ide

and Veronis 1998). It is one of the major components of machine translation task.
Nepali language is an Indo-Aryan language and is the official language of Nepal,
a south Asian country. It is also listed as official language in India in the state of
Sikkim. It is spoken in Nepal and parts of Bhutan and India. The WSD research for
Nepali as well as Indian languages is constrained due to lack of resources which
includes training and testing corpus. Nepali is similar to Hindi language. Both these
languages share lot of features in their vocabulary and grammar and have subtle
differences as well. As Nepali and Hindi are different languages, the results
obtained on Hindi or other Indian languages for WSD cannot be generalized on
Nepali without evaluation on Nepali Dataset.
Nepali language has words comprising multiple senses in different context. For
example, the Nepali noun “उत्तर” (uttar) has two senses as nouns listed in
IndoWordNet (http://tdil-dc.in/indowordnet/) as given below.
1. उत्तर, जवाब, कुनै प्रश्न या कुरा सुनेर त्यसको समाधानका लागि भनिएको कुरा,
“तपाईले मेरो प्रश्नको उत्तर दिनु भएन”
Uttar, javab, kunai prashna ya kura sunera tyasko samadhanka lagi bhaniyeko
kura, “Tapaile mero prashna uttar dinu bhayena”
Answer, any question or what is said to resolve it, “You did not answer my
question.”
2. उत्तर, उत्तर दिशा, दक्षिण दिशाको अघिल्तिरको दिशा, भारतको उत्तरमा हिमालय
पर्वत विराजमान छ
Uttar, Uttar disha, dakshin, dishako aghiltirako disha, bharatko uttarma
Himalaya parvat virajaman cha
North, north direction, direction opposite to south direction, “North of India is
surrounded by Himalayas”
Given below are two contexts of ‘उत्तर’ (uttar).
Context 1: बालकका विचित्र प्रश् नहरूको उत्तर दिँदा-दिँदै मातापिताहरू त दिक् क पनि
हुन थाल्छन् ।
Balakka vichitra prashnaharuko uttar dida-didai matapitaharu ta dikka pani huna
thalchan
Parents are also troubled by answering the bizarre questions of their child.
Context 2: यस परिषद्अन्तर्गत उत्तर पूर्वी क्षेत्रका सातवटा राज्यहरू आसाम, मणिपुर,
नागाल्यान्ड, मिजोराम, त्रिपुरा, मेघालय र अरुणाचल प्रदेश समावेश गरिए ।
Yash parishadantargat uttar poorvi kshetraka satavata rajyaharu Assam,
Manipur, Nagaland, Mizoram, Tripura, Meghalaya ra Arunanchal Pradesh sama-
vesh garie
Under this council, seven states of northeastern region Assam, Manipur,
Nagaland, Mizoram, Tripura, Meghalaya, and Arunachal Pradesh were included.
Sense 1 of “उत्तर” (uttar) pertains to answer and sense 2 pertains to north
direction. In context 1 the meaning of “उत्तर” (uttar) is answer and in context 2 the
meaning is north direction.
4 Nepali Word-Sense Disambiguation Using Variants … 43
In this work, we evaluate a WSD algorithm for Nepali language. The algorithm
used is based on Lesk (1986) and following Vasilescu et al. (2004), is called
simplified Lesk. The algorithm uses the similarity between context vector and sense
definitions for disambiguation. We further investigate the effects of context window
size, stop word elimination, and number of senses for Nepali WSD. We compare
the results for Nepali WSD using a similar Lesk-like algorithm for Hindi WSD
(Singh et al. 2017). The article is organized as follows: Sect. 4.2 provides the
related work in WSD for English, Hindi, and Nepali languages. The WSD algo-
rithm is discussed in Sect. 4.3. Section 4.4 provides the details of construction of
sense annotated Nepali corpus used in this work. In Sect. 4.5, we provide the
experiments conducted and results and Sect. 4.6 provide discussion of results. In
Sect. 4.7, we present our conclusion.
4.2 Related Work
There are two main categories into which WSD techniques are broadly grouped:
Dictionary-based or knowledge-based and corpus-based. Dictionary-based tech-
niques (Baldwin et al. 2010; Banerjee and Pederson 2002, 2003; Lesk 1986;
Vasilescu et al. 2004) utilize information from lexical resources and machine-
readable dictionaries for disambiguation. Corpus-based techniques utilize corpus,
either sense tagged (supervised) (Gale et al. 1992; Lee et al. 2004; Ng and Lee
1996) or raw corpus (unsupervised) (Resnik 1997; Yarowsky 1995) for
disambiguation.
Lesk (1986) was one of the early and pioneer works on dictionary-based WSD
for English language. He represented dictionary definition obtained from lexicon as
bag of words. He extracted words in sense definition of words, in context of target
ambiguous word. Disambiguation was performed by contextual overlap between
sense and context bag of words. The work in (Agirre and Rigau 1996; Miller et al.
1994) is some other early work utilizing dictionary definitions for WSD. Since
Lesk, several extensions to his work have been proposed (Baldwin et al. 2010;
Banerjee and Pederson 2002, 2003; Gaona et al. 2009; Vasilescu et al. 2004).
Baldwin et al. (2010) reinvestigated and extended the task of machine-readable
dictionary-based WSD. They extended the Lesk-based WSD approach by methods
of definition extension and by applying different tokenization schemes. Evaluation
was carried out on Hinoki Sense bank example sentences and Senseval-2 Japanese
dictionary task. The WSD accuracy uses their approach surpassed both unsuper-
vised and supervised baselines. Banerjee and Pedersen (2002) utilized glosses that
were associated with synset, semantic relations and each word attribute in pair for
disambiguation using English WordNet. In (Banerjee and Pederson 2003), they
explored a novel measure of semantic relatedness that was based on count of
overlap in glosses. Comparative evaluation of original Lesk algorithm was per-
formed by Vasilescu et al. (2004). They observed the performance of adapted Lesk
44 S. Singh et al.
algorithm to be better in comparison of original Lesk algorithm. Gaona et al. (2009)

utilized word occurrences in gloss and context information for disambiguation.
For Hindi language work on WSD includes (Bhingardive and Bhattacharyya
2017; Jain and Lobiyal 2016; Mishra et al. 2009; Singh and Siddiqui 2016; Singh
and Siddiqui 2012, 2014, 2015; Singh et al. 2013, 2017; Sinha et al. 2004). Sinha
et al. (2004) utilized an algorithm for Hindi WSD based on Lesk. They created
context bag by utilizing neighboring words of target polysemous word and sense
bag by utilizing synonyms, glosses, example sentences, and semantic relations
including hypernyms, hyponyms, meronyms, their glosses, and example sentences.
The winner sense was one that maximized contextual overlap between the sense
and context bag. They evaluated on Hindi Corpora and reported accuracy ranging
from 40 to 70% in their experiment. Jain and Lobiyal (2016) utilized fuzzy graph
connectivity measures for Hindi WSD and proposed fuzzy Hindi WordNet, an
extension of Hindi WordNet. Singh et al. (2013) studied semantic relatedness
measure for Hindi WSD task. The semantic relatedness measure explored in their
work was Leacock Chodorow measure. They reported precision of 60.65%. Singh
and Siddiqui (2014) investigated role of semantic relations for Hindi WSD. The
semantic relations explored in their work were holonym, hypernym, meronym, and
hyponym, and they obtained maximum precision using hyponym as single semantic
relation. Singh and Siddiqui (2012) evaluated a Lesk-based algorithm for Hindi
language WSD. They studied the effect of stemming and stop word removal on
Hindi WSD task. They also investigated the effect of context window size on
Hindi WSD task. Evaluation was performed on a manually sense tagged dataset
comprising of 10 polysemous Hindi nouns. They reported maximum precision of
54.81% after applying stemming and stop word removal. They further observed an
improvement of 9.24% in precision in comparison to baseline performance. In
another work, Singh et al. (2017) explored three variants of simplified Lesk mea-
sure for Hindi word-sense disambiguation. They evaluated the effect of stop word
elimination, context window size, and stemming on Hindi word-sense disam-
biguation. They observed 54.54% as maximum overall precision after dropping
stop words and applying stemming using frequency-based scoring excluding the
target word. Mishra et al. (2009) explored an unsupervised approach for Hindi
word-sense disambiguation. Their approach utilized learning based on decision list
created from untagged instances, while providing some seed instances manually.
They applied stop word elimination and stemming in their work. Evaluation was
carried on 20 ambiguous Hindi nouns; sense inventory being derived from Hindi
WordNet. Singh and Siddiqui (2015) investigated the role of karaka relations on
Hindi WSD task. They utilized two supervised algorithms in their experiment.
Evaluation was obtained on sense annotated Hindi corpus (Singh and Siddiqui
2016). They observed that vibhaktis can be helpful for disambiguation of Hindi
nouns. Bhingardive and Bhattacharyya (2017) explored IndoWordNet for bilingual
word-sense disambiguation for obtaining sense distribution using expectation
maximization algorithm. They also explored IndoWordNet for obtaining most
frequent sense utilizing embeddings drawn from sense and word.
For Nepali language work on WSD includes (Dhungana and Shakya 2014;
Shrestha et al. 2008). Dhungana and Shakya (2014) investigated adapted Lesk-like
algorithm for Nepali WSD. They included synset, gloss, example sentences, and
hypernym of every sense of target polysemous word for creating sense bag. Context
bag was created by extracting all words from whole sentence after dropping
prepositions, articles, and pronouns. Score was computed by contextual overlap of
sense bag and context bag. Evaluation was done on 348 words, 59 being polyse-
mous and they achieved an accuracy of 88.05%. Shrestha et al. (2008) studied the
role of morphological analyzer and machine-readable dictionary for Nepali
word-sense disambiguation using Lesk algorithm. Evaluation was performed on a
small dataset comprising of Nepali nouns and they achieved accuracy values
ranging from 50 to 70%. For Nepali language work on sentiment analysis includes
(Gupta and Bal 2015; Piryani et al. 2020). Gupta and Bal (2015) studied sentiment
analysis of Nepali text. They developed Nepali SentiWordNet named as
Bhavanakos and employed it for detecting sentiment words in Nepali text. They
also trained machine learning classifier using annotated Nepali text for document
classification. Piryani et al. (2020) performed sentiment analysis of tweets in Nepali
text. They employed machine and deep learning models for sentiment analysis of
Nepali text.
4.3 WSD Algorithm
The Simplified Lesk algorithm for WSD used in this work is adapted from (Singh
et al. 2017) and given in Fig. 4.1. In this algorithm, score is computed by contextual
overlap of two bags: context bag and sense definition bag. Sense definition bag
comprises of synsets, gloss, and example sentence of target word. Context bag is
formed by extracting neighboring words in a window size of ±n in context of target
word. The winner sense is one which maximizes the overlap of two bags. For
studying the effects of context window size, test runs were computed on window
size of 5, 10, 15, 20, and 25. For studying the effects of stop word elimination, we
dropped stop words from the context vector and then created the context window.
We utilized three variants to compute the score: direct overlap, frequency-based
scoring, and frequency-based scoring after dropping target word. For direct overlap,
we computed the number of matching words for disambiguation. For
frequency-based scoring, we computed the frequency of matching words between
context and sense bag. For frequency-based scoring after dropping target word, we
computed the frequency of matching words between context and sense bag after
dropping target word.
46 S. Singh et al.
1.(a) Keeping ambiguous word in middle, create a context vector (CV)

comprising of words in a fixed window size of ± n
(b) Perform stop word removal on sense definitions and instances and
create context vector as in 1 (a)
2. for i = 1 to n do // n = number of senses
Create sense definition vector (SV) for sense i of the target word
Scorei = Similarity-Overlap (CV, SVi)
3. return SVi for which score is maximum.
Computing score (Direct Overlap):
Similarity-Overlap (CV, SV)
sense_score = 0
for each word x in CV
if x is in SV
sense_score = sense_score +1
return sense_score
Computing score (Frequency based scoring & Frequency based
scoring after dropping target word):
Similarity (CV, SV)
sense_score = 0
for each word x in CV
word_count = frequency of x in SV
sense_score = sense_score + word_count
return sense_score
Fig. 4.1 WSD algorithm
4.4 Dataset
For evaluating the WSD algorithm, a sense annotated Nepali corpus was created
comprising of 20 polysemous Nepali nouns. The sense annotated Nepali corpus is
given in Table 4.1. The sense definitions were obtained from IndoWordNet (http://
tdil-dc.in/indowordnet/), an important lexical resource for Nepali and Indian lan-
guages. IndoWordNet is available at Centre for Indian Language Technology
(CFILT), Indian Institute of Technology (IIT) Bombay, India. Test instances were
obtained from Nepali General Text Corpus (http://tdil-dc.in/index.php?option=
com_download&task=showresourceDetails&toolid=453&lang=en), a raw Nepali
corpus available at Technology Development for Indian Language (TDIL) portal.
Text instances were also collected by firing search queries to various sites on the
Web containing Nepali text. The sense annotated Nepali corpus was build using
similar guidelines as sense annotated Hindi corpus (Singh and Siddiqui 2016).
The sense listings in IndoWordNet is fine-grained. Hence, few fine-grained
senses have been merged in our dataset using subject-based evaluation. For
example, Nepali noun “तिल” (til) has three senses as nouns in IndoWordNet as
given below.
Table 4.1 Sense annotated Nepali corpus

No of Nepali Nouns
Senses
2 उत्तर (uttar), क्रिया (kriya), गोली (goli), ग्रहण (grahan), ताल (taal), तिल (til), दर
(dar), फल (phal), बोली (boli), शाखा (saakhaa), साँचो (sancho), साल (saal), सीमा
(seema), हार (haar)
3 तुलसी (tulsi), धारा (dhaaraa), पुतली (putali), वचन (vachan)
4 टीका (tikaa), बल (bal)
1. तिल, एउटा रूखको बीउ जसबाट तेल निस्कन्छ, “ऊ सधैँ नुहाएपछि तिलको तेल
लगाउँछ”
Til, euta rukhko biu jasbata tel niskancha, “u sadhai nuhayepachi tilko tel
lagaucha.”
Sesame, a tree seed that secretes oils, “he always puts on sesame oil after
bathing”
2. कोठी, थोप्लो, तिल, छालामा हुने कालो वा रातो रङ्गको धेरै सानो प्राकृतिक चिनो अथवा
दाग, उसका गालामा कालो कोठी छ
Kothi, thoplo, til, chaalama hune kalo wa rato rangko dherai sano prakritik chino
athawa daag, uska gaalama kalo kothi cha.
Mole, mole, mole, a very small black or red colored natural identity or spot
present in the skin. He has a black mole on his cheek.
3. कोठी, तिल, कालो वा रातो रङ्गको अलिक उठेको मासुको त्यो दानो जुन शरीरमा कतैकतै
निक्लिने गर्छ, उनको डड्याल्नामा एउटा कालो कोठी छ
Kothi, til, kalo wa rato rangko alik utheko masuko tyo dano jun sarirma katai-
katai niklane garcha, unko ḍaḍyalnma euta kalo kothi cha.
Mole, mole, black or red colored slightly elevated spot which can appear any-
where in body. He has a black mole on his back.
For “तिल” (til), sense 1 pertains to small oval seeds of the sesame plant. Sense 2
pertains to mole, small congenital pigment spotted on the skin. Sense 3 pertains to
mole, firm abnormal elevated blemish on the skin. The instances of sense 2 and 3
were marked as similar by two subjects. Hence, we merged sense 2 and 3. The two
subjects were native speaker of Nepali language and were undergraduate students
of BML Munjal University, Gurugram, India.
For some senses in our dataset, we could not find sufficient instances, hence we
dropped them. For example, Nepali noun “क्रिया” (kriya) has 5 senses as nouns in
IndoWordNet as given below.
1. क्रिया, क्रियापद, व्याकरणमा त्यो शब्द जसद्वारा कुनै व्यापार हुनु या गरिनु सूचित
हुन्छ “यस अध्यायमा क्रियामाथि छलफल गरिन्छ”
Kriya, kriyapad, byakaranma tyo sabdha jasdwara kunai byapar hunu ya garinu
suchit huncha, “yas adhyayama kriyamathi chalfal garincha.”
48 S. Singh et al.
In verbs, verbs, grammar, the word by which a trade is made or done indicates
“This chapter discusses verbs”
2. प्रक्रिया, क्रिया, प्रणाली, पद्धति, त्यो क्रिया या प्रणाली जसबाट कुनै वस्तु हुन्छ,
बन्छ या निक्लिन्छ “युरियाको निर्माण रासायनिक प्रक्रियाबाट हुन्छ”
Prakriya, kriya, pranali, paddhati, tyo kriya ya pranali jasbata kunai vastu
huncha, bancha ya niklincha “Ureako nirman Rasayanik prakriyabata huncha”
Process, action, system, method, the action or system from which an object is
made, formed, or derived “Urea is formed by a chemical process”
3. श्राद्ध, सराद्ध, क्रिया, किरिया, कुनै मुसलमान साधु वा पीरको मृत्यु दिवसको कृत्य
“सुफी फकिरको श्राद्धमा लाखौँ मान्छे भेला भए”
Shradh, Saradh, kriya, kiriya, kunai musalman sadhu wa pirko mrityu diwasko
krtiya “Sufi fakirko shradhma lakhau manche bhela bhaye”
Last rites/rituals, Death anniversary, rites, the death anniversary of a Muslim
sage “Millions of people gathered to pay their respects to the Sufi fakir”
4. श्राद्ध, सराद्ध, क्रिया, किरिया, मुसलमान पीरको निर्वाण तिथि “पीर बाबाको श्राद्ध
बडो धुमदामले मनाइयो”
Shradh, Saradh, kriya, kiriya, kunai musalman pirko nirwan tithi “pir babako
shradh badho dhumdhamle manaiyo”
Last rites/rituals, Death anniversary, rites, the last rites/funeral of a Muslim
monk “The death anniversary of Pir Baba was celebrated with great pomp”
5. क्रिया, कुनै कार्य भएको वा गरिएको भाव “दूधबाट दही बनिनु एउटा रासायनिक क्रिया
हो”
Kriya, kunai karya bhyeko wa gariyeko bhab “Dhudhbata dahi baninu euta
rasayanik kriya ho”
Action, the feeling of having or doing an action/the feeling that something has
been done “Getting Yogurt from milk is an action of chemical reaction”
For “क्रिया” (kriya), sense 1 pertains to a content word that denotes an action or a
state, verb in Nepali grammar. Sense 2 pertains to particular course of action
intended to achieve a result. Sense 3 pertains to Death Anniversary Act of a Muslim
monk. Sense 4 pertains to Muslim monk emancipation date. Sense 5 pertains to
something that people do or cause to happen.
The Nepali noun “क्रिया” (kriya) has sense 3 and 4 pertaining to the act of death
of a Muslim monk or death anniversary of a Muslim monk. We could not get
instances pertaining to these senses for “क्रिया” (kriya), hence we dropped these
senses from the dataset. Sense 2 and 5 pertain to a course of action, hence sense 2
and 5 are merged.
We further added few senses as well which were not available in IndoWordNet.
For example, Nepali noun “दर” (dar) has 3 senses as nouns in IndoWordNet as
given below.
1. अनुपात, दर, मान,माप,उपयोगिता आदिको तुलनाको विचारले एउटा वस्तु अर्को वस्तुसित
रहने सम्बन्ध या अपेक्षा “पुस्तकका लागि लेखकले दुई प्रतिशतको अनुपातले रोयल्टी
भेटिरहेको छ”
Anupat, dar, maan, maap, upayogita adhiko tulanako bicharle euta vastu arko
vastustith rahane sambhandha ya apekchya “Pusktakko lagi lekheko dui pratisatko
anupatle royalty bhetiraheko cha.”
The relation or expectation of one object to be with another by comparing
proportions, rates, values, measurements, utility, etc. “The author is receiving a two
per cent royalty for the book.”
2. मूल्य, दर, दाम, मोल, कुनै वस्तु किन्दा वा बेच्दा त्यसको बदलामा दिइने धन “यस
कारको मूल्य कति हो”
Mulya, dar, daam, moal, kunai vastu kinda wa bechda tyasko badlama diyine
dhan. yash kaarko mulya kati ho
Price, rate, price, value, what is the value/money to be paid in return for buying
or selling an item. “What is the rate of this car?”
3. मूल्य, मोल, दाम, भाउ, दर, कुनै वस्तुको गुण,योग्यता या उपयोगिता जसको आधारमा
उसको आर्थिक मूल्य जाँचिन्छ “हीराको मूल्य जौहारीले मात्रै जान्दछ”
Mulya, moal, daam, bhau, dar, kunai vastuko gun, yogyata ya upayogita jasko
aadharma usko arthik mulya jachincha “Hirako mulya jauharile matrai jandacha”
Price, value, price, price, rate, quality, merit or usefulness of a commodity on the
basis of which its economic value is checked “Only a jeweler knows the value of a
diamond”
For “दर” (dar), sense 1 pertains to rate or value. Sense 2 pertains to the rate or
price. Sense 3 pertains to rate or value or price.
For Nepali noun “दर” (dar) we added a sense as
दर, तीजमा खईने विशेष खाना, दर खाने दिनबाट तीज सुरु भएको मानिन्छ ।
Dar, teejma khaine vishesh khana, dar khane dinbata teej suru bhayeko
manincha.
Dar, a special food on the occasion of Teej, The Teej festival is assumed to start
from the day after having Dar.
This sense pertains to special dish made on occasion of Teej Festival, a festival
celebrated in India and Nepal.
Sense 1, 2, and 3 pertain to rate hence sense 1, 2, and 3 have been merged.
Precision and recall were computed for performance evaluation of WSD algo-
rithm (Singh et al. 2017). Precision is computed as the ratio of instances disam-
biguated correctly to total test instances answered for a word. Recall is computed as
the ratio of instances disambiguated correctly to total test instances to be answered
for a word.
Sense annotated Nepali corpus comprises of 20 polysemous Nepali nouns. Total
number of words in the corpus are 231,830. Total number of unique words in the
corpus are 40,696. Total number of instances in the corpus are 3525. Total number
of senses in the corpus are 48. Average number of instances per word in the corpus
50 S. Singh et al.
are 176.25. Average number of instances per sense in the corpus are 73.44. Average
number of senses per word in the corpus are 2.4. The transliteration, translation, and
number of instances for every senses of each word of this corpus are provided in
Appendix in Table 4.10. The Nepali stop words list used in this work is given in
Fig. 4.2.
4.5 Experiments and Results
Two test runs were performed for evaluation and to study the effect of stop word
elimination on our algorithm. These test runs pertained to the following two cases:
without stop word removal (Case 1), which is also our baseline case and with stop
word removal (Case 2). For each test run, results were computed on window size 5,
10, 15, 20, and 25. Test run 1 (Case 1) corresponds to baseline and it is overlap
between context and sense definitions. For test run 2 (Case 2), we performed stop
words removal from sense definition and context vector and then similarity is
computed.
Overall average precision and recall for 20 words for direct overlap,
frequency-based scoring after dropping target word and frequency-based scoring
for both cases, averaged over context window size of 5–25 is given in Table 4.2.
Average precision and recall for 20 words with regard to context window size for
direct overlap is given in Tables 4.3 and 4.4. Tables 4.5 and 4.6 provide average
precision and recall for 20 words with regard to context window size for
frequency-based scoring after dropping target word. Average precision and recall
for 20 words with regard to context window size for frequency-based scoring are
given in Tables 4.7 and 4.8. Table 4.9 provides average precision for these words
with regard to number of senses for both cases and three variants.
Fig. 4.2 Nepali stop words list
Table 4.2 Overall average precision and recall

Precision Recall
Direct overlap Case 1 0.3004 0.2030
Case 2 0.2508 0.1655
Frequency-based scoring after dropping target word Case 1 0.3223 0.2178
Case 2 0.2759 0.1824
Frequency-based scoring Case 1 0.3887 0.2623
Case 2 0.3503 0.2347
Table 4.3 Average precision Precision

with respect to context
Context window size
window size (direct overlap)
5 10 15 20 25
Case 1 0.2111 0.2740 0.3148 0.3446 0.3574
Case 2 0.1724 0.2211 0.2663 0.2871 0.3072
Table 4.4 Average recall Recall

Context window size
window size (direct overlap)
5 10 15 20 25
Case 1 0.1466 0.1872 0.2142 0.2308 0.2362
Case 2 0.1186 0.1484 0.1743 0.1868 0.1995

Context window size
window size
(frequency-based scoring after 5 10 15 20 25
dropping target word) Case 1 0.2357 0.2977 0.3310 0.3621 0.3851
Case 2 0.1990 0.2447 0.2904 0.3143 0.3311

Context window size
window size
(frequency-based scoring after 5 10 15 20 25
dropping target word) Case 1 0.1627 0.2039 0.2256 0.2427 0.2539
Case 2 0.1359 0.1638 0.1905 0.2055 0.2160

Context window size
window size
(frequency-based scoring) 5 10 15 20 25
Case 1 0.3212 0.3776 0.3959 0.4151 0.4337
Case 2 0.2958 0.3319 0.3606 0.3777 0.3853

Context window size
window size
(frequency-based scoring) 5 10 15 20 25
Case 1 0.2173 0.2543 0.2690 0.2812 0.2895
Case 2 0.2003 0.2228 0.2413 0.2523 0.2567
52 S. Singh et al.
4.6 Discussion
The maximum overall precision and recall of 38.87% and 26.23% were observed
for frequency-based scoring for baseline case, as seen in Table 4.2. We observed
overall precision and recall of 35.03 and 23.47% for the case with stop word
elimination using frequency-based scoring. For direct overlap for baseline case, we
observed overall average precision and recall of 30.04% and 20.30%. For case with
stop word removal, overall average precision and recall observed were 25.08 and
16.55%. We observed overall average precision and recall of 32.23 and 21.78% for
frequency-based scoring after dropping target word for baseline case. For case with
stop word removal, we observed overall average precision and recall of 27.59 and
18.24%.
Decrease in precision is observed using stop word elimination (case 2) over
baseline (case 1). For direct overlap, we observed 16.51% decrease in precision
after stop word elimination (case 2) over baseline (case 1). For frequency-based
scoring, we observed 9.88% decrease in precision using stop word elimination (case
2) over baseline (case 1). Similarly, for frequency-based scoring after dropping
target word, we observed 14.40% decrease in precision using stop word elimination
(case 2) over baseline (case 1).
The results in Tables 4.3, 4.5, and 4.7 suggest that increasing context window
size enhances the possibility of disambiguation of correct sense. On increasing
window size, more content words are induced in the context vector, wherein some
word may be a strong indicator of particular sense.
As the number of senses (classes) increases, the possibility of correct disam-
biguation decreases in general as seen in results from Table 4.9. There were 14
words having 2 senses, 4 words with 3 senses and 2 words with 4 senses. We
observed maximum precision for words comprising of 2 senses following 3 and 4
senses.
Comparing the results of Nepali WSD, with similar kind of work on Hindi WSD
(Singh et al. 2017), we obtained overall decrease in precision for Nepali language.
Table 4.9 Average precision and recall with respect to number of senses
Precision Recall
Number of senses Number of senses
2 3 4 2 3 4
Direct overlap Case 1 0.3321 0.2598 0.1591 0.2140 0.2004 0.1310
Case 2 0.2666 0.2419 0.1582 0.1645 0.1879 0.1277
Frequency-based Case 1 0.3575 0.2795 0.1619 0.2317 0.2147 0.1261
scoring after Case 2 0.2973 0.2466 0.1848 0.1851 0.1915 0.1445
dropping target
word
Frequency-based Case 1 0.4534 0.2759 0.1614 0.2942 0.2130 0.1371
scoring Case 2 0.4134 0.2333 0.1421 0.2659 0.1832 0.1192
In the work reported on Hindi WSD (Singh et al. 2017), the maximum overall
average precision obtained on simplified Lesk algorithm was 54.54% using
frequency-based scoring excluding target word and after applying stemming and
stop word removal. The overall average precision obtained for baseline and stop
word removal was 49.40 and 50.64% using frequency-based scoring excluding
target word. Moreover, an increase in precision was observed after stop word
removal over baseline for Hindi WSD.
Nepali language has a complex grammatical structure. The root words in Nepali
grammar are often suffixed with words such as “को” (ko), “का” (ka), “मा” (maa),
“द्वारा” (dwara), “ले” (le), “लाई” (lai), “बाट” (bata), etc. These set of words are
known as vibhaktis. Apart from such words, some other suffix such as “हरू” (haru)
denotes the plural form of a word. For example, “केटाहरू” (ketaharu) meaning boys
is the plural form of “केटा” (keta) meaning boy. Moreover, different vibhatis can be
suffixed with same root words in different sentences, depending upon the context.
The separation of these suffixes and vibhaktis from root word results in an incorrect
grammatical sentence.
Given below is a context in Nepali.
सविधानसभा निर्वाचनमा मनाङबाट निर्वाचित टेकबहादुर गुरुङ श्रम तथा रोजगार
राज्यमन्त्री हुन्।
Sambhidhansabha nirvachanma Manangbata nirvachit tekbahadur gurung shram
tatha rojgar rajyamantri hun.
Tek Bahadur Gurung, elected from Manang in the Constituent Assembly elec-
tion, is the Minister of Labor and Employment.
The Nepali context has the word “निर्वाचनमा” (nirwachanma) having vibhakti
“मा” (maa) appended with “निर्वाचन” (nirwachan). “निर्वाचन” (nirwachan) can
also be append with vibhatki “को” (ko) forming the word “निर्वाचनको” (nirwa-
chanko). In the computational overlap “निर्वाचनमा” (nirwachanma) and
“निर्वाचनको” (nirwachanko) would be treated differently. This accounts for
decrease in precision in Nepali WSD over Hindi WSD. The context vector and
sense vector overlap for Nepali language may comprise of a word, suffixed with
different vibhatis. Hence, those content words would be treated differently for every
suffixed vibhakti and would not be counted in computational overlap. After stop
word removal in Nepali, stop words are dropped from the context vector, which
may have contributed in contextual overlap in baseline case. The context vector
thus formed comprises of content words with different vibhaktis appended. Thus,
there will be no match, for the same content word in sense and context vector
because the vibhatki appended to content word in sense vector and context vector
are different.
Nepali is morphologically very rich language. Given below are two contexts of
Nepali.
54 S. Singh et al.
Context 1: वास्तवमा यो बलद्वारा निर्मूल गर्न सकिने विषय होइन।

Actually, this topic is not something you can eliminate with force.
Context 2: कति कष्ट र वेदना सहेर बलैले धर्मवीर बन् न सफल भएको, आज अकस्मात्
पुन: अधर्मी बन् नुपर् यो?
How has he become atheist again despite facing much difficulties and sorrows
and managing to become theist with strong will power?
In the first context the word “बल” is appended with vibhakti “द्वारा” forming
“बलद्वारा” (baldwara). In the second context “बल” is appended with vibhakti “ले”
(le) forming “बलैले” (balele). This is how words are transformed into different
morphological forms using noun form in Nepali. After stop word removal stop
words are dropped from the context vector, which may have contributed in con-
textual overlap in baseline case. The context vector thus formed comprises of
content words with different vibhaktis appended. Thus, there will be no match, for
the same content word in sense and context vector because the vibhatki appended to
content word in sense vector and context vector are different. This is responsible for
decrease in precision after stop word elimination (case 2) over baseline (case 1) in
Nepali WSD.
4.7 Conclusion
In this paper, we investigated Nepali WSD using variants of simplified Lesk-like

algorithm. We also evaluated the effects of stop word elimination, number of
senses, and context window size on Nepali WSD task. The maximum precision
observed was for baseline using frequency-based scoring. Stop word elimination in
Nepali results in decrease in precision over baseline. Increasing the context window
size results in increase in precision, as more content word are added to context
vector. Increasing the numbers of senses results in the decrease in precision in
general.
Appendix
See Table 4.10.

Table 4.10 Translation, transliteration and details of sense annotated Nepali corpus
Word Sense number: translation of senses in English (number of instances)
उत्तर Sense 1: Answer (224)
(uttar) Sense 2: North Direction (131)
क्रिया Sense 1: Verb in Nepali grammar (107)
(kriya) Sense 2: A course of action (102)
गोली Sense 1: A dose of medicine (22)
(goli) Sense 2: bullet, A projectile fired from a gun (88)
ग्रहण Sense 1: The act of receiving (145)
(grahan) Sense 2: One celestial body obscures other (86)
टीका Sense 1: A Jewellery which is worn is worn by women in South Asian countries
(tikaa) (12)
Sense 2: A sign on forehead using sandalwood (27)
Sense 3: Writing about something is detail (20)
Sense 4: name of person (26)
ताल Sense 1: A small lake (105)
(taal) Sense 2: Rhythm as given by divisions (32)
तिल Sense 1: A small oval seeds of the sesame plant (38)
(til) Sense 2: A small congenital pigment spotted on the skin (20)
तुलसी Sense 1: Basil, Holy and medicinal plant (167)
(tulsi) Sense 2: A Saint who wrote Ramayana and was follower of God Ram (46)
Sense 3: A common name used for a man (42)
दर Sense 1: Rate (87)
(dar) Sense 2: Special Dish made in occasion of Teej Festival (66)
धारा Sense 1: River’s Flow (57)
(dhaaraa) Sense 2: Law Charges for Crimes/Section (126)
Sense 3: Flow of speech and thought (35)
पुतली Sense 1: Toy (24)
(putali) Sense 2: Contractile aperture in the iris of the eye (34)
Sense 3: Butterfly (21)
फल Sense 1: Fruit (155)
(phal) Sense 2: Result (112)
बल Sense 1: Strength, power (93)
(bal) Sense 2: Emphasis on a statement or something said (31)
Sense 3: Ball (41)
Sense 4: Force relating to police, army, etc. (83)
बोली Sense 1: Communication by word of mouth (164)
(boli) Sense 2: Bid (34)
वचन Sense 1: What one speaks or says, saying (62)
(vachan) Sense 2: Promise, commitment (59)
Sense 3: Used in Grammar as an agent to denote singular or plural (24)
शाखा Sense 1: Divisions of Organization (61)
(saakhaa) Sense 3: Community (21)
साँचो Sense 1: Truth (136)
(sancho) Sense 2: Keys (38)
साल Sense 1: Year (150)
(saal) Sense 2: Type of Tree (49)
(continued)
56 S. Singh et al.
Table 4.10 (continued)

Word Sense number: translation of senses in English (number of instances)
सीमा Sense 1: Boundary between two things or places, border (72)
(seema) Sense 2: Extent, Limit (97)
हार Sense 1: Defeat (100)
(haar) Sense 2: Necklace, garland (53)
References
Agirre E, Rigau G (1996) Word sense disambiguation using conceptual density. In: Proceedings of
the international conference on computational linguistics (COLING’96), pp 16–22
Baldwin T, Kim S, Bond F, Fujita S, Martinez D, Tanaka T (2010) A re-examination of
MRD-based word sense disambiguation. J ACM Trans Asian Lang Process 9(1):1–21
Banerjee S, Pederson T (2002) An adapted Lesk algorithm for word sense disambiguation using
WordNet. In: Proceedings of the third international conference on computational linguistics
and intelligent text processing, pp 136–145
Banerjee S, Pederson T (2003) Extended gloss overlaps as a measure of semantic relatedness. In:
Proceedings of the eighteenth international joint conference on artificial intelligence, Acapulco,
Mexico, pp 805–810
Bhingardive S, Bhattacharyya P (2017) Word sense disambiguation using IndoWordNet. In:
Dash N, Bhattacharyya P, Pawar J (eds) The WordNet in Indian Languages. Springer, pp 243–
260
Dhungana UR, Shakya S (2014) Word sense disambiguation in Nepali language. In: Fourth
international conference on digital information and communication technology and its
applications (DICTAP), Bangkok, Thailand, pp 46–50
Gale WA, Church K, Yarowsky D (1992) A method for disambiguation word senses in a large
corpus. J Comput Hum 26:415–439
Gaona MAR, Gelbukh A, Bandyopadhyay S (2009) Web-based variant of the Lesk approach to
word sense disambiguation. In: Mexican international conference on artificial intelligence,
pp 103–107
Gupta CP, Bal BK (2015) Detecting sentiments in Nepali text. In: Proceedings of international
conference on cognitive computing and information processing, Noida, India, pp 1–4
Ide N, Veronis J (1998) Word sense disambiguation: the state of the art. Comput Linguist 24(1):
1–40
Indowordnet http://tdil-dc.in/indowordnet/
Jain A, Lobiyal DK (2016) Fuzzy Hindi WordNet and word sense disambiguation using fuzzy
graph connectivity measures. ACM Trans Asian Low-Resource Lang Inf Process 15(2)
Lee YK, Ng HT, Chia TK (2004) Supervised word sense disambiguation with support vector
machines and multiple knowledge sources. In: SENSEVAL-3: Third international workshop
on the evaluation of systems for the semantic analysis of text, Barcelona, Spain, pp 137–140
Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a
pine cone from an ice cream cone. In: Proceedings of the 5th annual international conference
on systems documentation SIGDOC, Toronto, Ontario, pp 24–26
Miller G, Chodorow M, Landes S, Leacock C, Robert T (1994) Using a semantic concordance for
sense identification. In: Proceedings of the 4th ARPA human language technology workshop,
pp 303–308
Mishra N, Yadav S, Siddiqui TJ (2009) An unsupervised approach to hindi word sense
disambiguation. In: Proceedings of the first international conference on intelligent human
computer interaction, pp 327–335
Nepali General Text Corpus, http://tdil-dc.in/index.php?option=com_download&task=

showresourceDetails&toolid=453&lang=en
Ng HT, Lee HB (1996) Integrating multiple knowledge sources to disambiguation word sense: an
exemplar-based approach. In: Proceedings of the 34th annual meeting for the association for
computational linguistics, pp 40–47
Piryani R, Priyani B, Singh VK, David P (2020) Sentiment analysis in Nepali: exploring machine
learning and lexicon-based approaches. J Intell Fuzzy Syst 1–12
Resnik P (1997) Selectional preference and sense disambiguation. In: Proceedings of the
ACL SIGLEX workshop on tagging text with lexical semantics: why, what and how? pp 52–57
Shrestha N, Hall PAV, Bista SK (2008) Resources for Nepali word sense disambiguation. In:
International conference on natural language processing and knowledge engineering, Beijing,
China
Singh S, Siddiqui TJ (2016) Sense annotated Hindi corpus. In: The 20th international conference
on Asian language processing, Tainan, Taiwan, pp 22–25
Singh S, Siddiqui TJ (2012) Evaluating effect of context window size, stemming and stop word
removal on Hindi word sense disambiguation. In: Proceedings of the international conference
on information retrieval and knowledge management, Malaysia, pp 1–5
Singh S, Siddiqui TJ (2014) Role of semantic relations in Hindi word sense disambiguation. In:
Proceedings of international conference on information and communication technologies
Singh S, Siddiqui TJ (2015) Role of karaka relations in Hindi word sense disambiguation. J Inf
Technol Res 8(3):21–42
Singh S, Gabrani G, Siddiqui TJ (2017) Hindi word sense disambiguation using variants of
simplified lesk measure. J Intell Inform Smart Technol 2:1–6
Singh S, Singh VK, Siddiqui TJ (2013) Hindi word sense disambiguation using semantic
relatedness measure. In: Proceedings of 7th multi-disciplinary workshop on artificial
intelligence, Krabi, Thailand, pp 247–256
Sinha M, Kumar M, Pande P, Kashyap L, Bhattacharyya P (2004) Hindi word sense
disambiguation. In: International symposium on machine translation, natural language
processing and translation support systems, Delhi, India
Vasilescu F, Langlasi P, Lapalme G (2004) Evaluating variants of the lesk approach for
disambiguating words. In: Proceedings of the language resources and evaluation, pp 633–636
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In:
Proceedings of the 33rd annual meeting of the association for computational linguistics,
pp 189–196
Chapter 5
Performance Analysis of Big.LITTLE
System with Various Branch Prediction
Schemes
Froila V. Rodrigues and Nitesh B. Guinde
Abstract With the sprinting innovation in mobile technology, cell-phone proces-

sors, nowadays, are designed and deployed to meet the demands for high performance
and low-power operation. ARM big.LITTLE architecture for smart phones meets the
above requirements, with “big” cores delivering high performance and “little” cores
being power efficient. High performance is achieved by making deeper pipelines,
which result in more processing time being spent on a branch misprediction. Hence,
an accurate branch predictor is required to mitigate branch delay latency in pro-
cessors for exploiting parallelism. In this paper, we evaluate and compare various
branch prediction schemes by incorporating them in ARM big.LITTLE architecture
with Linux running on it. The comparison is carried out for performance and power
utilization with Rodinia benchmarks for heterogeneous cores. Performance of the
simulated system is in terms of execution time, percentage of conditional branch
mispredictions, and overall percentage of branch mispredictions that considers the
conditional and unconditional branches and instructions per cycles. It is observed
that the TAGE-LSC, perceptron predictors perform better among all the simulated
predictors achieving an average accuracy of 99.03%, 99.00%, respectively, using the
gem5 framework. The local branch predictor has less power dissipation when tested
on the integrated platform of multicore power area timing.
5.1 Introduction
Heterogeneous multicore systems have become an alternative for smart phone indus-
tries, whose primary objective is power efficiency and high performance. For high
performance, we need fast processors which further makes it difficult to fit within
the required thermal budget or mobile power. Battery power fails to cope up with
the fast evolving CPU technology. Today smart phones with high performance and
F. V. Rodrigues (B)
Dnyanprassarak Mandal’s College and Research Centre, Assagao-Goa, India
N. B. Guinde
Goa College of Engineering, Ponda-Goa, India
e-mail: nitesh.guinde@gec.ac.in
60 F. V. Rodrigues and N. B. Guinde
long-lasting battery life are preferred. ARM big.LITTLE architectures (ARM Lim-
ited 2014) are designed to satisfy the above requirements. This technology uses
heterogeneous cores. The “big” processors provide maximum computational perfor-
mance, and the “LITTLE” cores provide maximum power efficiency.
As per the research in (Butko et al. 2015; Sandberg et al. 2017), both processors
use the same instruction set architecture (ISA) and are coherent in operation. Branch
misprediction latency is one of the severe reasons for performance degradation in
processors. This becomes even more critical as micro-architectures become more
deeply pipelined (Sprangle and Carmean 2002). To mitigate this, an accurate predic-
tion scheme is essential to boost parallelism. Prefetching and executing the branch
along the predicted direction avoids stalling in the pipeline. This helps in reducing
the performance losses caused by branches by predicting their behavior. An accu-
rate branch prediction scheme exploits parallelism. Predicting the branch outcome
correctly, frees the functional units, which can be utilized for other tasks.
The work carried out earlier on branch predictors shows the comparison on the
basis of its performance alone, that is without taking into consideration the effects
of the operating system (OS) while running the workload. The novelty of this paper
is the evaluation and comparative analysis of various branch predictors by incor-
porating them in a ARM big.LITTLE architecture with Linux running on it. The
comparison is carried out for performance and power dissipation. Our contributions
also include comparing the branch predictors on ARM big.LITTLE system in terms
of its percentage of conditional branch mispredictions, overall percentage of branch
mispredictions that considers the conditional and indirect branches, IPC, execution
time, and power consumption. Based on the detailed analysis, we report some useful
insights about the designed architecture with various branch prediction schemes and
their associated impact on performance and power assessment.
The rest of the paper is organized as follows: Sect. 5.2 presents related research
work on branch prediction schemes commonly used. Section 5.3 includes the dis-
cussion about simulated branch predictors. In Sect. 5.4, the experimental setup is
described within architectural modeling of the processor. Also, the performance and
power models have been discussed. Section 5.5 includes the experimental results.
Section 5.6 gives concluding remarks and perspectives regarding the branch predic-
tion schemes.
5.2 Related Research
K. Aparna shows a comparative study of various BPs including the bimodal, gshare,
YAGs, and meta-predictor (Kotha 2007). The BPs are evaluated for their performance
using the applications of JPEG Encoder, G721 Speech Encoder, Mpeg Decoder, and
Mpeg Encoder. A new BP, namely YAGS Neo is modeled which outperforms for
some of the applications. The paper shows meta predictor with various combinations
of the predictors, and this shows an improved performance over the others.
5 Performance Analysis of Big.LITTLE System … 61
Sparsh M. shows a survey of various dynamic BP schemes which includes gshare,

two-level BPs, Smith BP, and perceptron (Sparsh 2018). It is seen that the perceptron
BP has the highest precision for most of the applications used.
A. Butko et al. explore the design of single-ISA heterogeneous multicore pro-
cessors based on the ARM big.LITTLE technology. They model the heterogeneous
processor in gem5 and McPAT frameworks for evaluation of its performance and
power (Butko et al. 2016). The big.LITTLE model is implemented in gem5 with the
reference of Exynos 5 Octa (5422) SoC specifications to configure the simulation
system. The fine-tuning of the micro-architectural parameters such as the execution
stage configuration, functional units, branch predictor, physical registers, etc., of the
Cortex-A7 (in-order) and Cortex-A15 (out-of-order) cores is performed. They val-
idate the simulated model for its accuracy, performance, and power with respect to
Samsung Exynos Octa (5422).
The “LITTLE” cores use the minor CPU model whose pipeline comprises of four
stages which are fetch1, fetch2, decode, and execute. The branch data is latched in
the fetch2 to fetch1 latch. The outcome of the conditional branch is known during
execution time. This is then carried to the fetch1 stage by the execute to fetch1 latch
for updating the branch predictor. Also, if the instructions fetched do not match the
branch predictors decision, then they are discarded from the pipeline. This is easily
identified as the sequence number of the predicted instruction in fetch2 will not
match with that of the fetched instruction in fetch1 pipeline stage (Butko et al. 2015;
Sandberg et al. 2017).
The “big” cores use the OoO cpu model in gem5 whose pipeline stages are fetch,
decode, rename, and the retirement stages of an OoO pipeline are performed in-
order, and the instruction issue, dispatch, and execute stages are performed out-of-
order (Butko et al. 2015; Sandberg et al. 2017). The fetch stage of the OoO pipeline
handles the branch prediction unit. The unconditional branches, whose branch target
is not available, are handled in the decode stage of the pipeline, while the conditional
branch mispredictions can be determined in the execute stage. Whenever a branch
misprediction is identified, the entire pipeline is squashed, and the entry is deleted
from the branch target buffer (BTB), and correspondingly, the program counter (PC)
is updated. The flushing of the entire pipeline on a branch misprediction leads to
reducing the performance of a multi-stage pipelined processor.
5.3 Simulated Branch Predictors
The branch prediction in the processor is dynamic. These adaptive predictors observe
the pattern of the history of previous branches during execution. This behavior is then
utilized to predict the outcome of the branch, whether taken or not taken when the
same branch occurs the next time. If multiple unrelated branches index the same
entry in the predictor table, it leads to the aliasing effect as shown in Fig. 5.1, where
there is an interference between the branches P and Q that leads to a misprediction.
Hence, it is necessary to have an accurate branch prediction scheme.
Fig. 5.1 Example of aliasing effect
Some of the commonly used branch prediction schemes for computer architectures
are bimodal, local, gshare, YAGS, Tournament, L-TAGE, perceptron, ISL-TAGE, and
TAGE-SC-L.
Bimodal BP: The bimodal predictor is the earliest form of a branch prediction
scheme (Lee et al. 1997). The prediction is based upon the branch history of a given
branch instruction. The table of counters is indexed by using the lower bits of the
corresponding branch address. When a branch is identified and if the bias of the
corresponding counter is in ST or WT state, then the future branches are predicted as
taken, and when in WNT and SNT state, the branches are predicted as not taken.
Local BP:Yeh and Patt propose a branch prediction scheme that uses the local history
of a branch being predicted.This history helps in predicting the next branch. Here, the
branch address bits are XORed with the local history to index the table of saturating
counters, whose bias will provide the prediction (Yeh and Patt 1991).
Tournament BP: Z. Xie et al. present tournament branch predictor that uses local
and global predictors based on saturating counters per branch (Xie et al. 2013) . The
local predictor is a two-level table that keeps a track of the local history of individual
branches. The global predictor is a single-level table. Both provide the prediction.
The meta-predictor, that selects between the two predictors, is a table indexed with
the branch address and comprises of saturating counters.
Gshare BP: S. McFarling comes up with a strategy to use sharing index scheme
aiming at higher accuracy (McFarling 1993). The gshare scheme is same as the
bimodal scheme. The global history register bits are XORed with the bits of program
counter (PC) to point to the pattern history table (PHT) entry, whose value will
give the prediction. However, aliasing is the major factor for reducing the prediction
accuracy.
YAGS BP: Eden and Mudge present yet another global scheme (YAGS) which is
a hybrid of bimodal and direction PHTs. Bimodal scheme stores the bias, and the
direction PHTs store the traces of a branch only when it is not according to the
bias (Eden and Mudge 1998). This reduces the information being stored otherwise
in the direction PHT tables.
TAGE, L-TAGE, ISL-TAGE, TAGE-LSC: Seznec and Michaud implement the
TAgged GEometric length predictor in (Seznec and Michaud 2006). It improvises
Michaud’s PPM-like tag-based branch predictor. A prediction is made with the
longest hitting entry in the partially tagged tables whose history lengths increase
according to the geometric series given as:
Length(i) = (int)(α i−1 × Length(1) + 0.5) (5.1)
Length geometrically increases with i. The table entries are allocated in an optimized
way making it very space efficient.
The updating policy used by the BP includes incrementing/decrementing the “U”
counter if the final prediction is correct/incorrect, respectively. The “U” counter
is reset periodically, to avoid any entry to be marked as useful forever. When the
prediction is made by a newly allocated entry, it is not considered as the new entries
need some training time to make a correct prediction. As a result, the alternate
prediction is considered as the final outcome. This branch predictor is better in terms
of accuracy. Also, partial tagging is cost efficient , hence, can be used by predictors
using global history lengths.
The L-TAGE predictor is presented in the CBP-2 (Seznec 2007). This is a hybrid
of TAGE and loop predictor. Here, the loop predictor identifies branches that are
regularly occurring loops with a fixed number of iterations. When the loop has been
executed successively three times with the persistent number of iterations, the loop
predictor provides a prediction. A. Seznec also presents ISL-TAGE and TAGE-LSC
predictors which incorporates a statistical corrector(SC) and a loop predictor (Seznec
et al. 2016; Seznez 2011).
Perceptron: Jimenez et al. implement perceptron predictors based on neural net-
works (Jiménez and Lin 2001). The perceptron model is a vector that comprises of
weights (w), which are signed integers and gives the amount of correlation between
the branches and the inputs (x). The boolean function of previous outcomes (1 = taken
and - 1 = not taken) from the global history shift register is the input to the perceptron.
The outcome is calculated as a dot product of the weight vector w0 , w1 ..., wn and
the input vecor x0 , x1 ..., xn . Here, x0 is the bias always set to 1. The outcome P is
based on the formula:
n
P = w0 + wi ∗ xi (5.2)
i=1
Positive or negative value of P indicates that the branch is predicted as taken or not
taken, respectively.
5.4 Architectural Modeling
This section comprises of the performance and power modeling using gem5 and
McPAT, respectively, along with system configuration and further describing the
Rodinia bench suite.
5.4.1 Performance and Power Modeling
The gem5 simulator (2018) is a cycle-approximate simulation framework that sup-

ports multiple ISAs, CPU models, branch prediction schemes, and memory mod-
els including cache coherent protocols, interconnects, and memory controllers. The
architectural specifications are discussed in Table 5.1.
The statistics and the configuration files provided by gem5 output are utilized by
the integrated platform of multicore power area and timing (McPAT) for estimating
the power consumption. The version of power model used is McPAT v1.0.
Table 5.1 Architectural specifications

Parameters Big LITTLE
General specifications
Total cores 4
ISA Linux arm system
CPU type Exynos
Cache_line_size 64 bytes
Type DerivO3 Minor
Total CPUs 2 2
Fetchwidth 3 0
NumPhysFloatRegs 256 0
Pipeline 19 stages 4 stages
I cache size 32 kB 32 kB
I cache associativity 2 way 2 way
D cache size 32 kB 32 kB
D cache associativity 2 way 2 way
l2 latency 45 cycles 27 cycles
Branch mispred penalty 13 cycles 3 cycles
BTBEntries 4096 bits 4096 bits
BTBTagSize 18 bits 16 bits
5.4.2 System Configuration
The system is configured using Ubuntu 16.04 OS on vmlinux kernel for ARM ISA.
5.4.3 Applications of Rodinia
The Rodinia bench suite (Che et al. 2009) for heterogeneous platforms is used to study
the effect of branch prediction schemes. We have used the OpenMP workloads of the
Rodinia bench suite. The problem size of the workloads is mentioned in Table 5.2.
The workloads of rodinia bench suite comprise of:
Heart Wall removes speckles from an image without without impairing its features.
It uses structured grids.
k-Nearest Neighbors comprises of a dense linear algebra.
Number of boxes 1D estimates potential of a particle and relocates them within a
large 3D space due to mutual force among them.
Table 5.2 Benchmark parameters

Application Acronym Problem size
Rodinia benchmark
Heart wall heartwall test.avi, 2 frames
k-nearest neighbors nn filelist.txt, 13500 data-points
Number of boxes 1D lavaMD 2 1D boxes
Particle filter particlefilter 10,000 data-points(x = 128, y
= 128, z = 10)
HotSpot hotspot 512 x 512 data-points
Needleman-Wunsch nw 2048 data-points
Pathfinder pathfinder 100,000 x 100
Hotspot 3D hotspot3D 512 x 512 x 512 data-points
SRAD srad 100 x 502 x 458 data-points
Breadth-First search bfs graph1M.txt (1,000,000
data-points)
Myocyte myocyte 256 data-points
Backpropogation backprop 512 input nodes
Stream Cluster SC 512 data-points 256
dimensions
K-means kmeans kdd_cup
Btree b+tree 256 data-points
Particle Filter is a probablistic estimator of a target position given noisy measure-

ments of the position.
HotSpot and HotSpot 3D evaluates the processor temperature.
Needleman-Wunsch estimates the location of DNA sequence.
Pathfinder estimates the time to discover a path.
SRAD is a regular, grid structured application used in ultrasonic, and radar image-
processing domains.
Breadth-First Search is a graph algorithm used to traverse the graphs with millions
of vertex points.
Myocyte is used in medical researches to model the cardiac muscle cells and simulate
its behavior.
Back-propagation is a neural learning algorithm where the actual output is compared
with the requested value. The difference is then sent back to the input, and the weights
are updated accordingly.
Stream Cluster for a group of data-points finds the number of medians before hand,
so that every point is mapped with its closest center.
K-means identifies a group of points by relating each data point with its neighboring
cluster; accordingly, new cluster center points are estimated, and iteration is done
until overlap.
Btree helps in deleting and inserting nodes to a graph by traversing it.
The disk image is mounted and loaded with the applications. A bootscript is
written to execute the applications. It also commands m5 to record the statistics of
the simulated system once the application is executed.
5.5 Experimental Results
This section discusses the various parameters used for comparing the performance
of branch predictors along with the performance and power analysis.
5.5.1 Parameters Used for Comparison
Percentage of overall conditional branch mispredictions: This parameter is the

percentage of the overall conditional branches predicted incorrectly out of the total
conditional branches.
Percentage of Overall misprediction percentage: This parameter is the percentage
of the overall mispredicted branches out of the total branches taken.
Instructions per cycle (IPC): The major aspect of a processor’s performance

depends on its IPC. It is the average number of instructions executed for each clock
cycle.
Simulation time in seconds: It is the total simulated time for an application.
History length: It is the size of the history tables required to store the global or local
history of the branches.
Percentage of power dissipated: This is the percentage of power dissipated by the
branch predictor.
5.5.2 Performance Analysis
The analysis for performance of the branch predictors is carried out for a minimum
of 1 million branch instructions. The overall conditional branch mispredictions are
shown in Fig. 5.2. From the Table 5.3, local BP has the minimum accuracy of 94.7%
with a misprediction rate of 5.29%, while perceptron and TAGE-LSC have a mispre-
diction rate of 3.43% and 3.4%, respectively. This gives TAGE-LSC and perceptron
the highest accuracy of 96.6% for conditional branch predictions with fixed history
length of 16kb. Mispredictions occurring in popular PC indexed predictors with 2-bit
counters is mainly due to destructive interference or aliasing. The other reason is that
the branch requires local history or global history or both kind of histories in-order
to predict the outcome correctly. TAGE predictors on the other hand handle branches
with long histories. They employ multiple tagged components with various folded
histories.
Tage-LSC predictor outperforms L-TAGE and ISL-TAGE predictors, as these pre-
dictors cannot predict accurately branches biased statistically towards a given direc-
tion. For certain branches, their performance is worse than a simple PC indexed table
Fig. 5.2 Percentage of conditional branch mispredictions per application at a fixed history length
of 16 kb
Table 5.3 Performance analysis

Branch predictor % conditional % overall IPC
mispredictions mispredictions
L-TAGE 3.56 1.16 0.99
Tournament 4.88 2.63 0.79
Bimodal 5.05 3.12 0.71
Local 5.29 3.27 0.68
YAGS 4.92 2.66 0.82
gshare 4.82 2.68 0.79
Perceptron 3.43 0.98 1.13
TAGE-LSC 3.40 0.97 1.14
ISL-TAGE 3.5 1.12 1.03
Fig. 5.3 Percentage of overall branch mispredictions comprising the conditional and unconditional
branches per application at a fixed history length of 16 kb
with two-bit saturating counters. Tage-LSC predictor incorporates a multi-GEHL

statistical corrector(SC) which handles this class of branches which are statistically
correlated. Multi-GEHL SC can handle various path and branch histories including
local and global history very accurately. The perceptron predictor also achieves a
very high accuracy for a history length of 16kb for various applications executed.
Figure 5.3 shows that TAGE-LSC, perceptron, ISL-TAGE, and L-TAGE have
an average of 0.97%, 0.98%, 1.12%, and 1.16% of overall branch mispredictions,
respectively, when tested for various applications having a minimum of 1 M branch
instructions. Tournament BP, YAGS, and gshare have a misprediction percentage of
2.63, 2.66, and 2.68%. Local BP has the highest overall misprediction percentage of
3.27%.
The mispredictions have an impact on the IPC. Higher the mispredictions, more
are the stalls in the pipeline, and there is a drop in the IPC; i.e., more cycles are
wasted per instruction. As seen in Fig. 5.4, TAGE-LSC and perceptron predictors
have the least overall misprediction percentage, and the overall IPC is 1.14 and 1.13,
respectively, which is higher than other predictors.
Fig. 5.4 Instructions Per Cycle (IPC) per application at a fixed history length of 16 kb
Fig. 5.5 Simulation time per application at a fixed history length of 16 kb
Figure 5.5 shows the execution time in seconds per application of the rodinia
bench suite for the simulated ARM big.LITTLE architecture by varying the branch
prediction scheme incorporated into it. It is observed that perceptron and TAGE-LSC
have the least execution time for almost all applications in the suite. Also, it is seen
that local BP has the maximum execution time.
Figures 5.6 and 5.7 show the results of performance of branch prediction schemes
by changing the history length. LavaMD benchmark is used to study the effect of
variations in history length for the popular branch predictors. LavaMD benchmark has
the highest utilization of the branch predictor as compared to the other applications of
the rodinia suite. It is seen that as the history length of branch predictors is increased,
the misprediction percentage drops. But, in case of L-TAGE predictor, the drop is not
significant. This proves that L-TAGE is robust to the changes in geometric history
lengths.
The overall branch misprediction rate comprising of the conditional and uncon-
ditional branches for varying history lengths is found to be 2.7701%. Seznec who
implemented L-TAGE also states that the overall misprediction rate is within 2% for
Fig. 5.6 Percentage

conditional branch
mispredictions with lavaMD
application for varying
history lengths
Fig. 5.7 Percentage of

overall branch
mispredictions comprising
the conditional and
unconditional branches with
lavaMD application for
varying history lengths
any minimum value of history length in the interval [2–5] and any maximum value
between [300 and 1000] (Seznec 2007). In other words, L-TAGE just like TAGE
and OGEHL predictors are not sensitive to history length parameters (Seznec 2005;
Seznec and Michaud 2006). ISL-TAGE, perceptron, and TAGE-LSC predictors have
an overall branch misprediction rates of 2.51%, 2.37%, and 1.9%, respectively, for
varying history lengths. Perceptron predictor provides the best performance for lower
history lengths below 16kb and eventually attains a constant level of misprediction
rate as the history length is increased beyond 16kb. Beyond this history length, TAGE-
LSC provides higher accuracy than the perceptron and other predictors. The reason
for the low performance of L-TAGE, ISL-TAGE, and TAGE-LSC predictors is the
insufficient correlation from remote branches due to reduced history lengths, result-
ing in negative interference. Also reducing local history bits, fail to detect branches
which exhibit loop behavior. However, the computational complexity involved in
perceptron and TAGE predictors is high in comparison to the popular PC indexed
2-bit predictors.
Table 5.3 shows the summarized results of performance analysis for the param-
eters of % conditional mispredictions, % overall mispredictions, and IPC for better
readability.
Fig. 5.8 Percentage of power dissipated by the predictors per application
5.5.3 Power Analysis
Figure 5.8 shows the result of power performance of branch predictors incorporated in
ARM big.LITTLE architecture. The average power consumed by the L-TAGE, ISL-
TAGE, and TAGE-LSC branch predictors is 5.89%, 5.91%, and 5.98%, respectively,
for various workloads. This is high in comparison to the other branch predictors. The
reason for this being the complexity in the design arises with the increase in com-
ponents. As a result of which the silicon area and power dissipation in the processor
increases (Seznec 2007). The perceptron predictor consumes 4.7% of the processor
power. It also shows that the minimum average power of 3.77% is dissipated by local
BP unit.
To be noted that the power estimation is based on a simulation which can incur
abstraction errors and reflect approximate levels of power utilization by the predictor.
5.6 Conclusion
Exhaustive analysis on various branch prediction schemes is done for power and
performance using McPAT and gem5 frameworks, respectively. It is observed that
TAGE-LSC and perceptron have the highest prediction accuracy among the simu-
lated predictors. Perceptron predictor performs efficiently at reduced resource budget
and history length, while TAGE-LSC outperforms it for higher history lengths and
increased resource budget.
In the ARM big.LITTLE architecture, the big cores can be incorporated with
TAGE-LSC predictor, where high performance is desired, and LITTLE cores can be
built with perceptron predictor which achieves a high accuracy and power efficiency
at reduced budget and power requirements. Also, local branch predictor dissipates
minimum power, but the accuracy is very less.
Acknowledgements We would like to thank all those who have dedicated their time in research
related to the branch predictors and have contributed to the gem5 and McPAT simulation frameworks.
References
ARM Limited big.LITTLE Technology (2014) The future of mobile. In: Low power-high perfor-
mance white paper
Butko A, Bruguier F, Gamati‘e A (2016) Full-system simulation of big.LITTLE multicore archi-
tecture for performance and energy exploration. In: 2016 IEEE 10th international symposium on
embedded multicore/many-core systems-on-chip (MCSOC) IEEE Lyon, , pp 201–208
Butko A, Gamatié A, Sassatelli G (2015) Design exploration for next generation high-performance
manycore on-chip systems: application to big.LITTLE Architectures. In:2015 IEEE computer
society annual symposium on VLSI. IEEE Montpellier, pp 551–556
Che S, Boyer M, Meng J (2009) Rodinia: A benchmark suite for heterogeneous computing. In:2009
IEEE international symposium on workload characterization (IISWC). IEEE Austin, pp 44–54
Eden AN, Mudge T (1998) The YAGS branch prediction scheme. In: Proceedings of the 31st annual
ACM/IEEE international symposium on microarchitecture 1998. ACM/IEEE Dallas, pp 169–177
Jiménez DA, Lin C (2001) Dynamic branch prediction with perceptrons. In: Proceedings of the 7th
international symposium on high-performance computer architecture HPCA ’01. ACM/IEEE
Mexico, p 197
Kotha A (2007) Electrical & computer engineering research works. digital repository at the Uni-
versity of Maryland (DRUM). https://drum.lib.umd.edu/bitstream/handle/1903/16376/branch_
predictors_tech_report.pdf?sequence=3&isAllowed=y.Cited. 10 Dec 2007
Lee CC, Chen ICK, Mudge TN (1997) The bi-mode branch predictor. In: Proceedings of 30th
annual international symposium on microarchitecture. ACM/IEEE, USA, pp 4–13
McFarling S (1993) Combining branch predictors. TR, Digital Western Research Laboratory, Cal-
ifornia, USA
Sandberg A, Diestelhorst S, Wang W (2017) Architectural exploration with gem5. In:ARM Res
Seznec A (2005) Analysis of the O-GEometric history length branch predictor. ACM SIGARCH
computer architecture news. Journal 33(2):394–405
Seznec A (2007) A 256 kbits l-tage branch predictor. The second championship branch prediction
competition (CBP-2). J Instruction-Level Parall (JILP) J 9(1):1–6
Seznec A (2016) TAGE-SC-L branch predictors again. 5th JILP workshop Comput Arch Competi
(JWAC-5) J 5(1)
Seznec A, Michaud P (2006) A case for (partially) TAgged GEometric history length branch pre-
diction. J Instruction Level Parall J 8(1):1–23
Seznez A (2011) A 64 Kbytes ISL-TAGE branch predictor. In: Workshop on computer architecture
competitions (JWAC-2): championship branch prediction
Sparsh M (2018) A survey of techniques for dynamic branch prediction. J CoRR. ArXiv.
abs/1804.00261
Sprangle E, Carmean D (2002) Increasing processor performance by implementing deeper pipelines.
In: Proceedings of the 29th annual international symposium on computer architecture (ISCA).
IEEE USA, pp 25–34
The gem5 simulator. Homepage. http://gem5.org/Main_Page. Last accessed 2018/1/12
Xie Z, Tong D, Cheng X (2013) An energy-efficient branch prediction technique via global-history
noise reduction. In: International symposium on low power electronics and design (ISLPED).
ACM Beijing, , pp 211–216
Yeh T, Patt Y (1991) Two-level adaptive training branch prediction. In: Proceedings of the 24th
annual international symposium on microarchitecture 1991. ACM, New York, pp 51–61
Chapter 6
Global Feature Representation Using
Squeeze, Excite, and Aggregation
Networks (SEANet)
Akhilesh Pandey, Darshan Gera, D. Gunasekar, Karam Rai,

and S. Balasubramanian
Abstract Convolutional neural networks (CNNs) are workhorses of deep learning.

A popular architecture in CNN is Residual Net (ResNet) that emphasizes on learn-
ing a residual mapping rather than directly fit input to output. Subsequent to ResNet,
Squeeze and Excitation Network (SENet) introduced a squeeze and excitation block
(SE block) on every residual mapping of ResNet to improve its performance. The SE
block quantifies the importance of each feature map and weights them accordingly.
In this work, we propose a new architecture SEANet built over SENet by introducing
an aggregate block after SE block. We choose sum as the aggregate operation. The
aggregation helps in minimizing redundancies in feature representation and provide
a global feature representation across feature maps by downsampling their number.
We demonstrate the superior performance of our SEANet over ResNet and SENet on
benchmark CIFAR-10 and CIFAR-100 datasets. Specifically, SEANet reduces clas-
sification error rate on CIFAR-10 by around 2% and 3%, respectively, over ResNet
and SENet. On CIFAR-100, SEANet reduces error by around 5% and 9% when com-
pared against ResNet and SENet. Further, SEANet outperforms the latest EncapNet
and both its variants EncapNet+ and EncapNet++ on CIFAR-100 dataset by around
2%.
6.1 Introduction
With advancement in deep learning models, lots of real-world problems are being
solved which were stacked up for the past few decades. Deep learning is being used
in a wide range of applications ranging from image detection and recognition to
security and surveillance. One of the major advantages of deep learning models is that
A. Pandey (B) · D. Gunasekar · K. Rai · S. Balasubramanian

Department of Mathematics and Computer Science (DMACS), Sri Sathya Sai Institute of Higher
Learning (SSSIHL), Prasanthi Nilayam, Anantapur District 515134, India
e-mail: sbalasubramanian@sssihl.edu.in
D. Gera
DMACS, SSSIHL, Brindavan Campus, Bengaluru 560067, India
e-mail: darshangera@sssihl.edu.in
74 A. Pandey et al.
they extract features on their own. Convolutional neural networks (CNNs) are used
extensively to solve image recognition and image classification tasks. Convolutional
layer basically learns a set of filters that help in extracting useful features. It learns
powerful image descriptive features by combining the spatial and the channelwise
relationship in the input.
To enhance the performance of CNNs, recent research has explored three dif-
ferent aspects of networks, namely width, depth, and cardinality. It was found that
deeper models could model complex input distribution much better than the shallow
models. With the availability of specialized hardware accelerators such as GPUs, it
has become easy to train larger networks. Taking the advantage of GPUs, continu-
ous improvement in accuracy is shown by recent models like VGGNet (Sercu et al.
2016), GoogLeNet (Szegedy et al. 2015), and Inception net (Szegedy et al. 2017).
VGGNet showed that stacking blocks of same shape gives better results. GoogLeNet
shows that width plays an important role in improving the performance of a model.
Xception (Chollet 2016) and ResNeXt (Xie et al. 2017) come up with an idea of
increasing the cardinality of a neural network. They empirically showed that apart
from saving the number of parameters cardinality also increases the representation
power compared to width and depth. But it was observed that deep models are built
by stacking up layers suffered from degradation problem (He et al. 2016). Degrada-
tion problem arises when after some iterations the training error refuses to decrease
thereby giving high training error and test error. The reason behind the degradation
problem is vanishing gradient—as the model becomes larger, the propagated gradi-
ent becomes very small by the time it reaches the earlier layers, thereby making the
learning more difficult. The degradation problem was solved with the introduction
of the ResNet (He et al. 2016) models which stacked residual blocks along with
skip connections to build very deep architecture. They gave better accuracy than its
predecessors.
ResNet performed very well and won the ILSVRC (Berg et al. 2010) challenge
in 2015. Subsequently, the architecture that won ILSVRC 2017 challenge is SENet
(Hu et al. 2017). Unlike other CNN architectures that considered all feature maps
to be equally important, SENet (Hu et al. 2017) quantifies the importance of each
feature map adaptively and weighs them accordingly. The main architecture of SENet
discussed in Hu et al. (2017) is built over base ResNet by incorporating SE blocks. SE
blocks can also be incorporated in other CNN architectures. Though SENet quantifies
the importance of feature maps, it does not focus on redundancies across feature maps
and provide a global representation across feature maps. In this work, we propose a
novel architecture, namely SEANet, that is built over SENet. Following SE block, we
introduce an aggregate block that helps in providing a global feature representation
and also minimizes redundancies in feature representation.
6 Global Feature Representation Using Squeeze, Excite, and Aggregation … 75
6.2 Related Work
Network engineering is an important vision research area since well-designed net-

works improve performance for different applications. From the time LeNet (LeCun
et al. 1998) was introduced and since the renaissance of deep neural networks through
AlexNet (Krizhevsky et al. 2012), a plethora of CNN architectures (He et al. 2016;
Sercu et al. 2016; Szegedy et al. 2017) have come about to solve computer vision and
image processing problems. Each architecture either focused on a fundamental prob-
lem associated with learning or improvised existing architectures over certain aspects.
For example, VGGNet (Sercu et al. 2016) eliminated the need to fine-tune certain
hyperparameters like filter parameters and activation function by fixing them. ResNet
(He et al. 2016) emphasized that learning residual mapping rather than directly fitting
input to output eases training in deep networks. Inception Net (Szegedy et al. 2017)
highlighted on sparsity of connections to add over the existing advantage given by
convolutions by proposing to use a set of filters of different sizes at different layers,
with lower layers having more of smaller size filters and higher layers having more
of larger size filters. On top of ResNet architecture, various models such as WideRes-
Net (Zagoruyko and Komodakis 2016) and Inception-ResNet (Szegedy et al. 2017)
have been proposed recently. WideResNet (Zagoruyko and Komodakis 2016) pro-
poses a residual network having larger number of convolutional filters with reduced
depth. PyramidNet (Han et al. 2017) builds on top of WideResNet by gradually
increasing the width of the network. ResNeXt (Xie et al. 2017) uses grouped con-
volutions and showed that the cardinality leads to improved classification accuracy.
Huang et al. (2017) proposed a new architecture, DenseNet. It concatenates the input
features along with the output features iteratively, thus enabling each convolution
block to receive raw information from all previous blocks. A gating mechanism
was introduced in highway networks (Srivastava et al. 2015) to regulate the flow of
information along shortcut connections. Since deep and wide architectures require
computational cost and memory requirement, lightweight models like MobileNet
(Howard et al. 0000) and ShuffleNet (Zhang et al. 2018) use depthwise separable
convolutions. Most of these network engineering methods focus primarily on depth,
width, cardinality, or making computationally efficient models.
Another important aspect in designing networks to improve CNNs performance
is attention mechanism inspired from human visual system. Humans focus on salient
features in order to capture visual structure. However, in all the above-mentioned
architectures, all the feature maps in a layer are considered equally important and
passed on to next layer. But none of these architectures emphasize on importance of
feature maps. Residual Attention Network (Wang et al. 2017) proposed by Wang et
al. uses an encoder–decoder style attention module. Instead of directly computing
the 3D attention map, they divided the process which learns channel attention and
spatial attention separately. This separate channel and spatial attention process is
less computational expensive as well as has less parameter over-head due to which
it can be used as a plug-and-play module with any CNN architectures. The network
not only performs well but is also robust to noisy inputs by refining the feature
76 A. Pandey et al.
maps. In 2018, Hu et al. introduced Squeeze and Excitation (SE) block in their work
SENet (Hu et al. 2017) to compute channelwise attention wherein (i) the squeeze
operation compresses each feature map to a scalar representation using global average
pooling that subsequently maps to weights and (ii) the excite operation excites the
feature maps using the obtained weights. This architecture won the ILSVRC1 2017
challenge.
In this work, we propose an architecture, namely SEANet, which is built over
SENet (Hu et al. 2017). Following SE block, we introduce an aggregate block that
help in global feature representation by downsampling the number of feature maps
by simple sum operation.
In Sect. 6.3, an overview of SENet is provided. The proposed architecture SEANet
is elucidated in detail in Sect. 6.4. Results and analysis are discussed in Sect. 6.5.
6.3 Preliminaries
Squeeze-and-Excitation-Networks (SENet)
There are two issues with the existing models in the way they apply convolution
operation on inputs. Firstly, the receptive field has the information only about the
local region because of which the global information is lost. Secondly, all the feature
maps are given equal weight but some feature maps may be more useful for the next
layers than others. SENet (Hu et al. 2017) proposes a novel technique to retain global
information and also dynamically re-calibrate the feature map inter-dependencies.
Following two subsections explain these two problems in detail and how SENet
(Hu et al. 2017) addresses them using squeeze and excitation operations. Squeeze
Operation: Since each of the learned filters operates within a local receptive field,
each unit of the output is deprived of the contextual information outside the local
region. Smaller the receptive field, lesser is the contextual information retained. The
issue becomes more severe when the network under consideration is very large and
the receptive field used is small.
SENet (Hu et al. 2017) solves this problem by finding a means to extract the
global information and then embed global information to the feature maps. Let U =
[u 1 , u 2 , . . . , u c ] be the output obtained from previous convolutional layer. The global
information is obtained by applying a global average pooling for each channel u p of
U to obtain a channelwise statistics Z = [z 1 , z 2 , . . . , z c ] where z k is the kth element
of Z computed as:
1
H W
zk = u k (i, j) (6.1)
H ∗ W i=1 j=1
1 http://image-net.org/challenges/LSVRC/.
Fig. 6.1 A SE-ResNet module
Equation (6.1) is the squeeze operation, denoted by Fsq . We see that Z obtained
in this way captures the global information for each feature map. Hence, the first
problem is addressed by squeeze operation.
Excitation Operation: Once the global information is obtained from the squeeze
step, the next step is to embed the global information to the output. Excitation step
basically multiplies the output feature maps by Z . But by simply multiplying the
feature maps with the statistics Z would not answer the second question. So in order
to re-calibrate the feature map inter-dependency, the excitation step uses a gating
mechanism consisting of a network (shown in Fig. 6.1) with two fully connected
layers having sigmoid as the output layer.
The excitation operation can be mathematically expressed as follows:
1. Let the network be represented as a function Fex .
2. Let U be the input to the network.
3. Let
X be the output of Fex .
4. Let FC1 and FC2 be fully connected layers with weights W1 and W2 respectively
(biases are set to 0 for simplicity).
78 A. Pandey et al.
5. Let δ be the ReLU (Qiu et al. 2017) activation applied to the output of FC1 , and
σ be the sigmoid activation applied after FC2 layer.
Then output of the excitation operation can be expressed by the following equation:
S = Fex (Z , W ) = FC2 (FC1 (U )) = σ (W2 (δ(W1 U ))) (6.2)
where Z is the statistics obtained from squeeze operation. The output S of the network
Fex is a vector of probabilities having same shape as the input to the network. The final
output of the SE block is obtained by scaling each element of U by corresponding
elements of S, i.e., ( X 1 ), (
X ) = [( X 2 ), . . . , (
X c )] where (
X p ) = s p ∗ u p . Thus, now
the feature maps are dynamically re-calibrated.
6.4 Proposed Architecture
Our proposed architecture is called SEANet as it is built on top of SENet using

aggregate operation. PyTorch implementation of SENet2 was taken and modified
using base of SENet(ResNet-20) using proposed aggregate block. The depth of the
proposed architecture is similar to that of SENet but it gives better accuracy than
SENet.
The SEANet model differs from the SENet model by two major differences:
1. The number of feature maps in each block is increased in SEANet. In SENet,
this number varies as 16, 32, and 64 while in SEANet we set it as 128, 256,
and 512. We increased the number of feature maps because more the number of
feature maps, better will be the effect of aggregation (downsampling) in global
representation of features. Since we rely on deep features, as hand-engineering
the features is infeasible, it is better to have large number of feature maps prior
to aggregation. Note that for fair comparison with SENet, we also increased the
number of feature maps in SENet to 128, 256, and 512 (see Sect. 6.5.2).
2. SE block is followed by an aggregate block.
The complete architecture of SEANet model is depicted in Fig. 6.2.
Aggregate block operates in two steps. The block is fed with a user defined
parameter called aggr egate f actor denoted by k. The two steps are:
Step 1: Incoming feature maps from SE block are divided into multiple groups using
k. If C = [c1 , c2 , c3 , . . . , cn ] are the n incoming feature maps from SE block, then
2 https://github.com/moskomule/senet.pytorch.
Fig. 6.2 Architecture of SEANet based on SE-ResNet-20
G 1 = [c1 , c2 , . . . , ck ]
G 2 = [ck+1 , ck+2 , . . . , c2k ]
..
.
(6.3)
G i = [c((i−1)k)+1 , c((i−1)k)+2 , . . . , cik ]
..
.
G n/k = [c(((n/k)−1)k)+1 , c(((n/k)−1)k)+2 , . . . , cn ]
80 A. Pandey et al.
Fig. 6.3 Aggregate block
are n/k mutually exclusive groups of feature maps. For example, in Fig. 6.2, after
the first three residual blocks and SE block, the number of incoming feature maps is
128. With aggregate factor k = 4, these feature maps are partitioned into 32 groups
with each group having 4 feature maps.
Step 2: Each group is downsampled to a single feature map using the aggregate
operation sum. That is,

ik
Si = aggr egate(G i ) = cj. (6.4)
j=((i−1)k)+1
Figure 6.3 depicts the aggregate operation. The effect of downsampling by aggre-
gation is to remove redundant representations and obtain a global feature representa-
tion across feature maps in each group. To understand this, let us assume that we have
an RGB image. We combine/aggregate information from each of R, G, and B maps
to output a single grayscale feature map. This way we move away from local color
information from each individual map to a global grayscale map. Further, the aggre-
gation downsampled three feature maps into one grayscale map, thereby eliminating
implicitly redundancies in representing a pixel. Our idea is to extend this principle
through the layers of a deep network. As batch normalization extends the idea of
normalizing input to normalization of all activations, so does our SEANet extends
the aforementioned principle through the layers of a deep network. The advantages
of such extension by aggregation in our SEANet are manifold:
1. Redundancy in representation of features is minimized.
2. A global representation across feature maps is obtained.
3. With sum as aggregate operation, significant gradient flow back during back-
propagation as sum shares its incoming gradient to all its operands.
4. Significant improvement in performance.
It is to be noted that many other aggregation operations including min, max
are available but sum performed the best. Further, one may argue that the number
of feature maps can be downsampled by keeping only the important ones where
importance is provided by the SE block. But we observed this idea to drastically pull
down the performance.
6.5 Results and Analysis
6.5.1 Datasets
We chose two benchmark datasets, namely CIFAR-10 (Krizhevsky et al. 2014a) and
CIFAR-100 (Krizhevsky et al. 2014b), for our experiments.
The CIFAR-10 dataset3 : The CIFAR-10 dataset consists of 60,000 color images
each of size 32 × 32. Each image belongs to one of the ten mutually exclusive classes,
namely airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The
dataset is divided into training and test set. The training set consists of 50,000 images
equally distributed across classes, i.e., 5000 images are randomly selected from each
of the classes. The test set consists of 10,000 randomly selected images from each
class.
The CIFAR-100 dataset4 : The CIFAR-100 dataset is similar to CIFAR-10 except
that it contains 100 classes instead of 10 classes. The 100 classes are grouped into
20 superclasses. Each class consists of 600 images. The training set consists of 500
images from each of the 100 classes, and the test set consists of 100 images from
each of the 100 classes.
Before we delve upon the superior performance of our architecture SEANet
against state-of-the-art architectures, we provide the implementation details.
6.5.2 Implementation Details
Data is preprocessed using per-pixel mean subtraction and padding by four pixels on
all sides. Subsequent to preprocessing, data augmentation is performed by random
cropping and random horizontal flipping. Model weights are initialized by Kaiming
initialization (He et al. 2015). The initial learning rate is set to 0.1 and is divided by 10
after every 80 epochs. We trained for 200 epochs. The other hyper-parameters such as
weight decay, momentum, and optimizer are set to 0.0001, 0.9 and stochastic gradient
descent (SGD), respectively. We fixed the batch size to 64. No dropout (Srivastava
et al. 2014) is used in the implementation. This setting remains the same across
the datasets. The hyper-parameters used in our implementation are summarized in
3 https://www.cs.toronto.edu/kriz/cifar.html.
4 https://www.cs.toronto.edu/kriz/cifar.html.
82 A. Pandey et al.
Table 6.1 Hyper-parameters

Hyper-parameters Values
Optimizer SGD
Initial learning rate 0.1
Batch-size 64
Weight-decay 1e-4
Momentum 0.9
Number of epochs 200
Table 6.2 Classification error (%) compared to the ResNet

Architecture CIFAR-10 CIFAR-100
SEANet 4.3 21.33
Res-20 8.6 32.63
Res-32 7.68 30.2
Res-44 6.43 26.85
Res-56 6.84 26.2
Res-110 6.34 26.67
Table 6.1. Our implementation is done in Pytorch (Adam et al. 2019) and codes5 are
made publically available. We trained our model on Tesla K-40 GPU and training
took around 22 h.
6.5.3 Comparison with State of the art
First, we compare the performance of our SEANet with ResNet and SENet. Table 6.2
enumerates the error rate in classification on CIFAR-10 and CIFAR-100 datasets
with respect to SEANet and variants of ResNet. It is clearly evident that SEANet
outperforms all variants of ResNet on both the datasets. In CIFAR-10, we achieve the
smallest error rate of 4.3% that is better by 2% than the best performing ResNet-110.
Similarly, in CIFAR-100, we achieve the smallest error rate of 22.24% that is better
by 4% than the best performing ResNet-56. It is to be noted that 1% on CIFAR-10 and
CIFAR-100 test sets correspond to 100 images. Therefore, we perform better than
ResNet on additional 200 and 400 images with respect to CIFAR-10 and CIFAR-100,
respectively.
Table 6.3 compares performance of SEANET against SENet. Again SEANet out-
performs SENet by 3% and 8% on CIFAR-10 and CIFAR-100, respectively. The
remarkable improvement in performance can be attributed to presence of additional
aggregate block in SEANet. Figures 6.4 and 6.5 display the validation accuracy over
5 https://github.com/akhilesh-pandey/SEANet-pytorch.
Table 6.3 Classification error (%) compared to the SENet.

SEANet 4.3 21.33
SENet 7.17 30.45
Table 6.4 SEANet versus modified ResNet and SENet

Validation accuracy # Parameters
Architecture Cifar-10 Cifar-100 Cifar-10 Cifar-100
SEANet 95.7 78.67 16188330 16199940
ResNet* 95.28 79.27 17159898 17206068
SENet* 95.91 79.76 17420970 17467140
*Original models were modified by increasing number of feature maps in each block
Fig. 6.4 Validation accuracy over epochs for ResNet, SENet, and SEANet on CIFAR-10 dataset
epochs for ResNet, SENet, and SEANet on CIFAR-10 and CIFAR-100 datasets,
respectively.
As mentioned earlier, SEANet uses 128, 256, and 512 feature maps in its blocks
unlike the standard ResNet-20 and SENet (with standard ResNet-20 as its base). For
fair comparison, we increased the feature maps in blocks of standard ResNet-20 to
128, 256, and 512, respectively. Table 6.4 reports the performance of SEANet versus
modified ResNet-20 and modified SENet. SEANet performs better or on par with
modified ResNet-20 and modified SENet on CIFAR-10 dataset while on CIFAR-
100, it performs marginally lower. But it is to be noted that due to downsampling by
sum aggregation in SEANet, the number of parameters in SEANet is smaller than
the corresponding numbers in modified ResNet-20 and modified SENet. Specifically,
SEANet has around 8% parameters lower than the number of parameters in modified
84 A. Pandey et al.
Fig. 6.5 Validation accuracy over epochs for ResNet, SENet, and SEANet on CIFAR-100 dataset
ResNet-20 or modified SENet. This is a significant advantage for our architecture

SEANet over modified ResNet-20 and modified SENet.
Table 6.5 Classification error (%) compared to the state-of-the-art architectures.

SEANet 4.3 21.33
EncapNet (Li et al. 2018) 4.55 26.77
EncapNet+ (Li et al. 2018) 3.13 24.01
EncapNet++ (Li et al. 2018) 3.10 24.18
GoodInit (Mishkin and Matas 5.84 27.66
2016)
BayesNet (Snoek et al. 2015) 6.37 27.40
ELU (Clevert et al. 2015) 6.55 24.28
Batch NIN (Changa and Chen 6.75 28.86
2015)
Rec-CNN (Liang and Hu 7.09 31.75
2015)
Piecewise (Agostinelli et al. 7.51 30.83
2015)
DSN (Lee et al. 2014) 8.22 34.57
NIN (Lin et al. 2014) 8.80 35.68
dasNet (Stollenga et al. 2014) 9.22 33.78
Maxout (Goodfellow et al. 9.35 38.57
2013)
AlexNet (Krizhevsky et al. 11.00 -
2012)
+ Stands for mild augmentation and ++ Stands for strong augmentation
Table 6.6 Effect of reduction factor used for downsampling in aggregate operation on CIFAR-10
and CIFAR-100 using SEANet
Reduction factor CIFAR-10 CIFAR-100
2 95.50 77.12
4 95.53 77.16
6 95.57 77.50
8 95.70 78.67
10 95.53 77.19
12 95.56 77.38
Table 6.5 compares SEANet against other state-of-the-art architectures. Note that
EncapNet (Li et al. 2018) is a very recent network architecture published in 2018. It
has two improvised variants, viz. EncapNet+ and EncapNet++. Our SEANet outper-
forms EncapNet and both its variants on the complex CIFAR-100 dataset by around
2%. On CIFAR-10 SEANet reports 1% lower than variants of EncapNet though it
outperforms EncapNet by 0.25%. Further, SEANet performs better than all other
enumerated state-of-the-art architectures.
6.5.4 Ablation Study
The aggregate operation used in the proposed SEANet downsamples number of fea-
tures by reduction factor of 8 on both CIFAR-10 and CIFAR-100. We did an ablation
study to determine the effect of this reduction factor. The results are reported in
Table 6.6 for various reduction factors of 2, 4, 6, 8, 10, and 12. Clearly, downsam-
pling by a factor of 8 gives best performance on both datasets. If reduction factor is
too small, then there is no elimination of redundant features, and if it is too large,
then there may be loss of useful features.
6.5.5 Discussion
The proposed SEANet is able to eliminate redundancies in feature maps and thereby
reduce number of feature maps by using simple aggregate operation of sum. Other
aggregate operations like max and min were also tried but did not give significant
improvement compared to sum. One possible future work could be to explore why
some aggregate function perform better than others.
86 A. Pandey et al.
6.6 Conclusion
A novel architecture named “SEANet” is proposed that emphasizes on global rep-

resentation of features and redundancy elimination in representing features. This is
achieved by introduction of an additional aggregate block over squeeze and excite.
The aggregation operation deployed is sum. SEANet outperforms the recent state-of-
the-art architectures on CIFAR-10 and CIFAR-100 datasets by a significant margin.
Also, the proposed architecture also has lesser number of parameters in comparison
with the number in corresponding ResNet-20 and SENet.
Acknowledgements We dedicate this work to our Guru and founder chancellor of Sri Sathya Sai
Institute of Higher Learning, Bhagawan Sri Sathya Sai Baba. We also thank DMACS for providing
us with all the necessary resources.
References
Agostinelli F, Hoffman M, Sadowski P, Baldi P (2015) Learning activation functions to improve

deep neural networks. In: ICLR workshop
Berg A, Deng J, Fei-Fei L (2010) Large scale visual recognition challenge (ILSVRC), 2010. URL
http://www.image-net.org/challenges/LSVRC3
Changa J, Chen Y (2015) Batch-normalized maxout network in network. In: arXiv preprint
arXiv:1511.025831511.02583
Chollet F (2016) Xception: deep learning with depthwise separable convolutions. arXiv preprint
arXiv:1610.02357
Clevert D, Unterthiner T, Hochreiter S (2015) Fast and accurate deep networks learning by expo-
nential linear units. arXiv preprint arXiv:1511.07289
Gao H, Zhuang L, Van Der Maaten Laurens, Weinberger Kilian Q (2017) Densely connected
convolutional networks. CVPR 1(2):3
Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. arXiv
preprint arXiv:1302.4389
Han D, Kim J, Kim J (2017) Deep pyramidal residual networks. In: Proceedings of computer vision
and pattern recognition (CVPR)
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level per-
formance on imagenet classification. In: Proceedings of the IEEE international conference on
computer vision
He K et al (2016a) Deep residual learning for image recognition. In: Proceedings of the IEEE
conference on computer vision and pattern recognition
He K, Zhang X, Ren S, Sun J (2016b) Identity mappings in deep residual networks. In: European
conference on computer vision. Springer, Cham, pp 630–645
Howard AG, Zhu M, Chen B et al (2017) MobileNets: efficient convolutional neural networks for
mobile vision applications. arXiv Preprint arXiv:170404861
Huang G, Sun Y, Liu Z, Sedra D, Weinberger KQ (2016) Deep networks with stochastic depth. In:
European conference on computer vision. Springer, Cham, pp 646–661
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE confer-
ence on computer vision and pattern recognition(CVPR), pp 7132–7141
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing
internal covariate shift. In: ICML
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional

neural networks. In: NIPS, pp 1106–1114
Krizhevsky A, Nair V, Hinton G (2014a) The CIFAR-10 dataset. Online: https://www.cs.toronto.
edu/~kriz/cifar.html
Krizhevsky A, Nair V, Hinton G (2014b) The CIFAR-100 dataset. Online: https://www.cs.toronto.
edu/kriz/cifar.html
LeCun Y et al (1998) Gradient-based learning applied to document recognition. Proc IEEE
86(11):2278–2324
Lee C, Xie S, Gallagher P, Zhang Z, Tu Z (2014) Deeply-supervised nets. arXiv preprint
arXiv:1409.5185
Liang M, Hu X (2015) Recurrent convolutional neural network for object recognition. In: The IEEE
conference on computer vision and pattern recognition (CVPR)
Li H, Gou X, Dai B, Ouyang W, Wang X (2018) Neural network encapsulation. arXiv preprint
arXiv:1808.03749
Lin M, Chen Q, Yan S (2014) Network in network. In: ICLR
Mishkin D, Matas J (2016) All you need is a good init. In: ICLR
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N,
Antiga L, Desmaison A (2019) Pytorch: an imperative style, high-performance deep learning
library. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds)
Advances in neural information processing systems, vol 32, pp 8024–8035. Curran Associates,
Inc., URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-
deep-learning-library.pdf
Qiu S, Xu X, Cai B (2017) FReLU: flexible rectified linear units for improving convolutional neural
networks. arXiv preprint arXiv:1706.08098
Sercu T et al (2016) Very deep multilingual convolutional neural networks for LVCSR. 2016 IEEE
international conference on acoustics, speech and signal processing (ICASSP). IEEE
Snoek J, Rippel O, Swersky K, Kiros R, Satish N, Sundaram N, Patwary M, Prabhat M, Adams R
(2015) Scalable bayessian optimization using deep neural networks. In: ICML
Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Conference on
neural information processing systems
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way
to prevent neural networks from overfitting. J Mach Learn Res
Stollenga MF et al (2014) Deep networks with internal selective attention through feedback con-
nections. In: Advances in neural information processing systems
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A
(2015) Going deeper with convolutions. In: Proceedings of computer vision and pattern recogni-
tion (CVPR)
Szegedy C et al (2017) Inception-v4, inception-resnet and the impact of residual connections on
learning. AAAI, vol 4
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention
network for image classification. arXiv preprint arXiv:1704.06904
Xie S, Girshick R, DollÁr P, Tu Z, He K (2017) Aggregated residual transformations for deep neural
networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE,
pp 5987–5995
Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv preprint arXiv:1605.07146
Zhang X, Zhou X, Lin M, Sun J (2018) ShuffleNet: an extremely efficient convolutional neural
network for mobile devices. In: Proceedings of the IEEE conference on computer vision and
pattern recognition, June 2018
Chapter 7
Improved Single Image Super-resolution
Based on Compact Dictionary Formation
and Neighbor Embedding
Reconstruction
Garima Pandey and Umesh Ghanekar
Abstract Single image super-resolution is one of the evolving areas in the field
of image restoration. It involves reconstruction of a high-resolution image from
available low-resolution image. Although lot of researches are available in this field,
still there are many issues related to existing problems those are still unresolved.
Here, this research work focuses on two aspects of image super-resolution. The first
aspect is that the process of dictionary formation is improved by using lesser number
of images while preserving maximum structural variations. The second aspect is
that pixel value estimation of high-resolution image is improved by considering only
those overlapping patches which are more relevant from the characteristics of image
point of view. For this, all overlapping pixels corresponding to a particular location
are classified whether they are part of smooth region or an edge. Simulation results
clearly prove the efficacy of the algorithm proposed in this paper.
7.1 Introduction
Single image super-resolution (SISR) is a part of image restoration in digital image

processing. It is used to upgrade the quality of an image, which is generally degraded
due to different constraints such as hardware limitations and environmental distur-
bance during its acquisition process. SISR is used as an image enhancement tool and
can be used along with the existing imaging systems. It is used in different fields like
medical, forensic, military, TV industry, satellite and remote sensing, telescopic and
microscopic imaging, biometric and pattern recognition, Internet, etc.
SISR is defined as a software method of producing a high-resolution (HR) image
from a low-resolution (LR) image which is obtained by an imaging system. It is an
inverse problem (Park et al. 2003) as given in Eq. (7.1). It is notoriously difficult to
G. Pandey (B) · U. Ghanekar

National Institute of Technology Kurukshetra, Kurukshetra, India
e-mail: garima_6160006@nitkkr.ac.in
U. Ghanekar
e-mail: ugnitk@nitkkr.ac.in
90 G. Pandey and U. Ghanekar
solve due to its ill-posedness and ill-conditioned behavior.
L = AU H (7.1)
In the equation, ‘A’ is a blurring factor, ‘U ’ is a scaling factor, ‘H ’ is a HR image, and

‘L’ is a LR image, respectively. Being an ill-posed inverse problem, SISR involves
solving of this mathematical equation through different optimization techniques.
Over the years different techniques based on neighbor embedding, sparse coding
(Yang et al. 2010), random forest (Liu et al. 2017), deep learning (Yang et al. 2019;
Dong et al. 2016), etc., are proposed. In all these, neighbor embedding is one of the
older and simpler approaches in machine learning for obtaining the HR image. Due
to its considerable performances in field of SISR, neighbor embedding is still an area
that attracts the researchers toward itself. In this paper also, an attempt has been made
to improve the existing issues in neighbor embedding-based SISR techniques. Here,
an effort is made to reduce the size dictionary without affecting the performance of
the algorithm. Also, the problem of removing irrelevant patches during HR image
reconstruction process has been considered in this paper, and an attempt is made to
alleviate it.
Rest of the paper is divided into following sections. In section 7.2, a brief lit-
erature related to single image super-resolution has been presented and discussed
analytically. In section 7.3, the algorithm proposed in this paper has been explained
in detail. In section 7.4, all the experiments are performed, and related results are
discussed to prove the efficacy of the proposed algorithm. In the end, conclusion is
drawn in section 7.5.
7.2 Related Literature in Field of Single Image

Super-resolution
In past, lot of work have been done and proposed in the field of SISR. Classifica-
tion of those existing techniques is discussed in detail in (Pandey and Ghanekar
2018, 2020). In spatial domain, techniques are methodized into interpolation-
based, regularization-based and learning-based. In all of the three specified groups,
researchers are more emphasizing on learning-based SISR techniques since these
techniques are able to create new spectral information in the reconstructed HR image.
In learning-based methods, in the first stage, a dictionary is created for training the
computer, and once the machine is trained, then in the second stage, i.e., the recon-
struction stage, the HR image is estimated for the given input LR image.
In training stage, the target is to train the machine in such a way that same types of
structural similarities as of the input LR image are available in the computer database
for further processing. This is achieved either by forming internal (Egiazarian and
Katkovnik 2015; Singh and Ahuja 2014; Pandey and Ghanekar 2020, 2021; Liu
et al. 2017) or external dictionary/dictionaries (Freeman et al. 2002; Chang et al.
7 Improved Single Image Super-resolution Based on Compact … 91
Fig. 7.1 Block diagram for Database image

LR image in input
proposed algorithm of LR-HR pairs
Similarity score
calculaƟon
Ten most similar images

selected in order
First image Rest nine images
Similarity score
calculaƟon
Image selected
having least SSIM
Edge finding and patch

LR-HR patches formaƟon
formaƟon
Neighbor embedding for HR

patches reconstrucƟon
HR patches combined to form

complete image
HR image in output
2004; Zhang et al. 2016; Aharon et al. 2006). In the case of external dictionary, the
size of dictionary is governed by number of images used in its formation and is of
major concern since greater number of images means large memory requirement that
will result in more number of computations during reconstruction phase. This can be
reduced by forming the dictionary from the images that are similar to input image but
differ from each other so that lesser number of images are required to cover the whole
structural variations present in the input image. In reconstruction stage, through local
processing (Chang et al. 2004; Bevilacqua et al. 2014; Choi and Kim 2017) or sparse
processing (Yang et al. 2010), HR image is recreated. Either neighbor embedding
(Chang et al. 2004) or direct mapping (Bevilacqua et al. 2014) is used in the case of
local processing. In neighbor embedding, a HR patch is constructed with the help of
a set of similar patches, searched in the dictionary formed in training phase. After
constructing corresponding HR patches for all input LR patches, all the HR patches
are combined to form the HR image in the output. In existing techniques, simple
averaging is used in overlapping areas for combining all the HR patches which may
result in poor HR image estimation in the case of having one or more erroneous
patches in the overlapping areas. This also results in blurring of edges due to simple
averaging.
To overcome this problem, in this paper, only similar types of patches are con-
sidered for combining the HR patches in the overlapping areas. This is based on
classifying the patches as well as pixels under consideration in smooth or edge

regions. This causes better visualization of the HR image in output. Also, an external
dictionary is formed from images which are similar to input image but differ from
one another. This helps in capturing the maximum structural properties present in
the input image with the help of minimum number of images.
7.3 Proposed Algorithm for Super-resolution
Learning-based SISR has two important parts for HR image reconstruction. At first,
a dictionary is build, and then, with the help of this dictionary and the input LR
image, a HR image is reconstructed. Method proposed in this paper consists of
forming an external dictionary for training part and neighbor embedding approach
for reconstruction part. Its generalized block diagram is provided in Fig. 7.1. All
the LR images of the dictionary as well as the input image are upscaled by a factor
U through bi-cubic interpolation. On the basis of structural similarity score (Wang
et al. 2004) given in Eq. (7.2), a set of ten images is selected from the database
for dictionary formation. Once images having higher score are selected from the
database, two of them are selected to have the maximum structural variations that are
present in the input LR image. For this, image having the highest structural similarity
score with the input image will be first selected for the dictionary formation, and then,
structural similarity score is calculated between the selected image and rest of the
images chosen in the first step. Image having the least structural similarity will be
considered as the second image for the dictionary formation. The complete dictionary
is formed with the help of these two selected LR images and their corresponding HR
pairs by forming overlapping patches of 5 × 5.
(2µi µ j + v1 )
SS I M(i, j) = (7.2)
(µi2 + µ2j + v1 )
where µ is mean, v1 is a constant, and i and j represent ith and jth image, respectively
In second stage, i.e., reconstruction part, the constructed dictionary is used for
HR image generation. At first, overlapping blocks of 5 × 5 are obtained from the
input LR image, then for every individual block ‘k’, number of nearest neighbors are
searched in the dictionary. LLE (Chang et al. 2004) is used as neighbor embedding
technique to obtain optimal weights for the set of nearest neighbors, and then, these
weights along with corresponding HR pairs are used to construct HR patches to
estimate HR image. In overlapping areas, simple averaging results in blurry edges.
Thus, a new technique has been given here to combine the patches. In this, every pixel
of the output HR image is individually estimated with the help of the constructed
HR patches. The steps are as follows:
i. Edge matrix for the input LR image is calculated by canny edge detector, and
then, overlapping patches are formed from the matrix just like that of patches
formed for the input LR image.
ii. At every pixel location of the HR image, check whether 0 or 1 is present in its
corresponding edge matrix. 0 represents smooth pixel, and 1 represents edge
pixel. To confirm the belongingness of each pixel to there respective group,
consider a 9 × 9 block of edge matrix having the pixel under consideration as
center pixel, and calculate the number of zeros or ones in all the four direction
given in Fig. 7.2.
iii. In any of the four directions, if number of zeros is 5 in case of smooth pixel and
number of ones in case of edge pixel, then assign that pixel as true smooth or
true edge pixel, respectively.
iv. For the pixels that cannot be categorized into true smooth or edge pixel, count
the number of zeros or ones in the 9 × 9 block of edge matrix with pixel under
consideration at center. If count of zero is 13, consider the pixel as smooth pixel,
and if count of one is 13, then pixel under consideration is edge pixel.
v. After categorizing the pixel into smooth or edge type, its value is estimated
with the help of HR patches that contains the overlapping pixel position and its
corresponding patch of edge matrix.
vi. Instead of considering all the overlapping pixels from their respective overlap-
ping HR patches, only pixels of patches which are of same type to that of the
pixel that is to be estimated are considered. Two cases are explained separately.
a. At a particular pixel location of the bi-cubic interpolated LR image, if by
following the above procedure the pixel is considered to be smooth type,
then to estimate the pixel value at same location in HR image, same type
of patches will be considered (out of selected 25 patches, number will be
less for boundary patches). For this, firstly, overlapping HR patch having all
zeros is chosen. If not available, then, the patches having maximum number
of zeros will be considered.
b. Similarly, for edge type, at first, overlapping HR patch having all ones is
chosen. If not available, then, the patches having maximum number of ones
will be considered to estimate the value instead of all the patches.
vii. Process of assigning the values to all the pixels of HR image is performed
individually to obtain the HR image in the output.
7.4 Experiment and Discussion
In this section, experiments are performed to prove the usefulness of the proposed
algorithm for generating a HR image from LR image. Images that are used for
validation of the algorithm are given in Fig. 7.3. In all the experiments, LR images
are formed by smoothing the HR image with a 5 × 5 block of average filter followed
Fig. 7.2 Four different

directions
(a) (b)
(c) (d)
Fig. 7.3 Image for

experimental simulations:
Starfish, Bear, Flower, Tree,
Building
(a) (b) (c)
(d) (e)
by reducing its size by a factor of U . In this paper, U is taken as 3. The experiment

is performed only on the luminance part of the image that is obtained by converting
the image in ‘Y CbCr ’ color space. To obtain the final HR image in the output,
chrominance part is upscaled by bi-cubic interpolation. Algorithms are compared
on the basis of structural similarity index measure (SSIM) and peak signal-to-noise
ratio (PSNR) given by Eqs. (7.2) and (7.3), respectively.
(Maximun Pixel Value)2
PSNR = 10 log10 (7.3)
Mean Square Value
A set of 100 standard images are selected from (Berkeley dataset 1868) to form
database of LR-HR images. All the LR images are upscaled by a factor of three using
bi-cubic interpolation. For dictionary formation, two images are selected one by one.
For selection of the first image, structural similarity score is calculated between input
image and database LR images, and image having the highest similarity score with
the input is selected. Nine more images having higher score are selected for deciding
the second image that will be used for dictionary formation. For selection of the
second image, structural similarity score between the first selected image and rest
of nine images is taken. Then, image having the lowest score with the first image
will be selected. This process of image selection will help in forming dictionary with
maximum possible variations with only two images. Overlapping LR-HR patches are
formed with the help of these two images to form the dictionary for training purpose.
Once the dictionary formation is completed, procedure for conversion of input
LR image into HR image starts. For this, the input LR image is first upscaled by
a factor of three using bi-cubic interpolation, and then, overlapping patches of size
5 × 5 are formed from it. For every patch, six nearest neighbors of LR patches
are selected from the dictionary using Euclidean distance. Now, with the help of
these selected LR patches (from the dictionary), optimal weights are calculated for
corresponding HR patches using LLE to obtain the final corresponding HR patch.
All such constructed HR patches are combined to obtain the HR image by estimating
each pixel individually by the technique proposed in the paper. Experimental results
of the proposed algorithm are given in Tables 7.1 and 7.2 to compare it with other
existing techniques like bi-cubic interpolation, neighbor embedding given in (Chang
et al. 2004) and sparse coding given in (Yang et al. 2010).
Table 7.1 Experimental results for proposed method and other methods for comparison, in terms
of PSNR
Sr.no. Name of Bi-cubic NE (Chang ScSR (Yang Proposed
image et al. 2004) et al. 2010) algorithm
1. Starfish 25.38 27.71 28.43 29.01
2. Bear 26.43 28.92 30.05 29.91
3. Flower 27.21 29.67 29.11 30.54
4. Tree 25.15 28.57 28.03 28.87
5. Building 27.12 29.97 30.34 30.21
Table 7.2 Experimental results for proposed method and other methods for comparison, in terms
of PSNR
Sr.no. Name of Bi-cubic NE (Chang ScSR (Yang Proposed
image et al. 2004) et al. 2010) algorithm
1. Starfish 0.8034 0.8753 0.8764 0.9091
2. Bear 0.7423 0.8632 0.8725 0.8853
3. Flower 0.7932 0.8321 0.8453 0.8578
4. Tree 0.7223 0.8023 0.8223 0.8827
5. Building 0.8059 0.8986 0.9025 0.9001
Fig. 7.4 Comparison of

different algorithms for SISR
for S = 3:LR input image,
actual HR image, bi-cubic
interpolated image, NE
algorithm (Chang et al. (a) (b) (c)
2004), SCSR algorithm
(Yang et al. 2010) and
proposed algorithm
(d) (e) (f)
Tables and the figures showing comparison of the proposed algorithm with a few
other algorithms prove that the results of our algorithm are better than the other
algorithms used in the present study. HR image constructed through our algorithm
is better in visualization when juxtaposed with other images obtained from other
algorithm (Fig. 7.4).
7.5 Conclusion
The research work, present in this paper, is focused on generating a HR image from
a single LR image with the help of an external dictionary. A novel way of building
an external dictionary has been proposed which helps to contain maximum types of
structural variations with the help of a fewer number of images in the dictionary. To
achieve this, images that are similar to input LR image but differ with each other
are selected for dictionary formation. This helps to reduce the size of dictionary and
hence the number of computations during the process of finding nearest neighbors.
To form the complete HR image, a new technique based on classifying the pixels as
the part of smooth or edge region is used for combining the HR patches in overlapping
areas that are generated using LLE. The results obtained through experiments verify
the effectiveness of the algorithm.
References
Aharon M, Elad M, Bruckstein A (2006) The K-SVD: an algorithm for designing of over-complete
dictionaries for sparse representation. IEEE Trans Signal Process 54(11), 4311–4322
Berkeley dataset (1868) https://www2.eecs.berkeley.edu
Bevilacqua M, Roumy A, Guillemot C, Morel M.-L. A (2014) Single-image super-resolution via
linear mapping of interpolated self-examples. In: IEEE Transactions on Image Processing, vol.
23(12), pp 5334–5347
Chang H, Yeung DY, Xiong Y (2004) Super-resolution through neighbor embedding. In: IEEE
conference on computer vision and pattern recognition, vol. 1, pp 275–282
Choi JS, Kim M (2017) Single image super-resolution using global regression based on multiple
local linear mappings. In: IEEE transactions on image processing, vol. 26(3)
Dong C, Loy CC, He K, Tang X (2016) Image super-resolution using deep convolutional networks.
IEEE Trans Pattern Anal Mach Intell 38(2), 295–307
Egiazarian K, Katkovnik V (2015) Single image super-resolution via BM3D sparse coding. In: 23rd
European signal processing conference, pp 2899–2903
Freeman W, Jones T, Pasztor E (2002) Example-Based Super-Resolut. IEEE Comput Graph Appl
22(2), 56–65
Liu ZS, Siu WC, Huang JJ (2017) Image super-resolution via weighted random forest. In: 2017
IEEE international conference on industrial technology (ICIT). IEEE
Liu C, Chen Q, Li H (2017) Single image super-resolution reconstruction technique based on a
single hybrid dictionary. Multimedia Tools Appl 76(13), 14759–14779
Pandey G, Ghanekar U (2018) A compendious study of super-resolution techniques by single image.
Optik 166:147–160
Pandey G, Ghanekar U (2020) Variance based external dictionary for improved single image super-
resolution. Pattern Recognit Image Anal 30:70–75
Pandey G, Ghanekar U (2020) Classification of priors and regularization techniques appurtenant to
single image super-resolution. Visual Comput 36:1291–1304. doi: 10.1007/s00371-019-01729-z
Pandey G, Ghanekar U (2021) Input image-based dictionary formation in super-resolution for online
image streaming. In: Hura G, Singh A, Siong Hoe L (eds) Advances in communication and
computational technology. Lecture notes in electrical engineering, vol 668. Springer, Singapore
Park SC, Park MK, Kang MG (2003) Super-resolution image reconstruction: a technical overview.
IEEE Signal Process Magz 20(3), 21–36
Singh A, Ahuja N (2014) Super-resolution using sub-band self-similarity. In:’ Asian conference on
computer vision, pp 552–5684
Wang Z, Bovik A, Sheikh H, Simoncelli E (2004) Image quality assessment: from error visibility
to structural similarity. IEEE Trans Image Process 13(4), 600–612
Yang J, Wright J, Huang TS, Ma Y (2010) Image Super-Resolution via Sparse Representation.
IEEE Trans Image Process 19(11), 2861–2873
Yang W, Zhang X, Tian Y, Wang W, Xue J-H, Liao Q (2019) Deep learning for single image
super-resolution: a brief review. IEEE Trans Multimedia 21(12), 3106–3121
Zhang Z, Qi C, Hao Y (2016) Locality preserving partial least squares for neighbor embedding-
based face hallucination. In: IEEE conference on image processing, pp 409–413
Chapter 8
An End-to-End Framework for Image
Super Resolution and Denoising of SAR
Images
Ashutosh Pandey, Jatav Ashutosh Kumar, and Chiranjoy Chattopadhyay
Abstract Single image super resolution (or upscaling) has become very efficient
because of the powerful application of generative adversarial networks (GANs).
However, the presence of noise in the input image often produces undesired artifacts
in the resultant output image. Denoising an image and then upscaling introduces more
chances of these artifacts due to the accumulation of errors in the prediction. In this
work, we propose a single shot upscaling and denoising of SAR images using GANs.
We have compared the quality of the output image with the two-step denoising and
upscaling network. To evaluate our standing with respect to the state-of-the-art, we
compare our results with other denoising methods without super resolution. We also
present a detailed analysis of experimental findings on the publicly available COWC
dataset, which come with context information for object classification.
8.1 Introduction
Synthetic aperture radar (SAR) is an imaging method capable of capturing high-

resolution images of terrains in all weather conditions and the day. SAR is a coherent
imaging technology that defeats the drawbacks of optical and infrared imaging. SAR
has proven to be a very beneficial in-ground observation and military applications.
However, being a coherent imaging technique suffers from a multiplicative speckle
noise because of the returned signals’ constructive and destructive interference. The
speckle noise’ presence affects computer vision techniques’ performance adversely
and makes it difficult to derive useful information from the data.
The research community has made several efforts to remove noise in the past,
including filtering methods, wavelet-based methods, and deep learning-based meth-
A. Pandey (B) · J. Ashutosh Kumar · C. Chattopadhyay

Indian Institute of Technology Jodhpur, Jodhpur, Rajasthan 342037, India
e-mail: pandey.3@iitj.ac.in
J. Ashutosh Kumar
e-mail: jatav.1@iitj.ac.in
C. Chattopadhyay
e-mail: chiranjoy@iitj.ac.in
100 A. Pandey et al.
ods, including convolutional neural network (CNN) and generative adversarial net-
works (GANs) for removal of noise. Section 8.3 gives a detailed description of
such techniques. In applications involving images or videos, high-resolution data
have usually aspired for more advanced computer vision-related works. The ratio-
nale behind the thirst for high image resolution can be either improving the pixel
information for human perception or easing computer vision tasks. Image resolution
describes the details in an image; the higher the resolution, the more image details.
Among various denotations of the word resolution, we focus on a spatial resolution
that refers to the pixel density in an image and measures in pixels per unit area.
In situations like satellite imagery, it is challenging to use high-resolution sen-
sors due to physical constraints. The input image is captured at a low resolution and
post-processed to obtain a high-resolution image to address this problem. These tech-
niques are commonly known as super resolution (SR) reconstruction. SR techniques
construct high-resolution (HR) images from several observed low-resolution (LR)
images. In this process, the high-frequency components are increased, and the degra-
dations caused by the imaging process of the low-resolution camera are removed.
Super resolution of images has proven to be very useful for better visual quality and
ease in the detection processes by other computer vision techniques. However, the
presence of noise in the input image may be difficult as the network enhances the
noise when upscaling is done. We try to merge the two-step procedure of denoise an
image and upscaling the image to compare quality and the performance benefit that
we get by running one network instead of two. The motivation is to create a single
artificial neural network with denoising and upscaling capabilities.
Contributions
The primary contributions of our work are
1. We analyze the two possible approaches of denoising and super resolution of
SAR images, single-step and two-step. We demonstrate the comparison of the
performance on the complied dataset.
2. Through empirical analysis, we demonstrate that the single-step approach better
preserves the details in the noisy image. Even with higher PSNR values for the
ID-CNN network, the RRDN images are better able to maintain the high-level
features present in the image, which will prove to be of more use in object
recognition compared to PSNR driven networks, which are not able to preserve
these details.
Organization of the chapter
The remaining of this chapter is organized in the following way. To give the readers
a clear understanding of the problem, we define speckle noise in Sect. 8.2, which is
believed to be necessary. In Sect. 8.3, we present a detailed description of the works
proposed in the literature and are related to our framework. In Sect. 8.4, we describe
the proposed framework in detail. Section 8.5 describes the experimental findings of
our proposed framework. In Sect. 8.6, we present a detailed analysis of the results
obtained using our framework. Finally, the paper concludes with Sect. 8.7, along
with some indications to the future scope of work.
8 An End-to-End Framework for Image Super Resolution … 101
Fig. 8.1 Artificial noise added in the dataset
8.2 Speckle Noise
Speckle noise arises due to the effect of environmental conditions on the imaging
sensor during image acquisition. The speckle-noise primarily prevalent in application
areas like medical images, active radar images, and synthetic aperture radar (SAR)
images. The model commonly used for representing the SAR speckle multiplicative
noise is defined as:
Y = NX (8.1)
where Y ∈ R W ×H is the observed intensity SAR image, X ∈ R W ×H is the noise free

SAR image, and N ∈ R W ×H is the speckle noise random variable. Here, W and H
denote the width and height of the image, respectively. Let the SAR image be of
an average of L looks, then N follows a gamma distribution with unit mean and
variance 1/L with the probability density function:
1
p(N ) = L L N L−1 e−L N , N ≥ 0, L ≥ 1 (8.2)
(L)
where (·) is the Gamma function. L, the equivalent number of looks (ENL), is usu-
ally regarded as the quantitative evaluation index for real SAR images de-speckling
experiments in the homogeneous areas and defined as:
X̄
ENL = (8.3)
σ X2
where X̄ and σ X2 are the image mean and variance.

Figure 8.1a shows the probability distribution function of Eq. (8.2) for L = 1, 4
and 10, along with the histogram of the random sampling from the numpy gamma
distribution. It can be observed from Fig. 8.1 that how the image quality changes with
the increasing values of the hyperparameters used to define the noise. As a result of
which proposing a unified, end-to-end model for such a task is challenging.
8.3 Literature Survey
As the importance of SAR denoising is explained above, various literature in the past
years has been proposed several techniques on this particular topic. In 1981, Lee filter
(Wang and Bovik 2002) was proposed that uses statistical techniques to define a noise
model, probabilistic path-based filter (PBB) (Deledalle et al. 2009) based on noise
distribution uses similarity criterion, non local means (NL-means) (Buades et al.
2005) use all possible self-prediction to preserve texture and details, block matching
3D (BM3D) (Kostadin et al. 2006) using inter-patch correlation (NLM) and intra-
patch correlation (Wavelet shrinkage). The deep learning approach has received much
attention in the area of image denoising. However, there are tangible differences
in the various types of deep learning techniques dealing with image denoising. For
example, discriminative learning based on deep learning tackles the issue of Gaussian
noise. On the other hand, optimization models based on deep learning are useful in
estimating the actual noise. In Chunwei et al. (2020), a comparative study of deep
techniques in image denoising is explained in detail.
There has been several approaches of speckle reduction in important application
domains. Speckle reduction is an important step before the processing and analysis
of the medical ultrasound images. In Da et al. (2020), a new algorithm is proposed
based on deep learning to reduce the speckle noise for coherent imaging without
clean data. In Shan et al. (2018), a new speckle noise reduction algorithm in medical
ultrasound images is proposed by employing monogenic wavelet transform (MWT)
and Bayesian framework. Considering the search for an optimal threshold as exhaus-
tive and the requirements as contradictory, in Sivaranjani et al. (2019), the problem
is conceived as a multi-objective particle swarm optimization (MOPSO) task, and
a MOPSO framework for de-speckling an SAR image using a dual-tree complex
wavelet transform (DTCWT) in the frequency domain was proposed. Huang et al.
(2009) proposesd a novel method that includes the coherence reduction speckle noise
(CRSN) algorithm and the coherence constant false-alarm ratio (CCFAR) detection
algorithm to reduce speckle noise for SAR images and to improve the detected
ratio for SAR ship targets from the SAR imaging mechanism. Techniques such
as (Vikrant et al. 2015; Yu et al. 2020; Vikas Kumar and Suryanarayana 2019) pro-
posed speckle denoising filters in their respective papers that are specifically designed
for SAR images and shown encouraging performance. For target recognition from
SAR images, Wang et al. (2016) proposed a complementary spatial pyramid coding
(CSPC) approach in the framework of spatial pyramid matching (SPM) (Fig. 8.2).
In Wang et al. ((2017), a novel technique was proposed, and the network proposed
in this technique has eight convolution layers along with rectified linear units (ReLU)
where each convolution layer has 64 filters of 3 × 3 kernel size with the stride of one,
without pooling and applies the combination of batch normalization and residual
Fig. 8.2 ID-CNN network architecture for image de-speckling
learning strategy with skip connection where the input image is divided with the
estimated speckle noise pattern in the image; this method uses L2 Norm or Euclidean
distance as the loss function which reduces the distance between the output and the
target image; however, this can introduce artefacts in the image and does not consider
the neighborhood pixels, so a total variational loss has been added to the overall loss
function balanced with the regularization factor λTV to overcome this shortcoming,
the TV loss encourages smoother results, assuming X̂ = φ(Y w,h ) where φ is the
learned network (parameters) for generating the despeckled output.
L = L E + λT V L T V (8.4)
1
W H
LE = ||φ(Y w,h ) − X w,h ||22 (8.5)
W H w=1 h=1
H
W
LT V = ( X̂ w+1,h − X̂ w,h )2 + ( X̂ w,h+1 − X̂ w,h )2 (8.6)
w=1 h=1
This method proposed in Wang et al. ((2017) perform well as compared to the tra-
ditional image processing methods mentioned above, and hence, we compared our
work with Wang et al. ((2017).
8.4 Proposed Model
8.4.1 Architecture
We propose a single-step artificial neural network model for image super resolution
and SAR denoising tasks inspired by the RRDN GAN (Wang et al. 2019) model.
Figure 8.3 depicts an illustration of the proposed modification of the original RRDN
network. The salient features of the proposed model are
• To compare it to the other noise removal techniques, we have removed the upscaling
part of the super-resolution GAN and have trained the network for 1X.
Fig. 8.3 The RRDN architecture (we adapted the network without the upscaling layer for compar-
ison)
Fig. 8.4 A schematic representation of the residual in residual dense block (RRDB) architecture
Fig. 8.5 A schematic diagram of the discriminator architecture
• We also train the network with upscaling layers for simultaneous super resolution
and noise removal.
The model was trained for various configurations; however, best results were found
for 10 RRDB blocks. Figure 8.4 depicts an illustration of such architecture. There
are 3 RDB blocks in each RRDB block in that architecture, 4 Conv in each RDB
block, 64 feature maps in RDB Conv layers, and 64 feature maps outside RDB Conv.
Figure 8.5 shows the discriminator used for in the adversarial network. As shown in
Fig. 8.5, the discriminator has a series of convolution and ReLU layer, followed by
a dense layer of dimension 1024 and, finally, a dense layer with Sigmoid function to
classify between the low and high-resolution images. Next, we discuss the various
loss functions used in the network.
8.4.2 Loss Function
An artificially intelligent system learns through a loss function. It is a scheme of

assessing how well specific algorithm patterns the given data. If prognostications
differ too much from real results, the loss function produces a large number. Pro-
gressively, with the help of some optimization function, the loss function learns to
subdue the error in the forecast. In this work, we use perceptual loss L percep as pro-
posed in the ESRGAN (Wang et al. 2019) for training along with the adversarial
discriminator loss.
8.4.2.1 Perceptual Loss
In perceptual loss (Wang et al. 2019), we measure the mean square error in the feature
maps of a pre-trained network. For our training, we have taken layer 5, 9 of the pre-
trained VGG19 model. The perceptual loss function is the improved version of MSE
to evaluate a solution based on perceptually relevant characteristics and is defined as
follows:
I SR = I XSR + 10−3 IGen SR
(8.7)

content loss adversarial loss

perceptual Loss (for VGG based content losses)
SR
where, l xSR is the content loss and lGen is the adversarial loss which are defined in the
following section.
8.4.2.2 Content Loss
Content is defined as the information available in an image. Upon visualizing the

learned components of a neural network, it has been shown in the literature that dif-
ferent feature maps in higher layers are activated in the presence of various objects.
So, if two images have the same content, they should have similar activations in the
higher layers. The content loss function ensures that the higher layers’ activations are
identical between the content image and the generated image. The content cost func-
tion ensures that the content present in the content image is obtained in the generated
image. In the literature, researchers have shown CNNs capture knowledge about the
higher levels’ content, where the lower levels are concentrating on individual pixel
values. The VGG loss is defined as follows:
Wi, j Hi, j
1
2
SR
lVGG/i, j = i, j (I HR )x,y − φi, j G θG (I LR ) x,y (8.8)
Wi, j Hi, j x=1 y=1
where φi, j represents the feature map obtained by the jth convolution and before ith
max-pooling layer, Wi, j and Hi, j describe the dimensions of the feature maps within
the VGG network, G θG (I LR ) is the reconstructed image, and I HR is the ground truth
image.
8.4.2.3 Adversarial Loss
One of the most important uses of adversarial networks is the ability to create natural
looking images after training the generator for a sufficient amount of time. This
component of the loss encourages the generator to favor outputs that are closer to
realistic images. The adversarial loss is defined as follows:

N

SR
IGen = − log Dθ D G θG (I LR ) (8.9)
n=1
where, Dθ D (G θG (I LR )) is the probability of the reconstructed image G θG (I LR ) of

being natural high-resolution image.
8.5.1 Dataset
We recompile the dataset as outlined by the authors of ID-CNN (Wang et al. (2017)
and ID-GAN (Wang Puyang et al. 2017) for analysis on synthetic SAR images. The
dataset is a combination of images from UCID (Schaefer Gerald and Stich Michal
2004), BSDS-500 (Arbeláez et al. 2011) and scraped Google map images (Isola
Phillip et al. 2017). All these images are converted to grayscale using OpenCV to
simulate intensity SAR images.
Grayscale images are then downscaled to 256 × 256 to serve as the noiseless high-
resolution target. Another set of grayscale images are downscaled to 256 × 256 and
64 × 64 images from their original size. For each input image, we have three different
noise levels. We randomly allot the images to the training, validation, and testing set.
The split ratio was 95 : 2.5 : 2.5. The ratio was taken to get a similar test set size as
ID-CNN (Wang et al. (2017).
We also use the cars overhead with context (COWC) dataset (Mundhenk et al.
2016), which is provided by the Lawrence Livermore National Laboratory, for further
investigation of the performance. We use this dataset because it contains target boxes
for classification and localization of cars. The data can be further used for object
detection for performance comparison. We then add speckle noise to the images
in our dataset using np.random.gamma(L , 1/L) from NumPy to randomly sample
from gamma distribution which is equivalent to the above probability density function
as shown in Fig. 8.1a. The image is normalized before adding noise and then again
scaled to the original range after adding noise to avoid clipping of values and loss of
information.
8.5.2 Results
In this section, we describe the various quantitative and qualitative results obtained
while conducting various experiments based on the proposed architecture.
8.5.2.1 No Super Resolution (No Upscaling Layer)
In this subsection, we report the comparison on no super resolution, i.e., no upscaling

layer. Table 8.1 shows the comparison of denoising performance of the proposed
network with ID-CNN (Wang et al. (2017). We use the same dataset in both the
cases. The results are reported for three different levels of the noise as shown by
three different values of L. Also, three different metrics are used to maintain the
same experimental setup used by the other state-of-the-art method. Here, VGG16
loss refers to the MSE of the deeper layer of VGG16 network for the output denoised
image and the target image.
For the obtained results, it is clear from Table 8.1 that the proposed framework
is able to outperform ID-CNN when compared only in denoising performance for
all the noise levels. The PSNR for L = 4, 10 comes out to be better for the case of
ID-CNN because of the PSNR driven loss function of the network. Whereas, the
RRDN is able to perform better when seen with respect to VGG16 implying it is
able to better preserve higher level details in the denoised image since the network
is driven by content loss instead of pixel-wise MSE.
8.5.2.2 With Super Resolution (With Upscaling Layers)
Here, we report the quantitative comparison with the super resolution. Table 8.2
shows the comparison for both the approaches. For two-step networks, we train
ID-CNN (Wang et al. (2017) on 256 × 256 noisy images to 256 × 256 clean target
images. Then, we train SR network on clean 256 × 256 input images to 1024 × 1024
high-resolution target images. For the single shot network, we train the network on
256 × 256 noisy images to 1024 × 1024 clean high-resolution images.
We compare the performance of the networks based on the above- mentioned
strategy. The PSNR calculations are done for the same output sizes as the target and
the output image sizes match. The VGG16 loss calculation, however, is done after
downscaling to 224 × 224 for the images from both the cases. It can be observed from
Table 8.2 that the two-step approach produces better results for most of the metrics.
Table 8.1 Comparison of RRDN without upscaling layer with ID-CNN (Wang et al. (2017)
Metric ID-CNN (Wang et al. RRDN 1x
(2017)
L=1 PSNR 19.34 19.55
SSIM 0.57 0.61
VGG16 1.00 0.81
L=4 PSNR 22.59 22.47
SSIM 0.77 0.79
VGG16 0.60 0.30
L = 10 PSNR 24.74 24.58
SSIM 0.85 0.86
VGG16 0.33 0.16
Table 8.2 Comparison of RRDN with upscaling layer with ID-CNN (Wang et al. (2017)
Metric ID-CNN → ISR Single Shot
L=1 PSNR 19.35 18.77
SSIM 0.61 0.60
VGG16 0.91 1.06
L=4 PSNR 21.95 21.38
SSIM 0.72 0.71
VGG16 0.53 0.48
L = 10 PSNR 23.32 23.00
SSIM 0.77 0.77
VGG16 0.29 0.25
However, the single shot network is still able to slightly outperform the two-step
network based on the VGG16 metric, which again shows that the network preserves
better high-level details, while doing both tasks at once instead of denoising the image
and then increasing its resolution using two independently trained networks. These
higher level details lead to better perceived quality of image and better performance
in object detection tasks even though the pixel-wise MSE or PSNR values come out
to be lower.
8.5.2.3 Additional Results
In this subsection, we present the result of super resolution and denoising in a single
network on SAR images. We present the calculated results in Table 8.3 without
comparison since we were not able to find any other papers with both super resolution
and denoising in the context of SAR images to the best of our knowledge. The input
images are 64 × 64 noisy images with almost no pixel information available. Figure
8.6 depicts an illustration of the proposed single-step denoising and super-resolution
Table 8.3 Performance of network while performing both operations

Metric RRDN-4x
L=1 PSNR 16.10
SSIM 0.36
VGG16 1.20
L=4 PSNR 17.05
SSIM 0.43
VGG16 0.99
L = 10 PSNR 17.60
SSIM 0.48
VGG16 0.87
Fig. 8.6 An illustration of the single-step denoising and super resolution task on a input image size
of 64 × 64
task. The RRDN is able to generate a pattern of the primary objects, such as buildings,
from the input noisy image based on the high-level features of the image. It is also
evident from the quantitative analysis that our proposed single-step method is also
able to produce HR images with considerably lesser noise elements. These claims
will be further clarified in the following section, where we analyze performance in
details.
8.6 Analysis
So far, we have discussed about the quantitative results obtained using the proposed
method. In this section, we now present the qualitative results and analysis behind
such results in details.
8.6.1 No Super Resolution (No Upscaling Layer)
Figure 8.7 shows the denoising performance comparison of ID-CNN with RRDN
on a 1L noise speckle image with no upscaling. Similarly, Figs. 8.8 and 8.9 show
the comparison for 4L and 10L noise speckle images. In all the three images, the
results are presented in a manner such that the part (a) and (d), the two diagonally
opposite images depict the input speckled image and the target image, respectively.
On the other hand, the part (b) and (c) represent the despeckled image generated
from our proposed model and using the method proposed in Wang et al. ((2017).
The denoised image output for the proposed network shows better preserved edges
and lesser smudge and sharper image when compared to ID-CNN even though the
PSNR difference between the images is not very high. The proposed method gives
consistently better quality image for all noise levels.
8.6.2 With Super Resolution (With Upscaling Layer)
Figures 8.10, 8.11 and 8.12 show the comparison for 1, 4 and 10L noise, respectively.
The images are downscaled after super resolution for comparison. The original image
input size is of 256 × 256, and the output image size is 1024 × 1024. Starting with
the 1L noisy image, we can see the stark difference in the output images produced by
the network. Using two-step network for denoising, then upscaling causes blurred
out and smeared images. The high-level details are better preserved in the single-step
approach compared to two step. It can be seen that the two-step approach induces
distortion when higher noise is induced, whereas, the single-step approach is able
to preserve more higher level details since the content loss has made it possible for
Fig. 8.7 Qualitative comparison between the proposed denoising method and (Wang et al. (2017)
without upscaling (Noise level = 1L)
network to learn to extract details from the noisy image which help produce the
building patterns even in presence of very high noise.
8.6.3 Comparison
In this section, we present qualitative comparison between the two-step method

and the single-step method. Figure 8.13 shows the comparison between the output
of the approaches. Figure 8.13a shows the image output of the two step approach
of denoising then upscaling, while Fig. 8.13b shows the output of the single-step
approach. For both the cases, we highlight one section of the image and zoomed in
to that region to show the difference in result more closely.
It can be seen in the cropped out region of the high-resolution 1024 × 1024 output
image that the single-step approach is able to better preserve the edges in between
the very close by building; whereas in the two-step network, the output images
have smeared edges, as the input noisy image is not available to the super resolution
network hence inducing distortions in the additional step. Also, the distortions left out
by the ID-CNN network are magnified in the upscaling network which are reduced
in the single step network since the features of the input noisy image is available to
the network from the dense skip connections.
8.7 Conclusion and Future Scope
In this work, we presented the results of the proposed network with the upscaling
layer. The results show significant improvement in VGG16 loss because the systems
can remove noise from the images while preserving the image’s relevant features.
The single-step performance is comparable to the two step, while also reducing the
need for a double pass and saving the need for training an additional network. Since
the image better preservers features in the single-step system, it might perform better
if used further in any tasks that require the use of features like object recognition.
with upscaling (Noise level = 1L)
(a) 2 Step Denoising and Super Resolution
(b) Single Step Denoising and Super Resolution
Fig. 8.13 A qualitative comparison between the two-step and approach for denoising and super-
resolving an input image.
References
Bai YC, Zhang S, Chen M, Pu YF, Zhou JL (2018) A fractional total variational CNN approach for
SAR image despeckling. ISBN: 978-3-319-95956-6
Bhateja V, Tripathi A, Gupta A, Lay-Ekuakille A (2015) Speckle suppression in SAR images
employing modified anisotropic diffusion filtering in wavelet domain for environment monitoring.
Measurement 74:246–254
Buades A, Coll B, Morel J (2005) A non-local algorithm for image denoising. In: IEEE conference
on computer vision and pattern recognition (CVPR)
Deledalle C, Denis L, Tupin F (2009) Iterative weighted maximum likelihood denoising with prob-
abilistic patch-based weights. IEEE Trans Image Process 18(12):2661–2672
Francesco C et al (2018) ISR. https://github.com/idealo/image-super-resolution
Gai S, Zhang B, Yang C, Lei Y (2018) Speckle noise reduction in medical ultrasound image using
monogenic wavelet and Laplace mixture distribution. Digital Signal Process 72:192–207
Huang S, Liu D, Gao G, Guo X (2009) A novel method for speckle noise reduction and ship target
detection in SAR images. Patt Recogn 42(7):1533–1542
Isola P, Zhu J-Y, Zhou T, Efros A (2017) Image-to-image translation with conditional adversarial
networks. In: IEEE conference on computer vision and pattern recognition (CVPR)
Kostadin D, Alessandro F, Vladimir K, Karen E (2006) Image denoising with block-matching
and 3D filtering. In: Image processing: algorithms and systems, neural networks, and machine
learning
Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J,
Wang Z, Shi W (2017) Photo-realistic single image super-resolution using a generative adversarial
network. In: IEEE conference on computer vision and pattern recognition (CVPR)
Li Y, Wang S, Zhao Q, Wang G (2020) A new SAR image filter for preserving speckle statistical
distribution. Signal Process 196:125–132
Mundhenk TN, Konjevod G, Sakla WA, Boakye K (2016) A large contextual dataset for classifi-
cation, detection and counting of cars with deep learning. In: European conference on computer
vision
Mundhenk TN, Konjevod G, Sakla WA, Boakye K (2016) A large contextual dataset for classifi-
cation, detection and counting of cars with deep learning. In: European conference on computer
vision
Puyang W, He Z, Patel Vishal M (2017) SAR image despeckling using a convolutional neural
network. IEEE Signal Process Lett 24(12):1763–1767
Rana VK, Suryanarayana TMV (2019) Evaluation of SAR speckle filter technique for inundation
mapping. Remote Sensing Appl Soc Environ 16:125–132
Schaefer G, Stich M (2004) UCID: an uncompressed color image database. In: Storage and retrieval
methods and applications for multimedia
Sivaranjani R, Roomi SMM, Senthilarasi M (2019) Speckle noise removal in SAR images using
multi-objective PSO (MOPSO) algorithm. Appl Soft Comput 76:671–681
Tian C, Fei L, Zheng W, Yong X, Zuo W, Lin C-W (2020) Deep learning on image denoising: an
overview. Neural Netw 131:251–275
Wang Z, Bovik AC (2002) A universal image quality index. IEEE Signal Process Lett 9(3):81–84
Wang S, Jiao L, Yang S, Liu H (2016) SAR image target recognition via Complementary Spatial
Pyramid Coding. Neurocomput 196:125–132
Wang X, Yu K, Wu S, Gu J, Liu Y, Dong C, Qiao Y, Change Loy C.(2018). Esrgan: enhanced
super-resolution generative adversarial networks. ISBN: 978-3-030-11020-8
Wang P, Zhang H, Patel VM (2017) Generative adversarial network-based restoration of speckled
SAR images. In: IEEE 7th international workshop on computational advances in multi-sensor
adaptive processing
Yin D, Zhongzheng G, Zhang Y, Fengyan G, Nie S, Feng S, Ma J, Yuan C (2020) Speckle noise
reduction in coherent imaging based on deep learning without clean data. Optics Lasers Eng
72:192–207
Part II
Models and Algorithms
Chapter 9
Analysis and Deployment of an
OCR—SSD Deep Learning Technique
for Real-Time Active Car Tracking
and Positioning on a Quadrotor
Luiz G. M. Pinto, Wander M. Martins, Alexandre C. B. Ramos,

Abstract This work presents a deep learning solution object tracking and object
detection in images and real-time license plate recognition implemented in F450
quadcopter in autonomous flight. The solution uses Python programming language,
OpenCV library, remote PX4 control with MAVSDK, OpenALPR, neural network
using Caffe and TensorFlow.
9.1 Introduction
A drone can follow an object that updates its route all the time (Pinto et al. 2019). This
is called active tracking and positioning, where an autonomous vehicle needs to fol-
low a goal without assistance from human intervention. There are some approaches
to this mission with drones (Amit and Felzenszwalb 2014; Mao et al. 2017; Patel
and Patel 2012; Sawant and Chougule 2015), but it is rarely used for object detec-
tion and OCR due to resource consumption. (Lecun et al. 2015) State-of-the-art
algorithms (DAI et al. 2016; Liu et al. 2016; Redmon and Farhadi 2017; Ren et al.
2017) can identify the class of a target object being followed. This work presents
and analyzes a technique that grants control to a drone during an autonomous flight,
using real-time tracking and positioning through an OCR system for deep learning
model of plate detection and object detection. (Bartak and Vykovsky 2015; Barton
L. G. M. Pinto (B) · A. C. B. Ramos

Institute of Mathematics and Computing, Federal University of Itajuba, IMC. Av. BPS, 1303,
Bairro Pinheirinho, Caixa, Itajuba, MG 37500 903, Brazil
e-mail: ramos@unifei.edu.br
W. M. Martins · T. C. Pimenta
Institute of Systems Engineering and Information Technology,
IESTI. Av. BPS, 1303, Bairro Pinheirinho, Caixa, Itajuba, MG 37500 903, Brazil
e-mail: wandermendes@unifei.edu.br
T. C. Pimenta
e-mail: tales@unifei.edu.br
122 L. G. M. Pinto et al.
and Azhar 2017; Bendea 2008; Braga et al. 2017; Breedlove 2019; Brito et al. 2019;
Cabebe 2012; Jesus et al. 2019; Lecun et al. 2015; leE et al. 2010; Martins et al.
2018; Mitsch et al. 2013; Qadri and Asif 2009; TavareS et al. 2010).
9.2 Materials and Methods
The following will present the concepts, techniques, models, materials, and methods
used in the proposed system, in addition to the structures and platforms used for its
creation.
9.2.1 Rotary Wing UAV
There are all kinds of autonomous vehicles (Bhadani et al. 2018; Chapman et al.
2016). This project used an F-450 quadcopter drone for outdoor testing and a Typhoon
H480 octorotor for the simulation. A quadcopter is an aircraft made up of four rotors
carrying the controller board in the middle and the rotors at the ends (Sabatino et al.
2015). It is controlled by changing the angular speeds of the rotors that are rotated
by electromagnetic motors, where you can have six degrees of freedom, as seen in
Fig. 9.1, with x, y and z as the transition movement, and roll, pitch and yaw as the
rotational movement (Luukkonen 2011; Martins et al. 2018; Strimel et al. 2017).
The altitude and position of the drone can be modified by adjusting the speeds of
the motors (Sabatino et al. 2015). The same applies to pitch control but controlling
the rear or front engines (Braga et al. 2017).
Fig. 9.1 Degrees of freedom (Strimel et al. 2017)

9 Analysis and Deployment of an OCR—SSD Deep Learning … 123
9.2.1.1 Pixhawk Flight Control Hardware
This project was implemented using the Pixhawk flight controller (Fig. 9.2), an inde-
pendent open hardware flight controller (Pixhawk 2019). Pixhawk supports manual
and automatic operations, being suitable for research, because it is inexpensive and
compatible with most remote control transmitters and receivers (Ferdaus et al. 2017).
Pixhawk is built with a dual processor with 32-bit computing capacity STM32f427
Cortex M4 MHz processor/256 cores 168KB of RAM/2MB of flash bit (Feng and
Angchao 2016; Meier et al. 2012). The current version is the Pixhawk 4 (Kantue and
Pedro 2019).
9.2.1.2 PX4 Flight Control Software
In this work, the PX4 flight control software (PX4 2019a) was used, because is the
Pixhawk’s native software, avoiding compatibility problems. PX4 is open-source
flight control software for drones and other unmanned vehicles (PX4 2019a) that
provides a set of tools to create customized solutions.
There are several internal types of frames with their own flight control parameters,
including engine assignment and numbering (PX4 2020), which includes the F450
used in this project. These parameters can be modified to obtain refinement during a
flight. In the case of PX4, it uses proportional, integral, derivative (PID) controllers,
which are one of the most widespread control techniquesg (PX4 2019b).
In the PID controllers, the P (proportional) gain is used to minimize the tracking
error. It is responsible for a quick response and therefore should be as high as possible.
Gain (derivative) is used to moisten. It is necessary, but only the maximum necessary
to avoid overtaking must be defined. Gain I (integral) maintains an error memory. The
Fig. 9.2 PixHawk

2.4.8 (Kantue and Pedro
2019)
term “I” increases when desired the rate has not been reached for some time (PX4
2019b). The idea is to parameterize the in-flight board model using ground control
station (GCS) software, where it is possible change these parameters and check their
effects on each of the drone’s degrees of freedom (QGROUNDCONTROL 2019V).
9.2.2 Ground Control Station
A ground control station (GCS), described in the previous chapter, needs to check the
behavior of the drone during the flight and was used to update the drone’s firmware,
adjust its parameters, and calibrate your sensors. Running on a base computer, com-
municate with the UAV wirelessly, such as telemetry or Wi-Fi, display real-time data
on the performance and position of the UAV, and show data from many instruments
present on a conventional plane or helicopter (ARDUPILOT 2019; Hong et al. 2008).
This work used the QGroundControl GCS due to its affinity with the PX4 and
PID tools, which allowed changes to the PID while the drone was still in the air.
QGroundControl has the ability to read telemetry data simultaneously from multi-
ple aircrafts if they are using the same version of MAVLink, while still being more
common features such as telemetry logging, visual display of the GUI, a mission
planning tool, the ability to adjust the PID gains during the flight (QGROUNDCON-
TROL 2019), as shown in Fig. 9.3, and the ability to display vital information flight
data information (Huang et al. 2017; Ramirez-Atencia and Camacho 2018; Shuck
2013; Songer 2013).
9.2.3 Drone Software Development Kit (SDK)
During the development of this project, some SDKs were used to obtain autonomous
control of software on drones. They were used in different parts of the project and
the reason it was the versatility of each other. The idea was to choose the SDK
that had the best balance between robustness and simplicity. The selection included
MAVROS (ROS 2019), MAVSDK (2019) and DroneKit (2015) because of their
popularity. All SDKs included the MAVLink protocol, responsible for controlling
the drone.
DroneKit presented the best simplicity, but it did not have the best support for
the PX4. External control that accepts remote commands via the programming was
developed for ArduPilot applications (PX4 2019c).
The MAVROS package running ROS presented the best system in terms of robust-
ness, but it was complex to manage. MAVSDK presented the best result. It has full
support for PX4 applications and is not complex to manage, being the subject of
choice for this project.
Fig. 9.3 QGroundControl PID tunning (QGROUNDCONTROL 2019)
9.2.3.1 MAVLink and MAVSDK
MAVLink is a very useful messaging protocol for exchanging information with a

drone and between the controller board components (MAVLINK 2019). During its
use, data flows, such as telemetry status, are published as topics, while subprotocol,
like those used for mission or parameter control, are point to point with retransmis-
sion (MAVLINK 2019).
MAVSDK is a MAVLink library with APIs implemented in different languages,
such as C ++ and Python. MAVSDK can be used on a computer embedded in a
vehicle, at a land or mobile station device, which has significantly more processing
power than an ordinary flight controller, be able to perform tasks such as computer
vision, obstacle prevention and route planning (MAVSDK 2019). MAVSDK was the
chosen framework, due to its affinity with the PX4 and its simplicity.
9.2.3.2 ROS and MAVROS
The Robot Operating System (ROS), also presented in the previous chapter, is a
structure widely used in robotics (Cousins 2010; Martinez 2013). ROS offers features
such as distributed computing, message passing and code reuse (Fairchild 2016;
Joseph 2015).
Fig. 9.4 ROS master (Hasan 2019)
ROS allows the robot control to be divided into several tasks called nodes and are
processes that perform calculations, allowing modularity (Quigley et al. 2009). The
nodes exchange messages with each other in an editor–subscriber system, where a
topic acts as an intermediate store for some of the nodes to publish its content while
others subscribe to receive this content (Fairchild 2016; Kouba 2019; Quigley et al.
2009).
ROS has a general manager called “master” (Hasan 2019). The ROS master, as
seen in Fig. 9.4, is responsible for providing names and records to services and
nodes in the ROS system. It tracks and directs editors and subscribers to topics
and Services. The role of the master is to guide the individual ROS nodes to locate
each from others. After the nodes are located, they define their communication via
peer-to-peer (Fairchild 2016).
To support collaborative development, the ROS system is organized in packages,
which are simply directories that contain XML files that describe the package and
presenting any dependencies. One of these packages is the MAVROS, which is an
extensible communication node MAVLink with proxy for GCS. The package pro-
vides a driver for communication between a variety of autopilots with MAVLink com-
munication protocol and a UDP MAVLink bridge for GCS. The MAVROS package
allows MAVLink communication between different computers running ROS, being
currently the official supported bridge between ROS and MAVLink (PX4 2019d).
9.2.3.3 DroneKit
DroneKit is an SDK built with development tools for UAVs (DRONEKIT 2015). It
allows to create applications which runs on a host computer and allows communica-
tion with ArduPilot flight controllers. Applications can insert a level of intelligence
into the vehicle’s behavior and can perform tasks with high computational perfor-
mance they cost or depend in real time, such as computer vision or path planning (3D
Robotics 2015, 2019).
Currently, the PX4 is not yet fully compatible, being more suitable for ArduPilot
applications (PX4 2019c).
9.2.4 Simulation
This work used the Gazebo platform (Nogueira 2014), with its PX4 Simulator imple-
mentation, which brings various vehicle models with PixHawk specific hardware and
firmware simulation. It wasn’t the only option (Hentati et al. 2018; Meyer 2020; Shah
et al. 2019), but In this project, the Gazebo platform was the choice to imitate environ-
ment through PX4 SITL. However, it was not the only option, since the PX4 SITL is
available for other platforms, such as jMAVSim (Hentati et al. 2018), AirSim (Shah
et al. 2019) and X-Plane (Meyer 2020). JMAVSim was not easy to integrate obsta-
cles or Extra sensors, such as cameras (Hentati et al. 2018), are discarded mainly for
this purpose reason. AirSim was also discarded because, while realistic, it requires
a powerful graphics processing unit (GPU) (Shah et al. 2019), which could compete
for resources during the object detection phase.
The X -Plane, also discarded, is realistic and has a variety of UAV models and
environments already implemented (Hentati et al. 2018), however, it depends on
the licensing for its use. Thus, Gazebo was the option chosen, due to its simulation
environment with a variety of online resource models, the ability to import meshes
from other modeling software, such as SolidWorks or Inkscape (Koenig and Howard
2004), and the free license.
9.2.4.1 Gazebo
Gazebo, already presented in the previous chapter, is an open-source simulation

platform that allows integration with ROS. Gazebo is currently only supported on
Linux, but there are some speculation about a version of Windows.
9.2.4.2 PX4 SITL
The PX4 firmware offers a complete hardware simulation (Hentati et al. 2018; Yan
et al. 2002), with a response that provides the entry of the environment using its
own SITL (Software in the loop). The simulation reacts to the simulation data given
exactly how it would react in reality and evaluates the total production energy required
in each rotor (Cardamone 2017; Nguyen et al. 2018).
The PX4 SITL allows you to simulate the same software as on a real platform, rig-
orously replicating the behaviors of an autopilot and can simulate the same autopilot
used on a real drone and its MAVLink protocol, which generalizes direct use of a
real drone (Fathian et al. 2018). The greatest the PX4 SITL is that a flight controller
cannot distinguish whether it is running on simulation or inside a real vehicle, allow-
ing the simulation code to be imported directly for commercially available UAV
platforms (Allouch et al. 2019).
9.2.5 Deep Learning Frameworks
Deep learning is a type of machine learning, generally used for classification, regres-
sion, and resource extraction tasks, with multiple layers of representation and abstrac-
tion (Deng 2014). For object detection, resource extraction tasks are required and
can be achieved using convolutional neural networks (CNN), a class of deep neural
networks that apply filters at various levels to extract and classify visual information
from a source, such as an image or video (O’Shea and Nash 2015). This project used
a CNN (Holden et al. 2016; Jain 2020; Opala 2020; Sagar 2020; Tokui et al. 2019)
to detect visual targets using a camera.
9.2.5.1 Caffe Framework
In this project was used the Caffe deep learning framework (Jia and Shelhamer
2020; Jia et al. 2019, 2014), but there are other options such as Keras (2020), scikit-
learn (2020), PyTorch (2020) and TensorFlow (AbadI et al. 2020; Rampasek and
Goldenberg 2016; Tensorflow 2016; TENSORFLOW 2019, 2020b, a). Caffe pro-
vides a complete toolkit for training, testing, fine-tuning, and model deployment,
with well-written documentation examples for these tasks. It is developed under a
free BSD license, being built with the C++ language and maintaining Python and
MATLAB links for training and deploying general-purpose convolutional neural
networks and many other deep models efficiently (Bahrampour et al. 2015; Bhatia
2020).
9.2.6 Object Detection
The most common way to interpret the location of the object is to create a bounding
box around the detected object, as seen in Fig. 9.5.
Object detection, detailed in the previous chapter, was the first stage in this tracking
system, as it focuses on the object to be tracked (Hampapur et al. 2005; Papageorgiou
and Poggio 2000; Redmon et al. 2016).
Fig. 9.5 Bounding boxes from object detection (Ganesh 2019)
9.2.6.1 Convolutional Neural Networks (CNN)
A convolutional neural network (CNN) (O’Shea and Nash 2015), as described in

the previous chapter, is a variation of the so-called multilayer perceptron networks
and was inspired by the biological process of data processing (Google Developers
2020; Hui 2019; Karpathy 2020; Rezatofighi et al. 2019; Stutz 2015; Vargas and
Vasconcelos 2019).
9.2.6.2 Single-Shot MultiBox Detector (SSD)
SSD is an object detection algorithm that uses a deep learning model for neural
networks (Liu et al. 2011; Liu 2020; Liu et al. 2016). It was designed for real-time
applications, like this one. It is lighter than other models, as it speeds up the process of
inferring new bounding boxes reuse of pixels or feature maps, which are the result of
convolutional blocks and representation of the dominant characteristics of the image
at different scales (Dalmia and Dalmia 2019; Forson 2019; Soviany and Ionescu
2019).
Its core was built around a technique called MultiBox, which is a method for
fast class agnostic bounding box coordinate proposals (Szegedy et al. 2014, 2015).
Regarding its performance and accuracy, for applicability in object detection, it has
a score above 74% mAP at 59 frames per second in datasets like COCO and Pas-
calVOC (Forson 2019).
MultiBox
MultiBox is a method for bounding box regression that achieves dimensionality

reduction, as it consists of branches of convolutional layers (Forson 2019), as seen
Fig. 9.6 MultiBox architecture (Forson 2019)
in Fig. 9.6 that resize images over the network, maintaining the original width and
height.
The magic behind the MultiBox technique is the interaction between two critical
assessment factors: loss of confidence (CL) and loss of location (LL). CL is a fac-
tor that measures how confident the class selection is made, which means whether
it is the correct class of the object, using categorical cross-entropy in relation to
entropy (Forson 2019). We can consider cross-entropy as a received response that
is not optimal. Entropy, on the other hand, represents the ideal answer. Therefore,
knowing entropy, all entropy received can be measured in terms of how far these
responses are from the optimal (Dipietro 2019).
SSD Architecture
The SSD is composed of feature map extraction, through an intermediate neural

network called feature extractor and the application of convolution filters to detect
objects. The SSD architecture (Fig. 9.7) consists of three main components: basic
network, extra layers for feature extraction, and prediction layers.
In the basic network, extraction is performed using a convolutional neural net-
work called VGG-16—the feature extractor, as seen in Fig. 9.8, where it is made
up of combinations of convolution layers with ReLU for fully connected layers. In
addition, it has layers of maximum pool and a final layer with a softmax activation
function (Frossard 2019). The output of this network is a feature map with dimen-
sions 19 × 19 × 1024 (Dalmia and Dalmia 2019). Right after the basic network,
four additional convolutional layers are added to continue reducing the size of the
resource map until its final dimensions are 1 × 1 × 256. Finally, the forecast layers, a
crucial element of the SSD, use a variety of feature maps representing various scales
Fig. 9.7 SSD architecture (Dalmia and Dalmia 2019)
Fig. 9.8 VGG-16 architecture (Frossard 2019)
to predict class scores and bounding box coordinates (Dalmia and Dalmia 2019).
The final composition of the SSD increases the chances of an object being even-
tually detected, localized, and classified (Howard et al. 2017; Sambasivarao 2019;
Simonyan and Zisserman 2019; Szegedy et al. 2015; Tompkin 2019).
9.2.7 Optical Character Recognition—OCR
In this work, the identification of the vehicle license plate was important for the
drone be able to follow a car with the correct license plate. This problem occurred
with the optical character recognition system (OCR). OCR is a technique respon-
sible for recognizing optically drawn characters (Eikvil 1993). OCR is a complex
problem to deal with due to the variety of languages, fonts, and styles in which char-
acters and information can be written, including the complex rules for each of these
languages (Islam et al. 2016; Mithe et al. 2013; Singh et al. 2010).
9.2.7.1 Inference Process
An example of the steps in the OCR technique is shown in Fig. 9.9 (Adams 1854). The
steps are as follows: acquisition, preprocessing, segmentation, resource extraction,
and recognition (Eikvil 1993; Kumar and Bhatia 2013; Mithe et al. 2013; Qadri and
Asif 2009; Sobottka et al. 2000; Weerasinghe et al. 2020; Zhang et al. 2011).
a. Acquisition: a recorded image is fed into the system.
b. Preprocessing: eliminates color variations by smoothing and normalizing pixels.
Smoothing applies convolution filters to the image to remove noise and smooth
the edges. Normalization finds a uniform size, slope and rotation for all characters
in the image (Mithe et al. 2013).
c. Segmentation: finds the words written inside the image.(Kumar and Bhatia 2013)
d. Resource extraction: extracts the characteristics of the symbols (Eikvil 1993).
e. Recognition: actually identifies the characters and classifies them, searching the
lines, word for word, converting the images for character streams representing
letters of recognized words (Weerasinghe et al. 2020).
Fig. 9.9 OCR inference process steps (Adams 1854)

9.2.7.2 Teressact OCR
This work used a plate detection called OpenALPR that uses Google Tesseract, an
open-source OCR framework, to train networks for different languages and scripts. It
converts the image into binary images and identifies and extracts character outlines.
Transforms Blobs contours, which are small regions isolated from digital images,
divide text into words using techniques like cloudy spaces and defined spaces. Finally,
it recognizes the text by classifying and storing each recognized word (Mishra et al.
2012; Patel and Patel 2012; Sarfraz et al. 2003; Shafait et al. 2008; Smith 2007;
Smith et al. 2009).
9.2.7.3 Automatic License Plate Recognition—ALPR
ALPR is a way to detect characters that makes up the license plate of the vehicle and
uses OCR for most of the process. It combines object detection, image processing,
and pattern recognition (Silva and Jung 2018). It is used in real-life applications, such
as automatic toll collection, traffic law enforcement, access control to parking lots and
road traffic monitoring (Anagnostopoulos et al. 2008; Du et al. 2013; Kranthi et al.
2011; Liu et al. 2011). The four steps of the ALPR are shown in Fig. 9.10 (Sarfraz
et al. 2003).
OpenALPR
This project used OpenALPR, an open-source ALPR library built with the C++
language, and has links in C#, Java, Node.js, and Python. The library receives images
and video streams for analysis in order to identify registrations and generates a text
representing the enrollment characters (OPENALPR 2017). It is based on OpenCV,
an open-source computer vision library for image analysis (Bradski and Kaehler
2008) and Tesseract OCR (Buhus et al. 2016; Rogowski 2018).
9.3 Implemented Model
This project used datasets, SSD training in TensorFlow and Caffe frameworks, image
preprocessing for OCR and tracking and motion functions.
9.3.1 Supervisioned Learning
Supervised learning (SL) is a kind of machine learning (ML) training in which

the model is provided with labeled data during your training phase (Google Devel-
Fig. 9.10 ALPR steps (Sarfraz et al. 2003)
opers 2019; Shobha and Rangaswamy 2018; Talabis et al. 2015). The LabelImg
tool (Talin 2015) was used, with PascalVOC format like the standard XML anno-
tation (Fig. 9.11), which includes class labels, coordinates of the bounding boxes,
image path, image size and name and other tags (Everingham et al. 2010).
9.3.1.1 Datasets
Three sets were used: INRIA Person, Stanford Cars and GRAZ-02.
The Stanford Cars dataset (Krause et al. 2013) allowed SSD to identify cars along
the quadcopter trajectory. This dataset contains 16,185 images from 196 classes of
cars, divided into 8,144 training images and 8,041 test images, already noted in terms
of make, model and year.
Fig. 9.11 PascalVOC’s

XML example (Everingham
et al. 2010)
The INRIA person dataset (Dalal and Triggs 2005) is a collection of digital images
to highlight people, taken over a long period of time, and some Web images taken
from Google Images (Dalal and Triggs 2005). About 2500 images were collected
from that dataset. The class of person was added because it was the most common
false positive found in single-class training, along with a background class, to help
improve the model discernment between different environments.
To help improve the model detection in Multi-Class, the GRAZ-02 (Opelt et al.
2006) dataset was used, since it contains images with high complexity objects and a
high variability on backgrounds, which includes 311 images with persons and 420
images with cars (Oza and Patel 2019).
9.3.1.2 Caffe and TensorFlow Training
Unlike TensorFlow (TENSORFLOW 2019, 2020b), Caffe (Bair 2019) does not
have a complete object detection API (Huang et al. 2017), which makes it more
complicated when starting a new training. Caffe does not include a direct visualization
tool like Tensorboard. However, it includes in its tools subdirectory, the parse log.py
script, which extracts all relevant training information from the log file and makes
it suitable for plotting. Using a Linux plotting tool called gnuplot (Woo and Broker
2020), in addition to an analysis script, it was possible to build a real-time plotting
algorithm (Fig. 9.12).
The latest log messages indicated that the model reached an overall mAP of
95.28%, which is an accuracy gain of 3.55% compared to the first trained model (Aly
2005; Jia et al. 2014; Lin et al. 2020).
Fig. 9.12 Results from gnuplot custom script (Author)
9.3.2 Image Preprocessing for ALPR
To help ALPR identify license plates, the quality of the images has been improved
through the use of two techniques of image preprocessing brightness variation and
sharp mask. The variation in brightness is controlled in the color system used was
RGB, where the color varies according to the levels of red, green and blue provided.
There are 256 possible values for each level, ranging from 0 to 255. To change the
brightness, just you need to add or subtract a constant value from each level. For
brighter images, the value is add, while for darker images the values are subtracted,
as seen in Fig. 9.13, providing a total −255 to 255 values. The only necessary care
is to check whenever the addition or subtracted value will be greater than 255 and
less than 0.
After applying the brightness, the sharpness mask, also called the sharpness fil-
ter, was it is necessary to highlight the edges, thus improving the characters of the
plate (Dogra and Bhalla 2014; Fisher et al. 2020). Figure 9.14 shows an example of
the result of this process.
Fig. 9.13 Effect of brightness on a collected image (author)
Fig. 9.14 Effect of sharpness filter application (author)
9.3.3 Drone Control Algorithm
The algorithm for controlling the drone was built with the assistance of libraries
in the Python language, which include OpenALPR, Caffe, TensorFlow, MAVSDK,
OpenCV (Cartesian System) (Bradski and Kaehler 2008) and others. Algorithm 1 is
the main algorithm of the solution and evaluates updated data from object detection
and OCR at each speed on the x, y, z axes of the three-dimensional real-world
environment.
9.3.3.1 Height Centralization Function
Height centering is the only control function shared between views. It positions the
drone at a standard altitude (Fig. 9.15).
9.3.3.2 2D Centralization Function
The drone has its camera pointed at the ground, the captured image is analogous to
a 2D Cartesian System, and the centralization is oriented using the x-coordinate and
y-coordinate, as shown in Fig. 9.16. The idea is to reduce the two values x and y,
which represent the detection x and y distances, respectively, central point (pink) to
the central point of the frame (blue).
9.3.3.3 Yaw Alignment Function
In the rear_view approach, the alignment function is used to center the drone hori-
zontally, according to the position of the car. The idea is to keep the drone facing the
car, as shown in Fig. 9.17. These values are calculated using the a yaw distance
between the frame and the central detection point on the Y-axis.
Fig. 9.15 Height centralization (author)
Fig. 9.16 2D Centralization (author)

Fig. 9.17 Yaw Alignment (author)
9.3.3.4 Approximation Function
In the rear_view approach, the zoom function was the most difficult to find and
used the distance from the object in the camera, the speed of the car, the balance
between a safe distance from the car, and the minimum distance for the OCR to
work. The distance from the object was calculated using the relationship between
the object camera field of view (FOV) and its sensor dimension (Fulton 2020), as
seen in Fig. 9.18.
To evaluate the algorithm, a notebook was used as ground station, coordinating

the acquisition of the frame, the processing of the frame by the SSD, the values
calculation and drone positioning command. Its specifications were an Ubuntu v18.04
64 OS, Intel R CoreTM i7 with 2.2 GHz, 16 GB of DDR3 SDRAM and an Nvidia
GeForce GTX 1060 graphics card, with 6 GB of G-DDR5 memory.
Fig. 9.18 Object distance from camera (author)
Fig. 9.19 Typhoon H-480 in Gazebo (author)
9.4.1 The Simulated Drone
For the simulated experiment, a Typhoon H-480 model was used in the Gazebo, as
shown in Fig. 9.19. It is available in the PX4 Firmware Tools directory on Github,
available at: https://github.com/PX4/Firmware/tree/master/Tools, ready to use on
SITL environment. It was a handful, as it had a built-in gimbal and camera. The gimbal
allowed the images to be captured in a very stable way, avoiding compromising the
detections.
Fig. 9.20 Custom world in Gazebo (author)
9.4.2 Simulated World
In the simulated experiments, a customized city (Fig. 9.20) was created with pre-
compilation models in the Gazebo model database, available at: https://bitbucket.
org/osrf/gazebo models.
9.4.3 CAT Vehicle
The CAT vehicle is a simulated distributed autonomous vehicle that is part of the
ROS project in order to support research on autonomous driving technology. CAT has
complete configurable simulated sensors and actuators imitating a real-world vehicle
capable of autonomous driving, which includes a steering speed control implemented
in real-time simulation.
In the experiments, the drone was adjusted some distance from the rear end of the
car, as seen in Fig. 9.21, and followed it, capturing its complete image and allowing
the algorithm to process the recognized card. In addition, it allowed the plate to be
personalized with a Brazilian model, making it very convenient for this project.
9.4.4 Simulated Experiments
The experiments related that in an urban scene, the most cars could be detected within
a range of 0.5–4.5 m from the camera, as shown in the green area in Fig. 9.22.
Fig. 9.21 CAT Vehicle and Typhoon H-480 in Gazebo (author)
Fig. 9.22 Car detection simulation (author)
The detection range was 1.5–3 m. The balance between the main number of
detections with the total of correct information extracted is represented in the green
area of Fig. 9.23.
Figure 9.24 shows a collision hazard zone represented by a red area.
The ideal safety height and image quality vary between 4 and 6m. Figure 9.25
shows the red area where the height of other possible objects to be found, such as
people and other vehicles, making it difficult to identify the moving vehicle as an
object to be tracked.
Fig. 9.23 Plate detection and processing results (author)
Fig. 9.24 Car rear following results (author)
Fig. 9.25 Car top following results (author)

Fig. 9.26 Customized F450 built (author)
9.4.5 The Real Quadrotor
A custom F450 (Fig. 9.26) was used for the outdoor experiments. The challenge was
to use cheap materials to achieve reasonable results.
The numbers shown in Fig. 9.26 are reference indexes for each of the components
shown in Table 9.1. For the acquisition of the frame, an EasyCap capture device was
connected to the computer, previously connected to the video receiver.
The Pixhawk 4, the most expensive component, was the flight controller chosen
board, as the idea was to use the PX4 flight stack as the control board configuration.
9.4.6 Outdoor Experiment
For the outdoor experiment on Campus (Fig. 9.27), the model was changed to detect
people instead of cars. To avoid colliding with the person who served as a target,
everything was coordinated very slowly.
Another experiment used a video as images source (Fig. 9.28), to check how
many detections, plates, and right information extracted the technique could acquire.
For the video, a record from Marginal Pinheiros was used, being one of the busiest
highways in the city of São Paulo (Oliveira 2020).
The experiment produced 379 vehicle detections of the 500 existing in a video
clip, where 227 plaques were found, 164 with the correct information extracted
(Fig. 9.29).
Table 9.1 Customized F450 specifications

Index Part Name Specifications Quantity Weight (g) Price (R$)1
1 Controller Pixhawk 1 38 183.40
PX4/2.4.8
2 GPS Module NEO-8N 1 30 58.16
3 Telem. Transc. Readytosky 1 15.4 63.23
3DR 915 Mhz
4 Video Transm. JMT / TS832 1 22 59.15
5 FPV Camera Cyclops DVR 3 1 4.5 55.71
4.2V 700TVL
6 RC Receiver Flysky/RX 1 18 19.69
FS-R9B
7 PPM Encoder JMT PWM to 1 3 22.40
PPM
8 ESC Hobbypower 39 4 100 69.01
A
9 Motor / Prop A2212 1000 4 200.8 50.62
KV / 6035
2-Blade
10 Battery (Li-po) Tattu 11.1 v 35 1 375 160.00
c 5200 mAh 3S
11 Video Receiver JMT / RS832 1 852 73.94
12 Frame YoungRC F450 1 280 21.31
450mm
13 Power Module Xt60 6s / 12s 1 28 16.08
Total 1141.7 882.73
Note 1 The price is equivalent to USD 190.77 on March 08th, 2020
Note 2 Not added to weight sum, since it was used in the ground station
Fig. 9.27 Outdoor experiment in Campus (author)

Fig. 9.28 Experiment on record (Oliveira 2020)
Fig. 9.29 Experiment on record results (author)
9.5 Conclusions and Future Works
The distance and the brightness level determined the limits in the tests performed,
being an aspect to work on future improvements. A high-definition camera should
be used to prevent noise and vibrations in the captured images.
The mathematical functions used to calculate the drone’s speed were useful in
understanding the drone’s behavior.
A different approach to position estimation or PID controller can be used to
determine the object’s route.
References
3D Robotics (2015) DroneKit. Available in: https://3drobotics.github.io/solodevguide/concept-

dronekit.html. Cited December 21st, 2019
3D Robotics (2019) About DroneKit. Available in: https://dronekit-python.readthedocs.io/en/
latest/about/overview.html. Cited December 21st, 2019
Abadi M et al (2020) Tensorflow: large-scale machine learning on heterogeneous distributed sys-
tems. ArXiv, arXiv:1603.04467
Adams HG (1854) Nests and eggs of familiar birds, vol IV, 1st edn. Groombridge and Sons: 5
Paternoster Row, London, England
Allouch A, (2019) Mavsec: securing the MAVLINK protocol for Ardupilot, PX4 unmanned aerial
systems. In, et al (2019) 15th international wireless communications & mobile computing con-
ference (IWCMC). IEEE. https://doi.org/10.1109/IWCMC.2019.8766667
Aly M (2005) Survey on multiclass classification methods. In: 1200 East California Boulevard
Pasadena, California, USA
Amit Y, FELZENSZWALB P (2014) Object detection. In: Computer vision. Springer US, 537–542.
https://doi.org/10.1007/978-0-387-31439-6
Anagnostopoulos C-N et al (2008) License plate recognition from still images and video sequences:
a survey. In: IEEE transactions on intelligent transportation systems, Institute of Electrical and
Electronics Engineers (IEEE). https://doi.org/10.1109%2Ftits.2008.922938
ARDUPILOT (2019) choosing a ground station: overview. Available in: https://ardupilot.org/plane/
docs/common-choosing-a-ground-station.html. Cited December 10th, 2019
Bahrampour, S et al (2015) Comparative study of caffe, neon, theano, and torch for deep learning.
ArXiv, arXiv:1511.06435
Bair (2019) Caffe. Available in: https://caffe.berkeleyvision.org/. Cited January 13th, 2019
Bartak R, Vykovsky A, (2015) Any object tracking and following by a Ying drone. In, (2015)
fourteenth Mexican international conference on artificial intelligence (MICAI). IEEE. https://
doi.org/10.1109/micai.2015.12
Barton TEA, Azhar MAHB (2017) Forensic analysis of popular UAV systems. In: Seventh interna-
tional conference on emerging security technologies (EST). IEEE. https://doi.org/10.1109/EST.
2017.8090405
BENDEA H et al (2008) Low cost UAV for post-disaster assessment. In: The international archives
of the photogrammetry, vol 37. Remote Sensing and Spatial Information Sciences
Bhadani RK, Sprinkle J, Bunting M (2018) The CAT vehicle testbed: a simulator with hardware in
the loop for autonomous vehicle applications. In: Electronic proceedings in theoretical computer
science, vol 269. Open Publishing Association, 32–47
Bhatia R (2018) Tensorflow vs caffe: which machine learning framework should you
opt for? In: Analytics India Magazine, Analytics India Magazine Pvt Ltd. Available
in: https://analyticsindiamag.com/tensorflow-vs-caffe-which-machine-learning-framework-
should-you-opt-for/. Cited January 14th, 2020
Bradski G, Kaehler A (2008) Learning OpenCV: computer vision with the OpenCV library. O’Reilly
Media, Inc
Braga RG. et al (2017) Collision avoidance based on reynolds rules: a case study using quadrotors.
In: Advances in intelligent systems and computing. Springer International Publishing, 773-780.
https://doi.org/10.1007/978-3-319-54978-1
Breedlove L (2019) An insider’s look at the rise of drones: industry veteran Lon Breedlove gives his
perspective on the evolution and future of the drone industry. Available in: https://medium.com/
hangartech/an-insiders-look-at-the-rise-of-drones-41280563f0dd. Cited November 15th, 2019
Brito PL de et al (2019) A technique about neural network for passageway detection. In: 16th inter-
national conference on information technology-new generations (ITNG 2019). Springer Interna-
tional Publishing, 465-470. https://doi.org/10.1007/978-3-030-14070-064
Buhus ER, Timis D, Apatean A (2016) Automatic parking access using openalpr on raspberry pi3. In:
Journal of ACTA TECHNICA NAPOCENSIS Electronics and Telecommunications. Technical
University of Cluj-Napoca, Cluj, Romania
Cabebe J (2019) Google translate for android adds OCR (2012). Available in ?? Cited November
15th, 2019
Cardamone A (2017) Implementation of a pilot in the loop simulation environment for UAV devel-
opment and testing. Doctoral Thesis (Graduation Project)|Scuola di Ingegneria Industriale e
dell’Informazione, Politecnico di Milano, Milano, Lombardia, Italia
Chapman A (2016) Types of drones: multi-rotor vs XEDwing vs single rotor vs hybrid VTOL.
DRONE Magz I(3):10
Cousins S (2010) ROS on the PR2 [ROS topics]. IEEE robotics & automation magazine, institute of
electrical and electronics engineers (IEEE), vol 17(3), 23-25. https://doi.org/10.1109/mra.2010.
938502
DAI, J. et al. R-fcn: Object detection via region-based fully convolutional networks. In: Proceedings
of the 30th International Conference on Neural Information Processing Systems, 379-387, ISBN
9781510838819, Red Hook, NY, USA: Curran Associates Inc., (NIPS’16) (2016)
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE
computer society conference on computer vision and pattern recognition (CVPR’05). IEEE.
https://doi.org/10.1109%2Fcvpr.2005.177
Dalmia A, Dalmia A (2019) Real-time object detection: understanding SSD. Available
in: https://medium.com/inveterate-learner/real-time-object-detection-part-1-understanding-
ssd-65797a5e675b. Cited November 28th, 2019
Deng L (2014) Deep learning: methods and applications. In: Foundations and trends R in signal
processing, Now Publishers, vol 7(3-4), 197-387. https://doi.org/10.1561%2F2000000039
Dipietro R (2019) A friendly introduction to cross-entropy loss. Available in: https://rdipietro.
github.io/friendly-intro-to-cross-entropy-loss/. Cited December 06th, 2019
Dogra A, Bhalla P (2014) Image sharpening by gaussian and butterworth high pass lter. Biomed
Pharmacol J Oriental Sci Publishing Company 7(2):707–713 https://doi.org/10.13005%2Fbpj
%2F545
DRONEKIT (2015) DRONEKIT: your aerial platform. Available in: https://dronekit.io/. Cited
December 21st, 2019
Du S et al (2013) Automatic license plate recognition (ALPR): a state-of-the-art review. In: IEEE
transactions on circuits and systems for video technology. Institute of Electrical and Electron-
ics Engineers (IEEE), v. 23, n. 2, 311–325 (2013) doi: https://doi.org/10.1109%2Ftcsvt.2012.
2203741
Eikvil L (1993) OCR—optical character recognition. Gaustadalleen 23, P.O. Box 114 Blindern,
N-0314 Oslo, Norway
Everingham M et al (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision
88(2):303–338
Fairchild C (2016) Getting started with ROS. In: ROS robotics by example: bring life to your robot
using ROS robotic applications. Packt Publishing, Birmingham, England. ISBN 978-1-78217-
519-3
Fathian K et al (2018) Vision-based distributed formation control of unmanned aerial vehicles
Feng L, Fangchao Q, and EKF altering algorithm of the autopilot PIXHAWK. In (2016) Research
on the hardware structure characteristics sixth international conference on instrumentation &
measurement, computer, communication and control (IMCCC). IEEE. https://doi.org/10.1109/
imccc.2016.128
Ferdaus MM (2017) ninth international conference on advanced computational intelligence
(ICACI). IEEE. https://doi.org/10.1109/icaci.2017.7974513
Fisher R et al (2020) Unsharp Filter. 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK:
The University of Edinburgh. Hypermedia Image Processing Reference (HIPR), School of Infor-
matics, The University of Edinburgh (2003). Available in: https://homepages.inf.ed.ac.uk/rbf/
HIPR2/unsharp.htm. Cited January 14th, 2020
Forson E (2017) Understanding SSD multiBox|Real-time object detection in deep learning.

Available in: https://towardsdatascience.com/understanding-ssd-multibox-real-time-object-
detection-in-deep-learning-495ef744fab. Cited November 26th, 2019
Frossard D (2016) VGG in TensorFlow: model and pre-trained parameters for VGG16 in Tensor-
Flow. Available in: https://www.cs.toronto.edu/~frossard/post/vgg16/. Cited November 28th,
2019
Fulton W (2020) Math of field of view (FOV) for a camera and lens. Available in: https://www.
scantips.com/lights/eldofviewmath.html. Cited January 16th, 2020
Ganesh P (2019) Object detection : Simple. Available in: https://towardsdatascience.com/object-
detection-simplied-e07aa3830954. Cited January 02nd 2020
Google Developers (2019) Elastication: true vs. false and positive vs. negative. Avail-
able in: https://developers.google.com/machine-learning/crash-course/classication/true-false-
positive-negative. Cited December 08th, 2019
Google Developers (2019) What is supervised learning? Available in: https://developers.
google.com/machine-learning/crash-course/classication/true-false-positive-negative. Cited Jan-
uary 13th, 2020
Hampapur A et al (2005) Smart video surveillance: exploring the concept of multiscale spatiotempo-
ral tracking. IEEE Signal Processing Magazine, vol 22(2). Institute of Electrical and Electronics
Engineers (IEEE), 38–51 . https://doi.org/10.1109/msp.2005.1406476
Hasan KSB (2019) What, Why and How of ROS. Available in: https://towardsdatascience.com/
what-why-and-how-of-ros-b2f5ea8be0f3. Cited December 19th, 2019
Hentai AI (2018) 14th international wireless communications & mobile computing conference
(IWCMC). IEEE. https://doi.org/10.1109/iwcmc.2018.8450505
Holden D, Saito J, Komura TA (2016) deep learning framework for character motion synthesis
and editing. ACM Transactions on Graphics, Association for Computing Machinery (ACM), vol
35(4), 1-11. https://doi.org/10.1145/2897824.2925975
Hong Y, Fang J, Tao Y (2008) Ground control station development for autonomous UAV. In: Intelli-
gent Robotics and Applications. Springer Berlin Heidelberg, 36-44. https://doi.org/10.1007/978-
3-540-88518-4
Howard AG et al MobileNets: efficient convolutional neural networks for mobile vision applications
Huang J et al (2017) Speed/accuracy trade-o s for modern convolutional object detectors. In: 2017
IEEE conference on computer vision and pattern recognition (CVPR). IEEE. https://doi.org/10.
1109%2Fcvpr.2017.351
Huang J et al (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, 7310–7311
Hui J (2019) SSD object detection: single shot multibox detector for real-time processing. Available
in: https://medium.com/@jonathanhui/ssd-object-detection-single-shot-multibox-detector-for-
real-time-processing-9bd8deac0e06. Cited November 27th, 2019
Islam N, Islam Z, Noor N (2016) A survey on optical character recognition system. In: Journal
of information & communication technology-JICT. 06010 UUM Sintok Kedah Darul Aman,
Malaysia: Universiti Utara Malaysia Press, (Issue. 2, v. 10) (2016)
Jain Y (2020) Tensorflow or PyTorch : the force is strong with which one? Avail-
able in: https://medium.com/@UdacityINDIA/tensorow-or-pytorch-the-force-is-strong-with-
which-one-68226bb7dab4. Cited March 04th, 2020
Jesus LD de et al (2019) Greater autonomy for RPAS using solar panels and taking advantage of
rising winds through the algorithm. In: 16th international conference on information technology-
new generations (ITNG 2019). Springer 615-616. https://doi.org/10.1007/978-3-030-14070-0
Jia Y, Shelhamer E (2000) Caffe Model Zoo. Available at: http://cae.berkeleyvision.org/model_
zoo.html. Cited January 14th, 2020
Jia Y, Shelhamer EC (2019). Available in: https://caffe.berkeleyvision.org/. Cited January 02nd,
2020
Jia Y, Shelhamer E, Jia Y et al (2014) Caffe. In: Proceedings of the ACM international conference
on multimedia—MM ’14. ACM Press. https://doi.org/10.1145/2647868.2654889
Joseph L (2015) Why should we learn ros?: Introduction to ROS and its package management. In:
Mastering ROS for robotics programming : design, build, and simulate complex robots using
Robot Operating System and master its out-of-the-box functionalities. Packt Publishing, Birm-
ingham, England. ISBN 978-1-78355-179-8 (2015)
Kantue P, Pedro JO (2019) Real-time identification of faulty systems: development of an aerial
platform with emulated rotor faults. In: 4th conference on control and fault tolerant systems
(SysTol). IEEE. https://doi.org/10.1109/systol.2019.8864732
Karpathy A (2020) Layers used to build ConvNets (2019). Available in: http://cs231n.github.io/
convolutional-networks/. Cited January 02nd 2020
KERAS (2020) KERAS: The python deep learning library (2020). Available in: https://keras.io/.
Cited March 03rd 2020
Koenig N, Howard (2004) A Design and use paradigms for gazebo, an open-source multi-robot
simulator. In: 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS)
(IEEE Cat. No.04CH37566). IEEE. https://doi.org/10.1109/iros.2004.1389727
Kouba A (2019) Services. Available in: http://wiki.ros.org/Services. Cited December 19th, 2019
Kranthi S, Pranathi K, Srisaila A (2011) Automatic number plate recognition. In: International
journal of advancements in technology (IJoAT). Ambala: Nupur Publishing House, Ambala,
India
KrauseJ et al (2013) 3d object representations for ne-grained categorization. In: 4th International
IEEE workshop on 3D representation and recognition (3dRR-13). Sydney, Australia
Kumar G, Bhatia PK (2013) Neural network based approach for recognition of text images. Int J
Comput Appl Foundation Comput Sci 62(14):8–13. https://doi.org/10.5120/10146-4963
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Lee T, Leok M, Mcclamroch NH (2010) Geometric tracking control of a quadrotor UAV on SE(3).
In: 49th IEEE conference on decision and control (CDC), IEEE. https://doi.org/10.1109/cdc.
2010.5717652
Lin T-Y et al (2014) Microsoft coco: Common objects in context. In: European conference on com-
puter vision (ECCV). Zurich: Oral. Available in: /se3/wp-content/uploads/2014/09/cocoeccv.pdf;
http://mscoco.org. Cited January 14th, 2020
Liu G et al (2011) The calculation method of road travel time based on license plate recognition tech-
nology. In: Communications in computer and information science. Springer, Berlin, Heidelberg,
385–389. https://doi.org/10.1007%2F978-3-642-22418-854
Liu W (2020) SSD: single shot multibox detector. Available in: https://github.com/weiliu89/caffe/
tree/ssd. Cited January 14th, 2020
Liu W et al (2016) Ssd: Single shot multibox detector. Lecture notes in computer science. Springer
International Publishing, 21–37. https://doi.org/10.1007/978-3-319-46448-0. ISSN 1611-3349
Luukkonen T (2011) Modelling and control of quadcopter. Master Thesis (Independent research
project in applied mathematics)|Department of Mathematics and Systems Analysis, Aalto Uni-
versity School of Science, Espoo, Finland
MAO W et al (2017) Indoor follow me drone. In: Proceedings of the 15th annual international
conference on mobile systems, applications, and services—MobiSys ’17. ACM Press. https://
doi.org/10.1145/3081333.3081362
Martinez A (2013) Getting started with ROS. In: Learning ROS for robotics programming: a prac-
tical, instructive, and comprehensive guide to introduce yourself to ROS, the top-notch, leading
robotics framework. Packt Publishing, Birmingham, England. ISBN 978-1-78216-144-8
Martins WM et al (2018) A computer vision based algorithm for obstacle avoidance. Information
Technology-New Generations. Springer 569–575. https://doi.org/10.1007/978-3-319-77028-4
MAVLINK (2019) MAVLink Developer Guide. Available in: https://mavlink.io/en/. Cited Decem-
ber 01st, 2019
MAVSDK (2019) MAVSDK (develop) (2019). Available in: https://mavsdk.mavlink.io/develop/
en/. Cited December 01st, 2019
Meier L et al (2012) PIXHAWK: a micro aerial vehicle design for autonomous hight using onboard
computer vision. In: Autonomous robots, vol 33(1-2). Springer Science and Business Media LLC,
21-39. https://doi.org/10.1007/s10514-012-9281-4
Meyer A (2020) X-Plane. Available in: https://www.x-plane.com/. Cited March 01st, 2020
Mishra N et al (2012) Shirorekha chopping integrated tesseract OCR engine for enhanced hindi
language recognition. Int J Comput Appl Foundation Comput Sci 39(6):19–23
Mithe R, Indalkar S, Divekar N (2013) Optical character recognition. In: International Journal of
Recent Technology and Engineering (IJRTE). G18-19-20, Block-B, Tirupati Abhinav Homes,
Damkheda, Bhopal (Madhya Pradesh)-462037, India: Blue Eyes Intelligence Engineering and
Sciences Publication (BEIESP), (1, v. 2), 72–75
Mitsch S, Ghorbal K, Platzer A (2013) On provably safe obstacle avoidance for autonomous robotic
ground vehicles. In: Robotics: science and systems IX. Robotics: Science and Systems Founda-
tion. https://doi.org/10.15607%2Frss.2013.ix.014
Nguyen KD, Ha C, Jang JT (2018) Development of a new hybrid drone and software-in-the-
loop simulation using PX4 code. In: Intelligent computing theories and application. Springer
International Publishing, 84–93. https://doi.org/10.1007/978-3-319-95930-6
Nogueira L (2014) Comparative analysis between Gazebo and V-REP robotic simulators. Master
Thesis (Independent research project in applied mathematics)|School of Electrical and Computer
Engineering, Campinas University, Campinas, São Paulo, Brazil
Oliveira na Estrada (2020) Marginal Pinheiros alterac oes no caminho para Castelo Branco. Avail-
able in: https://www.youtube.com/watch?v=VEpMwK0Zw1g. Cited January 05th, 2020
Opala M (2020) Top machine learning frameworks compared: SCIKIT-Learn, DLIB, MLIB, tensor
flow, and more. Available in: https://www.netguru.com/blog/top-machine-learning-frameworks-
compared. Cited March 04th, 2020
Opelt A et al (2006) Generic object recognition with boosting. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, Institute of Electrical and Electronics Engineers (IEEE), v. 28, n.
3, 416-431. https://doi.org/10.1109%2Ftpami.2006.54
OPENALPR (2020) OpenALPR Documentation. Available in: http://doc.openalpr.com/. Cited
January 07th, 2020
O’Shea K, Nash R (2015) An introduction to convolutional neural networks
Oza P, Patel VM (2019) One-class convolutional neural network. IEEE signal processing letters,
Institute of Electrical and Electronics Engineers (IEEE), v. 26, n. 2, 277-281. https://doi.org/10.
1109%2Flsp.2018.2889273
Papageorgiou C, Poggio T (2000) A trainable system for object detection. International Journal of
Computer Vision, Springer Science and Business Media LLC 38(1):15–33. https://doi.org/10.
1023/a:1008162616689
Patel C, Patel A, Patel D (2012) Optical character recognition by open source OCR tool tesseract: a
case study. In: International journal of computer applications, vol 55(10). Foundation of Computer
Science, 50-56. https://doi.org/10.5120/8794-2784
Pinto LGM et al (2019) A SSD–OCR approach for real-time active car tracking on quadrotors. In:
16th international conference on information technology-new generations (ITNG 2019). Springer,
471–476
PIXHAWK (2019) What is PIXHAWK? Available in https://pixhawk.org/. Cited December 16th,
2019
PX4 (2019) PX4 DEV, MAVROS. Available in: https://dev.px4.io/v1.9.0/en/ros/mavrosinstallation.
html. Cited December 19th, 2019
PX4 (2019) Simple multirotor simulator with MAVLink protocol support. Available in: https://
github.com/PX4/jMAVSim. Cited March 01st, 2020
PX4 (2019) What Is PX4? Available in: https://px4.io. Cited December 03rd 2019
PX4 DEV (2019) MAVROS. Available in: https://dev.px4.io/v1.9.0/en/ros/mavrosinstallation.html.
Cited December 19th, 2019
PX4 DEV (2019) Using DRONEKIT to communicate with PX4. Available in: https://dev.px4.io/
v1.9.0/en/robotics/dronekit.html. Cited December 21st, 2019
PYTORCH (2020) Tensors and dynamic neural networks in Python with strong GPU acceleration.
Available in: https://github.com/pytorch/pytorch. March 03rd, 2020
Qadri MT, ASIF M (2009) Automatic number plate recognition system for vehicle identification
using optical character recognition. In, (2009) international conference on education technology
and computer. IEEE. https://doi.org/10.1109/icetc.2009.54
QGROUNDCONTROL (2019) QGroundControl User Guide (2019). Available in: https://docs.
qgroundcontrol.com/en/. Cited December 06th, 2019
QGROUNDCONTROL (2019) QGROUNDCONTROL: intuitive and powerful ground control sta-
tion for the MAVLink protocol. Available in: http://qgroundcontrol.com/. Cited December 19th,
2019
Quigley M et al (2009) ROS: an open-source robot operating system, vol 3
Ramirez-Atencia C, Camacho D (2018) Extending QGroundControl for automated mission plan-
ning of UAVs. Sensors, MDPI AG 18(7):2339. https://doi.org/10.3390/s18072339
Rampasek L, Goldenberg A (2016) TensorFlow: Biology’s gateway to deep learning? Cell Systems,
Elsevier BV 2(1):12–14. https://doi.org/10.1016/j.cels.2016.01.009
Redmon J (2016) IEEE conference on computer vision and pattern recognition (CVPR). IEEE.
https://doi.org/10.1109/cvpr.2016.91
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, 7263–7271
Ren S et al (2017) Faster r-cnn: towards real-time object detection with region proposal networks. In:
IEEE transactions on pattern analysis and machine intelligence, vol 39(6). Institute of Electrical
and Electronics Engineers (IEEE), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031.
ISSN 2160-9292
Rezatofighi H, (2019) Generalized intersection over union: a metric and a loss for bounding box
regression. In, et al (2019) IEEE/CVF conference on computer vision and pattern recognition
(CVPR). IEEE. https://doi.org/10.1109/CVPR.2019.00075
Rogowski MV da S (2018) LiPRIF: Aplicativo para identificação de permissão de acesso de veículos
e condutores ao estacionamento do IFRS (in portuguese). Monography (Graduation Final Project)
| Instituto Federal de Educação, Ciência e Tecnologia do Rio Grande do Sul (IFRS), Campus Porto
Alegre, Av. Cel. Vicente, 281, Porto Alegre - RS - Brasil
ROS WIKI (2019) MAVROS. Available in: http://wiki.ros.org/mavros. Cited December 19th, 2019
Sabatino F (2015) Quadrotor control: modeling, nonlinear control design, and simulation. Master
Thesis (MSc)|School of Electrical Engineering and Computer Science, KTH Royal Institute of
Technology, Stockholm, Sweden
Sagar A (2020) 5 techniques to prevent obverting in neural networks. Available
in: https://towardsdatascience.com/5-techniques-to-prevent-overfitting-in-neural-networks-
e05e64f9f07. Cited March 03rd, 2020
Sambasivarao K (2019) Non-maximum suppression (NMS). Available in: https://
towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c. Cited December
10th, 2019
Sarfraz M, Ahmed M, Ghaz, S (2003) Saudi arabian license plate recognition system. In: 2003
international conference on geometric modeling and graphics, 2003. proceedings. IEEE Computer
Society. https://doi.org/10.1109%2Fgmag.2003.1219663
Sawant AS, Chougule D (2015) Script independent text pre-processing and segmentation for OCR.
In: International conference on electrical, electronics, signals, communication and optimization
(EESCO). IEEE
SCIKIT-LEARN (2020) sCIKIT-Learn: machine learning in Python. Available in: https://github.
com/scikit-learn/scikit-learn. Cited March 03rd, 2020
Shafait F, Keysers D, Breuel TM (2008) Efficient implementation of local adaptive thresholding
techniques using integral images. In: Yanikoglu BA, Berkner K (ed) Document Recognition and
Retrieval XV. SPIE. https://doi.org/10.1117%2F12.767755
SHAH S et al (2017) Airsim: High-fidelity visual and physical simulation for autonomous vehicles.
In: Field and service robotics. Available at: https://arxiv.org/abs/1705.05065 Cited December
21st, 2019
Shobha G, Rangaswamy S (2018) Machine learning. In: Handbook of statistics. Elsevier, 197-228.
https://doi.org/10.1016%2Fbs.host.2018.07.004
Shuck TJ (2013) Development of autonomous optimal cooperative control in relay rover config-
ured small unmanned aerial systems. Master Thesis (MSc)|Graduate School of Engineering and
Management, Air Force Institute of Technology, Air University, Wright-Patterson Air Force Base
(WPAFB), Ohio, US
Silva SM, Jung CR (2018) License plate detection and recognition in unconstrained scenarios.
In: Computer vision ECCV Springer International Publishing, 593–609. https://doi.org/10.1007
%2F978-3-030-01258-8 36
Simonyan K, Zisserman A (2019) Very deep convolutional networks for large-scale image recogni-
tion. In: Bengio Y, Lecun Y (ed) 3rd international conference on learning representations, ICLR
2015, San Diego, CA, USA, Conference Track Proceedings. Available in: http://arxiv.org/abs/
1409.1556. Cited December 10th, 2019
Singh R et al (2010) Optical character recognition (OCR) for printed Devnagari script using artificial
neural network. In: International journal of computer science & communication (IJCSC). (1, v.
1), 91–95
Smith R (2007) An overview of the tesseract OCR engine. In: Ninth international conference on
document analysis and recognition (ICDAR 2007) Vol 2. IEEE. https://doi.org/10.1109%2Ficdar.
2007.4376991
Smith R, Antonova D, Lee D-S (2009) Adapting the tesseract open source OCR engine for multi-
lingual OCR. In: Proceedings of the International workshop on Multilingual OCR—MOCR ’09.
ACM Press. https://doi.org/10.1145%2F1577802.1577804
Sobottka K et al (2000) Text extraction from colored book and journal covers. In: Kise Daniel
Lopresti SMK (ed) International journal on document analysis and recognition (IJDAR). Tier-
gartenstrasse 17 69121, Heidelberg, Germany: Springer-Verlag GmbH Germany, part of Springer
Nature, (4, v. 2), 163–176
Songer SA (2013) Aerial networking for the implementation of cooperative control on small
unmanned aerial systems. Master Thesis (MSc)|Graduate School of Engineering and Man-
agement, Air Force Institute of Technology, Air University, Wright-Patterson Air Force Base
(WPAFB), Ohio, US
Soviany P, Ionescu RT (2019) Frustratingly easy trade-o optimization between single-stage and
two-stage deep object detectors. In: Lecture notes in computer science. Springer International
Publishing, 366-378. https://doi.org/10.1007/978-3-030-11018-5
Strimel G, Bartholomew S, Kim E (2017) Engaging children in engineering design through the
world of quadcopters. Children’s Technol Eng J 21:7–11
Stutz D (2015) Understanding convolutional neural networks. In: Current Topics in Computer Vision
and Machine Learning. Visual Computing Institute, RWTH AACHEN University
Szegedy C et al (2014) Scalable, high-quality object detection
Szegedy C, (2015) Going deeper with convolutions. In, et al (2015) IEEE conference on computer
vision and pattern recognition (CVPR). IEEE. https://doi.org/10.1109/CVPR.2015.7298594
Talabis MRM et al (2015) Analytics de ned. In: Information security analytics. Elsevier, 1-12.
https://doi.org/10.1016%2Fb978-0-12-800207-0.00001-0
Talin T (2015) LabelImg. Available in: https://github.com/tzutalin/labelImg. Cited January 13th,
2020
Tavares DM, Caurin GAP, Gonzaga A (2010) Tesseract OCR: a case study for license plate recog-
nition in Brazil
Tensorflow (2016) A system for large-scale machine learning. In: Proceedings of the 12th USENIX
conference on operating systems design and implementation. USENIX Association USA, 265–
283. ISBN 9781931971331
TENSORFLOW (2019) why TensorFlow? Available in: https://www.tensorflow.org/. Cited Decem-

ber 22nd, 2019
TENSORFLOW (2020) Get started with TensorBoard. Available in: https://www.tensorflow.org/
tensorboard/get_started. Cited January 13th, 2020
TENSORFLOW (2020) Tensorflow detection model zoo. Available in: https://github.com/
tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md.
Cited January 13th, 2020
Tokui S et al (2019) Chainer: a next-generation open source framework for deep learning. In:
Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual
conference on neural information processing systems (NIPS). Available in: http://learningsys.
org/papers/LearningSys2015paper33.pdf. Cited December 22nd 2019
Tompkin J Deep Learning with TensorFlow: introduction to computer vision. Available in: http://
cs.brown.edu/courses/cs143/2017Fall/proj4a/. Cited December 10th, 2019
Vargas AMPCACG, Vasconcelos CN (2019) Um estudo sobre redes neurais convolucionais e sua
aplicação em detecção de pedestres (in portuguese). In: Cappabianco FAM et al (eds) Electronic
proceedings of the 29th conference on graphics, patterns and images (SIBGRAPI’16). São José
dos Campos, SP, Brazil. Available in: http://gibis.unifesp.br/sibgrapi16. Cited November 23rd,
2019
Weerasinghe R et al (2020) NLP applications of Sinhala: TTS & OCR. In: Proceedings of the third
international joint conference on natural language processing: volume-II. Available in: https://
www.aclweb.org/anthology/I08-2142. Cited January 03rd, 2020
Woo A, Broker H-B (2004) Gnuplot quick reference. Available in: http://www.gnuplot.info/docs4.
0/gpcard.pdf. Cited January 14th, 2020
Yan Q-Z, Williams JM, Li J (2002) Chassis control system development using simulation: software
in the loop, rapid prototyping, and hardware in the loop. SAE International, SAE Technical Paper
Series. https://doi.org/10.4271/2002-01-1565
Zhang H et al (2011) An improved scene text extraction method using conditional random field
and optical character recognition. In: 2011 international conference on document analysis and
Recognition. IEEE. https://doi.org/10.1109%2Ficdar.2011.148
Chapter 10
Palmprint Biometric Data Analysis for
Gender Classification Using Binarized
Statistical Image Feature Set
Shivanand Gornale, Abhijit Patil, and Mallikarjun Hangarge
Abstract Biometrics may be defined as a technological system that metrics indi-

viduals based upon their physiological and behavioral traits. The performance of
behaviometrics system is destitute, as very few operational systems are deployed.
In contrast, physiometrics systems seems significant and are used more due to its
individuality and permanence features such as iris, face, fingerprint, and palmprint
traits are well used physiometrics modalities. In this paper, authors have implemented
algorithm which identifies human gender based on palmprint by using binarized sta-
tistical image features. Filters ranging from 3 × 3 to 13 × 13, with a fixed length of
8bit that allows in capturing detail information from ROI palmprints. The proposed
method achieved the accuracy of 98.2% on CASIA palmprint database outperforming
result is noticed and competitive.
10.1 Introduction
Security concerns, as the credentials-based methods are not prevailing and suitable
for usage, thus simply biometrics-based measures are adapted and mapped to rapid
growing technologies. The era of biometrics is evolved nowadays usage of biomet-
rics become inevitable for gender classification and user identification (Gornale et al.
2015; Sanchez and Barea 2018; Shivanand et al. 2015). Likewise for many years,
the humans have been also interested in palm and palm lines for the telling for-
tunes. Scientists have also determined the association of palm line by certain genetic
disorders (Kumar and Zhang 2006) like Down syndrome, Cohen syndrome, and
Aarskog syndrome. Palmprint is an important biometrics trait, which gains lot of
S. Gornale (B) · A. Patil

Department of Computer Science, Rani Channamma University,
Belagavi, Karnataka 591156, India
e-mail: shivanand_gornale@yahoo.com
A. Patil
e-mail: abhijitpatil05@gmail.com
M. Hangarge
Department of Computer Science, Karnatak College, Bidar, Karnataka 585401, India
e-mail: mhangarge@yahoo.co.in
158 S. Gornale et al.
attention because of its high potential authentication capability (Charif et al. 2017).
A few studies have been carried out related to gender identification using palmprints.
In this context, palmprint-based gender identification will be among the next most
popular task for improvising accuracy of other biometrics devices and may even
sometimes doubles haste of biometrics system. The problem of comparison will be
reduced to half the database by it relatively to other methods. The gender classifi-
cation has several applications even in civil, commercial domain, surveillance, and
especially in forensic science for criminal detection and nabbing the suspects, etc.
Gender identification using palmprint is a binary class problem of deciding
whether given palm image corresponds to a male or to a female. Palmprints are
permanent (Kumar and Zhang 2006) and unalterable (Krishan et al. 2014) by nature,
whereas shape and size of an individual’s palm may vary with age, although basic
patterns remain unchanged (Kanchan et al. 2013). This makes palmprint slightly
noteworthy and individualistic. In earlier studies, it is observed that palmprint pat-
terns are genotypically determined (Makinen and Raisamo 2008; Zhang et al. 2011)
and there exists greater differences between female and male palmprints. These are
absolute means that can be considered to identify gender of an individual. Palm-
print contains both high- and low-resolution features like Geometrical, Delta-point,
Principal-Lines, Wrinkles and Minutiae (ridges) features, etc. (Gornale et al. 2018).
In proposed method, binarized statistical image feature (BSIF) technique is used.
Based on which the performance of this technique is evaluated on CASIA palmprint
public database. The results are outperforming performance which is noticed in the
literature. The remaining part of the paper consist of following: Sect. 10.2 contains
the work related to palmprint-based gender classification, and Sect. 10.3 focused on
proposed methodology. In Sect. 10.4, experimental results are discussed, and Sect.
10.5 contains the comparison between the proposed method and existing results, and
finally in Sect. 10.6, conclusions are presented.
10.2 Related Work
The research done earlier reveals that it is possible to authenticate an individual from
palmprint, but the work carried out in this domain is very scanty. In this section,
we discuss review of studies that have been classified on gender identification, G.
Amayeh et al. (2008) investigated possibilities of obtaining the information pertain-
ing to gender by using palmprint; for this, they used palmprint geometry and fingers
which they encoded making use of Fourier descriptors for evaluation; data is col-
lected from 40 subjects and obtained the result of 98% with limited dataset. After
those Wo et al. (2014), classified palmprints geometrical properties using polyno-
mial support vector machine classifier 85% accuracy are attained with a separate
180 palmar images collected from 30 subjects.Unlikely, these datasets are not avail-
able publically for further comparison. Gornale et al. (2018) have performed fussing
Gabor Wavelet with local binary pattern on publicly CASIA palmprint database
using simple nearest neighbor classifier; an accuracy of 96.1% is observed. Xie et al.
(2018) have explored with hyper-spectral CASIA palmprint dataset with convolution
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF … 159
Fig. 10.1 Diagram representing the proposed methodology is given in Fig. 10.1
neural network and fine-tuning of visual geometry group net (VGG-Net) managed
to achieve a considerable accuracy of 89.2% with blue spectrum.
The proposed method comprises the following. As first step, the palmprint image
is preprocessed which normalizes an input image and crops the region of interest
(ROI) from the image of the palm. In the second step, the features are computed
using BSIF. In the last step, the computed features are classified. Figure 10.1 gives
a representation of the proposed methodology.
10.3.1 Database
In the proposed work, authors have utilized CASIA palmprint database which is
available to the public (CASIA) (http://biometrics.idealtest.org/). From the CASIA
palmprint database, we have considered a subset of 4207 palmprints, out of which
3078 palm images belongs to male and 1129 belongs to female subjects, respectively.
Images from database are shown in Fig. 10.2.
10.3.2 Preprocessing
Preprocessing enhances some important features by restraining undesirable distor-

tions. In this experiment, the pre-processing is performed to extract the region of
interest (Zhenan et al. 2005).
The following are the preprocessing steps :
Step 1 First the input image is smoothened, with the help of Gaussian filter and
after that it is binarized (Otsu 1979).
Step 2 Image is normalized to a size of 640 × 480.
Fig. 10.2 Samples of the database
Fig. 10.3 Region of interest (ROI) extration
Step 3 Two key points are searched; key-point no. 1 is the gap between forefinger
and middle finger. Key-point no. 2 is the gap between the ring finger and
little finger(Shivanand et al. 2019).
Step 4 To determine palmprints co-ordinate system, the tangents of previously
located two key points are computed.
Step 5 The line which joins these two key points is considered y-axis, along with it
the centroid is detected, and the line passing through perpendicular to it is
treated as x-axis.
Step 6 After obtaining the co-ordinates by step 5, the sub-image from the co-
ordinates is contemplated as region of interest. The process of region of
interest extraction can be understood from Fig. 10.3.
10.3.3 Feature Extraction
Feature computation is performed on extracted palmprint images using binarized

statistical image features which are identical to local binary patterns and local phase
quantization (Gornale et al. 2018; Juho and Rahtu 2012). BSIF technique conven-
tionally encodes textural information from the sub-regions of the image. The BSIF
method (Patil et al. 2019) produces basic vectors by projecting the patch linearly
onto the sub-spaces by independent component analysis (Abdenour et al. 2014; Juho
and Rahtu 2012). Each pixel co-ordinate value is thresholded, and equivalent binary
code is generated. The value of local descriptor of the image intensity patterns is rep-
resented by value in the neighborhood of the selected pixel. For palmprint P(b, c)
and a BSIF filter WiK ×K , the filter response is obtained as follows:
K ×K
ri = P (b, c) × Wi (b, c) (10.1)
b,c
Where ’x’ denotes the convolutional manipulation, b and c indicates the size of the
palmprint image patch and WiK ×K = (1 . . . L) represents filter length and K × K
represents the filter size.
1, if ri > 0
d(i) = (10.2)
0, otherwise
Likewise, for each pixel (b, c) represents the corresponding pixel; L represents filter
length ; and the BSIF features are obtained by plotting the histogram of the obtained
binary codes, from each sub-region of ROI.

i
B S I FiK ×K (b, c) = d (b, c) × (2i−1 ) (10.3)
1=1
In this experiment, size is varied from 3 × 3 to 13 × 13, so that we have utilized six
different sizes of filters, and size is fixed to standard 8 bit length coding. Consequently,
the feature vector of 256 elements from each male and female ROI is extracted from
palmprint images. Figure 10.4 represents the visualization of the application of these
filters.
10.3.4 Classifier
Linear discriminant analysis (LDA) is the primary classifying techniques that have
smaller computational complexity, which is commonly utilized for reducing the
dimensionality (Zhang et al. 2012). It works by separating the variance both between
Fig. 10.4 Feature extraction
and within the classes (Jing et al. 2005). LDA is a binary classifier, which classifies
class label ’0’ or ’1’ from the palmprint images based upon the class variances.
Nearest neighbor classifier classifies the class labels based upon different kinds of
distances. It classifies the class values based on k-value and interns which explores
for immediate neighbors and provides labeling for unlabeled sample.

dEuclidean (M, N ) = (M − Ni )T (M − N j ) (10.4)

n
dCityblock (M, N ) = (|M j − N j |) (10.5)
j=1
Support vector machines (SVMs) embody a new statistical learning technique

which classify the label based upon different kinds of learning functions (Shivanand
et al. 2015). It is basically a binary modeled classifier that endeavors to seek an
optimal hyper-plane which separates the labels form a set of n data vectors from Yi
labels.
F X = (W T.Yi − b ≥ 1) (10.6)
Here, Yi predicts value either of the class belonging to male or female class by
using F(X ) discriminative function. Geometrically support vector machines are the
training patterns that are nearest to the decision boundary.
10.4 Experimental Results and Discussion
In this work, gender classification using palmprint biometrics is explored by employ-

ing BSIF filters through varying different filters sizes. The filters size is varied from
3 × 3 to 13 × 13 such that we have six different sizes of filters, and length is fixed to
standard 8bits length coding. The experimentation is carried out by 10-fold cross val-
idation over different binary classifiers like LDA, K-NN, and SVM classifier on the
publicly available CASIA palmprint database. Precision (P), recall(R), F-measure
(F), and accuracy (A) is calculated. The results during the exhaustive experiments
are demonstrated in Tables 10.1and 10.2.
From Table 10.1, it is observed that by utilizing 3 × 3 8bit length filters, K-NN
classifier with Euclidean distance for K = 3 has obtained the accuracy of 85.5% which
is empirically fixed throughout the experiment, and the lowest accuracy of 76.1%
has been obtained by LDA. Support vector machine has performed less compared
with K-NN and has been noted to be 80.6%, respectively. Similarly, for 5 × 5 8 bit
length filters, it is observed that by using K-NN classifier with Euclidean distance
an accuracy of 93% has been obtained, and the lowest accuracy of 76.3% has been
obtained by LDA classifier. Support vector machine has performed less compared to
K-NN and has noted to be 85.9%, respectively. Further, by using 7 × 7 8bit length
filters, the K-NN classifier has yielded the highest accuracy of 96.7% with Euclidean
distance, and the lowest accuracy of 79.7% has been obtained using LDA. Support
vector machine has performed less compared to K-NN and has noted to be 91%,
respectively.
From Table 10.2, it is observed that by 9x9 8bit length filters, the highest accuracy
of 98.1% with K-NN city block distance has been noted, and the lowest accuracy
of 80.1% has been obtained with LDA. Support vector machine classifier yields
less result compared to K-NN and has noted to be 93.8%, respectively.Similarly
with 11 × 11 8bit length filters, we noticed a higher result of 98.2 % result with
K-NN city block distance classifier, and support vector machine classifier follows
similar tends with lesser result than the K-NN classifier as of 95.2% accuracy, and the
lowest accuracy of 80.8% has been attained by using LDA classifier. From 13 × 13
8bit length filter, we have noted similar results as 11 × 11 filters for K-NN classifier,
with accuracy of 98.2% with Euclidean distance, and the lowest accuracy of 79.7%
Table 10.1 Results of 3 × 3, 5 × 5, and 7 × 7 filters size

Filter size 3×3 5×5 7×7
P R F A P R F A P R F A
LDA 0.89 0.80 0.42 76.1 0.89 0.80 0.42 76.3 0.90 0.83 0.43 79.7
SVM Quad 0.90 0.82 0.43 79.5 0.92 0.88 0.45 85.9 0.94 0.88 0.45 86.8
SVM Cubic 0.89 0.85 0.43 80.6 0.92 0.85 0.44 82.8 0.95 0.92 0.46 91
KNN CityBlock 0.92 0.87 0.45 85.2 0.95 0.94 0.47 93.0 0.98 0.97 0.48 96.7
KNN Euclidean 0.92 0.87 0.45 85.4 0.95 0.93 0.47 92.3 0.98 0.97 0.48 97.0
Table 10.2 Results of 9 × 9, 11 × 11, and 13 × 13 filters size

Filter size 9×9 11 × 11 13 × 13
P R F A P R F A P R F A
LDA 0.90 0.83 0.43 80.1 0.90 0.84 0.43 80.8 0.90 0.83 0.43 79.7
SVM Quad 0.96 0.90 0.46 89.5 0.96 0.91 0.46 90.9 0.97 0.91 0.47 91.6
SVM Cubic 0.97 0.94 0.47 93.8 0.97 0.95 0.48 95.2 0.98 0.96 0.48 95.9
KNN CityBlock 0.99 0.98 0.49 98.0 0.99 0.98 0.49 98.2 0.98 0.98 0.49 98.1
KNN Euclidean 0.99 0.98 0.49 97.9 0.99 0.98 0.49 98.1 0.98 0.98 0.49 98.2
Table 10.3 Detail confusion matrix of all filters size

3×3 5×5 7×7 9×9 11 × 11 13 × 13
Male Female Male Female Male Female Male Female Male Female Male Female
LDA 2753 325 2756 322 2792 286 2791 287 2784 294 2786 292
682 447 673 456 567 562 550 579 515 614 561 568
SVM 2790 288 2850 228 2909 169 2963 115 2983 95 2987 91
Quad
576 553 366 763 387 742 326 803 288 841 263 866
SVM 2745 333 2857 221 2935 143 2994 84 3014 64 3025 53
Cubic
482 647 501 628 235 894 175 954 137 992 118 1011
KNN 2856 222 2945 133 3022 56 3049 29 3055 23 3047 31
City
402 727 183 946 82 1047 57 1072 53 1076 48 1081
KNN 2760 318 2826 252 2964 114 3017 61 3040 38 3039 39
Eucli
369 760 220 909 107 1022 64 1065 53 1076 40 1089
has been obtained by LDA. Support vector machine has performed less in comparison
to K-NN and has yielded 95.9% accuracy.
The confusion matrix for the following experiment is illustrated in Table 10.3.
By varying the size and length with a fixed length of 8bit, it has been observed that
as the filter size is increased, higher accuracy is attained. Thus, varying the size of
filters allows in capturing various information from ROI palmprints images.
10.5 Comparative Analysis
To realize effectiveness of the proposed method, the authors have compared it with
similar works present in literature predicted in Table-10.4. In Amayeh et al. (2008),
the authors have made use of palm geometry, Zernike moments, and Fourier descrip-
tors and obtained 98% accuracy on relatively smaller dataset of just 40 images.
Table 10.4 Comparative analysis

Authors Features Database Classifier Results (%)
Amayeh et al. Palm geometry 20 males and 20 Score level fusion 98
(2008) fourier descriptor females with linear
and Zernike discriminant
moments analysis
Wo et al. (2014) Palm gemometry 90 male and 90 Polynomial 85
feature females support vector
machine
Gornale et al. Fusion of gabor CASIA database K-nearest 96.1
(2018) wavelet & Local neighbor
binary patterns
Xie et al. (2018) Convolution Multi-spectral Fine-tuning 89.2
neural network CASIA database visual geometry
group net
Proposed method Binary statistical CASIA database K-nearest 98.2
image features neighbor
However, Wo et al. (2014) have utilized very basic geometric properties like length,
height, and aspect ratio with PSSVM and obtained 85% accuracy. Gornale et al.
(2018) have performed fussing Gabor Wavelet with local binary pattern on publicly
CASIA palmprint database. Xie et al. (2018) have explored gender classification
with hyper-spectral CASIA palmprint dataset with convolution neural network and
fine-tuning of visual geometry group net (VGG-Net). The drawback of the reported
works with self-created database is that they are inapt with low-resolutional and far-
distant images captured through non-contact method, as they require touch-based
palm acquisition. However, the proposed method is worked with public database
which is suitable for both the approaches. The proposed method outperformed by
using BSIF filters with basic K-NN classifier on relatively larger dataset consisting
of 4207 ROIs of palmprints, which yielded the progressive result of an accuracy
98.2%. A brief summary of comparison is presented in Table 10.4.
10.6 Conclusion
In this paper, authors explore the performance of binary statistical image features,
i.e., BSIF on CASIA palmprint images, by varying the filter size with a fixed length
of 8bit length, further the filter size is increased, and progressive result of 98.2% is
noticed for filter size of 11 × 11 and above. Thus, varying the size of filters allows
capturing information from ROI palmprints. The proposed method is implemented
on contact-free-based palmprint acquisition process, and this is implacable to both
contact and contact-less-based methods. Our basic objective in this work is to develop
a standardized system that can efficiently distinguish between males and females on
the bases of palmprints. Likewise, with basic K-NN classifier and BSIF features,
authors have managed to enact relatively better result on larger database of 4207
palmprint images. In near future, plan is device generic algorithm which identifies
gender based on multimodal biometrics.
Acknowledgements The authors would like to thank to Chinese Academy of Science Institute of
Automation for providing the access to (CASIA) Palmprint Database for conducting this experi-
ment.
References
Abdenour H, Juha Y, Boradallo M (2014) Face and texture analysis using local descriptor: a compar-
ative analysis. In: IEEE international conference image processing theory, tools and application
IPTA https://doi.org/10.1109/IPTA.2014.7001944
Adam K, Zhang D, Kamel M (2009) A survey of palmprint recognition. Patt Recognit Lett
42(8):1408–1411
Amayeh G, Bebis G, Nicolescu M (2008) Gender classification from hand shapes. In: 2008
IEEE society conference on computer vision and pattern recognition workshop. AK, 1–7.
https://doi.org/10.11.09/CVPRW
Charif H, Trichili Adel M, Solaiman B (2017) Bimodal biometrics system for hand shape and palm-
print recognition based on SIFT sparse representation. Multimedia Tools Appl 76(20):20457–
20482. https://doi.org/10.1007/s11042-106-3987-9
Gornale SS, Malikarjun H, Rajmohan P, Kruthi R (2015) Haralick feature descriptors for gender
classification using fingerprints: a machine learning approach. Int J Adv Res Comput Sci Softw
Eng 5:72–78. ISSN: 2277 128X
Gornale SS, Patil A, Mallikarjun H, Rajmohan P (2019) Automatic human gender identification
using palmprint. In: Smart computational strategies: theoretical and practical aspects. Springer,
Singapore. Online ISBN 978-981-13-6295-8, Print ISBN 978-981-13-6294-1
Gornale SS (2015) Fingerprint based gender classification for biometrics security: a state-of -the-art
technique. American Int J Res Sci Technol Eng Mathe (AIJRSTEM). ISSN-2328-3491
Gornale SS, Kruti R (2014) Fusion of fingerprint and age biometrics for gender classification using
frequency domain and texture analysis. Signal Image Process Int J (SIPIJ) 5(6):10
Gornale SS, Patil A, Veersheety C (2016) Fingerprint based gender identification using discrete
wavelet transformation and Gabor filters. Int J Comput Appl 152(4):34–37
Gornale S, Basavanna M, Kruti R (2017) Fingerprint based gender classification using local binary
pattern. Int J Comput Intell Res 13(2):261–272
Gornale SS, Patil A, Kruti R (2018) Fusion Of Gabor wavelet and local binary patterns features
sets for gender identification using palmprints. Int J Imaging Sci Eng 10(2):10
Jing X-Y, Tang Y-Y, Zhan D (2005) A Fourier-LDA approaches for image recognition. Patt Recognit
38(3):453–457
Juho K, Rahtu E (2012) Bsif: Binarized statistical image features. In: IEEE international conference
on pattern recognition, in (ICPR), pp 1363–1366
Kanchan T, Kewal K, Aparna KR, Shredhar S (2013) Is there a sex difference in palmprint ridge
density. Med Sci law 15:10. https://doi.org/10.1258/msl.2012.011092
Krishan K, Kanchan T, Ruchika S, Annu P (2014) Viability of palmprint ridge density in North
Indian population and its use in inference of sex in forensic examination. HOMO-J Comparat
Hum Biol 65(6):476–488
Kumar A, Zhang D (2006) Personal recognition using hand shape. IEEE Trans. Image Process
15:2454–2461
Makinen E, Raisamo R (2008) An experiment comparison of gender classification methods. Patt

Recognit Lett 29(6):1554–1556
Ming W, Yuan Y (2014) Gender classification based on geometrical features of palmprint images.
SW J Article Id 734564:7
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man
Cyber 9(1):62–66
Patil A, Kruthi R, Gornale SS (2019) Analysis of multi-modal biometrics system for gender clas-
sification using face, iris and fingerprint images. Int J Image Graphics Signal Process (IJIGSP)
11(5). https://doi.org/10.5815/ijigsp.2019.05.04. ISSN: 2074-9082
Ragvendra R, Busch C (2015) Texture based features for robust palmprint recognition: a comparative
study. EURASIP J Inf Sec 5:10–15. https://doi.org/10.1186/s13635-015-0022
Sanchez A, Barea JA (2018) Impact of aging on fingerprint ridge density: anthropometry and
forensic implications in sex inference. Sci Justice 58(5):10
Shivanand G, Basavanna M, Kruthi R (2015) Gender classification using fingerprints based on
support vector machine with 10-Cross validation technique. Int J Sci Eng Res 6(7):10
Zhang D, Guo Z, Lu G, Zhang L, Zuo Liu YW, (2011) Online joint palmprint and palm vein
verification. Expert Syst Appl 38(3):2621–2631
Zhang D, Zuo W, Yue F (2012) A comparative analysis of palmprint recognition algorithm. ACM
Comput Surv 44(1):10
Zhenan ST, Wang Y, Li SZ (2005) Ordinal plamprint representation for personal identification.
Proceeding international conference on computer vision and pattern recognition, vol 1. Orlando,
USA, pp 279–284
Zhihuai X, Zhenhua G, Chengshan Q (2018) Palmprint gender classification by Convolution neural
network. IET Comput Vision 12(4):476–483
Chapter 11
Recognition of Sudoku with Deep Belief
Network and Solving with Serialisation
of Parallel Rule-Based Methods and Ant
Colony Optimisation
Satyasangram Sahoo, B. Prem Kumar, and R. Lakshmi
Abstract The motivation behind the paper is to give a single shot solution of sudoku
puzzle by using computer vision. This study’s purpose is twofold. First to recognise
the puzzle by using deep belief network which is very useful to extract the high-level
feature, and the second objective is to solve the puzzle by using parallel rule-based
technique and efficient ant colony optimization method. Each of the two methods can
solve this NP-complete puzzle. But singularly they lack effeciency, so we serialised
these two techniques to resolve any puzzle efficiently with less time and number of
iteration.
11.1 Introduction
Sudoku or “single number” is a well-posed logic-based single solution combinatorial

number placement puzzle. There are many variants of sudoku present as it varies
according to their size. But standard sudoku contains 9 × 9 grid which is further
subdivided by nine 3 × 3 sub-grid.The primary objective is to fill the number from
1 to 9 in each column, row and sub-grids with the presence of all digit without any
duplication. Nowadays, sudoku is the popular daily puzzle in much-printed news
media. Sudoku falls under the NP-complete complexity category which running
time is polynomial of its input size that means it is as hard as NP, as NP-complete
problem belongs to both the class of NP and NP-hard.
The puzzles like sudoku show the human intelligence level of efficiency. Now in
the era of augmented reality through artificial intelligence still on the way to solve a
hard puzzle by putting several computer algorithms in a very efficient way. Sudoku
becomes harder and harder as the size of sudoku increases or lack of hints-number
in proper position. There are many rules-based methods to solve the sudoku puzzle
efficiently. In this paper, we have implemented this rule-based methods. The main
objective of the paper is to provide the hints and solution to any of this kind puzzle.
S. Sahoo (B) · B. P. Kumar · R. Lakshmi

Pondicherry Central University, Pondicherry, India
e-mail: rlakshmiselva.kcs@pondiuni.edu.in
170 S. Sahoo et al.
As it is epoch-based two-stage solution algorithm to solve the puzzle, it can stop in

any iteration to know the possible hints to the solutions.
In this paper, we used convolutional deep belief network for image processing
to know the digit position according to the puzzle, as convolutional deep belief
network is the most efficient way for prepossessing the digit than any other OCR-
based methods. New and famous rule-based method is implemented to solve the
easy level problem and to minimise the difficulty level of the puzzle by filling many
cells with the appropriate digit solution for a fixed number of iterations. Then, the
partially solved solution is processed with ant colony optimization methods for the
apparent solution.
11.2 Literature Reviews
There are many papers published over the years to solve the puzzle in efficient
manner. Several authors proposed a different type of computer algorithm to address
the standard and higher sized puzzle. Among all the algorithm the backtracking
algorithm, a genetic algorithm is the most famous one. Even Abu Sayed Chowdhury
and Suraiya Akhter solved sudoku with the help of boolean algebra (Chowdhury and
Akhter 2012). The work on this paper is divided into two main categories. First is
based on sudoku image processing for printed digit and grid recognition and next to
proceed for an appropriate solution for that image.
In 2015, Kamal et al. made a comparative analysis paper on sudoku image pro-
cessing and solve the puzzle by using backtracking, genetic algorithm, etc., they
had used camera-based OCR technique (Kamal et al. 2015). In that 2015, Baptiste
Witch and jean hennebert proposed a work based on handwriting and printed digit
recognition using convolution deep belief network (Wicht and Henneberty 2015),
which is the extension work of the same author on deep belief network. It is handy
for detecting grid with cell number (Wicht and Hennebert 2014).
Computer vision plays an active role to detect and solve the puzzle (Nguyen et al
2018). Several methods over the year like heuristic hybrid approach (Musliu and
Winter 2017) by Nysret Musliu et al., the genetic algorithm by Gorges et al. (Gerges
et al. 2018) and through parallel processing by Saxena et al. (Saxena et al. 2018)
proposed to solve the puzzle in efficient manner. Saxena et al. composed five rule-
based methods with serial and parallel processing algorithm (Saxena et al. 2018).
11.3 Image Recognition and Feature Exctraction
See Fig. 11.1.

11 Recognition of Sudoku with Deep Belief Network and Solving … 171
Fig. 11.1 An image of

sudoku from our dataset
11.3.1 Image Preprocessing
The preprocessing of an image involves digit detection and edge detection. The
acquired image by camera or pre-stored file image needs to take care of elimi-
nating unnecessary background noise, the orientation of image and non-uniformly
distributed illumination gradient. The image preprocessing detections steps are:
Fig. 11.2 Processed image

after canny edge detection
172 S. Sahoo et al.
1. The captured image is converted to a greyscale image as the greyscaled image is

contented for image processing and detection of digit and edges.
2. The greyscaled image is then processed through local thresholding function T of
the form.
T = T [x, y, P(x, y), F(x, y)]
where p(x, y) is local property F(x, y) is the grey level

3. Canny edge detection multi-step algorithm can be used to detect edge with sup-
pression of noise at a time. In Fig. 11.2, it is used to control the amount of
detail that appears on the edge of the image. The hough transfer is an incredi-
ble computer vision-based method designed to identify each segment of lines. A
connected component among the segment analysis is performed to merge among
segments together to form the complete segment of the sudoku image (Ronse and
Devijver 1984). Convex hull detection algorithm is used to detect the corners of
the grid. Then, each side of the line segment is divided into nine equal parts, and
a quadrilateral is counted each bounding rectangle as a final cell.
4. Convolutional Restricted Boltzmann Machines (CRBM) The restricted Boltz-
mann machine consists of a set of one binary hidden layer units (h) and a set
of visible input layer units (v). A weight matrix (W) is there to represent the
symmetric connections between hidden units and visible units. The probabilistic
semantics for a restricted Boltzmann machine for the energy function (where the
visible units are binary valued) are defined as P(v, h).
where
⎛ ⎛ ⎞⎞
1
P(v, h) = exp ⎝− ⎝− vi Wij hj − bj hj − ci vi ⎠⎠
z i,j j i
5. Deep Belief Networks Deep belief network consists of multi-layer RBM where
each layer comprises of a set of binary values. Multi-layers convolutional RBM
consists of input layers of an array Iv × Iv (visible layers) and N groups of hidden
layers of array Ih × Ih . Where each N groups of hidden layers are associated with
a Iw × Iw filters and filter weights are shared within the groups. The probabilistic
semantic P(v, h) for CRBM is defined as:
1
p(v, h) = exp
⎛ z⎛ ⎞⎞

N
IH
IW
N
IH
Iv
⎝− ⎝− v(i+r−1),j+s−1 Wrsk hnij − bn hnij − c vij ⎠⎠
n=1 i,j=1 r,s=1 n=1 i,j=1 i,j=1
where bn = Bias of hidden group c = Single shared bias of visible input group
N groups of units pooling layers (P n ) shrinks the same number of units of hidden
layers (H n ) by a constant small integer factor of C . Each block α in the detection
unit Bα where Bα (i, j) : hi,j belongs to block α and is connected to exactly one
binary units of pooling layer Pαk as IP = IH

Sampling of each unit can be done as
C
follows :
N
P(vi,j = 1|h) = σ c + (W ∗f h )i,j
n n
Each unit of detection layers receives signal from bottom visible layer
L(hnij ) bn + (W¯ n ∗ v)ij
And for the pooling layer

L(pαn ) ( nl ∗ h́l )α
l
So, the conditional probability from above derivation
exp(L(hni,j ))
P(hnij = 1|v) =
1 + (í,j́)Bα exp(L(hní,j́ ))
1
P(pαn = 0|v) =
1+ (í,j́)Bα exp(L(hí,j́ ))
n
Finally, the energy function for convolutional RBM is defined as :
1
p(v, h) = exp(−E(v, h))
z
where

N
E(v, h) = − (hni,j (W̄ n ∗ v)i,j + bn hni,j ) − c vi,j
n i,j i,j
Convolutional deep belief network (CDBN) is made up of the stack of many proba-
bilistic max-pooling-CRBMs (Lee et al. 2009). So, the total energy of CDBN is the
sum of own energy of CRBM layers. CDBN is used not only to recognise the digit
inside the grid but also to act as feature extractor and classifier at the same time. Our
model is made up of three layers of RBM where first layers ( 500 hidden units) are
using rectifier linear unit (ReLU) for activation function which is defined as :
f (x) = max(0, x)
174 S. Sahoo et al.
Fig. 11.3 Recognition of

digit by convolutional deep
belief network
It is followed by second layers of the same 500 number of units, and the final visi-
ble layers are labelled with the digits from 1 to 9 (9 units). Simple base-e exponential
is used in final layers. Each CRBM is trained in an unsupervised manner using con-
trastive divergence (CD) except the last layer of the network, and stochastic gradient
descent (SGD) is used for “fine-tuning” of the network. The classifier is trained on
the training set of 150 images, batches of 15 images for ten epochs and tested 50
images with an accuracy of 98.52 % for printed digit recognition. In Fig. 11.3, it is
showing successfully digit recognisation by DBN.
11.4 Suduko Solver
After successful recognition of digit and the column number and row number from 1
to 9, our algorithm is successfully implemented on the outcome of digit recognition
on the basis of row and column. In this method, our algorithm is divided into two
parts. In the first part, the handwritten general rule-based algorithm is implemented
followed by an ant colony optimisation algorithm. Our handwritten algorithm can
be able to solve many puzzle problems. Newspaper sudoku is partitioned into three
basic categories (Easy, medium, hard) or 5-star rating according to their difficulty
level and number of the empty cells to solve the puzzle. The handwritten general rule-
based algorithm can solve easy and most of the medium level puzzle. Hard puzzles
are partially addressed by general rule-based handwritten algorithm and its difficulty
level decreases after that. If the problem remains unsolved after some iteration of the
general rule-based algorithm, then it has to be implemented in a ACO algorithm, as
the ACO algorithm is very efficient to solve an NP-complete problem.
11.4.1 General Rule-Based Algorithm
The general rule-based algorithm is subdivided into six different stages which run
parallel to solve the problem. CDBN is implemented to classify the digit and place
them according to the row and column. Each cell is assigned to an array to store
either its probable digits or available recognised digits along with each row, column
and grid (3 × 3) to create its avail and unavailable list of collection.
Algorithm 1 Algorithm for Digit Assignment

1.Start
2.Read:: 1 ← Row No , Col No , y;
3.empty array ←Avial list [Row], Avail list[col] , Avail list [Block]
4.[1 ... 9] ←Unavail list [row no] , Unavail list [col no] ,Unavail list [Block no]
5.Loop (col no ≤ 9 ) upto step 13
6.Read the row element by Convolution [1 × 1]
7. if ( y > 3 ,then reset y = 1) then
8.Block no ← int ( (col no + 2) / 3 ) + ( row no - y))
9.If (Digit found and check digit is in Unavail list [row no] [col no][Block no]) Then
10.Assign value to cell no[row no] [col no] [ Block no]
11.Add the value ← Avial list [Row no], Avail list[col no] , Avail list [Block no],
12.Eliminate the value ← Unavail list [row no] , Unavail list [col no] , Unavail list [Block no];
13.Row no + 1 , y + 1 ←Row no , y ;
each empty cell has assigned to the probability array ::-
14.1 ← Row no, col no , y;
15.Loop (col no ≤ 9 ) upto step 19
16.Read the row element by Convolution [1 × 1]
17.Select the empty cell[row no][col no][Block no]
18.Probability array[cell no] ←common element of cell unavail list[row no],unavail list[col no]
and unavail list[Block no]
19.End of loop
20.End
11.4.2 Methods
11.4.2.1 Step 1: Common Eliminator
After initialisation of probability array, the primary objective of probability array is

to minimise itself in each iteration until it gets a single element in the list of array.
Our handwritten rule-based algorithm is made explicitly to show its effectiveness as
do it to solve a puzzle. So that our model is trained as a human alike thought, and
men can pause the execution at any iteration to see the hints to solve the problem.
The handwritten algorithm is subdivided into two basic categories. One is assigner,
176 S. Sahoo et al.
and other is the eliminator. The algorithm is divided into six steps which is executed
in parallel manner.
Algorithm 2 Common eliminator Algorithm

Step 1: Common eliminator Algorithm
1.Start
2. 3 ← row no
3.Loop(((Row no / 3 ) =1 ) ≤ 12 ) upto step 11
4.If Avail list [row no] = Avail list [Row no -1 ] Then
5.Extract the Block no of the candidate as x, y
6.Row = Assign candidate [row no , row no -1 ]
7.Block = Assign candidate [ x , y]
8.If there is single vacancy in that row of BlockThen
9.Eliminate all the probability of that cell except that element
11.End of loop
12.Repeat the process for combination of (Row no , Row no - 2) and (Row no -1 , Row no -2)
13.End
Assign candidate algorithm is represented as :
1.Start
2.assign candidate ( x , y)
3.If (x + y) is not divisible by 2 Then
4.Return 3y - (x+ y)
5.Else
6.Return x+y / 2
7.End
11.4.2.2 Step 2: Hidden Single
As in Fig. 11.4, Hidden single represents the single candidate occurrence in the
probability list of entire row or column. Here, algorithm is expressed as:
Algorithm 3 Hidden Single Algorithm

Step 2: Hidden Single Algorithm
1.Start
2. 1 ← Candidate ;
3.Loop : candidate ≤ 9
4.0 ← Found
5.Linear search the candidate in avail list [ ]
6.Increment value of found
7.BreakIF (found > 1)
8.Else
9. Assign that candidate to that cell;
10.Eliminate that candidate from corresponding Row , Col and Block
11.candidate ← Candidate + 1 ;
12.End of loop
13.End
Fig. 11.4 Hidden single:

Digit 7 is the hidden single
for 8th column
11.4.2.3 Step 3: Naked Pair and Triple Eliminator
Naked pair eliminator algorithm is useful when there are only two occurrences of a
pair of candidates in a single row, column, or in block. Then, the possibility of those
candidates in the probability array for that cell is increased, so it needs removal of
other candidates in that cell as in digit 4 and 8 in Fig. 11.5 (https://www.youtube.
com/watch?v=b123EURtu3It=97s). The same procedure is followed by naked triple
eliminator; but in this case, three elements are subdivided in to either one set of three
candidates or three set combination of two candidates each for searching occurrences
of three different cell in a single row, column or Block. The algorithm is represented
as follows:
11.4.2.4 Step: 4 Pointing Pair Eliminator
As in Fig. 11.6, digit 9 of the 8th block and 5th column, When a certain candidate
appears only in two or three cells in a Block, and the cell are aligned in a column
or a row. They are called Ponting pairs. All the other appearances of that candidate
outside that block in the same column or row can be eliminated.
*above cell [R][ ][B] means same row number and same block number but dif-
ferent column number.
178 S. Sahoo et al.
Fig. 11.5 Naked pair: Digit 4 and 8 is naked pair for 4th block
(https://www.youtube.com/watch?v=b123EURtu3It=97s)
Fig. 11.6 Pointing Pair: Digit 9 is the pointing pair in figure all appearance outside block 8 can be
eliminated
Algorithm 4 Naked pair and triple eliminator Algorithm

Step 3: Naked pair and triple eliminator Algorithm
1.start
2.Create a combination of 2 elements in to 2 candidate set from unveil list of row /column /Block
3.Create a combination of 3 elements in to 2 and 3 candidate sets from unveil list of row /column
/Block
4. 0 ← Found;
5. Search the only same candidate combination set in row ,column and Block wise
6.If search successful
7.Found ← Found + 1;
8.For 2 element set
9.If (found = 2 )
10.Eliminate these two candidate from other cell probability array of that Row , Column and Block
11.For 3 element set
12.If (found = 3)
13.Eliminate these three candidate from other cell probability array in that Row, Col and Block
14.End
Algorithm 5 Pointing pair eliminator Algorithm

Step 4: Pointing pair eliminator Algorithm
1.Start
2.1 ←Block ,candidate :
3.Loop : block ≤ 9 upto step 12
4.Loop :candidate ≤ 9 upto step 10
5.If All the appearance of candidate in probability set of cell [ ] [C] [B] >2 Then
6.Eliminate that candidates from that column in other Block
7.If All the appearance of candidate in probability set of cell [R] [ ] [B] >2 Then
8.Eliminate that candidates from that row in other Block
9.candidate ← candidate + 1
10.End of loop
11.block ← block + 1
12.End of loop
11.End
11.4.2.5 Step: 5 Claiming Pair Eliminator
When a certain candidate appears in only two or three cells in a row or column and
the cells are in a single block, they are called claiming pair. Here, algorithm for row
(similar for column) is represented by:
11.4.2.6 Step: 6 X-wings
X-wings is most used by enigmatologist to solve the high-rated difficult level puzzle
to minimise the number of candidate distribution probability. X-wings technique
is implemented when four cells that form corners of a rectangle or square, and it
appears only in the two cells in both the rows. Then, the candidate appearing in that
two columns can be eliminated like Fig. 11.7. The same technique also can be applied
for columns.
180 S. Sahoo et al.
Algorithm 6 Claiming pair eliminator Algorithm

Stap 5:- Claiming pair eliminator Algorithm:
1.Start
2.1 ← Block :
3.Loop block ≤ 9 upto step 15
4.1 ← row , candidate :
5.loop:row ≤ 9 upto step 11
6.loop: candidate ≤ 9 upto step 9
7.Search the candidate occurrence in the row
9. End of Loop
10.row ← row + 1
11.End of loop
12.If the appearance of candidate in candidate set [ R] [ ] [B] ≥ 3 Then
13.Eliminate that candidates from any other appearances in that Block
14.block ← block + 1
15.End of loop
16.End
Algorithm 7 Here X-wings techniques for rows is expressed as

Step 6 :- Here X-wings techniques for rows is expressed as:
1.Start
2.1 ←Row ,candidate ;
3.loop: row ≤ 8 upto step 17
4.loop: candidate ≤ 9 upto step 15
5.Search for the candidate
6.If the number of appearance of candidate in a row = 2 ; Then
7.assign the first candidate column to column 1;
8.Assign the second candidate column to column 2:
9.Search that candidate in column 1
10.If found
11.Search in that row
12.If (the number appearance of that candidate =2 and found candidate column 1 = column 2) Then
13.erase that candidate other appearance in column1 and column 2 except that cell
15.End of loop
16.row ← Row + 1
17.End of loop
18.End
11.4.3 Parallel Algorithm Execution
All the above algorithm is independent of each other. The central theme is to find
out a possible array for the empty cell with the help of DCBN after that the parallel
algorithm helps to minimise by eliminating element from the probable arrays of each
cell. If all the six methods are implemented serially one after another, it will be more
Fig. 11.7 X-Wings: Digit 9

is in X-Wings
time consumer and ineffecient. The main aim of parallel execution is to minimise the
time cost and increase the efficiency of implementation. Some steps are constituted
for both row and column separately, which are also executed in a parallel manner.
As in a single epoch, all the six methods are implemented by single time only, and
the results are updated for next iteration as input.
11.4.4 Ant Colony Optimisation
The parallel algorithm can be capable of solving most of the easy to a medium level
problem within 100–150 epochs. Many of the challenging level puzzles also are being
answered within 250–300 epochs. Rule-based parallel methods fail to efficiently
handle higher level difficult problem. For some of these rule-based puzzle parallel,
the algorithm is stopped with more than one possible digit candidates in an array
for a single cell in solution, as the methods are unable to eliminate candidates after
certain epoch. But the rule-based parallel method efficiently minimises the candidate
arrays, so that any other greedy-based algorithms approach can be implemented with
less epoch with less time, compared to only a rule-based parallel algorithm. For this
reason, ant colony optimization method is serialised with the parallel rule-based
method.
182 S. Sahoo et al.
The ant colony optimisation method as sudoku solver (Lloyd and Amos 2019) is
used with constant propagation method (Musliu and Winter 2017). Here in our ACO,
each ant has covered only those cells with probability array of multiple candidates
in their local copy of the puzzle. A fixed amount of pheromone they add when they
pick up for a single element from the array of possible candidates and delete that
element other existence in that same row, column, and block. One pheromone matrix
(9 × 81) is created to keep track of the updating of each component in the possible
array. The best ant covered all the possible candidates of multiple candidates in the
puzzle.
11.5 Result and Discussion
For this paper, we did experiment on various possible dataset available in Internet
such as https://github.com/wichtounet/sudokudataset used by witch paper
(Wicht and Henneberty 2015) (for both recognition and solution) and
https://www.kaggle.com/bryanpark/sudoku (for only testing solver) . In the first half
of digit and grid recognition case by using deep belief network, we got the accuracy
of 98.58% with an error rate of 1.42 %. So, nearly it works perfect to recognise digit
according to grids, and we processed the puzzle which is recognised fully (Figs. 11.8
and 11.9). Only with rule-based algorithm is succeed to solve with success rate at
96.63% puzzle with 304 epochs, while ant colony optimisation is capabled of solving
98.87 % puzzle with 263 epochs but with the serialisation of parallel and ant colony
optimisation is highest success rate at 99.34 % result with 218 epochs. Maximum
epochs for three algorithms are calculated with the puzzle with 62 blank cells.
11.6 Conclusion
We designed and implemented a new rule-based algorithm associated with ant colony
optimization technique to solve the puzzle detected by image processing by using
convolutional deep belief network methods. CDBN is so efficient to recognise the
printed digit properly and implement it in any of highly difficult solution. It is well
designed that a player can stop at any iteration to see the hints for the answer. In
the future, we try to implement the digit recognization and proper solution solver by
using deep convolution neural network alone.
Fig. 11.8 The number of epochs used by three different algorithm
Fig. 11.9 The percentage of solving puzzles
References
Chowdhury AS, Akhter S (2012) Solving Sudoku with Boolean Algebra. Int J Comput Appl
52(21):0975–8887
Gerges F, Zouein G, Azar D (2018) Genetic algorithms with local optima handling to solve sudoku
puzzles. In: Proceedings of the 2018 international conference on computing and artificial intelli-
gence, pp 19–22
184 S. Sahoo et al.
Kamal S, Chawla SS, Goel N (2015) Detection of Sudoku puzzle using image processing and
solving by backtracking, simulated annealing and genetic algorithms: a comparative analysis. In:
2015 third international conference on image information processing (ICIIP). IEEE, pp 179–184
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsu-
pervised learning of hierarchical representations. In: Proceedings of the 26th annual international
conference on machine learning, pp 609–616
Lloyd H, Amos M (2019) Solving Sudoku with ant colony optimization. IEEE Trans Games
Musliu N, Winter F (2017) A hybrid approach for the sudoku problem: using constraint programming
in iterated local search. IEEE Intell Syst 32(2):52–62
Nguyen TT, Nguyen ST, Nguyen LC (2018) Learning to solve Sudoku problems with computer
vision aided approaches. Information and decision sciences. Springer, Singapore, pp 539–548
Ronse C, Devijver PA (1984) Connected components in binary images: the detection problem
Saxena R, Jain M, Yaqub SM (2018) Sudoku game solving approach through parallel processing. In:
Proceedings of the second international conference on computational intelligence and informatics.
Springer, Singapore, pp 447–455
Wicht B, Hennebert J (2014) Camera-based sudoku recognition with deep belief network. In: 2014
6th international conference of soft computing and pattern recognition (SoCPaR). IEEE, pp 83–88
Wicht B, Henneberty J (2015) Mixed handwritten and printed digit recognition in Sudoku with
Convolutional deep belief network. In: 2015 13th international conference on document analysis
and recognition (ICDAR). IEEE, pp 861–865
Chapter 12
Novel DWT and PC-Based Profile
Generation Method for Human Action
Recognition
Tanish Zaveri, Payal Prajapati, and Rishabh Shah
Abstract Human action recognition in recordings acquired from reconnaissance

cameras discovers application in fields like security, health care and medicine, sports,
programmed gesture-based communication acknowledgment, and so on. The task is
challenging due to variations in motion, recording settings, and inter-personal differ-
ences. In this paper, novel DWT & PC-based profile generation algorithm is proposed
which incorporates notion of energy in extracting features from video frames. Seven
energy-based features are calculated using unique energy profiles of each action. Pro-
posed algorithm is applied to three widely used classifiers—SVM, Naive bayes, and
J48 to classify video actions. Algorithm is tested on Weizmann’s dataset & perfor-
mance is measured with evaluation metrics such as precision, sensitivity, specificity,
and accuracy. Finally, it is compared with the existing method of template matching
using MACH filter. Simulation results give good accuracy than existing method.
12.1 Introduction
Human action recognition (HAR) is the process of recognizing various actions that
people perform, either individually or in a group. These actions may be walking, run-
ning, jumping, swimming, shaking hands, dancing, and many more. There are many
challenges in HAR such as differences in physiques of humans performing actions
like shape, size, color, etc., differences in background scene like occlusion, light-
ing or any other visual impairments, differences in recording settings like recording
speed, types of recording (2D or 3D/gray-scale or colored video recording), differ-
ences in motion performance by different people like difference in speed of walking,
difference in height of jumping, etc. For an algorithm to succeed, the methods used
for action representation and classification are of utmost importance. This motivated
research work in this field and the development of a plethora of different techniques
T. Zaveri · R. Shah
Nirma University, Ahmedabad 382481, India
P. Prajapati (B) · R. Shah
Government Engineering College, Patna 384265, India
e-mail: payalprajapati2808@gmail.com
186 T. Zaveri et al.
which falls under local and global representation approaches, classified by Weinland
et al. (2011) based on how actions are represented.
In global representation approaches, the human body is to be detected in the image,
usually with background subtraction techniques. While this step is a disadvantage, it
results in reduction of image size and complexity. Silhouettes and contours are usually
used for representing the person. Contour-based features like Cartesian coordinate
feature (CCF), the Fourier descriptor feature (FDF) (H et al. 1995; RD and L 2000;
R et al. 2009), centroid-distance feature (CDF) (D and G 2004, 2003), and chord-
length feature (CLF) (D and G 2004, 2003; S et al. 2012) are extracted from contour
boundary of the person in the aligned silhouettes image (ASI). The region inside
contour of human object is the silhouette. Silhouette-based features are extracted
from the silhouette in the ASI image. Some common silhouette-based features are
histogram of gradient (HOG) (D and G 2003; N et al. 2006), histogram of optical
flow (HOOF) (R et al. 2012), and structural similarity index measure (SSIM) (Z et al.
2004).
In local representation approaches, videos are treated as a collection of small
unrelated patches that involve the regions of high variations in spatial and temporal
domains. Centers of these patches are called spatio-temporal interest points (STIPs).
STIPs are represented by the information related to the motion in their patches and
then clustered to form a dictionary of visual words. Each action is represented by
bag of words model (BOW) (Laptev et al. 2008). Several STIPs detectors have been
proposed recently. For example, Laptev (2005) applied Harris corner detector for
spatio-temporal case and proposed Harris3D detector, Dollar et al. (2005) applied
1D Gabor filters temporally and proposed the Cuboid detector, Willems et al. (2008)
proposed Hessian detector that measures the saliency with the determinant of 3D Hes-
sian matrix, and Wang et al. (2009) introduced dense sampling detector that finds
STIPs at regular points and scales, both spatially and temporally. Various descrip-
tors used for STIPs include histogram of oriented gradients (HOG) descriptor and
histogram of optical flow (HOF) descriptor (H et al. 1995), gradient descriptor (Dol-
lár et al. 2005), 3D scale-invariant feature transform (3D SIFT) (Scovanner et al.
2007), 3D gradients descriptor (HOG3D) (A et al. 2008), and the extended speeded
up robust features descriptor (ESURF) (Willems et al. 2008). Some limitations of
global representation such as sensitivity to noise and partial occlusion and the com-
plexity of accurate localization by object tracking and background subtraction can
be overcome by local representation approaches. Local representation also has some
drawbacks, like the ignorance of spatial and temporal connections between local
features and action parts that are necessary to preserve intrinsic characteristics of
human actions.
In this paper, a novel energy-based approach is presented which is based on the
fact that every action has a unique energy profile associated with it. we are trying to
model some physics theorems which are related to kinetic energy generated while
performing an action. As it is known, if you apply a force over a given distance—
you have done work using equation W = F × D. Through work-energy theorem,
this work done can be related to changes in kinetic energy or gravitational potential
energy of an object using equation W = K where K stands for kinetic energy.
12 Novel DWT and PC-Based Profile Generation Method for Human … 187
As per normal observation, force required for performing various actions is different
(e.g., Running requires more force then Walking) which leads to difference in amount
of work done and hence difference in energy profile for different actions. So, energy
difference can be a feature in distinguishing among various actions. We are trying
to model different energy profile associated with various actions using fundamentals
of phase congruency and discrete wavelet transform. Goal is to automate human
activity recognition task by integrating seven energy-based features calculated in
frequency domain with few machine learning algorithms. All the seven features are
based on exploration of differences in energy profiles for different action. Section
12.2 of the paper describes background theory, proposed methodology, and analysis
of energy profiles. Section 12.3 presents details of dataset and results obtained. It also
compares result with existing method. Section 12.4 concludes the paper followed by
references in section 5.
Our method is based on following observation: As with inter-personal differences,

motion performance differences, there is also a noticeable difference in generated
energy profile over entire video for different action classes. For example, energy vari-
ation in case of walking would be more in lower parts; while in case of waving, there
would be more variation in upper parts. We are employing phase congruency (PC)
and discrete wavelet transform (DWT) methods in computation of energy profiles
as transform domain makes analysis easier over spatial domain. Figure 12.1 shows
block diagram of proposed methodology.
First block involves extracting frames from training video. Difference of every
alternate frame is taken and given as input to next two blocks, namely DWT decom-
position and phase congruency. DWT decomposition block applies level 2 DWT on
frame difference, returning approximation (c A ) and detail coefficients (c H , cV , c D ).
The percentage contribution of the detail coefficients in total energy is calculated
which become the first three features. These three features are denoted as E H , E V ,
and E D , respectively. The phase congruency block detects the edges in the frame
difference image. The result of PC is divided into four parts. Percentage contributions
of these four parts in total energy are calculated which become next four features.
These four features are denoted as EU L , E L L , EU R , and E L R , respectively. So, for
difference of every alternate frame, there is a feature vector f[E H , E V , E D , EU L ,
E L L , EU R , E L R ], and for every video, there is a vector V that contains sequence of
feature vectors for every alternate frame. The training matrix is created by taking
these seven features for videos of each action. Also, a feature vector for the test
video is calculated. The training matrix and the test matrix are passed as inputs to
the classifier. The classifier, be it SVM, Nave Bayes or J48, recognizes the action
being performed in the test video.
Fig. 12.1 Block diagram of proposed methodology
12.2.1 Background Theory
12.2.1.1 Discrete Wavelet Transform
In frequency domain, we can easily identify the high-frequency components from the
original signal. This property of frequency domain can be exploited to aid in action
recognition. When an action is performed, the high frequency component, in the part
of the frame where the action is performed, changes. Moreover, it is observed that
the change is different for different actions. Here, other transforms like the Fourier
transform cannot be used; because in Fourier transform domain, there is no relation
with the spatial coordinates. DWT resolves the issue. In DWT, the frequency infor-
mation is obtained at original location in the spatial domain. As a direct consequence
of this property, the frequency information obtained can be visualized as an image
in spatial domain. Figure 12.3 shows result of DWT for wave action which gives
approximation (low pass) and detail information (high pass). The detail information
horizontal, vertical, and diagonal information is used in energy calculation.
A mathematical concept of 2D wavelet transform is given by Eqs. (12.1) (scaling
function), (12.2) (wavelet functions), (12.3, 12.4).
φ j,m,n (x, y) = 2 j/2 φ(2 j x − m, 2 j y − n), (12.1)

ψ kj,m,n (x, y) = 2 j/2 ψ k (2 j x − m, 2 j y − n), k = H, V, D. (12.2)
where j, m, n are integers, j is a scaling parameter, m and n are shifting parameters

along x and y direction.
The analysis equations are:

M−1 N −1
1
Wφ ( j0 , m, n) = √ f (x, y)φ j0 ,m,n (x, y), (12.3)
MN x=0 y=0

M−1 N −1
1
Wψk ( j0 , m, n) = √ f (x, y)ψ kj0 ,m,n (x, y);
MN x=0 y=0 (12.4)
where, k = H, V, D.
Here, M, N is dimension of video frame and f (x, y) is an intensity value at x, y

coordinate, j0 is a scaling parameter.
As the scaling and wavelet functions are separable, 2D DWT can be decomposed
in to two 1D DWTs, one is along x-axis and second is along y-axis. As a result, it is
returning four bands: LL (left-top), HL (right-top), LH (left bottom), and HH (right-
bottom). The HL band and LH band gives variation along the x-axis and y-axis,
respectively. LL is an approximation of original image that retains only low pass
information. LH, HL, and HH bands are used to detect the changes in the motion
directions. This LL band gives c A approximation coefficient, and other three bands
give c H , cV , c D detail coefficients, respectively.
12.2.1.2 Phase congruency
Phase congruency reflects signal behavior in frequency domain. Phase congruency

is a method for edge detection which is robust against variations in illumination
and contrast. It finds places where phases are in same order and edges have similar
phase in frequency domain.This property is used to detect edge-like features using
PC. In action recognition from videos, the illumination may not always be even
on the entire scene. Phase congruency performs significantly better than other edge
detection methods in such cases. Mathematical equation of PC for 2D signals like
image is given in (Battiato et al. 2014).
Figure 12.2 shows result of PC applied on frame difference of two consecutive
frames of walk action.
It is easy to show that PC is proportional to local energy in a signal using some
simple trigonometric manipulations (Koves 2016) which is given by following rela-
tion:
E(x) = aω dω.PC(x) (12.5)
This relation is exploited to compute energy-based features for HAR.

Fig. 12.2 2D Wavelet

transform and phase
congruency
(a) Result of 2D wavelet transform

for wave action
(b) result of Phase Congruency

for walk action
Steps for extracting features using DWT and PC are given as below:
1. The training video is divided into frames. These frames are stored in F. Take
difference of every alternate frame stored in F, call such result as frame difference
image which is obtained by:
Fd x = |F p − F( p+2) |, (12.6)
where Fd x is absolute difference between two alternate frames and p varies from
1 to 46.
2. Apply analysis equations of DWT (Sect. 12.2.1.1) on each frame difference
image which gives c A , c H , cV , c D .

M−1 N −1
1
Wφ ( j0 , m, n) = √ Fd x (x, y)φ j0 ,m,n (x, y) (12.7)
MN x=0 y=0

M−1 N −1
1
Wψk ( j0 , m, n) = √ Fd x (x, y)ψ kj0 ,m,n (x, y);
MN x=0 y=0 (12.8)
where, k = H, V, D.
Let T E DW T be the total energy of decomposition vector C which consists of c A ,

c H , cV , c D then energy contribution of every individual component is calculated
by,
n
(ck (i))2
Ek = , k = H, V, D (12.9)
i=1
T E DW T
where E H , E V and E D are the energy contributions of c H , cV , c D , respectively.

E H , E V and E D become three features extracted.
3. Apply PC (Sect. 12.2.1.2) on each frame difference which gives image matrix
having detected edges.
FPC = PC(Fd x (m, n)) (12.10)
Let U L, L L, U R, L R depict the parts, namely upper left, lower left, upper
right, and lower right, respectively, obtained after dividing FPC in four parts.
Let T E PC be the total energy of the matrix FPC , then energy contribution of
every individual component is calculated by,

n
(FG (i))2
EG = , G = U L , L L , U R, L R (12.11)
i=1
T E PC
where EU L , E L L , EU R , and E L R are energies for U L, L L, U R, and L R parts,

respectively. Thus, EU L , E L L , EU R , and E L R become our next four features. For
every frame difference image, we get a feature vector containing above seven
features. Repeat the same for every instance of video available.
12.2.3 Analysis of Energy Profiles
The proposed idea is based on the fact that energy profile over entire video is different
for various action classes. We employ the above feature extraction algorithms to
generate feature vectors for videos of different classes. We also plot the results
obtained by both methods, DWT and PC, individually to show that the energy profiles
obtained matches with underlying observation.
12.2.3.1 Energy Profiles Obtained from Discrete Wavelet Transform
The values for the horizontal, vertical, and diagonal energy obtained from the frame
differences were calculated and plotted for analysis. These profiles for four bend
action videos and their average profile is shown in Fig. 12.3.
Figure 12.3 represents the energy profiles of H , V , and D components for multiple
videos of bend action, and their average is also shown. As expected, the vertical
energy is the highest initially. This is because the video starts with a person standing
upright. Also, a dip is observed in the diagonal energies as the person bends more and
more. After the completion of about half the video, the horizontal energy increases,
since now the person’s gait is almost horizontal. After that, diagonal energies increase
again as the person begins to stand up. Similarly, for other actions too, the profiles
were created, and they behave as expected.
(a) Horizontal Energy Profile (b) Vertical Energy Profile (c) Diagonal Energy Profile
(d) Avg. Horizontal Energy (e) Avg. Vertical Energy (f) Avg. Diagonal Energy
Profile Profile Profile
Fig. 12.3 Pattern analysis of horizontal, vertical, and diagonal energy profiles obtained using DWT
(a) Energy profile for upper (b) Energy profile for upper (c) Energy profile for lower
left part right part left part
(d) Average energy profile (e) Average energy profile for (f) Average energy profile for
for upper left part upper right part lower left part
(g) Energy profile for lower right part (h) Average energy profile for lower
right part
Fig. 12.4 Pattern analysis using PC
12.2.3.2 Energy Profiles Obtained from Phase Congruency
Result obtained after applying phase congruency is divided into four parts, namely
upper left, lower left, upper right, and lower right, respectively. Energy profiles of
these four parts are plotted and analyzed in Fig. 12.4 for wave action.
Figure 12.4 shows that energy distribution in all four parts for wave action is
almost equal which is consistent with the expected results for wave action. Profiles
obtained for other actions too have definite, unique patterns, and making these energy
profiles suitable for use in classifiers to recognize actions.
12.3 Simulation Result and Assessment
We used Weizmann‘s dataset in our experiment. This dataset contains ten actions of
day-to-day activities that are walk, run, bend, wave (one hand), wave2 (two hands),
jack, jump, skip, gallop sideways, and jump in place (pjump), performed by ten
actors. Figure 12.5 shows these actions being performed by different actors.
Fig. 12.5 Dataset
In our experiment, fourteen action classes were used which are run left, run right,
pjump, jump left, jump right, wave2, walk left, walk right, skip left, skip right, side
left, side right, jack, and bend. Total 85 videos have been used for the experiment.
The energy values obtained from the DWT and phase congruency methods were
used to construct the training matrix for the classifier. Since we had 14 action classes
with total 85 videos of 92 frames each and 7 features, we get a 85 × 322 feature
matrix. Thus, each row in the feature matrix contains the extracted features for a
particular action. And each of these rows is labeled by type of action. The dataset has
been divided randomly into training and test dataset keeping the 60–40 proportion.
Ten such divisions are employed to use cross-validation approach. We used three
classifiers—SVM, Naive Bayes, and J48 to test our algorithm. To evaluate the pro-
posed algorithms, four parameters sensitivity, specificity, precision, and accuracy are
calculated from confusion matrix. Sensitivity is the percentage of positive labeled
instances that were predicted as positive, specificity is the percentage of negative
labeled instances that were predicted as negative, and precision is the percentage of
positive predictions that are correct. Accuracy tells what percentage of predictions
is correct. Equations of each parameter is given in (Kacprzyk 2015).
We have analyzed result of DWT and PC separately and together for above three
classifiers which is shown in Tables 12.1, 12.2 and 12.3, respectively. It shows DWT
and PC together gives better sensitivity, specificity, precision, and accuracy for all
three classifiers. Out of these three classifiers, SVM gives better result in terms of
all four parameters.
Table 12.1 Classifier result for DWT

Sensitivity Specificity Precision Accuracy
SVM 0.838 0.977 0.849 83.75
J48 0.813 0.969 0.826 81.25
Naivebayes 0.75 0.976 0.796 75
Table 12.2 Classifier result for PC

SVM 0.725 0.987 0.617 72.5
J48 0.558 0.986 0.5866 58.75
Naivebayes 0.725 0.975 0.725 72.5
Table 12.3 Classifier result for DWT + PC

SVM 0.888 0.992 0.888 88.75
J48 0.813 0.985 0.822 81.25
Naivebayes 0.763 0.981 0.801 77.25
Results obtained from proposed methodology is compared with existing Actio

MACH filter (Rodriguez et al. 2008). Action MACH is a template-based method
which finds correlation of test video with synthesized filter of every action. If it is
greater than some pre-decided threshold, we infer that the corresponding action is
occurring. This requires tedious job of tracking start and end frame of every action
videos to generate a filter. On other hand, we do not require such preprocessing
task. Also, we extended our method for 14 action classes where as MACH filter is
represented for six actions only. The proposed methodology gives 88.75% accuracy
over 80.9% accuracy of MACH filter.
12.4 Conclusion
We have presented an approach based on energy that incorporates notion of phase

congruency and discrete wavelet transform in calculation of energy-based features.
Method is based on idea that energy profiles over entire videos of same action classes
show a good correlation, and for different action classes, it varies greatly. A good
accuracy is obtained using this methodology for fourteen action classes. For now,
we tried only for simple actions but it can be extended for complex actions. Other
energy-based features can be combined with those of proposed features to achieve
more robust result. Various features which can exploit other aspect like shape can
also be combined in order to distinguish same energy profile generation using shape
differences.
References
A K, C S, I G (2008) A spatio-temporal descriptor based on 3d-gradients. In: British machine vision

conference
Battiato S, Coquillart S, Laramee RS, Kerren A, Braz J (2014) Computer vision, imaging and
computer graphics—theory and applications. In: International joint conference, VISIGRAPP
2013. Springer Publishing Company, Barcelona, Spain
D Z, G L (2004) Review of shape representation and description techniques. Pattern Recogn 1:1–19
D Z, G L (20003) A comparative study on shape retrieval using fourier descriptors with different
shape signatures J Vis Commun Image Represent 1:41–60
D Z, G L (2003) A comparative study on shape retrieval using fourier descriptors with different
shape signatures. In: IEEE conference on computer vision and pattern recognition (CVPR2005),
vol 1, pp 886–893
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatiotemporal
features. In: In VS-PETS, pp 65–72
H K, T S, P M (1995) An experimental comparison of autoregressive and fourier-based descriptors
in 2d shape classification. IEEE Trans Pattern Anal Mach Intell 2:201–207
Kacprzyk J (2015) Advances in intelligent and soft computing. Springer, Berlin
Koves P (2016) Feature detection via phase congruency. http://homepages.inf.ed.ac.uk/rbf/
CVonline/. [Online Accessed 11 Nov 2016]
Laptev I (2005) On space-time interest points. Int J Comput Vision 64(2–3):107–123. https://doi.
org/10.1007/s11263-005-1838-7
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from
movies. CVPR (2008)
N D, B T, C S (2006) Human detection using oriented histograms of flow and appearance. In:
European conferences on computer vision (ECCV 2006), pp 428–441
R C, A R, G H, R V (2012) Histograms of oriented optical flow and binet-cauchy kernels on
nonlinear dynamical systems for the recognition of human actions. In: IEEE conferences on
computer vision and pattern recognition (CVPR 2009), vol 4, pp 1932–1939
RD L, L S (2000) Human silhouette recognition with fourier descriptors. In: 15th international
conferences on pattern recognition (ICPR 2000), vol 3, pp 709–712
Rodriguez M, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average corre-
lation height filter for action recognition. In: IEEE conference on computer vision and pattern
recognition, (CVPR 2008), pp 1–8
R G, R W, S E (2009) Digital image processing using Matlab, 2nd edn
S S, A AH, B M, U S (2012) Chord length shape features for human activity recognition. In: ISRN
machine vision
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action
recognition. In: proceedings of the 15th ACM international conference on multimedia, MM –07.
ACM, New York, NY, USA, pp 357–360. https://doi.org/10.1145/1291233.1291311
Wang H, Ullah MM, KlÃser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal
features for action recognition. University of Central Florida, USA
Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation,
segmentation and recognition, vol 115
Willems G, Tuytelaars T, Gool LV Gool l (2008) An efficient dense and scale-invariant spatiotem-
poral interest point detector. Technical Report
Z W, AC B, HR S, EP S (2004) Image quality assessment: from error visibility to structure similarity.
IEEE Trans Image Process 4:600–612
Chapter 13
Ripeness Evaluation of Tobacco Leaves
for Automatic Harvesting: An Approach
Based on Combination of Filters and
Color Models
P. B. Mallikarjuna, D. S. Guru, and C. Shadaksharaiah
Abstract In this paper, a novel filter-based model for classification of tobacco leaves
for the purpose of harvesting is proposed. The filter-based model relies on estimation
of degree of ripeness of a leaf using combination of filters and color models. Degree
of ripeness of a leaf is computed using density of maturity spots on a leaf surface and
yellowness of a leaf. A new maturity spot detection algorithm based on combination
of first order edge extractor (sobel edge detector or canny edge detector) and second
order high-pass filtering (Laplacian filter) is proposed to compute the density of
maturity spots on a unit area of a leaf. Further, a simple thresholding classifier is
designed for the purpose of classification. Superiorities of the proposed model in
terms of effectiveness and robustness are established empirically through extensive
experiments.
13.1 Introduction
Agriculture sector plays a vital role in an economy of any developing countries like
INDIA. Source of employment, wealth, and security of any nation is directly depends
on the qualitative and quantitative production of agriculture. It is an outcome of a
complex interaction of soil, seed, water, and agrochemicals. Enhancement of pro-
ductivity needs proper type, quantity, and timely application of soil, seed, water, and
agrochemicals at specific sites. This demands precision agriculture practices such
as soil mapping, disease mapping at both seedling and plant level, weed mapping,
P. B. Mallikarjuna (B)
JSS Academy of Technical Education, Bengaluru, Karnataka, India
e-mail: pbmalli2020@gmail.com
D. S. Guru
University of Mysore, Mysore, Karnataka, India
e-mail: dsg@compsci.uni-mysore.ac.in
C. Shadaksharaiah
Bapuji Institute of Engineering and Technology, Davangere, Karnataka, India
e-mail: shadaa@rediffmail.com
198 P. B. Mallikarjuna et al.
Fig. 13.1 Block diagram of

a general precision
agriculture system
selective harvesting, and quality analysis of agricultural products (Grading of agri-

cultural products). Continuous assessment of these precision agriculture practices
requires skilled labors. The availability of skilled labors is very short in most of
agro-based developing countries, and also, it is certain that in a continuous process
human cannot fulfill the above requirements precisely and accurately. The variations
occurring in crop or soil properties within a field (Jabro et al. 2010) should be mapped
and timely action need to be taken. But, humans are not consistent and precise in
recording of spatial variability and its mapping. Hence, assessment may vary from
an expert to an expert. This may lead to wrong assessment, and it results in a poor
quality of agricultural products. Therefore, this demands automation of assessment
of precision agriculture practices. Since there is a requirement of precise and accurate
assessment of precision agriculture practices, researchers have proposed intelligent
models based on computer vision (CV) techniques to automate these practices for
various commercial crops. The advantages of a computer vision-based approach are
that the accuracy is comparable to that of human experts and reduction of man power
and time (Patricio and Riederb 2018). Therefore, devising effective and efficient
computer vision models to practice precision agriculture system for real time is the
current requirement. The stages involved in a general precision agriculture system
are shown in Fig. 13.1.
Harvesting is an important stage in any crop production. Selective harvesting is
required for quality production. Selective harvesting is to collect only ripe crops
from the field (Manickavasagam et al. 2007). Therefore, before harvesting a crop,
farmers should look into factors such as unripe, ripe, and over-ripe of crops. Judgment
of crop ripeness by human will not always be accurate and precise due to human
sensory limitation, variable lighting condition, and loosing efficiency in evaluating
crop ripeness over the time. Therefore, there is a need to develop robust model against
ecological conditions (sunny, cloudy, and rainy) to evaluate ripeness of crop. Visual
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting … 199
properties of crop such as color, texture, and shape could be exploited to evaluate the
ripeness of crop for harvesting purpose using computer vision algorithmic models.
To show the importance of the computer vision techniques in precision agricultural
practices especially for selective harvesting stage, we have taken the tobacco crop
as a case study. After 60 d of plantation of tobacco crop, we can find three types of
leaves. They are unripe, ripe, and over-ripe leaves. One should harvest ripe leaves
to get quality cured tobacco leaves in a curing process. As an indication of ripeness,
small areas called maturity spots appear non-uniformly on the top surface of a leaf.
As ripeness increases, yellowness of a leaf also increases. The maturity spots are
more in over-ripe leaves when compared to ripe leaves.
Though tobacco crop has commercial crop, no attempt has been made on harvest-
ing of tobacco leaves using CV techniques. However, few attempts could be traced
on ripeness evaluation of other commercial crops for automatic harvesting. Direct
color mapping approach was developed to an evaluate maturity levels of tomato and
date fruits (Lee et al. 2011). This color mapping method maps the RGB values of
colors of interest into 1D color space using polynomial equations. It uses a single
index value to represent each color in the specified range for the purpose of maturity
evaluation of tomato and date fruits. A robotic system for harvesting ripe tomatoes
in greenhouse (Yin et al. 2009) was designed based on the color feature of tomatoes,
and morphological operations are used to denoise and handle the situations of tomato
overlapping and shelter. Medjool date fruits were taken as a case study to demon-
strate the performance of a novel color quantization and color analysis technique for
fruit maturity evaluation and surface defect detection (Lee et al. 2008).
A novel and robust color space conversion and color index distribution analysis
technique for automated date maturity evaluation (Lee et al. 2008) were proposed.
Applications of mechanical fruit grading and automatic fruit grading (Gao et al.,
2009) were discussed and also compare the performance of CV-based automatic
fruit grading with mechanical fruit grading. A neural network system using genetic
algorithm was implemented to evaluate the maturity levels of strawberry fruits (Xu
2008). In this work, H frequency of HIS color model was used to distinguish matu-
rity levels of strawberry fruits in a variable illumination conditions. An intelligent
algorithm based on neural network was developed to classify coffee cherries into
under-ripe, ripe, and over-ripe (Furfaro et al. 2007). A coffee ripeness monitoring
system was proposed (Johnson et al. 2004). In this work, reflectance spectrum was
recorded from four major components of coffee field viz., green leaf, under-ripe fruit,
ripe fruit, and over-ripe fruit. Based on reflectance spectrum, ripeness evaluation of
coffee field was performed. A Bayesian classifier was exploited for the purpose of
classification of intact tomatoes based on their ripening stages (Baltazar et al. 2008).
We made an initial attempt on ripeness evaluation of tobacco leaves for automatic
harvesting in our previous work (Guru and Mallikarjuna 2010), where we exploited
only the combination of sobel edge detector and laplacian filter with CIELAB color
model to estimate degree of ripeness of a leaf and conducted experiments on our
own small dataset of 244 sample images. In our current work, we exploited two
combinations (i) combination of laplacian filter and sobel edge detector and (ii)
combination of laplacian filter and canny edge detector with different color models
viz., RGB, HSV, MUNSELL, CIELAB, and CIELUV and conducted experiments
on our own large dataset of 1300 sample images. Indeed, the success of our previous
attempt motivated to take up the current work, where in the previous model has been
extended significantly.
Thus, overall contributions of this work are,
• Creation of a relatively large dataset of harvesting tobacco leaves due to non-
availability of a benchmarking dataset.
• Introduction of concept of fusing image filters of different orders for maturity spot
detection on tobacco leaves.
• Development of a model which combines density of maturity spots and color
information for estimating degree of ripeness of a leaf.
• Design of simple threshold-based classifier for classification of leaves.
• Conduction of experimentations on the large tobacco harvesting dataset created.
This paper is organized as follows. Section 13.2 presents proposed filter-based

model to evaluate the ripeness of a leaf for classification. Section 13.3 provides details
on tobacco harvesting dataset. It also presents the experimental results obtained due
to exhaustive evaluation of the proposed model. The paper concludes in Sect. 13.4.
13.2 Proposed Model
The proposed model has four stages: leaf segmentation, detection of maturity spots,
estimation of degree of ripeness, and classification.
13.2.1 Leaf Segmentation
The CIELAB (Viscarra et al., 2006) color model was used to segment a leaf area from
their background which includes soil, stones, and noise. According to the domain
experts, the color of a tobacco leaf varies from green to yellow. Therefore, the chro-
macity coordinate is used to segment the leaf from its background. For an illustration,
we have shown three different samples (Figs. 13.2, 13.3, and 13.4) of tobacco leaves
and also the results of the segmentation.
13.2.2 Detection of Maturity Spots
The proposed maturity spots detection algorithm mainly consists of two stages. The
first stage involves application of a second order high-pass filtering and a first order
edge extraction algorithm separately on a leaf, the results of which are later subjected
(a) (b)
Fig. 13.2 a A sample tobacco leaf with rare maturity spots, b segmented image
(a) (b)
Fig. 13.3 a A sample tobacco leaf with with moderate maturityspots, b segmented image
(a) (b)
Fig. 13.4 a A sample tobacco leaf with with rich maturity spots, b segmented image
Fig. 13.5 Block diagram of the proposed maturity spots detection algorithm.
for subtraction in second stage. The block diagram of the proposed maturity spots
detection algorithm is given in Fig. 13.5.
The maturity spots are highly visible in a R-channel gray scale image compared
to G-channel and B-Channel gray scale images, respectively. Therefore, the RGB
image of a tobacco leaf is transformed into its R-channel gray scale image. A second
order high-pass filter is exploited to enhance mature spots (fine details) present on a
red channel gray scale image of a tobacco leaf. It highlights transitions in intensities
in an image. Any high-pass filter in frequency domain attenuates low-frequency com-
ponents without disturbing high-frequency information. Therefore, to extract finer
details of small maturity spots, we recommend to apply any second order deriva-
tive high-pass filter (in our case laplacian filter) which enhances much better than
any first order derivative high-pass filters (Sobel and Roberts). Then, we transform
the second order filtered image into a binary image using a suitable threshold. The
resultant binary image contains veins and leaf boundary in addition to maturity spots.
An image subtraction is used to eliminate the vein and leaf boundary edge pixels
of resultant binary image. Therefore, we recommend to subtract the edge image
containing only vein and leaf boundary edge pixels from the resultant binary image.
Since first order edge extraction operator is susceptible to noise, it is used to extract
edge image from red channel gray scale image of the segmented original RGB
color image. The obtained edge image is then subtracted from the binary image
obtained due to second order high-pass filtering. Image subtraction results in an
image containing only maturity spots. Number of connected components present in
that image decides the number of maturity spots on the leaf.
Let us consider a tobacco leaf (Fig. 13.6) for the purpose of illustration of the
proposed maturity spots detection algorithm. As discussed above, when we apply
the transformation (RGB to R-channel gray scale) on the original RGB image of
segmented tobacco leaf (Fig. 13.6a), the maturity spots are highly noticeable in the
R-channel gray scale image as shown in Fig. 13.6b. The second order high-pass filter
(Laplacian filter) is used to enhance the maturity spots. The laplacian filtered image
(Fig. 13.6b) is converted into binary image (Fig. 13.6c) using a suitable predefined
threshold. As stated above, the resultant binary image (Fig. 13.6e) contains maturity
spots, vein, and boundary edge pixels. So, to remove vein and boundary pixels, we
subtracted edge image (Fig. 13.6b) obtained after first order edge extraction (canny
edge detector) from the binary image (Fig. 13.6d). Finally, image subtraction has
resulted in an image (Fig. 13.6f) containing only maturity spots.
13.2.3 Estimation of Degree of Ripeness
The degree of ripeness is estimated to evaluate the ripeness of a leaf. It is based on

the density estimation of maturity spots present on the leaf and also the yellowness
of the leaf. The degree of ripeness (D) of a leaf is defined to be,
The weights W1 and W2 are assigned to the density estimation of spots and the
mean value of yellowness of a leaf, respectively. The yellowness of a leaf increases
as ripeness of a leaf increases. Therefore, we have recommended the parameter MYL
while estimating the ripeness of a leaf. To support this, we used different color models
such as RGB, HSV, MUNSELL, CIELAB, and CIELUV.
13.2.4 Classification
During harvesting, we can find three types of leaves on a plant: unripe, ripe, and
over-ripe. Unripe leaves have low degree of ripeness. Ripe leaves have moderate
degree of ripeness. Over-ripe leaves have high degree of ripeness. Therefore, we
have used a simple thresholding classifier based on two predefined thresholds T1 and
T2 for classification of tobacco leaves into three classes: unripe, ripe, and over-ripe.
The threshold T1 is selected as the midpoint of distribution of degree of ripeness of
samples of unripe class and ripe class. The threshold T2 is selected as the midpoint
of distribution of degree of ripeness of samples of ripe class and over-ripe class.
(a) Segmented RGB image of a tobacco (b) Red channel gray scale
leaf image
(c) Laplacian filtered (d) Image after thresholding and bina-

image rization
(e) Image after leaf vein and boundary (f) Image consisting of only maturity
extraction using canny edge detector spots after subtraction of the image (e)
Fig. 13.6 Maturity spots detection

Then, the class label for a given leaf is decided based on two predefined thresholds
T1 and T2 , as given in Eq. 13.3.
D = [W1 × D E S] + [W2 × MY L] (13.1)
N umber o f maturit y
DES = (13.2)
Lea f ar ea
⎧
⎪
⎨C1 D < T1
Classlabel = C2 T1 < D < T2 (13.3)
⎪
⎩
C3 D > T2
where D = Degreet of ripeness.

C1, C2, and C3 are class labels of ripe, unripe, and over-ripe, respectively.
The thresholds T1 and T2 are supposed to be fixed empirically. We follow super-
vised learning to fix up the values for T1 and T2 .
13.3 Experimentation and Result Analysis
13.3.1 Dataset
In this section, we present details on creation of a tobacco harvesting dataset to build

an automatic harvesting system. One should harvest ripe leaves to get quality leaves
after curing. Harvesting of unripe or over-ripe leaves leads to poor quality leaves after
curing. Image samples of tobacco leaves (Unripe, Ripe, and Over-ripe) are collected
randomly from tobacco crop field at CTRI, HUNSUR. Number of collected image
samples of individual classes of tobacco harvesting leaves is tabulated in Table 13.1.
Image samples of each class are shown in Fig. 13.7.
Table 13.1 Number of samples of individual classes of tobacco harvesting leaves

Tobacco leaves Number of samples Total samples
Unripe leaves 323
Ripe leaves 667 1300
Over-ripe leaves 310
Fig. 13.7 A sample of

a Unripe leaf, b ripe leaf,
c over-ripe tobacco leaf
(a)
(b)
(c)
13.3.2 Results
The proposed model estimates the degree of ripeness of a leaf using the proposed
method of maturity spots detection and color models. The proposed maturity spots
detection algorithm is a combination of first order edge extraction and second order
filtering. We exploited the first order edge extractors such as sobel edge detector
and canny edge detector. We have used the laplacian filter for second order filtering.
Hence, we have two combinations (i) combination of laplacian filter and sobel edge
detector and (ii) combination of laplacian filter and canny edge detector. Henceforth,
in this paper, we refer these combinations, respectively, as Method 1 and Method 2.
The sobel edge detector works on one threshold (Tr ). Therefore, in the Method
1, we have fixed the threshold (Tr ) value of the sobel detector in the training phase.
Table 13.2 Average classification accuracy using the Method 2 (combination of laplacian filter
and canny edge detector) for varying the thresholds Tr 1 and Tr 2 of canny edge detector
Tr 2 → Tr 1 ↓ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 - 71.65 72.96 72.9 72.97 73.73 73.93 73.9 74.23 74.43 74.65
0.1 – – 73.21 73.17 73.4 73.08 73.45 73.54 74.02 74.21 75.76
0.2 – – – 74.66 74.96 75.3 76.31 77.31 78.3 78.84 79.49
0.3 – – – – 86.59 85.6 85.2 85.51 85.48 85.73 84.51
0.4 – – – – – 81.59 81.49 82.3 81.71 82.03 82.04
0.5 – – – – – – 81.97 81.77 82.31 82.07 82.28
0.6 – – – – – – – 81 81.32 79.13 79.64
0.7 – – – – – – – – 77.08 76.98 76.17
0.8 – – – – – – – – – 75.19 76.47
0.9 – – – – – – – – – – 75.62
1 – – – – – – – – – – –
That is, we have varied the threshold (Tr ) value from 0 to 1 with a step of 0.1.
Experimentally, it is found that the best average classification accuracy has been
achieved for Tr = 0.2.
On other hand, canny edge detector works on two thresholds (Tr 1 and Tr 2 ). There-
fore, in the Method 2, we have fixed the threshold values of Tr 1 and Tr 2 of the canny
edge detector in the training phase. Experimentally, it is found that the values of
Tr 1 and Tr 2 are 0.3 and 0.4, respectively. The thresholds Tr 1 and Tr 2 of the canny
edge detector are tuned up in such way that the leaf boundary edge pixels and leaf
vein edge pixels are extracted clearly. Therefore, there is a very less probability of
leaf vein edge pixels and leaf boundary edge pixels to be counted as maturity spots
while estimating maturity spots density. However, selecting suitable values of Tr 1
and Tr 2 is a challenging task. Pixels with values between Tr 1 and Tr 2 are weak edge
pixels that are 8-connected to the strong edge pixels (pixel values greater than Tr 2 )
which perform edge linking. Therefore, the values of Tr 1 and Tr 2 are set such that the
probability of leaf boundary and veins weak edge pixels to be missing is minimum.
By varying the thresholds Tr 1 and Tr 2 , it is found that the best average classification
accuracy has been achieved for Tr 1 = 0.3 and Tr 2 = 0.4 and is given in Table 13.2.
For estimation of degree of ripeness of a leaf, we vary the weights W1 and W2
(Eq. 13.1) such that the best average classification accuracy has been achieved (W1 =
0.7 and W2 = 0.3) for all set of training and testing samples. It is shown in Fig. 13.8
using the Method 2 for 60% training.
For purpose of fixing up T1 and T2 , during classification, we considered 150
samples from each class, and we plotted distribution of samples over degree of
ripeness (Fig. 13.9). Since there is a large overlapping between the classes as shown
in Fig. 13.9, we recommend to select the thresholds T1 and T2 by studying the over-
lapping of unripe class and ripe class to select the threshold T1 and ripe class and
over-ripe class to select the threshold T2 as follows.
Fig. 13.8 Average classification accuracy obtained by the Method 2 (combination of laplacian filter
and canny edge detector) under varying weights W1 and W2
Fig. 13.9 Distribution of tobacco samples over degree of ripeness
During experimentation, we conducted four different sets of experiments for both

Method 1 and Method 2. In the first set of experiments, we used 30% of the samples
of each class of the harvesting tobacco dataset to create class representative vectors
(training), and the remaining 70% of the samples are used for testing purpose. In
second set, third set, and fourth set of experiments, a number of training and testing
samples are in the ratio 40:60, 50:50, 60:40, respectively. In each set of experiments,
experiments are repeated 20 times by choosing the training samples randomly. As
measures of goodness of the proposed model, we computed classification accuracy,
Table 13.3 Classification accuracy using the Method 1 (combination of laplacian filter and canny
edge detector) with different color models
Training Color model Minimum Maximum Average Std. deviation
Examples accuracy accuracy accuracy
30% RGB 51.3055 57.1953 53.6121 1.6049
HSV 50.2696 54.8161 51.8439 1.1337
MUNSELL 60.5778 69.4947 64.6835 2.3205
CIELAB 78.3534 82.6594 80.43 1.0805
CIELUV 80.3467 83.1168 82.018 1.0023
40% RGB 50.713 55.1152 52.7922 1.1783
HSV 50.2438 54.9609 52.0307 1.2323
MUNSELL 60.5041 68.7531 63.8465 2.417
CIELAB 78.2631 82.7723 80.8273 1.2004
CIELUV 80.561 84.129 82.4455 1.268
50% RGB 50.2906 54.2789 52.1122 0.9674
HSV 49.6817 53.4766 51.0107 1.0167
MUNSELL 59.7798 68.2022 63.685 1.9417
CIELAB 79.0244 83.5759 80.9809 1.1023
CIELUV 80.8077 85.4418 83.3289 1.2991
60% RGB 50.4844 54.6317 52.4391 1.1576
HSV 49.5419 52.3252 50.7931 0.8702
MUNSELL 57.9345 68.3146 62.2161 2.1327
CIELAB 78.3068 85.4018 81.9757 1.4671
CIELUV 80.332 85.2349 83.5698 1.3244
precision, recall, and F-measure. The minimum, maximum, average, and standard
deviation of classification accuracy of all the 20 trails using the proposed simple
thresholding classifier for both methods are given in Tables 13.3 and 13.4 respectively.
Classification accuracy using the Method 1 with different color models viz., RGB,
HSV, MUNSELL, CIELAB, and CIELUV is given in Table 13.3. Similarly, classi-
fication accuracy using the Method 2 with different color models viz., RGB, HSV,
MUNSELL, CIELAB, and CIELUV is given in Table 13.4. The confusion matrix
across leaf types using the Method 1 for best average classification accuracy is given
in Table 13.5. Similarly, the confusion matrix across leaf types using the Method 2
for best average classification accuracy is given in Table 13.6. The corresponding
precision, recall, and F-measure for individual classes are presented for both the
Method 1 and the Method 2 in Fig. 13.10. From Tables 13.3 and 13.4, it is observed
that the best average classification accuracy has been achieved for the Method 2 with
CIELUV color model.
Table 13.4 Classification accuracy using the Method 2 (combination of laplacian filter and canny
edge detector) with different color models
Training Color model Minimum Maximum Average Std. deviation
Examples accuracy accuracy accuracy
30% RGB 66.6959 73.3427 70.5315 1.5919
HSV 54.9965 61.1442 58.7328 1.4712
MUNSELL 67.3114 74.1979 70.1425 1.6726
CIELAB 78.9152 82.9264 81.1872 0.9323
CIELUV 82.7643 88.7130 85.2472 1.8509
40% RGB 67.0316 73.2570 70.3530 1.7062
HSV 54.9682 62.1305 57.9434 1.8306
MUNSELL 67.0676 72.0768 69.7736 1.2273
CIELAB 78.9224 81.8946 80.2090 0.8748
CIELUV 81.0633 88.9688 86.3933 2.0416
50% RGB 65.8649 73.5737 70.3453 1.8429
HSV 52.9818 61.8088 58.6286 1.8589
MUNSELL 67.9882 72.8952 70.0468 1.1668
CIELAB 83.5848 79.4480 81.3107 1.1918
CIELUV 81.8075 89.7177 86.1646 2.3228
50% RGB 64.8270 73.8114 70.6765 1.9759
HSV 54.3457 62.1077 58.5386 2.1944
MUNSELL 66.3917 72.5029 70.8465 1.3572
CIELAB 78.2949 84.9594 81.8277 1.6287
CIELUV 80.7693 89.5043 86.5945 2.2025
Table 13.5 Confusion matrix across leaf types using the Method 1 (combination of laplacian filter
and sobel edge detector) for best average classification accuracy
Unripe Ripe Over-ripe
Unripe 106 23 00
Ripe 20 222 25
Over-ripe 00 18 106
Table 13.6 Confusion matrix across leaf types using the Method 2 (combination of laplacian filter
and canny edge detector) for best average classification accuracy
Unripe Ripe Over-ripe
Unripe 111 18 00
Ripe 10 228 29
Over-ripe 00 14 110
Fig. 13.10 Classwise

evaluation of the Method 1
(combination of laplacian
filter and sobel edge
detector) and the Method 2
(combination of laplacian
filter and canny edge
detector): a Precision,
b Recall, c F-measure
13.3.3 Discussion
When we applied our previous method’s combination of laplacian filter and sobel
edge detector (Method 1) with CIELAB (Guru and Mallikarjuna 2010) on our large
dataset, we have achieved classification accuracy of about 81% (see Table 13.3). To
improve classification accuracy, our previous method (Guru and Mallikarjuna 2010)
has been extended with different color models viz., RGB, HSV, MUNSELL, and
CIELUV and achieved a good classification accuracy of about 83% with CIELUV
color model (see Table 13.3) on our large dataset. To improve classification accuracy
further, the current work has been extended for combination of laplacian filter and
canny edge detector (Method 2) with different color models viz., RGB, HSV, MUN-
SELL, CIELAB, and CIELUV. We have achieved an improvement in classification
accuracy by 3% using Method 2 with CIELUV color model (see Table 13.4) on our
large dataset.
13.4 Conclusion
In this work, we present a novel model based on strategies of filtering for classification
of tobacco leaves for the purpose of harvesting. A method of detection of maturity
spots is proposed. A method of finding degree of ripeness of a leaf is presented.
Further, we proposed a simple thresholding classifier for effective classification of
tobacco leaves. In order to investigate the effectiveness and robustness of the proposed
model, we conducted experiments for both the methods (i) combination of laplacian
filter and sobel edge detector and (ii) combination of laplacian filter and canny edge
detector on our own large dataset. Experimental results reveal that combination of
laplacian filter and canny edge detector is superior than combination of laplacian
filter and sobel edge detector.
References
Baltazar A, Aranda JI, Aguilar GG (2008) Bayesian classification of ripening stages of to-mato
fruit using acoustic impact and colorimeter sensor data. Comput Electron Agric 60(2):113–121
Furfaro R, Ganapol BD, Johnson LF, Herwitz SR (2007) Neural network algorithm for coffee
ripeness evaluation using airborne images. Appl Eng Agric 23(3):379–387
Gao H, Cai J, Liu X (2009) Automatic grading of the post-harvest fruit: a review. In: Third IFIP inter-
national conference on computer and computing technologies in agriculture. Springer, Beijing,
pp 141–146
Guru DS, Mallikarjuna PB (2010) Spots and color based ripeness evaluation of tobacco leaves for
automatic harvesting. In: First international conference on intelligent interactive technologies and
multimedia. ACM, IIIT Allahabad., India, pp 198–202
Jabro JD, Stevens WB, Evans RG, Iversen WM (2010) Spatial variability and correlation of selected
soil in the AP horizon of a CRP grassland. Appl Eng Agric 26(3):419–428
Johnson LF, Herwitz SR, Lobitz BM, Dunagan SE (2004) Feasibility of monitoring coffee field
ripeness with airborne multispectral imagery. Appl Eng Agric 20(6):845–849
Lee DJ, Chang Y, Archibald JK, Greco CJ (2008a) Color quantization and image analysis for
automated fruit quality evaluation. In: IEEE international conference on automation science and
engineering. IEEE, Trieste, Italy, pp 194–199
Lee DJ, Chang Y, Archibald JK, Greco CJ (2008b) Robust color space conversion and color distri-
bution analysis techniques for date maturity evaluation. J Food Eng 88:364–372
Lee D, Archibald JK, Xiong G (2011) Rapid color grading for fruit quality evaluation using direct
color mapping. IEEE Trans Autom Sci Eng 8:292–302
Manickavasagam A, Gunasekaran JJ, Doraisamy P (2007) Trends in Indian flue cured virgina
tobacco (Nictoina tobaccum) processing: harvesting, curing and grading. Res J Agric Biol Sci
3(6):676–681
Patricio DI, Riederb R (2018) Computer vision and artificial intelligence in precision agriculture
for grain crops: A systematic review. Comput Electron Agric 153:69–81
Viscarra RA, Minasny B, Roudier P, McBratney AB (2006) Colour space models for soil science.
Geoderma 133:320–337
Xu L (2008) Strawberry maturity neural network detecting system based on genetic algorithm. In:
Second IFIP international conference on computer and computing technologies in agriculture,
Beijing, China, pp 1201–1208
Yin H, Chai Y, Yang SX, Mitta GS (2009) Ripe tomato extraction for a harvesting robotic system.
In: IEEE international conference on systems, man and cybernetics. IEEE, San Antonio, USA,
pp 2984–2989
Chapter 14
Automatic Deep Learning Framework
for Breast Cancer Detection and
Classification from H&E Stained Breast
Histopathology Images
Anmol Verma, Asish Panda, Amit Kumar Chanchal, Shyam Lal,

and B. S. Raghavendra
Abstract About half a million breast cancer patients succumb to the disease, and
nearly 1.7 million new cases arise every year. These numeric entities are expected
to rise significantly due to the advances in social and medical engineering. Further-
more, the histopathological images are a gold standard for identifying and classifying
breast cancer compared with other medical imaging. Evidently, the decision of an
optimal therapeutic schedule of breast cancer rests upon early detection.The primal
motive to have better breast cancer detection algorithm helps to the doctors who
know the molecular sub-types of breast cancer in order to control the metastasis of
tumor cells early in the disease prognosis and treatment planing. This paper pro-
poses automatic deep learning framework for breast cancer detection and classifica-
tion model from hematoxylin and eosin (H&E) stained breast histopathology images
with 80.4% accuracy for supplementing analysis of medical professionals to prevent
false negatives. Experimental results yield that proposed architecture provides better
classification results as compared to benchmark methods.
14.1 Introduction
Proper diagnosis of breast cancer is the demand of today’s time; because in women,
it becomes a major cancer-related issues worldwide. Manual analysis of microscopic
slides leads to differences of opinion among pathologists as well as time consuming
process due to the complexity associated with such images. Breast cancer is a dis-
ease having a distinctive histological attribute and has benign tumor of sub-class as
Adenosis, Fibroadenoma, Phyllode Tumor, Tubular Adenoma and malignant tumor
of sub-class as Ductal Carcinoma, Lobular Carcinoma, Mucinous Carcinoma, and
Papillary Carcinoma. Classical classification algorithms have own merit and demerit.
Logistic regression-based classification easy to implement, but its accuracy depends
on the nature of the dataset. If it is linearly separable, then it will work well, but in the
real world dataset rarely linearly separable. Decision tree-based classification model
A. Verma (B) · A. Panda · A. Kumar Chanchal · S. Lal · B. S. Raghavendra

National Institute of Technology Karnataka, Surathkal, Mangalore 575025, Karnataka, India
216 A. Verma et al.
is able to deal complex nature dataset, but there are always chances of overfittiing
in this method. Overfitting problem can be reduced by a random forest algorithm
which is a more sophisticated version of decision tree-based classification model.
The working method of support vector machine is based on hyperplane which acts
as a decision boundary. Appropriate selection of kernel is the key to better perfor-
mance in support vector machine classification method. To improve the process of
diagnosis, automatic detection and treatment are one of the leading research areas
to deal with cancer-related issue. Last one decade, the development of fast digital
whole slide scanners (DWSS) that provide whole slide images (WSI) has led to a
revival of interest in medical image processing, analysis, and their applications in
digital pathology solution.Segmentation of cells and nuclei proves to be an important
first step towards automatic image analysis of digitized histopathalogy images. We
therefore pose to develop an automated cell identification method that works with
in (H&E) stained breast cancer histopathology images. Deep learning framework
is very effective to detect and classify breast cancer histopathology slides. A typi-
cal deep learning classification system consists of (a) A properly annotated dataset
where its each class and sub-class is verified by experienced pathologists. (b) A
robust architecture that are able to differentiate its class and sub-class of tissue under
diagnosis (c) Good optimization algorithm and a proper loss function that are able
to train the model effectively. (d) In case of supervised learning, the performance
of the model depends that how ground truth prepared, and it should be under the
supervision of experienced pathologists.
The organizations of the this chapter are as follows: In Sect. 14.2 we have dis-
cussed related research work. Section 14.3 presents proposed model architecture.
Section 14.4 presents experimental results and discussion. Section 14.5 presents con-
clusion of the manuscript.
14.2 Literature Survey
The breast cancer detection had SVM as a benchmark model as presented by Akay
in Akay (2009), and the benefits of SVM as used in detection of cancer were clearly
presented but it lacked the classification of the type of breast cancer which was one
of our motivation regarding this research. The motivation was further enhanced by
the foundings presented in Karabatak and Ince (2009) by Karabatak.
Veta and Diest presented automatic nuclei segmentation in H&E stained breast
cancer histopathology images. In this paper, authors explained the different nuances
for breast cancer detection that have been achieved by automated cell segmentation.
Method of cell segmentation is explained deeply in this paper which is based on
patched slide analysis for higher accuracy of cancer detection (Veta and Diest 2013).
The advantage of this paper particularly is the accuracy of detection it achieves
with cell segmentation method, as it is the best in class with over 90.4% accuracy
14 Automatic Deep Learning Framework for Breast Cancer . . . 217
in positive detection. The disadvantage of this paper is that it fails to touch upon
the different ways of achieving such detection accuracy with multiple deep learning
algorithms.
Cruz-Roa, Angel, et al. presented automatic detection of invasive ductal carci-
noma in whole slide images with convolutional neural networks(CNNs). In this
paper, authors explained the detection and visual analysis of IDC tissues in whole
slide images(WSI). The framework explained in (Cruz-Roa et al. 2014) extends
to a number of CNNs. The CNN is trained over a large number of image patches
represented by tissue regions from the WSI to learn a hierarchical part-based rep-
resentation and classification. The resulting accuracy is stated as being 71.80% and
84.23% for F-measure and balanced accuracy, respectively. The disadvantage of the
method published is from the inherent limitations in obtaining a very highly granular
annotation of the diseased area of interest by an expert pathologist.
In Spanhol et al. (2016), Janowczyk and Anant (2016), the work presented by
authors brings significance to the datasets being used to elucidate significant deep
learning techniques needed to produce comparable, and in many cases, superior
to results from the benchmark hand-crafted feature-based classification algorithmic
design.
Recently, advanced CNN model has achieved paramount success in classification
of natural image as well as in biomedical image processing. In Han et al. (2017), Han
et al. designed a novel convolutional neural network, which includes a convolutional
layer, small SE-ResNet module, and fully connected layer and was responsible for
impeccable detection cancer detection outcomes.
Most of the state-of-the-art algorithms in the literature are based on learned fea-
tures that extract high-level abstractions directly from the histopathological H& E
stained images utilizing deep learning techniques. In Look and Once: Unified, Real-
Time Object Detection, (2016), Janowczyk and Anant (2016), Han et al. (2017),
authors discussion was brought upon the various algorithms applied for the nuclear
pleomorphism scoring of breast cancer, disquisition the challenges to be dealt with,
and outlines the importance of benchmark datasets in multi-level classification archi-
tectures.
The multiple layer analysis of cancer detection and classification draws its roots
from papers (Feng et al. 2018; Guo et al. 2019; Jiang et al. 2019; Liao et al. 2018;
Liu et al. 2019) explaining feature extraction representing different types of breast
cancer and giving a prominent inclination to invasive ductal carcinoma (IDC).
M Z Alom, T Aspiras et al. presented advanced deep convolutional neural net-
work approaches for digital pathology image analysis (Alom et al. 2019). In this
paper, authors explained the process of detection of cancer through a CNN approach
specifically IRRCNN. The process of detection using neural networks makes us
understand the multiple layers that go into making the model. The advantage of this
paper particularly is the approach of detection, as it is the optimum way of utilizing
CNN for image recognition in this case cancer detection . The disadvantage of this
paper is that it pegs only the detection of cancer among the cell and does not allow
abstract classification of cancer types.
218 A. Verma et al.
In Ragab et al. (2019), the authors explain the significance of SVM as a benchmark
model for identification of breast cancer although the analysis proves to be promising
but its done by mammogram images instead of H&E stained images.
A deep learning model by Li et al. (2019) that classifies into malignant and non-
malignant and use of a classifier in such a way that it detects local patches.
The idea of multi-classifier development can be shared by Kassani et al. (2019)
as a problem tackled through ResNet and other prominent neural networks. The
disadvantage encountered in these papers is the lack specificity of the disease.
The paper by Lichtblau and Stoean (2019) suggests the different models that need
to be studied to identify the most optimum approach for classification of different
cancer types. Due to the focus of this paper primarily on the classification of breast
cancer, our detection algorithm consists of data procured through transfer learning
of benchmark algorithms as presented in Shallu (2018), Ting et al. (2019), Vo et al.
(2019), Khan et al. (2019), Ni et al. (2019), Das et al. (2020).
For BreaKHis dataset (Toğaçar et al. 2020) proposed a general framework for
diagnosis of breast cancer. Their architecture consists of attention modules, convo-
lution block, dense block, residual block, and hyper column block to capture spatial
information precisely. Categorical cross entropy is loss function, and Adam opti-
mization is used to train the model.
Sheikh et al. (2020), densely connected CNN-based network for binary and mul-
ticlass classification is able to capture meaningful structure and texture by fusing
multi-resolution feature for ICIAR2018 dataset and BreaKHis dataset.
For classification of breast cancer into carcinoma and non-carcinoma Hameed
et al. (2020) utilized deep CNN-based pre-trained model of VGG-16 and VGG-19
that is helpful in better initialization and convergence. Their final architecture is an
ensemble of fine-tuned VGG-16 and fine-tuned VGG-19 models.
By utilizing sliding window mechanism and class-wise clustering with image-
wise feature pooling (Li et al. 2019) extract multi-layered features to train two parallel
CNN. Their final classification accuracy has both larger patches, features, and smaller
patch features.
For multiclass classification of breast cancer histopathology images (Xie et al.
2019) adopted transfer learning. The pre-trained model of Inception_ResNet_V2 and
Inception_V3 is utilized for the classification purpose. Their deep learning frame-
work used four different magnification factor for training and testing to ensure the
universality of the model.
Both CNN and SVM classifier used by Araújo et al. (2017) to achieve comparable
results. By dividing the histology image into patches and patch-based features are
extracted using CNN, and finally, these features are fed to the SVM input to classify
the images.
Classification of breast carcinomas by Babak ehteshami (Bejnordi et al. 2017) in
whole slide breast histology images by stacking high resolution patches on the top
of the network that accepts large size input to obtain fine-grained details as well as
global tissue structures.
Spanhol et al. (2017) utilize CNN trained on natural images to the BreaKHis
dataset to extract the deep features, and they find these features are better than
hand-crafted features. These features are fed to different classifier trained on spe-
cific dataset. Their patch-based classification with four different magnification factor
achieves very good prediction accuracy.
Zhu et al. (2019) works on the BreaKHis dataset by merging local and global
information called multiple CNN or hybrid CNN that is able to classify effectively.
To remove the redundant information, they incorporated SEP block in the hybrid
model. Combining the above two effects, their model got promising results.
For BACH and BreaKHis dataset (Patil et al. 2019) used attention based multiple
instance learning where they did aggregation of features called bag level features.
Their multiple instance-based learning is able to localize and classify into benign,
malignant, and invasive.
14.3 Proposed Model Architecture
The proposed architecture consists of two parts, namely detection and classification.
The detection networks take influence from IRRCNN (Alom et al. 2019), while the
classification network takes influence from WSI-Net (Ni et al. 2019).
The flow diagram of our proposed architecture is shown in Fig. 14.1. The archi-
tecture consists of two convolutional matrix and three residual networks. The H&
E image of the breast tissue is pre-processed and sent into the first convolutional
network followed by a residual network which is then repeated once more. The pro-
cessed data is sent into the classification branch and malignancy detection branch.
The malignancy detection branch decides whether the fed data is malignant or non-
malignant. The classification branch further processes the data and classifies it on
whether it is invasive ductal carcinoma (IDC) positive or negative. The data from
Fig. 14.1 Proposed model architecture

220 A. Verma et al.
both branches are combined and passed through the final residual network. We then
give the prediction through the confusion matrix segmentation map.
14.3.1 Loss Function
The loss function utilized in proposed architecture is the Adam loss function (Kingma
and Ba 2014). First and foremost, Adam means adaptive moment estimation. In
Adam, the exponential moving average (EMA) of the first moment of the gradient
scaled by the square root of the second moment of the moment is subtracted to the
parameter vector which is presented in mathematical Eq. 14.1 as explored in

η
θt+1 = θt − m̂ t (14.1)
vˆt +
where θ is the parameter vector, v(t) is the exponential moving average of the second
moment of the gradient G(t), and is a very small hyper-parameter that prevents the
algorithm from dividing by zero.
Please do not use quotation marks when quoting texts! Simply use the quotation envi-
ronment – it will automatically be rendered in line with the preferred layout.
14.3.2 Training Setup
A jupyter file in google colaboratory was used to implement all models used and
proposed model in this paper through a virtual machine on cloud as well as a PC
with Intel(R) Core(TM) i7-8750H CPU @ 3.98GHz, 16GB RAM and NVIDIA GTX
1070 Max-Q as its core specifications.
14.3.3 Segmentation Overview
The segmentation is based on zero-phase component analysis (ZCA) whitening. We

use this algorithm to identity certain key features from the breast dataset, which is
then compared to our pre-existing training dataset as a part of unsupervised learning.
The mathematical expression of ZCA is presented in Eq. 14.2.

W Z C A = E D −1/2 E T = C −1/2 (14.2)
Where W Z C A is the transformation matrix, E is the matrix consisting of eigen values,

covariance matrix C = X T X/n has eigenvectors in columns of E, and eigenvalues
on the diagonal of D and X are the data stored in n × d where n are data points and
d are the features.
14.4 Experimental Results and Discussion
14.4.1 Dataset and Pre-processing
The research work is using the BHC dataset for detection and classification of
histopathology and eosin stained breast cancer images. The keras data generator
is used to get the data from respective folders and into Keras automatically. Keras
provides convenient python library functions for this purpose.
The learning rate presided by this proposed model is adjusted to be 0.0001. On top
of it, a global average pooling layer followed by 50% dropouts to reduce over-fitting
bias was used. Adam is used as the optimizer and binary-cross-entropy as the loss
function. A sequential model along with confusion matrix is used for implementation
of the classification branch of proposed algorithm. It adds convolutional layer of bin
size 32 and kernel size 3 × 3. Four units are pooled together from both the axes.
Then, different operations like increasing density, flatten, and redundancy reduction
are applied.
14.4.2 Results Evaluation and Discussion
The results evaluation and discussions of proposed model for cancer detection and
classification method are presented in this section. For validity, the results of proposed
architecture are compared with existing methodology and composition from the
referenced literature such as IRRCNN, DCNN, SVM, VGG-16, decision tree, etc.
Table 14.1 consists of the different models that were tested for the cancer classi-
fication branch of the proposed architecture and following results observed.
The decision was made to implement DenseNet201 for the malignancy detection
branch of the proposed architecture by weighing in the size of the model and its
top-5 accuracy which weighed in best for the DenseNet201 model. The accuracy
and loss plots of malignancy detection branch of proposed architecture are shown in
Figs. 14.2 and 14.3 respectively. The receiver operating characteristics (ROC) plot
of proposed architecture is shown in Fig. 14.4.
Invasive ductal carcinoma (IDC) is the most common form of breast cancer.
Through the medium of this project, we are implementing a two base classification
for the preferred algorithm in order to broaden our automated analysis, i.e., IDC
versus DCIS (IDC−). This particular method involves confusion matrix in order to
222 A. Verma et al.
Table 14.1 Models performance comparision

Algorithm Parameter Recall F1 Precision Accuracy
IDC(-) 0.65 0.7 0.74

KNN IDC(+) 0.78 0.74 0.7 71.71%
Avg. 0.72 0.72 0.72
IDC(+) 0.82 0.76 0.71
GaussianNB Avg. 0.74 0.735 0.745 73.78%
IDC(-) 0.97 0.70 0.55
IDC(+) 0.22 0.36 0.87
KerasANN Avg. 0.46 0.455 0.745 59.09%
IDC(-) 0.63 0.72 0.82
IDC(+) 0.86 0.78 0.71
DecTree Avg. 0.745 0.75 0.765 72.17%
IDC(-) 0.71 0.76 0.85
IDC(+) 0.88 0.80 0.74
SVM Avg. 0.795 0.78 0.795 77.65%
IDC(-) 0.52 0.36 0.53
IDC(+) 0.37 0.14 0.47
WSI-Net Avg. 0.445 0.25 0.50 63.21%
IDC(-) 0.71 0.76 0.85
IDC(+) 0.88 0.80 0.74
Proposed 80.43%
Avg. 0.795 0.78 0.795
Fig. 14.2 Accuracy plot for the detection branch

Fig. 14.3 Loss plot for the detection branch
Fig. 14.4 ROC curve
understand the implications of error in the prediction of the type of cancer. Predicted
IDC(+) and IDC(-) of proposed model is shown in Fig. 14.5 and confusion matrix
of proposed model is presented in Fig. 14.6. The comparison of predicted results of
proposed model vs actual is shown in Fig. 14.7.
The machine learning algorithms in Table 14.1 were brought in contrast with the
proposed algorithm. The idea is to design a optimal algorithm in which bias to both
bases is limited and achieve similar efficacy to support vector machine (SVM). The
proposed algorithm provides the best approach in terms of this and can be used as
an alternative to the existing SVM method for classification of cancer.
224 A. Verma et al.
Fig. 14.5 Predicted IDC(+) and IDC(−)
Fig. 14.6 Proposed model confusion matrix

Fig. 14.7 Predicted versus actual
The CNN such as WSI-Net was brought in contrast with the proposed algorithm
on the weighted parameters, and results were drawn as listed in the conclusion.
14.5 Conclusion
The proposed model was broadly the combination of cancer detection and classifica-
tion into IDC and non-IDC. Detecting breast cancer is based on IRRCNN algorithm
with significant improvements in the number of epochs and layers of convolution
network in order to get near the desired results. Then, it is coalesced with the classi-
fication algorithm which gives us a significant improvement on WSI-Net and other
machine learning classifiers for classification. The accuracy that was observed for
detection of breast cancer stands at 95.25% and that for classification of IDC versus
DICS stands at 80.43% which was better than WSI-Net.
Acknowledgements This research work was supported in part by the Science Engineering
and Research Board, Department of Science and Technology, Govt. of India under Grant No.
EEG/2018/000323, 2019.
226 A. Verma et al.
References
Akay MF (2009) Support vector machines combined with feature selection for breast cancer diag-
nosis. Exp Syst Appl 36:3240–3247. https://doi.org/10.1016/j.eswa.2008.01.009
Alom M, Aspiras T, Taha MT, Asari K, Bowen V, Billiter D, Arkell S (2019) Advanced Deep
convolutional neural network approaches for digital pathology image analysis: a comprehensive
evaluation with different use cases
Araújo T, Aresta G, Castro E, Rouco J, Aguiar P, Eloy C, Polónia A, Campilho A (2017) Classifi-
cation of breast cancer histology images using convolutional neural networks. PloS One 12(6).
https://doi.org/10.1371/journal.pone.0177544
Bejnordi BE, Zuidhof G, Balkenhol M, Hermsen M, Bult P, van Ginneken B, Karssemeijer N, Litjens
G, van der Laak J (2017) Context-aware stacked convolutional neural networks for classification
of breast carcinomas in whole-slide histopathology images. J Med Imag (Bellingham, Wash)
4(4):044504. https://doi.org/10.1117/1.JMI.4.4.044504
Cruz-Roa A, et al (2014) In: Gurcan MN, Madabhushi A (eds) Automatic detection of invasive
ductal carcinoma in whole slide images with convolutional neural networks, p 904103. https://
doi.org/10.1117/12.2043872
Das A, Nair MS, Peter D (2020) Computer-aided histopathological image analysis techniques for
automated nuclear atypia scoring of breast cancer
Feng Y, Zhang L, Mo J (2018) Deep manifold preserving autoencoder for classifying breast cancer
histopathological images. IEEE/ACM Trans Comput Biol Bioinform 1. https://doi.org/10.1109/
TCBB.2018.2858763
Guo Y, Shang X, Li Z (2019) Identification of cancer subtypes by integrating multiple types of
transcriptomics data with deep learning in breast cancer. Neurocomputing 324:20–30. https://
doi.org/10.1016/j.neucom.2018.03.072
Hameed Z, Zahia S, Garcia-Zapirain B, Javier Aguirre J, María Vanegas A (2020) Breast cancer
histopathology image classification using an ensemble of deep learning models. Sensors 20:4373
Han Z, Wei B, Zheng Y, Yin Y, Li K, Li S (2017) Breast cancer multi-classification from histopatho-
logical images with structured deep learning model. Sci. Rep. 7:1–10. https://doi.org/10.1038/
s41598-017-04075-z
Janowczyk A, Madabhushi A (2016) Deep learning for digital pathology image analysis: a com-
prehensive tutorial with selected use cases. J Pathol Inform 7:29 (2016). PubMed https://doi.org/
10.4103/2153-3539.186902
Jiang Y et al (2019) Breast cancer histopathological image classification using convolutional neural
networks with small SE-ResNet module. PLOS ONE 14(3): e0214587. PLoS J. https://doi.org/
10.1371/journal.pone.0214587
Karabatak M, Ince MC (2009) An expert system for detection of breast cancer based on association
rules and neural network. Expe Syst Appl 36:346–3469. https://doi.org/10.1016/j.eswa.2008.02.
064
Kassani SH, Kassani PH, Wesolowski M (2019) Classification of histopathological biopsy images
using ensemble of deep learning networks. SIGGRAPH 4(32). https://doi.org/10.1145/3306307.
3328180
Khan S, Islam N, Jan Z, Din IU, Rodrigues JJPC (2019) A novel deep learning based framework
for the detection and classification of breast cancer using transfer learning. Pattern Recognit Lett
125:1–6. https://doi.org/10.1016/j.patrec.2019.03.022
Kingma D, Ba J (2014). Adam: a method for stochastic optimization. In: International conference
on learning representations
Li Y, Wu J, Wu Q (2019) Classification of breast cancer histology images using multi-size and
discriminative patches based on deep learning. IEEE Access 7:21400–21408. https://doi.org/10.
1109/ACCESS.2019.2898044
Li S, Margolies LR, Rothstein JH, Eugene F, Russell MB, Weiva S (2019) Deep learning to improve
breast cancer detection on screening mammography. Sci Rep 9:12495. https://doi.org/10.1038/
s41598-019-48995-4
Liao Q, Ding Y, Jiang ZL, Wang X, Zhang C, Zhang Q (2018) Multi-task deep convolutional neural
network for cancer diagnosis. Neurocomputing. https://doi.org/10.1016/j.neucom.2018.06.084
Lichtblau D, Stoean C (2019) Cancer diagnosis through a tandem of classifiers for digitized
histopathological slides. PLoS One 14:1–20. https://doi.org/10.1371/journal.pone.0209274
Liu N, Qi E-S, Xu M, Gao B, Liu G-Q (2019) A novel intelligent classification model for breast
cancer diagnosis. Inf Process Manag 56:609–623. https://doi.org/10.1016/j.ipm.2018.10.014
Mehra SR (2018) Breast cancer histology images classification: training from scratch or transfer
learning? ICT Exp 4:247–254. https://doi.org/10.1016/j.icte.2018.10.007
Ni H, Liu H, Wang K, Wang X, Zhou X, Qian Y (2019) WSI-Net: branch-based and hierarchy-aware
network for segmentation and classification of breast histopathological whole-slide images. In:
International Workshop on Machine Learning in Medical Imaging, pp 36-44
Patil A, Tamboli D, Meena S, Anand D, Sethi A (2019) Breast cancer histopathology image clas-
sification and localization using multiple instance learning. In: 2019 IEEE international WIE
conference on electrical and computer engineering (WIECON-ECE), Bangalore, India, pp 1–4.
https://doi.org/10.1109/WIECON-ECE48653.2019.9019916
Ragab DA, Sharkas M, Marshall S, Ren J (2019) Breast cancer detection using deep convolutional
neural networks and support vector machines. Peer J 7:e6201
Redmon J (2016) You only look once: unified, real-time object detection (2016) Retrieved from
http://pjreddie.com/yolo/
Spanhol FA, Oliveira LS, Cavalin PR, Petitjean C, Heutte L (2017) Deep features for breast cancer
histopathological image classification. In: 2017 IEEE international conference on systems, man,
and cybernetics (SMC), Banff, AB, pp 1868-1873 https://doi.org/10.1109/SMC.2017.8122889
Sheikh TS, Lee Y, Cho M (2020) Histopathological classification of breast cancer images using
a multi-scale input and multi-feature network. Cancers 12(8):2031. https://doi.org/10.3390/
cancers12082031
Spanhol F, Oliveira LS, Petitjean C, Heutte L (2016) A dataset for breast cancer histopathological
image classification. IEEE Trans Biomed Eng (TBME) 63(7):1455–1462
Ting F, Tan YJ, Sim KS (2019) Convolutional neural network improvement for breast cancer
classification. Exp Syst Appl 120:103–115. https://doi.org/10.1016/j.eswa.2018.11.008
Toğaçar M, Ergen B, Cömert Z (2020) Application of breast cancer diagnosis based on a combination
of convolutional neural networks, ridge regression and linear discriminant analysis using invasive
breast cancer images processed with autoencoders. Med Hypotheses
Vo DM, Nguyen N-Q, Lee S-W (2019) Classification of breast cancer histology images using
incremental boosting convolution networks. Inf Sci (Ny) 482:123–138. https://doi.org/10.1016/
j.ins.2018.12.089
Veta MJ, Diest PJ (2013) Automatic nuclei segmentation in HE stained. Breast cancer histopathol
images. PLOS One 8(7)
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2018) The marginal value of adaptive gradient
methods in machine learning, 2017. arXiv:1705.08292v2 [stat.ML] (22 May 2018)
Xie J, Liu R, Luttrell J, Zhang C (2019) Deep learning based analysis of histopathological images
of breast cancer. Front Genet 10. https://doi.org/10.3389/fgene.2019.00080
Zhu C, Song F, Wang Y et al (2019) Breast cancer histopathology image classification through
assembling multiple compact CNNs. BMC Med Inform Decis Mak 19:198. https://doi.org/10.
1186/s12911-019-0913-x
Chapter 15
An Analysis of Use of Image Processing
and Neural Networks for Window
Crossing in an Autonomous Drone
L. Pedro de Brito, Wander M. Martins, Alexandre C. B. Ramos,

Abstract The application with autonomous robots is becoming more popular

(Kyrkou et al. 2019), and neural networks and image processing are increasingly
linked to control and decision (Jarrell et al. 2012; Prescott et al. 2013). This study
seeks a technique that makes drones or robots more autonomous indoors fly. The
work investigates the implementation of an autonomous control system for drones,
capable of crossing windows on flights through closed places, through image pro-
cessing (de Brito et al. 2019; de Jesus et al. 2019; Martins et al. 2018; Pinto et al. 2019)
using convolutional neural network. Object’s detection strategy was used; through
its location in the captured image, it is possible to carry out a programmable route
for the drone. In this study, this location of the object was established by bounding
boxes, which define the quadrilateral around the found object. The system is based
on the use of an open-source autopilot, Pixhawk, which has a control and simulation
environment capable of doing the job. Two detection techniques were studied. The
first one is based on image processing filters, which captured polygons that repre-
sent a passage inside a window. The other approach was studied for a more real
environment and implemented with the use of convolutional neural networks for
object detection; with this type of network, it is possible to detect a large number of
windows.
L. P. de Brito (B) · A. C. B. Ramos

Federal University of Itajuba, Institute of Mathematics and Computing, IMC. Av. BPS, 1303
Bairro Pinheirinho, MG, Caixa Postal 50 CEP: 37500-903, Itajubá, Brazil
e-mail: ramos@unifei.edu.br
W. M. Martins · T. C. Pimenta
Institute of Systems Engineering and Information Technology, IESTI. Av. BPS, 1303, Bairro
Pinheirinho, MG, Caixa Postal 50 CEP: 37500-903, Itajubá, Brazil
e-mail: wandermendes@unifei.edu.br
T. C. Pimenta
e-mail: tales@unifei.edu.br
230 L. P. de Brito et al.
15.1 Introduction
The main objective of this study was to investigate the creation of a system capable
identify a passage using a monocular camera, such as a door or window, and guide a
little drone through it. This system involves the construction of a small aircraft capa-
ble of capture images that will be processed by an external machine which, in turn,
accurately respond to the specific movement the aircraft must follow. Figure 15.1
shows an outline of the system structure, where an arrow indicates the flow of infor-
mation and processes carried out.
The detection algorithm works with the classification of pixels and the location
of the red object within an image. Through this obtained position, it is possible to
make an analysis with reference to the position of the camera, thus calculating a
route for the aircraft run. Two detection approaches were studied in this work; the
first being only about simple image processing techniques and the other based on the
use of convolutional neural networks, more specifically the use of SSD, single shot
multibox detector (Falanga et al. 2018; Ilie and Gheorghe 2016; Liu et al. 2016).
The work combined the hardware and software to enable control of the aircraft.
The choice of the hardware equipment for research was due to the low investment
compared to the commercial models to sale. Not to mention the fact that PixHawk
has a large open-source development kit.
15.2.1 The Aircraft Hardware
A small Q 250 racing quadcopter (25 cm) was built with enough hardware to complete
the project, including a flight controller board PixHawk 4 (Meier et al. 2011; Pixhawk
2019), 2300 kv mo-tors and 12A electronic speed controllers (ESCs). This controller
board has a main flight management unit (FMU) processor, an input/output (I/O)
Fig. 15.1 The implemented

system (The author)
15 An Analysis of Use of Image Processing and Neural . . . 231
Fig. 15.2 Drone Q250 with PixHawk (The author)
processor, accelerometer, magnetometer, and barometer sensors. Figure 15.2 shows

the used drone in the search.
In the implemented system, the data processing is not embedded, so a wireless
connection was necessary for the transmission of commands between the ground
base machine and the drone. And also a data connection for transmitting images
captured by the aircraft. A separate connection was used to transmit the commands
and another for images through a 915 MHz telemetry to connect to the ground station
that processes and to send commands to flight control.
Images captured by the first person view (FPV) high definition (HD) camera is
transmitted by a 5.8 GHz transmitter/receiver pair.
15.2.2 The Ground Control Station
Ground control station (CGS) is a kind of software executed on a solo platform that
performs the monitoring and configuration of the sensor’s drone such as sensor cali-
bration settings and the configuration of general purpose boards, supporting different
types of vehicle models, like the PixHawk that needs a configuration of its firmware
before its use.
In this work, the QGroundControl software of ground station was used. This soft-
ware allows you to check the status of the drone and program in a simple way mis-
sions with global positioning system (GPS) and map. It is suitable for the PixHawk
Fig. 15.3 QGroundControl (GCS)
configuration. The following Fig. 15.3 shows the interface of this GCS used (Dami-
lano et al. 2013; Planner 2019a, b; QGROUNDCONTROL 2019; Ramirez-Atencia
and Camacho 2018).
15.2.3 MAVSDK Drone Control Framework
The MAVSDK drone control framework was used in this implementation. It is a

library that can perform simple stable motion functions such as take off and land, and
control the speed of the airframe on its axes. It uses the coordinate system to command
the aircraft, communicating with vehicles that have MAVLink, a communication
protocol for drones, and their internal components.
MAVSDK is a software development kit (SDK) made for PixHawk in various
types of vehicles. This framework was originally implemented in C++, but this work
was made in Python, which is one of the derivations of the SDK. The written code
runs on a machine and sends commands through MAVLink protocol (DRONEKIT
2019; French and Ranganathan 2017; MAVROS 2019; MAVSDK 2019).
15.2.4 Simulation
This work used the Gazebo simulator, with its PX4 Simulator implementation, which
brings various vehicle models with PixHawk specific hardware and firmware simu-
lation.
15.2.4.1 Gazebo
Gazebo provides realistic simulation with complex scenarios and robust environment
physics, including several sensors to sketch a true real-world implementation. Gazebo
enables the implementation of multiple robots. This makes possible to test and train
AI codes and image processing with remarkable ease and agility. Gazebo can create a
scenario with various buildings, such as houses, hospitals, cars, people, etc. With this
scenario, it is possible to evaluate the quality of the codes and trim their parameters
before a test in the real environment(de Waard et al. 2013; GAZEBOSIM 2019;
Koenig and Howard 2004).
15.2.4.2 PX4 Simulator
PX4 simulator is a simulated modeling of PixHawk within the Gazebo environment,

simulating all the main features of autopilot over some aircraft models, land vehi-
cles, and among others. This implementation makes it possible to create and test
code within the simulated environment that can be faithfully transferred to the real
environment. The important fact of this simulation environment is the compatibility
between PixHawk, Gazebo, PX4 Simulator, and also with MavSDK.
15.2.4.3 The Simulated Drone
The Iris model is the PX4 simulated drone that has the greatest fidelity to the image
real model q250 implemented (already presented previously), both are based on Pix-
Hawk, which means that the autopilot and the simulated Iris firmware are compatible
with the real q250. It is possible to connect the aircraft connect with the code and
command the aircraft, which in this case is the same for both. Figure 15.4 shows this
simulated model (Garcia and Molina 2020; PX4SIM 2019).
15.2.5 Object Detect Methods
This work was used the TensorFlow framework that has a large number of existing
implementations available for adaptation (Bahrampour et al. 2015; Kovalev et al.
2016).
15.2.5.1 TensofFlow Framework
TensoFlow was created by Google and is based on Keras API, to facilitate the imple-
mentation of high performance algorithms, especially for large servers. It accepts
the use of graphics processing unit (GPU) beyond central processing unit (CPU)
Fig. 15.4 The simulated drone Iris
only. This tool is considered heavy compared to others on the market; however, it is
very powerful because it provides a large number of features, tools, and implemen-
tations. TensorFlow has a Github repository where its main code and other useful
tools like deployed templates, TensorBoard, Project Magenta, etc., are available. As
a portable library, it is available in several languages, such as Python, C++, Java,
and Go, as well as other community-developed extensions (BAIR 2019; GOOGLE
2019; Sergeev and Balso 2018; Unruh 2019).
15.2.5.2 Neural NetWork Technique
This work used convolutional neural networks (CNN) and machine learning (ML)
for object detection (Cios et al. 2012; Kurt et al. 2008). The classification method
was added to the calculation of the location of the object. This approach is called
object detection.
A CNN is a variation of perceptron multilayer network (Vargas et al. 2016). A
perceptron is simply a neuron model capable of storing and organizing information
as in the brain (Rosenblatt 1958). The idea is to divide complex tasks into several
smaller and simpler tasks that, in turn, act on different characteristics of the same
problem and that eventually return an answer as desired. Figure 15.5 illustrates this
structure (He et al. 2015; Szarvas et al. 2005; Vora et al. 2015).
The CNN applies filters to visual data, to extract or highlight some important fea-
ture, maintaining the neighborhood relationship, as well as convolution matrix graph-
ical processing, hence the origin of that name for this type of network (Krizhevsky
et al. 2012). When a convolution layer is made over an image, it multiplies and adds
Fig. 15.5 A neuron and its activation function
Fig. 15.6 CNN Kernel
the values of each pixel to the values of a convolution filter or mask. After calcu-
lating an area following a defined pattern, the filter moves to another region of the
image until it completes the operation over it (Jeong 2019). Figure 15.6 illustrates
the structure of a CNN (Vargas et al. 2016). The single shot multibox detector (SSD)
neural network (Bodapati and Veeranjaneyulu 2019; Huang et al. 2017; Yadav and
Binay 2017), a convolutional neural network for real-time object detection (Cai et al.
2016; Dalmia 2019; Hui 2019; Liu et al. 2016; Moray 2019; Ning et al. 2017; Tindall
et al. 2015; Xia et al. 2017 was used because it is considered the start-of-the-art in
accuracy (Liu et al. 2016)).
15.2.5.3 Image Processig Technique
A detection algorithm was implemented based on pure image processing (Bottou

2012; Ding et al. 2016; Huang et al. 2017; Hussain et al. 2017; Liu et al. 2016; Pandey
2019; Redmon and Farhadi 2017; Ren et al. 2015), using the graphic library OpenCV
(Bradski and Kaehler 2008; Countours 2019; Marengoni and Stringhini 2009), which
has several image processing functions. First, a Gaussian filter, a convolution mask-
based algorithm, is applied to smooth out the edges of the box and detect polygons
(Deng and Cahill 1993). A limit filter is adopted (Ito and Xiong 2000; Kumar 2019;
Simple Thresholding 2019), which allows a set of data to be divided into two groups
starting from a limit. In the case of colors, this occurs by separating it from a color
tone, where all pixels darker than the limit go to one group and take them to another. To
find the edges, the Canny filter (Accame and Natale 1997; OPENCV 2019) is applied,
which walks over the image pixels with a gradient vector that calculates the direction
and color intensity of the pixels (Boze 1995; Hoover et al. 2000; Simple Thresholding
2019). OPENCV’s findContours () function was used to detect polygons after proper
treatment of the image with the mentioned filters.
Figure 15.7 shows a flowchart of the overall system architecture, indicating their
processes. Three interconnected machines (GCS) compute most of the implemented
code. The drone receives speed commands to move and to image capture through of
a camera and a receiver and transmitter pair for data transfer.
This is the general architecture of the both implemented simulated and real sys-
tems. To start this system, it is necessary to start three machines that will communi-
cate.
In the flowchart drawn, the red arrows represent internal executions of the same
machine and the blue arrows indicate the transfer of data between a machine and
another through a communication protocol. The orange arrows indicate the creation
of a multiprocessing chain necessary for the implementation of the system. The
machine running the main code is GCS, which consists of three main threads, one
for image capture, another to shut down the software and the main one that manages
all the processing and calculations.
The captured images are transferred to the detection algorithm that will calculate
the bounding box that best represents the desired passage. A speed calculation is
performed according to the detection performed and is transferred to the drone.
When the drone loses a detection or finds none, the code slows the aircraft down
to a certain speed until, if no further detection actually occurs, the drone will stop.
When the drone receives a speed to be adjusted on its axis, it remains the same until
another speed is received or some other type of command is executed, for security it
can land.
15.3.1 Object Detection by Neural Network
The CNN used applies filters to highlight the desired object and a classifier and
bounding box estimator to indicate the location of the object in the image. The
resource extractor used it was the graph called MobileNet version 2, and the classifier
comes from CNN resources.
The CNN was trained with a set of images of windows and other objects with their
respective boundary box coordinates. This set of images was obtained from Google
Open Image version 4 containing about 60,000 window images (GOOGLE 2019).
Fig. 15.7 General system flowchart (author)
RGB format images are converted to grayscale and smoothed with the the Gaus-
sian filter. Then, apply the Canny filter to isolate the edges of the objects. The code
looks for lines that form a four-sided polygon to identify passage within a window.
The center this polygon is identified and its side and diagonal measurements are
calculated.
An important detail is that the distances between points are calculated geomet-
rically for execution as quadrilateral measurements. But these calculated distances
they are values in pixel units, that is, the number of pixels between one point and
another and that number of pixels varies according to the resolution of the camera
used. Correct this error, these quantity values were converted into percentage values,
so every measure is defined as a percentage of the maximum it could assume, and the
maximum is usually the height, width, or diagonal of the image. For example, when
measuring the height of the bounding box, it is necessary to divide it by the height
of the image to find its occupation percentage. The Cartesian image study plan is
Fig. 15.8 Application of object segmentation in real world (auhor)
best suited using y and z axes because the image is in the same pattern as the drone
movements.
It was relatively easy to produce a passage in simulated experiment. However, in
a real experiment with many polygons, it generated many unwanted detections. To
solve this, a segmentation network could be used, which has the capacity to capture
the total area of the object sought. Figure 15.8 shows an example of this type of
network (He et al. 2017).
Another challenge is that the current algorithm does not capture the slope of
the object to be able to align the aircraft with the found window. To solve this, a
segmentation network can be used with the ability to capture the total area of the
searched object.
15.3.2 Drone Control
The drone control algorithm uses three functions to position the drone in front of
the window to make a slightly linear crossover: Approximate, Center and Align. The
algorithm defines the speed of the drone on the x, y, and z axes, as shown in Fig. 15.9.
15.3.2.1 Approximate Function
In the Approximate function, the entries are the diagonals of the image and the
bounding box, while the output is a speed on the x axis of the drone. This function
Fig. 15.9 Application of

object segmentation in real
world (auhor)
captures the bounding box detected and checks its size in relation to the image to
measure the relative distance of the object. The algorithm estimates in percentage
the value that the object occupies in the image.
The mathematical function 15.1 was used to model the characteristic of this move-
ment.
1
f ( p) = k. (15.1)
p2
It is the inverse function of the square of the calculated diagonal size. The smaller
the diagonal the farther the object is, the faster the movement speed. For greater
gain, a quadratic function was used. In this function "p" represents the input measure
of the function, and "k" is a constant that controls the output value according to
some factors, such as distance and state of execution, giving a greater or lesser gain,
depending on the case of the function.
The behavior of this function is shown in Fig. 15.10. Only the positive domain
was used for the problem. Where when the size of the object tends to zero, the speed
tends to infinity, and when size tends to infinity, speed tends to zero. Due to the high
values achieved by this function, only part of it is used, an interval defined by code
that respects the system conditions. This interval is defined by p between [0.1, 0.7],
that is, detections with diagonals of 10–70% of image occupation.
15.3.2.2 Centralize Function
The centralization function positions the drone in the center at the opening of the
identified window. It uses the distance on the y and z axes between the center of the
image and the center of the bounding box to set the speeds on the y and z axes of the
drone to perform the centering of the aircraft (Fig.15.11).
Fig. 15.10 Inverse square function behavior (author)
Fig. 15.11 Measures of bounding box and image to perform centralization (author)
15.3.2.3 Align Function
Figure 15.12 shows a picture of a side view, a distorted box, where the right side
is smaller than the left, as the view from the drone is misaligned in relation to the
window. The alignment function sets the speed of the drone’s angular axis to produce
a yaw that align the aircraft in relation to the window.
Fig. 15.12 Measures of bounding box and image to perform alignment (author)
15.3.2.4 Execution State
When the algorithm finds the detection of a passage, the movement functions them-
selves indicate the current state of execution of the drone. Five states are performed,
being the last crossing of the passage. The Algorithm 1 is a pseudocode system status
control.
Algorithm 1: State Control
if currentState == staye then

begin
ret1 <- centralize(bbox,centf,erry,errz);
ret2 <- aproximate(bbox,aprf,err);
ret3 <- align(bbox,alif,err);
end
if ret1 and ret2 and ret3 then
currentState++;
Each of the functions sets the speed to produce faster movements over a long
distance and slower over a short distance. When all functions return "true," the state
changes.
Fig. 15.13 Error analysis with approximation point of view (auhor)
Figure 15.13 shows a representation of the processed states, with a viewpoint

related to the approximation function. The first state has a diagonal represented by
the 30% occupancy value of the image; so, when the drone reaches this percentage,
the approximation function returns "true"; since it has already found its equilibrium
point, the drone interrupts its movement on the x-axis and awaits the other two
alignment and centralization to also find their balance point, so that the aircraft can
advance with precision.
Decision making in data analysis is never exactly calculated, always requiring
an error parameter to define if the group of values is within the set of truths. It is
extremely important to perform the correct adjustment of these parameters to generate
greater reliability in the system. A detection can be identified as positive or negative,
i.e., object bound to class or no class, respectively. And it can also be true or false,
that means judged as a correct detection or not (Cork et al. 1983).
We will evaluate the results of the training CNN, window, and passageway detection,
performed on images who were not part of the group used during the training of the
network.
To assess the quality of CNN’s convolutional processes, we use the percentage of
the number of false and true identifications. Accuracy measures:
• The percentage of your predictions that indicates how correct this is.
• How correct positive results are.
Fig. 15.14 Number of bounding boxes detected. (I) Mono-class training. (II) Multi-class training
(auhor)
15.4.1 Mono- and Multi-class Training Analysis
Two trainings were applied, one called mono-class that uses only one object class
(window) and another called multi-class that uses several other objects (people, cars,
traffic signs) besides the window.
Mono-class training obtained many false detections, where people and cars were
identified as windows. Figure 15.8 shows this clearly. To evaluate the false positives, a
test was carried out with about 700 images without windows, and the re-sult is shown
that the multi-class training effectively eliminated the number of false positives. The
results are shown in the graph in Fig. 15.14, which shows that the network has lost
its detection generality by 50%.
15.4.2 Object Detection Analysis
During the experiments, the loss of detection affected the system’s efficiency in
distorted frames. When the detection failure occurs, the drone tends to execute the last
calculated speed until it stops, generating an inappropriate movement. An experiment
was carried out that measured the number of detections carried out in a video capture.
Two cameras were tested, a Full HD (1920 × 1080) digital and an FPV analog
transmission (600TVL), for two types of video, one with camera movement and the
other without. The objective was to assess what most affected the detection number,
the camera, or the movement. As the camera used in the system, it was low cost
without gimbal to stabilize the video. In carrying out the experiment, both cameras
filmed the same stationary and moving object to simulate a system with gimbal and
one without.
As factors influencing, results were evaluated (a) camera resolution and (b) pres-
ence of movement, with two levels each. Each recorded video lasted one minute, and
each camera has a number of frames captured per second, so the final analysis the
variable was calculated on the percentage of frames that were successfully detected
(Jain 1990).
Table 15.1 Experiments analyze image processing

Camera Motion N. Detection (%)
Full HD Yes 87.07
Full HD No 84.16
PFV Yes 87.94
PFV Yes 93.01
Fig. 15.15 Camera (C),

Motion (M), and Both (MC)
affects in polygon detection
(auhor)
15.4.2.1 Imaging Processing Technique Analysis
To assess the detection losses of the polygon detection algorithm, a passage was used
that forms a well-defined polygon to facilitate detection in an isolated environment.
The results obtained are shown in Table 15.1.
The affectivity of each factor in the number of detection found is shown in
Fig 15.15, where "M" represents movement and "C" represents the camera and "CM"
means both together.
The camera was the factor that most affected the performance of the detector. The
motion did not have much influence on the result. But the union of this two factors
brings significant impact in the system.
15.4.2.2 Neural Network Technique Analysis
The same test performed for the polygon detector was also performed for the neural
network, where four experiments were carried out between the two types of cameras
for a stop recording and one in motion. The results obtained are shown in Table 15.2.
Figure 15.16 shows the result performed on the camera (C) and the movement
(M). The factor camera had an influence of 48.1%, that is, responsible for almost half
the influence of the system, so the quality of the video for neural networks generates
a great difference. The movement factor did not affect as much as it obtained a lower
value of 29.6%, but still relevant. The neural network can work with distorted and
flawed images as long as it has some characteristics sought. In the influence of both
factors together with CM, it also had a relevant figure of 22.4%, almost as large as
the movement itself.
Table 15.2 Experiments analyze neural network

Camera Motion N. Detection (%)
Full HD Yes 79.62
Full HD No 81.94
PFV Yes 41.08
PFV Yes 74.68
Fig. 15.16 Camera (C),

Motion (M), and Both (MC)
affects in CNN detection
(auhor)
This means that a low resolution set with movement had a lot to lose, and this can
be seen in the Table 15.1, where the experiment with these characteristics had a loss
of detection of more than half of the total group.
Anyway, the object detector based on neural networks also suffers from poor
quality image capture, better hardware implementation produces greater detection
number performed.
15.4.3 System Validation Tests
The system was first tested in a simulation in the Gazebo environment. Subsequently,
the system was implemented with PixHawk and MavSDK, and it was tested in real
scenarios.
Figure 15.17 shows the passage used for testing in a real environment. Using this
disposable passageway and the gazebo simulator, it was possible to find adequate
speed values for these movements. The highlighted border color makes it easy to
detect. Although the algorithm avoids unwanted detections, it still tends to find
squares in environments with very linear objects. Therefore, the use of a neural
network facilitated the application of the technique in the real world.
15.4.3.1 Simulated Environment
The tests performed on the simulated platform are important for control modeling,
as there are no problems with falls and accidents. A simple passage was simulated
Fig. 15.17 Detection of disposable passage in real world (author)
with the edges highlighted by the color difference. Figure 15.18 illustrates these
experiments in which there are four distinct steps in the execution of the algorithm.
The experiments were carried out by crossing a simple 2 × 2 meter size passage
in the simulation environment shown in Fig. 15.18. In these tests, the drone always
managed to cross the passage.
Sometimes, the drone has difficulty finding the equilibrium point of the last state,
relation to the centralization function and ends up taking longer to perform the
crossing, even more for a more precise parameterization, he often spends a lot of time
looking for the balance point in a reciprocating motion due to its precise movement.
But in the end, even with these parameters, he ends up crossing.
After the implementation of the neural network-based object detector, a new
environment was modeled to simulate a city with cars and people’s homes. In this
environment, it was possible to apply the algorithm that detected real windows to
control the drone at its intersection. Through the studied control functions and the
parameter adjustment made within the first environment, it was possible to carry out
the crossing in the desired way. Figure 15.19 provides a little bit about these tests.
In all tests performed with a control algorithm crossing a 1.5 × 2 meter window
in the simulation environment shown in Fig. 15.19, the drone has always managed
to cross. The simulation for the system occurred with success, having a high number
of detections per second, in addition to generating the correct speed for the drone to
follow.
15.4.3.2 Real Environment
The validation of the system for a real environment was fully applied on a disposable
passage, modeled to avoid accidents that would cause damage to the aircraft model.
The detection of this passage is shown in Fig. 15.17, and it was done with paper in a
Fig. 15.18 Polygon detector system based in four steps (author)
Fig. 15.19 CNN based object detector in two steps (author)
way that a propeller could cut. Figure 15.20 shows a sequence of steps that performed
in the real environment.
The validation in the real environment presented many difficulties due to the
quality of the image capture, which encouraged the analysis carried out on this
aspect. The use of a better camera and the use of fault detection filters was what
made the tests possible. In that case, an average of detections and a decrease in the
speed generated to avoid accidents; however, the system still sometimes failed due
to lack of detection. But, it was enough to show that detection methods can be used
in real environments.
Fig. 15.20 System in real environment in four steps (author)
15.5 Conclusion
The research carried out obtained satisfactory results. An autonomous image process-
ing and decision system depend on various peripherals, such as structures, detection
and data capture kit, hardware control, programming languages, simulation envi-
ronment, and among other characteristics that directly influence the final imple-
mentation. The bibliographic review of this work provides extensive modeling and
demonstration possibilities for a system similar to the one implemented.
The use of a square detector based on image processing facilitated the research
test. In environments with a low number of squares as in nature or in open fields, the
algorithm obtained excellent movement results.
The main problem with a polygon detector based on image filters, like the one
implemented, is being able to identify many different polygons in an image. For an
implementation in a real environment, it was necessary to use the trained CNN for
window detection. Thus, during the implementation of the CNN, the motion control
algorithm was developed based on the quadrilateral detector.
The mono-class network showed several false positives, which indicated people
and cars as windows. The solution was to train various classes, such as people and
cars, for the training group. Thus, it was possible to perform window detection
efficiently, as desired.
The implemented system generates speeds according to the position of the bound-
ing box found, suffering from detection failures, as it needs real-time detection to
maintain its correct movement, due to state variations. The solution was to implement
medium filters to circumvent the control failures, because in the real environment
with a low cost camera, many losses occurred depending on the lighting and distance
of data transmission. In that case, a possible solution would be to implement another
route-based control method instead of speed. Thus, despite losing its detection, the
vehicle would maintain its route with the static object it wants to cross.
The research focused on a solution based only on the use of image processing
and convolutional neural networks. However, a system like this implemented in a
real environment needs to use sensors to assist in the decision, such as a proximity
sensor, for example, to avoid accidents, identify closed windows, and stabilize the
flight. Especially during the crossing, which, in the case of the current system, is
done blindly, as the camera loses sight of object when it is inside it, following only
the route parameters already calculated until that point.
An interesting implementation would be a neural network model of control, cap-
turing data from the position of the drone and objects in the image (bounding boxes)
to perform various crossings for human control and create a data set. So, with this
data set, a neural mechanism control network can be trained to perform the move-
ments, where the bounding box be the entrance and the movement speeds would be
the exit.
References
Accame M, Natale FGD (1997) Edge detection by point classification of canny filtered images. Sig
Proces 60(1):11–22
Bahrampour S et al (2015) Comparative study of deep learning software frameworks. arXiv preprint
1511.06435
BAIR (2019) Caffe. Available in https://caffe.berkeleyvision.org/. Cited 2019
Bodapati JD, Veeranjaneyulu N (2019) Feature extraction and classification using deep convolu-
tional neural networks. J Cyber Secur Mob 8(2):261–276
Bottou L (2012) Stochastic gradient descent tricks. In: Neural networks: tricks of the trade. Springer,
pp 421–436
Boze SE (1995) Multi-band, digital audio noise filter. Google Patents. US Patent 5,416,847
Bradski G, Kaehler A (2008) Learning OpenCV: computer vision with the OpenCV library. O’Reilly
Media, Inc
Cai Z et al (2016) A unified multi-scale deep convolutional neural network for fast object detection.
In: European conference on computer vision. Springer, pp 354–370
Cios KJ, Pedrycz W, Swiniarski RW (2012) Data mining methods for knowledge discovery. In:
Springer Science & Business Media. Springer, vol 458
Cork RC, Vaughan RW, Humphrey LS (1983) Precision and accuracy of intraoperative temperature
monitoring. Anesth Analg 62(2):211–214
Countours (2019) OpenCV. Available in http://host.robots.ox.ac.uk/pascal/VOC/voc2007/. Cited
2019
Dalmia A (2019) Real-time object detection: understanding SSD. Available in https://medium.
com/inveterate-learner/real-time-object-detection-part-1-understanding-ssd-65797a5e675b.
Cited 2019
Damilano L et al (2013) Ground control station embedded mission planning for UAVs. J Intel Rob
Syst 69(1–4):241–256
de Brito PL et al (2019) A technique about neural network for passageway detection. In: 16th
international conference on information technology-new generations (ITNG 2019). Springer, pp
465–470
de Jesus LD et al (2019) Greater autonomy for rpas using solar panels and taking advantage of rising
winds through the algorithm. In: 16th international conference on information technology-new
generations (ITNG 2019). Springer, pp 615–616
de Waard M, Inja M, Visser A (2013) Analysis of flat terrain for the atlas robot. In: 3rd joint
conference of AI & robotics and 5th RoboCup Iran open international symposium. IEEE, pp 1–6
Deng G, Cahill L (1993) An adaptive gaussian filter for noise reduction and edge detection. In: IEEE
conference record nuclear science symposium and medical imaging conference, pp 1615–1619
Ding J et al (2016) Convolutional neural network with data augmentation for sar target recognition.
IEEE Geosci Remote Sens Lett 13(3):364–368
DRONEKIT (2019) Available in https://dronekit.io/. Cited 2019
Falanga D et al (2018) The foldable drone: a morphing quadrotor that can squeeze and fly. IEEE
Rob Autom Lett 4(2):209–216
French R, Ranganathan P (2017) Cyber attacks and defense framework for unmanned aerial systems
(uas) environment. J Unmanned Aerial Syst 3:37–58
Garcia J, Molina JM (2020) Simulation in real conditions of navigation and obstacle avoidance with
px4/gazebo platform. In: Personal and ubiquitous computing. Springer, pp 1–21
GAZEBOSIM (2019) Available in http://gazebosim.org/. Cited 2019
GOOGLE (2019) Open images dataset. Available in https://opensource.google.com/projects/open-
images-dataset. Cited in 2019
He K et al (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer
vision, pp 2961–2969
He K et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition.
IEEE Trans Pattern Anal Machine Intell 37(9):1904–1916
Hoover A, Kouznetsova V, Goldbaum M (2000) Locating blood vessels in retinal images by piece-
wise threshold probing of a matched filter response. IEEE Trans Med Imag 19(3):203–210
Huang J et al (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7310–7311
Hui J (2019) SSD object detection: single shot MultiBox detector for real-time processing. Avail-
able in https://medium.com/@jonathanhui/ssd-object-detection-single-shot-multibox-detector-
for-real-time-processing-9bd8deac0e06. Cited 2019
Hussain Z et al (2017) Differential data augmentation techniques for medical imaging classification
tasks. In: AMIA annual symposium proceedings. American Medical Informatics Association, p
979
Ilie I, Gheorghe GI (2016) Embedded intelligent adaptronic and cyber-adaptronic systems in organic
agriculture concept for improving quality of life. Acta Technica Corviniensis-Bull Eng 9(3):119
Ito K, Xiong K (2000) Gaussian filters for nonlinear filtering problems. IEEE Trans Autom Control
45(5):910–927
Jain R (1990) The art of computer systems performance analysis: techniques for experimental
design, measurement, simulation, and modeling. Wiley, Hoboken
Jarrell TA et al (2012) The connectome of a decision-making neural network. Science
337(6093):437–444
Jeong J (2019) The most intuitive and easiest guide for convolutional neural net-
work. Available in: https://towardsdatascience.com/the-most-intuitive-and-easiest-guide-for-
convolutional-neural-network-3607be47480. Cited 2019
Koenig N, Howard A (2004) Design and use paradigms for gazebo, an open-source multi-robot
simulator. In: IEEE/RSJ international conference on intelligent robots and systems (IROS) (IEEE
Cat. No. 04CH37566), vol 3, pp 2149–2154
Kovalev V, Kalinovsky A, Kovalev S (2016) Deep learning with theano, torch, caffe, tensorflow,
and deeplearning4j: which one is the best in speed and accuracy? Publishing Center of BSU,
Minsk
neural networks. In: Advances in neural information processing systems. pp 1097–1105
Kumar A (2019) Computer vision: Gaussian filter from scratch. Available in https://medium.com/
@akumar5/computer-vision-gaussian-filter-from-scratch-b485837b6e09. Cited 2019
Kurt I, Ture M, Kurum AT (2008) Comparing performances of logistic regression, classification
and regression tree, and neural networks for predicting coronary artery disease. Exp Syst Appl
34(1):366–374
Kyrkou C et al (2019) Drones: augmenting our quality of life. IEEE Potentials 38(1):30–36
Liu W et al (2016) Ssd: Single shot multibox detector. In: European conference on computer vision.
Springer, pp 21–37
Marengoni M, Stringhini S (2009) Tutorial: Introdução áÂ visão computacional usando opencv (in
portuguese). Revista de Informática Teórica e Aplicada 16(1):125–160
Martins WM et al (2018) A computer vision based algorithm for obstacle avoidance. In: Information
technology-new generations. Springer, pp 569–575
MAVROS (2019) Available in http://wiki.ros.org/mavros. Cited 2019
MAVSDK (2019) Available in https://mavsdk.mavlink.io/. Cited 2019
Meier L et al (2011) Pixhawk: a system for autonomous flight using onboard computer vision. In:
IEEE international conference on robotics and automation, pp 2992–2997
Moray A (2007) Available in https://docs.opencv.org/. Cited 2019
Ning C et al (2017) Inception single shot multibox detector for object detection. In: IEEE interna-
tional conference on multimedia & expo workshops (ICMEW), pp 549–554
OPENCV (2019) Canny edge detector. Available in https://docs.opencv.org/2.4/doc/tutorials/
imgproc/imgtrans/cannydetector/cannydetector.html. Cited 2019
OPENCV (2019) Simple thresholding. Available in https://docs.opencv.org/master/d7/d4d/
tutorialpythresholding.html. Cited 2019
Pandey P (2019) Understanding the mathematics behind gradient descent. Available in https://
opensource.google.com/projects/open-images-dataset. Cited 2019
Pinto LGM et al (2019) A ssd–ocr approach for real-time active car tracking on quadrotors. In: 16th
international conference on information technology-new generations (ITNG 2019). Springer, pp
471–476
Pixhawk Available in https://pixhawk.org/. Cited 2019
Planner A (2019a) APM planner. Available in https://ardupilot.org/planner2/. Cited 2019
Planner M (2019b) Mission planner. Available in https://ardupilot.org/planner/. Cited 2019
Prescott JW (2013) Quantitative imaging biomarkers: the application of advanced image processing
and analysis to clinical and preclinical decision making. J Digit Imag 26(1):97–108
PX4SIM (2019) Available in https://dev.px4.io/. Cited 2019
QGROUNDCONTROL. Available in: http://qgroundcontrol.com/. Cited 2019
Ramirez-Atencia C, Camacho D (2018) Extending qgroundcontrol for automated mission planning
of UAVs. Sensors 18(7):2339
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, pp 7263–7271
Ren S et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks.
In: Advances in neural information processing systems, pp 91–99
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization
in the brain. Psychol Rev 65(6):386
Sergeev A, Balso MD (2018) Horovod: fast and easy distributed deep learning in tensorflow. In
preprint arXiv:1802.05799
Szarvas M et al (2005) Pedestrian detection with convolutional neural networks. In: Intelligent
vehicles symposium, pp 224–229
TensorFlow (2019). Available in https://www.tensorflow.org. Cited in 2019
Tindall L, Luong C, Saad A (2015) Plankton classification using vgg16 networ
Unruh A (2019) What is the TensorFlow machine intelligence platform? Available in https://
opensource.com/article/17/11/intro-tensorflow. Cited 2019
Vargas ACG, Paes A, Vasconcelos CN (2016) Um estudo sobre redes neurais convolucionais e sua
aplicaç ao em detecç ao de pedestres" (in portuguese). In: Proceedings of the XXIX conference
on graphics, patterns and images. pp 1–4
Vora K, Yagnik S, Scholar M (2015) A survey on backpropagation algorithms for feedforward
neural networks. Citeseer
Xia X, Xu C, Nan B (2017) Inception-v3 for flower classification. In: 2nd international conference
on image, vision and computing (ICIVC), pp 783–787
Yadav N, Binay U (2017) Comparative study of object detection algorithms. Int Res J Eng Technol
(IRJET) 4(11):586–591
Chapter 16
Analysis of Features in SAR Imagery
Using GLCM Segmentation Algorithm
Jasperine James, Arunkumar Heddallikar, Pranali Choudhari,

and Smita Chopde
Abstract Synthetic Aperture Radar (SAR) system is one of the most popular system
used widely due to its property of self-illumination. Use of SAR is gaining interest in
earth remote sensing due to its advantages over optical imaging systems. Its ability to
consistently monitor and adapt to changing weather conditions makes application of
SAR for the purpose of radar imaging important. Feature detection in SAR images can
be achieved by using accurate texture segmentation methods. This paper introduces
Grey Level Co-occurrence Matrix (GLCM) that proves to be a good discriminator
for the purpose of identication of different textural features in SAR Imagery. With
this technique features present in SAR images such as water, vegetation and urban
area in land is detected using different orientations.
16.1 Introduction
Synthetic aperture radar (SAR) is an active microwave radar having ability to achieve
high-resolution images independent of daylight. SAR imaging is an important earth
observation technique in remote sensing field used in a wide range of applications.
SAR satellites provide high resolution images and it is difficult to identify the targets
in these data such as rivers etc that are present in images. Hence it is necessary to
introduce segmentation techniques that will be useful to identify such targets present
in such images in less time. One of simplest method of segmentation is thresholding
which comes under category intensity based segmentation. The major drawback
of this technique is that only intensity value is considered and not the relationship
between the pixels in an image. This will lead to either losing information of a
particular region or possibility of obtaining too much background pixels which are
unnecessary.
GLCM texture segmentation is one of the statistical texture segmentation method
that belongs to second order characteristics which considers the spatial relationship
J. James (B) · P. Choudhari · S. Chopde

FCRIT, Mumbai, India
A. Heddallikar
RADAR Division, Sameer, IIT Bombay, Mumbai, India
254 James et al.
among the pixels (Payal Dilip Wankhade 2014; Gonzalez et al. 2009). When
compared with previously existing algorithm such as watershed algorithm, results
obtained for GLCM method is much better than those of these previously existing
algorithms (Kaur et al. 2014). This paper shows how this texture segmentation method
can be carried out in SAR images to detect important features in them. Section 16.2
gives an overview of Gray Level Co-occurrence Matrix approach for texture seg-
mentation. In Sect. 16.3, the features based on gray level co-occurrence matrix are
explained. Section 16.4 gives an overview of the algorithm with the methodology
implemented for texture segmentation. Comparison of results obtained from texture
segmentation carried out on different SAR images is shown in Sect. 16.5.
16.2 Grey Level Co-occurrence Matrix
One of earliest and widely used method for the purpose of texture feature extraction
was proposed by Haralick in the year 1973 which is the Gray-Level Co-occurrence
Matrix (GLCM) after that it is widely used in many texture analysis applications
(Pathak et al. 2013). Gray Level Co-occurrence Matrix (GLCM) has proved to be one
of the popular statistical methods beneficial in extracting textural feature from images
that considers the spatial relationship of pixels (Mohanaiah et al. 2013; Materka et al.
1998). GLCM can be carried out in four directions Horizontal (0◦ or 180◦ ), Vertical
(90◦ or 270◦ ), Right Diagonal (45◦ or 225◦ ), Left diagonal (135◦ or 315◦ ) (Hall-Beyer
and Mryka 2017) which are denoted as P0 , P45 , P90 and P135 (Girisha et al. 2013)
as shown in Fig. 16.2 with GLCM created from the test image with its 4 direction
results. The co-occurance matrix directions taking place in GLCM is as shown in
Fig. 16.1 (Pathak et al. 2013).
It will also show how often a pixel in an image with the intensity having the grey
level value i occurs either horizontally, vertically, or diagonally to adjacent pixels in
a spatial relationship to a pixel with the value j (Singh and Inderpal 2014; Girisha
et al. 2013). Two neighbouring pixels can be separated by distance d, where one of
them has gray level i and other j. The co-occurrences matrix can be calculated in
an image through window which scans the image. This co-occurrence matrix can
be associated with each pixel as shown in Fig. 16.1 which shows the co-occurrence
matrix directions (Pathak et al. 2013). After creation of GLCM one can compute
various features from it (Singh and Inderpal 2014). Resultant matrices obtained are
used to characterize textures in images which contain information about the image
such as contrast, energy entropy, variance etc. (Hall-Beyer and Mryka 2017). After
creation of GLCM various features must be computed from it which are discussed
in next section (Singh and Inderpal 2014; Pathak et al. 2013).
16 Analysis of Features in SAR Imagery . . . 255
Fig. 16.1 Co-occurrence

matrix directions for
extracting texture features
(Pathak et al. 2013)
Fig. 16.2 Creation of GLCM from image matrix based on a a test image with b General form of
GLCM along four possible directions, c 0◦ , d 45◦ , e 90◦ and f 135◦ , # represents the number of
times with distance = 1 (Pathak et al. 2013)
16.3 GLCM Texture Feature Detection
Haralick has extracted 14 features from GLCM. In order to extract Haralick features
it is necessary that GLCM should be symmetric which is achieved by taking transpose
and adding it with original GLCM and normalized matrix must be created which can
be achieved from the calculation of sum of all elements in a GLCM and dividing
each element of the matrix with the obtained sum (Girisha et al. 2013). Thus the
resulting matrix can be used to extract features from the normalized symmetrical
GLCM matrix.
An Image texture can be classified into three groups depending on distribution of
spatial variation of pattern in an image.
(1) contrast
(2) orderliness
(3) statistics (Wen et al. 2011).
256 James et al.
Group one will include contrast, homogeneity and dissimilarity and from this
group we have selected contrast for the purpose of texture segmentation. Second
group is used to measure orderliness which includes energy as well as entropy.
Entropy is said to be inversely correlated with energy (Wen et al. 2011). We prefer
both energy and entropy for the purpose of texture segmentation in SAR imagery.
Group three includes mean, variance and correlation where variance is said to be
correlated with contrast and Correlation is uncorrelated with energy, entropy and
contrast (Wen et al. 2011). From this group variance is selected for the purpose of
texture segmentation. Details of these features is described below
16.3.1 Energy
Energy feature is also called as Uniformity or Angular second moment which mea-
sures orderliness of the gray level distribution in an image. This feature is high when
image has good homogeneity that is it will have more of similar set of pixels. The
expression for energy is as shown below

Ng Ng
Energy = P(i, j)2 (16.1)
i=1 j=1
where P(i, j) is the i j th entry of the normalized co-occurrence matrix, N g is the

number of grey levels of the considered SAR image.
16.3.2 Contrast
Contrast represents the amount of local variation in image acting as a good edge
detector and that measures the spatial frequency of an image (Girisha et al. 2013;
Cevik et al. 2016). The general representation of this feature in GLCM is shown
below

Ng Ng
Contrast = (i − j)2 · P(i, j) (16.2)
i=1 j=1
16.3.3 Homogeneity
Homogeneity is the measure of smoothness of grey level distribution in image. It is

said to be inversely correlated with contrast. If contrast is small then homogeneity
will be large. Homogeneity feature in GLCM will consider high values for a low
contrast image (Girisha et al. 2013). The general expression for homogeneity is as
given as

Ng Ng
1
Homogeneity = · P(i, j) (16.3)
i=1 j=1
1 + (i − j)2
16.3.4 Correlation
Correlation feature will measure of grey tone linear-dependencies in the image and
is uncorrelated with energy and entropy (Girisha et al. 2013; Mohanaiah et al. 2013).
Correlation of a pixel is calculated with its neighbour in the entire image that will
measure linear dependency of gray levels in the neighbouring pixels. This feature
can be expressed as follows (Cevik et al. 2016)
(i − μi )( j − μ j )P(i, j)
Ng Ng
Correlation = (16.4)
i=1 j=1
σi σ j
16.3.5 Entropy
Entropy feature belongs to the orderliness group which will show how regular the
pixel value in the image is different within the window. Entropy will give the amount
of information in the image that is required for image compression (Mohanaiah et al.
2013). The expression for Entropy is as shown below

Ng Ng
Entropy = − P(i, j) · log(P(i, j)) (16.5)
i=1 j=1
16.3.6 Variance
Variance is the average square of the distance of each data point from its mean which
is also called as mean squared deviation. The expression for Variance is given as

Ng Ng
Variance = (i − μ)2 · P(i, j) (16.6)
i=1 j=1
258 James et al.
16.4 Methodology Implemented
First step is to acquire SAR images for this purpose we use Sintinel-1 data from Euro-
pean Space Agency (ESA). All the images used are collected from ESA. Selection
of images from the database with window size 5 × 5, distance = 1 and orientation
from 0◦ , 45◦ , 90◦ and 135◦ is carried out. Gray Level Co-occurrence Matrix(GLCM)
is created initially with the matrix created as follows:
• Considering a set of samples surrounding a sample which will fall within the
window centered upon a sample with its window size specified.
• The i, j considered in GLCM will be the number of times two samples of intensities
i and joccur in that specified spatial relationship.
The obtained GLCM is made symmetric by creating a transpose and adding with
GLCM itself. The obtained results are Normalized by dividing each element by sum
of all the elements present in the matrix and the elements of GLCM is expressed
in probability. Finally Texture feature detection is done with features such as Con-
trast, ASM, Entropy, Variance and the obtained results are compared and tabulated
(Fig. 16.3).
16.5 Results
The SAR data considered for texture analysis is shown in Fig. 16.4. Area of Nether-
land is considered for study which includes data of class A and B. Class C includes
region between Denmark and Netherland. The sizes of the considered class A,B
and C images are 378 × 550, 376 × 550 and 376 × 549 respectively.The results are
observed by varying window size 3 × 3, 5 × 5, 7 × 7, 9 × 9 and 11 × 11. We observe
that over segmentation is taking place in images with the window size of 11 × 11 as
shown in Figs. 16.17 and 16.18 of class A and B images. It is necessary to select the
right window size so that features present in SAR images will be detected clearly.
By varying window sizes we observe that window size of 5 × 5 is gave clear results
than other window sizes and hence we select window size of 5 × 5 for all the images.
The distance considered is 1 and gray levels used is 3 to reduce time and to avoid
complexity.
Results of class A image show that water details of all the orientation is clear,hence
we compare the land details. We observe that vegetation region in the land is
detectable using GLCM feature of variance as shown in Figs. 16.5, 16.6, 16.7 and
16.8.
The presence of urban area in the land such as Hague and Amsterdam can be
detected from the feature of energy as shown in figures of all the orientations of class
A results. The 0◦ results of energy feature gave clearer results than the results of
other orientation as shown in Figs. 16.5, 16.6, 16.7 and 16.8.enlargethispage-12pt
With the results of class B image, Variance gave clearer results and no change is
observed in results of contrast. The presence of water region in land can be detected
Fig. 16.3 Methodology implemented
Fig. 16.4 SAR image data considered for GLCM segmentation. a ClassA, b ClassB, c ClassC
260 James et al.
Fig. 16.5 Results obtained for class A data for a Contrast, b Entopy, c Variance, d ASM for
orientation 0◦
orientation 45◦
orientation 90◦
orientation 135◦
Fig. 16.9 Results obtained for class B data for a Contrast, b Entopy, c Variance, d ASM for
orientation 0◦
using variance feature as observed in Figs. 16.9, 16.10, 16.11 and 16.12 with the
window size 5 × 5 for class B image.
On applying algorithm on class C image, River Elbe can be detected from the
variance feature of GLCM as shown in Figs. 16.13, 16.14, 16.15 and 16.16. The
presence of urban area in land such as Hamburg is also detectable using GLCM
feature of Energy of class C image data. The results of variance feature is observed
to be clearer than other features.
Since water details of all the images obtained are clear the details of land for all
the SAR images along with all possible orientation is compared and tabulated as
shown in Table 16.1. For the construction of any co-occurance matrix parameters of
distance (d) and direction (θ ) are important,hence we compare the results of GLCM
features observed in the SAR images by obtaining results stimulated with d = 1.
262 James et al.
orientation 45◦
orientation 90◦
orientation 135◦
Fig. 16.13 Results obtained for class C data for a Contrast, b Entopy, c Variance, d ASM for
orientation 0◦
orientation 45◦
orientation 90◦
264 James et al.
orientation 135◦
Fig. 16.17 Results obtained for class A data for a Contrast, b Entropy, c Variance, d ASM for
orientation 0◦ with window size 11 × 11
Fig. 16.18 Results obtained for class B data for a Contrast, b Entropy, c Variance, d ASM for
orientation 0◦ with window size 11 × 11
Table 16.1 Comparative results of GLCM features

Orientation 0◦ 45◦ 90◦ 135◦
Class A Contrast Not clear Not clear Not clear Not clear
Entropy Not clear Not clear Not clear Not clear
Variance Clear Clear Clear Clear
ASM Clear Clear Clear Clear
Class B Contrast Not clear Not clear Not clear Not clear
Entropy Not clear Clear Clear Clear
Class C Contrast Not clear Not clear Not clear Not clear
Entropy Not clear Not clear Not clear Not clear
16.6 Conclusion
In this paper, comparative analysis of GLCM texture features is carried out to detect
different features in SAR images. MATLAB software was used to differentiate vari-
ous features present in SAR images using different possible orientations. This texture
segmentation method has the ability to segment the region having higher texture than
other region present in an image which is useful to detect important features in SAR
images. Comparative analysis of GLCM features on SAR images show that results of
energy and variance gave accurate results as compared to other features. Examples
of SAR applications where GLCM texture segmentation can be used are tropical
forest monitoring, land cover change detection due to natural disasters such as land-
slides, floods and many other applications.This texture segmentation method is also
beneficial to obtain information which cannot be achieved through conventional seg-
mentation methods.
Acknowledgements Images used in this paper are collected from Sentinel-1 satellite data from
ESA (European Space Agency). The authors would like to thank them for providing data along
with necessary information that helped us to analyse the data and apply algorithm on it.
References
Cevik T, Ali Mustafa AA, Cevik N (2016) Performance analysis of GLCM-based classification on
Wavelet Transform-compressed fingerprint images. In: 2016 sixth international conference on
digital information and communication technology and its applications (DICTAP). IEEE
Chauhan AS, Silakari S, Dixit M (2014) Image segmentation methods: a survey approach. In: 2014
Fourth International Conference on Commun Syst Netw Technol (CSNT). IEEE
266 James et al.
Girisha AB, Chandrashekhar MC, Kurian MZ (2013) Texture feature extraction of video frames
using GLCM. Int J Eng Trends Technol 4(6):2718–2721
Gonzalez RC, Woods RE, Eddins SL (2009) Digital image processing using MATLAB, vol 2
Hall-Beyer M (2017) GLCM texture: a tutorial v. 3.0 March 2017
Kaur D, Kau Y (2014) Various image segmentation techniques: a review. Int J Comput Sci Mob
Comput 3(5):809–814
Materka A, Strzelecki M (1998) Texture analysis methods a review. Technical University of lodz,
Institute of Electronics, COST B11 report. Brussels 1998:9–11
Mohanaiah P, Sathyanarayana P, GuruKumar L (2013) Image texture feature extraction using GLCM
approach. Int J Sci Res Publ 3(5)
Pathak B, Barooah D (2013) Texture analysis based on the gray-level co-occurrence matrix consid-
ering possible orientations. Int J Adv Res Electric Electron Instrum Eng 2(9):4206–4212
Singh I (2014) Performance evaluation of texture based image segmentation using GLCM. Int J
Adv Image Process Tech IJIPT 1(3):2372–3998
Wen C, Zhang Y, Deng K (2011) Urban area classification in high resolution SAR based on texture
features. In: International conference on geo-spatial solutions for emergency management and
the 50th anniversary of the Chinese academy of surveying and mapping
Wankhade PD (2014) A review on Aspects of Texture analysis of images. Int J Appl Innova Eng
Manage (IJAIEM) 3(10)
Part III
Applications and Issues
Chapter 17
Offline Signature Verification Using
Galois Field-Based Texture
Representation
S. Shivashankar, Medha Kudari, and S. Prakash Hiremath
Abstract Signature verification has been one of the popular research areas with face
recognition being one among the other physiological traits in biometric recognition.
Offline signatures can be treated as texture images, and thus, texture representation
methods can be applied to such images. One such texture representation is based on
Galois fields. In this work, after application of Galois field operator, the cumulative
histogram is built, and it is normalized. Thus, obtained bin values are considered as
features of a signature image. k-NN classifier is used for offline signature verification.
Experiments conducted on the benchmark dataset, namely GPDS synthetic signature
database, have supported the application of the proposed method.
17.1 Introduction
Biometric systems are employed in a huge range of security applications. Biometrics

is an ever-expanding research field with more physiological or behavioural traits
being considered to identify an individual. Behavioural traits like signature (Pal et al.
2016), gait (Semwal et al. 2015) and keystroke dynamics (Boccignone et al. 1993)
have been employed either individually or in conjunction with other physiological
traits to uniquely represent a person. The handwritten signature is an important
behavioural trait used in biometric system because of its usage in daily life for
verifying a user’s identity in administrative, financial and legal matters. The collection
of signatures in offline mode is simple and non-invasive hence gaining popularity
(Jain et al. 2004).
A biometric system can be represented like a pattern recognition system. A bio-
metric system is being used for either identification or verification purpose. In ver-
ification application, a user’s identity is validated by comparing the user’s captured
S. Shivashankar · M. Kudari (B)

Department of Computer Science, Karnatak University, Dharwad 580003, India
e-mail: shivashankars@kud.ac.in
S. P. Hiremath
Department of Computer Science, KLE Technological University, BVBCET,
Hubballi, Karnataka 580029, India
270 S. Shivashankar et al.
biometric features against the user’s biometric features stored in the database (Jain
et al. 2004). In signature verification system, a query signature can be classified
as genuine or forged. Forgeries can be of three types: simple, skilled and random
forgeries. In random forgeries, the person doing the forgery does not know the other
person’s signature or the other person. The forger unknowingly uses his signature
which is in different shape and style from the original signature of the person. This
leads to a very different semantic from the original signature. In simple forgeries,
the forger deliberately tries to copy the person’s signature. The forger is aware of
the person’s name but not the person’s signature. The forgery may be similar to the
person’s original signature but not exactly the same. In skilled forgeries, the forger
is aware of the person’s name and signature. The forger copies the person’s signa-
ture as close as possible to the original signature, and this type of forgery is hard to
determine. (Hafemann et al. 2017).
Offline and online are the two types of signature verification systems, depending
on the acquisition method of the signatures. User’s signatures are acquired using
an acquisition device like a digitizing tablet in online type. Online signatures are
collected as a sequence over a time, such as co-ordinates of writing points, angle,
direction of the pen and pen pressure. Online signatures are difficult to forge, since
they contain dynamic information. In the case of offline type, after the completion
of the writing process, the signature is acquired. Here, the signature is treated as a
digital image (Hafemann et al. 2017). Hence, offline signature verification is a pattern
recognition problem.
Some of the research works on the topic are presented as follows: Kalera, Shri-
hari and Xu developed an offline signature verification method based on quasi-
multiresolution technique using structural, concavity and gradient features for feature
extraction (Kalera et al. 2004). Fierrez-Aguilar, Alonso-Hermira, Moreno-Marquez
and Ortega Garcia proposed an offline signature verification system employing the
fusion of global and local information (Fierrez-Aguilar et al. 2004). Global, direc-
tional and grid features of signatures were used for offline signature verification by
Ozgunduz, Senturk and Karshgil (Ozgunduz et al. 2005. Kiani, Pourreza and Pour-
reza (Kiani et al. 2009) and Bharadi and Kekre (Bharadi et al. 2010) used local Radon
transform and cluster-based global features for feature extraction, respectively. Pun
and Lee extracted features using log-polar transform to eliminate rotation and scale
effects in the input image (Pun et al. 2003). Local binary patterns (LBP) used grey
level distribution to enhance statistical and structural analysis of textural patterns
(Ojala et al. 2002). Vargas, Ferrer, Travieso and Alonso used co-occurrence matrix
and LBP to extract the grey level statistical texture features at the global level (Vargas
et al 2011). Ferrer, Vargas, Morales and Ordonez proved the robustness of grey level
features extracted from a distorted signature image in (Ferrer et al. 2012. Wajid and
Mansoor evaluated the performance of classifiers using the feature vector formed by
a code matrix of LBPs, created from divided signature images (Wajid et al. 2013).
Serdouk, Nemmour and Chibani developed a descriptor called orthogonal combi-
nation local binary pattern (OC-LBP) based on orthogonal combination of LBP
(Serdouk et al. 0000). Shekar, Bharathi, Kitler, Vizilter and Mestestskiy represented
grid structured morphological spectrum in the form of a histogram for offline signa-
17 Offline Signature Verification Using Galois … 271
ture verification (Shekar et al. 2015). Pal, Alaei, Pal and Blumenstein used LBP and
ULBP in extracting features from offline signature images (Pal et al. 2016). Yilmaz
and Yanikoglu presented an offline signature verification system that used histogram
of LBP, oriented gradients and scale invariant feature transform descriptors to gener-
ate a score-level fusion of complementary classifiers (Yilmaz et al. 2016). The more
recent advancements in the field are summarized in the literature review presented
by Hafemann, Sabourin and Oliveira (Hafemann et al. 2017).
The paper presents an offline signature verification method using a Galois field-
based texture representation. The proposed method consists of two steps: extraction
of features and signature verification (classification). In the extraction of features
step, the features are extracted after Galois field operator has been applied on sig-
nature image. During verification, the features of the signature image in question
are measured with the features of the genuine signature images that are stored in
database. k-Nearest Neighbour (k-NN) classifier is used in the present study. Exper-
imentations have also been done using the log-polar transform and rotation invariant
LBP (RILBP) method to depict the efficacy of the proposed method.
In the next section, the Galois field-based texture representation and feature extrac-
tion from the signature image is presented in detail. Section 17.3 describes briefly
the classification technique employed in the present study. The experimental settings
and the realized results are given in Sect. 17.4 followed by conclusion of the present
study in Sect. 17.5.
17.2 Galois Field-Based Texture Representation

and Feature Extraction
To represent the characteristics in the signature of an individual person, Galois field-

based representation is employed. Further, the cumulative number of occurrences of
each grey level of Galois field-operated image is then normalized. These normalized
values are considered as the features to the problem of signature verification. To
juxtapose, implementation of log-polar transform method and RILBP method is
carried out.
17.2.1 Galois Field-Based Texture Representation
Texture description based on Galois fields was implemented for scale and rotation
invariant texture classification (Shivashankar et al. 2017, 2018). The same method-
ology has been applied to signature images which are grey scale images with the
handwritten signature representing texture in the image. A grey scale image has
intensity values ranging from 0 to 255, with 0 representing black and 255 indicating
white. This can be represented in a Galois field of 28 which has 256 values. The
Galois field-based texture representation procedure is as given below.
Step 1: Consider a pixel Ii, j and its first eight neighbours in an image I
Step 2: Perform bitwise XOR operation on all the nine pixels
(Addition in GF( 28 ))
Step 3: Convert the binary number obtained into decimal value
Step 4: Steps 1, 2 and 3 are repeated for all the pixels in the image I , which results
in transformed image I ’
17.2.2 Extraction of Features
The features are extracted from the transformed image I as follows:

Step 1: Construct the histogram with 50 bins for the transformed image I
The histogram is depicted as follows
h(rk ) = n k (17.1)
where rk → k th value
n k → no. of rk values . The histogram represents the number of pixels with particular
intensity which lies within the range of 0–255 in that image.
Step 2: The Ck (cumulative histogram) is calculated as

k
Ck = h(r j ) (17.2)
j=0
where k = 0, 1, 2, ..., 49
Step 3: The normalization of Ck is given in (17.3)
Ck
N Ck = (17.3)
C
where
C = c02 + c12 + c22 + · · · + ck2 + · · · + c49
2
The cumulative number of occurrences of each grey level (normalized) N Ck of

the Galois field-operated image forms the features. These are found to be strong
features for texture representation (Shivashankar et al. 2017, 2018).
17.3 Classification
Classification of the feature extracted data is performed by supervised classification.

The classifier is trained with a known set of data called as training set, during the train-
ing stage. The training set contains the extracted features using the method proposed
in Sect. 17.2. The unknown data, called as testing set, contains similar extracted fea-
tures using the method proposed in Sect. 17.2. The testing set is classified depending
on the minimum distance with the training set during the testing stage. Measuring
the similarity between images is of central importance for low-level computer vision.
The distance between the trained data and the tested data is the similarity measure.
In the experiments conducted, Galois field-based features are obtained from all
the Galois field-based signature images and classified using the k-nearest neighbour
with k value initialized to 1, i.e. the minimum distance between the train and test
sample is considered. The similarity between the two histograms is measured with
Euclidean and Chi Square distance measures (Duda et al. 1973).
17.4 Experiments
17.4.1 Signature Dataset
The GPDS synthetic signature dataset is an offline benchmark signature database

(Ferrer et al. 2013). Data is collected from 4000 synthetic individuals. Details of the
database is available in (Ferrer et al. 2013).
Here, signatures of 250 individuals were considered for the experiments. Each
individual has 30 forgeries along with 24 genuine signatures of his/her signature.
Different modelled pens were used to generate all the signatures. The training set is
formed using 16 genuine and 16 forged signatures. The remaining 8 genuine and 14
forged signatures form the testing set. A total of 8000 images of signatures are used
as training set and 5500 signature images are part of the testing set. The following
figure, Fig. 17.1, shows some sample genuine and forged signatures.
Fig. 17.1 Sample images of GPDS synthetic signature dataset. Genuine images are displayed in
the first row, and forged images are displayed in the second row
17.4.2 Evaluation Methodology
Evaluation of a signature verification system is done based on two types of errors:

false rejection rate (FRR) is Type I error and false acceptance rate (FAR) is Type II
error. Type I error occurs when an original signature is rejected. Type II error occurs
when a duplicate is accepted. Lastly, the average of FRR and FAR is the average
error rate (AER) (Vargas et al 2011).
number of forgeries accepted

FAR = (17.4)
number of forgeries tested
number of genuines rejected

FRR = (17.5)
number of genuines tested
FAR + FRR
AER = (17.6)
2
17.4.3 Results and Discussion
Table 17.1 displays the results obtained with the proposed method and Euclidean
distance for signatures of different number of people in the GPDS synthetic signature
dataset. FAR of 0.23, 0.04 and 0.02, FRR of 0.68, 0.93 and 0.94, AER of 0.46, 0.49
and 0.48 were obtained with Euclidean distance measure for 10 persons, 100 persons
and 250 persons’ signatures, respectively.
It is observed that FRR is commonly greater than FAR indicating that the signature
images are correctly classified by the method proposed in Sect. 17.2. The consistency
of the values of AER in the last column of Table 17.2 shows that the trend will
continue as more number of signatures are included for classification. Depending
on the threshold used in the verification system, FRR and FAR can be changed by a
significant amount. The performance of a biometric system may be expressed using
the AER. Better performance is indicated by a lower AER value.
In what follows, the experiment is described to show how a different distance
metric affects the system performance. When Chi Square distance metric is applied
Table 17.1 FAR, FRR and AER of different number of people using the proposed method with
Euclidean as distance measure
No. of people FAR FRR AER
10 persons 0.23 0.68 0.46
100 persons 0.04 0.93 0.49
250 persons 0.02 0.94 0.48
Table 17.2 FAR, FRR and AER of different number of people using the proposed method with
Chi Square as distance measure
No. of people FAR FRR AER
10 persons 0.25 0.65 0.45
100 persons 0.05 0.91 0.48
250 persons 0.02 0.94 0.48
Table 17.3 Signature verification using RILBP, log-polar and proposed method using Euclidean
distance measure
Methods 10 persons 100 persons 250 persons
FAR (false RILBP (Ojala 0.17 0.04 0.01
acceptance rate) et al. 2002)
Log-polar (Pun 0.20 0.03 0.01
et al. 2003)
Proposed 0.23 0.04 0.02
FRR (false RILBP (Ojala 0.66 0.91 0.94
rejection rate) et al. 2002)
Log-polar (Pun 0.70 0.94 0.96
et al. 2003)
Proposed 0.68 0.93 0.94
AER (average RILBP (Ojala 0.41 0.48 0.48
error rate) et al. 2002)
Log-polar (Pun 0.45 0.49 0.49
et al. 2003)
Proposed 0.46 0.49 0.48
in the place of Euclidean distance metric in the same experiment as described above,
the results in Table 17.2 are obtained. False acceptance rate of 0.25, 0.05 and 0.02
is observed for 10, 100 and 250 persons’ signatures with Chi Square as the distance
measure. False rejection rate of 0.65, 0.91 and 0.94 was observed for the same.
Average error rates of 0.45, 0.48 and 0.48 were recorded for signatures of 10 persons,
100 persons and 250 persons, respectively.
Experiments are performed on the GPDS synthetic signatures database with
RILBP method with neighbour pixels P = 8 and radius set to R = 1, to evaluate
against the performance of the proposed method. FAR of 0.17, 0.04 and 0.01, FRR
of 0.66, 0.91 and 0.94 and AER of 0.41, 0.48 and 0.48 were obtained for RILBP
with Euclidean distance for signatures of 10, 100 and 250 people. Log-polar trans-
form method is applied on the GPDS synthetic signatures database with Euclidean
distance, and a FAR of 0.20, 0.03 and 0.01, FRR of 0.70, 0.94 and 0.96 and AER
of 0.45, 0.49 and 0.49 were recorded for 10 people, 100 and 250 people signatures.
The results are tabulated in Table 17.3.
Table 17.4 Signature verification using RILBP, log-polar and proposed method using Chi Square
distance measure
Methods 10 persons 100 persons 250 persons
FAR (false RILBP (Ojala 0.17 0.05 0.02
acceptance rate) et al. 2002)
Log-polar (Pun 0.19 0.03 0.01
et al. 2003)
Proposed 0.25 0.05 0.02
FRR (false RILBP (Ojala 0.46 0.88 0.93
rejection rate) et al. 2002)
Log-polar (Pun 0.65 0.93 0.96
et al. 2003)
Proposed 0.65 0.91 0.94
AER (average RILBP (Ojala 0.32 0.46 0.48
error rate) et al. 2002)
Log-polar (Pun 0.42 0.48 0.48
et al. 2003)
Proposed 0.45 0.48 0.48
The experiments are repeated using RILBP and log-polar transform with Chi
Square as distance measure. FAR of 0.17, 0.05 and 0.02 for RILBP and FAR of 0.19,
0.03, and 0.01 for log-polar transform method were obtained for signatures of 10,
100 and 250 people, respectively. FRR of 0.46, 0.88 and 0.93 for RILBP and 0.65,
0.93 and 0.96 for log-polar transform methods were observed for 10, 100 and 250
people’s signatures. AER of 0.32, 0.46 and 0.48 for RILBP and AER of 0.42, 0.48
and 0.48 for log-polar transforms methods are recorded for 10, 100 and 250 people
signatures, respectively, and are presented in Table 17.4 along with the values for
the proposed method. The proposed method’s performance is comparable with the
existing methods.
17.5 Conclusion
Signature images have been represented efficiently by Galois field-based operator.

After application of Galois field operator, the cumulative histogram is built, and it
is normalized. Thus, obtained bin values are considered as features of a signature
image. k-NN classifier is employed for offline signature verification. Experimenta-
tions conducted on the GPDS synthetic signatures database illustrate that a low false
acceptance rate indicates a low threshold for forged signatures; in other words, the
system is robust to forgery. The consistent values of average error rate present a com-
petent and robust system. Comparing the Galois field operator method proposed in
Sect. 17.2 with methods like log-polar transform and RILBP confirms the efficiency
of the offline signature verification system.
References
Bharadi VA, Kekre HB (2010) Off-line signature recognition systems. Int J Comput Appl 1(27):48–
56
Boccignone G, Chianese A, Cordella LP, Marcelli Angelo (1993) Recovering dynamic information
from static handwriting. Pattern Recogn 26(3):409–418
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, A Wiley-Interscience
Publication, New York
Ferrer Miguel A, Vargas JF, Morales A, Ordonez A (2012) Robustness of offline signature verifi-
cation based on gray level features. IEEE Trans Inf Forensics Secur 7(3):966–977
Ferrer MA, Diaz-Cabrera M, Morales A et al (2013) Synthetic off-line signature image generation.
In: 6th IAPR international conference on biometrics (ICB), pp 1–7
Fierrez-Aguilar J, Alonso-Hermira N, Moreno-Marquez G, Ortega-Garcia J (2004) An off-line
signature verification system based on fusion of local and global information. In: International
workshop on biometric authentication. Springer, pp 295–306
Hafemann LG, Sabourin R, Oliveira LS (2017) Offline handwritten signature verification—literature
review. In: 2017 seventh international conference on image processing theory, tools and applica-
tions (IPTA), pp 1–8
Jain AK, Ross A, Prabhakar Salil (2004) An introduction to biometric recognition. IEEE Trans Circ
Syst Video Technol 14:4–20
Kalera MK, Srihari S, Xu Aihua (2004) Offline signature verification and identification using
distance statistics. Int J Pattern Recogn Artif Intell 18(07):1339–1360
Kiani V, Pourreza Shahri R, Pourreza HR (2009) Int J Image Process 3
Ojala T, Pietikainen M, Maenpaa Topi (2002) Multiresolution gray-scale and rotation invariant
texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–
987
Ozgunduz E, Senturk T, Karsligil ME (2005) Off-line signature verification and recognition by
support vector machine. In: Signal processing conference, 2005 13th European IEEE, pp 1–4
Pal S, Alaei A, Pal U, Blumenstein M(2016) Performance of an off-line signature verification
method based on texture features on a large indic-script signature dataset. In: 2016 12th IAPR
workshop on document analysis systems (DAS), pp 72–77
Pun C-M, Lee M-C (2003) Log-polar wavelet energy signatures for rotation and scale invariant
texture classification. IEEE Trans Pattern Anal Mach Intell 25(5):590–603
Semwal VB, Raj M, Nandi Gora Chand (2015) Biometric gait identification based on a multilayer
perceptron. Robot Autonom Syst 65:65–75
Serdouk Y, Nemmour H, Chibani Y Orthogonal combination and rotation invariant of local binary
patterns for off-line handwritten signature verification
Shekar BH, Bharathi RK, Kittler J, Vizilter YV, Mestestskiy L (2015) Grid structured morpho-
logical pattern spectrum for off-line signature verification. In: 2015 international conference on
biometrics (ICB), pp 430–435
Shivashankar S, Kudari M, Hiremath PS (2017) Texture representation using galois field for rota-
tion invariant classification. 2017 13th international conference on signal-image technology &
internet-based systems (SITIS), pp 237–240
Shivashankar S, Kudari M, Hiremath PS (2018) Galois field-based approach for rotation and scale
invariant texture classification. Int J Image Graph Signal Process (IJIGSP) 10(9):56–64
Vargas JF, Ferrer MA, Travieso CM, Alonso JB (2011) Off-line signature verification based on grey
level information using texture features. Pattern Recogn 44(2):375–385
Wajid R, Mansoor AB (2013) Classifier performance evaluation for offline signature verification
using local binary patterns. In: 2013 4th European workshop on visual information processing
(EUVIP), pp 250–254
Yilmaz MB, Yanikouglu B (2016) Score level fusion of classifiers in off-line signature verification.
Inf Fusion 32:109–119
Chapter 18
Face Recognition Using 3D CNNs
Nayaneesh Kumar Mishra and Satish Kumar Singh
Abstract The area of face recognition is one of the most widely researched areas
in the domain of computer vision and biometric. This is because the non-intrusive
nature of face biometric makes it comparatively more suitable for application in area
of surveillance at public places such as airports. The application of primitive methods
in face recognition could not give very satisfactory performance. However, with the
advent of machine and deep learning methods and their application in face recogni-
tion, several major breakthroughs were obtained. The use of 2D convolution neural
networks(2D CNN) in face recognition crossed the human face recognition accuracy
and reached to 99%. Still, robust face recognition in the presence of real-world con-
ditions such as variation in resolution, illumination and pose is a major challenge for
researchers in face recognition. In this work, we used video as input to the 3D CNN
architectures for capturing both spatial and time domain information from the video
for face recognition in real-world environment. For the purpose of experimentation,
we have developed our own video dataset called CVBL video dataset. The use of
3D CNN for face recognition in videos shows promising results with DenseNets
performing the best with an accuracy of 97% on CVBL dataset.
18.1 Introduction
Face recognition started long back in the 1990s, and since then, the algorithms have
become more efficient. Various algorithms were applied to detect face in an image,
and subsequently, the recognition of face was done using a recognition algorithm.
Researchers developed various mathematical models and features to represent and
recognize faces. The features were based on traits of face such as geometry, texture,
color and appearance (Brunelli and Poggio 1993; Chellappa et al. 1995; Jain et al.
2000; Turk and Pentland 1991; Viola and Jones 2004; Wiskott et al. 1997) . No
feature was able to represent the face with all its complex dimensions. In addition
to this, recognition of face was made difficult by real-world challenges such as
N. K. Mishra (B) · S. K. Singh

Computer Vision and Biometric Lab, IIIT Allahabad, Allahabad, India
e-mail: sk.singh@iiita.ac.in
280 N. K. Mishra and S. K. Singh
varying illumination, pose and resolution. Various image transformations and super-
resolution methods have been proposed to deal with these challenges (Ahonen et al.
2004, 2008; Bilgazyev et al. 2011; Gunturk et al. 2003; Xie et al. 2011; Zhu et al.
2015). In spite of this, the real-world applications are still not reliable and robust.
In case of face recognition from video, the actual processing was done at frame
level which is actually an image. The best frame among all the frames of the video
was selected based on quality of the face in the image, and recognition algorithm
was applied subsequently (Wibowo et al. 2012; Wibowo and Tjondronegoro 2011).
With the advent of deep learning architectures, 2D convolution network came to be
applied on images or frames of videos to detect and recognize faces. The generation of
features was no more manual. Deep features, though undecipherable, were better than
manually developed features for face recognition. This lead to increase in accuracy
and robustness. However, deep architectures did not treat video as one input rather
they also generated spatial features based on series input of frames.
In the recent works, 3D CNN showed good results for activity recognition in
videos (Tran et al. 2015). This is because unlike 2D CNNs, 3D CNNs are capable
of modeling the time dimension as well as the spatial dimension. The 3D CNNs
accepted and treated a video as a single-input unit. 3D CNNs could generate a single
compact feature that contained facial trait as well as body language, gait pattern and
any other temporal and spatial pattern that may be relevant to classification.
The concept of residual networks (He et al. 2016) allowed for more depth in deep
learning networks without the limitation of vanishing gradient. While 2D residual
networks were successfully applied for image classification (He et al. 2016), 3D
residual networks have been designed to extend the capability of residual networks
in third dimension as well (Hara et al. 2017). 3D residual networks have been success-
fully applied for activity recognition using videos (Hara et al. 2017). The application
of 3D CNNs and 3D residual networks for activity recognition is motivated us to use
the 3D CNNs for face recognition using videos.
Apart from YTF (Wolf et al. 2011) dataset, all the other video datasets such
as UCF (Khurram et al. 2012) and HMDB (Kuehne et al. 2011) are available for
activity recognition. We have therefore created a comprehensive biometric dataset
with modalities video, iris and fingerprint. The video dataset has been used in our
experiments for face recognition.
In this paper, we perform the face recognition on CVBL facial video dataset. This
paper has therefore the following contributions:
i. A comprehensive biometric dataset called the CVBL dataset containing video,
iris and fingerprint modalities has been collected.
ii. It uses 3D residual networks to find out the accuracy for face recognition in
videos.
iii. It compares the accuracy of 3D residual network for different depths of residual
networks in case of face recognition in videos.
iv. It also compares the accuracy of different genres of 3D residual networks in case
of face recognition in videos.
18 Face Recognition Using 3D CNNs 281
The previous work on face recognition has been discussed in Sect. 18.2. Section
18.3 discusses in detail about our comprehensive biometric dataset called the CVBL
dataset. The residual network architectures used in the experiment and configuration
detail related to the experiment have been discussed in Sect. 18.4. The exact imple-
mentation details are discussed in Sect. 18.5 followed by discussion over the result
in Sect. 18.6. Finally, the conclusion and future scope are discussed in Sect. 18.7.
18.2 Related Work
18.2.1 Deep Architectures for Face Recognition Approaches
A lot of work has been done in the field of face recognition using images. Convolution
neural networks (CNN) are being used for face recognition these days. In (Hadsell
et al. 2006), the author introduced the concept of contrastive loss. The contrastive
loss is based on Euclidean distance between the two points. In contrastive loss, the
points in higher dimension are mapped to a manifold such that the euclidean distance
between the points on the manifold corresponds to the similarity between the same
two points in the higher dimension input space. In contrastive loss, the CNNs are
trained using pairs of images. The contrastive loss is such that it tries to generate
highly discriminative features when the training images in the pair are dissimilar to
each other. In case the images in the pair for training are same, the contrastive loss
tries to generate similar features for the images.
In (Schroff et al. 2015), triplet loss was introduced. The author trained a CNN
using triplets of images containing an anchor image which is the actual image, the
positive image which is the image of the same person as in anchor image and a
negative image which consists of an image of a person different from that in anchor
image. The training is done to obtain discriminative features such that it increases
the distance between the anchor face and negative face and decreases the distance
between positive and anchor face. In case of both contrastive loss and triplet loss,
organizing the batches in pair or triplet such that it satisfies a given condition is in
itself a difficult and complex process.
In the sequence of improvement of loss function to increase the discriminative
power of features for face recognition, a new loss function was proposed by Liu et.
al. In his paper (Liu et al. 2016), Liu proposed a generalized large-margin softmax
(L-Softmax) loss which explicitly encourages intra-class compactness and inter-class
separability between learned features. L-softmax not only can adjust the desired
margin but also can avoid overfitting.
Liu et al. (2017) in the year 2017 proposed a new loss function called A-softmax
as his extension and improvement to L-softmax. A-Softmax loss can be viewed as
imposing discriminative constraints on a hypersphere manifold, which intrinsically
matches the prior so that faces also lie on a manifold. The size of angular margin can be
quantitatively adjusted by a parameter m. This makes the learning better by increasing
the angular margin between the classes and making the feature discrimination better
than L-softmax. This paper has used two datasets for performance analysis. One is
labeled face in the wild (LFW), and the other is Youtube Faces (YTF). A-softmax
also which has also been called as SphereFace in the paper achieves 99.42% and
95.0% accuracies on LFW and YTF datasets, respectively. In an extension to the
angular softmax loss, Deng et. al. in his work (Deng et al. 2018) tried to increase the
inter-class separability by introducing the concept of additive angular margin.
Lot of work has been done on images for face recognition. However, all the works
on images in face recognition are prone to spoofing. This can be overcome only if we
can use videos for face recognition. This will allow the system to check the liveliness
of the person by the random body and face movements and thus avoid spoofing.
Activity recognition is one domain where deep learning has been successfully
applied for processing temporal domain along with the spatial domain.
18.2.2 Activity Recognition Approaches
Karpathy et al. (2014) used two-stream convolution network for activity recognition
in video. He used one stream to input centrally cropped video frames and other stream
to input the full frame but at half the original resolution. The two streams got con-
catenated later in the fully connected layer. The use of two-stream architecture made
the processing of videos 2–4 times faster than in case of a single stream architecture.
However, the problem of capturing the temporal dimension still remained because
the use of 2D convolution in the two-stream architecture limited the architecture
from capturing the temporal dimension.
Yue-Hei et al. (2015) applied an array of Long short-term memory (LSTM) cells
to capture the temporal dimension in videos for activity recognition. Because of the
use of LSTMs, the method was capable of handling full-length videos. This meant
that the architecture using LSTM was able to model the temporal change across the
entire length of the video. Firstly, a layer of CNN processed frames of videos in
sequence to produce spatial features. These spatial features were passed to LSTM
for extracting temporal features. Jeff et. al. in his research work (Donahue et al.
2015) also applied LSTMs in a different architecture but with the same objective of
modeling the temporal dimension for activity recognition. However, the LSTM-based
architectures are not giving accuracy better than two-stream-based architectures.
Tran et. al. in his work (Tran et al. 2015) used 3D CNNs for activity recognition.
Using his 3D CNN, he could capture both spatial and temporal dimension in his
features. The features extracted from videos using 3D CNN are highly efficient,
compact and extremely simple to use. He called these features C3D. Tran et. al.
demonstrated that C3D features along with a linear classifier can outperform or
approach current best methods on different video analysis benchmarks. However,
the only problem with 3D CNN is that the 3D CNNs cannot capture the entire length
of video sequence in one go. This causes a limitation in capturing the temporal
dimension if the length of the temporal activity is longer than the number of frames
captured by 3D CNN.
Tran in his work (Tran et al. 2015) used the temporal depth of 16 frames. Laptev
et. al. in his work (Varol et al. 2018) tried to figure out what happens to activity recog-
nition accuracy if we change the temporal depth of video clip. Laptev experimented
for temporal depth of 16, 20, 40, 60, 80 and 100. He found that on increasing the
temporal depth, the accuracy for activity recognition increased. This was because
the 3D CNN architecture could model the activity in a better way when the number
of frames were more. Thus, this experiment also confirmed that temporal dimension
played an important role in the activity recognition. However, greater temporal depth
also required more processing.
In 2016, He et al. (2016) came with the idea of residual networks and won the
first place in several tracks in ILSVRC & COCO 2015 competitions. However, in
ILSVRC, the architectures are tested on images.
In 2017, Hara et al. (2017) extended the concept of residual networks from 2D
to 3D. Kensho applied 3D residual networks for activity recognition in videos. He
changed the depth of 3D residual networks and tried to experiment its effect on
the accuracy. He found that as we increase the depth of the residual networks, the
accuracy increases till it reached the depth of 151. Upon further increasing the depth,
the accuracy for activity recognition saturates. With this experiment, it was clear that
with the increasing depth, it was possible to capture better features and thus increase
the accuracy of the activity recognition.
The work of Hara et al. (2017) motivated us to experiment if the residual networks
can be used to identify a person in a video. To realize this purpose, we developed a
video dataset of our own for face recognition called the CVBL dataset.
18.3 Video Dataset
CVBL dataset is named CVBL dataset (CVBL Dataset 2018) after the lab that is
creating it. The CVBL biometric data (CVBL Dataset 2018) is an exhaustive bio-
metric dataset consisting of facial videos, fingerprint and signature of each subject.
The dataset consists of biometric data of 125 school-going children below the age
of 15.
From the CVBL dataset, we used the face video dataset for our face recognition
experiment. The face video dataset consists of 320 × 240 size video taken at 30
frames per second. Each video is of maximum one minute, and minimum five such
videos of each subject have been taken. The videos show the subjects talking and
expressing themselves freely while being seated at a place in front of the camera as
shown in Fig. 18.1. There are 125 different subjects, and thus, there are 125 classes
for face recognition. The subjects are facing the camera; however, they can move
their face in any direction while talking. The videos include static background, and
there is no camera motion. More number of subjects will be included in the CVBL
dataset (CVBL Dataset 2018) in the future.
Fig. 18.1 Frames from CVBL dataset
In our experiment, out of total 675 videos, 415 videos have been taken for training
and the rest 260 for testing. Thus, training is to testing split ratio is 60:40 approx.
18.4 Experimental Configuration
18.4.1 Summary
Our objective is to find out the accuracy of 3D ResNets on face recognition video
dataset. In addition, we also wanted to know how the accuracy of face recognition
changes with change in depth and genre of residual networks. For this purpose, we
used the code from (Hara et al. 2017) for experimentation and modified it as per
our requirements and objectives. The code for face recognition experiments (Hara
et al. 2017) uses Pytorch library (PyTorch 2018). We begin our analysis by checking
whether the size of the dataset is large enough not to underfit the residual network of
such large depths. We therefore start with the depth of 18, assuming that if ResNet-18
overfits, then we can conclude that the size of the dataset is too small to train the
architecture of such depth. We will experiment with the larger depths of ResNets only
if we are convinced that CVBL dataset is large enough to train ResNet-18 without
underfitting.
18.4.2 Network Architectures
In this section, all those network architectures will be discussed which we plan to
implement and analyze them over training them on CVBL dataset. In this paper,
ResNet architectures of various depth have been experimented. The ResNet archi-
tectures have a special property that they allow shortcut connections to bypass layers
in between to move to the next layer. However, back-propagation still takes place
without any problem.
Apart from the ResNet (basic and bottleneck blocks) (He et al. 2016), follow-
ing extensions of the ResNet architecture have also been used for experiment: Pre-
activation ResNet (He et al. 2016), Wide ResNet (WRN) (Zagoruyko and Komodakis
2016), ResNeXt (Xie et al. 2017) and DenseNet (Huang et al. 2017).
A basic ResNet block (He et al. 2016) is the most simple ResNet and consists of
only two convolution layers. Each of the convolution layers is followed by a batch
normalization layer and a non-linearization layer ReLU. A shortcut connection is
also provided between the top of the block and to the layer just before the last ReLU
in the block. ResNets-18 and ResNets-34 adopt the basic ResNet block structure.
A ResNet bottleneck block (He et al. 2016) is different from the basic ResNets
block in the sense that it consists of three convolution layers instead of two. As in case
of basic ResNets block, each convolution layer is followed by batch normalization
layer and ReLU layer. The first and third convolution layers consist of the filters
of size 1 × 1 × 1, whereas the second convolution layer consists of filters of size
3 × 3 × 3. The networks which adopt ResNets bottleneck block are ResNet-50, 101,
152 and 200. The 1 × 1 × 1 convolutions (Lin et al. 2013) help the network to go
deeper by being computationally efficient as well as contains more information than
otherwise.
Unlike bottleneck ResNet, where each convolution layer is followed by batch
normalization and a ReLU, in case of pre-activation ResNet (He et al. 2016), the
batch normalization layer and the ReLU layer come before convolutional layer. He
et al. (2016) also confirmed in his studies on ResNet that pre-activation ResNets are
better in optimization and avoiding overfitting. The shortcut in case of pre-activation
ResNets connects the top of the block to the layer just after the last convolution layer
in the block. Pre-activation ResNet-200 is an example using pre-activation ResNet.
Wide ResNets (Zagoruyko and Komodakis 2016) increase the width of the residual
network instead of increasing the depth of the residual network. Width here means
the number of features maps in one layer. If we talk of a convolution layer network,
the number of feature maps corresponds to the number of filters in a convolution
layer. In a neural network, the width corresponds to the number of neurons in a
layer. In (Zagoruyko and Komodakis 2016), the authors increased the width instead
of depth and showed that same accuracy can be gained by increasing width instead
of depth. Several other authors (Zagoruyko and Komodakis 2016) however feel that
the increase in accuracy was not because of increase in number of feature maps but
because of increase in number of parameters. The increase in number of parameters
can also cause overfitting.
DenseNets (Huang et al. 2017) are those residual network which exploit the con-
cept of feature reuse. In DenseNets, the features from early layers are used in the
later layers by the providing direct connections from every early layer to every later
layers in the feed-forward fashion and concatenating them. This process makes the
interconnections very dense and hence the name. The concept of pre-activations used
in pre-activation ResNets has also been used in DenseNets to reduce the number of
parameters and yet achieve better accuracy than ResNets. The number of feature
maps at each layer is called the growth rate in case of DenseNets. This is because the
features maps at a particular layer grow after concatenation with the feature maps
of the previous layer. DenseNet-121 and DenseNet-201 with growth rate of 32 are
examples of DenseNets.
In Xie et al. (2017), it introduced a new term called cardinality. As per the author,
the cardinality refers to the size of the set of transformations. In his paper, Xie et.
al experimented with 2D residual architectures for image processing. He showed

that increasing the cardinality of 2D architectures is more effective than using wider
or deeper ones. With this context, ResNeXt was introduced with the concept of
cardinality. Cardinality refers to the number of middle convolutional layer groups in
the bottleneck block. These groups divide the feature maps into small groups which
are later concatenated. ResNeXt performed the best among all the residual networks
in case of activity recognition (Hara et al. 2017). This showed that increasing the
cardinality is better than increasing the width or depth.
18.5 Implemenation
Training: For training purpose, a 16-frame clip is generated from the temporal
position selected by uniform sampling of the video frames. In case the video contains
less than 16 frames, then 16 frames are generated by looping around the existing
frames as many times as required. Multiscale cropping is done by first selecting
randomly a spatial position out of 4 corners and 1 center. Then, for a particular
sample,
a scalevalue is selected out of the following to perform multi-cropping:
1 , , 1 ,1 .
1 √1
24 2 2 43 2
Aspect ratio is maintained to one, and scaling is done on the basis of shorter
side of the frame. Frames are then resized to 112 × 112 pixels. After all this, we
finally get the input sample size as (3 channels × 16 frames × 112 pixels × 112
pixels). Horizontal flipping with a probability of 50 percent is also performed. Mean
subtraction is performed to keep the pixel values zero centered. In the process of
mean subtraction, a mean value is subtracted from each color of the sample. Cross-
entropy loss is used for calculation of loss and back-propagation. For optimization
using the calculated gradients, stochastic gradient descent (SGD) with momentum
is used. Weight decay of 0.001 and momentum of 0.9 have been used in the training
process. When training the networks from scratch, we start from learning rate of 0.1
and divide it by 10 after the validation loss saturates.
Each video is split into non-overlapped 16-frame clips, and each clip is then
passed into the network for recognition of faces. Hence, in a way, we are following
sliding window to generate input clips where the sliding window is moving in time
dimension, and the length of the sliding window is 16. The sliding window is being
moved in non-overlapped fashion.
18.6 Results and Discussion
As we discuss the results, it must be noted here that the term ResNets means ResNet-
18, 34, 50, 101 and 152. Extension of ResNets means pre-activation ResNets, wide
ResNets and DenseNets. The results of experiments on CVBL dataset are summa-
Table 18.1 Accuracy of our proposed method using residual networks of different depth on CVBL
dataset for face recognition
Residual networks Accuracy (%)
ResNet-18 96
ResNet-34 93.7
ResNet-50 96.2
ResNet-101 93.4
ResNet-152 49.1
ResNeXt-101 78.5
Pre-activation ResNet-200 96.2
DenseNet-121 55%
DenseNet-201 97
WideResNet-50 90.2
rized in Table 18.1. The training versus validation loss and comparison of accuracy
for all kinds of ResNet architectures are shown in Figs. 18.3 and 18.4.
It can be easily observed that in case of ResNets, the accuracy does not increase
with the increase in depth of the architecture. In fact, it follows a zig-zag kind of path
as shown in Fig. 18.2. For ResNet-18, the accuracy is 96%. It drops to 93.7% for
ResNet-34. The accuracy of ResNet-50 then comes back to 96.2% and then dropping
to accuracy of 93.4% for ResNet-101. Thus, there is a zig-zag kind of variation in
accuracy as we increase the depth of the ResNets. It is also observed from the Table
18.1 and Fig. 18.2 that the performance of ResNets drops very sharply after ResNet-
101. ResNet-152 fails to perform with just 49.1%. The increase in depth for the same
training set may be the cause for the high bias and hence the decrease in accuracy. In
spite of all these variations of accuracy in ResNet architectures, it can be inferred that
ResNets in general are performing well with more than 90% accuracy. The highest
accuracy in ResNets has gone to 96.2%.
In case of DenseNets too, DenseNet-201 performed the best with 97% accuracy.
The reuse of features from previous layers in later layers of DenseNet seems to have
contributed to increase in the accuracy to 97% in case of DenseNet-201. However, the
accuracy of DenseNet-121 was way too low with just 55%. Figure 18.3e, f shows the
training versus validation loss for DenseNet-121 and DenseNet-201, respectively. It
can be seen that after convergence, the training loss in case of DenseNet-121 is more
than that in case of DenseNet-201. This shows that DenseNet-121 has a high bias
as compared to DenseNet-201. Also DenseNet accumulates features from previous
layers to later layers. Hence, we can say that DenseNet-201 is able to make use of
the accumulated features because of its high depth. Due to its relatively lower depth,
DenseNet-121 is not able to make use of the accumulated features to increase its
accuracy.
Fig. 18.2 Variation of accuracy of ResNets with depth
Wide ResNet-50 gave an accuracy of 90.2%. ResNet-50 which is of the same

depth as wideresnet-50 gave an accuracy of 96.2%. Wideresnet-50 not only failed to
maintain the accuracy to 96.2% but also made the accuracy lower. Thus, it is evident
that in case of face recognition from videos, increasing the width of the architecture
only had a negative effect on the accuracy.
Pre-activation ResNet also performed equivalent to the best performers with an
accuracy of 96.2%. In pre-activation ResNets, the ReLU and batch normalization
layer are placed before the convolution layer unlike in ResNets. Pre-activation
ResNets are said to perform better than pre-activation ResNets (Hara et al. 2017)
because of change of the sequence of layers. Interestingly, in our case, change in
sequence of layers had no effect on the accuracy. For the sake of comparison, the rise
in accuracy of all kinds of ResNets with respect to epochs is shown in Fig. 18.4d.
Our experiments conducted to understand the behavior of residual networks for
face recognition on CVBL video dataset found the results highly encouraging with
the highest accuracy going up to 97%. We can safely interpret that residual networks
are sensitive to face recognition and obtaining good results on CVBL video dataset.
18.6.1 Comparison with State-of-the-Art in Face Recognition
We compare Tables 18.1 and 18.2 for comparing the results of our proposed method
with the state-of-the-art results on various image and video datasets. Table 18.2 shows
the accuracy of different approaches on LWF dataset for face recognition in images
and YTF dataset for face recognition in video. From Table 18.2, it is observed that
for face recognition in images, the accuracy of 99% has been achieved by different
Fig. 18.3 Training loss versus validation loss for different ResNets
Fig. 18.4 a, b and c shows training versus validation loss for Wideresnet-50, pre-activation ResNet-
200 and ResNeXt-101. d shows the graph of accuracy against the number of epochs for all residual
networks
approaches. However, in case of face recognition in videos, our approach using

DenseNet-201 has successfully achieved the accuracy of 97%. This accuracy is well
above the state-of-the-art accuracy of 95.12% by nearly 2%.
Table 18.2 State-of-the-art results for face recognition on image and video datasets.
Architecture Image dataset Accuracy Video dataset Accuracy (%)
Facenet (Schroff et al. 2015) LWF 99.63% YTF 95.12
Deep ID 2 (Sun et al. 2015) – – YTF 93.2
Center Loss (Wen et al. 2016) LWF 99.28% YTF 94.9
Deep residual learning on images Imagenet 3.57% error – –
(He et al. 2016)
L-softmax (Liu et al. 2016) LWF 98.71% – –
A-softmax (Liu et al. 2017) LWF 99.42% YTF 95.0
3D residual networks (ResNeXt- – – UCF-101 94.5
101 64 frames) (Hara et al. 2017)
3D residual networks (ResNeXt- – – HMDB-51 70.2
101 64 frames) (Hara et al. 2017)
18.7 Conclusion
ResNets architectures seem to be sensitive to face patterns in videos. Since the CVBL
dataset consists of videos with same background which is plain white, it can be easily
assumed that the background is not at all contributing to the recognition accuracy. It
is only the spatial and the temporal dimensions that are contributing effectively to the
classification and recognition accuracy. Except few cases, the ResNets are providing
good results with accuracies above 90%. DenseNets performed the best with 97%.
Hence, we can conclude that ResNets are sensitive to face recognition patterns
with accuracy near 96%. This is the first of its kind experiment on face video dataset,
and the residual networks have given an accuracy of above 90% in general which
gives a very positive indication about the future in video biometric.
In the future, we plan to collect biometric samples from more subjects and prepare
a bigger biometric and exhaustive dataset. We then plan to experiment on this bigger
biometric dataset to evaluate the effect of large number of classes in the dataset on
face recognition accuracy. We also plan to experiment on the existing YTF dataset
using the residual networks.
References
Ahonen T, Hadid A, Pietikäinen M (2004) Face recognition with local binary patterns. In: Computer
vision-ECCV 2004. Springer, pp 469–481
Ahonen T, Rahtu E, Ojansivu V, Heikkila J (2008) Recognition of blurred faces using local phase
quantization. In: International conference on pattern recognition
Bilgazyev E, Efraty B, Shah SK, Kakadiaris IA (2011) Improved face recognition using super-
resolution. In: 2011 international joint conference on biometrics (IJCB). IEEE, pp 1–7
Brunelli R, Poggio T (1993) Face recognition: features versus templates. IEEE Trans Pattern Anal
Mach Intell 15(10):1042–1052
Chellappa R, Wilson CL, Sirohey S (1995) Human and machine recognition of faces: a survey. Proc
IEEE 83(5):705–740
CVBL Dataset. https://cvbl.iiita.ac.in/dataset.php. Last accessed 30 Dec 2018
Deng J, Guo J, Xue N, Zafeiriou S (2018) Arcface: Additive angular margin loss for deep face
recognition. arXiv preprint arXiv:1801.07698
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T
(2015) Long-term recurrent convolutional networks for visual recognition and description. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Gunturk BK, Batur AU, Altunbasak Y, Hayes MH, Mersereau RM (2003) Eigenface-domain super-
resolution for face recognition. IEEE Trans Image Process 12(5):597–606
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping.
In: Null. IEEE, pp 1735–1742
Hara K, Kataoka H, Satoh Y (2017) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs
and ImageNet?arXiv preprint arXiv:1711.09577
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. European
conference on computer vision. Springer, Cham, pp 630–645
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings
of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: Proceedings
of the European conference on computer vision (ECCV), pp 630–645
Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional
networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
(CVPR), pp 4700–4708
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern
Anal Mach Intell 22(1):4–37
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video
classification with convolutional neural networks. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 1725–1732
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human
motion recognition. In: 2011 IEEE international conference on computer vision (ICCV). IEEE,
pp 2556–2563
Lin M, Chen Q, Yan S (2013) Network in network. arXiv preprint arXiv:1312.4400
Liu W et al (2017) Sphereface: deep hypersphere embedding for face recognition. In: The IEEE
conference on computer vision and pattern recognition (CVPR), vol 1
Liu W, Wen Y, Yu Z, Yang M (2016) Large-margin softmax loss for convolutional neural networks.
In: ICML, pp 507–516
PyTorch. https://pytorch.org/. Last accessed 25 Dec 2018
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and
clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 815–823
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human action classes from videos
in the wild. In: CRCV-TR-12-01, Nov (2012)
Sun Y, W, Tang X (2015) Deeply learned face representations are sparse, selective, and robust. In:
Proceedings of the IEEE conference on computer vision and pattern recognition
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with
3d convolutional networks. In: Proceedings of the IEEE international conference on computer
Turk M, Pentland A (1991) Eigenfaces for recognition. J Cogn Neurosci 3(1):71–86
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE
Trans Pattern Anal Mach Intell 40(6):1510–1517
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Wen Y et al (2016) A discriminative feature learning approach for deep face recognition. In: Euro-
pean conference on computer vision. Springer, Cham
Wibowo ME, Tjondronegoro D, Chandran V (2012) Probabilistic matching of image sets for video-
based face recognition. In: International conference on digital image computing: techniques and
applications (DICTA)
Wibowo ME, Tjondronegoro D (2012) Face recognition across pose on video using eigen light-
fields. International conference on digital image computing: techniques and applications (DICTA)
2011:536–541
Wiskott L, Fellous J-M, Kruger N, Von Malsburg CD (1997) Face recognition by Elastic Bunch
graph matching. IEEE Trans Pattern Anal Mach Intell 19(7):775–779
Wolf L, Hassner T, Maoz I (2011) Face recognition in unconstrained videos with matched back-
ground similarity. In: CVPR
Xie X, Zheng W-S, Lai J, Yuen PC, Suen CY (2011) Normalization of face illumination based on
large-and small-scale features. IEEE Trans Image Process 20(7):1807–1821
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural
networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
(CVPR), pp 1492–1500
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond
short snippets: deep networks for video classification. In: Proceedings of the IEEE conference
on computer vision and pattern recognition, pp 4694–4702
Zagoruyko S, Komodakis N (2016) Wide residual networks. In: Proceedings of the British machine
vision conference
Zhu X, Lei Z, Yan J, Yi D, Li SZ (2015) High-fidelity pose and expression normalization for face
recognition in the wild. Proc IEEE Conf Comput Vis Pattern Recogn:787–796
Chapter 19
Fog Computing-Based Seed Sowing
Robots for Agriculture
Jaykumar Lachure and Rajesh Doriya
Abstract Agriculture is the most important field and the backbone of any coun-
try’s economic systems. After soil testing of land, seed sowing is the most important
and time-consuming process. Fog large-scale farming, seed sowing robots are pro-
posed with fog computing. These robots have microcontroller units (MCU) with
auto firmware that communicates with the fog layer through a smart edge node. Fog
robotics provides services like security, distributed storage, minimize latency, and
bandwidth utilization. A bot saves battery consumption as it communicated to fog
instead of the cloud layer. A typical seed sow robot consists of a powered wheel,
MCU, plower, seed hopper, counter sensor, UV sensor, IR sensor, and preloaded
map of the area. Different methods of planting and its system for sowing seed are
shown, and the seed rate, row spacing, and space between seeds, the volume of hop-
per, and the density of seeds are calculated with standard velocity. The robot uses
simultaneous localization and mapping (SLAM) and other path-finding algorithms
for working on the field. IR sensors detect the end of the field and obstacles for each
robot. FastAi method and machine learning techniques are used to classify the wheat
dataset into different classes with high accuracy.
19.1 Introduction
In farming after soil testing plant, cropping is an important factor, and there is a need
to decide which plant cropping will be done. As it is an important and tedious job for
any farmer and big farming, the area is very large on the scale, and performing this
activity is tedious along with that it needed more workers with proper planning. The
old traditional technical needs lot of effort and time that directly reduces production
and quality. Thus, agriculture robots or IoT devices play an important role and were
developed to simplify and reduce human efforts. In the traditional method of seed
planting, less spacing efficiency, the results such as low seed placement, and serious
issues of backache for the farmer. They can also be planted in the limited size of the
J. Lachure (B) · R. Doriya

National Institute of Technology Raipur, Raipur 492010, Chhattisgarh, India
e-mail: rajeshdoriya.it@nitrr.ac.in
296 J. Lachure and R. Doriya
field. Hence, for achieving maximum performance, the limits should be optimized
for a seed planter. Thus, we need to use the Internet of Things, devices, and robots for
making farms automatically. For seed sowing, the robot consists of different sensors
with mechanical structure.
Agriculture is the main pillar of any economy around the world that gives backbone
for suitability. For the sustainable growth of any country, agriculture development
plays a vital role. The world population is around 700 billion, and food security is an
important aspect as the huge population increases day by day, and thus, demand for
food security is also increasing. All over the world, the pandemic situation occurs
due to corona that affects the drop-down of the economy, and food security issues
happen. As around a total of 43 th of water available and the rest island. As the pop-
ulation increases, the forest and agriculture lands are converted into a residency that
causes insecurity for food directly. For a long time, traditional methods are used in
agriculture and various machines with huge manpower. This manual planting for a
large scale is difficult. The farmer has to spend almost all time planting, but the avail-
ability of time is less planting. Thus, for completing the task within the stipulated
time, it needs more manpower to complete which is costlier. Another drawback is
that more wastage of seeds happens during manual planting with improper spacing.
Hence, it needs to develop a mechanical robot with sensors that are connected to
the Internet and work on that principle, so the efforts reduced while planting, so
the farmer can perform best. This process of using machines with sensors and with
the Internet that can provide the services from the server is called “cloud robotics.”
This automation helps to increase the efficiency of the process. In typical robots with
sensors, the method is used to enhance farming methods, such as seed sowing, culti-
vation on plowed land, smart irrigation, weed detection, plant leaf disease detection,
and pesticide control system. For smart seed sowing robots, it will cultivate the farm
at a fixed distance by considering a particular column from the map for a particular
crop. The crop planting is the art of seeds placement at proper intervals in the soil
for obtaining good germination. Few plants like rice padding are first developed and
then placed after a period of dormancy. A perfect seed sowing gives:
• Correct ratio of seed
• Correct depth for sowing
• Correct amount of seeds per unit area
• Correct Spacing on each column and row to row and plant to plant.
The organization of the paper is given as an overview of fog robotics and its
services with architecture, edge layer, fog layer, and robotic layer. The literature of
seed sowing followed by mathematical calculation robotic process for sowing seed,
followed by the dataset of wheat, and machine learning techniques. The result and
comparisons are followed by a reference.
19 Fog Computing-Based Seed Sowing Robots for Agriculture 297
19.2 Fog Robotics for Seed Sowing
The Fog Robotics (Ai et al. 2018; Chauhan and Vermani 2016; Gia et al. 2018;
Smith et al. 2013; Wang et al. 2018) consists of different layers like the cloud layer
for controlling all major things and monitoring the complete system, the fog layer for
distributed storage, and providing limited resources with edge-cutting services for
data processing and analysis, and the robotic layer for sending data. The cloud layer
enables administrators or end users for controlling the system by providing general
instructions, and the performance is a monitor at different layers. The cloud layer
enables administrators or end users for controlling the system by providing general
instructions and at different layers its monitors the performance. The layer of fog
robotics includes the edge layer and the fog layer which have the same physical
resources and smart dedicated gateways. This architecture is easy to scalable and
can be distributed and modular by defining it.
Figure 19.1 shows how the fog layer interacts with the cloud layer and the robotic
layer for smart seed sowing robots. Each layer has a role and distribution of compu-
tation load over the network. The main key factors are latency, energy consumption,
security, and computational power at different layers. In general for seed sowing
robot, the data acquired every time and send toward the nearest server called the fog
layer through the same network. For making real-time decisions, latency is important
Fig. 19.1 Fog robotics architecture

to factor, and the edge layer reduces it by providing services quickly and manages
the computational load for different connected robots. The fog layer also provides
services like distributed storage, security, and minimal data loss process for different
mechanisms when robots switch from one gateway to another gateway. The robot
feed with map data required for moving over the field and interaction of each robot
through different algorithm like SLAM and path-finding methods. The edge layer
plays an important role in reducing energy consumption at robot or end nodes. Fog
computing is essential for providing robot state knowledge, robust situation aware-
ness for handling critical issues with safety.
(i) Cloud Services: The time-series data of robot or IoT devices are uploaded for
monitoring, controlling, and visualization for administrator or user. These data
include a robot state that covers position coordinates, velocity, steering drift
factor, acceleration, torque variation, current consumption, battery percentage,
or level. The fog layer with the edge layer helps to prepose the raw sensor data
generated from robots or devices; only critical or important information is sent
toward the cloud from devices for handling any issue or false effect in robots.
This can be implemented dynamically, each robot has to install a particular
map, and it operates all work in that map region only, so it can be uploaded at
different frequencies for the end user that monitor the system actively or not.
The cloud services provide general control instructions for stable operation and
better performance from robots. In agriculture land with a large area, the seed
sowing bots are placed at a different location with a pre-requested map, and
the position of robots is changing within the map. Once it reached to end point,
infrared sensor detects it and stops the working of sowing. Each specific or
group task is applied to the robot, and the cloud allows global access to a user
or administrative for monitoring and controlling the system cost-effectively.
However, latency is very high between the robot layer and cloud, along with
security challenges are between both layers. So, for a primary control purpose, it
should not be used as the variation affect performance of a network that creates
many problems or fault. Cloud provides at every interval access to pertinent
processed data from the robot layer through the edge node periodically. Cloud
provides services like machine learning, big data analysis, better prediction
with a high-performance computing system that needs.
(ii) Fog with edge layer: The edge node receives the real-time sensor data from sin-
gle or multi-robots. This includes odometry data, i.e., wheel odometry and
visual odometry data and inertial, velocity, location coordinates, data from
range sensors like radar or lidars, onboard camera data, to analyze the gate-
way for path planning, obstacle detection, and avoidance. These all data are
transferred through the edge gateway that is collected at the fog layer for pro-
viding fast response over the data. Fog acts as a central element, ensuring low
latency through edge gateways. If more than one robot is connected in the cloud,
then latency time increases that affect the performance directly. So, multiple
robots are connected to a single gateway, and all data from different sensors are
aggregated and analyzed to obtain a more comprehensive understanding of the
environment. For an instance, multiple robots are operating in the same envi-
ronment, and it is connected with the same gateway so that it can able to obtain
information from larger areas through sensors of robots. The main role of the
edge node in the fog layer is to provide importance in safety-critical situations,
communication between the robot and the fog layer did through it within the
same network. Compared to the more traditional methods in the cloud practice
of moving complex tasks need high bandwidth with high latency, but due to
the edge node, it reduces the latency and transfers data at a safe point called
the fog layer if network failure happens even it provides safety of data with
safe operation. The edge node continuously sends data from multiple robots to
the fog layer wirelessly for deciding terms of task allocation and robot move-
ment. In the fog layer, the interconnection of different edge nodes, together with
other services such as location, security, tracking, monitoring, and distributed
database services. If robots are disconnected from one gateway and connected
to another gateway, then it takes care of the fog layer for minimizing latency
and data loss during the handover. Localization algorithms such as simultane-
ous localization and mapping (SLAM) in the edge node use real-time sensor
data to match with an area of an existing map. In these crucial situations where
the robot operates in a partial or completely unknown environment. In the fog
layer, other services are collaborative processing, external senior management,
and monitoring. Additional services are included in the fog layer, to enhance
overall system robustness and fault tolerance. For instance, if a gateway fails or
abruptly disconnects, then the shared storage resources handle all the informa-
tion that makes it available to other gateways. This makes the robot to reconnect
next time with available gateway and continue its operation with low latency,
minimal data loss, and operational interruption.
(iii) Robot layer: In this layer, sensor row data is gathered and streamed in real-
time toward the fog layer suing smart edge node. All the instructions are giving
real-time, and in this system, the robot may or may not be aware of its current
state that depends on an on fog server that exactly gives the location and current
state condition of robots. A low-power wireless communication technologies
such as Wi-Fi, Bluetooth, or nRF that used for robot. Wi-Fi is used for more
bandwidth-intensive applications such as live streaming of video, Bluetooth, or
nRF is used if requirements of the bandwidth are met. The robot gains more
energy efficiently, the streaming row data directly send to the fog layer through
the edge node that saves processing energy. The microcontroller unit (MCU) is
used for a dedicated purpose with low power consumption, and it directly takes
resources from the cloud server to make it a user-friendly, robust, ubiquitous,
and dedicated system. The robots with different sensors use for sensing, path
planning, obstacle detection, object detection, and localization. In a traditional
robotic system that needs an onboard system within build firmware. Due to
the cloud robotic concept, it was replaced with low-power and low-cost MCU
board that operate through a cloud server or fog server. The analysis of basic
information is important for the robot itself as it includes, for instance, the
location of the robot, velocity, wheel odometry, or inertial data for current
acceleration, and orientation. This estimates the state at the fog or cloud layer
for performing online, which allows more accurate movement along with all
instructions that are get simplified. In a smart seed sowing bot, each robot has
specific tasks which include sowing, harvesting, irrigation, fertilization along
UV sensor to improving the germination power with proper path planning to
complete the task quickly.
19.3 Literature Review
Liu et al. proposed a method to design the cyber-physical system for proper seeding,
irrigation, and fertilizer management for alfalfa medicinal plants. This model is
comprehensive that includes the sub-model of water, sub-model of the biophysical
system, and fertilizer regulation after seed sowing. For alfalfa, growth sub-models
interact with each other for improving the precise regulation of fertilizer and water.
The simulation model was developed for measuring the values such as leaf area,
index, soil water content for improved the precise regulation of water, and fertilizer
application for alfalfa (Liu et al. 2020).
Praveena et al. proposed a robot with AVR at mega microcontroller that capable
of performing operations like plowing, seed dispensing, picking fruits, and spraying
pesticides. Early, the robot tilled the entire field, then it parallelly plows and sides by
side sow the seed in the row. For navigation, this device used an ultrasonic sensor,
and raw data sends continuously over the field of the microcontroller. On the field,
the robot operates automatically, and outside it operates manually. They proposed a
control device application as Bluetooth pairing for manual control. For continuous
data, collection humidity sensors were placed at various spots. For proper growth of
the crop, if the level of humidity is above the threshold, then it alerts the farmer that
the water sprinklers should be started using the GSM module for bringing down the
level of humidity (Praveena et al. 2015).
Naik et al. discussed the main reason behind the automation of farming processes
for saving the time and energy required for performing continuous farming tasks and
for increasing the productivity using the precision farming method by treating every
crop individually. They proposed the four-wheeler vehicle controlled by LPC2148
microcontroller as agriculture robot for seed sowing only. The efficient seed sowing
at optimal distances between crops and their rows, at optimal depth and specific for
each crop type and this done through the precision agriculture (Naik et al. 2016).
Srinivasan et al. proposed a novel design for an autonomous mobile robot that
is capable of sowing seeds over a prepared land. For constructing the body of the
proposed device, aluminum is used for efficient weight reduction and proper strength
utilization purposes. The navigation of the robot is done over the land through inputs
from a magnetometer. Proportional integral (PI) controller is used to improve the
accuracy of direction. An ultrasonic sensor is used for detecting the end of the field.
The robot sown seeds in evenly spaced rows and the point is decided where a seed
has been dropped with equidistant. The seed meter is proposed using the solenoid
actuator mechanism. The seed metering mechanism is based upon a solenoid actuator
assembly. The device consists of a modular structure providing ease for maintenance.
Overall, the proposed device consumes power efficiently and makes it suitable for
the field of agriculture (Srinivasan et al. 2016).
Ranjith et al. proposed the methods for designing a robot that sows the seed,
cuts the grass, and sprays the pesticide, and it uses a solar power system for the
whole system. The energy is got from solar panels for designed robots, and it is
operated through Bluetooth/Android application. This app sends the signals to the
robot for movement and required mechanisms. The efficiency of this robot increases
and reduces the problems encountered in manual planting (Ranjitha et al. 2019).
Saurabh Umarkar and Anil Karwankar discussed for agriculture the most impor-
tant component is seed and sowing of it over the field. A wide range of seed sizes is
available with different crop varieties to developed high-precision pneumatic plant-
ing that needed uniform seed distribution with proper seed spacing with travel path.
They use Wi-Fi for receiving data. The main disadvantage of the system is that robot
moves in only one direction. Whenever obstacles in the power supply that turned
OFF of the robot automatically (Umarkar and Karwankar 2016).
Sujon et al., proposed a robot that will perform using the analogy of ultrasonic
detection for changing its position every time. They studied the effects of various
seeding machines and for oilseed with different rates sowing application that was
developed (Sujon et al. 2018).
Kareemulla et al., proposed a robot for minimizing the wastage of seed. The
proposed robot machine needs less sowing time and energy as compared to the
tractor and manual methods. Its benefits can operate in simple mode to increase
the total yield effectively. The major disadvantage is that it only consists of one
mechanism (Kareemulla et al. 2018).
The main objective of the seed sowing system is to put the seed, fertilizer, and
water in rows at the desired depth, cover the seeds with soil, seed to seed spacing, and
provide proper compaction over the seed. Some mechanical factors that affect seed
germination like the uniformity of distribution of seed along rows, and uniformity
of depth of placement of seed. In this power transmission mechanism, seed counting
sensor, UV rays sensor, water dripping sensor, and harvesting mechanism. The rec-
ommended things are seed rate, seed placement depth that varies from crop to crop,
row to row spacing, and depends on different agroclimatic conditions for achieving
maximum yields. That is why seed sowing robot plays a wide role in the agriculture
field. Typical robotic system consists of a sensor and mechanical structure. For that
multi-purpose types of equipment that consists of a cylindrical shape container for
filling the seeds. The four-wheeled carrier assembly for carrying the container. It
consisting of a seed counting sensor, metering plate bevel gear mechanism, and for
seed size, two holes at the bottom are given. The robot works as when the plate will
rotate in the container, then the coinciding of both bottom holes of the container
and meter plate hole happens, and seeds will flow through the pipe to the soil called
seed metering sensor at the same time counting sensor count the seeds. The UV
with 400 nm is continuously applied in the container so the germination power of
the seed gets increases. The water dripping sensor-activated once the robotic arm
plows the soil and sows it, then water from another container drops the water over
the harvest place. Continuously, this process runs for row by row. Thus, this enables
the conservation of inputs through precision for ensuring reducing quantity needed
for better response, better distribution, and prevention of losses or wastage of inputs
applied. This directly reduces the unit cost of production as the input gets conserved
and productivity gets high. The most important purpose of the robot is to make it
affordable to farmers, reduce labor cost, early prediction of seedling, water irrigation
in the initial stage, and to increase the germination power of the seed.
The main objective is to make it affordable to the farmers so that they can manually
do their work without depending on labor. The above-mentioned machine increases
the efficiency of seed sowing so thereby reducing the wastage of seeds and thus
improving overall yield. For precision agriculture, different types of innovation going
on a different part, but seed sowing robot is a key component in the agriculture field.
The performance of this robot increases the yield with low cost, and the initial cost
for infrastructure with cloud, fog, and edge node is higher; after that, there is a need
of a few cost on the maintenance of it. Presently, different approaches are available
to detect the performance of seed sowing machines.
The multi-purpose agriculture robot can be used for soil testing, seed sowing,
fertilizer supplier, weed detector, and plant leaf disease detector. The drilling arm
completes the task of soil drilling, seed sowing, water dripping sensor for smart
irrigation, fertilizer spreading, and soil testing. The robots with the Internet come
with a lightweight, it is the biggest advantage that it works to fast, and every data is
sent to the nearest edge server for further analysis. At the same time, server gives the
prediction for further work too. Here, the main objective of the seed sowing robot
is to make it simple and easy for the use of farmers. The architecture is simple,
and the robot is developed with lightweight materials along with sensors which are
embedded with the Wi-Fi module. The main objective is to do sowing without the
use of laborers. Thus, it increases the efficiency of seed sowing, and it reduces the
wastage of seeds that affect improving overall yield.
19.3.1 Seed Spacing and Seed Rating
In farm plant spacing, and the optimum plant population is the primary objective for
seeding any planting operation. The ultimate goal is for obtaining the maximum net
return per unit area. Spacing and population requirements are influenced by factors
such as:
• Type of soil
• Type of crop
• Amount of moisture available
• Fertility of the soil
• Pollution level.
19.3.2 Methods of Planting
Different forms of planting arise as the area of land geographically changes;

• Broadcasting: Seeds are scattered randomly over the surface of the field.
• Hill dropping: Groups of seeds are placed at about equal intervals in rows.
• Drilling: The spacing between the seeds is not uniform. Drilling consists of drop-
ping of the seeds in furrow lines with continuous stream, and soil is cover to
them.
• Precision planting: Seeds are accurately placed at about equal intervals in rows.
In manual seeding, uniformity in the distribution of seeds is not possible to achieve.
A farmer may sow seed at the desired rate but intra-row and inter-row distribution
of seeds are likely to be uneven resulting in bunching and gaps in the field. Besides,
there will be poor control over the depth of seed placement, these results in poor
emergence of the crop. This gain leads to low productivity.
19.3.3 Planting Systems
Planting may be done on the flat surfaces of the field, in furrows, or on beds as:
– Furrow or shifter planting: In semiarid conditions, this technique is widely prac-
ticed for row crops such as cotton, corn, and grains. This system places the seed
down into the moist soil and protects young plants from wind and blowing soil.
– Bed planting: In high rainfall areas, it is often practiced to improve surface
drainage.
– Flat planting: In favorable moisture conditions, this type of planting generally
predominates.
Figure 19.2 1 shows the three types of plantings: The left corner is the furrow
planting system, the right side one is a bed planting system, and flat planting shows
below both of them. This system is generally used in a different region with a different
condition for getting yield more under favorable conditions.
19.3.4 General Problems Identified in Seed Sowing
In sowing seeds, there are few problems concerning the production of yield as:
• Irregular placing of seeds: In this process, throwing of the seeds placed manually
all over the field. This irregular way of placing seeds causes the growth of seed
irregularly.
1 http://www.google.com.
Fig. 19.2 Different types of planting system like furrow left one, bed right one, and flat planting
system
• Wastage of seeds: In seed sowing, during the manual process of seeds scattering
here and there results in irregular ways and lines also. So, the water and important
nutrition are not getting properly that may get disturbed, and the growth of the
plant will not be done up to that mark.
• Time-consuming process: Seed sowing is a time-consuming process as over the
complete land. If the area is small, then it is not a burden but if are is so big, then
it requires a lot of time and the process becomes difficult.
• Insufficient ground temperature: Nowadays, due to global warming, the surround-
ing temperature changes suddenly. If the ground temperature is cool, then seeds
will not raise and grow properly as seeds need warm conditions to grow well.
• Sowing of seeds too deeply: The depth of seed should be moderate. If the seeds
are sown too deeply, then the seed will not raise even if we water the plants.
• Lacking of quality seed rising mix: The seed quality should be good for rising the
plant, and it is very important process after germination. If the quality is not good,
then growth will not be up to that mark.
19.3.5 Functions of Seed Sowing Machine
The role of seed seeding machine is as follows:

• To carry the seed.
• The seed furrow opens up to the required or proper depth.

• Meter the seed.
• Deposit the seed in the furrow in an acceptable pattern.
• Cover the seed and compact the soil around the seed to the proper degree for the
type of the crop involved.
• When accomplishing these functions, the planter should not damage the seeds
enough to appreciably affect germination.
• After studying all the types of planting and plantation methods, we are using
the precision planting method and bed planting system with our motorized seed
sowing machine (http://www.soiltillage.com).
19.4 Performance Matrix for Seed Sowing Robot
The robot consists of a different part for seed sowing; it contains sensors like UV
sensor, counter, infrared(IR) sensors, gyro-sensor for moving of robot, water dripping
sensor, and mechanical parts like seed hopper, plow, small and big chain gears. These
parts decide the what exact amount of seed required with at rate for the particular
area as
• Seed rate (Sr).
• Seed hopper volume (Vs).
• Row spacing (RS).
• Spacing between seeds (X).
• Bulk density of seeds (Pb).
• Rotation per minute Rpm.
• Number of cells in seed chamber (n).
The transmission ratio for driver sprocket and driven sprocket is given as,
Transmission ratio (i) = N 1/N 2

= no.of teeth on driver sprocket/no.of teeth on driven sprocket
= 21/40 = 0.525
Consider “GROUNDNUT” seeds, it is widely used for making cooking oil; apart
from this, it is used in daily ingredient and in spices too. The total seed required for
germination in one hector is as follows,
(i) Seed rate (RS): The quantity of seeds sown per unit area is called seed rate.
It depends on spacing as the plant to plant or row to row spacing or plant
population, germination percentage, and test weight. Its units are kg/ha (kg per
hectare).
Seed rate = Plant population ∗ no. of seeds per test weight ∗ 100 ∗ 100/
germination percentage ∗ purity percentage ∗ 1000 ∗ 1000
(ii) Plant population: From the various article for the study of groundnut, it con-
cluded that,
Germination percentage = 90%
Test weight = 114 gm per 100 seeds
Seed bulk density = 434.8 kg/m
Normal seed rate = 100–110 kg/ha
Plant population = area to be planted/space between plant to plant * space
between rows
Consider 1-ha area for plantation then,
Area to be planted = 1 ha = 2.471 acres = 4046.856 m2 Therefore, 1 ha = 2.471
* 4046.856 = 9999.78 m2 = 10000 m2
Space between plant to plant = 75 cm = 0.75 m
Space between row to row = 10 cm = 0.1 m
Plant population = 10000/0.75 * 0.1 = 133,333 plants . We took to test 1000
grams, i.e., 1 kg of seeds are taken as test weight to know how much seed rate
will come per kg.
Seed rate = 133, 333 ∗ 1000 ∗ 100 ∗ 100/90 ∗ 90 ∗ 1000 ∗ 1000
= 164.61 kg/ha
The value from standard data is in between 100 and 160 kg/ha. To meet that
standard velocity of seed sowing robot, it depends on the diameter of the wheel
that rotates inside it. This can be assumed or can be taken from a standard book
not to cause seed breakage, so, velocity (v) = 0.2 m/s and Rpm of the motor =
60pm
No. of cells should be in the seed sowing gear = n = 3.14 ∗ D/i ∗ X
n = no.of cells
D= big wheel diameter
I = transmission ratio
X = space between seeds
n = 3.14 ∗ 0.25/0.525 ∗ 0.3 = 4.98 = 5 cells
5 cells need to be present in the seed sowing gear
(iii) Flow rate: The flow rate for sowing seed varies on rate, velocity, seed bluck
density, and space between rows as:
Q = Rs ∗ S ∗ V /10000 ∗ Pb
Q = flow rate
Rs = seed rate kg/ha
S = space between rows m
V = velocity of seed sowing machine m/s
Pb = seed bulk density kg/m3
Q = 68.58 ∗ 0.6 ∗ 0.2/10000 ∗ 714.861.1512 ∗ 10−6 m3 /s
Table 19.1 Seed sowing physical dimension

Seed Diameter (cm) Required distance Required depth (cm)
between plants (cm)
Soya beans 5–11 25–30 2–5
Ground nuts 6–9 20–25 2–4
Corn 6–7 45–55 2–5
Wheat 2–3 35–40 1–3
Peanuts 6–9 40–50 2–4
Cotton 6–8 55–60 2–3
Bengal gram 5–10 55–60 3–5
Kidney beans 9–11 45–50 2–4
(iv) Volume of seed hopper: It depends on flow rate, volume of seed hopper, and
speed of machine gear as: V c = Q ∗ 60 ∗ 106 /n ∗ N d
Vc = volume of seed hopper m3

Q = flow rate L/s
Nd= rotational speed of seed sowing gear rpm
Nd = N ∗ i
n = no. of cells in seed sowing gear
Vc = 1.1512 ∗ 10−6 ∗ 60 ∗ 106 /5 ∗ 60 ∗ 0.525 = 0.448 m3 = 0.5 m3
v. Planting depth: Without breaking of seeds, the seed should be planted to the
depth that is required. For that after covering of seeds with the help of V-shaped
metal, tires should go through the rows where plow was a dig. Then, the seed
will be at the required depth. Now, we will go through the table in which the
diameter, required depth, and required gap between the plants were shown
(Table 19.1).
For different seeds like soybean, wheat, Bengal gram, and peanut, the table shows
the placement of seed at what depth, seed rate for the robot so that the seed cloud,
not damage, a width of coverage means plant to plant distance and number of labor
required so the labor work reduces and plant population for every hector (Table 19.2).
19.5 Methodology
In this section, a brief detail about machine learning techniques, seed dataset, and
working process is given.
Table 19.2 Different parameters for different seeds

Crop Peanuts Soya bean Wheat Bengal gram)
Crop peanuts soya bean Wheat Bengal gram
Seed rate (kg/ha) 74 52 102 68.58
Width of 800 900 660 1000
coverage (mm)
Depth of 50 40 45 60
placement (mm)
Labor 3.4 4.1 10.4 4
requirement
(man-h/ha)
Plant 32 29 135 5
population/m2
19.5.1 Machine Learning Methods
In this section, a brief introduction about machine learning methods such as decision
tree, adaboost, and SVM and deep learning method FastAi:
(i) Decision tree: It uses the attribute and splits the data into successor nodes, and
entropy is calculated for discording the node. The node belongs to set of item of
two classes positive and negative. Thus, the attribute that maximizes the infor-
mation gain is selected for seed dataset after calculating the gain information.

p p n n
E( p, n) = − × log2 − × log2 (19.1)
p+n p+n p+n p+n
To calculate the quality of the entire split over the attribute At using entropy of
the system is,
Di
Gain(D, At) = E(D) − × E(Di ) (19.2)
i
D
(ii) Adaboost: The learner consists of two types weak and strong in Adaboost. A
weak learner classifier is better than random guessing. While a strong learner is
almost provided a correct classification for true value (Abd Rahman et al. 2015).
Consider the given training data of the form ( p1 , q1 ), ( p2 , q2 ), ... , ( pn , qn )
where qi ∈ {+1, −1} ∀ pi ∈ P [1] and a learner h the error is defined as,

1 0 if qi = h( pi )
= × (19.3)
N 1 if qi = h( pi )
(iii) Support vector machine: It is a binary classification (Abd Rahman et al. 2015)
dataset which is linearly separable for Ds ⊆ R d . A hyperplane discard the
point which having maximum distance from dataset for separating the line.
The hyperplane H is given as H ,

H (wt, con) = x ∈ R d | wt · x = con (19.4)
A dataset Ds=Ds + Ds − is linearly separable if there exists a hyperplane

H (w, c) such that
Ds + = {x ∈ Ds | wt · x > con} and Ds − = {x ∈ Ds | wt · x < con}.

(19.5)
(iv) FastAi : It is deep learning method that works on databunch technique (Howard
and Gugger 2020); for fitting a model, it can learn. For training a model, learner
used and learning rate is how much fast that update the parameters of model.
Learn finds the best record, and it plots the result of finder, x: learning rate; y:
loss.
The method fit_one_cycle and find the default learning rate 10−3 , but according
to the plot, learn rate changes if plot goes worse then unfreeze or reset the model
and again apply fit_one_cycle and make slice of (1e-6,1e-4). Make the learning rate
such that it should be equally on layers in between first and last layer, the learning
rate is first layer 10−6 and the last layer 10−4 .
Algorithm: FastAi for Seed Classification

Dataset Loading and Preprocessing Dataset Normalization: Remove null, missing
values from dataset
FastAi method on linear learning
Find the target layer
Splitting of dataset into training and validation
Fit the data > learning rate
Optimize up to the best value
For Repeat the FastAi method for fit the cycle
If learning rate > Error value
Repeat then step up-to get best value
Save the best performance value with optimal time
EndIf
Else
Freeze the final learning rate
end
19.5.2 Seed Dataset
The Wheat dataset (Charytanowicz et al. 2010) is donated by Charytanowicz and

Niewczas from the Institute of Mathematics and Computer Science, The John Paul
II Catholic University of Lublin. This dataset contains three different varieties of
wheat: the Kama, Rosa, and Canadian. Each variety contains 70 elements each, and
it is selected randomly for the experiment. A soft X-ray technique is used to detect the
internal kernel structure with high visualization quality. The X-ray KODAK plates
were used for recording the images on 13 × 18 cm. The data is constructed on seven
geometrical parameters of wheat kernels. The attributes of this dataset are given as
(i) Perimeter P.
(ii) Area A.
(iii) compactness C=4 ∗ pi ∗ A/P 2 where pi = 3.14.
(iv) width of the kernel.
(v) length of the kernel.
(vi) length of kernel groove.
(vii) asymmetry coefficient.
All of these parameters were continuous and real-valued.
19.5.3 Working Process
The FastAi method is used with linear transformation, and the data-bunching tech-
nique works quickly with the learning rate.
For given dataset classification purpose, the data first get preposed, and then, it
gets normalized first null value removed and then for missing data different methods
like average, min, max to fill with that value. FastAi using linear transformation is
a deep and fast learning technique that first finds the target layer; then, the dataset
splits into training and validation. For every learning cycle, the learning rate changes
to fit the data. Once the data fit, it easy to get the best learning rate. If fitting of data
is out of scope, then overflows occur and then again need to change the learning rate.
This learning rate cycle changes continuously to optimize the best result. The given
algorithm finds the best learning rate with the batch process.
The given Fig. 19.3 shows learning rate versus iteration in that the curve shows
that the best-optimized value is in between underflow and overflow condition.
Figure 19.4 shows the accuracy versus batch process as the training data fit linearly
for increasing the accuracy of the system.
Loss versus batch process shows how much data get lost and when its overflow
in the batch process (Fig. 19.5).
Fig. 19.3 Learning rate

versus iteration
Fig. 19.4 Accuracy versus

number of batch process
Fig. 19.5 Loss versus

number of batch process
Fig. 19.6 Comparison of

result performance
19.5.4 Result and Discussion
Forgiven wheat dataset, for classification into different categories using FastAi, a
deep learning approach and different machine learning techniques are used. It is
shown that FastAi runs faster and quicker as compared to all other algorithms as it
works on the learning cycle and fitting the dataset linearly (Fig. 19.6).
19.6 Conclusion
In agriculture, seed sowing robots play an important role in plowing, digging, seeding,
and harvesting. These robots are connected with the edge node for communicating
with the fog layer. This layer works in a homogenous network with a decentralized
server, to give better security as it is near to the robot or devices as compared to the
cloud layer. The fog robotics minimized the latency and utilized the proper bandwidth
of the network. As latency decreases, the battery for sending data to minimize so
it saves power and performance increases. As the fog layer gives firmware that can
be installed anytime within the network to perform other operation that indirectly
increases machine efficiency as per requirement, it becomes ubiquitous. The multi-
robotics scenario can be handle within the fog layer with the help of the edge node.
Each robot has a predefined map loaded in the firmware. The path planning of each
robot is done through SLAM or path-finding algorithms so that robots can work
quickly and accurately within that field.
The seed sowing robotic machine consists of different sensors such as a UV sensor,
an IR sensor, a hopper, plower, chain gears, and a seed counting meter. UV sensor
used for disinfectant of seed; an IR sensor detects the obstacle and end of fields. Seed
rate, row to row distance, speed spacing, depth for sowing seed are calculated, which
directly improves the growth of seed in that field.
The dataset of wheat species has a different type that needs to operate at the time
of sowing. The standard size of each species parameters recorded to classify in the
proper class, so the machine can separate that before sowing. The FastAi methods
along with different machine learning models developed to separate such a species
from each other.
References
Abd Rahman HA, Wah YB, He H, Bulgiba A (2015) Comparisons of adaboost, knn, svm and
logistic regression in classification of imbalanced dataset. In: International conference on soft
computing in data science. Springer, Berlin pp 54–64
Ai Y, Peng M, Zhang K (2018) Edge computing technologies for internet of things: a primer. Digital
Commun Netw 4(2):77–86
Chauhan S, Vermani S (2016) Cloud computing to fog computing: a paradigm shift. J Appl Comput
1(1):25–29
Charytanowicz M, Niewczas J, Kulczycki P, Kowalski PA, Łukasik S, Zak S (2010) Complete
gradient clustering algorithm for features analysis of x-ray images. Information technologies in
biomedicine. Springer, Berlin, pp 15–24
Gia TN, Rahmani AM, Westerlund T, Liljeberg P, Tenhunen H (2018) Fog computing approach for
mobility support in internet-of-things systems. IEEE Access 6:36064–36082
http://www.soiltillage.com
Howard J, Gugger S (2020) Fastai: a layered api for deep learning. Information 11(2):108
Kareemulla MS, Prajwal E, Sujeshkumar B, Mahesh B, Reddy BV (2018) Gps based autonomous
agricultural robot
Liu R, Zhang Y, Ge Y, Hu W, Sha B (2020) Precision regulation model of water and fertilizer for
alfalfa based on agriculture cyber-physical system. IEEE Access 8:38501–38516
Naik NS, Shete VV, Danve SR (2016) Precision agriculture robot for seeding function. In: 2016
international conference on inventive computation technologies (ICICT), vol 2, pp 1–3. IEEE
Praveena R, Srimeena R, et al (2015) Agricultural robot for automatic ploughing and seeding.
In: IEEE technological innovation in ICT for agriculture and rural development (TIAR). IEEE
2015:17–23
Ranjitha B, Nikhitha MN, Aruna Afreen K, Murthy BTV (2019) Solar powered autonomous mul-
tipurpose agricultural robot using bluetooth/android app. In: 3rd International conference on
electronics, communication and aerospace technology (ICECA), pp 872–877
Srinivasan N, Prabhu P, Smruthi SS, Sivaraman NV, Gladwin SJ, Rajavel R, Natarajan AR (2016)
Design of an autonomous seed planting robot. In: IEEE region 10 humanitarian technology
conference (R10-HTC), pp 1–4. IEEE
Smith CV, Doran MV, Daigle RJ, Thomas TG (2013) Enhanced situational awareness in autonomous
mobile robots using context-based mapping (october 2012). In: 2013 IEEE international multi-
disciplinary conference on cognitive methods in situation awareness and decision support
(CogSIMA)
Sujon MDI, Nasir R, Habib MMI, Nomaan MI, Baidya J, Islam MR (2018) Agribot: Arduino
controlled autonomous multi-purpose farm machinery robot for small to medium scale cultivation.
In: 2018 international conference on intelligent autonomous systems (ICoIAS), pp 155–159. IEEE
Umarkar S, Karwankar A (2016) Automated seed sowing agribot using arduino. In: 2016 interna-
tional conference on communication and signal processing (ICCSP), pp 1379–1383. IEEE
Wang X, Ning Z, Wang L (2018) Offloading in internet of vehicles: a fog-enabled real-time traffic
management system. IEEE Trans Indus Inform 14(10):4568–4578
Chapter 20
An Automatic Tumor Identification
Process to Classify MRI Brain Images
Arpita Ghosh and Badal Soni
Abstract The mortality rate due to failure of brain tumor diagnosis and treatment
is increasing extensively. The accurate and feasible interpretation of brain tumor is
mandatory for consecutive prognostication as well as medication. By expert physi-
cians, inspection of brain tumor can be done but it will make the process labor
demanding as well as more time consuming. So, in this work, we propose an auto-
matic tumor identification process to classify MRI brain images of which contains
tumor of benign and malignant type using an advance convolution neural network
or CNN architecture. Analysis of the proposed model is done based on some per-
formance metric as precision, recall and F1 score; as per the analysis, the proposed
method gives better result compared with the other state of the art methods.
20.1 Introduction
The growth of abnormal tissues in human brain can lead to the appearance of tumor.
A primary brain tumor can be cancerous or benign in nature. Gliomas and Menin-
giomas are two most frequent types of primary brain tumor. The origin of gliomas
tumor is from glial cell. The other type of primary tumor, i.e., Meningiomas is more
tends to develop among women than men. These tumors are benign in nature but can
cause complication due to the location and size of the tumor.
Last year with an increasing trend in India, around 5–10 cases of brain tumor
per one lakh population were encountered. Among these 20% cases are from the
children under the age of 15 years. The symptoms of brain tumor include a early
morning headache, continuously vomiting or nausea, partially memory loss, sleep
problems, etc. The diagnosis of brain tumor can be classified into two categories as
benign and malignant; types of benign tumors are not required for surgical treatments
unless it gets extended in size and expresses some doubtful symptoms. Hence, early
and accurate diagnosis of malignant tumor became mandatory to reduce the rate of
mortality.
A. Ghosh · B. Soni (B)

National Institute of Technology Silchar, Silchar, Assam, India
e-mail: badal@nits.ac.in
316 A. Ghosh and B. Soni
The diagnosis of brain tumor can be done from different medical imaging modal-
ities as magnetic resonance imaging or MRI, computed tomography or CT scan,
ultra-sonography, etc. An MRI uses magnetic fields to construct complete image of
the body. It can help to measure the size of the tumor.
Recently, various automatic tumor diagnosis technique as K-means clustering
algorithm, fuzzy C-means method (Abdel-Maksoud et al. 2015), LinkNet convo-
lution network (Sobhaninia et al. 2018), ELM-LRF (Ari and Hanbay 2018), SVM
(Priya et al. 2016) were used for automatic tumor detection. Much work has already
mentioned in the related work part but still some gap available to achieve more
promising results based on the performance matrices.
In this current study, the proposed model consists of two separate set of neural
network as convolution network and fully connected dense layer; 3 convolution layers
were used to extract the features from the input image, and 2 dense fully connected
layers were used to classify. The classification report of the proposed model is given
in Table 20.3 where the precision, recall and F1 score of the model are calculated
and given.
Related works were discussed in Sect. 20.2, and the concept of CNN is high-
lighted in Sect. 20.3. Section 20.4 is about the proposed architecture and description
of the architecture. Data set description is given in Sect. 20.5. Section 20.6 is about
the experimental analysis and results. Conclusion and future work are discussed in
Sect. 20.7.
20.2 Related Work
In Abdel-Maksoud et al. (2015) used a dynamic image segmentation method, i.e., K-

means clustering algorithm and fuzzy C-means method which is trailed by threshold
technique and level set segmentation for accurate identification. The experimental
work is done using three benchmark data set, digital imaging and communications
in medicine (DICOM), brain Web and BRATS data set.
Kumar et al. (2017) mentioned a hybrid method which includes discreet wavelet
transform (DWT) for feature extraction, genetic algorithm was used for decreasing
the features, classification was performed using SVM on the data collected from
SICAS medical repository.
Mohsen et al. (2018) mentioned deep neural network as a classification tool for
a data set which contains 66 MRI images collected from Harverd Medical school
website; DNN can classify the images into four classes including normal, sarcoma,
glioblastoma and metastatic bronchogenic carcinoma.
Zahra Sobhaninia et al. mentioned in paper (Sobhaninia et al. 2018) about the
usage of deep learning, i.e., LinkNet convolution network for image segmentation;
using different angles of brain MRI images.
In Ari and Hanbay (2018) used deep learning approach for classification of tumor;
their proposed technique consists these steps of pre-processing steps which includes
20 An Automatic Tumor Identification Process … 317
ELM-LRF for tumor classification and extraction of tumor region, and watershed
algorithm is used for segmentation.
In Goswami and Bhaiya (2013) mentioned an automatic system where edge detec-
tion, histogram equalization, noise removal and thresholding was performed as pre-
processing step. Independent component analysis (ICA) method is used for feature
extraction, and self organizing map is used for brain tumor diagnosis. At last for seg-
mentation of tumor into different cells, K-means clustering algorithm is performed.
Safaa E. Amin et al proposed in their work (Amin and Megeed 2012) a perceptive
neural network which can automatically classify the types of brain tumors present.
The proposed system is divided into two parts which contains a hybrid neural network
system and PCA for dimensionality reduction and feature extraction. The second part
includes segmentation of the MRI images using wavelet multi-resolution expectation
resolution (WMER) algorithm. Then, lastly, MLP or multi-layer perceptron is applied
for classification of the features extracted from the first phase or the second one.
Mohana Priya K. et al. mentioned in their paper (Priya et al. 2016) about support
vector machine (SVM) for classification of brain tumor images into different class.
Here, the SVM is used combining some statistical features like first-order feature set,
second-order feature set and the combination of both. The experimentation performed
in the paper is based on different SVM kernel types and using different gamma values.
In this paper, the experimental analysis is done using only different type of kernal
of SVM classifier, so the supremacy of the approach is not compared with other
state-of-the-art approaches.
Al-Ayyoub et al. (2012) mentioned in their paper about machine learning approach
for detection of brain tumor present or not from MR image. Four different types
of classification algorithm, i.e., ANN, Tree J48, Naïve Bayes and Lazy IBK are
performed in 27 MR images, and the comparison of the results has been using the
parameters recall, precision, F1 score and correctness and shows that the accuracy
of ANN is best among the others.
The Sudha et al. (2014) used feed-forward neural network, multi-layer perceptron
and BP neural network for classification purposes. Feature extraction has been done
using GLCM and GLRM approach.
Pereira et al. (2016) used CNN for segmentation of BRATS MRI data set. Data
argumentation was used to increase the size of the training set. Their proposed archi-
tecture was able to identify two types of tumor grade, i.e., HGG and LGG. The
evaluation matrices as DSC, PPV and sensitivity were used to measure the perfor-
mance of the architecture.
Author Tanzila Saba et al. used in the paper (Saba et al. 2020) a accurate segmen-
tation process called Grab-cut method for tumor segmentation and VGG-19 model
for feature extraction on BRATS data set. After extracting the features, they used sev-
eral classifiers as decision tree (DT), linear discriminant analysis (LDA), K-nearesst
neighbour (KNN), ensemble classifier and support vector machine (SVM) and com-
pared the results obtained from these several classifiers, and based on accuracy, DSC
the evaluation of these classifiers was performed.
20.3 Convolution Neural Network
CNN or convolution neural network comprises some basic layers to define the work-
ing principle of the network.
1. Convolution Layer: In this layer, the working principle depends on extracting
the features from the input image. Not only feature extraction but also it pre-
serves the co-relation between the image pixels. Using different types of filters,
it can perform various operation such as edge detection, image sharpening and
blurring of an image. For better understanding of the operation, the below exam-
ple is given. Here, in the example, an image with 1 channel and an 3 × 3 kernel
convolution operation is performed.
⎡ ⎤
55321 ⎡ ⎤ ⎡ ⎤
⎢1 0 3 0 1⎥ 101 13 20 9
⎢ ⎥
⎢2 5 3 5 2⎥ ⎣0 1 0⎦ = ⎣12 5 14⎦ (20.1)
⎢ ⎥
⎣ 3 0 0 2 5⎦ 101 9 17 14
12354
2. Activation Layer: The activation layer is used for converting the output of
the conv layer into a nonlinear output. In this experiment, “ReLU” is used as a
activation function. ReLU stands for rectified linear unit. The operation of ReLU
is mentioned below.
f (x) = max(0, x) (20.2)
Using this function, the network will learn the non-negative linear values.
3. Pooling Layer: This layer is important when size of the input image is too large;
in that case, reduction of number of trainable parameters is necessary. So usage
of this layer between subsequent convolution layers is important.
4. Fully connected layer (FC Layer): Before feeding the input matrix to the
FC layer, we should flatten the matrix into a vector. From the above-stated
diagram, the matrix of the feature map will be converted into vector such as
i 1 , i 2 , i 3 , . . . , i n . The creation of model will be done by combining the extracted
features together by the FC layer such as Fig. 20.1.
5. Output Layer: This layer comprises of a activation functions softmax or sigmoid
to classify the outputs. In this work, we used softmax as a activation function
for the output layer. The overall CNN architecture is illustrated in Fig. 20.2.
20.4 Proposed Architecture
The proposed methodology is based on an advance convolution neural architecture.

The neural network is able to identify the MR images containing tumor and without
containing tumor. The steps of the proposed work are described here.
Fig. 20.1 Fully connected layer after the pooling layer
Fig. 20.2 CNN architecture
1. Data acquisition: The MR image data is collected from (brain-mri-images-for-

brain-tumor-detection 2019) Kaggle data repository. The data set contains total
253 MR images of brain which includes 115 images with brain tumor and 98
images without brain tumor.
2. Pre-processing: The data pre-processing step includes reading the images and
resizing the images. Storing the path of the images into a variable and creating
a function to load the image folders into array of numbers are necessary while
reading the images. And resizing of the images is necessary because it will
provide a base size of the input images for the neural network.
3. Create the training set: This steps include creating a training set with labels
“yes” and “no” to the images where it defines the MR images of brain with and
without tumor, respectively.
Table 20.1 Model: sequential

Layer type Output shape Params #
conv2d (Conv2D) (None, 48, 48, 32) 320
activation (Activation) (None, 48, 48, 32) 0
max_pooling2d (None, 24, 24, 32) 0
(MaxPooling2D)
conv2d_1 (Conv2D) (None, 22, 22, 64) 18496
activation_1 (Activation) (None, 22, 22, 64) 0
max_pooling2d_1 (None, 11, 11, 64) 0
(MaxPooling2D)
conv2d_2 (Conv2D) (None, 9, 9, 64) 36928
activation_2 (Activation) (None, 9, 9, 64) 0
max_pooling2d_2 (None, 4, 4, 64) 0
(MaxPooling2D)
dropout (Dropout) (None, 4, 4, 64) 0
flatten (Flatten) (None, 1024) 0
dense (Dense) (None, 128) 131200
activation_3 (Activation) (None, 128) 0
dense_1 (Dense) (None, 128) 16512
dense_2 (Dense) (None, 2) 258
Total params: 203,714
Trainable params: 203,714
Non-trainable params: 0
4. Train the CNN architecture: Training the network includes 3 Conv2D layer
with activation function ReLU and for pooling layer, max-pooling function of
size 2 × 2 and 2 fully connected hidden layers is used. The loss of the trained
network is calculated using sparse categorical cross-entropy, and the network is
trained up to 25 epochs.
5. Test the network: In this phase, the network will decide a given MR image
which contains tumor or not.
In Table 20.1, the architecture of the proposed advance CNN model is given.
20.4.1 Architecture Description
This CNN model is consisting of 3 conv layers (conv + activation + pooling), and
before reaching the last output softmax layer 2, fully connected hidden layers were
used. The model was trained up to 25 number of epochs as it gives the most optimal
result for the used data set. The architecture of the model is given in Fig. 20.3.
Fig. 20.3 Proposed convolution neural architecture
This architecture is built in such a manner that it gives the optimal result for the
used data set. The Conv2D layer is used for feature extraction, the filter size used
here is 32, and the size of kernel is 3 × 3. ReLU or rectified linear unit is used here
as a activation function. The purpose of using ReLU is to suggest the nonlinearity
in the output of the Conv2D layer. The network will learn the non-negative linear
values using this function. The pooling layer is also used along with the activation
layer; if the input image is too large, then it will minimize the number of parameters.
Here, max-pooling is used as a pooling function. All these layers are important to
extract the features from the input image.
Fully connected layers are used to classify the input image as with or without
tumor. In this architecture 2, fully connected hidden layers are used along with the
last softmax output layer with 2 neurons for 2 classes. The output layer of the CNN
is responsible for producing the probability of each class.
20.4.2 Relation of Number of Layers with the Result
The number of layers used here relies on the data set based on the parameters as
variation and size of data set. If we used more number of layers here, it will just
help to extract extra features but up to a certain limit. After that, instead of extracting
features, it will overfit the data and produce erroneous result like false positives.
20.5 Data Set Description
The data set is collected from kaggle data repository (brain-mri-images-for-brain-

tumor-detection 2019). The MR image data set of brain contains JPEG format
images with brain tumor and without brain tumor. Among total 253 images, 155
images include brain tumor, and 98 images does not include brain tumor.
After collection of the data set, the first step was pre-processing of the data set,
where reading the images and resizing the images were the main task. Reading the
images was done by a step where storing a path into a variable and creating a function
to load the image folder into array of numbers. Resizing of the images will provide
a base size of the input images for the neural network. Here, the sample of the data
set is given in Fig. 20.4a, b, respectively.
Analysis of the collected brain images includes a key point where image seg-
mentation was a necessary part. It sub-divides an image and analyzes the object of
interest part from the background. During the feature extraction of the images, this
step was performed as this operation only includes separation of images and does
not seek to acquire the segmented part of the image.
20.6 Experimental Results and Analysis
Experimental work was done by the setup of Intel®CoreTM i5-8500 CPU 3.00 GHz
processor with 8 GB RAM and windows 10 operating system. For simulation work,
Python 3.6.7 and keras were used for implementing the CNN architecture with ten-
sorflow backend. For data visualization, scikit-learn, matplotlib and seaborn modules
were used. And for reading the data pandas and numpy modules were used. We used
confusion matrix to calculate accuracy of the classifier. The format of confusion
matrix is as given in Table 20.2.
The terms in the confusion matrix are associated with the performance of the
proposed model. The mathematical definitions of the terms, i.e., TP, FP, FN and TN
are given here.
True positive ([TP) = Predicts tumor as tumor
False positive (FP) = Predicts non-tumor as tumor
False negative (FN) = Predicts tumor as non-tumor
True negative (TN) = Predicts non-tumor as non-tumor
The precision, recall and F1 score are calculated using the formula given below:
(a) MRI sample data set without brain tumor
(b) MRI sample data set with brain tumor
Fig. 20.4 Sample of collected MRI brain tumor data set

Table 20.2 Confusion matrix format

Predicted true Predicted false
Actual true True positive (TP) False positive (FP)
Actual false False negative (FN) True negative (TN)
Fig. 20.5 Confusion matrix values
(TP)
Precision = (20.3)
(TP + FP)
(TP)
Recall = (20.4)
(TP + FN)
2 × (precision × recall)
F1 score = (20.5)
(precision + recall)
The confusion matrix associated with the experimental work is given in Fig. 20.5.
The above-mentioned performance measures associated with the experiment, i.e.,
precision, recall, F1 score support values are given in Table 20.3.
Table 20.3 Classification report

Precision Recall F1 score Support
No 0.97 0.98 0.95 98
Yes 0.96 0.98 0.97 155
Accuracy 0.96 253
Macro avg. 0.97 0.96 0.96 253
Weighted avg. 0.96 0.96 0.96 253
The tumor detection task for the proposed model is an imbalanced classification
problem where we need to identify two classes with tumor and without tumor. Here,
in case of disease detection, this imbalance classification issue occurs when the rate
of the disease is very low. In such condition, positive class tends to enormously
exceeded by the negative class. Accuracy is not a good metric for evaluating the
model performance in that case. So instead of that, recall can be a good statics for
evaluation. The actual definition of recall is given in Eq. 20.4. Recall measures the
ability of the model to identify the most concern data particles in a specific data set.
The formula of precision is given in Eq. 20.3 where FPs are the data particles that
the model incorrectly identifies as positive but actually are negative. In this problem,
FP counts the number of particles as tumor which are not actually. Precision expresses
the capacity of data particles which were actually relevant and the model also labeled
it as relevant.
(a) accuracy and validation accuracy of the model
(b) loss and validation loss of the model
Fig. 20.6 Accuracy and loss of the model up to 30 epoch

Table 20.4 Comparison of the accuracy and loss based on number of epochs
No. of epochs Loss Accuracy Val_loss Val_accuracy
20 0.1908 0.9048 0.3423 0.8077
21 0.1959 0.9031 0.3407 0.8077
22 0.1308 0.9404 0.2928 0.8692
23 0.1346 0.9427 0.2642 0.8846
24 0.1437 0.9604 0.3972 0.8462
25 0.0625 0.9824 0.9310 0.8846
26 0.0722 1.0000 0.9580 0.7692
27 0.0885 1.0000 0.9580 0.7692
28 0.0244 1.0000 0.9580 0.7692
29 0.0207 1.0000 0.7408 0.8846
30 0.0145 1.0000 0.9358 0.8077
Bold values in the table are showing the highest accuracy in the optimal number of epoch, that is
25
In case of preliminary tumor identification, we need to find an optimal fusion

of recall and precision where we can merge the two matrices using F1 score, and
the formula is given in Eq. 20.5. The macro-average calculated in the classification
report is the metric which is computed individually for each class, and after that, it
will take the average. So, it treats all the classes equally. In case of weighted average
calculation, the ratio of occurrence will be considered in the calculation. The accuracy
of the model can be interpreted after the parameters of the model are learned from
the given data and fixed, and after this, no learning will take place. After this step,
test samples will be fed to the model, and based on the performance of the model,
accuracy is calculated. In Fig. 20.6a, accuracy of the model is given.
The loss function is calculated to optimize the model. The objective of the model
was to minimize the value of loss function with respect to the parameter of models.
This value implies that how poorly or well a model behaves after every iteration. The
loss of the model is given in Fig. 20.6b.
The comparison based on the number of epochs is also done in Table 20.4. Here,
we can observe that the accuracy is highest for 25 epochs. When we tried to increase
the number of epochs, the model is getting overfitted. For this given data set, we
observed that the optimal number of epoch is 25. Actually, the number of epochs are
depends upon the diversity of the data set.
20.7 Conclusions and Future Works
In this current work, a novel advance CNN architecture is proposed for automatic
brain tumor identification. Based on several MRI brain images, the current model is
able to identify brain tumors correctly. The prospective current model is composed
of two networks: one is for feature extraction convolution layers, and another one
is dense layer for classification. Manual inspection and generating diagnostic result
are a real burden which can be reduced by using this automated model. From the
experimental results, the effectiveness of the model can be determined.
In the future work, we plan to examine the results using a large data set, and the
model can be constructed in such a way where it can identify the types of tumors
present. This can reduce the pain and burden for the expert physicians to determine
a type of tumor present.
References
Abdel-Maksoud E, Elmogy M, Al-Awadi R (2015) Brain tumor segmentation based on a hybrid

clustering technique. Egyptian Inform J 16(1):71–81
Al-Ayyoub M, Husari G, Darwish O, Alabed-alaziz A (2012) Machine learning approach for brain
tumor detection. In: Proceedings of the 3rd international conference on information and commu-
nication systems, pp 1–4
Amin SE, Megeed M (2012) Brain tumor diagnosis systems based on artificial neural networks
and segmentation using mri. In: 2012 8th international conference on informatics and systems
(INFOS). IEEE, , pp MM–119
Ari A, Hanbay D (2018) Deep learning based brain tumor classification and detection system.
Turkish J Electr Eng Comput Sci 26(5):2275–2286
Goswami S, Bhaiya LKP (2013) Brain tumour detection using unsupervised learning based neural
network. In: 2013 international conference on communication systems and network technologies.
IEEE, pp 573–577
Kumar S, Dabas C, Godara S (2017) Classification of brain mri tumor images: a hybrid approach.
Proc comput Sci 122:510–517
Mohsen H, El-Dahshan ESA, El-Horbaty ESM, Salem ABM (2018) Classification using deep
learning neural networks for brain tumors. Future Comput Inform J 3(1):68–71
Pereira S, Pinto A, Alves V, Silva CA (2016) Brain tumor segmentation using convolutional neural
networks in mri images. IEEE Trans Medical Imaging 35(5):1240–1251
Priya KM, Kavitha S, Bharathi B (2016) Brain tumor types and grades classification based on
statistical feature set using support vector machine. In: 2016 10th international conference on
intelligent systems and control (ISCO). IEEE, pp 1–8
Saba T, Mohamed AS, El-Affendi M, Amin J, Sharif M (2020) Brain tumor detection using fusion
of hand crafted and deep learning features. Cogn Syst Res 59:221–230
Sobhaninia Z, Rezaei S, Noroozi A, Ahmadi M, Zarrabi H, Karimi N, Emami A, Samavi S
(2018) Brain tumor segmentation using deep learning by type specific sorting of images.
arXiv:1809.07786
Sudha B, Gopikannan P, Shenbagarajan A, Balasubramanian C (2014) Classification of brain tumor
grades using neural network. In: Proceedings of the world congress on engineering 2014, vol 1.
WCE
https://www.com/brain-mri-images-for-brain-tumor-detection. Apr 2019
Chapter 21
Lane Detection for Intelligent Vehicle
System Using Image Processing
Techniques
Deepak Kumar Dewangan and Satya Prakash Sahu
Abstract Intelligent vehicle system (IVS) is being designed to leverage the safety,
facility, and life style of society. At the same time, it aims to enhance the driving
behavior to minimize the traffic-related issues. Artificial intelligence is assisting
such autonomous system, which is now not restricted only to software data, but its
functionality is being utilized in decision making in various phases of the IVS in
dynamic road environments. One such phase lane detection plays a significant role
in IVS especially through various sensors. Here, vision-based sensor mechanism is
employed which detects lane marking scheme on structured road. For this purpose,
traditional image processing technique has been applied to keep the computation less
complex, and public datasets KITTI is utilized. The proposed scheme is effectively
identifies various lane markings on the road in the normal driving conditions.
21.1 Introduction
To facilitate the society in terms of transportation, various automobile industries

and ongoing researches are improving the transportation modes through numerous
technology integration. These advanced transportation assists the human with num-
ber of safety features and facilities. However, these advanced safety features does
not ensure the complete safety of human due to random reasons (Dewangan and
Sahu 2020), casual driving, poor road structure, and varying illuminations. World
health organization (WHO) has marked a high death road of 299091 in 2016 (World
Health Organization 2020), and these death rates in road traffic are still a challenging
problem.
D. K. Dewangan (B) · S. P. Sahu

Department of Information Technology,
National Institute of Technology, Raipur, Chhattisgarh, India
e-mail: dkdewangan.phd2018.it@nitrr.ac.in
S. P. Sahu
e-mail: spsahu.it@nitrr.ac.in
330 D. K. Dewangan and S. P. Sahu
Therefore, a mechanism or a driver assistive system (DAS) in such vehicles is

required that can provide early warning to the driver or itself can modify its driving
decision in order to maintain the safety of vehicle and humans. To the driving environ-
ment, there are voluminous sensors exist, such as lidar (Castorena and Agarwal 2017;
Changalvala and Malik 2019; Cui et al. 2020) and radar (Fernandez et al. 2018) are
operated in these driverless vehicles to recognize the environment against real-time
objects. Also, vision sensors approaches (Drews et al. 2019; Gupta and Choudhary
2018; Okamoto et al. 2019) have attracted the industries and researchers due to their
comparatively low cost, operational stability, and assisted with modern artificial intel-
ligence techniques. In general, a self-driving car or IVS has number of diverse and
significant phases to examine in real-time cases and mostly includes detection and
categorization of road surfaces and lane markings, recognition of pedestrian, vehi-
cle, and other elements of road traffic system. Among these phases, recognizing lane
boundaries on the road surface is a challenging and important issue to keep the vehi-
cle driving safe and change the vehicle’s driving behavior accordingly. To learn sev-
eral distinct parts of such objects, well-known approaches including feature-based,
model-based, and knowledge-based techniques have been used in several studies
(Conrad and Foedisch 2003; Foedisch and Takeuchi 2004; He et al. 2004; Jian et al.
2009; Kong and Audiberta 2010; Rasmussen 2004; Sha and Zhang 2007; Wang
et al. 2008; Zhang and Nagel 1994; Zhang et al. 2009; Zhou and Jiang 2010). It is
also obvious that mentioned studies and sensors-based approaches are not always
applicable in most of the real-time environment like urban or unstructured roads due
to the high variation of driving conditions where there are no distinct road or lane
markings. To understand the different lane marking scheme for intelligent vehicle is
another immense issue because these distinct marking represents different meaning
in road traffic.
Marked lane markings differ from country to country. In context of country India,
some lane marking schemes (Types of Roads and Lane System In India Explained
2020) are represented here where each lane marking on the road surface represents a
purpose to permit some specific vehicles to drive. The generalized visuals of stated
lane marking scheme are also represented in Fig. 21.1.
21.1.1 White Line (Broken)
A regular lane marking scheme is mostly seen in the country. It permits the vehicle to
perform activities like overtake, U-turns, and change the lanes. But, to keep vehicle
driving safe, it is expected that the road traffic is almost clear and safe to accomplish
mentioned activities.
21 Lane Detection for Intelligent Vehicle System … 331
Fig. 21.1 A general representation of various lane marking schemes
21.1.2 White Line (Continuous)
On the road, this lane marking scheme does not allow a vehicle to perform activities
like taking U-turns or overtaking other vehicles until there is a situation to avoid
accidents. These schemes are commonly found in hilly roads to avoid any single
chance of accidents.
21.1.3 Yellow Line (Continuous)
Under this scheme, overtaking vehicle is permissible only if the vehicles are present
on the current side and mostly found in those areas where visibility is slightly low..
21.1.4 Double Yellow Line (Continuous))
It indicates that crossing the lane marking is rigorously not tolerable and mostly
found in areas where there is a huge probability for random or constant disasters.
21.1.5 Yellow Line (Broken)
It allows the vehicle to take U-turn and to overtake automobiles, provided that it is
complete safe to do so.
To make autonomous vehicle more intelligent in lane detection, it is required to
understand the features of these marking scheme, so that a proper driving decision
can be brought for such vehicles. It makes the situation complex when the traffic load
is a bit high and involvement of pedestrian makes the detection task difficult. How-
Fig. 21.2 Process flow for the proposed approach
ever, computation cost is also a challenging issue when artificial and deep learning
techniques are involved. Approach towards finding lane marking on the road using
traditional approach can be visualized from Fig. 21.2. In this direction, study of var-
ious approaches for lane detection is represented in Sect. 21.2. Employed methods
with their working concepts are represented in Sect. 21.3. Experimental analysis and
results are discussed in Sect. 21.4, whereas concluding remarks is given in Sect. 21.5.
21.2 Related Works
In order to detect lane markings, various studies involve learning lane line features
using image processing, computer vision, feature and model-based, and convolu-
tional neural network (CNN) techniques. A practical and reliable roadway vanish
spot tracking system based on the principles of v-disparity, and visual odometry
approaches are discussed, at which v-disparity mapping may effectively reduce its
state space towards vanishing point. Also, the visual odometry benefits the detec-
tion of the vanishing point for both of the straight and curved roads (Su Yingna et al.
2018). To measure lane equation from the lane applicant, Kalman filter and RANSAC
were applied and then used in the approach of state transfer to maintain lane tracking
(CHOI et al. 2012). With ROI implementation, lane identification can be performed
through Hough space. it was also mentioned that this model could be improved with
GIS or electronic map (Song Wenjie et al. 2018). Using Hough transform in Hough
space, a lane line can be identified where all points with parallel characteristic, length
and angle, and apprehend characteristics are considered in Hough space (Zheng Fang
et al. 2018). A collection of fuzzy collinear fuzzy lines, and line searching is able to
handle vague data and enables computational burden to be decreased compared to
Hough transform (Obradović et al. 2013). B-snake algorithm is addressed in the lane
identifier, and canny/Hough vanishing point estimation (CHEVP) is applied with
minimal mean square error (MMSE) to classify the control points on two sides of the
lane (Wang Yue et al. 2004). Recognition of lane markings using lane detection and
Hough transformation in combination with field programmable gate array (FPGA)
and digital signal processor ( DSP) was used, and lane markings can be accurately
detected by using gradient direction and gradient amplitude together (Xiao Jing et al.
2016). Feature line selection (FLS) a method based on a linear-cubic road model is
incorporated for two-way lane detection and involves only correct lane positions and
angles in close regions (Xin LIU et al. 2012). Lane detection using vanishing points
is based on a probabilistic technique based on the intersection of line sections from
an image. The host lane is being optimized using similarity to the interframe (Yoo
Ju Han et al. 2017).
Feature-based technique that uses visible features from an image such as bound-
aries (Gaikwad and Lokhande 2015; Lotfy et al. 2017; Son et al. 2015) colors,
intensity variations are widely used. Detection of the edge-based function involves
edge information, lane recognition, and estimate of departure. The most popular
edge detection systems are Canny (Gaikwad and Lokhande 2015; Kortli et al. 2016),
Sobel (Dai et al. 2016), and Prewitt (Li et al. 2014), which has been shown to be
better for robust individual pixels-wise edge detection (Son et al. 2015). Using a
stretching feature, an additional intensity-based enhancement can be done to cor-
rectly distinguish lanes with different colors. Research in (Gaikwad and Lokhande
2015) employs a 5-PLSF feature for contrast enhancement followed by a lane width
applied to calculate missing lanes, which decreases the system’s false alert rate. Dai
et al. (2016) had to use a different day and night time identification and then used
a gamma-correction method to having an efficient detection under poorly lit con-
ditions. Similarly, the analysis in (Lotfy et al. 2017) acquired an image’s inverse
perspective map (IPM) and then used a score-based lane detection tracking system.
Hough transform (Gaikwad and Lokhande 2015; Kortli et al. 2016) and RANSAC
were also implemented for lane recognition from the obtained edges. A Hough trans-
form inspired by RANSAC (Lotfy et al. 2017) was also used for lane detection to
reduce time per frame. Traditional lane detection strategies are not very accurate in
the existence of stray edges found in urban surroundings (Kortli et al. 2016). This
selection greatly improves Hough transform performance, raising the average detec-
tion accuracy. A further solution to this framework is learning based on (Gurghian
et al. 2016) or smartphone (Murugesh et al. 2016). Learning-based methods, as in
(Jayanth Balaji et al. 2017; Nair et al. 2017; Singh et al. 2016), have been widely
utilized in other applications. This technique is free from the tradition approach, but
such a method requires an enormous labeled dataset to train a CNN. Performance of
this system greatly depends on the time consuming, classified training dataset.
21.3 Material and Methods
Considering the scenario, the image frame is captured from the vision-sensor
mounted on the vehicle. Basic preprocessing phases are required to enhance the qual-
ity of the image and to perform some corrective actions. Afterwards, these images are
operated under filtering mechanism to fetch the contour information and recognize
those pixels which belongs to the lane marking schemes. Finally, the fitness of these
lane pixels with a model-based approach is represented. In this direction, following
significant stages are performed:
21.3.1 Preprocessing
After acquiring the images from camera or by raw dataset, they are required to be
processed. Images are pre-processed for the simplification and extraction of lane
marking features from road surfaces. Color image processing is computationally
difficult, so the input image preprocessing was done in this step, and the image is
transformed to gray scale image, where process is computationally simple. Gray
scale images are fully adequate for multiple tasks, so the use of rigid color images is
not needed here. Such a procedure is the contrast spreading which re-maps the pixels
to use another maximum range of possible values and it can be described by:
Gray(i, j) = Transformation[Image(i, j)] (21.1)
where Image(i, j) is the gray color values of the (i, j)th pixels of the given input
image, Gray(i, j) is the gray value of the (i, j)th pixel of the improved image, and
Transformation is the makeover procedure for the image and gray level values depend
on the feature used for remapping.
To make this procedure more robust, pixel intensities values are rescaled to yield
an image in where brightness values of the pixels are more uniformly dispersed. Let I
represents the intensity values and Q describes the function of normalized histogram
of image which can be described by:
I (i, j)

ImageEqualization = (I − 1) Qn (21.2)
n=0
where Q n represnts total number of pixel intensities to the total number of available
pixels.
To reduce the noise present in the image, averaging the brightness of the pixels
found within the mask filtration procedure for the given image and a local neighbor-
hood about location (i, j). Then, the final outcome is given by:

p

p
Image(i, j) = h(x, y)Image(i + x, j + y) (21.3)
x=− p =− p
It simply assists to extract the connected discontinuity in lane line detection.

21.3.2 Attention Region
Not all the portion of an image is to be processed, as the lane portion is only found
in the bottom region of the image. So, cropping one portion from the top side of the
image provides a reduced region to be processed and computationally beneficial.
21.3.3 Edge Detection
Edge-based segmentation implementations in certain pixel feature seek a related har-

monic resonance and several discontinuity operators employ derivative operations.
Such operators are functional to the image or to some derived mutable, obtained
through the application of an appropriate transformation. Subsequently, contour
points are identified and fused up in order to obtain sealed contours which delimit
different regions. The first and second order derivatives vary substantially in the
transformations from dark to light and vice versa. It measures the variance of the
blurred image in all regions and then follows the highest gradient as a sequence of
white pixels. ∂t
Gi
∇t = = ∂i∂t (21.4)
Gj ∂j
where partial derivative of i and j are the outcome of gradient in i and j direction.
Similarly, to determine the partial derivatives, one-dimensional filter is utilized to
the convolving procedure, and gradient direction is then computed using following
equation:
−1 G i
= tan (21.5)
Gj
Likewise, magnitude is determined using:

G 2i + G 2 j (21.6)
For the computed gradient value, it is compared with the upper and lower threshold
values to accpet or reject certain values in the predefined range.
The mentioned partial derivatives G i and G j are estimated with the help of deriva-
tive operators to compute the changes in the horizontal as well as in vertical direction.
Following are the few operators which is applied to extract lane features in the image,
out of which Prewitt operator is described by Eqs. (21.7) and (21.8), Roberts oper-
ator is represnted in Eqs. (21.9), (21.10), (21.11) and (21.12) are utilized for Sobel
operator. However, the Laplacian operator which uses only one kernel is represnted
in Eq. (21.13).
⎡ ⎤
−1 0 1
G i = ⎣−1 0 1⎦ (21.7)
−1 0 1
⎡ ⎤
−1 −1 −1
Gj = ⎣ 0 0 0 ⎦ (21.8)
1 1 1

10
Gi = (21.9)
01

0 1
Gj = (21.10)
−1 0
⎡ ⎤
1 0 −1
G i = ⎣2 0 −2⎦ (21.11)
1 0 −1
⎡ ⎤
1 2 1
Gj = ⎣ 0 0 0 ⎦ (21.12)
−1 −2 −1
⎡ ⎤
0 −1 0
Laplacian : ⎣−1 4 −1⎦ (21.13)
0 −1 0
Though, mentioned operators are fair enough to extract edge from images but have
restricted perfromance in terms of accuracy and computation time. Another approach
for edge detection is to use canny mechanism (Canny et al. 1986) which initially
involves minimizing the noise by smoothing the given image with Gaussian filter.
Afterwards, gradient is computed using any of the mentioned operators followed by
fetching edge points. Finally, hysteresis is performed by iterating a kernel over all
the available pixels in the image, and it verifies the existense of latest pixel being an
edge component.
21.3.4 Lane Identification
To detect lane marking in the image, Hough transform approach is employed to

segregate significant features of a lane from the road surface. The chief benefit of this
method is that it is lenient of gaps found in feature map region and is comparatively
modest by noise and is useful to determine global description of a pooling map by
provided local dimensions. Each of the coordinate points specify its influence to a
global and reliable result. For the proposed scenario, to fit a set of line segments
consider the visual represented in Figs. 21.3 and 21.4.
In this way, an appropriate equation to designate a line sets using normal or
parametric formation which is given in Eq. (21.14) for Fig. 21.5.
x cos θ + y sin θ = r (21.14)
where θ is the orientation of normal r to the x axis with target line component.
The locations of the edge fragment points (xn , yn ) in the image are recognized
and, thus, functions as constants in the equation of the parametric line, while the
unknown variables are searched. If we plot the distinct outcomes (r, θ ) identified by
Fig. 21.3 Coordinate points
Fig. 21.4 Possible set of straight line fittings

Fig. 21.5 Parametric representation
Fig. 21.6 Line detection on camera image using Hough transform

each (xn , yn ) the point to curves maps in the polar Hough space in the cartesian image
space. For straight lines, this point to curve conversion is the Hough transformation.
The transformation is realized by quantizing the space of the Hough parameter
into finite iterations. As the algorithm runs, each (xn , yn ) is converted into a dis-
crete (r, θ ) curve, incrementing the generator cells that lie along this curve. The
corresponding spikes in the accumulator set provide concrete proof that the frame
contains a matching straight line.
Curves created in the gradient image by the collinear points converge in the Hough
transform space at peaks. The above convergence points represent the fragments of
the original image in a straight line. Finally, a subjective baseline to obtain the strong
characteristics (r, θ ) in the final processed image which relate to each of the straight
line borders is executed. A typical example of the mentioned approach is represented
in Fig. 21.6, and lane detection after applying Hough transfrom can be visualized in
Fig. 21.7.
Fig. 21.7 Lane detection under Hough transform procedure

21.4 Experiment and Result Analysis
In the proposed approach, all the implementations are carried out using Python and
OpenCV. All the steps have been experimented on the data available in KITTI dataset
(Geiger et al. 2013). The 289 images were applied to test the proposed approach and
determined the lane marking successfully. Apart from these, some random video
sequences were also tested under this scheme, and the obtained results are shown in
Figs. 21.8, 21.9, 21.10, 21.11, 21.12 and 21.13.
Fig. 21.8 Sample sequence containing lane marking

Fig. 21.9 Gray scale conversion
21.4.1 Performance Analysis of Edge Detection Approaches
Several techniques for the extraction of edges have been utilized in many studies
and been compared from a computational perspective. Here, the figure of merit
(FOM) (Pratt et al. 1978) has also been applied, which determines error between
true gradient and locations of estimated gradient values. The quantitative outcome
for all the operators applied in this approach is given in Table 21.1.
Fig. 21.10 Smoothening results

Fig. 21.11 Edge detection

Fig. 21.12 Lane marking detection

Fig. 21.13 Final output with marked lane line
Table 21.1 Quantitative analysis of edge detection operators.

Method FOM score (avg.) Computation time (s)
Prewitt 0.385 0.065
Roberts 0.311 0.087
Sobel 0.234 0.045
Laplacian 0.289 0.145
Canny 0.212 0.084
21.5 Conclusion
We present a drivable lane marking recognition approach which works on structured

roads. The idea is to identify and extract the features of lane line marking from the
road images. For this, edge detection operators were applied and fused up with Hough
space transformation. Qualitative results are satisfactory along with the quantitative
score of different methods. Canny edge detection approach was found suitable, as it
has low computation time and low error score. The performance of all the methods
have been tested on standard conditions. For future study, an essential consideration
would be to explore the associations between traditional approach and embedded
hardware towards the design of intelligent vehicle system, so that such vehicle should
optimize its decision making capability with less computing resources.
References
Canny JF (1986) A computational approach to edge detection. IEEE Trans Pattern Analysis Mach
Intell 8(6):679–698
Castorena J, Agarwal S (2017) Ground-edge-based LIDAR localization without a reflectivity cali-
bration for autonomous driving. IEEE Robot Autom Lett 3(1):344–351
Changalvala R, Malik H (2019) LiDAR data integrity verification for autonomous vehicle. IEEE
Access 7:138018–138031
Choi HC, Park JM, Choi WS, Oh SY (2012) Vision-based fusion of robust lane tracking and forward
vehicle detection in a real driving environment. Int J Autom Technol 13(4):653–669
Conrad P, Foedisch M (2003) Performance evaluation of color based road detection using neural nets
and support vector machines. In: Proceedings of applied imagery pattern recognition workshop,
Washington, DC
Cui Y, Wu J, Xu H, Wang A (2020) Lane change identification and prediction with roadside LiDAR
data. Optics Laser Technol 123:
Dai J, Wu L, Lin H, Tai W (2016) A driving assistance system with vision based vehicle detection
techniques
Dewangan DK, Sahu SP (2020) Real time object tracking for intelligent vehicle. In: 2020 first
international conference on power, control and computing technologies (ICPC2T). IEEE, pp
134–138
Drews P, Williams G, Goldfain B, Theodorou EA, Rehg JM (2019) Vision-based high-speed driving
with a deep dynamic observer. IEEE Robot Autom Lett 4(2):1564–1571
Fernandez MG, Lopez YA, Arboleya AA, Valdes BG, Vaqueiro YR, Andres FLH, Garcia AP (2018)
Synthetic aperture radar imaging system for landmine detection using a ground penetrating radar
on board a unmanned aerial vehicle. IEEE Access 6:45100–45112
Foedisch M, Takeuchi A (2004) Adaptive real-time road detection using neural networks. In: Pro-
ceedings of the 7th international IEEE conference on intelligent transportation systems, Wash-
ington, DC, 3-6 Oct 2004
Gaikwad V, Lokhande S (2015) Lane departure identification for advanced driver assistance. IEEE
Trans Intell Transp Syst 16(2):910–918
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot
Res 32(11):1231–1237
GHO—by category—road traffic deaths—data by country. World Health Organization. https://apps.
who.int/gho/data/node.main.A997. Road traffic deaths. Cited 12 Sep 2020
Gupta A, Choudhary A (2018) A framework for camera-based real-time lane and road surface
marking detection and recognition. IEEE Trans Intell Vehicles 3(4):476–485
Gurghian A, Koduri T, Bailur SV, Carey KJ, Murali VN (2016) DeepLanes: end-to-end lane position
estimation using deep neural networks. In: 2016 IEEE conference on computer vision and pattern
recognition workshops, pp 38–45
He Y, Wang H, Zhang B (2004) Color-based road detection in urban traffic scenes. IEEE Trans
Intell Transp Syst 5(4):309–318
Jayanth Balaji A, Harish Ram DS, Nair BB (2017) Machine learning approaches to electricity
consumption forecasting in automated metering infrastructure (AMI) systems: an empirical study.
In: Silhavy R, Senkerik R, Kominkova Oplatkova Z, Prokopova Z, Silhavy P (eds) CSOC 2017.
AISC, vol 574. Springer, Cham, pp 254–263. https://doi.org/10.1007/978-3-319-57264-2_26
Jian W, Zhong J, Yuting S Unstructured road detection using hybrid features. In: International
conference on machine Learning and Cybernetics, Baoding, China, pp. 482–486 (2009)
Kong H, Audibert J-Y (2010) General road detection from a single image. IEEE Trans Image
Process 19(8)
Kortli Y, Marzougui M, Atri M (2016) Efficient implementation of a real-time lane departure
warning system. In: 2016 international image processing, application system, pp 1–6
Li Q, Chen L, Li M, Shaw SL, Nüchter A (2014) A sensor-fusion drivable-region and lane detection
system for autonomous vehicle navigation in challenging road scenarios. IEEE Trans Veh Technol
63(2):540–555
Liu X, Xu X, Dai B (20122) Vision-based long-distance lane perception and front vehicle location
for full autonomous vehicles on highway roads 19:1454–1465
Lotfy OG et al (2017) Lane departure warning tracking system based on score mechanism. In:
Midwest symposium circuits systems, pp 16–19
Murugesh R, Ramanadhan U, Vasudevan N, Devassy A, Krishnaswamy D, Ramachandran A (2016)
Smartphone based driver assistance system for coordinated lane change. In: 2015 international
conference on connected vehicles and expo, ICCVE 2015—proceedings, pp 385–386
Nair BB, Kumar PKS, Sakthivel NR, Vipin U (2017) Clustering stock price time series data to
generate stock trading recommendations: an empirical study. Expert Syst Appl 70:20–36
Obradović D, Konjović Z, Pap E, Rudas IJ (2013) Linear fuzzy space-based road lane model and
detection. Knowledge-Based Syst 38:37–47
Okamoto K, Itti L, Tsiotras P (2019) Vision-based autonomous path following using a human
driver control model with reliable input-feature value estimation. IEEE Trans Intell Vehicles
4(3):497–506
Pratt WK (1978) Digital image processing. Wiley-Interscience, New York
Rasmussen C (2004) Texture-based vanishing point voting for road shape estimation. In: British
machine vision conference
Sha Y, Zhang G-Y (2007) A road detection algorithm by boosting using feature combination. In:
2007 IEEE intelligent vehicles symposium, pp 364–368
Singh AK, John BP, Subramanian SV, Kumar AS, Nair BB (2016) A low-cost wearable Indian
sign language interpretation system. In: International conference on robotics & automation for
humanitarian applications
Son J, Yoo H, Kim S, Sohn K (2015) Real-time illumination invariant lane detection for lane
departure warning system. Expert Syst Appl 42(4):1816–1824
Song W, Yang Y, Fu M, Li Y, Wang M (2018) Lane detection and classification for forward collision
warning system based on stereo vision. IEEE Sensors J 18(12):5151–5162
Su Y, Zhang Y, Lu T, Yang J, Kong H (2018) Vanishing point constrained lane detection with a
stereo camera. IEEE Trans Intell Transp Syst 19(8):2739–2744
Types of roads and lane system in India explained. https://www.cars24.com/blog/types-of-roads-
lane-system-in-india/. Cited 12 Sep 2020
Wang Y, Chen D, Shi C (2008) Vision-based road detection by adaptive region segmentation and
edge constraint. In: Second international symposium on intelligent information technology appli-
cation, pp 342–346
Wang Y, Teoh Eam K, Shen D (204) Lane detection and tracking using B-Snake. Image Vis Comput
22:269–280
Xiao J, Li S, Sun B (2016) A real-time system for lane detection based on FPGA and DSP. Sens
Imaging 17(6):1–13
Yoo JH, Lee S-W, Park S-K, Kim DH (2017) A robust lane detection method based on vanishing
point estimation using the relevance of line segments. IEEE Trans Intell Transp Syst 18(12):3254–
3266
Zhang J, Nagel HH (1994) Texture-based segmentation of road images. In: IEEE symposium on
intelligent vehicles., Washington DC, pp 260–265
Zhang G, Zheng N, Cui C (2009) An efficient road detection method in noisy urban environment.
In: IEEE intelligent vehicles symposium. Xi’an, China, pp 556–561
Zheng F, Luo S, Song K, Yan C-W, Wang M-C (2018) Improved lane line detection algorithm based
on Hough transform. Pattern Recogn Image Analysis 28:254–260
Zhou S, Jiang Y (2010) A novel lane detection based on geometrical model and Gabor filter. In:
IEEE intelligent vehicles symposium. San Diego, CA, USA, pp 59–64
Chapter 22
An Improved DCNN Based Facial
Micro-expression Recognition System
Divya Garg and Gyanendra K. Verma
Abstract Recently, researchers focused their attention towards recognition of micro-

expression due to real-time application of micro-expression recognition in human
behavior understanding, as it indicates whether a person is knowingly or unknowingly
manipulating their exact emotion and mental state. Recognition of micro-expression
is a challenging task due to manipulated facial looks and delicate characteristic
of emotion. The captioned study is an extension of our previous work on micro-
expression that was implemented using discrete curvelet transform. This time, we
investigate implementation of deep convolutional neural network (DCNN) for micro-
expression recognition, as DCNN has established its presence in different image pro-
cessing applications. CASME-II, a benchmark database for micro-expression recog-
nition has been used for experimentations. The results of experiment had revealed
that types based on CNN gives correct results of 90 and 88% for four and six class,
respectively, that is beyond the regular methods. The performance comparison of our
approach to previous existing methods is reported in literature.
22.1 Introduction
Facial expression assumes a significant part in individuals’ day by day communi-

cation and expressing emotions. Ordinarily, a full expression lasts from half to four
seconds on the face and can be effectively distinguished by people. Over the past
few decades, numerous analysts have attempted to train computers to recognize
facial appearances and emotional correspondences among people. Notwithstanding,
human emotion recognition from psychological signals shows that mental investi-
gations show that facial expressions’ analysis might be ambiguous. In other words,
somebody may attempt to conceal their feeling by applying a contrary expression.
D. Garg (B) · G. K. Verma

Department of Computer Engineering, National Institute of Technology Kurukshetra,
Kurukshetra, India
e-mail: divya_6180091@nitkkr.ac.in
G. K. Verma
e-mail: gyanendra@nitkkr.ac.in
350 D. Garg and G. K. Verma
Micro-expression, a unique facial expression, is characterized as a fast-facial mea-

sure that is not liable to individuals’ cognizance and can uncover the veritable feeling.
Micro-expressions were first explained in the 1960s psychological literature and have
been studied since then (Ekman and Friesen 1969; Haggard and Isaacs 1966). These
expressions are considered as self-protection method and can uncover depressed feel-
ings. Identifying micro-expressions is challenging because of (a) emotions’ unpre-
tentious nature and (b) curbed facial expressions. Recognition of micro-expressions
is one of the significant traits of man-machine collaboration, and in the present
scenario, analysts focus on the most capable strategy to analyze micro-expressions.
Individuals can perceive full-scale articulations that occur in several minutes, though
individuals can manipulate these articulations to cover their real sentiments. These
expressions are hard to perceive generally because of their short period and unstable
nature. The utilization of strategy that can successfully recognize and distinguish
micro-expressions differ. The significant uses of recognizing micro-expressions are
(a) law implementation organizations for cross verifications to recognize misleading
(b) in marketing to analyze how individuals respond to commercials.
Past examinations on recognition of facial micro-expressions were revolved
around finding miniature facial expressions from particular images. Lately, recogni-
tion of unconstrained facial motions has been gotten attention from different analysts
(Bartlett et al. 2005; Pfister et al. 2011), since unbounded smaller scale facial motions
can uncover true blue sentiments which people endeavor to stow away. Therefore, it is
vital to inspect unbounded smaller scale facial motions. The subtle nature of smaller
scale facial expressions leads to human errors in observing tiny facial expressions
and passes on a veritable test to PC vision. Therefore, it requires a suitable method to
eliminate the relevant data from miniature expressions with inconspicuous change.
The most commonly used methodologies to examine micro-expressions are
geometry-based and appearance-based methodologies. In reference, face micro-
features like shapes and eminent facial parameters can be represented by using
the geometry-based methodology. In contrast, face skin texture can be depicted by
appearance-based methodology. By the latest, the local binary pattern from three
orthogonal planes (LBP-TOP) has demonstrated its effectiveness for recognition of
micro-expression (Huang et al. 2012; Zhao and Pietikainen 2007). Apart from this,
LBP has been broadly used to recognize micro-expression in (Huang et al. 2016;
Liu et al. 2016). A few examinations (Kim and Cho 2014; Mishra et al. 2013; Shu
et al. 2011) depend on histograms of oriented gradients (HOG), as it gives shading
invariants features.
The rest of the paper is organized as follows: Sect. 22.2, review of related literature,
is discussed. The proposed methodology is presented in Sect. 22.3. Experiment and
results are discussed under Sect. 22.5, and concluding remarks and future scope are
given in the last Sect. 22.6.
22 An Improved DCNN Based Facial … 351
22.2 Related Work
In our previous work (Verma 2017), we have implemented micro-expression recogni-

tion using a discrete curvelet transform. The implementation of a curvelet transform
was implemented firstly (to the best of author information) to recognize micro-
expression. We have used CASME-II, a standard dataset for recognizing micro-
expression. The exploratory outcomes showed that the curvelet-based technique gives
better results (accuracy) than existing methodologies stated in the literature. We guar-
anteed that curvelet examination is the pinnacle strategy (with the best execution) for
recognizing micro-expressions. This work is in continuation of our previous work.
This time we have implemented the micro-expression recognition system with deep
convolutional neural network (DCNN). The existing literature for facial expression
is based on a
1. Local Binary Pattern (Ahonen et al. 2006; Zhao and Pietikainen 2007),
2. Local Binary Pattern from Three Orthogonal Planes (LBP-TOP) (Liu et al. 2016;
Wang et al. 2017), and
3. Histograms of Oriented Gradients (Kim and Cho 2014; Mishra et al. 2013; Shu
et al. 2011).
However, here our literature survey is limited to deep learning-based studies.
The methodologies based on deep learning have demonstrated their practicality for
various visual assignments and pulled tremendous enthusiasm from the PC vision
network. Lately, deep learning features have additionally been read to analyze spon-
taneous micro-expressions. Peng et al. (2017) proposed a two-stream network termed
as dual temporal scale convolutional neural network (DTSCNN) to recognize uncon-
strained micro-expressions. Diversity of the stream of DTSCNN is utilized to adjust
to various edge pace of micro-expression video cuts. Each surge of DSTCNN com-
prises an autonomous shallow organization for maintaining a strategic distance from
the overfitting issue. Then, they nourished the network with optical-stream classi-
fications to guarantee that the shallow organizations can additionally procure more
elevated level features. They tested their experiment results on benchmark databases,
i.e., CASME I/II and attained 10% higher than other state-of-the-art approaches.
Al-Shabi et al. (2017) developed an aggregator model based on scale invariant
feature transform (SIFT) and CNN. They have extracted features from both dense
SIFT and regular SIFT and merged with CNN features to increase the performance
on small data. They have studied both dense SIFT and regular SIFT along with
CNN and designed an aggregator model. The accuracy reported by them is 73.4%
on FER-2013 and 99.1% on the CK+ database.
Takalkar and Xu (2017) indicated that it is conceivable to substantially improve
precision over the expected outcomes for classifying micro-expression utilizing
CNNs pre-trained system used for face recognition tasks. Their examinations also
reason that the small sizes of the datasets do not support them for preparing CNNs.
Notwithstanding, CNNs prepared on adequately massive datasets of face micro-
expressions can acquire preferred outcomes over the pattern without utilizing the
data augmentation procedure. They combined two datasets CASME and CASME-
II, to frame a bigger dataset. At that point, these datasets are consolidated to tune a
palatable CNN-based micro-expression recognizer.
Zhang et al. (2018) designed a network termed as SMEConvNet to analyze micro-
expression from the long video. They extracted 500 features per frame and built a
feature matrix. Then, they proposed a technique for processing a feature matrix to
locate the apex frame from video, which utilizes a sliding window and considers the
attributes of micro-expression to look through the apex frame. Exploratory outcomes
exhibit that the proposed strategy can accomplish the highest apex spotting rate
(0.8280) and the smallest mean absolute error-(22.36) than other techniques.
Li et al. (2016) proposed a methodology by merging a deep learning network and
histograms of oriented optical flow (HOOF) to recognize micro-expression. They
utilized CNN to localize facial areas and region-based normalized HOOF features.
Hu et al. (2018) presented a framework to perceive miniature articulation by
merging deep learning techniques and handmade features. They merged time-based
and spatial features through local Gabor binary pattern from three orthogonal panels
to covert the nearby facial developments and trained miniature articulation dataset
on the CNN model. The outcomes exhibit that the proposed approach accomplishes
better performance than other mainstream micro-expression acknowledgment strate-
gies.
Li et al. (2018) proposed an algorithm in which they employed CNN for identify-
ing the facial benchmarks/regions. Moreover, they employed fused CNN to extricate
the optical-stream features from the facial benchmarks that comprise the muscle
movements when the miniature expressions occur. They applied the proposed algo-
rithm on two databases and achieved better results in analyzing micro-expressions.
22.3 Methodology
22.3.1 Deep Convolutional Neural Network
In this segment, we discuss the most revolutionary and potent technical field i.e., deep
learning and the convolutional neural network (CNN) standard, which establishes
a framework for investigating the deep convolutional neural network (DCNN) for
recognition of micro-expression.
Deep learning is a potent part of artificial intelligence. It comprises well-defined
sequences that are imitations or motivations of the human cerebrum, intended to
emulate the human mind’s structure, and termed an artificial neural network. It is a
neural function capable of emulating a human’s brain in handling information and
making outlines in decision making.
Fig. 22.1 A typical CNN architecture (Jan et al. 2019)
The structure of deep learning consists of multiple layers. Each layer has neu-
ral nodes that can communicate with other network nodes and has full capacities
to learn various features and data representations. At prsent, numerous deep orga-
nization structures have been advanced. For example, LeNet (LeCun et al. 1998),
AlexNet (Krizhevsky et al. 2012), VGG (Szegedy et al. 2015), GoogleNet (Simonyan
and Zisserman 2014), ResNet (He et al. 2016), etc. In this paper, we predominantly
talk about DCNN for micro-expression recognition, as there is the drastically extraor-
dinary achievement of CNN in emotion recognition and computer vision.
CNN is a popular deep learning model and has shown tremendous success in
various areas. It is proposed by LeCun et al. (1998) that incorporates a series of few
building blocks or layers. The main objective is to process the given input, extract
high and low-level features, and classify it into specific classes.
CNN architecture has two phases: The first phase accomplishes the feature learn-
ing through the convolutional layer, activation function, and pooling layer. In contrast,
the second phase comprises a fully-connected and SoftMax layer, which performs
classification (as shown in Fig. 22.1). A complex CNN architecture involves repe-
titions of a few convolution layers and a pooling layer, trailed by at least one fully
connected layer.
The convolutional layer is connected to a section of the input (normalized Image),
aiming to perform convolution operations among given input and kernel. The convo-
lution operation is a particular linear operation utilized for extracting features where
kernel/filter (a set of array numbers) is applied over the given input, which is a set of
array numbers, called a tensor. A dot-product between every section of kernel/filter
and input tensor is determined at every area of the tensor and summated to get the
output in the corresponding location of the output tensor, termed as a feature map.
The convolution operation between image(I) and kernel (K) is shown below by using
Eq. 22.1.
Ih Iw Ic
(I ∗ K )x,y = Ix+ p−1,y p −1 K p,q,r (22.1)
p=1 q=1 r =1
where Ih , Iw and Ic are height, width, and channels of image, respectively, x and
y represent pixel values corresponding to the local receptive field. This method is
repeated by applying multiple convolution filters to frame several feature extractors,
signifying various attributes of the given input tensor. After that, the activation func-
tion is applied over the information signal because it does not initiate all neurons
simultaneously and introduce nonlinearity into the transformed output tensor. ReLU
is extensively used as an activation function, as it combines multiple times quicker
than tanh and sigmoid functions. It is described as Eq. 22.2
f (t) = max(0, t) (22.2)
This output tensor is then directed to the following layer of neurons as input.
The pooling layer targets down-sampling the highlights of the contribution without
affecting the quantity of the channels.
[l]
Ix,y,z = pool(I [l−1] )x,y,z (22.3)
where l is the current working layer and l − 1 is previous layer. The first convolutional
layer and pooling layer would obtain low-level data of the picture, while the pile of
them would empower significant level component extraction. The pooling process
outcomes are fed into a fully connected layer where it classifies given input into
classes/labels. Considering the kth node of the lth layer, it can be defined by Eq. 22.4

nl−1
z k[i] = [l] [l−1]
wk,i Ii + bi[l] (22.4)
i=1
Ik[l] = f [l] (z k[i] ) (22.5)
where input tensor I [l−1] can be a transformed outcome of convolution/pooling layer

with dimensions Ih[l−1] Iw[l−1] andIc[l−1] , while f is the activation function. The key
parameters at ith layer’s key parameters are weights (w_k,i with n_i − 1 and n_i
parameters) and bias (b with n_i parameter).
A deep CNN comprises various hidden layers. The proposed structure comprises
the stack of hidden layers where the hidden layer is a set of convolutional layers,
followed by a max-pooling layer. The last hidden layer consists fully connected layer
(act as a classifier) and drop out layer for reducing overfitting. A deep CNN sublayers
are depicted in Fig. 22.2.
Algorithm 1 CNN training and classification algorithm

1. TRAINING PROCESS
INPUT: labeled training data as X = { X (1) , X (2) . . . , X(K) } , K is the total of classes.
X_input X ; % Reshape X to a 4-D tensor
CNN ← X_input, % reshaped training data are sent into CNN to get extracted feature vectors fˆ
# Hidden-layer1: Conv1 −→ReLU −→ Pool1

Initialize:
W (1) ; % weight tensor of Conv1 with network layer parameters b (1) ; % bias of Conv1
Compute:
z (1) = conv2D (X_input, W (1) ) + b (1) ; % apply convolutional operation h_c1 = f (z (1) ); % apply
the ReLU activation function h_p1 = % maxpool2d (h_c1); % fed into max-pooling layer
# Hidden-layer2: Conv2 −→ ReLU −→ Pool2

Initialize:
W (2) ; % weight tensor of Conv2 with network layer parameters b (2) ; % bias of Conv2
Compute:
z (2) = conv2D (h_p1, W (2) ) + b (2)
h_c2 = f (z (2) ); % apply the ReLU activation function
h_p2 = maxpool2d (h_c2); % fed into max-pooling layer
# Hidden-layer3: Fully Connected 1

Initialize:
W (3) ; % weight tensor of fully connected 1 b (3) ; % bias of fully connected 1
Compute:
z (3) = tf.matmul(h_p2.flatten, W (3) ) + b (3) h_f1 = f (z (3) ); % apply the ReLU activation function
h_f1_drop = h_f1.dropout; % apply dropout to reduce overfitting
# Final Output layer

Initialize:
W (4) ; % weight tensor with fully connected size and no. of classes K b (4) ; % bias variable with no.
of classes K
Compute OUTPUT:
f_output = tf.matmul(h_f1_drop, W (4) ) + b (4)
2. CLASSIFICATION PROCESS
INPUT: x̂ is an image to be classified
fˆ ← CNN ← x̂ ;
OUTPUT: class = f_output (i) where i =1, 2. . . , K % the class that x belongs to
22.4 Experimental Setup
22.4.1 CASME-II Database
All the experimentations were carried out on benchmark CASME-II database to rec-
ognize miniature expressions algorithms that fit for recognizing subtle facial muscle
motions for analyzing affective states. The CASME-II (Yan et al. 2014) consoli-
Fig. 22.2 Deep CNN sublayers
Table 22.1 CASME-II micro-expression database

Hyper parameters Values
Author and year Yan 2013
Micro-expression samples 195
Subjects 35
Spontaneous expressions Yes
Emotions Amusement, sadness, disgust, contempt,
surprise, fear, repressions, and tense
Spontaneous micro-expressions 247
Time based tenacity 200-fps
Spatial/face tenacity 280×340 pixels
Frame size 640×480
Labeling By emotions
dates 246 subtle facial little scope articulations recorded by a 200-fps camera. These
samples were browsed over 2500 outward expressions. CASME-II is improved in
extended sample measure, settled light, and higher assurance (both standard and spa-
tial). The selected, scaled downscale articulations in this dataset either had a total
duration under 500 ms or an onset term (time from beginning edge to top layout)
under 250 ms. These samples are coded with the beginning and balance diagrams and
categorized with action units (AUs) and sentiments. Hyperparameters of CASME-II
and their respective values are shown in Table 22.1.
CASME-II comprises mainly six classes of the littler scale articulations to be
specific ‘Happiness,’ ‘Fear,’ ‘Sadness,’ ‘Surprise,’ ‘Repression,’ and ‘Disgust.’ We
have chosen 1841 samples of happiness, 274 samples of sadness, 1605 samples of
surprise, 4204 samples of disgust, 2187 samples of repression, and other samples for
analyzing micro-expressions. These examples are gathered from number of various
people with various gender. Some samples of this database are shown in Fig. 22.3.
Fig. 22.3 Sample images of CASME-II database
22.4.2 Implementations
The implementation is done with TensorFlow and Python 3.5.2. TensorFlow, an open-
source software library used for experimentation in this study. The computations
of TensorFlow are sated as stateful data-flow diagrams. TensorFlow gets from the
activities that neural organizations perform on multidimensional arrays, which are
alluded to as tensors. In our work, we utilized Tensorboard to construct complex
charts and diagrams. We tracked and visualized model parameters such as accuracy
and loss function using Tensorboard. It also helps in viewing tensor histograms, as
they vary over time. For the usage of this model, we tested by considering a diverse
number of preparing pictures and a distinctive number of classes, for example, two,
four, and six classes.
The spontaneous emotion recognition model identifies the micro-expressions of
human beings. As we discussed in the previous section, we utilized standard database
CASME-II that comprised 12,000 images. We divided these images accordingly to
the training and testing phase. Approx. 80% of images have been utilized for the
training phase and the remaining images for testing purposes. Then, these images
have been classified into several classes in the classification process.
22.5 Results and Analysis
The results are given in terms of cost and accuracy parameters for two classes (Hap-
piness and Sadness), four classes (Happiness, Sadness, Disgust, and Fear), and six
classes, as shown in Table 22.2. The cost function is a capacity that quantifies the
performance of the deep neural model for given data. It evaluates the relationship
between variables (predicted and expected values), and it is framed based on cost
(estimation of error). It is minimalized to guarantee the improved working of the
model. These two measurements are of the prime noteworthiness because they legit-
imately oversee the presentation of the created model and ought to be differed to
improve the equivalent. The cost graph for 2, 4, and 6 class is given in Figs. 22.4,
22.6, 22.8 repectively. Following are the portrayals of cost and precision for a various
number of classes taken at a time Figs. (22.6 and 22.8).
Test results exhibit that our framework accomplished the most noteworthy perfor-
mance (recognition accuracy) against existing investigations detailed in the literature.
The test accuracy graphs for two, four, and six classess are shown in Fig. 22.5, 22.7,
and 22.9, respectively. We have likewise contrasted the exhibition of our framework
and different works, as appeared in Table 22.3. To summarize, DCNN can adequately
learn low-level and high-level features from imbalanced information and decipher the
unobtrusive movement in facial regions of micro-expressions and achieve remarkable
performance for recognizing micro-expressions.
Table 22.2 Accuracy With Cost Parameter with Different Class

Sl. No. Parameter # Iterations # class Accuracy (%)
1. Cost 26 2 95
2. Cost 57 4 90
3. Cost 65 6 88
Fig. 22.4 Cost graph for two class of micro-emotion

Fig. 22.5 Test accuracy graph for two class
Fig. 22.6 Cost graph for four class
Fig. 22.7 Test accuracy graph for four class

Fig. 22.8 Cost graph for six class
Fig. 22.9 Test accuracy graph for six class
Table 22.3 Comparison of recognition accuracy (%) with CASME-II database

Method Year 4-class 5-class
Cuboids (Dollár et al. 2005 33.33 36.03
2005)
LBP-TOP (Zhao and 2007 37.43 46.46
Pietikainen 2007)
BP-TOP (S+M) 2011 45.31 53.28
(Pfister et al. 2011)
STLMBP (Huang 2012 46.20 47.77
et al. 2012)
LOCP-TOP (Chan 2012 31.58 48.91
et al. 2011)
LBP-SIP (Wang et al. 2014 36.84 46.56
2014)
STCLQP (Huang et al. 2016 57.31 58.39
2016)
Curvelet (Verma 2017) 2017 62.99 61.39
CNN [this work] 2020 90.00 88.00 (six class)
22.6 Conclusion
This study has proposed a framework to recognize micro-expression dependent on a

deep convolutional neural network (DCNN). The DCNN has been effectively utilized
to extricate features from micro-expression images and classify them into class(es).
Its architecture involves the repetition of a convolution layer and a pooling layer
trailed by two fully connected layers. It works in two phases: The primary phase
accomplishes feature learning through the convolutional layer, activation function,
and pooling layer. In contrast, the second phase comprises a fully-connected, which
performs classification.
All the tests were performed on CASME-II information base for analyzing micro-
expressions recognition-based algorithms that are suitable for distinguishing subtle
facial muscle improvements for recognition of affective states. Accordingly, the
most elevated two class (95%), four class (90%), and six class (88%) arrangement
precision were cultivated utilizing deep convolutional neural network. Exploratory
outcomes exhibit that our framework accomplished the most noteworthy performance
(recognition accuracy) against existing investigations detailed in the literature. We
have additionally compared the results of our framework and different works. In the
future, more research will be conducted with various CNN models to improve our
system’s performance and validate the results of our proposed system.
References
Ahonen T, Hadid A, Pietikainen M (2006) Face description with local binary patterns: application
to face recognition. IEEE Trans Pattern Anal Mach Intell 28(12):2037–2041
Bartlett MS, Littlewort G, Frank M, Lainscsek C, Fasel I, Movellan J (2005, June) Recognizing
facial expression: machine learning and application to spontaneous behavior. 2005 IEEE Comput
Soc Conf Comput Vis Pattern Recogn (CVPR’05) 2:568–573
Chan CH, Goswami B, Kittler J, Christmas W (2011) Local ordinal contrast pattern histograms for
spatiotemporal, lip-based speaker authentication. IEEE Trans Inf Forensics Secur 7(2):602–612
Connie T, Al-Shabi M, Cheah WP, Goh M (2017) Facial expression recognition using a hybrid CNN-
SIFT aggregator. International workshop on multi-disciplinary trends in artificial intelligence.
Springer, Cham, pp 139–149
Dollár P, Rabaud V, Cottrell G, Belongie S (2005, Oct) Behavior recognition via sparse spatio-
temporal features. In: 2005 IEEE international workshop on visual surveillance and performance
evaluation of tracking and surveillance. IEEE, pp 65–72
Ekman P, Friesen WV (1969) Nonverbal leakage and clues to deception. Psychiatry 32(1):88–106
Haggard EA, Isaacs KS (1966) Micromomentary facial expressions as indicators of ego mechanisms
in psychotherapy. Methods of research in psychotherapy. Springer, Boston, MA, pp 154–165
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings
of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang X, Zhao G, Zheng W, Pietikäinen M (2012) Towards a dynamic expression recognition
system under facial occlusion. Pattern Recogn Lett 33(16):2181–2191
Huang X, Zhao G, Hong X, Zheng W, Pietikäinen M (2016) Spontaneous facial micro-expression
analysis using spatiotemporal completed local quantized patterns. Neurocomputing 175:564–578
Hu C, Jiang D, Zou H, Zuo X, Shu Y (2018, Aug) Multi-task micro-expression recognition combin-
ing deep and handcrafted features. In: 2018 24th international conference on pattern recognition
(ICPR). IEEE, pp 946–951
Jan B, Farman H, Khan M, Imran M, Islam IU, Ahmad A, Jeon G (2019) Deep learning in big data
analytics: a comparative study. Comput Electr Eng 75:275–287
Kim S, Cho K (2014) Fast calculation of histogram of oriented gradient feature by removing
redundancy in overlapping block. J Inf Sci Eng 30(6):1719–1731
neural networks. Adv Neural Inf Process Syst:1097–1105
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document
recognition. Proc IEEE 86(11):2278–2324
Liu L, Fieguth P, Wang X, Pietikäinen M, Hu D (2016) Oct) Evaluation of LBP and deep texture
descriptors with a new robustness benchmark. European conference on computer vision. Springer,
Cham, pp 69–86
Li Q, Yu J, Kurihara T, Zhan S (2018, Apr) Micro-expression analysis by fusing deep convolutional
neural network and optical flow. In: 2018 5th international conference on control, decision and
information technologies (CoDIT). IEEE, pp 265–270
Li X, Yu J, Zhan S (2016, Nov) Spontaneous facial micro-expression detection based on deep
learning. In: 2016 IEEE 13th international conference on signal processing (ICSP). IEEE, pp
1130–1134
Mishra G, Aung YL, Wu M, Lam SK, Srikanthan T (2013, Dec) Real-time image resizing hardware
accelerator for object detection algorithms. In: 2013 international symposium on electronic system
design. IEEE, pp 98–102
Pantic M, Rothkrantz LJ (2004) Facial action recognition for facial expression analysis from static
face images. IEEE Trans Syst Man Cybern Part B (Cybern) 34(3):1449–1461
Peng M, Wang C, Chen T, Liu G, Fu X (2017) Dual temporal scale convolutional neural network
for micro-expression recognition. Frontiers Psychol 8:1745
Pfister T, Li X, Zhao G, Pietikäinen M (2011, Nov) Differentiating spontaneous from posed facial
expressions within a generic facial expression recognition framework. In: 2011 IEEE international
conference on computer vision workshops (ICCV workshops). IEEE, pp 868–875
Shu C, Ding X, Fang C (2011) Histogram of the oriented gradient for face recognition. Tsinghua
Sci Technol 16(2):216–224
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recog-
nition. arXiv:1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich
A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 1–9
Takalkar MA, Xu M (2017, Nov) Image based facial micro-expression recognition using deep learn-
ing on small datasets. In: 2017 international conference on digital image computing: techniques
and applications (DICTA). IEEE, pp 1–7
Verma, G. K. (2017, November). Facial micro-expression recognition using discrete curvelet trans-
form. In: 2017 conference on information and communication technology (CICT). IEEE, pp.
1–6
Wang Y, See J, Phan RCW, Oh YH (2014) Nov) Lbp with six intersection points: reducing redundant
information in lbp-top for micro-expression recognition. Asian conference on computer vision.
Springer, Cham, pp 525–537
Wang Y, See J, Oh YH, Phan RCW, Rahulamathavan Y, Ling HC, Li X (2017) Effective recog-
nition of facial micro-expressions with video motion magnification. Multimedia Tools Appl
76(20):21665–21690
Yan WJ, Wu Q, Liu YJ, Wang SJ, Fu X (2013, Apr) CASME database: a dataset of spontaneous
micro-expressions collected from neutralized faces. In: 2013 10th IEEE international conference
and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–7
Yan WJ, Li X, Wang SJ, Zhao G, Liu YJ, Chen YH, Fu X (2014) CASME II: an improved sponta-
neous micro-expression database and the baseline evaluation. PloS One 9(1):e86041
Zhang Z, Chen T, Meng H, Liu G, Fu X (2018) SMEConvNet: a convolutional neural network for
spotting spontaneous facial micro-expression from long videos. IEEE Access 6:71143–71151
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an
application to facial expressions. IEEE Trans Pattern Anal Machine Intell 29(6):915–928
Chapter 23
Selective Deep Convolutional Framework
for Vehicle Detection in Aerial Imagery
Kaustubh V. Sakhare and Vibha Vyas
Abstract Advancements in computer vision techniques have played a significant

role in object detection. Vehicle detection from a moving platform is an exigent prob-
lem. Researchers of automated vehicle detection systems need to focus on designing
a system robust to the environmental variations and reducing the processing time.
Due to this complexity, there is a necessity of employing different strategies for struc-
tured object detection with requisite precision. The work carried out in this paper
compares handcrafted features such as SIFT and SURF with novel shape descrip-
tors like convolutional neural network (CNN). CNNs are proven to be powerful in
extracting sparse object information present in the images. Moreover, fine-tuning of
CNNs using feature ontology is adopted in this work to increase the discriminative
power of descriptors. A semantic classification model proposed in this paper intends
to employ modern embedding and aggregating methods which considerably enhance
feature discriminability and boost the performance of the CNN. The performance of
this framework is exhaustively tested across a wide dataset. The intuitive and robust
systems that use these techniques play a vital role in various sectors like security,
military, automation, industries, medical, and robotics.
23.1 Introduction
Growing vehicle counts on the roads have reinstated the necessity of automation in
the transportation systems. Numerous efforts are seen to make the automation in the
vehicle detection practically viable. Intelligent transportation with advancements in
driverless cars (Hosang et al. 2015; Moranduzzo and Melgani 2013) has stimulated
the conglomerated solutions using sensors and algorithms.
A multifold advantage of autonomous vehicles made it a mandate to incorporate
those features in all leading organizations. Improved mobility, reduction in the traffic
K. V. Sakhare (B) · V. Vyas

Department of Electronics and Telecommunication, College of Engineering, Pune, India
e-mail: kvsakhare@pict.edu
V. Vyas
e-mail: vsv.extc@coep.ac.in
366 K. V. Sakhare and V. Vyas
cost, and optimization of the many more factors in transportation and road safety have
attracted majority of the researcher’s attention to evaluate their novel object detection
techniques for vehicle detection, ultimately resulting in autonomous vehicles (Lowe
2004; Moranduzzo and Melgani 2013). Radar-based vehicle detection has seen the
discriminatory capability of the object detection on vehicle datasets in diversified
environment. However, the input image received from cameras has made it possible
for the researchers to target robust solutions in driverless vehicles. It triggered the
set of explorations in the developing vision-based techniques vehicle detection in
ADAS. With this objective, the paper examines some of the best methods of vehicle
detection when objects are of smaller instances within the image.
The remainder of this manuscript is structured as follows: Sect. 23.2 details the
different techniques such as HOG and SIFT used in vehicle detection on the VEDAI
database of vehicle images. Furthermore, the section lists the shortcomings of the
earlier mentioned handcrafted features and compares them to significant features like
object proposal methods. Section 23.3 discusses the feature ontology of histogram
of gradient combined with logistic regression. Feature aggregation of scale invariant
feature transform (SIFT) and bag of words (BoW) is elucidated in detail as a pream-
ble to the proposed model. The work builds a semantic network model using the
feature ontology of HOG and logistic regression and aggregated features made up of
SIFT and BoW. The ontological representation generated here yields the necessary
feature discriminability. Section 23.4 illustrates the proposed novel relevance feed-
back approach that uses the semantic network model with aggregated features and
feature ontology. The proposed semantic features are given as input to the CNN along
with object proposal methods. The feature engineering here is compared with the
DNN (CNN + HOG) proposed by the author in one of his earlier works. Database and
experimental results are discussed in Sect. 23.5. Conclusions are drawn in Sect. 23.6.
23.1.1 Main Contribution of the Work
1. Conventional feature descriptors such as HOG and SIFT are experimented on

the VEDAI database. The superiority of SIFT over these existing techniques is
validated.
2. An ontology model is built using HOG logistic regression vehicle detection.
3. Livelihood and survival mobility are oftentimes coutcomes of uneven socio-
economic development.
4. SIFT + BoW are combined together to defer aggregated features.
5. Livelihood and survival mobility are oftentimes coutcomes of uneven socioeco-
nomic development.
6. A semantic network model is built using object ontology (HoG + LR) and aggre-
gated features like (SIFT + BoW) which are used as relevance feedback to the
proposed semantic classifier.
7. Object proposal methods have been the effective representation of aerial images.
The performance of selective search on VEDAI1024 database is completed.
23 Selective Deep Convolutional Framework for Vehicle Detection … 367
8. A semantic classifier is proposed using object proposal methods as predefined

descriptors and semantic networksnksnjj features from the above step as rele-
vance feedback to the network.
9. Performance analysis of the proposed algorithms is done based on pre-existing
criteria such as accuracy and precision.
23.2 Related Work
Shape detections are pursued effectively for real-time object detection as shape con-
vey the unique exterior characteristics of the objects. This distinguished representa-
tion of the objects compared to the background is important in scene understanding.
The best shape descriptor should be unaffected due to the rotational, scale transfor-
mations that is why it should be invested insightfully in object detection (Cheng and
Han 2016; Moranduzzo and Melgani 2013). A large number of techniques have been
proposed for describing shapes in object detection, wherein the points of interest are
discerned out of the images and compared to those with the ones registered from
dataset images to find the object of interest (Razakarivony and Jurie 2016). This part
of interest inside the image is normally treated as the feature (Sommer et al. 2017;
Villon et al. 2016). A detailed survey in this regards is elaborated in Table 23.1.
Feature extraction can be visualized with numerous techniques such as extracting
the feature transforms which are scale invariant (SIFT) and histogram of gradients
(Cheng and Han 2016; Moranduzzo and Melgani 2013; Xu et al. 2016).
23.2.1 Shape-Based Object Detection in Aerial Images
Detection of the various classes of vehicles from aerial images is very useful. The
system, when applied to real-time scenarios, helps in solving various day-to-day
problems like screening of large areas, surveillance, traffic management, vehicle
density detection, urban planning, and many others. In real-time scenarios, the vehi-
cle density detection on the roads can be used in navigation systems to provide the
best possible path to travel and thus reduce travel time (Xu et al. 2016). Since manual
image analysis is a difficult task, providing an automated system for aerial image
analysis makes object localization, detection, and classification more efficient. Aerial
image analysis for vehicle detection has seen limitless applications such as land map-
ping, screening of large areas, etc. The enormous need for early detection of vehicles
has resulted in aerial imagery being used in a lot of vehicle detection applications.
The solutions that have been devised have become the most comprehensive and
sophisticated research in scene understanding, semantic analysis for traffic surveil-
lance, and defense applications. When compared with ground image-based object
detection, aerial imagery-based techniques are faced with the exacting task of dis-
Table 23.1 Literature survey of vehicle detection

Sr. No. Research paper Author Algorithms and Shortcomings
databases used and outcomes
1 Vehicle detection Hilal Tayara, Kim FCRN MUNICH • Highest TP
and counting in Gil Soo, Kil To and overhead (true positive)
high-resolution Chong imagery research rate—329, false
aerial images data set (OIRDS) positive rate—17
using
convolutional
regression neural
networks
• Precision rate
95.09
• Recall rate
93.72%
• More time
taken during
interference
2 A hybrid vehicle Yongzheng Xu, HOG + SVM, • Methods are
detection method Guizhen Yu, Viola-Jones UAV sensitive to
based on Yunpeng Wang, image dataset objects in
Viola-Jones and Xinkai Wu, rotation,
HOG + SVM Yalong Ma precision
from UAV images rate—92.15%
• On UAV dataset
precision
rate—92.15%
3 Vehicle detection Shih-Chung Hsu, Fast R-CNN • Robust
using simplified Chung-Lin algorithm to
fast R-CNN 2018 Huang, detect vehicles
Cheng-Hung appearing in
Chuang various
orientations
SHRP 2 NDS • Partial views of
database the objects are
addressed in
object detection
• Localization of
the vehicles with
dynamic
background and
foreground
objects
• Detector gives
90.3% precision
and 94.3% recall
cerning very small-sized vehicles that are obscured in their backgrounds (Chen et al.
2014). Keeping these peculiar challenges in mind, a few object detectors have been
investigated (Cucchiara et al. 2000). This paper exposes the limited utilization of
these object detectors in aerial imagery, as these are found to be lacking in handling
multi-scale imagery and fast response times.
23.3 Semantic Network Model
Object detection from aerial imagery has seen growing attention of the researchers
due to its ability to capture larger areas in single image. At the same time, having
the precision in real-time object detection has always been challenging scenario to
achieve desired level of accuracy. The small instance objects as cars, pick up vans,
and vans will take less pixels comparative to the whole image area.
State-of-the-art object detectors even could give limited efficiency on the aerial
image analysis. It has demanded the continued researches in object detection system
on small instance objects with improved accuracy. Significant work is carried out
in vehicle detection from the aerial imagery (Ajmal and Hussain 2010; Alexe et al.
2010; Dalal and Triggs 2005; Dhanaraj et al. 2020; Hosang et al. 2015; Lowe 2004;
Rabiu et al. 2013; Sakhare et al. 2020; Tayara et al. 2017; Tewari et al. 2019; Van de
Sande et al. 2011) , still keeping the scope for improvisation in terms of accuracy,
minimizing the complexity and achieving the robust performance on the dynamically
changing backgrounds. A preferred duo of most of the researches as combination of
handcrafted features and classifier always bounded to the feature engineering based
on human ingenuity (Dalal and Triggs 2005; Ren et al. 2015; Tewari et al. 2019).
Two-stage approach of feature extraction and classification, those systems always
proved to be computationally complex and yet, offered limited efficacy for occlusion,
light variations clutter, and rotation variance.
23.3.1 Feature Ontology: HOG with Logistic Regression
Given the existing shortfalls of the prevalent techniques, the work proposes a novel
vehicle detection system offering more efficiency and robustness to address small
instances of vehicle objects within the aerial images. This proposed method utilizes
an adaptive model to symbolize optimum features required for object representation.
The optimum features are obtained by employing logistic regression and histograms
of oriented gradients (HOG) (Cheng and Han 2016; Sommer et al. 2017). The feature
ontology uses a logistic loss between the test object and the training sets for feature
selection.
23.3.2 Feature Aggregation: SIFT with BoW
SIFT-based techniques perform matching of local features in the image using two
stages of feature detector and descriptor (Zheng et al. 2017). SIFT methods cohesively
function with bag of word models, as initially BoW were proposed for document
parsing. The word responses were accumulated into a vector form. Scale invariant
feature transform (SIFT) (Uijlings et al. 2013) along with BoW model derives better
performer for object detection (Fig. 23.1).
Figure 23.2 demonstrates the feature aggregation reducing the dimensionality
of the input feature sets. SIFT is most popularly used for identifying prominent,
unwavering features in an image. It generates the rotation and scale invariant feature
points. These points describe a small image region around the points (Girshick 2015).
A generic SIFT framework for object detection is set out in the following steps as:
1. Location and scale of salient feature points are determined. Intensity changes
are identified using difference of Gaussians at nearby scales.
1 −(x2 +y2 )
G (x, y, σ ) = e 2 σ2 (23.1)
2π σ 2
L(x, y, σ ) = G(x, y, σ ) ∗ I (x, y) (23.2)
D(x, y, σ ) = (G(x, y, kσ ) − G(x, y, σ )) ∗ I (x, y) = L(x, y, kσ ) − L(x, y, σ )

(23.3)
2. The location of the key points is discrete; it can be extrapolated to yield the
desired accuracy.
The DoG function is stated around a key point (xi , yi , σ , i) using a second-order
Taylor-series.
∂D(x, y, σ ) T
D(x, y, σ ) = G(xi , yi , αi ) +
∂(x, y, σ ) x=xi ,y=yi ,σ =σi
(23.4)
1 ∂D(x, y, σ ) T
+ T
2 ∂(x, y, σ ) x=xi ,y=yi ,σ =σi
where ⎡ ⎤
x − xi
= ⎣ y − yi ⎦ (23.5)
α − αi
For finding extreme values of the DoG in this region, the derivative of D(.) is
set to 0. It performs as:
Fig. 23.1 Semantic network model

Fig. 23.2 SIFT + BoW
⎡ ⎤
x −1
⎣ y ⎦ = ∂D(x, y, σ ) ∂D(x, y, σ )
(23.6)
α ∂(x, y, σ ) x=xi ,y=yi ,σ =σi ∂(x, y, σ ) x=xi ,y=yi ,σ =σi
⎡ ⎤
x
1 ∂D(x, y, σ )
T
⎣y ⎦
Dextremal = D(xi , yi , αi ) (23.7)
2 ∂(x, y, σ ) x=xi ,y=yi ,σ =σi α
The position of the key point is updated. |Dextremal | < 0.03, are discarded as “low
contrast points.”
3. The gradient magnitudes and orientations observed over a small window around
the key point are computed.
L(x, y, σ ) = G(x, y, σ ) ∗ I (x, y) (23.8)

m(x, y) = (L(x + 1, y) − L(x − 1, y))2 + (L(x, y + 1) − L(x, y − 1)2 )
(23.9)
L(x, y + 1) − L(x, y − 1)
= tan−1 (23.10)
L(x + 1, y) − L(x − 1, y)
4. A small region around the key point is considered. Further, it is divided into
n × n cells where each cell will be of size 4 × 4. A gradient orientation his-
togram is built in each cell (Cheng and Han 2016). Each histogram entry is
weighted by the gradient magnitude and a Gaussian weighting function with
σ = 0.5 times the window width. Each gradient orientation histogram is sorted
keeping the prevailing key point orientations obtained from step 3. Once the
key points are extracted, an image can be represented as an unordered collec-
tion of visual words. Bag of visual words are scale, viewpoint, orientation, and
illumination invariant, which makes them suitable for real-time applications.
Figure 23.6 presents the outline of SIFT and BOW for object detection. Bag
of words combine the local descriptors in the form of codebook. With limited
scope for the variations in the objects, BoW entail K centroid. Local descriptor
with d dimension will be clustered based on the nearest centroid. Bag of word
is histogram of image descriptors when assigned to visual word, generating K-
dimensional vector. The K dimensional space is normalized at later stage. The
normalization of the histogram can be done using different distances. Manhattan
and Euclidean distances are some of the frequent choices for the normalization.
When K is maintained at 4096, mean average precision of 68.9% is achieved,
which extrapolate the results obtained by conventional BoW (Qiang et al. 2006;
Zheng et al. 2017) for VEDAI dataset. When K increases, it is impossible to get
the dimensionality reduction using feature aggregation. A novel model evolved
by feature ontology of HOG with logistic regression and feature aggregation of
SIFT with BoW is proposed as a semantic network model as shown in Fig. 23.2.
23.3.3 Object Proposal Methods
The conventional methods may fail to give commanding feature vector particularly
for moving objects. Sliding windows have certainly acknowledged those issues,
while generating the candidate regions. Recently, object proposal methods override
gave high objectness (Bharathi et al. 2012; Hsu et al. 2018) compared to the slid-
ing window methods. Compared to the different object proposal methods, selective
Fig. 23.3 Selective search proposals on VEDAI1024 image
search (Konoplich et al. 2016; Pawar and Humbe 2015) outperformed on benchmark
dataset. The algorithm has shown similar comparable results on aerial images, when
applied with certain adaptations (Lowe 2004). This diversified grouping technique
combined with heuristic grouping has resulted in achieving higher objectness.
Selective search gives s(r1 , r2 ) similarity measure by grouping r1 and r2 as two
segmented regions. The grouping is done based on the weighted combination of
color, texture, and shape and size similarity.
Selective search yields 99% recall rate on PASCAL VOC dataset, and there is
9.868 mean average best overlap achieved (Hsu et al. 2018). The performance of
selective search drops drastically when applied on VEDAI 1024 dataset. By tuning
the proposal width and segmentation size, the recall value of selective search is
marginally increased (Lowe 2004).
The adaptive selective search algorithm is chosen for generating well articulated
proposals for small instances as depicted in Fig. 23.3.
These adaptations have stated it as one of the outperformer for non-aerial image
datasets. However, their analysis on small-sized objects is debatable (Lowe 2004;
Lu et al. 2005). A semantic classification model, as shown in Fig. 23.4, is proposed
to sort the loopholes of all above-mentioned techniques. The output of the semantic
network model along with object proposals is given as input to the CNN.
23.4 CNN Based Semantic Classifier Model
In recent times, a lot of CNN-based methods are being used in the field of object
detection (Moranduzzo and Melgani 2013; Xu et al. 2016). The objective of this paper
is to devise a framework best suited for databases having small instance of objects.
The level 2 data flow diagram of the proposed CNN-based semantic classifier is
shown in Fig. 23.4. The vehicle detection is boosted in a two-stage approach. In
stage I, the semantic network model (3.0) is derived as a combination of ontological
features (1.1) and aggregated features (0.1) as discussed thoroughly in Sects. 23.3.1
Fig. 23.4 CNN-based semantic classifier model
and 23.3.2, respectively. Object proposals (2.0) are identified as the unique features
that are fed as input to the convolutional neural network.
The semantic model acts as the relevance feedback to the CNN. The semantic
classifier model is derived based on CNN and object proposals (4.0). The performance
of the proposed semantic CNN model is compared with the CNN-based vehicle
detection.
23.4.1 Deep Learning Based Vehicle Detection Framework
Efficiency of machine learning algorithms depends a lot on the handcrafted features

which minimize the complexity of the data by achieving sparser representation of the
feature space which will be more visible to the classifier (Lowe 2004; Moranduzzo
and Melgani 2013; Sommer et al. 2017). Handcrafted features perform below expec-
tations, which are exposed to the small instance datasets having merely 100–2000
pixels for overall consideration. Deep learning-based techniques addressed this issue
flawlessly, by extracting high level features and avoiding under fitting.
The proposed CNN-based semantic classifier model is applied on an image of
size 32 × 32 pixels (Table 23.2).
CNN can accept low-level features and generates high-level representations from
those. The proposed semantic classifier using CNN performs the low-level to high-
level feature translation. Advancements in CNN as R-CNN (Dhanaraj et al. 2020;
Van de Sande et al. 2011) and different variants as fast RCNN (Rabiu et al. 2013) and
faster RCNN (Sakhare et al. 2020) have shown significant performance in detecting
objects in common standard datasets; however, their performance is band limited on
small instance objects and can lead to missed detections (Lowe 2004; Lu et al. 2005).
The computational complexity needs GPU processing to get desired performance.
The work presented in this paper exploits the unique ensemble at feature level and
Table 23.2 Proposed CNN-based semantic classifier model summary

Layer within the Kernel size Padding stride Output parameters
architecture
Conv layer 3 × 3 × 32 0, 1 (Parameters computed,
30, 30, 32)
Activation layer: – – (Parameters computed,
ReLU 30, 30, 32)
28, 28, 32)
ReLU 28, 28, 32)
Pooling layer: max 2×2 0, None (Parameters computed,
14, 14, 32)
Dropout layer – – (Parameters computed,
14, 14, 32)
12, 12, 64)
ReLU 12, 12, 64)
10, 10, 64)
ReLU 10, 10, 64)
Pooling layer: max 2×2 0, None (Parameters computed,
5, 5, 64)
Dropout layer – – (Parameters computed,
5, 5, 64)
Flatten layer – – (Parameters computed,
1600)
Fully connected layer – – (Parameters computed,
256)
ReLu activation (Parameters computed,
256)
Dropout layer (Parameters computed,
256)
Fully connected layer – – (Parameters computed,
2)
classifier level giving an effective semantic classifier. This novel semantic classifier
is obtained from ontology and feature aggregation.
The combined semantic network model when feature engineered with the object
proposal methods in deep learning classifier gives comparable accuracy with respect
to the conventional techniques. The simple architecture does not need GPU comput-
ing. CNN-based semantic classifier model summary is represented for input image
of 32×32 pixels (Table 23.3).
Table 23.3 CNN-based semantic classifier model summary

Layer within the architecture Output parameters
Fully connected layer (Parameters computed, 1024)
ReLU activation function (Parameters computed, 1024)
23.5 Database
The experimentations of any novelty, proposed algorithms on benchmark datasets

play a crucial role. Vehicle detection from aerial images could see limited edition of
the experiments compared to state-of-the-art object detectors because of the limited
availability of datasets. Most of the available datasets comprehend the controlled
environments. Evaluation of the algorithms on such structured objectives fails when
tested in the real-time scenarios. One of the other challenges is none of the approaches
targets automation of the vehicle detection from the small instance images. VEDAI
1024 (Tewari et al. 2019) is seen completely different than the earlier works, as it
contains the satellite images from Utah, USA. The vehicle detection demonstrates
diversified challenges due to the dynamic backgrounds alike fields, woods, moun-
tains, and cities VEDAI 1024 dataset has all types of vehicles with different colors
and orientations. The dataset has total 1210 images represented in color image for-
mat with three color planes (R, G, and B) and one infrared plane. The dataset comes
with two variants of resolution as 1024 × 1024 and 512 × 512. Annotated VEDAI
dataset makes it suitable choice for the deep learning-based detection systems. There
are total nine categories; the data can be categorized as truck, camping cars, truck,
van, boat and pick up vans, boat, plane, and other class. The findings suggest that
there are five vehicles per image present in the dataset. A small instance image with
dynamic backgrounds in VEDAI dataset brings necessary opportunity to devise the
robust vehicle detector.
The experimentations are performed on VEDAI 1024 database. VEDAI 1024
database is explicitly derived for detection of the vehicles in unrestricted environ-
ment and dynamic backgrounds. The database comprising of vehicle objects as
small instance objects have overcome the drawbacks of the existing databases in
this domain (Ren et al. 2015). VEDAI 1024 database is collected over Utah, USA,
has dynamic backgrounds of city, urban, forest, and grasslands. The images are main-
tained as 1024 × 1024 aspect ratio, and 12.5 cm ground sampling distance evaluated
per pixel. Ground truth annotations make the database suitable for deep learning
approaches. The experimentations are performed on selected classes as pick up vans,
vans, and cars as enough number of ground truths are available for the evaluation.
Table 23.4 VEDAI 1024 database

Sr. No. Parameter VEDAI 1024
1 File format tif
2 Resolution 1024 × 1024
3 Size of an image 200 KB
4 Number of images 605
5 Object classes (9 classes) car, truck, van,
pickup, tractor, boat, bus,
motorcycle, other
6 GSD per pixel in centimeter 12.5 cm
7 Total number of objects 2950
8 Per image number of objects 2.88
9 Average width per bounding 33.40 ± 11.33
box
10 Average height per bounding 33.47 ± 11.68
box
The execution considered 70% of the database for training, while 20% is kept for
validation and 10% for the testing purpose.
The ground truths objects are hold to get the overlapping bounding boxes using
selective search method (Table 23.4).
23.6 Results
The performance metrics used for the validation of the algorithms are accuracy.
Table 23.5 shows the accuracy of the semantic classifier using HoG and logistic
regression, convolutional neural network, and the proposed CNN-based semantic
classifier model.
Table 23.5 Comparative result analysis

Experiments Semantic classifier CNN classifier Semantic classifier
HOG+ logistic using convolutional
regression neural network
Model training 0.783 0.958 0.9871
accuracy
Model validation 0.7627 0.936 0.9709
accuracy
Loss at validation loss 0.129 0.1168 0.0938
Average precision – – 0.965
Fig. 23.5 Vehicle detection in VEDAI images using proposed architecture
Fig. 23.6 Comparative

analysis of vehicle detection
in VEDAI 1024
The combination of CNN with semantic classifier model is resulted in improved

performance. The average precision of 96.5% is quite more than the accuracy
achieved by any other classifiers for vehicle detection in aerial imagery. The results
of the vehicle detection using proposed method are given in Fig. 23.5, and the com-
parative analysis of vehicle detection in VEDAI 1024 is shown in Fig. 23.6.
23.7 Conclusion
Conventional dominant object detection paradigms such as HOG, SIFT, with classi-
fiers such as logistic regression, bag of word are analyzed. HOG, with logistic regres-
sion, has created the optimal features required for vehicle representation. Moreover,
the benefits of applying logistic regression create ontological features to be used in
the semantic network model. Scale invariant feature transforms with bag of word
has reduced the dimensionality of the feature vector causing feature aggregation. A
semantic network model of optimal features is proposed with a conglomeration of
feature ontology and feature aggregation. Object proposal methods are identified as
the suitable feature representation techniques for small instance objects in VEDAI
1024 database. Selective search with minimum proposal size minimum box width
and maximum proposal size and minimum box width facilitate the best feature repre-
sentation for small instance objects in VEDAI 1024 database. Object proposals along
with CNN yield a detection accuracy of 95.8%. A CNN-based semantic classifier
model is proposed which accepts selective search object proposals as input features,
while semantic network features act as relevance feedback to boost the detection
accuracy to 98.71%, keeping the average detection accuracy to 96.5%.
References
Ajmal A, Hussain IM (2010, March) Vehicle detection using morphological image processing tech-
nique. In: 2010 International conference on multimedia computing and information technology
(MCIT). IEEE, pp 65–68
Alexe B, Deselaers T, Ferrari V (2010, June) What is an object? In: 2010 IEEE computer society
conference on computer vision and pattern recognition. IEEE, pp 73–80
Bharathi TK, Yuvaraj S, Steffi DS, Perumal SK (2012, December) Vehicle detection in aerial surveil-
lance using morphological shared-pixels neural (MSPN) networks. In: 2012 Fourth international
conference on advanced computing (ICoAC). IEEE, pp 1–8
Chen X, Xiang S, Liu CL, Pan CH (2014) Vehicle detection in satellite images by hybrid deep
convolutional neural networks. IEEE Geosci Remote Sens Lett 11(10):1797–1801
Cheng G, Han J (2016) A survey on object detection in optical remote sensing images. ISPRS J
Photogrammetry Remote Sens 117:11–28
Cucchiara R, Piccardi M, Mello P (2000) Image analysis and rule-based reasoning for a traffic
monitoring system. IEEE Trans Intell Transp Syst 1:119–130
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings
of the 2005 IEEE computer society conference on computer vision and pattern recognition, San
Diego, CA, USA, vol 1, pp 886–893
Dhanaraj M, Sharma M, Sarkar T, Karnam S, Chachlakis D, Ptucha R, Markopoulos PP, Saber E
(2020, April) Vehicle detection from multi-modal aerial imagery using YOLOv3 with mid-level
fusion (conference presentation). In: Big data II: learning, analytics, and applications, vol 11395.
International Society for Optics and Photonics, p 1139506
Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer
Hosang J, Benenson R, Dollár P, Schiele B (2015) What makes for effective detection proposals?
IEEE Trans Pattern Anal Mach Intell 38(4):814–830
Hsu SC, Huang CL, Chuang CH (2018, January) Vehicle detection using simplified fast R-CNN.
In: 2018 International workshop on advanced image technology (IWAIT). IEEE, pp 1–3
Konoplich GV, Putin EO, Filchenkov AA (2016, May) Application of deep learning to the problem of
vehicle detection in UAV images. In: 2016 XIX IEEE International conference on soft computing
and measurements (SCM). IEEE, pp 4–6
Lowe DG (2004) Distinctive image features from scale-invariant key points. Int J Comput Vis 60:91.
https://doi.org/10.1023/B:VISI.0000029664.99615.94
Lu M, Wevers K, Van Der Heijden R (2005) Technical feasibility of advanced driver assistance
systems (ADAS) for road traffic safety. Transp Plan Technol 28(3):167–187
Moranduzzo T, Melgani F (2013, July) Comparison of different feature detectors and descriptors
for car classification in UAV images. In: 2013 IEEE International geoscience and remote sensing
symposium-IGARSS. IEEE, pp 204–207
Pawar BD, Humbe VT (2015) Morphology based composite method for vehicle detection from
high resolution aerial imagery. VNSGU J Sci Technol 4(1):50–56
Qiang Z, Mei-Chen Y, Cheng K-T (2006) Fast human detection using a cascade of histograms of
oriented gradients. Comput Vis Pattern Recognit 1491–1498
Rabiu H et al (2013) Vehicle detection and classification for cluttered urban intersection. Int J
Comput Sci Eng Appl (IJCSEA) 3(1)
Razakarivony S, Jurie F (2016) Vehicle detection in aerial imagery: a small target detection bench-
mark. J Vis Commun Image Representation 34:187–203
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with
region proposal networks. In: Advances in neural information processing systems, pp 91–99
Sakhare KV, Tewari T, Vyas V (2020) Review of vehicle detection systems in advanced driver
assistant systems. Arch Comput Methods Eng 27(2):591–610
Sommer LW, Schuchert T, Beyerer J (2017, March) Fast deep vehicle detection in aerial images. In:
2017 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 311–319
Tayara H, Soo KG, Chong KT (2017) Vehicle detection and counting in high-resolution aerial
images using convolutional regression neural network. IEEE Access 6:2220–2230
Tewari T, Sakhare KV, Vyas V (2019) Vehicle detection in aerial images using selective search with
a simple deep learning based combination classifier. In: Proceedings of the third international
conference on microelectronics, computing and communication systems. Springer, Singapore,
pp 221–233
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recog-
nition. Int J Comput Vis 104(2):154–171
Van de Sande KE, Uijlings JR, Gevers T, Smeulders AW (2011, November) Segmentation as
selective search for object recognition. In: 2011 International conference on computer vision.
IEEE, pp 1879–1886
Villon S et al (2016) Coral reef fish detection and recognition in underwater videos by supervised
machine learning: comparison between deep learning and HOG + SVM methods. In: Advanced
concepts for intelligent vision systems, ACIVS
Xu Y, Yu G, Wang Y, Wu X, Ma Y (2016) A hybrid vehicle detection method based on Viola–Jones
and HOG + SVM from UAV images. Sensors 16(8):1325
Yang Z, Pun-Cheng LSC (2018) Vehicle detection in intelligent transportation systems and its
applications under varying environments: a review. J Image Vis Comput 143–154
Zhang L, Lin L, Liang X, He K (2016, October) Is faster R-CNN doing well for pedestrian detection?
In: European conference on computer vision. Springer, Cham, pp 443–457
Zheng H (2006, July) Automatic vehicles detection from high resolution satellite imagery using
morphological neural networks. In: Proceedings of the 10th WSEAS international conference on
computers, Vouliagmeni, Athens, Greece, vol 13, p 608
Zheng L, Yang Y, Tian Q (2017) SIFT meets CNN: a decade survey of instance retrieval. IEEE
Trans Pattern Anal Mach Intell 40(5):1224–1244
Chapter 24
Exploring Source Separation as
a Countermeasure for Voice Conversion
Spoofing Attack
R. Hemavathi, S. Thoshith, and R. Kumaraswamy
Abstract Voice conversion is a high-risk spoofing technique for automatic speaker

verification systems where the identity of imposter speaker’s speech is transformed
to that of genuine speaker’s without altering the linguistic content. In this work, we
propose a countermeasure by combining an unsupervised co-channel source separa-
tion framework based on non-negative matrix factorization and convolutional neural
network (CNN)-based binary classifier. We also propose to model voice conversion
(VC) spoofed speech as an instantaneous mixture of estimate of target speech and
the artifacts introduced during the process of VC. Further, the efficiency of proposed
countermeasure is evaluated using CNN-based automatic speaker verification sys-
tem. For evaluation, voice conversion challenge 2016 dataset which consist of 18
different voice conversion algorithm’s results is considered. The ASV system’s vul-
nerability for the above database was tested using false alarm rate (FAR) and was
found to be ranging from 18.8 to 62.4%. The proposed countermeasure showed to
give excellent performance for 3 known and 15 unknown voice conversion attacks
by reducing false alarm rate of speaker verification system below from 2.8% for all
18 VC algorithms.
24.1 Introduction
Automatic speaker verification (ASV) is a biometric authentication system used to

verify person’s claimed identity through his/her voice (Campbell et al. 2007; Kin-
nunen et al. 2006). As accessibility of ASV has been increased with its introduction in
smartphones, studying the spoofing threats to ASV and building proper countermea-
sure is gaining attention. Spoofing refers to the condition where the source (imposter)
claims the false identity by mimicking the target (genuine) speaker’s speech. Spoof-
R. Hemavathi (B) · S. Thoshith · R. Kumaraswamy

Department of Electronics and Communication Engineering, Siddaganga Institute of Technology
(Affiliated to Visveswaraya Technological University, Belagavi), Tumakuru, India
e-mail: hemavathir01@gmail.com
R. Kumaraswamy
e-mail: hyrkswamy@sit.ac.in
384 R. Hemavathi et al.
ing is a major threat to verification systems, as it leads to increase in the false alarm
rate (FAR), i.e., the imposter is falsely accepted as genuine speaker.
Spoofing attacks for ASV systems include replay attack (Wu et al. 2014), imper-
sonation (Hautamaki et al. 2015), voice conversion (Chen et al. 2014), and speech
synthesis (Masuko et al. 1997). Impersonation refers to human voice mimicking.
It is observed that humans can efficiently mimic speakers with similar voice char-
acteristics, and impersonating an arbitrary speaker is more challenging (Lau et al.
2004). Replay attack is a scenario where attacker has a digital copy of an original
target speaker’s utterance, and he replays it using playback device to claim false
identity for ASV system. Voice conversion refers to transforming the identity of
source (imposter) speaker’s to that of target (intended) speaker’s without altering the
linguistic content. In speech synthesis, unit selection (Masuko et al. 1997) and sta-
tistical approaches (Masuko et al. 1996) are used to generate more natural sounding
speech with specific speaker’s voice characteristics.
Among all the spoofing attacks for ASV system, voice conversion and speech
synthesis (SS) attacks gain more attention, as impersonation cannot be applied in
large scale, and eventhough replay attacks can be accomplished easily, it poses threat
to text-dependent ASV systems. Availability of opensource software and lack of
effective countermeasures make voice conversion and speech synthesis genuine and
high-risk attacks.
There are two approaches to overcome the spoofing attacks: First approach is to
build an efficient ASV system, it does not seems to work significantly, as state-of-art
ASV systems based on i-vectors, Gaussian mixture model, and hidden Markov model
are also vulnerable to spoofing (Wu et al. 2016). Second is to build a countermeasure,
which can make a decision whether the input speech as spoofed or natural speech.
This work aims to build an efficient countermeasure based on source separation for
voice conversion spoofing attack.
24.2 Related Work
Voice conversion (VC) is a technique where the identity of imposter speaker’s speech
is transformed to that of intended speaker’s speech. Initially, the source and filter
parameters of target and imposter speaker’s speech are extracted. As the duration of
imposter and target speech features may differ, dynamic time warping is performed to
align them. Later imposter speaker’s speech features are transformed to that of target
by using VC algorithms like: parametric approaches (Toda et al. 2007), frequency
warping (Daniel et al. 2010), and artificial neural networks-based techniques (Desai
et al. 2009; Wu et al. 2015). The converted features are synthesized to obtain spoofed
speech.
The vulnerability ASV system to voice conversion spoofing is studied for Gaus-
sian mixture model (GMM) system in Pellom and Hansen (1999), GMM-universal
background modeling (UBM) system in Bonastre et al. (2007). These studies showed
an increase FAR from around 10 to 40%. The advanced ASV systems based on
24 Exploring Source Separation as a Countermeasure for Voice … 385
joint factor analysis (JFA), i-vectors, and probabilistic linear discriminative analysis
(PLDA) are also vulnerable to VC spoofing attack resulting in increased FARs from
3 to 17%.
To overcome the spoofing attacks for ASV system, various countermeasures are
proposed. An efficient countermeasure should differentiate between natural and syn-
thetic (or VC) speech, hence, reduce FAR. To detect VC and SS attacks, spectro-
temporal features derived from local binary patterns were employed in Alegre et al.
(2013b). Phase and modified group delay features are exploited to detect VC spoof-
ing in Wu et al. (2012). Eventhough the phase-based features efficiently detect VC
attacks, its performance for unknown attacks remains challenging. Relative phase
shift feature was proposed in Sanchez et al. (2015) to detect SS spoofing using min-
imum phase. In Alegre et al. (2013a), an average pair-wise distance (PWD) between
consecutive feature vectors was employed to detect VC speech.
In this paper, we propose a countermeasure which can detect voice conversion
spoofed speech. Countermeasure is built by combining unsupervised co-channel
speech separation algorithm based on non-negative matrix factorization as front end
for CNN-based binary classifier. We also propose to model voice conversion spoofed
speech as an instantaneous mixture of estimate of target speech and artifacts intro-
duced due to voice conversion. The efficiency of the proposed countermeasure is
evaluated using a CNN-based automatic speaker verification system for voice con-
version challenge 2016 dataset. The proposed countermeasure is also validated for
noisy speech database NOIZEUS.
Rest of the paper is organized as follows: Sect. 24.3 gives the motivation for
exploring speech separation as a countermeasure, Sect. 24.4 gives the overview of
proposed system, Sect. 24.5 gives experimental results, and Sect. 24.6 concludes the
paper.
24.3 Motivation
This section gives the motivation for the present study. Voice conversion algorithms
mainly focus on transforming the spectral content of source to that of target. Hence,
a lot of similarity is observed in target and spoofed speech. In this study, instead of
processing the signals directly, the speech is pre-processed using speech separation
block where the artifacts introduced during the voice conversion is separated.
The main motivation to apply the source separation algorithm is shown in Fig.
24.1. Based on the fact that voice conversion introduces artifacts in the resultant
speech (Patel and Patil 2017; Wu et al. 2016), if the VC spoofed speech is processed
using source separation, estimate of target speech and artifacts can be obtained. But
the major issue was the artifact estimate that should be distinct from other noises and
generalizable for all VC algorithms.
Hence, study was conducted, where speech separation was applied for clean
speech, spoofed speech, and speech degraded with different noises. Figure 24.1
shows the cochleagram plot of artifacts estimates obtained for clean speech from
5000 5000 5000

2767 2767 2767
1450 1450 1450
733 733 733
310 310 310
80 80 80
1 2 3 4 1 2 3 1 2 3 4
(a) (b) (c)
Center Frequency (Hz)
5000 5000 5000

2767 2767 2767
1450 1450 1450
733 733 733
310 310 310
80 80 80
1 2 3 1 2 3 1 2 3
(d) (e) (f)
5000 5000 5000
2767 2767 2767
1450 1450 1450
733 733 733
310 310 310
80 80 80
1 2 3 4 1 2 3 4 1 2 3 4
(g) (h) (i)
Time (s)
Fig. 24.1 Cochleagram plots of artifact estimates obtained using source separation stage for a–
c clean speech from VCC-16 database, Timit and ASV-15 database d–f spoofed speech from
participant submission J , L and M, respectively, from VCC-16 database i–l speech degraded with
street noise, reverberation, and babble noise, respectively
VCC-2016 dataset, Timit and ASV-15 database (Alegre et al. 2013a) (Fig. 24.1a–c).
Voice conversion spoofed speech from VC algorithm J , L, and M from VCC-2016
dataset Fig. 24.1d–f, respectively. The details regarding all the VC algorithms in
VCC-16 is discussed in Table 24.2. Further to show the difference between the arti-
fact introduced from voice conversion and other background noises, the artifacts
estimates of speech signals degraded with street noise, reverberation, and babble
noise are shown in Fig. 24.1g–i, respectively. Figure shows that the artifact esti-
mates of VC spoofed speech are unique. It is also distinguishable from clean speech
and speech degraded with other background noise and artifacts. This is the major
motivation to conduct the present study.
24.4 Proposed System
The schematic of proposed system is given in Fig. 24.2. Automatic speaker verifica-
tion system (ASV) is built using a convolutional neural network (CNN) and trained
using Mel-spectrogram images of target speech. In the testing phase, the test signal
Stest is initially given to the proposed countermeasure to classify it as spoofed or nat-
ural speech. The countermeasure is built by combining speech separation block and
CNN-based binary classifier. The speech separation block separates the input speech
signal Stest into estimate of target Ŝtarget and artifact αvc introduced due to voice con-
Fig. 24.2 Schematic of proposed system
version process. The CNN-based binary classifier is trained using Mel-spectrogram

images of αvc of natural and spoofed speech signals. The countermeasure classifies
the input signal into either natural or spoofed speech. The speech signal, which is
classified as natural, is given for ASV system, else it is rejected.
24.4.1 Modeling of Voice Conversion Spoofed Speech
As voice conversion include various stage of processing, artifacts are introduced in the
converted speech (Patel and Patil 2017; Wu et al. 2016). The countermeasure that we
are proposing mainly relays on artifacts introduced due to voice conversion algorithm.
Figure 24.1 shows that the artifacts introduced due to voice conversion algorithm is
uniquely characterized by discontinuous formants in high frequency region. Based
on this, we propose to model the voice converted speech as an instantaneous mixture
of estimate of target Ŝtarget (t) and artifact introduced during VC αvc .
Sspoof = Ŝtarget (t) + αvc (24.1)
24.4.2 Source Separation
The source separation algorithms separate the mixed speech signal into two inde-
pendent signals. Two state-of-art approaches in unsupervised co-channel speech
separation (USCSS) are computational auditory scene analysis (CASA) and non-
negative matrix factorization (NMF) approaches. A comparative study in Hemavathi
and Swamy (2018) showed that NMF-based approach performs better as CASA-
based speech separation system itself introduces additional artifacts. Hence, in this
work NMF-based source separation approach, 2D Itakura-Saito non-negative matrix
factorization (ISNMF) (Gao et al. 2013) is used.
Initially, STF , time-frequency (TF) representation of input speech signal is decom-
posed into two matrices, D, a set of spectral basis vectors, and H , an encoding matrix
which has amplitude of each basis vector at each time point. Each element of |STF |0.2
is given by
φmax
τmax

I
φ
|STF ( f, t)|0.2 = Dτf −φ,i Hi,ts −τ (24.2)
i=1 τ =0 φ=0
where vertical and horizontal arrows denote downward and right shift by φ rows and τ
columns. Estimates of D and H are obtained using Quasi-EM IS-NMF2D algorithm
(Gao et al. 2013). The objective is to find the estimates of target and artifacts, i.e.,
{|X i ( f, ts )|0.2 }i=1
I
.
Where
φmax
τmax
φ
| X̃ i ( f, ts )| =
0.2
Dτf −φ,i Hi,ts −τ (24.3)
τ =0 φ=0
Using above equations, the binary mask is generated as, maski ( f, ts ) = 1,

if | X̃ i ( f, ts )|0.2 > | X̃ j ( f, ts )|0.2 , else 0.
Finally, the estimated time-domain sources are resynthesized weighting the
cochleagram of mixture by the mask as Ssep ˜ = Resynthesize(maski · Y).
i
To classify the outputs as estimate of target or artifact signals, the correlation
between the input speech and separated speech signals is considered. The separated
signal with high correlation with input is considered as estimate of target and other
as artifact. The Mel-spectrogram representation of clean and spoofed speech signals
is shown in Fig. 24.3. It is seen that the artifact estimate of spoofed speech is clearly
distinguishable from that of clean speech.
Fig. 24.3 Mel-spectrogram representation of a clean speech b VC spoofed speech from baseline
from VCC-16 database, respectively. c–d speech estimate and e–f artifact estimate obtained from
natural and spoofed speech, respectively, using source separation
24.4.3 CNN-Based Binary Classifier
The binary classifier is built using VGG-16 convolutional neural network (CNN)
model (Simonyan and Zisserman 2014) to classify the input as natural and spoofed
speech. The CNN is trained using the Mel-spectrogram images of αvc of natural and
spoofed speech. Mel-spectrogram is obtained by applying the Mel-scale to linear
spectrogram. Mel spectrogram gives the magnitude of TF bins. Mel-frequency m
representation of frequency f (Hz) can be obtained using (24.4)

f
m = 2595 ∗ log10 1 + (24.4)
700
A 16 layer VGG-16 convolutional neural network architecture is used in this study

(Simonyan and Zisserman 2014). The network is built using 3 × 3 convolutional
layer stack. The input to the CNN is 224 × 224 RGB image. Convolutional layer
accepts the raw pixel values as input and computes output by multiplying small
chunk of images with weights. Filters are used to localize chunks and to compute dot
product. Each convolutional layer is followed by ReLU layer, which has activation
function, f (x) = max(0, x). Further, max pool layer down samples the dimension
of the matrix resulting in smaller output. Fully connected layer computes the class
scores resulting in 1 × N -dimensional vector where N is number of classes. As we
are designing VGG-16 as binary model in this work, N = 2. Visualization VGG-16
neural network’s layers is given in Table 24.1.
24.4.4 Convolutional Neural Network-Based Automatic

Speaker Verification System
In this work, text-independent automatic speaker verification system is built using

Mel-spectrogram images of target speech signals and VGG-16 convolutional neural
network model. Both modeling and feature extraction techniques are same as binary
classifier. The difference in binary model is trained using Mel-spectrogram images of
αvc of both natural and VC spoofed speech and N = 2. However, for ASV VGG-16,
Mel-spectrograms of target speech signals are given to VGG-16 network for training,
and N is number of speakers.
Table 24.1 Visualization of VGG-16 neural network’s layers

Sl. No. Layers Description Sl. No. Layers Description
1 ‘input’ Image input 224 × 22 ‘relu4_2’ ReLU ReLU
224 × 3
images
2 ‘conv1_1’ Convolution 64 23 ‘conv4_3’ Convolution 512 3 × 3 ×
3×3×3 512
convolutions convolutions
3 ‘relu1_1’ ReLU ReLU 24 ‘relu4_3’ ReLU ReLU
4 ‘conv1_2’ Convolution 64 25 ‘pool4’ Max pooling 2 × 2 max
3 × 3 × 64 pooling
convolutions
5 ‘relu1_2’ ReLU ReLU 26 ‘conv5_1’ Convolution 512 3 × 3 ×
512
convolutions
6 ‘pool1’ Max pooling 2 × 2 max 27 ‘relu5_1’ ReLU ReLU
pooling
7 ‘conv2_1’ Convolution 128 28 ‘conv5_2’ Convolution 512 3 × 3 ×
3 × 3 × 64 512
9 ‘conv2_2’ Convolution 128 3 × 3 × 30 ‘conv5_3’ Convolution 512 3 × 3 ×
128 512
11 ‘pool2’ Max pooling 2 × 2 max 32 ‘pool5’ Max pooling 2 × 2 max
pooling pooling
12 ‘conv3_1’ Convolution 256 3 × 3 × 33 ‘fc6’ Fully 4096 fully
128 connected connected
convolutions layer
13 ‘relu3_1’ ReLU ReLU 34 ‘relu6’ ReLU ReLU
14 ‘conv3_2’ Convolution 256 3 × 3 × 35 ‘drop6’ Dropout 50% dropout
256
convolutions
15 ‘relu3_2’ ReLU ReLU 36 ‘fc7’ Fully 4096 fully
connected connected
layer
16 ‘conv3_3’ Convolution 256 3 × 3 × 37 ‘relu7’ ReLU ReLU
256
convolutions
17 ‘relu3_3’ ReLU ReLU 38 ‘drop7’ Dropout 50% dropout
18 ‘pool3’ Max pooling 2 × 2 max 39 ‘fc8’ Fully 1000 fully
pooling connected connected
layer
19 ‘conv4_1’ Convolution 512 3 × 3 × 40 ‘prob’ Softmax
256
convolutions
20 ‘relu4_1’ ReLU ReLU 41 ‘output’ Classification Cross
output entropy
21 ‘conv4_2’ Convolution 512 3 × 3 ×
512
convolutions
24.5 Experimental Results
24.5.1 Dataset
To test the efficiency of proposed countermeasure, voice conversion challenge 2016

(VCC-2016) dataset is used. In VCC-2016, 162 parallel utterance of 5 source (3
Female and 2 Male) and target (2 Female and 3 Male) speakers speech along with
baseline spoofing technique was provided as training data, and research teams were
asked to submit the voice conversion results by applying their VC algorithm for the
data. 17 research groups submitted voice conversion results for all combinations (25
combinations) of source and target speaker’s speech, and the algorithms are named
in alphabetical order from A to Q which are briefly described in Table 24.2.
24.5.2 Experimental Setup
In this work, ASV is built for five classes of target, 500 files were used for training and
310 for evaluation. The training and validation plot of ASV system is shown in Fig.
24.4b. To train CNN-based binary classifier, 500 speech files of clean and spoofed
speech are taken. For spoofed training set, 200 speech files from baseline (B L) with
mean opinion score (MOS) 1.5, 150 each from VC algorithm M and L with MOS 1.9,
and 2.9 from VCC-16 dataset were used. Mean opinion scores (MOS) are derived
from Toda et al. (2016), and high MOS indicates more naturalness in converted
speech. Training and validation plot of proposed countermeasure is shown in Fig.
24.4a. Here, FAR indicates that spoofed speech classified as natural speech.
24.5.3 Results
Table 24.2 gives the experimental results, the performance of proposed countermea-
sure (CM), ASV system with and without proposed countermeasure in terms of false
alarm rate (FAR), and equal error rate (EER). The vulnerability of ASV system is
tested for 500 samples (100 from each class) for VCC-16 database.
24.5.4 Validation of Countermeasure in Noisy Condition
The proposed system mainly relays on artifact estimate; hence, the system’s perfor-
mance in noisy environment is major issue of concern. To show the efficiency of the
proposed system in speaker independent and noisy environments, validation is done
for NOIZEUS database (Hu and Loizou 2007). This database consists of 30 IEEE
Table 24.2 Performance evaluation of proposed countermeasure, CNN-based automatic speaker

verification system, and automatic speaker verification system with proposed countermeasure in
terms of FAR and EER for 18 different voice conversion algorithms from VCC-16 database
VC Vocoder Proposed CNN-based ASV ASV with proposed CM
Approaches CM
FAR FAR EER FAR EER
Baseline- 0.2 18.8 11.1 0.4 1.9
VC
A Ahocoder 12.8 52.8 28.1 0.4 1.9
B Straight 11.2 52.8 28.1 0.4 1.9
C LPC 10.2 42.4 22.9 0.2 1.8
D Straight 4.81 57.6 30.5 0.4 1.9
E Ahocoder 23 45 24.2 0.4 1.9
F Straight 1.6 43 23.2 0.2 1.8
G Straight 17.2 41.2 22.3 1.6 2.5
H Straight 14.8 28.8 16.1 0.4 1.9
I HSM-based 0.5 36.2 19.8 0.4 1.9
synthesis
J Straight for 2.6 58.6 31 1.8 2.6
FO MLSA
for spectral
conversion
K Differential 22.1 56 29.7 2.8 3.1
spectral
amplitude
filtering
applied on
wide-band-
based
harmonic
modeling
L Tandem 30.1 50.4 26.9 0.4 1.9
straight
M No 2.87 60.8 32.1 0.2 1.8
N Scaling of 5.21 44.6 24 1.2 2.3
liner
prediction
spectrum
O Straight 7.8 57.6 30.5 0.4 1.9
P Straight 7.8 49.2 26.3 0.2 1.8
Q Harmonic- 16.6 47 25.2 0.2 1.8
noise
model
Fig. 24.4 Training and validation plots of a proposed countermeasure and b CNN-based automatic
speaker verification system
Fig. 24.5 Validation of proposed countermeasure at for airport, babble, car, exhibition, restaurant,
train-station, and street noises
sentences spoken by six speakers, corrupted by seven real-world noises airport, bab-
ble, car, exhibition, restaurant, train-station, and street (AURORA database) at 0 dB.
The validation plot of proposed countermeasure for NOIZEUS database is shown in
Fig. 24.5. It is seen that for all seven noises at all dBs, the proposed countermeasure
is giving excellent performance. The reason is uniqueness of artifact introduced by
voice conversion algorithms which is indicated in Fig. 24.1.
24.5.5 Discussion
Table 24.2 shows the brief description of voice conversion techniques used by all
17 participant submission. Proposed countermeasure gives best result for baseline,
i.e., FAR of 0.2% as it has more artifact and lesser MOS and least result for dataset
L with high MOS of 3%. The ASV system is vulnerable to all spoofing attacks,
and least FAR is observed for baseline, i.e., 18.80% and algorithms which have high
similarity and MOS successfully increase the FAR by more than 50%. By combining
the countermeasure with ASV, the FAR is found to be decreased from 60.8 to 2.8%.
The efficiency of proposed algorithm can be seen for three known attacks (used
for training the countermeasure) baseline, participant submission L and M, highest
FAR reported is 0.4% and for 15 other unknown attacks highest FAR observed is
2.8% for participant submission N , which has MOS of 3. Based on the results, the
proposed countermeasure can be considered as more reliable countermeasure for
voice conversion spoofing attack. Eventhough the noisy speech was not used while
training the binary classifier, validation plot for seven different noises show that the
system can perform well in noisy environments too.
24.6 Conclusion
In this work, we proposed a countermeasure for ASV system by combining source

separation framework with CNN-based binary classifier for voice conversion spoof-
ing attack. We conducted study to show the artifacts of spoofed speech is unique and
generalizable to various VC spoofing algorithms. We also propose to model voice
conversion speech as an instantaneous mixture of estimate of target and artifact.
To classify the artifact estimate as natural or spoofed speech, CNN-based binary
classifier is proposed, so that it can generalize to wide range of unknown attacks.
In this work, the vulnerability of CNN-based automatic speaker verification system
is also studied. Results shows that after combining the proposed countermeasure
with the ASV system, the false alarm rate of automatic speaker verification system is
decreased by significant rate for unknown attacks too. Further, the validation of coun-
termeasure for various noises shows the efficiency of the proposed countermeasure
in noisy environments. Hence, this algorithm can be considered as reliable algorithm
to detect the spoofed speech.
Acknowledgements The first author would like to thank Women Scientist Scheme-A, Department
of Science and Technology (WOS-A DST), Government of India for providing financial assistance
wide reference number SR/WOS-A/ET-69/2016.
References
Alegre F, Amehraye A, Evans N (2013a) Spoofing countermeasures to protect automatic speaker

verification from voice conversion. In: ICASSP 2013, IEEE International conference on acoustics,
speech, and signal processing, 26–31 May, 05
Alegre F, Vipperla R, Amehraye A, Evans N (2013b) A new speaker verification spoofing counter-
measure based on local binary patterns. In: INTERSPEECH
Bonastre J-F, Matrouf D, Fredouille C (2007) Artificial impostor voice transformation effects on
false acceptance rates. In: INTERSPEECH
Campbell WM, Campbell JP, Gleason TP, Reynolds DA, Shen W (2007) Speaker verification
using support vector machines and high-level features. IEEE Trans Audio Speech Lang Process
15(7):2085–2094
Chen LH, Ling ZH, Liu LJ, Dai LR (2014) Voice conversion using deep neural networks with layer-
wise generative training. IEEE/ACM Trans Audio Speech Lang Process 22(12):1859–1872
Daniel E, Asunción M, Antonio B (2010) Voice conversion based on weighted frequency warping.
IEEE Trans Audio Speech Lang Process 18(5):922–931
Desai S, Veera Raghavendra E, Yegnanarayana B, Black AW, Prahallad K (2009) Voice conversion
using artificial neural networks. In: IEEE International conference on acoustics, speech and signal
processing ICASSP, pp 3893–3896
Gao B, Woo WL, Dlay SS (2013) Unsupervised single-channel separation of nonstationary signals
using gammatone filterbank and itakura saito nonnegative matrix two-dimensional factorizations.
IEEE Trans Circ Syst I Regular Pap 60(3):662–675
Hautamäki RG, Kinnunen T, Hautamäki V, Laukkanen A-M (2015) Automatic versus human
speaker verification: the case of voice mimicry. Speech Commun 72:13–31
Hemavathi R, Swamy RK (2018) Unsupervised speech separation using statistical, auditory and
signal processing approaches. In: 2018 International conference on wireless communications,
signal processing and networking (WiSPNET), Mar 2018, pp 1–5
Hu Y, Loizou PC (2007) Subjective comparison and evaluation of speech enhancement algorithms.
Speech Commun 49(7–8):588–601
Kinnunen T, Karpov E, Franti P (2006) Real-time speaker identification and verification. IEEE
Trans Audio Speech Lang Process 14(1):277–288
Lau YW, Wagner M, Tran D (2004) Vulnerability of speaker verification to voice mimicking. In:
Proceedings of international symposium on intelligent multimedia, video and speech processing,
Oct 2004, pp 145–148
Masuko T, Tokuda K, Kobayashi T, Imai S (1996) Speech synthesis using HMMS with dynamic
features. In: Proceedings of ICASSP, vol 1, pp 389–392
Masuko T, Tokuda K, Kobayashi T, Imai S (1997) Voice characteristics conversion for hmm-based
speech synthesis system. In: IEEE International conference on acoustics, speech, and signal
processing, Apr 1997, vol 3, pp 1611–1614
Patel TB, Patil HA (2017) Cochlear filter and instantaneous frequency based features for spoofed
speech detection. IEEE J Sel Top Sig Process 11(4):618–631
Pellom BL, Hansen JHL (1999) An experimental study of speaker verification sensitivity to com-
puter voice-altered imposters. In: IEEE International conference on acoustics, speech, and signal
processing. Proceedings ICASSP99, Mar 1999, vol 2, pp 837–840
Sanchez J, Saratxaga I, Hernáez I, Navas E, Erro D, Raitio T (2015) Toward a universal synthetic
speech spoofing detection using phase information. IEEE Trans Inf Forensics Secur 10(4):810–
820
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recog-
nition. arXiv preprint arXiv:1409.1556. http://arxiv.org/abs/1409.1556
Toda T, Black AW, Tokuda K (2007) Voice conversion based on maximum-likelihood estimation
of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235
Toda T, Chen L-H, Saito D, Villavicencio F, Wester M, Wu Z, Yamagishi J (2016) The voice
conversion challenge. In: Interspeech, pp 1632–1636
Wu Z, Chng ES, Li H (2012) Detecting converted speech and natural speech for anti-spoofing attack
in speaker recognition. In: INTERSPEECH
Wu Z, Gao S, Cling ES, Li H (2014) A study on replay attack and anti-spoofing for text-dependent
speaker verification. In: Signal and information processing association annual summit and con-
ference (APSIPA), 2014 Asia-Pacific, Dec 2014, pp 1–5
Wu Z, Nicholas E, Tomi K, Junichi Y, Federico A, Haizhou L (2015) Spoofing and counter measures
for speaker verification: a survey. Speech Commun 66:130–153
Wu Z, De Leon PL, Demiroglu C, Khodabakhsh A, King S, Ling ZH, Saito D, Stewart B, Toda T,
Wester M, Yamagishi J (2016) Anti-spoofing for text-independent speaker verification: an initial
database, comparison of countermeasures, and human performance. IEEE/ACM Trans Audio
Speech Lang Process 24(4):768–783
Chapter 25
Statistical Prediction of Facial Emotions
Using Mini Xception CNN and Time
Series Analysis
Basudeba Behera, Amit Prakash, Ujjwal Gupta, Vijay Bhaksar Semwal,

and Arun Chauhan
Abstract The growing era of facial recognition has opened a large area of compu-
tational study. Facial emotion recognition has always been a challenging task in the
fields of deep learning. In this work, we have proposed a better approach to not only
study human facial emotion, but also predict one’s emotion by collecting the same
personal data. One part of the article represents the usage of CNN for detecting facial
emotions in which it takes in real-time video frames and predicts the probabilities
of the seven basic emotion states. Output data of the CNN model serves as the input
for the time-series analysis model, and the task of predicting one’s future emotions
has been accomplished. The two-step hierarchical structure helps in studying human
behaviour to predict future outcomes. Finally, the model can be used for continuous
monitoring and predicting a person’s behaviour providing constant emotional param-
eters. This work will be used in many interrogatory procedures or as a preventive
measure by collections of the convict’s facial data.
25.1 Introduction
Reading facial emotion is not a difficult work for the human being but doing the same
thing using a machine learning or neural network is imperious. Facial expressions
represent more than 55% (Mehrabian 2017) of one’s emotions. Advancements in
machine learning and neural network have made the application of emotional anal-
ysis available to the public. Possibly the study of facial expressions started from the
work of Clark et al. (2020). A lot of work has been carried out in the field of computer
vision, and it is able to provide satisfactory results (Semwal et al. 2017).
B. Behera (B) · A. Prakash · U. Gupta

Department of Electronics and Communication Engineering, NIT Jamshedpur, Jamshedpur,
Jharkhand, India
e-mail: basudeb.ece@nitjsr.ac.in
V. B. Semwal
Department of Computer Science Engineering, MANIT, Bhopal, Madhya Pradesh, India
A. Chauhan
Department of Computer Science Engineering, IIIT Dharwad, Dharwad, Karnataka, India
398 B. Behera et al.
According to Shaver et al. (1987), there are seven basic emotions into which one’s
expression could be classified—happy, sad, neutral, surprise, anger, fear and disgust.
Studying these personal emotions and predicting future emotions can help predict
one’s mindset. This work focuses not only on studying facial emotions, but also
studying one’s emotions over a period of time and predicting future behaviour pro-
vided the same emotional parameters. To deal with the initial part of the work, Haar
cascade (Cuimei et al. 2017) and Mini Xception CNN models were used, and for the
later part FB Prophet for time-series analysis model is used for the future prediction.
The Haar cascade model helps in detecting the face in the given video frame using
Haar features (Kaehler and Bradski 2016). By studying different regions of the pic-
ture one by one, it creates a bounding box. Since in a Haar cascade model weights
are assigned manually, the training is done very fast, and results are perfect for espe-
cially still and front facing images. After the bounding box is created, Mini Xception
CNN model is used for emotion recognition. The model predicts the emotion of the
detected face whether happy, sad, neutral, surprise, anger, fear or disgust (Cuimei
et al. 2017). The data from this study is saved over a period of time and then passed
in as a csv file which serves an input for the FB Prophet model. Studying through
the data, it forecasts the emotions of the object for an upcoming time period with
parameters remaining constant.
This work serves a broad survey of the usage of these three models simultaneously.
All the major problems were handled properly to obtain the results with properly pre-
processing of the used data. It also discusses the challenges occurred and how they
were handled. The key focus was on the usage of time-series model for predicting
future emotional behaviour.
25.2 Previous Works
When CNN has revolutionised the study of image processing (Behera et al. 2020a, b),
commonly used CNN models use fully connected layers as the basis for feature extra-
dition (Arriaga et al. 2019). In the recent models such as Inception V3 (Szegedy et al.
2015), rather focussing on the fully connected layer towards the end, much of the
feature extraction is been done in the global average pooling layer. This layer forces
to extract the global features from the input image. The most recent of the architec-
tures is the Xception CNN (Chollet 2017) model which is been used in this work.
It works on the combination of two most successful experimental assumptions, i.e.
residual models (He et al. 2016) and depth-wise separable convolutions (Howard
et al. 2017). The depth-wise separable convolutions serve as the basis of the reduc-
tion in the number of the parameters been used, by separating the process of feature
extraction and combination within the convolutional layers.
Study of facial emotions was just the first part of this work. The innovation is the
usage of time-series analysis for predicting the trends of a particular emotion. Pre-
dicting the future by collecting the data over a period and finding the trend in the
variation can be best solved by using the Prophet tool made available by Facebook
25 Statistical Prediction of Facial Emotions Using Mini Xception … 399
(Polusmak 2017). The Prophet tool was introduced for creating high-quality business
forecasts. This study of the number over the period of time is been used for predict-
ing the state of one’s emotions by studying the emotions over a period of time. The
input for the FB Prophet model provided by this work is the statistical analysis of a
particular emotion over a period of time using the Xception CNN model.
Sun et al. (2020) proposed a robust vectorized convolutional neural network (CNN)
model for extracting features in the region of interests (ROIs) of the face. The atten-
tion concept was adopted in the first layer of the neural network to perform ROIs-
related convolution calculation, and ROIs-related convolution calculation results of
the specific fields in the ROIs are increased by extracting more robust features. Com-
prehensive comparative experiments and cross-database experiments are conducted
to verify the validity and robustness of the proposed model.
Choi and Song (2020) proposed a two-dimensional (2D) landmark feature map for
effectively recognising such facial micro-expressions (FMEs). The proposed 2D
landmark feature map (LFM) is obtained by transforming conventional coordinate-
based landmark information into 2D image information. LFM is designed to have
an advantageous property independent of the intensity of facial expression change.
Alam et al. (2019) propose an IoMT-based emotion recognition system for affective
state mining. Human psychophysiological observations are collected through elec-
tromyography (EMG), electro-dermal activity (EDA) and electro-cardiogram (ECG)
medical sensors and analysed through a deep convolutional neural network (CNN)
to determine the covert affective state. They performed experimental study, and a
benchmark dataset was used to analyse the performance of the proposed method.
25.3 Proposed Method
25.3.1 Facial Emotion Recognition
The facial emotion recognition consists of two processes:
25.3.1.1 Face Detection
In this process, efforts were made to find the face in the given frame/image on which
the emotion recognition algorithm is applied. For this, Haar Cascade Cuimei et al.
(2017) is used, which is a machine learning (ML) algorithm used to identify objects in
a given frame/image. This model is perfect for front facing image detection. The fast
nature of the model helps in detection of face at every frame of the video. For training
this classifier, it requires a lot of positive and negative images (Tutorials 2020). Posi-
tive images are the ones which we want our classifier to identify, and negative images
are the images of everything else apart from the positive ones. Haar features, as shown
Fig. 25.1 Haar features used to extract features from images (Kaehler and Bradski 2016)
in Fig. 25.1 (Cuimei et al. 2017), are used to extract features from the images. A
single value is calculated as the difference between the sum of pixels under white
rectangle and sum of pixels under black rectangle and which is the calculated feature
(Tutorials 2020). Rectangle features can be calculated using integral images (inter-
mediate representation of the image). The integral image at a point x, y is the sum
of the pixel values to its left and above (https://www.researchgate.net/publication/
3940582-Rapid-Object-Detection-using-Boosted-Cascade-of-Simple-Features).

ii(x, y) = i(x , y ) (25.1)
x≤x,y ≤y
As per literature, most of the region inside an image is the non-facing region.
So first, it need to check whether a window is a non-facing region or not? If it is
a non-facing region, then it needs to discard the image without processing. Again
process the image for searching the facial position. Instead of applying all features
on windows, features are taken into groups, and each group is applied one by one
at different stages. The number of features in the initial stages is less and keeps
increasing later (Tutorials 2020). It processes a window only if it passes the first
stage or the previous stage and then moves to the next step. If the window passes all
stages, then it is a face region.
25.3.1.2 Emotion Recognition
Now the image with the front face has been detected, and it needs to classify the
image’s emotions into one, out of the seven classes which were considered. For
finding the specific emotion, the Xception CNN model is used. This architecture is
small and performs well in emotion classification. Xception CNN architecture (Fig.
25.3) is slightly different from normal CNN model (Fig. 25.2) in the prospect that
Fig. 25.2 Standard CNN model architecture (Saha 2018)
Fig. 25.3 MiniXception

model architecture (Arriaga
et al. 2019)
the fully connected layers are used at the end. Most of the parameters reside in this
layer in the normal CNN architectures, and they use standard convolutions. Xception
CNN architecture uses the residual modules and depth-wise separable convolutions.
Residual modules modify the expected mapping of subsequent layers. Thus, the
learned features become the difference between the desired features and the original
feature map.
The depth-wise separable convolutions are a combination of two different layers:
• Depth-wise convolutions.
• Pointwise convolutions.
These layers separate the channel cross-correlations from the spatial cross-
correlations. To do this, firstly a D × D filter is applied on every M input channel.
Now, N numbers of 1 × 1 × M convolution filters are applied so that to com-
bine the M input channels into N output channels. Each value in the feature map is
combined by applying 1 × 1 × M convolutions without considering their spatial
relationships within the channel. The computation is reduced by the depth-wise sep-
arable convolutions. We can see how efficient the depth-wise separable convolutions
are as compared to standard convolutions from the number of calculations involved
in each of them. In normal convolutions, for an image of size
Df × Df × M (25.2)
and N filters of size

Dk × Dk × M (25.3)
the output size will be

Dp × Dp × N (25.4)
Total number of multiplications in normal convolutions
N × D 2p × Dk2 × M (25.5)
In depth-wise separable convolution:

(i) for depth-wise operation convolution filter size is
Dk × Dk × 1 (25.6)
and M such filters are required. So the output is of size
Dp × Dp × M (25.7)
Total number of multiplications in depth-wise convolution operation
M × Dk2 × D 2p (25.8)
Fig. 25.4 Standard

convolution (Arriaga et al.
2019)
(ii) for point-wise convolution, 1 × 1 convolution is applied to M channels. Filter

size if 1 × 1 × M. For N such filters, output size will be
Dp × Dp × N (25.9)
Total number of multiplications in point-wise convolution operation
M × D 2p × N (25.10)
Overall total number of operations in depth-wise separable convolution = Mul-

tiplications in depth-wise convolution + Multiplications in point-wise convolution.
Overall total number of operations
M × D 2p × (Dk2 + N ) (25.11)
So, the ratio of number of operations in depth-wise separable convolution to the

number of operations in normal convolution
1 1
+ 2 (25.12)
N Dk
From here, we can see that depth-wise separable convolutions do much lesser
computations than standard convolutions. Figure 25.4 (Shaver et al. 1987) shows
the difference between the architecture of the standard convolutions with that of the
depth-wise separable convolutions.
25.3.2 Future Emotion Prediction Using Time-Series

Analysis
Time-series analysis (TSA) is a way to analyse time-series data and extract some
useful information from it. Time-series data is just a series of data that are arranged
based on time periods or intervals (Adhikari and Agrawal 2013). It can be extremely
valuable if we become able to accurately predict the future. TSA is already being
used in many areas such as economic and sales forecasting, stock market analysis,
budgetary analysis, census analysis, etc., and producing satisfactory results. In this
article, TSA is implemented in the field of medical and security systems. The aim of
the time series is to develop a mathematical model and then estimate the model to
predict future patterns. The objective of TSA in our model is to identify the nature
of emotions and then forecast or predict the future values of the emotions.
FB Prophet is used for future emotions prediction. It is an open-source forecasting
tool available in both R and Python provided by Facebook. The Prophet uses three
main components—trend, holidays and seasonality. Trend deals with the piecewise
logistic or linear growth curve for non-periodic variations in time series like in our
case. It implements two trend models: saturating growth model and the piecewise
linear model (Taylor and Letham 2018). This work is not dealing with the holidays
and seasonality effects in our model. Prophet frames the forecasting as a curve-fitting
problem and not just dealing with the time-based dependence of each reading in the
input time series. It is robust to outliers and dramatic shifts in the trend and typically
works fine by handling the missing data as well. We are providing the CSV file
obtained from the previous part as the input. We have taken the facial reading around
4 times daily for 3 days and feeding this as input. Totally, we are feeding 15,000 input
data points. The input to the prophet must have exactly two columns in it, one is the
date time stamp as ‘ds’ and second is the recorded value (emotion in our case) as ‘y’
(Adhikari and Agrawal 2013). We are predicting with minute as frequency and for
a period equal to 60 (i.e. 1 min × 60 = 60 mins or one hour). Based on the trends
present in the input, this model gives the prediction of emotion for the next hour.
25.4 Results and Experimental Study
The web camera served as the input for our model. It worked on real-time video.
The video taken from the camera was converted into frame sequences where each
frame served as a single image input for the model. Figure 25.5 given below shows
the output of the face detection model. The bounding box is built around the face.
Usage of Harr Cascade model helped in fast frontal face detection. The square frame
around the face in Fig. 25.5 is the bounding box created (Table 25.1; Fig. 25.7).
The next part after detecting face was to statistically analyse the seven basic facial
emotions of the detected face. For doing the same, we have used the MINI Xception
model as discussed. Figure 25.6 displays the output of the MINI Xception model. It
displays the percentage of all seven emotions of the detected face and also displays
the one with the maximum percentage. The main parameters for this are the widening
of lips, contraction on the sides of eyes and cheeks.
For the second part and the most important part of this work involving the pre-
diction of emotion, this calculated data serves as the input for the time-series model.
The data calculated over the period is sent in as the input in the form of a csv file.
The model takes each emotion as a different set of input, and with the collected data
of an individual over the period, it predicts the percentage of each emotion for the
Fig. 25.5 Depth-wise

separable convolutions
(Arriaga et al. 2019)
Table 25.1 The percentage Sl. No. Emotions Percentage of

of various emotions prediction (%)
01 Angry 8.03
02 Disgust 0.19
03 Scared 12.46
04 Happy 32.73
05 Sad 6.73
06 Surprised 1.17
07 Neutral 38.68
Fig. 25.6 Face detection

using Haar Cascade
Fig. 25.7 Statistical analysis of the seven emotional state
Fig. 25.8 Prediction of sad emotion
upcoming future. Here, we are showing the output of two of the seven emotions sad
and angry, respectively. Figures 25.8 and 25.9 the plot of input data and the predicted
values for next 1 hour of sad and angry, respectively. The dots represent the input
data points and the curve after the last black dot is that of the predicted ones.
Figures 25.10 and 25.11 show the trend of both the emotions—sad and angry,
respectively. Figures 25.10 and 25.12 show the predicted trend over the entire period
of the data tabulated, while that of Figs. 25.11 and 25.13 show the predicted changes
over a day. The daily trend curve is calculated by observing the changes within the
24 hour. Similarly, the model predicts the percentage of each of the seven emotions.
As we are not providing data over months or years, so monthly or yearly trends are
Fig. 25.9 Prediction of angry emotion
Fig. 25.10 Trend for emotion—sad, trend over the period of time
Fig. 25.11 Trend for emotion—sad, trend changes over a day

Fig. 25.12 Trend for emotion—angry, trend over the period of time
Fig. 25.13 Trend for emotion—angry, trend changes over a day
not present in the output. This model can be used to gather information about the
current emotion of a person, and with time as the amount of input data increases, the
predicted values should also be more accurate.
25.5 Conclusion
Hence, the experimental work shows the utilisation of two types of emotions. Various
results were represented out of the facial emotions. These behavioural studies of
human emotions serve as the basis for many interrogative and preventive cases. It
can serve as a monitoring mechanism. Sudden change in emotional behaviour or
diversion from the regular trend of a particular emotion can be easily analysed here.
The work carried out in this article can be beneficial for satisfying following tasks
such as Doctors can use this for monitoring patient’s emotion and Policemen can use
it to monitor criminals’ mental activities.
References
Adhikari R, Agrawal RK (2013) An introductory study on time series modeling and forecasting
Alam MGR, Abedin SF, Moon S II, Talukder A, Hong CS (2019) Healthcare IoT-based affective
state mining using a deep convolutional neural network. IEEE Access 7:1–15. https://doi.org/10.
1109/ACCESS.2019.2919995
Arriaga O, Valdenegro-Toro M, Plöger PG (2019) Real-time convolutional neural networks for
emotion and gender classification. In: ESANN 2019—Proceedings, 27th European symposium
on artificial neural networks, computational intelligence and machine learning, pp 221–226
Available online. https://www.researchgate.net/publication/3940582-Rapid-Object-Detection-
using-Boosted-Cascade-of-Simple-Features. Accessed: 03-Sept-2020
Behera B, Kumar N, Mahato MR, Prasad BK, Semwal VB (2020a) Weather forecasting and
monitoring using machine learning. In: National conference on electronics, communication and
computation—NCECC 2020. MANTECH Publications, Jamshedpur, pp 1–6
Behera B, Kumar N, Mahato MR, Kumar A (2020b) COVID-19 detection using advanced CNN and
X-rays. In: Arpaci I et al (eds) Emerging technologies during the era of COVID-19 pandemic.
Springer Nature, Berlin, pp 1–11
Choi DY, Song BC (2020) Facial micro-expression recognition using two-dimensional landmark
feature maps. IEEE Access 8:121549–121563. https://doi.org/10.1109/ACCESS.2020.3006958
Chollet F (2017) Xception: deep learning with depth wise separable convolutions, pp 1–8. http://
arxiv.org/abs/161002357v3. arXiv: 161002357v3. https://doi.org/10.1109/CVPR.2017.195
Clark EA, Kessinger J, Duncan SE et al (2020) The facial action coding system for characterisation of
human affective response to consumer product-based stimuli: a systematic review. Front Psychol
11:1–21. https://doi.org/10.3389/fpsyg.2020.00920
Cuimei L, Zhiliang Q, Nan J, Jianhua W (2017) Human face detection algorithm via Haar cascade
classifier combined with three additional classifiers. In: IEEE 13th International conference on
electronic measurement & instruments. IEEE, pp 483–487
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE
Conference on computer vision and pattern recognition, pp 770–778
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017)
MobileNets: efficient convolutional neural networks for mobile vision applications, pp 1–9. http://
arxiv.org/abs/170404861v1. arXiv:170404861v1
Kaehler A, Bradski G (2016) Learning OpenCV 3: computer vision in C++ with the OpenCV
library, 1st edn. O’Reilly, Sebastopol
Mehrabian A (2017) Nonverbal communication. Taylor & Francis Group, New York, USA
Polusmak E (2017) Time series analysis in Python: predicting the future with Facebook Prophet.
In: mlcourse.ai. https://mlcourse.ai/articles/topic9-part2-prophet/. Accessed: 03-Sept-2020
Saha S (2018) A comprehensive guide to convolutional neural networks—the ELI5 way. In: Towards
Data Science. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-
networks-the-eli5-way-3bd2b1164a53. Accessed: 03-Sept-2020
Semwal VB, Singha J, Sharma PK, Chauhan A, Behera B (2017) An optimized feature selection
technique based on incremental feature analysis for bio-metric gait data classification. Multimed
Tools Appl 76:24457–24475. https://doi.org/10.1007/s11042-016-4110-y
Shaver P, Schwartz J, Kirson D, O’Connor C (1987) Emotion knowledge: further exploration of a
prototype approach. J Pers Soc Psychol 52:1061–1086
Sun X, Zheng S, Fu H (2020) ROI-attention vectorized CNN model for static facial expression
recognition. IEEE Access 8:7183–7194. https://doi.org/10.1109/ACCESS.2020.2964298
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture
for computer vision, pp 1–8. arXiv: 151200567v3. https://doi.org/10.1109/CVPR.2016.308
Taylor SJ, Letham B (2018) Forecasting at scale. In: American Statistician. Available online. https://
facebook.github.io/prophet/. Accessed: 03-Sept-2020
Tutorials O-P (2020) Face detection using Haar cascades. In: OpenCV. https://opencv-python-
tutroals.readthedocs.io/en/latest/py-tutorials/py-objdetect/py-face-detection/py-face-detection.
html. Accessed: 03-Sept-2020
Chapter 26
Identification of Congestive Heart
Failure Patients Through Natural
Language Processing
Niyati Baliyan, Aakriti Johar, and Priti Bhardwaj
Abstract Research in biomedical field requires technical infrastructures to deal with

heterogeneous and multi-sourced big data. Biomedical informatics primarily use data
in electronic health to better understand how diseases spread or to gather new insights
from patient history. One of the prominent use cases of electronic health records
is identification of patient cohort (group) with specific disease or some common
characteristics, so that useful inferences may be drawn via these records. This paper
proposes a methodology for identification and analysis of cohorts for patients having
congestive heart failure problem among obesity patients. This may help doctors and
medical researchers in predicting outcomes, survival analysis of patients, clinical
trials, and other types of retroactive studies. cTAKES tool was used to apply natural
language processing technique in order to identify patients belonging to a particular
cohort. All clinical terms were identified and were mapped to its matching terms
in the UMLS Metathesaurus. Also, negated statements were detected and removed
from the final cohort. The method is reasonably automated and achieves accuracy,
precision, recall, and F-score values of 0.970, 0.972, 0.958, and 0.965, respectively.
Results were compared against the experts annotations. Additionally, manual review
of clinical records was performed for further validation.
26.1 Introduction
An ongoing interest of researchers in the biomedical informatics field demands

strengthening the techniques for identifying patient cohorts (groups) for research
studies and clinical trials which involve the subservient electronic health record
(EHR) data use (Pang et al. 2018). To identify patients who persuade a predefined
criterion from a large number of patients in an organization has innumerable use
cases, incorporating predicting outcomes, survival analysis of patients, clinical tri-
als, and other types of retroactive studies. Cohort identification for patients with a
common characteristic, for example, a particular disease, or similar symptoms, sim-
N. Baliyan (B) · A. Johar · P. Bhardwaj

Department of Information Technology, IGDTUW, Delhi 110006, India
e-mail: niyatibaliyan@igdtuw.ac.in
412 N. Baliyan et al.
ilar allergies, etc., is one of the major tasks in medical field in order to utilize existing
patient records for future researches or to determine new useful insights from existing
case studies (Gupta et al. 2018). Cohort identification may be carried out with the
help of enormous patient and biomedical data stored in institutional as well as public
repositories. This data may be structured, unstructured, or semi-structured. There is
a need to apply efficient and appropriate techniques to extract and utilize this data
(Saini et al. 2017).
Nevertheless, the procedure of differentiating the group of patients on the foun-
dation of their records stored in EHR can be exceedingly taxing and time consuming
owing to the complexity of the criteria on which the grouping has to be performed.
This happens due to the fact that the texts emulating these criteria are concealed
across several documents and beyond several data points in patients EHR (Malathi
et al. 2019).
26.1.1 Background
EHR, also referred to as electronic medical record (EMR), is the systematized col-
lection of patient and population electronically stored health information in a digital
format. These records can be shared across different healthcare settings. Records
are shared through network connected, enterprise-wide information systems or other
information networks and exchanges (Shickel et al. 2017). EHRs may include a
range of data, including demographics, medical history, medication and allergies,
immunization status, laboratory test results, radiology images, vital signs, personal
statistics like age and weight, and billing information. The growing accessibility
and mobility of EHR have increased the ease with which they can be used by doc-
tors unlike with the use of paper-based medical records. EHR provides researchers
unmatched phenotypic comprehensiveness and have the prospective to extend the
growth of accuracy in medicine techniques, at scale. One of the chief EHR-based
use case is determining a patient cohort identification algorithm or framework which
can discover disease status, inception, and its severity (Fox et al. 2018). Phenotype
algorithms using EHR data to classify patients with specific diseases and outcomes
are a foundation of EHR research.
Clinicians/doctors issue significant supplementary observations in the form of
unstructured documents, such as notes of patients progress, radiological reports, and
clinical narratives. However, there are open difficulties built in parsing the heteroge-
neous and multiplex clinical narratives. Additionally, there are additional challenges
present such as the existence of abbreviations, grammatical errors, and spelling mis-
takes, local dialectical phrases, make the job of processing this data even harder.
Moreover, the collected data in the unstructured and structured forms need recom-
bining strategies. The ability to extract meaningful data from EHRs and integrate it
into a reasonable structure can give prominent profit for patient cohort identification
in an automated manner (Grana and Jackwoski 2015).
26 Identification of Congestive Heart Failure Patients Through … 413
26.1.2 Research Objectives
Nowadays, information technology is in use across wide range of applications, and

one of them is in the biomedical science, which is known as health information
technology (HIT). An extensive study and research have been published outlining
automated cohort identification techniques employed by medical institution. Much
effort has been put towards improvisation of approaches developed at one site that
are desired to be used across multiple sites. Yet, there is a lack of standard tools estab-
lished that can be chosen by organizations and be used without notable difficulties.
There is little clarity concerning the essence of cohort identification or phenotyp-
ing solutions, at present. There were three major technologies identified during the
literature survey that are being studied for cohort identification of patients, which
are natural language processing (NLP), machine learning (ML), and semantic web
(shown in Fig. 26.1) (Johar and Baliyan in press).
About 5 million patients have congestive heart failure (CHF), and over 550,000
patients are diagnosed with CHF for the first time every year. It is ranked fifth for the
cause for hospitalization overall and is one of the leading causes for hospitalization
in the elderly (Pang et al. 2018). Study reveals that India has the world’s highest
number of heart failure deaths which is 23%, while it is just 7% for China.
There are numerous clinical research efforts and quality initiatives that rely on
EHR to identify patients with CHF. Usually, this involves identification of diagnostic
codes of CHF according to International Classification of Diseases 9 (ICD-9) codes.
While EHR is widely available, their utility is limited by the fact that encounters
coded as CHF may not reflect accepted epidemiologic criteria. To try to improve the
utility of EHR, many state signatures utilizing combinations of specific codes, for
example, multiple encounters or particular encounter types (e.g., hospitalizations)
have been used (Huffman and Prabhakaran 2010). Moreover, previous studies have
revealed that the usage of ICD-9 codes is no more considered to be enough for
cohort identification and has motivated the utilization of secondary data sources
to identify patient cohorts (Afzal et al. 2018). For instance, to justify ordering a
specific radiology test or laboratory test, doctors usually allocate a diagnosis code
to each patient for a disorder that the patient is suspected to have. Yet, despite the
Fig. 26.1 Prevalence of

cohort identification
approaches
fact that the test results reveal otherwise, i.e., patient does not have the condition;
the diagnosis code stays with the patients health record history. And if then the
diagnosis code is spotted without considering its context (i.e., without understanding
the nuances of the patient’s case as shown in his or her health records), this can
become a major concern because it forbids the aptness of investigators to perform
patient cohorts identification accurately and fully use the statistical potential of the
available populations (Critical Data 2016). It is extensively proved that clinical NLP
systems perform well for information extraction from free text, i.e., unstructured
data in specific disease domains. After identification of patients’ groups (cohorts)
for a study or research, further analytics has to be performed on these cohorts to
interrelate the data extracted and find useful insights. The analysis on data requires
this data to be in a standard structure, for sake of ease. The main limitation in the
present approaches of semantic web studies is lack of support for NLP methods for
extraction of clinical markers from unstructured text.
This paper proposes a framework for patient cohort identification from unstruc-
tured clinical records. To demonstrate a use case, i2b2 dataset (https://www.i2b2.org/
NLP/DataSets/Main.php) of obese patients was chosen, and the cohort of patients
having congestive heart failure condition has been identified. The results were com-
pared against the experts annotations which were considered as gold standard. Addi-
tionally, manual review of clinical records was performed for validation.
This section describes about the data used, proposed framework, detailed steps of
the framework, and tool used for the study.
26.2.1 Data Acquisition
The data for this study was extracted from the publicly available Informatics for Inte-
grating Biology & Besides (i2b2) obesity challenge dataset (https://www.i2b2.org/
NLP/DataSets/Main.php). i2b2 is a zealous patron which has the quality of exist-
ing clinical information to capitulate insights which can directly impact healthcare
improvement.
The data was randomly taken from the RPDR using a query that extracted records
of patients who were either diabetic or obese. Each patient record in the dataset has
occurrences of the stem “obes” from zero to more than ten times. The withdrawn
patient records were semi-automatically de-identified. An automatic pass, followed
by two parallel manual passes were run over each individual record. After which
a third manual pass was made over the records that resolved all the disagreements
that occurred between the previous two manual passes. The data was made HIPAA
compliant (Murphy et al. 2011) by replacing the patient names, patient age, patient
family member names, nationalities, hospital names, phone numbers, doctor names,
ID numbers, dates, locations, patient’s occupations, and other potential identifiers
with surrogates.
The annotation of the challenge data was done by two experts in the field of
obesity (https://www.i2b2.org/NLP/Obesity/Documentation.php). If the patient has
the co-morbidity, then they have marked it with a Y which stands for a YES, N for NO
(does not have the co-morbidity), U for Unmentioned (co-morbidity is not mentioned
in the narrative), and Q for Questionable (Questionable whether the patient has the
disease or not).
For this study, CHF co-morbidity was chosen, and patients belonging to this cohort
were identified. Only two markers—Y and N were used. The unmentioned category
is also considered as an N, as the records do not show any instance of the disease.
Thus, only records with a Y marker were considered to be in the cohort, i.e., the
patient has or had CHF. The experts annotations are considered as the ground truth
and compared against our proposed systems output.
26.2.2 Methodology
As discussed in Sect. 26.1, cohort identification for patients with a common charac-
teristic, for example, a particular disease, or similar symptoms, similar allergies, etc.,
is one of the major tasks in medical field in order to utilize existing patient records
for future researches or to determine new useful insights from existing case studies.
And as this task is complex and time devouring, there is a need to apply efficient
and appropriate techniques to extract and use this data. NLP techniques that focus
on information extraction from free text, i.e., unstructured data in specific disease
domains may be applied.
To allow identification of potential treatments, researchers need not only identify a
proper patients cohort, but also collect and combine relevant data from a wide range of
repositories, so that statistical significance can be obtained at the end. Semantic web
technologies can do wonders in this, as they can seamlessly combine heterogeneous
data from multiple sources and also provide interoperability.
The main limitation in the present approaches of semantic web studies is supported
for NLP methods for extraction of clinical markers from unstructured text. This is
not naively supported by any of the studied approaches. The current phenotype
ontology can be extended by specifying lexicons and NLP rules for NLP engines
which then can help in ontology creation. Figure 26.2 shows the overview of the
proposed framework (Johar and Baliyan in press).
An Apache project tool called cTAKES which was used to apply NLP technique
in order to identify patients belonging to a particular cohort. The main reason to
choose cTAKES is that it provides the feature of dictionary creator which can help to
identify clinical terminology and also map these to its matching terms in the Unified
Medical Language System (UMLS) Metathesaurus (https://ctakes.apache.org/s).
Fig. 26.2 Overview of

workflow
26.2.3 Unified Medical Language System
“The National Library of Medicine Unified Medical Language System is a large

biomedical thesaurus that is organized by concept, or meaning, and it links similar
names for the same concept from nearly 200 different vocabularies” (Bodenreider
2004). The UMLS incorporates the Metathesaurus, the semantic network, and the
SPECIALIST Lexicon and Lexical Tools. UMLS provides terminology, coding stan-
dards, and resources for biomedical and electronic health systems. UMLS provides
three knowledge sources: the semantic network, the Metathesaurus, and the SPE-
CIALIST lexicon. Using the UMLS license, cTAKES allows using Metathesaurus
component of UMLS which will be an efficient way of identifying terms in the
unstructured clinical notes.
Every concept in Metathesaurus has a unique permanent identifier called a concept
unique identifier (CUI) and also has a preferred name (Unified Medical Language
System (UMLS) 2019). The concept is nothing but a meaning, and a meaning can
possess different names from different thesauruses or vocabularies. The semantic
network provides:
1. A categorization (semantic type) of all concepts represented in the UMLS
Metathesaurus;
2. A set of relationships (semantic relations) between these concepts. The semantic

network accommodates 133 semantic types and 54 relationships.
UMLS is based on some electronic thesauruses, code sets, classifications, and lists
of controlled terms like RxNorm and SNOMED_CT. In the USA, SNOMED_CT
is used as a standard for digital trade of clinical health data. However, RxNorm is
the standard for clinical drug names and associate the names to additional avail-
able vocabularies used in drug interaction software and pharmacy. In this study, the
medical entities that are recognized are the concepts represented by the CUIs. Two
vocabularies are used in this study, which are SNOMED_CT and RxNorm and with
one semantic types (henceforth ST) disease or syndrome.
26.2.4 Steps Involved
As we know NLP can handle unstructured data very well, so when pertained to clinical
narratives, it can conquer the restrictions of billing code algorithms to identify patient
cohort by identification of terms that narrate signs and symptoms used to build a
diagnosis. Certainly, antecedent studies have illustrated that NLP techniques outshine
billing code algorithms for phenotype identification from clinical narratives from the
EHR. In this work, we build a methodology which uses NLP-based algorithm for text
recognition and semantic network, namely Metathesaurus for mapping predefined
entities to the recognized text.
Figure 26.3 shows how individual sentences in the unstructured clinical notes will
be processed using the Apache project cTAKES. The first four steps use basic NLP
processes, namely—boundary detection, tokenization, part-of-speech tagging, and
chunking and were implemented through Python. The next two steps use semantic
network for entity recognition and mapping entities to its properties (i.e., with its
corresponding UMLS code) were derived with help of the cTAKES tool. These steps
are explained in the following sub-sections.
Fig. 26.3 Sample workflow

Fig. 26.4 Tokenization step
Fig. 26.5 Part-of-speech tagging step
26.2.4.1 Tokenization
Figure 26.4 shows how tokenization works.
26.2.4.2 Part-of-Speech Tagging
Figure 26.5 shows how part-of-speech tagging (POS-Tagging) works.
26.2.4.3 Chunking
Figure 26.6 shows how chunking works.
26.2.4.4 Named-Entity Recognition
Figure 26.7 shows how named-entity recognition(NER) works.

Fig. 26.6 Chunking step
Fig. 26.7 Named-entity

recognition step
26.2.5 Tool Used
Apache cTAKES is an open-source natural language processing (NLP) tool which

extricate clinical knowledge from EHR unstructured content. It operates on clini-
cal narratives, identifying various kind of clinical named entities drugs, diseases/
disorders, symptoms/signs, anatomical sites, and procedures. All the named entity
have attributes for the text span, the ontology mapping code, context (family his-
tory of, current, unrelated to patient), and negated/not negated (Denecke 2015). This
study has used the latest version of cTAKES version 4.0.0.
26.2.5.1 cTAKES Components
cTAKES was developed at Mayo Clinic in 2006. After its full development, it grew
into a fundamental segment for clinical data management infrastructure (https://en.
wikipedia.org/wiki/Apache_cTAKES). It consists of multiple components which are
used to produce semantic annotations and linguistics which further can be used for
research purposes and in decision support systems (DSS).
Each of these components has distinctive characteristic and potential. This study
has used the default fast pipeline from cTAKES. This involves annotations of dis-
eases/disorders, signs/symptoms, medications, anatomical sites, and procedures. For
every annotation, there are UMLS CUIs, used for uncertainty, negation, and subject.
Fig. 26.8 Component dependencies used
Figure 26.8 shows all the components used in this study, in which the components
in the box are used for named entity recognition process and derived with help of
cTAKES.
26.3 Evaluation Approach
To access the fulfillment of our model, we choose precision, recall, F1 score, and
accuracy as the standard metrics. A confusion matrix is used, which is a plan in a
tabular form that is usually utilized to narrate the achievement of a classification
prototype on a set of test data for which the correct merits are familiar.
The true negatives and true positive are the utterances which are accurately antic-
ipated. There is a need to shrink false negatives and false positives. Each of these are
explained next.
1. True Positives (TP) is labeled when the system rightly predict the positive values,
that is, the value of the predicted class and the actual class both are yes. Example,
suppose our system has predicted that a patient belongs to the cohort of patients
having CHF and actual class value also suggests that the patient have CHF, then
this case will be labeled as true positive.
2. True Negatives (TN) is labeled when the system rightly predicts the negative
values that is the value of the predicted class and the actual class both are no.
Example, suppose our system has predicted that a patient do not belong to the
cohort of patients having CHF and actual class value also suggests that the patient
does not have CHF, then this case will be labeled as True Negative.
3. False Positives (FP) is labeled when predicted classification is yes but the actual
classification is no. Example, suppose our system has predicted that a patient
belong to the cohort of patients having CHF but the actual class value suggests
that the patient does not have CHF, then this case will be labeled as False Positive.
4. False Negatives (FN) is labeled when predicted classification is no but the actual
classification is yes. Example, suppose our system has predicted that a patient
does not belong to the cohort of patients having CHF but the actual class value
suggests that the patient have CHF, then this case will be labeled as False Neg-
atives.
The above-mentioned four specifications are used to calculate precision, F1 score,
recall, and accuracy.
26.3.1 Precision
It is the amount of rightly positive predicted occurrences to the overall positive

predicted occurrences. This metric tells that, out of all patients that marked to have
the CHF condition, how many actually had the condition?
True Positive
Precision = (26.1)
True Positive + False Positive
26.3.2 Recall
It is commonly referred as Sensitivity. It is the proportion of accurately positive

predicted occurrences to the all occurrences in actual class. This metric tells that, of
all the patients that correctly belonged to the cohort of patients with CHF, how many
did we identify?
True Positive
Recall = (26.2)
True Positive + False Negative
26.3.3 F1 -Score
It is the measured mean of recall and precision. Hence, this metric draws false
negatives and false positives in consideration. Instinctively, F1 is not that simple to
perceive if compared to accuracy, still the former is generally have more utility than
accuracy, particularly if there is an unequal distribution of class. However, accuracy
best operate when false negatives and false positives and have alike fare.
Precision × Recall
F1 = 2 × (26.3)
Precision + Recall
26.3.4 Accuracy
Accuracy is the uttermost instinctive evaluation metric. It is basically a proportion

of rightly predicted occurrences to the complete amount of occurrences. It is always
thought as if one have a strong accuracy, then their model is good. It is a substantial
metric when there is a symmetric dataset, that is, price of false negatives and false
positive is similar.
True Positive + True Negative

Accuracy =
True Positive + True Negative + False Positive + False Negative
(26.4)
26.4 Evaluation and Results
26.4.1 Implementation Setup
Table 26.1 lists the implementation setup used.
26.4.2 Implementation
The implementation is shown step-wise below:
1. Tokenization
One of the most basic and foremost step of NLP is tokenization. It is the pro-
cess of segregation of text or set of text into its individual constituent words
Table 26.1 Implementation environment

Programming language(s) PYTHON 3.7
Operating system WINDOWS 10
Library packages or API used NLTK 3.4.4
Tool used cTAKES 4.0.0
called “tokens.” A token is an occurrence of a course of characters in a spe-

cific text which are clubbed jointly as a functional semantic unit for processing.
These tokens then serve as an input to other processes for analytical tasks such
as part-of-speech tagging (https://nlp.stanford.edu/IR-book/html/htmledition/
tokenization-1.html). This step was performed using a Python script using
NLTK, and an instance of its output is shown in Fig. 26.9.
2. Part-of-Speech Tagging
Humans have an understanding of variations in natural language. That is the
reason we can understand the meaning of different phrases and can respond to
them accordingly, while for a machine to understand the diversity of a natural
language can be difficult task. A machine can understand any language by first
identifying the part of speech of the text. For example, parsing a line of text
and identify what terms act as verbs, adverbs, noun, etc. This is part-of-speech
(POS) tagging.
It is a technique which scrutinizes unstructured narrative in any language and
allocates appropriate parts of speech to individual words (or tokens), for example,
adjective, noun, adverb, verb, etc. It is something that is done as a prerequisite
to simplify many language processing tasks (https://nlp.stanford.edu/software/
tagger.shtml).
This step was performed using a Python script using NLTK, and an instance of
its output is shown in Fig. 26.10.
3. Chunking
Chunking is the task of extraction of phrases from unstructured data. The product
of POS tagger is utilized as the input for chunking process which in turn produces
chunks as output. Chunking is an essential step in the extraction of knowledge
about names, medical terms, etc. This extraction in NLP is called named entity
extraction. For example, if we define a noun phrase (NP), chunking will find
chunks corresponding to an individual NP. To produce NP chunk, we will have
to specify the chunk grammar using the POS tags. This can be defined by the
use of a regular expression rule (https://medium.com/greyatom/learning-pos-
tagging-chunking-in-nlp-85f7f811a8cb). This step was also performed using a
Python script using NLTK, and an instance of its output is shown in Fig. 26.11.
4. Named-Entity Recognition
In any text document, there are certain words which constitute particular enti-
ties that are also instructive and have a distinctive context, such entities are
called named entities. Named-entity recognition detects and classifies named
entity mentions in unstructured data within predefined categories, for example—
names, medical codes, locations, organizations, time expressions, percentages,
quantities, etc. A natural perspective would be to detect them by observing the
noun phrases that exist in the text. NER commonly referred as entity extraction,
and it is utilized in the knowledge extraction for identification and segmenta-
tion of named entities and categorizes or classifies them into several predefined
classes.
For this step, Apache tool cTAKES has been used because it allows the access
of the UMLS library which is required for the recognition of medical named
entities.
The clinical narratives of the dataset were processed using the collection pro-
cessing engine (CPE) of cTAKES. The output text files of the above NLP steps
were saved in a directory (in this case called the data directory) from where these
were collectively processed using bundled UMIA CPE, which in turn saves the
output annotations in another directory (in this case called the output directory).
There is another option to process documents, which is through a UIMA CAS
visual debugger (CVD), but it processes documents one at a time, and the results
can be seen in the GUI itself or as XCAS files. Figure 26.12 shows the snapshot
of CPE configurator graphical user interface (GUI) with three major divisions,
namely collection reader, analysis engines, and CAS consumers. Each line in the
document will be considered an entity to be analyzed by the CPE. The collection
reader segment asks you to define the descriptor used, input directory, language,
encoding, and any extensions. We have used apache-ctakes-4.0.0/desc/ctakes-
core/desc/collection_reader/FilesInDirectory CollectionReader.xml. For input
directory, a directory was created which contained all the selected output files
(this directory was named as data). For analysis engine, AggregatePlaintextFas-
tUMLSProcessor has been used. This analysis engine runs the complete pipeline,
encompasses the SimpleSegmentAnnotator analysis engine that makes a seg-
ment annotation which enfolds the entire plain text document. This analysis
engine uses the UMLS resources for NER or concept identification medical
terms.
For each CAS, a local file with the document text is written to a directory speci-
fied by the parameter. This CAS consumer does not make use of any annotation
information in the CAS except for the document id specified the CommonType-
System.xml descriptor. The document id will be the name of the file written for
each CAS. This CAS consumer may be useful if you want to write the results of
a collection reader and/or CAS initializer to the local file system. For example,
a JDBC collection reader may read XML documents from a database, and a
specialized cas initializer may convert the XML to plain text. The FilesInDirec-
toryCasConsumer was then used to write the plain text to local plain text files.
These annotated files were then searched for the CUIs related to CHF. For this
search, a Python script was written which parsed the annotated XML files for
the desired CUIs.
26.4.3 Evaluation and Results
The dataset from i2b2 was used in this study to develop the proposed method. 1130
patient records were considered from the data set for the study. Disease mentions in
the records were assigned with CUI IDs using the UMLS vocabulary.
Fig. 26.9 Instance of tokenization step
Fig. 26.10 Instance of POS-tagging step

Fig. 26.11 Instance of

chunking step
As discussed in Sect. 26.4.2, cTAKES CPE was used to process the clinical nar-
ratives, which produced the annotated xml files for each of the record. These anno-
tated files were searched for three of the CUIs related to CHF which were C0018802
(Congestive heart failure), C0018801 (Heart failure), and C0018800 (Cardiomegaly).
Cardiomegaly (C0018800) is used to find patients who are at risk of a heart failure.
Cardiomegaly generally is a symptom of a state such as a heart disease or a heart valve
issue. It might also prompt a heart attack in advance. And it is seen that congestive
heart failure is commonly called simply as “heart failure” (https://www.mayoclinic.
org/diseases-conditions/heart-failure/symptoms-causes/syc-20373142). So, adding
the latter two CUIs in our search might help in improving the identification of CHF
patients. Firstly, the three CUIs are checked if they are present in the annotated file of
a patient record (all the three CUIs can also be present in a record or any combination
of them), if any of them is not present that means that particular patient does not have
the condition and henceforth does not belong to the patient cohort having CHF. If
present, then for that particular instance, the corresponding annotations of polarity
and uncertainty are checked. If polarity is –1 that means that sentence is negated, and
this CUI can be ignored and does not qualify the patient to be in the desired cohort. If
uncertainty is 1 that means it is uncertain to have the condition specified by the CUI
and can be ignored. If the polarity and uncertainty have values other than mentioned
above, then that CUI will be considered and, hence, that record will belong to the
cohort having CHF.
Figure 26.13 conveys the performance statistics of CPE while executing a record
file.
Figures 26.14, 26.15, and 26.16 show partial snapshots of the resultant annotation
file from the CPE. Part 1 shows how the text strings that mentions an anatomical
entity (A body part or area, corresponding to the UMLS semantic group of Anatomy)
is given an ID, and its different corresponding parameters are defined like—history
of, generic, conditional, uncertainty, polarity, confidence, subject, and OntologyCon-
Fig. 26.12 GUI for collection processing engine configurator
ceptArr number for reference to its corresponding UMLS concept mention (shown
in part-1). Just like AnatomicalSiteMention, there are more entities for which an
ID is created, these are, DiseaseDisorderMention (A text string that refers to a Dis-
ease/Disorder Event) and DateAnnotation (A text string that refers to a Date Event),
SignSymptomMention (A text string that refers to a sign or symptom event. As dis-
cussed above, these annotated files were searched for three of the CUIs in order to
be a part of the cohort having a congestive heart failure problem. Figure 26.15 shows
an instance of the CUI C0018802, i.e., congestive heart failure (which is also men-
Fig. 26.13 Performance report of CPE
tioned through the preferredText parameter) present in one of the annotated record
(highlighted text). This means that if this instance is not negated, then this record
will belong to the desired cohort. The FSARRAY_id will be used to check if this
instance was negated in the text, i.e., polarity = –1. As mentioned in Fig. 26.14,
this id will be mapped to the cTAKES OntologyConceptArr id (shown in part-2) of
the DiseaseDisorderMention event annotation. The FSARRAY_id mentioned here
is “59273,” if search for this ID we’ll find an instance of DiseaseDisorderMention
in this annotated file. This can be shown in Fig. 26.16. On checking this instance’s
parameter, it is found that polarity is 1 that means the text is not negated, that means
this record belongs to the CHF cohort. Also, we can check the other parameters as
well like the subject of the sentence is patient, history is 0 which means the text is
not in the context of the patient’s history, uncertainty is 0 which means the instance
is not uncertain, and generic is false which means it is not used in a generic way.
These annotation files were then searched for the CHF CUI codes with help of the
Python script (shown in Appendix C). A few sample outputs of this search are shown
in Fig. 26.17. Table 26.2 summarizes the count of correct and incorrect predictions
Fig. 26.14 Snapshot of annotation file (part-1)

Fig. 26.17 Instances of

search results for annotation
file
(a) Test_1 Record File Belongs to CHF Cohort
(b) Test_603 Record File does not Belong

to CHF Cohort
of the study through a confusion matrix. A number of true positives, true negatives,
false positive, and false negatives for N = 1130 (where N is the total number of
records) are 466, 631, 13, 20, respectively. The accuracy, precision, recall, and F-
score for our study were 0.970, 0.972, 0.958, and 0.965. Extracting and using the two
additional UMLS CUI for “C0018801” and “C0018800” for searching the clinical
notes aided in increasing the count of patients incorporated in the final cohort and
have actually improved the results. As Smith Reátegui and Ratté (2018) only used
CUI C0018802 and showed F1-score, recall, and precision as 0.89, 0.92, and 0.86,
respectively. On the other hand, the proposed system showed these metrics as 0.96,
0.95, and 0.97. Figure 26.18 shows the results of our model graphically.
Table 26.2 Confusion matrix

Predicted class
Actual class Class = Yes Class = No
Class = Yes 486 20
Class = No 13 631
Fig. 26.18 Result of our framework
Fig. 26.19 Comparison with

related work
26.5 Discussion
Previous studies have revealed that the usage of “International Classification of Dis-
eases, Ninth Revision (ICD-9) codes” is no more considered to be enough for cohort
identification and has motivated the use of secondary data sources for patient cohort
identification (Fox et al. 2018).
Sohn et al. (2018) and Wi et al. (2018) proposed an NLP algorithm for extraction
of descriptive patterns of asthma events in the free text and temporal information.
These were manually annotated, then associated jointly and rules were applied. One
of the limitations was the small sample size owing to the arduous manual annotation
required for construction of a huge data set. However, in our method, annotation
process is automated with help of cTAKES. So, it works fairly well on large datasets.
Wang et al. (2018) proposed a prediction model for new patients along the learned
graph structure for chronic kidney disease. They used the k-nearest neighbor-based
prediction and showed effective prediction rate of 87%; whereas, for our method, it
was 97%. Table 26.3 compares our model with related work.
This comparison can also be noticed graphically in Fig. 26.19.
It is also seen that amalgamating the use of negation detection and UMLS syn-
onyms in a clinical NLP tool can aid clinical researchers to improve the performance
Table 26.3 Comparison with related work

Model Methodology Cohort Precision F1 -score Recall
identified
Our NLP-based CHF 0.97 0.96 0.95
framework
Afzal et al. NLP-based CLI 0.96 0.90 0.88
(2018)
Smith NLP-based CHF 0.86 0.89 0.92
Reátegui and
Ratté (2018)
Kandula et al. Bootstrapping CHF 0.55 0.66 0.84
(2011) algorithm
CHF stands for congestive heart failure; CLI stands for critical limb ischemia
of cohort identification problems utilizing data from various sources inside a huge
clinical database. Also, by the use of aggregation of CUIs improved the results, as
it was seen that congestive heart failure is frequently mentioned simply as “heart
failure.” So, adding the C0018801 (Heart failure) and C0018800 (Cardiomegaly)
CUIs in our search helped in improving the identification of CHF patients.
26.6 Conclusion
26.6.1 Limitations and Future Work
A limitation of the NLP technique implemented is that, all patients might not be
correctly classified in its correct cohort. For instance, as we identified a crucial term
as “heart failure” instead of the whole term “congestive heart failure” and used its
code for searching the annotation files. Seizing all the plausible means in which
medical practitioners abbreviate is a tough task and can cause a few patients to be
declassified. The absence of conventional practice in clinical narratives has been
identified as a blockage to NLP analysis of clinical text. Another limitation is that
we used our NLP methodology only to find CHF cohort. It can be scaled and utilized
for more diseases or to find cohorts with common characteristics. Also in future,
an approach which performs an automatic selection of related CUIs utilizing the
association linking of the concepts in the Metathesaurus can be developed.
26.6.2 Concluding Remarks
Biomedical data needs to be analyzed for new researches in the field. The ability to
integrate and connect data across a number of EHRs is required for better under-
standing and fruitful insights. Cohort identification in medical field is significant

for early detection of some disease or disorder risks and recruit patients for clinical
trials. Cohort identification generally needs scrutinizing a huge clinical database for
identifying a small group of patients; therefore, it is usually time devouring. The task
further becomes costly when manual chart reviews are required to affirm diagnosis or
other clinical features using the natural language-based unstructured clinical notes.
Cohort identification is a laborious task which creates crucial obstacle for time-severe
decision making in clinical practice.
NLP consists of techniques that are being extensively utilized for a quick anal-
ysis of a huge volume of unstructured texts with human support. Procedures like
POS tagging, parsing, and NER can pace up the process of identification methods,
medications, and diagnoses in clinical narratives with adequate accuracy. Together
with the flexibility of semantic network of Metathesaurus, NLP techniques can be
used for which they are best suited, i.e., extracting the possible structure from data.
Simultaneously, this framework allows consumption to evolve along with NLP capa-
bilities, without the requirement of redesigning any applications that use those data.
For example, if presently your NLP implementation cannot extricate drug data from
texts, but in a few months, a semantic web application can adopt this change without
demanding much of re-implementation.
Use of the proposed framework can constitute a decent approach to supersede the
manual extraction of medical entities with F1 -score, recall, and precision as 0.96,
0.95, and 0.97, respectively.
References
Afzal N, Mallipeddi VP, Sohn S, Liu H, Chaudhry R, Scott CG, Arruda-Olson AM (2018) Natural
language processing of clinical notes for identification of critical limb ischemia. Int J Med Inform
111:83–89
Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical ter-
minology. Nucleic Acids Res 32(Suppl_1):D267–D270
Denecke K (2015) Health web science: social media data for healthcare. Springer, Berlin
Fox F, Aggarwal VR, Whelton H, Johnson O (2018, June) A data quality framework for process
mining of electronic health record data. In: 2018 IEEE International conference on healthcare
informatics (ICHI). IEEE, pp 12–21
Grana M, Jackwoski K (2015, November) Electronic health record: a review. In: 2015 IEEE Inter-
national conference on bioinformatics and biomedicine (BIBM). IEEE, pp 1375–1382
Gupta D, Sundaram S, Khanna A, Hassanien AE, De Albuquerque VHC (2018) Improved diagnosis
of Parkinson’s disease using optimized crow search algorithm. Comput Electr Eng 68:412–424
https://ctakes.apache.org/ . Last accessed 11 Apr 2019
https://en.wikipedia.org/wiki/Apache_cTAKES . Last accessed 11 Apr 2019
https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb . Last
accessed: 10 Apr 2019
https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html . Last accessed 10 Apr
2019
https://nlp.stanford.edu/software/tagger.shtml . Last accessed: 10 Apr 2019
https://www.i2b2.org/NLP/DataSets/Main.php . Last accessed 18 Apr 2019
https://www.i2b2.org/NLP/Obesity/Documentation.php . Last accessed 18 Apr 2019

https://www.mayoclinic.org/diseases-conditions/heart-failure/symptoms-causes/syc-20373142 .
Last accessed: 18 May 2019
Huffman MD, Prabhakaran D (2010) Heart failure: epidemiology and prevention in India. Nat Med
J India 23(5):283
Johar A, Baliyan N (in press) Data science approaches to patient cohort identification: a use case in
biomedical field. In: 1st International conference on machine learning, image processing, network
security and data sciences. IETE Springer Series
Kandula S, Zeng-Treitler Q, Chen L, Salomon WL, Bray BE (2011) A bootstrapping algorithm to
improve cohort identification using structured data. J Biomed Inform 44:S63–S68
Malathi D, Logesh R, Subramaniyaswamy V, Vijayakumar V, Sangaiah AK (2019) Hybrid
reasoning-based privacy-aware disease prediction support system. Comput Electr Eng 73:114–
127
MIT Critical Data (2016) Secondary analysis of electronic health records. Springer Nature, Berlin,
p 427
Murphy SN, Gainer V, Mendis M, Churchill S, Kohane I (2011) Strategies for maintaining patient
privacy in i2b2. J Am Med Inform Assoc 18(Supplement_1):i103–i108
Pang Z, Yang G, Khedri R, Zhang YT (2018) Introduction to the special section: convergence of
automation technology, biomedical engineering, and health informatics toward the healthcare
4.0. IEEE Rev Biomed Eng 11:249–259
Reátegui R, Ratté S (2018) Comparison, of MetaMap and cTAKES for entity extraction in clinical
notes. BMC Med Inform Decis Making 18(3), 74; Smith J, Jones M Jr, Houghton L et al (1999)
Future of health insurance. N Engl J Med 965:325–329
Saini M, Baliyan N, Bassi V (2017, August) Prediction of heart disease severity with hybrid data
mining. In: 2017 2nd International conference on telecommunication and networks (TEL-NET).
IEEE, pp 1–6
Shickel B, Tighe PJ, Bihorac A, Rashidi P (2017) Deep EHR: a survey of recent advances in deep
learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform
22(5):1589–1604
Sohn S, Wi CI, Wu ST, Liu H, Ryu E, Krusemark E, Juhn YJ (2018) Ascertainment of asthma
prognosis using natural language processing from electronic medical records. J Allergy Clin
Immunol 141(6):2292–2294
Unified Medical Language System (UMLS): The Metathesaurus. https://www.nlm.nih.gov/
research/umls/new_users/online_learning/Meta_001.html. Last accessed 18 Apr 2019
Wang L, Zheng X, Huang LS, Xu J, Hsu FC, Chen SH, Ng MC, Bowden DW, Freedman BI, Su J
(2018) Progression of chronic kidney disease in African Americans with type 2 diabetes mellitus
using topology learning in electronic medical records. bioRxiv, 361956
Wi CI, Sohn S, Ali M, Krusemark E, Ryu E, Liu H, Juhn YJ (2018) Natural language processing for
asthma ascertainment in different practice settings. J Allergy Clin Immunol Pract 6(1):126–131
Glossary
API Application Programming Interface

APM ArduPilot Mission
BAIR Berkeley Artificial Intelligence Research
CNN Convolutional Neural Network
COCO Common Objects in Context
CPU Central Processing Unit
ESC Electronic Speed Controller
FMU Flight Management Unit
FN False Negative
FP False Positive
FPS Frame Per Second
FPV First Person View
GCS Ground Control Station
GPU Graphics Processing Unit
I/O Input/Output
IP Image Processing
mAP Mean Average Precision
MAV Micro Air Vehicle
ML Machine Learning
NN Neural Network
OS Operational System
PC Personal Computer
PX4 PixHawk 4
ROS Robot Operating System
SDK Software Development Kit
SSD Single-Shot MultiBox Detector
TCP Transmission Control Protocol
TPU Tensor Processing Unit
UDP User Datagram Protocol
VGG Visual Geometry Group
© The Editor(s) (if applicable) and The Author(s), under exclusive 435
license to Springer Nature Singapore Pte Ltd. 2021
and Networks, https://doi.org/10.1007/978-981-16-1681-5
436 Glossary
YOLO You Only Look Once

ALPR Automatic License Plate Recognition
AP Average Precision
API Application Programming Interface APM ArduPilot Mission
AR Average Recall
ASICs Application Specific Integrated Circuit
BAIR Berkeley Artificial Intelligence Research
CAT Cognitive and Autonomous Test
CC Connected Component
CL Confidence Loss
CNN Convolutional Neural Network
COCO Common Objects in Context
CPU Central Processing Unit
CSV Comma-Separated Values
CUDA Compute Unified Device Architecture
cuDNN CUDA Deep Neural Network library
DDR Double Data Rate
ED Euclidean Distance
ESC Electronic Speed Controler
FMU Flight Management Unit
FN False Negative
FP False Positive
FPV First Person View
GCS Ground Control Station
GPU Graphics Processing Unit
GT Ground-Truth
GUI Graphic User Interface
I/O Input/Output
IoU Intersection Over Union
JSON JavaScript Object Notation
LL Location Loss
LMDB Lightning Memory Mapped Database
MAV Micro Air Vehicle
mAP Mean Average Precision
ML Machine Learning
NN Neural Network
OCR Optical Character Recognition
OVA One-Versus-All
PID Proportional, Integral, Derivative
PPM Pulse Position Modulation
PWM Pulse Width Modulation
ReLU Rectified Linear Unit
RGB Red-Green-Blue
ROS Robot Operating System
SAD Sum of Absolute Difference
Glossary 437
DK Software Development Kit

SDRAM Synchronous Dynamic Random-Access Memory
SITL Software in the Loop
SL Supervised Learning
SSD Single-Shot Multibox Detector
TN True Negative
TP True Positive
TPU Tensor Processing Units
UAV Unmanned Aerial Vehicle
VGG Visual Geometry Group
XML Extensible Markup Language
YOLO You Only Look Once

Data Science 2021

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science 2021

Uploaded by

Copyright:

Available Formats

Transactions on Computer Systems and Networks

More information about this series at http://www.springer.com/series/16657

Salah Bourennane Alexandre C. B. Ramos

ISSN 2730-7484 ISSN 2730-7492 (electronic)

Kurukshetra, India Dr. Gyanendra K. Verma

Objective of the Book

Part I Theory and Concepts

Part II Models and Algorithms

Part III Applications and Issues

automation, industries, medical, and robotics. In Chap. 24, a countermeasure for

Kurukshetra, India Dr. Gyanendra K. Verma

Part I Theory and Concepts

Part II Models and Algorithms

Part III Applications and Issues

19 Fog Computing-Based Seed Sowing Robots for Agriculture . . . . . . . 295

About the Editors

Gyanendra K. Verma is currently working as Assistant Professor at the Depart-

Badal Soni is currently working as Assistant Professor at the Department of

telecommunications, array processing, image processing, multidimensional signal

Alexandre C. B. Ramos is the associate Professor of Mathematics and Computing

S. M. Abdullaev Department of System Programming, South Ural State University,

Chiranjoy Chattopadhyay Indian Institute of Technology Jodhpur, Jodhpur,

Jasperine James FCRIT, Mumbai, India

Luiz G. M. Pinto Institute of Mathematics and Computing, Federal University of

S. Thoshith Department of Electronics and Communication Engineering,

BHC Bayesian Hierarchical Clustering

detection is undecidable without a prior on the distribution of anomalies, and learned

Table 1.2 Snippet of input data

1.2.1 Input and Output Example

Table 1.2 depicts three rows of data (excluding the label):

1.2.2 Processing Pipeline

1.3.1 Evaluation Metric

where P = TP/((TP + FP)), R = TP/((TP + FN)), TP is true positive, FP is false

1.3.2 Oracle and Baseline

Table 1.3 Oracle and baseline for different attack types

Fig. 1.1 Active learning scheme

1.3.3 Active Learning

1.4.1 Learners and Sampling Strategies

Fig. 1.2 Detection quality for a high prevalence attack

Fig. 1.3 Detection quality

1.4.2 Ensemble Learning

Creating an ensemble of classifiers is usually a very effective way to combine the

Fig. 1.5 Ensemble learner

Fig. 1.6 Ensemble active

Table 1.5 Active learning an unsupervised sampling strategy

1.4.3 Sampling the Outliers Generated Using Unsupervised

Mudasir Ashraf, Yass Khudheir Salal, and S. M. Abdullaev

Abstract The ensemble approaches involving amalgamation of various learning

The fundamental concept behind ensemble method is to synthesize contrasting base

of prediction accuracy produced by the composite model and in decision making.

Table 2.1 Exhibits results of diverse classifiers

would be propounded to categorize all significant methods employed in the realm of

2.2 Performance of Diverse Individual Learning Classifiers

Table 2.2 Shows results with SMOTE process

noteworthy as well, and nevertheless acquired outcomes were least considerable

2.2.1 Empirical Results of Base Classifiers with

Table 2.2 exemplifies results of diverse classifiers subsequent to the application of

2.2.2 Empirical Outcomes of Base Classifiers with

After successfully deploying spread subsampling (undersampling technique) over

Table 2.3 Demonstrates results with undersampling method

Entire performance estimates connected with knn learning algorithm such as Tp

2.3 Bagging Approach

Table 2.4 Shows results using bagging approach

Table 2.5 Displays results of bagging method with SMOTE