ART 203 Thesis m-1

IUBAT – International University of Business Agriculture
and
Technology
Thesis Report
A Machine Learning-Based Machine Vision

Attendance and Intruder Detection System for Secured
Government Facilities
Submitted To:
Krishna Das
Supervisor and Assistant Professor
Department of Computer Science and Engineering
Submitted By:
Murshid Zaman Bhuiyan Raian Hossain

ID:19303037 ID: 19303050
Program: BCSE
Section: D
Date of Submission: 21/12/2022

Md Murshid Zaman Bhuiyan

&
Raian Hossain
A Thesis in the Partial Fulfillment of the Requirements
for the Award of Bachelor of Computer Science and Engineering (BCSE)

College of Engineering and Technology
IUBAT – International University of Business Agriculture and Technology
Fall 2022
2
Md Murshid Zaman Bhuiyan
&
Raian Hossain
A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of
Computer Science and Engineering (BCSE)
The thesis has been examined and approved
_____________________________
Prof. Dr. Utpal Kanti Das

Chairman and Professor
_____________________________
Dr. Hasibur Rashid Chayon
Coordinator and Associate Professor
_____________________________
Krishna Das
Supervisor and Assistant Professor

College of Engineering and Technology
IUBAT – International University of Business Agriculture and Technology
Fall 2023
Letter of Transmittal
30 September 2022
The Chair
Thesis Defense Committee
IUBAT–International University of Business Agriculture and Technology
4 Embankment Drive Road, Sector 10, Uttara Model Town, Dhaka 1230, Bangladesh
Subject: Letter of Transmittal.
Dear Sir,
With due respect, we, the undersigned students of BCSE 193 batch have worked on “A
Machine Learning-Based Machine Vision Attendance and Intruder Detection System for
Secured Government Facilities” under the Supervision of Assistant professor Krishna Das.
This report has enabled us to gain insight into the core fact of different aspects of machine
learning as well as machine vision based on machine learning. It has been a very challenging
and interesting experience throughout the whole phase of this thesis work.
Thank you for your supportive consideration for formulating an idea. Without your Inspiration
and motivation, this thesis would have been an incomplete one.
Lastly, I would be thankful once again if you please give your judicious advice on the effort.
Yours sincerely,
________________________ ________________________
Md Murshid Zaman Bhuiyan Raian Hossain

ID:19303037 ID:19303050
iii
Student’s Declaration
We declare that the thesis has been composed by ourselves and that the work has not be
submitted for any other degree or professional qualification. We confirm that the work
submitted is our own, except where work which has formed part of jointly-authored
publications has been included. Our contribution and those of the other authors to this work
have been explicitly indicated below. I confirm that appropriate credit has been given within
this thesis where reference has been made to the work of others.
We have read the University’s current research ethics guidelines, and accept responsibility for
the conduct of the procedures in accordance with the University’s Committee. We have
attempted to identify all the risks related to this research that may arise in conducting this
research, obtained the relevant ethical and/or safety approval (where applicable), and
acknowledged our obligations.
________________________ ________________________
Md Murshid Zaman Bhuiyan Raian Hossain

ID:19303037 ID:19303050
iv
Supervisor’s Certification
This is to certify that the thesis on the topic of “A Machine Learning-Based Machine Vision
Attendance and Intruder Detection System for Secured Government Facilities” is a
research work based on Machine Learning, which was done by Md Murshid Zaman Bhuiyan
& Raian Hossain under my supervision as the Partial Fulfillment of the Requirements for the
Award of Bachelor of Computer Science and Engineering (BCSE).
They have completed the thesis work under my supervision and guidance. I wish and pray for
their Bright Future.
______________________________
Krishna Das
Supervisor & Assistant Professor
IUBAT–International University of Business Agriculture and Technology
v
Abstract
Face is the major part of the human biometrics, which provides the identification of the
person. With the help of characteristics of the face and the structure of the body, the attendance
system can be implemented using machine vision. In traditional attendance system, the
members just sign and put a time when they enter and exit to mark the presence and if not
signed then marked as absent. These traditional techniques are time-consuming and does not
have the function to give late attendance and also it may occur that some employee may be
dishonest about their attending time. Sometimes it is very difficult to detect the intruder as well
in a secured facility. In this paper, the smart machine learning based on face and body
recognition approach has been proposed. We need the body recognition because sometimes
the intruder can make his face look like an authorized member to gain entry to a secured
facility. The database is created by capturing the faces and the bodies of the authorized
members in a room with 3D cameras, where the camera will take multiple pictures from 360
degrees. The face and body is detected using machine vision based machine learning approach.
The captured images then patched together and makes a 3D model of an authorized personnel
and stored in a database with respective labels. The proposed approach achieves the
recognition rate of 92%, As similar project has already been implemented in developed
countries. The Proposed system proved to be an efficient and robust device for taking
attendance in a Government secured facility with very high security and also with less time
consumption and manual work. The intruder alert system is very efficient and is much secured
for the facilities that must be protected at all costs. The system development is cost-efficient
vi
and need less installation as all the secured facilities already has CC cameras to overlook the
facility.
vii
Acknowledgments
We would like to acknowledge and give my warmest thanks to our supervisor Krishna Das Sir,
who made this work possible. His guidance and advice carried us through all the stages of
writing our thesis. We would also like to thank our committee members for letting our defense
be an enjoyable moment, and for your brilliant comments and suggestions, thanks to you.
I would also like to give special thanks to our family members as a whole for their continuous
support and understanding when undertaking our research and writing our thesis. Your prayer
for us was what carried us this far.
Lastly and most significantly, we would like to thank Almighty ALLAH, for letting us through
all the difficulties. We have experienced ALLAH’s guidance every day. ALLAH has granted
us to finish our degree. We will keep on believing You, Almighty ALLAH, with all our heart
throughout the life-span You have given us. Thank You for giving us Your utmost blessings.
viii
Table of Contents
Letter of Transmittal ....................................................................................................... iii
Student’s Declaration ...................................................................................................... iv
Supervisor’s Certification ............................................................................................... iv
Abstract ............................................................................................................................. vi
Acknowledgments .......................................................................................................... viii
List of Figures .....................................................................................................................x
List of Tables .................................................................................................................... xi
Chapter I. Introduction .....................................................................................................1
Chapter II. Literature Review ..........................................................................................3
Chapter III. Research Methodology ..............................................................................11
Chapter IV. Result ..........................................................................................................16
Chapter V. Discussion .....................................................................................................24
Chapter VI. Conclusion ...................................................................................................26
References .........................................................................................................................27
ix
List of Figures
Figure 1 11
Figure 2 12
Figure 3 14
Figure 4 17
Figure 5 20
Figure 6 21
Figure 7 21
x
List of Tables
Table 1 19
Table 2 19
Table 3 22
Table 4 23
xi
12
Chapter I. Introduction
The attendance of the employees in a secured facility is very important for the knowing the
security and making the attendance record. This plays the important role for the security of the
facility because on the basis of the attendance of the employees it can be noted that who is on time
and who is not. This also plays an important role to improve the standard of the employees. Most
of the existing attendance systems are a manual system where the employees have to mark the
present and absent themselves manually on the sheet. In another system, the sheet is provided to
the attendance giver and they have to mark who is present and who is absent on the sheet. But this
system may fail and is very time-consuming. Another disadvantage is that employees may put a
proxy sign and be dishonest about their own entry time. Also, this traditional system does not give
any security to the secured facilities like Prime Minister’s office or President’s home or Army
inventory facility. These manual systems are time-consuming where the number of employees is
very high and security becomes an issue in the traditional system of attendance.
There are many automatic attendance systems available. To track the attendance of the
employees in the secured facilities, if this system is implemented, no one can proxy for others or
give a dishonest time for their entry time and also automatically the absence of an employee is
recorded. Many secured facilities use the RFID based system, punching card systems, swipe card
systems and biometric systems based on the fingerprint etc. But every system has its own
limitations like using RFID card anyone can give the attendance by simply tagging the card. Hence
there is a strong requirement of the smart, secure attendance system.
In this paper, a facial and body structural feature-based attendance system has been
proposed. In this approach, the faces and body structures are detected using machine vision-based
approach. The detected faces and body structure images were patched and made into a 3D model
Page | 1
of the employee and stored in the databases with respective employee label. The faces and body
structures were captured from various angles by a 360-degree camera and patched together to make
a 3D model of the personnel to improve the accuracy. The features were extracted using Principal
Component Analysis (PCA) and Linear Discriminant Analysis (LDA) algorithms. Finally, the
extracted features were trained and test using two different machine learning algorithm such as
Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) algorithm.
A prototype face and body structure recognition system is developed using Raspberry Pi
module. This module has 1.2 GHz quad-core ARM Cortex A53, wireless LAN etc. The database
is created by taking photos form 360 degrees of different persons in a room. In this system, the
image is captured through the 360-degree camera and then face and body structure is detected
properly. If the face and body structure data match with one of the databases, then and only then
the person that has been authorized by the system can enter the facility and automatically the
attendance and entry time is added to the database under the employee’s label. If an intruder tries
to enter the facility, alarm will start ringing and eventually he will get caught. This way the facility
stays smart and properly secured.
Page | 2
Chapter II. Literature Review
In recent years, computing applications have changed dramatically from simple data
processing to machine learning, thanks to the availability and accessibility of vast amounts of data
collected via sensors and the Internet. Machine learning ideas demonstrate and propagate the fact
that computers have the ability to improve over time. Western countries have shown great interest
in the topics of machine learning, computer vision, and pattern recognition by organizing
conferences, workshops, group discussions, experiments, and actual implementations. This
machine learning and computer vision study examines and analyzes the applications of machine
learning in computer vision and predicts future prospects. In this study, we found that there are
supervised, unsupervised, and semi-supervised machine learning strategies in computer vision.
Commonly used algorithms are neural networks, k-means clustering, and support vector machines.
The most recent applications of machine learning in computer vision are object recognition, object
classification, and the extraction of relevant information from images, graphic documents, and
videos. Additionally, we use Tensor Flow, the Faster RCNN Inception V2 model, and the
Anaconda software development environment to identify cars and people in images.
2.1 Literature Review
i) Machine learning and computer vision hope to give computers the ability of humans to
sense data, understand data, and act based on past and present findings. Machine learning and
computer vision research are still evolving. Complex human activities are recorded and monitored
in media streams using machine learning and computer vision. These methods use machine
Page | 3
learning algorithms like support vector machine, KNN, etc. Machine learning solutions revolve
around collecting data, training a model, and using the trained model to make predictions. There
are models and services provided by private companies for speech recognition, text analysis, and
image classification. Object detection has applications in traffic collision avoidance, facial
expression recognition, and emotion recognition based on human posture. Supervised learning of
a deep complex neural network that recognizes faces with a large number of face images. Also,
companies like Amazon, Microsoft, and Google use machine learning as a cloud service. The
objective of this study is to investigate and analytically evaluate applications of machine learning
in computer vision. The database searched includes Google Scholar applying advanced search
techniques related to the keywords "machine learning", "computer vision", "deep learning", and
"artificial intelligence". Part 3 groups existing machine learning applications into groups (Sergio
Robles-Serrano, 2021).
ii) Features learnt by neural networks trained for the task of object recognition using more
than a million labelled images are useful for many computer vision tasks like semantic use visual
perception for recognizing objects, understanding about the task of object recognition or is it the
case that useful visual representations can be learnt through other modes Clearly, biological agents
perform complex visual tasks and it is unlikely that they require external supervision can be used
to learn useful visual representations. until now unsupervised learning approaches Biological
agents use perceptual systems for obtaining and robotic agents employ their motor system for
executing these agents can use their own motor system as a source of supervision for learning
useful perceptual representations? models of perception that make use of motor information. In
this work we focus on visual perception and present a model based on ego motion (i. e. self-motion)
Page | 4
for learning useful visual representations. When we say useful visual tasks by learning from only
a few labeled examples Mobile agents are naturally aware of their ego motion words, knowledge
of ego motion is “freely” available. agent can estimate its ego motion either from the motor
commands We propose that useful visual representations can be learnt by performing the simple
task of correlating visual a camera moving in the world and thus the knowledge of ego motion is
the same as the knowledge of camera motion. the camera transformation from the consequent pairs
of images that the agent receives while it moves. Intuitively, the task of predicting camera
transformation between two images should force the agent to learn features that are adept images
(i. e. visual correspondence). were also found to be very useful for tasks such based learning can
also result in features that are useful for the camera transformation between pairs of images.
Features learnt using our method outperform previous approaches of unsupervised feature learning
when class-label the scenario of a robotic agent moving around in the quality of features learnt
from this data were evaluated on four tasks Scene recognition on SUN matching and Object
recognition on same amount of training data, features learnt using ego motion as supervision
compare favorably to features learnt using slow feature analysis for unsupervised learning from
videos the first effective demonstration of learning visual representations from non-visual access
to ego motion information (Alice A. Robie, 2017).
iii) Machine learning and computer vision hope to give computers the ability of humans to
sense data, understand data, and act based on past and present findings. Machine learning and
computer vision research are still evolving. Computer vision is an essential part of the Internet of
Things, the Industrial Internet of Things, and the human brain interface. Complex human activities
are recorded and monitored in media streams using machine learning and computer vision. There
Page | 5
are several well-established methods for prediction and analysis such as supervised learning,
unsupervised learning, and semi-supervised learning. These methods use machine learning
algorithms like support vector machine, KNN, etc. Machine learning solutions revolve around
collecting data, training a model, and using the trained model to make predictions. There are
models and services provided by private companies for speech recognition, text analysis, and
image classification. One can use their models through application programming interfaces (APIs).
For example, Amazon Recognition, Polly, Lex, Microsoft Azure Cognitive Services, IBM Watson.
Object detection and analysis is an integral part of everyday life. Object detection has applications
in traffic collision avoidance, facial expression recognition, and emotion recognition based on
human posture. In, developed an automated system to detect information contained in human faces
from images and videos using orientation. Tensor Flow and Open Pose are software libraries used
for object detection and computer vision. Traffic detection models use a cumulative neural
network, a cyclic neural network (RNN), a short-term memory (LSTM), a closed cyclic unit
(GRU), and a Bayesian network. In smart environments, sensors collect data that is then used for
analysis and prediction. Object extraction is one of the tasks that a cumulative neural network
(CNN) performs without loss of information for successful object detection. Supervised learning
of a deep complex neural network that recognizes faces with a large number of face images. The
only challenge in computer vision and computer applications is data annotation/labelling. Machine
learning algorithms currently operate in the cloud as “machine learning as a service”, “cloud
machine learning”. Also, companies like Amazon, Microsoft, and Google use machine learning as
a cloud service. The objective of this study is to investigate and analytically evaluate applications
of machine learning in computer vision. The database searched includes Google Scholar applying
advanced search techniques related to the keywords "machine learning", "computer vision", "deep
Page | 6
learning", and "artificial intelligence". The initial search result was 258 articles, including patents
and citations. After reviewing the content of the articles and excluding citations, the number was
reduced to 175 articles. Ultimately, 20 articles formed the focus of this study. There are five parts.
Part 2 is basic research. Part 3 groups existing machine learning applications into groups. Section
4 presents the results and discussion. The final section concludes with comments and future work
(Mavridou & Vrochidou, 2019).
iv) Computer vision and pattern recognition-based traffic sign detection, tracking and
classification methods have been studied for several purposes, such as Advanced Driver Assistance
Systems (ADAS) and Auto Driving Systems (ADS). Generally, traffic sign recognition (TSR)
systems consist of two phases of detection and classification; for some TSR systems, a tracking
phase is designed between detection and classification for dealing with video sequences. In this
paper, we review the literature on traffic sign detection (TSD) based on camera or LIDAR, and do
comparison and analysis of the reviewed methods based on the reported performance and the
performance of our reimplemented methods. For a TSR system, traffic sign detection (TSD)
usually is the first key process. Then, the detected traffic signs are utilized as inputs of the
following tracking or classification methods; hence, the accuracy of the traffic sign detection and
locating results has a great influence on the following tracking or classification algorithms. Though
the structures and appearances of traffic signs are different across the world, the distinct color and
shape characteristics of traffic signs provide important cues to design detection methods. Shape
and edge detection methods can also be used to extract the accurate position of a traffic sign. The
Page | 7
goal of traffic sign tracking is usually designed for boosting classification performance, fine-
positioning or predicting positions for detection in the next frame. The binary-tree-based
classification method usually classify traffic signs according to the shapes and colors in a coarse-
to-fine tree process. As a binary-classification method, SVM classifies traffic signs using one-vs-
one or one-vs-others classification process. It presents a comprehensive survey for TSD, which
covers popular detection methods before 2012. Furthermore, all previous surveys do not review
the LIDAR based methods. Distinguished from these previous surveys, we classify the reviewed
methods into fine categories, reimplement part of the TSD methods for comprehensive
comparisons of these methods, and also review the LIDAR based TSD methods. Section II
presents the introduction of traffic signs, influence to human driving safety, machine vision based
TSR system and its applications, and benchmarks for TSR. Section III shows overview of traffic
sign detection; traffic sign detection methods are classified into five categories: color-based
methods, shape-based methods, color and shape-based methods, machine learning based methods,
and LIDAR based methods (Bini, Pamela, & Prince, 2020).
v) Bangladesh has its own abundance of water resources which helps to identify its customs
that are related to freshwater fish. Due to environmental issues along with some other reasons, the
amount of water resources of Bangladesh is reducing day-by-day. Consequently, many of our
territorial freshwater fishes are getting abolished. Thus, the new generation people of Bangladesh
lack the knowledge of local freshwater fish. For this problem, a solution has been found with the
collaboration of vision-based technology. As a solution, a machine-vision based local freshwater
fish recognition system is presented that can be proceed with an image of fish captured with a
mobile or handheld device and recognize the fish in order to introduce the fish. To demonstrate
the utility of the proposed expert system, several experiments are performed. At first, a set of
Page | 8
fourteen features, which consists of four types of features, are presented. Then the color image has
been converted into gray-scale image and the gray-scale histogram is formed. Image segmentation
takes place using histogram-based method and then the features are extracted. PCA is used for
decreasing the feature numbers. Three classifiers are used for recognizing fish, where SVM gives
the highest accuracy showing a value of 94.2% (Sharmin, Islam, & Jahan, 2019).
2.2 Strongest Points of Literature Review
➢ The high social benefits of machine vision system
➢ The strong peripheral advantages of machine vision system
➢ The wide applications of machine vision system

➢ Improve recognition system through machine vision
➢ Image processing using machine vision
➢ Use of efficient algorithm to get proper detection
➢ Use of machine learning and Artificial intelligence to improve proper image detection and
merging
2.3 Overview and Alignment with suggested model
Our approach of detecting and securing a highly secured facility using machine vision and
taking several pictures of the people that has the access in the facility is very much different from
Page | 9
these machine vision-based projects. A machine learning algorithm set learner learns parameters
by learning patterns in a data set. We have identified lexical features that might work in different
datasets containing different trends and patterns. Previous work by other researchers has used
different datasets but did not focus on the associative approach to lexical features. Similarly, our
approach for Realtime body and face detection is different as it includes authorized personnel to
at first take a 3D picture of themselves that would be saved in the system database and then they
can have the access to the facility. In our approach we already have most of the detection cameras
there in the term of CC cameras. In summary, variety of different approaches that have been tried
in the literature. Each study has its own set of limitations and one thing that has been common
across all of them is that they use fairly simple machine learning models. In our study we are going
to use a very different approach to minimize the cost and maximize the security by using the
already present materials and just implementing the new 3D image maker and merger.
Page | 10
Chapter III. Research Methodology
3.1 Recap of research question
The computer vision and machine learning are two important areas of recent research. The
computer vision computer uses the image and pattern mappings in order to find solutions. It
considers an image as an array of pixels. The computer vision automates the monitoring,
inspection, and surveillance tasks. Machine learning is the subset of artificial intelligence. The
automatic analysis/annotation of videos is the outcome of computer vision and machine learning.
Figure 1 shows the classification, object detection, and instance segmentation. Figure 2 shows the
object detection in images using Tensor flow and Faster-RCNN-Inception-V2 model in Anaconda
environment.
Page | 11
3.2 Description of method
There are three approaches to machine learning and computer vision: supervised,
unsupervised, and semi-supervised learning. Training data labeled supervised learning. Data
labeling is expensive, time consuming and requires expertise. On the other hand, semi-supervised
learning has some labeled data and some not. Bayesian network classifiers have the advantage of
learning with unlabeled data. However, real-world problems are of the unsupervised learning type,
where patterns develop based on clustering. Research has explored many applications of machine
learning in computer vision. For example, segmentation, feature extraction, visual model
refinement, pattern matching, shape representation, surface reconstruction, and modeling for the
biological sciences. Machine learning in computer vision is used to interpret data contained in car
and pedestrian detection images, to automatically classify defects in railway sleepers by images,
interpreting remote sensing data for geographic information systems, distinguishing truth mangoes
Page | 12
based on size attributes, extracting graphical and textual information from document images.
Similarly, other applications include facial and gesture recognition, machine vision, handwritten
character and number recognition, enhanced driver assistance systems, behavioral studies and
kinematic estimation of the human body. for a cyclist and estimate posture. Detecting sidewalk
ramps in Google Street View, such as automatically identifying and reviewing sidewalk ramps in
images. studies that uses computer vision and machine learning in medical sciences such as
cardiovascular imaging, retinal vasculature, nuclear medicine, endoscopy, thermometer,
angiography, resonance magnetic, ultrasound and microscopy. Machine learning and computer
vision have innovative applications in engineering, medicine, agriculture, astronomy, sports,
education, and more.
3.3 Background and rationale of method
The system studying paradigms for laptop imaginative and prescient are assisting vector
machines, neural networks, and probabilistic graphical models. Support vector machines (SVMs)
is a subdomain of supervised system studying techniques and famous in classification. Neural
community includes layered networks of interconnected processing nodes. Convolutional neural
networks (CNNs) is a class of neural networks utilized in photo popularity and classification. It
has neurons with dimensions: width, top and depth. CNN has won reputation latest instances
because of in large part reachable datasets, GPUs, and regularization techniques. OpenCV is a
library, which may be included with programming languages which include Android, .NET, Java,
iOS on systems which include Eclipse and Visual Studio in Windows, iOS, and Linux for photo
processing and analysis. It is utilized in photo processing, video analysis, item detection, and
system studying. Figure three indicates the item detection procedure with inside the system
Page | 13
studying and laptop imaginative and prescient environment. Figure 3 shows the object detection
process in the machine learning and computer vision environment.
3.4 Evaluation
In the world of the Internet, tons of graphic and visual information move around, but unlike
textual data, the ability to categorize and store it according to specific characteristics is a
demanding job. much effort'. Indexing and storing graphical data requires computational
interventions with advanced model-based learning and vision. This study sheds light on machine
learning and computer vision research in different fields. Machine learning and computer vision
techniques have reduced the cost, effort, and time spent in engineering, science, and technology.
An automated system based on machine learning and computer vision detects human emotions
(likes and dislikes, confidence levels). Probabilistic models predict human activities through
labeling and pattern recognition. Machine learning and computer vision in professional sports
measures and analyzes the performance of teams and individual players. Furthermore, it has been
used in industries for predictive maintenance. The timely replacement of machinery and tools in
industries before the occurrence of incidents has a significant impact on the effectiveness and
efficiency of production units. Public cameras and smart devices with sensors are a huge source of
Page | 14
data. Computer vision and computer techniques, when applied to this data, will help predict and
monitor traffic in cities. Figure 4 shows the development of research areas in machine learning
and computer vision. This study shows that growing areas of research in this area are biological
sciences (19 percent) and human activity (19 percent), followed by traffic management (13
percent) and physical fitness. professional sports (13%).
Page | 15
Chapter IV. Results
4.1 Dataset: For the solution proposed, two sets of data were used. The first one consists of
images used for the fine tuning of the visual feature vector extractor. The second one consists of
videos that present intruder and authorized personnel (positive and negative cases) for training the
temporal feature extractor.
The image dataset was built from scratch, applying the web scraping technique to populate
the dataset. For this, a series of logical steps were proposed. First, we identified the sources on the
web where the image search was performed. Next, we defined the set of keywords for the searches.
For this process, the following keywords were selected: Face detection, body detection, body
structure detection and top to bottom body detection with complexion detection. Then, the
automation stage was performed. The application was developed in the Python programming
language together with the Selenium library, which contains useful functions to perform this
process. Finally, a manual validation of all the collected images was carried out together with an
image transformation in order to standardize the size and format used.
The videos dataset was formed from two different data sources. The first one is the CADP
dataset. It has a total of 46578 daily activity videos of the employees. This dataset adds up to a
total duration of 15.6 h, with an average number of frames of 966. This source was chosen instead
of others in the literature, such as, due to the number of positive cases that the CADP dataset
presents (100%) and the position of the video camera (CCTV), which allows for a third-person
perspective. The second source used for the video dataset only contains negative cases of the
presented problem, i.e., videos where no personnel are present. It has a total of 500 videos from
Page | 16
different locations in Bangladesh, with a spatial resolution of 960 × 540 pixels. Some examples of
frames belonging to these datasets are shown in Figure 4.
Figure 4. Example of frame from videos in the datasets
4.2. Temporal Video Segmentation
A video is segmented in order to obtain a greater number of examples with a certain number
of constant frames and, in turn, a segment with shorter duration. This is because daily activities
have a short average duration (60 frames), which allows for processing of the original video in a
more efficient way.
In order to select the segmentation technique for the input data, some experiments were
performed on the videos taken from the dataset. The four techniques to be evaluated were
compared using the same videos in each case. The first technique consists of a segmentation
Page | 17
without frame discrimination. Therefore, all consecutive images of the video are selected until the
maximum time of the segment is reached. This technique has an average reading time of 0.18 s.
The second technique used seeks to skip frames in order to reduce the redundancy that can be
observed when using very close images in the video. This is because when the video has been
recorded with a traditional camera, the number of similar consecutive frames is very high. For this
reason, we experimented by skipping one frame for each frame selected. That is, in this case, the
images with an odd index were chosen from the video, until the maximum length of duration
established for the segment was reached. The third and fourth techniques presented are based on
discriminating consecutive frames with respect to an SSIM. For the third technique, a pixel-to-
pixel comparison of two consecutive images is calculated. For decision making, a threshold of 0.9
was set. Therefore, if a consecutive frame exceeds this threshold, the candidate is not chosen and
moves to the next frame in the video, for which the same process is performed. Finally, the fourth
technique number four shows a similar process to the third technique. However, in this one, the
threshold was defined at 0.98, and the matching operation used is the SSIM image-matching
metric. A maximum segment length of 45 frames was set for the tests. The comparison between
techniques is presented in Table 1 and Table 2. The technique chosen was the first described: “No
selection”.
Page | 18
4.3. Automatic Detection of personnel in their daily activity
The solution presented is based on a visual and a temporal feature extractor. The first stage
of the model consists of the InceptionV4 architecture (pre-trained with the ImageNet dataset)
truncated. That is, all the Inception cells (convolutional layers) were used, eliminating the
multilayer perceptron at the end of this architecture. This is to use this part of the model only as a
visual feature extractor as in Figure 1, upper part.
Page | 19
However, by performing multiple experiments, it was concluded that the pre-trained model
does not differentiate between a vehicle at rest and a vehicle hit by a traffic accident. Therefore,
the images dataset was used for training in order to adjust the weights of this pre-trained network.
In this process, all the weights of the initial layers of the architecture were frozen, and only those
of the last convolutional cell of InceptionV4 were adjusted. To adjust the feature extractor,
multiple experiments were performed. This was done using regularization techniques, data
augmentation, and hyper-parameter modifications. The results of the tests performed are described
in Figure 3.
Figure 5. Visual feature extractor experiment
The temporal feature extraction is based on recurrent neural networks. The architecture
proposed for this stage consists of two ConvLSTM layers. These were created to extract temporal
information in data of more than one dimension, using the convolution operation. Between these
Page | 20
layers, a Batch Normalization is added, and the various hyper-parameters are adjusted. The
ConvLSTM layers used consist of 64 neurons each, a kernel size of 3 × 3, a dropout of 0.2 and a
recurrent dropout of 0.1. The results obtained are presented in Figure 4, while Figure 5 shows the
accuracy of the model in the training stage.
Figure 6. Experimenting with the temporal feature extractor.
Page | 21
Figure 7. Behavior of the model’s accuracy by epochs with the training set and the validation set.
The last stage of daily activities of personnel detection process is given by a densely layered
block. The proposed neural network consists of a total of three hidden layers and one output layer,
plus a regularization technique called dropout with a value of 0.3. The distribution of the neurons
in the mentioned layers is as follows: four hundred, one hundred, and one neuron, where the first
two layers use the hyperbolic tangent activation function while the last layer (output layer) uses
the sigmoid activation function in order to perform a binary classification (present or not present).
The training and validation results are presented in Table 3 (Note: the dataset is distributed as 94%
present and 6% non-present). The established hyper-parameter values are presented in Table 4.
The model was trained on a computer with a 5th generation I7 4820k@3.70GHz processor, 64 GB
of RAM memory, and two Nvidia 1080TI video cards with 11 GB of GDDR 5X RAM at 405 M
Hz.
Page | 22
Page | 23
Chapter V. Discussion
In relation to model bias, the model could be biased towards accidents involving vehicles.
This verification is not trivial due to limitations of the validation dataset, which is composed of a
majority of activities involving personnel. However, the feature extractor was trained considering
different kinds of facilities, including army, air-force, and SSF but excluding where there are
visible human interventions. Additionally, relating to weather conditions, the bias is present in
diurnal activities due to dataset limitations. There were not enough videos with rain or snow or at
night, among other weather conditions.
Regarding the model’s generalization capacity, the model is independent of a particular
camera-viewpoint, the structure of the street, or aspects such as vehicle density. We do not describe
the technical parameters of the cameras because we used public video datasets for the model
validation. However, an adequate analysis in this address will permit defining hardware limitations
for a correct model operation. However, it is difficult to perform analysis at the level of device
specification, mainly because obtaining a dataset that includes a large number of images from
cameras with different lenses, acquisition sensors, and even spatial positions is impractical. In this
context, separating or analyzing the effect of the camera parameters is almost impossible.
However, Deep Learning models have the characteristic of being robust to small variations in their
input. They require minimal preprocessing and do not need the selection of an extractor of specific
characteristics. In this context, we consider that there are no significant camera parameters
restrictions in the model due to the used datasets that include different cameras in multiple
positions. Therefore, we assume that the model can operate correctly in the most popular devices
used for facial detection systems.
Page | 24
The feature extractor was trained with visual patterns associated with activities that had
already occurred, so the model cannot predict an activity, but it is capable of identifying visual
patterns relating to the occurrence of an activity. The temporal feature extractor was trained to
recognize the appearance in time of these visual patterns, which strengthens the activity
identification only when based on visual patterns. We considered that it is not possible to predict
in advance the occurrence of accidents with this configuration.
Addressing ethical considerations, the group of activities involving pedestrians was not
considered because the nature of the training in the fine-tuning process of the feature extractor
model required images that includes unknown persons. To include this category (pedestrians),
consideration should be given to obtaining a representative group of images to avoid biases due to
aspects such as age, height, or skin color.
Page | 25
Chapter VI. Conclusion
In conclusion, in our study, advanced image processing techniques like contrast
adjustment, bilateral filtering, and histogram equalization were used to preprocess the input face
images in order to improve their image features. The same advanced image processing techniques
were then applied to the training/template face images, along with an image blending technique,
to guarantee high-quality training/template face images. The input face image that has been
preprocessed will be divided into k2 regions, and then the LBP code will be calculated for each
pixel in each region by comparing the pixel in the center to the pixel around it. Binary 1 is used to
denote a pixel that is greater than or equal to the center pixel; otherwise, binary 0 is used.
In order to obtain the binary pattern necessary to construct the feature vector of the input
face images, this procedure will be repeated for each and every pixel in all other regions. A
histogram with all possible labels is constructed for each region. The number of instances of a
pattern in the region is represented by these constructed histograms with all of their bins. After
that, the regional histograms are combined into a single, individual feature vector, which is then
compared to the template face images in order to identify faces. The results of our experiments
demonstrate that our method is extremely accurate and robust for a facial recognition system that
can be used in a real-world setting. It also makes the LBP code better. It is also essential to point
out that our research does not address the problem of mask faces and occlusion in facial
recognition; however, addressing these issues could be an excellent addition to this paper's future
work.
Page | 26
References
Alice A. Robie, K. M. (2017). Machine vision methods for analyzing social interactions. Journal
of Experimental Biology, 35-70. doi:10.1242
Bini, D., Pamela, D., & Prince, S. (2020). Machine Vision and Machine Learning for Intelligent
Agrobots: A review. 2020 5th International Conference on Devices, Circuits and Systems
(ICDCS), 12-16. doi: 10.1109/ICDCS48716.2020.243538.
Mavridou, E., & Vrochidou, E. (2019). Machine Vision Systems in Precision Agriculture for
Crop Farming . Imaging 2019, 89. doi:10.3390
Sergio Robles-Serrano, G. S.-T.-B. (2021). Automatic Detection of Traffic Accidents from
Video Using Deep Learning Techniques. Computers 2021, 148. Retrieved from
https://doi.org/10.3390/computers10110148
Sharmin, I., Islam, N. F., & Jahan, I. (2019). Machine vision based local fish recognition. SN
Appl. Sci. 1, 1529. doi:10.1007
Page | 27

ART 203 Thesis m-1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ART 203 Thesis m-1

Uploaded by

Copyright:

Available Formats

IUBAT – International University of Business Agriculture

A Machine Learning-Based Machine Vision

Supervisor and Assistant Professor

Department of Computer Science and Engineering

Murshid Zaman Bhuiyan Raian Hossain

Date of Submission: 21/12/2022

Md Murshid Zaman Bhuiyan

A Thesis in the Partial Fulfillment of the Requirements

for the Award of Bachelor of Computer Science and Engineering (BCSE)

Department of Computer Science and Engineering

Prof. Dr. Utpal Kanti Das

Department of Computer Science and Engineering

Subject: Letter of Transmittal.

Md Murshid Zaman Bhuiyan Raian Hossain

acknowledged our obligations.

Md Murshid Zaman Bhuiyan Raian Hossain

for us was what carried us this far.

Letter of Transmittal ....................................................................................................... iii

Student’s Declaration ...................................................................................................... iv

Supervisor’s Certification ............................................................................................... iv

Acknowledgments .......................................................................................................... viii

List of Figures .....................................................................................................................x

List of Tables .................................................................................................................... xi

Chapter I. Introduction .....................................................................................................1

Chapter II. Literature Review ..........................................................................................3

Chapter III. Research Methodology ..............................................................................11

Chapter IV. Result ..........................................................................................................16

Chapter V. Discussion .....................................................................................................24

Chapter VI. Conclusion ...................................................................................................26

there is a strong requirement of the smart, secure attendance system.

Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) algorithm.

stays smart and properly secured.

conferences, workshops, group discussions, experiments, and actual implementations. This

supervised, unsupervised, and semi-supervised machine learning strategies in computer vision.

Anaconda software development environment to identify cars and people in images.

2.1 Literature Review

to ego motion information (Alice A. Robie, 2017).

(Mavridou & Vrochidou, 2019).

and LIDAR based methods (Bini, Pamela, & Prince, 2020).

amount of water resources of Bangladesh is reducing day-by-day. Consequently, many of our

collaboration of vision-based technology. As a solution, a machine-vision based local freshwater

2.2 Strongest Points of Literature Review

➢ The high social benefits of machine vision system

➢ The strong peripheral advantages of machine vision system

➢ The wide applications of machine vision system

➢ Image processing using machine vision

➢ Use of efficient algorithm to get proper detection

2.3 Overview and Alignment with suggested model

3.1 Recap of research question

cardiovascular imaging, retinal vasculature, nuclear medicine, endoscopy, thermometer,

vision have innovative applications in engineering, medicine, agriculture, astronomy, sports,

education, and more.

3.3 Background and rationale of method

is a subdomain of supervised system studying techniques and famous in classification. Neural

community includes layered networks of interconnected processing nodes. Convolutional neural

process in the machine learning and computer vision environment.

percent) and physical fitness. professional sports (13%).

temporal feature extractor.

image transformation in order to standardize the size and format used.

frames belonging to these datasets are shown in Figure 4.

Figure 4. Example of frame from videos in the datasets

4.2. Temporal Video Segmentation