You are on page 1of 40

IUBAT – International University of Business Agriculture

and
Technology

Thesis Report

A Machine Learning-Based Machine Vision


Attendance and Intruder Detection System for Secured
Government Facilities
Submitted To:

Krishna Das

Supervisor and Assistant Professor

Department of Computer Science and Engineering

Submitted By:

Murshid Zaman Bhuiyan Raian Hossain


ID:19303037 ID: 19303050
Program: BCSE
Section: D

Date of Submission: 21/12/2022


A Machine Learning-Based Machine Vision
Attendance and Intruder Detection System for Secured
Government Facilities

Md Murshid Zaman Bhuiyan


&
Raian Hossain

A Thesis in the Partial Fulfillment of the Requirements

for the Award of Bachelor of Computer Science and Engineering (BCSE)

Department of Computer Science and Engineering


College of Engineering and Technology
IUBAT – International University of Business Agriculture and Technology

Fall 2022

2
A Machine Learning-Based Machine Vision
Attendance and Intruder Detection System for Secured
Government Facilities
Md Murshid Zaman Bhuiyan
&
Raian Hossain

A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of
Computer Science and Engineering (BCSE)
The thesis has been examined and approved

_____________________________

Prof. Dr. Utpal Kanti Das


Chairman and Professor

_____________________________
Dr. Hasibur Rashid Chayon
Coordinator and Associate Professor

_____________________________
Krishna Das
Supervisor and Assistant Professor

Department of Computer Science and Engineering


College of Engineering and Technology
IUBAT – International University of Business Agriculture and Technology

Fall 2023
Letter of Transmittal

30 September 2022
The Chair
Thesis Defense Committee
Department of Computer Science and Engineering
IUBAT–International University of Business Agriculture and Technology
4 Embankment Drive Road, Sector 10, Uttara Model Town, Dhaka 1230, Bangladesh

Subject: Letter of Transmittal.

Dear Sir,

With due respect, we, the undersigned students of BCSE 193 batch have worked on “A
Machine Learning-Based Machine Vision Attendance and Intruder Detection System for
Secured Government Facilities” under the Supervision of Assistant professor Krishna Das.

This report has enabled us to gain insight into the core fact of different aspects of machine
learning as well as machine vision based on machine learning. It has been a very challenging
and interesting experience throughout the whole phase of this thesis work.

Thank you for your supportive consideration for formulating an idea. Without your Inspiration
and motivation, this thesis would have been an incomplete one.

Lastly, I would be thankful once again if you please give your judicious advice on the effort.

Yours sincerely,

________________________ ________________________

Md Murshid Zaman Bhuiyan Raian Hossain


ID:19303037 ID:19303050

iii
Student’s Declaration

We declare that the thesis has been composed by ourselves and that the work has not be

submitted for any other degree or professional qualification. We confirm that the work

submitted is our own, except where work which has formed part of jointly-authored

publications has been included. Our contribution and those of the other authors to this work

have been explicitly indicated below. I confirm that appropriate credit has been given within

this thesis where reference has been made to the work of others.

We have read the University’s current research ethics guidelines, and accept responsibility for

the conduct of the procedures in accordance with the University’s Committee. We have

attempted to identify all the risks related to this research that may arise in conducting this

research, obtained the relevant ethical and/or safety approval (where applicable), and

acknowledged our obligations.

________________________ ________________________

Md Murshid Zaman Bhuiyan Raian Hossain


ID:19303037 ID:19303050

iv
Supervisor’s Certification

This is to certify that the thesis on the topic of “A Machine Learning-Based Machine Vision
Attendance and Intruder Detection System for Secured Government Facilities” is a
research work based on Machine Learning, which was done by Md Murshid Zaman Bhuiyan
& Raian Hossain under my supervision as the Partial Fulfillment of the Requirements for the
Award of Bachelor of Computer Science and Engineering (BCSE).

They have completed the thesis work under my supervision and guidance. I wish and pray for
their Bright Future.

______________________________

Krishna Das
Supervisor & Assistant Professor
Department of Computer Science and Engineering
IUBAT–International University of Business Agriculture and Technology

v
Abstract
Face is the major part of the human biometrics, which provides the identification of the

person. With the help of characteristics of the face and the structure of the body, the attendance

system can be implemented using machine vision. In traditional attendance system, the

members just sign and put a time when they enter and exit to mark the presence and if not

signed then marked as absent. These traditional techniques are time-consuming and does not

have the function to give late attendance and also it may occur that some employee may be

dishonest about their attending time. Sometimes it is very difficult to detect the intruder as well

in a secured facility. In this paper, the smart machine learning based on face and body

recognition approach has been proposed. We need the body recognition because sometimes

the intruder can make his face look like an authorized member to gain entry to a secured

facility. The database is created by capturing the faces and the bodies of the authorized

members in a room with 3D cameras, where the camera will take multiple pictures from 360

degrees. The face and body is detected using machine vision based machine learning approach.

The captured images then patched together and makes a 3D model of an authorized personnel

and stored in a database with respective labels. The proposed approach achieves the

recognition rate of 92%, As similar project has already been implemented in developed

countries. The Proposed system proved to be an efficient and robust device for taking

attendance in a Government secured facility with very high security and also with less time

consumption and manual work. The intruder alert system is very efficient and is much secured

for the facilities that must be protected at all costs. The system development is cost-efficient

vi
and need less installation as all the secured facilities already has CC cameras to overlook the

facility.

vii
Acknowledgments

We would like to acknowledge and give my warmest thanks to our supervisor Krishna Das Sir,

who made this work possible. His guidance and advice carried us through all the stages of

writing our thesis. We would also like to thank our committee members for letting our defense

be an enjoyable moment, and for your brilliant comments and suggestions, thanks to you.

I would also like to give special thanks to our family members as a whole for their continuous

support and understanding when undertaking our research and writing our thesis. Your prayer

for us was what carried us this far.

Lastly and most significantly, we would like to thank Almighty ALLAH, for letting us through

all the difficulties. We have experienced ALLAH’s guidance every day. ALLAH has granted

us to finish our degree. We will keep on believing You, Almighty ALLAH, with all our heart

throughout the life-span You have given us. Thank You for giving us Your utmost blessings.

viii
Table of Contents

Letter of Transmittal ....................................................................................................... iii

Student’s Declaration ...................................................................................................... iv

Supervisor’s Certification ............................................................................................... iv

Abstract ............................................................................................................................. vi

Acknowledgments .......................................................................................................... viii

List of Figures .....................................................................................................................x

List of Tables .................................................................................................................... xi

Chapter I. Introduction .....................................................................................................1

Chapter II. Literature Review ..........................................................................................3

Chapter III. Research Methodology ..............................................................................11

Chapter IV. Result ..........................................................................................................16

Chapter V. Discussion .....................................................................................................24

Chapter VI. Conclusion ...................................................................................................26

References .........................................................................................................................27

ix
List of Figures

Figure 1 11

Figure 2 12

Figure 3 14

Figure 4 17

Figure 5 20

Figure 6 21

Figure 7 21

x
List of Tables

Table 1 19

Table 2 19

Table 3 22

Table 4 23

xi
12
Chapter I. Introduction
The attendance of the employees in a secured facility is very important for the knowing the

security and making the attendance record. This plays the important role for the security of the

facility because on the basis of the attendance of the employees it can be noted that who is on time

and who is not. This also plays an important role to improve the standard of the employees. Most

of the existing attendance systems are a manual system where the employees have to mark the

present and absent themselves manually on the sheet. In another system, the sheet is provided to

the attendance giver and they have to mark who is present and who is absent on the sheet. But this

system may fail and is very time-consuming. Another disadvantage is that employees may put a

proxy sign and be dishonest about their own entry time. Also, this traditional system does not give

any security to the secured facilities like Prime Minister’s office or President’s home or Army

inventory facility. These manual systems are time-consuming where the number of employees is

very high and security becomes an issue in the traditional system of attendance.

There are many automatic attendance systems available. To track the attendance of the

employees in the secured facilities, if this system is implemented, no one can proxy for others or

give a dishonest time for their entry time and also automatically the absence of an employee is

recorded. Many secured facilities use the RFID based system, punching card systems, swipe card

systems and biometric systems based on the fingerprint etc. But every system has its own

limitations like using RFID card anyone can give the attendance by simply tagging the card. Hence

there is a strong requirement of the smart, secure attendance system.

In this paper, a facial and body structural feature-based attendance system has been

proposed. In this approach, the faces and body structures are detected using machine vision-based

approach. The detected faces and body structure images were patched and made into a 3D model

Page | 1
of the employee and stored in the databases with respective employee label. The faces and body

structures were captured from various angles by a 360-degree camera and patched together to make

a 3D model of the personnel to improve the accuracy. The features were extracted using Principal

Component Analysis (PCA) and Linear Discriminant Analysis (LDA) algorithms. Finally, the

extracted features were trained and test using two different machine learning algorithm such as

Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) algorithm.

A prototype face and body structure recognition system is developed using Raspberry Pi

module. This module has 1.2 GHz quad-core ARM Cortex A53, wireless LAN etc. The database

is created by taking photos form 360 degrees of different persons in a room. In this system, the

image is captured through the 360-degree camera and then face and body structure is detected

properly. If the face and body structure data match with one of the databases, then and only then

the person that has been authorized by the system can enter the facility and automatically the

attendance and entry time is added to the database under the employee’s label. If an intruder tries

to enter the facility, alarm will start ringing and eventually he will get caught. This way the facility

stays smart and properly secured.

Page | 2
Chapter II. Literature Review
In recent years, computing applications have changed dramatically from simple data

processing to machine learning, thanks to the availability and accessibility of vast amounts of data

collected via sensors and the Internet. Machine learning ideas demonstrate and propagate the fact

that computers have the ability to improve over time. Western countries have shown great interest

in the topics of machine learning, computer vision, and pattern recognition by organizing

conferences, workshops, group discussions, experiments, and actual implementations. This

machine learning and computer vision study examines and analyzes the applications of machine

learning in computer vision and predicts future prospects. In this study, we found that there are

supervised, unsupervised, and semi-supervised machine learning strategies in computer vision.

Commonly used algorithms are neural networks, k-means clustering, and support vector machines.

The most recent applications of machine learning in computer vision are object recognition, object

classification, and the extraction of relevant information from images, graphic documents, and

videos. Additionally, we use Tensor Flow, the Faster RCNN Inception V2 model, and the

Anaconda software development environment to identify cars and people in images.

2.1 Literature Review

i) Machine learning and computer vision hope to give computers the ability of humans to

sense data, understand data, and act based on past and present findings. Machine learning and

computer vision research are still evolving. Complex human activities are recorded and monitored

in media streams using machine learning and computer vision. These methods use machine

Page | 3
learning algorithms like support vector machine, KNN, etc. Machine learning solutions revolve

around collecting data, training a model, and using the trained model to make predictions. There

are models and services provided by private companies for speech recognition, text analysis, and

image classification. Object detection has applications in traffic collision avoidance, facial

expression recognition, and emotion recognition based on human posture. Supervised learning of

a deep complex neural network that recognizes faces with a large number of face images. Also,

companies like Amazon, Microsoft, and Google use machine learning as a cloud service. The

objective of this study is to investigate and analytically evaluate applications of machine learning

in computer vision. The database searched includes Google Scholar applying advanced search

techniques related to the keywords "machine learning", "computer vision", "deep learning", and

"artificial intelligence". Part 3 groups existing machine learning applications into groups (Sergio

Robles-Serrano, 2021).

ii) Features learnt by neural networks trained for the task of object recognition using more

than a million labelled images are useful for many computer vision tasks like semantic use visual

perception for recognizing objects, understanding about the task of object recognition or is it the

case that useful visual representations can be learnt through other modes Clearly, biological agents

perform complex visual tasks and it is unlikely that they require external supervision can be used

to learn useful visual representations. until now unsupervised learning approaches Biological

agents use perceptual systems for obtaining and robotic agents employ their motor system for

executing these agents can use their own motor system as a source of supervision for learning

useful perceptual representations? models of perception that make use of motor information. In

this work we focus on visual perception and present a model based on ego motion (i. e. self-motion)

Page | 4
for learning useful visual representations. When we say useful visual tasks by learning from only

a few labeled examples Mobile agents are naturally aware of their ego motion words, knowledge

of ego motion is “freely” available. agent can estimate its ego motion either from the motor

commands We propose that useful visual representations can be learnt by performing the simple

task of correlating visual a camera moving in the world and thus the knowledge of ego motion is

the same as the knowledge of camera motion. the camera transformation from the consequent pairs

of images that the agent receives while it moves. Intuitively, the task of predicting camera

transformation between two images should force the agent to learn features that are adept images

(i. e. visual correspondence). were also found to be very useful for tasks such based learning can

also result in features that are useful for the camera transformation between pairs of images.

Features learnt using our method outperform previous approaches of unsupervised feature learning

when class-label the scenario of a robotic agent moving around in the quality of features learnt

from this data were evaluated on four tasks Scene recognition on SUN matching and Object

recognition on same amount of training data, features learnt using ego motion as supervision

compare favorably to features learnt using slow feature analysis for unsupervised learning from

videos the first effective demonstration of learning visual representations from non-visual access

to ego motion information (Alice A. Robie, 2017).

iii) Machine learning and computer vision hope to give computers the ability of humans to

sense data, understand data, and act based on past and present findings. Machine learning and

computer vision research are still evolving. Computer vision is an essential part of the Internet of

Things, the Industrial Internet of Things, and the human brain interface. Complex human activities

are recorded and monitored in media streams using machine learning and computer vision. There

Page | 5
are several well-established methods for prediction and analysis such as supervised learning,

unsupervised learning, and semi-supervised learning. These methods use machine learning

algorithms like support vector machine, KNN, etc. Machine learning solutions revolve around

collecting data, training a model, and using the trained model to make predictions. There are

models and services provided by private companies for speech recognition, text analysis, and

image classification. One can use their models through application programming interfaces (APIs).

For example, Amazon Recognition, Polly, Lex, Microsoft Azure Cognitive Services, IBM Watson.

Object detection and analysis is an integral part of everyday life. Object detection has applications

in traffic collision avoidance, facial expression recognition, and emotion recognition based on

human posture. In, developed an automated system to detect information contained in human faces

from images and videos using orientation. Tensor Flow and Open Pose are software libraries used

for object detection and computer vision. Traffic detection models use a cumulative neural

network, a cyclic neural network (RNN), a short-term memory (LSTM), a closed cyclic unit

(GRU), and a Bayesian network. In smart environments, sensors collect data that is then used for

analysis and prediction. Object extraction is one of the tasks that a cumulative neural network

(CNN) performs without loss of information for successful object detection. Supervised learning

of a deep complex neural network that recognizes faces with a large number of face images. The

only challenge in computer vision and computer applications is data annotation/labelling. Machine

learning algorithms currently operate in the cloud as “machine learning as a service”, “cloud

machine learning”. Also, companies like Amazon, Microsoft, and Google use machine learning as

a cloud service. The objective of this study is to investigate and analytically evaluate applications

of machine learning in computer vision. The database searched includes Google Scholar applying

advanced search techniques related to the keywords "machine learning", "computer vision", "deep

Page | 6
learning", and "artificial intelligence". The initial search result was 258 articles, including patents

and citations. After reviewing the content of the articles and excluding citations, the number was

reduced to 175 articles. Ultimately, 20 articles formed the focus of this study. There are five parts.

Part 2 is basic research. Part 3 groups existing machine learning applications into groups. Section

4 presents the results and discussion. The final section concludes with comments and future work

(Mavridou & Vrochidou, 2019).

iv) Computer vision and pattern recognition-based traffic sign detection, tracking and

classification methods have been studied for several purposes, such as Advanced Driver Assistance

Systems (ADAS) and Auto Driving Systems (ADS). Generally, traffic sign recognition (TSR)

systems consist of two phases of detection and classification; for some TSR systems, a tracking

phase is designed between detection and classification for dealing with video sequences. In this

paper, we review the literature on traffic sign detection (TSD) based on camera or LIDAR, and do

comparison and analysis of the reviewed methods based on the reported performance and the

performance of our reimplemented methods. For a TSR system, traffic sign detection (TSD)

usually is the first key process. Then, the detected traffic signs are utilized as inputs of the

following tracking or classification methods; hence, the accuracy of the traffic sign detection and

locating results has a great influence on the following tracking or classification algorithms. Though

the structures and appearances of traffic signs are different across the world, the distinct color and

shape characteristics of traffic signs provide important cues to design detection methods. Shape

and edge detection methods can also be used to extract the accurate position of a traffic sign. The

Page | 7
goal of traffic sign tracking is usually designed for boosting classification performance, fine-

positioning or predicting positions for detection in the next frame. The binary-tree-based

classification method usually classify traffic signs according to the shapes and colors in a coarse-

to-fine tree process. As a binary-classification method, SVM classifies traffic signs using one-vs-

one or one-vs-others classification process. It presents a comprehensive survey for TSD, which

covers popular detection methods before 2012. Furthermore, all previous surveys do not review

the LIDAR based methods. Distinguished from these previous surveys, we classify the reviewed

methods into fine categories, reimplement part of the TSD methods for comprehensive

comparisons of these methods, and also review the LIDAR based TSD methods. Section II

presents the introduction of traffic signs, influence to human driving safety, machine vision based

TSR system and its applications, and benchmarks for TSR. Section III shows overview of traffic

sign detection; traffic sign detection methods are classified into five categories: color-based

methods, shape-based methods, color and shape-based methods, machine learning based methods,

and LIDAR based methods (Bini, Pamela, & Prince, 2020).

v) Bangladesh has its own abundance of water resources which helps to identify its customs

that are related to freshwater fish. Due to environmental issues along with some other reasons, the

amount of water resources of Bangladesh is reducing day-by-day. Consequently, many of our

territorial freshwater fishes are getting abolished. Thus, the new generation people of Bangladesh

lack the knowledge of local freshwater fish. For this problem, a solution has been found with the

collaboration of vision-based technology. As a solution, a machine-vision based local freshwater

fish recognition system is presented that can be proceed with an image of fish captured with a

mobile or handheld device and recognize the fish in order to introduce the fish. To demonstrate

the utility of the proposed expert system, several experiments are performed. At first, a set of

Page | 8
fourteen features, which consists of four types of features, are presented. Then the color image has

been converted into gray-scale image and the gray-scale histogram is formed. Image segmentation

takes place using histogram-based method and then the features are extracted. PCA is used for

decreasing the feature numbers. Three classifiers are used for recognizing fish, where SVM gives

the highest accuracy showing a value of 94.2% (Sharmin, Islam, & Jahan, 2019).

2.2 Strongest Points of Literature Review

➢ The high social benefits of machine vision system

➢ The strong peripheral advantages of machine vision system

➢ The wide applications of machine vision system


➢ Improve recognition system through machine vision

➢ Image processing using machine vision

➢ Use of efficient algorithm to get proper detection

➢ Use of machine learning and Artificial intelligence to improve proper image detection and

merging

2.3 Overview and Alignment with suggested model

Our approach of detecting and securing a highly secured facility using machine vision and

taking several pictures of the people that has the access in the facility is very much different from

Page | 9
these machine vision-based projects. A machine learning algorithm set learner learns parameters

by learning patterns in a data set. We have identified lexical features that might work in different

datasets containing different trends and patterns. Previous work by other researchers has used

different datasets but did not focus on the associative approach to lexical features. Similarly, our

approach for Realtime body and face detection is different as it includes authorized personnel to

at first take a 3D picture of themselves that would be saved in the system database and then they

can have the access to the facility. In our approach we already have most of the detection cameras

there in the term of CC cameras. In summary, variety of different approaches that have been tried

in the literature. Each study has its own set of limitations and one thing that has been common

across all of them is that they use fairly simple machine learning models. In our study we are going

to use a very different approach to minimize the cost and maximize the security by using the

already present materials and just implementing the new 3D image maker and merger.

Page | 10
Chapter III. Research Methodology

3.1 Recap of research question

The computer vision and machine learning are two important areas of recent research. The

computer vision computer uses the image and pattern mappings in order to find solutions. It

considers an image as an array of pixels. The computer vision automates the monitoring,

inspection, and surveillance tasks. Machine learning is the subset of artificial intelligence. The

automatic analysis/annotation of videos is the outcome of computer vision and machine learning.

Figure 1 shows the classification, object detection, and instance segmentation. Figure 2 shows the

object detection in images using Tensor flow and Faster-RCNN-Inception-V2 model in Anaconda

environment.

Page | 11
3.2 Description of method

There are three approaches to machine learning and computer vision: supervised,

unsupervised, and semi-supervised learning. Training data labeled supervised learning. Data

labeling is expensive, time consuming and requires expertise. On the other hand, semi-supervised

learning has some labeled data and some not. Bayesian network classifiers have the advantage of

learning with unlabeled data. However, real-world problems are of the unsupervised learning type,

where patterns develop based on clustering. Research has explored many applications of machine

learning in computer vision. For example, segmentation, feature extraction, visual model

refinement, pattern matching, shape representation, surface reconstruction, and modeling for the

biological sciences. Machine learning in computer vision is used to interpret data contained in car

and pedestrian detection images, to automatically classify defects in railway sleepers by images,

interpreting remote sensing data for geographic information systems, distinguishing truth mangoes

Page | 12
based on size attributes, extracting graphical and textual information from document images.

Similarly, other applications include facial and gesture recognition, machine vision, handwritten

character and number recognition, enhanced driver assistance systems, behavioral studies and

kinematic estimation of the human body. for a cyclist and estimate posture. Detecting sidewalk

ramps in Google Street View, such as automatically identifying and reviewing sidewalk ramps in

images. studies that uses computer vision and machine learning in medical sciences such as

cardiovascular imaging, retinal vasculature, nuclear medicine, endoscopy, thermometer,

angiography, resonance magnetic, ultrasound and microscopy. Machine learning and computer

vision have innovative applications in engineering, medicine, agriculture, astronomy, sports,

education, and more.

3.3 Background and rationale of method

The system studying paradigms for laptop imaginative and prescient are assisting vector

machines, neural networks, and probabilistic graphical models. Support vector machines (SVMs)

is a subdomain of supervised system studying techniques and famous in classification. Neural

community includes layered networks of interconnected processing nodes. Convolutional neural

networks (CNNs) is a class of neural networks utilized in photo popularity and classification. It

has neurons with dimensions: width, top and depth. CNN has won reputation latest instances

because of in large part reachable datasets, GPUs, and regularization techniques. OpenCV is a

library, which may be included with programming languages which include Android, .NET, Java,

iOS on systems which include Eclipse and Visual Studio in Windows, iOS, and Linux for photo

processing and analysis. It is utilized in photo processing, video analysis, item detection, and

system studying. Figure three indicates the item detection procedure with inside the system

Page | 13
studying and laptop imaginative and prescient environment. Figure 3 shows the object detection

process in the machine learning and computer vision environment.

3.4 Evaluation

In the world of the Internet, tons of graphic and visual information move around, but unlike

textual data, the ability to categorize and store it according to specific characteristics is a

demanding job. much effort'. Indexing and storing graphical data requires computational

interventions with advanced model-based learning and vision. This study sheds light on machine

learning and computer vision research in different fields. Machine learning and computer vision

techniques have reduced the cost, effort, and time spent in engineering, science, and technology.

An automated system based on machine learning and computer vision detects human emotions

(likes and dislikes, confidence levels). Probabilistic models predict human activities through

labeling and pattern recognition. Machine learning and computer vision in professional sports

measures and analyzes the performance of teams and individual players. Furthermore, it has been

used in industries for predictive maintenance. The timely replacement of machinery and tools in

industries before the occurrence of incidents has a significant impact on the effectiveness and

efficiency of production units. Public cameras and smart devices with sensors are a huge source of

Page | 14
data. Computer vision and computer techniques, when applied to this data, will help predict and

monitor traffic in cities. Figure 4 shows the development of research areas in machine learning

and computer vision. This study shows that growing areas of research in this area are biological

sciences (19 percent) and human activity (19 percent), followed by traffic management (13

percent) and physical fitness. professional sports (13%).

Page | 15
Chapter IV. Results

4.1 Dataset: For the solution proposed, two sets of data were used. The first one consists of

images used for the fine tuning of the visual feature vector extractor. The second one consists of

videos that present intruder and authorized personnel (positive and negative cases) for training the

temporal feature extractor.

The image dataset was built from scratch, applying the web scraping technique to populate

the dataset. For this, a series of logical steps were proposed. First, we identified the sources on the

web where the image search was performed. Next, we defined the set of keywords for the searches.

For this process, the following keywords were selected: Face detection, body detection, body

structure detection and top to bottom body detection with complexion detection. Then, the

automation stage was performed. The application was developed in the Python programming

language together with the Selenium library, which contains useful functions to perform this

process. Finally, a manual validation of all the collected images was carried out together with an

image transformation in order to standardize the size and format used.

The videos dataset was formed from two different data sources. The first one is the CADP

dataset. It has a total of 46578 daily activity videos of the employees. This dataset adds up to a

total duration of 15.6 h, with an average number of frames of 966. This source was chosen instead

of others in the literature, such as, due to the number of positive cases that the CADP dataset

presents (100%) and the position of the video camera (CCTV), which allows for a third-person

perspective. The second source used for the video dataset only contains negative cases of the

presented problem, i.e., videos where no personnel are present. It has a total of 500 videos from

Page | 16
different locations in Bangladesh, with a spatial resolution of 960 × 540 pixels. Some examples of

frames belonging to these datasets are shown in Figure 4.

Figure 4. Example of frame from videos in the datasets

4.2. Temporal Video Segmentation

A video is segmented in order to obtain a greater number of examples with a certain number

of constant frames and, in turn, a segment with shorter duration. This is because daily activities

have a short average duration (60 frames), which allows for processing of the original video in a

more efficient way.

In order to select the segmentation technique for the input data, some experiments were

performed on the videos taken from the dataset. The four techniques to be evaluated were

compared using the same videos in each case. The first technique consists of a segmentation

Page | 17
without frame discrimination. Therefore, all consecutive images of the video are selected until the

maximum time of the segment is reached. This technique has an average reading time of 0.18 s.

The second technique used seeks to skip frames in order to reduce the redundancy that can be

observed when using very close images in the video. This is because when the video has been

recorded with a traditional camera, the number of similar consecutive frames is very high. For this

reason, we experimented by skipping one frame for each frame selected. That is, in this case, the

images with an odd index were chosen from the video, until the maximum length of duration

established for the segment was reached. The third and fourth techniques presented are based on

discriminating consecutive frames with respect to an SSIM. For the third technique, a pixel-to-

pixel comparison of two consecutive images is calculated. For decision making, a threshold of 0.9

was set. Therefore, if a consecutive frame exceeds this threshold, the candidate is not chosen and

moves to the next frame in the video, for which the same process is performed. Finally, the fourth

technique number four shows a similar process to the third technique. However, in this one, the

threshold was defined at 0.98, and the matching operation used is the SSIM image-matching

metric. A maximum segment length of 45 frames was set for the tests. The comparison between

techniques is presented in Table 1 and Table 2. The technique chosen was the first described: “No

selection”.

Page | 18
4.3. Automatic Detection of personnel in their daily activity

The solution presented is based on a visual and a temporal feature extractor. The first stage

of the model consists of the InceptionV4 architecture (pre-trained with the ImageNet dataset)

truncated. That is, all the Inception cells (convolutional layers) were used, eliminating the

multilayer perceptron at the end of this architecture. This is to use this part of the model only as a

visual feature extractor as in Figure 1, upper part.

Page | 19
However, by performing multiple experiments, it was concluded that the pre-trained model

does not differentiate between a vehicle at rest and a vehicle hit by a traffic accident. Therefore,

the images dataset was used for training in order to adjust the weights of this pre-trained network.

In this process, all the weights of the initial layers of the architecture were frozen, and only those

of the last convolutional cell of InceptionV4 were adjusted. To adjust the feature extractor,

multiple experiments were performed. This was done using regularization techniques, data

augmentation, and hyper-parameter modifications. The results of the tests performed are described

in Figure 3.

Figure 5. Visual feature extractor experiment

The temporal feature extraction is based on recurrent neural networks. The architecture

proposed for this stage consists of two ConvLSTM layers. These were created to extract temporal

information in data of more than one dimension, using the convolution operation. Between these

Page | 20
layers, a Batch Normalization is added, and the various hyper-parameters are adjusted. The

ConvLSTM layers used consist of 64 neurons each, a kernel size of 3 × 3, a dropout of 0.2 and a

recurrent dropout of 0.1. The results obtained are presented in Figure 4, while Figure 5 shows the

accuracy of the model in the training stage.

Figure 6. Experimenting with the temporal feature extractor.

Page | 21
Figure 7. Behavior of the model’s accuracy by epochs with the training set and the validation set.

The last stage of daily activities of personnel detection process is given by a densely layered

block. The proposed neural network consists of a total of three hidden layers and one output layer,

plus a regularization technique called dropout with a value of 0.3. The distribution of the neurons

in the mentioned layers is as follows: four hundred, one hundred, and one neuron, where the first

two layers use the hyperbolic tangent activation function while the last layer (output layer) uses

the sigmoid activation function in order to perform a binary classification (present or not present).

The training and validation results are presented in Table 3 (Note: the dataset is distributed as 94%

present and 6% non-present). The established hyper-parameter values are presented in Table 4.

The model was trained on a computer with a 5th generation I7 4820k@3.70GHz processor, 64 GB

of RAM memory, and two Nvidia 1080TI video cards with 11 GB of GDDR 5X RAM at 405 M

Hz.

Page | 22
Page | 23
Chapter V. Discussion
In relation to model bias, the model could be biased towards accidents involving vehicles.

This verification is not trivial due to limitations of the validation dataset, which is composed of a

majority of activities involving personnel. However, the feature extractor was trained considering

different kinds of facilities, including army, air-force, and SSF but excluding where there are

visible human interventions. Additionally, relating to weather conditions, the bias is present in

diurnal activities due to dataset limitations. There were not enough videos with rain or snow or at

night, among other weather conditions.

Regarding the model’s generalization capacity, the model is independent of a particular

camera-viewpoint, the structure of the street, or aspects such as vehicle density. We do not describe

the technical parameters of the cameras because we used public video datasets for the model

validation. However, an adequate analysis in this address will permit defining hardware limitations

for a correct model operation. However, it is difficult to perform analysis at the level of device

specification, mainly because obtaining a dataset that includes a large number of images from

cameras with different lenses, acquisition sensors, and even spatial positions is impractical. In this

context, separating or analyzing the effect of the camera parameters is almost impossible.

However, Deep Learning models have the characteristic of being robust to small variations in their

input. They require minimal preprocessing and do not need the selection of an extractor of specific

characteristics. In this context, we consider that there are no significant camera parameters

restrictions in the model due to the used datasets that include different cameras in multiple

positions. Therefore, we assume that the model can operate correctly in the most popular devices

used for facial detection systems.

Page | 24
The feature extractor was trained with visual patterns associated with activities that had

already occurred, so the model cannot predict an activity, but it is capable of identifying visual

patterns relating to the occurrence of an activity. The temporal feature extractor was trained to

recognize the appearance in time of these visual patterns, which strengthens the activity

identification only when based on visual patterns. We considered that it is not possible to predict

in advance the occurrence of accidents with this configuration.

Addressing ethical considerations, the group of activities involving pedestrians was not

considered because the nature of the training in the fine-tuning process of the feature extractor

model required images that includes unknown persons. To include this category (pedestrians),

consideration should be given to obtaining a representative group of images to avoid biases due to

aspects such as age, height, or skin color.

Page | 25
Chapter VI. Conclusion

In conclusion, in our study, advanced image processing techniques like contrast

adjustment, bilateral filtering, and histogram equalization were used to preprocess the input face

images in order to improve their image features. The same advanced image processing techniques

were then applied to the training/template face images, along with an image blending technique,

to guarantee high-quality training/template face images. The input face image that has been

preprocessed will be divided into k2 regions, and then the LBP code will be calculated for each

pixel in each region by comparing the pixel in the center to the pixel around it. Binary 1 is used to

denote a pixel that is greater than or equal to the center pixel; otherwise, binary 0 is used.

In order to obtain the binary pattern necessary to construct the feature vector of the input

face images, this procedure will be repeated for each and every pixel in all other regions. A

histogram with all possible labels is constructed for each region. The number of instances of a

pattern in the region is represented by these constructed histograms with all of their bins. After

that, the regional histograms are combined into a single, individual feature vector, which is then

compared to the template face images in order to identify faces. The results of our experiments

demonstrate that our method is extremely accurate and robust for a facial recognition system that

can be used in a real-world setting. It also makes the LBP code better. It is also essential to point

out that our research does not address the problem of mask faces and occlusion in facial

recognition; however, addressing these issues could be an excellent addition to this paper's future

work.

Page | 26
References

Alice A. Robie, K. M. (2017). Machine vision methods for analyzing social interactions. Journal

of Experimental Biology, 35-70. doi:10.1242

Bini, D., Pamela, D., & Prince, S. (2020). Machine Vision and Machine Learning for Intelligent

Agrobots: A review. 2020 5th International Conference on Devices, Circuits and Systems

(ICDCS), 12-16. doi: 10.1109/ICDCS48716.2020.243538.

Mavridou, E., & Vrochidou, E. (2019). Machine Vision Systems in Precision Agriculture for

Crop Farming . Imaging 2019, 89. doi:10.3390

Sergio Robles-Serrano, G. S.-T.-B. (2021). Automatic Detection of Traffic Accidents from

Video Using Deep Learning Techniques. Computers 2021, 148. Retrieved from

https://doi.org/10.3390/computers10110148

Sharmin, I., Islam, N. F., & Jahan, I. (2019). Machine vision based local fish recognition. SN

Appl. Sci. 1, 1529. doi:10.1007

Page | 27

You might also like