You are on page 1of 19

Biomedical Signal Processing and Control

YOLO NANO ARCHITECTURE FOR MOVING OBJECT DETECTION USING REAL-


TIME VIDEOS
--Manuscript Draft--

Manuscript Number: BSPC-D-24-01669

Article Type: Research Paper

Keywords: Moving Object Detection, Yolo Nano Architecture, Real-Time Videos, and Video
Dataset from Surveillance Cameras

Abstract: Computer vision has been extensively covered over the past 20 years. One of the most
crucial components of computer vision is the monitoring of visual objects. Tracking
involves keeping tabs on a moving object throughout time. Object recognition tracking
is used to identify or link target items over successive video sequence. The majority of
Indians pay little attention to safety regulations. There is a need for an alarm system
because it is common knowledge that passengers are not permitted to cross the yellow
line in metro stations. The focus of the proposed system is on the identification of
moving objects from fixed security cameras. YOLO object detection models are
employed for object detection. Pre-trained models are employed in this project for
object detection to keep things straightforward. Motion detection for video surveillance
involves several processing phases, including motion detection, classification and
object detection, behaviour interpretation. The technology operates in real time and is
more accurate than earlier models. This study describes a bespoke picture dataset that
was trained using YOLO Nano for three distinct classes, and this model was then
employed in movies for tracking. For traffic analysis, identifying a vehicle or pedestrian
in a live video is useful.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Manuscript Click here to view linked References

YOLO NANO ARCHITECTURE FOR MOVING OBJECT DETECTION USING REAL-TIME


VIDEOS

*Dhivya Ramasamy, Assistant Professor, Department of Information Technology


M.Kumaraswamy College of Engineering, Thalavapalayam, Karur, Tamilnadu, India, 639113

*Corresponding Email: dhivyaramasamy2@gmail.com

Abstract— Computer vision has been extensively covered over the past 20 years. One of the most
crucial components of computer vision is the monitoring of visual objects. Tracking involves keeping
tabs on a moving object throughout time. Object recognition tracking is used to identify or link target
items over successive video sequence. The majority of Indians pay little attention to safety regulations.
There is a need for an alarm system because it is common knowledge that passengers are not permitted
to cross the yellow line in metro stations. The focus of the proposed system is on the identification of
moving objects from fixed security cameras. YOLO object detection models are employed for object
detection. Pre-trained models are employed in this project for object detection to keep things
straightforward. Motion detection for video surveillance involves several processing phases, including
motion detection, classification and object detection, behaviour interpretation. The technology operates
in real time and is more accurate than earlier models. This study describes a bespoke picture dataset that
was trained using YOLO Nano for three distinct classes, and this model was then employed in movies
for tracking. For traffic analysis, identifying a vehicle or pedestrian in a live video is useful.
Keywords— Moving Object Detection, Yolo Nano Architecture, Real-Time Videos, and Video
Dataset from Surveillance Cameras
I. INTRODUCTION
Past developments have seen a rise in the importance of content analysis in smart surveillance, and new
methods are being researched and tested in actual settings. Data analysis engines built by intelligent
devices will be crucial in future surveillance networks. Video object segmentation and tracking are given
the most attention in embedded content analysis algorithms since they are essential components for other
smart surveillance activities. Considering diverse natural assumptions, a number of video object
segmentation methods have been developed. A number of straightforward and effective video object
segmentation techniques are suggested. Nevertheless, since just one backdrop layer is used in their
background image, the proposed methods are unable to deal with dynamic backgrounds. To deal with
dynamic backgrounds, multidimensional, complicated background models are used; however, because
of their large memory needs, these models experience implementation constraints, particularly in
embedded systems like smart cameras. Artificial datasets are used to compare the performance of the
majority of currently used video object segmentation methods, and the results show that multilayer
background model (multi model) techniques perform better than single background model algorithms
(uni-models). In order to solve abrupt changes in illumination, authors presented a more complicated
method that consists of an eigen background and statistical illumination model; however, this algorithm
is too complex to be integrated into current smart camera platforms. Since the results of segmentation
heavily depend on the threshold value, a threshold determination technique for choosing acceptable
threshold values is equally crucial. The data association of segmented blobs for video object tracking
heavily depends on the accuracy of the segmentation output. Gradient descent-based methods use
gradient descent optimization techniques to look at the most likely object candidate regions. They
struggle with local minima issues, though, and finding solutions for huge motion objects is challenging.
The Kalman filter is used to track and anticipate object motion, but it might fail when an item moves
randomly. Although particle filter is a more reliable approach to object detection and can better handle
large and erratic motions, the features utilized for object modelling and the distance measurements used
to determine the weights of a particles, which are crucial for the effectiveness of these methods, must be
carefully considered and planned. Color, gradient, edge, texture, and motion are often used aspects in
object modelling. When these qualities are used as the object model, though, a number of flaws might be
visible. For instance, models with color images may not account for variations in appearance brought on
by changes in the environment's lighting. However, whenever the gradient, boundary, or textural feature
is not visible or when the target being tracked is a non-rigid item, the models containing these features
are not reliable because these features typically vary depending on the motion of the non-rigid object.
Video Surveillance Systems (VSS) are now an integral part of the infrastructure in smart cities. Due to
their great advantages, such making public and private spaces safe and enhancing community safety,
these systems play a crucial part in our lives. Cameras are placed in the surveillance environment, people
are tracked, and suspicious conduct is detected using an intelligent system that makes decisions based on
a study of the scene sequences. In computer vision, detecting moving objects is a key challenge. It is
crucial for many applications, including surveillance, motion tracking of people, traffic control, and
human-computer interactions. The automatic connection between the things visible in the current frame
and those in the previous frame is, in essence, tracking moving objects. A "Self-Driving Car" is one that
can perceive its surroundings and drive itself without the assistance of a driver.
The primary driving force behind the current discussion is the quick development of applied artificial
intelligence and the anticipated significance of autonomous driving endeavours in the future of
humanity, from independent mobility for non-drivers to affordable transportation services for those with
low incomes. Driverless vehicles are becoming more common, and their combination with electric
vehicles holds the potential of reducing traffic deaths, air and tiny particle pollution, improving parking
lot management, and relieving people of the tedious and boring process of operating a vehicle.
Autonomous navigation has a lot of potential because it has a variety of uses that go well beyond those
of an automated driving. Here, the major goal is to remove humans from the vehicle management system
and free them from the responsibility of operating the vehicle. Visual sensors (for gaining traffic insight
of the vehicle surrounds), micro controllers or computers (for processing the sensor information and
relaying vehicle control commands), and motors are the essential components of self-driving cars (to
receive said instructions and be responsible for the longitudinal and lateral control of the car). Asteroid
mining is one of the most intricate human-planned endeavours that autonomous vehicles are anticipated
to be used in. The simultaneous development of such autonomous vehicles by numerous venture
companies has been made possible by the rapidly developing of AI and deep learning (DL) technologies
and frameworks.
Modern methods for object detection like Faster RCNN, YOLO, and SSD all use deep learning at their
foundation. Deep learning-based object recognition systems now support practical applications thanks to
GPUs, and the edge computing market now has a choice of reasonably priced AI hardware available
(AI). These tools are used to deploy computationally complex activities, including AI inference, on
hardware with limited resources. They rely on GPUs and a variety of software optimizations. Real-time
object detection on embedded systems still poses a significant challenge since it necessitates deep
learning systems that are fairly sophisticated. To adjust each approach to the intended scenario in
practise, a trade-off among precision and delay is necessary.
The design of the current work is as follows. An overview of label search for object tracking and
identification in transport systems is given in Section 2. The system framework of our system for
detecting moving objects is then presented in Section 3, along with a discussion of the capabilities and
processes of the suggested automatic object detection systems. The automatic item labelling method,
known as Yolo Nano, is then highlighted in Section 4 along with the results, and Section 7 brings the
paper to a close.
II.LITERATURE REVIEW
People who are visually impaired have trouble moving securely and freely, which makes it difficult for
them to engage in typical indoor and outdoor job and social activities. They also struggle to recognize
the fundamental elements of their surroundings. This study proposes a model for recognizing
fundamental items and face identification from people dataset as well as detecting the brightness and key
colours of real-time images utilizing the RGB technique with an external camera. In images and videos,
lexical items emerge four times in the computer vision subfield of object detection. The device
continually captures many frames using the digital camera of the ESP-32 Cam, which can later be
transformed to audio segments. We apply the You Only Look Once V3 (YOLO v3) algorithm in this
work, which use OpenCV to process a version of a very sophisticated Convolutional Neural Network
structure. The image is converted to text, and for the visually impaired person, the text is then converted
to speech with the help of Google Text to Speech. As a result, the visually impaired person hears the
location of the objects in the digital camera's field of vision. Using an ultrasonic sensor facilitates
distance calculation. With the use of a user-friendly device that incorporates this special item detection
Model, the proposed design is successful in giving visually challenged consumers the chance to
experience unexpected situations [21]. Unmanned aerial vehicles (UAVs) are being employed more
often for remote sensing and video monitoring. For many of these applications, moving object detection
is an essential algorithm. Many of these applications involve real-time processing of object tracking
identification for different decision-making activities. High-resolution UAV-sourced videos, however,
are challenging to analyses in real-time due to their computationally expensive nature. Vendors of GPUs
frequently introduce newer architecture with additional capabilities to accelerate specific applications. It
is therefore crucial to investigate parallel versions of these techniques using the new GPU technologies.
In order to identify moving objects in films from UAVs, this study discusses parallel implementation
options for algorithms like feature detection, feature matching, image modification, frame differencing,
morphological processing, and linked component labelling. Tests of the solution are conducted using
several NVIDIA GPU microarchitectures (Fermi, Maxwell, and Pascal). According to experimental
findings, 1080p films may be processed at rates of 43.1, 35.5, and 9.1 frames per second on the Pascal,
Maxwell, and Fermi microarchitectures, correspondingly [22].
One of the difficult applications of computer vision is object recognition, which has been widely used
in various fields, such as autonomous cars, robotics, security tracking, and guiding visually impaired
individuals. Numerous algorithms were increasing the connection among video analysis and image
interpretation as deep learning advanced quickly. With varied network architectures, each of these
techniques accomplishes the same task of multiple object detection in complicated images. The freedom
of movement in an unknown environment is restricted by the absence of vision impairment, thus it is
crucial to use modern technologies and teach them to assist blind people whenever necessary. This study
indicates a system that will identify all conceivable daily things while also prompting a voice to warn
users of both nearby and faraway objects. In order to assess accuracy and performance, two alternative
algorithms—Yolo and Yolo v3—were used in the development of the system in this paper. Yolo Tensor
Flow uses the SSD-Mobile Net Model and Yolo v3 uses the Dark Net Model. Python module used to
translate statements into audio speech in order to obtain the audio Feedback gTTS (Google Text to
Speech). The Python program is employed to play the audio in the game. Over 200K images from the
MS-COCO Dataset are used to test both techniques. In order to assess the algorithms' accuracy in all
scenarios, both methods are examined using webcams [23]. The purpose of police authorities is to avoid
and detect crime before it occurs because the crime rate and the number of criminals are rising daily,
raising serious concerns about security problems. In order to reduce crime, recent technologies,
particularly CCTV, are frequently used in both public and private spaces, however they require human
supervision to be monitored. A human finds it challenging to keep track of multiple screens at once. It
causes a lot of mistakes. Our Real-Time Crime Detection Technique, which watches real-time videos
and notifies the local Cybercrime administrator about the crime that has occurred with current location,
was offered as a solution to these issues. Researchers propose YOLO as our object detection technique
in this work. Our architecture processes images in real time at a rate of 45 frames per second [24].
Wrong-way driving is one of the primary causes of traffic jams and accidents globally. It is possible to
identify vehicles that are driving against the flow of traffic, which helps to avoid accidents and
congestion. Surveillance video has become an important source of data due to the accessibility of less
priced cameras and the expanding use of real-time traffic management systems. Researchers suggest an
automated wrong-way car recognition system in this article using video from on-road security cameras.
The center detection method is then used to track each moving object inside a set zone of interest, and
finally, the wrong-way driving identification method is used to detect moving objects. The You Only
Look Once (YOLO) methodology is utilised to detect moving objects in a video frame. With the use of
the centroid tracking method, YOLO can accurately identify and track any moving item. An experiment
using various traffic footages shows that the suggested method is capable of detecting and identifying
any wrong-way car in a variety of lighting and weather conditions. The setup and operation of the
system are incredibly simple [25]. The design is suitable for embedded systems as demonstrated by its
deployment on the NVIDIA Jetson Nano edge-device. Real-world tests evaluated the method's
performance in a variety of scenarios, including aerial surveillance using the WPAFB 2009 set of data,
civil monitoring using the Chinese University of Hong Kong (CUHK) Square dataset, and fast tennis-
ball tracking using a custom dataset, in order to demonstrate the effectiveness and general applicability
of the approach. T-RexNet beats previous general object recognition algorithms in this job, according to
experimental tests, demonstrating that it is a valid, general solution to recognise small moving objects. In
terms of the accuracy vs. speed trade-off, the method likewise compares with implementation
approaches [26]. However, the YOLO still needs top-notch electronics to detect objects in real-time. The
real-time object detection service of the YOLO on AI embedded devices with resource limitations is the
first topic we cover in this study. We specifically highlight the issues with real-time processing in
YOLO object detection that are connected to network cameras before putting forth a brand-new YOLO
architecture with adaptive frame control (AFC) that can effectively address these issues. We demonstrate
through numerous experiments that the proposed AFC can provide real-time object detection service by
decreasing overall service delay, which remains a restriction of the pure YOLO [27], while retaining the
high precision and ease of YOLO. For wide-area surveillance cameras, current imaging devices with
better megapixel resolution and frame rates are widely used (VS). As a result, the demand for high-
performance VS algorithm implementation for the real-time processing of high-resolution videos has
increased. By extracting data level parallelism from such algorithms, the development of multi-core
architectures and graphics processing units (GPUs) offers an economical and energy-efficient platform
to satisfy real-time processing demands. However, the main benefits of these systems may only be
realised by creating innovative algorithms and fine-grained parallelization techniques. This study
discusses the scheme based of video object recognition techniques, such as connected component
labelling (CCL) for blob labelling, segmentation methods for post-processing, and Gaussians mixed
model (GMM) for model information.
For fully utilising the computing power of CUDA cores on GPUs, novel parallelization approaches are
discussed together with fine-grained optimization methods. When comparing to sequential
implementation operating on an Intel Xeon CPU, experimental results show parallel GPU
implementation achieves considerable speedups of 250 for binary morphology, 15 for GMM, and 2 for
CCL [28]. This results in the processing of 22.3 frames per second for HD films. A few years ago,
classifying the many objects in an image had become an extremely challenging undertaking. Finding the
significant things inside the image is quite crucial, and you can do this by drawing rectangle border
boxes to those objects. Convolutional neural networks that use deep learning have become increasingly
popular in the field of computer vision, where each image is given a label based on its contents after
being previously taught by a computer system. The system is still at the intermediate level when it
comes to how quickly it can recognize the items inside an image. A Python, OpenCV, and Deep
Learning-based YOLO algorithm that is based on machine learning is used to accelerate and improve the
objects detection process. Speed of the vehicle, road signs, logos inside of images, and front and rear
views of the vehicle can all be determined using this method. Detection of moving objects can be
tracked by the system in a real-time setting. This algorithm employs the localization approach, which
divides the acquired video into numerous frames before applying localization to find the item inside
each of the separated frames. The primary goal of the suggested approach is to increase the precision and
effectiveness of object recognition employing a single stage method in the various moving objects inside
video frames [29].
The precise motion estimation and compensating techniques that are typically employed to detect
precise movement from video streams are necessary for the detection and tracking of objects in large-
scale surveillance. This study proposes a novel hardware level architecture for practical implementation
that includes motion estimate, compensating, and recognition. In order to create an optimized
architecture, the movement vectors are derived utilizing 1616 sub-blocks with a new parallel D flip flop
design. Then, in order to increase the speed of the architecture, the sum of absolute difference (SAD) is
calculated using optimized absolute difference and adder blocks created using the kogge-stone adder.
The controller block, which synchronizes all of the actions, was created using a finite state machine
paradigm. Additionally, the Kogge-stone adder and fundamental logical constructs are used to optimize
the comparator and compensation blocks. The proposed architecture is then put into practice on the Zynq
Z7-10 FPGA and simulated using the System Generator tool for real-time traffic signals [30].
III. PROPOSED MODEL
One of the key functions of our system is object detection. In order to maintain just the information
about the things essential for the surveillance, this activity entails locating the objects on the frames
captured by the camera in the surveillance area and classifying these items (people, animals, vehicles,
etc.). The literature has outlined several strategies. We are specifically interested in using the real-time
object detection neural network YOLO (You Only Look Once) as an object detector among these
methods. Due to its higher results in comparison to other object identification approaches, quick
velocity, lower memory footprint, etc., this method has grown in favor recently.

Frame Frame Current Frame


Input Video
Separation Differencing Taken

Background Foreground Bounding Box Feature


Subtraction Detection Detection Extraction

Object
Detection

Fig.1. Flowchart for Proposed Model


A. Key Frame Selection
Videos typically have a large number of frames, but not all of them provide crucial information. As a
result, taking into account all frames produce worthless analysis. Therefore, frame selection is a crucial
step in the object detection process. Every video has a series of shots that are recorded by a camera and
are collectively referred to as a "unbroken sequence frame." The key frames must adhere to two basic
rules:
 They must be comparable to the frames in their cluster
 They must be tolerant of the divergent frames from other clusters

Mostly on basis of colour, resemblance, intensity, and other factors, the key frames are chosen. These
are chosen using the gradient histogram technique. That determines how different the histograms are.
The mass function of the frame intensities is represented by the histogram of the frame. It is determined
using the quantity of pixels that belong to each colour. In order to minimize the number of colours in the
histogram and hence its size, colour quantization is typically applied to the original frame. Each video
consists of n frames, from which the best key frame is chosen for subsequent operations. The frames are
initially taken from the video, often referred to as the frame sequence. Using the similarity key frame as
a base, determine the similarities between the colour features. If the value of the current frame exceeds
the threshold, we consider the current frame to be a key frame. The threshold was established based on
mean and deviation (i.e., the distance between two histograms). The following formulas are used to
determine how similar or different the two-colour histograms are,
ℎ(𝐻1 + 𝐻2) = 1 − ∑63
𝑗=0 𝑀𝑖𝑛(𝐻1(𝑗), 𝐻2(𝑗) (1)

Where, 𝐻1, 𝐻2 are depict, for frames 1 and 2, the relevant colour histograms.

Frame sequences

Histogram
Frame wise
difference
histogram
between two
calculation,
frames
Frame extraction
HOG

Frequecy
Calculate mean &
deviation
Selected key frames
Orientation

Yes No
Key frame Not a key frame
Threshold >frame
selection

Fig.2.Flow for Frame Selection

B. Denoising
Remove noise is the method of removing noise from the chosen 3D key frames in order to improve the
image quality as well as the accuracy of the results before matching with the specified query. The
following factors contribute to noise in movies and frames: (i). recording devices (such as cell phones,
digital cameras, and security cameras); (ii) object movement; (iii) channels noise; (iv) poor resolution
devices; and (v) compressed storage. Denoising is regarded as a pre-processing stage in frame
processing since it keeps the frame's valuable information intact. The noise is reduced, and the wavelet
coefficients frames are used in the subsequent step to produce findings with a higher degree of precision.
This research revealed using a Probabilistic filter strategy to denoise the best key frames. Taylor series
serves as the foundation for this suggested strategy since it offers high resolution wavelet sub bands.
Infinite sums of terms, calculated from derivative functions with varying pixel values, are defined by
Taylor series. Thus, Taylor series is described as follows for a 3D frame:
1
𝑡 (𝑋) = (𝐴) + (𝑋 − 𝐴)𝑡 𝐷𝑓(𝑎) + 2! (𝑋 − 𝐴)𝑡 {𝐷2 𝑓(𝐴)}(𝑋 − 𝐴) + ⋯ (2)
That's because a frame's intensity are not uniform throughout, a frame often consists of a variety of
pixel values. For denoising, the variations in pixel values therefore require greater consideration. The
sounds in the sub bands are calculated from the defined Taylor series and deleted if they exceed the
anticipated Bayesian thresholding. The following equation is used to create thresholds for all subbands:
2
𝜎𝑁
𝑡𝐵 = (3)
𝜎𝑠
𝜎𝑁 signifies the estimation of noise, 𝑇𝐵 represents the threshold for all subbands. Once 𝜎𝑠 ≠ 0, it is
expressed as,

𝜎𝑠 = √𝑀𝐴𝑋 ((𝜎𝑦2 − 𝜎𝑦2 ), 0) (4)


Here, 𝜎𝑦 ,
1
𝜎𝑦 = (∑ 𝑠𝑏𝑖 ) (5)
𝑁

The subbands 𝑆𝐵𝑖 are, 𝑆𝐵𝑖 = {{𝐿𝐿𝐿, 𝐿𝐿𝐻, 𝐿𝐻𝐿, 𝐿𝐻𝐻, 𝐻𝐿𝐿, 𝐻𝐿𝐻, 𝐻𝐻𝐿, 𝐻𝐻𝐻} is measured as total
number of sub bands and 𝜎𝑁 is strongminded by,
𝑚𝑒𝑑𝑖𝑎𝑛 (𝑠𝑏)
𝜎𝑛 = [ ] ()
0.6745

According to the value of Bayesian threshold, Curve Fitting Approach is used for the following
study,
𝑝1 𝛽 2 + 𝑝2 𝛽+ 𝑝3
Q= (6)
𝛽+𝑞
For the above equation, 𝛽 represents the difference values for the noise standard and the terms for the
given variable is follows,

𝑃1 = 0.9592, 𝑃2 = 3.648, 𝑃3 = − 0.138, 𝑄 = 0.1245

Next Bayesian Filtering technique is applied and its mathematical formulation is stated as follows,

𝑌̂𝑠𝑠𝑏 = 𝑡3𝑑
−1
(∝ (𝑡3𝐷 (𝑍𝑆𝐵 ), 𝑡𝐵 𝑄√2 𝑙𝑜𝑔(𝑁 2 ) )) (7)

Hereby the terms involved in above are,

• 𝑇3𝐷 signifies the three-dimensional transform for the unitary structure


• 𝑌̂𝑆𝑆𝐵 signifies the stacked subbands
• 𝑍 signifies the subband sizes
• ∝ is the notation for threshold

The sub band is reconstructed from this final formulation to achieve denoised frames for feature
extraction. Denoising all user-provided 3D frames and optimum frames is followed by feature
extraction. All of the relevant and non-redundant information is gathered out of each frame and 3D input
frame by features extracted.
C. Object Detection
Two distinct kinds of characteristics are considered in this work:
• Visual low-level characteristics
• Semantic visual characteristics
Colour, shape, and contour are examples of low-level visual features, as is the visual semantic feature
in motions seen in frames and frames. The three features that were extracted and the formulations used
to do so are covered in this section. By first transforming the image to grey scale, then estimating the
intensity values, the colour feature can be extracted. Thus, the following formula represents the
conversion of RGB to grey scale:
𝑖𝑌 = 0.333∗ 𝑓𝑅 + 0.5∗ 𝑓𝐺 + 0.1666∗ 𝑓𝐵 (8)

From the above equation, 𝑖𝑌 denotes the RGB intensity values and the components of R, G, and B,
respectively. For the purpose of obtaining the variables for hue, saturation, brightness, and intensity, the
frames and images are further transformed into the HSV and YCC color spaces. Additional color
histograms are acquired, standardized, and then the color feature is retrieved from the normalized color
histogram. The visual low-level features in the frame are then used to extract texture features. For the
extraction of texture feature, the conventional method of co-occurrence metric estimation is used. The
identification of 14 statistical texture metrics lends support to this strategy. A co-occurrence matrix
based on angle orientation is performed between a pair of grey levels and an axis. The orientations that
are considered are 0°, 45°, 90°, and 135°. The equation for the grey level co-occurrence matrix is as
follows:

𝑃(𝐼, 𝐽 | 𝐷, θ) = {(𝑋1 , 𝑌1 ), (𝑋2 , 𝑌2 )} (9)

𝐼 = 𝑖 (𝑋1 , 𝑌1 ), 𝐽 = 𝑖 (𝑋2 , 𝑌2 ) correspondingly, |𝑋1 − 𝑋2 | = 0°, |𝑌1 − 𝑌2 | = 𝐷 correspondingly. The


variables such as 𝜃 and 𝑑 are the angular and distance values, respectively. Thus, The GLCM is
computed by 𝑝(𝐼, 𝐽 | 𝐷, 0𝑜 ). It’s defined as the probability value of GLCM is defined as follows,
𝑃(𝐼,𝐽 | 𝐷,θ)
𝑃(𝐼, 𝐽 | 𝐷, θ) = ∑256 ∑256 (10)
𝑖=1 𝑗=1 𝑃(𝐼,𝐽 | 𝐷,θ)

The following texture characteristics are used in this work: Angular Second Moment, Entropy,
Correlation, Contrast, and Mean. A definition of Angular Second Moment (ASM), which is employed to
evaluate frame homogeneity, is the total of the squares of entries in the Grey Level Co-occurrence
Matrix. This textural characteristic is stated as,
The Yolo Nano method, a member of the Yolo family, balances accuracy and inference time to deliver
effective results in a single movement. Yolo Nano is based on the Yolo v3 architecture and achieves
real-time object detection, which is akin to CCTV surveillance footage processing. Yolo Nano enhanced
Yolo v3 in two areas, including Instance Regularization and a lighter architecture. Three components
make up Yolo Nano's structure, and they are as follows: Expansion and projection; projection; expansion
projection; and fully focused attention. As comparing to other Yolo architectural, these components are
offered in the Yolo Nano design; it reduces the procedures by 50% and removes the several PEP layers
to make a network lighter. This section describes the instance regularization method, which enhances
frame contrast and preserves lower - level characteristic properties like lines and texturing.
Consequently, the mathematical equation of instance regularization is as follows:
𝑥 −𝜇𝑁𝐶
𝑦𝑛𝑐𝑖𝑗 = 𝑁𝐶𝑖𝑗 (11)
2 +∈
√𝜎𝑁𝐶

1
𝜇𝑁𝐶 = ∑𝑤 𝐻
𝑙=1 ∑𝑚=1 𝑋𝑛𝑐𝑙𝑚 (12)
ℎ𝑤

1
2
𝜎𝑛𝑐 = ∑𝑊 𝐻
𝑙=1 ∑𝑚=1(𝑋𝑛𝑐𝑙𝑚 − 𝜇𝑛𝑐 )
2
(13)
𝐻𝑊

Here, 𝑋𝑛𝑐𝑖𝑗 in which the 𝑛𝑐𝑖𝑗 signify the layer feature map and 𝑖, 𝑗 are signify the spatial features and 𝑐
signify the channel features, n signify the N-th frames in the frame. ∈ is a lower numeric value that is
utilized to reduce the total number of computations. 𝜇𝑛𝑐 is the mean values of the feature frame and ith
frame is computed by the current frame.
Conv1X1

PEP (25)(104x104x70)
PEP (113)(26x26x125)

Conv1x1 (52x52x150)
PEP (56)(52x52x150)
PEP (99)(26x26x207)

PEP(58) (52x52x122)
Conv1x1(13x13x105)

Conv1x1(13x13x105)
Conv1x1 (26x26x98)

Conv1x1 (52x52x75)
PEP(52) (52x52x87)

PEP(47) (52x52x93)

EP (52x52x50)
DWConv 3X3

Conv1X1

(B) EP module

Conv1X1
PEP (141) (26x26x325)

PEP (140) (26x26x325)

PEP (137) (26x26x325)

PEP (135) (26x26x325)

PEP (133) (26x26x325)

PEP (140) (26x26x325)


PEP (132)(26x26x325)

PEP (132)(26x26x325)

PEP (25)(104x104x70)

PEP (25)(104x104x70)

Conv1x1 (52x52x150)
PEP (56)(52x52x150)
Conv1X1

EP (13x13x545)

EP (52x52x50)
DWConv 3X3

Conv1X1

(C) PEP module


PEP (25)(104x104x70)

PEP (25)(104x104x70)
Conv3x3(416x416x12)

Conv3x3(208x208x24)

PEP (7) (208*208*24)

Conv1x1 (52x52x150)

PEP (25)(104x104x70)

PEP (25)(104x104x70)
PEP (56)(52x52x150)

Conv1x1 (52x52x150)
PEP (56)(52x52x150)
EP(104x104x70)

FC
EP (52x52x50)

EP (52x52x50)
FC

(D) FCA module

Key Key Key


Frame 1 Frame 2 Frame n

Fig.3.Yolo Nano Architecture for Object Detection

EXPERIMENTAL RESULTS AND DISCUSSION IV.


In order to elaborately scale the effectiveness of the Proposed Yolo Nano method, experiments with it
are conducted in this part, and the results are then analyzed. Utilizing the Matlab programme, the
proposed Yolo Nano method is experimented with. Due to the abundance of functions available in
Matlab that can be used as a frame processing tool, the proposed approach was tested there. Matlab is
used to implement the suggested system on a Linux workstation with an Intel Core i5 processor and 4
GB of RAM. We ran all of our experiments using real data from various distribution to use the video
dataset in order to assess the efficacy of our system. This database includes 3 video clips shot with both
stationary and moving cameras in natural settings (2 training and 1 testing). All sequences have been
meticulously documented, strictly adhering to a set process. The recommended system is evaluated
throughout a number of videos. The experiment is broken into two parts: tracking and item
identification. The project's design is built on Python, assessed using three distinct video sequences, and
executes at a high frame rate. The weight files are utilised for object recognition in films when training
is finished. The input video file is divided into the entire amount of frames, and each frame is sent to a
trained object detector. After the object detector completes its work, the bounding box information is
passed onto a sort tracking algorithm, and object tracking is then carried out.
100
Accuracy (%)

95

90

85

80
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig.4.Accuracy vs. Number of Instances


100
98
Precision (%)

96
94
92
90
88
86
84
2000 4000 6000 8000 10000
Number of Instances

Mask RCNN Yolo YoloV3 Yolo Nano

Fig.5.Precision vs. Number of Instances


100
98
96
Recall (%)

94
92
90
88
86
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig.6.Recall vs. Number of Instances


100
F-score (%)

95

90

85
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano
Fig.7.F-score vs. Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

4000
Exceution Time (ms)

3000

2000

1000

0
2000 4000 6000 8000 10000
Number of Instances

Fig.8.Execution Time vs. Number of Instances


100
Accuracy (%)

95

90

85

80
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig.9.Accuracy vs. Number of Instances


100
Precision (%)

95

90

85
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig.10.Precision vs. Number of Instances


100
98
96
Recall (%)

94
92
90
88
86
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano
Fig.11.Recall vs. Number of Instances
100
F-score (%)

95

90

85
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig.12.F-score vs. Number of Instances


Mask RCNN Yolo YoloV3 Yolo Nano

4000
Exceution Time (ms)

3000

2000

1000

0
2000 4000 6000 8000 10000
Number of Instances

Fig.13.Execution Time vs. Number of Instances


100
Accuracy (%)

95

90

85

80
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig.14.Accuracy vs. Number of Instances


100
Precision (%)

95

90

85
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig.15.Precision vs. Number of Instances


100
98
96
Recall (%)

94
92
90
88
86
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig. 16.Recall vs. Number of Instances


100
F-score (%)

95

90

85
2000 4000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig.17. F-score vs. Number of Instances


Mask RCNN Yolo YoloV3 Yolo Nano

4000
Exceution Time (ms)

3500
3000
2500
2000
1500
1000
500
0
2000 4000 6000 8000
10000
Number of Instances

Fig.18. Execution Time vs. Number of Instances


MSE is a crucial criterion that is used to assess an estimator's efficiency. Since MSE shows the total
grey scale value inaccuracy in the entire frame, it is frequently used to gauge the amount of frame
degradation. In mathematical notation, the MSE is computed as,
1 2
𝑀𝑆𝐸 = 𝑚𝑛 ∑𝑚−1 𝑛−1 ̂
𝑋=0 ∑𝑌=0[𝑓(𝑋, 𝑌) − 𝑓 (𝑋, 𝑌)] (14)

Here, m, and n denotes the dimension and size of the input frame, f’ and 𝑓̂ represents the original and
distorted frames correspondingly. Lower error is provided by a lower MSE number, whereas more error
is provided by a larger MSE value. The following list of MSE traits is an illustration:
• To compare two statistical models, the MSE value is employed.
• If the MSE is zero, the estimator will view it as having perfect accuracy. Unfortunately, it is not
practically practicable to meet this criterion.
• MSE, which uses a model of a particular reflection set, is used to calculate the number of
predictors.

16
14
12
MSE (%)

10
8
6
4
2
0
2000 4000 6000 8000 10000
Number of Instances

Mask RCNN Yolo YoloV3 Yolo Nano

Fig.19.MSE vs. Number of Instances

20

15
MSE (%)

10

0
20004000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig. 20.MSE vs. Number of Instances

20

15
MSE (%)

10

0
20004000 6000 8000 10000
Number of Instances
Mask RCNN Yolo YoloV3 Yolo Nano

Fig.21.MSE vs. Number of Instances


V. CONCLUSION
In this study, visual object tracking is carried out on videos by training a detector using a customized
dataset made up of 30.000 photos. Using the YOLO detector and SORT tracker to track the items in
successive frames, moving objects are detected. By fine-tuning the detector while training the system
over more epochs, accuracy and precision can be improved. The SORT tracker uses a tracking by
detection methodology; hence its effectiveness is entirely dependent upon the performance of the
detectors. As it can be applied to several video domains and can identify and track various objects, the
system can be taught for more classes (more sorts of objects) in future work. Our method is restricted to
people and cars, but it can be modified to include other items or reduced to just one object with a smaller
dataset. For proposed object detection and tracking, several types of features extracted and object
trackers can be used, and a variety of data will be produced that can be analyzed.
Declaration:
Ethics Approval and Consent to Participate:
No participation of humans takes place in this implementation process
Human and Animal Rights:
No violation of Human and Animal Rights is involved.
Funding:
No funding is involved in this work.
Data availability statement:
Data sharing not applicable to this article as no datasets were generated or analyzed during the
current study
Conflict of Interest:
Conflict of Interest is not applicable in this work.
Authorship contributions:
All authors are contributed equally to this work
Acknowledgement:
There is no acknowledgement involved in this work.

REFERENCES
1. Vasavi, S., Priyadarshini, N.K., & Harshavaradhan, K. (2021). Invariant Feature-Based Darknet
Architecture for Moving Object Classification. IEEE Sensors Journal, 21, 11417-11426.
2. Bourja, O., Derrouz, H., Abdelali, H.A., Maach, A., Thami, R.O., & Bourzeix, F. (2021). Real Time
Vehicle Detection, Tracking, and Inter-vehicle Distance Estimation based on Stereovision and Deep
Learning using YOLOv3. International Journal of Advanced Computer Science and Applications.
3. Abri, S., Abri, R., Yarıcı, A., & Çetin, S. (2020). Multi-Thread Frame Tiling Model in Concurrent
Real-Time Object Detection for Resources Optimization in YOLOv3. Proceedings of the 2020 6th
International Conference on Computer and Technology Applications.
4. Bahri, H., Chouchene, M., Sayadi, F., & Atri, M. (2019). Real-time moving human detection using
HOG and Fourier descriptor based on CUDA implementation. Journal of Real-Time Image
Processing, 1 - 16.
5. Pandiyan, P., Thangaraj, R., Subramanian, M., Rahul, R., Nishanth, M., & Palanisamy, I. (2022).
Real-time monitoring of social distancing with person marking and tracking system using YOLO V3
model. Int. J. Sens. Networks, 38, 154-165.
6. Shahzad Alam, M., Ashwin, T.S., & Ram Mohana Reddy, G. (2019). Optimized Object Detection
Technique in Video Surveillance System Using Depth Images.
7. Venkatesh, A., N, P.K., & Talawar, K. (2019). An Efficient FPGA based Reconfigurable
Architecture for Object Detection using Adaptive Threshold. 2019 4th International Conference on
Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques
(ICEECCOT), 234-239.
8. Pudasaini, D., & Abhari, A. (2020). Scalable Object Detection, Tracking and Pattern Recognition
Model Using Edge Computing. 2020 Spring Simulation Conference (SpringSim), 1-11.
9. Dahirou, Z., & Zheng, M. (2021). Motion Detection and Object Detection: Yolo (You Only Look
Once). 2021 7th Annual International Conference on Network and Information Systems for
Computers (ICNISC), 250-257.
10. Zilkha, M., & Spanier, A.B. (2019). Real-time CNN-based object detection and classification for
outdoor surveillance images: daytime and thermal. Security + Defence.
11. Kumar, K., Nandan, D., & Mishra, R.K. (2019). Compact Hardware of Running Gaussian Average
Algorithm for Moving Object Detection Realized on FPGA and ASIC. Rev. d'Intelligence Artif., 33,
305-311.
12. Chenini, H. (2017). Pipelined architecture for real time detection and tracking of moving objects on
an hybrid device. 2017 International Conference on Advanced Technologies for Signal and Image
Processing (ATSIP), 1-6.
13. Philip, A. (2013). Subtraction Algorithm for Moving Object Detection Using Denoising Architecture
in FPGA.
14. Chung, W., Chakraborty, G., Chen, R.C., & Bornand, C. (2019). Object Dynamics from Video Clips
using YOLO Framework. 2019 IEEE 10th International Conference on Awareness Science and
Technology (iCAST), 1-6.
15. Kryjak, T., & Gorgon, M. (2011). Real-Time Implementation of Moving Object Detection in Video
Surveillance Systems using FPGA. Comput. Sci., 12, 149-162.
16. Chopda, R. (2022). Design and Development of an Autonomous Car using Object Detection with
YOLOv4.
17. Algorry, A.M., García, A.G., & Wofmann, A.G. (2017). Real-Time Object Detection and
Classification of Small and Similar Figures in Image Processing. 2017 International Conference on
Computational Science and Computational Intelligence (CSCI), 516-519.
18. Dagar, S., & Nijhawan, G. (2019). Modified Architecture for Detection of Moving Objects. 2019
International Conference on Machine Learning, Big Data, Cloud and Parallel Computing
(COMITCon), 47-50.
19. Wang, Y., Cheng, H., Zhou, X., Luo, W., & Zhang, H. (2020). MOVING SHIP DETECTION AND
MOVEMENT PREDICTION IN REMOTE SENSING VIDEOS. ISPRS - International Archives of
the Photogrammetry, Remote Sensing and Spatial Information Sciences, 1303-1308.
20. Aishwarya, C., Mukherjee, R., & Mahato, D.K. (2018). Multilayer vehicle classification integrated
with single frame optimized object detection framework using CNN based deep learning
architecture. 2018 IEEE International Conference on Electronics, Computing and Communication
Technologies (CONECCT), 1-6.
21. Padigel, A. (2022). Real Time Object Detection Using Deep Learning. International Journal for
Research in Applied Science and Engineering Technology.
22. Jaiswal, D., & Kumar, P. (2019). Real-time implementation of moving object detection in UAV
videos using GPUs. Journal of Real-Time Image Processing, 1-17.
23. Mahendru, M., & Dubey, S.K. (2021). Real Time Object Detection with Audio Feedback using Yolo
vs. Yolo_v3. 2021 11th International Conference on Cloud Computing, Data Science & Engineering
(Confluence), 734-740.
24. Sivakumar, P., V, J., R, R., & S, K. (2021). Real Time Crime Detection Using Deep Learning
Algorithm. 2021 International Conference on System, Computation, Automation and Networking
(ICSCAN), 1-5.
25. Rahman, Z., Ami, A.M., & Ullah, M.A. (2020). A Real-Time Wrong-Way Vehicle Detection Based
on YOLO and Centroid Tracking. 2020 IEEE Region 10 Symposium (TENSYMP), 916-920.
26. Canepa, A., Ragusa, E., Zunino, R., & Gastaldo, P. (2021). T-RexNet—A Hardware-Aware Neural
Network for Real-Time Detection of Small Moving Objects. Sensors (Basel, Switzerland), 21.
27. Lee, J., & Hwang, K. (2021). YOLO with adaptive frame control for real-time object detection
applications. Multimedia Tools and Applications.
28. Kumar, P., Singhal, A., Mehta, S., & Mittal, A. (2012). Real-time moving object detection algorithm
on high-resolution videos using GPUs. Journal of Real-Time Image Processing, 11, 93-109.
29. Asish, S.S., Reddy, K.N., Chaithra, S., & Lakkoju, S.S. (2020). DYNAMICALLY MOVING
OBJECT DETECTION BY USING YOLOALGORITHM.
30. N, S., & Meenakshi, M. (2021). Efficient reconfigurable architecture for moving object detection
with motion compensation.
Declaration of Interest Statement

Declaration:

Ethics

Approval and Consent to Participate:

No participation of humans takes place in this implementation process

Human and Animal Rights:

No violation of Human and Animal Rights is involved.

Funding:

No funding is involved in this work.

Data availability statement:

Data sharing not applicable to this article as no datasets were generated or


analysed during the current study

Conflict of Interest:

Conflict of Interest is not applicable in this work.

Authorship contributions:

All authors are contributed equally to this work

Acknowledgement:

There is no acknowledgement involved in this work.

You might also like