You are on page 1of 11

Guardians AI

Keeping eye on every where

By:-
1.Bhagra jyoti behera
2.Sneha bajpai
3.Aditya priya
GUARDIANS AI
In this country, there are 1.54 million CCTV cameras installed for crime prevention and
property protection purposes. But unfortunately many of them are not really utilized until
crimes take place. Of course they help us prevent crimes. But it is The human retina that is
required to watch every CCTV screen continuously, round-the-clock. So, unavoidably, we
frequently overlook crucial hints or proof. Additionally, there is an actual limit. But there is
technology to assist. us overcome such human limitations.To overcome limitations our AI
solution guardians AI, is that technology As you can see, guardians AI’s network is designed
to spot suspicious persons or activities and immediately alert organization that are in direct
relation to the solution.The human recognition technology is one of the keys in spotting
suspicious activities through live cctv footage.So our solution help you by image, face and
action recognition system to see and detect what is going on and report you immediately what
happen.

Features :
1. There is a limited need of physical vision on camera
2. It is cost effective
3. Better than a human (as we know that a AI can work more efficiently that a
human)
4. Help to prevent crime
5. Helping the police to know where is the exact clue.
Motivation:
When we came to know that there is a crime happen but there is no one who can help the
victim and the police didn’t know and even the normal people also don’t know what is
happen there even the cctv is still there and because of this type of thing we are inspire to
make this AI tool which help the owner of the cctv and the nearest police station to get the
notification if there is going on any suspicious activity.
It will help them to stop it before it is very big issue.

Key points:
Vision-based , Action recognition , Activity recognition , Real world challenges ,
Dataset , depth camera sensor, wearable inertial sensor , sensor fusion
Abstract
In this work, the state of the art for human action recognition systems is
reviewedInteraction between humans and computers is necessary for integrated visual use in
safety devices, and artificial intelligence, we bring forth a human action recognition system.
Here is a real-time, sensor-fusion-based human action identification system that utilises both
a depth camera and an inertial sensor at the same time. As it needs the system to precisely
identify and categorise human behaviours in actual situations, recognising and classifying
human actions is a difficult issue for computer vision. We give an overview of the many
methods and procedures for recognising human action that have been put out in the literature
in this study, including hand-crafted features, deep learning-based methods, and hybrid
models. We also go through the datasets often used for benchmarking and evaluating human
action recognition algorithms as well as the performance evaluation measures We also
discuss some of the important issues and goals for the field's future research, such as
enhancing performance in complicated contexts, handling huge datasets, and adding temporal
information for improved recognition accuracy. Finally, we summarise the most recent
developments and trends in the study of human action recognition and provide some
suggestions for future research areas.
Introduction
A growing number of applications for human computer interfaces benefit from human action
recognition, which is making its way into commercial devices. Examples of applications
include gaming, smart assistive living, and hand gesture interaction. In order to conduct
human action recognition, various sensors have been used. These sensors include traditional
RGB cameras, like those in [1] through [3], depth cameras, like those in [4] through [7], and
inertial sensors, like those in [8] through [10]. The movement of the various body parts is
frequently a result of functional movements that conceal ideas and intentions. Depending on
the complexity of the action and the bodily components involved, human activities are
divided into four groups.
Gesture : It is a physical gesture that clearly conveys a message. It is not a question of
spoken or vocal communication but rather a motion done with the hands, face, or other body
parts, such as the okay sign and thumbs-up.
 Action : It is a collection of bodily motions performed solely by one individual, such
as walking and running.
 Interaction : It is a series of deeds carried out by no more than two actors. The other
topic may be a human or an object (hand shaking, chit-chatting, etc.), but at least one
of the subjects must be a person.
 Group activities : It consists of a variety of gestures, behaviours, or encounters. At
least two performers are present, in addition to one or more interactive activities (such
as volleyball, obstacle courses, etc.).
Automatically analysing and identifying the type of an action from unidentified video
sequences is the goal of human action recognition (HAR). The demand for automatic
interpretation of human behaviour is increasing, and HAR has attracted interest from both
academics and business. In fact, assessing and comprehending someone's actions is crucial
for a variety of applications, including video indexing, biometrics, surveillance, and securit.
According to Lara and Labrador (2013), Pfeifer and Voelker (2015), Chen et al. (2017),
circuit miniaturisation in new submicron technologies has enabled embedded applications to
support a variety of sensors that provide information about human actions. The
aforementioned sensors vary in terms of cost, simplicity of installation, and data output type.
A person's activity can be detected using sensors like accelerometers, GPS, cameras, and
Leap Motion (Ameur et al., 2016; Mimouna et al., 2018). Each sensor is employed in
accordance with the needs of the application, depending on the type of information collected
(Valentin, 2010; Debes et al., 2016). Because a camera can provide such a vast amount of
information, the usage of vision sensors, for instance, is particularly intriguing (Berrached,
2014). In this research, we mainly concentrate on cameras as the first sensor for vision-based
action recognition systems.

The HAR field is a difficult topic because of the numerous problems it faces, including
anthropometric variation (Rahmani et al., 2018; Baradel et al., 1703), multiview variation
(Liu et al., 2017a; Wang et al., 2014), cluttered and dynamic backgrounds (Afsar et al., 2015;
Duckworth et al., 2016), inter-class similarity and intraclass variability (Zhao and ji, 2018;
Lie et ai., 2018).
The goal of HAR systems is to examine a real-world scenario and accurately identify human
behaviours. The overall design of any HAR system is depicted in Fig. 1. A feature extraction
technique extracts symbolic or numerical data from individual frames of pictures or videos.
A classifier will then apply labels in accordance with these derived features. The method is
made up of several steps that guarantee an effective description of actions. Recently, a
number of open datasets devoted to recognising human behaviour and actions were released.
Different types of sensors, including Kinect, accelerometers, Motion Capture (MoCap), and
infrared and thermal cameras, were used to collect these information. But for a wide range of
applications, RGB (Red, Green, and Blue) data have drawn a lot of interest (Zhang et al.,
2017). The sequence in which the various human action datasets occur corresponds to the
HAR issues that the scientific community must deal with.
It is fascinating to evaluate how well frame representation and classification approaches work
given the current datasets' near resemblance to reality and variety of action descriptions. We
have combed over surveys relating to HAR. None of them have provided an overview of the
approaches used to address all real-world problems found in publicly available datasets that
detail these problems.
However, some thorough studies only present the publicly available datasets from the
literature that describe human actions (Ahad et al., 2011; Chaquet et al., 2013; Zhang et al.,
2016; Liu et al., 2017b, 2019), while others present a thorough analysis of the various
methods put forth to identify human actions.
Numerous sensors have been employed to detect physical activities. The type of sensors used
may then be used to categorise the planned surveys. For instance, Aggrawal et al. (Aggarwal
and Xia, 2014) outlined the key methods for HAR from 3D data, paying special emphasis to
methods that made use of depth data. In their 2015 paper, Sunny et al. provided a summary of
the uses and difficulties of HAR using mobile phone sensors.
The bulk of researchers have focused their attention on vision sensors, particularly cameras,
given to the abundance and value of the photos and videos in HAR.
For this reason, a plethora of surveys have reviewed the vision-based HAR methods
(Moeslund et al., 2006; Turaga et al., 2008; Weinland et al., 2011; Aggarwal and Ryoo, 2011;
Guo and Lai, 2014; Subetha and Chitrakala, 2016; Dhamsania and Ratanpara, 2016;
Dhulekar et al., 2017; Zhang et al., 2017, 2019; Herath et al., 2017; Wang et al., 2018; Singh
and Vishwakarma, 2019; Ji et al., 2019).
In 2010, Poppe (2010) offered a prior survey in which a thorough analysis of vision-based
HAR was given. Based on their capacity to address the bulk of HAR issues and their
potential to be generalised, Ramanathan et al. (2014) gave an overview of the available
techniques.
Given the growing quantity of HAR-related papers, our surveys still fall short of covering all
the newly published studies and all the datasets in the literature. This survey's objective is to
evaluate the recommended approaches and available datasets in terms of how well they can
address HAR issues.
The following are this paper's key contributions: We thoroughly examine each issue HAR
systems face as well as the characterisation techniques suggested to address them. We
examine and categorise the various categorization methods. We explain and categorise the
current datasets that depict challenges in the actual world so that the effectiveness of various
recommended HAR techniques can be assessed.
the rest of portions of the document are arranged as follows: Section II describes the
difficulties with HAR and picture representation. Action categorization methods and training
techniques are covered in Section III. Then, in Section IV, a summary of the primary datasets
utilised to test HAR approaches is presented. Finally, Section V serves as the paper's
conclusion.

GENERAL LAYOUT
OF HAR (human action recognition)

2. Related work
Temporal templates were developed by Bobick and Davis [3] in their early work [14]. They
can identify numerous kind of aerobic activities using Motion Energy Images (MEI) and
MHI. The Motion Gradient Orientation (MGO) was also suggested by them [4] as a way to
clearly encode changes in an image brought on by motion events. A helpful hierarchical
expansion for computing a local motion field from the original MHI form was also provided
by Davis [6]. The MHI was converted into an image pyramid, which made it possible to
convolve effective fixed-size gradient masks at all levels of the pyramid and extract motion
information at a variety of speeds. An effective computational method for expressing,
characterising, and recognising movement of humans in footage is the organisational MHI
approach. Utilising localised time and space data and these models, School et al [18]
described a method for recognising complex mobility sequences in the clip utilising machine
learning (SVM) techniques.
The "30-pixel man" is the objective of a research project by Efros et al. [8], which
concentrates on the issue of poor quality videos of human action. They propose an optical
flow measurement-based spatio-temporal descriptor in this regard and employ it to recognise
activities in databases for sports such as tennis, football, and ballet. Weinland and colleagues
introduced Motion History Volumes (MHV) as a free-viewpoint representation for human
motions in a variety of calibrated, background-subtracted movies in their article [19]. They
offered formulas for determining, comparing, and aligning the MHVs of various actions
taken by numerous persons from diverse points of view. Ke et al. looked at the usage of
physical characteristics as a substitute for local classifier techniques for event recognition in
footage sequences. [10[They developed the concept of 2D container features to produce 3D
spatio-temporal volumetric characteristics. They developed a set of volumetric feature-based
filters that efficiently examined recordings in time as well as space for each action of interest,
resulting in a real-time occurrence detector. Ogata et al. used rapid connectivity identification
to accomplish this. [16] suggested Modified Motion History Images (MMHI) and used an
eigenspace approach. Six human motions were identified during the experiment. Wong and
Cipolla [20] proposed a novel method based on MGO extraction to recognise basic motions,
and it was then used to constant gesture recognition [21]. The stochastic histogram of
Oriented Gradient (HOG), an image order presentation descriptor, and a method of detection
for detecting and tracking people in video were both developed by Dalal et al. Dollar et al. [7]
proposed a method whereby researchers "use an innovative spatio-temporal interest area
sensor to get a global measurement" within the framework of the regional characteristics in
[8].Furthermore, Niebles et al. [15] use spatial-time interest sites to derive spatial-temporal
terms for their characteristics. Yeo et al. [22] generate vectors of motion from the flow of
light and estimate frame-to-frame movement similarities to investigate human behaviour in
video.. According to Blank et al. [2], human actions may be seen as three-dimensional
shadows of space-time volumes. They developed a method for evaluating 2D forms in order
to handle volumetric space-time action shapes. Oikonomopoulos et al. created a limited
depiction of optical patterns and an accumulation of spatial events that were concentrated at
key locations to reflect our understanding of human behaviour in both space and time [17].
We can observe that the motion features employed in a number of these approaches [8, 18,
19, 15, 5, 7, 2, 17, 10, 22] are rather complex, indicating a high processing cost when the
features are created. Because they require the process of segmentation monitoring, or other
computationally intensive processes, some of them are not [3, 4, 6, 20, 21, 16, 2].
Figure 1A MHI illustration. A frame representing the original hand-waving action video clip
is shown in (a), and its MHI is shown in (b). The pixels in (b)'s vertical red line vary between
(60, 11) to (60, 80)..

In the intelligent environment's real-time embedded vision applications.


The fundamental motion characteristics MHI, MMHI, and MGO are the foundation of an
SVM-based system that we have previously presented[12, 11, 13]. Although they perform
admirably in integrated artificial intelligence applications, they fall short in general on
challenging real-world datasets.
The goal of this work is to develop a solution that uses compact representations, is quick to
calculate, and still performs better on classification tasks than existing compact and quick
approaches. By adding additional motion features and combination methods with
significantly better performance, we expanded the work of [12, 11, 13].
3. Hierarchical Motion History Histogram (HMHH)
The motion history image (MHI) must first be reviewed before the "Hierarchical Motion
History Histogram" (HMHH) and its purpose are introduced in this part.
3.1. Motion History Image
The motion history image (MHI) must first be reviewed before we can introduce the
"Hierarchical Motion History Histogram" (HMHH) and explain its purpose.
Normally, a MHI HT (u,v,k) at time k and location (u,v) is defined by the following
equation 1:
HT(u, v, k) = { T, if D(u, v, k)=1
max {0, HT (u, v, k)-1}, otherwise (1)
wherein T is the maximum storage time for a motion, and the movement mask A binary
image called D(u, v, k) is produced by removing frames. T is frequently assigned a value of
255, making it straightforward to display the MHI as a one-byte-deep grayscale picture.. A
MHI is hence a pixel with a range of possible values, but an MHI is its binary equivalent,
which can be calculated with ease by thresholding HT > 0.
Figure 2. D (: :, :) on the red line of figure 1(b) is shown. Each row is D (u, v, :) for one fixed
pixel (u, v). A white block represents ‘1’ and a black block ‘0’. The horizontal green line is
the ‘binarised frame difference history’ or ‘motion mask’ of pixel (60, 50) through time, i.e.,
D (60, 50, :).
Simply setting the limit to H > 0 will compute this. A snapshot from the initial hand-waving
gesture recorded is shown in Figure 1 (a), and the MHI for this movement is shown in Figure
1 (b). It is clear that MHI consists of a frame-sized picture that also includes some motion
information about the event.
In order to thoroughly evaluate the MHI, we have selected the areas on the red axis in figure
1(b). If something occurred on pixels (u, v) at panel k, D(u, v, k)=1; else, D(u, v, k)=0. The
coordinates of these pixels are (60, 11), (60, 12), and (60, 80). The movement of the masks D
(u, v,:) for the image (u, v) is represented by the binary sequence "u, v,:".
D (u, v, :) = (b1, b2, ···, bN), bi ∈ {0, 1} (2)
where N + 1 is the total number of frames.
Figure 2 every motion mask is shown on the red line. For a single unchanging pixel (u, v),
each row is D(u, v,:), and in the patterns, a white block represents "1" and a black block
represents "0." The green line represents motion mark D (60, 50,:), which has a subsequent
sequence: 3: Every motion mask is shown on the red line in Figure 2. Each sequence is D (u,
v,:) for a single static image (u, v), and within the patterns, a white block represents "1" and a
black block represents "0." The green line, which represents the motion mark D (60, 50,:),
has the following sequence:
0000000001101000000000000000000000001010000 (3)
It is clear from the definition of MHI in equation 1 that MHI only saves the most recent
action that occurred for each pixel (u, v). That example, in the MHI at pixel (60, 50), just the
final "1" of the sequence 3 is kept. It is obvious that the previous '1' in the sequence, during
which some activity took place, is not reflected. It is also evident that nearly every pixel has
multiple '1's' in their sequence. This inspired us to create a new representation (the HMHH,
explained in the next subsection), which makes use of all of the sequence's information while
still remaining. condensed and practicle

Figure 3. HMHH example. Four Patterns P1, P2, P3 and P4 were selected. The results were
generated from the hand waving action in figure 1. Each pattern Pi, HMHH (::, Pi) has the
same size as the original frame.

3.2. HMHH
According to equation 4, we produce patterns Pi in D (u, v,:) series dependent on the total
amount of linked '1's

.
Equation 5 is used to symbolise a subsequence Ci, and D (u, v,:) is used to denote the set of
all subsequences of D (u, v,:). The number of instances of any particular pattern Pi in the
sequence D (u, v,:) can therefore be counted for each pixel (u, v), as stated in equation 6
where 1 is the indicator function.
Ci = bn1 , bn2 , ··· , bni (5)
HMHH (u, v, Pi) = ∑ j 1{Cj=Pi|Cj∈Ω {D (u, v, :)}} (6)
We may create a greyscale image from each pattern Pi and refer to it as the Motion History
Histogram (MHH) since the bin value counts the number of each pattern type. We refer to the
collection of patterns Pi, i = 1...M as the "Hierarchical Motion History Histogram" (HMHH)
representation.
HMHH(:,:, Pi) can be used to illustrate the Pi pattern. The four distinct designs P1, P2, P3,
and P4 that resulted using the hand's motion in Figure 1 are shown in Figure 3. While
examining the HMHH in the third figure against the MHI in figure 1, it is noteworthy to see
that HMHH separates the area of MHI into several segments based on patterns. compared to
the hierarchy MHI described by Davis [6], whereby only minuscule size MHIs were
obtained, HMHH records the entire information of each activity. Because it is inexpensive to
do so, the following procedure may be utilised for calculating HMHH.
Algorithm (HMHH)
Input: Video clip f (u,v,k), u=1,...,U, v=1,...,V, frame k=0,1,...,N
Initialisation: Pattern M, HMHH(1:U,1:V,1:M)=0, I(1:U,1:V)=1
For k=1 to N (For 1)
Compute: D(:,:,k)
For u=1 to U (For 2)
For v= 1 to V (For 3)
If Subsequence Cj={D(u,v,I),…,D(u,v,k)}=Pi
Update: HMHH(u, v,Pi)=HMHH(u, v,Pi)+1
End If
Update: I (u, v)
End (For 3)
End (For 2)
End (For 1)
Output: HMHH (1: U,1: V,1:M)
3.3. Motion Geometric Distribution (MGD)
The size of the HMHH representation makes it challenging to provide to a classifier, thus we
are searching for a smaller version that accurately depicts the geometrical nature of the
motion throughout the image. To achieve this, equation 7 shows how an MHH is represented
in binary as MHHb.
MHHb(u, v, Pi) = { 1, if MHH(u, v, Pi) > 0
0, otherwise (7)
Then, for a particular pattern, Pi, we sum each row of MHHb to produce a vector with a size
of V rows. By adding columns, we can create another vector with U rows of length. By
utilising all M levels of the binarized MHH hierarchy, we are able to create a "Motion
Geometric Distribution" (MGD) vector that is quite small compared to the size of the original
HMHH and MHI features. This MGD vector has a size of M (U + V). Thus, equation 8 can
be used to express the MGD vector:
MGD = { ∑u MHHb(u, v, Pi), ∑v MHHb(u, v, Pi)}
i=1, 2………..., M (8)

Figure 4. SVM based human action recognition system.

4. SVM based human action recognition


Meng et al. [12] created a rapid human action identification system that utilised a linear SVM
and simple motion characteristics. A linear SVM was used for this design since it has
historically shown outstanding results in many real-world difficulties with classification and
since it can handle massive dimensions feature vectors. The system looked at the MHI,
MMHI, and MGD motion characteristics, which have varying degrees of performance.
In [12], the system is given a generalisation that makes it capable of handling additional
motion characteristics. The general design of the behaviour of humans system is shown in
Figure 4. The training step is finished offline in order to retrieve the machine learning (SVM)
parameters. The movement features taken from the experiment's video and the SVM
parameters' inner product yields the categorization component.

You might also like