You are on page 1of 5

2022 International Conference on Futuristic Technologies (INCOFT)

Karnataka, India. Nov 25-27, 2022

Human Activity Detection Using Pose Net


Sudharshan Duth P Poojashree B S
Department of Computer Science Amrita Vishwa Vidyapeetham Department of Computer Science
Amrita School of Arts and Sciences
Mysuru, India Amrita Vishwa Vidyapeetham
p_sudharshanduth@my.amrita.edu Amrita School of Arts and Sciences
Mysuru, India
Poojashreeni111@gmail.com
2022 International Conference on Futuristic Technologies (INCOFT) | 978-1-6654-5046-1/22/$31.00 ©2022 IEEE | DOI: 10.1109/INCOFT55651.2022.10094346

Abstract—Human detection and tracking is one of Convolutional Neural Networks (CNN) have grown in
the important study fields that has attracted a lot of popularity in recent years as a solution to picture
attention recently.Despite the fact that commercially categorization problems. Finally, the success in identifying
available technologies for human identification and images and description, analysis have begin to use CNNs
counting are now available, further experimentation is for visual categorization more widely. Due to lighting
required in order to address the challenges. circumstances, blockage, framework clutter, contortion,
Abnormality detection, commonly called automated measurements, and inter-group variance, classifying real-life
video surveillance, is the way of noticing and describing movies into inconsistent freestyle activities is a difficult
chore.
human behaviour and interactions in a crowded setting.
People must be discovered and tracked in order to Human Action Recognition (HAR) is a programme that
maintain reliability, welfare, and site management. analyses human behaviour and assigns a name to each
Object identification is a critical stage in abnormality action. As a result, it has a huge number of applications and
detection and automated video surveillance. is getting more popular in the world of computer vision.
Background subtraction is a technique for detecting skeleton, indeep, flaming, specific cloud, voice, acceleration,
human actions in video segments. The frequent radar, and WiFi signals are some of the data procedure that
technique for distinguishing moving objects from can be used to depict human actions. Depending on the
motionless images Background subtraction is the application environment, these methods encapsulate several
technique of detaching foreground objects from the sources of important but separate data and offer several
background in a sequence of video frames, as the label benefits. Appropriately, numerous past research has
attempted to investigate individual HAR techniques
implicit. In this scenario, the primary purpose of
employing various modalities. Human action recognition
abnormality detection is to recognise and track a (HAR) is important for a range of real-world examples.This
moving object using video recorded images and pose can be used to detect potentially dangerous human actions
net. and assure their safe operation in both visual surveillance
and autonomous navigation systems. It is additionally
Keywords—Background subtraction, Pose net, CNN helpful for video recovery, human-robot association, and
I. .INTRODUCTION diversion. HAR motive is to analyse suddenly and recognise
the characteristics of an action in unknown video sequences.
Computer vision is a multidisciplinary field that focuses Academics and industry are both interested in HAR because
on how computers might be designed to recognise high- of the rising demand for automated human activity
quality digital images or movies. It expects to computerise interpretation. In real, analysing and understanding Person
things that the visual framework of humans is capable of behaviour is important for a variety of approach, including
performing a designing outlook. Techniques for recording, video indexing, biometrics, surveillance, and security.
processing, analysing, and understanding digital images, also
computer vision problems include extracting high- II. LITERATURE REVIEW
geometric data from the real world in order to provide Fatemeh et al. [1] present a graded approach that
statistical information like judgments. In this case, includes HOG with background subtraction, as well as the
understanding refers to the conversion of visual use of a deep neural network and bone modelling methods.
representations into world descriptions which can have For feature selection and storing past information, a CNN
interact with other cognitive processes and lead to exact and an LSTM recursive network are combined, and
action. Human activity detection and position estimates have ultimately, a Softmax-KNN classifier is utilized to identify
aroused attention in a variety of applications, including human actions.
video-based recognition and human–computer interactions.
However, there is still a lot of study being carried out with Fernando et al [2] A powerful deep neural networks for
precision and speed. In most cases, activity recognition and the HAR. The dimensions from multiple body-worn devices
pose estimation are done individually. are separated by this network. Three datasets were used:
Opportunity, Pamap2, and an industrial data set. The
Even if the posture is strongly linked with activity architecture was evaluated and found to outperform the
detection, no solution for simultaneously addressing both state-of-the-art.
difficulties is being researched for the advantage of activity
recognition. One of ML’s primary advantages is its capacity H Mei et al. [3]presents a way of detecting a large figure
to execute end-to-end optimization. There are various of things with an unknown amount that fluctuates over time.
advantages to a machine learning problem that can perform The multiple object tracking approach use a graph structure
end-to-end optimization. to keep track of a number of conjecture regarding the

978-1-6654-5046-1/22/$31.00 ©2022 IEEE 1


quantity and trajectories of the objects in the video, which one or more agents' actions and goals from a set of
are based on preliminary object identification findings in observations. The methodology utilised in this article was
each image, which may or may not be right. based on FSM. This technique is less sophisticated since it
recognises complex processes using a succession of simple
Rashmi R.koli et al.[4]The study's purpose is to create a push, pop, enqueue, and dequeue actions. When tested on a
device that recognises gestures and then converts them into synthetic dataset, the suggested strategy outperforms the
speech. CNN training receives specific emphasis. The competition, with an accuracy of over 80.9 percent. For the
concept entails creating a device that treats input as a gesture prescribed actions, the system provides correct and accurate
and then outputs identifiable text using in-depth mastering results. The purpose of activity recognition is to recognise
standards. one or more agents' actions and goals from a set of
M.Leo et al. [5]The recognition procedure is divided into observations. The methodology utilised in this article was
two-step processes: The human body position is first based on FSM. This strategy is less sophisticated because To
determined picture by picture, and then the detected recognise sophisticated operations, you'll need to do a
postures' temporal sequences are statistically modelled. succession of simple push, pop, enqueue, and dequeue
Human body postures are calculated using binary shapes actions. When tested on a synthetic dataset, the suggested
associated with people by selecting horizontal and vertical strategy outperforms For the prescribed actions, the system
histograms as features and feeding them into an provides correct and accurate results.
unsupervised clustering technique. Ashokan et al. [12]Due to its numerous applications in
Shi et al. [6] proposed a strategy for spectrogram-based healthcare, security systems, and other fields, human activity
human action grouping utilising profound learning methods, detection has become one of the most popular explored
including a profound convolutional generative ill-disposed disciplines in recent decades. Customers can use ATMs to
network for growing and enhancing the preparation set and access financial services quickly and conveniently. It is
an exchange learned profound convolutional network for critical to create a security system that identifies any unusual
highlight abstraction and characterization, which depends on activity in order to ensure the safety of customers. It will
a DCNN preprepared by a huge scope RGB picture immediately notify the police of any unusual events,
informational index, that is, ImageNet. allowing the police to quickly apprehend the culprit. The
work focuses on comparing the performances of a number of
Noor et al.[7] offer a new method for recognising and classifiers, including random forest, SVM, and KNN, on the
summarising numerous human actions in surveillance films. same dataset. Three distinct classifiers are used to classify
By separating each person's sequence from the photo, the the data.On activity detection, all three approaches produce
suggested method suggests a new representation of the data. satisfactory results. However, based on experimental
Then, using 3D convolutional neural networks, each findings, it can be concluded that the random forest classifier
sequence is analyzed to detect and recognize the relevant outperforms SVM and KNN classifiers in detecting
behaviours (3DCNNs). An action-based video summary anomalous events in ATMs. Random forest is quick and
involves recording each person's actions at various points reliable, making it ideal for real-time applications. It's also
throughout the film. good for working with massive datasets.
Tingtian Li et al. [8]For complicated events, the authors Amrutha et al. [13]Intelligent video surveillance and
of paper present a novel human activity recognition shopping behaviour analysis are only two examples of real-
technique based on group skeletons. Our approach combines world applications for human behaviour recognition. The
Convolutional networks with multi-scale spatial-temporal training phase is 76 percent accurate for the first 10 epochs.
graphs to extract skeleton information from a huge number The model's accuracy can be increased by increasing the
of people (MS-G3Ds). In addition to the usual key point number of iterations. For testing purposes, frames from
coordinates, the author also gave key point speed values for videos are collected in a single folder. Using our trained
the equation. model, the algorithm determines whether the frames are
Amel Ben Mahjoub et al. [9]The paper describes a suspicious (cell phone use on campus, fighting, or fainting)
method for recognising human actions. The STIP is used by or regular (walking, running). A notice with the expected
the authors to discover significant changes in the image. The class will be sent to the appropriate authorities in the event
author then uses the histogram of Oriented Gradient and of suspicious behaviour. The accuracy was found to be
histogram of Optical Flow descriptors to extract the 87.15 percent.
appearance and motion properties of these interesting Chitturi et al.[14] In the smart environment, human
locations. Finally, to apply a label to each video sequence, activity discovery, or HAD, is a hotly debated topic. This
the author uses a Bag Of Words (BOW) to match the STIP paper is about using an unsupervised approach to recognise
descriptor's Support Vector Machine (SVM). human behaviour. For the same, we provided three options.
V.Parameswari et al. [10]As input, it uses footage of We either use the LRS method or extract segments as a first
COVID-19 patients and searches the hold on photographs step. substrings with respect to initial step are then grouped
for a match. This method is based on the extraction of Gabor either directly or via sub sequences retrieved from them.
options using the Gabor filter. To extract features from the Using phrase frequency, the clusters are then identified. The
input pictures, the Gabor filter is utilised, and then a Dunns Index and the confusion matrix were used to assess
personal sample generation formula is employed. the project. The proposed systems performed better than the
existing systems, according to the findings. In addition, we
Kavya et al.[11] Active research subjects include used WEKA 3.6 to compare clustering techniques and
ubiquitous and mobile computing, surveillance-based discovered that hierarchical clustering outperforms other
security, context-aware computing, and ambient assisted
living. The purpose of activity recognition is to recognise

2
clustering algorithms on the provided data files. compared to human computer interactions, patient monitoring systems,
alternative clustering techniques. and robotics, human activity detection is quickly becoming a
popular topic of research. These datasets were split into two
Vinayakumar.R et al.[15] The usefulness of the IRNN categories by us. First, there are two-dimensional (2D-RGB)
and various RNN variations for ID is investigated in this datasets, and then there are three-dimensional (3D-RGB)
work. The detection rates achieved by IRNN methods with datasets. The most accurate algorithms for these datasets
respect to KDDCup-99 intrusion datafiles are very similar to according to state-of-the-art technology are also offered. We
those achieved by other RNN variations. With numerous quickly go over both the benefits and drawbacks of using 2D
tests with IRNN and RNN variant designs, the logic behind and 3D datasets.
the network topology and its parameters has been thoroughly
studied. Furthermore, the document is organised as follows: The
Literature survey is found in the second section. The
Soman.K.P et al.[16]When compared to typical machine methods part is the third component. In the fourth section,
learning classifiers, experiments using members of RNN the proposed system is discussed. The fifth portion contains
modal produced a lower FP rate. RNN designs are popular the results and discussion, while the last section contains the
because they can retain detalies for long-period dependency study's conclusion.
during time delay and adapt it with subsequent detalies in
sequence connections. The usefulness of RNN designs is III. METHODOLOGY
also demonstrated in UNSW-NB15 data files.
A. CNN
Jie Yin et al. [17]To eliminate the actions with a very
high likelihood of becoming normal, our method first uses a Human Pose Estimation is the art of extricating the
one-class SVM which has been justified on regularly found body's skeletal central issues and joint areas relating to the
usual activities. To lower the wrong positive rate in an human body parts. It makes use of the enormous number of
unsupervised way, From a generic normal model, we next central problems and joints to connect the human body's
create models of anomalous behaviour using kernel two-layered structure. In this project, we used the OpenPose
nonlinear regression. We demonstrate that our method offers system to measure posture from an information picture. The
a favourable trade-off between the rate of abnormality image is submitted to the CNN Yield organisation in
recognition and the incorrect alarm rate and enables the OpenPose to extract the highlights from the input. The
automatic derivation of abnormal activity models without element map is then processed through multiple CNN layers
the need to exact tag the rare aberrant training data. Using to yield (PAF) Part Affinity Fields and Confidence Maps. To
actual data gathered from a sensor network set up in a capture human attitude in the image, the partial affinity
practical environment, we show the efficacy of our fields and confidence map established above go through a
methodology. bipartite diagram matching calculation.

Gurkirt Singh et al.[18] The method is easily expanded B. PoseNet


to simultaneously detect and classify without requiring It contains two-layered vectors that encode the body
classification scores at the video level, which creates the part's positions and directions in a picture. It encodes your
possibility for classifying, detecting, and predicting online information as a two-fold connection between body parts.
activity. The competition organisers offered precomputed
features, which we utilised. Implementation of linear SVM
L = (L1, L2, L3 … . Lc)
and random forest was done using SciKit-learn. Lc ‫א‬R^w*h*2
Natarajan Kumaran et al.[19]An optical flow covariance c ‫{ א‬1 … C},
matrix model is presented in this paper. A vital component
of video streams is optical flow, which is regular in the It is a two-layered portrayal of the conviction that a
temporal domain. When there are several people present, specific piece of the body can be put on a particular pixel.
anomalous activity can be found using the logistic regression
method. Finally, benchmark datasets like UMN, UCSD, and
P = (P1, P2, P3 … . Pj)
BEHAVE can be used to predict the behaviours of human Pj ‫א‬R^w*h
crowds. The collected experimental findings demonstrate
that the suggested strategy may accurately identify j ‫{ א‬1 … J},
anomalous events derived from the desolate setting of where J is the absolute number of body parts, R is the
surveillance videos. genuine number, and P is the arrangement of certainty maps.
Baris Erol et al.[20] In this research, we suggested a The quantity of keypoints recognized through OpenPose is
multilinear subspace approach for human behaviour subject to the dataset having been prepared.
detection based on the radar data cube (RDC). The UCF101 dataset is used in this study, which contains
Using RDC provides a powerful method for combining 18 body core issues: R Ankle, R Knee, R Wrist, L Wrist, R
motion data from different domains to capture Shoulder, L Shoulder, L_Ankle, L Ear, R Ear, R Elbow, L
crosscorrelations and interdependency. A single Elbow, L Knee, L Eye, R Eye, R Hip, L Hip, Nose
representation that takes use of the interdependence between
1) Algorithm
the joint-variable domains for fast-time, slow-time, and
Doppler frequency is a key benefit of the suggested subspace The algorithm workflow is as follows:
method. 1. Get Joints with OpenPose.
2. Track each person, using the Euclidean distance
Singh, T et all.[21]Vision-based Due of its many between the joints of two skeletons.
applications in fields including security and surveillance,

3
3. extract features of the body, and normalized joint
positions.

A video file or a video stream, is fed into the system.


Then, from each frame, the OpenPose algorithm [12] is used
to detect the human skeleton (joint locations). The skeleton
data from the first N frames is then aggregated using a
sliding window of size N. This skeletal data is preprocessed
and used for feature extraction, with the final recognition
result being put into a classifier.
A feature of a person or any object can be considered a
distinctive attribute that defines the object or person by
itself. In simpler words, it is something that is unique to that Fig. 2. UCF101 data set used for training
object. Once the preprocessing step is done, the joints or key
points are good to use for further processing. Some of the
features which are computed from the input are: direct
concatenation of joints of all the N frames, average height of D. Video Pre-processing
the skeletons, the next position; all the joints' position, length The input video is given to the model. The model will
of limbs, and joint angles computed from the joint positions. extract the frames from the video. The frames is then
IV. PROPOSED SYSTEM preprocessed to remove the noise in the image. Then the
image is processed with CNN and PosNet models to predict
In the proposed system, we have taken video footage the activity of the user.
from a camera for monitoring human activities and sending
messages to the corresponding user when any Anamoly E. Performance Measure
activity occurs. In terms of training and testing, the classifier's
A. System Architecture performance must be evaluated. Three performance
parameters are utilised to measure the performance:
Video capture, video preprocessing, feature extraction, accuracy, precision, and recall. The receiver operating
classification, and prediction are all phases of the characteristics curve is used to assess the performance of the
architecture. trained classifier. The confusion matrix’s number of true FP
and FN is used to calculate the parameters.
୘୔ା୘୒
x  —”ƒ › ൌ
୘୔ା୘୒ା୊୔ା୊୒
୘୔
x ”‡ ‹•‹‘ ൌ
୘୔ା୊୔
୘୔
x ‡ ƒŽŽ ൌ
୘୔ା୊୒

Fig. 1. System Architecture


Precision is defined as the ratio of TP to the total number
of true and FP. Recall is determined by comparing the
B. Video capture number of true positives to the total number of true positives
and false negatives.
The first stage in a video surveillance system is to set up
a camera and monitor the footage. Various types of V. RESULTS AND DISCUSSION
recordings are captured from a variety of cameras across the
The aim of the project is detect the Human activity
surveillance area. The movies must be converted to frames
using input video footage. If any abnormal activity is
because our technology processes data in frames.
detected, it will send the notification to the userInitially,
C. Dataset Description we will upload the video. This was accomplished by using
The UCF101 dataset is a well-known dataset that has video to extract features from the frames. After frames are
been utilised for training. UCF101 is a data set of realistic extracted by using poseNet, human activity is detected.
action videos containing 101 action categories acquired from
The figure 3 and 4 shows the human activity. In Fig 3,
YouTube for action recognition. This data set complements
human activity is detected to be walking. There is an
the UCF50 data collection, which includes 50 different
anamoly detected. Fig 4 shows that human activity detected
activity categories. UCF101 is the most diverse data set to
is fighting. Here anamoly is detected and it sends the
date, with substantial differences in camera movements,
notification to the user .
object look and location, object scale, viewpoint, cluttered
background, illumination settings, and other characteristics. In the below table 1, human activity detection will yield
It contains 13320 images encompassing 101 action a result by considering the contour and edges of the image.
categories. Because most action detection data sets are
illogical and produced by actors, UCF101 intends to support
further action detection research by learning and exploring
new realistic action categories.

4
[2] Fernando Moya Rueda, Gernot A. Fink, “Convolutional Neural
Networks for Human Activity Recognition Using Body-Worn
Sensors”, 2018
[3] Mei Han; A. Sethi, “A detection-based multiple object tracking
method”, International Conference on Image Processing, 2004
[4] Rashmi R. Koli, Tanveer I. Bagban,”Human Action Recognition
Using Deep Neural Networks”, Fourth World Conference on Smart
Trends in Systems, Security and Sustainability, 2020
[5] M. Leo, T. D'Orazio, I. Gnoni, “Complex human activity recognition
for monitoring wide outdoor environments”, 7th International
Conference on Pattern Recognition, 2004
[6] Xiaoran Shi; Yaxin Li; Feng Zhou,“Human Activity Recognition
Based on Deep Learning Method”, International Conference on
Radar (RADAR), 2018
[7] Noor Almaadeed, Omar Elharrouss, Somaya Al-Maadeed, “A Novel
Approach for Robust Multi Human Action Recognition and
Summarization based on 3D Convolutional Neural Networks”, 2021
[8] Tingtian Li, Zixun Sun, Xiao Chen, “Group-Skeleton-Based Human
Action Recognition in Complex Events”, 2020
[9] Amel Ben Mahjoub, Mohamed Atri,“Human action recognition using
RGB data”, 11th International Design & Test Symposium, 2016
[10] V. Parameswari S. Pushpalatha V. Parameswari S. Pushpalatha V.
Parameswari, S. Pushpalatha, “Human Activity Recognition using
SVM and Deep Learning”, European Journal of Molecular & Clinical
Medicine, Volume 7, Issue 4, 2020
[11] Kavya , J., & Geetha , M. (2016, September). An FSM based
Fig.3 Human activity detected Fig.4 Human activity detected methodology for interleaved and concurrent activity recognition. In
is Walking is fighting 2016 International Conference on Advances in Computing,
Communications and Informatics (ICACCI) (pp. 994-999). IEEE
Classification report: [12] Ashokan, V., & Murthy, O. R. (2017, July). Comparative evaluation
of classifiers for abnormal event detection in ATMs. In 2017
TABLE I. CLASSIFACATION REPORT International Conference on Intelligent Computing, Instrumentation
and Control Technologies (ICICICT) (pp. 1330-1333). IEEE
precision recall f1-score support [13] Amrutha, C. V., Jyotsna, C., & Amudha, J. (2020, March). Deep
learning approach for suspicious activity detectio n from surveillance
1 0.98 0.98 0.98 50 video. In 2020 2nd International Conference on Innovative
0 0.98 0.98 0.98 50 Mechanisms for Industry Applications (ICIMIA) (pp. 335-339).
IEEE.
accuracy 0.98 100 [14] Chitturi, B., Thomas, J., & Indulekha, T. S. (2015, December). New
approaches for discovering unsupervised human activities by mining
Macro avg 0.98 0.98 0.98 100 sensor data. In 2015 International Conference on Computing and
Network Communications (CoCoNet) (pp. 118-123). IEEE
Weighted avg 0.98 0.98 0.98 100
[15] Vinayakumar, R., Soman, K. P., Poornachandran, P. (2019). A
comparative analysis of deep learning approaches for network
VI. .CONCLUSION intrusion detection systems (N-IDSs): deep learning for N-IDSs.
International Journal of Digital Crime and Forensics (IJDCF), 11(3),
The focus of this article is to discuss the most recent 65-89
work in this field of study. It initially discusses the objective [16] Vinayakumar, R., Soman, K. P., & Poornachandran, P. (2017).
of human pose estimation and then presents the human pose Evaluation of recurrent neural network and its variants for intrusion
detection system (IDS). International Journal of Information System
estimation approaches and key point detection methods for Modeling and Design (IJISMD), 8(3), 43-63
pose representation. It also discusses some common datasets [17] Yin, J., Yang, Q., & Pan, J. J. (2008). Sensor-based abnormal human-
and different types of classification methods that provide an activity detection. IEEE transactions on knowledge and data
effective survey study to design and develop the automated engineering, 20(8), 1082-1090.
[18] Singh, G., & Cuzzolin, F. (2016). Untrimmed video classification for
human recognition system. In numerous computer vision activity detection: submission to activitynet challenge. arXiv preprint
applications, the necessity to interpret human actions has arXiv:1607.01979.
become unavoidable. On the other hand, the major [19] Kumaran, N., & Reddy, U. S. (2021). Classification of human
application fields anamoly recognition becoming a activity detection based on an intelligent regression model in video
prominent research work because of developed techniques to sequences. IET Image Processing, 15(1), 65-76.
[20] Erol, B., & Amin, M. G. (2019). Radar data cube processing for
perform anamoly detection in front of system without the human activity recognition using multisubspace learning. IEEE
help of trainer and become self-learner. We conclude that Transactions on Aerospace and Electronic Systems, 55(6), 3617-
our study helps to design and develop the automated human 3628.
recognition system for anomaly recognition systems with [21] Singh, T., & Vishwakarma, D. K. (2019). Human activity recognition
different poses irrespective of many challenges. in video benchmarks: A survey. Advances in Signal Processing and
Communication, 247-259.
REFERENCES
[1] Fatemeh Serpush, Mahdi Rezaei, “Complex Human Action
Recognition Using a Hierarchical Feature Reduction and Deep
Learning-Based Method”, SN Computer Science, 2021

You might also like