Professional Documents
Culture Documents
A Thesis By
SIJIE SHANG
ORCID iD: 0009-0004-0308-282X
Department:
Department of Computer Science
Committee:
Rong Jin, Department of Computer Science, Chair
Doina Bein, Department of Computer Science
Kanika Sood, Department of Computer Science
DOI:
10.5281/zenodo.7903558
Keywords:
neural network, recurrent neural networks, long short-term memory, MoveNet
Abstract:
This study presents three models to predict human fitness poses. First, we use the MoveNet
model to get the human keypoints. The first model is the Feedforward Neural Network to predict the
human pose of each frame. Then the second model we have is a Long Short-term Memory Network
(LSTM). The last model we use is the Gated Recurrent Unit (GRU). The last two models can use the
time series data as the input. As a result, the last two models have a better result than the first model.
The accuracy of LSTM is 94.76% and the accuracy GRU model is 97.27%.
ACKNOWLEDGMENTS ..................................................................................................................... vi
Chapter
1. INTRODUCTION ........................................................................................................................... 1
Using YOLO and MLP Neural Networks to Recognize Tennis Players’ Poses .......................... 15
Using Detection2 and Random Forest to Recognize Tennis Players’ Poses ............................. 15
Proposing Posture Recognition System Combining MobilenetV2 and LSTM for Medical
Surveillance ................................................................................................................................ 15
Use of LSTM Regression and Rotation Classification to Improve Camera Pose Localization
Estimation ................................................................................................................................... 16
4. DATASET ...................................................................................................................................... 17
Data Description............................................................................................................................ 17
Data Preprocess ........................................................................................................................... 18
Frame Extraction ..................................................................................................................... 18
Image Cropping ....................................................................................................................... 18
5. IMPLEMENTATION ...................................................................................................................... 19
ii
6. RESULTS ...................................................................................................................................... 26
7. CONCLUSION .............................................................................................................................. 36
REFERENCES ................................................................................................................................... 37
iii
LIST OF TABLES
Table Page
iv
LIST OF FIGURES
Figure Page
6. 17 keypoints............................................................................................................................. 10
7. MoveNet................................................................................................................................... 11
v
ACKNOWLEDGMENTS
I would like to begin by expressing my sincere appreciation to my advisor, Dr. Rong Jin, for her
unwavering guidance, support, and encouragement throughout the thesis development process. Her
knowledge, comprehension, and zeal have been indispensable in guiding my intellectual and
emotional development. In addition, I would like to appreciate Drs. Doina Bein and Kanika Sood for
their valuable guidance. My parents, whose affection and support contributed to my success, also
vi
1
CHAPTER 1
INTRODUCTION
checkers offer writing suggestions based on target audience and goals, and autopilot driving systems
can alert drivers of potential hazards. In this article, I propose a personalized training experience for
fitness enthusiasts, using machine learning models to recognize and classify human fitness poses. By
providing feedback, our models aim to help users exercise more efficiently. I present different
approaches to achieving this goal, such as using the MoveNet framework, Feedforward Neural
Network, Gated Recurrent Units, and the Movenet framework with Long Short-Term Memory.
2
CHAPTER 2
Neural networks, sometimes called artificial neural networks (ANNs) or imitation neural
networks (SNNs), belong to a part of machine learning and form the core of deep learning methods
[1]-[5]. Their name and methodology take inspiration from the human brain, copying how natural
neurons communicate with each other [1]. The fundamental concept of a neural network is to
reproduce numerous closely connected brain cells in a computer, enabling it to acquire knowledge,
function, and a learning rule. Neurons will receive input from predecessor neurons that have an
A neural network is a series of nodes or neurons. Within each node is a set of inputs, weight,
and a bias value as you can see in Figure 1. As an input enters the node, it gets multiplied by a
weight value, and the resulting output is either observed or passed to the next layer in the neural
network [1]-[5].
The weights and bias are possibly the most important concepts of a neural network. When
inputs are transmitted between neurons, weights are applied to them and passed into an activation
During training on a training set, a neural network is initialized with a set of random weights.
These weights are then optimized during training to produce optimum weights. Equation 1 shows the
formula.
Wi: weights, decide how much will Xi influence the next layer
Xi: inputs
3
Bias: a constant number
Activation Functions
should be activated or not [6]. It does this by calculating the weighted sum of inputs and further
The purpose of an activation function is to introduce non-linearity into the output of a neuron.
This means that instead of producing a simple linear output, an activation function can produce more
There are several types of activation functions including binary step functions which depend on
a threshold value that decides whether a neuron should be activated or not [7].
Sigmoid
The Sigmoid function [8, 9] is a logistic function that helps normalize the output of any input in
the range between 0 to 1 as you can see from Figure 2. It takes a real value as input and gives a
probability that’s always between 0 and 1. Equation 2 shows an example of the Sigmoid function.
1
𝑦 = (1+ⅇ−𝑥) (2)
Here, 'x' is the input value and 'e' is the natural logarithm's base, which is about 2.718. A large
negative input value will bring the Sigmoid function to values near 0, while large positive input values
4
will bring the values close to 1. The Sigmoid function returns a probability of 0.5 when the input value
Tanh
The Tanh (hyperbolic tangent) function [9]-[11], as you can see in Figure 3, is another popular
activation function used in neural networks, which is similar to the Sigmoid function but normalizes
the output of any input in the range between -1 and 1. The formula is shown in Equation 3 below.
The ReLU (Rectified Linear Unit) activation function [9, 12], as you can see in Figure 4, is a
great alternative to both sigmoid and tanh activation functions. It does not have the vanishing gradient
problem and is computationally inexpensive. Equation 4 shows the formula of this activation function.
The input value, in this case, is 'x'. If the input value is positive or equal to zero, the ReLU
function produces that value, and if it is negative, the result will be 0. Many deep learning models
have shown good performance with this straightforward function since it introduces non-linearity while
Due to its simplicity, computational efficiency, and capacity to address the vanishing gradient
problem, the ReLU activation function is widely used in machine learning and deep learning
applications such as image classification, object detection, and natural language processing [12]-[14].
Backpropagation
artificial neural networks, particularly for supervised learning tasks. It plays a critical role in adjusting
the model's weights to minimize the error between predicted outputs and actual target values. The
6
primary objective of backpropagation is to reduce the loss function, a quantitative measure of the
The backpropagation algorithm operates by leveraging the chain rule of calculus to compute
the gradients of the loss function with respect to each weight within the network. These gradients
provide information on the direction and magnitude of the weight adjustments necessary to minimize
Forward Pass
During this phase, input data is propagated through the network to compute the predicted
output. This process entails calculating the weighted sum of inputs for each neuron, passing the
result through an activation function, and repeating this procedure for each layer in the network until
Backward Pass
In this stage, the error between the predicted output and the target values is determined, and
the gradients of the loss function with respect to each weight are calculated. Starting at the output
layer and moving backward through the network, the error gradients for each neuron are computed.
Using the chain rule, the gradients for each weight are then calculated. Once the gradients for all
weights have been obtained, the weights are updated using an optimization algorithm, such as
gradient descent or one of its variants (e.g., stochastic gradient descent, Adam).
Backpropagation allows for the simultaneous and continuous update of all the weights in the
network, making it a computationally efficient method for training neural networks. Through iterative
weight adjustments aimed at minimizing the loss function, the model learns to make more accurate
predictions, ultimately enhancing its performance on the given task. The backpropagation algorithm
has been instrumental in the advancement of deep learning, enabling the development of complex
neural network architectures capable of solving a wide range of problems in various domains, such as
Feedforward neural networks [1, 16, 19, 20] are a type of artificial neural network in which
information flows in one direction: from the input layer, through the hidden layers, and finally to the
output layer. Figure 1 shows a simple Feedforward Neural Network. They are called "feedforward"
because there are no cycles or loops in the connections between neurons. This architecture allows
for the efficient processing of input data and is widely used in various applications, such as image
Convolutional Neural Networks (CNNs) [21] are a specialized type of deep learning model
designed for processing grid-like data, such as images [22]. They are particularly effective at tasks
like image classification, object detection, and semantic segmentation. CNNs consist of two primary
parts: feature extraction and classification with a fully connected layer as you can see in Figure 5 [23]-
[25].
Feature Extraction
In the feature extraction phase, a CNN employs a series of convolutional and pooling layers to
learn and extract meaningful features from the input data. The convolutional layers utilize kernels or
filters (e.g., 3x3 matrices) that are applied to the input data through a sliding window approach. By
performing element-wise multiplication between the input data and the kernel, followed by a
8
summation, the model generates a feature map that captures local patterns in the input data. This
process is repeated across the entire input, moving the kernel horizontally and vertically according to
a predefined stride.
In this study, I utilized the InceptionV3 pre-trained model to extract features from input images.
A deep convolutional neural network architecture called InceptionV3 has excelled in a number of
computer vision tasks. It was created by Google researchers, and it won the ImageNet Large Scale
Visual Recognition Challenge (ILSVRC) in 2014 [42]. The model can capture a wide variety of
information at various scales thanks to the architecture's effective use of computational resources and
the inclusion of numerous parallel branches with various convolutional filter sizes.
The main benefit of adopting a pre-trained model is that it may take advantage of the
knowledge acquired through training on a large dataset, in this case, the ImageNet dataset. ImageNet
is a sizable dataset that includes millions of photos from tens of thousands of different item
categories. As a result, the model has already acquired a vast array of features and patterns, which
makes it a great place to begin when attempting to extract useful characteristics from fresh input
photos. Furthermore, adopting a pre-trained model frees us from having to train the entire model from
scratch, allowing us to concentrate on optimizing the following layers of our architecture, such as
LSTM or GRU, to better suit our specific issue domain. On our particular dataset, this may result in
Pooling
After the convolutional layers, a pooling layer is used to reduce the spatial dimensions of the
feature maps. This step helps in decreasing computational complexity, reducing the risk of overfitting,
and improving translation invariance. Two common types of pooling are max pooling and average
pooling. Max pooling selects the maximum value within the pooling kernel's coverage, while average
pooling computes the average value of the covered region. The figure below shows the Convolutional
Following the feature extraction and pooling steps, the CNN architecture typically includes one
or more fully connected layers. These layers serve to classify the extracted features into predefined
categories, such as object classes in an image classification task. The fully connected layers connect
the output from the last pooling or convolutional layer to a final output layer, which often uses a
In summary, Convolutional Neural Networks (CNNs) are powerful models designed for
handling grid-like data structures, such as images. They consist of two main parts: feature extraction
using convolutional and pooling layers, and classification using fully connected layers. By employing
these components, CNNs can effectively learn and extract meaningful features from input data and
MoveNet framework
(Figure 6), which are critical points on the human body, such as joints or specific anatomical
landmarks [26]-[27]. MoveNet is particularly useful for tasks related to human pose estimation and
motion analysis. It can be employed in various applications, including physical therapy, sports
The MoveNet framework typically uses a deep learning architecture, such as a Convolutional
Neural Network (CNN), as its core component. These layers apply a series of filters to the input data
to learn and extract features, which are then used for various tasks like classification or regression.
The difference between the MoveNet framework and a regular CNN is that MoveNet is
specifically tailored for detecting keypoints on the human body [28]. While a regular CNN can be
applied to a wide range of image-based tasks, MoveNet focuses on the particular challenges of
human pose estimation, including dealing with occlusions, variations in appearance, and different
Figure 7. MoveNet
11
Person Center Heatmap
This heatmap represents the likelihood of a person's center (usually the torso) being present at
each location in the image. It helps the model localize individuals in an image, even if they are
This component predicts a vector field for each keypoint, where the vectors point towards the
keypoint's location relative to the person's center. The model uses these vector fields to refine the
This is a set of heatmaps, one for each keypoint type (e.g., left elbow, right knee). Each
heatmap estimates the probability of a keypoint being present at each location in the image. The
model combines this information with the person center heatmap and keypoint regression field to
This component predicts a 2D offset for each detected keypoint, which helps refine the
Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber [30],
are a type of recurrent neural network (RNN) architecture specifically designed to overcome the
The LSTM architecture [31] introduces a memory cell, which is regulated by three gates: the
input gate, the forget gate, and the output gate. It is shown in Figure 8. These gates control the flow
of information within the LSTM, allowing it to retain, update, or remove information from the memory
cell as needed.
LSTMs are good at many things, like understanding natural language, recognizing speech,
predicting time series, and more [32]-[34]. They can learn how things depend on each other over a
12
long period of time, which makes them especially useful for jobs that involve sequential data with
complex temporal patterns. On some tasks and datasets, LSTMs have been shown to do better than
However, LSTM has some drawbacks [32]-[34]. LSTMs have a more complicated design than
GRUs, which means they have more parameters and need more processing power. This can make
training take longer and may mean that you need more powerful gear. Also, LSTMs may be more
likely to overfit, especially on smaller datasets, so they may need regularization techniques to work at
their best.
Gated Recurrent Units (GRUs) are a type of recurrent neural network (RNN) architecture,
introduced by Cho et al. [35] as a simpler alternative to the Long Short-Term Memory (LSTM)
networks. GRUs are made to effectively learn long-term dependencies in sequential data by
addressing the vanishing gradient problem that plagues standard RNNs [36].
The update gate and the reset gate are the two gates that make up the GRUs design. You can
see in Figure 9. These gates control how information moves between the secret states. This lets the
model learn and remember important information over longer time periods. The update gate controls
13
how much the old secret state affects the new hidden state, while the reset gate controls how much
Compared to LSTMs, GRUs have a simpler design with fewer parameters. This makes them
easier to train and faster to train on a computer. This often leads to the same or better performance
on certain jobs, especially when resources are limited or the data set is small. GRUs have been used
successfully in many fields that are similar to the usage of LSTM networks.
Even though GRUs have their benefits, they may not always be better than LSTMs. This is
especially true when working with very long sequences or big datasets. In some situations, the extra
complexity of LSTM networks can make a model more descriptive and better able to show how the
data is connected. The choice between GRUs and LSTMs often relies on the problem at hand, the
data set, and the computer resources that are available [32]-[34].
RELATED WORKS
In this chapter, I am going to introduce some related researches from other authors about
Using YOLO and MLP Neural Networks to Recognize Tennis Players’ Poses [37]
Jhen-Min Hung et al. proposed using YOLO v5 multi-target bounding-box detection and MLP
neural network to classify whether the player is forehand-swing or backhand-swing. They use the
data from the US Open Tennis Championships 2020 and 2019. First, they extract frames from the
videos, then they use the YOLO v5 model to detect the players and balls. Next, they use the MLP to
Using Detection2 and Random Forest to Recognize Tennis Players’ Poses [38]
Rajdeep Chatterjee et al. proposed an intelligent system to detect tennis payers’ poses. They
use Detection2 to detect the keypoints of humans. Detection2 is an object detection and
segmentation framework which is developed by Meta AI. [3] After that, they use those keypoints as
the input and train on the Random Forest model. They also trained and compare the results to
different fine-tuned Convolutional Neural Network models. As a result, their Random Forest model
Huu et al. show a posture recognition system that can be used for medical surveillance. The
system uses a deep learning model called mobilenetV2 to extract features from images of human
poses. Then, it uses LSTM to analyze the temporal patterns of the features and classify them into
different postures. The system can recognize 10 postures, such as standing, sitting, lying down,
bending, and squatting. The system achieves high accuracy and efficiency on various datasets and
can be applied to monitor patients' health conditions and activities. By using LSTM, they can improve
the accuracy to 99% while they can only achieve 88% accuracy without LSTM.
15
Use of LSTM Regression and Rotation Classification to Improve
Camera Pose Localization Estimation [40]
The journal from Meng Xu et al. proposes a method for estimating camera pose (the position
and orientation of a camera) using deep learning. The article combines a long short-term memory
(LSTM) module with rotation classification loss to regress the camera pose from image sequences.
The LSTM module can capture temporal information and learn features from images, while the
rotation classification loss can reduce errors in estimating camera orientation. The article supervised
the pose estimation with dynamic, weighted, multi-losses that balance different components of
camera pose (such as yaw, pitch, roll, translation and quaternion). The article evaluates the proposed
method on two public datasets (KITTI and EuRoC) and compares it with several state-of-the-art
methods. The results show that the proposed method achieves better performance than existing
DATASET
In this chapter, the focus will be on introducing the data utilized for training purposes and
Data Description
conduct my research. This dataset comprises a compilation of the most frequently performed
Arm Raises – a simple upper body exercise that targets the shoulder muscles, primarily
the deltoids, and helps improve overall shoulder stability and strength.
Pushups – an upper body strength training exercise that targets the chest, triceps, and
shoulders.
Curls – an isolation exercise that engages the biceps and forearms while strengthening
the upper arms.
Flys – a resistance exercise that works the shoulder muscles, specifically the deltoids.
Squats – a lower body compound exercise that targets the glutes, quads, hamstrings,
and calves.
Bird Dogs – a dynamic exercise that enhances core stability and strengthens the back
extensors.
Supermans – an isometric exercise that targets the lower back muscles and improves
spinal erector strength.
Bicycle Crunches – an abdominal exercise that engages the rectus abdominis and
obliques.
Leg Raises – a lower abdominal exercise that strengthens the hip flexors and lower abs.
Overhead Press – a compound exercise that strengthens the shoulder muscles, triceps,
and upper back.
Each exercise is accompanied by 100 videos that have been filmed under varying light settings
and backgrounds. This diverse dataset allows for a thorough analysis of exercise form and technique,
To preprocess the data from the comprehensive fitness video dataset provided by Infinity AI
[41], I followed a series of steps to extract relevant information and prepare the data for further
analysis:
Frame Extraction
Since the dataset consists of videos, the first step was to extract individual frames from each
video. This process involves breaking down the videos into a sequence of images, which can then be
processed and analyzed using computer vision techniques. Depending on the frame rate of the
videos and the desired temporal resolution, a specific number of frames per second were extracted to
Image Cropping
After extracting the frames, the next step was to crop the images to focus on the people
performing the exercises. To achieve this, I utilized the metadata provided with the dataset, which
includes information about the person's position in each frame. Based on this information, I cropped
the images so that the person was centered, ensuring that the model's attention would be
IMPLEMENTATION
In this chapter, I will introduce each model I used and shows models in detail.
After using MoveNet framework to get the position of the human keypoints, Figure 10 shows
The code in Figure 10 defines a feedforward neural network model that takes as input a tensor
of shape (batch_size, 51). And ‘51’ represents the number of features in the input data.
inputs = tf.keras.Input(shape=(51))
embedding = landmarks_to_embedding(inputs)
The input tensor is passed through a custom function called landmarks_to_embedding, which
maps the input tensor to a higher-dimensional space, creating a tensor of shape (128).
The mapped tensor is then passed through two dense layers with different numbers of
neurons. The first dense layer has 128 neurons and applies the ReLU6 activation function, which is a
variant of the Rectified Linear Unit (ReLU) activation function that has a maximum output of 6. The
second dense layer has 64 neurons and applies the ReLU6 activation function as well. Both dense
layers are followed by a dropout layer, which randomly sets 50% of the input units to 0 at each update
len(class_names) neurons and a softmax activation function. This layer produces the output of the
neural network, which is a probability distribution over the different classes in the dataset.
In this section, I go over a new LSTM model that processes video sequences with the
MoveNet framework. MoveNet is a powerful deep-learning model created for real-time human posture
estimation. It finds important spots on the human body to identify each person's pose in a video
frame.
The new LSTM models that are shown in Figure 11 and 12 use MoveNet to provide keypoints
for each body part, together with their accompanying confidence scores. By capturing the spatial
interactions between various body parts, this new representation offers a more organized and
Each row in the tabular structure that results from MoveNet processing the new data
corresponds to one frame of a video sequence. The x and y coordinates and confidence rating of
each MoveNet-identified keypoint are listed in a separate column. Body parts like the nose, eyes,
ears, shoulders, elbows, wrists, hips, knees, and ankles are among the crucial points. Additionally,
the table has columns for the class number and class name, which correspond to the different
The LSTM model can be altered to handle this new type of input data by altering the input
shape to match the quantity of keypoints and confidence ratings. The model can now learn and
predict the class labels (names of the workouts) using the spatial correlations between the keypoints
as opposed to learning and predicting them from the raw pixel values of the video frames.
20
def create_lstm_model(input_shape, num_classes):
model = Sequential()
model.add(Bidirectional(LSTM(128, return_sequences=True), input_shape=input_shape))
model.add(Dropout(0.4))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.4))
model.add(LSTM(64))
model.add(Dropout(0.4))
model.add(Dense(32, activation="relu"))
model.add(Dense(num_classes, activation="softmax"))
Model 1 that is shown in Figure 11 utilizes the first 100 frames of each video. The model starts
with a bidirectional LSTM layer with 128 units, then has a dropout, an LSTM layer with 128 units, a
second dropout layer, an LSTM layer with 64 units, a dropout layer, a dense layer with 32 units, and
finally a dense softmax layer to output the class probabilities. However, the forecasting accuracy of
A second approach for segmenting the videos improves performance. Each video is split into
20-frame pieces, with any additional frames being deleted, because it contains repetitive movements.
As a result, Model 2 is produced, and its name is create_lstm_model_2. In this model, the data are
first scaled to have a zero mean and a unit variance using the normalize_data function. The model
Model 2 that is shown in Figure 12 is created using a bidirectional LSTM layer with 256 units,
batch normalization, and dropout. The following layers are a dense softmax layer for class
probabilities, an LSTM layer with 64 units, batch normalization, and dropout, an LSTM layer with 128
units, dropout, and batch normalization, and a dense layer with 32 units.
The normalized data is used for both training and testing for model 2's training. In order to
prevent overfitting and save the best model based on the validation loss, early stopping is included as
a callback during training. A batch size of 16 and a maximum of 100 epochs are used to train the
model.
Model 2 tries to improve upon model 1's prediction accuracy by segmenting the films into 20-
frame chunks and using a more intricate model with normalization. Model 1 only used the first 100
frames. This highlights how crucial it is to use the right sequence length and model architecture for
Here, I provide a Gated Recurrent Unit (GRU) model that, like the LSTM model, processes
video sequences using the MoveNet framework. In contrast to the LSTM model, the GRUS model is
The normalize_data function is used to scale the data to have a zero mean and unit variance
before applying the GRUs model. This is the procedure I use before building an LSTM model.
The GRUS model architecture that is shown in Figure 13 is specified using the
create_gru_model function. A bidirectional GRUs layer with 256 units serves as the model's
foundation. Batch normalization and a dropout layer are then applied. A further GRUs layer with 128
units, batch normalization, and dropout are then included in the model. A third GRUs layer with 64
units is subsequently added by the model, which is then followed by batch normalization and dropout.
To output the class probabilities, the model has a dense softmax layer with 32 units and a dense
layer.
This model is constructed using the sparse categorical crossentropy loss function and the
Adam optimizer, just like the LSTM model. The create_gru_model function receives the input shape
and the number of classes to initialize the model. During the training of the model, the normalized
Early stopping is included as a callback during training to avoid overfitting and save the best
model based on validation loss. A batch size of 16 and a maximum of 100 epochs are used to train
the model.
In conclusion, utilizing video sequences processed with the MoveNet framework, the GRUS
model provides an alternative strategy to the LSTM model for human exercise recognition.
23
def create_gru_model(input_shape, num_classes):
model = Sequential()
model.add(Bidirectional(GRU(256, return_sequences=True), input_shape=input_shape))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(GRU(128, return_sequences=True))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(GRU(64))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(32, activation="relu"))
model.add(Dense(num_classes, activation="softmax"))
RESULTS
Before analyzing the results, it is important to examine the model's training and validation
accuracy over time, as depicted in the accuracy plot. This graphical representation showcases how
the model's performance evolves as the number of training epochs increases. An epoch refers to a
complete iteration through the entire training dataset. In this case, the model has been trained for a
The accuracy plot reveals a positive trend in both training and validation accuracy as the
number of epochs increases. Specifically, the training accuracy reaches approximately 80% as Figure
14 shows, while the validation accuracy achieves a higher value of around 84%. It is noteworthy that
the validation accuracy surpasses the training accuracy, which may suggest that the model
generalizes well to the unseen data. This is a desirable outcome, as it indicates that the model is less
likely to be overfitting on the training data and can maintain good performance when encountering
depiction of how well the neural network model classified the given tasks. The true class (actual
exercise) is represented by each row in the matrix, while the predicted class (exercise predicted by
the model) is represented by each column. The examples that were correctly classified for each
For instance, the value at the first row and first column (5312) indicates that 5312 instances of
the "armraise" exercise have been accurately identified by the model. The first row's non-diagonal
values, such as 1, 0, 3, etc., represent the number of 'armraise' instances that were mistakenly
26
labeled as other exercises. One 'armraise' instance, for instance, was mistakenly labeled as a
'basic_legraise'.
According to the provided confusion matrix in Figure 15, the model appears to do well when
classifying activities like "armraise", "fly", and "overhead press". However, the model struggles to
distinguish between the terms "basic_legraise", "legraise", and "pushup", The non-diagonal values of
these rows, which are substantially high in comparison to the diagonal values, reveal the
misclassifications.
The classification report, which comes after the confusion matrix, gives more details about how
As we can see in Table 1, the model performs well for several classes, such as "curl", "fly", and
"overheadpress", with precision scores of 0.86, 0.84, and 0.89, respectively, according to the
classification report. These excellent results show that the model can correctly categorize these
"bicyclecrunch", and "superman", which are 0.29, 0.17, and 0.28, respectively. The model has trouble
discriminating some workouts. These findings imply that in order to enhance the model's performance
for these particular classes, additional training data or model improvement may be necessary.
The model performs well overall. The weighted average accuracy is 0.76.
In this section, I go over the results of LSTM models in conjunction with the MoveNet
framework to recognize human exercise. The model was initially trained using 100-frame sequences
for each video. However, this strategy produced subpar results, Figure 16 shows the model’s
accuracy levels for both training and validation falling below 22%. Each video was divided into 20-
frame segments in order to solve this problem, with any extra frames being discarded. As a result,
x_train's form changed from (800, 100, 52) to (4809, 20, 52).
As we can see in Figure 17, the training accuracy eventually rose to over 80% thanks to the
improved methodology. The validation accuracy, however, did not consistently improve with the
number of epochs, eventually declining to 49.39% at the end of the training. The complexity of the
28
model was raised, and the data was standardized, to further improve the model's performance. As a
result, the test's accuracy was 94.76% as we can see in Figure 18.
The final optimized LSTM model with MoveNet's classification report (see Table 2) shows that
the model works remarkably well for a number of exercise classes, including "armraise," "curl," "fly,"
"overhead press," and "squat." Other exercises, such as the "basic_legraise," "bicyclecrunch,"
These results are supported by the confusion matrix (see Figure 19), which highlights the
model's shortcomings in classifying some activities while also displaying a high percentage of
In conclusion, the MoveNet framework and the LSTM model demonstrate strong performance
The GRUs model performs admirably, obtaining over 90% accuracy for both the training and
validation sets after just five iterations, as seen by the accuracy plot. The training and validation
accuracies, however, occasionally switch due to variations in the plot. Despite the anomaly, the
model stops before its expected time at epoch 29, stopping with a weighted accuracy of 98.63% as
Figure 20 shows.
31
According to the GRUs model's classification report in Table 3, it performs remarkably well for
the majority of workout classes, with high precision, recall, and F1-scores. The model's precision
scores are at or close to 1.00 for classes 0, 4, 5, 7, and 9 ('armraise', 'curl', 'fly', 'overheadpress', and
precision scores of 0.00 and 0.25, respectively, provide challenges for the model when categorizing
them. These low results imply that the model might profit from additional improvement or training data
The categorization performance of the model across various workout groups is visually
represented by the confusion matrix as Figure 21 shows. It supports the categorization report's earlier
observations about the majority of classes' exceptional performance. The majority of examples are
accurately classified by the model, and misclassifications are rare. The confusion matrix, however,
also reveals the model's difficulty in correctly classifying 'basic_legraise' (class 1), where it wrongly
In conclusion, the high accuracy, F1-scores, and confusion matrix findings show that the GRUs
model performs exceptionally well in identifying the majority of exercise courses. The classification of
some exercises, including the "basic_legraise" and the "bicyclecrunch," might be improved. Potential
remedies can involve expanding the amount of training data, changing the model's architecture, or
putting data augmentation approaches into practice to improve the model's performance for these
classes.
33
CONCLUSION
In conclusion, the combination of MoveNet with different neural network architectures, such as
Feedforward Neural Networks, Gated Recurrent Units (GRUs), and Long Short-Term Memory
Networks (LSTMs), has shown promising results in classifying human exercises. The GRUs model
achieved the highest accuracy of 97.27%, while the LSTM model also performed well with an
Improvements can be made by increasing the amount of training data, increasing the quality of the
training data, adjusting the model's architecture, or implementing data augmentation techniques.
35
REFERENCES
[2] E. Kavlakoglu, “AI vs. Machine Learning vs. Deep Learning vs. Neural Networks: What’s the
Difference?,” IBM Blog, 27-May-2020. [Online]. Available: https://www.ibm.com/cloud/blog/ai-
vs-machine-learning-vs-deep-learning-vs-neural-networks. [Accessed: 07-Jan-2023].
[3] Analytics Vidhya, “Introduction to Artificial Neural Networks,” Analytics Vidhya. [Online].
Available: https://www.analyticsvidhya.com/blog/2021/09/introduction-to-artificial-neural-
networks/. [Accessed: 07-Jan-2023].
[4] Towards Data Science, “Artificial Neural Networks (ANN),” Towards Data Science, 18-May-
2020. [Online]. Available: https://towardsdatascience.com/artificial-neural-networks-ann-
21637869b306. [Accessed: 07-Jan-2023].
[5] Rene Y. Choi, Aaron S. Coyner, Jayashree Kalpathy-Cramer, Michael F. Chiang, J. Peter
Campbell; Introduction to Machine Learning, Neural Networks, and Deep Learning. Trans. Vis.
Sci. Tech. 2020;9(2):14. doi: https://doi.org/10.1167/tvst.9.2.14.
[6] Sharma, S., Sharma, S., & Athaiya, A. (2017). Activation functions in neural networks.
Towards Data Sci, 6(12), 310-316.
[8] J. Han and C. Moraga, “The influence of the sigmoid function parameters on the speed of
backpropagation learning,” Lecture Notes in Computer Science, pp. 195–201, Jun. 1995, doi:
10.1007/3-540-59497-3_175.
[10] K. Abdelouahab, M. Pelcat, and F. Berry, “Why TanH is a Hardware Friendly Activation
Function for CNNs,” HAL (Le Centre Pour La Communication Scientifique Directe), Sep. 2017,
doi: 10.1145/3131885.3131937.
[12] Dansbecker, “Rectified Linear Units (ReLU) in Deep Learning,” Kaggle, May 2018, [Online].
Available: https://www.kaggle.com/code/dansbecker/rectified-linear-units-relu-in-deep-learning
[13] D. Liu, “A Practical Guide to ReLU - Danqing Liu - Medium,” Medium, Jul. 12, 2019. [Online].
Available: https://medium.com/@danqing/a-practical-guide-to-relu-b83ca804f1f7
[14] L. S. Toth, “Phone recognition with deep sparse rectifier neural networks,” International
Conference on Acoustics, Speech, and Signal Processing, May 2013, doi:
10.1109/icassp.2013.6639016.
[15] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
36
[16] M. A. Nielsen, “Neural Networks and Deep Learning,” 2015.
http://neuralnetworksanddeeplearning.com/chap2.html
[18] A. Al-Masri, “How Does Backpropagation in a Neural Network Work?,” Built In, Oct. 2022,
[Online]. Available: https://builtin.com/machine-learning/backpropagation-neural-network
[19] DeepAI, “Feed Forward Neural Network,” DeepAI, Jun. 2020, [Online]. Available:
https://deepai.org/machine-learning-glossary-and-terms/feed-forward-neural-network
[20] V. Kurama, “Feedforward Neural Networks: A Quick Primer for Deep Learning,” Built In, Sep.
2019, [Online]. Available: https://builtin.com/data-science/feedforward-neural-network-intro
[21] K. O’Shea and R. R. Nash, “An Introduction to Convolutional Neural Networks,” arXiv (Cornell
University), Nov. 2015, doi: 10.48550/arxiv.1511.08458.
[22] J. Gu et al., “Recent advances in convolutional neural networks,” Pattern Recognition, vol. 77,
pp. 354–377, May 2018, doi: 10.1016/j.patcog.2017.10.013.
[24] “What Is a Convolutional Neural Network? | 3 things you need to know,” MATLAB & Simulink.
https://www.mathworks.com/discovery/convolutional-neural-network-matlab.html
[25] “A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way | Saturn Cloud
Blog,” Apr. 17, 2023. https://saturncloud.io/blog/a-comprehensive-guide-to-convolutional-
neural-networks-the-eli5-way/
[26] “MoveNet: Ultra fast and accurate pose detection model.,” TensorFlow, [Online]. Available:
https://www.tensorflow.org/hub/tutorials/movenet
[28] J. Jiang, S. Tao, D. Lian, Z. Huang, and E. Chen, “Predicting Human Mobility with Self-
attention and Feature Interaction,” Lecture Notes in Computer Science, pp. 117–131, Aug.
2020, doi: 10.1007/978-3-030-60290-1_9.
[30] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9,
no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.
[33] H. Pedamallu, “RNN vs GRUS vs LSTM - Analytics Vidhya - Medium,” Medium, Dec. 16, 2021.
[Online]. Available: https://medium.com/analytics-vidhya/rnn-vs-gru-vs-lstm-863b0b7b1573
[35] K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation,” arXiv (Cornell University), Jan. 2014, doi: 10.3115/v1/d14-1179.
[36] “Recurrent Neural Network (RNN) – Part 5: Custom Cells,” The Neural Perspective, Nov. 20,
2016.
https://web.archive.org/web/20180416083110/https://theneuralperspective.com/2016/11/17/rec
urrent-neural-network-rnn-part-4-custom-cells/
[37] J. -M. Hung, J. -Y. Chiang and K. Wang, "Tennis Player Pose Classification using YOLO and
MLP Neural Networks," 2021 International Symposium on Intelligent Signal Processing and
Communication Systems (ISPACS), Hualien City, Taiwan, 2021, pp. 1-2, doi:
10.1109/ISPACS51563.2021.9650925.
[38] R. Chatterjee, S. Roy, S. H. Islam and D. Samanta, "An AI Approach to Pose-based Sports
Activity Classification," 2021 8th International Conference on Signal Processing and Integrated
Networks (SPIN), Noida, India, 2021, pp. 156-161, doi: 10.1109/SPIN52536.2021.9565996.
[39] P. Nguyen Huu, N. Nguyen Thi and T. P. Ngoc, "Proposing Posture Recognition System
Combining MobilenetV2 and LSTM for Medical Surveillance," in IEEE Access, vol. 10, pp.
1839-1849, 2022, doi: 10.1109/ACCESS.2021.3138778.
[40] M. Xu, L. Wang, J. Ren and S. Poslad, "Use of LSTM Regression and Rotation Classification
to Improve Camera Pose Localization Estimation," 2020 IEEE 14th International Conference
on Anti-counterfeiting, Security, and Identification (ASID), Xiamen, China, 2020, pp. 6-10, doi:
10.1109/ASID50160.2020.9271762.
[42] Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li
Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition
Challenge. IJCV, 2015.