You are on page 1of 43

HUMAN FITNESS POSE CLASSIFICATION USING

ARTIFICIAL NEURAL NETWORK

A Thesis By

SIJIE SHANG
ORCID iD: 0009-0004-0308-282X

California State University, Fullerton


Spring, 2023
__________________________________________

In partial fulfillment of the degree:


Master of Science in Computer Science

Department:
Department of Computer Science

Committee:
Rong Jin, Department of Computer Science, Chair
Doina Bein, Department of Computer Science
Kanika Sood, Department of Computer Science

DOI:
10.5281/zenodo.7903558

Keywords:
neural network, recurrent neural networks, long short-term memory, MoveNet

Abstract:
This study presents three models to predict human fitness poses. First, we use the MoveNet
model to get the human keypoints. The first model is the Feedforward Neural Network to predict the
human pose of each frame. Then the second model we have is a Long Short-term Memory Network
(LSTM). The last model we use is the Gated Recurrent Unit (GRU). The last two models can use the
time series data as the input. As a result, the last two models have a better result than the first model.
The accuracy of LSTM is 94.76% and the accuracy GRU model is 97.27%.

© 2023, SIJIE SHANG, CC-BY-NC-ND 4.0


TABLE OF CONTENTS

LIST OF TABLES ............................................................................................................................... iv

LIST OF FIGURES ............................................................................................................................. v

ACKNOWLEDGMENTS ..................................................................................................................... vi

Chapter
1. INTRODUCTION ........................................................................................................................... 1

2. NEURAL NETWORKS BASICS ................................................................................................... 2

Inputs and Weights ....................................................................................................................... 2


Activation Functions ...................................................................................................................... 3
Sigmoid .................................................................................................................................... 3
Tanh ......................................................................................................................................... 4
ReLU ........................................................................................................................................ 5
Backpropagation ........................................................................................................................... 6
Forward Pass........................................................................................................................... 6
Backward Pass ........................................................................................................................ 6
Feedforward Neural Network ........................................................................................................ 7
Convolutional Neural Network ...................................................................................................... 7
Feature Extraction ................................................................................................................... 8
Pooling ..................................................................................................................................... 8
Classification............................................................................................................................ 9
MoveNet framework ...................................................................................................................... 10
Recurrent Neural Networks – LSTM ............................................................................................ 12
Recurrent Neural Networks – Gated Recurrent Unit ................................................................... 13

3. RELATED WORKS ....................................................................................................................... 15

Using YOLO and MLP Neural Networks to Recognize Tennis Players’ Poses .......................... 15
Using Detection2 and Random Forest to Recognize Tennis Players’ Poses ............................. 15
Proposing Posture Recognition System Combining MobilenetV2 and LSTM for Medical
Surveillance ................................................................................................................................ 15
Use of LSTM Regression and Rotation Classification to Improve Camera Pose Localization
Estimation ................................................................................................................................... 16

4. DATASET ...................................................................................................................................... 17

Data Description............................................................................................................................ 17
Data Preprocess ........................................................................................................................... 18
Frame Extraction ..................................................................................................................... 18
Image Cropping ....................................................................................................................... 18

5. IMPLEMENTATION ...................................................................................................................... 19

MoveNet and Feedforward Neural Network ................................................................................. 19


MoveNet and Long Short-Term Memory Network ....................................................................... 20
MoveNet and Gated Recurrent Unit ............................................................................................. 24

ii
6. RESULTS ...................................................................................................................................... 26

MoveNet and Feedforward Neural Network ................................................................................. 26


MoveNet and Long Short-Term Memory Network ....................................................................... 29
MoveNet and Gated Recurrent Unit ............................................................................................. 32

7. CONCLUSION .............................................................................................................................. 36

REFERENCES ................................................................................................................................... 37

iii
LIST OF TABLES

Table Page

1. Classification report, Feedforward neural network ................................................................. 28

2. Classification report, LSTM ..................................................................................................... 31

3. Classification report, GRUs ..................................................................................................... 33

iv
LIST OF FIGURES

Figure Page

1. A Simple Neural Network ........................................................................................................ 3

2. Sigmoid function ...................................................................................................................... 4

3. Tanh function ........................................................................................................................... 5

4. ReLU function .......................................................................................................................... 6

5. Convolutional Neural Networks ............................................................................................... 9

6. 17 keypoints............................................................................................................................. 10

7. MoveNet................................................................................................................................... 11

8. LSTM architecture ................................................................................................................... 13

9. GRUs architecture ................................................................................................................... 14

10. Feedforward neural network model ........................................................................................ 19

11. LSTM model 1 ......................................................................................................................... 21

12. LSTM model 2 ......................................................................................................................... 21

13. GRUS model............................................................................................................................ 25

14. Model accuracy, Feedforward neural network........................................................................ 26

15. Confusion matrix, Feedforward neural network ...................................................................... 27

16. Model accuracy, LSTM 1......................................................................................................... 29

17. Model accuracy, LSTM 2......................................................................................................... 30

18. Model accuracy, LSTM 3......................................................................................................... 30

19. Confusion matrix, LSTM .......................................................................................................... 32

20. Model accuracy, GRU ............................................................................................................. 33

21. Confusion matrix, GRU ........................................................................................................... 35

v
ACKNOWLEDGMENTS

I would like to begin by expressing my sincere appreciation to my advisor, Dr. Rong Jin, for her

unwavering guidance, support, and encouragement throughout the thesis development process. Her

knowledge, comprehension, and zeal have been indispensable in guiding my intellectual and

emotional development. In addition, I would like to appreciate Drs. Doina Bein and Kanika Sood for

their valuable guidance. My parents, whose affection and support contributed to my success, also

deserve special recognition.

vi
1
CHAPTER 1

INTRODUCTION

As AI technologies continue to advance, there is an increasing demand for personalized

experiences. Shopping websites recommend products based on customer preferences, grammar

checkers offer writing suggestions based on target audience and goals, and autopilot driving systems

can alert drivers of potential hazards. In this article, I propose a personalized training experience for

fitness enthusiasts, using machine learning models to recognize and classify human fitness poses. By

providing feedback, our models aim to help users exercise more efficiently. I present different

approaches to achieving this goal, such as using the MoveNet framework, Feedforward Neural

Network, Gated Recurrent Units, and the Movenet framework with Long Short-Term Memory.
2
CHAPTER 2

NEURAL NETWORKS BASICS

Neural networks, sometimes called artificial neural networks (ANNs) or imitation neural

networks (SNNs), belong to a part of machine learning and form the core of deep learning methods

[1]-[5]. Their name and methodology take inspiration from the human brain, copying how natural

neurons communicate with each other [1]. The fundamental concept of a neural network is to

reproduce numerous closely connected brain cells in a computer, enabling it to acquire knowledge,

identify patterns, and reach decisions similar to humans [3].

Components of a typical neural network involve neurons, weights, biases, propagation

function, and a learning rule. Neurons will receive input from predecessor neurons that have an

activation threshold and an activation function.

Inputs and Weights

A neural network is a series of nodes or neurons. Within each node is a set of inputs, weight,

and a bias value as you can see in Figure 1. As an input enters the node, it gets multiplied by a

weight value, and the resulting output is either observed or passed to the next layer in the neural

network [1]-[5].

The weights and bias are possibly the most important concepts of a neural network. When

inputs are transmitted between neurons, weights are applied to them and passed into an activation

function along with bias.

During training on a training set, a neural network is initialized with a set of random weights.

These weights are then optimized during training to produce optimum weights. Equation 1 shows the

formula.

Y = Bias + W1X1 + W2X2 + …+ WnXn (1)

Y is the neural in next layer, that shows in the graph below

Wi: weights, decide how much will Xi influence the next layer

Xi: inputs
3
Bias: a constant number

Figure 1. A Simple Neural Network

Activation Functions

An activation function is used in artificial neural networks to determine whether a neuron

should be activated or not [6]. It does this by calculating the weighted sum of inputs and further

adding bias to it.

The purpose of an activation function is to introduce non-linearity into the output of a neuron.

This means that instead of producing a simple linear output, an activation function can produce more

complex outputs that can better model real-world data.

There are several types of activation functions including binary step functions which depend on

a threshold value that decides whether a neuron should be activated or not [7].

Sigmoid

The Sigmoid function [8, 9] is a logistic function that helps normalize the output of any input in

the range between 0 to 1 as you can see from Figure 2. It takes a real value as input and gives a

probability that’s always between 0 and 1. Equation 2 shows an example of the Sigmoid function.

1
𝑦 = (1+ⅇ−𝑥) (2)

Here, 'x' is the input value and 'e' is the natural logarithm's base, which is about 2.718. A large

negative input value will bring the Sigmoid function to values near 0, while large positive input values
4
will bring the values close to 1. The Sigmoid function returns a probability of 0.5 when the input value

is 0, as we can see in Figure 2.

Figure 2. Sigmoid function

Tanh

The Tanh (hyperbolic tangent) function [9]-[11], as you can see in Figure 3, is another popular

activation function used in neural networks, which is similar to the Sigmoid function but normalizes

the output of any input in the range between -1 and 1. The formula is shown in Equation 3 below.

After calculating, we can have Equation 4 as the result.

𝑦 = (ⅇ𝑥 − ⅇ−𝑥 ) ∕ (ⅇ𝑥 + ⅇ−𝑥 ) (3)


2
𝑦 = (1+ⅇ −2𝑥 ) − 1 (4)

Figure 3. Tanh function


5
ReLU

The ReLU (Rectified Linear Unit) activation function [9, 12], as you can see in Figure 4, is a

great alternative to both sigmoid and tanh activation functions. It does not have the vanishing gradient

problem and is computationally inexpensive. Equation 4 shows the formula of this activation function.

y = max (0, x) (4)

The input value, in this case, is 'x'. If the input value is positive or equal to zero, the ReLU

function produces that value, and if it is negative, the result will be 0. Many deep learning models

have shown good performance with this straightforward function since it introduces non-linearity while

requiring little processing power.

Due to its simplicity, computational efficiency, and capacity to address the vanishing gradient

problem, the ReLU activation function is widely used in machine learning and deep learning

applications such as image classification, object detection, and natural language processing [12]-[14].

Figure 4. ReLU function

Backpropagation

Backpropagation [15]-[18] is a fundamental optimization algorithm employed in the training of

artificial neural networks, particularly for supervised learning tasks. It plays a critical role in adjusting

the model's weights to minimize the error between predicted outputs and actual target values. The
6
primary objective of backpropagation is to reduce the loss function, a quantitative measure of the

discrepancy between the model's predictions and the ground truth.

The backpropagation algorithm operates by leveraging the chain rule of calculus to compute

the gradients of the loss function with respect to each weight within the network. These gradients

provide information on the direction and magnitude of the weight adjustments necessary to minimize

the loss function. The algorithm consists of two primary stages:

Forward Pass

During this phase, input data is propagated through the network to compute the predicted

output. This process entails calculating the weighted sum of inputs for each neuron, passing the

result through an activation function, and repeating this procedure for each layer in the network until

the output layer is reached.

Backward Pass

In this stage, the error between the predicted output and the target values is determined, and

the gradients of the loss function with respect to each weight are calculated. Starting at the output

layer and moving backward through the network, the error gradients for each neuron are computed.

Using the chain rule, the gradients for each weight are then calculated. Once the gradients for all

weights have been obtained, the weights are updated using an optimization algorithm, such as

gradient descent or one of its variants (e.g., stochastic gradient descent, Adam).

Backpropagation allows for the simultaneous and continuous update of all the weights in the

network, making it a computationally efficient method for training neural networks. Through iterative

weight adjustments aimed at minimizing the loss function, the model learns to make more accurate

predictions, ultimately enhancing its performance on the given task. The backpropagation algorithm

has been instrumental in the advancement of deep learning, enabling the development of complex

neural network architectures capable of solving a wide range of problems in various domains, such as

computer vision, natural language processing, and speech recognition.


7
Feedforward Neural Network

Feedforward neural networks [1, 16, 19, 20] are a type of artificial neural network in which

information flows in one direction: from the input layer, through the hidden layers, and finally to the

output layer. Figure 1 shows a simple Feedforward Neural Network. They are called "feedforward"

because there are no cycles or loops in the connections between neurons. This architecture allows

for the efficient processing of input data and is widely used in various applications, such as image

recognition and natural language processing.

Convolutional Neural Networks

Convolutional Neural Networks (CNNs) [21] are a specialized type of deep learning model

designed for processing grid-like data, such as images [22]. They are particularly effective at tasks

like image classification, object detection, and semantic segmentation. CNNs consist of two primary

parts: feature extraction and classification with a fully connected layer as you can see in Figure 5 [23]-

[25].

Figure 5. Convolutional Neural Networks

Feature Extraction

In the feature extraction phase, a CNN employs a series of convolutional and pooling layers to

learn and extract meaningful features from the input data. The convolutional layers utilize kernels or

filters (e.g., 3x3 matrices) that are applied to the input data through a sliding window approach. By

performing element-wise multiplication between the input data and the kernel, followed by a
8
summation, the model generates a feature map that captures local patterns in the input data. This

process is repeated across the entire input, moving the kernel horizontally and vertically according to

a predefined stride.

In this study, I utilized the InceptionV3 pre-trained model to extract features from input images.

A deep convolutional neural network architecture called InceptionV3 has excelled in a number of

computer vision tasks. It was created by Google researchers, and it won the ImageNet Large Scale

Visual Recognition Challenge (ILSVRC) in 2014 [42]. The model can capture a wide variety of

information at various scales thanks to the architecture's effective use of computational resources and

the inclusion of numerous parallel branches with various convolutional filter sizes.

The main benefit of adopting a pre-trained model is that it may take advantage of the

knowledge acquired through training on a large dataset, in this case, the ImageNet dataset. ImageNet

is a sizable dataset that includes millions of photos from tens of thousands of different item

categories. As a result, the model has already acquired a vast array of features and patterns, which

makes it a great place to begin when attempting to extract useful characteristics from fresh input

photos. Furthermore, adopting a pre-trained model frees us from having to train the entire model from

scratch, allowing us to concentrate on optimizing the following layers of our architecture, such as

LSTM or GRU, to better suit our specific issue domain. On our particular dataset, this may result in

faster convergence and better generalization.

Pooling

After the convolutional layers, a pooling layer is used to reduce the spatial dimensions of the

feature maps. This step helps in decreasing computational complexity, reducing the risk of overfitting,

and improving translation invariance. Two common types of pooling are max pooling and average

pooling. Max pooling selects the maximum value within the pooling kernel's coverage, while average

pooling computes the average value of the covered region. The figure below shows the Convolutional

Neural Networks with a Max pooling.


9
Classification

Following the feature extraction and pooling steps, the CNN architecture typically includes one

or more fully connected layers. These layers serve to classify the extracted features into predefined

categories, such as object classes in an image classification task. The fully connected layers connect

the output from the last pooling or convolutional layer to a final output layer, which often uses a

softmax activation function to produce class probabilities.

In summary, Convolutional Neural Networks (CNNs) are powerful models designed for

handling grid-like data structures, such as images. They consist of two main parts: feature extraction

using convolutional and pooling layers, and classification using fully connected layers. By employing

these components, CNNs can effectively learn and extract meaningful features from input data and

subsequently classify them into specific categories.

MoveNet framework

MoveNet is a machine learning framework designed to detect 17 human body keypoints

(Figure 6), which are critical points on the human body, such as joints or specific anatomical

landmarks [26]-[27]. MoveNet is particularly useful for tasks related to human pose estimation and

motion analysis. It can be employed in various applications, including physical therapy, sports

analysis, video game development, and gesture recognition.

The MoveNet framework typically uses a deep learning architecture, such as a Convolutional

Neural Network (CNN), as its core component. These layers apply a series of filters to the input data

to learn and extract features, which are then used for various tasks like classification or regression.

The difference between the MoveNet framework and a regular CNN is that MoveNet is

specifically tailored for detecting keypoints on the human body [28]. While a regular CNN can be

applied to a wide range of image-based tasks, MoveNet focuses on the particular challenges of

human pose estimation, including dealing with occlusions, variations in appearance, and different

body shapes and sizes.


10

Figure 6. 17 keypoints [29]

To detect keypoints, MoveNet typically employs a combination of the following components as

you can see in Figure 7.

Figure 7. MoveNet
11
Person Center Heatmap

This heatmap represents the likelihood of a person's center (usually the torso) being present at

each location in the image. It helps the model localize individuals in an image, even if they are

partially occluded or overlapping.

Keypoint Regression Field

This component predicts a vector field for each keypoint, where the vectors point towards the

keypoint's location relative to the person's center. The model uses these vector fields to refine the

positions of the detected keypoints.

Person Keypoint Heatmap

This is a set of heatmaps, one for each keypoint type (e.g., left elbow, right knee). Each

heatmap estimates the probability of a keypoint being present at each location in the image. The

model combines this information with the person center heatmap and keypoint regression field to

improve keypoint detection.

2D Per-Keypoint Offset Field

This component predicts a 2D offset for each detected keypoint, which helps refine the

keypoint's location further.

Recurrent Neural Networks – LSTM

Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber [30],

are a type of recurrent neural network (RNN) architecture specifically designed to overcome the

vanishing gradient problem encountered in standard RNNs.

The LSTM architecture [31] introduces a memory cell, which is regulated by three gates: the

input gate, the forget gate, and the output gate. It is shown in Figure 8. These gates control the flow

of information within the LSTM, allowing it to retain, update, or remove information from the memory

cell as needed.

LSTMs are good at many things, like understanding natural language, recognizing speech,

predicting time series, and more [32]-[34]. They can learn how things depend on each other over a
12
long period of time, which makes them especially useful for jobs that involve sequential data with

complex temporal patterns. On some tasks and datasets, LSTMs have been shown to do better than

standard RNNs and, sometimes, GRUs.

Figure 8. LSTM architecture [31]

However, LSTM has some drawbacks [32]-[34]. LSTMs have a more complicated design than

GRUs, which means they have more parameters and need more processing power. This can make

training take longer and may mean that you need more powerful gear. Also, LSTMs may be more

likely to overfit, especially on smaller datasets, so they may need regularization techniques to work at

their best.

Recurrent Neural Networks – Gated Recurrent Units

Gated Recurrent Units (GRUs) are a type of recurrent neural network (RNN) architecture,

introduced by Cho et al. [35] as a simpler alternative to the Long Short-Term Memory (LSTM)

networks. GRUs are made to effectively learn long-term dependencies in sequential data by

addressing the vanishing gradient problem that plagues standard RNNs [36].

The update gate and the reset gate are the two gates that make up the GRUs design. You can

see in Figure 9. These gates control how information moves between the secret states. This lets the

model learn and remember important information over longer time periods. The update gate controls
13
how much the old secret state affects the new hidden state, while the reset gate controls how much

the old hidden state affects the candidate hidden state.

Compared to LSTMs, GRUs have a simpler design with fewer parameters. This makes them

easier to train and faster to train on a computer. This often leads to the same or better performance

on certain jobs, especially when resources are limited or the data set is small. GRUs have been used

successfully in many fields that are similar to the usage of LSTM networks.

Even though GRUs have their benefits, they may not always be better than LSTMs. This is

especially true when working with very long sequences or big datasets. In some situations, the extra

complexity of LSTM networks can make a model more descriptive and better able to show how the

data is connected. The choice between GRUs and LSTMs often relies on the problem at hand, the

data set, and the computer resources that are available [32]-[34].

Figure 9. GRUS architecture [36]


14
CHAPTER 3

RELATED WORKS

In this chapter, I am going to introduce some related researches from other authors about

machine learning in detecting and classifying human poses.

Using YOLO and MLP Neural Networks to Recognize Tennis Players’ Poses [37]

Jhen-Min Hung et al. proposed using YOLO v5 multi-target bounding-box detection and MLP

neural network to classify whether the player is forehand-swing or backhand-swing. They use the

data from the US Open Tennis Championships 2020 and 2019. First, they extract frames from the

videos, then they use the YOLO v5 model to detect the players and balls. Next, they use the MLP to

classify the poses. The accuracy of the validation dataset is 93%.

Using Detection2 and Random Forest to Recognize Tennis Players’ Poses [38]

Rajdeep Chatterjee et al. proposed an intelligent system to detect tennis payers’ poses. They

use Detection2 to detect the keypoints of humans. Detection2 is an object detection and

segmentation framework which is developed by Meta AI. [3] After that, they use those keypoints as

the input and train on the Random Forest model. They also trained and compare the results to

different fine-tuned Convolutional Neural Network models. As a result, their Random Forest model

has the highest accuracy and shortest training time.

Proposing Posture Recognition System Combining MobilenetV2 and


LSTM for Medical Surveillance [39]

Huu et al. show a posture recognition system that can be used for medical surveillance. The

system uses a deep learning model called mobilenetV2 to extract features from images of human

poses. Then, it uses LSTM to analyze the temporal patterns of the features and classify them into

different postures. The system can recognize 10 postures, such as standing, sitting, lying down,

bending, and squatting. The system achieves high accuracy and efficiency on various datasets and

can be applied to monitor patients' health conditions and activities. By using LSTM, they can improve

the accuracy to 99% while they can only achieve 88% accuracy without LSTM.
15
Use of LSTM Regression and Rotation Classification to Improve
Camera Pose Localization Estimation [40]

The journal from Meng Xu et al. proposes a method for estimating camera pose (the position

and orientation of a camera) using deep learning. The article combines a long short-term memory

(LSTM) module with rotation classification loss to regress the camera pose from image sequences.

The LSTM module can capture temporal information and learn features from images, while the

rotation classification loss can reduce errors in estimating camera orientation. The article supervised

the pose estimation with dynamic, weighted, multi-losses that balance different components of

camera pose (such as yaw, pitch, roll, translation and quaternion). The article evaluates the proposed

method on two public datasets (KITTI and EuRoC) and compares it with several state-of-the-art

methods. The results show that the proposed method achieves better performance than existing

methods in terms of accuracy and robustness.


16
CHAPTER 4

DATASET

In this chapter, the focus will be on introducing the data utilized for training purposes and

discussing the data preprocessing steps in detail.

Data Description

Currently, I am utilizing a comprehensive fitness video dataset provided by Infinity AI [41] to

conduct my research. This dataset comprises a compilation of the most frequently performed

exercises, which include:

Arm Raises – a simple upper body exercise that targets the shoulder muscles, primarily
the deltoids, and helps improve overall shoulder stability and strength.

Pushups – an upper body strength training exercise that targets the chest, triceps, and
shoulders.

Curls – an isolation exercise that engages the biceps and forearms while strengthening
the upper arms.

Flys – a resistance exercise that works the shoulder muscles, specifically the deltoids.

Squats – a lower body compound exercise that targets the glutes, quads, hamstrings,
and calves.

Bird Dogs – a dynamic exercise that enhances core stability and strengthens the back
extensors.

Supermans – an isometric exercise that targets the lower back muscles and improves
spinal erector strength.

Bicycle Crunches – an abdominal exercise that engages the rectus abdominis and
obliques.

Leg Raises – a lower abdominal exercise that strengthens the hip flexors and lower abs.

Overhead Press – a compound exercise that strengthens the shoulder muscles, triceps,
and upper back.

Each exercise is accompanied by 100 videos that have been filmed under varying light settings

and backgrounds. This diverse dataset allows for a thorough analysis of exercise form and technique,

as well as the impact of external factors on video analysis.


17
Data Preprocess

To preprocess the data from the comprehensive fitness video dataset provided by Infinity AI

[41], I followed a series of steps to extract relevant information and prepare the data for further

analysis:

Frame Extraction

Since the dataset consists of videos, the first step was to extract individual frames from each

video. This process involves breaking down the videos into a sequence of images, which can then be

processed and analyzed using computer vision techniques. Depending on the frame rate of the

videos and the desired temporal resolution, a specific number of frames per second were extracted to

capture the critical moments of each exercise.

Image Cropping

After extracting the frames, the next step was to crop the images to focus on the people

performing the exercises. To achieve this, I utilized the metadata provided with the dataset, which

includes information about the person's position in each frame. Based on this information, I cropped

the images so that the person was centered, ensuring that the model's attention would be

concentrated on the individual and their movements during the exercises.


18
CHAPTER 5

IMPLEMENTATION

In this chapter, I will introduce each model I used and shows models in detail.

MoveNet and Feedforward Neural Network

After using MoveNet framework to get the position of the human keypoints, Figure 10 shows

how I build the model.

The code in Figure 10 defines a feedforward neural network model that takes as input a tensor

of shape (batch_size, 51). And ‘51’ represents the number of features in the input data.

inputs = tf.keras.Input(shape=(51))
embedding = landmarks_to_embedding(inputs)

layer = keras.layers.Dense(128, activation=tf.nn.relu6)(embedding)


layer = keras.layers.Dropout(0.5)(layer)
layer = keras.layers.Dense(64, activation=tf.nn.relu6)(layer)
layer = keras.layers.Dropout(0.5)(layer)
outputs = keras.layers.Dense(len(class_names), activation="softmax")(layer)

model = keras.Model(inputs, outputs)


model.summary()

Figure 10. MoveNet

The input tensor is passed through a custom function called landmarks_to_embedding, which

maps the input tensor to a higher-dimensional space, creating a tensor of shape (128).

The mapped tensor is then passed through two dense layers with different numbers of

neurons. The first dense layer has 128 neurons and applies the ReLU6 activation function, which is a

variant of the Rectified Linear Unit (ReLU) activation function that has a maximum output of 6. The

second dense layer has 64 neurons and applies the ReLU6 activation function as well. Both dense

layers are followed by a dropout layer, which randomly sets 50% of the input units to 0 at each update

during training, to prevent overfitting.


19
Finally, the output of the second dropout layer is passed through a dense layer with

len(class_names) neurons and a softmax activation function. This layer produces the output of the

neural network, which is a probability distribution over the different classes in the dataset.

MoveNet and Long Short-Term Memory Network

In this section, I go over a new LSTM model that processes video sequences with the

MoveNet framework. MoveNet is a powerful deep-learning model created for real-time human posture

estimation. It finds important spots on the human body to identify each person's pose in a video

frame.

The new LSTM models that are shown in Figure 11 and 12 use MoveNet to provide keypoints

for each body part, together with their accompanying confidence scores. By capturing the spatial

interactions between various body parts, this new representation offers a more organized and

meaningful portrayal of human poses.

Each row in the tabular structure that results from MoveNet processing the new data

corresponds to one frame of a video sequence. The x and y coordinates and confidence rating of

each MoveNet-identified keypoint are listed in a separate column. Body parts like the nose, eyes,

ears, shoulders, elbows, wrists, hips, knees, and ankles are among the crucial points. Additionally,

the table has columns for the class number and class name, which correspond to the different

exercises shown in the film.

The LSTM model can be altered to handle this new type of input data by altering the input

shape to match the quantity of keypoints and confidence ratings. The model can now learn and

predict the class labels (names of the workouts) using the spatial correlations between the keypoints

as opposed to learning and predicting them from the raw pixel values of the video frames.
20
def create_lstm_model(input_shape, num_classes):
model = Sequential()
model.add(Bidirectional(LSTM(128, return_sequences=True), input_shape=input_shape))
model.add(Dropout(0.4))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.4))
model.add(LSTM(64))
model.add(Dropout(0.4))
model.add(Dense(32, activation="relu"))
model.add(Dense(num_classes, activation="softmax"))

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])


return model

Figure 11. LSTM model 1

def create_lstm_model_2(input_shape, num_classes):


model = Sequential()
model.add(Bidirectional(LSTM(256, return_sequences=True), input_shape=input_shape))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(LSTM(128, return_sequences=True))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(LSTM(64))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(32, activation="relu"))
model.add(Dense(num_classes, activation="softmax"))

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])


return model

Figure 12. LSTM model 2


21
I'll go over how to choose the LSTM model's sequence length.

Model 1 that is shown in Figure 11 utilizes the first 100 frames of each video. The model starts

with a bidirectional LSTM layer with 128 units, then has a dropout, an LSTM layer with 128 units, a

second dropout layer, an LSTM layer with 64 units, a dropout layer, a dense layer with 32 units, and

finally a dense softmax layer to output the class probabilities. However, the forecasting accuracy of

this model is subpar.

A second approach for segmenting the videos improves performance. Each video is split into

20-frame pieces, with any additional frames being deleted, because it contains repetitive movements.

As a result, Model 2 is produced, and its name is create_lstm_model_2. In this model, the data are

first scaled to have a zero mean and a unit variance using the normalize_data function. The model

also gets more sophisticated as additional layers are added.

Model 2 that is shown in Figure 12 is created using a bidirectional LSTM layer with 256 units,

batch normalization, and dropout. The following layers are a dense softmax layer for class

probabilities, an LSTM layer with 64 units, batch normalization, and dropout, an LSTM layer with 128

units, dropout, and batch normalization, and a dense layer with 32 units.

The normalized data is used for both training and testing for model 2's training. In order to

prevent overfitting and save the best model based on the validation loss, early stopping is included as

a callback during training. A batch size of 16 and a maximum of 100 epochs are used to train the

model.

Model 2 tries to improve upon model 1's prediction accuracy by segmenting the films into 20-

frame chunks and using a more intricate model with normalization. Model 1 only used the first 100

frames. This highlights how crucial it is to use the right sequence length and model architecture for

precise LSTM models for human activity recognition.


22
MoveNet and Gated Recurrent Unit

Here, I provide a Gated Recurrent Unit (GRU) model that, like the LSTM model, processes

video sequences using the MoveNet framework. In contrast to the LSTM model, the GRUS model is

computationally more effective.

The normalize_data function is used to scale the data to have a zero mean and unit variance

before applying the GRUs model. This is the procedure I use before building an LSTM model.

The GRUS model architecture that is shown in Figure 13 is specified using the

create_gru_model function. A bidirectional GRUs layer with 256 units serves as the model's

foundation. Batch normalization and a dropout layer are then applied. A further GRUs layer with 128

units, batch normalization, and dropout are then included in the model. A third GRUs layer with 64

units is subsequently added by the model, which is then followed by batch normalization and dropout.

To output the class probabilities, the model has a dense softmax layer with 32 units and a dense

layer.

This model is constructed using the sparse categorical crossentropy loss function and the

Adam optimizer, just like the LSTM model. The create_gru_model function receives the input shape

and the number of classes to initialize the model. During the training of the model, the normalized

data is used for both training and testing.

Early stopping is included as a callback during training to avoid overfitting and save the best

model based on validation loss. A batch size of 16 and a maximum of 100 epochs are used to train

the model.

In conclusion, utilizing video sequences processed with the MoveNet framework, the GRUS

model provides an alternative strategy to the LSTM model for human exercise recognition.
23
def create_gru_model(input_shape, num_classes):
model = Sequential()
model.add(Bidirectional(GRU(256, return_sequences=True), input_shape=input_shape))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(GRU(128, return_sequences=True))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(GRU(64))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(32, activation="relu"))
model.add(Dense(num_classes, activation="softmax"))

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])


return model

Figure 13. GRUs model


24
CHAPTER 6

RESULTS

MoveNet and Feedforward Neural Network

Before analyzing the results, it is important to examine the model's training and validation

accuracy over time, as depicted in the accuracy plot. This graphical representation showcases how

the model's performance evolves as the number of training epochs increases. An epoch refers to a

complete iteration through the entire training dataset. In this case, the model has been trained for a

total of 128 epochs.

The accuracy plot reveals a positive trend in both training and validation accuracy as the

number of epochs increases. Specifically, the training accuracy reaches approximately 80% as Figure

14 shows, while the validation accuracy achieves a higher value of around 84%. It is noteworthy that

the validation accuracy surpasses the training accuracy, which may suggest that the model

generalizes well to the unseen data. This is a desirable outcome, as it indicates that the model is less

likely to be overfitting on the training data and can maintain good performance when encountering

new data samples.

Figure 14. Model accuracy, Feedforward neural network


25
Following this analysis, a more detailed evaluation of the model's classification capabilities can

be conducted using the confusion matrix.

The confusion matrix below shows in Figure 15 provides an easy-to-understand visual

depiction of how well the neural network model classified the given tasks. The true class (actual

exercise) is represented by each row in the matrix, while the predicted class (exercise predicted by

the model) is represented by each column. The examples that were correctly classified for each

exercise are represented by the diagonal values in bold.

Figure 15. Confusion matrix, Feedforward neural network

For instance, the value at the first row and first column (5312) indicates that 5312 instances of

the "armraise" exercise have been accurately identified by the model. The first row's non-diagonal

values, such as 1, 0, 3, etc., represent the number of 'armraise' instances that were mistakenly
26
labeled as other exercises. One 'armraise' instance, for instance, was mistakenly labeled as a

'basic_legraise'.

According to the provided confusion matrix in Figure 15, the model appears to do well when

classifying activities like "armraise", "fly", and "overhead press". However, the model struggles to

distinguish between the terms "basic_legraise", "legraise", and "pushup", The non-diagonal values of

these rows, which are substantially high in comparison to the diagonal values, reveal the

misclassifications.

The classification report, which comes after the confusion matrix, gives more details about how

well the model performs across various workout classes.

As we can see in Table 1, the model performs well for several classes, such as "curl", "fly", and

"overheadpress", with precision scores of 0.86, 0.84, and 0.89, respectively, according to the

classification report. These excellent results show that the model can correctly categorize these

workouts while balancing precision and recall.

Table 1. Classification report, Feedforward neural network

precision recall f1-score support


armraise 0.67 0.86 0.75 6197
basic_legraise 0.29 0.04 0.07 436
bicyclecrunch 0.17 0.24 0.20 167
birddog 0.59 0.85 0.70 713
curl 0.86 0.81 0.83 5929
fly 0.84 0.90 0.87 2738
legraise 0.00 0.00 0.00 436
overheadpress 0.89 0.81 0.85 5128
pushup 0.74 0.63 0.68 922
squat 0.75 0.65 0.70 5177
superman 0.28 0.43 0.34 202
accuracy 0.76 28045
macro avg 0.55 0.57 0.54 28045
weighted avg 0.76 0.76 0.76 28045
27
However, as seen by the low precision scores of exercises like "basic_legraise",

"bicyclecrunch", and "superman", which are 0.29, 0.17, and 0.28, respectively. The model has trouble

discriminating some workouts. These findings imply that in order to enhance the model's performance

for these particular classes, additional training data or model improvement may be necessary.

The model performs well overall. The weighted average accuracy is 0.76.

MoveNet and Long Short-Term Memory Network

In this section, I go over the results of LSTM models in conjunction with the MoveNet

framework to recognize human exercise. The model was initially trained using 100-frame sequences

for each video. However, this strategy produced subpar results, Figure 16 shows the model’s

accuracy levels for both training and validation falling below 22%. Each video was divided into 20-

frame segments in order to solve this problem, with any extra frames being discarded. As a result,

x_train's form changed from (800, 100, 52) to (4809, 20, 52).

Figure 16. Model accuracy, LSTM 1

As we can see in Figure 17, the training accuracy eventually rose to over 80% thanks to the

improved methodology. The validation accuracy, however, did not consistently improve with the

number of epochs, eventually declining to 49.39% at the end of the training. The complexity of the
28
model was raised, and the data was standardized, to further improve the model's performance. As a

result, the test's accuracy was 94.76% as we can see in Figure 18.

Figure 17. Model accuracy, LSTM 2

Figure 18. Model accuracy, LSTM 3

The final optimized LSTM model with MoveNet's classification report (see Table 2) shows that

the model works remarkably well for a number of exercise classes, including "armraise," "curl," "fly,"

"overhead press," and "squat." Other exercises, such as the "basic_legraise," "bicyclecrunch,"

"legraise," "pushup," and "superman," are harder to categorize correctly.


29
Table 2. Classification report, LSTM

precision recall f1-score support


armraise 0.97 1.00 0.98 299
basic_legraise 0.00 0.00 0.00 15
bicyclecrunch 0.00 0.00 0.00 3
birddog 0.77 1.00 0.87 27
curl 1.00 1.00 1.00 289
fly 0.99 1.00 1.00 128
legraise 0.00 0.00 0.00 15
overheadpress 0.96 0.99 0.98 250
pushup 0.53 0.22 0.31 41
squat 1.00 1.00 1.00 248
superman 0.08 1.00 0.15 3
accuracy 0.95 1318
macro avg 0.57 0.65 0.57 1318
weighted avg 0.94 0.95 0.94 1318

These results are supported by the confusion matrix (see Figure 19), which highlights the

model's shortcomings in classifying some activities while also displaying a high percentage of

accurate predictions for the classes on which it excels.

In conclusion, the MoveNet framework and the LSTM model demonstrate strong performance

in identifying particular human workouts.


30

Figure 19. Confusion matrix, LSTM

MoveNet and Gated Recurrent Unit

The GRUs model performs admirably, obtaining over 90% accuracy for both the training and

validation sets after just five iterations, as seen by the accuracy plot. The training and validation

accuracies, however, occasionally switch due to variations in the plot. Despite the anomaly, the

model stops before its expected time at epoch 29, stopping with a weighted accuracy of 98.63% as

Figure 20 shows.
31

Figure 20. Model accuracy, GRUs

According to the GRUs model's classification report in Table 3, it performs remarkably well for

the majority of workout classes, with high precision, recall, and F1-scores. The model's precision

scores are at or close to 1.00 for classes 0, 4, 5, 7, and 9 ('armraise', 'curl', 'fly', 'overheadpress', and

'squat'), suggesting superb categorization performance.

Table 3. Classification report, GRUs

precision recall f1-score support


armraise 0.97 1.00 0.98 299
basic_legraise 0.00 0.00 0.00 15
bicyclecrunch 0.25 0.67 0.36 3
birddog 0.93 0.96 0.95 27
curl 0.99 1.00 0.99 289
fly 0.99 0.99 0.99 128
legraise 0.00 0.00 0.00 15
overheadpress 1.00 1.00 1.00 250
pushup 0.91 1.00 0.95 41
squat 0.96 1.00 0.98 248
superman 0.00 0.00 0.00 3
accuracy 0.97 1318
macro avg 0.64 0.69 0.66 1318
weighted avg 0.95 0.97 0.96 1318
32
Certain exercises, such as "basic_legraise" (class 1) and "bicyclecrunch" (class 2), with

precision scores of 0.00 and 0.25, respectively, provide challenges for the model when categorizing

them. These low results imply that the model might profit from additional improvement or training data

for these particular classes.

The categorization performance of the model across various workout groups is visually

represented by the confusion matrix as Figure 21 shows. It supports the categorization report's earlier

observations about the majority of classes' exceptional performance. The majority of examples are

accurately classified by the model, and misclassifications are rare. The confusion matrix, however,

also reveals the model's difficulty in correctly classifying 'basic_legraise' (class 1), where it wrongly

classes all occurrences.

In conclusion, the high accuracy, F1-scores, and confusion matrix findings show that the GRUs

model performs exceptionally well in identifying the majority of exercise courses. The classification of

some exercises, including the "basic_legraise" and the "bicyclecrunch," might be improved. Potential

remedies can involve expanding the amount of training data, changing the model's architecture, or

putting data augmentation approaches into practice to improve the model's performance for these

classes.
33

Figure 21. Confusion matrix, GRUs


34
CHAPTER 7

CONCLUSION

In conclusion, the combination of MoveNet with different neural network architectures, such as

Feedforward Neural Networks, Gated Recurrent Units (GRUs), and Long Short-Term Memory

Networks (LSTMs), has shown promising results in classifying human exercises. The GRUs model

achieved the highest accuracy of 97.27%, while the LSTM model also performed well with an

accuracy of 94.76%. However, challenges remain in correctly classifying some exercises.

Improvements can be made by increasing the amount of training data, increasing the quality of the

training data, adjusting the model's architecture, or implementing data augmentation techniques.
35
REFERENCES

[1] IBM, “What are Neural Networks?,” IBM. [Online]. Available:


https://www.ibm.com/topics/neural-networks. [Accessed: 07-Jan-2023].

[2] E. Kavlakoglu, “AI vs. Machine Learning vs. Deep Learning vs. Neural Networks: What’s the
Difference?,” IBM Blog, 27-May-2020. [Online]. Available: https://www.ibm.com/cloud/blog/ai-
vs-machine-learning-vs-deep-learning-vs-neural-networks. [Accessed: 07-Jan-2023].

[3] Analytics Vidhya, “Introduction to Artificial Neural Networks,” Analytics Vidhya. [Online].
Available: https://www.analyticsvidhya.com/blog/2021/09/introduction-to-artificial-neural-
networks/. [Accessed: 07-Jan-2023].

[4] Towards Data Science, “Artificial Neural Networks (ANN),” Towards Data Science, 18-May-
2020. [Online]. Available: https://towardsdatascience.com/artificial-neural-networks-ann-
21637869b306. [Accessed: 07-Jan-2023].

[5] Rene Y. Choi, Aaron S. Coyner, Jayashree Kalpathy-Cramer, Michael F. Chiang, J. Peter
Campbell; Introduction to Machine Learning, Neural Networks, and Deep Learning. Trans. Vis.
Sci. Tech. 2020;9(2):14. doi: https://doi.org/10.1167/tvst.9.2.14.

[6] Sharma, S., Sharma, S., & Athaiya, A. (2017). Activation functions in neural networks.
Towards Data Sci, 6(12), 310-316.

[7] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learning Activation Functions to


Improve Deep Neural Networks,” arXiv (Cornell University), Dec. 2014, doi:
10.48550/arxiv.1412.6830.

[8] J. Han and C. Moraga, “The influence of the sigmoid function parameters on the speed of
backpropagation learning,” Lecture Notes in Computer Science, pp. 195–201, Jun. 1995, doi:
10.1007/3-540-59497-3_175.

[9] “Desmos | Graphing Calculator,” Desmos. https://www.desmos.com/calculator

[10] K. Abdelouahab, M. Pelcat, and F. Berry, “Why TanH is a Hardware Friendly Activation
Function for CNNs,” HAL (Le Centre Pour La Communication Scientifique Directe), Sep. 2017,
doi: 10.1145/3131885.3131937.

[11] P. Antoniadis and P. Antoniadis, “Activation Functions: Sigmoid vs Tanh | Baeldung on


Computer Science,” Baeldung on Computer Science, Mar. 2023, [Online]. Available:
https://www.baeldung.com/cs/sigmoid-vs-tanh-functions

[12] Dansbecker, “Rectified Linear Units (ReLU) in Deep Learning,” Kaggle, May 2018, [Online].
Available: https://www.kaggle.com/code/dansbecker/rectified-linear-units-relu-in-deep-learning

[13] D. Liu, “A Practical Guide to ReLU - Danqing Liu - Medium,” Medium, Jul. 12, 2019. [Online].
Available: https://medium.com/@danqing/a-practical-guide-to-relu-b83ca804f1f7

[14] L. S. Toth, “Phone recognition with deep sparse rectifier neural networks,” International
Conference on Acoustics, Speech, and Signal Processing, May 2013, doi:
10.1109/icassp.2013.6639016.

[15] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
36
[16] M. A. Nielsen, “Neural Networks and Deep Learning,” 2015.
http://neuralnetworksanddeeplearning.com/chap2.html

[17] S. Kostadinov, “Understanding Backpropagation Algorithm - Towards Data Science,” Medium,


Dec. 11, 2021. [Online]. Available: https://towardsdatascience.com/understanding-
backpropagation-algorithm-7bb3aa2f95fd

[18] A. Al-Masri, “How Does Backpropagation in a Neural Network Work?,” Built In, Oct. 2022,
[Online]. Available: https://builtin.com/machine-learning/backpropagation-neural-network

[19] DeepAI, “Feed Forward Neural Network,” DeepAI, Jun. 2020, [Online]. Available:
https://deepai.org/machine-learning-glossary-and-terms/feed-forward-neural-network

[20] V. Kurama, “Feedforward Neural Networks: A Quick Primer for Deep Learning,” Built In, Sep.
2019, [Online]. Available: https://builtin.com/data-science/feedforward-neural-network-intro

[21] K. O’Shea and R. R. Nash, “An Introduction to Convolutional Neural Networks,” arXiv (Cornell
University), Nov. 2015, doi: 10.48550/arxiv.1511.08458.

[22] J. Gu et al., “Recent advances in convolutional neural networks,” Pattern Recognition, vol. 77,
pp. 354–377, May 2018, doi: 10.1016/j.patcog.2017.10.013.

[23] “Unsupervised Feature Learning and Deep Learning Tutorial.”


http://deeplearning.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/

[24] “What Is a Convolutional Neural Network? | 3 things you need to know,” MATLAB & Simulink.
https://www.mathworks.com/discovery/convolutional-neural-network-matlab.html

[25] “A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way | Saturn Cloud
Blog,” Apr. 17, 2023. https://saturncloud.io/blog/a-comprehensive-guide-to-convolutional-
neural-networks-the-eli5-way/

[26] “MoveNet: Ultra fast and accurate pose detection model.,” TensorFlow, [Online]. Available:
https://www.tensorflow.org/hub/tutorials/movenet

[27] “Next-Generation Pose Detection with MoveNet and TensorFlow.js.”


https://blog.tensorflow.org/2021/05/next-generation-pose-detection-with-movenet-and-
tensorflowjs.html

[28] J. Jiang, S. Tao, D. Lian, Z. Huang, and E. Chen, “Predicting Human Mobility with Self-
attention and Feature Interaction,” Lecture Notes in Computer Science, pp. 117–131, Aug.
2020, doi: 10.1007/978-3-030-60290-1_9.

[29] Tensorflow, “tfjs-models/pose-detection at master · tensorflow/tfjs-models,” GitHub.


https://github.com/tensorflow/tfjs-models/tree/master/pose-detection

[30] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9,
no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.

[31] “Understanding LSTM Networks -- colah’s blog.” https://colah.github.io/posts/2015-08-


Understanding-LSTMs/
37
[32] M. Phi, “Illustrated Guide to LSTM’s and GRU’s: A step by step explanation,” Medium, Jun. 28,
2020. [Online]. Available: https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-
a-step-by-step-explanation-44e9eb85bf21

[33] H. Pedamallu, “RNN vs GRUS vs LSTM - Analytics Vidhya - Medium,” Medium, Dec. 16, 2021.
[Online]. Available: https://medium.com/analytics-vidhya/rnn-vs-gru-vs-lstm-863b0b7b1573

[34] “Long Short-Term Memory (LSTM) Networks,” MATLAB & Simulink.


https://www.mathworks.com/discovery/lstm.html#:~:text=LSTMs%20are%20predominantly%2
0used%20to,speech%20recognition%2C%20and%20video%20analysis.

[35] K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation,” arXiv (Cornell University), Jan. 2014, doi: 10.3115/v1/d14-1179.

[36] “Recurrent Neural Network (RNN) – Part 5: Custom Cells,” The Neural Perspective, Nov. 20,
2016.
https://web.archive.org/web/20180416083110/https://theneuralperspective.com/2016/11/17/rec
urrent-neural-network-rnn-part-4-custom-cells/

[37] J. -M. Hung, J. -Y. Chiang and K. Wang, "Tennis Player Pose Classification using YOLO and
MLP Neural Networks," 2021 International Symposium on Intelligent Signal Processing and
Communication Systems (ISPACS), Hualien City, Taiwan, 2021, pp. 1-2, doi:
10.1109/ISPACS51563.2021.9650925.

[38] R. Chatterjee, S. Roy, S. H. Islam and D. Samanta, "An AI Approach to Pose-based Sports
Activity Classification," 2021 8th International Conference on Signal Processing and Integrated
Networks (SPIN), Noida, India, 2021, pp. 156-161, doi: 10.1109/SPIN52536.2021.9565996.

[39] P. Nguyen Huu, N. Nguyen Thi and T. P. Ngoc, "Proposing Posture Recognition System
Combining MobilenetV2 and LSTM for Medical Surveillance," in IEEE Access, vol. 10, pp.
1839-1849, 2022, doi: 10.1109/ACCESS.2021.3138778.

[40] M. Xu, L. Wang, J. Ren and S. Poslad, "Use of LSTM Regression and Rotation Classification
to Improve Camera Pose Localization Estimation," 2020 IEEE 14th International Conference
on Anti-counterfeiting, Security, and Identification (ASID), Xiamen, China, 2020, pp. 6-10, doi:
10.1109/ASID50160.2020.9271762.

[41] “Papers with Code - InfiniteRep Dataset.” https://paperswithcode.com/dataset/infiniterep

[42] Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li
Fei-Fei. (* = equal contribution) ImageNet Large Scale Visual Recognition
Challenge. IJCV, 2015.

You might also like