Professional Documents
Culture Documents
INTERNATIONAL UNIVERSITY
SCHOOL OF ELECTRICAL ENGINEERING
BY
NGO CHON PHUC
BY
Under the guidance and approval of the committee, and approved by its members, this
senior project has been accepted in partial fulfillment of the requirements for the
degree.
Approved:
________________________________
Chairperson
_______________________________
Committee member
________________________________
Committee member
________________________________
Committee member
________________________________
Committee member
i
HONESTY DECLARATION
My name is Ngo Chon Phuc I would like to declare that, apart from the
acknowledged references, this thesis either does not use language, ideas, or other
original material from anyone; or has not been previously submitted to any other
educational and research programs or institutions. I fully understand that any writings
in this thesis contradicted to the above statement will automatically lead to the
Date:
Student’s Signature
(Full name)
ii
TURNITIN DECLARATION
Date: 22/6/2022
iii
ACKNOWLEDGMENT
Although I take a huge effort into this project, this can not be possible
without the support from faculty members. I would like to express my sincere thanks
to them. It is a great honor and luck to be trained by Mr. Vo Tan Phuoc. Not only is
he the one who guides this senior, but he is also the one who inspires and shares his
Besides, I would also like to thank a million to Mr. Ton That Long for
providing me with the necessary equipment to carry out this senior research.
iv
TABLE OF CONTENTS
HONESTY DECLARATION..............................................................................................
TURNITIN DECLARATION............................................................................................
ACKNOWLEDGMENT....................................................................................................
TABLE OF CONTENTS.....................................................................................................
LIST OF TABLES............................................................................................................
LIST OF FIGURES............................................................................................................
ABSTRACT.......................................................................................................................
CHAPTER I INTRODUCTION..........................................................................................
1.1. Motivation.......................................................................................................
1.2. Objective.........................................................................................................
v
4.1.What is Open Modbus...............................................................................................
4.1.1.Modbus IP/TCP.................................................................................................
4.2.1.The overview....................................................................................................
4.2.2.Squirrel Motor..................................................................................................
4.3.Scalar Control………….……………………………………………………………………………………..7
4.4.Field-Oriented Control..............................................................................................
CHAPTER V METHODOLOGY.....................................................................................
Coordinate Systems......................................................................................
6.1. 26
6.2. 27
REFERENCES..................................................................................................................
vi
vii
LIST OF TABLES
viii
LIST OF FIGURES
Figure 4. 3 LRCN architecture for High Jump action and general architecture of
LRCN [18].........................................................................................................................
ix
ABBREVIATIONS AND NOTATIONS
C3D: Convolutional 3D
x
111Equation Chapter 1 Section 1
ABSTRACT
...... In the modern-day, human activity recognition (HAR) has attracted the interest
vehicles. Companies commonly use this technology to train and monitor their new
employees to correctly perform their assigned tasks. Moreover, a HAR system can
replace multiple fall detection sensors which are usually attached to patients. In some
recent studies, this task can be achieved by applying a pre-trained model. In this
senior project, a new HAR based on state-of-the-art methods will be proposed for the
thesis project.
xi
CHAPTER I
INTRODUCTION
1.1. Motivation
eldercare and healthcare field. However, due to the intricacy and discrepancy of
extract feature information from the input human activities then classify the input into
each activity classes. Recently, the rapid growth of computer vision algorithms and
applications with the support of pre-trained networks have been utilized in the human
action recognition field for the purpose of simplifying the mathematical procedure of
In the present days, HAR could be achieved by using an input of the signal
hardware which can be costly when using in some distinct scenario such as nursing
processed with a few surveillance systems whose have computer vision tools such as
1.2. Objective
with Deep Learning and Computer Vision then suggest a new HAR system with
theoretically faster or more accurate than the previous ones. Firstly, cutting edge
techniques in Human Activities Recognition are reviewed. Then, the best aspects of
1
advanced convolutional network architecture will be considered as the references to
in this chapter.
Chapter II: Constructed specification and standards of the projects will be stated in
this chapter.
Chapter III: The management, planning, project schedules and resource planning
about some Human Activities Recognition methods and some recent researches
Chapter VI: The expected theoretical outcome of the project will be shown in this
chapter.
Chapter VII: The conclusion of the project and the future works to systemize a
HAR program.
2
CHAPTER II
This senior was conducted on laptop model MSI GS63 7RD with i7-7700HQ
processor, 16GB RAM. It also has GTX 1050 GPU with 2GB VRAM and 640 CUDA
cores.
The reference systems used in this senior was tested using Python and
OpenCV.
3
55Equation Section (Next)
CHAPTER III
PROJECT MANAGEMENT
Microsoft Office: NA
Python: Free
4
CHAPTER IV LITERATU
RE REVIEW
4.1.1. Background
were given. As early as the late 20th century, Prof. Gaverila split the research into 2D
and 3D approaches [1]. At that time, new taxonomy was also presented by Aggarwal
and Cai which specialized in human motion analysis [2]. This taxonomy was later
recognition methods and put forward a greater classification system, which consists of
activity complexity [5]. In 2010, R. Poppe broke HAR methods down into two
3D modeling was first extended in studies of Ye et al. and Chen et al. (2013)
[7]. By using depth cameras, a 3D demonstration of a human body was set about in
the 2-dimensional plane. Aggarwal and Xia (2014) later introduced a categorization of
HAR methods from 3D stereo and motion capture system with the primary
sort them into two main types corresponding to the abstraction level and feature types
Since 2003, many surveys about human affective interaction have been
proposed and conducted through different cases. Jaimes and Sebe (2007) focused on
postures, facial expressions and speech [10] while Pantic and Rothkrantz (2003)
majored in non-verbal signal, such as facial and vocal expressions [11]. For
spontaneous actions and movement, Zeng et al. (2009) developed this method by
using visual and audio cues as Bousmalis et al. (2013) presented an analysis of
nonverbal behavior recognition methods for agreement and disagreements [12] [13].
personal affective states with other people and their emotions or body movement. It
can be approached by both early and late fusion. The technique is classified into 3
categorization.
6
Figure 4. 1 Human action recognition methods categorization.
Many recent researches in HAR failed to express comprehensively human
actions in a compact and instructive way as they mentioned the barrier as regards to
has increased rapidly in recent years, problems such as modeling of human poses and
labeling data still remain unsolved. This project will focus on space-time activities
recognizing methods.
people (i.e. talking, kissing). From human activities, it can tell the characteristic of a
person or even predict the mental and physical state of patient and elderly, or it
Kinect). With this kind of benefit, HAR can be applied to surveillance systems,
7
Dividing human activities video clips in video datasets into frames -
Training the network with a dataset - Based on previous input and their
descent algorithm)
each
Two breakthrough methods were released in 2014 which form the backbone of
all later HAR methods, which is Single Stream network and Two Streams network.
4.2.2.1. Overview
Single Stream network: June 2014, Karapathy et al. figured out numerous
ways to fuse temporal information from video frames using 2-dimensional pre-trained
8
Two Streams network: Simmoyan together with Zisserman then continues to
improve Karapathy et al. 's work. The authors shape motion features to stacked optical
flow vectors for deep architecture to effectively learn the human motion [16]. Hence,
this architecture has two segregate networks to serve both spatial and motion
information. The spatial network is a 2D pre-trained model with the input is a video's
single frame. For the motion network, the authors experiment many kinds of input
then found that bi-directional optical flow stacked across 10 successive frames has the
best performance. Both networks were separately trained and united with Support
With the backbone of two mentioned papers, these networks can be included:
All the methods above are improvisations on top of five basic ideas: LSTM,
9
Figure 4. 2 Basics network architecture [17]
4.2.2.2. LRCN
LRCN was the was first introduced by Donahue et al. on 17 November 2014
[18]. The authors use a LSTM decoder after each convolution layers then using an
end-to-end training network. Optical flow features and RGB were also compared as
input choices and the result that had the best weighted scoring of predictions was the
input, which combined both RGB and optical flow. Figure 4.3 describe the LRCN
Figure 4. 3 LRCN architecture for High Jump action and general architecture of
LRCN [18]
Benchmarks (UCF101-split1):
10
Table 4. 1 LRCN methods and accuracy score
METHOD SCORE
RGB 71.1
wrong label selection, inability to capture long range temporal information and
al. solve the short temporal range problem by using longer clips (about 60 frames) and
4.2.2.3. C3D
was built based on Single Stream architecture from Karapathy et al. But they used 3D
across images [19]. Their idea was training these networks on dataset Sports1M then
using it as 3D feature extractors for other datasets. The authors also showed in their
work that a simple linear classifier such as SVM was more suitable for extracted
features than the latest algorithms. The performance can increase with hand crafted
features like iDT. In 2018, Hara et al. used this idea for 2D convolutional network like
METHOD SCORE
11
C3D (single net) + SVM 82.3
spatial and temporal streams and combination of temporal net output across time
described that spatial features can be captured by spatial (space) stream and annual
motion for every spatial location can be captured by temporal (time) stream in a
video. Second, long term dependency can also be modeled by combining temporal net
12
Figure 4. 4 TwoStreamFusion architecture [21]
Benchmarks (using UCF101-split1 dataset):
METHOD SCORE
4.2.2.5. TSN
to produce the best result. This technique proposed by Wang et al. on August 2016
[22]. In their paper, using sampling clips sporadically throughout the video get
improved long-range temporal data model. Next, for the video-level final prediction,
author experienced many strategies, but the best scoring was combining of spatial-
temporal scores using weighted average and then applying SoftMax classification.
13
Figure 4. 5 Temporal Segment Networks architecture for High-Jump action [22]
Benchmarks (using UCF101-split1 dataset):
METHOD SCORE
4.2.2.6. HiddenTwoStream
In 2017, Zhu et al. identified a major factor, the usage of optical flow in two
stream architecture making the network rely on pre-compute optical flow for each
sampled frame, affects storage and speed adversely [23]. The authors then
experienced multiple strategies and network architecture to get the highest fps and
lowest parameters possible with the generated optical flow without affecting the
accuracy of the system. Their architecture is based on two stream architecture with the
differences as the author mentioned are the appearance of the optical flow generation
14
frames for input data instead of preprocessed optical flow and an added computation
METHOD SCORE
HiddenTwoStream 89.8
containing two different 3-dimensional networks for both streams [17]. Also, the
models. Moreover, the input of the spatial stream including frames stacked in time
authors also confirmed the usage of Kinetics dataset on increasing the accuracy of
their network.
15
METHOD SCORE
ImageNet datasets
4.2.2.8. T3D
Diba et al. on 2017 [24]. The authors proposed a method to capture different temporal
depths by using 3-dimensional single stream named DenseNet based structure with a
multi-depth temporal transition layer (or temporal down sampling layer) placed after
dense blocks. This multi-depth pooling technique can be achieved by down sampled
sampling layers.
method between the pre-trained 2-dimensional convolution network and T3D. The
16
architecture is trained to perform binary classifier based on the similarity and the
METHOD SCORE
T3D 90.3
4.2.2.9. Datasets
In the action recognition field, the most recent successful datasets are HMDB-
51 and UCF-101 [25] [26]. These datasets were commonly used as benchmarks and
become very popular in the early state of the field. However, their size was not
Some bigger datasets were created. One of them were ActivityNet, which
consist of 200 different type of activities with 849 hours of YouTube videos [27]. In
2017, Key et al. released the Kinetics dataset as an effort to create a successfully
trained model. At that time, the Kinetics dataset was the state-of-the-art dataset with
over 300,000 video clips divided into 400 classes [28]. Hara et al. figure out that there
CNNs works best with this dataset with 73.7% of average accuracy for ResNet-152
and ResNet-200 model [29]. Aside from Kinetics, other large-scale datasets such as
Sports-1M and YouTube-8M have also been introduced [30] [31]. Both datasets are
larger than Kinetics; however, their video datasets includes many unrelated-to-target
frames and their annotations are quite noisy. These aspects can affect the accuracy of
17
the training process. Moreover, their large size files can prevent them from effectively
utilized.
18
Chapter V
METHODOLOGY
The proposed method of HAR system is given in this chapter. Specifically, the
proposed method and algorithms needed to create a new Human Action Recognition
of the input video clip. After that, the previous data will be fed to state-of-the-art 2-
dimensional classification CNNs such as VGG, ResNet, DenseNet, etc. to test the
performance. During the process, some modification to each CNNs will be tested such
as changing number and type of filters in convolutional layers. Some specific feature
of state-of-the-art CNNs will also be added such as identity connection from ResNet
For the training process, Stochastic Gradient Descend (SGD) algorithm with
Throughout the process, the Kinetics dataset will be used for training and
Stage 1: Pre-processing
This stage is an elementary step for the whole process, which can substantially
increase the performance of the network if chosen carefully. For more specific, short
input video clips from Kinetics datasets will be divided into frames using MATLAB
tool.
19
This stage performs spatio-temporal extracting by using 3 × 3 × 3 convolution
kernels with stride of 1 and a 3D pooling with stride of 2. Figure 5.1 demonstrates the
65
MERGEFORMAT (.)
Where is the input volume size, F is the kernel size, S is the stride, and P is
W
the padding.
This stage performs the training of the CNNs using Stochastic Gradient
Descend with momentum as Hara et al. did in their papers then classify the output
Different from 2D CNNs which can only learn from 2D images, 3D CNNs can
convolutional network can only perform spatially. Du Tran et al. already proved the
performance of this method with their C3D network [19]. Figure 5.2 show the
multi-channel image will result in an image, which leads to the loss of temporal
dimensional convolution holds the temporal information of the input which will help
the network perform better with human action clips. Also, the same phenomenon
holds the same for 2D and 3D pooling. With the loss of temporal information, 2D
convolutional network cannot effectively recognize human action from video clips
due to the noise came from other class of human action. This drawback was proved by
21
Du Tran et al. in their paper [19]. Hence, the HAR network should use 3D
convolution throughout the network. Figure 5.2 describes the example of the general
architecture of CNNs.
150 × 150 × 3 followed by some basic layer of a CNNs which are convolution layer
their performance with ImageNet dataset. However, in 2017, Hara et al. proved that
with three dimensional (3D) convolutional kernels added to ResNet, its performance
22
can be comparable with a deeper 2D convolutional network. Hence, state-of-the-art
2D CNNs are believed to have the ability of recognizing human activity with
they use convolutional kernels with the size of 3 × 3 × 3 and temporal stride is 1 and
down-sampling the input with the stride of 2, which is similar to C3D [19].
elementwise to the input matrix, which output zero across negative half, while linearly
matching function on the positive half. It was proven to be able to increase the
For most CNNs, the more layers using certain activation functions are added,
the harder to train the neural networks. This problem has also been known as the
Vanishing Gradient problem, which occurs due to the approaching zero of the
gradients of the loss function. Thus, He et al. introduced a residual network concept to
solve the problem [32]. The most importance features of the residual network
(ResNet) is Residual Block. Figure 5.4 describes the Residual Block used in the
network.
23
Consider x is the output of any activation function (i.e. Sigmoid, SoftMax)
and assume that input and output have the same dimension. The output of each
75
MERGEFORMAT (.)
The formulation of H ( x )=F( x)+ x can be comprehended by feed-forward
neural networks with “skip connections”. One or more layers are skipped by skip
connections (or shortcut connection [33]). In He et al.’s study, ResNet uses skip
throughout the whole network. This can inherently trammel vanishing gradient or
With skip connection, the author stated that the whole network can be trained
throughout using the SGD method with back-propagation and can be easily
implemented by using essential libraries (i.e. Caffe [36]) without varying the
compiler. When the amount of the features maps is increased, a type-A identity
shortcut with zero-padding will be applied to avert the increase of parameter numbers
[33].
5.5. Datasets
UCF101 and Sports1M are known for the most famous datasets for human
Sports1M can be extraordinarily difficult due to the noise coming from their messy
annotations and the labels are only in video level (i.e. all the irrelevant frames are
the high space correlation among the videos strongly decrease the actual
24
diversification in training. For the purpose of re-tracing 2D CNNs trained on
ImageNet success, Kinetics human action video dataset has been created. The latest
Kinetics dataset contains approximately 650,000 video clips that cover 700 human
action classes. Each action class has at least 600 video clips, and each of them is
annotated with a single action class. Also, videos in Kinetics datasets were temporally
slightly smaller than Sports1M, whereas the annotation quality is extremely high.
Hence, Kinetics dataset is the most suitable human action recognition training system
25
Chapter VI
EXPECTED RESULTS
Due to the limitation of both knowledge and time, only theoretical background
of the network was built. Hara et al. already proved the idea of using known 2D
action recognition system in their work. With the latest version of Kinetics dataset and
ReLu (α = 5.5) for better performance, both speed and accuracy of the system can be
The human action recognition task requires information about the existence of
those activities to be performed throughout the entire duration of the sequence (i.e.
continuation of image classification tasks to video clips and then combining the
predictions result from each frame. Because of that problem, human action
recognition systems require both time and space information from video clips to
recognize the action. In Du Tran et al.'s and Hara et al.'s papers, the ability of 3D
convolutions on fulfilling the task has been proved [19] [30] [20]. 3D convolution
can produce a 3D activation map for analyzing data where temporal or volumetric
input, the authors use the clips with the dimension of 3 × 16 × 112 × 112 (channel ×
frames × x × y). Figure 6.1 describes the architecture of Hara et al. 's network.
max pooling layers in conv3_1, conv4_1, conv5_1 with stride 2. Then, after each
convolution layers are batch normalization layer [36] and a ReLu activation layer
[37]. Finally, a fully connected output layer with 400 classes is set for Kinetics 400
dataset.
This architecture showed a better result than C3D architecture on the Kinetics
dataset. The model can sustain overfitting despite a large number of parameters.
Figure 6.2 shows the accuracy of the Kinetics dataset of the authors' network compare
27
Figure 6. 2 Accuracy of 3D ResNet-34 compare with other [29]
28
CHAPTER VII
The researching process in this senior project mainly focuses on the human
action recognition task and clarifies its difficulties. Some advance human action
recognition systems and their performances have also been discussed in this report.
Then, the suitable dataset for the modern human action recognition task is stated.
methods and their accuracy scoring. Finally, a human action recognition architecture
based on a successfully worked system has also been proposed in terms of theoretical
architecture.
In the future, several steps are required to obtain the practical HAR system
29
REFERENCES
[1] D. M. Gavrila, “The Visual Analysis of Human Movement: A Survey,” Computer Vision and Image
Understanding, vol. 73, no. 1, pp. 82–98, 1999.
[2] J. K. Aggarwal and Q. Cai, “Human Motion Analysis: A Review.,” Computer Vision and Image
Understanding, vol. 73, no. 3, pp. 428–440, 1999.
[3] L. Wang, W. Hu, and T. Tan, “Recent developments in human motion analysis,” Pattern
Recognition, vol. 36, no. 3, pp. 428–440, Mar. 2003.
[4] T. B.Moeslund, A. Hilton, and V. Krüger, “A survey of advances in vision-based human motion
capture and analysis.,” Computer Vision and Image Understanding, vol. 104, no. 2–3, pp. 90–126,
2006.
[5] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine Recognition of Human
Activities: A Survey,” Computer Vision and Image Understanding, vol. 104, no. 2–3, pp. 90–126,
2006.
[6] R. Poppe, “A survey on vision-based human action recognition,” IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 18, no. 11, pp. 1473–1488, 2008.
[7] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, and J. Gall, “A Survey on Human Motion Analysis
from Depth Data.,” in Time-of-Flight and Depth Imaging, 2013, vol. 8200, pp. 149–187.
[8] J. K. Aggarwal and L. Xia, “Human activity recognition from 3D data: A review,” Pattern
Recognition Letters, vol. 48, pp. 70–80, 2014.
[9] G. Guo and A. Lai, “A survey on still image based human action recognition,” Pattern Recognition,
vol. 47, no. 10, pp. 3343–3361, 2014.
[10] A. Jaimes and N. Sebe, “Multimodal Human Computer Interaction: A Survey,” Computer Vision
and Image Understanding, vol. 108 (Special Issue on Vision for Human-Computer Interaction), pp.
116–134, 2007.
[11] M. Pantic and L. Rothkrantz, “Towards an affect-sensitive multimodal human-computer
interaction,” in IEEE, Special Issue on Multimodal Human-Computer Interaction, Invited Paper,
2003, vol. 91 (IEEE), pp. 1370–1390.
[12] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A Survey of Affect Recognition Methods:
Audio, Visual, and Spontaneous Expressions,” in IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2009, vol. 31, no. 1, pp. 39–58.
[13] K. Bousmalis, M. Mehu, and M. Pantic, “Towards the automatic detection of spontaneous
agreement and disagreement based on nonverbal behaviour: A survey of related cues, databases, and
tools,” Image and Vision Computing, vol. 31, no. 2, pp. 203–221, 2013.
[14] N. D. Rodríguez, M. P. Cuéllar, J. Lilius, and M. D. Calvo-Flores, “A survey on ontologies for
human behavior recognition,” ACM Computing Surveys, vol. 46, no. 4, pp. 1–33, 2014.
[15] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-Scale Video
Classification with Convolutional Neural Networks,” in 2014 IEEE Conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[16] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in
Videos,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C.
Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. 2014, pp. 568–576.
[17] J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics
Dataset,” 2017.
[18] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T.
Darrell, “Long-term Recurrent Convolutional Networks for Visual Recognition and Description,”
2014.
[19] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features
with 3D Convolutional Networks,” 2014.
[20] K. Hara, H. Kataoka, and Y. Satoh, “Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs
and ImageNet?,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2018, pp. 6546–6555.
30
[21] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional Two-Stream Network Fusion for
Video Action Recognition,” 2016.
[22] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, “Temporal Segment
Networks: Towards Good Practices for Deep Action Recognition,” 2016.
[23] Y. Zhu, Z.-Z. Lan, S. D. Newsam, and A. G. Hauptmann, “Hidden Two-Stream Convolutional
Networks for Action Recognition,” 2017.
[24] A. Diba, M. Fayyaz, V. Sharma, A. H. Karami, M. M. Arzani, R. Yousefzadeh, and L. V. Gool,
“Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification,” 2017.
[25] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB51: A Large Video Database for
Human Motion Recognition,” in Proceedings of the IEEE International Conference on Computer
Vision, 2011, pp. 2556–2563.
[26] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes From
Videos in The Wild,” 2012.
[27] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “ActivityNet: A large-scale video
benchmark for human activity understanding,” in 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2015, pp. 961–970.
[28] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T.
Back, P. Natsev, M. Suleyman, and A. Zisserman, “The Kinetics Human Action Video Dataset,”
2017.
[29] K. Hara, H. Kataoka, and Y. Satoh, “Learning Spatio-Temporal Features with 3D Residual
Networks for Action Recognition,” 2017 IEEE International Conference on Computer Vision
Workshops (ICCVW), pp. 3154–3160, 2017.
[30] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-Scale Video
Classification with Convolutional Neural Networks,” in 2014 IEEE Conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[31] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S.
Vijayanarasimhan, “YouTube-8M: A Large-Scale Video Classification Benchmark,” 2016.
[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015.
[33] S. Verma, “Understanding 1D and 3D Convolution Neural Network | Keras. [Online]. Available:
https://towardsdatascience.com/understanding-1d-and-3d-convolution-neural-network-keras-
9d8f76e29610 [Accesesed: 13- Jan- 2020].” .
[34] N. B. D. Huy, “Using Deep Learning To Design And Implement An American
FingerspellingAlphabet Recognition System,” 2019.
[35] J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A Short Note on the Kinetics-700 Human
Action Dataset,” 2019.
[36] S. I. andcChristian Szegedy, “Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift,” 2015.
[37] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in
Proceedings of the 27th International Conference on International Conference on Machine
Learning, 2010, pp. 807–814.
31