NVQ Report Revised Fixed

VIETNAM NATIONAL UNIVERSITY – HOCHIMINH CITY
INTERNATIONAL UNIVERSITY
SCHOOL OF ELECTRICAL ENGINEERING
VFD AND MOTION CONTROL WITH

OPEN MODBUS
BY
NGO CHON PHUC
A SENIOR PROJECT SUBMITTED TO THE SCHOOL OF ELECTRICAL

ENGINEERING IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF BACHELOR OF ELECTRICAL ENGINEERING
HO CHI MINH CITY, VIET NAM

2022
VFD AND MOTION CONTROL WITH OPEN MODBUS
BY
NGO CHON PHUC
Under the guidance and approval of the committee, and approved by its members, this
senior project has been accepted in partial fulfillment of the requirements for the
degree.
Approved:
________________________________
Chairperson
_______________________________
Committee member
________________________________
Committee member
________________________________
Committee member
________________________________
Committee member
i
HONESTY DECLARATION
My name is Ngo Chon Phuc I would like to declare that, apart from the
acknowledged references, this thesis either does not use language, ideas, or other
original material from anyone; or has not been previously submitted to any other
educational and research programs or institutions. I fully understand that any writings
in this thesis contradicted to the above statement will automatically lead to the
rejection from the EE program at the International University – Vietnam National
University Ho Chi Minh City.
Date:
Student’s Signature
(Full name)
ii
TURNITIN DECLARATION
Name of Student: NGO CHON PHUC
Date: 22/6/2022
Advisor Signature Student Signature
iii
ACKNOWLEDGMENT
Although I take a huge effort into this project, this can not be possible
without the support from faculty members. I would like to express my sincere thanks
to them. It is a great honor and luck to be trained by Mr. Vo Tan Phuoc. Not only is
he the one who guides this senior, but he is also the one who inspires and shares his
experiences to help me perfect my senior in the optimal way.
Besides, I would also like to thank a million to Mr. Ton That Long for
providing me with the necessary equipment to carry out this senior research.
iv
TABLE OF CONTENTS
HONESTY DECLARATION..............................................................................................
TURNITIN DECLARATION............................................................................................
ACKNOWLEDGMENT....................................................................................................
TABLE OF CONTENTS.....................................................................................................
LIST OF TABLES............................................................................................................
LIST OF FIGURES............................................................................................................
ABBREVIATIONS AND NOTATIONS.............................................................................
ABSTRACT.......................................................................................................................
CHAPTER I INTRODUCTION..........................................................................................
1.1. Motivation.......................................................................................................
1.2. Objective.........................................................................................................
1.3. Report Organization........................................................................................
CHAPTER II DESIGN SPECIFICATIONS AND STANDARDS....................................
2.1. Hardware description......................................................................................
2.2. Software description.......................................................................................
CHAPTER III PROJECT MANAGEMENT......................................................................
3.1. Budget and Cost Management........................................................................
3.2. Project Schedule.............................................................................................
CHAPTER IV LITERATURE REVIEW............................................................................
v
4.1.What is Open Modbus...............................................................................................
4.1.1.Modbus IP/TCP.................................................................................................
4.2.What is Induction Motor...........................................................................................
4.2.1.The overview....................................................................................................
4.2.2.Squirrel Motor..................................................................................................
4.3.Scalar Control………….……………………………………………………………………………………..7
4.4.Field-Oriented Control..............................................................................................
CHAPTER V METHODOLOGY.....................................................................................
5.1.Simulink continuous State Space Models of the IM in Startor-Fixed ...................
5.2.Simulinkcontinuous State Space Models of the IM in Field-Synchronous
Coordinate Systems......................................................................................
5.3.Control the speed and torque of the IM in TIA PORTAL.........................................
5.4.Control the motion of the IM in TIA PORTAL.........................................................
CHAPTER VI EXPECTED RESULTS............................................................................
6.1. 26
6.2. 27
CHAPTER VII CONCLUSION AND FUTURE WORK................................................
7.1. The conclusion..............................................................................................
7.2. Future work...................................................................................................
REFERENCES..................................................................................................................
vi
vii
LIST OF TABLES
Table 4. 1 LRCN methods and accuracy score..................................................................
Table 4. 2 C3D methods and accuracy scoring.................................................................
Table 4. 3 Two Stream Fusion methods and scoring.........................................................
Table 4. 4 TSN methods and score....................................................................................
Table 4. 5 HiddenTwoStream methods and scoring..........................................................
Table 4. 6 I3D methods and score.....................................................................................
Table 4. 7 T3D methods and score....................................................................................
viii
LIST OF FIGURES
Figure 3. 1 Project schedule.................................................................................................
Figure 4. 1 Human action recognition methods categorization...........................................
Figure 4. 2 Basics network architecture [17].....................................................................
Figure 4. 3 LRCN architecture for High Jump action and general architecture of
LRCN [18].........................................................................................................................
Figure 4. 4 TwoStreamFusion architecture [21]................................................................
Figure 4. 5 Temporal Segment Networks architecture for High-Jump action [22]...........
Figure 4. 6 The Hidden Two Stream architecture [24]......................................................
Figure 4. 7 Temporal 3D Convolutional Network architecture [25].................................
Figure 4. 8 3D Temporal Transition Layer structure [25].................................................
Figure 5. 1 The 3D convolution kernel [34]......................................................................
Figure 5. 2 2D convolution and 3D convolution...............................................................
Figure 5. 3 An example of CNNs architecture..................................................................
Figure 5. 4 Residual block structure..................................................................................
Figure 6. 1 3D ResNet architecture [30]............................................................................
Figure 6. 2 Accuracy of 3D ResNet-34 compare with other [30].....................................
ix
ABBREVIATIONS AND NOTATIONS
CNNs: Convolutional Neural Networks
HAR: Human Action Recognition
LSTM: Long Short-Term Memory
LRCN: Long-term Recurrent Convolutional Networks
C3D: Convolutional 3D
TSN: Temporal Segment Networks
SVM: Support Vector Machine
SGD: Stochastic Gradient Descend
x
111Equation Chapter 1 Section 1
ABSTRACT
...... In the modern-day, human activity recognition (HAR) has attracted the interest
of many researchers in many aspects such as surveillance systems, autonomous
vehicles. Companies commonly use this technology to train and monitor their new
employees to correctly perform their assigned tasks. Moreover, a HAR system can
replace multiple fall detection sensors which are usually attached to patients. In some
recent studies, this task can be achieved by applying a pre-trained model. In this
senior project, a new HAR based on state-of-the-art methods will be proposed for the
thesis project.
xi
CHAPTER I
INTRODUCTION
1.1. Motivation
In the era of automation, human action recognition (HAR) is an essential
technology which can be used to solve various real-life, human-centric issues in
eldercare and healthcare field. However, due to the intricacy and discrepancy of
human activities, HAR can be fiendishly difficult.
Human activity recognition is a process that uses some specific algorithms to
extract feature information from the input human activities then classify the input into
each activity classes. Recently, the rapid growth of computer vision algorithms and
applications with the support of pre-trained networks have been utilized in the human
action recognition field for the purpose of simplifying the mathematical procedure of
the process and increase accuracy of the output.
In the present days, HAR could be achieved by using an input of the signal
from gyroscope and accelerometer. This method requires an adequate number of
hardware which can be costly when using in some distinct scenario such as nursing
homes or hospitals. Alternatively, a huge amount of data could be collected and
processed with a few surveillance systems whose have computer vision tools such as
OpenCV, MATLAB, TensorFlow, etc. installed.
1.2. Objective
The objective of this senior project is studying Human Activities Recognition
with Deep Learning and Computer Vision then suggest a new HAR system with
theoretically faster or more accurate than the previous ones. Firstly, cutting edge
techniques in Human Activities Recognition are reviewed. Then, the best aspects of
1
advanced convolutional network architecture will be considered as the references to
create a new architecture based on them.
1.3. Report Organization
The organization of this senior report is described below:
 Chapter I: An introduction to the objective motivation of this senior will be stated
in this chapter.
 Chapter II: Constructed specification and standards of the projects will be stated in
this chapter.
 Chapter III: The management, planning, project schedules and resource planning
involved in the senior project will be enlightened in this chapter.
 Chapter IV: The framework of Human Activities Recognition. Also, information
about some Human Activities Recognition methods and some recent researches
will be provided in this chapter.
 Chapter V: A thorough demonstration of the proposed methods applied in this
project will be stated in this chapter.
 Chapter VI: The expected theoretical outcome of the project will be shown in this
chapter.
 Chapter VII: The conclusion of the project and the future works to systemize a
HAR program.
22EQUATION SECTION (NEXT)
2
CHAPTER II
DESIGN SPECIFICATIONS AND STANDARDS
2.1. Hardware description
This senior was conducted on laptop model MSI GS63 7RD with i7-7700HQ
processor, 16GB RAM. It also has GTX 1050 GPU with 2GB VRAM and 640 CUDA
cores.
2.2. Software description
The reference systems used in this senior was tested using Python and
OpenCV.
33Equation Section (Next)44Equation Section (Next)
3
55Equation Section (Next)
CHAPTER III
PROJECT MANAGEMENT
3.1. Budget and Cost Management
 Laptop: 24,000,000 VNĐ
 MATLAB and toolbox: NA
 Microsoft Office: NA
 Python: Free
3.2. Project Schedule
The senior project is planned to complete within 12 weeks. The detail is
described in figure 3.1.
Figure 3. 1 Project schedule
4
CHAPTER IV LITERATU
RE REVIEW
4.1. Human Activities
This chapter discusses about the background information and historical
development of a human action recognition (HAR) process. Furthermore, a synopsis
of some advance HAR architectures and datasets is also mentioned.
4.1.1. Background
Human activity is a crucial part in human-human interaction and social
relationships as data about the psychological condition and personality of a person
were given. As early as the late 20th century, Prof. Gaverila split the research into 2D
and 3D approaches [1]. At that time, new taxonomy was also presented by Aggarwal
and Cai which specialized in human motion analysis [2]. This taxonomy was later
developed by Wang et al. as he introduced an action categorization system [3].
Moeslund et al. (2006) conducted an analysis concentrating on posture-based action
recognition methods and put forward a greater classification system, which consists of
human tracking, motion, posture estimation and recognition methods [4].
Categories of human activities were also analyzed in other proposals. In 2008,
Turaga et al. classified an action recognition method according to their level of
activity complexity [5]. In 2010, R. Poppe broke HAR methods down into two
primary types: top-down and bottom-up [6].
3D modeling was first extended in studies of Ye et al. and Chen et al. (2013)
[7]. By using depth cameras, a 3D demonstration of a human body was set about in
the 2-dimensional plane. Aggarwal and Xia (2014) later introduced a categorization of
HAR methods from 3D stereo and motion capture system with the primary
concentration on techniques utilizing 3-dimensional depth data [8].

5
In 2014, Guo and Lai compiled all techniques for HAR from still images and
sort them into two main types corresponding to the abstraction level and feature types
used in those methods [9].
Since 2003, many surveys about human affective interaction have been
proposed and conducted through different cases. Jaimes and Sebe (2007) focused on
postures, facial expressions and speech [10] while Pantic and Rothkrantz (2003)
majored in non-verbal signal, such as facial and vocal expressions [11]. For
spontaneous actions and movement, Zeng et al. (2009) developed this method by
using visual and audio cues as Bousmalis et al. (2013) presented an analysis of
nonverbal behavior recognition methods for agreement and disagreements [12] [13].
Human activities categorization methods are [14]:
+ Unimodal methods: used when recognize human action based on motion
characteristics. There are 4 types: space-time, stochastic, rule-based and shape-based.
+ Multimodal methods: specify atomic actions or interactions that identify
personal affective states with other people and their emotions or body movement. It
can be approached by both early and late fusion. The technique is classified into 3
categories: affective methods, behavioral methods and methods based on social
networking. Figure 4.1 below describes human action recognition methods
categorization.
6
Figure 4. 1 Human action recognition methods categorization.
Many recent researches in HAR failed to express comprehensively human
actions in a compact and instructive way as they mentioned the barrier as regards to
computational issues. More specifically, although the human understanding methods
has increased rapidly in recent years, problems such as modeling of human poses and
labeling data still remain unsolved. This project will focus on space-time activities
recognizing methods.
4.1.2. Human Activities Concept
Human activity is a combination of different human gesture and movements in
order to interact with a surrounding environment (i.e. dancing, skateboarding) or other
people (i.e. talking, kissing). From human activities, it can tell the characteristic of a
person or even predict the mental and physical state of patient and elderly, or it
features could be extracted for human-computer interaction systems (i.e. Xbox
Kinect). With this kind of benefit, HAR can be applied to surveillance systems,
healthcare systems and human computer interaction systems.
4.2. Human Activities Recognition concept
4.2.1. The overview
Most Human Activity Recognition process includes:
7
 Dividing human activities video clips in video datasets into frames -
using video dataset such as Kinetics 700 and UCF101
 Extracting human action features from input image/frames - Using
various algorithms to extract the unique features of each action (i.e.
Gaussian spatial-temporal filter, 3D Wiener Filter)
 Training the network with a dataset - Based on previous input and their
labels, using a iterative method for training (i.e. Stochastic gradient
descent algorithm)
 Executing recognition –The trained model is then used to recognize
each
corresponding class of human activities of input data using a 2D CNNs
 Classify the output - Using classifier such as SVM or SoftMax to
classify the output.
Human action is one necessary type of real-life information. Automatic human
action recognizing in videos is implemented in practice, such as video indexing,
human computer interacting and camera surveillance.
4.2.2. Human action recognition methods
Two breakthrough methods were released in 2014 which form the backbone of
all later HAR methods, which is Single Stream network and Two Streams network.
4.2.2.1. Overview
Single Stream network: June 2014, Karapathy et al. figured out numerous
ways to fuse temporal information from video frames using 2-dimensional pre-trained
convolutional network [15].
8
Two Streams network: Simmoyan together with Zisserman then continues to
improve Karapathy et al. 's work. The authors shape motion features to stacked optical
flow vectors for deep architecture to effectively learn the human motion [16]. Hence,
this architecture has two segregate networks to serve both spatial and motion
information. The spatial network is a 2D pre-trained model with the input is a video's
single frame. For the motion network, the authors experiment many kinds of input
then found that bi-directional optical flow stacked across 10 successive frames has the
best performance. Both networks were separately trained and united with Support
vector machine classifier.
With the backbone of two mentioned papers, these networks can be included:
LRCN, C3D, TwoStreamFusion, TSN, HiddenTwoStream, I3D, T3D.
All the methods above are improvisations on top of five basic ideas: LSTM,
3D-ConvNet, Two-Stream, 3D-Fused Two-Stream and Two-Stream 3D-ConvNet
[17]. Figure 4.2 display five basic ideas.
9
Figure 4. 2 Basics network architecture [17]
4.2.2.2. LRCN
LRCN was the was first introduced by Donahue et al. on 17 November 2014
[18]. The authors use a LSTM decoder after each convolution layers then using an
end-to-end training network. Optical flow features and RGB were also compared as
input choices and the result that had the best weighted scoring of predictions was the
input, which combined both RGB and optical flow. Figure 4.3 describe the LRCN
architecture for action recognition and general LRCN architecture.
Figure 4. 3 LRCN architecture for High Jump action and general architecture of
LRCN [18]
Benchmarks (UCF101-split1):
10
Table 4. 1 LRCN methods and accuracy score
METHOD SCORE
OPTICAL FLOW + RGB 82.92
RGB 71.1
By using an end-to-end training network, there were a few drawbacks such as
wrong label selection, inability to capture long range temporal information and
sparsity of pre-computing of flow features as using optical flow information. Varol et
al. solve the short temporal range problem by using longer clips (about 60 frames) and
lower the spatial resolution of video which led to better performance.
4.2.2.3. C3D
In 2014, Du Tran et al. introduced 3D Convolutional Networks. This network
was built based on Single Stream architecture from Karapathy et al. But they used 3D
convolution kernels on video volume in place of using 2-dimensional convolutions
across images [19]. Their idea was training these networks on dataset Sports1M then
using it as 3D feature extractors for other datasets. The authors also showed in their
work that a simple linear classifier such as SVM was more suitable for extracted
features than the latest algorithms. The performance can increase with hand crafted
features like iDT. In 2018, Hara et al. used this idea for 2D convolutional network like
ResNet to recognize 3D human actions [20].
Benchmarks (using UCF101-split1 dataset):
Table 4. 2 C3D methods and accuracy scoring
METHOD SCORE
11
C3D (single net) + SVM 82.3
C3D (three net) + SVM 85.2
C3D (THREE NETS) + IDT + SVM 90.4
Although C3D method significantly improves the accuracy, the long-range
temporal modeling is remaining a problem. Furthermore, it gains a computational
problem when training such huge network.
4.2.2.4. Two Stream Fusion
This network based on two stream architecture. It is made up of fusion of
spatial and temporal streams and combination of temporal net output across time
frames. It first introduced by Feichtenhofer et al. on April 2016 [21]. Feichtenhofer
described that spatial features can be captured by spatial (space) stream and annual
motion for every spatial location can be captured by temporal (time) stream in a
video. Second, long term dependency can also be modeled by combining temporal net
output across time frames.
12
Figure 4. 4 TwoStreamFusion architecture [21]
Table 4. 3 Two Stream Fusion methods and scoring.
METHOD SCORE
Two stream fusion 92.5
Two stream fusion + iDT 94.2
4.2.2.5. TSN
Temporal Segment Networks or TSN is an upturn in two-stream architecture
to produce the best result. This technique proposed by Wang et al. on August 2016
[22]. In their paper, using sampling clips sporadically throughout the video get
improved long-range temporal data model. Next, for the video-level final prediction,
author experienced many strategies, but the best scoring was combining of spatial-
temporal streams individually by averaging across fragments, combining final spatial-
temporal scores using weighted average and then applying SoftMax classification.
13
Figure 4. 5 Temporal Segment Networks architecture for High-Jump action [22]
Table 4. 4 TSN methods and score
METHOD SCORE
TSN (RGB + optical flow) 94.0
TSN (RGB + optical flow + warped flow) 94.2
4.2.2.6. HiddenTwoStream
In 2017, Zhu et al. identified a major factor, the usage of optical flow in two
stream architecture making the network rely on pre-compute optical flow for each
sampled frame, affects storage and speed adversely [23]. The authors then
experienced multiple strategies and network architecture to get the highest fps and
lowest parameters possible with the generated optical flow without affecting the
accuracy of the system. Their architecture is based on two stream architecture with the
differences as the author mentioned are the appearance of the optical flow generation
net (MotionNet) located at the beginning of temporal stream structure, consequent
14
frames for input data instead of preprocessed optical flow and an added computation
of multistage of multiple losses by MotionNet.
Figure 4. 6 The Hidden Two Stream architecture [23]

Table 4. 5 HiddenTwoStream methods and scoring.
METHOD SCORE
HiddenTwoStream 89.8
HiddenTwoStream + TSN 92.5
4.2.2.7. Two-Stream Inflated 3D ConvNet (I3D)
In 2017, Carreira et al. introduced I3D, which is two stream architecture
containing two different 3-dimensional networks for both streams [17]. Also, the
authors using 2D pre-trained weights in 3rd dimension to exploit pre-trained 2D
models. Moreover, the input of the spatial stream including frames stacked in time
dimension instead of single frames as in the original Two-Stream architecture. The
authors also confirmed the usage of Kinetics dataset on increasing the accuracy of
their network.
Table 4. 6 I3D methods and score
15
METHOD SCORE
Two Stream I3D 93.4
With Kinetics pre-training and 98.0
ImageNet datasets
4.2.2.8. T3D
Temporal 3D ConvNets or T3D is an extension of the work done on I3D by
Diba et al. on 2017 [24]. The authors proposed a method to capture different temporal
depths by using 3-dimensional single stream named DenseNet based structure with a
multi-depth temporal transition layer (or temporal down sampling layer) placed after
dense blocks. This multi-depth pooling technique can be achieved by down sampled
with kernels of changeable temporal sizes. Overall, this architecture is essentially a
3D modification to the original DenseNet with addition changeable temporal down
sampling layers.
Figure 4. 7 Temporal 3D Convolutional Network architecture [24]
Figure 4. 8 3D Temporal Transition Layer structure [24]

Moreover, the authors also proposed a new supervising transferred learning
method between the pre-trained 2-dimensional convolution network and T3D. The
16
architecture is trained to perform binary classifier based on the similarity and the
inaccuracy from the prediction as is backpropagated throughout the network for
adequately transfer knowledge.
Table 4. 7 T3D methods and score
METHOD SCORE
T3D 90.3
T3D + TLL 91.7
T3D + TSN 93.2
4.2.2.9. Datasets
In the action recognition field, the most recent successful datasets are HMDB-
51 and UCF-101 [25] [26]. These datasets were commonly used as benchmarks and
become very popular in the early state of the field. However, their size was not
enough for the training process.
Some bigger datasets were created. One of them were ActivityNet, which
consist of 200 different type of activities with 849 hours of YouTube videos [27]. In
2017, Key et al. released the Kinetics dataset as an effort to create a successfully
trained model. At that time, the Kinetics dataset was the state-of-the-art dataset with
over 300,000 video clips divided into 400 classes [28]. Hara et al. figure out that there
CNNs works best with this dataset with 73.7% of average accuracy for ResNet-152
and ResNet-200 model [29]. Aside from Kinetics, other large-scale datasets such as
Sports-1M and YouTube-8M have also been introduced [30] [31]. Both datasets are
larger than Kinetics; however, their video datasets includes many unrelated-to-target
frames and their annotations are quite noisy. These aspects can affect the accuracy of
17
the training process. Moreover, their large size files can prevent them from effectively
utilized.
18
Chapter V
METHODOLOGY
The proposed method of HAR system is given in this chapter. Specifically, the
proposed method and algorithms needed to create a new Human Action Recognition
neural network are reported.
5.1. Network Design
My convolutional neural network will be built based on C3D, which will
concatenate 3D information such as spatial and temporal information to every frames
of the input video clip. After that, the previous data will be fed to state-of-the-art 2-
dimensional classification CNNs such as VGG, ResNet, DenseNet, etc. to test the
performance. During the process, some modification to each CNNs will be tested such
as changing number and type of filters in convolutional layers. Some specific feature
of state-of-the-art CNNs will also be added such as identity connection from ResNet
[32] to get better performance.
For the training process, Stochastic Gradient Descend (SGD) algorithm with
momentum will be used to upgrade weights.
Throughout the process, the Kinetics dataset will be used for training and
testing the CNNs performance.
Stage 1: Pre-processing
This stage is an elementary step for the whole process, which can substantially
increase the performance of the network if chosen carefully. For more specific, short
input video clips from Kinetics datasets will be divided into frames using MATLAB
tool.
Stage 2: Spatio-temporal Extraction
19
This stage performs spatio-temporal extracting by using 3 × 3 × 3 convolution
kernels with stride of 1 and a 3D pooling with stride of 2. Figure 5.1 demonstrates the
idea of 3D convolution kernels.
Figure 5. 1 The 3D convolution kernel [33]

The dimension of the output volume can be calculated by the equation 56:
65
MERGEFORMAT (.)
Where is the input volume size, F is the kernel size, S is the stride, and P is
W
the padding.
Stage 3: Training and Classification
This stage performs the training of the CNNs using Stochastic Gradient
Descend with momentum as Hara et al. did in their papers then classify the output
using Support Vector Machine or SoftMax classifier.
5.2. Extract 3D feature
Different from 2D CNNs which can only learn from 2D images, 3D CNNs can
learn 3D object using 3D convolution and pooling. In 3D convolutional network,

20
spatio-temporal can be executed by convolution and pooling whereas 2D
convolutional network can only perform spatially. Du Tran et al. already proved the
performance of this method with their C3D network [19]. Figure 5.2 show the
visualization of 2D and 3D convolution operations.
Figure 5. 2 2D convolution and 3D convolution

Figure 5.2 illustrates that 2D convolution with input is multiple image or
multi-channel image will result in an image, which leads to the loss of temporal
information of input signal soon after every convolutional operation. In contrast, 3-
dimensional convolution holds the temporal information of the input which will help
the network perform better with human action clips. Also, the same phenomenon
holds the same for 2D and 3D pooling. With the loss of temporal information, 2D
convolutional network cannot effectively recognize human action from video clips
due to the noise came from other class of human action. This drawback was proved by
21
Du Tran et al. in their paper [19]. Hence, the HAR network should use 3D
convolution and pooling to get better accuracy.
5.3. Convolutional Neural Networks (CNNs)
Among the branch of neural networks, one special type is
Convolutional Neural Network. Instead of implementing matrix algorithms, it uses
convolution throughout the network. Figure 5.2 describes the example of the general
architecture of CNNs.
Figure 5. 3 An example of CNNs architecture. [34]

The figure above shows a chain structure with an RGB input with the size of
150 × 150 × 3 followed by some basic layer of a CNNs which are convolution layer
followed by batch normalization algorithm and ReLu/Leaky ReLu activation function,
down sampling layer, dropout, softmax classification layer.
5.3.1. 2D CNNs for Human Action Recognition
Such famous 2D convolution network as ResNet or VGGNet are known for
their performance with ImageNet dataset. However, in 2017, Hara et al. proved that
with three dimensional (3D) convolutional kernels added to ResNet, its performance
22
can be comparable with a deeper 2D convolutional network. Hence, state-of-the-art
2D CNNs are believed to have the ability of recognizing human activity with
additional 3D information concatenated to input video frames. In Hara et al.’s work,
they use convolutional kernels with the size of 3 × 3 × 3 and temporal stride is 1 and
down-sampling the input with the stride of 2, which is similar to C3D [19].
5.3.2. ReLu and Leaky ReLU
ReLU or Rectified linear unit is a nonlinear activation function, applied
elementwise to the input matrix, which output zero across negative half, while linearly
matching function on the positive half. It was proven to be able to increase the
training rate of CNNs and fix the Vanishing Gradient problem.
5.4. Identity Connection
For most CNNs, the more layers using certain activation functions are added,
the harder to train the neural networks. This problem has also been known as the
Vanishing Gradient problem, which occurs due to the approaching zero of the
gradients of the loss function. Thus, He et al. introduced a residual network concept to
solve the problem [32]. The most importance features of the residual network
(ResNet) is Residual Block. Figure 5.4 describes the Residual Block used in the
network.
Figure 5. 4 Residual block structure
23
Consider x is the output of any activation function (i.e. Sigmoid, SoftMax)
and assume that input and output have the same dimension. The output of each
residual block can be described as below:
75
MERGEFORMAT (.)
The formulation of H ( x )=F( x)+ x can be comprehended by feed-forward
neural networks with “skip connections”. One or more layers are skipped by skip
connections (or shortcut connection [33]). In He et al.’s study, ResNet uses skip
connections to perform identity mapping, which allows information to flow freely
throughout the whole network. This can inherently trammel vanishing gradient or
curse of dimension problems in other CNNs networks such as VGG-19 or AlexNet.
With skip connection, the author stated that the whole network can be trained
throughout using the SGD method with back-propagation and can be easily
implemented by using essential libraries (i.e. Caffe [36]) without varying the
compiler. When the amount of the features maps is increased, a type-A identity
shortcut with zero-padding will be applied to avert the increase of parameter numbers
[33].
5.5. Datasets
UCF101 and Sports1M are known for the most famous datasets for human
action recognition task. However, searching for rational network architecture on
Sports1M can be extraordinarily difficult due to the noise coming from their messy
annotations and the labels are only in video level (i.e. all the irrelevant frames are
included). Although the number of frames of UCF101 is appreciating as ImageNet,
the high space correlation among the videos strongly decrease the actual
24
diversification in training. For the purpose of re-tracing 2D CNNs trained on
ImageNet success, Kinetics human action video dataset has been created. The latest
Kinetics dataset contains approximately 650,000 video clips that cover 700 human
action classes. Each action class has at least 600 video clips, and each of them is
annotated with a single action class. Also, videos in Kinetics datasets were temporally
trimmed to eliminate non-action frames [35]. In summary, the size of Kinetics is
slightly smaller than Sports1M, whereas the annotation quality is extremely high.
Hence, Kinetics dataset is the most suitable human action recognition training system
to improve the precision of the process compared to other obsolete ones.
25
Chapter VI
EXPECTED RESULTS
Due to the limitation of both knowledge and time, only theoretical background
of the network was built. Hara et al. already proved the idea of using known 2D
Convolutional network as Residual Network with 3D convolutional kernels for human
action recognition system in their work. With the latest version of Kinetics dataset and
modification on 2D CNNs compared to ResNet such as replacing ReLu by Leaky
ReLu (α = 5.5) for better performance, both speed and accuracy of the system can be
theoretically intensively increased.
6.1. The usage of 3D convolution
The human action recognition task requires information about the existence of
distinct actions from a sequence of 2D frames. However, there is no guarantee for
those activities to be performed throughout the entire duration of the sequence (i.e.
movement from the background or by a camera). This problem appears to be a
continuation of image classification tasks to video clips and then combining the
predictions result from each frame. Because of that problem, human action
recognition systems require both time and space information from video clips to
recognize the action. In Du Tran et al.'s and Hara et al.'s papers, the ability of 3D
convolutions on fulfilling the task has been proved [19] [30] [20]. 3D convolution
can produce a 3D activation map for analyzing data where temporal or volumetric
context is important such as the human action recognition task.
6.2. 3D Convolutional Neural Networks
In 2017, Hara et al. experiment 3D convolution and 3D pooling on Residual
Network (ResNet). The dimension of 3D convolutional kernels is 3 × 3 × 3 with a

26
temporal stride of the first convolution layer is 1, which is identical to C3D. For the
input, the authors use the clips with the dimension of 3 × 16 × 112 × 112 (channel ×
frames × x × y). Figure 6.1 describes the architecture of Hara et al. 's network.
Figure 6. 1 3D ResNet architecture [29]

The Residual blocks are shown in brackets. Pooling is performed by 3 × 3 × 3
max pooling layers in conv3_1, conv4_1, conv5_1 with stride 2. Then, after each
convolution layers are batch normalization layer [36] and a ReLu activation layer
[37]. Finally, a fully connected output layer with 400 classes is set for Kinetics 400
dataset.
This architecture showed a better result than C3D architecture on the Kinetics
dataset. The model can sustain overfitting despite a large number of parameters.
Figure 6.2 shows the accuracy of the Kinetics dataset of the authors' network compare
with the other method in Carreira and Zisserman paper [17].
27
Figure 6. 2 Accuracy of 3D ResNet-34 compare with other [29]
28
CHAPTER VII
CONCLUSION AND FUTURE WORK
7.1. The conclusion
The researching process in this senior project mainly focuses on the human
action recognition task and clarifies its difficulties. Some advance human action
recognition systems and their performances have also been discussed in this report.
Then, the suitable dataset for the modern human action recognition task is stated.
This senior project has successfully categorized human action recognition
methods and their accuracy scoring. Finally, a human action recognition architecture
based on a successfully worked system has also been proposed in terms of theoretical
architecture.
7.2. Future work
In the future, several steps are required to obtain the practical HAR system
with the proposed architecture:
 Create the practical architecture with Python-OpenCV or MATLAB
 Consecutively test the performance of CNNs with various modifies
and datasets to get the best accuracy or speed scoring.
 Create pre-trained network to implement to real-life situations with
Python-OpenCV or MATLAB GUI.
29
REFERENCES
[1] D. M. Gavrila, “The Visual Analysis of Human Movement: A Survey,” Computer Vision and Image
Understanding, vol. 73, no. 1, pp. 82–98, 1999.
[2] J. K. Aggarwal and Q. Cai, “Human Motion Analysis: A Review.,” Computer Vision and Image
Understanding, vol. 73, no. 3, pp. 428–440, 1999.
[3] L. Wang, W. Hu, and T. Tan, “Recent developments in human motion analysis,” Pattern
Recognition, vol. 36, no. 3, pp. 428–440, Mar. 2003.
[4] T. B.Moeslund, A. Hilton, and V. Krüger, “A survey of advances in vision-based human motion
capture and analysis.,” Computer Vision and Image Understanding, vol. 104, no. 2–3, pp. 90–126,
2006.
[5] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine Recognition of Human
Activities: A Survey,” Computer Vision and Image Understanding, vol. 104, no. 2–3, pp. 90–126,
2006.
[6] R. Poppe, “A survey on vision-based human action recognition,” IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 18, no. 11, pp. 1473–1488, 2008.
[7] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, and J. Gall, “A Survey on Human Motion Analysis
from Depth Data.,” in Time-of-Flight and Depth Imaging, 2013, vol. 8200, pp. 149–187.
[8] J. K. Aggarwal and L. Xia, “Human activity recognition from 3D data: A review,” Pattern
Recognition Letters, vol. 48, pp. 70–80, 2014.
[9] G. Guo and A. Lai, “A survey on still image based human action recognition,” Pattern Recognition,
vol. 47, no. 10, pp. 3343–3361, 2014.
[10] A. Jaimes and N. Sebe, “Multimodal Human Computer Interaction: A Survey,” Computer Vision
and Image Understanding, vol. 108 (Special Issue on Vision for Human-Computer Interaction), pp.
116–134, 2007.
[11] M. Pantic and L. Rothkrantz, “Towards an affect-sensitive multimodal human-computer
interaction,” in IEEE, Special Issue on Multimodal Human-Computer Interaction, Invited Paper,
2003, vol. 91 (IEEE), pp. 1370–1390.
[12] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A Survey of Affect Recognition Methods:
Audio, Visual, and Spontaneous Expressions,” in IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2009, vol. 31, no. 1, pp. 39–58.
[13] K. Bousmalis, M. Mehu, and M. Pantic, “Towards the automatic detection of spontaneous
agreement and disagreement based on nonverbal behaviour: A survey of related cues, databases, and
tools,” Image and Vision Computing, vol. 31, no. 2, pp. 203–221, 2013.
[14] N. D. Rodríguez, M. P. Cuéllar, J. Lilius, and M. D. Calvo-Flores, “A survey on ontologies for
human behavior recognition,” ACM Computing Surveys, vol. 46, no. 4, pp. 1–33, 2014.
[15] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-Scale Video
Classification with Convolutional Neural Networks,” in 2014 IEEE Conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[16] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in
Videos,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C.
Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. 2014, pp. 568–576.
[17] J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics
Dataset,” 2017.
[18] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T.
Darrell, “Long-term Recurrent Convolutional Networks for Visual Recognition and Description,”
2014.
[19] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features
with 3D Convolutional Networks,” 2014.
[20] K. Hara, H. Kataoka, and Y. Satoh, “Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs
and ImageNet?,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2018, pp. 6546–6555.
30
[21] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional Two-Stream Network Fusion for
Video Action Recognition,” 2016.
[22] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, “Temporal Segment
Networks: Towards Good Practices for Deep Action Recognition,” 2016.
[23] Y. Zhu, Z.-Z. Lan, S. D. Newsam, and A. G. Hauptmann, “Hidden Two-Stream Convolutional
Networks for Action Recognition,” 2017.
[24] A. Diba, M. Fayyaz, V. Sharma, A. H. Karami, M. M. Arzani, R. Yousefzadeh, and L. V. Gool,
“Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification,” 2017.
[25] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB51: A Large Video Database for
Human Motion Recognition,” in Proceedings of the IEEE International Conference on Computer
Vision, 2011, pp. 2556–2563.
[26] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes From
Videos in The Wild,” 2012.
[27] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “ActivityNet: A large-scale video
benchmark for human activity understanding,” in 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2015, pp. 961–970.
[28] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T.
Back, P. Natsev, M. Suleyman, and A. Zisserman, “The Kinetics Human Action Video Dataset,”
2017.
[29] K. Hara, H. Kataoka, and Y. Satoh, “Learning Spatio-Temporal Features with 3D Residual
Networks for Action Recognition,” 2017 IEEE International Conference on Computer Vision
Workshops (ICCVW), pp. 3154–3160, 2017.
[30] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-Scale Video
Classification with Convolutional Neural Networks,” in 2014 IEEE Conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[31] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S.
Vijayanarasimhan, “YouTube-8M: A Large-Scale Video Classification Benchmark,” 2016.
[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015.
[33] S. Verma, “Understanding 1D and 3D Convolution Neural Network | Keras. [Online]. Available:
https://towardsdatascience.com/understanding-1d-and-3d-convolution-neural-network-keras-
9d8f76e29610 [Accesesed: 13- Jan- 2020].” .
[34] N. B. D. Huy, “Using Deep Learning To Design And Implement An American
FingerspellingAlphabet Recognition System,” 2019.
[35] J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A Short Note on the Kinetics-700 Human
Action Dataset,” 2019.
[36] S. I. andcChristian Szegedy, “Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift,” 2015.
[37] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in
Proceedings of the 27th International Conference on International Conference on Machine
Learning, 2010, pp. 807–814.
31

NVQ Report Revised Fixed

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NVQ Report Revised Fixed

Uploaded by

Copyright:

Available Formats

VIETNAM NATIONAL UNIVERSITY – HOCHIMINH CITY

VFD AND MOTION CONTROL WITH

A SENIOR PROJECT SUBMITTED TO THE SCHOOL OF ELECTRICAL

HO CHI MINH CITY, VIET NAM

NGO CHON PHUC

rejection from the EE program at the International University – Vietnam National

University Ho Chi Minh City.

Name of Student: NGO CHON PHUC

Advisor Signature Student Signature

experiences to help me perfect my senior in the optimal way.

ABBREVIATIONS AND NOTATIONS.............................................................................

1.3. Report Organization........................................................................................

CHAPTER II DESIGN SPECIFICATIONS AND STANDARDS....................................

2.1. Hardware description......................................................................................

2.2. Software description.......................................................................................

CHAPTER III PROJECT MANAGEMENT......................................................................

3.1. Budget and Cost Management........................................................................

3.2. Project Schedule.............................................................................................

CHAPTER IV LITERATURE REVIEW............................................................................

4.2.What is Induction Motor...........................................................................................

5.1.Simulink continuous State Space Models of the IM in Startor-Fixed ...................

5.2.Simulinkcontinuous State Space Models of the IM in Field-Synchronous

5.3.Control the speed and torque of the IM in TIA PORTAL.........................................

5.4.Control the motion of the IM in TIA PORTAL.........................................................

CHAPTER VI EXPECTED RESULTS............................................................................

CHAPTER VII CONCLUSION AND FUTURE WORK................................................

7.1. The conclusion..............................................................................................

7.2. Future work...................................................................................................

Table 4. 1 LRCN methods and accuracy score..................................................................

Table 4. 2 C3D methods and accuracy scoring.................................................................

Table 4. 3 Two Stream Fusion methods and scoring.........................................................

Table 4. 4 TSN methods and score....................................................................................

Table 4. 5 HiddenTwoStream methods and scoring..........................................................

Table 4. 6 I3D methods and score.....................................................................................

Table 4. 7 T3D methods and score....................................................................................

Figure 3. 1 Project schedule.................................................................................................

Figure 4. 1 Human action recognition methods categorization...........................................

Figure 4. 2 Basics network architecture [17].....................................................................

Figure 4. 4 TwoStreamFusion architecture [21]................................................................

Figure 4. 5 Temporal Segment Networks architecture for High-Jump action [22]...........

Figure 4. 6 The Hidden Two Stream architecture [24]......................................................

Figure 4. 7 Temporal 3D Convolutional Network architecture [25].................................

Figure 4. 8 3D Temporal Transition Layer structure [25].................................................

Figure 5. 1 The 3D convolution kernel [34]......................................................................

Figure 5. 2 2D convolution and 3D convolution...............................................................

Figure 5. 3 An example of CNNs architecture..................................................................

Figure 5. 4 Residual block structure..................................................................................

Figure 6. 1 3D ResNet architecture [30]............................................................................

Figure 6. 2 Accuracy of 3D ResNet-34 compare with other [30].....................................

CNNs: Convolutional Neural Networks

HAR: Human Action Recognition

LSTM: Long Short-Term Memory

LRCN: Long-term Recurrent Convolutional Networks

TSN: Temporal Segment Networks

SVM: Support Vector Machine

SGD: Stochastic Gradient Descend

of many researchers in many aspects such as surveillance systems, autonomous

In the era of automation, human action recognition (HAR) is an essential

technology which can be used to solve various real-life, human-centric issues in

human activities, HAR can be fiendishly difficult.

Human activity recognition is a process that uses some specific algorithms to

the process and increase accuracy of the output.

from gyroscope and accelerometer. This method requires an adequate number of

homes or hospitals. Alternatively, a huge amount of data could be collected and