You are on page 1of 43

VIETNAM NATIONAL UNIVERSITY – HOCHIMINH CITY

INTERNATIONAL UNIVERSITY
SCHOOL OF ELECTRICAL ENGINEERING

VFD AND MOTION CONTROL WITH


OPEN MODBUS

BY
NGO CHON PHUC

A SENIOR PROJECT SUBMITTED TO THE SCHOOL OF ELECTRICAL


ENGINEERING IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF BACHELOR OF ELECTRICAL ENGINEERING

HO CHI MINH CITY, VIET NAM


2022
VFD AND MOTION CONTROL WITH OPEN MODBUS

BY

NGO CHON PHUC

Under the guidance and approval of the committee, and approved by its members, this

senior project has been accepted in partial fulfillment of the requirements for the

degree.

Approved:

________________________________
Chairperson

_______________________________
Committee member

________________________________
Committee member

________________________________
Committee member

________________________________
Committee member

i
HONESTY DECLARATION

My name is Ngo Chon Phuc I would like to declare that, apart from the

acknowledged references, this thesis either does not use language, ideas, or other

original material from anyone; or has not been previously submitted to any other

educational and research programs or institutions. I fully understand that any writings

in this thesis contradicted to the above statement will automatically lead to the

rejection from the EE program at the International University – Vietnam National

University Ho Chi Minh City.

Date:

Student’s Signature

(Full name)

ii
TURNITIN DECLARATION

Name of Student: NGO CHON PHUC

Date: 22/6/2022

Advisor Signature Student Signature

iii
ACKNOWLEDGMENT

Although I take a huge effort into this project, this can not be possible

without the support from faculty members. I would like to express my sincere thanks

to them. It is a great honor and luck to be trained by Mr. Vo Tan Phuoc. Not only is

he the one who guides this senior, but he is also the one who inspires and shares his

experiences to help me perfect my senior in the optimal way.

Besides, I would also like to thank a million to Mr. Ton That Long for

providing me with the necessary equipment to carry out this senior research.

iv
TABLE OF CONTENTS

HONESTY DECLARATION..............................................................................................

TURNITIN DECLARATION............................................................................................

ACKNOWLEDGMENT....................................................................................................

TABLE OF CONTENTS.....................................................................................................

LIST OF TABLES............................................................................................................

LIST OF FIGURES............................................................................................................

ABBREVIATIONS AND NOTATIONS.............................................................................

ABSTRACT.......................................................................................................................

CHAPTER I INTRODUCTION..........................................................................................

1.1. Motivation.......................................................................................................

1.2. Objective.........................................................................................................

1.3. Report Organization........................................................................................

CHAPTER II DESIGN SPECIFICATIONS AND STANDARDS....................................

2.1. Hardware description......................................................................................

2.2. Software description.......................................................................................

CHAPTER III PROJECT MANAGEMENT......................................................................

3.1. Budget and Cost Management........................................................................

3.2. Project Schedule.............................................................................................

CHAPTER IV LITERATURE REVIEW............................................................................

v
4.1.What is Open Modbus...............................................................................................

4.1.1.Modbus IP/TCP.................................................................................................

4.2.What is Induction Motor...........................................................................................

4.2.1.The overview....................................................................................................

4.2.2.Squirrel Motor..................................................................................................

4.3.Scalar Control………….……………………………………………………………………………………..7

4.4.Field-Oriented Control..............................................................................................

CHAPTER V METHODOLOGY.....................................................................................

5.1.Simulink continuous State Space Models of the IM in Startor-Fixed ...................

5.2.Simulinkcontinuous State Space Models of the IM in Field-Synchronous

Coordinate Systems......................................................................................

5.3.Control the speed and torque of the IM in TIA PORTAL.........................................

5.4.Control the motion of the IM in TIA PORTAL.........................................................

CHAPTER VI EXPECTED RESULTS............................................................................

6.1. 26

6.2. 27

CHAPTER VII CONCLUSION AND FUTURE WORK................................................

7.1. The conclusion..............................................................................................

7.2. Future work...................................................................................................

REFERENCES..................................................................................................................

vi
vii
LIST OF TABLES

Table 4. 1 LRCN methods and accuracy score..................................................................

Table 4. 2 C3D methods and accuracy scoring.................................................................

Table 4. 3 Two Stream Fusion methods and scoring.........................................................

Table 4. 4 TSN methods and score....................................................................................

Table 4. 5 HiddenTwoStream methods and scoring..........................................................

Table 4. 6 I3D methods and score.....................................................................................

Table 4. 7 T3D methods and score....................................................................................

viii
LIST OF FIGURES

Figure 3. 1 Project schedule.................................................................................................

Figure 4. 1 Human action recognition methods categorization...........................................

Figure 4. 2 Basics network architecture [17].....................................................................

Figure 4. 3 LRCN architecture for High Jump action and general architecture of

LRCN [18].........................................................................................................................

Figure 4. 4 TwoStreamFusion architecture [21]................................................................

Figure 4. 5 Temporal Segment Networks architecture for High-Jump action [22]...........

Figure 4. 6 The Hidden Two Stream architecture [24]......................................................

Figure 4. 7 Temporal 3D Convolutional Network architecture [25].................................

Figure 4. 8 3D Temporal Transition Layer structure [25].................................................

Figure 5. 1 The 3D convolution kernel [34]......................................................................

Figure 5. 2 2D convolution and 3D convolution...............................................................

Figure 5. 3 An example of CNNs architecture..................................................................

Figure 5. 4 Residual block structure..................................................................................

Figure 6. 1 3D ResNet architecture [30]............................................................................

Figure 6. 2 Accuracy of 3D ResNet-34 compare with other [30].....................................

ix
ABBREVIATIONS AND NOTATIONS

CNNs: Convolutional Neural Networks

HAR: Human Action Recognition

LSTM: Long Short-Term Memory

LRCN: Long-term Recurrent Convolutional Networks

C3D: Convolutional 3D

TSN: Temporal Segment Networks

SVM: Support Vector Machine

SGD: Stochastic Gradient Descend

x
111Equation Chapter 1 Section 1

ABSTRACT

...... In the modern-day, human activity recognition (HAR) has attracted the interest

of many researchers in many aspects such as surveillance systems, autonomous

vehicles. Companies commonly use this technology to train and monitor their new

employees to correctly perform their assigned tasks. Moreover, a HAR system can

replace multiple fall detection sensors which are usually attached to patients. In some

recent studies, this task can be achieved by applying a pre-trained model. In this

senior project, a new HAR based on state-of-the-art methods will be proposed for the

thesis project.

xi
CHAPTER I

INTRODUCTION

1.1. Motivation

In the era of automation, human action recognition (HAR) is an essential

technology which can be used to solve various real-life, human-centric issues in

eldercare and healthcare field. However, due to the intricacy and discrepancy of

human activities, HAR can be fiendishly difficult.

Human activity recognition is a process that uses some specific algorithms to

extract feature information from the input human activities then classify the input into

each activity classes. Recently, the rapid growth of computer vision algorithms and

applications with the support of pre-trained networks have been utilized in the human

action recognition field for the purpose of simplifying the mathematical procedure of

the process and increase accuracy of the output.

In the present days, HAR could be achieved by using an input of the signal

from gyroscope and accelerometer. This method requires an adequate number of

hardware which can be costly when using in some distinct scenario such as nursing

homes or hospitals. Alternatively, a huge amount of data could be collected and

processed with a few surveillance systems whose have computer vision tools such as

OpenCV, MATLAB, TensorFlow, etc. installed.

1.2. Objective

The objective of this senior project is studying Human Activities Recognition

with Deep Learning and Computer Vision then suggest a new HAR system with

theoretically faster or more accurate than the previous ones. Firstly, cutting edge

techniques in Human Activities Recognition are reviewed. Then, the best aspects of

1
advanced convolutional network architecture will be considered as the references to

create a new architecture based on them.

1.3. Report Organization

The organization of this senior report is described below:

 Chapter I: An introduction to the objective motivation of this senior will be stated

in this chapter.

 Chapter II: Constructed specification and standards of the projects will be stated in

this chapter.

 Chapter III: The management, planning, project schedules and resource planning

involved in the senior project will be enlightened in this chapter.

 Chapter IV: The framework of Human Activities Recognition. Also, information

about some Human Activities Recognition methods and some recent researches

will be provided in this chapter.

 Chapter V: A thorough demonstration of the proposed methods applied in this

project will be stated in this chapter.

 Chapter VI: The expected theoretical outcome of the project will be shown in this

chapter.

 Chapter VII: The conclusion of the project and the future works to systemize a

HAR program.

22EQUATION SECTION (NEXT)

2
CHAPTER II

DESIGN SPECIFICATIONS AND STANDARDS

2.1. Hardware description

This senior was conducted on laptop model MSI GS63 7RD with i7-7700HQ

processor, 16GB RAM. It also has GTX 1050 GPU with 2GB VRAM and 640 CUDA

cores.

2.2. Software description

The reference systems used in this senior was tested using Python and

OpenCV.

33Equation Section (Next)44Equation Section (Next)

3
55Equation Section (Next)
CHAPTER III

PROJECT MANAGEMENT

3.1. Budget and Cost Management

 Laptop: 24,000,000 VNĐ

 MATLAB and toolbox: NA

 Microsoft Office: NA

 Python: Free

3.2. Project Schedule

The senior project is planned to complete within 12 weeks. The detail is

described in figure 3.1.

Figure 3. 1 Project schedule

4
CHAPTER IV LITERATU

RE REVIEW

4.1. Human Activities

This chapter discusses about the background information and historical

development of a human action recognition (HAR) process. Furthermore, a synopsis

of some advance HAR architectures and datasets is also mentioned.

4.1.1. Background

Human activity is a crucial part in human-human interaction and social

relationships as data about the psychological condition and personality of a person

were given. As early as the late 20th century, Prof. Gaverila split the research into 2D

and 3D approaches [1]. At that time, new taxonomy was also presented by Aggarwal

and Cai which specialized in human motion analysis [2]. This taxonomy was later

developed by Wang et al. as he introduced an action categorization system [3].

Moeslund et al. (2006) conducted an analysis concentrating on posture-based action

recognition methods and put forward a greater classification system, which consists of

human tracking, motion, posture estimation and recognition methods [4].

Categories of human activities were also analyzed in other proposals. In 2008,

Turaga et al. classified an action recognition method according to their level of

activity complexity [5]. In 2010, R. Poppe broke HAR methods down into two

primary types: top-down and bottom-up [6].

3D modeling was first extended in studies of Ye et al. and Chen et al. (2013)

[7]. By using depth cameras, a 3D demonstration of a human body was set about in

the 2-dimensional plane. Aggarwal and Xia (2014) later introduced a categorization of

HAR methods from 3D stereo and motion capture system with the primary

concentration on techniques utilizing 3-dimensional depth data [8].


5
In 2014, Guo and Lai compiled all techniques for HAR from still images and

sort them into two main types corresponding to the abstraction level and feature types

used in those methods [9].

Since 2003, many surveys about human affective interaction have been

proposed and conducted through different cases. Jaimes and Sebe (2007) focused on

postures, facial expressions and speech [10] while Pantic and Rothkrantz (2003)

majored in non-verbal signal, such as facial and vocal expressions [11]. For

spontaneous actions and movement, Zeng et al. (2009) developed this method by

using visual and audio cues as Bousmalis et al. (2013) presented an analysis of

nonverbal behavior recognition methods for agreement and disagreements [12] [13].

Human activities categorization methods are [14]:

+ Unimodal methods: used when recognize human action based on motion

characteristics. There are 4 types: space-time, stochastic, rule-based and shape-based.

+ Multimodal methods: specify atomic actions or interactions that identify

personal affective states with other people and their emotions or body movement. It

can be approached by both early and late fusion. The technique is classified into 3

categories: affective methods, behavioral methods and methods based on social

networking. Figure 4.1 below describes human action recognition methods

categorization.

6
Figure 4. 1 Human action recognition methods categorization.
Many recent researches in HAR failed to express comprehensively human

actions in a compact and instructive way as they mentioned the barrier as regards to

computational issues. More specifically, although the human understanding methods

has increased rapidly in recent years, problems such as modeling of human poses and

labeling data still remain unsolved. This project will focus on space-time activities

recognizing methods.

4.1.2. Human Activities Concept

Human activity is a combination of different human gesture and movements in

order to interact with a surrounding environment (i.e. dancing, skateboarding) or other

people (i.e. talking, kissing). From human activities, it can tell the characteristic of a

person or even predict the mental and physical state of patient and elderly, or it

features could be extracted for human-computer interaction systems (i.e. Xbox

Kinect). With this kind of benefit, HAR can be applied to surveillance systems,

healthcare systems and human computer interaction systems.

4.2. Human Activities Recognition concept

4.2.1. The overview

Most Human Activity Recognition process includes:

7
 Dividing human activities video clips in video datasets into frames -

using video dataset such as Kinetics 700 and UCF101

 Extracting human action features from input image/frames - Using

various algorithms to extract the unique features of each action (i.e.

Gaussian spatial-temporal filter, 3D Wiener Filter)

 Training the network with a dataset - Based on previous input and their

labels, using a iterative method for training (i.e. Stochastic gradient

descent algorithm)

 Executing recognition –The trained model is then used to recognize

each

corresponding class of human activities of input data using a 2D CNNs

 Classify the output - Using classifier such as SVM or SoftMax to

classify the output.

Human action is one necessary type of real-life information. Automatic human

action recognizing in videos is implemented in practice, such as video indexing,

human computer interacting and camera surveillance.

4.2.2. Human action recognition methods

Two breakthrough methods were released in 2014 which form the backbone of

all later HAR methods, which is Single Stream network and Two Streams network.

4.2.2.1. Overview

Single Stream network: June 2014, Karapathy et al. figured out numerous

ways to fuse temporal information from video frames using 2-dimensional pre-trained

convolutional network [15].

8
Two Streams network: Simmoyan together with Zisserman then continues to

improve Karapathy et al. 's work. The authors shape motion features to stacked optical

flow vectors for deep architecture to effectively learn the human motion [16]. Hence,

this architecture has two segregate networks to serve both spatial and motion

information. The spatial network is a 2D pre-trained model with the input is a video's

single frame. For the motion network, the authors experiment many kinds of input

then found that bi-directional optical flow stacked across 10 successive frames has the

best performance. Both networks were separately trained and united with Support

vector machine classifier.

With the backbone of two mentioned papers, these networks can be included:

LRCN, C3D, TwoStreamFusion, TSN, HiddenTwoStream, I3D, T3D.

All the methods above are improvisations on top of five basic ideas: LSTM,

3D-ConvNet, Two-Stream, 3D-Fused Two-Stream and Two-Stream 3D-ConvNet

[17]. Figure 4.2 display five basic ideas.

9
Figure 4. 2 Basics network architecture [17]
4.2.2.2. LRCN

LRCN was the was first introduced by Donahue et al. on 17 November 2014

[18]. The authors use a LSTM decoder after each convolution layers then using an

end-to-end training network. Optical flow features and RGB were also compared as

input choices and the result that had the best weighted scoring of predictions was the

input, which combined both RGB and optical flow. Figure 4.3 describe the LRCN

architecture for action recognition and general LRCN architecture.

Figure 4. 3 LRCN architecture for High Jump action and general architecture of
LRCN [18]

Benchmarks (UCF101-split1):

10
Table 4. 1 LRCN methods and accuracy score

METHOD SCORE

OPTICAL FLOW + RGB 82.92

RGB 71.1

By using an end-to-end training network, there were a few drawbacks such as

wrong label selection, inability to capture long range temporal information and

sparsity of pre-computing of flow features as using optical flow information. Varol et

al. solve the short temporal range problem by using longer clips (about 60 frames) and

lower the spatial resolution of video which led to better performance.

4.2.2.3. C3D

In 2014, Du Tran et al. introduced 3D Convolutional Networks. This network

was built based on Single Stream architecture from Karapathy et al. But they used 3D

convolution kernels on video volume in place of using 2-dimensional convolutions

across images [19]. Their idea was training these networks on dataset Sports1M then

using it as 3D feature extractors for other datasets. The authors also showed in their

work that a simple linear classifier such as SVM was more suitable for extracted

features than the latest algorithms. The performance can increase with hand crafted

features like iDT. In 2018, Hara et al. used this idea for 2D convolutional network like

ResNet to recognize 3D human actions [20].

Benchmarks (using UCF101-split1 dataset):

Table 4. 2 C3D methods and accuracy scoring

METHOD SCORE

11
C3D (single net) + SVM 82.3

C3D (three net) + SVM 85.2

C3D (THREE NETS) + IDT + SVM 90.4

Although C3D method significantly improves the accuracy, the long-range

temporal modeling is remaining a problem. Furthermore, it gains a computational

problem when training such huge network.

4.2.2.4. Two Stream Fusion

This network based on two stream architecture. It is made up of fusion of

spatial and temporal streams and combination of temporal net output across time

frames. It first introduced by Feichtenhofer et al. on April 2016 [21]. Feichtenhofer

described that spatial features can be captured by spatial (space) stream and annual

motion for every spatial location can be captured by temporal (time) stream in a

video. Second, long term dependency can also be modeled by combining temporal net

output across time frames.

12
Figure 4. 4 TwoStreamFusion architecture [21]
Benchmarks (using UCF101-split1 dataset):

Table 4. 3 Two Stream Fusion methods and scoring.

METHOD SCORE

Two stream fusion 92.5

Two stream fusion + iDT 94.2

4.2.2.5. TSN

Temporal Segment Networks or TSN is an upturn in two-stream architecture

to produce the best result. This technique proposed by Wang et al. on August 2016

[22]. In their paper, using sampling clips sporadically throughout the video get

improved long-range temporal data model. Next, for the video-level final prediction,

author experienced many strategies, but the best scoring was combining of spatial-

temporal streams individually by averaging across fragments, combining final spatial-

temporal scores using weighted average and then applying SoftMax classification.

13
Figure 4. 5 Temporal Segment Networks architecture for High-Jump action [22]
Benchmarks (using UCF101-split1 dataset):

Table 4. 4 TSN methods and score

METHOD SCORE

TSN (RGB + optical flow) 94.0

TSN (RGB + optical flow + warped flow) 94.2

4.2.2.6. HiddenTwoStream

In 2017, Zhu et al. identified a major factor, the usage of optical flow in two

stream architecture making the network rely on pre-compute optical flow for each

sampled frame, affects storage and speed adversely [23]. The authors then

experienced multiple strategies and network architecture to get the highest fps and

lowest parameters possible with the generated optical flow without affecting the

accuracy of the system. Their architecture is based on two stream architecture with the

differences as the author mentioned are the appearance of the optical flow generation

net (MotionNet) located at the beginning of temporal stream structure, consequent

14
frames for input data instead of preprocessed optical flow and an added computation

of multistage of multiple losses by MotionNet.

Figure 4. 6 The Hidden Two Stream architecture [23]


Benchmarks (using UCF101-split1 dataset):

Table 4. 5 HiddenTwoStream methods and scoring.

METHOD SCORE

HiddenTwoStream 89.8

HiddenTwoStream + TSN 92.5

4.2.2.7. Two-Stream Inflated 3D ConvNet (I3D)

In 2017, Carreira et al. introduced I3D, which is two stream architecture

containing two different 3-dimensional networks for both streams [17]. Also, the

authors using 2D pre-trained weights in 3rd dimension to exploit pre-trained 2D

models. Moreover, the input of the spatial stream including frames stacked in time

dimension instead of single frames as in the original Two-Stream architecture. The

authors also confirmed the usage of Kinetics dataset on increasing the accuracy of

their network.

Benchmarks (using UCF101-split1 dataset):

Table 4. 6 I3D methods and score

15
METHOD SCORE

Two Stream I3D 93.4

With Kinetics pre-training and 98.0

ImageNet datasets

4.2.2.8. T3D

Temporal 3D ConvNets or T3D is an extension of the work done on I3D by

Diba et al. on 2017 [24]. The authors proposed a method to capture different temporal

depths by using 3-dimensional single stream named DenseNet based structure with a

multi-depth temporal transition layer (or temporal down sampling layer) placed after

dense blocks. This multi-depth pooling technique can be achieved by down sampled

with kernels of changeable temporal sizes. Overall, this architecture is essentially a

3D modification to the original DenseNet with addition changeable temporal down

sampling layers.

Figure 4. 7 Temporal 3D Convolutional Network architecture [24]

Figure 4. 8 3D Temporal Transition Layer structure [24]


Moreover, the authors also proposed a new supervising transferred learning

method between the pre-trained 2-dimensional convolution network and T3D. The

16
architecture is trained to perform binary classifier based on the similarity and the

inaccuracy from the prediction as is backpropagated throughout the network for

adequately transfer knowledge.

Benchmarks (using UCF101-split1 dataset):

Table 4. 7 T3D methods and score

METHOD SCORE

T3D 90.3

T3D + TLL 91.7

T3D + TSN 93.2

4.2.2.9. Datasets

In the action recognition field, the most recent successful datasets are HMDB-

51 and UCF-101 [25] [26]. These datasets were commonly used as benchmarks and

become very popular in the early state of the field. However, their size was not

enough for the training process.

Some bigger datasets were created. One of them were ActivityNet, which

consist of 200 different type of activities with 849 hours of YouTube videos [27]. In

2017, Key et al. released the Kinetics dataset as an effort to create a successfully

trained model. At that time, the Kinetics dataset was the state-of-the-art dataset with

over 300,000 video clips divided into 400 classes [28]. Hara et al. figure out that there

CNNs works best with this dataset with 73.7% of average accuracy for ResNet-152

and ResNet-200 model [29]. Aside from Kinetics, other large-scale datasets such as

Sports-1M and YouTube-8M have also been introduced [30] [31]. Both datasets are

larger than Kinetics; however, their video datasets includes many unrelated-to-target

frames and their annotations are quite noisy. These aspects can affect the accuracy of

17
the training process. Moreover, their large size files can prevent them from effectively

utilized.

18
Chapter V

METHODOLOGY

The proposed method of HAR system is given in this chapter. Specifically, the

proposed method and algorithms needed to create a new Human Action Recognition

neural network are reported.

5.1. Network Design

My convolutional neural network will be built based on C3D, which will

concatenate 3D information such as spatial and temporal information to every frames

of the input video clip. After that, the previous data will be fed to state-of-the-art 2-

dimensional classification CNNs such as VGG, ResNet, DenseNet, etc. to test the

performance. During the process, some modification to each CNNs will be tested such

as changing number and type of filters in convolutional layers. Some specific feature

of state-of-the-art CNNs will also be added such as identity connection from ResNet

[32] to get better performance.

For the training process, Stochastic Gradient Descend (SGD) algorithm with

momentum will be used to upgrade weights.

Throughout the process, the Kinetics dataset will be used for training and

testing the CNNs performance.

Stage 1: Pre-processing

This stage is an elementary step for the whole process, which can substantially

increase the performance of the network if chosen carefully. For more specific, short

input video clips from Kinetics datasets will be divided into frames using MATLAB

tool.

Stage 2: Spatio-temporal Extraction

19
This stage performs spatio-temporal extracting by using 3 × 3 × 3 convolution

kernels with stride of 1 and a 3D pooling with stride of 2. Figure 5.1 demonstrates the

idea of 3D convolution kernels.

Figure 5. 1 The 3D convolution kernel [33]


The dimension of the output volume can be calculated by the equation 56:

65
MERGEFORMAT (.)
Where is the input volume size, F is the kernel size, S is the stride, and P is
W

the padding.

Stage 3: Training and Classification

This stage performs the training of the CNNs using Stochastic Gradient

Descend with momentum as Hara et al. did in their papers then classify the output

using Support Vector Machine or SoftMax classifier.

5.2. Extract 3D feature

Different from 2D CNNs which can only learn from 2D images, 3D CNNs can

learn 3D object using 3D convolution and pooling. In 3D convolutional network,


20
spatio-temporal can be executed by convolution and pooling whereas 2D

convolutional network can only perform spatially. Du Tran et al. already proved the

performance of this method with their C3D network [19]. Figure 5.2 show the

visualization of 2D and 3D convolution operations.

Figure 5. 2 2D convolution and 3D convolution


Figure 5.2 illustrates that 2D convolution with input is multiple image or

multi-channel image will result in an image, which leads to the loss of temporal

information of input signal soon after every convolutional operation. In contrast, 3-

dimensional convolution holds the temporal information of the input which will help

the network perform better with human action clips. Also, the same phenomenon

holds the same for 2D and 3D pooling. With the loss of temporal information, 2D

convolutional network cannot effectively recognize human action from video clips

due to the noise came from other class of human action. This drawback was proved by

21
Du Tran et al. in their paper [19]. Hence, the HAR network should use 3D

convolution and pooling to get better accuracy.

5.3. Convolutional Neural Networks (CNNs)

Among the branch of neural networks, one special type is

Convolutional Neural Network. Instead of implementing matrix algorithms, it uses

convolution throughout the network. Figure 5.2 describes the example of the general

architecture of CNNs.

Figure 5. 3 An example of CNNs architecture. [34]


The figure above shows a chain structure with an RGB input with the size of

150 × 150 × 3 followed by some basic layer of a CNNs which are convolution layer

followed by batch normalization algorithm and ReLu/Leaky ReLu activation function,

down sampling layer, dropout, softmax classification layer.

5.3.1. 2D CNNs for Human Action Recognition

Such famous 2D convolution network as ResNet or VGGNet are known for

their performance with ImageNet dataset. However, in 2017, Hara et al. proved that

with three dimensional (3D) convolutional kernels added to ResNet, its performance
22
can be comparable with a deeper 2D convolutional network. Hence, state-of-the-art

2D CNNs are believed to have the ability of recognizing human activity with

additional 3D information concatenated to input video frames. In Hara et al.’s work,

they use convolutional kernels with the size of 3 × 3 × 3 and temporal stride is 1 and

down-sampling the input with the stride of 2, which is similar to C3D [19].

5.3.2. ReLu and Leaky ReLU

ReLU or Rectified linear unit is a nonlinear activation function, applied

elementwise to the input matrix, which output zero across negative half, while linearly

matching function on the positive half. It was proven to be able to increase the

training rate of CNNs and fix the Vanishing Gradient problem.

5.4. Identity Connection

For most CNNs, the more layers using certain activation functions are added,

the harder to train the neural networks. This problem has also been known as the

Vanishing Gradient problem, which occurs due to the approaching zero of the

gradients of the loss function. Thus, He et al. introduced a residual network concept to

solve the problem [32]. The most importance features of the residual network

(ResNet) is Residual Block. Figure 5.4 describes the Residual Block used in the

network.

Figure 5. 4 Residual block structure

23
Consider x is the output of any activation function (i.e. Sigmoid, SoftMax)

and assume that input and output have the same dimension. The output of each

residual block can be described as below:

75
MERGEFORMAT (.)
The formulation of H ( x )=F( x)+ x can be comprehended by feed-forward

neural networks with “skip connections”. One or more layers are skipped by skip

connections (or shortcut connection [33]). In He et al.’s study, ResNet uses skip

connections to perform identity mapping, which allows information to flow freely

throughout the whole network. This can inherently trammel vanishing gradient or

curse of dimension problems in other CNNs networks such as VGG-19 or AlexNet.

With skip connection, the author stated that the whole network can be trained

throughout using the SGD method with back-propagation and can be easily

implemented by using essential libraries (i.e. Caffe [36]) without varying the

compiler. When the amount of the features maps is increased, a type-A identity

shortcut with zero-padding will be applied to avert the increase of parameter numbers

[33].

5.5. Datasets

UCF101 and Sports1M are known for the most famous datasets for human

action recognition task. However, searching for rational network architecture on

Sports1M can be extraordinarily difficult due to the noise coming from their messy

annotations and the labels are only in video level (i.e. all the irrelevant frames are

included). Although the number of frames of UCF101 is appreciating as ImageNet,

the high space correlation among the videos strongly decrease the actual
24
diversification in training. For the purpose of re-tracing 2D CNNs trained on

ImageNet success, Kinetics human action video dataset has been created. The latest

Kinetics dataset contains approximately 650,000 video clips that cover 700 human

action classes. Each action class has at least 600 video clips, and each of them is

annotated with a single action class. Also, videos in Kinetics datasets were temporally

trimmed to eliminate non-action frames [35]. In summary, the size of Kinetics is

slightly smaller than Sports1M, whereas the annotation quality is extremely high.

Hence, Kinetics dataset is the most suitable human action recognition training system

to improve the precision of the process compared to other obsolete ones.

25
Chapter VI

EXPECTED RESULTS

Due to the limitation of both knowledge and time, only theoretical background

of the network was built. Hara et al. already proved the idea of using known 2D

Convolutional network as Residual Network with 3D convolutional kernels for human

action recognition system in their work. With the latest version of Kinetics dataset and

modification on 2D CNNs compared to ResNet such as replacing ReLu by Leaky

ReLu (α = 5.5) for better performance, both speed and accuracy of the system can be

theoretically intensively increased.

6.1. The usage of 3D convolution

The human action recognition task requires information about the existence of

distinct actions from a sequence of 2D frames. However, there is no guarantee for

those activities to be performed throughout the entire duration of the sequence (i.e.

movement from the background or by a camera). This problem appears to be a

continuation of image classification tasks to video clips and then combining the

predictions result from each frame. Because of that problem, human action

recognition systems require both time and space information from video clips to

recognize the action. In Du Tran et al.'s and Hara et al.'s papers, the ability of 3D

convolutions on fulfilling the task has been proved [19] [30] [20]. 3D convolution

can produce a 3D activation map for analyzing data where temporal or volumetric

context is important such as the human action recognition task.

6.2. 3D Convolutional Neural Networks

In 2017, Hara et al. experiment 3D convolution and 3D pooling on Residual

Network (ResNet). The dimension of 3D convolutional kernels is 3 × 3 × 3 with a


26
temporal stride of the first convolution layer is 1, which is identical to C3D. For the

input, the authors use the clips with the dimension of 3 × 16 × 112 × 112 (channel ×

frames × x × y). Figure 6.1 describes the architecture of Hara et al. 's network.

Figure 6. 1 3D ResNet architecture [29]


The Residual blocks are shown in brackets. Pooling is performed by 3 × 3 × 3

max pooling layers in conv3_1, conv4_1, conv5_1 with stride 2. Then, after each

convolution layers are batch normalization layer [36] and a ReLu activation layer

[37]. Finally, a fully connected output layer with 400 classes is set for Kinetics 400

dataset.

This architecture showed a better result than C3D architecture on the Kinetics

dataset. The model can sustain overfitting despite a large number of parameters.

Figure 6.2 shows the accuracy of the Kinetics dataset of the authors' network compare

with the other method in Carreira and Zisserman paper [17].

27
Figure 6. 2 Accuracy of 3D ResNet-34 compare with other [29]

28
CHAPTER VII

CONCLUSION AND FUTURE WORK

7.1. The conclusion

The researching process in this senior project mainly focuses on the human

action recognition task and clarifies its difficulties. Some advance human action

recognition systems and their performances have also been discussed in this report.

Then, the suitable dataset for the modern human action recognition task is stated.

This senior project has successfully categorized human action recognition

methods and their accuracy scoring. Finally, a human action recognition architecture

based on a successfully worked system has also been proposed in terms of theoretical

architecture.

7.2. Future work

In the future, several steps are required to obtain the practical HAR system

with the proposed architecture:

 Create the practical architecture with Python-OpenCV or MATLAB

 Consecutively test the performance of CNNs with various modifies

and datasets to get the best accuracy or speed scoring.

 Create pre-trained network to implement to real-life situations with

Python-OpenCV or MATLAB GUI.

29
REFERENCES

[1] D. M. Gavrila, “The Visual Analysis of Human Movement: A Survey,” Computer Vision and Image
Understanding, vol. 73, no. 1, pp. 82–98, 1999.
[2] J. K. Aggarwal and Q. Cai, “Human Motion Analysis: A Review.,” Computer Vision and Image
Understanding, vol. 73, no. 3, pp. 428–440, 1999.
[3] L. Wang, W. Hu, and T. Tan, “Recent developments in human motion analysis,” Pattern
Recognition, vol. 36, no. 3, pp. 428–440, Mar. 2003.
[4] T. B.Moeslund, A. Hilton, and V. Krüger, “A survey of advances in vision-based human motion
capture and analysis.,” Computer Vision and Image Understanding, vol. 104, no. 2–3, pp. 90–126,
2006.
[5] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine Recognition of Human
Activities: A Survey,” Computer Vision and Image Understanding, vol. 104, no. 2–3, pp. 90–126,
2006.
[6] R. Poppe, “A survey on vision-based human action recognition,” IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 18, no. 11, pp. 1473–1488, 2008.
[7] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, and J. Gall, “A Survey on Human Motion Analysis
from Depth Data.,” in Time-of-Flight and Depth Imaging, 2013, vol. 8200, pp. 149–187.
[8] J. K. Aggarwal and L. Xia, “Human activity recognition from 3D data: A review,” Pattern
Recognition Letters, vol. 48, pp. 70–80, 2014.
[9] G. Guo and A. Lai, “A survey on still image based human action recognition,” Pattern Recognition,
vol. 47, no. 10, pp. 3343–3361, 2014.
[10] A. Jaimes and N. Sebe, “Multimodal Human Computer Interaction: A Survey,” Computer Vision
and Image Understanding, vol. 108 (Special Issue on Vision for Human-Computer Interaction), pp.
116–134, 2007.
[11] M. Pantic and L. Rothkrantz, “Towards an affect-sensitive multimodal human-computer
interaction,” in IEEE, Special Issue on Multimodal Human-Computer Interaction, Invited Paper,
2003, vol. 91 (IEEE), pp. 1370–1390.
[12] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A Survey of Affect Recognition Methods:
Audio, Visual, and Spontaneous Expressions,” in IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2009, vol. 31, no. 1, pp. 39–58.
[13] K. Bousmalis, M. Mehu, and M. Pantic, “Towards the automatic detection of spontaneous
agreement and disagreement based on nonverbal behaviour: A survey of related cues, databases, and
tools,” Image and Vision Computing, vol. 31, no. 2, pp. 203–221, 2013.
[14] N. D. Rodríguez, M. P. Cuéllar, J. Lilius, and M. D. Calvo-Flores, “A survey on ontologies for
human behavior recognition,” ACM Computing Surveys, vol. 46, no. 4, pp. 1–33, 2014.
[15] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-Scale Video
Classification with Convolutional Neural Networks,” in 2014 IEEE Conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[16] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in
Videos,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C.
Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. 2014, pp. 568–576.
[17] J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics
Dataset,” 2017.
[18] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T.
Darrell, “Long-term Recurrent Convolutional Networks for Visual Recognition and Description,”
2014.
[19] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features
with 3D Convolutional Networks,” 2014.
[20] K. Hara, H. Kataoka, and Y. Satoh, “Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs
and ImageNet?,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2018, pp. 6546–6555.

30
[21] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional Two-Stream Network Fusion for
Video Action Recognition,” 2016.
[22] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, “Temporal Segment
Networks: Towards Good Practices for Deep Action Recognition,” 2016.
[23] Y. Zhu, Z.-Z. Lan, S. D. Newsam, and A. G. Hauptmann, “Hidden Two-Stream Convolutional
Networks for Action Recognition,” 2017.
[24] A. Diba, M. Fayyaz, V. Sharma, A. H. Karami, M. M. Arzani, R. Yousefzadeh, and L. V. Gool,
“Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification,” 2017.
[25] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB51: A Large Video Database for
Human Motion Recognition,” in Proceedings of the IEEE International Conference on Computer
Vision, 2011, pp. 2556–2563.
[26] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes From
Videos in The Wild,” 2012.
[27] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “ActivityNet: A large-scale video
benchmark for human activity understanding,” in 2015 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2015, pp. 961–970.
[28] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T.
Back, P. Natsev, M. Suleyman, and A. Zisserman, “The Kinetics Human Action Video Dataset,”
2017.
[29] K. Hara, H. Kataoka, and Y. Satoh, “Learning Spatio-Temporal Features with 3D Residual
Networks for Action Recognition,” 2017 IEEE International Conference on Computer Vision
Workshops (ICCVW), pp. 3154–3160, 2017.
[30] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-Scale Video
Classification with Convolutional Neural Networks,” in 2014 IEEE Conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[31] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S.
Vijayanarasimhan, “YouTube-8M: A Large-Scale Video Classification Benchmark,” 2016.
[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015.
[33] S. Verma, “Understanding 1D and 3D Convolution Neural Network | Keras. [Online]. Available:
https://towardsdatascience.com/understanding-1d-and-3d-convolution-neural-network-keras-
9d8f76e29610 [Accesesed: 13- Jan- 2020].” .
[34] N. B. D. Huy, “Using Deep Learning To Design And Implement An American
FingerspellingAlphabet Recognition System,” 2019.
[35] J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A Short Note on the Kinetics-700 Human
Action Dataset,” 2019.
[36] S. I. andcChristian Szegedy, “Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift,” 2015.
[37] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in
Proceedings of the 27th International Conference on International Conference on Machine
Learning, 2010, pp. 807–814.

31

You might also like