Professional Documents
Culture Documents
By
Maverick S. Bustos
Josemaria A. Sebastian
TABLE OF CONTENTS ii
LIST OF FIGURES iv
ABSTRACT v
CHAPTER 1 1
INTRODUCTION 1
CHAPTER 2 8
CHAPTER 3 21
METHODOLOGY 21
CHAPTER 4 27
EXPERIMENTS 27
REFERENCES 31
ii
LIST OF TABLES
Table 1 : The action subsets and tests of the Microsoft Action3D Dataset 14
Table 3 : State of the art of precision and recall values (%) on CAD-60 dataset. 16
Table 4 : A comparative summary of the performance of prior methods and Shan (2014) method
Table 5 : Recognition rate using the other subjects training data and tested with those of other
subjects 18
iii
LIST OF FIGURES
Figure 1 : Similarity of the Roundhouse kick (top) to Mawashi geri kick (down). 4
Figure 6 : The orientaion of the skeleton data with respect to the Kinect 12
Figure 7 : Spherical coordinates: radial distance r, azimuthal angle θ, and polar angle φ 13
Figure 8 : The“drinking water” action sample from the Cornell dataset is identified as a key pose
and are indicated by the dashed vertical lines (top). A set of the identified key poses are shown in
iv
ABSTRACT
Today, sports have been invested with so many resources just to be competitive and
entertaining especially on the professional level, whether it is for the purpose of injury prevention
and analysis or improvement of the athlete’s performance and with regards to Human Action
Recognition, there is a limited study on sports. Sports moves are faster in execution and have low
inter class variability which produces noisy feature and ambiguity compared to daily human
actions. In this study, we proposed an approach of using skeletal data from Kinect and
preprocessing the data so it can be invariant to the sensor and variation of human height and limb
length and reducing irrelevant skeleton joints with regards to the action being performed. The
proponents would also use key poses and atomic actions as the segmentation process. Lastly, the
actions would then be classified by the use of Hidden Markov Models (HMM). The evaluation
will be between the model that use a full set of joints versus a model that uses a reduced set of
joints that focuses on taekwondo kicks which can reach speeds of 26 m/s.
Keywords: Human Action Recognition, Kinect, skeleton algorithm, key poses, atomic
v
vi
CHAPTER 1
INTRODUCTION
Human action recognition is the task of recognizing human activities from videos or
images. The computer vision community has conducted multiple studies that try to solve the many
problems in recognizing human activities such as occlusion, changes in scale, viewpoint, lighting,
appearance, background clutter, and the complexity of human movements. The study of human
action recognition in the field of computer vision has been an active area of research due to the
and medical.
Videos have numerous applications, specifically in sports, wherein coaches and athletes
are utilizing the medium more to gauge and right strategy and technique, and to break down group
and individual performances. In a competitive combat sports environment such as karate, muay
thai, boxing, wrestling, and taekwondo, just to name a few, statistics of an athlete’s performance
can be a valuable source of data for feedback and improvement purposes. The standard method of
analyzing an athlete’s performance is performed by the use of the play and pause function of video
The Microsoft Kinect sensor is able to distinguish the kind of activity being performed by
the individual, for example, standing, strolling, punching, sitting, waving and so forth. Various
studies have implemented the Microsoft Kinect because of its wide availability and low cost.
Another reason why Kinect is used in action recognition is that it can produce depth maps. A depth
1
map is an image or image channel that contains information relating to the distance of the surfaces
Taekwondo is a martial art of Korean origin, which has evolved from a tradition martial
art into an Olympic combat sport. The primary focus of combat is to win by obtaining either more
points than the opponent by execution of kicking and also punching to the scoring areas or by a
technical knockout (Bridge, Jones, & Drust, 2011). Matches are played in 10 x 10m area and
comprise of 3 rounds of 2 min, with a 1 min rest interval separating each round. Receiving both
international and Olympic recognition where research into the physiological demands of the sport
is in its infancy (Bridge C. J., 2009). Bridge (2011) focused on the activity profile in taekwondo
and came up with the result in an average fighting to non-fighting ratio is 1:6 wherein 1 represents
the fighting periods wherein it usually lasted less than 2 seconds and 6 is a representation of
preparatory, non-preparatory and general stoppages that lasted for less than 16 seconds.
According to De Souza Vicente (2016), the difference with sport actions in comparison to
human actions are that they have low variability inter-class. Low variability inter-class will
produce more noisy features and ambiguity. The limited research on sports basically focuses on
the execution or the correct technique while some focuses on the statistics. In 2012, S. Bianco and
F. Tisato conducted a study that automatically recognizes sequences of complex karate movements
and giving an evaluation of the quality of the move performed. In terms of statistics in a boxing
boxer’s punch with the use of overhead depth imagery to reduce challenges associated with
occlusions, and build a robust body-part tracking algorithm for the time-of-flight sensors.
2
Action recognition has used other sensors to gain data; an example would be the use of
wearable sensors like accelerometers. However, the use of Microsoft Kinect as acquisition of data
in human action recognition has been proven many times in different studies, being invariant to
lighting. A user must face the sensor or be straight in front of the sensor because the Kinect was
initially designed to recognize users facing the sensor. Capturing a subject or object from the side
can cause occlusion of joints thus degrading the skeleton generated thus robustness on skeleton
extraction and tracking are the main issues for future works (L. Miranda, 2012). Ahmed Taha
(2014), S. Bianco (2012), and Shan & Akella (2014) all have proposed a method in preprocessing
the skeleton data in their study to be invariant to the sensor orientation and also invariant to scale
and their methodology yielding an average accuracy of 95.2% (MSR Action3D dataset), 97%
automatic action recognition and segmentation of training sequences for high performance sports
specifically taekwondo. They acquired data with the use of Kinect by capturing the athlete’s
performance sideways. De Souza Vicente’s work had difficulty in classifying similar kicks (back
leg kick/back leg kick to the head and front leg kick/front leg kick to the head) and segmentation
of kicks (back leg kicks or back leg kicks to head). Taekwondo moves that have high body rotation
and are similar caused a recognition impact resulting in the dissertation’s average recognition
accuracy of 74.72%. Their reasoning for the problem has to do with Kinect video capture and body
estimations technology limitations. The complexity of high speed and big joint shifting moves
causes imprecision of Kinect tracking, harming the extracted feature vector. However, S. Bianco
and F. Tisato (2012) created a framework of evaluating the quality of moves in karate which also
involves moves with high body rotation and the use of Kinect, for example, the move “Mawashi
3
geri” is similar to the move “Roundhouse Kick” or back leg kicks in taekwondo and did not
Figure 1 : Similarity of the Roundhouse kick (top) to Mawashi geri kick (down).
Ahmed Taha (2014) achieved a feature vector of 13 pairs of polar/zenith angle 𝜑 and
azimuth angle 𝜃. The study achieved this by discarding unnecessary or irrelevant joints,
specifically seven joints and since joint coordinates are normalized, ignored the radial distance r.
Thus achieving a low-dimensional representation that means less computation effort. The Multi-
class Support Vector Machine (SVM) to perform action classification, achieved an average
Reducing the set of joints produces better classification performances (Manzi, 2017).
Manzi proposed a feature vector with 7 joints, specifically, head, neck, torso, hands and feet as
reference since this set of joints is the most dicriminative for activity recognition, thus reducing
4
the complexity of computation. In total, achived a pose feature vector of 18 attributes through the
calculation of 3(𝑁 − 1), wherein N is the number of joints, which in Manzi’s work is of 7 joints.
Moves in sports are executed faster and have interclass similarities which produces noisy
features and ambiguity. Complex movement such as taekwondo kicks are fast-pace, high body
rotation kicks and to some extent, are similar in form. The aim to solve the problem is by
identifying proper form and execution of taekwondo kicks by focusing on important joints.
What is the effect of shifting the origin point from Kinect’s coordinate data point
The significance of our work will show that prioritizing skeleton joints that are significant
to the action will hopefully be beneficial to human action recognition studies and sports analysis
videos.
5
The objective of the study is to develop a model that can distinguish different taekwondo
actions. The specific objectives of the study are to eliminate skeleton joints that are irrelevant to
The main focus of this study will cover the implementation of a skeleton algorithm that
will reduce irrelevant skeleton joints, achieving dimensionality reduction. The proponents limited
this research to the sports of taekwondo since moves in taekwondo have high body rotation and
can reach peak linear speeds up to 26 m/s and there are limited research in action recognition with
regards to sports.
To understand and clarify the terms used in the study. The following are hereby defined:
Dimensionality Reduction – Reduction of features that are involved in the classification process
characteristics.
Hidden Markov Model (HMM) – is a statistical Markov model in which the system being
Intra-class variations – the same activity may vary from subject to subject.
Inter-class similarity - activities that are fundamentally different, but that show very similar
Kinect – a line of motion sensing input devices by Microsoft for Xbox 360 and Xbox One video
6
Time-of-flight sensors (ToF) – a range imaging camera system that resolves distance based on
the known speed of light, measuring the time-of-flight of a light signal between the camera and
7
CHAPTER 2
In 2016, Mona Alzahrani & Salma Kammoun performed a literature review on Human
Action Recognition. Alzahrani & Kammoun focused on the latest and important works of Human
Action Recognition and presented the five generic process stages which are: data acquisition from
the sensor, preprocessing of the data, and segmentation of the activity, extraction of the feature
vectors and training and classification. Concluding in their study; how the sensor, depending on
the chosen application and approach, changes the performance and results.
Microsoft Kinect
One of the most recent advancements in Computer Vision is the Microsoft Kinect.
Retrieving 3D information of a scene and recognizes the action being performed by the human
body by retrieving the depth image information and real-time skeletal tracking. What makes the
Microsoft Kinect beneficial is that is has wide availability and low cost. The Kinect has a wide
application to computer science, engineering, medical, robotic and etc. The Kinect equipment
contains a depth sensor, a color (RGB) camera and a four-microphone array. The sensor produces
a depth image/map for the IR image. The depth value is encoded with gray values, and then the
darker the pixel is, the closer the point is to the camera. If no depth values are available which is
indicated by black pixels, then at that point the focuses might be too far or excessively near to be
registered (Gahlot, Agarwal, Agarwal, Singh, & Gautam, 2016). The ability to capture a RGB
image and a depth image can be done by the Kinect. To produce a skeleton model of object to the
8
scene, the resulting RGB-D data will be used. This data can be now be used to attain and
A Depth Image/Map is an image that contains information relating to the distance of the
surfaces of scene objects from a viewpoint (Vennila Megavannan, 2013). Depth image offers a
number of advantages, specifically, low light levels, a calibrated scale estimate, invariant to color
and texture, resolving silhouette ambiguities in pose, and simplifying some tasks such as
from depth maps can be done by two moethods. The two methods are Microsoft SDK by Microsoft
models, the only difference is in the number of skeleton joints. Microsoft SDK produces a skeleton
model of 20 joints while OpenNi provides a subset of the Microsoft SDK, a 15 joint skeleton
9
model. Figure shows the skeleton joints detected by both methods. All the solid and unsolid circles
are the joints detected by Microsoft SDK while only the solid circles are joints detected by OpenNi.
Numerous studies have utilized the skeleton data from the Kinect. Nonetheless, raw
coordinates cannot be utilized due to noise from the sensor or noise from the user. Lack of 3D
information in some joint positions is a common problem thus demanding the user to face the
sensor although the kinect was initially designed to capture the subject or object from the front.
Capturing the user from the side can occlude joints, degrading the the skeleton. Another issue is
of different individuals performing the same pose that can generate different skeleton models. In
summary, there is a need for robustness of skeleton tracking algorithms and numerous studies have
shown this. Ahmed Taha’s (2014) process of the skeleton data is invariant to the scale of the
subjects/obejcts and the camera orientation while maintaining relationship among different body
10
Figure 5 : The block diagram of Ahmed Taha’s Work
Ahmed Taha (2014) used the Microsoft SDK to track and obtain the 20 skeleton joints in
the scene. The position of the skeleton joints are provided as Cartesian coordinates (X, Y, Z) with
the origin centered at the Kinect. The positive Y axis points up, the positive Z axis points where
the Kinect is pointing, and the positive X axis is to the left. Ideally, the subject should be in front
of the sensor but it is not always the case. Therefore, a counterclockwise rotation of all the skeleton
points around the Y-axis with an angle α in order to make the subject face the sensor. The line
connecting both shoulders and the positive direction of X-axis of Kinect coordinates system is
called the angle α. Assuming the coordinates of two shoulder joints are (𝑥𝐿 , 𝑦𝐿 , 𝑧𝐿 ) as the left
shoulder and (𝑥𝑅 , 𝑦𝑅 , 𝑧𝑅 ) as the right shoulder then the computation of angle α is as follows: ∝ =
𝑧 −𝑧
tan−1(𝑥𝑅−𝑥𝐿 ).
𝑅 𝐿
11
Figure 6 : The orientaion of the skeleton data with respect to the Kinect
Then follows a counterclockwise rotaion about the Y-axis to all the skeleton joints with
respect to an angle α. For each skeleton joint 𝑖 with coordinates (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ), the rotated coordinates
are calculated:
Moreover, to achieve invariance to translation and rotation of the body with respect to the
sensor reference system. Ahmed Taha’s method use the shoulder center joint as the origin of the
new system. Assuming the shoulder center joint corrdinate is (𝑥, 𝑦, 𝑧) then each skeleton joint ί
with coordinates (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ), e.g. torso joint, the calculation of the translated coordinates (𝑥𝑖′ , 𝑦𝑖′ , 𝑧𝑖′ )
are a follows: (𝑥𝑖′ , 𝑦𝑖′ , 𝑧𝑖′ ) = (𝑥𝐼 − 𝑥, 𝑦𝑖 − 𝑦, 𝑧𝑖 − 𝑧). Finally, the normalization process to
eliminate the variation of people in terms of height and dimension is done by converting the
12
translated joints from the cartesian coordinate system to the spherical coordinate system. The
normalization in the cartesian coordinate systems will change the components of X, Y, and Z.
Although, normalizing a point in the spherical coordinates, radial distance r will equal to one while
Figure 7 : Spherical coordinates: radial distance r, azimuthal angle θ, and polar angle φ
The results of Ahmed Taha’s work is evaluated by the use of the benchmark dataset
Microsoft Action3D dataset. Twenty actions are performed in front of the camera. The dataset is
13
Table 1 : The action subsets and tests of the Microsoft Action3D Dataset
The results shows that the dimension reduction of the feature vector has an effect to the
computational cost of the recognition process. Making it feasible for real time tracking
applications.
Normalized relative orientation (NRO) takes full use of the dynamics models of human
joints is utilized to represent human pose. The NRO of one human joint is computed relative to the
14
joint that it rotates around but not relative to the torso or the hip center (G. Zhu, 2016). An example
of NRO is where the left elbow joint’s NRO is computed relative to the left shoulder joint wherein
the left elbow rotates around the left shoulder in the human body. The process of normalizing the
joints make the relative orientation insensitive to subject’s height, and limb length, and position to
camera. Shan and Akella (2014) presented a method to recognize human action from skeleton data
that is normalized and the segmented using pose sequence to extract key poses wherein it
establishes the atomic action templates. Reducing intra-class variation because the method
eliminates the influence of nonlinear stretching and random pauses in the temporal domain.
Ahmed Taha (2014) achieved a feature vector of 13 pairs of polar/zenith angle 𝜑 and
azimuth angle 𝜃. The study achieved this by discarding unnecessary or irrelevant joints,
specifically seven joints and since joint coordinates are normalized, ignored the radial distance r.
Thus achieving a low-dimensional representation that means less computation effort. With the help
Reducing the set of joints produces better classification performances (Manzi, 2017).
Manzi proposed a feature vector with 7 joints, specifically, head, neck, torso, hands and feet as
reference since this set of joints is the most discriminative for activity recognition, thus reducing
the complexity of computation. In total, achieved a pose feature vector of 18 attributes through the
calculation of 3(𝑁 − 1), wherein N is the number of joints, which in Manzi’s work is of 7 joints.
Manzi’s work was evaluated using the Cornell Activity Dataset (CAD-60).
15
Table 3 : State of the art of precision and recall values (%) on CAD-60 dataset.
According to several studies by Shan (2014) and G. Zhu (2015), key poses and atomic
actions are used for the segmentation of the video. The concept is that one human action is
composed of sequence key poses and atomic actions. Shan (2014) defined key poses as special
poses that have the minimal kinetic energy and atomic actions as the action itself. An example is
the action of drinking water; the key poses are the human hand moving from upward to downward
Figure 8 : The“drinking water” action sample from the Cornell dataset is identified as a key pose and are indicated
by the dashed vertical lines (top). A set of the identified key poses are shown in the bottom half.
16
The study achieved an average accuracy of 84% in the MSR Action3D dataset, proving to
extract quality features, multiple steps are taken to address the preprocessing of the noisy skeleton
data and skeleton data can sometimes have corrupted poses or heavily occluded.
Table 4 : A comparative summary of the performance of prior methods and Shan (2014) method on the MSR
Action3D dataset.
Hidden Markov Models, wherein HMMs recently had success in application to speech
recognition, deal with time sequential data that can provide time-scale invariability in recognition.
For each action type, a HMM has to be trained. Given a set of observation sequences, the
parameters of HMMs can be calculated using the Expectation Maximization (EM) algorithm.
calculation for the probability of the action 𝐴𝑗 for the 𝑖 𝑡ℎ action type 𝑇𝑖 , that is 𝑃(𝐴𝑗 |𝑇𝑖 ). In general,
an HMM is defined by three elements: the prior distribution for initial states π, the transition
matrix 𝑀(𝑆𝑡 + 1 |𝑆𝑡 ), and the emission matrix 𝑃(𝑜𝑗 |𝑆𝑡 ) where 𝑆𝑡 is the hidden state at time t. The
use of Gaussian distribution is to model the coordinate values. Hence, the goal of training the
17
Yamato (1994) used Hidden Markov Model to distinguish the different tennis strokes.
Yamato approach the problem by using a feature based bottom up approach with HMMs that is
characterized by its learning capability and time-scale invariability and an algorithm of a mesh
The results of their experiment shows that to improve recognition rate, there should be a
mixed in the training data because each person has some uniqueness in performing the actions.
HMMs trained with the data of two subjects achieved a recognition rate of 70.8%.
Table 5 : Recognition rate using the other subjects training data and tested with those of other subjects
In 2008, Rodriguez et al created the UCF sports actions dataset, consisting of ten different
types of human actions all collected from broadcast television channels such as EPSN and BBC:
ball, swinging on the pommel horse and on the floor, and swinging at the high bar. The dataset
consists of 150 video samples with the resolution of 720 x 480 which shows large intra-class
18
variability. Rodriguez reported a 69.2% accuracy which is the only known published result for the
UCF dataset.
MSR-Action3D Dataset
Wanqing Li made this dataset during his stay at Microsoft Research Redmond. A depth
camera was used to capture the action dataset of depth sequence, known as MSR-Action3D. The
dataset contains twenty actions such as high arm wave, hammer, horizontal arm wave, hand catch,
forward punch, high throw, draw x, draw tick draw circle, hand clap, two hand wave, side-boxing,
bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, pick up & throw
The dataset was acquired where the sensor able to capture the athlete’s body sideways. The
taekwondo moves were suggested by the head coach. The seven taekwondo moves are front leg
kick, back leg kick, front leg kick to head, back leg kick to head, front arm punch, back arm punch,
and spinning kick. The dataset is composed of 70 files of 10 athletes performing each move for 10
seconds, totalizing to 11 minutes and 40 seconds of recording time. 1 minute and 40 seconds of
recording time for each move. Each recording was stored in .CSV type files.
19
Figure 10 : The seven taekwondo moves
The dataset focuses on realistic actions from daily life. Collected through the use
of a depth camera and contains action performed by four different human subjects: two males and
two females. It contains 12 types of actions: “talking on the phone”, “writing on whiteboard”,
“drinking water”, “rinsing mouth with water”, “brushing teeth”, “wearing contact lenses”, “talking
container”, and “working on computer”. The dataset contains RGB, depth, and skeleton data, with
15 joints available. Each subject performs the activity twice, so one sample contains two
20
CHAPTER 3
METHODOLOGY
Theoretical Framework
The use of Kinect helps in extracting skeleton data it tracks, two popular methods in
extracting the data are OpenNi SDK and Microsoft SDK. The proponents will use the Microsoft
SDK as is provides more advantages than OpenNi. Table 1 shows the advantages of using
Skeleton data extracted from Kinect are used for the purpose of recognizing the gesture or
the action of the user. Skeleton data can be noisy due to the sensor or the user itself. Therefore,
studies have proposed several algorithms that make the skeleton data invariant to factors such as
sensor and scale. Ahmed Taha (2014), S. Bianco (2012), and Shan & Akella (2014) all have
proposed a method that preprocess the skeleton data in their study to be invariant to the sensor
orientation and also invariant to scale and their methodology yielding an average accuracy of
95.2% (MSR Action3D dataset), 97% (Dataset of Karate Moves), and 80% (MSR Action3D
dataset) respectively.
21
Ahmed Taha’s (2014) skeleton algorithm initially rotates the skeleton data with an angle
that rotates counterclockwise in the Y-axis consequently making the skeleton face the sensor, then
changing the origin from the Kinect to a human joint. Finally, converting the 3D Cartesian plane
Manzi (2017) proposed an approach to processing the data which make it invariant to
sensor orientation and person’s specific size. The original reference frame is moved from the
camera to the torso joint and then scaled with respect to the distance of the neck and torso joint.
Manzi presented a feature vector of 7 joints, focusing on the head, neck, torso, hands, and feet that
are the most discriminative in action recognition. The normalization step is of follows, the N
number of joints in the skeleton data, the feature vector f is 𝑓 = [𝑗1 , 𝑗2 , … , 𝑗𝑁−1 ], wherein each 𝑗𝑖
𝐽 −𝐽
is the vector containing the coordinates of the 𝑖 𝑡ℎ joint 𝐽𝑖 . Thus 𝑗𝑖 is defined as 𝑗𝑖 = ‖𝐽 𝑖 −𝐽0 ‖ , 𝑖 =
1 0
1,2, … , 𝑁 − 1, where 𝐽0 and 𝐽1 are the coordinates of the torso and neck joint, respectively. The
total number of attributes of the feature vector f can be calculated as 3(𝑁 − 1).
G. Zhu (2016) process skeleton data with the use of Normalized relative orientation which
takes use of the dynamics models of human joints and that the computation of NRO of one joint
is relative to the joint it rotates around. The use of NRO makes the data insensitive to human
height, limb length and distance to the camera. Applying a simple moving average to smooth the
data, the calculation for a joint’s NRO is as follows. Let 𝑃𝑖 = (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) represent the position of
joint i in the coordinate system, the NRO of joint i relative to joint j can be computed as 𝐹𝑁𝑅𝑂 =
𝑃𝑖 −𝑃𝑗
‖𝑃𝑖 −𝑃𝑗 ‖
, where ||*|| is the Euclidean distance.
22
Figure 11 : An example of relative orientation of the skeleton consisting of 20 joints
Feature vectors can be defined as the conversion of a big input data into a reduce
representation set of features that are used for distinguishing different activities and then used as
inputs for classification algorithms. The number of features affects the recognition performance
a sub stage of increasing accuracy and reducing the computation complexity to achieve real time
recognition. Ahmed Taha (2014) achieved a feature vector composed of thirteen pairs of azimuth
and zenith angles. The study achieves this by converting and normalizing the reduced skeleton
points to spherical coordinates. Reducing the original twenty skeleton joints to thirteen as the
concept behind is that not all joints contribute or important to an action therefore the study removes
the torso area of the data as the torso area did not show independent motion along with the whole
body. Anjum (2014) also reduced the size of the feature vector by tracking the joint undergoing
the most distinct motion during an activity and used a joint relative to another joint to account for
different positions of the user performing the action in front of the camera. Shan & Akella (2014)
approached the segmentation process of human action recognition by identifying key poses using
pose kinetic energy. G. Zhu (2015) & (2016) also approached the segmentation process by
23
identifying key poses and atomic action, resulting to being robust to the inter-class similarity and
intra-class dissimilarity. Both studies have a similar approach in which a feature sequence is
defined as S =(𝐹1 , 𝐹2 , … , 𝐹𝑁 ), N the frame number of the feature sequence and 𝐹𝑖 as the feature
vector in the 𝑖 𝑡ℎ frame. The kinetic energy of the 𝑖 𝑡ℎ feature vector can be computed as 𝐸𝑝 (𝑖) =
2
∑𝐿𝑗=1‖𝐹𝑖𝑗 − 𝐹1𝑗 ‖ wherein the 𝑖 𝑡ℎ feature vector is the Euclidean distance between the 𝑖 𝑡ℎ feature
vector and the first feature vector. Then the kinetic energy between two adjacent feature vectors
can be calculated by 𝐸𝑑 (𝑖) = 𝐸𝑝 (𝑖) − 𝐸𝑝 (𝑖 − 1). A threshold-based segmentation is use for the
segmentation of the feature sequence to static segments and dynamic segments wherein |𝐸𝑑 (𝑖)| <
𝐸𝑚𝑖𝑛 are labeled as static segments while the other are dynamic segments. The absolute value
operator |𝐸𝑑 (𝑖)| and 𝐸𝑚𝑖𝑛 is an empirical parameter. Finally, the construction of a key pose
codebook and atomic action codebook are done by extracting key poses from the static segments
by clustering algorithms while the atomic actions between any 2 key poses are clustered with the
The Hidden Markov Model (HMM) has a particular success to speech recognition and it
has been applied to many studies in the Computer Vision field. HMMs are characterized by their
learning ability because it can provide time-scale invariability in recognition thus making it
possible to deal with time-sequential data and automatically optimizing the model with the data.
The variation of the action being performed per person causes an impact to the recognition rate of
HMMs. To overcome this issue, the improvement of the recognition rate of HMMs can be done
by using mixed training data of the actions (Yamato, 1992). Collecting more training patterns
which are suitable for representing the different actions can further improve the recognition rate
of HMMs since test pattern and training pattern subjects are different.
24
Conceptual Framework
The input of the study will be the training sequences of the taekwondo moves using the
Kinect sensor as the primary data acquisition and extracting the data using the Microsoft SDK as
it provides 20 joints.
The next process will then be the processing of the raw 3D coordinate skeleton data
provided by the Microsoft SDK. The preprocessing is of follows: the origin of the data, which is
originally the sensor, is moved from the camera to the torso joint and then the joints are scaled
with respect to the distance between the neck and torso joints. Next is the elimination of the
irrelevant joints for the action, reducing the feature vector size. For the segmentation process, the
construction of key poses and atomic actions codebooks for the training sequences. Finally, using
25
The output will be the correct recognition of the taekwondo kicks, specifically, kicks that
have high rotation (back leg) and high similarity (back leg/back leg to head and front leg/front leg
to head).
26
CHAPTER 4
EXPERIMENTS
The data gathering preparation as well as the data gathering itself will be performed in a sports
training center or gym with the guidance of a Taekwondo head coach. The coach will recommend
kicks for our experiments. Kicks that are commonly used by the athletes in their daily training
sessions.
Participants
The participants of the study will be the athletes of the sport with the guidance from the
Taekwondo head coach. According to De Souza (2016), we will also use 10 athletes of different
sex, ages, height, weight, and belt color are to perform the recommended moves.
All data will be acquired through the use of a single Microsoft Kinect. The Kinect is a
iii) Multiple skeleton tracking (at most 6 persons) in the same scene
27
The use of skeleton tracking and Microsoft SDK will allow the tracking of the athlete’s
Data Gathering
The Kinect sensor will be set to a position where the camera and sensor could capture the
athlete’s body sideways. The dataset assembly will be as follows: each participant will perform
each move for 10 seconds, taking a short break, then will perform another move until all the moves
are performed. The end process will be a number of files. The selected kicks are:
28
vii. Turning side kick (kick to the body)
The position of the skeleton joints are provided as Cartesian coordinates (X, Y, Z) with the
origin centered at the Kinect. The positive Y axis points up, the positive Z axis points where the
Kinect is pointing, and the positive X axis is to the left. To achieve invariance to sensor orientation
and person’s specific size. The original reference frame is moved from the camera to the torso
joint and then scaled with respect to the distance of the neck and torso joint. The normalization
step is of follows: the N number of joints in the skeleton data will be the feature vector f, where
𝑓 = [𝑗1 , 𝑗2 , … , 𝑗𝑁−1 ], wherein each 𝑗𝑖 is the vector containing the coordinates of the 𝑖 𝑡ℎ joint 𝐽𝑖 .
𝐽 −𝐽
Thus 𝑗𝑖 is defined as 𝑗𝑖 = ‖𝐽 𝑖 −𝐽0 ‖ , 𝑖 = 1,2, … , 𝑁 − 1, where 𝐽0 and 𝐽1 are the coordinates of the
1 0
torso and neck joint, respectively. The reduction of the set of joints will be done through the
frame number of the feature sequence and 𝐹𝑖 as the feature vector in the 𝑖 𝑡ℎ frame. The kinetic
𝑗 𝑗 2
energy of the 𝑖 𝑡ℎ feature vector can be computed as 𝐸𝑝 (𝑖) = ∑𝐿𝑗=1‖𝐹𝑖 − 𝐹1 ‖ wherein the 𝑖 𝑡ℎ
feature vector is the Euclidean distance between the 𝑖 𝑡ℎ feature vector and the first feature vector.
Then the kinetic energy between two adjacent feature vectors can be calculated by 𝐸𝑑 (𝑖) =
𝐸𝑝 (𝑖) − 𝐸𝑝 (𝑖 − 1). A threshold-based segmentation is use for the segmentation of the feature
sequence to static segments and dynamic segments wherein |𝐸𝑑 (𝑖)| < 𝐸𝑚𝑖𝑛 are labeled as static
segments while the other are dynamic segments. The absolute value operator |𝐸𝑑 (𝑖)| and 𝐸𝑚𝑖𝑛 is
29
an empirical parameter. Finally, the construction of a key pose codebook and atomic action
codebook are done by extracting key poses from the static segments by clustering algorithms while
the atomic actions between any 2 key poses are clustered with the associated dynamic segments.
Hidden Markov Model, characterized by their learning ability because it can provide time-
scale invariability in recognition. For each taekwondo kick, a HMM has to be trained so that it will
generate its symbol pattern for its corresponding move. Given a set of observation sequences, the
parameters of HMMs can be calculated using the Expectation Maximization (EM) algorithm.
calculation for the probability of the action 𝐴𝑗 for the 𝑖 𝑡ℎ action type 𝑇𝑖 , that is 𝑃(𝐴𝑗 |𝑇𝑖 ).
The process in recognizing the actions are done by the vector quantization of each frame’s
feature vector to obtain a symbol sequence. The recognition result will be the HMM that
The proponents will use Leave-one-out cross validation and confusion matrix to evaluate
the recognition performance of the model. In Leave-one-out cross validation. 9 out of 10 of the
participants will be used to train the classifier while the remaining participant will be for testing
purposes. Describing the performance of a classifier evaluated on a set of test data for which the
true values are known if known as a confusion matrix. The aim of the first experiment is to assess
shift of origin from the from Kinect’s coordinate data point to a point in the object’s lower body.
The second experiments aims to find out about the minimum sets of joints required to recognize
the action.
30
REFERENCES
Ahmed Taha, H. H.-S.-H. (2014, July). Human Action Recognition based on MSVM and Depth
Alzahrani, M., & Kammoun, S. (2016, May). Human Activity Recognition: Challenges and
Bridge, C., Jones, M., & Drust, B. (2011). The Activity Profile in International Taekwondo
de Souza Vicente, C. M. (2016). High performance moves recognition and sequence segmentation
based on key poses filtering. IEEE Winter Conference on Applications of Computer Vision
(WACV), 1-8.
G. Zhu, L. Z. (2016). Human Action Recognition using Multi-layer Codebooks of Key Poses and
Gahlot, A., Agarwal, P., Agarwal, A., Singh, V., & Gautam, A. K. (2016). Skeleton based Human
Kasiri-Bidhendi, S. (2015). Combat sports analytics: Boxing punch classification using overhead
31
L. Miranda, T. V. (2012). Real-time gesture recognition from depth data through key poses
learning and decision forests. 25th SIBGRAPI Conference on Graphics, Patterns and
Images, 268-275.
Reily, B. J. (2016). Human activity recognition and gymnastics analysis through depth imagery.
S. Bianco, F. T. (2012). Karate moves recognition from skeletal motion. 3D Image Processing and
Applications.
Shan, J., & Akella, S. (2014, September). 3D human action segmentation and recognition using
pose kinetic energy. In Advanced Robotics and its Social Impacts (ARSO), 69-75.
Vennila Megavannan, B. A. (2013). Human Action Recognition using Depth Maps. In proceedings
5.
32