You are on page 1of 39

Taekwondo Kicks Detection Using RGB-D Sensors

By

Maverick S. Bustos
Josemaria A. Sebastian

A Thesis Proposal Paper Submitted to the School of Information Technology


In Partial Fulfillment of the Requirements for CS200
Bachelor of Science in Computer Science

Mapua University - Makati


November 2017
i
TABLE OF CONTENTS

TABLE OF CONTENTS ii

LIST OF TABLES iii

LIST OF FIGURES iv

ABSTRACT v

CHAPTER 1 1

INTRODUCTION 1

CHAPTER 2 8

REVIEW OF RELATED LITERATURE AND STUDIES 8

CHAPTER 3 21

METHODOLOGY 21

CHAPTER 4 27

EXPERIMENTS 27

REFERENCES 31

ii
LIST OF TABLES

Table 1 : The action subsets and tests of the Microsoft Action3D Dataset 14

Table 2 : Ahmed Taha’s propose method performnce compare to other methods 14

Table 3 : State of the art of precision and recall values (%) on CAD-60 dataset. 16

Table 4 : A comparative summary of the performance of prior methods and Shan (2014) method

on the MSR Action3D dataset. 17

Table 5 : Recognition rate using the other subjects training data and tested with those of other

subjects 18

Table 6 : Comparison of OpenNi SDK and Microsoft SDK 21

iii
LIST OF FIGURES

Figure 1 : Similarity of the Roundhouse kick (top) to Mawashi geri kick (down). 4

Figure 2 : Complexity and Similarity of the Taekwondo Kicks 5

Figure 3 : Hardware of Kinect. 9

Figure 4 : Skeleton Model detected by Microsoft SDK and OpenNI 10

Figure 5 : The block diagram of Ahmed Taha’s Work 11

Figure 6 : The orientaion of the skeleton data with respect to the Kinect 12

Figure 7 : Spherical coordinates: radial distance r, azimuthal angle θ, and polar angle φ 13

Figure 8 : The“drinking water” action sample from the Cornell dataset is identified as a key pose

and are indicated by the dashed vertical lines (top). A set of the identified key poses are shown in

the bottom half. 16

Figure 9 : Example of extracted tennis action and symbol sequence 18

Figure 10 : The seven taekwondo moves 20

Figure 11 : An example of relative orientation of the skeleton consisting of 20 joints 23

Figure 12 : The proposed method 25

Figure 13 : Recording setup 28

iv
ABSTRACT

Today, sports have been invested with so many resources just to be competitive and

entertaining especially on the professional level, whether it is for the purpose of injury prevention

and analysis or improvement of the athlete’s performance and with regards to Human Action

Recognition, there is a limited study on sports. Sports moves are faster in execution and have low

inter class variability which produces noisy feature and ambiguity compared to daily human

actions. In this study, we proposed an approach of using skeletal data from Kinect and

preprocessing the data so it can be invariant to the sensor and variation of human height and limb

length and reducing irrelevant skeleton joints with regards to the action being performed. The

proponents would also use key poses and atomic actions as the segmentation process. Lastly, the

actions would then be classified by the use of Hidden Markov Models (HMM). The evaluation

will be between the model that use a full set of joints versus a model that uses a reduced set of

joints that focuses on taekwondo kicks which can reach speeds of 26 m/s.

Keywords: Human Action Recognition, Kinect, skeleton algorithm, key poses, atomic

actions, Hidden Markov Model.

v
vi
CHAPTER 1

INTRODUCTION

Human action recognition is the task of recognizing human activities from videos or

images. The computer vision community has conducted multiple studies that try to solve the many

problems in recognizing human activities such as occlusion, changes in scale, viewpoint, lighting,

appearance, background clutter, and the complexity of human movements. The study of human

action recognition in the field of computer vision has been an active area of research due to the

various applications it can be performed namely in security, robotics, surveillance, entertainment

and medical.

Videos have numerous applications, specifically in sports, wherein coaches and athletes

are utilizing the medium more to gauge and right strategy and technique, and to break down group

and individual performances. In a competitive combat sports environment such as karate, muay

thai, boxing, wrestling, and taekwondo, just to name a few, statistics of an athlete’s performance

can be a valuable source of data for feedback and improvement purposes. The standard method of

analyzing an athlete’s performance is performed by the use of the play and pause function of video

recordings to manually collect raw statistics.

The Microsoft Kinect sensor is able to distinguish the kind of activity being performed by

the individual, for example, standing, strolling, punching, sitting, waving and so forth. Various

studies have implemented the Microsoft Kinect because of its wide availability and low cost.

Another reason why Kinect is used in action recognition is that it can produce depth maps. A depth

1
map is an image or image channel that contains information relating to the distance of the surfaces

of scene objects from a viewpoint. Depth maps are invariant to illumination

Taekwondo is a martial art of Korean origin, which has evolved from a tradition martial

art into an Olympic combat sport. The primary focus of combat is to win by obtaining either more

points than the opponent by execution of kicking and also punching to the scoring areas or by a

technical knockout (Bridge, Jones, & Drust, 2011). Matches are played in 10 x 10m area and

comprise of 3 rounds of 2 min, with a 1 min rest interval separating each round. Receiving both

international and Olympic recognition where research into the physiological demands of the sport

is in its infancy (Bridge C. J., 2009). Bridge (2011) focused on the activity profile in taekwondo

and came up with the result in an average fighting to non-fighting ratio is 1:6 wherein 1 represents

the fighting periods wherein it usually lasted less than 2 seconds and 6 is a representation of

preparatory, non-preparatory and general stoppages that lasted for less than 16 seconds.

According to De Souza Vicente (2016), the difference with sport actions in comparison to

human actions are that they have low variability inter-class. Low variability inter-class will

produce more noisy features and ambiguity. The limited research on sports basically focuses on

the execution or the correct technique while some focuses on the statistics. In 2012, S. Bianco and

F. Tisato conducted a study that automatically recognizes sequences of complex karate movements

and giving an evaluation of the quality of the move performed. In terms of statistics in a boxing

match, Kasiri-Bidhendi, S. (2015) presented a robust framework for automatic classification of a

boxer’s punch with the use of overhead depth imagery to reduce challenges associated with

occlusions, and build a robust body-part tracking algorithm for the time-of-flight sensors.

2
Action recognition has used other sensors to gain data; an example would be the use of

wearable sensors like accelerometers. However, the use of Microsoft Kinect as acquisition of data

in human action recognition has been proven many times in different studies, being invariant to

lighting. A user must face the sensor or be straight in front of the sensor because the Kinect was

initially designed to recognize users facing the sensor. Capturing a subject or object from the side

can cause occlusion of joints thus degrading the skeleton generated thus robustness on skeleton

extraction and tracking are the main issues for future works (L. Miranda, 2012). Ahmed Taha

(2014), S. Bianco (2012), and Shan & Akella (2014) all have proposed a method in preprocessing

the skeleton data in their study to be invariant to the sensor orientation and also invariant to scale

and their methodology yielding an average accuracy of 95.2% (MSR Action3D dataset), 97%

(Dataset of Karate Moves), and 80% (MSR Action3D dataset) respectively.

De Souza Vicente (2016) presented a discriminative key pose-based approach for

automatic action recognition and segmentation of training sequences for high performance sports

specifically taekwondo. They acquired data with the use of Kinect by capturing the athlete’s

performance sideways. De Souza Vicente’s work had difficulty in classifying similar kicks (back

leg kick/back leg kick to the head and front leg kick/front leg kick to the head) and segmentation

of kicks (back leg kicks or back leg kicks to head). Taekwondo moves that have high body rotation

and are similar caused a recognition impact resulting in the dissertation’s average recognition

accuracy of 74.72%. Their reasoning for the problem has to do with Kinect video capture and body

estimations technology limitations. The complexity of high speed and big joint shifting moves

causes imprecision of Kinect tracking, harming the extracted feature vector. However, S. Bianco

and F. Tisato (2012) created a framework of evaluating the quality of moves in karate which also

involves moves with high body rotation and the use of Kinect, for example, the move “Mawashi

3
geri” is similar to the move “Roundhouse Kick” or back leg kicks in taekwondo and did not

encounter the problem pertaining to high body rotation.

Figure 1 : Similarity of the Roundhouse kick (top) to Mawashi geri kick (down).

Ahmed Taha (2014) achieved a feature vector of 13 pairs of polar/zenith angle 𝜑 and

azimuth angle 𝜃. The study achieved this by discarding unnecessary or irrelevant joints,

specifically seven joints and since joint coordinates are normalized, ignored the radial distance r.

Thus achieving a low-dimensional representation that means less computation effort. The Multi-

class Support Vector Machine (SVM) to perform action classification, achieved an average

recognition accuracy of 95.2% in the MSR Action3D dataset.

Reducing the set of joints produces better classification performances (Manzi, 2017).

Manzi proposed a feature vector with 7 joints, specifically, head, neck, torso, hands and feet as

reference since this set of joints is the most dicriminative for activity recognition, thus reducing

4
the complexity of computation. In total, achived a pose feature vector of 18 attributes through the

calculation of 3(𝑁 − 1), wherein N is the number of joints, which in Manzi’s work is of 7 joints.

Figure 2 : Complexity and Similarity of the Taekwondo Kicks

Moves in sports are executed faster and have interclass similarities which produces noisy

features and ambiguity. Complex movement such as taekwondo kicks are fast-pace, high body

rotation kicks and to some extent, are similar in form. The aim to solve the problem is by

identifying proper form and execution of taekwondo kicks by focusing on important joints.

Specifically, it will answer these following questions:

 What is the effect of shifting the origin point from Kinect’s coordinate data point

to a point in the object’s lower body?

 What are the minimum joints to classify a taekwondo kick?

The significance of our work will show that prioritizing skeleton joints that are significant

to the action will hopefully be beneficial to human action recognition studies and sports analysis

videos.

5
The objective of the study is to develop a model that can distinguish different taekwondo

actions. The specific objectives of the study are to eliminate skeleton joints that are irrelevant to

the action for reduction of the feature vector for classification.

The main focus of this study will cover the implementation of a skeleton algorithm that

will reduce irrelevant skeleton joints, achieving dimensionality reduction. The proponents limited

this research to the sports of taekwondo since moves in taekwondo have high body rotation and

can reach peak linear speeds up to 26 m/s and there are limited research in action recognition with

regards to sports.

To understand and clarify the terms used in the study. The following are hereby defined:

Dimensionality Reduction – Reduction of features that are involved in the classification process

thus increasing accuracy and reducing computational effort.

Feature vector – a vector that contains information describing an object's important

characteristics.

Hidden Markov Model (HMM) – is a statistical Markov model in which the system being

modeled is assumed to be a Markov process with unobserved (hidden) states.

Intra-class variations – the same activity may vary from subject to subject.

Inter-class similarity - activities that are fundamentally different, but that show very similar

characteristics in the sensor data.

Kinect – a line of motion sensing input devices by Microsoft for Xbox 360 and Xbox One video

game consoles and Microsoft Windows PCs.

6
Time-of-flight sensors (ToF) – a range imaging camera system that resolves distance based on

the known speed of light, measuring the time-of-flight of a light signal between the camera and

the subject for each point of the image.

7
CHAPTER 2

REVIEW OF RELATED LITERATURE AND STUDIES

In 2016, Mona Alzahrani & Salma Kammoun performed a literature review on Human

Action Recognition. Alzahrani & Kammoun focused on the latest and important works of Human

Action Recognition and presented the five generic process stages which are: data acquisition from

the sensor, preprocessing of the data, and segmentation of the activity, extraction of the feature

vectors and training and classification. Concluding in their study; how the sensor, depending on

the chosen application and approach, changes the performance and results.

Microsoft Kinect

One of the most recent advancements in Computer Vision is the Microsoft Kinect.

Retrieving 3D information of a scene and recognizes the action being performed by the human

body by retrieving the depth image information and real-time skeletal tracking. What makes the

Microsoft Kinect beneficial is that is has wide availability and low cost. The Kinect has a wide

application to computer science, engineering, medical, robotic and etc. The Kinect equipment

contains a depth sensor, a color (RGB) camera and a four-microphone array. The sensor produces

a depth image/map for the IR image. The depth value is encoded with gray values, and then the

darker the pixel is, the closer the point is to the camera. If no depth values are available which is

indicated by black pixels, then at that point the focuses might be too far or excessively near to be

registered (Gahlot, Agarwal, Agarwal, Singh, & Gautam, 2016). The ability to capture a RGB

image and a depth image can be done by the Kinect. To produce a skeleton model of object to the

8
scene, the resulting RGB-D data will be used. This data can be now be used to attain and

distinguish human poses.

Figure 3 : Hardware of Kinect.

A Depth Image/Map is an image that contains information relating to the distance of the

surfaces of scene objects from a viewpoint (Vennila Megavannan, 2013). Depth image offers a

number of advantages, specifically, low light levels, a calibrated scale estimate, invariant to color

and texture, resolving silhouette ambiguities in pose, and simplifying some tasks such as

background subtraction and object segmentation (Vennila Megavannan, 2013).

Microsoft SDK and OpenNi SDK

According to Alexandros Andre Chaaraouia (2014), Obtaining the skeletal information

from depth maps can be done by two moethods. The two methods are Microsoft SDK by Microsoft

(http://www.microsoft.com/enus/kinectforwindows/); and OpenNI by PrimeSense

(http://www.primesense.com/open-ni/). Microsoft SDK and OpenNI produces the same skeleton

models, the only difference is in the number of skeleton joints. Microsoft SDK produces a skeleton

model of 20 joints while OpenNi provides a subset of the Microsoft SDK, a 15 joint skeleton

9
model. Figure shows the skeleton joints detected by both methods. All the solid and unsolid circles

are the joints detected by Microsoft SDK while only the solid circles are joints detected by OpenNi.

Figure 4 : Skeleton Model detected by Microsoft SDK and OpenNI

Processing of the skeleton data

Numerous studies have utilized the skeleton data from the Kinect. Nonetheless, raw

coordinates cannot be utilized due to noise from the sensor or noise from the user. Lack of 3D

information in some joint positions is a common problem thus demanding the user to face the

sensor although the kinect was initially designed to capture the subject or object from the front.

Capturing the user from the side can occlude joints, degrading the the skeleton. Another issue is

of different individuals performing the same pose that can generate different skeleton models. In

summary, there is a need for robustness of skeleton tracking algorithms and numerous studies have

shown this. Ahmed Taha’s (2014) process of the skeleton data is invariant to the scale of the

subjects/obejcts and the camera orientation while maintaining relationship among different body

parts. Figure will show the diagram of their propose method.

10
Figure 5 : The block diagram of Ahmed Taha’s Work

Ahmed Taha (2014) used the Microsoft SDK to track and obtain the 20 skeleton joints in

the scene. The position of the skeleton joints are provided as Cartesian coordinates (X, Y, Z) with

the origin centered at the Kinect. The positive Y axis points up, the positive Z axis points where

the Kinect is pointing, and the positive X axis is to the left. Ideally, the subject should be in front

of the sensor but it is not always the case. Therefore, a counterclockwise rotation of all the skeleton

points around the Y-axis with an angle α in order to make the subject face the sensor. The line

connecting both shoulders and the positive direction of X-axis of Kinect coordinates system is

called the angle α. Assuming the coordinates of two shoulder joints are (𝑥𝐿 , 𝑦𝐿 , 𝑧𝐿 ) as the left

shoulder and (𝑥𝑅 , 𝑦𝑅 , 𝑧𝑅 ) as the right shoulder then the computation of angle α is as follows: ∝ =
𝑧 −𝑧
tan−1(𝑥𝑅−𝑥𝐿 ).
𝑅 𝐿

11
Figure 6 : The orientaion of the skeleton data with respect to the Kinect

Then follows a counterclockwise rotaion about the Y-axis to all the skeleton joints with

respect to an angle α. For each skeleton joint 𝑖 with coordinates (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ), the rotated coordinates

are calculated:

Moreover, to achieve invariance to translation and rotation of the body with respect to the

sensor reference system. Ahmed Taha’s method use the shoulder center joint as the origin of the

new system. Assuming the shoulder center joint corrdinate is (𝑥, 𝑦, 𝑧) then each skeleton joint ί

with coordinates (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ), e.g. torso joint, the calculation of the translated coordinates (𝑥𝑖′ , 𝑦𝑖′ , 𝑧𝑖′ )

are a follows: (𝑥𝑖′ , 𝑦𝑖′ , 𝑧𝑖′ ) = (𝑥𝐼 − 𝑥, 𝑦𝑖 − 𝑦, 𝑧𝑖 − 𝑧). Finally, the normalization process to

eliminate the variation of people in terms of height and dimension is done by converting the

12
translated joints from the cartesian coordinate system to the spherical coordinate system. The

normalization in the cartesian coordinate systems will change the components of X, Y, and Z.

Although, normalizing a point in the spherical coordinates, radial distance r will equal to one while

both polar/zenith angle 𝜑 and azimuth angle 𝜃 will remain constant.

Figure 7 : Spherical coordinates: radial distance r, azimuthal angle θ, and polar angle φ

The results of Ahmed Taha’s work is evaluated by the use of the benchmark dataset

Microsoft Action3D dataset. Twenty actions are performed in front of the camera. The dataset is

divided into three subset to reduce computational cost.

13
Table 1 : The action subsets and tests of the Microsoft Action3D Dataset

The results shows that the dimension reduction of the feature vector has an effect to the

computational cost of the recognition process. Making it feasible for real time tracking

applications.

Table 2 : Ahmed Taha’s propose method performnce compare to other methods

Normalized relative orientation (NRO)

Normalized relative orientation (NRO) takes full use of the dynamics models of human

joints is utilized to represent human pose. The NRO of one human joint is computed relative to the

14
joint that it rotates around but not relative to the torso or the hip center (G. Zhu, 2016). An example

of NRO is where the left elbow joint’s NRO is computed relative to the left shoulder joint wherein

the left elbow rotates around the left shoulder in the human body. The process of normalizing the

joints make the relative orientation insensitive to subject’s height, and limb length, and position to

camera. Shan and Akella (2014) presented a method to recognize human action from skeleton data

that is normalized and the segmented using pose sequence to extract key poses wherein it

establishes the atomic action templates. Reducing intra-class variation because the method

eliminates the influence of nonlinear stretching and random pauses in the temporal domain.

Reduction of Skeleton Features

Ahmed Taha (2014) achieved a feature vector of 13 pairs of polar/zenith angle 𝜑 and

azimuth angle 𝜃. The study achieved this by discarding unnecessary or irrelevant joints,

specifically seven joints and since joint coordinates are normalized, ignored the radial distance r.

Thus achieving a low-dimensional representation that means less computation effort. With the help

of a Multi-class Support Vector Machine to perform action classification, achieved an average

recognition accuracy of 95.2% in the MSR Action3D dataset.

Reducing the set of joints produces better classification performances (Manzi, 2017).

Manzi proposed a feature vector with 7 joints, specifically, head, neck, torso, hands and feet as

reference since this set of joints is the most discriminative for activity recognition, thus reducing

the complexity of computation. In total, achieved a pose feature vector of 18 attributes through the

calculation of 3(𝑁 − 1), wherein N is the number of joints, which in Manzi’s work is of 7 joints.

Manzi’s work was evaluated using the Cornell Activity Dataset (CAD-60).

15
Table 3 : State of the art of precision and recall values (%) on CAD-60 dataset.

Key Poses and Atomic Actions

According to several studies by Shan (2014) and G. Zhu (2015), key poses and atomic

actions are used for the segmentation of the video. The concept is that one human action is

composed of sequence key poses and atomic actions. Shan (2014) defined key poses as special

poses that have the minimal kinetic energy and atomic actions as the action itself. An example is

the action of drinking water; the key poses are the human hand moving from upward to downward

direction while the atomic action is drinking of water.

Figure 8 : The“drinking water” action sample from the Cornell dataset is identified as a key pose and are indicated

by the dashed vertical lines (top). A set of the identified key poses are shown in the bottom half.

16
The study achieved an average accuracy of 84% in the MSR Action3D dataset, proving to

extract quality features, multiple steps are taken to address the preprocessing of the noisy skeleton

data and skeleton data can sometimes have corrupted poses or heavily occluded.

Table 4 : A comparative summary of the performance of prior methods and Shan (2014) method on the MSR

Action3D dataset.

Hidden Markov Model

Hidden Markov Models, wherein HMMs recently had success in application to speech

recognition, deal with time sequential data that can provide time-scale invariability in recognition.

For each action type, a HMM has to be trained. Given a set of observation sequences, the

parameters of HMMs can be calculated using the Expectation Maximization (EM) algorithm.

Given an action sample 𝐴𝑗 , which consists of a series of L observed poses, 𝑜1 , 𝑜2 , … , 𝑜𝐿 , the

calculation for the probability of the action 𝐴𝑗 for the 𝑖 𝑡ℎ action type 𝑇𝑖 , that is 𝑃(𝐴𝑗 |𝑇𝑖 ). In general,

an HMM is defined by three elements: the prior distribution for initial states π, the transition

matrix 𝑀(𝑆𝑡 + 1 |𝑆𝑡 ), and the emission matrix 𝑃(𝑜𝑗 |𝑆𝑡 ) where 𝑆𝑡 is the hidden state at time t. The

use of Gaussian distribution is to model the coordinate values. Hence, the goal of training the

HMM is to learn all these parameters.

17
Yamato (1994) used Hidden Markov Model to distinguish the different tennis strokes.

Yamato approach the problem by using a feature based bottom up approach with HMMs that is

characterized by its learning capability and time-scale invariability and an algorithm of a mesh

feature vector sequence extracted from time-sequential images.

Figure 9 : Example of extracted tennis action and symbol sequence

The results of their experiment shows that to improve recognition rate, there should be a

mixed in the training data because each person has some uniqueness in performing the actions.

HMMs trained with the data of two subjects achieved a recognition rate of 70.8%.

Table 5 : Recognition rate using the other subjects training data and tested with those of other subjects

UCF Sports Actions Dataset

In 2008, Rodriguez et al created the UCF sports actions dataset, consisting of ten different

types of human actions all collected from broadcast television channels such as EPSN and BBC:

diving, weight-lifting, horse-riding, running, skateboarding, gold-swinging, walking, kicking a

ball, swinging on the pommel horse and on the floor, and swinging at the high bar. The dataset

consists of 150 video samples with the resolution of 720 x 480 which shows large intra-class
18
variability. Rodriguez reported a 69.2% accuracy which is the only known published result for the

UCF dataset.

MSR-Action3D Dataset

Wanqing Li made this dataset during his stay at Microsoft Research Redmond. A depth

camera was used to capture the action dataset of depth sequence, known as MSR-Action3D. The

dataset contains twenty actions such as high arm wave, hammer, horizontal arm wave, hand catch,

forward punch, high throw, draw x, draw tick draw circle, hand clap, two hand wave, side-boxing,

bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, pick up & throw

Taekwondo Moves Dataset

The dataset was acquired where the sensor able to capture the athlete’s body sideways. The

taekwondo moves were suggested by the head coach. The seven taekwondo moves are front leg

kick, back leg kick, front leg kick to head, back leg kick to head, front arm punch, back arm punch,

and spinning kick. The dataset is composed of 70 files of 10 athletes performing each move for 10

seconds, totalizing to 11 minutes and 40 seconds of recording time. 1 minute and 40 seconds of

recording time for each move. Each recording was stored in .CSV type files.

19
Figure 10 : The seven taekwondo moves

Cornell Activity Dataset (CAD-60)

The dataset focuses on realistic actions from daily life. Collected through the use

of a depth camera and contains action performed by four different human subjects: two males and

two females. It contains 12 types of actions: “talking on the phone”, “writing on whiteboard”,

“drinking water”, “rinsing mouth with water”, “brushing teeth”, “wearing contact lenses”, “talking

on couch”, “relaxing on couch”, “cooking(chopping)”, “cooking(stirring)”, “opening pill

container”, and “working on computer”. The dataset contains RGB, depth, and skeleton data, with

15 joints available. Each subject performs the activity twice, so one sample contains two

occurrences of the same activity.

20
CHAPTER 3

METHODOLOGY

Theoretical Framework

The use of Kinect helps in extracting skeleton data it tracks, two popular methods in

extracting the data are OpenNi SDK and Microsoft SDK. The proponents will use the Microsoft

SDK as is provides more advantages than OpenNi. Table 1 shows the advantages of using

Microsoft SDKs over OpenNi.

Table 6 : Comparison of OpenNi SDK and Microsoft SDK

Skeleton data extracted from Kinect are used for the purpose of recognizing the gesture or

the action of the user. Skeleton data can be noisy due to the sensor or the user itself. Therefore,

studies have proposed several algorithms that make the skeleton data invariant to factors such as

sensor and scale. Ahmed Taha (2014), S. Bianco (2012), and Shan & Akella (2014) all have

proposed a method that preprocess the skeleton data in their study to be invariant to the sensor

orientation and also invariant to scale and their methodology yielding an average accuracy of

95.2% (MSR Action3D dataset), 97% (Dataset of Karate Moves), and 80% (MSR Action3D

dataset) respectively.

21
Ahmed Taha’s (2014) skeleton algorithm initially rotates the skeleton data with an angle

that rotates counterclockwise in the Y-axis consequently making the skeleton face the sensor, then

changing the origin from the Kinect to a human joint. Finally, converting the 3D Cartesian plane

skeleton joints to the spherical plane and normalizing the points.

Manzi (2017) proposed an approach to processing the data which make it invariant to

sensor orientation and person’s specific size. The original reference frame is moved from the

camera to the torso joint and then scaled with respect to the distance of the neck and torso joint.

Manzi presented a feature vector of 7 joints, focusing on the head, neck, torso, hands, and feet that

are the most discriminative in action recognition. The normalization step is of follows, the N

number of joints in the skeleton data, the feature vector f is 𝑓 = [𝑗1 , 𝑗2 , … , 𝑗𝑁−1 ], wherein each 𝑗𝑖

𝐽 −𝐽
is the vector containing the coordinates of the 𝑖 𝑡ℎ joint 𝐽𝑖 . Thus 𝑗𝑖 is defined as 𝑗𝑖 = ‖𝐽 𝑖 −𝐽0 ‖ , 𝑖 =
1 0

1,2, … , 𝑁 − 1, where 𝐽0 and 𝐽1 are the coordinates of the torso and neck joint, respectively. The

total number of attributes of the feature vector f can be calculated as 3(𝑁 − 1).

G. Zhu (2016) process skeleton data with the use of Normalized relative orientation which

takes use of the dynamics models of human joints and that the computation of NRO of one joint

is relative to the joint it rotates around. The use of NRO makes the data insensitive to human

height, limb length and distance to the camera. Applying a simple moving average to smooth the

data, the calculation for a joint’s NRO is as follows. Let 𝑃𝑖 = (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 ) represent the position of

joint i in the coordinate system, the NRO of joint i relative to joint j can be computed as 𝐹𝑁𝑅𝑂 =

𝑃𝑖 −𝑃𝑗
‖𝑃𝑖 −𝑃𝑗 ‖
, where ||*|| is the Euclidean distance.

22
Figure 11 : An example of relative orientation of the skeleton consisting of 20 joints

Feature vectors can be defined as the conversion of a big input data into a reduce

representation set of features that are used for distinguishing different activities and then used as

inputs for classification algorithms. The number of features affects the recognition performance

and computation complexity in distinguishing the different activities. Dimensionality reduction is

a sub stage of increasing accuracy and reducing the computation complexity to achieve real time

recognition. Ahmed Taha (2014) achieved a feature vector composed of thirteen pairs of azimuth

and zenith angles. The study achieves this by converting and normalizing the reduced skeleton

points to spherical coordinates. Reducing the original twenty skeleton joints to thirteen as the

concept behind is that not all joints contribute or important to an action therefore the study removes

the torso area of the data as the torso area did not show independent motion along with the whole

body. Anjum (2014) also reduced the size of the feature vector by tracking the joint undergoing

the most distinct motion during an activity and used a joint relative to another joint to account for

different positions of the user performing the action in front of the camera. Shan & Akella (2014)

approached the segmentation process of human action recognition by identifying key poses using

pose kinetic energy. G. Zhu (2015) & (2016) also approached the segmentation process by

23
identifying key poses and atomic action, resulting to being robust to the inter-class similarity and

intra-class dissimilarity. Both studies have a similar approach in which a feature sequence is

defined as S =(𝐹1 , 𝐹2 , … , 𝐹𝑁 ), N the frame number of the feature sequence and 𝐹𝑖 as the feature

vector in the 𝑖 𝑡ℎ frame. The kinetic energy of the 𝑖 𝑡ℎ feature vector can be computed as 𝐸𝑝 (𝑖) =

2
∑𝐿𝑗=1‖𝐹𝑖𝑗 − 𝐹1𝑗 ‖ wherein the 𝑖 𝑡ℎ feature vector is the Euclidean distance between the 𝑖 𝑡ℎ feature

vector and the first feature vector. Then the kinetic energy between two adjacent feature vectors

can be calculated by 𝐸𝑑 (𝑖) = 𝐸𝑝 (𝑖) − 𝐸𝑝 (𝑖 − 1). A threshold-based segmentation is use for the

segmentation of the feature sequence to static segments and dynamic segments wherein |𝐸𝑑 (𝑖)| <

𝐸𝑚𝑖𝑛 are labeled as static segments while the other are dynamic segments. The absolute value

operator |𝐸𝑑 (𝑖)| and 𝐸𝑚𝑖𝑛 is an empirical parameter. Finally, the construction of a key pose

codebook and atomic action codebook are done by extracting key poses from the static segments

by clustering algorithms while the atomic actions between any 2 key poses are clustered with the

associated dynamic segments.

The Hidden Markov Model (HMM) has a particular success to speech recognition and it

has been applied to many studies in the Computer Vision field. HMMs are characterized by their

learning ability because it can provide time-scale invariability in recognition thus making it

possible to deal with time-sequential data and automatically optimizing the model with the data.

The variation of the action being performed per person causes an impact to the recognition rate of

HMMs. To overcome this issue, the improvement of the recognition rate of HMMs can be done

by using mixed training data of the actions (Yamato, 1992). Collecting more training patterns

which are suitable for representing the different actions can further improve the recognition rate

of HMMs since test pattern and training pattern subjects are different.

24
Conceptual Framework

Figure 12 : The proposed method

The input of the study will be the training sequences of the taekwondo moves using the

Kinect sensor as the primary data acquisition and extracting the data using the Microsoft SDK as

it provides 20 joints.

The next process will then be the processing of the raw 3D coordinate skeleton data

provided by the Microsoft SDK. The preprocessing is of follows: the origin of the data, which is

originally the sensor, is moved from the camera to the torso joint and then the joints are scaled

with respect to the distance between the neck and torso joints. Next is the elimination of the

irrelevant joints for the action, reducing the feature vector size. For the segmentation process, the

construction of key poses and atomic actions codebooks for the training sequences. Finally, using

HMM as the action classifier.

25
The output will be the correct recognition of the taekwondo kicks, specifically, kicks that

have high rotation (back leg) and high similarity (back leg/back leg to head and front leg/front leg

to head).

26
CHAPTER 4

EXPERIMENTS

Locale of the Study

The data gathering preparation as well as the data gathering itself will be performed in a sports

training center or gym with the guidance of a Taekwondo head coach. The coach will recommend

kicks for our experiments. Kicks that are commonly used by the athletes in their daily training

sessions.

Participants

The participants of the study will be the athletes of the sport with the guidance from the

Taekwondo head coach. According to De Souza (2016), we will also use 10 athletes of different

sex, ages, height, weight, and belt color are to perform the recommended moves.

Data Gathering Tools

All data will be acquired through the use of a single Microsoft Kinect. The Kinect is a

device that can provide to its user the following resources:

i) Depth map image with color gradients

ii) RGB 30 FPS images

iii) Multiple skeleton tracking (at most 6 persons) in the same scene

iv) 20 joints 3D coordinates according to the Figure

v) Precise microphone array.

27
The use of skeleton tracking and Microsoft SDK will allow the tracking of the athlete’s

body and build the feature vector.

Figure 13 : Recording setup

Data Gathering

The Kinect sensor will be set to a position where the camera and sensor could capture the

athlete’s body sideways. The dataset assembly will be as follows: each participant will perform

each move for 10 seconds, taking a short break, then will perform another move until all the moves

are performed. The end process will be a number of files. The selected kicks are:

i. 45 degree kick (kick to the body)

ii. Front 45 degree kick

iii. Roundhouse kick (kick to the side of the head)

iv. Front Roundhouse kick

v. Axe kick (kick to the front of the face)

vi. Front Axe kick

28
vii. Turning side kick (kick to the body)

viii. Turning long kick (kick to the head)

Preprocessing of the skeleton data

The position of the skeleton joints are provided as Cartesian coordinates (X, Y, Z) with the

origin centered at the Kinect. The positive Y axis points up, the positive Z axis points where the

Kinect is pointing, and the positive X axis is to the left. To achieve invariance to sensor orientation

and person’s specific size. The original reference frame is moved from the camera to the torso

joint and then scaled with respect to the distance of the neck and torso joint. The normalization

step is of follows: the N number of joints in the skeleton data will be the feature vector f, where

𝑓 = [𝑗1 , 𝑗2 , … , 𝑗𝑁−1 ], wherein each 𝑗𝑖 is the vector containing the coordinates of the 𝑖 𝑡ℎ joint 𝐽𝑖 .

𝐽 −𝐽
Thus 𝑗𝑖 is defined as 𝑗𝑖 = ‖𝐽 𝑖 −𝐽0 ‖ , 𝑖 = 1,2, … , 𝑁 − 1, where 𝐽0 and 𝐽1 are the coordinates of the
1 0

torso and neck joint, respectively. The reduction of the set of joints will be done through the

calculation of 3(𝑁 − 1), wherein N is the number of joints.

For the segmentation process, a feature sequence is defined as S =(𝐹1 , 𝐹2 , … , 𝐹𝑁 ), N the

frame number of the feature sequence and 𝐹𝑖 as the feature vector in the 𝑖 𝑡ℎ frame. The kinetic

𝑗 𝑗 2
energy of the 𝑖 𝑡ℎ feature vector can be computed as 𝐸𝑝 (𝑖) = ∑𝐿𝑗=1‖𝐹𝑖 − 𝐹1 ‖ wherein the 𝑖 𝑡ℎ

feature vector is the Euclidean distance between the 𝑖 𝑡ℎ feature vector and the first feature vector.

Then the kinetic energy between two adjacent feature vectors can be calculated by 𝐸𝑑 (𝑖) =

𝐸𝑝 (𝑖) − 𝐸𝑝 (𝑖 − 1). A threshold-based segmentation is use for the segmentation of the feature

sequence to static segments and dynamic segments wherein |𝐸𝑑 (𝑖)| < 𝐸𝑚𝑖𝑛 are labeled as static

segments while the other are dynamic segments. The absolute value operator |𝐸𝑑 (𝑖)| and 𝐸𝑚𝑖𝑛 is

29
an empirical parameter. Finally, the construction of a key pose codebook and atomic action

codebook are done by extracting key poses from the static segments by clustering algorithms while

the atomic actions between any 2 key poses are clustered with the associated dynamic segments.

Hidden Markov Model, characterized by their learning ability because it can provide time-

scale invariability in recognition. For each taekwondo kick, a HMM has to be trained so that it will

generate its symbol pattern for its corresponding move. Given a set of observation sequences, the

parameters of HMMs can be calculated using the Expectation Maximization (EM) algorithm.

Given an action sample 𝐴𝑗 , which consists of a series of L observed poses, 𝑜1 , 𝑜2 , … , 𝑜𝐿 , the

calculation for the probability of the action 𝐴𝑗 for the 𝑖 𝑡ℎ action type 𝑇𝑖 , that is 𝑃(𝐴𝑗 |𝑇𝑖 ).

The process in recognizing the actions are done by the vector quantization of each frame’s

feature vector to obtain a symbol sequence. The recognition result will be the HMM that

maximizes this equation.

The proponents will use Leave-one-out cross validation and confusion matrix to evaluate

the recognition performance of the model. In Leave-one-out cross validation. 9 out of 10 of the

participants will be used to train the classifier while the remaining participant will be for testing

purposes. Describing the performance of a classifier evaluated on a set of test data for which the

true values are known if known as a confusion matrix. The aim of the first experiment is to assess

shift of origin from the from Kinect’s coordinate data point to a point in the object’s lower body.

The second experiments aims to find out about the minimum sets of joints required to recognize

the action.

30
REFERENCES

Ahmed Taha, H. H.-S.-H. (2014, July). Human Action Recognition based on MSVM and Depth

Images. In The International Journal of Computer Science Issues, 11(4), 42-51.

Alzahrani, M., & Kammoun, S. (2016, May). Human Activity Recognition: Challenges and

Process Stages. International Journal of Innovative Research in Computer and

Communication Engineering, 4(5).

Bridge, C. J. (2009). Physiological responses and perceived exertion during international

taekwondo competition. Int J Sports Physiol Perform, 485-493.

Bridge, C., Jones, M., & Drust, B. (2011). The Activity Profile in International Taekwondo

Competition Is Modulated by Weight Category. International Journal of Sports Physiology

and Performance, 344-357.

de Souza Vicente, C. M. (2016). High performance moves recognition and sequence segmentation

based on key poses filtering. IEEE Winter Conference on Applications of Computer Vision

(WACV), 1-8.

G. Zhu, L. Z. (2016). Human Action Recognition using Multi-layer Codebooks of Key Poses and

Atomic Motions. Signal Processing: Image Communication.

Gahlot, A., Agarwal, P., Agarwal, A., Singh, V., & Gautam, A. K. (2016). Skeleton based Human

Action Recognition using Kinect. IJCA Proceedings on Recent Trends in Future

Prospective in Engineering and Management Technology.

Kasiri-Bidhendi, S. (2015). Combat sports analytics: Boxing punch classification using overhead

depth imagery. IEEE ICIP.

31
L. Miranda, T. V. (2012). Real-time gesture recognition from depth data through key poses

learning and decision forests. 25th SIBGRAPI Conference on Graphics, Patterns and

Images, 268-275.

Manzi, A. D. (2017). A Human Activity Recognition System Based on Dynamic Clustering of

Skeleton Data. Sensors.

Reily, B. J. (2016). Human activity recognition and gymnastics analysis through depth imagery.

S. Bianco, F. T. (2012). Karate moves recognition from skeletal motion. 3D Image Processing and

Applications.

Shan, J., & Akella, S. (2014, September). 3D human action segmentation and recognition using

pose kinetic energy. In Advanced Robotics and its Social Impacts (ARSO), 69-75.

Tejero-de-Pablos, A. N. (2016, July). Human action recognition-based video summarization for

RGB-D personal sports video. In Multimedia and Expo (ICME), 1-6.

Vennila Megavannan, B. A. (2013). Human Action Recognition using Depth Maps. In proceedings

of the International Conference on Signal Processing and Communications (SPCOM), 1-

5.

32

You might also like