Action Recognition

© All Rights Reserved

14 views

Action Recognition

© All Rights Reserved

- Stability, Oscillatory Modes, Bifurcations, and Chaotic Behavior of Complex Systems with Phase Control
- Extracting 3D Information From Broadcast Soccer Video
- ifsr_Preprocessing
- JournalClub - Robust Statistical Fusion of Image Labels
- General Internship Guidelines and Report Format Summer 2015 3
- A Review on Image Segmentation using Clustering and Swarm Optimization Techniques
- Stroke Segmentation and Partial Matching for Livestock Brand Image Recognition (Abstract)
- Rough Report
- Time-Series Panel Analysis (TSPA): Multivariate Modeling of Temporal Associations in Psychotherapy Process
- OMAE2008-57136
- Script Identification From Printed Document Images Using Statistical
- On the Position Accuracy of Mobile Robot Localization Based on Particle Filters Combined With Scan Matching
- Complexity and Human
- 40220140502005
- Comparative Study on Vehicle Detection Techniques in Aerial Surveillance
- Eric_TSA
- Time Series Analysis
- THE ADAPTATION OF ITU-T FORECASTING RULES TO THE COUNTRIES UNDER SPECIAL CIRCUMSTANCES
- CIARP 2015
- [IJCST-V4I2P32]:Rajeshwari G.Tayade, Miss. P. H. Patil, Miss. Prachi A. Sonawane

You are on page 1of 97

REPRESENTATION OF 3D SHAPE

by

Binu M Nair

Cochin University Of Science and Technology

Old Dominion University in Partial Fulfillment of the

Requirement for the Degree of

MASTER OF SCIENCE

ELECTRICAL AND COMPUTER ENGINEERING

OLD DOMINION UNIVERSITY

August 2010

Approved by:

Jiang Li (Member)

ABSTRACT

ACTION RECOGNITION BASED ON MULTI-LEVEL

REPRESENTATION OF 3D SHAPE

Binu M Nair

Old Dominion University, 2010

Director: Dr. Vijayan K. Asari

A novel algorithm is proposed in this thesis for recognizing human actions using

a combination of two shape descriptors, one of which is a 3D Euclidean distance

transform and the other based on the Radon transform. This combination captures

the necessary variations from the space time shape for recognizing actions. The space

time shapes are created by the concatenation of human body silhouettes across time.

The comparisons are done against some common shape descriptors such as the zernike

moments and Radon transform. This is also compared with an algorithm which uses

the same concept of a space time shape and uses another shape descriptor based

on the Poissons equation. The proposed algorithm uses a 3D Euclidean distance

transform to represent the space time shape and this shape descriptor in comparison

to the Poissons equation based shape descriptor is less complex. By taking the

gradient of this distance transform, the space time shape can be divided into different

levels with each level representing a coarser version of itself. Then, at each level,

specific features such as the R-Transform feature set and the R-Translation vector

set are extracted and concatenated to form the action features. These action features

extracted from a space time shape of a test sequence are compared with the action

features of space time shapes of the training sequences using the minimum Euclidean

distance metric and they are classified using the nearest neighbour approach. The

algorithm is tested on the Weizmann action database which consists of 90 video

work is being done to improve the recognition accuracy by extracting features which

are more localized and classifying them using a more sophisticated technique.

c

Copyright,

2010, by Binu M Nair, All Rights Reserved

iv

ACKNOWLEDGEMENTS

I would like to thank my advisor, Dr. Vijayan K. Asari for all his support and

constant guidance for my thesis as well as for my coursework and for providing me an

excellent opportunity to work in the Vision Lab. It has been a wonderful experience

during the last two years and I feel that I have reached a new level of technical

expertise by just being in the Vision Lab. For that, I have a lot of gratitude to my

advisor.

I wish to thank Dr.Jiang Li for his support not only for my thesis but also for

the machine learning course from which I have gained a lot of information regarding

the theoretical concepts of various classifiers and their implementation, which in

turn, helped me in my thesis work. I also wish to thank Dr.Frederic D. McKenzie

for being my committee member and helping me to finalize my thesis work. I also

wish to thank Dr. Zia-ur Rahman for teaching the image processing course where I

learned the basics and the implementation of the image processing algorithms in C.

This helped me a great deal in my thesis.

I want to thank my parents for their continous encouragement especially my

father who supported me not only in financial matters but also in harsh times which

allowed me to focus on my research work. I wish to thank all of my labmates for

their support. Finally, I wish to thank my good friend Ann Mary for continously

pushing me to finish the thesis in time and motivating me to present the thesis with

full confidence.

Once again, thank you all for giving me the encouragement, motivation, and

support.

vi

TABLE OF CONTENTS

Page

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

CHAPTERS

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1

2.2

Action

Recognition

Algorithms

Based

on

Motion

Capture

Fields/Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.1

2.1.2

11

2.1.3

12

2.1.4

Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.1.5

. . . . . . . . . . . . . . . .

15

2.1.6

17

2.1.7

18

19

2.2.1

19

2.2.2

2.2.3

27

vii

2.2.4

27

2.2.5

29

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

31

2.3

3

3.1

3.2

3.3

31

3.1.1

31

3.1.2

32

3.1.3

. . . . . . . . . . . . .

34

3.1.4

35

3.1.5

39

3.1.6

R-Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

44

3.2.1

45

Transform and R-Transform . . . . . . . . . . . . . . . . . . . . . . .

46

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

49

3.4

4

4.1

50

4.2

55

4.2.1

. . . . . . . . . . . .

56

4.2.2

Segmentation of a 3D Shape . . . . . . . . . . . . . . . . . . .

57

62

4.3

viii

4.3.1

62

4.3.2

64

4.4

67

4.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

5.1

69

5.2

72

5.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

80

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

ix

LIST OF TABLES

Page

1

70

73

Poissons Equation Shape Descriptor. . . . . . . . . . . . . . . . . . .

78

LIST OF FIGURES

Page

1.1

3.1

33

3.2

. . . . .

35

3.3

37

3.4

. . . . . . . . . . .

38

3.5

R-transform of a Silhouette. . . . . . . . . . . . . . . . . . . . . . . . .

41

3.6

Properties of R-Transform. . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.7

47

4.1

51

4.2

Silhouette Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4.3

53

4.4

54

4.5

with Various Aspect Ratios. . . . . . . . . . . . . . . . . . . . . . . .

4.6

with Various Aspect Ratios. . . . . . . . . . . . . . . . . . . . . . . .

4.7

59

60

Jumping Jack Action.

. . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.8

63

4.9

66

xi

5.1

Bar Graph Showing the Overall Accuracy Obtained with the Proposed Algorithm for Different Lengths of the Space Time Shape. The

Overlap is just Half the Length of the Space Time Shape. . . . . . . .

5.2

Shape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3

74

Time Shape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.4

71

75

Shape for Different Shape Descriptors. . . . . . . . . . . . . . . . . .

77

CHAPTER 1

INTRODUCTION

In recent years, human action recognition has been a widely researched area in computer vision as it holds a lot of applications relating to security and surveillance.

One such application is the human action detection and recognition algorithm implemented in systems using CCTV cameras where the system can discriminate suspicious actions from normal human behavior and alert authorities accordingly [1].

Another area is in object recognition whereby evaluating the human interaction with

the object and recognizing the human action, the object with which the interaction

takes place is recognized [2]. Thus arises the need for faster and robust algorithms

for human action recogntition. Action recognition involves extraction of features

which represents the variations caused due to the action with respect to time and

these variations can be extracted by using basically two types of approaches. One

approach involves the use of 2D or 3D shape descriptors which gives an internal representation of the space time shape and thereby facilitates the extraction of suitable

action features. The other approach is by using motion fields such as motion history

images and optical flow vectors and by using trajectories of specific body parts. The

various invariant properties extracted from these motion fields and trajectories are

considered action feature vectors. In the algorithm presented in this thesis, a combination of two different types of shape descriptors are used for the representation of

a space time shape for feature extraction.

An overview of the algorithm is shown as a block diagram in Figure 1.1. The

steps of the algorithm are

Extraction of the human silhouette from a video sequence by foreground segmentation.

This thesis follows IEEE journal format.

2

Formation of the space time shape by concatenting predefined number of silhouettes and applying the 3D Euclidean distance transform.

Using the gradient of the distance transform to segment the space time shape

into multiple levels.

Extracting the R-Transform feature at every level and at every frame of the

space time shape and the R-Translation vector at the coarsest level at every

frame.

Concatenating the R-Transform Feature set and the R-Translation Vector set

to form the action features.

Comparing these features using the minimum Euclidean distance metric and

classifying them using the nearest-neighbor approach.

To extract a silhouette of a human body in a video, foreground segmentation must

be performed on every frame of a video sequence by learning the background model.

Since the video sequences in the database used in the testing of the algorithm contain

a static background with uniform lighting, a simple background model based on the

median of the pixels of the frames is sufficient. The median background model is

easier to implement and less complex. This is preferred over the mean background

due to the fact that, unlike the mean, the median statistic of the pixels across the

frames is not affected by sudden changes in the pixel values. More details on the

background segmentation will be given in the coming chapters.

Next, the silhouettes obtained from the video sequence must be concatenated

across a predetermined number of frames so as to form space time shapes. Each

video sequence in the database will contain a certain number of space time shapes

with each one having a certain overlap with the previous one. Once the space time

shape is obtained, a 3-D shape descriptor should be used. The purpose of the 3D

3

shape descriptor in the proposed algorithm is to give an internal representation of the

space time shape by segmenting it into different levels with each level representing

its coarseness. The 3D shape descriptor based on the Euclidean distance transform

is selected for this type of representation. By using the gradient of the distance

transform, the space time shape is divided into multiple levels.

Once the multi level representations of the space time shape is obtained, each

level is analyzed seperately and suitable features are extracted. In this algorithm, the

analysis at each level is done by using another shape descriptor. This shape descriptor

is applied to a 2D human silhouette in a frame of the space time shape at every level

and the variation in the properties extracted from this shape descriptor across the

frames are futhur analyzed to give the action features. The shape descriptor known as

the R-Transform is used instead to capture the variations of the 2D shape in a frame

[22]. This R-transform is in fact derived from the 2-D Radon transform by integrating

it over one of the variables [18]. It can be made scale-invariant by using a suitable

scaling factor. The R-transform possess the other two properties of translation and

rotation variance which again makes it much more suited to bring out the variations

in a space time shape. So, a collection of R-Transforms are obtained at every level of

the space time shape. These can be called as the R-Transform feature set. Another

set of features can be extracted from the 2D Radon transform by integrating it over

the other variable. This set of features, known as R-Translation vector set, can be

used to determine how much the 2D shape has moved from its initial position within

a space time shape.

A complete action feature representation is formed by concatenating the RTransform features set and R-Translation vector set and this concantenated set of

features represents the particular action of an individual in the space time shape.

These actions features extracted from test space time shapes containing the same

type of action are compared with the action features extracted from the training set

5

of space time shapes using the Euclidean distance metric and they are categorized

using the nearest neighbor classifier [27]. The evaluation of the algorithm is done

by finding the recognition rate for each action type for different lengths of the space

time shape. Moreover, the comparison of this algorithm is made with methods which

uses only a single shape descriptor for shape representation. Furthur research work

is being done to localize these features of the space time shapes so as to improve the

recognition rate.

The specific objectives of the proposed algorithm are

To extract the silhouettes at every frame of the all the video sequences in the

database by background segmentation with suitable thresholding and suitable

morphological operations.

To organize the silhouette frames of every video sequence in the database into

space time shapes of a fixed length with a predefined overlap.

To apply the 3D Euclidean distance transform at every space time cube of every

video sequence and taking its gradient in order to segment it into predefined

number of levels.

To apply the 2D Radon transform on every silhouette frame of the space time

shape at every level.

To extract the R-Transform Feature set as well as the R-Translation vector set

and concantenate them to form the action feature set.

To store these action features in an action feature database and use this

database to create tests and training sets based on the variant of leave-oneout procedure.

To classify the test action features by nearest neighbor approach and evaluate

6

the algorithm by determining the recognition rate for each action and compare

the results with a known algorithm.

The thesis is organised as follows:

Chapter 2 gives a brief overview of the current algorithms that have been proposed over the last few years for action recognition and looks closely at the analytical

tools or transforms used. First, the algorithms based on some motions fields and trajectories are discussed. Then, the various shape descriptors and the action recognition

algorithms based on these are explained briefly. Finally, this Chapter will introduce

the shape descriptor on which the proposed algorithm is based.

Chapter 3 describes in detail the shape descriptor based on the Radon transform

and its purpose in the action recognition framework. The 2D Radon transform is first

explained mathematically along with the geometrical interpretation and illustrations.

Then, the actual shape descriptor known as the R-Transform and its properties are

explained. Finally, the concept of a multi-level representation of a 2D shape using

the R-Transform and the Chamfer distance transform is discussed.

Chapter 4 describes the proposed algorithm which is based on a combination of two shape descriptors, namely the Euclidean distance transform and the

R-Transform. It explains in detail how the gradient of the distance transform is used

in segmentating a space time shape into multiple levels with suitable illustrations and

describes how the action features are extracted from each level. Finally, the classifer

used in the algorithm is briefly described.

Chapter 5 gives the experimental results which are obtained by simulation of

the algorithm on a known database and compares these results with other existing

algorithms. The advantages and disadvantages of using the proposed algorithm are

discussed and the results are analysed to find the reason behind some of the low and

high recognition rates acheived.

7

Chapter 6 gives a complete summary of the proposed algorithm and states the

conclusions made by the analysis of the results. It also suggests improvements that

can be made to extract much more robust features required for action recognition.

Future work is also discussed in this Chapter which involves better preprocessing

stages for better feature extraction.

CHAPTER 2

LITERATURE SURVEY

Some of the earlier works in action recognition were based on capturing the motion

features by computing the optical flow or motion history images at each frame and

then using the variation in these motion features for action recognition. Some others

required tracking of certain body points such as the limbs,torso and head across the

frames of the video sequence. The trajectories of these points are analyzed and the

properties extracted from these trajectories are then used as action features. However, the algorithm presented in this thesis is based on frameworks which capture

human silhouette variations across the video sequence using some shape descriptors.

Both 2D as well as 3D shape descriptors can be used. In the former, the shape

descriptor is applied to each frame of the video sequence and the variations in the

silhouette shape description occuring across the video sequence are considered action features. The latter case involves extraction of properties of the 3D shape after

applying a suitable 3D shape descriptor. The algorithm presented in this thesis is

a combination of both approaches. This Chapter can be divided into two main sections. The first half of the Chapter reviews the work done in the action recognition

field which directly computes the motion fields. The second half discusses algorithms

that capture the motion variations using shape descriptors.

2.1

CAPTURE FIELDS/TRAJECTORIES

In this section, various action recognition frameworks are discussed where some use

motion flow fields such as optical flow, motion history images, and trajectories of specific silhouette points. The advantages and disadvantages of using these algorithms

9

are also discussed.

2.1.1

Some of the work in human action or movement recognition has been done using

temporal templates which give the motion characteristics at every spatial location of

a frame in an image sequence. The motion characteristics at a pixel of the current

frame depends on the motion characteristics of the corresponding pixel at previous

frames. In [3], the temporal template extracted from a frame is an image where

each pixel is a vector component with one of the components obtained from the

binary motion energy image(MEI) and the other component from a motion history

image(MHI). The motion energy image gives the location of the occurence of motion

in the image while the motion history image has pixel values corresponding to the

recency of the motion. The binary motion energy image E (x, y, t) is formed by

accumulating the silhouettes for a specific number of frames and can be defined as

E (x, y, t) =

[

1

D(x, y, t i)

(2-1)

i=0

where D(x, y, t) is a binary image representing the masked human silhouette at time

t. The purpose of the motion energy images can be used for incorporating view

invariant feature extraction by including the motion energy images computed at

different viewpoints for the same action in the training data. For the motion history

image H (x, y, t), the pixel intensity gives the temporal history of motion at that

pixel and it is defined as

if D(x, y, t) = 1

H (x, y, t) =

(2-2)

10

A pixel value in a motion history image gives the recency of motion at that pixel. In

other words, the brighter the pixel, the more recent the motion is at the corresponding location. The computation of the vector template image involves computing the

motion history image first and then thresholding it to get the motion energy image.

Statistical matching using Hu moments are used on the vector templates and the

Mahalanobis distance measure is used to compare and classify these moment based

features.

Another variation of the motion history image termed as the timed MHI or timed

motion history image is used for silhouette pose estimation and for segmentation of

moving parts [4]. The difference in the timed MHI with the previously defined MHI

is that the pixel values are stored in accordance to the current timestamp which is in

floating point format. By taking the gradient of the timed MHI, optical flow vectors

can be extracted. These vectors provide the local orientation at each pixel. Features

such as the radial histogram and the global orienation can also be extracted from

this optical flow representation and these along with the segmented body parts can

be used in the moment-based statistical matching for motion classification purposes.

A hierarchial MHI representation in accordance to speed of motion can also be

derived from the MHI computed at the current frame in order to compute local motion fields [6]. The MHI pyramid is computed from the image pyramid where each

frame is subsampled successively to a certain number of levels. The higest level in

the representation has the smallest image size or the coarsest image in the pyramid.

The idea behind the image pyramid is that each level represents a particular range

of the speed of motion. In other words, faster motion displacements in the lower

levels of the pyramid are represented by smaller displacements in the higher levels.

Thus, from each pyramid level, a MHI image is computed where the motion fields

are extracted. These motion fields from each level are then resampled to the original

size of the image and combined together to form the final motion field. Many types

11

of features can be extracted from this motion field such as the polar orientation histograms which are linear invariant in nature. These histograms are the final action

feature vectors and the classification is done by directly comparing the test histogram

with the ones in the database using the Euclidean distance measure.

One drawback with the motion history images is that it is not very robust to

partial occlusions. The motion history images computed from a partially occluded

silhouette frame will be different from the ones computed from non-occluded silhouettes. These may give rise to distorted motion fields which may result in inaccurate

classification especially when using the statistical matching using Hu moment features which are very sensitive to the changes in shape.

2.1.2

A motion descriptor based on optical flow measurements has been used to describe

an action of an individual at a far off distance from the camera [7]. Here, a figurecentric spatio-temporal volume is extracted where the individual at each frame is

centered to stabilize the motion and to discard motions due to a shaky camera.

Then, the optical flow vector at each frame is computed using the Lucas-Kannade

algorithm. The optical vector field Fxy is seperated into two different components

namely, the horizontal Fx and vertical component Fy . Each of these components is

then half wave rectified to get two more components thereby giving a total of four

components namely Fx , Fx+ , Fy and Fy . But since these four components are

noisy measurements, these are smoothened using a Gaussian kernel and normalized to

get the four components Fbx , Fbx+ , Fby ,Fby at each frame. The set of these four

vectors at each frame across the figure-centric spatio-temporal volume is considered

as a motion descriptor. In other words, the variation in the four components of

the optical vector field across the frames of the video sequence gives rise to action

features suitable for recognition. For comparison of these action features, a similarity

12

measure based on normalized correlation is used. Here, to test two video sequences

A and B, a similarity measure between motion descriptor of A centered at frame i

and that of B at frame j is computed by

S(i, j) =

4

X X

X

tT

j+t

ai+t

c (x, y) bc (x, y)

(2-3)

c = 1 (x,y) I

where T and I are temporal and spatial extents of the motion descriptors ac and bc

and c refers to either of the four components. Therefore, from every frame of video

sequences A and B, a similarity measure is obtained which can be organized as a

matrix known as the motion-to-motion similarity matrix S. Then, for classifying the

action present in a frame of a test sequence, the k-nearest neighbor rule is applied to

the motion descriptors and the action in the frame is appropriately labelled. Similar

to the case of MHI, this approach is not so robust to partial occulsions as the optical

flow motion fields get distorted to a greater degree thereby increasing the missclassification rate. On the other hand, this approach is fast and can be implemented in

real time as the action is recognized at every frame of a video sequence even at very

low resolutions.

2.1.3

Spatio-temporal features can be extracted from the video sequences containing different types of actions and these features can be considered like video words in a

codebook [8]. By interpreting each video sequence in the training set as a set of

video words, a model for each action category can be learned. The features extracted from the video sequence are space time interest points in a space time shape.

They are obtained by finding the gradient or optical flow from the region of interest

in each frame of the video sequence. The regions of interest are actually the local

maxima regions of the response function R computed at every frame and this is given

13

by

R = (I g hev )2 + (I g hod )2

(2-4)

where g(x, y, ) is the 2D Gaussian filter applied in the xy domain and (hev , hod ) are

the quadrature pair of gabor filters applied temporally. From each of the space time

regions where these regions are in the form of spatio-temporal cubes centered at the

interest points, descriptors such as the brightness gradient or the windowed optical

flow field are computed and all of the computed descriptors are concatenated to form

a large feature vector. These vectors are then projected into the lower dimension

using principal component analysis and the lower dimensional representation of the

feature vector is considered as a video word. Therefore, each action catergory is

modelled probabilistically by using the video words as the input data. By noting the

number of occurences n(wi , dj ) of a word wi from the codebook in a video sequence dj ,

the joint probability P (wi , dj ) can be computed as P (wi , dj ) = P (dj )P (wi |dj ). From

this joint probability density, the conditional probability density can be computed as

K

X

P (wi |dj ) =

P (zk |dj )P (wi |zk ) where zk refers to an action category and K is the

k=1

P (wi |zk ) which gives the probability of the video word wi belonging to a category zk ,

is done by maximimzing the objective function using the expectation maximization

M Y

N

Y

or EM algorithm. The objective function is given by

P (wi |dj )n(wi ,dj ) . Once

i=1 j=1

action catergory P (zk |wi |dj ) can be known. Therefore, from the test video sequence

containing a set of video words, this posterior probability probability for each video

word is computed and the maximum of these probabilites gives the action category

to which this video sequence belongs.

The advantage in this algorithm is that no foreground segmentation is necessary

as the interest points are extracted from the response funtion defined directly on

14

the image pixel intensity. These interest points are generated due to the sudden

change in the spatial characteristics of local regions where these regions are part

of a complex action. But these interest points can be created due to noise in the

video sequence also and thus creates unwanted video words, which may affect the

classification accuracy.

2.1.4

Similar to the algorithm based on the bag of words model, this algorithm uses 3D

SIFT descriptors in place of the descriptors which were based on the gradient magnitude. Also, the interest regions are chosen at random rather than choosing the

local maxima regions of the response function based on Gabor space. The features

extracted from the 3D SIFT descriptors are the sub-histograms which are then considered as video words for the bag of words model [9]. Moreover, once the video

words are obtained, the video words are grouped according to their relationships and

these discovered groupings are used for the classification task.

For computation of the 3D SIFT descriptor, first the overall orientation for the

3D neighborhood surrounding an interest point should be calculated. By taking the

spatio-temporal gradient in that neighborhood, the orientation at each pixel or the

local orientation can be computed. An orientation histogram for that neighborhood

is then calculated from where the dominant peak or the overall orientation is computed. This is used to rotate the 3D neighborbood about its interest point so that the

dominant peaks of all the 3D neighborhoods are in the same direction which inturn

would make the features rotation invariant. After rotation, these 3D neighborhoods

are divided into sub-regions of a fixed size and the magnitude and orientation at each

pixel in a sub-region is computed. Using the gradient information in the sub-region,

a histogram is computed and these histograms are called sub-histograms since these

15

are associated with only a sub-region. The final descriptor is then obtained by vectorizing these sub-histograms and concatenating them to form final vector termed as

the 3D SIFT decriptor.

The selection of the interest regions is done by random sampling of the pixels

at different locations,time and scale. Then, the feature vectors or the SIFT descriptors obtained the neighborhood of every interest point are then quantized by using

K-mean clustering algorithm. This leads to predefined number of groups whose centers make up the video word vocabulary. The 3D SIFT descriptors from the videos

are matched to each video word in the vocabulary and for a space time cube, the

frequency of each word in the vocabulary, known as a signature, is computed. Suport vector machines (SVM) are used to train each action category using a modified

version of the signatures and the classification is done on the basis of the largest

distance from the action category.

The one drawback that this algorithm has is that the interest points are selected

at random. If the interest points are selected based on another set of features which

are directly correlated with the action, then, complexity in the classification can be

reduced by using the signatures directly in to the classifier without any modification.

Moreover, the features used to detect the interest points can be used as additional

features for classification, provided these features are directly related to the motion.

2.1.5

Another algorithm for action recognition considers the human action to be generated by a non-linear dynamical system where the state variables are defined by the

reference joints in the human body silhouette and their functions are defined by the

trajectories of these joints in the time domain [10] . Action features are derived from

the properties of these trajectories by considering them as time series data. There

16

are many methods available for studying time series data but the one used in this algorithm analyzes the non-linear dynamics of human actions using the concepts from

the theory of choatic systems [11].

The idea behind chaos theory is that there is some form of determinism in otherwise random data and that the determinism is due to some underlying non-linear

dynamics. In other words, a choatic time series is one which is apparently random

in nature but has been generated by a deterministic process. Dynamical systems are

represented by state space models with state variables X(t) = [x1 (t)x2 (t)...xn (t)] R n

defining the status at time t. Attractors are regions of phase space or the space

spanned by the state variables where the collection of all the trajectories or the

path of the variables, settle down as time t approaches infinity. These attractors are

termed strange if they are not stable. So, the invariants of the dynamical systems attractor represents the non-linear nature of the system and its properties can be used

in a classification problem. Here, the non-linear dynamical system which represents

the human action generates the choatic time series data which, in this case, are the

trajectories of the reference joints. Therefore, by extracting the strange attractors,

the human actions can be distinguished.

The first step is to obtain the trajectories of the reference joints, namely the

head, two hands, two legs, and the belly in the video sequence. The scale and translation invariance of the trajectories are obtained by normalizing the trajectories with

respect to the belly point. Each single dimensional time series data is converted to

a multi-dimensional signal thereby modifying the state of the system. The modified

state space and the original state space has the same property according to chaos

theory and the modified state space brings out the deterministic features. The modified phase space invariants are then extracted to distinguish the different attractors

generated by different human actions. The invariants extracted in this algorithm

are the Maximal Lyapunov Exponent, the Correlation Integral, and the Correlation

17

Dimension. The Lyaponov exponent is considered a dynamical invariant which measures exponential divergence of the trajectories in the phase space . The correlation

integral quantifies the density of points while the correlation dimension measures the

change in density. Along with these three features, the variance of the time series

data is also included. So, a 4D feature is obtained from each time series data and

if there K time series data obtained from K reference joints, then, a total of K 4

feature vectors are available for human action. Identification of the action is done by

first comparing the test feature vector with those in the database using some distance

metric and then classifying the test feature vector using the K-Nearest Neighbor rule.

This algorithm gives good recognition accuracies but the one drawback in this

algorithm is that the features are extracted from the trajectories of joints and to get

a noise free trajectory, a good background segmentation and tracking is required. If

the human silhouette is partially occluded in such a way that one of the joints is not

visible, then, the trajectory corresponding to that joint will be distorted which may

affect the properties extracted and hence, may increase the misclassification rate.

2.1.6

The action features extracted here [12] are the cartesian form of the optical flow

velocity and the vectorized form of the human body silhouette. Since the vectorized silhouette is of a higher dimension, it is reduced to a lower dimensional feature

space using PCA. Each action category is then modeled using hidden Markov models(HMMs) and the combination of the reduced silhouette feature vector and the

optical flow vector are used in the training of these models for every viewing direction possible.

The silhouettes are extracted from the video by foreground segmentation where

the algorithm models the background pixel color value as Gaussian. If that pixel

color value is beyond a certain threshold, then the pixel belongs to the foreground.

18

These silhouettes obtained at every frame of the video are normalized in size by

bi-cubic interpolation. Then, PCA analysis is performed on these normalized silhouettes by considering them as N -dimensional data points. The analysis is done

by computing the mean and the co-variance matrix of these data points and then,

computing the eigen vectors that span the variation in the silhouettes. So, at every

frame, the silhouette is projected onto the eigen space to get the lower dimensional

feature vector representing that silhouette. To estimate the non-rigid motion of the

human body at every pixel, the optical flow velocity is calculated. The action region

is divided into K number of blocks and the average value of the optical flow motion

field is extracted from each. The average values from all the blocks of an action

region are concatenated to form the optical flow feature vector. For each frame in

every action, the reduced silhouette feature vector and the optical flow feature vector

are combined to form the final feature vector. Hidden Markov models are used in

the modeling of the actions as they are useful in capturing the variations of a time

series data and classification is done using the maximum likelihood approach.

2.1.7

Here, the concept of mid-level features known as space time shapelets are introduced

where these shapelets are local volumetric objects or local 3D shapes extracted from

a space time shape. In other words, these shapelets characterize the local motion

patterns formed by the action [13]. Thus, an action is represented by a combination

of such shapelets and because these shapelets represent the local parts of the entire

space time shape, they are more robust to partial occlusions. Extracting all the

possible local volumes from each space time shape of the database and clustering

these sub-volumes using K-mean clustering provides the cluster centers which are

then considered as space time shapelets.

19

The action feature vector extracted using the shapelets is the probability distribution of these shapelets given a particular voxel of a space time shape. Then,

every voxel in the space time cube created by an action is represented by a feature

vector given by fD (x) = [ p1 p2 . . . pn ]T where pi is the probability of occurence of

the shapelet i given the voxel x. Using the bag of words model where each word is

the shapelet, the histogram for the dictionary of shapelets D is computed and this

X

fD (x) where n

histogram is the final action feature vector given by hD (V ) = n1

voxels

is the number of shapelets. Two classifiers were used for comparison, one is based

on the nearest neighbor rule and the other is by using logistic regression where the

latter was found to give better results.

2.2

DESCRIPTORS

This section explains some of the descriptors used to discriminate between 2D shapes.

Later on, some of the action recognition frameworks are discussed which are based

on directly using these descriptors individually or their combination. Some of the

frameworks discussed uses a combination of a shape descriptor and a motion field.

2.2.1

This section provides an overview of the various shape descriptors which have been

used for recognizing human action patterns. The basic idea is to capture the variations of a 2D shape descriptor across time or capturing the variations extracted from

the space time shape using a 3D shape descriptor and use these variations as action

features.

20

Shape Descriptors Using Hu Moments

One of the most common moment based 2D shape descriptor is the Hu moment

invariants [14] which have been widely used in recognition of visual patterns and

characters. The Hu moments of a geometrical shape are a set of 7 moment values

which are derived from central moments of that shape and normalized by a scaling

factor in such a way as to acheive the property of translation, scale, and rotation

invariance. The central moments p,q of a shape which are obtained from the centroid

(xcent , ycent ) and regular moments mp,q of that shape are translation invariant and so,

are used in the computation of Hu moments. The central moments of a shape are

defined as

Z

p,q =

(2-5)

where the (x, y) is the probability density function of the shape under consideration.

The scale invariance property is incorporated by using the theory of algebraic invari(p+q)/2+1

the central moments of the shape with respect to its size. To obtain the orientation

invariance, the central moments of different orders are combined in accordance to the

theory of orthogonal invariants to produce the Hu moments. Therefore, according to

[16], the Hu moments with the property of translation, scale, and rotation invariance

can be used as region based shape descriptor of a 2D shape with the probability

density function as the 2D binary image.

Fourier based shape descriptors are boundary based descriptors unlike the Hu moments which takes into account only the pixels on the outer contour of the 2D binary shape [15],[16] and these are translation,scale and rotation invariant. Here, the

boundary of the shape can be described by four different shape signatures such as

21

complex number representation, centroid distance, curvature function, and curvature angular functions. For the boundary coordinate point (x, y), the complex shape

signature is given by

s = (x xc ) + i(y yc )

(2-6)

r = (x xc )2 + (y yc )2

(2-7)

By normalizing the co-ordinates of the boundary of the shape by the centroid, the

shape signatures can be made translation invariant. The curvature and curvature

angular functions are already translation and rotation invariant. The Discrete Fourier

Transform(DFT) of these shape signatures are then used as shape descriptors with

approriate modifications to acheive scale invariance. For instance in the case of

complex number shape signature, to have scale invariance, the average or the DC

value is ignored and all the other coefficients are scaled down by a factor equal to

the first coefficient of the DFT.

Zernike moments are another set of moments which are used for shape description.

These are in fact related to the normalized central moments of a shape and, so, are

translation and scale invariant. The main difference between Zernike moments and

Hu moments is that Zernike moments are generated with a rotation invariant property

while Hu moments are generated by combining different orders of normalized central

moments. In other words, the Zernike moments are obtained by the projection of the

binary shape onto a set of orthogonal functions with simple rotational properties and

these functions are known as the Zernike polynomials [17]. The Zernike polynomials

22

are given by

Vnl = Vnl ( sin , cos ) = Rnl () exp(il)

(2-8)

f (x, y) =

XX

n

Anl Vnl ( , )

(2-9)

(n + 1)

Znl = (

)

Z Z

f (x, y) [Vnl ( , )] dx dy

(2-10)

From these Zernike moments, shape descriptors can be built which are related to

much higher order moments with the advantage of translation,scale and rotation

invariance and, hence, can represent more variations in the binary shape than just

regular moments or Hu moments.

The R-Transform is a 2D shape descriptor which is derived from the Radon transform

and which is invariant to translation [18]. It is made scale invariant by using a suitable

scaling factor. The Radon transform is in fact a projection of the 2D binary shape into

a set of straight lines which are at a distance of s from the centroid of the shape and

oriented at an angle of with respect to the x axis and with s varying from (, )

and varying from [0, ). The R-transform is computed by integrating the Radon

transform over the s axis. This R-transfrom is not exactly scale invariant but it can

be made so by scaling each value of the R-transform by the area covered by it and

the axis. However, compared to the Hu moments and Fourier descriptors, these

R-Transforms are not rotation invariant but the R-transform of a rotated version

of an image is only a shifted version of the R-transform of the original function.

23

Therefore, the rotation invariance can be acheived by taking only the magnitude

of the Discrete Fourier Transform of the R-transform and scaling each coefficient

by the DC or average coefficient value. The final shape descriptor is obtained by

first applying a Chamfer distance transform to the binary image and then using the

transform values to segment the shape into different levels. The R-Transform is then

applied at each level and this set of R-Transforms applied is taken as a 2D shape

descriptor. This shape descriptor is an example of combining two different shape

descriptors which brings out different aspects of the shape into a single descriptor.

The shape descriptor based on Poissons equation is similar to the Euclidean distance

transform of a binary image where every internal point in the silhouette is assigned a

value based on its distance from a set of boundary points [19]. Here, the value of an

internal point is determined by the mean time taken by a set of particles at that point

to undergo a random walk process and hit the boundaries. These values at every

internal point of the silhouette S can be determined from the solution of the Poissons

equation U given by U (x, y) = 1 under the condition U (x, y) = 0 at the boundary

S where U = Uxx + Uyy is the Laplacian of U . The one difference between this

shape descriptor and the distance transform is that while the distance transform

considers only the shortest distance boundary point to find the value of an interior

point, the Poissons equation based shape descriptor takes into account not only the

shortest distance boundary point but also its neigboring boundary points. Hence, it

brings out more global properties of the silhouette than the distance transform. The

other difference is that this representation enables the segmentation of the silhouette

into different parts by taking the gradient = U + Ux2 + Uy2 . The gradient gives

higher values near concavities which are often projections in a silhouette and thus,

segmentation of shapes into different parts is feasible. Next, by taking the second

24

derivatives of U , the local orientation and aspect ratio can be computed. Finally,

the features for this shape descriptor are the moments of the binary shape but are

weighted by functions which depend on the gradient, local orientation, and aspect

ratio.

All the shape descriptors described above can be used to extract the variation of

the silhouettes across the video frames either in the 2D or 3D case. By far the

sophisticated ones are the shape descriptor based on Poissons equation and Radon

transform as they not only give the boundary representation but also an internal

representation of the silhouette. This enables the extraction of more localized features

which would not have been possible with other shape descriptors like Hu moments,

Zernike moments and Fourier descriptors. In fact, a 3D version of the Poissons

equation based shape descriptor is used to describe a space time shape and global

as well as local properties are extracted from this descriptor. They are then used as

weighted functions in the computation of moments which are furthur used as action

features [20]. The R-Transform is used in an action recognition framework where

this is computed for silhouettes in key frames of the video sequence. This set of

R-Transforms are later trained by HMMs and the trained models are then used to

compute each action model similarity with the input test sequence. In short, the

Poisson shape descriptor and the Radon transform are the two shape descriptors

which are recommended for action feature extraction.

2.2.2

In this framework, the concept of a space time shape is introduced where a space time

shape is formed by the concatenation of silhouettes in the video frames. These space

time shapes contain the human action and, thus, considers actions as 3D shapes. An

25

extension of the Poisson shape descriptor to the 3D domain is used to extract the

various properties pertaining to the 3D shape such as local space-time saliency, action

dynamics, shape structure, and local orientation [20]. These properties are used as

action features and the results prove its robustness towards partial occlusions, nonrigid deformations, large changes in the scale, and viewpoint and low-quality video.

The only constraint is that the background should be known beforehand and in this

case, the median background computed from the video sequence is used to segment

out the background for silhouette extraction.

A 3D version of this descriptor is used for the internal representation of the

3D space time shape in which a value at an internal point reflects a much more

global aspect of the silhouette than the Euclidean distance transform. As mentioned

before, this is because the value is calculated not from just the nearest boundary point

but also from the set of neighboring boundary points. This descriptor is computed

by solving the equation U (x, y, t) = 1 subject to both the dirichlet boundary

condition U (x, y, t) = 0 at the bounding surface S and the Neumann boundary

condition Ut = 0 applied only at the first and last frames of the video sequence.

The numerical solutions are obtained by a simple w-cycle of a geometric multigrid

solver [21] and the solution obtained can be used to extract global as well as local

features. The local features are obtained by furthur processing of the solution such

as by taking the gradient and Hessian of the same. The global features are actually

the regular moments but weighted by the local features.

Since human actions are described as a collection of moving parts, a particular

descriptor is defined which emphasizes the fast-moving parts with some importance

to parts which move at a moderate speed. This descriptor is a variant of the gradient

of U defined by

w (x, y, t) = 1

log(1 + (x, y, t)

max (log(1 + (x, y, t))

(x,y,t)S

(2-11)

26

where the gradient (x, y, t) = U + 23 kU k2 . This variant w (x, y, t) will be one

of the weights used for the computation of the global features. The other set of

local features such as local orientation and local aspect ratio are extracted by first

defining three types of local space time structures named as stick, plate, and ball.

A 3 3 Hessian matrix H is applied to the solution of the shape descriptor U with

the matrix centered at each voxel location. Then, the eigen values of this matrix are

extracted and based on the ratio of these eigen values, the stick, and plate structures

are defined where the stick and plate structures Sst (x, y, t), Spl (x, y, t) are exponential functions. The informative direction Dj (x, y, t) given by the projection of the

three eigen vectors corresponding to the three largest eigen values are projected into

the x, y, and t axes where this informative direction gives the deviations of these

eigen vectors from the axes x,y and t. Then, the local orientation features are the

combination of the informative direction and the stick and plate structures given by

wi,j (x, y, t) = Si (x, y, t) Dj (x, y, t) where i {st, pl} and j {1, 2, 3}.

The global features are the weighted moments of the 3D shape using local features as the weights. For classification, the video sequences are first broken down

into space time cubes of predefined length with a predefined overlap between them.

The features extracted from the space time shapes of the test sequence are then compared with the features of the space time shapes in the database using the Euclidean

distance metric and they are appropriately classfied using the nearest neighbor rule.

Although this algorithm is robust to partial occlusions, the memory required to store

the features is large. Moreover, the speed will not be issue if dealing with videos of

smaller frame size due to the fact that the computation time required for solving the

poissons equation on a small area is comparable to other methods. But, when the

video frame size is large as 256 256, the speed of computation of the features using

the Poissons equation will be less and takes considerable time.

27

2.2.3

in a single frame and these features extracted from a video sequence are used to

train a set of HMMs to recognize the action [22]. The results provided justifies

the robustness of the algorithm to frame loss and disjoint silhouettes. HMMs are

used in the analysis of the time series data and since the low-level features are extracted at every frame, the action features can described as time series data with

the data being the low-features varying across the video sequence. The R-Transform

is generally translation-invariant but not scale and rotation invariant. The rotation

variance is ignored as human actions rarely include the silhouettes being rotated and

the scale invariance is acheived by resizing the image to a normalized scale before

the R-Transform is applied. The R-Transform is a 180 dimensional vector and so,

a feature matrix of size 60 3is formed. By applying PCA, the feature matrix is

reduced in size to 2 3 which are concatenated to form 6 1 vector. This vector

obtained from a silhouette from a single frame in a video sequence is used to train

the HMM to get a model for each action category.

The main advantage of using this algorithm is that the computation of Rtransform is linear and so, the computation cost of the features is low. The only

disadvantage is that the silhouettes need to be normalized with respect to the scale

and this often requires the use of interpolation methods.

2.2.4

In this framework, both the motion flow feature vectors as well as the global shape

flow feature vectors are extracted from image sequences and the combined feature

vectors are used to model each category by HMMs for multiple views [23]. The

motion flow feature vectors are given by the optical flow and the shape flow feature

vector is obtained by using suitable shape descriptors.

28

To capture the shape flow, a concatenation of the different features is used

such as Hus invariant moments, Zernike moments, flow deviations over the image

sequences in a space time shape, and global anthropometric variations. The flow

deviations are the mean absolute deviation of the silhouette image from the center

of mass of the silhouette in the x and y directions and the mean intensity of the

shape distribution of the space time shape. The mean absolute deviations helps

in discriminating between actions involving large body motions and actions having

small body movements. The global anthropometric variants are the projection of the

silhouette onto the x and y axes. These four feature vectors are combined to form the

shape flow feature vector. The combined local-global(CLG) optic flow feature vector

is the combination of the global optical flow extracted by considering the silhouette

as a whole and the optical flow vectors extracted from the various blocks of the

silhouette image. Here, the silhouette action boundary is divided into four quadrants

or blocks with the origin at the center of mass of the silhouette and from each block,

the optical flow vectors are extracted. Therefore, the key features extracted from a

single frame in a space time cube are the combined vectors of both the shape flow and

the CLG optic flow vectors. Modeling of the various action categories are done by

multi-dimensional hidden markov models where the HMM algorithms are modified to

include the combined shape-CLG flow vectors. Classification of the action features

is done using hidden Markov Model where the model which gives out the highest

conditional probability is selected.

The use of optical flow vectors and shape descriptors such as Hu moments reduce

the robustness of the algorithm to partial occlusions. Moreover, this algorithm uses a

combination of different shape descriptors such as Hu moments and Zernike moments

which is not very efficient. Although it is stated that the different shape descriptors

bring out different aspects of the shape, it is not very clear as to what those aspects

are.

29

2.2.5

The algorithm proposed uses a combination of the local features in the form of

SIFT descriptors and the global features in the form of Zernike moments. Both the

local and global features emphasize the different aspects of actions [24]. For the

local features, both the 2D as well as the 3D SIFT descriptors are extracted from

neighborhood regions which have 2D SIFT interest points. The 2D SIFT descriptor

emphasizes the 2D silhouette shape and the 3D SIFT descriptor emphasizes the motion. The global or holistic features are the Zernike moments extracted from single

frames and from motion energy images.

First, consecutive frames of the video sequence are subtracted from each other to

remove the background and what remains are the difference images. It is from these

difference images that the interest points are obtained. Each of the difference images

are projected into scale space using a Gaussian kernel. This creates a pyramid like

structure with each level representing the scale of the image. The whole scale space

is divided into octaves with each octave having a certain number of levels. At each

octave, the difference images projected into consecutive levels are subtracted to form

a difference of Gaussians (DOG). Each pixel in the DOG image is then compared to

its 8-neighbors in the same level and the 9-neighbors in the higher and lower levels.

If there is a significant difference, then, that pixel is considered as an interest point

referred to as a 2D SIFT interest point. It is from these interest points that the

2D SIFT descriptor feature vector and the 3D SIFT descriptor feature vector are

extracted. In both the extractions, the gradient magnitude and the orientation are

calculated at each scale or each level and at the interest points, the orientation histogram in that interest region is computed. As mentioned in the previous sub section,

the scale invariance is acheived by scaling the histogram of the interest region with

the magnitude of the gradient at the respective interest point. The interest regions

are rotated so that the dominant orientation is aligned in the same direction as the

30

dominant orientations of other interest regions. The whole region is then divided

into sub-regions on which the sub-histograms of the orientations are calculated. The

final 2D or 3D descriptor feature vector of that interest region is the concenation of

these sub-histograms in that interest region.

The holistic features are the Zernike moments extracted from the space time

shape. There are two variations which are extracted. One is the set of Zernike moments extracted from the single frame thereby providing the spatial variation. The

bag of words approach along with K-means clustering algorithm is applied to these

moments to get necessary holistic feature vector. The other is the set of Zernike

moments extracted from the motion energy image computed from the single frames.

The holistic features and the 2D/3D SIFT descriptors are applied to SVMs for classification. The one drawback this algorithm faces is that it may not be robust to

partial occlusions. This is because of the use of Zernike moments in describing the

global features which can vary a lot due to occlusions.

2.3

SUMMARY

Various action recognition algorithms have been discussed with some using motions

fields such as optical flow and local features such as SIFT descriptors while some

others using shape flow features extracted using shape descriptors. The emphasis has

been on the algorithms using the shape descriptors. In the next Chapter, the shape

descriptor based on the Radon transform is explained in detail with illustrations.

31

CHAPTER 3

MULTI-LEVEL SHAPE REPRESENTATION

The algorithm proposed in this thesis is a small extension of the shape descriptor

based on the multi-level Radon transform[18] from where action features are extracted. This Chapter describes the Radon transform in detail and it can be used to

describe the spatial variations of the 2D silhouette of a frame in a video sequence.

3.1

This type of shape descriptor does not belong to the category of region based shape

descriptor where all the pixels within the boundary of the shape are considered or

the contour-based descriptor which takes only the boundary pixels into account.

The descriptor is based on the Radon transform where this transform computes the

projection of a 2D shape on to a set of lines with each line being at a distance of

s from the centroid of that shape and oriented at an angle of from one of the

co-ordinate axes. The Radon transform is a 2D image having co-ordinates s and .

A slight variation of the Radon transform known as the R-Transform is extracted

from it and this is used as one of the components of the shape descriptor. In fact,

each component is obtained by projecting different levels of the binary shape into

the Radon space with each level obtained from the 2D Chamfer distance transform.

Thus the link between the internal structure and the boundary is emphasized while

computing the shape descriptor.

3.1.1

As defined before, the Radon transform projects a 2D binary shape onto a set of lines

oriented at different angles and at different distances from the centroid of that shape.

In other words, the Radon transform is an integral transform where the function is

32

integrated over a set of lines. According to [32], the Radon transform of a density

distribution (x, y)is defined as

Z

< (x, y) =

(3-1)

where is the scanning direction or the direction of the projection line and R is the

distance of the projection line from the origin. Here, the density distribution (x, y)

is the 2D binary image f (x, y) and the origin is the centroid of that binary shape. In

vector notation form where x is a vector with two components (x, y) and t is a unit

vector in the scanning direction, the Radon transform is given by

Z

< (x, y) =

3.1.2

(x)(R x t) dx

(3-2)

Consider a unit point mass at P (a, b) as shown in Figure 3.1a and Figure 3.1b.

Then, the density function (x, y) given in terms of this point mass is (x, y) =

2 (xa, yb). Let the Radon transform of this density function be I(a, b, R, ) which

will in turn be the impulse response of the operator <. Let a line L be scanning in

a fixed direction given by across the x y plane. At a particular distance R from

the origin, the line L passes through the point P (a, b) and let g (R) be the Radon

transform computed for that line L for a fixed at a fixed distance R. Since this line

L passes through the point P (a, b), g (R) will be a non-zero value. Then, by fixing

this distance R, the line is rotated with varying from 0 to so that the contribution

of this point P (a, b) to the Radon transform <(x, y) is accumulated for all directions.

In other words, g (R) is calculated for all values of for a fixed R. By plotting the

points of the intersection of the line L and the perpendicular lines from the origin,

a circular locus is obtained. This locus of points defines the nature of the impulse

33

(a) PointMass1

(b) PointMass1

response I(a, b, R, ) as a uniform ring delta function of unit line density. The above

interpretation is shown in Figure 3.1a. Now, consider a unit mass spread over a small

circular area of diameter surrounding the unit point mass P . As shown in Figure

3.1b, the line integral g (R) for a constant R and constant will be a narrow hump

of unit area and width . For a fixed R, the line L is rotated so that the contribution

of the unit mass centered at P to the Radon transform <(x, y) is accumulated for

all values of . The boundary of the Radon transform will then be a circular strip

of non-uniform width. The maximum width occurs at the position where the line

L passes through both the point P and the origin O when extended as it is at this

position the maximum area can be seen from the line L. At all other positions, the

area of the point mass as seen from the line L is less than this maximum. Thus, it can

be stated that the mass at P (a, b) is distributed non-uniformly along the perimeter

of the circle with diameter OP . This gives the true nature of the impulse response

which is a ring delta with non-uniform line density.

34

3.1.3

co-ordinate system, then, the mass at P (a, b) is non-uniformly distributed along

the Radon transform boundary and the density at point Q is proportional to OQ.

Therefore, the computation of the Radon transform is done by first sub-dividing this

mass into smaller point masses and accumulating its contribution. It can be divided

in two ways, one with uniform spacing between the point masses on the Radon

boundary but each one with a different mass, the other with non-uniform spacing on

the Radon boundary but having same mass. The line L is divided into smaller parts

and each point mass is allocated to the nearest part of the line. Allocations from

all the smaller point masses from all the points are accumulated to form the Radon

transform of the density distribution. The Radon transform can also be denoted

by T (s, ) where refers to and s refers to R in the previous representation.

For computing the Radon transform of a binary image, it is taken as the density

distribution with the pixels considered as the point masses. The computation of

the Radon transform is done by applying the analogy of smaller point masses where

each pixel is divided into sub-pixels and then, the value of each subpixel is projected

into a line which is divided into a certain number of bins. If the projection of the

sub-pixel falls on to the center of the bin, the full value of the sub-pixel is used in the

computation. If the projection falls on the border of a bin, the value of the sub-pixel

is split between the current bin and the adjacent bin. This type of projection for a

single pixel is computed for all sets of lines oriented at different angles with the x

axis and at different distances from the centroid of the shape. The accummulation

of all these projections over all the pixels of the image gives the Radon transform of

the image. The definition of the Radon transform and its computation for a pixel

is illustrated in Figure 3.2a and Figure 3.2b where the 2D binary shape is projected

on to the line AA0 which is at a distance of s from the centroid of the shape and

35

(a) Definition

(b) Computation

Figure 3.2: Definition of Radon Transform and its Computation for a Pixel.

oriented at an angle of . Since the 2D binary image is discrete, the maximum and

minimum values that s can take is finite and this value depends on the size of the

image.

3.1.4

The Radon transform is mainly used for the detection of lines and curves in an

image where this transformation will emphasize these straight lines or curves. In

other words, the Radon transform will contain high and low value pixels where the

high value gives the intensity of the straight line in the original image whose location

in the original is determined by its co-ordinates in the Radon transformed image.

As mentioned before, the Radon transform co-ordinates are (s, ) where s gives

the perpendicular distance of that straight line from the origin and gives the

inclination of the line with the x axis of the original image. The Radon transform

36

for a square binary image is shown in Figure 3.3a. The maximum values denote that

the maximum projection of the square occurs along the diagonals. The detection

of lines property is illustrated in Figure 3.3b where the outline of the square binary

image is Radon transformed.

As shown in Figure 3.3a, the bright pixels in the Radon transform of the square

image occurring at co-ordinates (25, 0 ),(+25, 0 ),(25, 90 ) and (+25, 90 ) correspond to the horizontal bottom edge, the horizontal upper edge, the left vertical edge

and the right vertical edge respectively. By comparing the Radon transform of the

square image to that of its outline, a significant difference can be noticed. In Figure

3.3a, the highest values occurs at co-ordinates (0, 45 ) and (0, 135 ) which represent

the diagonals of the square image and in Figure 3.3b, the highest values correspond

to the edges of the square outline. This asserts the fact that the computation of

the Radon transform not only considers the boundary pixels of the shape but also

the interior pixels. Furthur analysis on the Radon transform is shown in Figure 3.3c

where the interior values of the square are represented by a 2D distance transform

based on the Euclidean distance. The Radon transform of the distance transform is

not very different compared to that of the original with the difference being in the

intensity of the sinusoidal variations present in both the images. The co-ordinates of

the bright pixels are the same in the Radon transforms of both the square and the

distance transformed image but the intensity is greater at those locations in case of

the distance transformed one. This shows that this transform changes in accordance

to the interior values of the shape. The analysis can also be applied to human binary

silhouette images and is illustrated in Figure 3.4.

Similar to the case of the square image, the Radon transform of the silhouette

image shown in Figure 3.4a differs from that of its outline and its distance transform shown in Figure 3.4b and Figure 3.4c. This is due to the different values of

the interior pixels. The Radon transform of the silhouette image has bright pixels

37

38

39

which correspond to the main axis. For the outline silhouette, the Radon transform

consists of bright regions which corresponds to the line segments that make up the

boundary. In the case of the distance transform where the interior pixels have nonzero values, the Radon transform have other regions which are brighter and these

regions corresponds not only to the line segments that make up the boundary but

also to the interior structure of the silhouette. Another observation is that the sinusoidal variations become more emphasized in the Radon transform domain when the

interior pixels of the input image have larger values. This can be seen in the case

of the silhouette outline images where the corresponding Radon transform has sinusoidal variations which are less brighter than those present in the Radon transform

of the original image. In the case of the distance transformed image, the sinusoidal

variations even though brighter than those found in the outline, are darker when

compared to those of the original.

3.1.5

are discussed in this section. These properties are periodicty, symmetry, translation,

rotation, and scaling [18].

Periodicity :- The Radon transform is periodic in with a period being a multiple of 2. This property is given by T (s, ) = T (s, + 2k) where k Z and

T (s, ) is the Radon transform. So, if the Radon transform is shifted by a

multiple of 2 along the variable, the inverse of the shifted Radon transform

would be the same as the original image. By shifting in , the set of lines on

which the original image gets projected on, is rotated by a certain offset and if

this offset is a multiple of 2, the lines are back to their original positions.

40

Symmetry :- The next property is symmetry which is given by T (s, ) =

T (s, ). As seen in Figure 3.3 and Figure 3.4, the Radon transform

is symmetric along the line s = 0 and about = . This property can be used

to reduce the number of computations and memory required for storing the

transform. In other words, only the upper or lower half is need to reconstruct

the original image.

Translation :- Let the image f (x, y) be translated by a vector p~ = (x0 , y0 ).

The Radon transform of the translated image is then given by T (s x0 cos

y0 sin , ). This tells us that the transform is translated along the s axis

by a value equal to the projection of the translation vector p~ onto the line

x cos + y sin . So, an image and its translated version will have different

Radon transforms and hence, to use these transforms as a suitable shape descriptor, the translation vector should be determined from the image and its

translated version and this predetermined vector should be used as the normalization factors.

Rotation :- Another property is that of rotation where if the image is rotated

by 0 , the Radon transform is shifted by the same amount along the axis. In

other words, the rotated Radon transform is given by T (s, + 0 ). Again as in

the case of translation, the rotation angle between the original and the rotated

image should be determined and this angle should be used for normalization.

Scaling :- The scaling of the image by a value k results in a Radon transform

1

T (ks, )

|k|

which is scaled not only in the amplitude but also in the s co-

be determinedfor normalization purposes.

In short, if the original image is translated, rotated, and scaled, the corresponding

Radon transform would also be the translated, shifted, and scaled versions and the

41

extraction of the translation vector, the rotation angle and the scaling coefficient

at the same time for normalization is extremely difficult. Therefore, the 2D Radon

transform cannot be used directly as a suitable shape descriptor. This lead to a

modified version of the transform known as the R-Transform [18].

3.1.6

R-Transform

The R-Transform is formed by integrating the 2D Radon transform T (s, ) over the

variable s. The R-Transform which is single dimensional is given by

Z

Rf () =

T 2 (s, ) ds

(3-3)

where T (s, ) is the 2D Radon transform of the image. A single value of the RTransform can be interpreted as the summation of the projections of a set of lines

at different distances from the centroid but all oriented at a specific angle to the

x co-ordinate. In other words, R-Transform is formed by keeping the variable

constant and accumulating the projections on the set of lines all oriented in the

same direction. The R-Transform is shown in Figure 3.5. This R-Transform has the

42

43

following properties which makes it a suitable descriptor for 2D shapes [18]. The

following properties are given below.

Periodicity :- The R-Transform is periodic in nature with a period unlike the

2D Radon transform which has the period 2. This is given by the equation:

Rf ( ) = Rf ().

Rotation :- The R-Transform is not rotation invariant but the angle at which the

shape has been rotated can easily be extracted and thus, can normalize the RTransform to make it rotation invariant. In fact, the R-transform of the image

rotated by an angle 0 will be a translated version with a translation shift by the

Z

T 2 (s, +0 )ds.

same amount. This is given by the equation : Rf (+0 ) =

Translation :- The R-Transform is translation invariant although the 2D Radon

transform is not. This removes the need to extract a translation factor for

normalization, and thus, satisfies one of the criteria for a shape descriptor.

The R-Transform of an image translated by a vector p~ = (x0 , y0 ) is given by

Z

the equation:

T 2 (s x0 cos y0 sin )ds = Rf (). This is illustrated in

Figure 3.6a.

Scaling :- The R-Transform is not scale invariant but can be made so by normalizing with a scaling factor. This transform extracted from a scaled version

of the image will also be scaled but only in amplitude unlike the 2D Radon

transform where the amplitude as well as the variable s was scaled. This

makes it easier to use a scaling factor for normalization which depends on the

R-Transform itself. Here, the scaling factor used in the normalization is the

area of the R-Transform and by using this scaling factor, scale invariance can

be achieved. The property is illustrated in Figure 3.6b.

44

Thus, the property of translation and scale invariance makes the R-transform

a suitable shape descriptor. The rotation invariance can be acheived by extracting

the amount by which the image is rotated from its original one and this can be

computed from the two R-transforms by a correlation based technique. But usually,

by normalizing the Discrete Fourier transform coefficients of the R-Transform by

the average value and retaining only the magnitude, the rotational dependence can

be removed. As shown in Figure 3.6, the scaled images and the translated images

have the same R-transform while the rotated image with a rotation of 30 has the

R-transform shifted to the right.

3.2

The distance transform is another type of shape descriptor which gives an internal

representation of the shape based on the Euclidean distance of a pixel from the nearest

boundary pixel. The interior pixel values depend on the type of the approximation of

the Euclidean distance metric used and this leads to different families of the distance

transform [29]. The approximation of the Euclidean distance between a particular

interior pixel and the neigboring pixel depends on the weights given to its neighbors.

Let the weights given to the diagonal neigbors be d2 and to the horizontal/vertical

neighbors be d1. The distance between two arbitrary pixels p and q is then given by

(3-4)

where m1 is the number of horizontal steps and m2 is the number of vertical steps

between the two pixels. Depending on the value of the (d1, d2) pair, different approximations are possible and they are known by different names. For (d1, d2) = (1, ),

the approximation is known as 4-neighbor distance or city-block distance and when

45

(d1, d2) = (1, 1), it is called 8-neighbor distance or the chessboard distance. The Euclidean distance approximations with d1 or d2 being an integer other than 1 or are

known as Chamfer distances. One example of the Chamfer distance approximation

3.2.1

The distance transform of an image is computed using local operations applied sequentially to pixels using certain types of masks [30]. The operational mask starts

its processing from the top-left most pixel and the image is scanned by this mask

in a forward raster scan manner. Each shift of the operational mask in the scan

involves the processing of the currently centered pixel using the neighboring pixels

where the current values of the neighborhood pixels are obtained from the processing

of the previous shift of the operational mask. If a pixel in an image is represented by

a(i, j) in the ith row and j th column and the new value represented by a (i, j), the

sequential operation on a centered pixel can be summarized as

a (i, j) = f (ai1,j1 , ai1,j , ai1,j+1 , ai,j1 , ai,j , ai,j+1 , ai+1,j1 , ai+1,j , ai+1,j+1 )

(3-5)

where f () is the operation performed by the mask. The image on which the distance

transform is to be applied, is usually a binary image. The two local operations used

are

0

if ai,j = 0

f1 (ai,j ) =

min(ai1,j + 1, ai,j1 + 1) (i, j) 6= (1, 1) and ai,j = 1

M +N

(i, j) = (1, 1) and a1,1 = 1

(3-6)

(3-7)

46

where the size of the image is M N . Two scans of the image are done to compute

the distance transform, the first is the forward raster scan by the operation f1 and

the second is the reverse raster scan by the operation f2 . These operations mark the

pixels with values equal to their distance from the set of zero-valued pixels. The first

operator increments the values of the top vertical and left horizontal neighbors and

assigns the minimum value to the center pixel. The extreme conditions are when the

centered pixel has a value 0 or when the centered pixel is the top-left most pixel. The

image obtained after the first scan is a partial distance transform where the interior

pixel values gives the distance from the nearest top or left boundary pixel. The

second operation in the reverse raster scan mode is applied to the output image of

the first scan and by centering the mask at every pixel, the bottom vertical and right

horizontal neighbors are incremented. The value of the centered pixel is then updated

by the minimum of the current value of the centered pixel and the updated values

of the corresponding neighbors. By varying the scale of the increments applied to

the vertical and horizontal neighbors, the approximations to the Euclidean distance

transform can be implemented. The horizontal increments are scaled by d1 and the

vertical increments are scaled by (d2 d1) during the forward and reverse scans. The

different types of approximations for the Euclidean distance transform are mentioned

in [31].

3.3

CHAMFER DISTANCE TRANSFORM AND R-TRANSFORM

The R-Transform gives a very compact representation of a shape and so, to include the variation of the interior structure, a distance transform along with the

R-Transform is used to give a multi-level representation of the shape. First, a distance transform is applied to the 2D shape and using the interior values obtained,

the shape is segmented into different levels as shown in Figure 3.7. Then, each level

47

extracted from a single shape and each extracted shape gives a coarser representation of the original. Finally, the R-Transform is extracted from each level of the 2D

shape to get the multi-level representation. The distance transform used here is the

Chamfer distance measure with (d1, d2) = (3, 4).

The segmentation of the shape into different levels is done by a simple thresholding scheme. At a particular level, the pixels whose distance transform values are

greater than a predefined threshold are selected and the rest are discarded. The

selected values at every level of the shape are then given a constant value to get the

binary shape. The combination of all the R-Transforms extracted at each level gives

a complete description of the shape including the internal structure. An illustration

of 8 - level segmentation applied to binary images of a dog and a human silhouette are

shown in Figure 3.7a and Figure 3.7b along with the operational masks used to compute the distance transform. But this set of R-Transforms are not that suitable for

48

shape representation as it is still rotation variant as explained in the previous section.

A rotation of a shape by a certain degree will have translatory shift by the same degree

in the R-Transform. Therefore, the Discrete Fourier transform F is taken from the

R-Transform of every level of the shape and only the magnitude of the coefficients is

retained so as to remove the rotation variance. Let Rl be the discrete R-Transforms of

the 2D shape where l = (1, L) refers to a particular level. Then, the final multi-level

0

(1)

, ... FF 0()

, ... FF L (0)

, ... FF L()

). This

representation of the shape is given by X = ( FF 0 (1)

(0)

(0)

(0)

multi-level representation of the shape obtained after the Radon transform and the

Discrete Fourier transform is the final shape descriptor which is translation, rotation

and scale invariant.

3.4

SUMMARY

This Chapter gives a detailed explanation of the Radon transform and a brief

overview of the algorithm used in the computation of the Radon transform. The

various properties relating to shape description and the drawbacks which renders the

original Radon unfit for shape representation are discussed. Then, the R-Transform

was introduced which is a modified form of the original Radon transform and this

modification fulfills the necessary properties such as translation and scale invariance.

Finally, the concept of a multi-level representation of a shape is introduced where

this representation is derived from a combination of the Chamfer distance transform

and the R-Transform. The final shape descriptor is the Discrete Fourier transform

extracted from this multi-level representation which is not only translation and scale

invariant but also rotation invariant. In the next Chapter, an extension of the concept of multi-level representation is used to describe 3D space time shapes obtained

from the actions and this representation is used for action feature extraction.

49

CHAPTER 4

ACTION RECOGNITION FRAMEWORK

The action recognition algorithm proposed in this thesis is an extension of the 2D

multi-level Radon transform based shape descriptor. As previously indicated, the

distance transform is used to segment the silhouette but the difference is that instead of segmenting a 2D silhouette, a space time shape formed from an action is

segmented into different levels. The segmentation is done by first computing the distance transform of the 3D space time shape and then, computing the gradient. This

gradient serves as the benchmark for multi-level space time shape representation.

The 3D distance transform used is based on the Euclidean distance transform which

can be approximated by different metrics such as the Chamfer distance transform

or the Quasi-Euclidean distance transform discussed in the previous Chapter. The

algorithm is as follows:

Extraction of the silhouette by extracting the foreground from the video frame

by median background subtraction.

Concatenation of a predefined number of silhouettes extracted from the previous step into space time shapes.

Applying the 3D distance transform with anisotropic aspect ratio with more

weight given to the time axis.

Computing the normalized gradient of the distance transform.

Segmenting the space time shape into multiple levels using the normalized

gradient.

Applying the R-Transform for each frame at each level of the the space time

cube to get 3D feature vector.

50

Extract the R-Translation vector from the coarsest level of the space time shape

and concantenate it with the 3D feature vector obtained from the previous step

to form the final feature vector.

Classify the feature vector by comparing it with those features in the database

using the nearest neighbor approach.

4.1

TIME SHAPE

Extraction of the binary human silhouette from the video sequence is done by a

simple foreground segmentation where it is assumed that the video sequence has

a stationary background and the only object which moves in the video sequence

is the individual performing a certain action. Using a background model obtained

from the video sequence, every frame is compared with the background to segment

out the foreground pixels. Using morphological operations of dilation and erosion,

the holes in the foreground silhouette and noisy background pixels are removed.

The silhouettes extracted from a predefined number of consecutive frames are then

concatenated to form the space time shape.

Here, the median background image is used as the background model. In a

median background, each pixel value is the median of the correponding pixel in all

the frames of the video sequence. The median background is preferred over the

mean background due to two reasons. One is that the median background can

be extracted directly from the video sequence containing the action which involves

movement of the entire silhouette across the frame and thus, does not require a

seperate background video sequence. But for actions which does not contain a large

movement of the entire silhouette and where the torso of the silhouette is almost

stationary, a seperate background video sequence should be used. The second reason

51

is that the median of the pixels are not much affected by outliers in the pixel values

when compared to their mean where the outliers are caused due to the movement

of the person across the frame. The outlier pixel values infact reduce the mean by

a significant amount but not the median. This is illustrated in Figure 4.1 where the

mean background computed from a video sequence is a little darker than the median

background. Also, some portions in the background image have some slightly dark

patches and this is caused due to the movement of the person across the same region

in the video sequence. It can be seen that the median background image does not

have such dark patches and, moreover, the overall brightness is almost the same as

compared to each frame. It should be noted that this background segmentation is

not implemented in real time as the background model image is learned after making

a single pass across the video. Silhouette extraction is done from the second pass

onwards. Once the background image is learned, then, every frame in the video

sequence is compared with the background image by taking the absolute difference

between the two. The difference image is given by

(4-1)

52

where (i, j) refers to the pixel location at the ith row and j th column, backgnd is the

background image, f ramek is the k th frame and dif f Imagek is the difference image

computed from the k th frame. To get the complete human silhouette, the background

image is then masked to retain those difference pixels having a value greater than a

certain threshold. The thresholding operation is given by

0

if dif f Imagek (i, j) < T

Sk (i, j) =

255 otherwise

(4-2)

where Sk is the binary silhouette image and T is the threshold used. Since the background image and the video frames are color images, the difference image computed

is also a color image and the thresholding operation is performed for each color band

seperately using different thresholds for each band. After thresholding, the silhouette may contain holes and unwanted protrusions and so, to remove these anomalies,

morphological operations such as dilation and erosion are used. The number of dilations performed on the silhouette is more than the number of erosions. This is to

ensure than no holes are present inside the silhouette as dilation helps in closing of

the holes. But, due to dilation, the silhouette size gets very large and so, erosions

53

are required to bring the silhouette back to its original size. Usually, a rectangular

mask is used for both dilation and erosion processes. An illustration of the silhouette extraction operation in shown in Figure 4.2. The silhouette extracted from the

frame is slightly larger than the actual body of the person. The use of a morphology

operation is illustrated in Figure 4.4 where the number of dilations and erosions are

varied and how this variation affects the output of the video sequence. In Figure

4.4b, it is seen that without morphological operation, the silhouette has some holes

and imperfections. The morphological operation reduces the holes and imperfections

to some extent. The more number of dilations and erosions, the more the holes are

reduced but the shape of the silhouette gets distorted. So, a trade off exists between

the extent of the holes and the distortion of the silhouette.

Every space time shape has a certain overlap between the adjacent space time

shapes of a video sequence where this overlap is usually less than its length. The

54

1 1 1

1 1 1

1 1 1

(e) Dilation and Erosion done twice

55

length of a space time shape is the number of silhouette frames used in the concatenation. So, for every video sequence in the training set, a certain number of space

time shapes of length L and an overlap N are formed and action features are then

extracted from each of these space time shapes. A shape time shape is illustrated in

Figure 4.3.

4.2

LEVELS

A space time shape can be considered as a concatenation of human body silhouettes extracted over a predefined number of frames with axes x, y and t where (x, y)

represent the frame axes and t represents the time axis. In the previous Chapter,

to describe a 2D binary image, more specifically a human silhouette, a 2D distance

transform was used to mark the interior pixels with values which are proportional

to its distance from the nearest boundary pixel. Here, an extension of the distance

transform to 3D is used where a voxel (a pixel in 3D) inside the volume spanned

by the space time shape is assigned a value which is proportional to the distance

between this voxel and the nearest boundary. The algorithm to compute the approximation of a 3D Euclidean distance is the 3-pass algorithm which is very similar

to the 2-pass used for 2D distance transforms. The difference between the 2D and

3D distance transform algorithm is that the minimum distance calculation is done

by finding the local minima of the lower envelope of the set of parabolas where each

parabola is defined on the basis of the Euclidean distance between the two points.

Also, when computing the 3D distance transform of a space time shape, the aspect

ratio is selected such that the time t axis gets more emphasis. A normalized gradient

of this distance transform is used to segment the space time shapes into different

levels. The number of levels and the interval between the levels is chosen so that

the most coarse level will be the concatenation of only the silhouette torsos without

56

the limbs obtained from each frame. In other words, the emphasis on the limbs is

reduced as the level becomes coarser and coarser.

4.2.1

The 3D distance transform is computed in three passes where each pass is associated

with a raster scan in a single dimension. In two of the passes, the distance transforms computed are bound by the parabolas defined on the boundary voxels in the

respective dimension. This type of distance transform measure is given by

qB

(4-3)

f (q) is the value of the distance measure between points p and q. It is seen that for

every q B, the distance transform is bounded by the parabola rooted at (q, f (q)).

In short, the distance transform value at point p is the minima of the lower envelope

of the parabolas formed from every boundary point q. This idea can used for the 3D

distance transform computation where the parabolas along a dimension are defined

from the boundary points along that dimension.

Let U (x, y, t) be the 3D distance transform computed for a space time shape

S(x, y, t). The 3D distance transform U (x, y, t) is computed by using the following

lemmas : Consider F (X, k) as the family of parabolas where each parabola is defined

on the boundary voxel at one of the dimensions X. This family is given by

i

(4-4)

where the scanning is done in (X, M ) plane along the X dimension, ki is the value of

the boundary voxel i along the dimension X and at fixed dimension M , and Gi (X)

is an arbitrary distance measure computed along the dimension X from each of the

57

boundary voxel i. Let Hk (X, M ) be one of the parabolas in the family of parabolas

F (X, M ). The scanning is done in three passes. In the first pass, the scanning

is done along the y axis in a raster scan manner to get an intermediate distance

transform values. The algorithm used in this pass uses a 1D mask oriented along

the y axis for distance transform computation and its value reflects the shortest

distance of a voxel to the nearest boundary voxel which are along the y-axis or the

row axis. The boundary pixels on the left or right columns are not considered for

the computation during this pass. The second pass includes scanning along the x

axis direction in the raster scan manner. The distance transform computation at an

arbritray voxel depends on the family of parabolas where each parabola is defined

at the boundary voxels occuring along the x axis direction in the x y plane. In

fact, this approximation to the distance transform is actually the minimum distance

of that particular voxel to the lower envelope of the family of parabolas F (x, y). In

the third pass, the scanning is done in the t-axis direction where again just like in

the second pass, the distance transform values depend on the family of parabolas

but, here, each parabola is defined on the boundary voxels which are occurring along

the t-axis direction in the y t plane. The final 3D distance transform of the 3D

space time shape is obtained after these three passes. It should be noted that the

weights to the computation of the distance transform can included in each of the

passes. This provides the flexibility to incorporate non-isotropic aspect ratio so as

to emphasize the variation in a particular dimension.

4.2.2

Segmentation of a 3D Shape

Once the distance transform has been computed for the space time shape, the gradient of this distance transform is taken and this is used as the bookmark for segmenting

it into different levels. Some sample frames of the distance transformed image for

various aspect ratios are shown in Figure 4.5a, Figure 4.5b, and Figure 4.5c. It is

58

seen than the area covered by the torso part of the body has higher values than the

area covered by the limbs. By varying the aspect ratio, the different axes x,y and t

will have different emphasis on the computed distance transform. In other words, the

distance transform is tuned to the variation in the axes non-uniformly. Therefore, to

emphasize the time variation of a space time cube, the aspect ratio is selected such

that time axis t is scaled more than the other axes. This is expected from the distance transformed space time shape where the torso being the interior part of shape

will have higher values. But, human actions are distinguished from the variation of

the silhouette and these variations are more along the limbs than in the torso. So,

a better representation of the space time shape is required which emphasizes fast

moving parts so that features extracted gives the necessary variation to represent

the action. Thus, a normalized gradient of the distance transform is used and, as

shown in Figure 4.6a, Figure 4.6b and Figure 4.6c, the fast moving parts such as the

limbs have higher values compared to the torso region. The gradient of the space

time shape (x, y, t) is defined as

(x, y, t) = U (x, y, t) + K1

2U

2U

2U

+

K

+

K

2

3

x2

y 2

t2

(4-5)

where U (x, y, t) is the distance transformed space time shape, Ki is the weight added

to the derivative taken along the ith axis. The weights associated with the gradients

along each of the axes are usually kept the same. It is seen that the proper variation

occurs in Figure 4.6c where the time axis has more emphasizes. The fast moving

parts in this case being the hands and legs have high values, the region surrounding

the torso which are not so fast moving have moderate values while the torso region

which moves very slowly with respect to the limbs have very low values. Moreover

this representation also contains concatenation of silhouettes from the previous frame

onto the current frame due to the gradient and so, the time nature is emphasized

59

Figure 4.5: Sample Frames of the 3D Distance Transformed Space-time Shapes with

Various Aspect Ratios.

60

Figure 4.6: Sample Frames of the Normalized Gradient of the Space-time Shape with

Various Aspect Ratios.

61

Jumping Jack Action.

in a single frame of the space time shape. In short, this representation of the space

time shape is tuned more towards the time variation where this variation is directly

related to the action being performed. The normalized log gradient is given by

L(x, y, t) =

log((x, y, t))

max ((x, y, t))

(4-6)

(x,y,t) S

This normalized gradient is used to segment the space time shape into multiple

levels where, at each level, the features correponding to the silhouette variations

in a frame are extracted. The statistics of L(x, y, t) such as the minimum value

and the standard deviation are computed and these statistics are used to define the

62

interval between adjacent levels. An illustration of 8-level segmentation of a space

time shape for different frames are shown in Figure 4.7a, Figure 4.7b, and Figure

4.7c. The segmentation is done on each frame using the values of the normalized

gradient and from each level, a particular set of features are extracted. In the next

section, the type of features extracted from the space time shape will be discussed.

4.3

There are two sets of features which are extracted from the segmented space time

shape. One is the set of translation invariant R-Transform features extracted at each

level and the other is the R-Translation vectors which are extracted from the coarsest

level. The R-Transform, mentioned in Chapter 3, describes the 2D shape variations

present in a frame at a certain level. The set of R-Transforms taken across the

frames of the space time shape at each level gives the variation which corresponds to

a particular action. The R-Translation vector taken at the coarsest level emphasizes

the translatory variation of the entire silhouette across the frames of the space time

shape while reducing the emphasis on the 2D shape variation to a large extent.

4.3.1

The R-Transform feature set is the set consisting of elements where each element is

given by

Z

Rk,l () =

2

Tk,l

(s, ) ds

(4-7)

where Tk,l (s, ) is the 2D Radon transform of the frame k of the space time shape at

level l and [0, [ is the angle of inclination of the line on which the silhouette in the

frame is projected on. For a space time shape containing K frames and for L number

of levels, the R-Transform feature set is a 3D matrix of size L M K where M is

the number of angles on which the projection is taken. Typically, M is taken as 180.

63

(b) walk

Figure 4.8: R-Transform Set for Single Level of a Space Time Shape.

64

The surface plot for the R-Transform feature set for a single level is shown in Figure

4.8a and Figure 4.8b. This gives the variations of the silhouette body shape across

the frames. These variations differ from action to action and are independent of the

person performing that particular action. The reason is that these R-Transforms are

scale invariant, as explained in Chapter 3, and thus the variation of the silhouette

shape with person is removed to a great extent. A person who has a large silhouette

can be considered as a normalized silhouette shape which is scaled up and a person

who has small silhouette can be considered as a normalized silhouette which is scaled

down. Scale invariance property of the R-Transform removes these scaling variations

in the silhouette and only captures the variation in the silhouette due to the change

of shape with respect to time. Thus, these variations correpond to only action

independent of the person who performed it. Moreover, the R-Transform is also

translation invariant which gives the fact that the silhouette shape variations which

correspond to an action are captured irrespective of the position of the silhouette in

the frames. In other words, the silhouette is not required to be centered in the frame.

When the silhouette is rotated, the R-Transform gets shifted by a certain amount.

But, human actions seldom contain variations caused due to the rotation of a human

body and, so, does not need to incorporate rotation invariance in the feature vector.

The human silhouettes either stay in place or move in a translatory manner when an

action is performed.

4.3.2

The R-Transform feature set gives the variations of the silhouette shape across the

frames but removes the variation caused due to translation of the shape. Therefore,

to distinguish between actions which have large translatory motions such as walk and

run actions from those which have very little translatory motion such as single hand

wave action, another set of features should be extracted which gives the translatory

65

variation while minimizing the time variation of the shape. This type of feature is

known as the R-Translation vector. This feature vector extracted from a frame k of

the space time shape at the coarsest level, is given by

Z

RTk (s) =

2

Tk,1

(s, ) d

(4-8)

where Tk,l is the 2D Radon transform of the centered silhouette present at the frame

k. The R-translation vector is obtained by integrating the 2D Radon transform over

the variable . Before the extraction of the R-Translation vector, the silhouette in

every frame of the space time shape is shifted with respect to the position of the

silhouette in the first frame. The position of the silhouette in the first frame is given

by its centroid and the distance from this centroid to the center of the frame is

calculated. Then, the silhouette in every frame is shifted by this distance so that the

centroid of the silhouette in the first frame coincides with the center of the frame. The

R-Translation vector is then extracted from the modified silhouettes and the variation

in this vector across the frames gives the translation of the silhouette. The set of

R-Translation vectors extracted from the space time shape is a matrix of size K M

where K is the number of frames and M refers to twice the maximum distance of the

projection line from the centroid of the silhouette where this projection line is used

in the Radon transform computation. The set of R-Translation vectors extracted

is illustrated in Figure 4.9a and Figure 4.9b. At every frame k, the figure shows a

Gaussian-like function having a peak at s = M/2 and these Gaussian-like functions

do not vary much across the frames for the jumping jack action but for the walk

action, there is a considerable variation. This shows that the jumping jack action

has less translatory motion than the walk action. The small variations that occur in

the R-Translation vectors of the jumping jack action is due to the time variations in

the silhouette shape but unlike the R-Tranform feature set, the significance of those

66

(b) walk

67

type of variations are given less emphasis in the R-Translation vector.

4.4

Classification of the action features is done using the nearest neighbor approach

where a data point in the classfier is the action feature set extracted from a space

time shape. The extracted action features are the concatenation of the R-Transform

feature set and the R-Translation vector set which is done by first converting the

multi-dimensional R-Transform feature Set and the R-Translation vector set into

a single dimensional vectors and then, appending one with the other to form the

complete single dimensional action feature vector. This action feature vector, used

as a test vector, is then compared with the action feature vectors of the space time

shapes of each training action class using the Euclidean distance metric. The class

of the action feature vector which has the minimum distance from the test vector

is noted and the space time shape corresponding to this test vector is then grouped

under this class.

4.5

SUMMARY

This Chapter explains the action recognition algorithm proposed in this thesis. The

algorithm used the concept of the space time shape and an extension of the idea of

multi-level representation using the modified R-Transform. The extension was to use

a 3D distance transform to represent a space time shape and then, use the normalized

gradient of this transform to segment it into different levels. Once segmented, the RTransform is extracted from every frame at every level to get a R-Transform feature

set. To get the translatory variance of the silhouette across the frames, another

feature known as the R-Translation vector was defined and this vector was extracted

from the coarsest level. These two features are concatenated to form the complete

feature vector which is then used in a nearest neighbor classifier.

68

CHAPTER 5

EXPERIMENTAL RESULTS

The algorithm proposed in this thesis is implemented using softwares such as MATLAB 7.4, OPENCV version 1.0, and C++ on a computer having a RAM of 3.0

Gigabytes and 2.00 GHz processor with Ubuntu 9.10 as the operating system. The

background segmentation using the median background model and the silhouette extraction by morphological operation of dilation and erosion are implemented in C++

using the OPENCV vision library. The silhouettes are then loaded into the MATLAB environment and extraction of features are implemented using the functions

available in MATLAB. Testing of the action features extracted and the computation

of the recognition accuracy are performed in MATLAB using the nearest neighbor

classifier which is also implemented in the same environment.

The algorithm is tested on the Weizmann dataset which consists of 90 lowresolution video sequences each having a single person performing a particular action. Each video sequence is taken at 50 fps with a frame size of 180 144. This

dataset contains 10 different actions classes with each action class containing 9 sample video sequences with each of them performed by a different person. The various

action classes are bend, jump-in-place, jumping-jack, jump-forward, run,

gallop-sideways, wave-one-hand,skip, wave-two-hands and walk. Space

time shapes are extracted from each of the video sequences by using a window along

the time axis where this window is of a pre-defined length. Each shift of the window

by a pre-determined amount gives rise to a space time shape. If the shift is less

than the length of the window, the space time shape will have an overlap with the

previously extracted space time shape, both belonging to the same video sequence.

The training data set thus consists of the space time shapes extracted from each of

the video sequences in the database. For evaluation of the algorithm, a variation of

69

the leave-one-out procedure is used where to test a video sequence, the space time

shapes corresponding to that particular video sequence is taken as the testing data

and these same space time shapes are left out of the training data. Classification

is done by comparing the features extracted from the test space time shape with

those extracted from the training space time shapes and, then, using the nearest

neighbor rule to classify the same. The evaluation of the algorithm is conducted in

two phases. The first is that the window size and overlap are varied and the recognition rates are plotted for each combination. The second involves the comparison of

this algorithm with other methods or existing algorithms. The results are provided

in the form of confusion matrices and bar graphs where the confusion matrix gives

the recognition rates for each set of parameters and the bar graphs give the overall

accuracy acheived by the algorithm. The confusion matrix is a matrix which gives

the classification rates of each of the action classes obtained by the algorithm. Each

row of the matrix refers to the actual action class of the space time shapes and each

column refers to the action class under which the space time shapes are classified.

The quantity given in each entry gives the percentage of the space time shapes classified under various classes. The notation for each of the actions is given in Table

1.

5.1

In this experiment, the lengths of the space time shape and the overlap are varied.

Six combinations of (Length,Overlap) are taken : (6,3), (8,4), (10,5), (12,6), (14,7)

and (16,8). The experiment shows that the bend,jump-in-place and singlehand-wave actions have very good accuracies of over 95% while the actions such

as jumping-jack, two-hand-wave and walk sequences have an accuracy of over

92%. Some actions such as jump-forward, run and gallop-sideways have a

fair amount of accuracy of over 88%. But one action skip has a poor accuracy of

70

Action Notation

a1 Bend

a2 Jump-In-Place

a3 Jumping-Jack

a4 Jump-Forward

a5 Run

a6 Gallop-Sideways

a7 Wave-One-Hand

a8 Skip

a9 Two-Hand-Wave

a10 Walk

around 60% for all various lengths of the space time shapes. Here, the skip action

is being confused with the walk action and the run action. This is because

of the similarities in the variation of the silhouette shape for the skip with the

walk and run actions and hence, the features extracted using the algorithm

is not good enough to discriminate them. But, the algorithm inspite of the lowaccuracy for the skip action, obtains a fairly overall classification accuracy of over

90%. The average accuracy obtained with different shape time shape lengths are

shown in Figure 5.1 inclusive and exclusive of the skip action. It is seen that the

overall accuracy obtained without the skip action is higher than the one obtained

with the skip action and that accuracy peaks around 95% Moreover, even with

the skip action, the overall accuracy still peaks around 92% which is still good.

Furthur analysis shows that the accuracy achieved is better for the space time shapes

having larger length. This is because more time variations are included in the longer

space time shapes and hence provides better discrimination between action features.

The optimal length of the space time shape is found to be around 12 or 14 where

the overall accuracy is around 95% without the skip action and 92% with it. The

71

Figure 5.1: Bar Graph Showing the Overall Accuracy Obtained with the Proposed

Algorithm for Different Lengths of the Space Time Shape. The Overlap is just Half

the Length of the Space Time Shape.

72

confusion matrices for the (12, 6) and (14, 7) are given in Table 2. The variation in the

accuracy of each type of action with the increase in length and overlap of the space

time shape is illustrated in Figure 5.2. It is seen that the jump-in-place action has

100% accuracy for all lengths of the space time shape while the bend action except

for (6,3) length-overlap, also has 100% percent accuracy. This is because the action

features extracted from these actions do not share any similarities with the action

features extracted from other actions. The single-hand-wave and the two-handwave actions have a good accuracy of above 92% for all lengths. The jump-forward

action has fairly a consistent accuracy of 88% for all lengths and this action often

gets confused with skip action due to feature similarity. The jumping jack shows

a good consistent accuracy of around 92% and this action is sometimes confused with

the two-hand-wave action. Actions such as the walk and run action show a

consistent accuracy of about 95% and 85% respectively and as mentioned before,

these sometimes get confused with the skip action. The gallop-sideways action

accuracy is not very consistent as shown and it oscillates wildly with the lengths of

the space time shape. This is because this action gets classified under the skip

action again due to the features being similar. The action which shows a consistent

but poor accuracy with the proposed algorithm is the skip action as very often it

gets classifed under the walk,gallop-sideways or run actions. The error rate

of the actions with the increase in length furthur explains the points mentioned and

this is illustrated in Figure 5.3.

5.2

METHODS

In this section, the comparison of the proposed algorithm is made with the other

methods where each method uses a different shape descriptor. In fact, the comparisons are actually between the shape descriptors used for action recognition. The

73

(a) (Length,Overlap) = (12,6)

a1

a2

a3

a4

a5

a6

a7

a8

a9

a10

a1 100.0

0

0

0

0

0

0

0

0

0

a2

0 100.0

0

0

0

0

0

0

0

0

a3

0

0 96.80

0

0

0

0

0 3.22

0

a4

0

0

0 88.10

0

0

0 11.9

0

0

a5

0

0

0

0 86.00

0

0

10

0

4

a6

0

0

0

0

0 96.40

0

0

0 3.63

a7

0 2.27

0

0

0

0 97.70

0

0

0

a8

0

0

0 3.45 15.5

0

0 65.50

0 15.5

a9

0 4.76

1.2

0

0

0

0

0 94.10

0

a10

0

0

2.2

1.1

1.1

0

0

2.2

0 93.40

(b) (Length,Overlap) = (14,7)

a1

a2

a3

a4

a5

a6

a7

a8

a9

a10

a1

a2

a3

a4

a5

a6

a7

a8

a9

a10

100.0

0

0

0

0

0

0

0

0

0

0 100.0

0

0

0

0

0

0

0

0

0

0 94.90

0

0

0

0

0 5.13

0

0

0

0 89.60

0

0

0 10.4

0

0

0

0

0

0 85.40

0

0 9.76

0

4.9

0

0

0

0

0 91.10

0

0

0 8.89

0

0

0

0

0

0 98.60

0

1.4

0

0

0

0 4.17 18.75

0

0 62.50

0 14.6

0

0

0

0

0

0

0

0 100.0

0

0

0 3.89

0 1.29

0

0

0

0 94.80

74

Shape.

75

Figure 5.3: Error Rate for Different Combinations of (length,overlap) of Space Time

Shape.

76

recognition accuracies and their error rates of the algorithm have been compared

with methods where the R-Transform and the Zernike moments are being used as

shape descriptors. This descriptor describes a 2D silhouette shape and the variation

of this descriptor across the space time shape is taken as the action features. The

comparison of accuracies is shown as a bar graph in Figure 5.4. As seen, the lowest

accuracy obtained is by using the R-Transform directly on the 2D silhouette and

considering its variation as action features. This is because the R-Transform only

describes the variation of the silhouette along the x y plane. No time variation is

being captured by the R-Transform and the variation in the 2D silhouette description

across the space time shape emphasizes the time variation to a minimal extent. This

time variation does not represent the action performed by the person completely

and this can be seen in the low accuracy obtained. The same case applied to the

2D Zernike moments where again the time variation captured does not completely

pertain to the action although it does acheive better accuracy than the R-Transform.

The better accuracy is because Zernike moment shape descriptors are region based

descriptors while the R-Transform is a boundary-based descriptor and so, the time

variation in the Zernike moment descriptor across the frames of the space time shape

captures the action with more emphasis than the time variation in the R-Transform

shape descriptor. In the algorithm, one major difference from the above mentioned

methods is that the time variation is captured before applying the shape descriptor

and this is done by taking the gradient of the distance transform across the x,y, and

t axes and then, using that to segment the space time shape into different levels.

This captures the action much more efficiently than the simple R-Transform method

and the Zernike moment method and hence, achieves much better accuracy.

The individual recognition rates achieved with the Zernike moments and the RTransform are shown in Table 3. As seen in the table, the individual recognition

rates acheived with the proposed algorithm is far better than the ones acheived with

77

Shape for Different Shape Descriptors.

78

TABLE 3: Confusion Matrix Obtained with Zernike Moments,R-Transform and Poissons Equation Shape Descriptor.

(a) Zernike Moments

a1

a2

a3

a4

a5

a6

a7

a8

a9

a10

a1 98.10

0

0

0

0

0 1.85

0

0

0

a2

0 83.10

0

0

0

0 16.9

0

0

0

a3 1.07

0 91.40

0

0

0

0 2.15 1.08

4.3

a4

0

0

0 88.10

0

0

0 11.9

0

0

a5

0

4

4

2 50.00

4

2

10

0

24

a6

0 9.09

0

0

0 70.90 16.4 1.81

0 1.81

a7

0 7.96

0

0

0

0 50.00

0 42.1

0

a8

0

0 10.3 17.2 12.1

0 22.40 18.9 5.17 13.8

a9

0

0

0

0

0

0 26.2

0 73.80

0

a10

0

0

0

0 10.9 3.29 2.19 9.89 2.19 71.40

(b) R-Transform

a1

a2

a3

a4

a5

a6

a7

a8

a9

a10

a1

a2

a3

a4

a5

a6

a7

a8

a9

a10

87.00

0

0 5.56 1.85

0

0 5.56

50

0

0 66.20 9.85

0

0 11.3 2.81 4.23 1.41 4.23

0 9.68 71.00

0

0 2.15

0

0

14 3.23

10.2 1.69

0 57.60 1.69

5.1 3.39 11.9

0

8.5

4

4

0

2

32

12 22.00

20

0

4

0 12.7 5.45 3.63 5.45 47.30 3.63

0

0 21.8

0 2.27

0

0 7.95 2.27 86.40 1.13

0

0

8.62 12.1 1.72 6.89 32.8 5.17 10.4 13.80 1.72 6.89

0 4.76 13.1

0

0 1.19

0

0 78.60 2.38

0

6.6

7.7

7.7 2.2 17.6

1.1

3.3

5.5 48.40

(c) Poisson Shape Descriptor

a1

a2

a3

a4

a5

a6

a7

a8

a9

a10

a1

a2

a3

a4

a5

a6

a7

a8

a9

a10

99.10

0

0

0

0

0

0

0

0.9

0 100.0

0

0

0

0

0

0

0

0

0

0 100.0

0

0

0

0

0

0

0

0

0

0 89.20

0

0

0 10.8

0

0

0

0

0

0 98.00

0

0

0.2

0

0

0

0

0

0

0 100.0

0

0

0

0

0

0.9

0.9

0

0

0 94.80

0

3.5

0

0

0

0

0

2.9

0

0 97.10

0

0

0

0

0.9

0

0

0

1.9

0 97.20

0

0

0

0

0

0

0

0

0

0 100.0

79

the Zernike moments and R-Transform especially in the skip action where the

recognition rate achieved with Zernike moments and R-Transform is just under 20%

while that achieved with the current algorithm is above 60%. Another comparison

can be made with the current state of the art action recognition framework which

uses a Poisson-shape descriptor to capture motion variations. The confusion matrix

for this framework is provided in Table 3c. It can be seen that the proposed algorithm achieves individual action accuracies close to the state of the art with some

action accuracies exceeding with the proposed one. The only action which has lower

accuracy when compared to the state of the art is that of the skip action.

5.3

SUMMARY

This Chapter evaluates the algorithm in two ways. The first way is to vary the

length and overlap of the space time shape and plot their average accuracies as

a bar graph and plot the individual action accuracies and error rates. The plots

are analyzed and it is found that the best combinations of (length, overlap) are

(12, 6) and (14, 7). The second way of evaluation is by comparing the algorithm with

other methods such as using moment-based shape descriptors and just the Radon

transform based shape descriptor as shape representation. The proposed algorithm

shows comparitively higher accuracy. The comparison is also done with another

action recognition framework which used Poissons equation based shape descriptor

for shape representation. The accuracies obtained with the proposed algorithm are

comparable with some actions acheiveing better accuracy while one particular action

has a lower of accuracy.

80

CHAPTER 6

CONCLUSIONS AND FUTURE WORK

The thesis had provided an algorithm which have used the concept of multi-level

representation of a 3D shape for action classification. An action has been considered

as a space time shape or a 3D shape and a multi-level representation using the gradient of the 3D distance transform and the Radon transform have been used from

where the action features have been extracted. Silhouettes from a video sequence

containing a particular action have been concatenated to form the space time shapes

representing that action. Action features were extracted from each level of the representation and these features concatenated as a single feature was used in a nearest

neighbor classifier for recognition.

The evaluation of the algorithm was performed by comparison of the recognition

accuracies with those which used shape descriptors such as Zernike moments and RTransform to represent the space time shape. The results obtained showed higher

accuracy rates for the proposed algorithm. Furthur comparison has been done with

the action recognition framework which used a shape descriptor based on Poissons

equation for space time shape representation. The results obtained showed comparable accuracy rates with majority of the actions and some of them being greater

with the proposed algorithm. Another evaluation was performed by computation

of the accuracies and error rates obtained with different lengths and overlap of the

space time shape. It was found that the most optimal space time shape parameters

(length, overlap) were (12, 6) and (14, 7). The database used for these evaluations

was the Weizmann Action database which contains 90 video sequences with 10 action

classes.

Although the average accuracies are high, the accuracy for one particular action

obtained by the proposed algorithm is low as the features extracted from the space

81

time shape corresponding to this action cannot be discriminated from those of other

similar actions. Future work will involve extraction of more localized features so that

the average accuracy as well as the accuracy of each action is high. Also, the use of a

more sophisticated classification technique is also considered for action recognition.

82

BIBLIOGRAPHY

[1] C.J.Cohen, K.A.Scott, M.J.Huber, S.C.Rowe and F.Morelli, Behavior recognition architecture for surveillance applications, International Workshop on

Applied Imagery and Pattern Recognition - AIPR 2008, pp. 1-8, 15-17 October,

2008.

[2] M.Mitani, M.Takaya, A.Kojima and K.Fukunaga, Environment recognition

based on analysis of human actions for mobile robot, The 18th International

Conference on Pattern Recognition - ICPR 2006 , vol. 4, pp. 782-786, 2006.

[3] J.W.Davis and A.F.Bobick, The representation and recognition of human

movement using temporal templates, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 928-934, 17-19

June 1997.

[4] G.R.Bradski and J.W.Davis, Motion segmentation and pose recognition with

motion history gradients, Fifth IEEE Workshop on Applications of Computer

Vision, pp. 238-244, 4-6 December 2000.

[5] A.F.Bobick and J.W.Davis, The recognition of human movement using temporal templates, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, March 2001.

[6] J.W.Davis, Hierarchical motion history images for recognizing human motion,

Proceedings of the IEEE Workshop on Detection and Recognition of Events in

Video, pp. 39-46, 8 July 2001.

[7] A.A.Efros, A.C.Berg, G.Mori and J.Malik, Recognizing action at a distance,Proceedings of Ninth IEEE International Conference on Computer Vision, vol. 2, pp. 726-733, 13-16 October 2003.

[8] J.C.Niebles, H.Wang and L.Fei-Fei, Unsupervised Learning of Human Action

Categories Using Spatial-Temporal Words,Proceedings of the British Machine

Vision Conference - BMVC 2006, 4-7 September 2006.

[9] P.Scovanner, S.Ali and M.Shah, A 3-Dimensional SIFT descriptor and its application to action recognition,Proceedings of the 15th ACM International Conference on Multimedia, 25-29 September 2007.

83

[10] S.Ali, A.Basharat and M.Shah, Chaotic invariants for human action recognition, IEEE 11th International Conference on Computer Vision - ICCV 2007,

pp. 1-8, 14-21 October 2007.

[11] J.Zhang, K.F.Man and J.Y.Ke, Time series prediction using Lyapunov exponents in embedding phase space, International Conference on Systems, Man

and Cybernetics, vol. 2, pp. 1750-1755, 11-14 October 1998.

[12] M.Ahmad and S.W.Lee, HMM-based human action recognition using multiview image sequences, The 18th International Conference on Pattern Recognition - ICPR 2006, vol. 1, pp. 263-266, September 2006.

[13] D.Batra, T.Chen and R.Sukthankar, Space-time shapelets for action recognition,IEEE Workshop on Motion and Video Computing - WMVC 2008, pp. 1-6,

8-9 January 2008.

[14] M.K.Hu, Visual pattern recognition by moment invariants, IRE Transactions

on Information Theory, vol. 8, no. 2, pp. 179-187, February 1962.

[15] D.Zhang and G.Lu, A comparative study on shape retrieval using fourier descriptors with different shape signatures ,Journal of Visual Communication and

Image Representation, no. 14, pp. 41-60, 2003.

[16] Q.Chen, E.Petriu and X.Yang; , A comparative study of fourier descriptors and

Hus seven moment invariants for image recognition, Canadian Conference on

Electrical and Computer Engineering, vol. 1, pp. 103- 106, 2-5 May 2004.

[17] A.Khotanzad and Y.H.Hong, Invariant image recognition by Zernike moments, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.

12, no. 5, pp. 489-497, May 1990.

[18] S.Tabbone, L.Wendling and J.P.Salmon, A new shape descriptor defined on

the Radon transform, Computer Vision and Image Understanding , vol. 102,

pp. 42-51, April 2006.

[19] L.Gorelick, M.Galun, E.Sharon, R.Basri and A.Brandt, Shape representation

and classification using the Poisson equation, Proceedings of the 2004 IEEE

Computer Society Conference on Computer Vision and Pattern Recognition CVPR 2004, vol. 2, pp. 61-67, June 2004.

84

[20] M.Blank, L.Gorelick, E.Shechtman, M.Irani and R.Basri, Actions as space-time

shapes, Tenth IEEE International Conference on Computer Vision - ICCV

2005, vol. 2, pp. 1395-1402, 17-21 October 2005.

[21] U.Trottenberg, C.W.Oosterlee and A.Schuller, Multigrid, Academic Press, 2001.

[22] Y.Wang, K.Huang and T.Tan, Human activity recognition based on R Transform,IEEE Conference on Computer Vision and Pattern Recognition - CVPR

2007, pp. 1-8, 17-22 June 2007.

[23] A.Mohiuddin and S.W.Lee, Human action recognition using shape and CLGmotion flow from multi-view image sequences,Pattern Recognition, vol. 41, pp.

2237-2252, July 2008.

[24] X.Sun, M.Chen and A.Hauptmann, Action recognition via local descriptors

and holistic features, IEEE Computer Society Conference on Computer Vision

and Pattern Recognition Workshops, pp. 58-65, 20-25 June 2009.

[25] R.Fabbri, O.M.Bruno, J.C.Torelli and L.F.Costa, 2D Euclidean Distance

Transform Algorithms: A Comparative Survey, ACM Computing Surveys,

February 2008.

[26] C.R.Maurer, R.Qi, V.Raghavan, A linear time algorithm for computing exact

Euclidean distance transforms of binary images in arbitrary dimensions, IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 2, pp.

265- 270, February 2003.

[27] C.M.Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

[28] G.Bradski and A.Kaehler, Learning OpenCV, OReilly Media Inc., 2008.

[29] G.Borgefors, Distance transformations in arbitrary dimensions,Computer Vision, Graphics, and Image Processing, vol. 27, pp. 321-345, September 1984.

[30] A.Rosenfeld and J.Pfaltz, Sequential operations in digital picture processing,

Journal of the Association for Computing Machinery, vol. 13, pp. 471-494, 1966.

[31] D.W.Paglieroni, Distance transforms: Properties and machine vision applications, Graphical Models and Image Processing - CVGIP 1992, vol. 54, pp.

56-74, January 1992.

85

[32] R.N.Bracewell, Two-Dimensional Imaging , Englewood Cliffs, NJ, Prentice Hall,

1995, pp. 505-537.

86

VITA

Personal Information:

Binu M Nair

Department of Electrical and Computer Engineering

Old Dominion University

Norfolk, VA 23529

Phone No : 757-288-3522

Email : binuq8@gmail.com

Education:

Master of Science in Electrical and Computer Engineering

G.P.A 3.81

August 2010

Old Dominion University, Norfolk, VA

Bachelor of Technology in Electronics and Communication

Percentage 70 %

April 2007

Cochin University of Science and Technology, Cochin, India

Work Experience:

Research Assistant

Old Dominion University, Norfolk, VA

Fall 2008 - Spring 2010

Programmer Analyst

Mindtree Consulting Ltd., Bangalore, India

July 2007-April 2008

- Stability, Oscillatory Modes, Bifurcations, and Chaotic Behavior of Complex Systems with Phase ControlUploaded bySEP-Publisher
- Extracting 3D Information From Broadcast Soccer VideoUploaded byLouie Z. Perkins
- ifsr_PreprocessingUploaded bymatlab5903
- JournalClub - Robust Statistical Fusion of Image LabelsUploaded bytelnet2
- General Internship Guidelines and Report Format Summer 2015 3Uploaded byKishore Ashraf
- A Review on Image Segmentation using Clustering and Swarm Optimization TechniquesUploaded byInternational Journal for Scientific Research and Development - IJSRD
- Stroke Segmentation and Partial Matching for Livestock Brand Image Recognition (Abstract)Uploaded byWaldemarVillamayor-Venialbo
- Rough ReportUploaded byrintu777
- Time-Series Panel Analysis (TSPA): Multivariate Modeling of Temporal Associations in Psychotherapy ProcessUploaded byF.Ramseyer
- OMAE2008-57136Uploaded byStriking Xie
- Script Identification From Printed Document Images Using StatisticalUploaded byIAEME Publication
- On the Position Accuracy of Mobile Robot Localization Based on Particle Filters Combined With Scan MatchingUploaded byengineer86
- Complexity and HumanUploaded byraku2121
- 40220140502005Uploaded byIAEME Publication
- Comparative Study on Vehicle Detection Techniques in Aerial SurveillanceUploaded byJames Moreno
- Eric_TSAUploaded bymuralimovies
- Time Series AnalysisUploaded bysan4unall
- THE ADAPTATION OF ITU-T FORECASTING RULES TO THE COUNTRIES UNDER SPECIAL CIRCUMSTANCESUploaded byZeljkoJurca
- CIARP 2015Uploaded byGermán Capdehourat
- [IJCST-V4I2P32]:Rajeshwari G.Tayade, Miss. P. H. Patil, Miss. Prachi A. SonawaneUploaded byEighthSenseGroup
- Slide 2 - Basic TasksUploaded byRahul Venkat
- MonoSLAM - real time single-camera SLAMUploaded byHurp Durp
- part2Uploaded bySkyblackPeru
- Image SegmentationUploaded byPurushotham Prasad K
- Comparative Review of Image Denoising and Segmentation Approaches For Detection of Tumor in Brain ImagesUploaded byMia Amalia
- LdeUploaded bylulubell uuu
- Mammogram Image Segmentation by Watershed Algorithm and Classification through k-NN ClassifierUploaded byBONFRING
- Mango Variety Recognizer - PCSC - V3Uploaded byWinston M. Tabada
- My List of TechniquesUploaded bymubina
- Slide Idriss Tsafack CIREQUploaded byNiel

- Smc Ieee 2013Uploaded bybinuq8usa
- SMC 2013 PresentationUploaded bybinuq8usa
- ISVC LNCS 2014 AcceptedPaperUploaded bybinuq8usa
- SIAM 2014 InvitedTalkUploaded bybinuq8usa
- VISAPP_2011Uploaded bybinuq8usa
- IEAAIE_LNAI_2012Uploaded bybinuq8usa
- UD Stander Symposium 2014Uploaded bybinuq8usa
- IEAAIE LNAI 2012 PresentationUploaded bybinuq8usa
- SPIE DSS 2013 CorrectedUploaded bybinuq8usa
- SPIE DSS 2013 PresentationUploaded bybinuq8usa
- PersonReIdentification_ResultsUploaded bybinuq8usa
- SPIE_EI_9026_2014Uploaded bybinuq8usa
- SPIE JEI Paper UnderReviewUploaded bybinuq8usa
- SPIE_EI_9026_2014_PosterUploaded bybinuq8usa
- Isvc 2014 Paper Id548Uploaded bybinuq8usa
- CAS ScienceDirect 2011Uploaded bybinuq8usa

- [Alvin Bayliss, Bernard J. MatkowskyUploaded byJarmison Araújo
- Gonzalo Garcia De Polavieja et al- Coarse grained spectra, dynamics and quantum phase space structures of H3^+Uploaded byGummisX
- ON CONJUGACY OF CONVEX BILLIARDSUploaded byidownloadbooksforstu
- Caitlin McCann - Bifurcation Analysis of Non-linear Differential EquationsUploaded byAbrahão E Laíla Silva
- Ergodic Theory on Homogeneous Spaces and Metric Number TheoryUploaded bydyaataha7902
- Hamiltonian Descent Methods 1809.05042Uploaded byDom DeSicilia
- Diss KohlerUploaded byhellhawk123
- Https Attachment.fbsbx.com File DownloadUploaded byRohan Nair
- Dynamics of Vehicle-road Coupled SystemUploaded byTeron Nguyen
- randybooksKLMNOUploaded byjotiroy
- PHY 301-Classical Mechanics-Dr[1]. Fakhar Ul InamUploaded by0307ali
- Crisan_11Uploaded bydiankusuma123
- MIND as MOTION - Explorations in the Dynamics of Cognition - Robert F. PortUploaded bytalari007
- mathsUploaded bywoodksd
- Teller-Ulam configurationUploaded byCharles Stevens
- Denker 2004 BreakingUploaded bypastafarianboy
- [] Mathematica Package for Analytical and Control (Bookos.org)Uploaded byHassan Ali Khan
- [Sayama] Introduction to the Modeling and Analysis of%0AComplex Systems.pdfUploaded byDiego
- 0780360125Uploaded bykatakgoreng
- Chen_ch5Uploaded bymurat Fatih
- Gravity Resonance TrapUploaded byRafael Muñoz
- G._Adomian__Solving_Frontier_Problems_of_Physics--Thr_Decomposition_Method(370p)(1994).pdfUploaded bygeferson_lucatelli
- DE ZG515Uploaded byAnu
- Applied Mathematics and Computation (Rinaldi, 1998)Uploaded byMarco Salvatore Vanadìa
- Numerical.continuation.methods.for.Dynamical.systemsUploaded byDionisis Pettas
- Opti ControlUploaded byMithun John
- 10.1007%2Fb97525Uploaded byswapnil
- 05 Nonlinear DynamicsUploaded byEnrique
- Analytical DynamicsUploaded byCeleste Romero Longar
- Second VirialUploaded byevita

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.