UNIVERSITY OF CALIFORNIA

,
IRVINE
A Joint Model for Tracking and Recognizing
Human Actions in Video Sequences
THESIS
submitted in partial satisfaction of the requirements
for the degree of
MASTER OF SCIENCE
in Computer Science
by
Goutham Patnaikuni
Thesis Committee:
Professor Deva Ramanan, Chair
Professor Alexander Ihler
Professor Charless Fowlkes
2009
c ( 2009 Goutham Patnaikuni
The thesis of Goutham Patnaikuni
is approved and is acceptable in quality and form for
publication on microfilm and in digital formats:
Committee Chair
University of California, Irvine
2009
ii
DEDICATION
I dedicate this thesis to Professor Paul Utgoff, who lost his battle to appendiceal cancer in
October 2008. I learnt of his passing only recently and was deeply saddened to hear about
it. It seems like it was just yesterday when I was in his office discussing homework
assignments for his Artificial Intelligence course. Professor, this is for you, with love and
resolution
iii
TABLE OF CONTENTS
Page
LIST OF FIGURES vi
LIST OF TABLES vii
ACKNOWLEDGMENTS viii
ABSTRACT OF THIS THESIS ix
1 Introduction 1
1.1 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work 4
2.1 Local features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Global features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Other HMM based approaches . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Theory 8
3.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 Intuitions behind Margins . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.3 Functional and geometrical margins . . . . . . . . . . . . . . . . . . 10
3.1.4 The optimal margin classifier . . . . . . . . . . . . . . . . . . . . . . 12
3.1.5 Optimal margin classifiers . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.6 Multiclass SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Transition and Emission probabilities . . . . . . . . . . . . . . . . . 18
3.2.2 Maximum Likelihood for the HMM . . . . . . . . . . . . . . . . . . . 19
3.2.3 The forward-backward algorithm . . . . . . . . . . . . . . . . . . . . 23
3.2.4 Scaling Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.5 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Approach 37
4.1 The Support Vector Machine Approach . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Optical Flow based HOG descriptors . . . . . . . . . . . . . . . . . . 37
4.1.2 A multiclass SVM framework for action recognition . . . . . . . . . 39
4.1.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv
4.2 The Hidden Markov Model Approach . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Building a vocabulary of visual words . . . . . . . . . . . . . . . . . 41
4.2.2 A hidden Markov model framework for video sequences . . . . . . . 43
4.2.2.1 Dimensionality reduction of optical flow features . . . . . . 43
4.2.2.2 Visual word SVM based features . . . . . . . . . . . . . . . 44
4.2.2.3 Classifying a new pre-tracked video sequence . . . . . . . . 44
4.3 A Unified model for joint tracking and recognition . . . . . . . . . . . . . . 46
4.3.1 Cross product space of location and visual words . . . . . . . . . . . 46
4.3.2 Joint model for tracking and recognition . . . . . . . . . . . . . . . . 46
4.3.2.1 Bounded velocity motion model . . . . . . . . . . . . . . . 47
4.3.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Motivation for joint tracking and recognition . . . . . . . . . . . . . . . . . 49
4.4.1 The Tracking problem . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.2 Experiment and baseline . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Results 53
5.1 Action Classification on KTH Dataset . . . . . . . . . . . . . . . . . . . . . 53
5.2 Action Classification on UCF action dataset . . . . . . . . . . . . . . . . . . 58
6 Conclusion and Future Work 62
Bibliography 64
Appendices 67
A Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.1 KTH Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2 UCF Action Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
v
LIST OF FIGURES
Page
3.1 Linear decision boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Maximum margin separating hyperplane and support vectors . . . . . . . . 14
3.3 The transition state diagram unfolded over time . . . . . . . . . . . . . . . 20
3.4 Forward recursion for the evaluation of the α variables . . . . . . . . . . . . 26
3.5 Backward recursion for the evaluation of the β variables . . . . . . . . . . . 28
3.6 A graphical depiction of the Viterbi algorithm . . . . . . . . . . . . . . . . . 34
4.1 Visualization of features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Cluster centers from the UCF action dataset . . . . . . . . . . . . . . . . . 42
4.3 An illustration of the joint model . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 An illustration of the bounded velocity motion model . . . . . . . . . . . . 48
4.5 Case for joint tracking and recognition . . . . . . . . . . . . . . . . . . . . . 51
5.1 Representative frames from the KTH action dataset . . . . . . . . . . . . . 54
5.2 Similarity between running and jogging actions . . . . . . . . . . . . . . . . 56
5.3 KTH sequences with tracks . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Representative frames from the UCF Sports Action Dataset . . . . . . . . . 58
5.5 UCF sequences with tracks . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vi
LIST OF TABLES
Page
5.1 KTH confusion matrix with SVM approach . . . . . . . . . . . . . . . . . . 54
5.2 KTH confusion matrix using the joint model and optical flow features . . . 55
5.3 KTH confusion matrix using the joint model and visual word SVM based
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Comparison of different methods on KTH . . . . . . . . . . . . . . . . . . . 55
5.5 UCF confusion matrix using SVM approach . . . . . . . . . . . . . . . . . . 59
5.6 UCF confusion matrix using the joint model with optical flow based features 59
5.7 UCF confusion matrix using the joint model and visual word SVM based
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
vii
ACKNOWLEDGMENTS
I would like to start off by thanking Professor Deva Ramanan. Working with him for the
past year has been an absolute pleasure. He has been a great source of knowledge and
support and I cannot thank him enough for it. He is, without a doubt, the best advisor I
have ever had. My thanks also go to Professors Ihler and Fowlkes, for their time and help.
This work would not have been possible without the support of my friends here in Irvine. I
would particularly like to thank my good friend Kristian Hermansen, whom I have known
since my days as an undergraduate at the University of Massachusetts, for his constant
support and for being a source of humor at times when I needed it.
I have to thank my family, including my parents, my sister and my brother in law for being
there for me and believing in me. I particularly appreciate all the intellectual conversations
I have had with my brother-in-law about my research. It is good to have someone with a
PhD from Stanford in your family.
viii
ABSTRACT
A Joint Model for Tracking and Recognizing
Human Actions in Video Sequences
By
Goutham Patnaikuni
Master of Science in Computer Science
University of California, Irvine, 2009
Professor Deva Ramanan, Chair
In this paper, we propose a new method for human activity recognition from video sequences
using a combination of discriminative and generative models. We present two different
approaches - one based solely on Support Vector Machines and another in which the video
sequence is modelled as a Markov process. The major difference between our methods and
previous work done in action recognition is that we do away with the assumption that
the persons in a video sequence are tracked prior to the recognition process, and instead
combine the tracking and recognition problems into one. We believe that this is not only
a much more reasonable approach to action recognition, but also that combining the two
problems increases the accuracy of tracking
ix
Chapter 1
Introduction
Recognizing human actions in video sequences is a challenging problem in computer vision.
It has applications in areas such as human computer interaction, surveillance, searching
video databases and automatic tagging of videos on sites such as Youtube. Various visual
cues, based on motion and shape, have been used for action recognition. In this thesis, the
focus will be on recognizing the actions in video sequences using motion cues. Specifically,
optical flow will be used as a feature set. We discuss two different methods for solving the
action recognition problem; one of them extends the SVM based framework introduced in
Dalal et. al[1]. In this setting, a video sequence, broken down into a sequence of frames is
essentially treated as a “bag-of-words” i.e the order of the frames is inconsequential . The
second method models the sequence of frames in a video as a Markov process meaning that
the sequence in which frames occur in a video is taken into consideration.
1.1 Contributions of this thesis
Many interesting approaches have been proposed to solving the action recognition problem.
Although some of these methods have reported some impressive results, most of them
suffer from the same weakness - they assume that tracking of the human figure is done
1
prior to recognition. In most cases, an external module is used to localize the motion of the
human figure in each frame of a video sequence. A common technique used is background
subtraction. In background subtraction, the goal is to identify moving objects from the
portion of a video frame that differs significantly from a background model. Although it
works well in cases where the background is static, background subtraction is not known to
perform well when the scene consists of complex, non-static backgrounds.
Another popular out-of-the-box module is the Histogram of Oriented Gradient (HOG)
based classifier, developed by Navneet Dalal and Bill Triggs and commonly known as the
Dalal-Triggs detector. The detector has a single fixed template which determines whether
a given image pattern corresponds to a person. The Dalal-Triggs detector was originally
trained to detect pedestrians in images and is not meant to detect a wide variety of human
poses. This means that while it may accurately locate a person standing in an upright pose
in an image, it may not be able to detect a person in a crouching or sitting position.
While the above two methods can be used for localizing motion in simple scenarios, com-
plex ones require a more intuitive approach. In such situations, the kind of action being
performed, and not global composition of a frame, should provide the primary cue for lo-
calization and recognition of motion patterns. In the case of action recognition, the ability
to perform robust tracking prior to recognition cannot be assumed in general.
The main contribution of this thesis is to address the issues described above. In this thesis,
this problem is tackled by building a family of templates, one for each motion pattern,
instead of working with a single template for localizing motion. A motion pattern could
either be related to an action class or a “word” from a visual vocabulary. We believe that
having a family of templates will enable us to detect a wide array of human poses and will
therefore lead to a better track, which in turn will increase the accuracy with which an action
can be predicted. We later describe a method in which these templates are incorporated
into a hidden Markov model based approach for solving the action recognition problem. We
introduce a novel joint model for tracking and recognizing actions in video sequences.
2
1.2 Organization of this thesis
The thesis is organized as follows. Chap 2 provides a brief overview of related work in
human activity recognition. Chap 3 goes into detail about machinery of Support Vector
Machines and Hidden Markov Models which are the basis for the methods used in this
thesis for recognizing human actions. Chap 4 discusses these methods in detail, specifically
describing how the tracking and recognition problems are combined into one. Chap 5
discusses the results of testing the methods described in Chap 4 on two datasets. Chap 6
concludes this thesis with a summary and a discussion of possibilities for extensions.
3
Chapter 2
Related Work
The literature on action recognition is quite rich. We avoid an in-depth review of all methods
and instead refer the reader to [30]. We instead focus on related work that is most applicable
to the approach we pursue. As mentioned before, different visual cues are used to detect
human actions. This section will concentrate on methods that use motion based cues. The
methods we discuss fall into two broad categories - ones that consider global level features
and others that try to capture and interpret local features.
2.1 Local features
Several recent approaches have concentrated on capturing local level features in video and
using them to understand the underlying motion in video sequences. The motivation for
doing so was to overcome some of the limitations of the methods that only considered motion
on a global scale, such as the inability to deal with multiple moving objects and variations
in background. Several local features for video have been proposed recently; one of which is
Space-time interest points[4]. In images, points with significant variation in local intensities
are considered to be of interest and are called spatial interest points. Space-time interest
points are an extension of spatial interest points and are meant to correspond to interesting
4
events in video data. Neibles et al.[3] models a video sequence as a collection of spatial-
temporal words by extracting space-time interest points from video sequences. Probability
distributions of the spatial-temporal words and intermediate topics corresponding to human
action categories are automatically learnt using a probabilistic Latent Semantic Analysis
(pLSA)[15] model. Using this model, a new video sequence is categorized and the motions
in it localized. Lindeberg et al.[6] introduces several local space-time descriptors associated
with space-time interest points and uses them for recognition of spatio-temporal events and
activities. Captuo et al.[7] combine space-time features and SVMs and use the resulting
approach to classification of human actions
Part based models are also increasingly being used in action recognition. This trend is
partly inspired by the success of these models in object detection. In [32], a discriminatively
trained, multi-scale, deformable part model is used to build models for people and objects
such as cars, bottles, and couches. [33, 34, 35] also use part based models for both human
and object recognition. In [18], a discriminative part-based approach based on hidden
conditional random fields is used for human action recognition.
2.2 Global features
Several global based features have been captured and used for action recognition in the past.
One commonly used feature is optical flow, which is an approximation of motion between
temporally adjacent scenes. Efros et. al[9] build motion descriptors based on optical flow
and use these in a k-nearest neighbor framework to classify actions. Wang. et al.[2] use the
same descriptor to build a visual vocabulary and represent a video as a bag of visual words.
They later use this representation to build a model based on latent Dirichlet allocation
(LDA)[5]. [18] also uses optical flow to model human actions as a flexible constellation of
parts. In order to avoid explicit computation of optical flow, a number of template-based
methods attempt to capture the underlying motion similarity amongst videos of a given
action class. Shechtman and Irani [19] avoid explicit flow computations by employing a
5
rank-based constraint directly on the intensity information of spatio-temporal cuboids to
enforce consistency between a template and a target.
Rodriguez and Ahmed[11] introduce a template-based method for recognizing human ac-
tions called Action MACH based on a Maximum Average Correlation Height (MACH)
filter. They address the common limitations of template-based methods in generating a
single template for an action by synthesizing a single Action MACH filter for a given action
class. These Action MACH filters combine the training sequences of an action class into a
single composite template. These templates are then correlated with testing sequences in
the frequency domain via a FFT transform. Once an Action MACH filter is synthesized,
similar actions in a testing video sequence are detected by applying the action MACH filter
to the video.
In this thesis, we use optical flow as a feature set. Although it does not explicitly capture
local interest points in video, localization is offered to some degree by using optical flow in
conjunction with the framework introduced in Dalal et. al[1]
2.3 Other HMM based approaches
Using HMMs for action recognition is very common. Typically, the hidden state is an
activity to be inferred, and observations are image measurements. Yamato et al.[20] describe
recognizing tennis strokes with HMMs. Wilson and Bobick[21] describe the use of HMMs
for recognizing gestures such as pushes. Yang et al.[22] use HMMs to recognize handwriting
gestures.
In order to simplify the training process of learning the state transition matrix, there has
been a great deal of interest in models obtained by modifying a basic activity-state HMM.
Variations include a coupled HMM (CHMM) [21, 22], a layered HMM (LHMM) [23, 24,
25], a parametric HMM (PHMM) [26], an entropic HMM (EHMM) [27], and variable length
Markov models (VLMM) [28, 29].
6
In this thesis, we use HMMs to infer activities using optical flow based feature sets as
observations. Later on, we build a joint model for tracking and recognizing actions in
video.
7
Chapter 3
Theory
This chapter discusses support vector machines and hidden Markov models in detail. Both
of these constitute the machinery used to classify video sequences.
3.1 Support Vector Machines
Support Vector Machines (SVMs for short) are known to be among the best “off-the-shelf”
supervised learning algorithms. SVMs are used in solving problems such as text catego-
rization, hand-written character recognition, image classification and in this case, action
recognition. A Support Vector Machine is typically a binary classifier (although it can be
extended to a multiclass classifier) which performs classification by constructing an N-1
dimensional hyperplane that optimally separates N dimensional data into two categories.
Input data fed into an SVM can be viewed as two sets of vectors in an N dimensional space.
An SVM will construct a separating hyperplane in that space, one which maximizes the
margin between the two data sets.
8
3.1.1 Intuitions behind Margins
The intuition behind margins can be best explained by considering logistic regression. In
logistic regression, the probability p(y = 1 [x; θ) is modeled by h
θ
(x) = g(θ
T
x). When h
θ
(x)
≥ 0.5. or equivalently, if θ
T
x ≥ 0, the label “1” is predicted. For a positive training example
(y = 1), the larger θ
T
x is, the larger h
θ
(x) is, and thus higher the degree of “confidence”
that the label is 1. The prediction can be thought of as a “confident” one that y = 1 if θ
T
x
0. Similarly, logistic regression can be thought of as making a confident prediction that y
= 0 if θ
T
x < 0. Given a training set, a good fit can be found if a θ can be found such that
θ
T
x
(i)
0 whenever y
(i)
= 1, and θ
T
x
(i)
< 0 whenever y
(i)
= 0, since this would reflect a
very confident set of classifications for all the training examples. For points that are very far
away from the separating hyperplane, a prediction can be made rather confidently. On the
other hand, for a point that is very close to the hyperplane, a confident prediction may not
be possible because even a small change in the separating hyperplane could easily change
the prediction.
3.1.2 Notation
Considering a linear classifier for a binary classification problem with labels y and features
x, y ∈ -1,1 (instead of 0,1). Using w, b, the classifier can be written as
h
w,b
(x) = g(w
T
x + b) (3.1)
Here, g(z ) = 1 if z ≥ 0, and g(z) = -1 otherwise. The “w,b” notation allows the intercept
term b to be treated separately from other parameters.
9
3.1.3 Functional and geometrical margins
Given a training example (x(i ),y(i )), the functional margin of (w,b) with respect to the
training example can be defined as
ˆ γ
(i)
= y
(i)
(w
T
x
(i)
+ b) (3.2)
If y
(i)
= 1, then for the functional margin to be large (for the prediction to be confident and
correct), w
T
x + b needs to be a large positive number. Conversely, if y
(i)
= -1, then for
the functional margin to be large, w
T
x + b needs to be large negative number. Moreover,
if y
(i)
(w
T
x + b) > 0, then the prediction on this example is correct. A large functional
margin represents a confident and a correct prediction.
For a linear classifier with the choice of g, if w and b were to be replaced with 2w and
2b respectively, since g(w
T
x + b) = g(2w
T
x + 2b), this would not change h
w,b
(x) atall.
However, replacing (w,b) with (2w,2b) results in multiplying the functional margin by a
factor of 2. This means that by scaling w and b, the functional margin can be arbitrarily
large without changing anything meaningful. If a normalization condition such as [[w[[ =
1, (w,b) can be replaced by (
w
||w||
2
,
b
||w||
2
). Given a training set, the functional margin (w,b)
with respect to S = ¦(x
(i)
,y
(i)
);i = 1,...,m¦ is defined as the smallest of the functional
margins of the individual training examples. Denoted by ˆ γ, this can be written as:
ˆ γ = min
i=1,...,m
ˆ γ
(i)
(3.3)
In Fig 3.1, the decision boundary corresponding to (w,b) is shown, along with the vector
w. It should be noted that w is orthogonal to the separating hyperplane. In the figure, the
distance of point A from the decision boundary, ˆ γ
(i)
, is given by line segment AB.
w
||w||
is a unit length vector pointing in the same direction as w. Since A represents x
(i)
,
the point B is given by x
(i)
- ˆ γ
(i)
.
w
||w||
. But this point lies on the decision boundary, and all
10
Figure 3.1: Linear decision boundary
points x on the decision boundary satisfy the equation w
T
x + b = 0.
w
T

x
(i)
− ˆ γ
(i)
w
[[w[[

+ b = 0 (3.4)
Solving for ˆ γ
(i)
yields
ˆ γ
(i)
=
w
T
x
(i)
+ b
[[w[[
=

w
[[w[[

T
x
(i)
+
b
[[w[[
(3.5)
More generally, for both negative and positive examples, the geometric examples (w,b) with
respect to training example (x
(i)
,y
(i)
) to be
γ
(i)
= y
(i)

w
[[w[[

T
x
(i)
+
b
[[w[[

(3.6)
It should be noted that if [[w[[ = 1, then the functional margin is equal to the geometric
margin. The geometric margin is invariant to rescaling of the parameters; i.e if w and b are
replaced by 2w and 2b, the geometric margin does not change. Because of this invariance to
the scaling of the parameters, when trying to fit w and b to the training data, an arbitrary
scaling constant on w can be imposed without changing anything important. Given a
training set S = ¦(x
(i)
,y
(i)
);i = 1,....,m¦, the geometric margin (w,b) with respect to S can
11
be defined as the smallest of the geometric margins on the individual training examples.:
γ = min
i=1,...,m
γ
(i)
(3.7)
3.1.4 The optimal margin classifier
Given a training set, it seems that it is natural to try and find a decision boundary that
maximizes the geometric margin, since this would reflect a very confident set of predictions
on the training set and a good fit to the training data. This will result in a classifier
that separates the positive training examples from the negative training examples with the
geometric margin.
Assuming that the training data is linearly separable, i.e it is possible to separate the
positive and negative examples using a hyperplane, the question is how to find one that
achieves maximum geometric margin. The following optimization problem can be posed:
max
γ,w,b
γ
s.t y
(i)
(w
T
x
(i)
+ b) ≥ ˆ γ, i = 1, ..., m
[[w[[ = 1
The objective is to maximize γ, subjective to the training example having functional margin
at least γ. The [[w[[ = 1 constraint ensures that the functional margin equals the geometric
margin, so it is gauranteed that all geometric margins are atleast γ. Thus, solving this
problem will result in (w,b) with the largest possible geometric margin with respect to the
training set.
The [[w[[ = 1 constraint makes the problem a hard one to solve because it cannot directly
be plugged into a standard optimization algorithm; the answer is to transform the problem:
max
γ,w,b
ˆ γ
||w||
s.t y
(i)
(w
T
x
(i)
+ b) ≥ γ, i = 1, ..., m
12
Here,
ˆ γ
||w||
is maximized, subject to the functional margins all being atleast ˆ γ. Since the
functional and geometric margins are related by γ =
ˆ γ
||w||
, this is the desired answer. The
difficult constraint [[w[[ = 1 does not have to dealt with. On the other hand, there is no
off-the-shelf software that optimizes the objective function
ˆ γ
||w||
Using the ability to add an arbitrary scaling constant on w and b without changing anything,
the scaling constant that the functional margin of w,b with respect to the training set
must be set to 1, can be introduced. i.e γ ˆ =1. Plugging this into the equation above and
noting that maximizing
ˆ γ
||w||
=
1
||w||
is the same as minimizing [[w
2
[[ results in the following
optimization problem
min
γ,w,b
1
2
[[w[[
2
s.t y
(i)
(w
T
x
(i)
+ b) ≥ γ, i = 1, ..., m
The problem can now be solved efficiently. The equation above is a optimization problem
with a convex quadratic objective and linear constraints. Its solution gives a optimal margin
classifier. This can be solved using quadratic programming and lagrange multipliers.
3.1.5 Optimal margin classifiers
The constraint for the optimization problem above can be written as
g
i
(w) = -y
(i)
(w
T
x
(i)
+ b) + 1 ≤ 0.
There is one such constraint for each training example. From the KKT conditions, the
only training examples with α
i
> 0 are the ones that have functional margins equal to one
(the ones corresponding to constraints that hold with equality, g
i
(w) = 0). In Fig 3.2, the
maximum margin separating the hyperplane is shown by the solid line
The points with the smallest margins are exactly the ones closest to the decision boundary.
In this case, only three points (one negative and two positive examples) lie on the dashed
13
Figure 3.2: Maximum margin separating hyperplane and support vectors
lines parallel to the decision boundary. This means that only three α
i
’s will be non-zero at
the optimal solution. These three points are called the support vectors. The number of
support vectors is less than the size of the training set. Constructing the Lagrangian form
for the optimization problem:
L(w, b, α) =
1
2
[[w[[
2

m
¸
i=1
α
i
[y
(i)
(w
T
x
(i)
+ b) −1] (3.8)
It should be noted that there are only α
i
and no β
i
Lagrange multipliers, since the equation
only has inequality constraints.
To find the dual form of the problem, L(w,b,α) will have to be minimized with respect to w
and b (for fixed α), to get θ
D
. This can be done by setting the derivatives of L with respect
to w and b to 0‘:

w
L(w, b, α) = w −
m
¸
i=1
α
i
y
(i)
x
(i)
= 0 (3.9)
This implies that
w =
m
¸
i=1
α
i
y
(i)
x
(i)
(3.10)
14
The derivative with respect to b

∂b
L(w, b, α) =
m
¸
i=1
α
i
y
(i)
= 0 (3.11)
Taking the definition of w in Equation (3.10) and plugging it back into the Lagrangian in
Equation (3.8):
L(w, b, α) =
m
¸
i=1
α
i

1
2
m
¸
i,j=1
y
(i)
y
(j)
α
i
α
j
(x
(i)
)
T
x
(j)
−b
m
¸
i=1
α
i
y
(i)
(3.12)
The above equation was obtained by minimizing L with respect to w and b. Putting this
together with constraints α
i
≥ 0. and the constaint (3.11) leads to the following dual
problem:
max
α
W(α) =
¸
m
i=1
α
i

1
2
¸
m
i,j=1
y
(i)
y
(j)
α
i
α
j
'x
(i)
, x
(j)
`
s.t α
i
≥ 0, i = 1,...,m
¸
m
i=1
α
i
y
(i)
= 0
In the dual problem above, the parameters of the maximization problem are all α
i
’s. If
there were an algorithm to solve the dual equation above, the optimal w’s can be found as
a function of α’s using Equation (3.10). Having found w

, the optimal value for intercept
term b can be found as:
b

=
max
i:y
(i)
=−1
w
∗T
x
(i)
+ min
i:y
(i)
=1
w
∗T
x
(i)
2
(3.13)
Equation (3.10) also gives a optimal value of w in terms of the optimal value of α. If a
prediction at a new point x must be made, w
T
+ b can be calculated and y = 1 can be
predicted if this quantity is bigger than zero. But by using Equation (3.10), this quantity
can also be written as:
w
T
+ b =

m
¸
i=1
α
i
y
(i)
x
(i)

T
x + b (3.14)
15
=
m
¸
i=1
α
i
y
(i)
'x
(i)
, x` + b (3.15)
Once the α
i
’s are found, a quantity that depends only on the inner product between x and
the points in the training set will have to be calculated. Many of the terms in the sum
above will be zero because the α
i
’s will all be zeros except for the support vectors and only
the inner product between x and the support vectors will have to be calculated
3.1.6 Multiclass SVMs
The support vector machine is fundamentally a binary classifier. In practice, however, one
may have to tackle problems involving more than two classes. In this project, for example,
SVMs are being used in a multiclass scenario. Various methods have been proposed to
combine multiple binary SVMs to build a multiclass classifier
One commonly used approach is to construct K separate SVMs, in which the kth SVM
y
k
(x) is trained using data from class C
k
as the positive examples and the data from the
remaining K - 1 classes as the negative examples. This is known as a one-versus-the-rest
approach. A new point x is classified using
y(x) = max
k
y
k
(x) (3.16)
This heuristic suffers approach suffers from the problem that the different classifiers are
trained on different tasks, and there is not guarantee that the real-valued quantities y
k
(x)
for different classifiers will have different scales.
Another approach is to train K(K-1)/2 different 2-class SVMs on all possible pairs of
classes, and then to classify test points according to which class has the highest number of
’votes’, an approach that is called one-versus-one. The problem with this approach is that
it required more training time that the one-versus-the-rest approach. Similarly, to evaluate
16
test points, significantly more computation is required.
17
3.2 Hidden Markov Models
A Hidden Markov Model (HMM) is a statistical model in which the system being modeled
is assumed to be a Markov process with unknown parameters; the challenge is to determine
the hidden parameters from the observable data. The extracted model parameters can then
be used to perform further analysis, for example for pattern recognition applications. In a
regular Markov model, the state is directly visible to the observer, and therefore the state
transition probabilities are the only parameters. In a hidden Markov model, the state is not
directly visible, but variables influenced by the state are visible. Each state has a probability
distribution over the possible output tokens. Therefore the sequence of tokens generated by
an HMM gives some information about the sequence of states. The HMM is widely used
in speech recognition, natural language modelling, on-line handwriting recognition and the
analysis of biological sequences such as proteins and DNA.
3.2.1 Transition and Emission probabilities
As in a standard mixture model, the latent variables are the discrete multinomial variables
z
n
describing which components of the mixture is responsible for generating the correspond-
ing observation x
n
. If the probability distribution of z
n
were allowed to depend on the state
of the previous latent variable z
n-1
through a conditional distribution p(z
n
[z
n-1
). Because
the latent variables are K-dimensional binary variables, this conditional distribution corre-
sponds to a table of numbers A, the elements of which are known as transition probabilities.
They are given by A
jk
= p(z
nk
= 1[z
n−1,j
= 1), and because they are probabilities, they
satisfy 0 ≤ A
jk
≤ 1 with
¸
k
A
jk
= 1, so the matrix A has K(K-1) independent features..
The conditional distribution can be written in the form:
p(z
n
[z
n−1
, A) =
K
¸
k=1
K
¸
j=1
A
z
n−1,j
z
nk
jk
(3.17)
The initial latent node z
1
is special in that it does not have a parent node, and so it
has a marginal distribution p(z
1
) represented by a vector of probabilities π with elements
18
π
k
≡ p(z
1k
= 1), so that:
p(z
1
[π) =
K
¸
k=1
π
z
1k
k
(3.18)
where
¸
k
π
k
= 1
The specification of the probabilistic model is completed by defining the conditional distri-
butions of the observed variables p(x
n
[z
n
, φ) where φ is the set of parameters governing the
distribution. These are known as emission probabilities. These could be given by a Gaus-
sian (or any other continuous probability distribution) if the elements of x are continuous
variables, or by conditional probability tables if x is discrete. Because x
n
is observed, the
distribution p(x
n
[z
n
, φ) consists, for a given value of φ, of a vector of K numbers corre-
sponding to the K possible states of the binary vector z
n
. The emission probabilities can
be represented in the form:
p(x
n
[z
n
, φ) =
K
¸
k=1
p(x
n

k
)
z
nk
(3.19)
The joint probability distribution over both latent and observed variables is then given by:
p(X, Z[π) = p(z
1
[π)
¸
N
¸
n=2
p(z
n
[z
n−1
, A)
¸
N
¸
m=1
p(x
m
[z
m
, φ) (3.20)
where X = ¦x
1
,. . . ,x
N
¦, Z = ¦z
1
,...,z
N
¦, and θ = ¦π,A,φ¦ denotes the set of parameters
governing the model. The model is tractable for a wide range of emission distributions
including discrete tables, Gaussians and mixtures of Gaussians.
3.2.2 Maximum Likelihood for the HMM
For an observed data set X = ¦x
1
,...,x
N
¦, one can determine the parameters of the HMM
using maximum likelihood. The likelihood function is obtained from the joint distribution
19
in Equation (3.20) by marginalizing over the latent variables:
p(X[θ) =
¸
z
p(X, Z[θ) (3.21)
Because the joint distribution p(X[θ) does not factorize over n, each of the summations
over z
n
cannot be treated independently. The summations cannot be performed explicitly
because there are N variables to be summed over, each of which has K states, resulting in
a total of K
N
terms. Thus the number of terms in the summation grows exponentially with
the length of the chain.In fact, the summation in Equation (3.20) corresponds to summing
over many paths in Fig 3.3 (b)
(a) A state transition diagram (b) A lattice representing the transition dia-
gram unfolded over time
Figure 3.3: The transition state diagram unfolded over time
To find an efficient framework for the maximizing the likelihood function in the hidden
Markov Model, one can use the expectation maximization (EM) algorithm. The EM algo-
rithm starts with some initial selection for the model parameters, denoted by θ
old
. In the E
step, the model parameters can be used to find the posterior distribution of the latent vari-
ables p(Z[X, θ
old
). This posterior distribution can then be used to evaluate the expectation
of the logarithm of the compete-data likelihood function as a function of the parameters θ,
to give the function Q(θ,θ
old
) defined by:
Q(θ, θ
old
) =
¸
z
p(Z[X, θ
old
) ln p(X, Z[θ) (3.22)
20
Introducing some notation, γ(z
n
) can denote the marginal posterior distribution of a latent
variable z
n
, and ξ(z
n−1
,z
n
) to denote the joint posterior distribution of two successive talent
variables, so that:
γ(z
n
) = p(z
n
[X, θ
old
) (3.23)
ξ(z
n−1
, n
n
) = p(z
n−1
, z
n
[X, θ
old
) (3.24)
For each value of n, γ(z
n
) can be stored using a set of K non-negative numbers that sum to
unity, and similarly, ξ(z
n−1
, z
n
) can be stored in a K K matrix of non-negative numbers
that sum to unity. γ(z
nk
) can denote the conditional probability of z
nk
= 1, with a similar
notation for ξ(z
n−1,j
, z
nk
) and for other probabilistic variables introduced earlier. Because
the expectation of a binary random variable is just the probability that it takes the value
1:
γ(z
nk
) = E[z
nk
] =
¸
z
γ(z)z
nk
(3.25)
ξ(z
n−1,j
, z
nk
) = E[z
n−1,j
, z
nk
] =
¸
z
γ(z)z
n−1,j
z
nk
(3.26)
Substituting the joint distribution p(X,Z[θ)in Equation (3.20) into (3.22), and making use
of the definitions of γ and ξ:
Q(θ, θ
old
) =
K
¸
k=1
γ(z
1k
) ln π
k
+
N
¸
n=2
K
¸
j=1
K
¸
k=1
ξ(z
n−1,j
, z
nk
)lnA
jk
+
N
¸
n=1
K
¸
k=1
γ(z
nk
) ln p(z
n

k
) (3.27)
The goal of the E step will be to evaluate the quantities γ(z
n
) and ξ(z
n−1
, z
n
) efficiently.
In the M step, the quantity Q(θ, θ
old
) with respect to the parameters θ = ¦π, A, φ¦ in which
γ(z
n
) and ξ(z
n−1
, z
n
) are treated as constant. Maximization with respect to π and A is
21
easily achieved using appropriate Lagrange multipliers with the results:
π
k
=
γ(z
1k
)
¸
K
j=1
γ(z
1j
)
(3.28)
A
jk
=
¸
N
n=2
ξ(z
n−1,j
, z
nk
)
¸
K
l=1
¸
N
n=2
ξ(z
n−1,j
, z
nl
)
(3.29)
The EM algorithm must be initialized by choosing the starting values for π and A, which
should respect the summation constraints associated with their probabilistic interpretation.
Any elements of π and A that are initially set to zero will remain zero in subsequent EM
updates. A typical initialization procedure would involve selecting random starting values
for these parameters subject the summation and non-negativity constraints.
To maximize Q(θ,θ
old
) with respect to φ
k
, it should be noted that only the final term in
Equation (3.27) depends on φ
k
. If the parameters φ
k
are different for the different compo-
nents, then this term decouples into a sum of terms one for each value of k, each of which
can be maximized independently. This would reduce to simply maximizing the weighted log
likelihood function for the emission density p(x[φ
k
) with the weights γ(z
nk
). For example,
in case of Gaussian emission densities, p(x[φ
k
) = N(x[µ
k
,
¸
k
), and maximization of the
function Q(θ,θ
old
) then gives:
µ
k
=
¸
N
n=1
γ(z
nk
)x
n
¸
N
n=1
γ(z
nk
)
(3.30)
Σ
k
=
¸
N
n=1
γ(z
nk
)(x
n
−µ
k
)(x
n
−µ
k
)
T
¸
N
n=1
γ(z
nk
)
(3.31)
For the case of discrete multinomial observed variables, the conditional distribution of the
observed variables takes the form:
p(x[z) =
D
¸
i=1
K
¸
k=1
µ
x
i
z
k
ik
(3.32)
22
and the corresponding M step equations are given by:
µ
ik
=
¸
N
n=1
γ(z
nk
)x
in
¸
N
n=1
γ(z
nk
)
(3.33)
The EM algorithm requires the initial values for the parameters of the emission distribution.
3.2.3 The forward-backward algorithm
There needs to be an efficient procedure to evaluate the quantities γ(z
nk
) and ξ(z
n-1,j
, z
nk
),
corresponding to the E step of the EM algorithm. The graph for the Hidden Markov
Model is a tree, so this means that the posterior distribution of the latent variables can be
obtained efficiently using a two-stage message passing algorithm. In the particular context
of the hidden Markov Model, this is known as the forward backward (Rabiner, 1989) or the
Baum-Welch algorithm (Baum, 1972). There are several variants of the basic algorithm,
all of which lead to the exact marginals, according to the precise form of the message that
are propagated along the chain. The most widely used of these is called the alpha-beta
algorithm, discussed below
The evaluation of the posterior distributions of the latent variables is independent of the
form of the emission density p(x[z) or of whether the observed variables are discrete or
continuous. All that is required is the values of the quantities p(x
n
[z
n
) for each value of
z
n
for every n.Also, the explicit dependence on the model parameters θ
old
shall be omitted
because these are fixed throughout
23
The following conditional independencies hold:
p(X[z
n
) = p(x
1
, ..., x
n
[z
n
)p(x
n+1
, ..., x
N
[z
n
) (3.34)
p(X[z
n
) = p(x
1
, ..., x
n
[z
n
)p(x
n+1
, ..., x
N
[z
n
) (3.35)
p(x
1
, ..., x
n−1
[z
n−1
, z
n
) = p(x
1
, ..., x
n−1
[z
n−1
) (3.36)
p(x
n+1
, ..., x
N
[z
n
, z
n+1
) = p(x
n+1
, ..., x
N
[z
n+1
) (3.37)
p(x
n+2
, ..., x
N
[z
n+1
, x
n+1
) = p(x
n+2
, ..., x
N
[z
n+1
) (3.38)
p(X[z
n−1
, z
n
) = p(x
1
, ..., x
n−1
[z
n−1
)p(x
n
[z
n
)p(x
n+1
, ..., x
N
[z
n
) (3.39)
p(x
N+1
[X, z
N+1
) = p(x
N+1
[z
N+1
) (3.40)
p(z
N+1
[z
N
, X) = p(z
N+1
[z
N
) (3.41)
where X = ¦x
1
,...,x
n
¦. These relations are easily proved using d-separation. For instance,
for the first of these results, every path from any one of the nodes x
1
,...,x
n-1
to the node
x
n
passes through node z
n
, which is observed. Because all such paths are head-to-tail, it
follows that the conditional independence properly must hold. These relations can also be
proved directly from the joint distribution of the hidden Markov model using the sum of
product rules of probability.
To evaluate γ(z
nk
), the fact that for a discrete multinomial random variable the expected
value of one of its components is just the probability of that component having a 1 proves
useful. Given this fact, the goal is to find the posterior distribution p(z
n
[x
1
, ..., x
n
) of z
n
given the observed data set x
1
,...,x
n
. This represents a vector of length K whose entries
correspond to the expected values of z
nk
. Using Bayes theorem:
γ(z
n
) = p(z
n
[X) =
p(X[z
n
)p(z
n
)
p(X)
(3.42)
The denominator p(X) is implicitly conditioned on the parameters θ
old
of the HMM and
hence represents the likelihood function. Using the conditional independency property
24
(3.34), together with the product rule of probability
γ(z
n
) =
p(x
1
, ..., x
n
, z
n
)p(x
n+1
, ..., x
N
[z
n
)
p(X)
=
α(z
n
)β(z
n
)
p(X)
(3.43)
where
α(z
n
) ≡ p(x
1
, ..., x
n
, z
n
) (3.44)
β(z
n
) ≡ p(x
n+1
, ..., x
N
[z
n
) (3.45)
The quantity α(z
n
) represents the joint probability of observing all of the given data up to
a time n and the value of z
n
, whereas β(z
n
) represents the conditional probability of all
the future data from time n + 1 upto N given the value of z
n
. Again, α(z
n
) and β(z
n
)
each represent a set of K numbers, one for each of the possible settings of the 1-of-K coded
binary vector z
n
. From now on, the notation α(z
nk
) shall be used to denote the value of
α(z
n
) when z
nk
= 1, with the analogous interpretation of β(z
nk
).
The recursion relations that allow α(z
n
) and β(z
n
) to be evaluated efficiently can now be
derived. Making use of the conditional independence properties,in particular Eq (3.35) and
(3.36), together with the sum of product rules, allows to express α(z
n
) in terms of α(z
n-1
)
25
Figure 3.4: Forward recursion for the evaluation of the α variables
as follows
α(z
n
) = p(x
1
, ..., x
n
, z
n
)
= p(x
1
, ..., x
n
[z
n
)p(z
n
)
= p(x
n
[z
n
)p(x
1
, ..., x
n−1
[z
n
)p(z
n
)
= p(x
n
[z
n
)p(x
1
, ..., x
n−1
, z
n
)
= p(x
n
[z
n
)
¸
z
n−1
p(x
1
, ..., x
n−1
, z
n−1
, z
n
)
= p(x
n
[z
n
)
¸
z
n−1
p(x
1
, ..., x
n−1
, z
n
[z
n−1
)p(z
n−1
)
= p(x
n
[z
n
)
¸
z
n−1
p(x
1
, ..., x
n−1
[z
n−1
)p(z
n
[z
n−1
)p(z
n−1
)
= p(x
n
[z
n
)
¸
z
n−1
p(x
1
, ..., x
n−1
, z
n−1
)p(z
n
[z
n−1
) (3.46)
Making use of definition (3.44) for α(z
n
),
α(z
n
) = p(x
n
[z
n
)
¸
z
n−1
α(z
n−1
)p(z
n
[z
n−1
) (3.47)
It should be noted that there are K terms in the summation,and the right hand side has to
be evaluated for each of the K values of z
n
so each step of the α recursion has computational
cost that scaled like O(K
2
). The forward recursion has to be computed using the lattice
26
diagram in Fig 3.4
In order to start this recursion,an initial condition is required. This is given by:
α(z
1
) = p(x
1
, z
1
) = p(z
1
)p(x
1
[z
1
) =
K
¸
k=1
¦π
k
p(x
1

k

z
1k
(3.48)
which says that α(z
1k
), for k=1,...,K, takes the value π
k
p(x
1

k
). Starting at the first node of
the chain, one can work along the chain and evaluate α(z
k
) for every latent node. Because
each step of the recursion involves multiplying by a K x K matrix, the overall cost of
evaluating these quantities for the whole chain is O(K
2
N)
The recursion relation for β(z
n
) can be found similarly, by making use of the conditional
independence properties (3.36) and (3.37)
β(z
n
) = p(x
n+1
, ..., x
N
[z
n
)
=
¸
z
n+1
p(x
n+1
, ..., x
N
, z
n+1
[z
n
)
=
¸
z
n+1
p(x
n+1
, ..., x
N
[z
n
, z
n+1
)p(z
n+1
[z
n
)
=
¸
z
n+1
p(x
n+1
, ..., x
N
[z
n+1
)p(z
n+1
[z
n
)
=
¸
z
n+1
p(x
n+2
, ..., x
N
[z
n+1
)p(x
n+1
[z
n+1
)p(z
n+1
[z
n
) (3.49)
Making use of the definition (3.45) for β(z
n
):
β(z
n
) =
¸
z
n+1
β(z
n+1
)p(x
n+1
[z
n+1
)p(z
n+1
[z
n
) (3.50)
It should be noted that in this case, the algorithm goes backward, evaluating β(z
n
) in terms
of β(z
n+1
). At each step, the effect of observation β(x
n+1
) is absorbed through the emission
probability p(x
n+1
[z
n+1
), multiplied by the transition matrix p(z
n+1
[z
n
), and then z
n+1
is
marginalized. This is illustrated in Fig 3.5
27
Figure 3.5: Backward recursion for the evaluation of the β variables
Again, a starting condition for the recursion, namely a value for β(z
N
), is required.This can
be obtained by setting n = N in Eq (3.43) and replacing α(z
n
) with its definition (3.44) to
give:
p(z
N
[X) =
p(X, z
N
)β(z
N
)
p(X)
(3.51)
which will be correct provided β(z
N
) = 1 for all settings of z
N
In the M step equations, the quantity p(X) will cancel out,as can be seen in the M step
equation for µ
k
given by (3.30) which takes the form
µ
k
=
¸
n
n=1
γ(z
nk
)x
n
¸
n
n=1
γ(z
nk
)
=
¸
n
n=1
α(z
nk
)β(z
nk
)x
n
¸
n
n=1
α(z
nk
)β(z
nk
)
(3.52)
However, the quantity p(X) represents the likelihood function whose value is typically mon-
itored during the EM optimization, and so it us useful to be able to evaluate it. Summing
both sides of (3.43) over z
n
, and using the fact that the left hand side is a normalized
distribution:
p(X) =
¸
zn
α(z
n
)β(z
n
) (3.53)
Thus the likelihood function can be evaluated by computing this sum,for any convenient
choice of n. For instance, is only the likelihood function has to be evaluated, then this can
28
be done by running the α recursion from the start to the end of the chain, and then use
this result for n = N, making use of the fact that β(z
N
) is a vector of 1’s. In this case, no
β recursion is required, and the following is obtained:
p(X) =
¸
z
N
α(z
N
) (3.54)
To distribute the likelihood, the joint distribution p(X,Z) should be summed over all pos-
sible values of z. Each such choice represents a particular choice of a hidden state for every
time step, in order words every term in the summation is a path through the lattice diagram.
The number of such paths is exponential. By expressing the likelihood function in the form
(3.54), the computational cost has been reduced from being exponential in the length of
the chain to being linear by swapping the order of the summation and multiplications, so
that each time step n, the contributions from all paths passing through each of the states
z
nk
can be summed to give the intermediate quantities α(z
n
).
To evaluate the quantities ξ(z
n-1
, z
n
), which corresponds to the values of the conditional
probabilities p(z
n-1
,z
n
[X). Applying Baye’s theorem:
ξ(z
n-1
, z
n
) = p(z
n-1
, z
n
[X)
=
p(X[z
n-1
, z
n
)p(z
n-1
, z
n
)
p(X)
=
p(x
1
, ..., x
n−1
[z
n
)p(x
n
[z
n
p(x
n+1
, ..., x
N
[z
n
)p(z
n
[z
n−1
)p(z
n−1
)
p(X)
=
α(z
n-1
)p(x
n
[z
n
)p(z
n
[z
n-1
)β(z
n
)
p(X)
(3.55)
Here, the conditional independence property (3.39) was used together with definitions of
α(z
n
) and β(z
n
) given by (3.44) and (3.45). Thus ξ(z
n-1
, z
n
) directly by using the results
of the α and β recursions.
To summarize, the steps required to train a hidden Markov model using the EM algorithm,
one needs to first make an initial selection of the parameters θ
old
where θ ≡ (π, A, φ).
29
The A and π parameters are often initialized either uniformly or randomly from a uniform
distribution (respecting their non-negativity and summation constraints). Initialization of
the parameters φ will depend on the form of the distribution. Then both the forward α
recursion and the backward β recursion and use the results to evaluate γ(z
n
) and ξ(z
n-1
, z
n
).
At this stage, the likelihood function can be evaluated. This completes the E step, and these
results can be used to find a revised set of parameters θ
new
using the M step equations
from section (the forward backward). The E and M steps can be run alternatively until
convergence,for instance when the log likelihood is below some threshold.
It should be noted that in the recursion relations, the observations enter through conditional
distributions of the form p(x
n
[z
n
). The recursions are therefore independent of the type of
dimensionality of the observed variables or of the form of this conditional distribution, so
long as its value can be computed for each of the K possible states of z
n
. Since the observed
variables ¦x
n
¦ are fixed, the quantities p(x
n
[z
n
) can be pre-computed as functions of z
n
at
the start of the EM algorithm, and remain fixed throughout.
The maximum likelihood approach is most effective when the number of data points is large
in relation to the number of parameters. Here, the hidden Markov model can be trained
effectively, using maximum likelihood, provided the training sequence is sufficiently long.
Alternatively, one can also use multiple shorter sequences which requires a straightforward
modification of the hidden Markov model EM algorithm. In the case of left-to-right mod-
els, this is particularly important because, in a given observation sequence, a given state
transition corresponding to a non-diagonal element of A will be seen at most once
Another quantity of interest is the predictive distribution, in which the observed data is X
= ¦x
1
,...,x
n
¦ and one wishes to predict x
N+1
. Again, one can make use of the sum and
product rules together with the conditional independence properties (3.39) and (3.41) to
30
derive:
p(x
N+1
[X) =
¸
z
N+1
p(x
N+1
, z
N+1
[X)
=
¸
z
N+1
p(x
N+1
[z
N+1
)p(z
N+1
[X)
=
¸
z
N+1
p(x
N+1
[z
N+1
)
¸
z
N
p(z
N+1
, z
N
[X)
=
¸
z
N+1
p(x
N+1
[z
N+1
)
¸
z
N
p(z
N+1
[z
N
)p(z
N
[X)
=
¸
z
N+1
p(x
N+1
[z
N+1
)
¸
z
N
p(z
N+1
[z
N
)
p(z
N
, X)
p(X)
=
1
p(X)
¸
z
N+1
p(x
N+1
[z
N+1
)
¸
z
N
p(z
N+1
[z
N
)α(z
N
) (3.56)
which can be evaluated by fist running a forward α recursion and the computing the final
summations over z
N
can be stored and used once the value of z
N+1
is observed in order to
run the α recursion forward to the next step in order to predict the subsequent value x
N+2
.
In the equation above, the influence of all the data from x
1
to x
N
is summarized in the the
K values of α(z
N
)
3.2.4 Scaling Factors
There is an important issue to be addressed before making use of the forward-backward
algorithm in practice. In the algorithm, at each step, the value α(z
n
) is obtained from the
previous value α(z
n−1
) by multiplying by quantities p(z
n
[z
n−1
) and p(x
n
[z
n
). Because the
probabilities are often significantly less than unity, going forward along the chain, the values
α(z
n
) can go to zero exponentially quickly.
In case of i.i.d data, this problem was circumvented with the evaluation of log likelihood
functions. This will not work here because the sums of products of smaller numbers are
being formed. Therefore, rescaled versions of α(z
n
) and β(z
n
) whose values remain of order
31
unity are used. The corresponding scaling factors cancel out when these rescaled quantities
are used in the EM algorithm.
In (3.44), α(z
n
) = p(x
1
,...,x
n
,z
n
) representing the joint distribution of all the observations
upto x
n
and the latent variable z
n
. Now, to introduce the normalized version of α:
ˆ α(z
n
) = p(z
n
[x
1
, ..., x
n
) =
α(z
n
)
p(x
1
, ..., x
n
)
(3.57)
which is expected to be well behaved numerically because it is a probability distribution
over K variables for any value of n. In order to relate the scaled and original alpha vari-
ables, scaling factors defined by conditional distributions over the observed variables are
introduced:
c
n
= p(x
n
[x
1
, ..., x
n−1
) (3.58)
From the product rule:
p(x
1
, . . . , x
n
) =
n
¸
m=1
c
m
(3.59)
and so
α(z
n
) = p(z
n
[x
1
, ..., x
n
)p(x
1
, ..., x
n
) =

n
¸
m=1
c
m

ˆ α(z
n
) (3.60)
The recursion equation (3.48)for α can be turned into one for ˆ α given by
c
n
ˆ α(z
n
) = p(x
n
[z
n
)
¸
z
n−1
ˆ α(z
n−1
)p(z
n
[z
n−1
) (3.61)
At each stage of the forward message passing phase, c
n
will have to be evaluated and stored,
which is easily done because it is the coefficient that normalizes the right hand side of the
equation above to give ˆ α(z
n
)
32
Rescaled variables
ˆ
β(z
n
) can be similarly defined using
β(z
n
) =

N
¸
m=n+1
c
m

ˆ
β(z
n
) (3.62)
which will again remain within the same machine precision because, from (3.45), the quan-
tities
ˆ
β(z
n
) are simply the ratio of two conditional probabilities
ˆ
β(z
n
) =
p(x
n+1
, ..., x
N
[z
n
)
p(x
n+1
, ..., x
N
[x
1
, ..., x
n
)
(3.63)
The recursion result (3.50) for β then gives the following recursion for the re-scaled variables
c
n+1
ˆ
β(z
n
) =
¸
z
n+1
ˆ
β(z
n+1
)p(x
n+1
[z
n+1
)p(z
n+1
[z
n
) (3.64)
In applying this recursion relation, the scaling factors c
n
that were computed in the α phase
are used.
From (3.59), the likelihood function can be found using
p(X) =
N
¸
n=1
c
n
(3.65)
Similarly,using (3.43) and (3.55), together with (3.65), the required marginals are given by
γ(z
n
) = ˆ α(z
n
)
ˆ
β(z
n
) (3.66)
ξ(z
n−1
, z
n
) = c
n
ˆ α(z
n−1
)p(x
n
[z
n
)p(z
n
[z
n−1
)
ˆ
β(z
n
) (3.67)
33
Figure 3.6: A graphical depiction of the Viterbi algorithm
3.2.5 The Viterbi Algorithm
In many applications of hidden Markov models, the latent variables have some meaningful
interpretation, and so it is often of interest to find the most probable sequence of hidden
states for a given observation sequence. Because the graph for a hidden Markov model
is a directed tree, this problem can be solved exactly using the max-sum algorithm. The
problem of solving the most probable sequence of latent states is not the same as that
of finding the set of states that are individually the most probable. The latter problem
can be solved by running the forward-backward (sum-product) algorithm to find the latent
marginals γ(z
n
) and then maximizing each of these individually. However, the set of such
states will not, in general correspond to the most probable sequence of states. In fact, this
set of states might even represent a sequence having zero probability, if it so happens that
two successive states, which in isolation are individually the most probable, are such that
the transition matrix element connecting them is zero
Finding the most probable sequence of states can be solved efficiently using the max-sum
algorithm, which is known as the Viterbi algorithm. Fig 3.6 shows a fragment of the hidden
Markov model expanded as a lattice diagram. The number of possible paths through the
lattice diagram grows exponentially with the length of the chain. The Viterbi algorithm
searches this space of paths efficiently to find the most probable path with a computational
cost that grows only linearly with the length of the chain.
34
The variable z
N
is treated as the root, and messages are passed to the root starting with
the leaf nodes. The messages passed in the max-sum algorithm are given by
µ
zn→f
n+1
(z
n
) = µ
f
n
→zn
(z
n
) (3.68)
µ
f
n+1
→z
n+1
(z
n+1
) = max
zn
¦ln f
n+1
(z
n
, z
n+1
) + µ
zn→f
n+1
(z
n
)¦ (3.69)
If µ
zn→f
n+1
(z
n
) is eliminated between these two equations, a recursion for the f → z can be
obtained
ω(z
n+1
) = ln p(x
n+1
[z
n+1
) + max
zn
¦ln p(z
n+1
[z
n
) + ω(z
n
)¦ (3.70)
where ω(z
n
) ≡ µ
f
n
→zn
(z
n
).
The messages are initialized using
ω(z
1
) = ln p(z
1
) + ln p(x
1
[z
1
) (3.71)
A simple algorithm that keeps track of the path to every possible latent variable is used to
find the sequence of latent variables that correspond to the most likely path
Intuitively, the Viterbi algorithm can be understood as follows. Naively, one could consider
all of the exponentially many paths through the lattice, evaluate the probability for each,
and then select the path having the highest probability. However, a dynamic saving in
computational cost can be made as follows. Suppose the probability of each path is evaluated
by summing up products of transition and emission probabilities going forward along each
path through the lattice. Considering a particular time step n and a particular state k at
that time step, there will be many possible paths converging on the corresponding node in
the lattice diagram. However, only the path that that has the highest probability so far
needs to be retained. Because there are K states at time step n, K such paths will need to
be kept track of at step n. At time step n+1, there will be K
2
possible paths to consider,
compromising K possible paths leading out of each of the K current states, but again,
35
only K of these corresponding to the best path for each state at time n+1 will have to be
retained. When the final time step N is reached, it will be known which state corresponds
to the overall most probable path. Because there is a unique path coming into that state,
the path can be traced back to step N-1 to see what state it occupied at that time, and so
on back through the lattice to the state n=1.
36
Chapter 4
Approach
Two different approaches will be discussed in this section. The first approach is purely SVM
(Support Vector Machine) based and is derived from Dalal et al.[1]. The machinery behind
the second approach is based on Hidden Markov Models. We describe our joint model for
tracking and recognition in which we track and classify actions simultaneously. Both of
these are discussed in detail in the following sections.
4.1 The Support Vector Machine Approach
The SVM approach is an extension of [1]. Whereas in [1], the focus is on building a binary
classifier to make person/no-person detections in images, the SVM approach uses an optical
flow based feature set and learns an SVM for each class of actions.
4.1.1 Optical Flow based HOG descriptors
Dalal et. al[1] discusses how locally normalized Histogram of Oriented Gradient (HOG)
descriptors provide better performance at human detection relative to other feature sets.
These descriptors are computed over a grid of uniformly spaced cells and use overlapping
37
local contrast normalizations for improved performance. The intuition is that local object
appearance and shape in static images can be characterized well by the distribution of local
intensity gradients.
Here, local intensity gradients are replaced by optical flow to characterize local motion
patterns in video sequences. This is implemented by dividing an image into regions called
cells and accumulating a 1-D histogram of the optical flows over the pixels of the cell.
To achieve better invariance to effects such as illumination and shadowing, local responses
from cells are normalized by accumulating cells over larger spatial regions called blocks and
normalizing over all cells in a block. Just as in [1], we use an 8x8 cell with each block having
4 cells in it. Each pixel calculates a weighted “vote” based on the orientation and magnitude
of the optical flow vector centered at it and the votes are accumulated into orientation bins
over cells. There are 9 orientation bins from 0 - 180 (degrees) and 4 normalizations for every
cell. A detection window is tiled with a dense grid of overlapping HOG descriptors. Let
i ∈ ¦1, . . . , L¦ where L is the number of discrete locations in an image. L is proportional to
the number of pixels in the image. For efficiency reasons, we only score windows centered
at every 8th pixel. If the dimensionality of the detection window is nm, and p
i
is the
nm dimensional patch extracted from location i in the image, we can write
x
i
= ψ(p
i
)
for the HOG feature vector computed at the patch. This feature vector is
n
8

m
8
36
dimensional. Fig. 4.1 shows an image, the optical flow and the HOG descriptor computed
at the image.
Figure 4.1: (a) is the original image. (b) is the optical flow computed at the image. (c) is
the HOG descriptor computed for the image
38
(a) Original image (b) Optical flow (c) HOG descriptor
4.1.2 A multiclass SVM framework for action recognition
Typically, an SVM is a binary classifier but can be extended to a multiclass framework as
described in Sec 3.1.6. This is the version of the SVM that is used here. Given a training
dataset of video sequences, an SVM is built for every action class.
4.1.2.1 Training
Given a training set of video sequences, we collect training pairs ¦x
i
, y
i
¦ where x
i
is as
described above and y
i
∈ ¦1,. . . ,C¦. where C is the number of action classes. Positive
features for an SVM for action class C are obtained from bounding boxes around the
actions corresponding to class C. Negative features were obtained from random patches in
images containing actions from the remaining classes. We train a model w
C
that minimizes
the hinge loss as follows
w
C
=
d
¸
i=1
α
i
y
i
x
i
(4.1)
where d is the size of the training data and α
i
are Lagrange multipliers, whose values can
be found by solving a dual optimization problem.
The training is typically done iteratively - after learning an initial SVM classifier, all the
negative training examples are searched exhaustively for false positives (hard examples).
39
These false positives are then appended to the negative training set and the SVM is retrained
using the augmented negative training set, which gives rise to a new classifier.
4.1.2.2 Testing
To classify a new video sequence broken down into a sequence of optical flows, we convert
each image into a large
framewidth
8

frameheight
8
36 dimensional HOG descriptor. Then
for a given C, we scan the nm dimensional detection window across all locations and
scales, scoring the classifier for C at every detection window by convolving the image HOG
descriptor with 36 templates representing different subsets of weights from w
C
.
If X = ¦x
1
, . . . , x
N
¦ represents the sequence of optical flows in an N-image video, and X
j
= ¦x
1
, . . . , x
m
¦, where m is the number of windows in the image across locations and scales,
x
j
can be classified as follows
C(x
j
) = arg max
C
max
i
w
C
x
i
(4.2)
The entire video sequence X can be classified by taking a majority vote across frames
C(X) = C(x
1
, . . . , x
n
) = arg max
C
N
¸
i=1
I(C(x
i
) = C) (4.3)
where I is the identity function
40
4.2 The Hidden Markov Model Approach
The SVM approach is similar to a bag-of-words approach; even if it were fed a sequence of
frames constituting a video out of order, it would still classify the video sequence just as it
would if the frames were fed to it in order. In reality, the optical flow in a particular frame
very much depends on the optical flows of the frames that precede it. This scenario can be
modeled as a Markov process. We use the training set of videos to build Hidden Markov
Models (HMMs) for each action class. Given a new, previously unseen video sequence which
broken down into a sequence of optical flows, the HMMs learnt in the training phase are
then used to classify this observed sequence.
4.2.1 Building a vocabulary of visual words
As mentioned in section 3.2, a Hidden Markov Model (HMM) consists of a transition model
A, an emission model φ and an initial model π. The first step towards building an HMM
model is to build a vocabulary of “visual words” using the frames belonging to the videos
from the training set. These visual words will then represent our hidden variables in the
HMM.
To build a visual vocabulary, the motion descriptor in Efros et al.[9] is used on bounding
boxes around the person in a frame. This motion descriptor has been shown to perform
reliably with noisy image sequences. Given a video sequence in which the person appears
in the center of the field of view, the optical flow is computed at each frame using the
Lucas-Kanade algorithm[10]. The optical flow vector field F is then split into 2 scalar fields
F
x
and F
y
, corresponding to the x and y components of the optical flow. F
x
and F
y
are
further half wave rectified into four non-negative channels F

x
, F
+
x
, F

y
, and F
+
y
, such that
F
x
= F
+
x
- F

x
and F
y
= F
+
y
- F

y
. These four non-negative channels are then blurred with a
Gaussian kernel and normalized to obtain the final four channels Fb

x
, Fb
+
x
, Fb

y
, and Fb
+
y
The motion descriptors of 2 different frames are computed as follows: Suppose the four
41
Figure 4.2: Cluster centers from the UCF action dataset
channels for frame A are a
1
, a
2
, a
3
and a
4
, similarly, the four channels for frame B are b
1
,
b
2
, b
3
and b
4
. Then the similarity between frame A and frame B is:
S(A, B) =
4
¸
c=1
¸
x,y∈I
a
c
(x, y)b
c
(x, y) (4.4)
where I is the spatial extent of the motion descriptors.
To construct the codebook, an affinity matrix A is computed on all frames in the training
set, where A(i.j ) is the similarity between frame i and frame j calculated using the equation
above. K-Medoid clustering is then run on this affinity matrix to obtain K clusters. Each
frame in the training set belongs to one of these K clusters. Each cluster represents a visual
word. Fig. 4.2 shows 15 of the 45 “cluster centers” from the UCF action dataset[11]. Each
one of the images in the figure represents a visual word.
From now on, the visual vocabulary Vwill be represented as the K element set ¦w
1
, . . . , w
K
¦.
For the KTH dataset, which has 2391 video sequences in it, the size of the vocabulary we
used was 255.
42
4.2.2 A hidden Markov model framework for video sequences
For an N frame video sequence X = ¦x
1
, . . . , x
N
¦, where x
j
is the optical flow computed
at frame j, then each x
j
could represent an observed state in a HMM. The video sequence
X could be represented as a sequence of these observed states. The visual words computed
in the section above could represent the hidden variables z in a HMM. Given a labeled set
of videos as training data, we could then build a hidden Markov model θ
C
= ¦A
C

C

C
¦
for every action class C in the training set.
4.2.2.1 Dimensionality reduction of optical flow features
We assume that our observations are continuous vectors and therefore assume a Gaussian
emission density. We explore two different feature vectors. The first is the optical flow
based HOG feature set described in Sec. 4.1.1. For a n m image patch, the length of the
feature vector is
n
8

m
8
36. The problem with directly modelling this feature vector as
a Gaussian is that it may be too large to fit a full covariance matrix. This means that we
may end up with a singular covariance matrix meaning Σ
−1
will not exist. One solution to
this problem is to restrict the space of matrices Σ under consideration. Instead of trying to
fit a full covariance matrix, we could choose to fit a covariance matrix Σ that is diagonal.
Although restricting the covariance matrix to be diagonal is an acceptable solution in some
cases, we found both an efficiency and performance gain by explicitly reducing the dimen-
sionality of the data by Principal Component Analysis (PCA). We project the data down
to a smaller dimension D such that
x
i
= V
T
ψ(p
i
) =

V
T
1
.
.
.
V
T
D
¸
¸
¸
¸
¸
¸
ψ(p
i
) (4.5)
where V
T
is a
n
8

m
8
36 D dimensional matrix of principal component directions.
43
4.2.2.2 Visual word SVM based features
Here, we train an SVM w
i
,i ∈ ¦1, . . . , K¦ to detect each visual word. Positive instances
come from bounding boxes from images in the cluster for which i is the center. Negative
instances are randomly sampled patches from images belonging to other clusters along with
image patches with no person in them. We found that adding the latter helps improve
performance since the SVM is trained to discriminate a particular action word from both
other action words and background patches that do not correspond to any action class. We
can now define the feature extracted from the i
th
patch as
x
i
= w
T
ψ(p
i
) =

w
T
1
.
.
.
w
T
K
¸
¸
¸
¸
¸
¸
ψ(p
i
) (4.6)
This is also a linear dimensionality reduction scheme. Our reduction scheme is discrimina-
tive in that it exploits training labels, while PCA does not. One may also employ other
discriminative techniques such as linear discriminant analysis (LDA). We chose an SVM
based scheme in anticipation of moving our model toward a fully discriminative structured
prediction model[20], which we briefly describe in our future work.
Given the definitions of our observed feature vectors x and hidden states z, we use the
standard EM algorithm for HMMs to learn the parameters θ
C
∀C ∈ ¦1,. . . ,C¦
4.2.2.3 Classifying a new pre-tracked video sequence
A new, previously unseen video sequence can be classified using the hidden Markov models
constructed in the training phase. In the pre-tracked scenario, the location of the person
in each frame is known i.e each frame is clipped to only include the person of interest.
The video sequence essentially reduces to a sequence of poses from which optical flow can
computed at each pose. This sequence of flows constitute the observed states. A sum-
44
product algorithm can then be used to calculate the probability of the sequence of observed
states under the HMM for each action class built in the training phase.If
α
n
(i) = p(x
1
, ....., x
n
, z
n
= i[θ
C
)
represents the probability of the partial observed sequence x
1
,. . . ,x
n
produced by all possible
state sequences that end at the i -th visual word given θ
C
, it can be recursively defined as:
α
n
(i) =

K
¸
j=1
α
n−1
(j)A
C
ji
¸
¸
φ
C
i
(X
n
) (4.7)
where K is the size of the visual vocabulary. A
C
is a matrix of transition probabilities for
action class C and A
C
ji
is the probability of transitioning from state j to state i in class C.
φ
C
is the emission model and φ
C
i
(x
t
) represents the probability P(x
n
[z
n
= i). For an N
length video sequence, the probability of the entire sequence X under the HMM for class C
can be computed as follows:
p(X[Z, θ
C
) = α
m
=
K
¸
j=1
α
m
(j) (4.8)
After scoring the sequence under the HMM for every class,the final classifier for a video
sequence X can be written as
C

= arg max
C
p(X[Z, θ
C
) (4.9)
As mentioned earlier, the ability to perform robust tracking prior to classification cannot
be assumed general. In the next section, this assumption is relaxed and a model for jointly
tracking and predicting actions is presented.
45
4.3 A Unified model for joint tracking and recognition
Here, we use HMM machinery to build a joint model for tracking and recognizing actions.
Given a video sequence, we do not assume that the person in it is tracked; instead we try
to infer the person’s location in each frame along with his/her action.
4.3.1 Cross product space of location and visual words
In the pre-tracked scenario, z ∈ ¦1,. . . ,K¦ where K is the number of visual words. In the
current scenario, where pre-tracking is not assumed and location of a person in a frame is
“hidden”, z ∈ ¦1,. . . ,L¦¦1,. . . ,K¦ where L is the number of locations and proportional
to the number of pixels in a frame. Given that the hidden variables z now lie in the cross
product space of locations and visual words, a HMM for tracking can be given by
p(X, Z[θ
G
) = p(z
1

G
)
N
¸
n=1
p(x
i
[z
i
, φ
G
)p(z
n
[z
n+1
, A
G
) (4.10)
where θ
G
is a global hidden Markov model trained on the entire dataset.
4.3.2 Joint model for tracking and recognition
We present a model for joint tracking and recognition by building action specific HMMs for
every action class. This is given by
p(X, Z, C[θ
C
) = p(z
1

C
)
N
¸
n=1
p(x
i
[z
i
, φ
C
)p(z
n
[z
n+1
, A
C
)p(C) (4.11)
where θ
C
is the hidden Markov model for action class C. Our model enforces that all the
visual words z
n
are consistent with a single action class whereas (4.10) cannot guarantee
this property. Fig. 4.3 shows a graphical representation of our joint model. In it, the
transition and emission probabilities are conditioned on the variable C, the action class.
46
Figure 4.3: An illustration of the joint model. The variable C determines the transition
and emission probabilities.
4.3.2.1 Bounded velocity motion model
As mentioned earlier, in the present scenario, z ∈ ¦1,. . . ,L¦¦1,. . . ,K¦. This state space
can be very large for large values of L. For the KTH dataset[7], where the spatial resolution
is 160x120, L ∼ 19200. In order to reduce this state space, we place a prior on the movement
of a person between consecutive frames - we assume that a person cannot be displaced by
more than δ pixels from frame to frame. The probability of a transition from state z
n−1
=
(l
n−1
,w
n−1
) to z
n
= (l
n
,w
n
), which we will represent as T(z
n−1
, z
n
) is
T(z
n−1
, z
n
) =

0 if [l
n
−l
n−1
[ > δ
c if [l
n
−l
n−1
[ ≤ δ
(4.12)
Fig. 4.4 below illustrates the bounded velocity motion model.
As an alternative to the bounded velocity motion model, Felzenszwalb and Huttenlocher[14]
can be used to speed up the Viterbi algorithm, rewriting the Viterbi recursion as a dis-
tance transform. The standard Viterbi algorithm is O(K
2
) (where K is the number of
states) whereas using distance transform techniques takes O(K) time; an order of magni-
tude speedup for large values of K
47
Figure 4.4: An illustration of the bounded velocity motion model. Each node represents
a (location,word) hidden state. The only transitions considered are shown by the lines
connecting nodes at n-1 and n. The rest of the transitions are not taken into consideration
4.3.2.2 Inference
Given our joint model for tracking and recognition, we can now run inference on a new video
sequence and find both the best path of hidden states and its action class. We describe the
process of inference as follows
max
C,Z
p(Z, C[X) = max
i
max
Z
p(Z, C=i[X)
∝ max
i
p(C = i) max
Z
¸
p(z
1

i
)
N
¸
n=2
p(x
n
[z
n
, φ
i
)p(z
n
[z
n−1
, A
i
)
¸
(4.13)
For a specific C, we employ dynamic programming to find the best sequence of hidden
states. If S
C
(z
n
=i ) is defined as the score of the path that ends in state i under θ
C
, it can
be defined recursively as follows
S
C
(z
n
) = p(x
n
[z
n
= i) max
j
S(z
n−1
= j)p(z
n
= i[z
n−1
= j) (4.14)
48
where
S
C
(z
1
) = p(z
1
= i)p(x
1
[z
i
)
Recalling our bounded velocity motion model and that z
n
= (l
n
,w
n
), if we define l (z
n
) =
l
n
, the equation above can be rewritten as
S
C
(z
n
) = p(x
n
[z
n
= i) max
{j:|l(i)−l(j)|≤δ}
S(z
n−1
= j)p(z
n
= i[z
n−1
= j) (4.15)
Finally, if S
C
is the score of the highest scoring path,
S
C
= max S
C
(z
N
) (4.16)
The final classifier can then be written as
C

= arg max
C
S
C
(4.17)
4.4 Motivation for joint tracking and recognition
In the section above, we presented a joint model for tracking and recognition. Here, we
present an informal motivation for such a model. We run several baseline algorithms on
a video sequence and compare their performance, tracking wise, to our joint tracking and
recognition model.
4.4.1 The Tracking problem
There does not exist an out-of-the-box technology for tracking people in video sequences.
Tracking people is difficult because of the deform due to articulations, speed of movement
from frame to frame, and the clutter surrounding the person to be tracked. Many previous
approaches have relied on Kalman filtering or Particle filtering [30]. Although it may not
necessarily seem intuitive, recent approaches have advocated a track-by-detection philoso-
49
phy, in which a track is stitched together by linking object detections from across frames[31].
Such an approach has the advantage of not needing to be hand initialized and being robust
to errors from drifting because the tracker is essentially re-initializing itself each frame. For
example, Forsyth[30] explicitly advocates the above approach for face tracking.
4.4.2 Experiment and baseline
Given that tracking by detection has been advocated by recent approaches, for our baseline,
we feel comfortable in proposing to link detections across frames using the detector in Dalal
et. al[1]. Commonly known as the Dalal-Triggs detector, it is known to be the state-of-the-
art for human detection in images and is particularly trained to find pedestrians in images.
We also propose two other baselines. All our baselines are outlined below
A: Linking detections across frames using the Dalal-Triggs detector
B: Employing the Dalal-Triggs detector in a dynamic programming framework
C: Reducing the joint model for tracking+recognition to a simple location HMM
In B and C, the hidden variable z ∈ ¦1,. . . ,L¦ where L is the number of discrete locations
in a frame. They both use the bounded velocity motion model for transition probabilities.
In B, the emission model is given by
p(x
i
[i) ∝ e
wx
i
where x
i
is the HOG feature vector extracted from location i and w is the model. It should
be noted that in its typical implementation, the Dalal-Triggs detector scans a 64128
detection window across locations as well as scales. Here, the version of the detector used
does not search over scales. Images are rescaled such that the person of interest appears in
the center of the detection window.
50
(a) Kicking sequence tracked by baseline A
(b) Kicking sequence tracked by baseline B
(c) Kicking sequence tracked by baseline C
(d) Kicking sequence tracked by the joint model
Figure 4.5: (a) tracked using baseline A is not consistent in the person it tracks. (b) tracked
using baseline B is consistent but does not track the person of interest-the person kicking
the ball. (c) and (d) seem to be doing equally well in tracking the person of interest
Baseline C is similar to the joint model except that there is no model for transitions
between visual words - these transitions are simply not taken into consideration. We run
all our baselines and our joint model on a kicking sequence from the UCF dataset[11]. We
run baseline C and our joint model given that the classification of the sequence is already
known to us. The joint model has a Gaussian emission density and uses the feature vector
in Sec 4.2.2.2. Fig. 4.5 shows the tracks output by the baselines and our joint model.
From the figure, it seems that the Dalal-Triggs baseline (baseline A) is inconsistent in its
tracking. One would expect the bounding box in each frame to be around the same person,
but this is clearly not the case here. Baseline B enforces this consistency by using the
51
Dalal-Triggs detector in a dynamic programming framework, using the bounded velocity
motion model for transition probabilities. Both A and B are unable to track the person of
interest i.e the person kicking the ball. Instead, they seem to be tracking the person whose
pose most resembles that of a pedestrian. The performace of baselines A and B on the
video sequence demonstrates why pre-tracking using an external module cannot be assumed
in general.
Baseline C outputs a track that is identical to the one output by the joint model. Although
baseline C outputs a reasonable track without considering transitions between visual words,
we believe that these transitions do convey important information that will lead to more
accurate tracks in more complex scenarios.
We note that baseline C is itself a novel contribution to tracking people in videos. It
estimates a spatio-temporal track by linking together responses from a family of HOG
templates, where each template encodes a visual action word. We are not aware of a similar
work in this vein.
Even though we do not see any major improvement of our joint model over baseline C, we
note that in ur model, jointly tracking and recognizing activities is not more expensive than
separately pre-tracking and labelling actions. Since both activities can be efficiently done
together, we argue that this will result in a more robust system in more difficult scenarios
because it can enforce the constraint that all visual word templates are consistent with a
single action class.
52
Chapter 5
Results
The algorithms described in Chap are tested on two datasets: KTH human motion dataset[7]
and the UCF Sports Action dataset[11]
5.1 Action Classification on KTH Dataset
The KTH human motion dataset is one of the largest video datasets of human actions.
It contains six types of human actions (walking, jogging, running, boxing, hand waving
and hand clapping) performed several times by 25 subjects both outdoors and indoors,
with different clothes. All sequences were taken over homogeneous backgrounds with a
static camera with 25fps frame rate. Representative frames from the dataset are shown in
Fig. 5.1. The dataset was divided roughly equally into a training and testing set of video
sequences. The training sequences were tracked and stabilized so that the figures appear
in the center of the field of view. This was done using a combination of the Dalal Triggs
detector, background subtraction and Sabzmeydani et al[12].
The SVM approach was first tested on the dataset. Once the optical flow was computed on
every frame in the training set, six linear SVMs were built; one for each class, as described
in Sec 4.1. In building the negative training set for a particular action class, feature vectors
53
Figure 5.1: Representative frames from the KTH action dataset
corresponding to frames that had no person in them were included, along with feature
vectors from frames in the other five action classes. Given all six SVMs, a new video
sequence, was converted to a sequence of optical flows and classified by a majority vote
taken across all frames. The confusion matrix for the KTH dataset using the SVM approach
is shown in Table 5.1
box clap wave jog run walk
box 0.95 0.02 0.01 0.01 0.01 0.0
clap 0.03 0.88 0.08 0 0 0.01
wave 0.03 0.06 0.89 0.02 0.00 0.0
jog 0.01 0.0 0.0 0.59 0.32 0.08
run 0.0 0.0 0.0 0.26 0.66 0.08
walk 0.0 0.0 0.0 0.09 0.05 0.86
Table 5.1: - Confusion matrix for the KTH dataset using the SVM approach (overall
accuracy = 80.5%)
We also tested our joint model on the KTH dataset. A visual vocabulary was constructed
like discussed in Sec. 4.2.1. We built a HMM for each of the six action classes and ran
inference on every video sequence on the test set. As described in Sec. 4.3.2.2, the action
was classified as the action corresponding to the highest scoring HMM. The confusion matrix
using 255 codewords and optical flow features (4.2.2.1) with a Gaussian emission model is
shown in Table 5.2
The confusion matrix using 255 codewords and visual word SVM features (4.2.2.2) with a
Gaussian emission model is shown in Table 5.3
54
box clap wave jog run walk
box 0.98 0.01 0.01 0.0 0.0 0.00
clap 0.01 0.96 0.03 0.0 0.0 0.0
wave 0.02 0.03 0.95 0.0 0.0 0.0
jog 0.0 0.0 0.0 0.72 0.25 0.03
run 0.0 0.0 0.0 0.23 0.74 0.03
walk 0.0 0.0 0.0 0.04 0.03 0.93
Table 5.2: - Confusion matrix for the KTH dataset using our joint model and optical flow
features (overall accuracy = 88%)
box clap wave jog run walk
box 0.98 0.01 0.01 0.0 0.0 0.0
clap 0.01 0.97 0.02 0.0 0.0 0.0
wave 0.02 0.02 0.96 0.0 0.0 0.0
jog 0.0 0.0 0.0 0.75 0.22 0.03
run 0.0 0.0 0.0 0.21 0.77 0.02
walk 0.0 0.0 0.0 0.02 0.03 0.95
Table 5.3: title of table - Confusion matrix for the KTH dataset using the joint model
and visual word SVM based features (overall accuracy = 89.67 %)
Table 5.4 shows how the methods introduced in this thesis perform on the KTH dataset
compared to other action recognition methods It should be noted that it is not very clear
method accuracy pretracking
Mori et. al[18] 97.2% yes
Wang et. al[2] 92.43% yes
HMM and visual word SVM feature vector 89.67% no
HMM and optical flow feature vector 88% no
Neibles et. al[3] 81.5% no
SVM approach 80.5% no
Table 5.4: Comparison of different methods on KTH
how the other methods split the training and testing sets. The methods introduced in this
thesis perform reasonably well on the KTH dataset given that pretracking is not assumed.
Fig. 5.3 shows consecutive frames from KTH running sequences. Bounding boxes represent
the estimated location of the person. As seen in the figure, both the SVM approach and
the joint model do rather well in tracking a wide variety of poses and localizing motion in
a sequence of frames
From the confusion matrices, it is apparent that the “boxing” action has the highest accu-
55
Figure 5.2: The frames in the top row are from a jogging sequence, the frames in the bottom
row from a running sequence. These actions are similar to one another
racy. This is because it is unlike any of the other five actions. For example, “waving” and
“clapping” can be quite similar to one another because in many cases, both action start out
the same way - with the hands of the person stretched out by his/her side. Similarly, the
“walking”, “running” and “jogging” actions are also similar since they all involve movement
of the legs. Specifically, the running and jogging actions are often misclassified as one an-
other. This is reasonable, since running and jogging are similar actions and can be hard to
tell apart. Fig. 5.2 shows a few consecutive frames from a running and a jogging sequence.
56
(a) boxing sequence
(b) jogging sequence
(c) hand waving sequence
(d) hand clapping sequence
(e) running sequence
(f) walking sequence
Figure 5.3: The persons in (a) and (b) were tracked (and their actions classified) using the
SVM approach. Persons in (c) and (d) were tracked using the joint model and optical flow
features. Persons in (e) and (f) were tracked using the joint model and a visual word SVM
based features
57
Figure 5.4: Representative frames from the UCF Sports Action Dataset
5.2 Action Classification on UCF action dataset
The UCF (University of Central Florida) vision lab collected a set of eight actions from
various sports featured on channels such as BBC and ESPN. Actions in this dataset include
Golf swings, diving, kicking a soccer ball, weight lifting, horse riding, running, skating and
swingbenching. Representative frames of this dataset are shown in Fig 5.4 The original
dataset consists of some frames in which more than one person appears, unlike the KTH
dataset which consists of only one person per frame. The dataset also includes cropped
versions of the frames including only the person of interest in the center. The dataset was
divided roughly equally into training and testing sets
To test the SVM approach on this dataset, eight SVMs, one for each action, were built in
the training phase. The negative dataset for a particular action consisted of feature vectors
from frames in the other seven actions whereas the positive dataset consisted of feature
vectors from frames in the action class. Classification of a new sequence of frames was done
as described in Sec 4.1. The confusion matrix for the UCF dataset using the SVM approach
is shown in Table 5.5
We built eight HMMs, one for each class to test out joint model for tracking and recognition
on the UCF dataset. The confusion matrix for the UCF dataset using a vocabulary of 45
visual words, optical flow features and a Gaussian emission mode is shown in Table 5.6
And finally, the confusion matrix for the UCF dataset using the joint model, a vocabulary
of 45 words and visual word SVM based feature vectors is shown in Table 5.7
58
golf swing dive kick lift horse ride run skate swingbench
golf swing 0.8 0.0 0.01 0.0 0.06 0.02 0.08 0.03
dive 0.01 0.81 0.0 0.06 0.0 0.02 0.02 0.08
kick 0.0 0.01 0.67 0.01 0.03 0.2 0.06 0.02
lift 0.0 0.06 0.02 0.76 0.05 0.0 0.0 0.11
horse ride 0.05 0.02 0.02 0.03 0.85 0.01 0.01 0.01
run 0.01 0.01 0.22 0.0 0.01 0.68 0.05 0.02
skate 0.06 0.01 0.08 0.01 0.0 0.06 0.77 0.01
swingbench 0.02 0.06 0.02 0.01 0.0 0.0 0.0 0.8
Table 5.5: Confusion matrix for the UCF dataset using the SVM approach (overall accuracy
= 76.75%)
golf swing dive kick lift horse ride run skate swingbench
golf swing 0.89 0.0 0.0 0.0 0.03 0.01 0.06 0.01
dive 0.01 0.88 0.0 0.04 0.0 0.02 0.01 0.04
kick 0.0 0.0 0.72 0.01 0.02 0.18 0.05 0.02
lift 0.0 0.04 0.02 0.85 0.02 0.0 0.0 0.07
horse ride 0.02 0.0 0.01 0.01 0.93 0.01 0.01 0.01
run 0.0 0.01 0.19 0.01 0.01 0.74 0.03 0.01
skate 0.06 0.0 0.07 0.01 0.0 0.05 0.81 0.01
swingbench 0.0 0.03 0.01 0.06 0.0 0.0 0.0 0.9
Table 5.6: Confusion matrix for the UCF dataset using the joint model with optical flow
based features (overall accuracy = 84%)
golf swing dive kick lift horse ride run skate swingbench
golf swing 0.91 0.0 0.0 0.0 0.02 0.01 0.05 0.01
dive 0.01 0.9 0.0 0.03 0.0 0.02 0.0 0.04
kick 0.0 0.0 0.76 0.0 0.02 0.17 0.03 0.02
lift 0.0 0.04 0.01 0.88 0.01 0.0 0.0 0.06
horse ride 0.01 0.0 0.01 0.01 0.94 0.01 0.01 0.01
run 0.0 0.0 0.16 0.01 0.01 0.78 0.03 0.01
skate 0.04 0.0 0.04 0.0 0.0 0.05 0.86 0.01
swingbench 0.0 0.02 0.0 0.05 0.0 0.0 0.0 0.93
Table 5.7: Confusion matrix for the UCF dataset using using the joint model and visual
word SVM based features (overall accuracy = 87.1%)
59
From the confusion matrices, it seems like there is some confusion between the “kick” and
“run” actions. Given that these actions can be similar (for example,they both involve a wide
stance), this is quite reasonable. Otherwise, both the SVM and HMM approach seem to be
doing reasonably well, even on complex actions like diving. The joint model in combination
with the visual word SVM based feature set, in particular, does well on this dataset. Fig 5.5
shows several sequences from the UCF action dataset with bounding boxes around detected
motion patterns. Again, both the SVM approach and the joint model seem to be tracking
the person in the image sequence accurately.
We are only aware of one other method that uses the UCF action dataset for testing. [11]
achieves an accuracy of 69.2 % on it. Although it seems like our method does an order of
magnitude better than [11], it should be noted that there were a few actions mentioned in
[11] that didnt appear in our version of the dataset. For example, pole vaulting and swing-
baseball do not appear here. It may be that the publicly available dataset is a simpler
subset of the collected dataset.
60
(a) lifting sequence
(b) golf swing sequence
(c) running sequence
(d) swingbench sequence
(e) diving sequence
(f) kicking sequence
(g) riding sequence
(h) skateboarding sequence
Figure 5.5: The persons in (a) and (b) were tracked (and their actions classified) using the
SVM approach. (c) and (d) were tracked using the joint model with an optical flow based
feature set.(e), (f), (g) and (h) were tracked using the joint model with a visual word SVM
based features
61
Chapter 6
Conclusion and Future Work
In this thesis, we tried two different methods for tackling the action recognition problem.
From the results on the KTH and UCF action datasets, it is apparent that the hidden
Markov model approach outperformed the Support Vector Machine based approach. Intu-
itively, this makes sense because the pose of a person in a frame is certainly dependent on
the person’s pose in previous frames. This suggests that the order in which frames appear
in a video sequence should be taken into consideration when classifying it.
This thesis introduced a new method by which tracking and recognition can be done simul-
taneously. It tested rather well on the KTH and UCF action datasets; the next natural step
would be to test it on video sequences with more complex motion patterns, like a YouTube
video. So far, only optical flow has been used as a feature set but as motion patterns get
more obscure, it may be worth exploring other feature sets and use them in conjunction
with optical flow.
Throughout our work, we assumed that there is only one person of interest in a video
sequence whose actions need to be tracked and recognized. The next step would be to
consider how to extend the framework introduced here to track and recognize multiple
human actions. Another simplifying assumption made in this thesis is that a person can
only perform a single action in a video sequence, whereas in reality, this is hardly ever
62
the case. It might be worth exploring an extension of the methods introduced here by
considering transitions from one action class to another along with transitions among visual
words.
Finally, we believe that a fully discriminative model such as a structural SVM[20] may
perform better since both the transition model and location emission models will be trained
jointly so as to produce correct tracks on the training data.
63
Bibliography
[1] N. Dalal B. Triggs. Histograms of Oriented Gradients for Human Detection. In IEEE
Conference Computer Vision and Pattern Recognition 2005, San Diego, USA, pages:886
to 893
[2] Y. Wang P. Sabzmeydani G. Mori. Semi-Latent Dirichlet Allocation: A Hierarchical
Model for Human Action In International Conference on Computer Vision Recognition
2007
[3] J. C. Niebles, H. Wang L. Fei-Fei. Unsupervised Learning of Human Action Categories
Using Spatial-Temporal Words. In International Journal of Computer Vision Volume 79,
Issue 3 (September 2008), pages:299-318
[4] I. Laptev T. Lindeberg. Space-Time Interest Points. In Conference on Computer
Vision 2003, Nice, France, pages:432 to 439
[5] D. M. Blei, A. Y. Ng, M. I. Jordan J. Lafferty. Latent dirichlet allocation. In Journal
of Machine Learning Research Jan 2003, pages:993 to 1022
[6] T. Lindeberg I. Laptev. Local Descriptors for Spatio-Temporal Recognition. In In
ECCV Workshop 2004 ”Spatial Coherence for Visual Motion Analysis”, Springer LNCS
Vol.3667, pages:91 to 103
[7] B. Caputo, C. Schuldt I. Laptev. Recognizing Human Actions: A Local SVM
Approach. In . ICPR 2004, Cambridge, UK. pages:32 to 36.
[8] I. Laptev. Local Spatio-Temporal Image Features for Motion Interpretation.
[9] A. Efros, A. C Berg, G. Mori, J. Malik Recognizing action at a distance. In EEE
International Conference on Computer Vision 2003. Volume 2. pages:726 to 733.
[10] B. D. Lucas T. Kanade An iterative image registration technique with an application
to stereo vision In Proceedings of the DARPA Image Understanding Workshop (April
1981). pages:121 to 130.
[11] D. Rodriguez, J. Ahmed M. Shah. Action MACH: A Spatio-temporal Maximum
Average Correlation Height Filter for Action Recognition In Vision and Pattern Recog-
nition, 2008
[12] P. Sabzmeydani G. Mori. Detecting pedestrians by learning shapelet features. In
2003.
64
[13] K. Murphy. Hidden Markov Model (HMM) Toolbox for Matlab. Software retrieved
from http://www.cs.ubc.ca/ murphyk/Software/HMM/hmm.html
[14] P. Felzenszwalb D. P. Huttenlocher. Distance Transforms of Sampled Functions. In
ornell Computing and Information Science Technical Report TR2004-1963, September
2004.
[15] T. Hofmann. Probabilistic Latent Semantic Indexing In of the Twenty-Second Annual
International SIGIR Conference on Research and Development in Information Retrieval
(SIGIR-99), 1999.
[16] C. Bishop. Pattern Recognition and Machine Learning.
[17] A. Ng. Support Vector Machines. Retrieved from
http://www.stanford.edu/class/cs229/notes/cs229-notes3.pdf.
[18] G. Mori, Y. Wang. Learning a Discriminative Hidden Part Model for Human Action
Recognition. In 2008.
[19] E. Shechtman and M. Irani. Space-time behavior based correlation. In 2005.
[20] I. Tsochantaridis, T. Joachims, T. Hofmann and Y. Altun. Large margin methods
for structured and interdependent output variables. In , 6, 14531484 2005.
[21] M. Brand. Coupled hidden markov models for complex action recognition. In lab
vision and modelling tr-407, MIT, 1997
[22] M. Brand, N. Oliver, and A.P. Pentland. Coupled hidden markov models for complex
action recognition. In Conference on Computer Vision and Pattern Recognition 1997,
pages:994 to 999
[23] N. Oliver, A. Garg, and E. Horvitz. Layered representations for learning and inferring
office activity from multiple sensory channels. In , vol. 96, no. 2, pp. 163180, November
2004
[24] N. Oliver, E. Horvitz, and A. Garg. Layered representations for human activity
recognition. In IEEE International Conference on Multimodal Interfaces, 2002, pages 3
to 8
[25] T. Mori, Y. Segawa, M. Shimosaka, and T. Sato. Hierarchical recognition of
daily human actions based on continuous hidden markov models. In Face and Gesture
Recognition, 2004, pages 779 to 784.
[26] A. D. Wilson and A. F. Bobick. Parametric hidden markov models for gesture
recognition. In Trans. Pattern Analysis and Machine Intelligence, vol. 21, no. 9, pp.
884900, September 1999.
[27] M. Brand and V. Kettnaker. Discovery and segmentation of activities in video. In
Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 844851, August
2000
[29] A. Galata, N. Johnson, and D. Hogg. Learning structured behavior models using
variable length markov models. In International Workshop on Modelling People, 1999.
65
[29] A. Galata, N. Johnson, and D. Hogg. Learning behavior models of human activities.
In Machine Vision Conference, 1999.
[30] D. A. Forsyth, O. Arikan, L. Ikemoto, J. O’Brien and D. Ramanan. Computational
studies of human motion: part 1, tracking and motion synthesis. In and Trends in
Computer Graphics and Vision. July 2005
[31] D. Ramanan and D. A. Forsyth. Automatic Annotation of Everyday Movements. In
Dec 2003.
[32] D. McAllester and D. Ramanan. A discriminately trained, multiscale, deformable
part model. In , 2008
[33] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial priors for part-based
recognition using statistical models. In , pages 1017, 2005.
[34] B. Epshtein and S. Ullman, Semantic hierarchies for recognizing objects and parts.
In CVPR, 2007.
[35] S. Ioffe and D. Forsyth. Probabilistic methods for finding people. In , pages 45-69,
June 2001
66
Appendices
A Data Sets
In this appendix, we give a brief introduction to the data sets used in this thesis.
A.1 KTH Dataset
The KTH human motion dataset is one of the largest video datasets of human actions. It
contains six types of human actions (walking, jogging, running, boxing, hand waving and
hand clapping) performed several times by 25 subjects both outdoors and indoors, with
different clothes. All sequences were taken over homogeneous backgrounds with a static
camera with 25fps frame rate. The data set is fairly synthetic and does not represent real
world scenarios.
A.2 UCF Action Dataset
The UCF (University of Central Florida) vision lab collected a set of eight actions from
various sports featured on channels such as BBC and ESPN. Actions in this dataset include
Golf swings, diving, kicking a soccer ball, weight lifting, horse riding, running, skating
and swingbenching. The original dataset consists of some frames in which more than one
person appears, unlike the KTH dataset which consists of only one person per frame. This
relatively new data set contains close to 200 video sequences at a resolution of 720x480.
67
The collection represents a natural pool of actions featured in a wide range of scenes and
viewpoints
68

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.