You are on page 1of 166

Beyond Controllers

Human Segmentation, Pose, and Depth Estimation
as Game Input Mechanisms
Glenn Sheasby
Thesis submitted in partial fulfilment of the requirements of the award of
Doctor of Philosophy
Oxford Brookes University
in collaboration with Sony Computer Entertainment Europe
December 2012
Abstract
Over the past few years, video game developers have begun moving away from the tradi-
tional methods of user input through physical hardware interactions such as controllers or
joysticks, and towards acquiring input via an optical interface, such as an infrared depth
camera (e.g. the Microsoft Kinect) or a standard RGB camera (e.g. the PlayStation Eye).
Computer vision techniques form the backbone of both input devices, and in this thesis,
the latter method of input will be the main focus.
In this thesis, the problem of human understanding is considered, combining segment-
ation and pose estimation. While focussing on these tasks, we examine the stringent
challenges associated with the implementation of these techniques in games, noting par-
ticularly the speed required for any algorithm to be usable in computer games. We also
keep in mind the desire to retain information wherever possible: algorithms which put
segmentation and pose estimation into a pipeline, where the results of one task are used
to help solve the other, are prone to discarding potentially useful information at an early
stage, and by sharing information between the two problems and depth estimation, we
show that the results of each individual problem can be improved.
We adapt Wang and Koller’s dual decomposition technique to take stereo information
into account, and tackle the problems of stereo, segmentation and human pose estimation
simultaneously. In order to evaluate this approach, we introduce a novel, large dataset
featuring nearly 9,000 frames of fully annotated humans in stereo.
Our approach is extended by the addition of a robust stereo prior for segmenta-
tion, which improves information sharing between the stereo correspondence and human
segmentation parts of the framework. This produces an improvement in segmentation
results. Finally, we increase the speed of our framework by a factor of 20, using a highly
efficient filter-based mean field inference approach. The results of this approach compare
favourably to the state of the art in segmentation and pose estimation, improving on the
best results in these tasks by 6.5% and 7% respectively.
Acknowledgements
“Okay... now what?”
(Mike Slackenerny, PhD comic #844)
It is finished. Although the PhD thesis is a beast that must be tamed in solitude, I
don’t believe it’s something that can be done entirely alone, and there are many people
to whom I owe a debt of gratitude.
My supervisor, Phil Torr, made it possible for me to get started in the first place,
and gave me immeasurable help along the way. While we’re talking about how I came to
be doing a PhD, I should also thank my old boss, Andrew Stoddart, who recommended
that I apply, and the recession for costing me the software job I was doing after leaving
student life for the first time. I guess my escape velocity wasn’t high enough, moving
only two miles from my first alma mater. I’m about 770 miles away now, so that should
be enough!
My colleagues at Brookes also helped immensely, from those who helped me settle
in: David Jarzebowski, Jon Rihan, Chris Russell, Lubor Ladick` y, Karteek Alahari, Sam
Hare, Greg Rogez, and Paul Sturgess; to those who saw me off at the end of it: Paul
Sturgess, Sunando Sengupta, Michael Sapienza, Ziming Zhang, Kyle Zheng, and Ming-
Ming Cheng. Special thanks are due to Morten Lindegaard, who proof-read large chunks
of this thesis, and to my co-authors: Julien Valentin, Vibhav Vineet, Jonathan Warrell,
and my second supervisor, Nigel Crook.
Financial support from the EPSRC partnership with Sony is gratefully acknowledged,
and weekly meetings and regular feedback from Diarmid Campbell helped to guide and
focus my research. Furthermore, Amir Saffari and the rest of the crew at SCEE London
Studio provided a dataset, as well as feedback from a professional perspective.
I’d also like to thank my examiners, Teo de Campos, Mark Bishop, and David Duce,
for taking the time to read my thesis, and for providing useful feedback and engaging
discussion during the viva.
While struggling through my PhD years, I was kept sane in Oxford by a variety of
groups, including the prayer group at St. Mary Magdalene’s, Brookes Ultimate Frisbee,
and of course, the Oxford University Bridge Club, where I spent many Monday even-
ings exercising my mind (and liver), and where I met my wonderful fiancée, the future
Dr. Mrs. Dr. Sheasby, Aleksandra: wszystkie nasze sukcesy są wspólne, ale ten jeden za-
wdzięczam wyłącznie Tobie. Wierzyłaś we mnie nawet wtedy, kiedy ja sam w siebie nie
wierzyłem i za to będę Ci wdzięczny do końca życia.
Lastly, but most importantly, I’d like to thank my parents, for raising me, for sup-
porting me in all of my endeavours, and for teaching me to question everything.
Contents
List of Figures 7
List of Tables 9
List of Algorithms 11
1 Introduction 13
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Vision in Computer Games: A Brief History 19
2.1 Motion Sensors: Nintendo Wii . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 RGB Cameras: EyeToy and Playstation Eye . . . . . . . . . . . . . . . . 22
2.2.1 Early Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 Antigrav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.3 Eye of Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.4 EyePet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.5 Wonderbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Depth Sensors: Microsoft Kinect . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3 Vision Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Overall Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 State of the Art in Selected Vision Algorithms 33
3.1 Inference on Graphs: Energy Minimisation . . . . . . . . . . . . . . . . . 34
3
Contents
3.1.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Submodular Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.3 The st-Mincut Problem . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.4 Application to Image Segmentation . . . . . . . . . . . . . . . . . 38
3.2 Inference on Trees: Belief Propagation . . . . . . . . . . . . . . . . . . . 40
3.2.1 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1 Pictorial Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Flexible Mixtures of Parts . . . . . . . . . . . . . . . . . . . . . . 47
3.3.3 Unifying Segmentation and Pose Estimation . . . . . . . . . . . . 50
3.4 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Humans in Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 3D Human Pose Estimation in a Stereo Pair of Images 57
4.1 Joint Inference via Dual Decomposition . . . . . . . . . . . . . . . . . . . 58
4.1.1 Introduction to Dual Decomposition . . . . . . . . . . . . . . . . 59
4.1.2 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Humans in Two Views (H2view) Dataset . . . . . . . . . . . . . . . . . . 67
4.2.1 Evaluation Metrics Used . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Segmentation Term . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Pose Estimation Term . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.3 Stereo Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.4 Joint Estimation of Pose and Segmentation . . . . . . . . . . . . . 81
4.3.5 Joint Estimation of Segmentation and Stereo . . . . . . . . . . . . 82
4.4 Dual Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Binarisation of Energy Functions . . . . . . . . . . . . . . . . . . 84
4.4.2 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.3 Solving Sub-Problem L
1
. . . . . . . . . . . . . . . . . . . . . . . 88
4.4.4 Solving Sub-Problem L
2
. . . . . . . . . . . . . . . . . . . . . . . 88
4.4.5 Solving Sub-Problem L
3
. . . . . . . . . . . . . . . . . . . . . . . 89
4.5 Weight Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.1 Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.2 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4
Contents
5 A Robust Stereo Prior for Human Segmentation 103
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1.1 Range Move Formulation . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 Flood Fill Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Application: Human Segmentation . . . . . . . . . . . . . . . . . . . . . 112
5.3.1 Original Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.2 Stereo Term f
D
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.3 Segmentation Terms f
S
and f
SD
. . . . . . . . . . . . . . . . . . . 114
5.3.4 Pose Estimation Terms f
P
and f
PS
. . . . . . . . . . . . . . . . . 115
5.3.5 Energy Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.6 Modifications to λ
D
Vector . . . . . . . . . . . . . . . . . . . . . . 118
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6 An Efficient Mean Field Based Method for Joint Estimation of Human
Pose, Segmentation, and Depth 125
6.1 Mean Field Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1.1 Introduction to Mean-Field Inference . . . . . . . . . . . . . . . . 128
6.1.2 Simple Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.1.3 Performance Comparison: Mean Field vs Graph Cuts . . . . . . . 131
6.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.1 Joint Energy Function . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3 Inference in the Joint Model . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.4.1 Segmentation Performance . . . . . . . . . . . . . . . . . . . . . . 137
6.4.2 Pose Estimation Performance . . . . . . . . . . . . . . . . . . . . 137
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7 Conclusions and Future Work 143
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . 144
Bibliography 147
5
List of Figures
2.1 Duck Hunt screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Wii Sensor Bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 ‘Putting’ action from Wii Sports . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 EyeToy and Playstation Eye cameras. . . . . . . . . . . . . . . . . . . . . 22
2.5 EyeToy: Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Antigrav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Eye of Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.8 EyePet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9 Wonderbook design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.10 A Wonderbook scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.11 Kinect games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.12 Furniture removal guidelines in Kinect instruction manual . . . . . . . . 30
3.1 Image for our toy example. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Segmentation results on the toy image . . . . . . . . . . . . . . . . . . . 40
3.3 Skeleton Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Part models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Yang-Ramanan skeleton model . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Stereo example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Subgradient example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Dual functions versus the cost variable λ. . . . . . . . . . . . . . . . . . . 66
4.3 Values of the dual function g(λ) . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Accuracy of Part Proposals . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 CRF Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Foreground weightings on a cluttered image from the Parse dataset . . . 76
4.7 Results using just f
S
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.8 Part selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.9 Limb recovery due to J
1
term . . . . . . . . . . . . . . . . . . . . . . . . 83
7
List of Figures
4.10 Master-slave update process . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.11 Decision tree: parameter optimisation . . . . . . . . . . . . . . . . . . . . 94
4.12 Sample stereo and segmentation results . . . . . . . . . . . . . . . . . . . 97
4.13 Segmentation results on H2view . . . . . . . . . . . . . . . . . . . . . . . 98
4.14 Results from H2view dataset . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Flood fill example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Three successive range expansion iterations . . . . . . . . . . . . . . . . . 109
5.3 The new master-slave update process . . . . . . . . . . . . . . . . . . . . 117
5.4 Segmentation results on H2View . . . . . . . . . . . . . . . . . . . . . . . 119
5.5 Comparison of segmentation results on H2View . . . . . . . . . . . . . . 120
5.6 Failure cases of segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.1 Segmentation of the Tree image . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Basic 6-part skeleton model . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3 Segmentation results on H2view compared to other methods . . . . . . . 138
6.4 Further segmentation results . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.5 Qualitative results on H2View dataset . . . . . . . . . . . . . . . . . . . 141
8
List of Tables
4.1 Table of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Evaluation of f
S
only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Evaluation of f
S
combined with f
PS
. . . . . . . . . . . . . . . . . . . . . 82
4.4 List of weights learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Evaluation of segmentation performance . . . . . . . . . . . . . . . . . . 96
4.6 Dual Decomposition results . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1 Segmentation results on the H2View dataset . . . . . . . . . . . . . . . . 121
5.2 Results (given in % PCP) on the H2view test sequence. . . . . . . . . . . 122
6.1 Evaluation of mean field on the MSRC-21 dataset . . . . . . . . . . . . . 131
6.2 Quantitative segmentation results on the H2View dataset . . . . . . . . . 137
6.3 Pose estimation results on the H2View dataset . . . . . . . . . . . . . . . 140
9
List of Algorithms
4.1 Parameter optimisation algorithm for dual decomposition framework. . . 93
5.1 Generic flood fill algorithm for an image I of size W ×H. . . . . . . . . 110
5.2 doLinearFill: perform a linear fill from the seed point (s
x
, s
y
). . . . . . 111
6.1 Naïve mean field algorithm for fully connected CRFs . . . . . . . . . . . 130
11
Chapter 1
Introduction
Over the past several years, a wide range of commercial applications of computer vision
have begun to emerge, such as face detection in cameras, augmented reality (AR) in shop
displays, and the automatic construction of image panoramas. Another key application
of computer vision that has become popular recently is computer games, with commercial
products such as Sony’s Playstation Eye and Microsoft’s Kinect selling millions of units
[98].
In creating these products, video game developers have been able to partially expand
the demographic of players. They have done this by moving away from the traditional
controller pad method of user input, and enabling the player to control the game using
other objects, such as books or AR markers, and even their own bodies. Some of the
most popular games that are either partially or completely driven using human motion
include sports games such as Wii Sports and Kinect Sports, and party games such as
the EyeToy: Play series. More recent games, such as EyePet and Wonderbook: Book of
Spells, combine motion information with object detection. A more thorough description
of these games can be found in Chapter 2.
Three main computer vision techniques are used to obtain input instructions for these
games: motion detection, object detection, and human pose estimation. The first of these,
motion detection, involves detecting changes in image intensity across several frames; in
video games, motion detection is used in particular areas of the screen as the player
13
1.1. Contributions
attempts to complete tasks. Secondly, object detection involves determining the presence,
position and orientation of particular objects in the frame. The object can be a simple
shape (e.g. a quadrilateral) or a complex articulated object, such as a cat. In certain
video games, the detection of AR markers is used to add computer graphics to an image
of the player’s surroundings. Finally, the goal of human pose estimation is to determine
the position and orientation of each of a person’s body parts. Using images obtained via
an infrared depth sensor, Kinect games can track human poses over several frames, in
order to detect actions [110].
Theoretically, an image contains a lot more information than a controller can sup-
ply. However, the player can only provide information via a relatively limited set of
actions, either with their own body, or using some kind of peripheral object which can
be recognised.
The main aim of this thesis is to explore and expand the applicability of human
pose estimation to video games. After an analysis of the techniques that have already
been used, and of the current state of these techniques in research, our main application
will be presented. Using a stereo pair of cameras, we will develop a system that unifies
human segmentation, pose estimation, and depth estimation, solving the three tasks
simultaneously. In order to evaluate this system, we will present a large dataset containing
stereo images of humans in indoor environments where video games might be played.
1.1 Contributions
In summary, the principal contributions of this thesis are as follows:
• A system for the simultaneous segmentation and pose estimation of humans, as
well as depth estimation of the entire scene. This system is further developed by
the introduction of a stereo-based prior; the speed of the system is subsequently
improved by applying a state-of-the-art approximate inference technique.
• The introduction of a novel, 9,000 image dataset of humans in two views.
14
1.2. Outline of the Thesis
Throughout the thesis, the pronoun ‘we’ is used instead of ‘I’. This is done to follow
scientific convention; the contents of this thesis are the work of the author. Where others
have contributed towards the work, their collaborations will be attributed in a short
section at the end of each chapter.
1.2 Outline of the Thesis
Chapter 2 contains a description of some of the various attempts that games developers
have made to provide alternatives to controllers, and the impact that these games have
had on the video games community. Starting with the accelerometer and infrared detec-
tion based solutions provided by the Nintendo Wii, we observe the increasing amount
of integration of vision techniques, with this trend demonstrated by the methods used
by Sony’s EyeToy and PlayStation Eye-based games over the past several years. Finally,
we consider the impact that depth information can have in enabling the software to
determine the pose of the player’s body, as shown by the Microsoft Kinect.
Following on from that, Chapter 3 contains an appraisal of related work in computer
vision that might be applied in computer games. We consider the different approaches
commonly used to solve the problems of segmentation and human pose estimation, and
give an overview of some of the approaches that have been used to provide 3D information
given a pair of images from a stereo camera.
Chapter 4 describes a novel framework for the simultaneous depth estimation of a
scene, and segmentation and pose estimation of the humans within that scene. Using a
stereo pair of images as input provides us with the ability to compute the distance of each
pixel from the camera; additionally, we can use standard approaches to find the pixels
occupied by the human, and predict its pose. In order to share information between
these three approaches, we employ a dual decomposition framework [62, 127]. Finally, to
evaluate the results obtained by our method, we introduce a new dataset, called ‘Humans
in Two Views’, which contains almost 9,000 stereo pairs of images of humans.
In Chapter 5, we extend this approach to improve the quality of information shared
15
1.3. Publications
between the segmentation and depth estimation parts of the algorithm. Observing that
the human occupies a continuous region of the camera’s field of view, we infer that the
distance of human pixels from the camera will vary only in certain ways, without sharp
boundaries (we say that the depth is smooth). Therefore, starting from pixels that we are
very confident lie within the human, we can extract a reliable initial segmentation from
the depth map, significantly improving the overall segmentation results.
The drawback of the dual decomposition-based approach, however, is that it is much
too slow to be used in computer games. In Chapter 6, we adapt our framework in order
to apply an approximate, but very fast, inference approach based on mean field [64]. Our
use of this new inference approach enables us to improve the information sharing between
the three parts of the framework, providing an improvement in accuracy, as well as an
order-of-magnitude speed improvement.
While the mean-field inference approach is much quicker than the dual decomposition-
based approach, its speed (close to 1 fps) is still not fast enough for real-time application
such as computer games. In Chapter 7, the thesis concludes with some suggestions for
how to further improve the speed, as well as some other promising possible directions for
future research. The concluding chapter also contains a summary of the work presented
and contributions made.
1.3 Publications
Several chapters of this thesis first appeared as conference publications, as follows:
• G. Sheasby, J. Warrell, Y. Zhang, N. Crook, and P.H.S. Torr. Simultaneous hu-
man segmentation, depth and pose estimation via dual decomposition. In British
Machine Vision Conference, Student Workshop, 2012. (Chapter 4, [108])
• G. Sheasby, J. Valentin, N. Crook, and P.H.S. Torr. A robust stereo prior for human
segmentation. In Asian Conference on Computer Vision (ACCV), 2012. (Chapter
5, [107])
16
The contributions of co-authors are acknowledged in the corresponding chapters. The first
paper [108] received the “best student paper” award at the BMVC workshop. Addition-
ally, some sections of Chapter 6 form part of a paper that is currently under submission
at a major computer vision conference.
Chapter 2
Vision in Computer Games: A Brief History
Figure 2.1: A screenshot [84] from Duck Hunt, an early example of a game that used
sensing technology.
The purpose of a game controller is to convey the user’s intentions to the game. A wide
varieties of input methods, for instance a mouse and keyboard, a handheld controller,
or a joystick, have been employed for this purpose. Video games using some sort of
sensing technology (instead of, or in addition to, those listed above) have been available
for several decades. In 1984, Nintendo released a light gun, which detects light emitted by
CRT monitors; this release was made popular by the game Duck Hunt for the Nintendo
19
2.1. Motion Sensors: Nintendo Wii
Figure 2.2: The sensor bar, which emits infrared light that is detected by Wii remotes.
The picture [81] was taken with a camera sensitive to infrared light; the LEDs are not
visible to the human eye.
Entertainment System (NES), in which the player aimed the gun at ducks that appeared
on the screen (Figure 2.1). When the trigger is fired, the screen is turned black for one
frame, and then the target area is turned white in the next frame. If it is pointed at the
correct place, the gun detects this change in intensity, and registers a hit.
Over the past few years, technological developments have made it easier for video
game developers to incorporate sensing devices to augment, or in some cases replace, the
traditional controller pad method of user input. These devices include motion sensors,
RGB cameras, and depth sensors. The following sections give a brief summary of the
applications of each in turn.
2.1 Motion Sensors: Nintendo Wii
The Wii is a seventh-generation games console that was released by Nintendo in late 2006.
Unlike previous consoles, the unique selling point of the Wii was a new form of player
interaction, rather than greater power or graphics capability. This new form of interaction
was the Wii Remote, a wireless controller with motion sensing capabilities. The controller
contains an accelerometer, enabling it to sense acceleration in three dimensions, and an
infrared sensor, which is used to determine where the remote is pointing [49].
Unlike light guns, which sense light from CRT screens, the remote detects light from
the console’s sensor bar, which features ten infrared LEDs (Figure 2.2). The light from
each end of the bar is detected by the remote’s optical sensor as two bright lights. Trian-
gulation is used to determine the distance between the remote and the sensor bar, given
20
2.1. Motion Sensors: Nintendo Wii
Figure 2.3: An example of the use of the Wii Remote’s motion sensing capabilities to
control game input. Here, the player moves the remote as he would move a putter when
playing golf. The power of the putt is determined by the magnitude of the swing [74].
the observed distance between the two bright lights and the known distance between the
LED arrays.
The capability of the Wii to track position and motion enables the player to mimic
actual game actions, such as swinging a sword or tennis racket. This capability is demon-
strated by games such as Wii Sports, which was included with the games console in the
first few years after its release. The remote can be used to mimic the action of bowling
a ball, or swung like a tennis racket, a baseball bat, or a golf club (Figure 2.3).
2.1.1 Impact
The player still uses a controller, although in some games, like Wii Sports, it is now
the position and movement of the remote that is used to influence events in-game. This
21
2.2. RGB Cameras: EyeToy and Playstation Eye
(a) EyeToy [114] (b) PlayStation Eye [32]
Figure 2.4: The two webcam peripherals released by Sony.
makes playing the games more tiring than before, especially if they are played with
vigour. However, a positive effect is that the control system is more intuitive, meaning
that people who don’t normally play traditional video games might still be interested in
owning a Wii console [103].
2.2 RGB Cameras: EyeToy and Playstation Eye
While Nintendo’s approach uses the position and motion of the controller to enhance
gameplay, other games developers have made use of the RGB images provided by cameras.
The first camera released as a games console peripheral and used as an input device for a
computer game was the EyeToy, which was released for the PlayStation 2 (PS2) in 2003.
This was followed in 2007 by the PlayStation Eye (PS Eye) for the PlayStation 3 (PS3).
Some of Sony’s recent games have used the PlayStation Move (PS Move) in addition
to the PS Eye. The Move is a handheld plastic controller which has a large, bright ball
on the top; the hue of this ball can be altered by the software. During gameplay, the ball
is easily detectable by the software, and is used as a basis for determining the position,
orientation and motion of the PS Move [113].
The degree to which vision techniques have been applied to EyeToy games has varied
widely. Some games only use the camera to allow the user to see themselves, whereas
others require significant levels of image processing. The following sections contain de-
22
2.2. RGB Cameras: EyeToy and Playstation Eye
(a) Ghost Catcher [48] (b) Keep Up [117] (c) Kung Foo [42]
Figure 2.5: Screenshots of three mini-games from EyeToy: Play.
scriptions of some of the games that have used image processing to enhance gameplay.
2.2.1 Early Games
In its original release, the EyeToy was released in a bundle with EyeToy: Play, which
features twelve mini-games. The game play is simplistic, as is common with party-oriented
video games. Many of them rely on motion detection; for instance, the object of “Ghost
Catcher” is to fill ghosts with air and then pop them, and this is done by repeatedly
waving your hands over them. Others, such as “Keep Up”, use human detection; the
player is required to keep a ball in the air. Therefore, the game needs to determine
whether there is a person in the area where the ball is.
A third use of vision in this game occurs in “Kung Foo”; in this mini-game, the
player stands in the middle of the camera’s field of view, and is instructed to hit ninjas
that fly onto the screen from various directions. Again, motion detection can be used to
determine whether a hit has been registered, as it doesn’t matter which body part was
used to perform the hit.
Impact
As the mini-games in EyeToy: Play only require simplistic image understanding tech-
niques, specifically the detection of motion within a small portion of the camera’s field
of view, the underlying techniques seemed to work well. As with Wii Sports, the game
was aimed at casual gamers rather than traditional, or “hardcore”, gamers.
23
2.2. RGB Cameras: EyeToy and Playstation Eye
(a) Antigrav screenshot (b) Close-up of user display
Figure 2.6: A screenshot [91] from Antigrav, where the player has extended their right
arm to grab an object (the first of three) and thus score some points. The user display
shows where the game has detected the player’s hands to be.
2.2.2 Antigrav
Antigrav, a PS2 game that utilises the EyeToy, is a futuristic trick-based snowboarding
game, and was brought out by Harmonix in late 2004. The player takes control of a
character in the game, and guides them down a linear track. The game uses face tracking
to control the character’s movements, enabling the player to increase the character’s speed
by ducking, and change direction by leaning. In addition, the player’s hands are tracked,
and their hand position is used to infer a pose, enabling the player to literally grab for
collectible objects on-screen. The player can see what the computer calculates their head
and hand positions to be in the form of a small diagram in the corner of the screen, as
shown in Figure 2.6. A GameSpot review [24] points out:
“this is good for letting you know when the EyeToy is misreading your move-
ments, which takes place more often than it ought to.”
The review, like other reviews of PS2 EyeToy releases, hints at further technological
limitations impairing the enjoyment of the game:
“Harmonix pushes the limits of what you should expect from an EyeToy
entry... unfortunately, EyeToy pushes back, and its occasional inconsistency
hobbles an otherwise bold and enjoyable experience.”
24
2.2. RGB Cameras: EyeToy and Playstation Eye
Impact
The reviews above imply that the head and hand detection techniques employed by
the game were not completely effective, meaning that users are often frustrated by their
actions not being recognised by the game due to failure of the tracking system. This high-
lights the importance of accuracy when developing vision algorithms for video games: if
your tracking algorithm fails around 5% of the time, then the 95% accuracy is, quantit-
atively, extremely good. However, during a 3-minute run down a track on Antigrav, this
could result in a failure of the tracking system, taking several seconds to recover from.
This would be clearly noticeable by gamers.
2.2.3 Eye of Judgment
Figure 2.7: An image [3] showing the set-up of Eye of Judgment. The camera is pointed
at a cloth, on which several cards are placed. These cards are recognised by the game,
and the on-screen display shows the objects or creatures that the cards represent.
In 2007, Sony released Eye of Judgment, a role-playing card-game simulation that can be
compared to the popular card game Magic: The Gathering. The PS3 game comes with
a cloth, and a set of cards with patterns on them that the computer can easily recognise
25
2.2. RGB Cameras: EyeToy and Playstation Eye
(Figure 2.7). It can recognise the orientation as well as the identity of the cards, enabling
them to have different functions when oriented differently. A review reported “very
few hardware-related issues”, principally because of the pattern-based card recognition
system [122].
Since then, the PS3 saw very little PS Eye-related development before the release of
EyePet in October 2009; in the two years between the releases of Eye of Judgment and
EyePet, the use of the PS Eye was generally limited to uploading personalised images for
game characters.
2.2.4 EyePet
(a) EyePet’s AR marker (b) EyePet with trampoline
Figure 2.8: An example of augmented reality being used in EyePet [4].
EyePet features a virtual pet, which interacts with people and objects in the real world
using fairly crude motion sensing. For example, if the player rolls a ball towards the pet,
it will jump out of the way. Another major feature of the game is the use of augmented
reality: a card with a specific pattern is detected in the camera’s field of view, and a
“magic toy” (a virtual object that the pet can interact with, such as a trampoline, a
bubble-blowing monkey, or a tennis player) is shown on top of the card (see Figure 2.8).
26
2.2. RGB Cameras: EyeToy and Playstation Eye
Impact
Again, EyePet uses fairly simplistic vision techniques, with marker detection and a motion
buffer being used throughout the game. This prevents it from receiving the sort of
criticism that was associated with Antigrav.
Although it was generally well-received, even EyePet did not escape criticism for the
limitations of its technology, which “can’t help but creak at times” according to a review
published in Eurogamer [129]. The review goes on to say that performance is robust
under strong natural light, but patchy under electric light in the evening.
This sort of comment shows the unforgivingness of video gamers, or at least of video
game reviewers: for a vision technique to be useful in a game, it needs to be able to work
under a very wide variety of environments and lighting conditions.
2.2.5 Wonderbook
(a) (b)
Figure 2.9: The Wonderbook (a) [82] is used with the PlayStation Move controller. The
interior (b) [115] features AR markers, as well as markings on the border to identify the
edge of the book, and ones near the edge of the page, which help to identify the page
quickly.
Wonderbook: Book of Spells, released by Sony in November 2012, is the first in an up-
coming series of games that will use computer vision methods to enhance gameplay. The
games will be centred upon a book whose pages contain augmented reality markers and
other patterns (Figure 2.9). These are detected by various pattern recognition techniques,
in order to determine where the book is, and which pages are currently visible. Once
27
2.3. Depth Sensors: Microsoft Kinect
Figure 2.10: After the Wonderbook is detected, gameplay objects can be overlaid on-
screen. In this image [116], a 3D stage is superimposed onto the book.
this is known, augmented reality can be used to replace the image of the book with, for
example, a burning stage (Figure 2.10).
In Book of Spells, the book becomes a spell book, and through the gameplay, spells
from the Harry Potter series are introduced [53]. At various points in the game, the
player must interact with the book, for example to put out fires by patting the book.
Skin detection algorithms are used to ensure that the player’s hands appear to occlude
the spellbook, rather than going through it.
The generality of the book enables it to be used in multiple different kinds of games.
BBC’s Walking with Dinosaurs will be made into an interactive documentary, with the
player first excavating and completing dinosaur skeletons, and then feeding the dinosaurs
using the PS Move [55]. It remains to be seen how the final versions of these games will
be appraised by reviewers and customers, and thus whether the Wonderbook franchise
will have a significant impact on the video gaming market.
2.3 Depth Sensors: Microsoft Kinect
While RGB cameras can be useful in enhancing gameplay with vision techniques, the
extra information provided by depth cameras makes it significantly easier to determine
the structure of a scene. This enables games developers to provide a new way of playing.
With Kinect, which was released in November 2010, Microsoft offer a system where “you
28
2.3. Depth Sensors: Microsoft Kinect
(a) (b) (c)
Figure 2.11: A selection of different games available for Kinect. (a) Kinect Sports [78]: two
players compete against each other at football. (b) Dance Central [7]: players perform
dance moves, which are tracked and judged by the game. (c) Kinect Star Wars [99]:
players swing a lightsabre by making sweeping movements with their arm.
are the controller”. Using an infrared depth sensor to track 3D movement, they generate
a detailed map of the scene, significantly simplifying the task of, for example, tracking
the movement of a person.
2.3.1 Technical Details
The Kinect provides a 320 ×240 16-bit depth image, and a 640 ×480 32-bit RGB image,
both running at 30 frames per second (fps); the depth sensor has an active range of 1.2 to
3.5 metres [93]. The skeletal detection system, used for detecting the human body in each
frame, is based on random forest classifiers [110], and is capable of tracking a twenty-
link skeleton of up to two active players in real-time.
1
The software also provides an
object-specific segmentation of the people in the scene (i.e. different people are segmented
separately), and further enhances the player’s experience by using person recognition to
provide greetings and content.
2.3.2 Games
As with the Nintendo Wii, the Kinect was launched along with a sports game, namely
Kinect Sports. The controls are intuitive: the player makes a kicking motion in order
1
In an articulated body, a link is defined as an inflexible part of the body. For example, if the
flexibility of fingers and thumbs is ignored, each arm could be treated as three links, with one link each
for hand, forearm and upper arm.
29
2.3. Depth Sensors: Microsoft Kinect
Figure 2.12: The furniture removal guidelines in the Kinect instruction manual [79] advise
the player to move tables etc. that might block the camera’s view, which may cause a
problem for some users.
to kick a football, or runs on the spot in the athletics mini-games. No controllers or
buttons are required, which makes the games very easy to adapt to, although some of
the movements need to be exaggerated in order for the game to recognise them [18].
Another intuitive game is Dance Central, which uses the Kinect’s full body tracking
capabilities to compare the player’s dance moves to those shown by an on-screen in-
structor. The object of the game is to imitate these moves in time with the music. This
can be compared to classic games like Dance Dance Revolution, with the difference that
the player’s whole body is now used, enabling a greater variety of moves and adding an
element of realism [128].
Up until now, games developers have struggled to produce a game that uses the
Kinect’s capabilities, yet still appeals to the serious gamer. One attempt was made
in the 2011 release Kinect Star Wars, in which the player uses their arms to control
a lightsaber, making sweeping or chopping motions to remove obstacles, and to defeat
enemies. However, this game was criticised due to the game’s inability to keep up with
fast and frantic arm motions [126].
A common problem with the Kinect model of gaming is that it is necessary to stand
a reasonable distance away from the camera (2 to 3 metres is the recommended range),
30
2.4. Discussion
which makes gaming very difficult in small rooms, especially as any furniture will need
to be moved away (Figure 2.12).
2.3.3 Vision Applications
Since its release, and the subsequent release of an open-source software development
kit [83], the Kinect has been used in a wide variety of non-gaming related work by
computer vision researchers. Oikonomidis et al. [87] developed a hand-tracking system
capable of running at 15 fps, while Izadi et al. [54] perform real-time 3D reconstruction of
indoor scenes by slowly moving the Kinect camera around the room. The Kinect has also
been shown to be a useful tool for easily collecting large amounts of training data [47].
However, due to IR interference, the depth sensor does not work in direct sunlight,
making it unsuitable for outdoor applications such as pedestrian detection [39].
2.3.4 Overall Impact
The Kinect has had a huge impact worldwide, selling 19 million units worldwide in its first
eighteen months. This has helped Microsoft improve sales of the Xbox 360 year-on-year,
despite the console now being in its seventh year. This is the reverse of the trend shown
by competing consoles [98]. The method of controlling games using the human body
rather than a controller is revolutionary, and the technology has also had a significant
effect on vision research, as mentioned in Section 2.3.3 above.
2.4 Discussion
To date, a number of vision methods that use RGB cameras have been introduced to
the video gaming community. However, these tend to be low-level (motion detection or
marker detection) rather than high-level: if the only information given is an RGB signal,
unconstrained object detection and human pose estimation are neither accurate nor fast
enough to be useful in video games.
31
2.4. Discussion
The depth camera used in the Microsoft Kinect has provided a huge leap forward in
this area, although the cost of this peripheral (which had a recommended retail price of
£129.99 at release, around four times more than the PS Eye) means that an improvement
in the RGB-based techniques would be desirable. The next chapter contains an appraisal
of related work in computer vision that might be of interest to games developers, and
provides background for this thesis.
32
Chapter 3
State of the Art in Selected Vision
Algorithms
While we have seen in Chapter 2 that computer vision techniques are beginning to have a
profound effect on computer games, there are a number of research areas which could be
applied to further transform the gaming industry. Accurate object segmentation would
allow actual objects, or even people, to be taken directly from the player’s surroundings
and put into the virtual environment of the game. Human motion tracking could be
used to allow the player to navigate a virtual world, for instance by steering a vehicle.
Finally, human pose estimation could be used to allow the player to control an avatar in
a platform or role-playing game. In this chapter, we will discuss the current state of the
art in energy minimisation, human pose estimation, segmentation, and stereo vision.
In order for computer vision techniques like localisation and pose estimation to be
suitable for use in computer games, the algorithm that applies the technique needs to
respond in real time as well as being accurate. A fast algorithm is necessary because
the results (e.g. pose estimates) need to be used in real-time so that they can affect
the game in-play; very high accuracy is a requirement because mistakes made by the
game will undoubtedly frustrate the user (see [24] and Section 2.2.2). The problem is
to find a suitable balance between these two requirements (a faster algorithm might
involve approximate solutions, and hence could be less accurate). This may involve
33
3.1. Inference on Graphs: Energy Minimisation
tweaking existing algorithms to produce significant speed increases without any loss in
accuracy, or developing novel and significantly more accurate algortihms that still have
speed comparable to current state-of-the-art algorithms.
3.1 Inference on Graphs: Energy Minimisation
Many of the most popular problems in computer vision can be framed as energy minim-
isation problems. This requires the definition of a function, known as an energy function,
which expresses the suitability of a particular solution to the problem. Solutions that are
more probable should give the energy function a lower value; hence, we wish to find the
solution that gives the lowest value.
3.1.1 Conditional Random Fields
Suppose we have a finite set V of random variables, to which we wish to assign labels from
a label set L. If all the variables are independent, then this problem is easily solvable
- just find the best label for each variable. However, in general we have relationships
between variables. Let E be the set of pairs of variables {v
1
, v
2
} ∈ V which are related to
one another.
We can then construct a graph G = (V, E) which specifies both the set of variables,
and the relationships between those variables. G is a directed graph if the pairs in E are
unordered; this enables us to construct graphs where, for some v
1
, v
2
, (v
1
, v
2
) ∈ E, but
(v
2
, v
1
) / ∈ E.
Given some observed data X, we can assign a set {y
i
: v
i
∈ V} of values to the
variables in V. Let f denote a function that assigns a label f(v
i
) = y
i
to each v
i
∈ V.
Now, suppose that we also have a probability function p that gives us the probability of
34
3.1. Inference on Graphs: Energy Minimisation
a particular labelling {f(v
i
) : v
i
∈ V} given observed data X. Then:
Definition 3.1
(X, V) is a conditional random field if, when conditioned on X, the variables
V obey the Markov property with respect to G:
p(f(v
i
) = y
i
|X, {f(v
j
) : j = i}) = p(f(v
i
) = y
i
|X, {f(v
j
) : (v
i
, v
j
) ∈ E}).
(3.1)
In other words, each output variable y
i
only depends on its neighbours [72].
3.1.2 Submodular Terms
Now we consider set functions, which are functions whose input is a set. For example,
suppose we have a set Y of possible variable values, and a set V, with size Ω = |V|,
of variables v
i
which each take a value y
i
∈ Y. A function f which takes as input an
assignment of these variables {y
i
: v
i
∈ V} is a set function.
Energy functions are set functions f : Y

→R
+
∪{0}, which take as input the variable
values {y
i
: v
i
∈ V}, and output some non-negative real number. If the variable values
are binary, then this f is a binary set function f : 2

→R
+
∪ {0}.
Definition 3.2
A binary set function f : 2

→R
+
∪{0} is submodular if and only if for every
ordered set S, T ⊆ V we have that:
f(S) +f(T ) ≥ f(S ∪ T ) +f(S ∩ T ). (3.2)
35
3.1. Inference on Graphs: Energy Minimisation
For example, if Ω = 2, S = [1, 0] and T = [0, 1], a submodular function will satisfy the
following inequality [104]:
f([1, 0]) +f([0, 1]) ≥ f([1, 1]) +f([0, 0]). (3.3)
From Schrijver [104], we also have the following proposition:
Proposition 3.1 The sum of submodular functions is submodular.
Proof It is sufficient to prove that, given two submodular functions f : A → R
+
∪ {0}
and g : B →R
+
∪ {0}, h = f +g : A ∪ B →R
+
∪ {0} is submodular.
h(S) +h(T) =(f +g)(S) + (f +g)(T)
=f(S|
A
) +g(S|
B
) +f(T|
A
) +g(T|
B
)
=(f(S|
A
) +f(T|
A
)) + (g(S|
B
) +g(T|
B
))
≥(f((S ∪ T)|
A
) +f((S ∩ T)|
A
)) + (g((S ∪ T)|
B
) +g((S ∩ T)|
B
))
=f((S ∪ T)|
A
) +g((S ∪ T)|
B
) +f((S ∩ T)|
A
) +g((S ∩ T)|
B
)
=h(S ∪ T) +h(S ∩ T).
As shown by Kolmogorov and Zabih [61], one way of minimising energy functions, par-
ticularly submodular energy functions, is via graph cuts, which we will now introduce.
3.1.3 The st-Mincut Problem
In this section, we will consider directed graphs G = (V, E) that have special nodes
s, t ∈ V such that for all v
i
∈ V\{s, t}, we have (s, v
i
) ∈ E, (v
i
, t) ∈ E, (v
i
, s) / ∈ E, and
(t, v
i
) / ∈ E. We say that s is the source node and t is the sink node of the graph. Such a
graph is also known as a flow network. Let c be a function c : E → R
+
∪ {0}, where for
each (v
1
, v
2
) ∈ E, c(v
1
, v
2
) represents the capacity, or maximum amount of flow, of the
edge.
36
3.1. Inference on Graphs: Energy Minimisation
Max Flow
Definition 3.3
A flow function is a function f : E → R
+
∪ {0} which satisfies the following
constraints:
1. f(v
1
, v
2
) ≤ c(v
1
, v
2
) ∀ (v
1
, v
2
) ∈ E
2.

v
1
:(v
1
,v)∈E
f(v
1
, v) =

v
2
:(v,v
2
)∈E
f(v, v
2
) ∀ v ∈ V.
The definition given above gives us two guarantees: first, that the flow passing along a
particular edge does not exceed that edge’s capacity; and second, that the flow entering
a vertex is equal to the flow leaving that vertex. From this second constraint, we can
derive the following:
Definition 3.4
The flow of a flow function is the total amount passing from the source to the
sink, and is equal to

(s,v)∈E
f(s, v).
The objective of the max flow problem is to maximise the flow of a network, i.e. to find
a flow function f with the highest flow.
Min Cut
Definition 3.5
An s-t cut C = (S, T ) is a partition of the variables v ∈ V into two disjoint
sets S and T , with s ∈ S and t ∈ T .
37
3.1. Inference on Graphs: Energy Minimisation
Let E

be the set of edges that connect a variable v
1
∈ S to a variable v
2
∈ T . Formally:
E

= {(v
1
, v
2
) ∈ E : v
1
∈ S, v
2
∈ T } (3.4)
Note that there are at least |V| − 2 edges in E

, as if v ∈ S\s, then (v, t) ∈ E

, and
if v ∈ T \t, then (s, v) ∈ E

. Depending on the connectivity of G, there may be up to
(|S| −1) ×(|T | −1) additional edges.
Definition 3.6
The capacity of an s-t cut is the sum of the capacity of the edges connecting
S to T , and is equal to

(v
1
,v
2
)∈E
∗ c(v
1
, v
2
).
The objective of the min cut problem is to find an s-t cut which has minimal capacity
(there may be more than one solution).
In 1956, it was shown independently by Ford and Fulkerson [41] and by Elias et al. [30]
that the two problems above are equivalent. Therefore, to find a flow function that has
maximal flow, one needs only to find an s-t cut with minimal capacity. Algorithms that
seek to obtain such an s-t cut are known as graph cut algorithms. Submodular functions
can be efficiently minimised via graph cuts [15, 61]; C++ code is available that performs
this minimisation using an augmented path algorithm [14, 58, 61]. This code is often used
as a basis for image segmentation algorithms, for example [9, 16, 71, 100, 101, 130].
3.1.4 Application to Image Segmentation
To illustrate the use of energy minimisation in image segmentation, consider the following
example. We have an image, shown in Figure 3.1, with just 9 pixels (3×3). To construct
a graph, we create a set of vertices V = {s, t, v
1
, v
2
, . . . , v
9
}, and a set of edges E, with
(s, v
i
) and (v
i
, t) ∈ E for i = 1 to 9, and (v
i
, v
j
) ∈ E if v
i
and v
j
are adjacent in the image,
as shown in Figure 3.1. The vertices v
i
have pixel values p
i
between 0 and 255 inclusive
(i.e. the image is 8-bit greyscale, with 0 corresponding to black, and 255 to white). Our
38
3.1. Inference on Graphs: Energy Minimisation
Figure 3.1: Image for our toy example.
objective is to separate the pixels into foreground and background sets, i.e. to define a
labelling z = {z
1
, z
2
, . . . , z
9
}, where z
i
= 1 if and only if v
i
is assigned to the foreground
set.
We wish to separate the light pixels in the image from the dark ones, with the light
pixels in the foreground, so we create foreground and background penalties θ
F
and θ
B
respectively for pixels v
i
as follows:
θ
F
(v
i
) = 255 −p
i
; (3.5)
θ
B
(v
i
) = p
i
. (3.6)
These are known as unary pixel costs. The total unary cost of a labelling z is:
θ(z) =
9

i=1
(z
i
· θ
F
(v
i
) + (1 −z
i
) · θ
B
(v
i
)) . (3.7)
We also want the boundary of the foreground set to align with edges in the image. There-
fore, we wish to penalise cases where adjacent pixels have similar values, but different
labels. This is done by including a pairwise cost φ:
φ(z) =

(v
i
,v
j
)∈E
1(z
i
= z
j
) · exp(−|p
i
−p
j
|), (3.8)
where 1 is the indicator function, which has a value of 1 if the statement within the
39
3.2. Inference on Trees: Belief Propagation
(a) γ = 0.1 (b) γ = 0.2 (c) γ = 1
Figure 3.2: Segmentation results for different values of γ. A higher value punishes seg-
mentations with large boundaries; a high enough value (as in (c)) will make the result
either all foreground or all background.
brackets is true, and zero otherwise. The overall energy function is:
f(z) = θ(z) +γ · φ(z), (3.9)
where γ is a weight parameter; higher values of γ will make it more likely that adjacent
pixels have similar labels.
The energy function in (3.9) is submodular, and can therefore be minimised efficiently
using the max flow code available at [58]. The segmentation results obtained for different
values of γ are shown in Figure 3.2. The ratio between the unary and pairwise weights
influences the segmentation result produced.
3.2 Inference on Trees: Belief Propagation
While vision problems such as segmentation require a large number of variables (one per
image pixel), others, such as pose estimation, only require a smaller number of variables,
and hence a smaller graph. An important type of graph that is useful for this problem is
40
3.2. Inference on Trees: Belief Propagation
the tree structure.
Definition 3.7
Let G = (V, E) be any graph. We define a path to be an ordered set of
vertices {v
1
, . . . , v
n
} ⊆ V (where n ≥ 2), such that (v
i
, v
i+1
) ∈ E for all
i ∈ {1, . . . n −1}.
Definition 3.8
A tree is an undirected graph T = (V, E) in which any two vertices are con-
nected by exactly one path.
In order to perform inference on this tree, we require a set of labels L
i
that each vertex
v
i
can take a value z
i
from. Note that we allow each variable to have a different label set.
Let z be a labelling of T , i.e. an assignment of a label z
i
∈ L
i
to each vertex v
i
∈ V.
As with inference in graphs, we may define an energy function f(z), and this energy
function may consist of unary and pairwise terms. Throughout this section, we shall
assume the existence of a unary potential θ(v
i
= z
i
), which provides the cost associated
with v
i
taking a label z
i
, and a pairwise potential φ(v
i
= z
i
, v
j
= z
j
), providing the cost
of the variables v
i
and v
j
such that (v
i
, v
j
) ∈ E taking labels z
i
and z
j
.
As described in the remainder of this section, inference on trees may be performed
using belief propagation, a message-passing algorithm which was first proposed by Pearl
in 1982 [90]. Belief propagation proves to be computationally efficient, as it exploits the
structure of the tree.
3.2.1 Message Passing
A message passing method is an inference method where vertices are considered separ-
ately, and each vertex v
i
receives information from connected vertices v
j
(where (v
i
, v
j
) ∈
41
3.2. Inference on Trees: Belief Propagation
E) in the form of a message. Here, a message can be as simple as a scalar value, or a
matrix of values. This message is then combined with information relevant to the vertex
itself, to form a new message for the next vertex.
Messages are passed between vertices in a series of pre-defined updates. The number
of updates required to find an overall solution depends on the complexity of the graph.
If the graph has a simple structure, such as a chain (where each vertex is connected to
at most two other vertices, and the graph is not a cycle), then only one set of updates is
required to find the optimal set of values. This set can be found by an algorithm such as
the Viterbi algorithm [125].
3.2.2 Belief Propagation
Belief propagation can be viewed as a variation of the Viterbi algorithm that is applicable
to trees. To use this process to perform inference on a tree T , we must choose a vertex
of the tree to be the root vertex, denoted v
0
.
Since T is a tree, v
0
is connected to each of the other vertices by exactly one path.
We can therefore re-order the vertices such that, for any vertex v
i
, the path from v
0
to v
i
proceeds via vertices with indices in ascending order.
1
Once we have done this, we can
introduce the notions of parent-child relations between vertices, defined here for clarity.
Definition 3.9
We say that a vertex v
i
is the parent of v
j
if (v
i
, v
j
) ∈ E and i < j. If this is
the case, we say that v
j
is a child of v
i
.
Note that the root node has no parents, and each other vertex has exactly one parent,
since if a vertex v
j
had two parents, then there would be more than one path from v
0
to v
j
, which contradicts the definition of a tree. However, a vertex may have multiple
1
There will typically be multiple ways to do this.
42
3.2. Inference on Trees: Belief Propagation
children, or none at all.
Definition 3.10
A vertex v
i
with no children is known as a leaf vertex.
We now describe the general form of belief propagation on our tree T . The vertices
in V are considered in two passes: a “down” pass, where the vertices are processed in
descending order, so that each vertex is processed after its children, but before its parent,
and an “up” pass, where the order is reversed.
For each leaf vertex v
i
, we have a set L
i
= {l
1
i
, l
2
i
, . . . , l
K
i
i
} of possible labels, where
K
i
is the number of labels in the set L
i
. Then the score associated with assigning a
particular label l
p
i
to vertex v
i
is:
score
i
(l
p
i
) = θ(v
i
= l
p
i
). (3.10)
This score is the message that is passed to the parents of v
i
.
For a vertex v
j
with at least one child, we need to combine these messages with the
unary and pairwise energies, in order to produce a message for the parents of v
j
. Again,
we have a finite set L
j
= {l
1
j
, l
2
j
, . . . , l
K
j
j
} of possible labels for v
j
. The score associated
with assigning a particular label l
q
j
is:
score
j
(l
q
j
) = θ(v
j
= l
q
j
) +

i>j:(v
i
,v
j
)∈E
m
i
(l
q
j
), (3.11)
where:
m
i
(l
q
j
) = max
l
p
i
_
φ(v
i
= l
p
i
, v
j
= l
q
j
) + score
i
(l
p
i
)
_
. (3.12)
When the root vertex is reached, the optimal label v
0
can be found by maximising score
0
,
defined in (3.11). Finally, the globally optimal configuration can be found by keeping
track of the arg max indices, and then tracing back through the tree on the “up” pass
43
3.3. Human Pose Estimation
to collect them. The “up” pass can be avoided if the arg max indices are recorded along
with the messages during the “down” pass.
One of the vision problems that is both interesting for computer games and suitable
for the application of belief propagation is human pose estimation, and it is this problem
that is described in the next section.
3.3 Human Pose Estimation
In layman’s terms, the problem of human pose estimation can be stated as follows: given
an image containing a person, the objective is to correctly classify the person’s pose. This
pose can either be in the form of selection from a constrained list, or freely estimating
the locations of a person’s limbs, and the angles of their joints (for example, the location
of their left arm, and the angle at which their elbow is bent). This is often formalised
by defining a skeleton model, which is to be fitted to the image. It is quite common to
describe the human body as an articulated object, i.e. one formed of a connected set of
rigid parts. Such a formalisation gives rise to a family of parts-based models known as
pictorial structure models.
These models typically consist of six parts (if the objective is restricted to upper body
pose estimation) or ten parts (full body) [10, 35–37]. The upper body model consists of
head, torso, and upper and lower arms; to extend this to the full body, upper and lower
leg parts are added. Having divided the human body into parts, one can then learn a
separate detector for each part, taking advantage of the fact that the parts have both a
simpler shape (not being articulated), and a simpler colour distribution.
Indeed, pose estimation can be formulated as an energy minimisation problem. In
contrast to segmentation problems, which require a different variable for each pixel, the
number of variables required is equal to the number of parts in the skeleton model.
However, the number of possible values that the variables can take is large (a part can,
in theory, occupy any position in the image, with any orientation and extent).
44
3.3. Human Pose Estimation
Figure 3.3: A depiction of the ten-part skeleton model used by Felzenszwalb and Hut-
tenlocher [35].
3.3.1 Pictorial Structures
A pictorial structure model can be expressed as a graph G = (V, E), with the vertices
V = {v
1
, v
2
, . . . , v
n
} corresponding to the parts, and the edges E specifying which pairs
of parts (v
i
, v
j
) are connected. A typical graph is shown in Figure 3.3.
In Felzenszwalb and Huttenlocher’s pictorial structure model [35], a particular la-
belling of the graph is given by a configuration L = {l
1
, l
2
, . . . , l
n
}, where each l
i
specifies
the location (x
i
, y
i
) of part v
i
, together with its orientation, and degree of foreshortening
(i.e. the degree to which the limb appears to be shorter than it actually is, due to its
angle relative to the camera). The energy of this labelling is then given by:
E(L) =
n

i=1
θ(l
i
) +

(v
i
,v
j
)∈E
φ(l
i
, l
j
), (3.13)
where, as in the previous section, θ represents the unary energy on part configuration,
and φ the pairwise energy. These energies relate to the likelihood of a configuration, given
45
3.3. Human Pose Estimation
the image data, and given prior knowledge of the parts; more “realistic” configurations
will have lower energy. Despite the large number of possible configurations, a globally
optimal configuration can be found efficiently. This can be done by using simple appear-
ance models for each part, explained in the following section, and then applying belief
propagation.
Appearance Model
For each part, appearance models can be learned from training data, and can be based on
edges [95], colour-invariant features such as HOG [22,37], or the position of the part within
the image [38]. Another approach for video sequences is to apply background subtraction,
and define a unary potential based on the number of foreground pixels around the object
location [35].
Given an image, this appearance model can be evaluated over a dense grid [1]; to
speed this process up, a feature pyramid can be defined, so that a number of promising
locations are found from a coarse grid, and then higher resolution part filters produce
more precise matching scores [34, 36]. In order to reduce the time taken by the inference
process, it might be desirable to reduce the set of possible part locations. Two ways to
do this are:
1. Thresholding, where part locations with a score that is worse than some predefined
value, or with a score outside the top N values for some N.
2. Non-maximal suppression, which involves the removal of part locations that are
similar, but inferior, to other part locations.
Optimisation
After applying these techniques, we now have a small set of possible locations {l
1
i
, l
2
i
, . . . , l
k
i
}
for each vertex v
i
. For a leaf vertex v
i
, the score of each location l
p
i
is:
score
i
(l
p
i
) = θ(l
p
i
). (3.14)
46
3.3. Human Pose Estimation
Now, for a vertex v
j
with at least one child, the score is defined in terms of the children:
score
j
(l
p
j
) = θ(l
p
j
) +

v
i
:(v
i
,v
j
)∈E
m
i
(l
p
j
), (3.15)
where:
m
i
(l
p
j
) = max
l
q
i
_
φ(l
q
i
, l
p
j
) + score
i
(l
q
i
)
_
. (3.16)
Finally, the top-scoring part configuration is found by finding the root location with the
highest score, and then tracing back through the tree, keeping track of the arg max indices.
Multiple detections can be generated by thresholding this score and using non-maximal
suppression.
3.3.2 Flexible Mixtures of Parts
Yang and Ramanan [131] extend these approaches by introducing a flexible mixture of
parts model, allowing for greater intra-limb variation.
Rather than using a classical articulated limb model such as that of Marr and Nishi-
hara [75], they introduce a new representation: a mixture of non-orientable pictorial
structures. Instead of having ten rigid parts, as the methods described in Section 3.3.1
do, their model has twenty-six rigid parts, which can be combined to form limbs and
produce an estimate for the ten parts, as shown in Figure 3.4. Each part has a number
T of possible types, learned from training data. Types may include orientations of a part
(e.g. horizontal or vertical hand), and may also span semantic classes (e.g. open versus
closed hand).
Model
Let us denote an image by I, the location of part i by p
i
= (x, y), and its mixture
component by t
i
, with i ∈ {1, . . . , K}, p
i
∈ {1, . . . , L}, and t
i
∈ {1, . . . T}, where K is
the number of parts, L is the number of possible part locations, and T is the number of
mixture components per part.
47
3.3. Human Pose Estimation
(a) The basic model features ten parts,
representing limbs.
(b) Instead, [131] uses the endpoints of
limbs (orange), giving 14 parts.
(c) The midpoints (white) of the eight limb
parts are added, giving a total of 22 parts.
(d) Four parts are added to model the
torso, producing a final total of 26 parts.
Figure 3.4: Evolution of part models, from the basic 10-part model to Yang and
Ramanan’s 26-part model.
48
3.3. Human Pose Estimation
Figure 3.5: Graph showing the relations between parts in Yang and Ramanan’s part
model.
Certain configurations of parts, such as those where adjacent parts have the same
orientation, are more likely to occur than others. To model this, a compatibility function
S is defined on mixture choices t = {t
1
, . . . , t
K
}:
S(t) =

i∈V
b
t
i
i
+

(i,j)∈E
b
(t
i
,t
j
)
(i,j)
, (3.17)
where V is the set of vertices and E is the set of edges in the K-vertex relational graph
specifying which pairs of parts are adjacent, represented graphically in Figure 3.5.
This score is added to energy terms including unary potentials θ on parts, and pairwise
potentials φ between parts:
S(I, p, t) = S(t) +

i∈V
w
t
i
i
· θ(I, p
i
) +

(i,j)∈E
w
(t
i
,t
j
)
(i,j)
· φ(p
i
−p
j
). (3.18)
Here, the appearance model computes the local score at p
i
of placing a template w
t
i
i
for
part i, tuned for type t
i
. In the final term, we have φ(p
i
− p
j
) = [dx, dx
2
, dy, dy
2
]
T
,
49
3.3. Human Pose Estimation
where dx and dy represent the relative location of part i with respect to j. The parameter
w
(t
i
,t
j
)
(i,j)
encodes the expected values for dx and dy, tailored for types t
i
and t
j
. So if i is
the elbow and j is the forearm, with t
i
and t
j
specifying vertically-oriented parts (i.e. the
arm is at the person’s side), we would expect p
j
to be below p
i
on the image.
Inference
To perform inference on this model, Yang and Ramanan maximise S(I, p, t) over p and t.
Since the graph G in Figure 3.5 is a tree, belief propagation (see Section 3.2.2) can again
be used. The score of a particular leaf node p
i
with mixture t
i
is:
score
i
(t
i
, p
i
) = b
t
i
i
+w
t
i
i
· θ(I, p
i
), (3.19)
and for all other nodes, we take into account the messages passed from the node’s children:
score
i
(t
i
, p
i
) = b
t
i
i
+w
t
i
i
· θ(I, p
i
) +

c∈children(i)
m
c
(t
i
, p
i
), (3.20)
where:
m
c
(t
i
, p
i
) = max
t
c
b
t
i
,t
c
(i,c)
+ max
p
c
w
(t
i
,t
c
)
(i,c)
· φ(p
i
−p
c
). (3.21)
Once the messages passed reach the root part (i = 1), score
1
(c
1
, p
1
) contains the best-
scoring skeleton model given the root part location p
1
. Multiple detections can be gen-
erated by thresholding this score and using non-maximal suppression.
3.3.3 Unifying Segmentation and Pose Estimation
So far, a number of methods for solving either human segmentation or pose estimation
have been discussed. Some recent work has also been done that attempts to solve both
tasks together. In this section, we discuss PoseCut [16].
50
3.3. Human Pose Estimation
PoseCut
Bray et al. [16] tackle the segmentation problem by introducing a pose-specific Markov
random field (MRF), which encourages the segmentation result to look “human-like”.
This prior differs from image to image, as it depends on which pose the human is in.
Given an image, they find the best pose prior Θ
opt
by solving:
Θ
opt
= arg min
Θ
_
min
x
Ψ
3
(x, Θ)
_
, (3.22)
where x specifies the segmentation result, and Ψ
3
is the Object Category Specific MRF
from [65], which defines how well a pose prior Θ fits a segmentation result x. It is defined
as follows:
Ψ
3
(x, Θ) =

i
_
_
φ(D|x
i
) +φ(x
i
|Θ) +

j
(φ(I|x
i
, x
j
) +ψ(x
i
, x
j
))
_
_
, (3.23)
where I is the observed (image) data, φ(I|x
i
) is the unary segmentation energy, φ(x
i
|Θ)
is the cost of the segmentation given the pose prior (penalising pixels near to the shape
being background, and pixels far from the shape being foreground), and the ψ term is a
pairwise energy. Finally, φ(I|x
i
, x
j
) is a contrast-sensitive term, defined as:
φ(I|x
i
, x
j
) =
_
_
_
γ(i, j), if x
i
= x
j
; (3.24)
0, if x
i
= x
j
,
where γ(i, j) is proportional to the difference in RGB values of pixels i and j; pixels with
similar values will have a high value for γ(i, j), since we wish to encourage these pixels
to have the same label.
Given a particular pose prior Θ, the optimal configuration x

= arg min
x
Ψ
3
(x, Θ)
can be found using a single graph cut. The final solution arg min
x
Ψ
3
(x, Θ
opt
) is found
using the Powell minimisation algorithm [94].
51
3.4. Stereo Vision
3.4 Stereo Vision
Stereo correspondence algorithms typically denote one image as the reference image and
the other as the target image. A dense set of patches is extracted from the reference
image, and for each of these patches, the best match is found in the target image. The
displacement between the two patches is known as the disparity; the disparity for each
pixel in the reference image is stored in a disparity map. It can easily be shown that
the disparity of a pixel is inversely proportional to its distance from the camera, or its
depth [105]. A typical disparity map is shown in Figure 3.6.
A plethora of stereo correspondence algorithms have been developed over the years.
Scharstein and Szeliski [102] note that earlier methods can typically be divided into four
stages: (i) matching cost computation; (ii) cost aggregation; (iii) disparity computation;
and (iv) disparity refinement; later methods can be described in a similar fashion [20, 60,
76, 92, 97].
It is quite common to use the sum of absolute differences measure when finding the
matching cost for each pixel. A patch with a height and width of 2n + 1 pixels for some
n ≥ 0 is extracted from the reference image. Then, for each disparity value d, a patch
is extracted from the target image, and the pixelwise intensity values for each patch are
compared.
With L and R representing the reference (left) and target (right) images respectively,
the cost of assigning disparity d to a pixel (x, y) in L is as follows:
θ(x, y, d) =
n

δx=−n
n

δy=−n
|L(x +δx, y +δy) −R(x +δx −d, y +δy)| (3.25)
Evaluating this cost over all pixels and disparities provides a cost volume, on which
aggregation methods such as smoothing can be applied in order to reduce noise. Disparity
values for each pixel can then be computed. The simplest method for doing this is just
to find for each pixel (x, y) the disparity value d which minimises θ(x, y, d).
However, such a method is likely to result in a high degree of noise. Additionally, pixels
immediately outside a foreground object are often given disparities that are higher than
52
3.4. Stereo Vision
(a) Left image
(b) Right image
(c) Disparity map
Figure 3.6: An example disparity map produced by a stereo correspondence algorithm.
Pixels that are closer to the camera appear brighter in colour. Note the dark gaps
produced in areas with little or no texture.
53
3.5. Discussion
the actual disparity value, producing what are known as halo artefacts in the result [20].
To overcome these problems, many methods either perform a post-processing step
to refine the output, or use more complex methods of computing the disparity map.
For example, Criminisi et al. [20] use an approach based on dynamic programming to
reduce halo artefacts. Rhemann et al. [97] smooth the cost volume with a weighted box
filter, with weights chosen so that edges in the input image correspond with edges in the
disparity image, while in a similar vein, Hirschmüller et al. [51] use patches of multiple
sizes to correct errors at object boundaries.
3.4.1 Humans in Stereo
A number of methods have been developed that use stereo correspondence as a pre-
processing step for a wider task, for example foreground-background segmentation. Kol-
mogorov et al. [60] achieve real-time segmentation of stereo video sequences. Stereo
correspondence is used together with colour and contrast information to provide an im-
proved result.
Often in these works, the foreground object of interest is a human. This is the case
in [20], where the final objective is to improve video conferencing by using cameras either
side of the monitor to generate a view from the centre of the monitor, therefore increasing
eye contact. More ambitiously, Pellegrini and Iocchi [92] use stereo as a basis for posture
classification, although they omit arms from their body model and only choose from a
limited set of postures.
3.5 Discussion
In this chapter, several existing techniques in computer vision have been introduced;
we have considered general problems such as energy minimisation on graphs and belief
propagation on trees, and shown how they have been applied to segmentation and human
pose estimation. These tasks are of particular interest to video game developers, especially
after the advent of the Kinect (see Section 2.3) gave the general public a showcase of
54
the possibility of controlling video games with the human body. To this end, in the next
chapter, we will begin exploring the possibility of providing a human scene understanding
framework, combining pose estimation with segmentation and depth estimation.
Chapter 4
3D Human Pose Estimation in a Stereo Pair
of Images
The problem of human pose estimation has been widely studied in the computer vision
literature; a survey of recent work is provided in Section 3.3. Despite the large body of
research focussing on 2D human pose estimation, relatively little work has been done to
estimate pose in 3D, and in particular, annotated datasets featuring frontoparallel stereo
views of humans are non-existent.
In recent years, some research has focussed on combining segmentation and pose
estimation to produce a richer understanding of a scene [10, 16, 66, 89]. Many of these
approaches simply put the algorithms into a pipeline, where the result of one algorithm
is used to drive the other [10, 16, 89]. The problem with this is that it often proves
impossible to recover from errors made in the early stages of the process. Therefore,
a joint inference framework, as proposed by Wang and Koller [127] for 2D human pose
estimation, is desired.
This chapter describes a new algorithm for estimating human pose in 3D, while sim-
ultaneously solving the problems of stereo matching and human segmentation. The al-
gorithm uses an optimisation method known as dual decomposition, of which we give an
overview in Section 4.1.
Following that, a new dataset for two-view human segmentation and pose-estimation,
57
4.1. Joint Inference via Dual Decomposition
H2view, is presented in Section 4.2, and our inference framework is described in Section
4.3. Our application of dual decomposition is discussed in Section 4.4, while our learning
approach can be found in Section 4.5, and we evaluate our algorithm in Section 4.6.
Finally, we discuss the outcomes of this chapter in Section 4.7.
The contributions of this chapter can be summarised as follows: we present a novel
dual decomposition framework to combine stereo algorithms with pose estimation and
segmentation. Our system is fully automated, and is applicable to more general tasks
involving object segmentation, stereo and pose estimation. Drawing these together, we
demonstrate a proof of concept that the achievements of the Kinect, covered in Section
2.3, can be matched using a stereo pair of images, instead of using infrared depth data.
We also provide an extensive new dataset of humans in stereo, featuring nearly 9,000
annotated images.
An earlier version of part of this chapter previously appeared as [108]. Graph cuts
are performed using the max-flow code of Boykov and Kolmogorov [14, 58].
4.1 Joint Inference via Dual Decomposition
A decomposition method is an optimisation method that minimises a complex function
by splitting it into easier subfunctions. One ‘easy’ class of functions are convex functions,
of which we give a definition below:
Definition 4.1
A function f : R
N
→R is convex if it satisfies the following inequality for all
x, y ∈ R
N
and all α, β ∈ R with α ≥ 0, β ≥ 0, and α +β = 1:
f(αx +βy) ≤ αf(x) +βf(y).
Equivalently, the line segment between (x, f(x)) and (y, f(y)) lies above the graph of
f [13]. An important property of convex functions that makes them easier to optimise is
58
4.1. Joint Inference via Dual Decomposition
that they only have one local minimum, and no local maxima. The converse of a convex
function is a concave function, which is simply a function f : R
N
→ R such that −f is
convex. Concave functions have only one local maximum.
4.1.1 Introduction to Dual Decomposition
Let x be a vector in R
N
, which can be split into three subvectors, x
1
, x
2
, and y. Consider
the following optimisation problem:
minimise
x∈R
N f(x) = f
1
(x
1
, y) +f
2
(x
2
, y). (4.1)
If the shared (or complicating) variable y were fixed, we could use a divide-and-conquer
type approach, and solve f
1
and f
2
separately. Suppose that for fixed y, we have solutions
φ
1
and φ
2
as follows:
φ
1
(y) =inf
x
1
f
1
(x
1
, y); (4.2)
φ
2
(y) =inf
x
2
f
2
(x
2
, y). (4.3)
Now, the original problem can be restated as:
minimise
y
φ
1
(y) +φ
2
(y), (4.4)
which is convex provided the original function f is convex. This problem is referred to
as the master problem.
A decomposition method is a method that solves the master problem in order to solve
the problem in (4.1); it is so called because the problem is decomposed into smaller
subproblems. If the original problem is decomposed, then the method is known as primal
decomposition, since the shared variable is being updated directly. The master problem
(4.4) can be solved via an iterative process; one such method, which is described next, is
the subgradient method.
59
4.1. Joint Inference via Dual Decomposition
Figure 4.1: Example showing a subgradient (red) of the function f(x) = x +0.5 ×|x −2|
(blue) at x = 2.
Subgradient Method
Definition 4.2
A subgradient of a function f at x is any vector g such that for all y, the
inequality f(y) ≥ f(x) +g
T
(y −x) is satisfied.
An example of a subgradient is shown in Figure 4.1: the function g(x) = x is a subgradient
of f(x) = x + 0.5 ×|x −2| at the point x = 2.
60
4.1. Joint Inference via Dual Decomposition
Definition 4.3
A sequence {α
k
}

k=0
is summable if:

k=0
α
k
< ∞.
The sequence is square summable if:

k=0
α
2
k
< ∞.
Non-differentiable convex functions can be minimised using a method known as the sub-
gradient method. This can be thought of as similar to Newton’s method for differentiable
functions; one difference that warrants a mention is the speed: the subgradient method
is generally far slower than Newton’s method. The subgradient method, originally de-
veloped by Shor [109], uses predetermined step sizes that can be set by a variety of
methods, discussed in more detail in [11]. In this chapter, a square summable but not
summable step size is used: α
k
= a/(b +k), with parameters a, b > 0.
To minimise a given convex function f : R
n
→ R, the subgradient method uses the
following iteration:
x
k+1
= x
k
−α
k
g
k
, (4.5)
where x
k
∈ R
n
is the k
th
iterate, α
k
is the step size, and g
k
is any subgradient of f at
x
k
. By keeping track of the best point x
i
found so far, this method can be used to find a
decreasing sequence f
k
of objective values. It was shown in [11] that if the step size α
k
is
square summable, but not summable, then the sequence f
k
of objective values converges
to the optimal value f

.
Primal Decomposition
To apply the subgradient method to primal decomposition, we apply an iterative process,
whereby the complicating variable y is fixed, so that we can solve the subproblems f
1
61
4.1. Joint Inference via Dual Decomposition
and f
2
as in (4.2) and (4.3). At iteration k, we find the x
1
and x
2
that minimise f
1
and f
2
respectively, given y
k
. The next step is to find a subgradient g
1
of f
1
at x
1
(and analogously, a subgradient g
2
). Once these have been found, we can update the
complicating variable by the following rule:
y
k+1
= y
k
−α
k
(g
1
+g
2
) (4.6)
This process can be repeated until convergence of the y
k
to some fixed value y

. It is quite
common to instead decompose the problem’s Lagrangian dual (dual decomposition); this
is the method that is followed in this chapter.
Lagrangian Duality
An alternative approach to solving the problem in (4.1) is to introduce a copy of the
shared variable y, and add an equality constraint:
minimise
x∈R
N f(x) =f
1
(x
1
, y
1
) +f
2
(x
2
, y
2
) (4.7)
subject to: y
1
= y
2
.
Note that the minimisation problem is now separable, although we must take the equality
constraint into account. We do this by forming the Lagrangian dual problem.
In Lagrangian duality, the constraints in (4.7) are accounted for by augmenting the
objective functions f
1
and f
2
by constraint functions [13]. Constraint functions have
the property that they are equal to zero when the constraint is satisfied, and non-zero
otherwise. The Lagrangian L associated with the problem in (4.7) is as follows:
L(x
1
, y
1
, x
2
, y
2
, λ) = f
1
(x
1
, y
1
) +f
2
(x
2
, y
2
) +λ(y
1
−y
2
), (4.8)
where λ is a cost vector of the same dimensionality of y, and is known as the Lagrange
multiplier associated with the constraint y
1
= y
2
. This problem is separable, and for a
given λ, we can find solutions for x
1
, y
1
, x
2
and y
2
in parallel.
62
4.1. Joint Inference via Dual Decomposition
Optimisation via Dual Decomposition
The Lagrangian (4.8) can be separated into two sub-problems:
L
1
(x
1
, y
1
, λ) =f
1
(x
1
, y
1
) +λy
1
; (4.9)
L
2
(x
2
, y
2
, λ) =f
2
(x
2
, y
2
) −λy
2
. (4.10)
The dual problem is then as follows:
maximise
λ∈R
g(λ) = g
1
(λ) +g
2
(λ), (4.11)
where:
g
1
(λ) = inf
x
1
,y
1
(f
1
(x
1
, y
1
) +λy
1
) ; (4.12)
g
2
(λ) = inf
x
2
,y
2
(f
2
(x
2
, y
2
) −λy
2
) . (4.13)
While the master problem in (4.11) is not typically differentiable, it is known to be convex
over R, and therefore we can solve it using an iterative method.
To apply the subgradient method to our problem (4.11), suppose that after the k
th
iteration, we have the current value λ
k
for λ. We denote by (¯ x
1

k
), ¯ y
1

k
)) the values of
x
1
and y
1
that give us the minimum value for g
1

k
); the solution for g
2

k
) is denoted
analogously. Then:
Proposition 4.1 A subgradient with respect to λ of the master problem (4.11) at λ
k
is
given by:
∇g(λ
k
) = ¯ y
1

k
) − ¯ y
2

k
). (4.14)
Proof For any λ, let ¯ x
1
(λ) and ¯ y
1
(λ) be the values of x
1
and y
1
that give the minimum
63
4.1. Joint Inference via Dual Decomposition
value for g
1
(λ), and define ¯ x
2
(λ) and ¯ y
2
(λ) analogously. Then we have:
g(λ
k
) = f
1
(¯ x
1

k
), ¯ y
1

k
)) +λ
k
¯ y
1

k
) +f
2
(¯ x
2

k
), ¯ y
2

k
)) −λ
k
¯ y
2

k
)
= inf
x
1
,y
1
(f
1
(x
1
, y
1
) +λ
k
y
1
) + inf
x
2
,y
2
(f
2
(x
2
, y
2
) −λ
k
y
2
)
≤ f
1
(¯ x
1
(λ), ¯ y
1
(λ)) +λ
k
¯ y
1
(λ) +f
2
(¯ x
2
(λ), ¯ y
2
(λ)) −λ
k
¯ y
2
(λ)
= f
1
(¯ x
1
(λ), ¯ y
1
(λ)) +f
2
(¯ x
2
(λ), ¯ y
2
(λ)) +λ
k
(¯ y
1
(λ) − ¯ y
2
(λ))
= f
1
(¯ x
1
(λ), ¯ y
1
(λ)) +f
2
(¯ x
2
(λ), ¯ y
2
(λ)) +λ(¯ y
1
(λ) − ¯ y
2
(λ))
+ (λ
k
−λ)(¯ y
1
(λ) − ¯ y
2
(λ))
= g(λ) + (λ
k
−λ)(¯ y
1
(λ) − ¯ y
2
(λ)).
We have shown that for any λ, g(λ
k
) ≤ g(λ) + (λ
k
− λ)(¯ y
1
(λ) − ¯ y
2
(λ)), which is
equivalent to g(λ) ≥ g(λ
k
) + (λ −λ
k
)(¯ y
1
(λ) − ¯ y
2
(λ)), as required.
The subgradient in (4.14) is a vector of the same dimensionality as λ. We then update λ
according to the following rule:
λ
k+1
= λ
k
−α
k
∇g(λ
k
), (4.15)
and by keeping track of the smallest values g
best
, we can find the optimal λ-value.
It is perhaps helpful to consider the subgradient method for updating λ from an
economic viewpoint. There are two copies of the y variables: the values of the y
1
copy
can be thought of as the amounts of a set of resources generated by g
1
– that is, the
supply of y. Then, the y
2
copy can be thought of as the demand of the variables by g
2
.
At iteration k, the λ
k
can be taken to be the costs of these resources.
If for some index i the supply exceeds the demand (i.e. y
i
1
> y
i
2
), then the subgradient
of index i will be positive. Hence, by the update rule (4.15), we will decrease the cost λ
k
.
Note that this makes sense from an economic perspective, as if supply exceeds demand,
then we would like to decrease the cost of the resource, in order to stimulate interest in it.
Conversely, if the demand of a resource exceeds its supply (i.e. y
i
1
< y
i
2
), we would like
to increase the cost. The increased cost causes a reduced demand, but while the demand
64
4.1. Joint Inference via Dual Decomposition
exceeds the supply, the revenue is increased. The optimal point comes where supply and
demand are equal. In this way, the update rule encourages the two copies y
1
and y
2
to
agree with each other.
Example
As a simple example, suppose that x
1
and x
2
are each vectors in R
10
, y ∈ R, g
1
, g
2
, . . . , g
50
are affine functions of x
1
and y, and h
1
, h
2
, . . . , h
50
are affine functions of x
2
and y, for
example:
g
1
(x
1
, y) =A
(1,0)
x
1
(0) +. . . +A
(1,9)
x
1
(9) +A
(1,10)
y +A
(1,11)
; (4.16)
h
1
(x
2
, y) =B
(1,0)
x
2
(0) +. . . +B
(1,9)
x
2
(9) +B
(1,10)
y +B
(1,11)
, (4.17)
where A
(i,j)
, B
(i,j)
∈ R for all (i, j) (i.e. A and B are real-valued 50-by-12 matrices).
Now, the functions g
i
and h
i
are convex, and the maximum of a set of convex functions
is convex, so if we set f
1
and f
2
as:
f
1
(x
1
, y) =
50
max
i=1
g
i
(x
1
, y); (4.18)
f
2
(x
2
, y) =
50
max
i=1
h
i
(x
2
, y), (4.19)
then f
1
and f
2
are also convex. Thus, using the procedure outlined above and CVX, a
package for specifying and solving convex programs [45,46], we can iteratively solve (4.11)
for a given λ, and then update λ using the update rule (4.15). Figure 4.2 shows a plot of
the dual functions in (4.11) with respect to the λ used during the optimisation.
In Figure 4.3, the progress of a bisection method for maximising g(λ) is shown. Also
shown are two bounds: the larger (worse) bound is found by evaluating f(x), using the
current solutions ¯ x
1
, ¯ x
2
and ¯ y, and (4.1). The smaller (better) bound is found by using
these solutions to evaluate the dual function g(λ) from (4.11). Note the high magnitude
of the deviations in g(λ) for the first few iterations, where λ is changing by a relatively
large amount. After a few iterations, the dual function converges to the “better” bound.
65
4.1. Joint Inference via Dual Decomposition
Figure 4.2: Dual functions versus the cost variable λ.
Figure 4.3: Value of the dual function g(λ) after each iteration k, compared to the bounds
obtained by equations 4.1 and 4.11.
66
4.2. Humans in Two Views (H2view) Dataset
4.1.2 Related Approaches
The principle of dual decomposition was originally developed by the optimisation com-
munity, and can be traced back to work on linear programs in 1960 [23]. More than four
decades later, the principle was combined with Markov random field (MRF) optimisation
by Komodakis et al. [62, 63], who decompose an NP-hard MRF problem on a general
graph G into a set of easier subproblems defined on trees T ⊂ G.
Dual decomposition was applied by Wang and Koller [127] to jointly solve the problems
of human segmentation and pose estimation. They defined two slave problems: one
provides a pose estimate by using belief propagation to select from a finite set of estimates;
the other finds a foreground segmentation via graph cuts.
In the remainder of this chapter, we extend the formulation of [127] to include stereo,
so that we can provide segmentation and pose estimation in 3D, not just 2D. While
existing stand-alone stereo correspondence algorithms are not sufficiently accurate to
compensate for the lack of an infrared sensor, our multi-level inference framework aids
us in segmenting objects despite errors in the disparity map.
4.2 Humans in Two Views (H2view) Dataset
While many datasets [22, 36, 37] exist for evaluation of 2D human pose estimation, there
is a lack of multi-view human pose datasets – the only well-known such dataset is the
HumanEva dataset [112], which consists of four image sequences obtained from seven
cameras, of which three are in colour. However, no two of the cameras are frontoparallel,
so rectified images cannot be obtained.
Another related dataset is the H3D (‘Humans in 3D’) dataset created by Bourdev et
al. [10], which gives 3D ground-truth human pose for set of 1000 images, but these images
are only available in one view.
1
In order to evaluate the algorithms in this chapter, a new dataset, called ‘Humans
1
The dataset size is doubled by mirroring these 1000 images.
67
4.2. Humans in Two Views (H2view) Dataset
in Two Views’ (‘H2view’ for short) was created.
2
This dataset contains a total of 8,741
images of humans standing, walking, crouching or gesticulating in front of a stereo camera,
divided up into 25 video
3
sequences of between 149 and 430 images, with eight subjects
and two locations.
The images are annotated with ground-truth human segmentation, and 3D positions
of fifteen joints, from which a ten-part skeleton is obtained. Ground truth data was
collected using a Microsoft Kinect, which was synchronised with the stereo camera for
each sequence. Since the Kinect’s pose estimation algorithm is not 100% accurate, the
pose estimates were corrected manually. Camera calibration was also performed, and
python code was used to translate the data obtained using the Kinect into the image
plane of the left image from the stereo camera.
Each annotated frame has six files associated with it: the RGB and depth images
from the Kinect, the interlaced image from the stereo camera, the left and right rectified
images from the stereo camera, and the ground truth skeleton.
The dataset is split into training and testing portions. The training set contains 7,143
images, while the test set features 1,598 images, with a different location and different
subjects from the training set. The subjects were not given particular instructions regard-
ing how to act in front of the camera; they mostly perform simple gestures at random in
front of the camera, and occasionally interact with inanimate objects. This means that
the difficulty of the test set is varied: some sequences contain simple, slow actions, while
others feature more energetic movement. A downside to this is that the dataset is not
readily suitable for use with action recognition algorithms.
4.2.1 Evaluation Metrics Used
Segmentation
In order to evaluate segmentation, we give results on four metrics. These are calculated
based on the counts of three classes of pixel. The first, true positives (TP) are pixels that
2
http://cms.brookes.ac.uk/research/visiongroup/h2view/h2view.html
3
Video files are not included, but the frames were captured at a rate of 15 fps.
68
4.3. Inference Framework
are classed as human in both the segmentation result and the ground truth. If a pixel is
wrongly classed as human, it is a false positive (FP), and if a pixel is wrongly classed as
background, it is a false negative (FN). The metrics used are as follows:
• Precision: P = TP/(TP +FP) ,
• Recall: R = TP/(TP +FN) ,
• F-Score: F = 2PR/(P +R) ,
• Overlap: O = TP/(TP +FP +FN) .
Pose Estimation
To evaluate pose estimation, we follow the standard criterion of probability of correct
pose (PCP) [37] to measure the percentage of correctly localised body parts.
In this criterion, the detected endpoints of each limb are compared to the annotated
ground truth endpoints. If both detected endpoints are within 50% of the ground truth
limb’s length from the ground truth endpoint, then the part detection is considered to
be correct. The threshold of 50% is consistent with previous works that evaluate pose
estimation using PCP [1, 37, 131].
4.3 Inference Framework
Table 4.1: Notation used in this chapter.
Symbol Meaning
L Left input image
R Right input image
D Set of disparity labels
S Set of segmentation labels
Continued on next page
69
4.3. Inference Framework
Table 4.1 – Continued from previous page
Symbol Meaning
B Set of body part labels
Z Set of pixels
Z
m
Individual pixel
B Set of body parts
B
i
Individual part
C Set of neighbouring pixels
T Set of neighbouring parts
N Number of pixels in image
N
E
Number of proposals per body part
N
P
Number of body parts
K Maximum disparity label
i part index
j proposal index
k disparity value
m = (x, y) pixel location
z CRF labelling
¯z Binarised CRF labelling
z
m
= [d
m
, s
m
] label for one pixel Z
m
d
m
disparity label for Z
m
s
m
segmentation label for Z
m
b
i
label for one body part B
i
E(z) Energy function
L Lagrangian function
f

Individual term in an energy function
f


Binarised term in an energy function
Continued on next page
70
4.3. Inference Framework
Table 4.1 – Continued from previous page
Symbol Meaning
J

Joint energy term (combining two of segmentation, pose and depth)
γ

Weight of a term or sub-term
θ

Unary term
φ

Pairwise term
δx, δy, etc. Difference in x, y, etc.
∆L, ∆R Gradient image of L and R (in x-direction only).
w
m
ij
Weight of a pixel m given by proposal j for part i
W
F
Overall weight of a pixel
p
m
Posterior foreground probability of a pixel m
λ Dual variable
τ Threshold
ξ Slack variable
F Feasible set of labels
t Iteration index
α step size
For ease of reference, a list of symbols used in this chapter can be found in Table
4.1. The problem which we wish to optimise consists of three main elements: stereo,
segmentation, and human pose estimation. Each of these are represented by one term in
the energy function. We introduce two additional terms, hereafter referred to as joining
terms, which combine information from two of the elements, encouraging them to be
consistent with one another. We take as input a stereo pair of images L and R, and as a
preprocessing step, we use the algorithm of Yang and Ramanan [131] to obtain a number
N
E
of proposals for N
P
different body parts. In this chapter, we use N
P
= 10, with two
parts for each of the four limbs, plus one each for the head and torso. Each proposal j for
71
4.3. Inference Framework
each part i comes with a pair of endpoints corresponding to a line segment in the image,
representing the limb (or skull, or spine).
Needless to say, these proposals are not guaranteed to be accurate. As might be
expected, the top-ranked proposal returned is most likely to be correct, however the
maximum possible accuracy (that is, the accuracy that an oracle could obtain by picking
the most accurate proposal for each part) does increase as the number of parts is increased,
as shown in Figure 4.4.
Our approach is formulated as a conditional random field (CRF) with two sets of
random variables: one set covering the image pixels Z = {Z
1
, Z
2
, . . . , Z
N
}, and one
covering the body parts B = {B
1
, B
2
, . . . , B
N
P
}. Any possible assignment z of labels to
the random variables will be called a labelling.
The label sets for the pixels are defined as follows:
D ={0, 1, . . . , K −1}; (4.20)
S ={0, 1}. (4.21)
In a particular labelling z, each pixel variable Z
m
takes a label z
m
= [d
m
, s
m
], from
the product space of disparity and segmentation labels D × S, where d
m
represents the
disparity assigned to the pixel, and s
m
is set to 1 if and only if the pixel is assigned to
foreground. Additionally, each part B
i
takes a label b
i
∈ B = {1, 2, . . . , N
E
}, denoting
which proposal for part i has been selected. In general, the energy of z can be written
as:
E(z) = f
D
(z) +f
S
(z) +f
P
(z) +f
PS
(z) +f
SD
(z). (4.22)
The structure of the variable sets and labellings is shown graphically in Figure 4.5. It
is important to note that we have two separate sets of variables, with three labellings
obtained, and the five terms in the equation above each give an energy score for a different
aspect of this labelling structure:
1. f
D
gives the cost of the disparity label assignment {d
m
}
N
m=1
(Figure 4.5(c), centre);
72
4.3. Inference Framework
(a)
(b)
Figure 4.4: (a): Part proposal accuracy decreases as the proposal rank is lowered, with
the fourth proposal onwards less than half as accurate as the first proposal. However,
there is still information of value in these lower-ranked proposals, as the best possible
accuracy, shown in (b), increases slightly. Note that a different scale is used on the two
graphs.
73
4.3. Inference Framework
2. f
S
gives the cost of the segmentation label assignment {s
m
}
N
m=1
(Figure 4.5(c), left);
3. f
P
gives the cost of the part proposal selection {b
i
}
N
P
i=1
(Figure 4.5(c), right);
4. f
SD
gives the joint cost of the disparity and segmentation labellings; and
5. f
PS
gives the joint cost of the segmentation labelling and the part proposal selection.
In the following sections, we describe in turn each of the five terms in (4.22). Where
appropriate, terms are weighted by parameters γ

; the process of setting the values of
these parameters is given in Section 4.5.
4.3.1 Segmentation Term
In order to build unary potentials for the segmentation term, we create a foreground
weight map based on the pose detections obtained from Yang and Ramanan’s algorithm
[131]. For each pixel Z
m
, each part proposal (i, j) contributes a weight w
m
ij
, where w
m
ij
= 1
if Z
m
lies directly on the line segment representing the limb, and decreases exponentially
as we move away from it. We then have a foreground weight W
F
=

i,j
β
j
w
m
ij
and a
background weight

i,j
β
j
· (1−w
m
ij
) for each pixel. The β
j
term represents our confidence
in the detections, which decreases as the proposal ranking lowers. An example of why
this weighting can be important is shown in Figure 4.6.
These weights are then used to fit Gaussian mixture models (GMMs) for the fore-
ground and background regions, which together give us a probability p
m
of each pixel Z
m
being a foreground pixel. From this, we obtain unary costs θ
F
and θ
B
, which store the
costs of assigning each pixel m to foreground and background respectively:
θ
F
(m) =−log(p
m
); (4.23)
θ
B
(m) =−log(1 −p
m
). (4.24)
The unary segmentation cost for Z
m
given the binary segmentation label s
m
is then the
74
4.3. Inference Framework
(a) Inputs
(b) Variable sets
(c) Labellings
Figure 4.5: A graphical representation of the CRF formulation given in this chapter. (a):
the inputs are a stereo pair of images (left), and a skeleton model (right). (b): we have
two variable sets; one covering pixels, and one covering body parts. (c): These variables
are assigned labellings. The pixels get segmentation (binary) and disparity (integer)
labels, while the parts are given integer labels that represent the index of the proposal
selected for that part.
75
4.3. Inference Framework
(a) Original image
(b) W
F
(averaging) (c) W
F
(relative)
Figure 4.6: In this cluttered image, there are many pose estimates that are extremely
inaccurate. However, weighting the GMMs according to our confidence in the detections
results in the incorrect detections, which had a lower rank, being assigned a less significant
weight.
76
4.3. Inference Framework
Table 4.2: Results on the H2view test set, using just the segmentation term.
Method Precision Recall F-Score Overlap
Unary only 27.91% 83.53% 41.84% 27.45%
Unary + Pairwise 55.11% 87.28% 67.56% 56.59%
following:
θ
S
(m) =
_
_
_
θ
F
(m), if s
m
= 1 ; (4.25a)
θ
B
(m), if s
m
= 0 . (4.25b)
We also have pairwise costs φ
S
, which store the cost of assigning adjacent pixels to
different labels. Denoting the set of neighbouring pixels by C, we follow equation (11) in
Rother et al. [100], and write the pairwise energy for each pair of pixels (m
1
, m
2
) ∈ C as:
φ
S
(m
1
, m
2
) =
_
_
_
exp(−γ
3
L(m
1
) −L(m
2
)), if s(m
1
) = s(m
2
) ;
0, otherwise. (4.26)
The energy we have to minimise is then the following:
f
S
(z) = γ
1

Z
m
∈Z
θ
S
(m) +γ
2

(m
1
,m
2
)∈C
φ
S
(m
1
, m
2
). (4.27)
Results
When only the unary segmentation energy is used, a high recall value is obtained, al-
though the precision is very low, as nothing is done to exclude pixels with a similar colour
to those in the foreground. Adding in the pairwise term causes both the precision and
recall to increase slightly, although there are still a large number of false positives. Full
results are given in Table 4.2, while a sample result is shown in Figure 4.7.
4.3.2 Pose Estimation Term
Recall that Yang and Ramanan’s algorithm provides us with a discrete set of part pro-
posals. An example set of proposals is shown in Figure 4.8. Each proposal j for each
77
4.3. Inference Framework
(a) Original image (b) W
F
(relative)
(c) Result: Unary only (γ
1
= 1, γ
2
= 0) (d) Result: Unary + Pairwise

1
= γ
2
= 1)
Figure 4.7: Sample result using just f
S
. There are many false positives in both results.
The unary-only result in (c) also contains many speckles, which the pairwise term in (d)
removes; however, the head is also lost.
78
4.3. Inference Framework
(a) Top-ranked estimates (b) Second-ranked estimates
(c) 10
th
-ranked estimates (d) Result
Figure 4.8: Several complete pose estimations are obtained from Yang and Ramanan’s
algorithm, which we split into ten body parts. We then select each part individually
(one head, one torso, etc.). In this example, the parts highlighted in white are selected,
enabling us to recover from errors such as the left forearm being misplaced in the first
estimate.
79
4.3. Inference Framework
part i has a unary cost θ
P
(i, j) associated with it, whose value is based on the weights w
ij
defined in the previous section. This cost function penalises the selection of a part where
the pixels Z
m
close to the limb (so w
m
ij
is close to 1) have low foreground probability (so
p
m
is close to 0). We write:
θ
P
(i, j) =

Z
m
∈Z
w
m
ij
(1 −p
m
). (4.28)
A pairwise term φ
i
1
,i
2
is introduced to penalise the case where, for two parts that should
be connected (e.g. upper and lower left leg), two proposals are selected that are distant
from one another in image space. We define a tree-structured set of edges T over the
set of parts, where (i
1
, i
2
) ∈ T if and only if parts i
1
and i
2
are connected. For each
connected pair of parts (i
1
, i
2
) ∈ T , we model the joint by a three-dimensional Gaussian
distribution over the relative position and angle between the two parts. This term is
computed using the same method as in [1]:
φ
i
1
,i
2
(b
i
1
, b
i
2
) = exp
_
δx −µ
x

2
x
_
+ exp
_
δy −µ
y

2
y
_
+ exp
_
δθ −µ
θ

2
θ
_
, (4.29)
where δx is the difference in x, and µ
x
and σ
2
x
are respectively the mean and variance
difference in x. The terms for y and θ are analogous. Multiplying by unary and pairwise
terms by weights γ
4
and γ
5
, we minimise the following cost function:
f
P
(z) = γ
4
10

i=1
θ
P
(i, b
i
) +γ
5

(i
1
,i
2
)∈T
φ
i
1
,i
2
(b
i
1
, b
i
2
). (4.30)
4.3.3 Stereo Term
A particular disparity label d corresponds to matching the pixel (x, y) in L to the pixel
(x −d, y) in R. Note that the disparity will always be non-negative, due to the fact that
the two cameras are aligned, with R coming from a point directly to the right of L. In
other words, points always appear in R at a position closer to left of the image than the
80
4.3. Inference Framework
corresponding position in L. These disparity labels are integer-valued, and range from 0
to K −1.
We define a cost volume θ
D
, which for each pixel m, specifies the cost of assigning a
disparity label d
m
. These costs incorporate the gradient in the x-direction (in ∆L and
∆R), which means that we don’t need to adopt a pairwise cost. The following energy
function then needs to be minimised over labellings z:
f
D
(z) = γ
6

m
θ
D
(m, d
m
), (4.31)
where the cost θ
D
is calculated for each pixel m as follows:
θ
D
(m, d
m
) =
4

δx=−4
4

δy=−4
_
|L(x +δx, y +δy) −R(x +δx −d
m
, y +δy)|
+|∆L(x +δx, y +δy) −∆R(x +δx −d
m
, y +δy)|
_
. (4.32)
4.3.4 Joint Estimation of Pose and Segmentation
Here, we encode the concept that foreground pixels should be explained by some body
part; conversely, each selected body part should explain some part of the foreground. We
use the same weights w
m
ij
as defined in Section 4.3.1, and incur a cost if the part candidate
(i, j) is selected and the pixel m is labelled as background. For a particular labelling z,
the cost is:
J
1
(z) =

i,j

m
_
1(b
i
= j) · (1 −s
m
) · w
m
ij
_
. (4.33)
Secondly, we formulate the cost for the case where a pixel m is assigned to foreground,
but not explained by any body part. We set a threshold τ = 0.1 (value determined
empirically), and a cost is accrued for m when for all parts (i, j), w
m
ij
< τ:
J
2
(z) =

m
1(max
i,j
w
m
ij
< τ) · s
m
. (4.34)
81
4.3. Inference Framework
Table 4.3: Results on the H2view test set, after combining the segmentation term with
the joint segmentation and pose term f
PS
.
Method Precision Recall F-Score Overlap
Unary only 27.91% 83.53% 41.84% 27.45%
Unary + Pairwise 55.11% 87.28% 67.56% 56.59%
f
S
+f
PS
63.47% 90.41% 74.59% 62.65%
With weighting factors γ
7
and γ
8
, The overall cost f
PS
can be written as:
f
PS
(z) = γ
7
J
1
(z) +γ
8
J
2
(z). (4.35)
Results
Combining this term with the segmentation term f
S
provides an improvement in results,
as the J
1
and J
2
terms make it possible for information obtained by the pose estimation
algorithm to be used to aid segmentation. Results on the H2view test set are given in
Table 4.3. Note that adding the joint segmentation and pose term f
PS
yields a large
increase in precision, as pixels far from any part are excluded by the J
2
term above.
Additionally, limbs that were missing from the original segmentation can be recovered,
as shown in Figure 4.9.
4.3.5 Joint Estimation of Segmentation and Stereo
Here, we encode the intuition that, assuming that the objects closest to the camera are
body parts, foreground pixels should have a higher disparity than background pixels. To
do this, we use the foreground weights W
F
obtained in Section 4.3.1 to obtain an expected
value E
F
for the foreground disparity:
E
F
=

m
w
m
· d
m

m
w
m
. (4.36)
Using a hinge loss with a non-negative slack variable ξ = 2 to allow small deviations to
occur, we then have the following cost measure to penalise pixels with a high disparity
82
4.3. Inference Framework
(a) Pose estimation result
(b) Segmentation: first iteration
(c) Segmentation: second iteration
Figure 4.9: In this example, the person’s lower legs are missing from the initial segmenta-
tion estimate, although the pose estimation is correct. Once the two results are combined
using the f
PS
energy term above, the lower legs are correctly segmented.
83
4.4. Dual Decomposition
being assigned to the background:
f
SD
(z) = γ
9

m
(1 −s
m
) · max(d
m
−E
F
−ξ, 0). (4.37)
While the individual functions f
S
, f
P
and f
D
do reasonably well by themselves at solving
each of their tasks, the functions f
PS
and f
SD
can improve the results further. However,
the addition of these functions means that the problems are no longer separable, and
therefore an optimal labelling z

is harder to find. In order to minimise the energy
function (4.22) efficiently, we apply the dual decomposition technique that was introduced
in Section 4.1.
4.4 Dual Decomposition
Many of the minimisation problems defined in Section 4.3 are multiclass problems, and
are therefore intractable to solve in their current forms. However, we can binarise the
multiclass label sets D and B.
4.4.1 Binarisation of Energy Functions
For pixels, we extend the labelling space so that each pixel takes a vector of binary labels
z
m
= [d
(m,0)
, d
(m,1)
, · · · , d
(m,K−1)
, s
m
], with each d
(m,k)
equal to 1 if and only if disparity
value k is selected for pixel m. For parts, we extend the labelling space so that each part
takes a vector of binary labels b
i
= [b
(i,1)
, b
(i,2)
, · · · , b
(i,N
E
)
], where each b
(i,j)
is equal to 1
if and only if the j
th
proposal for part i is selected.
A particular solution to this binary labelling problem is denoted by ¯z. Only a subset of
the possible binary labellings will correspond directly to multiclass labellings z; these are
those such that each pixel has exactly one disparity turned on, and each part has exactly
one proposal selected. The set of solutions for which all pixels satisfy this constraint is
84
4.4. Dual Decomposition
called the feasible set F. We can write:
F =
_
_
_
¯z :
K−1

k=0
d
(m,k)
= 1 ∀Z
m
∈ Z;
N
E

j=1
b
(i,j)
= 1 ∀B
i
∈ B
_
_
_
. (4.38)
We rewrite the cost functions from (4.30) and (4.31) as follows:
f

P
(¯z) = γ
4

(i,j)
b
(i,j)
· θ
P
(i, j) (4.39)

5

(i
1
,i
2
)∈T

j
1
,j
2
b
(i
1
,j
1
)
· b
(i
2
,j
2
)
· φ
i
1
,i
2
(j
1
, j
2
);
f

D
(¯z) = γ
6
N

m=1
K−1

k=0
d
(m,k)
· θ
D
(m, k). (4.40)
The joining functions given in Sections 4.3.4 and 4.3.5 can be binarised in a similar
fashion. The energy minimisation problem (4.22) can be restated in terms of these binary
functions, giving us:
E(¯z) = f

D
(¯z) +f
S
(¯z) +f

P
(¯z) +f

PS
(¯z) +f

SD
(¯z) (4.41)
subject to: ¯z ∈ F.
4.4.2 Optimisation
Minimising this energy function across all labellings ¯z simultaneously is intractable, so in
order to simplify the problem, we use dual decomposition, which was described in Section
4.1.
We introduce duplicate variables ¯z
1
and ¯z
2
, and only enforce the feasibility constraints
on these duplicates. Our energy function thus becomes:
E(¯z, ¯z
1
, ¯z
2
) =f

D
(¯z
1
) +f
S
(¯z) +f

P
(¯z
2
) +f

PS
(¯z) +f

SD
(¯z) (4.42)
subject to: ¯z
1
, ¯z
2
∈ F, ¯z
1
= ¯z, ¯z
2
= ¯z .
85
4.4. Dual Decomposition
We remove the equality constraints by adding Lagrangian multipliers:
L(¯z, ¯z
1
, ¯z
2
) = f

D
(¯z
1
) +f
S
(¯z) +f

P
(¯z
2
) +f

PS
(¯z) (4.43)
+f

SD
(¯z) +λ
D
(¯z −¯z
1
) +λ
P
(¯z −¯z
2
).
Hence, the original problem, min
z
E(z), is converted to:
max
λ
D

P
_
min
¯z,¯z
1
,¯z
2
L(¯z, ¯z
1
, ¯z
2
)
_
, (4.44)
which is the dual problem with dual variables λ
D
and λ
P
. λ
D
is a vector with N × K
elements; one for each binary variable d
ij
. λ
P
is a vector with N
P
×N
E
elements; one for
each binary variable b
ij
.
Note that a labelling ¯z gives us an upper bound for the solution to the primal problem
E(¯z); on the other hand, the dual problem (4.44) gives us a lower bound, as we have
relaxed the feasibility constraints by adding λ
D
and λ
P
. We can decompose this dual
problem into three subproblems L
1
, L
2
and L
3
, as follows:
L(¯z, ¯z
1
, ¯z
2
) = L
1
(¯z
1
, λ
D
) +L
2
(¯z
2
, λ
P
) +L
3
(¯z, λ
D
, λ
P
), (4.45)
where
L
1
(¯z
1
, λ
D
) = f

D
(¯z
1
) −λ
D
¯z
1
; (4.46)
L
2
(¯z
2
, λ
P
) = f

P
(¯z
2
) −λ
P
¯z
2
; (4.47)
L
3
(¯z, λ
D
, λ
P
) = f
S
(¯z) +f

SD
(¯z) +f

PS
(¯z) +λ
D
¯z +λ
P
¯z; (4.48)
are the three slave problems, which can be optimised independently and efficiently, while
treating the dual variables λ
D
and λ
P
as constant. This process is shown graphically
in Figure 4.10. Intuitively, the role of the dual variables is to encourage the labellings
¯z, ¯z
1
, and ¯z
2
to agree with each other. Since these λ
D
and λ
P
remain constant during
each iteration (only changing between iterations), we do not actually need to explicitly
86
4.4. Dual Decomposition
(a) (b)
Figure 4.10: Diagram showing the two-stage update process. (a): the slaves find la-
bellings z, z
1
, z
2
and pass them to the master; (b): the master updates the dual variables
λ
D
and λ
P
and passes them to the slave.
minimise them.
Given the current values of λ
D
and λ
P
, we solve the slave problems L
1
, L
2
, and L
3
,
denoting the solutions by
¯
L
1

D
),
¯
L
2

P
), and
¯
L
3

D
, λ
P
) respectively. We concatenate
¯
L
1

D
) and
¯
L
2

P
) to form a vector of the same dimensionality as
¯
L
3

D
, λ
P
). The
master then calculates the subgradient of the relaxed dual function at (λ
D
, λ
P
), given by
∇L(λ
D
, λ
P
) =
¯
L
3

D
, λ
P
) −[
¯
L
1

D
),
¯
L
2

P
)]. (4.49)
The master problem can then update the dual variables using the subgradient method,
similar to that of [127], and then update the λ by:

D
, λ
P
] ← [λ
D
, λ
P
] +α
t
∇L(λ
D
, λ
P
), (4.50)
and pass the new λ vectors back to the slaves. Here, α
t
is the step size indexed by
iteration t, which we set as follows:
α
t
=
γ
10
1 +t
, (4.51)
ensuring that the α
t
form a decreasing sequence, so that we make progressively finer
refinements with each iteration. Consider a particular proposal j for a part i. At each
iteration, two solutions b
1
ij
and b
3
ij
are generated for B
ij
, from L
1
and L
3
respectively. If
these candidates are the same, then no penalty is suffered, and the corresponding element
of λ
P
will not change. If, however, there was a disagreement, for example b
1
ij
= 1 but
87
4.4. Dual Decomposition
b
3
ij
= 0, then for problem L
1
, we wish to increase the penalty of selecting b
ij
, while doing
the reverse for problem L
3
.
4.4.3 Solving Sub-Problem L
1
Since problem L
1
contains terms that only depend on the disparity variables, we can
relax the feasibility constraint in (4.38) to only depend on these variables. We call this
expanded feasible set F
D
:
F
D
=
_
¯z :
K−1

k=0
d
(m,k)
= 1 ∀Z
m
∈ Z
_
. (4.52)
Then, we can write L
1
in terms of the binary function f

D
as:
L
1
(¯z
1
, λ
D
) =f

D
(¯z
1
) −λ
D
¯z
1
(4.53)
subject to: ¯z
1
∈ F
D
.
Since f

D
includes only unary terms, this equation can be solved independently for each
pixel. As we have the feasibility constraint attached to this equation, we solve the fol-
lowing for each pixel m, where λ
D
(m, k) is the element of the λ
D
vector corresponding
to pixel m and disparity k:
K−1
min
k=0
(max(γ
6
· θ
D
(m, k) −λ
D
(m, k), 0)) . (4.54)
4.4.4 Solving Sub-Problem L
2
L
2
contains functions that depend only on the pose variables b
ij
, so we can again relax
the feasibility constraint. This expanded feasible set is called F
P
:
F
P
=
_
_
_
¯z :
N
E

j=1
b
(i,j)
= 1 ∀B
i
∈ B
_
_
_
. (4.55)
88
4.4. Dual Decomposition
L
2
can be written in terms of f

P
as follows:
L
2
(¯z
2
, λ
P
) =f

P
(¯z
2
) −λ
P
¯z
2
(4.56)
subject to: ¯z
2
∈ F
P
.
with f

P
as in (4.39). Ordering the parts B
i
such that (i
1
, i
2
) ∈ T only if i
1
< i
2
, we find
the optimal solution using belief propagation, as introduced in Section 3.2.2. The score
of each leaf vertex is the following, calculated for each estimate j:
score
i
(j) = θ
P
(i, j) −λ
P
(i, j). (4.57)
For a vertex i with children, we can compute the following:
score
i
(j) = θ
P
(i, j) −λ
P
(i, j) +

(i
1
,i)∈T
min
j
1

i
1
,i
(j
1
, j) + score
i
1
(j
1
)) , (4.58)
and the globally optimal solution is found by keeping track of the arg min indices, and
then selecting the root (torso) estimate with minimal score.
4.4.5 Solving Sub-Problem L
3
Sub-problem L
3
is significantly more complex, as it contains the joining terms f

PS
and
f

SD
. Since we have rewritten L
1
and L
2
in terms of the binary variables ¯z
1
and ¯z
2
,
we need to do the same to the joining terms. J
1
(z) penalised background pixels being
assigned a body part. For parts i, estimates j, and pixels m, this becomes:
J

1
(¯z) =

i,j

m
_
b
(i,j)
· (1 −s
m
) · w
m
ij
_
. (4.59)
The function J
2
(z), penalising foreground pixels not being explained by any body part,
becomes the following for pixels m:
J

2
(¯z) =

m
1
_
τ −max
i,j
w
m
ij
> 0
_
· s
m
. (4.60)
89
4.5. Weight Learning
Bringing these costs together, we get:
f

PS
(¯z) = γ
7
· J

1
(¯z) +γ
8
· J

2
(¯z). (4.61)
Function f
SD
(¯z), which penalises background pixels with a higher disparity than the
foreground region, becomes the following for pixels m and disparities k:
f

SD
(¯z) = γ
9

m

k
_
(1 −s
m
) · d
(m,k)
· max(k −E
F
−ξ, 0)
_
. (4.62)
Since all the terms in the energy function L
3
are submodular, we can efficiently minimise
the function via a graph cut.
4.5 Weight Learning
The formulation described in the previous section contains a large number of parameters,
denoted by γ

, which dictate how much influence the various energy terms have on the
final result. A particular term can be made hugely important by setting its γ value,
hereafter its weight, to be very high. Alternatively, setting the weight to zero will mean
that the term is ignored completely.
The number of parameters is sufficient that it is not feasible to select the optimal
values by hand; additionally, it is not clear which terms should have a higher weight.
This section contains a description of an algorithm to determine the best choices from a
finite set of weights.
4.5.1 Assumption
In order to improve the efficiency of the algorithm, the following assumption is made:
Proposition 4.2 When a variable is varied, the result of the algorithm wil l vary in a
convex fashion – in other words, there is only one local optimum.
90
4.5. Weight Learning
Assuming this makes it possible to reduce the runtime of the algorithm by up to 50%:
without it, we would have to run the test for each possible value in each iteration, rather
than performing a binary search for the best value. To test this assumption empirically,
we fixed all but one of the variables, and evaluated the performance of the framework for
each of a sequence of values. In each case, the above proposition held.
This optimum point is hereafter referred to as γ
opt
. Define the score of a given
parameter value x ∈ P to be s
x
. Formally,
Definition 4.4
For a given set P of parameter values, the optimal value γ
opt
is the one such
that:
∀x ∈ P, s
x
≥ s
γ
opt
⇒ x = γ
opt
.
Further, the following two propositions directly follow from the assumed convexity of the
result:
Proposition 4.3 If γ
opt
< x < y, then s
x
> s
y
.
Proposition 4.4 If x < y < γ
opt
, then s
x
< s
y
.
4.5.2 Method
The parameters are optimised based on their performance on the training set. To avoid
overfitting, it might be better to use the entire training set, but since the learning time
is linear in relation to the number of images, a subset of 100 images is used.
In each iteration, we fix all but one of the parameters, and find the best value by a
method that is itself iterative. The process for optimising one variable is described next,
and this subsection is concluded by a discussion of the overall algorithm for finding the
best parameter set.
91
4.5. Weight Learning
Optimisation of One Variable
Suppose we have fixed the parameters γ
j
for j = i. Our goal is to find the best value γ
opt
for γ
i
from the following set of values: P = {0.001, 0.01, 0.1, 1, 10, 100}.
We find this optimal value by performing a binary search. At each step, we choose a
value from this set, and evaluate the performance of our dual decomposition algorithm
from Section 4.4 on 100 images from the training set. Since there are only six values to
choose from, the algorithm is written in full, although a more general algorithm can be
extrapolated from the procedure below.
First, we set γ
i
= 0.1, and find the score s
0.1
of the algorithm for this parameter set.
At this stage we can’t eliminate any parameter values, as we only have one score. So, we
choose the middle value from the longest set of unchecked values, i.e. γ
i
= 10, and find
the score s
10
.
Now, we can eliminate some values from the parameter set based on the scores found
so far:
Lemma 4.5 Denote by γ
opt
the (as yet unknown) optimal parameter value. Then:
1. if s
0.1
> s
10
, then γ
opt
< 10.
2. if s
0.1
≤ s
10
, then γ
opt
≥ 0.1.
Proof 1. For a contradiction, suppose that s
0.1
> s
10
, but γ
opt
> 10 (by definition, we
can’t have γ
opt
= 10, since s
10
< s
0.1
). Then, since γ
opt
> 10 > 0.1, by Proposition
4.4, we have s
10
> s
0.1
, which is a contradiction.
2. Again seeking a contradiction, suppose that s
0.1
≤ s
10
, but γ
opt
< 0.1. Then
Proposition 4.3 applies, and we have that s
0.1
> s
10
, a contradiction.
We now have a reduced set of possible values for γ
opt
: either {0.001, 0.01, 0.1, 1}, or
{0.1, 1, 10, 100}. In both cases, we check the second possible value. After this point,
there are four possible cases:
92
4.5. Weight Learning
1. s
0.1
> s
10
and s
0.01
> s
0.1
: the optimum value must be either 0.001 or 0.1, so we
find s
0.001
, and thus determine γ
opt
.
2. s
0.1
> s
10
, but s
0.01
< s
0.1
: then we know that 0.1 ≤ γ
opt
< 10, so we find s
1
, and
thus determine γ
opt
.
3. s
0.1
< s
10
and s
1
> s
10
: then we’ve already found a local maximum, so by Proposi-
tion 4.2, it must be the only local maximum.
4. s
0.1
< s
10
and s
1
< s
10
: the optimum value must be either 10 or 100, so we find
s
100
, and thus determine γ
opt
.
This is the set of cases for the first set of possible values for γ
opt
; the cases for the second
set are analogous. The whole process is represented by a decision tree in Figure 4.11.
Optimisation of All Variables
To optimise all variables, we use Algorithm 4.1:
Algorithm 4.1 Parameter optimisation algorithm for dual decomposition framework.
1: Input: Initial variable values γ
1
, . . . , γ
10
2: Initialise ‘has been optimised’ vector: opt(i) ← FALSE for all i.
3: while ∃ i : opt(i) = FALSE do
4: Choose a variable γ
i
s.t. opt(i) = FALSE, and store the previous value γ
prev
.
5: Find the optimal value γ
opt
and score s
opt
.
6: if γ
opt
= γ
prev
then
7: for all j = i do
8: opt(j) ← FALSE
9: end for
10: end if
11: opt(i) ← TRUE
12: end while
In theory, this algorithm will converge in a finite number of iterations (as there are a
finite number of possible values for the set of parameters), but in practice, we limit the
number of iterations to 100. It is possible to do fine-tuning by repeating the process with
additional parameter sets, e.g. if the best value for a variable turns out to be 0.1, we
could use the set {0.025, 0.05, 0.1, 0.2, 0.4, 0.8}.
93
4.5. Weight Learning
F
i
g
u
r
e
4
.
1
1
:
T
h
e
o
p
t
i
m
i
s
a
t
i
o
n
p
r
o
c
e
s
s
f
o
r
o
n
e
p
a
r
a
m
e
t
e
r
,
r
e
p
r
e
s
e
n
t
e
d
a
s
a
d
e
c
i
s
i
o
n
t
r
e
e
.
94
4.6. Experiments
Table 4.4: The various weights learned by the weight learning process and used in the
experiments in this chapter.
Symbol Usage Weight
γ
1
Unary segmentation weight 1
γ
2
Pairwise segmentation weight 1
γ
3
Factor in (4.26) 0.1
γ
4
Unary pose weight 10
γ
5
Pairwise pose weight 10
γ
6
Stereo weight 1
γ
7
Joint weight penalising background pixels close to body parts 10
γ
8
Joint weight penalising foreground pixels far from body parts 10
γ
9
Joint stereo-segmentation weight 0.1
Table 4.4 specifies the weights learned for our framework. In general, the terms
relating to pose were found to be most important, and the joint stereo-segmentation
term least important, perhaps to compensate for the fact that this term is used in far
more places than other terms (W ∗ H ∗ D pairwise connections).
4.6 Experiments
The dual decomposition formulation described in this chapter is evaluated using the
H2view dataset, which was described in Section 4.2. For each experiment, the training
set is used for two tasks: to train the pose estimation algorithm we use to generate
proposals, and to learn the weights that we attach to the energy terms. These weights
are learned by coordinate ascent, as detailed in Section 4.5.
4.6.1 Performance
Some qualitative stereo results are given in Figure 4.12. Due to the difference between
the fields of view of the stereo and Kinect cameras, ground truth disparity values are
only available for the foreground objects, some background objects, and surrounding
floor space. Comparing our stereo results (Figure 4.12(b)) to the ground truth disparity
95
4.6. Experiments
Table 4.5: Results on the H2view test set. The entire framework achieves a significant
performance improvement compared to the joint segmentation and pose terms.
Method Precision Recall F-Score Overlap
GrabCut [100] 43.03% 93.64% 58.96% 43.91%
ALE [71] 83.19% 73.58% 78.09% 64.29%
f
S
55.11% 87.28% 67.56% 56.59%
f
S
+f
PS
63.47% 90.41% 74.59% 62.65%
Entire framework 69.33% 89.61% 78.18% 67.62%
values for these objects (Figure 4.12(c)) shows a good correspondence between the two.
Segmentation
We compare the segmentation performance of our framework against the methods of
GrabCut [100] and ALE [71], and quantitative results are given in Table 4.5. While our
segmentation term alone does not perform as well as ALE does, the inclusion of pose
and depth information allows us to surpass the result of ALE. In Figure 4.13, we give
examples of how our performance improves as more terms are added to the framework.
Pose Estimation
To evaluate pose estimation, we measure the percentage of correctly localised body parts
(PCP) [37]. Quantitative results are given in Table 4.6, while we include qualitative
results in Figure 4.14.
Our model exhibits a significant performance increase for the upper body, where the
segmentation cues are the strongest. However, there is a slight reduction in perform-
ance for the upper and lower legs. The baseline performance from Yang and Ramanan’s
algorithm (taking the highest scoring pose for each image) is 69.85% PCP; our joint infer-
ence model allows us to improve this performance to 75.37%. Our model can successfully
correct mistakes made by Yann and Ramanan’s algorithm; an example of this is shown
in Figure 4.14(b). It should be noted that the other algorithms do not use stereo inform-
ation, so it is likely that some of the improvement obtained by our algorithm is due to
96
4.6. Experiments
(a) RGB image
(b) Disparity map
(c) Ground truth depth
(d) Segmentation result
(e) Ground truth segmentation
Figure 4.12: Sample stereo and segmentation results. The Kinect depth data used to
generate the ground truth depth in (c) is only available for some pixels, due to the
slightly different field of view of the camera.
97
4.6. Experiments
(a) Original image
(b) Segmentation term only
(c) Joint segmentation and pose
(d) Joint segmentation, pose, and depth estimation
(e) Ground truth segmentation
Figure 4.13: Segmentation results comparing the three stages of the algorithm. Left:
some head pixels are recovered, although the arms are still missing. Middle: the new
terms remove progressively more false positives. Right: the segmentation term already
achieves a good result, which is maintained when more terms are added.
98
4.6. Experiments
(a) Pose, segmentation and stereo results together.
(b) Error correction: the first estimate (first image) misclassifies the left leg (red), while the second
estimate (second image) gets it right; our segmentation (third image) and stereo (fourth image) cues
enable us to recover (fifth image).
Figure 4.14: Some sample results from our new dataset.
99
4.7. Discussion
Table 4.6: Results (given in % PCP) on the H2view test sequence, compared with Yang
and Ramanan’s algorithm.
Method Torso Head
Upper
arm
Forearm
Upper
leg
Lower
leg
Total
Ours 94.9 88.7 74.4 43.4 88.4 78.8 75.37
Yang [131] 72.0 87.3 61.5 36.6 88.5 83.0 69.85
Andriluka [1] 80.5 69.2 60.2 35.2 83.9 76.0 66.03
this extra information.
4.6.2 Runtime
Our entire framework requires around 2.75 minutes per frame, using an unoptimised
single-threaded algorithm on a 2.67GHz processor. The long time required is due to the
graph construction for problem L
3
, which requires millions of vertices to be added to
capture the relationship between segmentation and disparity values. This runtime does
not include the observed runtime of Yang and Ramanan’s pose estimation algorithm
(about 10 seconds per frame using the Linux implementation provided on their website
[132]; a faster run-time was quoted in their paper, albeit for smaller images) [131]. This
is much slower than the speed of 15 fps required for real-time applications. However, the
entire algorithm (including using Yang and Ramanan’s method to obtain pose estimates)
is quicker than that of Andriluka et al. [1], which requires around 3 minutes per frame.
However, the framework is still slower than ALE and GrabCut, which require around 1.5
and 3 seconds per frame, respectively [71, 100].
4.7 Discussion
In this chapter, we have described a novel formulation for solving the problems of human
segmentation, pose estimation and depth estimation, using a single energy function. The
algorithm we have presented is self-contained, and exhibits high performance in all three
tasks. In addition, we have introduced an extensive, fully annotated dataset for 3D
100
4.7. Discussion
human pose estimation.
The algorithm is modular in design, which means that it would be straightforward
to substitute alternative approaches for each slave problem; a thorough survey of the
efficacy of these combinations would be a promising direction for future research. Further,
the formulation could be extended to incorporate cues from video, for example motion
tracks. This data could be processed sequentially, or alternatively, entire sequences could
be optimised offline.
The most significant drawback of this approach is the speed; at an average of 2.75
minutes per frame, it takes just over three days for a single-threaded test run. One of
the main aims of the next two chapters is to improve the speed of the framework. We
have noted the large number of vertices and edges needed to allow the disparity map to
influence the segmentation result. In the next chapter, we will restructure our framework
in order to bypass this issue.
Acknowledgments
The basis for the H2view dataset was provided by SCEE London Studio. This included
camera calibration, data capture, and python code for translating the segmentation
masks, joint locations and depth data obtained via Kinect into the image co-ordinates
corresponding to the (left) stereo camera. The correction of the pose estimates in the
dataset was performed by myself, using an annotation tool in MATLAB which I created
for the task.
The main codebase was written by myself, although the segmentation algorithm used
was (heavily) adapted from code released by Talbot and Xu [118, 119].
Credit is also due to the co-authors of the BMVC student workshop publication:
• Jonathan Warrell contributed significantly to the discussions in which the for-
mulation followed in this chapter was refined;
• Yuhang Zhang, a visiting student at the time of these discussions, also made some
contribution;
101
4.7. Discussion
• Phil Torr, my PhD supervisor, and Nigel Crook, my second supervisor, provided
feedback based on results and paper drafts throughout the project.
102
Chapter 5
A Robust Stereo Prior for Human
Segmentation
We demonstrated in the previous chapter that using a dual decomposition framework
can achieve good results on human pose estimation, segmentation and depth estimation.
However, the main problem with the framework was the time required to run the al-
gorithm. The most significant bottleneck in the framework of the previous chapter was
found to be caused by the addition of millions of vertices in the graph to unify the stereo
and segmentation tasks. In this chapter, we show how a high-quality segmentation result
can be obtained directly from the disparity map.
In human segmentation, the goal is to obtain a binary labelling of an image, showing
which pixels belong to the person of interest. The human will generally occupy a continu-
ous region of the scene, and the depth of the human will be smooth within this region.
In contrast, the background region tends to have a different depth to that of the human.
This property motivates our method of generating a segmentation prior: after finding
a small set of seed pixels which have high probability of belonging to the foreground
object, we use a flood fill algorithm to find a continuous region of the image which has a
smooth disparity.
As shown in Figure 5.1, a flood fill prior can generate promising human segmentations
by itself. We show how to integrate this prior into our inference framework from Section
103
5.1. Related Work
4.3, improving the speed, and also providing a slight increase in performance.
The chapter is structured as follows: related work is summarised in the next section,
describing the range expansion formulation. Then, the flood fill prior is introduced in
Section 5.2. Our modified inference framework is described in Section 5.3. Experimental
results follow in Section 5.4, and concluding remarks are given in Section 5.5.
An earlier version of this chapter previously appeared as [107].
5.1 Related Work
The use of priors for segmentation has made a significant difference for a variety of
computer vision problems in recent years. Many of these approaches can be traced back
to the introduction of GrabCut [100], an interactive method in which the user specifies
a rectangular region of interest, and then the foreground region is refined based on that
region using graph cuts. Larlus and Jurie [73] used a similar approach, using object
detection to supply bounding boxes and thus automate the process. More sophisticated
priors have been used in works such as ObjCut and PoseCut, which use the output of
object detection and pose estimation algorithms respectively to provide a shape prior
term [16, 66].
Lower-level priors are used by Gulshan et al. [47], who formulate the task of segment-
ing humans as a learning problem, using linear classifiers to predict segmentation masks
from HOG descriptors. Ladický et al. [70] showed how multiple such features and priors
can be combined, avoiding the need to make an a priori decision of which is most appro-
priate. Their Associative Hierarchical Random Field (AHRF) model has demonstrated
state of the art results for several semantic segmentation problems.
In this chapter, we use the intuition that foreground objects will occupy a continuous
region of 3D space, and will therefore exist within a smooth range of depth values. Given
the bijective relation between depth and disparity, we show how disparity can be used as
a discriminative tool to aid segmentation, and propose a novel segmentation prior based
on the disparity map produced by a stereo correspondence algorithm.
104
5.1. Related Work
(a)
(b)
(c)
Figure 5.1: The seeds (in red) on the disparity map in (a) are used to generate the flood
fill prior in (b), which is a good approximation of the ground truth segmentation (c).
105
5.1. Related Work
Many approaches exist for solving the stereo correspondence problem, and a review
of recent methods can be found in Section 3.4. In this chapter, we follow the approach
of Kumar et al. [67], who proposed an efficient way to solve the stereo correspondence
problem using graph cuts. They note that the disparity labels form a discrete range of
possible values, and that therefore the stereo correspondence problem is suitable for the
application of range moves. A range move is a move making approach where rather than
only considering one or two labels at each iteration, a range of several possible values is
considered. Kumar et al. define two different range move formulations; the one we use in
this chapter is range expansion. In a single range expansion move, each pixel can either
keep its old label, or choose from a range of consecutive labels – in this case, disparity
values.
5.1.1 Range Move Formulation
Given a stereo pair of images L and R, our aim is to solve the stereo correspondence prob-
lem, and find a disparity map. In order to solve this problem via graph cuts, we construct
a set of pixels Z = {Z
1
, Z
2
, . . . , Z
N
}, and a set of disparity labels D = {0, 1, . . . , M −1}.
A solution obtained via graph cuts is referred to as a labelling z, where each pixel variable
Z
m
is assigned a label d
m
∈ D. The set of possible labellings is referred to as the label
space.
Move-making algorithms find a local minimum solution to a graph cut problem by
making a series of “moves” in the label space, where each move is governed by a set of
specific rules. One of the more popular moves is α-expansion [15], where each variable
can either retain its current label, or change its label to α. When the set of labels is
ordered (as D is), it is possible to efficiently obtain a local minimum by considering a
range of labels rather than just one. This can be done by using the range expansion
algorithm, as proposed by Kumar et al. [67].
At each iteration k of the range expansion algorithm, let z
(k)
be the current labelling
of pixels; each pixel Z
m
has a disparity label d
(k)
m
. We consider an interval D
(k)
of labels,
where D
(k)
= {α
(k)
, α
(k)
+ 1, . . . , β
(k)
}. The option is provided for each pixel to either
106
5.1. Related Work
keep its current label d
(k+1)
m
= d
(k)
m
, or choose a new label d
(k+1)
m
∈ D
(k)
. Formally, we find
a labelling z
(k+1)
satisfying:
z
(k+1)
=arg min
z
f
D
(z), (5.1)
such that ∀ Z
m
∈ Z : d
(k+1)
m
= d
(k)
m
OR d
(k+1)
m
∈ D
(k)
,
where f
D
is the energy function of the stereo correspondence problem, with unary poten-
tials θ
D
, pairwise potentials φ
D
, and weight parameters γ
1
and γ
2
:
f
D
(z) = γ
1

Z
m
θ
D
(Z
m
, d
m
) +γ
2

(Z
m
1
,Z
m
2
)∈C
φ
D
(d
m
1
, d
m
2
) , (5.2)
where C is the set of neighbouring pairs of pixels in Z, φ
D
is the L
1
norm on the depth
difference, and d
m
1
and d
m
2
are the disparity values of the pixels Z
m
1
and Z
m
2
respectively.
To minimise this energy function, we follow the formulation of [67]. A graph G
(k)
=
{V
(k)
, E
(k)
} with source s and sink t is created, and for each pixel Z
m
, nodes v
(m,α
(k)
)
, . . . ,
v
(m,β
(k)
)
are added.
The edges (s, v
(m,α
(k)
)
) and (v
(m,j)
, v
(m,j+1)
) ∈ E
(k)
, where α
(k)
≤ j < β
(k)
, have as
their capacities the unary potentials in f
D
. If the current disparity d
(k)
m
is outside the
interval I
(k)
, then the unary potential is used as the capacity of the edge (s, v
(m,α
(k)
)
):
c
(k)
(s, v
(m,α
(k)
)
) =
_
_
_
θ
D
(Z
m
, d
(k)
m
)), if d
(k)
m
/ ∈ D
m
; (5.3a)
∞, otherwise , (5.3b)
while the capacity of edges (v
(m,j)
, v
(m,j+1)
) are given by:
c
(k)
(v
(m,j)
, v
(m,j+1)
) = θ
D
(Z
m
, j) . (5.4)
Finally, the unary potential for disparity β
(k)
is used as the capacity of the edge (v
(m,β
(k)
)
, t):
c
(k)
(v
(m,β
(k)
)
, t) = θ
D
(Z
m
, β
(k)
) . (5.5)
107
5.2. Flood Fill Prior
For neighbouring pixels Z
m
and Z
m
, edges are added between all nodes v
(m,j)
and v
(m

,j

)
for j and j

in the range {α
(k)
, . . . , β
(k)
}, except for the case where j = j

= α
(k)
. These
edges store the pairwise potentials:
c
(k)
(v
(m,j)
, v
(m

,j

)
) = φ
(Z
m
,Z
m
)
(j, j

) . (5.6)
If at least one of d
(k)
m
and d
(k)
m
are not in I
(k)
, then an additional edge is added. If
d
(k)
m
∈ D
(k)
and d
(k)
m
/ ∈ D
(k)
, the edge (v
(m,α
(k)
)
, v
(m


(k)
)
) is added, with capacity L +κ/2,
where L = β
(k)
− α
(k)
is the length of the interval, and κ is a constant. If d
(k)
m
/ ∈ D
(k)
and d
(k)
m
∈ D
(k)
, then the edge (v
(m


(k)
)
, v
(m,α
(k)
)
) is added with the same capacity. If
d
(k)
m
/ ∈ D
(k)
and d
(k)
m
/ ∈ D
(k)
, then an additional node µ
(m,m

)
is added, with edges:
c
(k)
(v
(m,α
(k)
)
, µ
(m,m

)
) = c
(k)

(m,m

)
, v
(m,α
(k)
)
) = L +κ/2 ; (5.7)
c
(k)
(v
(m


(k)
)
, µ
(m,m

)
) = c
(k)

(m,m

)
, v
(m


(k)
)
) = L +κ/2 ; (5.8)
c
(k)
(s, µ
(m,m

)
) = φ
(Z
m
,Z
m
)
(d
(k)
m
, d
(k)
m
) +κ . (5.9)
The formulation is given in full in [67], and an example of a sequence of range move
iterations is shown in Figure 5.2.
5.2 Flood Fill Prior
After using range expansions to find the disparity map, we apply a flood fill to generate
a prior for human segmentation. Flood fill is an algorithm that, given a list of seeds, fills
connected parts of a multidimensional array, e.g. an image, with a given label. Starting
with two pixels which have a high probability of belonging to the human, we find a
continuous region of the image which has a smooth disparity. This region is specified by
a binary image F.
The flood fill algorithm is a standard method that has many applications: for example,
it can be used for bucket filling in graphics editors such as GIMP or Microsoft Paint;
108
5.2. Flood Fill Prior
(a) Disparity map after the first range move iteration
(b) Disparity map after the second range move
iteration
(c) Disparity map after the third range move
iteration
Figure 5.2: The results of three successive range expansion iterations. At first, there are
very few disparities to choose from, but with further iterations, the person becomes more
visible against the background.
109
5.2. Flood Fill Prior
Algorithm 5.1 Generic flood fill algorithm for an image I of size W ×H.
Input: non-empty list S of seed pixels
initialize a zero-valued matrix F of size W ×H; empty queue of new ranges R
for all seeds s = (s
x
, s
y
) ∈ S do
v ← I(s
x
, s
y
)
(F, x
min
, x
max
) ← doLinearFill(F, I, s
x
, s
y
, v)
if x
min
< x
max
then
add (x
min
, x
max
, y, v) to back of queue R
end if
while R = ∅ do
remove (x
min
, x
max
, y, v) from front of R
if y > 0 then
(R, F) ← multiLinearFill(F, I, x
min
, x
max
, y −1, v)
end if
if y < H −1 then
(R, F) ← multiLinearFill(F, I, x
min
, x
max
, y + 1, v)
end if
end while
end for
return F
a memory-efficient version is given in general form in Algorithm 5.1
1
. The function
doLinearFill(F, s
x
, s
y
, v), given in Algorithm 5.2, finds the largest range (x
min
, x
max
),
where x
min
≤ s
x
≤ x
max
, such that for all x, x+1 within the range, |I(x, s
y
)−v| < , where
v is the intensity value of the seed pixel. In other words, it finds the largest horizontal
region of the image containing (s
x
, s
y
), that has an intensity value close to that of the
seed pixel.
The function multiLinearFill(F, x
min
, x
max
, y, v) executes doLinearFill starting
from each pixel (x, y) where x
min
≤ x ≤ x
max
, provided |I(x, y) −v < | and F(x, y) = 0,
i.e. (x, y) has not already been filled by the algorithm. Each execution of doLinearFill
generates a new range (x

min
, x

max
); if x

min
< x

max
, then (x

min
, x

max
, y, v) is added to R.
We use the flood fill algorithm to generate a segmentation of the human from the
disparity map. We run the pose estimation algorithm of Yang and Ramanan [131] on
the original RGB image, and then use the endpoints of the top-ranked torso estimate as
1
The algorithm was written by Julien Valentin, a member of the Brookes vision group, and was based
on the implementation of Dunlap [29].
110
5.2. Flood Fill Prior
Algorithm 5.2 doLinearFill: perform a linear fill from the seed point (s
x
, s
y
).
Input: Image I, current flood fill map F, starting points s
x
and s
y
, and initial seed
value v.
Require: 0 ≤ s
x
< W, 0 ≤ s
y
< H
if |I(s
x
, s
y
) −v| > then
return (F, s
x
, s
x
)
end if
F(s
x
, s
y
) ← 1
x ← s
x
while x < W −1 and |I(x + 1, s
y
) −v| < do
F(x + 1, s
y
) ← 1
increment x
end while
x
max
← x, x ← s
x
while x > 0 and |I(x −1, s
y
) −v| < do
F(x −1, s
y
) ← 1
decrement x
end while
x
min
← x
return (F, x
min
, x
max
)
seeds.
To determine whether a value I(x, y) matches its neighbour I(x +δx, y +δy) (where
|δx| + |δy| = 1), we consider the difference between the two values; the pair of values
match if the difference is below a preset threshold.
If the person is fully visible, then we will also be able to see the point where their feet
touch the floor. At this point, the disparity value of the floor is often very similar to the
disparity value of the foot, and so it is possible for the segmentation prior to ‘leak’ on to
the floor. In order to prevent this, we estimate the position of the floor plane.
The disparity d
m
of a pixel Z
m
has an inverse relation to its depth z
m
,
z
m
=
b(f
x
+f
y
)
2d
m
, (5.10)
where b is the baseline (the distance between the camera centres) of the stereo camera,
and f
x
and f
y
are the focal lengths in x and y respectively (diagonal elements of the
camera matrix).
111
5.3. Application: Human Segmentation
Once we have the depth z
m
, the real-world height can be obtained via a projective
transformation. We also assume that the camera is parallel to the floor plane, and we
know its height above the ground at capture time, so by thresholding the height values,
we can obtain an estimate for the floor plane. A typical flood fill prior is shown in Figure
5.1.
5.3 Application: Human Segmentation
Here, we discuss our modifications to the inference framework of Section 4.3. To recap, the
objective is to minimise an energy function which consists of five terms: three representing
the distinct problems that the function unifies (i.e. stereo correspondence, segmentation,
and pose estimation), and two joining terms, which encourage consistency between the
solutions of these problems. As input, we have a stereo pair of images L and R. The parts-
based pose estimation algorithm of Yang and Ramanan [131] is used as a preprocessing
step to obtain a number N
E
of proposals for K = 10 body parts: head, torso, and two
for each of the four limbs. For each part i, each proposal j consists of a pair of image
co-ordinates, representing the endpoints of the limb (or skull, or spine).
5.3.1 Original Formulation
The approach is formulated as a conditional random field (CRF), with two sets of random
variables. The set Z = {Z
1
, Z
2
, . . . , Z
N
} represents the image pixels; in addition to a
disparity label from the multi-class label set D defined in Section 5.1.1, each pixel Z
m
is
given a binary segmentation label s
m
from S = {0, 1}. The set B = {B
1
, B
2
, . . . , B
K
}
represents the body parts; labels b
i
for each part B
i
are assigned to each part from the
multi-class label set B. Any possible assignment of labels to these variables is called a
labelling, and denoted by z. As it was in (4.22), the energy of a labelling is written as:
E(z) = f
D
(z) +f
S
(z) +f
P
(z) +f
PS
(z) +f
SD
(z) , (5.11)
112
5.3. Application: Human Segmentation
with each term containing weights γ
i
∈ R
+
.
In order to utilise range moves and apply the new flood fill prior, we need to make some
amendments to the formulation given in Section 4.3. We modify three of the terms: f
D
,
which gives the cost of the disparity label assignment {d
m
}
N
m=1
; f
S
, which gives the cost
of the segmentation label assignment {s
m
}
N
m=1
; and f
SD
, which unifies the two. We now
describe in turn each of the terms in (5.11), along with our modifications to these terms.
For clarity, the terms as they appeared in Chapter 4 will be denoted by f

, while our new
functions will be denoted by g

. Weights carried over from the original formulation will
be denoted by γ

(with the same index), and new weights will be denoted by w

.
5.3.2 Stereo Term f
D
The energy f
D
(z) represented the disparity map. In (4.31), the energy was simply given
in terms of a cost volume θ
D
, which for each pixel Z
m
, specified the cost of assigning a
disparity label d
m
. This cost incorporated the gradient in the x-direction, so a pairwise
term was not added:
f
D
(z) = γ
1

Z
m
θ
D
(Z
m
, d
m
) . (5.12)
In order to apply range moves, we introduce a truncated pairwise term to formalise the
notion that adjacent pixels should have similar disparities. After the k
th
iteration, we
denote by d
(k)
m
the labelling of each pixel Z
m
. We define C to be the set of pairs of
neighbouring pixels (Z
m
, Z
m
). Given a labelling z, the pairwise cost associated with a
pair of pixels (Z
m
, Z
m
) ∈ C is:
φ
(Z
m
,Z
m
)
(d
(k)
m
, d
(k)
m
) = |d
(k)
m
−d
(k)
m
|
_
λ
1

2
exp
_
−ψ
D
(Z
m
, Z
m
)
λ
3
__
, (5.13)
where
ψ
D
(Z
m
, Z
m
) = min(L(m) −L(m

) , λ
4
), (5.14)
113
5.3. Application: Human Segmentation
for some given maximum distance λ
4
. The overall energy of the labelling is:
g
D
(z) = γ
1

Z
m
θ
D
(Z
m
, d
(k)
m
) +w
1

(Z
m
,Z
m
)∈C
φ
(Z
m
,Z
m
)
(d
(k)
m
, d
(k)
m
) , (5.15)
where γ
1
, w
1
, λ
1
, λ
2
and λ
3
are preset weight parameters.
5.3.3 Segmentation Terms f
S
and f
SD
As introduced in (4.27), f
S
(z) gave the segmentation energy. The part estimates obtained
from Yang and Ramanan’s algorithm were used to create a foreground weight map W
F
,
from which Gaussian mixture models based on RGB values were fitted for the foreground
and background respectively, built using the Orchard-Bouman clustering algorithm [88].
2
These were used to obtain unary costs θ
F
and θ
B
of assigning each pixel to foreground
and background respectively. Additionally, a pairwise cost φ
S
was included to penalise
the case where adjacent pixels are assigned to different labels:
f
S
(z) = γ
2
· θ
S
(z) +γ
3
· φ
S
(z) , (5.16)
where:
θ
S
(z) =

Z
m
∈Z
s
m
· θ
F
(Z
m
) + (1 −s
m
) · θ
B
(Z
m
) . (5.17)
φ
S
(z) =

(Z
m
1
,Z
m
2
)∈C
1(s
m
1
= s
m
2
) exp(−β L(m
1
) −L(m
2
)
2
) . (5.18)
f
SD
(z), detailed in (4.37), was a cost function that penalised the case where pixels with
high disparity were assigned to foreground, contrary to the notion that the foreground
object is in front of the background. Using the foreground weight map W
F
defined above,
a set F of pixels with a high probability of being foreground was obtained, from which
the expected foreground disparity E
F
was calculated. Background pixels with disparity
2
This algorithm was used in the GrabCut implementation on which the code used in this work was
based [119], and was found empirically to give the best separation between foreground and background.
114
5.3. Application: Human Segmentation
greater than E
F
+ξ, where ξ is a non-negative slack variable, were then penalised:
f
SD
(z) = γ
4

Z
m
∈Z
(1 −s
m
) · max(d
m
−E
F
−ξ, 0) . (5.19)
We find that a flood fill prior proves to be much more reliable in discriminating between
high-disparity pixels that belong to the object, and those far from the object. Therefore,
we replace f
SD
with a flood fill-based term. Define F = {ρ
(k)
m
: Z
m
∈ Z} to be the flood
fill prior at iteration k. We can penalise disagreements between F and the segmentation
s using a Hamming distance between this prior and the current segmentation mask:
f
SD
(z) = γ
SD

Z
m
∈Z
1(s
m
= ρ
m
), where γ
SD
is again a preset weighting term. f
SD
now
depends only on the flood fill prior and the segmentation. Since the flood fill prior at
each iteration is fixed, we can merge the two terms f
SD
and f
S
, with the former being
subsumed by the unary segmentation potential. This becomes:
θ
S
(z, ρ) =

Z
m
∈Z
_
γ
2
_
s
m
· θ
F
(Z
m
) + (1 −s
m
) · θ
B
(Z
m
)
_
(5.20)
+w
2
1[s
m
= ρ
m
]
_
,
where θ
F
and θ
B
are as defined in (5.16). The overall segmentation energy is:
g
S
(z, ρ) = θ
S
(z, ρ) +γ
3
· φ
S
(z) . (5.21)
5.3.4 Pose Estimation Terms f
P
and f
PS
The terms relating to pose estimation are unchanged from the framework described in
Section 4.3, and appear here for ease of reference. f
P
(z), given in (4.30), denotes the
cost of a particular selection of parts {b
i
}
10
i=1
. Each proposal j for each part i has an
associated unary cost θ
P
(i, j); connected parts i
1
and i
2
have associated pairwise costs,
to penalise the case where parts that should be connected are distant from one another
115
5.3. Application: Human Segmentation
in image space. Defining T as the set of connected parts (i
1
, i
2
), the cost function is:
g
P
(z) = f
P
(z) = γ
5
10

i=1
θ
P
(i, b
i
) +γ
6

(i
1
,i
2
)∈T
φ
i
1
,i
2
(b
i
1
, b
i
2
) . (5.22)
f
PS
(z), given in (4.35), encodes the relation between segmentation and pose estimation:
we expect foreground pixels to be close to some body part, and we expect pixels close to
some body part to be foreground. The energy is written as:
g
PS
(z) = f
PS
(z) = γ
7

i,j

Z
m
_
1(b
i
= j) · (1 −s
m
) · w
m
ij
_
(5.23)

8

Z
m
1(max
i,j
w
m
ij
< τ) · s
m
.
5.3.5 Energy Minimisation
Our new energy function, with the terms defined in Sections 5.3.2 and 5.3.3, is as follows:
E(z) = g
D
(z) +g
S
(z) +g
P
(z) +g
PS
(z) . (5.24)
In order to efficiently minimise this energy function, the label set B is binarised. Each
body part B
i
takes a vector of binary labels b
i
= [b
(i,1)
, b
(i,2)
, · · · , b
(i,N
E
)
], where each b
(i,j)
is equal to 1 if and only if the j
th
proposal for part i is selected. Note that since we solve
g
D
efficiently using range moves, and the term linking stereo and segmentation is now
included in g
S
, we no longer need to binarise D, as we can efficiently find a multi-class
solution. This has an effect on the cost vector λ
D
, which we discuss in Section 5.3.6.
An assignment of binary labels to all pixels and parts is denoted by ¯z. We introduce
duplicate variables ¯z
1
and ¯z
2
, and add Lagrangian multipliers to penalise disagreement
between ¯z and its two copies, forming the Lagrangian dual of the energy function (5.24).
116
5.3. Application: Human Segmentation
(a) Message-passing structure from Chapter 4
(b) New message-passing structure.
Figure 5.3: Diagram showing the two-stage update process. Left: the slaves find labellings
z, z
1
, z
2
and pass them to the master. Note that this is the same for both the old and
new formulations. Right: the master updates the cost vectors λ
D
and λ
P
and passes
them to the slave. In the new formulation, the flood fill prior ρ is passed to the slave L
3
instead of the cost vector λ
D
.
This is then divided into three slave problems using dual decomposition:
L(¯z, ¯z
1
, ¯z
2
, ρ) = g
D
(¯z
1
) +g
S
(¯z, ρ) +g
P
(¯z
2
) +g
PS
(¯z) (5.25)

D
(¯z −¯z
1
) +λ
P
(¯z −¯z
2
) .
= L
1
(¯z
1
, λ
D
) +L
2
(¯z
2
, λ
P
) +L
3
(¯z, λ
D
, λ
P
, ρ) ,
where:
L
1
(¯z
1
, λ
D
) = g
D
(¯z
1
) −λ
D
¯z
1
; (5.26)
L
2
(¯z
2
, λ
P
) = g
P
(¯z
2
) −λ
P
¯z
2
; (5.27)
L
3
(¯z, λ
D
, λ
P
, ρ) = g
S
(¯z, ρ) +g
PS
(¯z) +λ
D
¯z +λ
P
¯z . (5.28)
Figure 5.3 shows the decomposition structure used in Section 4.4, and how it relates to
our new decomposition structure. The change in L
3
means that the messages passed
between the slaves and the master also change. The new message passing structure sees
the flood fill prior passed from the master to the slave, instead of λ
D
.
117
5.4. Experiments
5.3.6 Modifications to λ
D
Vector
Removing the explicit statement of f
SD
means that the loss function L
3
no longer depends
on finding values for stereo. Recall that an expected foreground disparity E
F
, and a non-
negative slack variable ξ, were defined in Section 5.3.1. This gives us a range [E
F

ξ, · · · , E
F
+ξ] of disparity values which we associate with the foreground region.
When a segmentation result has been obtained, the λ
D
cost vector can be adjusted
by comparing the flood fill prior ρ
(k)
with the segmentation result s
(k+1)
, and altering
the cost for those pixels for which the segmentation result disagrees with the flood fill
prior. If a pixel is segmented as foreground despite the flood fill prior encouraging it to be
segmented as background, this implies that the disparity costs within this range should
be reduced, and the costs outside the range increased; conversely, if a pixel is background
despite the flood fill insisting it is foreground, then the costs for disparities within the
range should be increased. Therefore, we set the cost update vector λ
D
as follows:
λ
D
(x, y, d) =
_
_
_
ρ
(k)
(x, y) −s
(k+1)
(x, y), if |d −E
F
| < ξ ; (5.29)
s
(k+1)
(x, y) −ρ
(k)
(x, y), otherwise . (5.30)
5.4 Experiments
To test the effects that our modifications to the formulation of Section 4.3 had on the
results, we evaluate our algorithm on the H2view dataset. To evaluate pose estimation,
we use the standard probability of correct pose (PCP) criterion; in this section, we also
present quantitative segmentation results, using the overlap (intersection over union)
metric.
5.4.1 Segmentation
We used the H2view test sequence of 1598 images to evaluate our approach; for compar-
ison, we also tested GrabCut [100], ALE [71], and our original algorithm. We achieve a
segmentation accuracy of 69.23%; some qualitative results are shown in Figure 5.4. The
118
5.4. Experiments
(a) RGB image
(b) Result from Chapter 4
(c) Our result
(d) Ground truth segmentation
Figure 5.4: Sample segmentation results. Note that the flood fill prior correctly seg-
ments the outstretched arm in the third example. The prior also leads to a less blocky
appearance to the segmentation results. In the first two examples, part of the window
is segmentated as foreground by the framework of Chapter 4, but the flood fill prior
prevents this.
119
5.4. Experiments
(a) Original image (b) Disparity map
(c) Our result (Overlap: 83.46%) (d) Ground truth segmentation
(e) ALE [71] (51.51%) (f) [100] on RGB (64.18%)
(g) [100] on depth (51.84%) (h) [100]: two-stage (69.61%)
Figure 5.5: Sample segmentation results, with overlap scores for this image. Note that
ALE and GrabCut [100] on RGB both classify parts of the window and other background
areas as foreground, while GrabCut based on the disparity map fails to capture thin
structures such as the arm and head.
120
5.4. Experiments
Table 5.1: Segmentation results on the H2view dataset, compared with previous results.
Using the RGB image to refine the result obtained by GrabCut on depth improves the
result slightly, but our joint approach is superior.
Method Precision Recall F-Score Overlap
Two-stage GrabCut 60.00% 73.53% 66.08% 49.34%
GrabCut on depth 69.96% 53.24% 60.47% 45.14%
GrabCut on RGB 43.03% 93.64% 58.96% 43.91%
ALE [71] 83.19% 73.58% 78.09% 64.29%
Chapter 4 69.33% 89.61% 78.18% 67.62%
Ours 79.59% 83.23% 81.37% 69.23%
accuracy is reasonably consistent over the sequence, with most images achieving accur-
acy close to the average, but there is room for improvement in temporal consistency via
video-based cues. It should be noted that the segmentation results are quite sensitive
to initialisation; if the torso is classified incorrectly, the flood fill prior is likely to be
incorrect, leading to a bad segmentation result (Figure 5.6). Developing a resistance to
this pitfall is a promising direction for future research.
These results compare favourably to those obtained by the other approaches we eval-
uated (a qualitative comparison is shown in Figure 5.5). The addition of the prior has
improved the precision of our algorithm by 10% compared to the framework of Chapter
4, meaning that although the recall has dropped, the overall performance (overlap) in-
creased by 1.6%. Although the precision achieved is still not quite as good as ALE, the
recall is much better, meaning the overlap is almost 5% better.
In order to test GrabCut, we obtained initial bounding boxes by thresholding the
foreground weight map W
F
defined in Section 5.3.3. We ran the GrabCut algorithm
on three passes through our test set: first, using solely the RGB images; second, on
disparity values; and finally, running GrabCut on the disparity map and then refining
the result with the RGB information. The results, given in Table 5.1, show that combining
information from depth and RGB (as we do in “two-stage GrabCut”) can improve the
segmentation results, but the accuracy can be improved by a large margin by using a
flood fill prior.
121
5.5. Conclusion
Table 5.2: Results (given in % PCP) on the H2view test sequence.
Method Torso Head
Upper
arm
Forearm
Upper
leg
Lower
leg
Total
Ours 96.3 92.4 77.3 42.0 89.3 81.4 76.86
Chapter 4 94.9 88.7 74.4 43.4 88.4 78.8 75.37
Yang [131] 72.0 87.3 61.5 36.6 88.5 83.0 69.85
Andriluka [1] 80.5 69.2 60.2 35.2 83.9 76.0 66.03
5.4.2 Pose Estimation
While we did not directly alter the pose estimation algorithm used in Chapter 4, our im-
proved segmentation results also lead to a slight improvement in pose estimation results.
Full quantitative results are given in Table 5.2.
5.4.3 Runtime
The framework presented has been implemented in a single-threaded C++ implement-
ation, and runs at 25 seconds per frame on a 2.67GHz processor. This represents a
significant speed improvement over the previous framework – the computational time per
image has been decreased by a factor of 6.6. Again, this does not include the overhead
caused by needing to run Yang and Ramanan’s pose estimation to get the initial pose
estimates [131].
5.5 Conclusion
In this chapter, we have demonstrated the applicability of the flood fill algorithm for
generating segmentation priors. Disparity maps generated by stereo correspondence al-
gorithms can be used as input, and given a set of seeds with high probability of belonging
to the foreground object, accurate segmentation results can be obtained. We have also
incorporated our prior into the dual decomposition framework of Chapter 4, greatly im-
proving upon the segmentation results. Our segmentation results show that combining
122
5.5. Conclusion
information from depth and RGB can improve the segmentation results, but the accuracy
can be improved by a large margin by using a flood fill prior.
The results presented in this chapter, both for segmentation and for human pose
estimation, can be further improved by adding video-based cues, such as motion tracks.
This should improve temporal consistency, in particular giving the algorithm a chance to
recover in the rare cases where the seeds used in flood fill were incorrect (shown in Figure
5.6). Indeed, the centroid of the segmented object could be used as a seed for the next
frame.
Finally, we note that while the speed of the updated framework is significantly faster
than that of Chapter 4, the dual decomposition algorithm is still much too slow for
real-time applications such as video games. Recently, a powerful new approximation
technique, mean field inference [57], has emerged, and has proven to be very powerful in
combining object segmentation and disparity computation [124]. In the next chapter, we
show that our framework is also suitable for the application of mean field inference.
Acknowledgements
As with the previous chapter, this section contains attributions to the other co-authors
of the paper on which this chapter was based, as follows:
• Julien Valentin coded an initial version of the flood fill algorithm used in this
chapter, and provided valuable feedback and comments while the formulation was
being adapted.
• Phil Torr, my PhD supervisor, and Nigel Crook, my second supervisor, provided
feedback based on results and paper drafts throughout the project.
123
5.5. Conclusion
(a) The original image
(b) Result of Chapter 4
(c) Our result
(d) Ground truth
Figure 5.6: Failure cases of segmentation. In (c), incorrect torso detection means that
the flood fill prior misleads the segmentation algorithm, resulting in the wrong region
being segmented.
124
Chapter 6
An Efficient Mean Field Based Method for
Joint Estimation of Human Pose,
Segmentation, and Depth
While the methods developed in Chapters 4 and 5 show state-of-the-art results for seg-
mentation and pose estimation, the slow speed (some 25 seconds per frame) make them
infeasible for development in video game applications, where real-time performance is
required. Therefore, it is desirable to find an efficiently solvable approximation of the
original problem.
One such method that can be applied here is mean field inference [57]. For a cer-
tain class of pairwise terms, mean-field inference has been shown to be very powerful in
solving the object class segmentation and object-stereo correspondence problems in CRF
frameworks, providing an order-of-magnitude speedup [124].
The main contribution of this chapter is the proposition of a highly efficient filter-
based mean field approach to perform joint estimation of human segmentation, pose and
disparity in the product label space. The application of mean-field inference produces a
significant improvement in speed, as well as noticeable qualitative improvements.
As with the previous two chapters, the efficiency and accuracy of the new model
is evaluated using the H2view dataset, introduced in Section 4.2. We show results for
125
6.1. Mean Field Inference
segmentation and pose estimation; disparity computation is used to improve these results,
but is not quantitatively evaluated as it is not feasible to obtain dense ground truth data.
We achieve 20 times speedup compared to the current state-of-the-art methods [1,70,131],
as well achieving better accuracy in all cases.
The remainder of this chapter is structured as follows: in the next section, we give an
introduction to mean-field inference. Our body part formulation is discussed in Section
6.2, while we describe our joint inference framework in Section 6.3. Results follow in
Section 6.4, and the chapter concludes with a discussion in Section 6.5.
An earlier version of this chapter forms part of a paper that is under submission at a
major computer vision conference.
6.1 Mean Field Inference
While graph cut-based inference methods are fairly efficient for the task of object class
segmentation, they fail to capture long-range interactions, which as shown in Figure
6.1(b), can result in oversmoothed boundaries between object classes. An alternative
approach is to use a CRF that is fully connected, i.e. one that has pairwise connections
between every pair of pixels in the image, regardless of the distance between them.
Despite the improvement in results that this gives, this is not a feasible way to improve
results on large sets of images, since the time taken by a graph cut based inference method
rises from 1 second (for the Robust P
n
CRF [56]) to around 36 hours (for the fully
connected CRF) [64]. An alternative approach is to use mean-field inference to solve this
problem; the inference algorithm of Krähenbühl and Koltun [64] successfully recovers fine
boundaries between object classes (as shown in Figure 6.1(c)), and only takes around 0.2
seconds to converge. The image which these timings are based on is 237 by 356 pixels;
the performance of the algorithms on images of other sizes was not specified in [64].
126
6.1. Mean Field Inference
(a) Image (b) Robust P
n
CRF
(c) Mean field inference for fully
connected pairwise CRF
(d) Ground truth
Figure 6.1: The Robust P
n
CRF [56] produces an oversmoothed result on this image of a
tree, from the MSRC-21 dataset [77, 111]. Using a fully connected CRF recoveres the fine
detail, but is prohibitively slow. Mean field inference is much quicker, and still achieves
similar performance. (Images taken from [64].)
127
6.1. Mean Field Inference
6.1.1 Introduction to Mean-Field Inference
Define the energy of a CRF with fully connected pairwise terms as:
E(x) =

i
θ(x
i
) +

i<j
φ(x
i
, x
j
), (6.1)
where i and j range from 1 to N, and N is the number of vertices in the graph. In the
formulation of Krähenbühl and Koltun [64], the pairwise potential is defined as a linear
combination k of Gaussian kernels:
φ(x
i
, x
j
) = µ(x
i
, x
j
)k(f
i
, f
j
), (6.2)
where the label compatibility function µ can be, for example, a Potts model:
µ(x
i
, x
j
) = 1(x
i
= x
j
); (6.3)
and k is a linear combination of bilateral and spatial Gaussian kernels:
k(f
i
, f
j
) = w
(1)
exp
_

|p
i
−p
j
|
2

2
α

|I
i
−I
j
|
2

2
β
_
+w
(2)
exp
_

|p
i
−p
j
|
2

2
γ
_
, (6.4)
where the λ

are truncation parameters. We can use the Gibbs distribution to trans-
form the energy function (6.1) into the probability of a given labelling x:
P(x|I) =
˜
P(x|I)
Z
=
1
Z
exp(−E(x, I)) , (6.5)
where Z is a normalising constant.
In the mean field approach, given the true probability distribution P(x|I) defined
above, we wish to find an approximate distribution Q(x) which is tractable, and which
closely resembles the original distibution. We measure this closeness in terms of the
128
6.1. Mean Field Inference
KL-divergence between P and Q:
D
KL
(Q(x)||P(x|I)) =

x∈X
Q(x) log
Q(x)
P(x|I)
. (6.6)
Our goal is to find the closest such Q from the tractable family of probability distributions
Q. This is defined as:
Q

= arg min
Q∈Q
D
KL
(Q(x)||P(x|I)). (6.7)
One way of approximating the true distribution is to make independence assumptions,
i.e. to partition the set of variables into a collection of subsets, which are then assumed
to be independent from each other. The simplest possible such assumption is to assume
that all of the variables are independent:
Q(x) =

i∈V
Q
i
(x
i
). (6.8)
In this case, (6.6) can be evaluated as follows:
D
KL
(Q(x)||P(x|I))
=

x∈X
Q(x) log
Q(x)
P(x|I)
(6.9)
=

x∈X
Q(x) log Q(x) −

x∈X
Q(x) log P(x|I) (6.10)
=

i∈V

x
i
∈x
i
Q
i
(x
i
) log Q
i
(x
i
) −

x∈X
Q(x) log
˜
P(x|I) + log Z(x, I) (6.11)
=

i∈V

x
i
∈x
i
Q
i
(x
i
) log Q
i
(x
i
) +

x∈X
Q(x)E(x, I) + log Z(x, I), (6.12)
where the final step is due to (6.5).
The marginal Q
i
(x
i
) that minimises (6.6) can then be found by analytically minim-
ising a Lagrangian consisting of all terms in D
KL
(Q(x)||P(x|I)).
1
From Krähenbühl and
1
for a detailed derivation, the reader is referred to Koller and Friedman, Chapter 11.5 [57].
129
6.1. Mean Field Inference
Koltun [64], the marginal update equation is:
Q
i
(x
i
= l) =
1
Z
i
exp
_
_
−θ(x
i
) −

j=i

x
j
Q
j
(x
j
)φ(x
i
, x
j
)
_
_
, (6.13)
where the normalising constant Z
i
is:
Z
i
=

x
i
exp
_
_
−θ(x
i
) −

j=i

x
j
Q
j
(x
j
)φ(x
i
, x
j
)
_
_
. (6.14)
The inference algorithm from [64] is given in Algorithm 6.1.
Algorithm 6.1 Naïve mean field algorithm for fully connected CRFs
Initialise Q: Q
i
(x
i
= l) ←
1
Z
i
exp(−θ(x
i
))
while not converged do
Message passing:
˜
Q
(m)
i
(l) ←

j=i
k
(m)
(f
i
, f
j
)Q
j
(l) for all m
Compatibility transform:
ˆ
Q
i
(x
i
) ←

l∈L
µ
(m)
(x
i
, l)

m
w
(m)
˜
Q
(m)
i
(l)
Local update: Q
i
(x
i
) ← exp(−θ(x
i
) −
ˆ
Q
i
(x
i
))
Normalise Q
i
(x
i
)
end while
6.1.2 Simple Illustration
As a simple example, consider the simplified human skeleton model shown in Figure
6.2. This skeleton model is represented by a graph G = (V, E) with six vertices (V =
{x
1
, . . . , x
6
}, with x
1
representing the torso), and five edges (E = {(x
1
, x
2
), . . . , (x
1
, x
6
)}).
That is, each of the variables x
2
to x
6
are only connected to x
1
.
Applying the independence assumption and the derivation above, the mean field up-
date for x
1
is:
Q
1
(x
1
) =
1
Z
1
exp
_

6

i=2
Q
i
(x
i
)ψ(x
1
, x
i
)
_
, (6.15)
where ψ is a kernel on the distance between the parts x
1
and x
i
.
130
6.2. Model Formulation
Figure 6.2: Our basic six-part skeleton model, for the simple example.
Table 6.1: Comparing the performance of Robust P
n
CRF [56] and mean field inference
[64] on the MSRC-21 dataset. In this table ‘Global’ refers to the percentage of all pixels
that are correctly labelled, and ‘Average’ is the average of the per-class accuracy figures.
Method Runtime (s/f) Global (%) Average (%)
Robust P
n
30 84.9 77.5
Mean field 0.2 86.0 78.3
6.1.3 Performance Comparison: Mean Field vs Graph Cuts
Krähenbühl and Koltun evaluated their mean field inference framework on the MSRC-
21 dataset [77]. Compared to the robust CRF method of Kohli et al. [56], the runtime
decreases from 30 seconds per image to 0.2 seconds per image (a 150× speed-up factor);
at the same time, the global accuracy also improves. A brief summary of results is given
in Table 6.1; a more thorough evaluation can be found in [64].
6.2 Model Formulation
As has been the case in the previous two chapters, the goal of our joint optimisation
framework is to estimate human segmentation and pose, and perform stereo reconstruc-
131
6.2. Model Formulation
tion.
We formulate the problem of joint labelling in a conditional random field (CRF)
framework in a product label space. We first define two sets of random variables, X =
[X
S
, X
D
], covering the segmentation and disparity variables, and Y, to represent the
part variables. X takes a label from the product label space L = {(L
S
× L
D
)
N
}, and
Y takes a label from (L
P
)
M
. Here, X
S
= {X
S
1
, ..., X
S
N
} and X
D
= {X
D
1
, ..., X
D
N
} are
the per-pixel human segmentation and disparity variables. We assume that each of these
random variables is associated with a pixel x
i
in the image, with i ∈ {1, ..., N}. Further,
each X
S
i
takes a label from the segmentation label set, L
S
∈ {0, 1}, and X
D
i
takes a label
from the disparity label set, L
D
.
The body parts are represented by the set of latent variables Y = {Y
1
, Y
2
, ..., Y
M
}
corresponding to the M body parts, each taking labels from L
P
∈ {0, ..., K} where
1, 2, ..., K corresponds to the K part proposals generated for each body part, and zero
represents the background class. We generate K part proposals using the model of Yang
and Ramanan [131].
6.2.1 Joint Energy Function
Given the above model, our joint energy function takes the following form:
E(x
S
, x
D
, y) = E
S
(x
S
) +E
D
(x
D
) +E
P
(y) +E
PS
(x
S
, y) +E
SD
(x
S
, x
D
) +E
PD
(x
D
, y)
(6.16)
We define each of these terms individually below.
Per-Pixel Terms
We first define E
S
and E
D
, which take the following forms:
132
6.2. Model Formulation
E
S
(x
S
) =

i∈V
θ
S
(x
i
) +

i∈V,j∈N
i
φ
S
(x
i
, x
j
) +

c∈C
ψ
S
c
(x
c
), (6.17)
E
D
(x) =

i∈V
θ
D
(x
i
) +

i∈V,j∈N
i
φ
D
(x
i
, x
j
), (6.18)
where N
i
represents the neighbourhood of the variable i, θ
S
(x
i
) and θ
D
(x
i
) represent
unary terms corresponding to human segmentation class and depth labels respectively,
and φ
S
(x
i
, x
j
) and φ
D
(x
i
, x
j
) are pairwise terms capturing the interaction between a pair
of segment and depth variables respectively. The human object specific unary cost θ
S
(x
i
)
is computed based on a boosted unary classifier on image-specific appearance. The cost
θ
D
(x
i
) is computed based on sum of absolute differences, as in Section 4.3.3. The higher
order term ψ
S
c
(x
c
) describes a cost defined over cliques containing more than two pixels,
as introduced by Ladický et al. [70].
The pairwise term between human object variables φ
S
takes the form of a Potts model,
weighted by edge-preserving Gaussian kernels [64]. The depth consistency term φ
D
(x
i
, x
j
)
encourages pixels which are next to each other to take the same level, and takes a similar
form to that of φ
S
.
Per-Part Term
Similar to the energy function g
P
(x) defined in (5.22) (Section 5.3.4), the energy term
E
P
(y) covers the human part variables Y. This energy function involves a per-part unary
cost θ
P
(y
j
= k) for associating j
th
part to the k
th
proposal or to the background, and a
pairwise term φ
P
(y
i
, y
j
), which penalises the case where parts that should be connected
are distant from one another in image space. These terms are the same as in (5.22).
Defining the set of connected parts as E, the pose estimation energy is as follows:
E
P
(y) =

j∈Y
θ
P
(y
j
= k) +

(i,j)∈E
φ
P
(y
i
, y
j
). (6.19)
133
6.2. Model Formulation
Joint Terms
We now give details of our joint energy terms, E
PS
, E
SD
, and E
PD
.
The joint human segmentation and part proposal term, E
PS
, encodes the relation
between segmentation and part proposals. Specifically, we expect pixels close to a selected
part proposal to belong to the foreground class, and pixels that are far from any body part
to belong to the background class. We pay a cost of C
PS
for violation of this constraint,
incorporated through a pairwise interaction between the segmentation and part proposal
variables; this interaction takes the following form:
E
PS

PS
p
(x
S
, y) =
N

i=1
C
PS
· [(x
S
i
= 1) ∧ (max
j,k
(dist(x
i
, y
(j,k)
)) ≥ τ)]
+
N

i=1
M

j=1
K

k=1
C
PS
· [(x
S
i
= 0) ∧ (dist(x
i
, y
(j,k)
) < τ) ∧ (y
j
= k)]. (6.20)
Here, dist(x
i
, y
(j,k)
) gives the distance from the pixel x
i
to the k
th
proposal for the j
th
part; this distance is measured by modelling the part proposal as a line segment between
its two endpoints, and finding the Euclidean distance from the point to the line segment.
The threshold, τ is set to half of the average width of each part, as determined by
typical images from the training set where the parts are well-separated. The values range
from around 10 pixels for the lower arms, to around 25 pixels for the torso. However, it
should be noted that the torso width in particular is heavily dependent on the orientation
of the person, and indeed their clothing and body shape. In future research, it would be
interesting to consider finding the orientation of the person, and adjusting the expected
torso width accordingly. Adjusting the width based on the length of the part could also
improve results, although the possibility of foreshortening should be taken into account.
Our joint object-depth cost E
SD
encourages pixels with a high disparity to be classed
as foreground, and pixels with a low disparity to be classified as background. We penalise
the violation of this constraint by a cost C
SD
. Using the flood fill method introduced in
Chapter 5, we first generate a segmentation map F = {ρ
1
, ρ
2
, . . . , ρ
N
} by thresholding
the disparity map, thus each ρ
i
takes a label from L
S
. We would expect the prior map F
134
6.3. Inference in the Joint Model
to agree with the segmentation result, so that pixels labelled as human by the flood fill
prior (ρ
i
= 1) are classified as human, and vice versa, otherwise we pay a cost C
SD
for
violation of this constraint:
E
SD
= ψ
SD
p
(x
S
, x
D
) =
N

i=1
C
SD
· [(x
S
i
= 1) ∧ (ρ
i
= 0)]
+
N

i=1
C
SD
· [(x
S
i
= 0) ∧ (ρ
i
= 1)] (6.21)
Finally, the joint energy term E
PD
encodes the relationship between the part proposals
and the disparity variables. Again, we use the flood fill prior F defined above. We expect
pixels classed as human by this prior (so ρ
i
= 1) to be close to a selected body part, so
for some part j and proposal index k, y
j
= k and dist(x
i
, y
(j,k)
) < τ. Conversely, pixels
classed as background (ρ
i
= 0) should not be close to selected body parts, so we pay a
cost if for some j, k, y
j
= k and dist(x
i
, y
(j,k)
) < τ. Therefore, the energy term has the
following form:
E
PD
= ψ
PD
p
(y, x
D
) =
N

i=1
C
PD
· [(max
j,k
(dist(x
i
, y
(j,k)
)) ≥ τ) ∧ (ρ
i
= 1)]
+
N

i=1
M

j=1
K

k=1
C
PD
· [(y
j
= k) ∧ (dist(x
i
, y
(j,k)
) < τ) ∧ (ρ
i
= 0)] (6.22)
The weights C
PS
, C
SD
, and C
PD
capturing the relationships between the different sets
of variables are set through cross-validation.
6.3 Inference in the Joint Model
We now provide details of the mean-field update for segmentation variables X
S
, depth
variables X
D
, and part variables Y
P
.
Given the energy function detailed in Section 6.2.1, the marginal update for human
135
6.4. Experiments
segmentation variable X
S
i
takes the following form:
Q
S
i
(x
S
[i,l]
) =
1
Z
S
i
exp{−θ
S
(x
i
) −

l

∈L
J

j=i
Q
S
j
(x
[j,l

]

S
(x
i
, x
j
)

l

∈L
D
Q
D
i
(x
D
[i,l

]
)ψ(x
S
i
, x
D
i
) −
M

j=1

l

∈L
P
Q
P
j
(y
P
[j,l

]
)ψ(x
S
i
, y
P
j
)} (6.23)
Similarly, the marginal update for the per-pixel depth variables X
D
i
takes the following
form:
Q
D
i
(x
D
[i,l]
) =
1
Z
D
i
exp{−θ
D
(x
i
) −

l

∈L
D

j=i
Q
D
j
(x
[j,l

]

D
(x
i
, x
j
)

l

∈L
S
Q
S
i
(x
S
[i,l

]
)ψ(x
D
i
, x
S
i
) −
M

j=1

l

∈L
J
Q
P
j
(y
P
[j,l

]
)ψ(y
P
j
, x
D
i
)}. (6.24)
Finally, the marginal update for the part variables Y
P
j
is as follows:
Q
P
j
(y
P
[j,l

]
) =
1
Z
P
j
exp{−θ
P
(y
j
) −

l

∈L
P

j

=j
Q
P
j
(y
[j

,l

]

P
(y
j
, y
j
)

N

i=1

l

∈L
S
Q
S
i
(x
[i,l

]
)φ(x
S
i
, y
P
j
) −
N

i=1

l

∈L
D
Q
D
i
(x
[i,l

]
)φ(y
P
j
, x
D
i
)}. (6.25)
6.4 Experiments
In this section, we demonstrate the efficiency and accuracy provided by our approach on
the challenging H2view dataset, introduced in Section 4.2. In all experiments, timings are
based on single-threaded code run on an Intel
®
Xeon
®
3.33 GHz processor, and we fix the
number of full mean-field update iterations to 5 for all models. As a baseline, we compare
our approach for the human segmentation problem against the graph-cuts based AHRF
method of Ladický et al. [70], the dual-decomposition based model of Chapter 5, and the
mean-field model of Krähenbühl et al. [64]. We assess the overall percentage of pixels
correctly labelled, and the average recall and intersection/union scores per class. For pose
estimation, we compare our results with those of Chapter 5, Yang and Ramanan [131],
136
6.4. Experiments
Table 6.2: Quantitative results for human segmentation on the H2View dataset. The table
compares timing and accuracy of our approach (last line) against three baselines. Note
the significant improvement in inference time, recall, F-score and overlap performance of
our approach against the baselines.
Method Time (s) Precision Recall F-Score Overlap
Unary 0.36 79.84% 73.55% 76.57% 62.21%
ALE [70] 1.5 83.19% 73.58% 78.09% 64.29%
MF [64] 0.48 84.22% 73.50% 78.50% 64.61%
Chapter 5 25 79.59% 83.23% 81.37% 69.23%
Ours 1.07 79.89% 87.05% 83.32% 71.17%
and Andriluka et al. [1], using the probability of correct pose (PCP) criterion.
6.4.1 Segmentation Performance
We first evaluate the performance of our method in the human segmentation problem. A
quantitative comparison can be found in Table 6.2. Our method shows a clear improve-
ment in recall, compared to the other approaches, with an increase of 3.82% compared
to Chapter 5, and around a 13.5% increase compared to the other methods. A similar
improvement is shown in the overlap (intersection over union) score, with our method
showing an improvement of almost 2%. Significantly, we observe an order of magnitude
speed up (close to 25×) over the model of Chapter 5, and a slight speed up over ALE.
Sample results comparing the performance of our mean field formulation with that of
previous chapters is given in Figure 6.3, while further results are shown in Figure 6.4.
6.4.2 Pose Estimation Performance
Further, we observe an improvement of about 3.3% over Yang and Ramanan, and 7%
over Andriluka et al. in the PCP scores for the pose estimation problem. Although we
do slightly worse than Chapter 5 in the PCP score, we observe a speed up of 20×, as well
as a speed-up of 8× over Yang and Ramanan model, and almost 30× over the model of
Andriluka, as shown in Table 6.3. However, in some cases a qualitative improvement can
137
6.4. Experiments
(a) Original image
(b) Result of Chapter 4
(c) Result of Chapter 5
(d) Our segmentation result
(e) Ground truth segmentation
Figure 6.3: Segmentation results. Outstretched arms are difficult for segmentation meth-
ods to capture, but the dense CRF approached used in this chapter successfully retrieves
the arms.
138
6.4. Experiments
(a) Original image (b) Segmentation result (c) Ground truth
Figure 6.4: Results showing consistency of segmentation performance. Shown here are
five successive frames from a difficult part of the test set. Qualitatively, the segmentation
results are of a consistently high standard.
139
6.5. Discussion
Table 6.3: The table compares timing and accuracy of our approach (last line) against
the baseline for the pose estimation problem on the H2View dataset. Observe that our
approach achieves 20× speedup, and performs better than the baseline in estimating the
limbs. U/LL represents average of upper and lower legs, and U/FA represents average of
upper and fore arms.
Method T(s) U/LL U/FA TO Head Overall
Andriluka [1] 35 80.0 47.7 80.5 69.2 66.03
Yang [131] 10 85.8 49.0 87.4 72.0 69.85
Chapter 5 25 85.4 59.7 96.3 92.4 76.86
Ours 1.2 82.9 55.2 89.1 86.2 73.12
be observed, as shown in Figure 6.5.
6.5 Discussion
In this chapter, we proposed an efficient mean-field based method for joint estimation of
human segmentation, pose, and disparity. Our new inference approach yields excellent
results, with a substantial improvement in inference speed with respect to the current
state-of-the-art methods, while also observing a good improvement both human segment-
ation and pose estimation accuracy.
Mean-field inference produces faster performance than the dual decomposition-based
method due to the simplifications made in the pairwise part of the equations: for in-
stance, the segmentation update equation (6.23) can be evaluated separately for each
pixel, whereas a graph cut-based approach with a complexity depending on the number
of pairwise edges was used for the corresponding step in the dual decomposition-based
method.
The proposed method runs at 1.07 seconds per frame, meaning that a speed up factor
of 15 is still needed for real-time applications such as computer games. Directions for fu-
ture research include investigating ways to further improve the efficiency of the approach.
For instance, some parts of the algorithm are parallelisable. It would also be extremely
interesting to adapt the algorithm to a GPU framework. While the results given are state
140
6.5. Discussion
(a) Pose estimation from Chapter 5
(b) Our pose estimation result
Figure 6.5: Qualitative results on H2View dataset. Our method is able to correct some
mistakes from the previous chapter (left side), but some of the same mistakes still remain
(right side).
141
6.5. Discussion
of the art, they could perhaps be further improved by adopting a hierarchical approach.
This approach could capture image-level attributes, such as the person’s orientation, or
a high-level understanding of the person’s pose, e.g. standing, sitting, gesticulating.
Acknowledgements
The paper that this chapter is based on was jointly authored by myself and Vibhav
Vineet, a PhD student in Brookes Vision Group. I extended the formulation to include
higher-order terms, and adapted it to mean-field inference, also putting the terms into
the C++ code used to run the experiments. Vibhav provided contributions to the paper
which are not included in this chapter.
142
Chapter 7
Conclusions and Future Work
This thesis has explored the applicability to video games of human segmentation and pose
estimation. However, owing to the high levels of articulation that the human body can
exhibit, together with the variability in shape, size, colour, and appearance, it is necessary
to use a complex approach to find accurate pose estimates, and the result unfortunately
falls short of the real-time aim. In Section 7.2, however, we identify a few methods which
could help further speed up our system.
7.1 Summary of Contributions
The major contributions of this thesis are as follows:
• In Chapter 4, we presented a unified framework for human segmentation, pose
estimation, and depth estimation. These three problems can be solved separately,
but the results can be improved by sharing information between the three solutions.
To do this, we defined a five-term energy function, with three terms to give the
costs of the solution for each individual problem, and two ‘joining terms’, which
facilitated information sharing. For example, we encoded the notion that if a human
body part is located in a particular region of the image, then that region should
be segmented as foreground. In order to minimise the energy function, we applied
dual decomposition [62].
143
7.2. Directions for Future Research
• In order to evaluate our dual decomposition framework, we created a novel dataset,
which was presented in Section 4.2. This dataset contains almost 9,000 stereo pairs
of images, which each feature a human standing, walking, crouching or gesticulating.
To encourage future research on solving multi-view human understanding problems,
the dataset has been released.
1
• In Chapter 5, we introduced a stereo-based prior for human segmentation. Using
the endpoints of the top-ranked torso estimate as starting points, we run a flood fill
algorithm on the disparity map, obtaining a flood fill prior, ρ. We then incorporated
this prior into our framework from the previous chapter, yielding an improvement
in segmentation performance, and reducing the runtime by a factor of 6.
• The main drawback of the dual decomposition-based approach is the speed: the
algorithms presented took around 20-30 seconds per frame to converge. Our final
contribution, in Chapter 6, was to reformulate the problem into a highly efficient
inference framework based on mean field [64]. The use of mean field inference allows
us to use dense pairwise interactions, which provides a further improvement in
segmentation performance. The resulting algorithm only requires around 1 second
per frame.
7.2 Directions for Future Research
The final version of our unified human segmentation, depth, and pose estimation frame-
work, presented in Chapter 6, requires around 1 second per frame on a desktop computer,
running in a single thread. It would be tremendously interesting to extend this framework
to run on multiple threads, which would bring the “holy grail” of real-time human under-
standing within sight. With this goal in mind, it would also be worthwhile to consider
adapting the framework to run on architectures such as the GPU, or the internal SPU
processor on the Playstation 3. With the latter solution, care would have to be taken to
ensure that the memory limitations of such processors are taken into account.
1
http://cms.brookes.ac.uk/research/visiongroup/h2view/
144
Besides such solutions, there are additional methods that could also speed up the
performance; for instance, using integral images (via which the colour values of rectan-
gular regions in an image can be summed quickly and efficiently), removing low-weighted
components of the energy function, and using a faster algorithm to generate the pose es-
timates. Indeed, it would be very useful to remove the constraint that our pose estimates
depend on those of another algorithm, and instead to generate new pose candidates at
each iteration.
We conclude this thesis with the following observations: while the controller is likely
to remain the standard input method for many regular players of computer games, the
market for casual gamers is growing. There is clear demand in this demographic for
affordable systems where games can be controlled using the human body instead of a
physical controller, as shown by the sales of the Kinect [98]; we note that if such a system
relied on a stereo pair of cameras rather than a depth sensor, the cost to the consumer
would be lower.
Finally, it is worth noting that a human understanding framework based on a stereo
pair of images would have applications beyond computer games. In particular, a major
advantage that stereo camera systems have over depth sensors is their ability to work
outdoors during the daytime, allowing them to be used in, for example, pedestrian de-
tection systems; however, it should be noted that infra-red based methods would work
better at night than RGB stereo camera-based methods. With vision research, including
this thesis, bringing the goal of real-time human understanding closer to reality, the next
few years promise exciting developments.
Bibliography
[1] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People de-
tection and articulated pose estimation. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 1014–1021, 2009.
[2] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape:
shape completion and animation of people. ACM Transactions on Graphics (TOG),
24(3):408–416, 2005.
[3] B. Ashcraft. Eye of Judgment. http://kotaku.com/343541/more-of-eye-of-
judgment-coming-to-japan?tag=the-eye-of-judgment, 2008. This photo, re-
trieved on 22nd August 2012, contains an image from a copyrighted game (Eye of
Judgment, Sony, 2007), whose appearance here for research purposes qualifies as
fair dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
[4] Press Association. EyePet bounce. http://www.telegraph.co.uk/technology/
6410150/The-Sony-EyePet-an-electronic-pet-for-people-too-lazy-for-
the-real-thing.html, 2009. This photo, retrieved on 22nd August 2012, is a
screenshot from a copyrighted game (EyePet, SCEE, 2009) provided by the Press
Association, and appears here for illustrative purposes, qualifying as fair dealing
under Section 29 of the Copyright, Designs and Patents Act 1988.
147
Bibliography
[5] A.O. Balan, L. Sigal, M.J. Black, J.E. Davis, and H.W. Haussecker. Detailed human
shape and pose from images. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1–8, 2007.
[6] D.P. Bertsekas. Nonlinear programming. Athena Scientific, 1999.
[7] J. Best. Kinect Dance demonstration. http://www.techrepublic.com/
blog/cio-insights/why-microsofts-kinect-gaming-tech-matters-to-
business/39746574, 2010. This photo, retrieved on 22nd August 2012, appears
here for illustrative purposes, qualifying as fair dealing under Section 29 of the
Copyright, Designs and Patents Act 1988.
[8] C.M. Bishop. Pattern recognition and machine learning, volume 4. Springer New
York, 2006.
[9] A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr. Interactive image segment-
ation using an adaptive GMMRF model. In European Conference of Computer
Vision (ECCV), pages 428–441. Springer, 2004.
[10] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3D human
pose annotations. In IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 1365–1372, 2009.
[11] S. Boyd, L. Xiao, and A. Mutapcic. Subgradient methods. Lecture notes of EE392o,
Stanford University, Autumn Quarter, 2003.
[12] S. Boyd, L. Xiao, A. Mutapcic, and J. Mattingley. Notes on decomposition methods.
Notes for EE364B, Stanford University, 2007.
[13] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press,
2004.
[14] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-
flow algorithms for energy minimization in vision. IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI), 26(9):1124–1137, 2004.
148
Bibliography
[15] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via
graph cuts. IEEE Transactions on Pattern Analysis and Machine Intel ligence
(PAMI), 23(11):1222–1239, 2001.
[16] M. Bray, P. Kohli, and P. Torr. Posecut: Simultaneous segmentation and 3D
pose estimation of humans using dynamic graph-cuts. In European Conference of
Computer Vision (ECCV), pages 642–655, 2006.
[17] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[18] J. Calvert. Kinect Sports review. http://www.gamespot.com/kinect-sports/
reviews/kinect-sports-review-6283473/, 2010. Retrieved 22nd August 2012.
[19] J. Canny. A computational approach to edge detection. IEEE Transactions on
Pattern Analysis and Machine Intelligence (PAMI), 6:679–698, 1986.
[20] A. Criminisi, A. Blake, C. Rother, J. Shotton, and P.H.S. Torr. Efficient dense
stereo with occlusions for new view-synthesis by four-state dynamic programming.
International Journal of Computer Vision (IJCV), 71(1):89–110, 2007.
[21] N. Dalal. Inria object detection and localization toolkit, 2008. Software available
at http://pascal.inrialpes.fr/soft/olt.
[22] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
886–893, 2005.
[23] G.B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operations
research, pages 101–111, 1960.
[24] R. Davis. EyeToy: Antigrav review. http://uk.gamespot.com/eyetoy-
antigrav/reviews/eyetoy-antigrav-review-6112714/, 2004. Retrieved 26th
July 2012.
149
Bibliography
[25] D. DeMenthon and L.S. Davis. Exact and approximate solutions of the perspective-
three-point problem. IEEE Transactions on Pattern Analysis and Machine Intel li-
gence (PAMI), 14(11):1100–1105, 1992.
[26] R. Deriche. Using Canny’s criteria to derive a recursively implemented optimal edge
detector. International Journal of Computer Vision (IJCV), 1(2):167–187, 1987.
[27] P. Dollar, Z. Tu, and S. Belongie. Supervised learning of edges and object bound-
aries. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
volume 2, pages 1964–1971, 2006.
[28] R.O. Duda and P.E. Hart. Use of the hough transformation to detect lines and
curves in pictures. Communications of the ACM, 15(1):11–15, 1972.
[29] J. Dunlap. Queue-linear flood fill: A fast flood fill algorithm. http:
//www.codeproject.com/Articles/16405/Queue-Linear-Flood-Fill-A-
Fast-Flood-Fill-Algorith. Retrieved 1st April 2013.
[30] P. Elias, A. Feinstein, and C. Shannon. A note on the maximum flow through a
network. IEEE Transactions on Information Theory, 2(4):117–119, 1956.
[31] M. Enzweiler, A. Eigenstetter, B. Schiele, and D.M. Gavrila. Multi-cue pedestrian
classification with partial occlusion handling. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 990–997, 2010.
[32] PlayStation Europe. PSEye. http://www.flickr.com/photos/
playstationblogeurope/4989764343/, 2007. This photo, retrieved on 22nd Au-
gust 2012, is used under a Creative Commons Attribution-NonCommercial 2.0 Gen-
eric licence: http://creativecommons.org/licenses/by-nc/2.0/deed.en_GB.
[33] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.
The PASCAL Visual Object Classes (VOC) challenge. International Journal of
Computer Vision, 88(2):303–338, June 2010.
150
Bibliography
[34] P. Felzenszwalb, R. Girschick, D. McAllester, and D. Ramanan. Object detection
with discriminatively trained part based models. IEEE Transactions on Pattern
Analysis and Machine Intelligence (PAMI), 32(9):1627–1645, 2010.
[35] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition.
International Journal of Computer Vision (IJCV), 61(1), 2005.
[36] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained,
multiscale, deformable part model. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 1–8, 2008.
[37] V. Ferrari, M. Marín-Jiménez, and A. Zisserman. Progressive search space reduction
for human pose estimation. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 1–8, 2008.
[38] V. Ferrari, M. Marín-Jiménez, and A. Zisserman. Pose search: retrieving people
using their pose. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2009.
[39] A. Ferworn, J. Tran, A. Ufkes, and A. D’Souza. Initial experiments on 3D mod-
eling of complex disaster environments using unmanned aerial vehicles. In IEEE
International Symposium on Safety, Security, and Rescue Robotics (SSRR), pages
167–171. IEEE, 2011.
[40] M.A. Fischler and R.C. Bolles. Random sample consensus: a paradigm for model
fitting with applications to image analysis and automated cartography. Commu-
nications of the ACM, 24(6):381–395, 1981.
[41] L.R. Ford and D.R. Fulkerson. Maximal flow through a network. Canadian Journal
of Mathematics, 8(3):399–404, 1956.
[42] J. Fraser. Kung Foo. http://www.thunderboltgames.com/reviews/article/
eye-toy-play-review-for-ps2.html, 2003. This photo, retrieved on 22nd Au-
gust 2012, is a screenshot from a copyrighted game (EyeToy: Play, SCEE, 2003),
151
Bibliography
and appears here for illustrative purposes, qualifying as fair dealing under Section
29 of the Copyright, Designs and Patents Act 1988.
[43] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical
view of boosting. Annals of Statistics, 28(2):337–407, 2000.
[44] K. Golder. Why isn’t Kinect “kinecting” with hardcore gamers? http:
//www.hardcoregamer.com/2012/04/08/gears-of-war-exile-in-permanent-
diskinect/, 2012. Retrieved 17th September 2012.
[45] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In
Recent Advances in Learning and Control, Lecture Notes in Control and Information
Sciences, pages 95–110. Springer-Verlag Limited, 2008.
[46] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming,
version 1.21. http://cvxr.com/cvx/, April 2011.
[47] V. Gulshan, V. Lempitsky, and A. Zisserman. Humanising grabcut: Learning to
segment humans using the kinect. In International Conference on Computer Vision
(ICCV) Workshops, pages 1127–1133, 2011.
[48] G. Gupta. Ghost Catcher. http://archive.techtree.com/techtree/jsp/
article.jsp?print=1&article_id=51303&cat_id=544, 2004. This photo, re-
trieved on 22nd August 2012, is a screenshot from a copyrighted game (EyeToy:
Play, SCEE, 2003), and appears here for illustrative purposes, qualifying as fair
dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
[49] S. Hay, J. Newman, and R. Harle. Optical tracking using commodity hardware. In
IEEE/ACM International Symposium on Mixed and Augmented Reality (ISMAR),
pages 159–160, 2008.
[50] K. He, J. Sun, and X. Tang. Guided image filtering. In European Conference of
Computer Vision (ECCV), pages 1–14. Springer, 2010.
152
Bibliography
[51] H. Hirschmüller, P.R. Innocent, and J. Garibaldi. Real-time correlation-based ste-
reo vision with reduced border errors. International Journal of Computer Vision
(IJCV), 47(1):229–246, 2002.
[52] P.V.C. Hough. Machine analysis of bubble chamber pictures. In International
Conference on High Energy Accelerators and Instrumentation, volume 73. CERN,
1959.
[53] G. Howitt. Wonderbook - hands-on preview. http://www.guardian.co.uk/
technology/gamesblog/2012/aug/16/wonderbook-hands-on-preview-ps3,
2012. Retrieved 22nd August 2012.
[54] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton,
S. Hodges, D. Freeman, A. Davison, et al. Kinectfusion: real-time 3D reconstruction
and interaction using a moving depth camera. In Proceedings of the 24th annual
ACM symposium on User interface software and technology, pages 559–568. ACM,
2011.
[55] M. Jackson. BBC’s Walking with Dinosaurs coming to PS3 Wonder-
book. http://www.computerandvideogames.com/363100/bbcs-walking-with-
dinosaurs-coming-to-ps3-wonderbook/, 2012. Retrieved 22nd August 2012.
[56] P. Kohli, L. Ladick` y, and P.H.S. Torr. Robust higher order potentials for enforcing
label consistency. International Journal of Computer Vision (IJCV), 82(3):302–324,
2009.
[57] D. Koller and N. Friedman. Probabilistic graphical models: principles and tech-
niques. MIT press, 2009.
[58] V. Kolmogorov. maxflow-v3.01 C++ code library for solving the max-flow/min-
cut problem. http://vision.csd.uwo.ca/code/maxflow-v3.01.zip, 2010. Re-
trieved 9th August 2012.
153
Bibliography
[59] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Pother. Probabilistic
fusion of stereo with color and contrast for bilayer segmentation. IEEE Transactions
on Pattern Analysis and Machine Intel ligence (PAMI), 28(9):1480–1492, 2006.
[60] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother. Bi-layer seg-
mentation of binocular stereo video. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), volume 2, pages 407–414, 2005.
[61] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph
cuts? IEEE Transactions on Pattern Analysis and Machine Intel ligence (PAMI),
26(2):147–159, 2004.
[62] N. Komodakis, N. Paragios, and G. Tziritas. MRF optimization via dual decompos-
ition: Message-passing revisited. In IEEE International Conference on Computer
Vision (ICCV), pages 1–8, 2007.
[63] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond
via dual decomposition. IEEE Transactions on Pattern Analysis and Machine
Intelligence (PAMI), 33(3):531–552, 2011.
[64] P. Krähenbühl and V. Koltun. Efficient inference in fully connected CRFs with
Gaussian edge potentials. In NIPS, pages 109–117, 2011.
[65] M.P. Kumar, P.H.S. Torr, and A. Zisserman. Obj cut. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 18–25, 2005.
[66] M.P. Kumar, P.H.S. Torr, and A. Zisserman. Objcut: Efficient segmentation us-
ing top-down and bottom-up cues. IEEE Transactions on Pattern Analysis and
Machine Intelligence (PAMI), 32(3):530–545, 2010.
[67] M.P. Kumar, O. Veksler, and P.H.S. Torr. Improved moves for truncated convex
models. The Journal of Machine Learning Research, 12:31–67, 2011.
154
Bibliography
[68] M.P. Kumar, A. Zisserman, and P.H.S. Torr. Efficient discriminative learning of
parts-based models. In IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 552–559, 2009.
[69] L. Ladický. Global structured models towards scene understanding. PhD thesis,
Oxford Brookes University, 2011.
[70] L. Ladický, C. Russell, P. Kohli, and P.H.S. Torr. Associative hierarchical CRFs for
object class image segmentation. In International Conference on Computer Vision
(ICCV), pages 739–746, 2009.
[71] L. Ladický and P.H.S. Torr. The automatic labelling environment. http://cms.
brookes.ac.uk/staff/PhilipTorr/ale.htm. Retrieved 1st November 2012.
[72] J.D. Lafferty, A. McCallum, and F.C.N. Pereira. Conditional random fields: Prob-
abilistic models for segmenting and labeling sequence data. In Proceedings of the
Eighteenth International Conference on Machine Learning, pages 282–289. Morgan
Kaufmann Publishers Inc., 2001.
[73] D. Larlus and F. Jurie. Combining appearance models and markov random fields
for category level object segmentation. In Computer Vision and Pattern Recognition
(CVPR), pages 1–7, 2008.
[74] Nintendo Power magazine. Wii Sports: WiiRemote example. http://upload.
wikimedia.org/wikipedia/en/f/f6/WS-WiiRemote_Example.jpg, 2008. This
photo, retrieved on 22nd August 2012, is an annotated screenshot from a copy-
righted game (Wii Sports, Nintendo, 2006), whose appearance here for research
purposes qualifies as fair dealing under Section 29 of the Copyright, Designs and
Patents Act 1988.
[75] D. Marr and H.K. Nishihara. Representation and recognition of the spatial organ-
ization of three-dimensional shapes. Proceedings of the Royal Society of London.
Series B. Biological Sciences, 200(1140):269–294, 1978.
155
Bibliography
[76] Y. Matsumoto and A. Zelinsky. An algorithm for real-time stereo vision implement-
ation of head pose and gaze direction measurement. In Fourth IEEE International
Conference on Automatic Face and Gesture Recognition, pages 499–504, 2000.
[77] Microsoft. Image understanding. http://research.microsoft.com/en-us/
projects/objectclassrecognition/, 2009. Retrieved 22nd November 2012.
[78] Microsoft. Kinect Sports. http://g-ecx.images-amazon.com/images/G/01/
videogames/detail-page/kinectsports.04.lg.jpg, 2010. This photo, retrieved
on 22nd August 2012, is promotional material for a copyrighted game (Kinect
Sports, Microsoft, 2010), and appears here for illustrative purposes, qualifying as
fair dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
[79] Microsoft. Microsoft Kinect instruction manual. http://download.
microsoft.com/download/f/6/6/f6636beb-a352-48ee-86a3-abd9c0d4492a/
kinectmanual.pdf, 2010. Retrieved 22nd August 2012. This photo appears for
illustrative purposes, qualifying as fair dealing under Section 29 of the Copyright,
Designs and Patents Act 1988.
[80] Microsoft. Xbox Kinect. full body game controller. http://www.xbox.com/kinect,
2010. Retrieved 16th June 2012.
[81] J. Mikesell. Nintendo Wii sensor bar. http://upload.wikimedia.org/wikipedia/
commons/2/20/Nintendo_Wii_Sensor_Bar.jpg, 2007. This photo, retrieved on
22nd August 2012, is used under a Creative Commons Attribution-Share Alike 3.0
Unported license: http://creativecommons.org/licenses/by-sa/3.0/deed.
en.
[82] G. Miller. Wonderbook and PlayStation Move. http://uk.ign.com/articles/
2012/06/06/e3-2012-can-wonderbook-be-successful, 2012. This photo, re-
trieved on 22nd August 2012, appears here for illustrative purposes, qualifying
as fair dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
156
Bibliography
[83] Open NI. Kinect SDK. http://openni.org/Downloads/OpenSources.aspx, 2010.
Retrieved 22nd August 2012.
[84] Nintendo. Duck Hunt screenshot. http://pressthebuttons.typepad.com/
photos/uncategorized/duckhunt.png, 2006. Retrieved 14th September 2012.
The image is a screenshot from a copyrighted game (Nintendo, 1984), whose ap-
pearance here for research purposes qualifies as fair dealing under Section 29 of the
Copyright, Designs and Patents Act 1988.
[85] W. Niu, J. Long, D. Han, and Y.F. Wang. Human activity detection and recognition
for video surveillance. In IEEE International Conference on Multimedia and Expo
(ICME), volume 1, pages 719–722, 2004.
[86] F. O’Gorman and MB Clowes. Finding picture edges through collinearity of feature
points. IEEE Transactions on Computers, 100(4):449–456, 1976.
[87] I. Oikonomidis, N. Kyriazis, and A. Argyros. Efficient model-based 3D tracking of
hand articulations using kinect. In British Machine Vision Conference (BMVC),
pages 101–111, 2011.
[88] M.T. Orchard and C.A. Bouman. Color quantization of images. IEEE Transactions
on Signal Processing, 39(12):2677–2690, 1991.
[89] B. Packer, S. Gould, and D. Koller. A unified contour-pixel model for figure-ground
segmentation. In European Conference of Computer Vision (ECCV), pages 338–
351, 2010.
[90] J. Pearl. Reverend Bayes on inference engines: a distributed hierarchical approach.
Cognitive Systems Laboratory, School of Engineering and Applied Science, Univer-
sity of California, Los Angeles, 1982.
[91] D. Pelfrey. EyeToy: Antigrav screenshot. http://www.dignews.com/platforms/
ps2/ps2-reviews/eye-toy-antigrav-review/, 2005. This photo, retrieved on
18th September 2012, is a screenshot from a copyrighted game (EyeToy: Antigrav,
157
Bibliography
Sony/Harmonix, 2004), whose appearance here for research purposes qualifies as
fair dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
[92] S. Pellegrini and L. Iocchi. Human posture tracking and classification through
stereo vision and 3D model matching. Journal on Image and Video Processing,
2008:1–12, 2008.
[93] L. Plunkett. Report: Here are Kinect’s technical specs. http://kotaku.com/
5576002/here-are-kinects-technical-specs, 2010. Retrieved 22nd August
2012.
[94] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numerical recipes in C.
Cambridge University Press, 1988.
[95] D. Ramanan. Learning to parse images of articulated bodies. Advances in neural
information processing systems (NIPS), 19:1129–1136, 2007.
[96] C. Rhemann. Fast cost-volume filtering code for stereo matching, 2011. Software
available at http://www.ims.tuwien.ac.at/research/costFilter/.
[97] C. Rhemann, A. Hosni, M. Bleyer, C. Rother, and M. Gelautz. Fast cost-volume
filtering for visual correspondence and beyond. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 3017–3024, 2011.
[98] B. Rigby. Microsoft Kinect: CEO releases stunning sales figures during CES key-
note. http://www.huffingtonpost.com/2012/01/09/microsoft-kinect-ces-
keynote_n_1195735.html, 2012. Retrieved 17th September 2012.
[99] C. Rodgers. Kinect Star Wars demonstration. http://www.guardian.co.uk/
technology/2011/jun/08/kinect-star-wars-game-preview, 2011. This photo,
retrieved on 22nd August 2012, was released by the Associated Press, and appears
here for illustrative purposes, qualifying as fair dealing under Section 29 of the
Copyright, Designs and Patents Act 1988.
158
Bibliography
[100] C. Rother, V. Kolmogorov, and A. Blake. GrabCut: Interactive foreground extrac-
tion using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309–
314, 2004.
[101] C. Rother, V. Kolmogorov, Y. Boykov, and A. Blake. Interactive foreground extrac-
tion using graph cut. In Markov Random Fields for Vision and Image Processing,
pp. 111-126, pages 111–126. MIT Press, 2011.
[102] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame ste-
reo correspondence algorithms. International Journal of Computer Vision (IJCV),
47(1):7–42, 2002.
[103] S. Schiesel. Getting everybody back in the game. http://www.nytimes.com/2006/
11/24/arts/24wii.html?_r=1, 2006. Retrieved 17th September 2012.
[104] A. Schrijver. Combinatorial optimization: polyhedra and efficiency. Springer, 2003.
[105] L. Shapiro and G.C. Stockman. Computer Vision. Prentice Hall, 2001.
[106] Shcha. R theta line. http://en.wikipedia.org/wiki/File:R_theta_line.GIF,
2011. This photo, retrieved on 13th May 2010, was released into the public domain
by Wikipedia user Shcha, who created it.
[107] G. Sheasby, J. Valentin, N. Crook, and P.H.S. Torr. A robust stereo prior for human
segmentation. Asian Conference on Computer Vision (ACCV), 2012.
[108] G. Sheasby, J. Warrell, Y. Zhang, N. Crook, and P.H.S. Torr. Simultaneous human
segmentation, depth and pose estimation via dual decomposition. British Machine
Vision Conference, Student Workshop (BMVW), 2012.
[109] N.Z. Shor, K.C. Kiwiel, and A. Ruszczynski. Minimization methods for non-
differentiable functions. Springer-Verlag Berlin, 1985.
159
Bibliography
[110] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman,
and A. Blake. Real-time human pose recognition in parts from single depth images.
Communications of the ACM, 56(1):116–124, 2013.
[111] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image un-
derstanding: Multi-class object recognition and segmentation by jointly modeling
texture, layout, and context. International Journal of Computer Vision (IJCV),
81(1):2–23, 2009.
[112] L. Sigal and M.J. Black. Humaneva: Synchronized video and motion capture data-
set for evaluation of articulated human motion. Brown Univertsity TR, 120, 2006.
[113] B. Sinclair. Sony reveals what makes PlayStation Move tick. http://uk.gamespot.
com/news/sony-reveals-what-makes-playstation-move-tick-6253435, 2010.
Retrieved 29th November 2012.
[114] Sony. Eye Toy. http://ecx.images-amazon.com/images/I/31PQsJIk2VL.jpg,
2003. This photo, retrieved on 22nd August 2012, appears here for illustrative
purposes, qualifying as fair dealing under Section 29 of the Copyright, Designs and
Patents Act 1988.
[115] Sony. Wonderbook. http://blog.problematicgamer.com/2012/06/e3-
wonderbook-announced.html, 2012. This photo, retrieved on 22nd August 2012,
is promotional material for an upcoming video game (Wonderbook, Sony, 2012),
and appears here for illustrative purposes, qualifying as fair dealing under Section
29 of the Copyright, Designs and Patents Act 1988.
[116] Sony. Wonderbook: Fire. http://www.digitalspy.co.uk/gaming/news/
a400867/wonderbook-pricing-revealed-hands-on-event-coming-to-
oxford-street.html, 2012. This photo, retrieved on 22nd August 2012, is
promotional material for an upcoming video game (Wonderbook, Sony, 2012), and
appears here for illustrative purposes, qualifying as fair dealing under Section 29
of the Copyright, Designs and Patents Act 1988.
160
Bibliography
[117] Spong. Keep Up. http://spong.com/asset/117573/2/11034546, 2004. This
photo, retrieved on 22nd August 2012, is a screenshot from a copyrighted game
(EyeToy: Play, SCEE, 2003), and appears here for illustrative purposes, qualifying
as fair dealing under Section 29 of the Copyright, Designs and Patents Act 1988.
[118] J. Talbot. Implementing GrabCut. http://www.justintalbot.com/course-
work/. Retrieved 1st April 2013.
[119] J. Talbot and X. Xu. Implementing GrabCut. Brigham Young University, 2006.
[120] A. Torralba, K.P. Murphy, and Freeman W.T. Sharing visual features for multiclass
and multiview object detection. IEEE Transactions on Pattern Recognition and
Machine Learning, 29(5):854–869, 2007.
[121] Z. Tu. Probabilistic boosting-tree: Learning discriminative models for classification,
recognition, and clustering. In IEEE International Conference on Computer Vision
(ICCV), volume 2, pages 1589–1596, 2005.
[122] K. VanOrd. Eye of Judgment review. http://www.gamespot.com/the-eye-of-
judgment-legends/reviews/the-eye-of-judgment-review-6181426/, 2007.
Retrieved 21st August 2012.
[123] V. Vineet, J. Warrell, P. Sturgess, and P.H.S. Torr. Improved initialization and
gaussian mixture pairwise terms for dense random fields with mean-field inference.
In British Machine Vision Conference (BMVC), pages 1–14, 2012.
[124] V. Vineet, J. Warrell, and P.H.S. Torr. Filter-based mean-field inference for random
fields with higher-order terms and product label-spaces. In European Conference
of Computer Vision (ECCV), pages 31–44, 2012.
[125] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269,
1967.
161
Bibliography
[126] M. Walton. Kinect Star Wars review. http://www.gamespot.com/kinect-star-
wars/reviews/kinect-star-wars-review-6369636/, 2011. Retrieved 22nd Au-
gust 2012.
[127] H. Wang and D. Koller. Multi-level inference by relaxed dual decomposition for
human pose segmentation. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 2433–2440, 2011.
[128] C. Watters. Dance Central review. http://www.gamespot.com/dance-central/
reviews/dance-central-review-6283598/, 2010. Retrieved 22nd August 2012.
[129] D. Whitehead. EyePet review. http://www.eurogamer.net/articles/eyepet-
review, 2009. Retrieved 21st August 2012.
[130] J. Winn and J. Shotton. The layout consistent random field for recognizing and
segmenting partially occluded objects. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 37–44, 2006.
[131] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-
parts. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 1385–1392, 2011.
[132] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-
parts: pose-release version 1.2. http://phoenix.ics.uci.edu/software/pose/,
2011. Retrieved 1st April 2013. Version 1.2 of the code was used throughout this
thesis.
162