This action might not be possible to undo. Are you sure you want to continue?
:
A LARGE MARGIN APPROACH
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Ben Taskar
December 2004
c _Copyright by Ben Taskar 2005
All Rights Reserved
ii
I certify that I have read this dissertation and that, in my opin
ion, it is fully adequate in scope and quality as a dissertation
for the degree of Doctor of Philosophy.
Daphne Koller
Computer Science Department
Stanford University
(Principal Advisor)
I certify that I have read this dissertation and that, in my opin
ion, it is fully adequate in scope and quality as a dissertation
for the degree of Doctor of Philosophy.
Andrew Y. Ng
Computer Science Department
Stanford University
I certify that I have read this dissertation and that, in my opin
ion, it is fully adequate in scope and quality as a dissertation
for the degree of Doctor of Philosophy.
Fernando Pereira
Computer Science Department
University of Pennsylvania
Approved for the University Committee on Graduate Stud
ies:
iii
iv
To my parents, Mark and Tsilya, and my love, Anat.
Abstract
Most questions require more than just truefalse or multiplechoice answers. Yet super
vised learning, like standardized testing, has placed the heaviest emphasis on complex
questions with simple answers. The acquired expertise must now be used to address tasks
that demand answers as complex as the questions. Such complex answers may consist of
multiple interrelated decisions that must be weighed against each other to arrive at a glob
ally satisfactory and consistent solution to the question. In natural language processing, we
often need to construct a global, coherent analysis of a sentence, such as its corresponding
partofspeech sequence, parse tree, or translation into another language. In computational
biology, we analyze genetic sequences to predict 3D structure of proteins, ﬁnd global align
ment of related DNA strings, and recognize functional portions of a genome. In computer
vision, we segment complex objects in cluttered scenes, reconstruct 3D shapes from stereo
and video, and track motion of articulated bodies.
We typically handle the exponential explosion of possible answers by building mod
els that compactly capture the structural properties of the problem: sequential, grammat
ical, chemical, temporal, spatial constraints and correlations. Such structured models in
clude graphical models such as Markov networks (Markov random ﬁelds), recursive lan
guage models such as context free grammars, combinatorial optimization problems such as
weighted matchings and graphcuts. This thesis presents a discriminative estimation frame
work for structured models based on the large margin principle underlying support vector
machines. Intuitively, the largemargin criterion provides an alternative to probabilistic,
likelihoodbased estimation methods by concentrating directly on the robustness of the de
cision boundary of a model. Our framework deﬁnes a suite of efﬁcient learning algorithms
that rely on the expressive power of convex optimization to compactly capture inference or
vii
solution optimality in structured models. For some of these models, alternative estimation
methods are intractable.
The largest portion of the thesis is devoted to Markov networks, which are undirected
probabilistic graphical models widely used to efﬁciently represent and reason about joint
multivariate distributions. We use graph decomposition to derive an exact, compact, con
vex formulation for largemargin estimation of Markov networks with sequence and other
lowtreewidth structure. Seamless integration of kernels with graphical models allows ef
ﬁcient, accurate prediction in realworld tasks. We analyze the theoretical generalization
properties of maxmargin estimation in Markov networks and derive a novel type of bound
on structured error. Using an efﬁcient onlinestyle algorithm that exploits inference in the
model and analytic updates, we solve very large estimation problems.
We deﬁne an important subclass of Markov networks, associative Markov networks
(AMNs), which captures positive correlations between variables and permits exact infer
ence which scales up to tens of millions of nodes and edges. While likelihoodbased meth
ods are believed to be intractable for AMNs over binary variables, our framework allows
exact estimation of such networks of arbitrary connectivity and topology. We also intro
duce relational Markov networks (RMNs), which compactly deﬁne templates for Markov
networks for domains with relational structure: objects, attributes, relations.
In addition to graphical models, our framework applies to a wide range of other models:
We exploit context free grammar structure to derive a compact maxmargin formulation that
allows highaccuracy parsing in cubic time by using novel kinds of lexical information. We
use combinatorial properties of weighted matchings to develop an exact, efﬁcient formu
lation for learning to match and apply it to prediction of disulﬁde connectivity in proteins.
Finally, we derive a maxmargin formulation for learning the scoring metric for clustering
from clustered training data, which tightly integrates metric learning with the clustering
algorithm, tuning one to the other in a joint optimization.
We describe experimental applications to a diverse range of tasks, including handwrit
ing recognition, 3D terrain classiﬁcation, disulﬁde connectivity prediction in proteins, hy
pertext categorization, natural language parsing, email organization and image segmen
tation. These empirical evaluations show signiﬁcant improvements over stateoftheart
methods and promise wide practical use for our framework.
viii
Acknowledgements
I am profoundly grateful to my advisor, Daphne Koller. Her tireless pursuit of excellence in
research, teaching, advising, and every other aspect of her academic work is truly inspira
tional. I am indebted to Daphne for priceless and copious advice about selecting interesting
problems, making progress on difﬁcult ones, pushing ideas to their full development, writ
ing and presenting results in an engaging manner.
I would like to thank my thesis committee, Andrew Ng and Fernando Pereira, as well
as my defense committee members, Stephen Boyd and Sebastian Thrun, for their excellent
suggestions and thoughtprovoking questions. I have learned a great deal from their work
and their inﬂuence on this thesis is immense. In particular, Fernando’s work on Markov
networks has inspired my focus on this subject. Andrew’s research on clustering and clas
siﬁcation has informed many of the practical problems addressed in the thesis. Sebastian
introduced me to the fascinating problems in 3D vision and robotics. Finally, Stephen’s
book on convex optimization is the primary source of many of the insights and derivations
in this thesis.
Daphne’s research group is an institution in itself. I am very lucky to have been a part
of it and shared the company, the ideas, the questions and the expertise of Pieter Abbeel,
Drago Anguelov, Alexis Battle, Luke Biewald, Xavier Boyen, Vassil Chatalbashev, Gal
Chechik, Lise Getoor, Carlos Guestrin, Uri Lerner, Uri Nodelman, Dirk Ormoneit, Ron
Parr, Eran Segal, Christian Shelton, Simon Tong, David Vickrey, Haidong Wang and Ming
Fai Wong.
JeanClaude Latombe was my ﬁrst research advisor when I was still an undergraduate
at Stanford. His thoughtful guidance and quick wit were a great welcome and inspiration
to continue my studies. When I started my research in machine learning with Daphne, I
ix
had the privilege and pleasure to work with Nir Friedman, Lise Getoor and Eran Segal.
They have taught me to remain steadfast when elegant theories meet the ragged edge of
real problems. Lise has been a great friend and my other big sister throughout my graduate
life. I could not wish for more caring advice and encouragement than she has given me all
these years.
I was lucky to meet Michael Collins, Michael Littman and David McAllester all in
one place, while I was a summer intern at AT&T. Collins’ work on learning largemargin
tagging and parsing models motivated me to look at such methods for other structured
models. I am very glad I had the chance to work with him on the maxmargin parsing
project. Michael Littman, with his endearing combination of humor and warmth, lightness
and depth, has been a constant source of encouragement and inspiration. David’s wide
ranging research on probability and logic, languages and generalization bounds, to mention
a few, enlightened much of my work.
(Almost) everything I know about natural language processing I learned from Chris
Manning and Dan Klein. Chris’ healthy scepticism and emphasis on the practical has kept
me from going down many deadend paths. Dan has an incredible gift of making ideas
work, and better yet, explaining why other ideas do not work.
I would like to express my appreciation and gratitude to all my collaborators for their
excellent ideas, hard work and dedication: Pieter Abbeel, Drago Anguelov, Peter Bartlett,
Luke Biewald, Vassil Chatalbashev, Michael Collins, Nir Friedman, Audrey Gasch, Lise
Getoor, Carlos Guestrin, Geremy Heitz, Dan Klein, Daphne Koller, Chris Manning, David
McAllester, Eran Segal, David Vickrey and Ming Fai Wong.
Merci beaucoup to Pieter Abbeel, my sounding board for many ideas in convex opti
mization and learning, for numerous enlightening discussions that we have had anytime,
anywhere: gym, tennis court, biking to EV, over cheap Thai food. Ogromnoye spasibo to
my ofﬁcemate, sportsmate, always late, everready to checkmate, Drago Anguelov, who
is always willing to share with me everything from research problems to music and back
packing treks to Russian books and Belgian beer. Blagodarya vi mnogo for another great
Bulgarian I have the good luck to know, Vasco Chatalbashev, for his company and ideas,
good cheer and the great knack for always coming through. Muchas gracias to Carlos
Guestrin, my personal climbing instructor, fellow photography fanatic, avid grillmaster
x
and exploiterofstructure extraordinaire, for opening his ear and heart to me when it mat
ters most. Mhgoisaai to Ming Fai Wong, for his sense of humor, kind heart and excellent
work.
Many thanks to my friends who have had nothing to do with work in this thesis, but
worked hard to keep my relative sanity throughout. I will not list all of you here, but my
gratitude to you is immense.
My parents, Mark and Tsilya, have given me unbending support and constant encour
agement. I thank them for all the sacriﬁces they have made to ensure that their children
would have freedom and opportunities they never had in Soviet Union. To my sister, Anna,
my brotherinlaw, Ilya, and my sweet niece, Talia, I am grateful for bringing me so much
joy and love. To my tireless partner in work and play, tears and laughter, hardship and love,
Anat Caspi, thank you for making me happy next to you.
xi
xii
Contents
Abstract vii
Acknowledgements ix
1 Introduction 1
1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Complex prediction problems . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Structured models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Previously published work . . . . . . . . . . . . . . . . . . . . . . . . . . 13
I Models and methods 15
2 Supervised learning 16
2.1 Classiﬁcation with generalized linear models . . . . . . . . . . . . . . . . 19
2.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Logistic dual and maximum entropy . . . . . . . . . . . . . . . . . . . . . 21
2.4 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 SVM dual and kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Structured models 24
3.1 Probabilistic models: generative and conditional . . . . . . . . . . . . . . . 25
xiii
3.2 Prediction models: normalized and unnormalized . . . . . . . . . . . . . . 27
3.3 Markov networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 Linear programming MAP inference . . . . . . . . . . . . . . . . . 34
3.4 Context free grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Combinatorial problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Structured maximum margin estimation 42
4.1 Maxmargin estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Minmax formulation . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.2 Certiﬁcate formulation . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Approximations: upper and lower bounds . . . . . . . . . . . . . . . . . . 50
4.2.1 Constraint generation . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Constraint strengthening . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
II Markov networks 57
5 Markov networks 58
5.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Maximum margin estimation . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 M
3
N dual and kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Untriangulated models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Generalization bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6 M
3
N algorithms and experiments 75
6.1 Solving the M
3
N QP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.1 SMO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xiv
6.1.2 Selecting SMO pairs . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.1.3 Structured SMO . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7 Associative Markov networks 89
7.1 Associative networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.2 LP Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.3 Mincut inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3.1 Graph construction . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3.2 Multiclass case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.4 Maxmargin estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8 Relational Markov networks 109
8.1 Relational classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.2 Relational Markov networks . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.3 Approximate inference and learning . . . . . . . . . . . . . . . . . . . . . 116
8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.4.1 Flat models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.4.2 Link model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.4.3 Cocite model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
III Broader applications: parsing, matching, clustering 127
9 Context free grammars 128
9.1 Context free grammar model . . . . . . . . . . . . . . . . . . . . . . . . . 128
xv
9.2 Context free parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.3 Discriminative parsing models . . . . . . . . . . . . . . . . . . . . . . . . 133
9.3.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . 134
9.3.2 Maximum margin estimation . . . . . . . . . . . . . . . . . . . . . 135
9.4 Structured SMO for CFGs . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
9.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10 Matchings 144
10.1 Disulﬁde connectivity prediction . . . . . . . . . . . . . . . . . . . . . . . 145
10.2 Learning to match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
10.3 Minmax formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10.4 Certiﬁcate formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
10.5 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
10.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
10.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
10.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
11 Correlation clustering 160
11.1 Clustering formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11.1.1 Linear programming relaxation . . . . . . . . . . . . . . . . . . . 162
11.1.2 Semideﬁnite programming relaxation . . . . . . . . . . . . . . . . 163
11.2 Learning formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
11.3 Dual formulation and kernels . . . . . . . . . . . . . . . . . . . . . . . . . 167
11.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
11.4.1 Irrelevant features . . . . . . . . . . . . . . . . . . . . . . . . . . 168
11.4.2 Email clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
11.4.3 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 170
11.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
11.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
xvi
IV Conclusions and future directions 175
12 Conclusions and future directions 176
12.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 176
12.1.1 Structured maximum margin estimation . . . . . . . . . . . . . . . 177
12.1.2 Markov networks: maxmargin, associative, relational . . . . . . . 180
12.1.3 Broader applications: parsing, matching, clustering . . . . . . . . . 182
12.2 Extensions and open problems . . . . . . . . . . . . . . . . . . . . . . . . 183
12.2.1 Theoretical analysis and optimization algorithms . . . . . . . . . . 183
12.2.2 Novel prediction tasks . . . . . . . . . . . . . . . . . . . . . . . . 184
12.2.3 More general learning settings . . . . . . . . . . . . . . . . . . . . 185
12.3 Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
A Proofs and derivations 188
A.1 Proof of Theorem 5.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
A.1.1 Binary classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . 189
A.1.2 Structured classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . 190
A.2 AMN proofs and derivations . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.2.1 Binary AMNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.2.2 Multiclass AMNs . . . . . . . . . . . . . . . . . . . . . . . . . . 197
A.2.3 Derivation of the factored primal and dual maxmargin QP . . . . . 199
Bibliography 202
xvii
List of Tables
9.1 Parsing results on development set . . . . . . . . . . . . . . . . . . . . . . 140
9.2 Parsing results on test set . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
10.1 Bond connectivity prediction results . . . . . . . . . . . . . . . . . . . . . 156
xviii
List of Figures
1.1 Supervised learning setting . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Examples of complex prediction problems . . . . . . . . . . . . . . . . . . 4
2.1 Handwritten character recognition . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Classiﬁcation loss and upper bounds . . . . . . . . . . . . . . . . . . . . . 20
3.1 Handwritten word recognition . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Chain Markov network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Diamond Markov network . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Diamond network junction tree . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Marginal agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Example parse tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Exact and approximate constraints for maxmargin estimation . . . . . . . 51
4.2 A constraint generation algorithm. . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Chain M
3
N example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Diamond Markov network . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Blockcoordinate ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 SMO subproblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3 SMO pair selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.4 Structured SMO diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5 Structured SMO pair selection . . . . . . . . . . . . . . . . . . . . . . . . 85
6.6 OCR results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
xix
7.1 Mincut graph construction . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.2 αexpansion algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.3 Segbot: roving robot equipped with SICK2 laser sensors. . . . . . . . . . . 100
7.4 3D laser scan range map of the Stanford Quad. . . . . . . . . . . . . . . . 101
7.5 Comparison of terrain classiﬁcation models (detail) . . . . . . . . . . . . . 102
7.6 Mincut inference running times . . . . . . . . . . . . . . . . . . . . . . . 103
7.7 Comparison of terrain classiﬁcation models (zoomedout) . . . . . . . . . . 106
7.8 Labeled portion of the test terrain dataset (Ground truth and SVMpredictions)107
7.9 Labeled portion of the test terrain dataset (VotedSVM and AMN predictions)108
8.1 Link model for document classiﬁcation . . . . . . . . . . . . . . . . . . . 113
8.2 WebKB results: ﬂat models . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.3 WebKB results: link model . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.4 WebKB results: cocite model . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.1 Two representations of a binary parse tree . . . . . . . . . . . . . . . . . . 130
10.1 PDB protein 1ANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.2 Number of constraints vs. number of bonds . . . . . . . . . . . . . . . . . 150
10.3 Learning curve for bond connectivity prediction . . . . . . . . . . . . . . . 157
11.1 Learning to cluster with irrelevant features . . . . . . . . . . . . . . . . . . 169
11.2 Learning to organize email . . . . . . . . . . . . . . . . . . . . . . . . . . 170
11.3 Two segmentations by different users . . . . . . . . . . . . . . . . . . . . . 171
11.4 Two segmentations by different models . . . . . . . . . . . . . . . . . . . 172
xx
Chapter 1
Introduction
The breadth of tasks addressed by machine learning is rapidly expanding. Major appli
cations include medical diagnosis, scientiﬁc discovery, ﬁnancial analysis, fraud detection,
DNA sequence analysis, speech and handwriting recognition, game playing, image analy
sis, robot locomotion and many more. Of course, the list of things we would like a computer
to learn to do is much, much longer. As we work our way down that list, we encounter the
need for very sophisticated decision making from our programs.
Some tasks, for example, handwriting recognition, are performed almost effortlessly
by a person, but remain difﬁcult and errorprone for computers. The complex synthesis
of many levels of signal processing a person executes when confronted by a line of hand
written text is daunting. The reconstruction of an entire sentence from the photons hitting
the retina off of each tiny patch of an image undoubtedly requires an elaborate interplay of
recognition and representation of the penstrokes, the individual letters, whole words and
constituent phrases.
Computer scientists, as opposed to, say, neuroscientists, are primarily concerned with
achieving acceptable speed and accuracy of recognition rather than modeling this compli
cated process with any biological verity. Computational models for handwriting recogni
tion aim to capture the salient properties of the problem: typical shapes of the letters, likely
letter combinations that make up words, common ways to combine words into phrases, fre
quent grammatical constructions of the phrases, etc. Machine learning offers an alternative
to encoding all the intricate details of such a model from scratch. One of its primary goals
1
2 CHAPTER 1. INTRODUCTION
is to devise efﬁcient algorithms for training computers to automatically acquire effective
and accurate models from experience.
In this thesis, we present a discriminative learning framework and a novel family of efﬁ
cient models and algorithms for complex recognition tasks in several disciplines, including
natural language processing, computer vision and computational biology. We develop the
oretical foundations for our approach and show a wide range of experimental applications,
including handwriting recognition, 3dimensional terrain classiﬁcation, disulﬁde connec
tivity in protein structure prediction, hypertext categorization, natural language parsing,
email organization and image segmentation.
1.1 Supervised learning
The most basic supervised learning task is classiﬁcation. Suppose we wish to learn to
recognize a handwritten character from a scanned image. This is a classiﬁcation task,
because we must assigns a class (an English letter from ‘a’ through ‘z’) to an observation of
an object (an image). Essentially, a classiﬁer is a function that maps an input (an image) to
an output (a letter). In the supervised learning setting, we construct a classiﬁer by observing
labeled training examples, in our case, sample images paired with appropriate letters. The
main problem addressed by supervised learning is generalization. The learning program is
allowed to observe only a small sample of labeled images to produce an accurate classiﬁer
on unseen images of letters.
More formally, let x denote an input. For example, a blackandwhite image x can be
represented as a vector of pixel intensities. We use A to denote the space of all possible
inputs. Let y denote the output, and ¸ be the discrete space of possible outcomes (e.g.,
26 letters ‘a’‘z’). A classiﬁer (or hypothesis) h is a function from A to ¸, h : A → ¸.
We denote the set of all classiﬁers that our learning program can produce as H (hypothesis
class). Then given a set of labeled examples ¦x
(i)
, y
(i)
¦, i = 1, . . . , m, a learning program
seeks to produce a classiﬁer h ∈ H that will work well on unseen examples x, usually by
ﬁnding h that accurately classiﬁes training data. The diagram in Fig. 1.1 summarizes the
supervised learning setting.
1.1. SUPERVISED LEARNING 3
Labeled data
Learning
Prediction
Hypotheses
New data
Figure 1.1: Supervised learning setting
The problem of classiﬁcation has a long history and highly developed theory and prac
tice (see for example, Mitchell [1997]; Vapnik [1995]; Duda et al. [2000]; Hastie et al.
[2001]). The two most important dimensions of variation of classiﬁcation algorithms is the
hypothesis class Hand the criterion for selection of a hypothesis h from Hgiven the train
ing data. In this thesis, we build upon the generalized linear model family, which underlies
standard classiﬁers such as logistic regression and support vector machines. Through the
use of kernels to implicitly deﬁne highdimensional and even inﬁnitedimensional input
representations, generalized linear models can approximate arbitrarily complex decision
boundaries.
The task of selecting a hypothesis h reduces to estimating model parameters. Broadly
speaking, probabilistic estimation methods associate a joint distribution p(x, y) or condi
tional distribution p(y [ x) with h and select a model based on the likelihood of the data
[Hastie et al., 2001]. Joint distribution models are often called generative, while condi
tional models are called discriminative. Large margin methods, by contrast, select a model
based on a more direct measure of conﬁdence of its predictions on the training data called
the margin [Vapnik, 1995]. The difference between these two methods is one of the key
4 CHAPTER 1. INTRODUCTION
Thescreenwas
aseaofred
RSCCPCYWGGCPW
GQNCYPEGCSGPKV
brace
(a) (b) (c) (d)
Figure 1.2: Examples of complex prediction problems (inputstop, outputsbottom):
(a) handwriting recognition [image →word];
(b) natural language parsing [sentence →parse tree];
(c) disulﬁde bond prediction in proteins [aminoacid sequence →bond structure (shown in yellow)];
(d) terrain segmentation [3D image →segmented objects (trees, bushes, buildings, ground)]
themes in this thesis.
Most of the research has focused on the analysis and classiﬁcation algorithms for the
case of binary outcomes [¸[ = 2, or a small number of classes. In this work, we focus
on prediction tasks that involve not a single decision with a small set of outcomes, but a
complex, interrelated collection of decisions.
1.2 Complex prediction problems
Consider once more the problem of character recognition. In fact, a more natural and useful
task is recognizing words and entire sentences. Fig. 1.2(a) shows an example handwritten
word “brace.” Distinguishing between the second letter and fourth letter (‘r’ and ‘c’) in iso
lation is actually far from trivial, but in the context of the surrounding letters that together
form a word, this task is much less errorprone for humans and should be for computers
as well. It is also more complicated, as different decisions must be weighed against each
other to arrive at the globally satisfactory prediction. The space of all possible outcomes
1.2. COMPLEX PREDICTION PROBLEMS 5
¸ is immense, usually exponential in the number of individual decisions, for example, the
number of 5 letter sequences (26
5
). However, most of these outcomes are unlikely given
the observed input. By capturing the most salient structure of the problem, for example the
strong local correlations between consecutive letters, we will construct compact models
that efﬁciently deal with this complexity. Below we list several examples from different
ﬁelds.
• Natural language processing
Vast amounts of electronically available text have spurred a tremendous amount of
research into automatic analysis and processing of natural language. We mention
some of the lowerlevel tasks that have received a lot of recent attention [Charniak,
1993; Manning & Sch¨ utze, 1999]. Partofspeech tagging involves assigning each
word in a sentence a partofspeech tag, such as noun, verb, pronoun, etc. As with
handwriting recognition, capturing sequential structure of correlations between con
secutive tags is key. In parsing, the goal is to recognize the recursive phrase structure
of a sentence, such as verbal, noun and prepositional phrases and their nesting in
relation to each other. Fig. 1.2(b) shows a parse tree corresponding to the sentence:
“The screen was a sea of red” (more on this in Ch. 9). Many other problems, such as
namedentity and relation extraction, text summarization, translation, involve com
plex global decision making.
• Computational biology
The last two decades have yielded a wealth of highthroughput experimental data,
including complete sequencing of many genomes, precise measurements of protein
3D structure, genomewide assays of mRNA levels and proteinprotein interactions.
Major research has been devoted to geneﬁnding, alignment of sequences, protein
structure prediction, molecular pathway discovery [Gusﬁeld, 1997; Durbin et al.,
1998]. Fig. 1.2(c) shows disulﬁde bond structure (shown in yellow) we would like to
predict from the aminoacid sequence of the protein (more on this in Ch. 10).
• Computer vision
As digital cameras and optical scanners become commonplace accessories, medical
imaging technology produces detailed physiological measurements, laser scanners
6 CHAPTER 1. INTRODUCTION
capture 3D environments, satellites and telescopes bring pictures of Earth and distant
stars, we are ﬂooded with images we would like our computer to analyze. Example
tasks include object detection and segmentation, motion tracking, 3D reconstruction
from stereo and video, and much more [Forsyth & Ponce, 2002]. Fig. 1.2(d) shows a
3D laser range data image of the Stanford campus collected by a roving robot which
we would like to segment into objects such as trees, bushes, buildings, ground, etc.
(more on this in Ch. 7).
1.3 Structured models
This wide range of problems have been tackled using various models and methods. We
focus on the models that compactly capture correlation and constraint structure inherent to
many tasks. Abstractly, a model assigns a score (or likelihood in probabilistic models) to
each possible input/output pair (x, y), typically through a compact, parameterized scoring
function. Inference in these models refers to computing the highest scoring output given
the input and usually involves dynamic programming or combinatorial optimization.
• Markov networks
Markov networks (a.k.a. Markov random ﬁelds) are extensively used to model com
plex sequential, spatial, and relational interactions in prediction problems arising in
many ﬁelds. These problems involve labeling a set of related objects that exhibit
local consistency. Markov networks compactly represent complex joint distributions
of the label variables by modeling their local interactions. Such models are encoded
by a graph, whose nodes represent the different object labels, and whose edges rep
resent and quantify direct dependencies between them. The graphical structure of
the models encodes the qualitative aspects of the distribution: direct dependencies as
well as conditional independencies. The quantitative aspect of the model is deﬁned
by the potentials that are associated with nodes and cliques of the graph. The graph
ical structure of the network (more precisely, the treewidth of the graph, which we
formally deﬁne in Ch. 3) is critical to efﬁcient inference and learning in the model.
• Context free grammars
1.3. STRUCTURED MODELS 7
Contextfree grammars are one of the primary formalisms for capturing the recur
sive structure of syntactic constructions [Manning & Sch¨ utze, 1999]. For example,
in Fig. 1.2, the nonterminal symbols (labels of internal nodes) correspond to syntac
tic categories such as noun phrase (NP), verbal phrase (VP) or prepositional phrase
(PP) and partofspeech tags like nouns (NN), verbs (VBD), determiners (DT) and
prepositions (IN). The terminal symbols (leaves) are the words of the sentence. A
CFG consists of recursive productions (e.g. V P → V P PP, DT → The) that
can be applied to derive a sentence of the language. The productions deﬁne the set
of syntactically allowed phrase structures (derivations). By compactly deﬁning a
probability distribution over individual productions, probabilistic CFGs construct a
distribution over parse trees and sentences, and the prediction task reduces to ﬁnding
the most likely tree given the sentence. The context free restriction allows efﬁcient
inference and learning in such models.
• Combinatorial structures
Many important computational tasks are formulated as combinatorial optimization
problems such as the maximum weight bipartite and perfect matching, spanning
tree, graphcut, edgecover, and many others [Lawler, 1976; Papadimitriou & Stei
glitz, 1982; Cormen et al., 2001]. Although the term ‘model’ is often reserved for
probabilistic models, we use the term model very broadly, to include any scheme
that assigns scores to the output space ¸ and has a procedure for ﬁnding the opti
mal scoring y. For example, the disulﬁde connectivity prediction in Fig. 1.2(c) can
be modeled by maximum weight perfect matchings, where the weights deﬁne po
tential bond strength based on the local aminoacid sequence properties. The other
combinatorial structures we consider and apply in this thesis include graph cuts and
partitions, bipartite matchings, and spanning trees.
The standard methods of estimation for Markov networks and context free grammars
are based on maximum likelihood, both joint and conditional. However, maximum like
lihood estimation of scoring function parameters for combinatorial structures is often in
tractable because of the problem of deﬁning a normalized distribution over an exponential
set of combinatorial structures.
8 CHAPTER 1. INTRODUCTION
1.4 Contributions
This thesis addresses the problemof efﬁcient learning of highaccuracy models for complex
prediction problems. We consider a very large class of structured models, from Markov
networks to context free grammars to combinatorial graph structures such as matchings
and cuts. We focus on those models where exact inference is tractable, or can be efﬁciently
approximated.
◦ Learning framework for structured models
We propose a general framework for efﬁcient estimation of models for structured
prediction. An alternative to likelihoodbased methods, this framework builds upon
the large margin estimation principle. Intuitively, we ﬁnd parameters such that in
ference in the model (dynamic programming, combinatorial optimization) predicts
the correct answers on the training data with maximum conﬁdence. We develop gen
eral conditions under which exact large margin estimation is tractable and present
two formulations for structured maxmargin estimation that deﬁne compact convex
optimization problems, taking advantage of prediction task structure. The ﬁrst for
mulation relies on the ability to express inference in the model as a compact convex
optimization problem. The second one only requires compactly expressing optimal
ity of a given assignment according to the model and applies to a broader range of
combinatorial problems. These two formulations form the foundation which the rest
of the thesis develops.
◦ Markov networks
The largest portion of the thesis is devoted to novel estimation algorithms, represen
tational extensions, generalization analysis and experimental validation for Markov
networks, a model class of choice in many structured prediction tasks in language,
vision and biology.
Lowtreewidth Markov networks
We use graph decomposition to derive an exact, compact, convex learning for
mulation for Markov networks with sequence and other lowtreewidth structure.
The seamless integration of kernels with graphical models allows us to create
1.4. CONTRIBUTIONS 9
very rich models that leverage the immense amount of research in kernel de
sign and graphical model decompositions for efﬁcient, accurate prediction in
realworld tasks. We also use approximate graph decomposition to derive a
compact approximate formulation for Markov networks in which inference is
intractable.
Scalable online algorithm
We present an efﬁcient algorithm for solving the estimation problem called
Structured SMO. Our onlinestyle algorithm uses inference in the model and
analytic updates to solve extremely large estimation problems.
Generalization analysis
We analyze the theoretical generalization properties of maxmargin estimation
in Markov networks and derive a novel marginbased bound for structured pre
diction. This bound is the ﬁrst to address structured error (e.g. proportion
of mislabeled pixels in an image) and uses a proof that exploits the graphical
model structure.
Learning associative Markov networks (AMNs)
We deﬁne an important subclass of Markov networks that captures positive cor
relations present in many domains. We show that for AMNs over binary vari
ables, our framework allows exact estimation of networks of arbitrary connec
tivity and topology, for which likelihood methods are believed to be intractable.
For the nonbinary case, we provide an approximation that works well in prac
tice. We present an AMNbased method for object segmentation from 3D range
data. By constraining the class of Markov networks to AMNs, our models are
learned efﬁciently and, at runtime, scale up to tens of millions of nodes and
edges.
Representation and learning of relational Markov networks
We introduce relational Markov networks (RMNs), which compactly deﬁne
templates for Markov networks for domains with relational structure objects,
10 CHAPTER 1. INTRODUCTION
attributes, relations. The graphical structure of an RMN is based on the rela
tional structure of the domain, and can easily model complex interaction pat
terns over related entities. We use approximate inference in these complex mod
els, in which exact inference is intractable, to derive an approximate learning
formulation. We apply this class of models to classiﬁcation of hypertext using
hyperlink structure to deﬁne relations between webpages.
◦ Broader applications: parsing, matching, clustering
The other large portion the thesis addresses a range of prediction tasks with very di
verse models: context free grammars for natural language parsing, perfect matchings
for disulﬁde connectivity in protein structure prediction, graph partitions for cluster
ing documents and segmenting images.
Learning to parse
We exploit context free grammar structure to derive a compact maxmargin
formulation and show highaccuracy parsing in cubic time by exploiting novel
kinds of lexical information. We show experimental evidence of the model’s
improved performance over several baseline models.
Learning to match
We use combinatorial properties of weighted matchings to develop an exact,
efﬁcient algorithm for learning to match. We apply our framework to predic
tion of disulﬁde connectivity in proteins using perfect nonbipartite matchings.
The algorithm we propose uses kernels, which makes it possible to efﬁciently
embed the features in very highdimensional spaces and achieve stateoftheart
accuracy.
Learning to cluster
We derive a maxmargin formulation for learning the afﬁnity metric for clus
tering from clustered training data. In contrast to algorithms that learn a metric
independently of the algorithm that will be used to cluster the data, we describe
a formulation that tightly integrates metric learning with the clustering algo
rithm, tuning one to the other in a joint optimization. Experiments on synthetic
and realworld data show the ability of the algorithm to learn an appropriate
1.5. THESIS OUTLINE 11
clustering metric for a variety of desired clusterings, including email folder or
ganization and image segmentation.
1.5 Thesis outline
Below is a summary of the rest of the chapters in the thesis:
Chapter 2. Supervised learning: We review basic deﬁnitions and statistical framework
for classiﬁcation. We deﬁne hypothesis classes, loss functions, risk. We consider
generalized linear models, including logistic regression and support vector machines,
and review estimation methods based on maximizing likelihood, conditional likeli
hood and margin. We describe the relationship between the dual estimation problems
and kernels.
Chapter 3. Structured models: In this chapter, we deﬁne the abstract class of structured
prediction problems and models addressed by the thesis. We compare probabilistic
models, generative and discriminative and unnormalized models. We describe repre
sentation and inference for Markov networks, including dynamic and linear program
ming inference. We also brieﬂy describe context free grammars and combinatorial
structures as models.
Chapter 4. Structured maximum margin estimation: This chapter outlines the main prin
ciples of maximum margin estimation for structured models. We address the expo
nential blowup of the naive problem formulation by deriving two general equivalent
convex formulation. These formulations, minmax and certiﬁcate, allow us to ex
ploit decomposition and combinatorial structure of the prediction task. They lead
to polynomial size programs for estimation of models where the prediction problem
is tractable. We also discuss approximations, in particular using upper and lower
bounds, for solving intractable or very large problems.
Chapter 5. Markov networks: We review maximum conditional likelihood estimation
and present maximum margin estimation for Markov networks. We use graphical
model decomposition to derive a convex, compact formulation that seamlessly in
tegrates kernels with graphical models. We analyze the theoretical generalization
12 CHAPTER 1. INTRODUCTION
properties of maxmargin estimation and derive a novel marginbased bound for
structured classiﬁcation.
Chapter 6. M
3
N algorithms and experiments: We present an efﬁcient algorithmfor solv
ing the estimation problem in graphical models, called Structured SMO. Our online
style algorithm uses inference in the model and analytic updates to solve extremely
large quadratic problems. We present experiments with handwriting recognition,
where our models signiﬁcantly outperform other approaches by effectively capturing
correlation between adjacent letters and incorporating highdimensional input repre
sentation via kernels.
Chapter 7. Associative Markov networks: We deﬁne an important subclass of Markov
networks, associative Markov networks (AMNs), that captures positive interactions
present in many domains. We show that for associative Markov networks of over bi
nary variables, maxmargin estimation allows exact training of networks of arbitrary
connectivity and topology, for which maximum likelihood methods are believed to
be intractable. For the nonbinary case, we provide an approximation that works
well in practice. We present an AMNbased method for object segmentation from
3D range data that scales to very large prediction tasks involving tens of millions of
points.
Chapter 8. Relational Markov networks: We introduce the framework of relational Mar
kov networks (RMNs), which compactly deﬁnes templates for Markov networks in
domains with rich structure modeled by objects, attributes and relations. The graph
ical structure of an RMN is based on the relational structure of the domain, and can
easily model complex patterns over related entities. As we show, the use of an undi
rected, discriminative graphical model avoids the difﬁculties of deﬁning a coherent
generative model for graph structures in directed models and allows us tremendous
ﬂexibility in representing complex patterns. We provide experimental results on a
webpage classiﬁcation task, showing that accuracy can be signiﬁcantly improved by
modeling relational dependencies.
Chapter 9. Context free grammars: We present maxmargin estimation for natural lan
guage parsing on the decomposition properties of context free grammars. We show
1.6. PREVIOUSLY PUBLISHED WORK 13
that this framework allows highaccuracy parsing in cubic time by exploiting novel
kinds of lexical information. We show experimental evidence of the model’s im
proved performance over several baseline models.
Chapter 10. Perfect matchings: We apply our framework to learning to predict disulﬁde
connectivity in proteins using perfect matchings. We use combinatorial properties of
weighted matchings to develop an exact, efﬁcient algorithm for learning the param
eters of the model. The algorithm we propose uses kernels, which makes it possible
to efﬁciently embed the features in very highdimensional spaces and achieve state
oftheart accuracy.
Chapter 11. Correlation clustering: In this chapter, we derive a maxmargin formula
tion for learning afﬁnity scores for correlation clustering from clustered training data.
We formulate the approximate learning problem as a compact convex program with
quadratic objective and linear or positivesemideﬁnite constraints. Experiments on
synthetic and realworld data show the ability of the algorithm to learn an appro
priate clustering metric for a variety of desired clusterings, including email folder
organization and image segmentation.
Chapter 12. Conclusions and future directions: We reviewthe main contributions of the
thesis and summarize their signiﬁcance, applicability and limitations. We discuss ex
tensions and future research directions not addressed in the thesis.
1.6 Previously published work
Some of the work described in this thesis has been published in conference proceedings.
The minmax and certiﬁcate formulations for structured maxmargin estimation have not
been published in their general form outlined in Ch. 4, although they underly several pa
pers mentioned below. The polynomial formulation of maximum margin Markov networks
presented in Ch. 5 was published for a less general case, using a dual decomposition tech
nique [Taskar et al., 2003a]. Work on associative Markov networks (Ch. 7) was published
with experiments on hypertext and newswire classiﬁcation [Taskar et al., 2004a]. A paper
14 CHAPTER 1. INTRODUCTION
on 3D object segmentation using AMNs, which presents a experiments on terrain classiﬁ
cation and other tasks, is currently under review (joint work with Drago Anguelov, Vassil
Chatalbashev, Dinkar Gupta, Geremy Heitz, Daphne Koller and Andrew Ng). Taskar et al.
[2002] and Taskar et al. [2003b] deﬁned and applied the Relational Markov networks
(Ch. 8), using maximum (conditional) likelihood estimation. Natural language parsing
in Ch. 9 was published in Taskar et al. [2004b]. Disulﬁde connectivity prediction using
perfect matchings in Ch. 10 (joint work with Vassil Chatalbashev and Daphne Koller) is
currently under review. Finally, work on correlation clustering in Ch. 11, done jointly with
Pieter Abbeel and Andrew Ng, has not been published.
Part I
Models and methods
15
Chapter 2
Supervised learning
In supervised learning, we seek a function h : A → ¸ that maps inputs x ∈ A to outputs
y ∈ ¸. The input space A is an arbitrary set (often A = IR
n
), while the output space ¸
we consider in this chapter discrete. A supervised learning problem with discrete outputs,
¸ = ¦y
1
, . . . , y
k
¦, where k is the number of classes, is called classiﬁcation. In handwritten
character recognition, for example, A is the set of images of letters and ¸ is the alphabet
(see Fig. 2.1).
The input to an algorithm is training data, a set of m i.i.d. (independent and identically
distributed) samples S = ¦(x
(i)
, y
(i)
)¦
m
i=1
drawn from a ﬁxed but unknown distribution D
over A ¸. The goal of a learning algorithm is to output a hypothesis h such that h(x)
will approximate y on new samples from the distribution (x, y) ∼ D.
Learning algorithms can be distinguished among several dimensions, chief among them
is the hypothesis class H of functions h the algorithm outputs. Numerous classes of func
tions have been well studied, including decision trees, neural networks, nearestneighbors,
generalized loglinear models and kernel methods (see Quinlan [2001]; Bishop [1995];
Hastie et al. [2001]; Duda et al. [2000], for indepth discussion of these and many other
models). We will concentrate on the last two classes, for several reasons we discuss be
low, including accuracy, efﬁciency, and extensibility to more complex structured prediction
tasks will consider in the next chapter.
The second crucial dimension of a learning algorithm is the criterion for selection of h
fromH. We arrive at such a criterion by quantifying what it means for h(x) to approximate
16
17
y. The risk functional 1
D
[(h)] measures the expected error of the approximation:
1
D
[h] = E
(x,y)∼D
[(x, y, h(x))], (2.1)
where the loss function : A ¸ ¸ → IR
+
measures the penalty for predicting h(x)
on the sample (x, y). In general, we assume that (x, y, ˆ y) = 0 if y = ˆ y.
A common loss function for classiﬁcation is 0/1 loss
0/1
(x, y, h(x)) ≡ 1I(y ,= h(x)),
where 1I() denotes the indicator function, that is, 1I(true) = 1 and 1I(false) = 0.
Since we do not generally know the distribution D, we estimate the risk of h using its
empirical risk 1
S
, computed on the training sample S:
1
S
[h] =
1
m
m
i=1
(x
(i)
, y
(i)
, h(x
(i)
)) =
1
m
m
i=1
i
(h(x
(i)
)), (2.2)
where we abbreviate (x
(i)
, y
(i)
, h(x
(i)
)) =
i
(h(x
(i)
)). For 0/1 loss, 1
S
[h] is simply the
proportion of training examples that h misclassiﬁes. 1
S
[h] is often called the training
error or training loss.
If our set of hypotheses, H, is large enough, we will be able to ﬁnd h that has zero or
very small empirical risk. However, simply selecting a hypothesis with lowest risk
h
∗
= arg min
h∈1
1
S
[h],
is generally not a good idea. For example, if A = IR, ¸ = IR and Hincludes all polynomi
als of degree m−1, we can always ﬁnd a polynomial h that passes through all the sample
points (x
(i)
, y
(i)
), i = (1, ..., m) assuming that all the x
(i)
are unique. This polynomial is
very likely to overﬁt the training data, that is, it will have zero empirical risk, but high ac
tual risk. The key to selecting a good hypothesis is to tradeoff complexity of class H (e.g.
the degree of the polynomial) with the error on the training data as measured by empirical
risk 1
S
. For a vast majority of supervised learning algorithms, this fundamental balance is
18 CHAPTER 2. SUPERVISED LEARNING
achieved by minimizing the weighted combination of the two criteria:
h
∗
= arg min
h∈1
_
T[h] + C1
S
[h]
_
, (2.3)
where T[h] measures the inherent dimension or complexity of h, and C ≥ 0 is a trade
off parameter. We will not go into derivation of various complexity measures T[h] here,
but simply adopt the standard measures as needed and refer the reader to Vapnik [1995];
Devroye et al. [1996]; Hastie et al. [2001] for details. The term T[h] is often called
regularization.
Depending on the complexity of the class H, the search for the optimal h
∗
in (2.3)
may be a daunting task
1
. For many classes, for example decision trees and multilayer
neural networks, it is intractable [Bishop, 1995; Quinlan, 2001], and we must resort to
approximate, greedy optimization methods. For these intractable classes, the search pro
cedure used by the learning algorithm is crucial. Below however, we will concentrate on
models where the optimal h
∗
can be found efﬁciently using convex optimization in poly
nomial time. Hence, the learning algorithms we consider are completely characterized by
the hypothesis class H, the loss function , and the regularization T[h].
In general, we consider hypothesis classes of the following parametric form:
h
w
(x) = arg max
y∈¸
f(w, x, y), (2.4)
where f(w, x, y) is a function f : ¼A ¸ → IR, where w ∈ ¼ is a set of parameters,
usually with ¼ ⊆ IR
n
. We assume that ties in the arg max are broken using some arbitrary
but ﬁxed rule. As we discuss below, this class of hypotheses is very rich and includes
many standard models. The formulation in (2.4) of the hypothesis class in terms of an
optimization procedure will become crucial to extending supervised learning techniques to
cases where the output space ¸ is more complex.
1
For classiﬁcation, minimizing the objective with the usual 0/1 training error is generally a very difﬁcult
problem with multiple maxima for most realistic H. See discussion in the next section about approaches to
dealing with 0/1 loss.
2.1. CLASSIFICATION WITH GENERALIZED LINEAR MODELS 19
a b c d e
Figure 2.1: Handwritten character recognition: sample letters from Kassel [1995] data set.
2.1 Classiﬁcation with generalized linear models
For classiﬁcation, we consider the generalized linear family of hypotheses H. Given n
realvalued basis functions f
j
: A ¸ → IR, a hypothesis h
w
∈ H is deﬁned by a set of n
coefﬁcients w
j
∈ IR such that:
h
w
(x) = arg max
y∈¸
n
i=1
w
j
f
j
(x, y) = arg max
y∈¸
w
¯
f (x, y). (2.5)
Consider the character recognition example in Fig. 2.1. Our input x is a vector of
pixel values of the image and y is the alphabet ¦a, . . . , z¦. We might have a basis function
f
j
(x, y) = 1I(x
row,col
= on ∧ y = char) for each possible (row, col) and char ∈ ¸,
where x
row,col
denotes the value of pixel (row, col). Since different letters tend to have
different pixels turned on, this very simple model captures enough information to perform
reasonably well.
The most common loss for classiﬁcation is 0/1 loss. Minimizing the 0/1 risk is generally
a very difﬁcult problem with multiple maxima for any large class H. The standard solution
is minimizing an upper bound on the 0/1 loss, (x, y, h(x)) ≥ (x, y, h(x)). (In addition
to computational advantages of this approach, there are statistical beneﬁts of minimizing a
convex upper bound [Bartlett et al., 2003]). Two of the primary classiﬁcation methods we
consider, logistic regression and support vector machines, differ primarily in their choice of
the upper bound on the training 0/1 loss. The regularization T[h
w
] for the linear family is
typically the norm of the parameters [[w[[
p
for p = 1, 2. Intuitively, a zero, or small weight
w
j
implies that the hypothesis h
w
does not depend on the value of f
j
(x, y) and hence is
simpler than a h
w
with a large weight w
j
.
20 CHAPTER 2. SUPERVISED LEARNING
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
3.5
log−loss
hinge−loss
0/1−loss
Figure 2.2: 0/1loss upper bounded by logloss and hingeloss. Horizontal axis shows
w
¯
f (x, y) −max
y
,=y
w
¯
f (x, y
t
), where y is the correct label for x, while the vertical axis
show the value of the associated loss. The logloss is shown up to an additive constant for
illustration purposes.
2.2 Logistic regression
In logistic regression, we assign a probabilistic interpretation to the hypothesis h
w
as deﬁn
ing a conditional distribution:
P
w
(y [ x) =
1
Z
w
(x)
exp¦w
¯
f (x, y)¦, (2.6)
where Z
w
(x) =
y∈¸
exp¦w
¯
f (x, y)¦. The optimal weights are selected by maximiz
ing the conditional likelihood of the data (minimizing the logloss) with some regulariza
tion. This approach is called the (regularized) maximum likelihood estimation. Common
choices for regularization are 1 or 2norm regularization on the weights; we use 2norm
below:
min
1
2
[[w[[
2
+ C
i
log Z
w
(x
(i)
) −w
¯
f (x
(i)
, y
(i)
), (2.7)
where C is a userspeciﬁed constant the determines the tradeoff between regularization
and likelihood of the data. The logloss log Z
w
(x) −w
¯
f (x, y) is an upper bound (up to a
constant) on the 0/1 loss
0/1
(see Fig. 2.2).
2.3. LOGISTIC DUAL AND MAXIMUM ENTROPY 21
2.3 Logistic dual and maximum entropy
The objective function is convex in the parameters w, so we have an unconstrained (differ
entiable) convex optimization problem. The gradient with respect to w is given by:
w+ C
i
E
i,w
[f
i
(x
(i)
, y)] −f
i
(x
(i)
, y
(i)
) = w−C
i
E
i,w
[∆f
i
(y)],
where E
i,w
[f(y)] =
y
f(y)P
w
(y [ x
(i)
) is the expectation under the conditional distribu
tion P
w
(y [ x
(i)
) and ∆f
i
(y) = f (x
(i)
, y
(i)
) − f (x
(i)
, y). Ignoring the regularization term,
the gradient is zero when the basis function expectations are equal to the basis functions
evaluated on the labels y
(i)
. It can be shown [Cover & Thomas, 1991] that the dual of the
maximum likelihood problem (without regularization) is the maximum entropy problem:
max −
i,y
P
w
(y [ x
(i)
) log P
w
(y [ x
(i)
) (2.8)
s.t. E
i,w
[∆f
i
(y)] = 0, ∀i.
We can interpret logistic regression as trying to match the empirical basis function expec
tations while maintaining a high entropy conditional distribution P
w
(y [ x).
2.4 Support vector machines
Support vector machines [Vapnik, 1995] select the weights based on the “margin” of con
ﬁdence of h
w
. In the multiclass SVM formulation [Weston & Watkins, 1998; Crammer &
Singer, 2001], the margin on example i quantiﬁes by how much the true label “wins” over
the wrong ones:
γ
i
=
1
[[w[[
min
y,=y
(i)
w
¯
f (x
(i)
, y
(i)
) −w
¯
f (x
(i)
, y) =
1
[[w[[
min
y,=y
(i)
w
¯
∆f
i
(y),
22 CHAPTER 2. SUPERVISED LEARNING
where ∆f
i
(y) = f (x
(i)
, y
(i)
) −f (x
(i)
, y). Maximizing the smallest such margin (and allow
ing for negative margins) is equivalent to solving the following quadratic program:
min
1
2
[[w[[
2
+ C
i
ξ
i
(2.9)
s.t. w
¯
∆f
i
(y) ≥
0/1
(y) −ξ
i
, ∀i, ∀y ∈ ¸.
Note that the slack variable ξ
i
is constrained to be positive in the above program since
w
¯
∆f
i
(y
(i)
) = 0 and
0/1
(y
(i)
) = 0. We can also express ξ
i
as max
y
0/1
i
(y) −w
¯
∆f
i
(y),
and the optimization problem Eq. (2.9) in a form similar to Eq. (2.7):
min
1
2
[[w[[
2
+ C
i
max
y
[
0/1
i
(y) −w
¯
∆f
i
(y)]. (2.10)
The hingeloss max
y
[
0/1
i
(y) − w
¯
∆f
i
(y)] is also an upper bound on the 0/1 loss
0/1
(see Fig. 2.2).
2.5 SVM dual and kernels
The form of the dual of Eq. (2.9) is crucial to efﬁcient solution of SVM and the ability to
use a high or even inﬁnite dimensional set of basis functions via kernels.
max
i,y
α
i
(y)
0/1
i
(y) −
1
2
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,y
α
i
(y)∆f
i
(y)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
(2.11)
s.t.
y
α
i
(y) = C, ∀i; α
i
(y) ≥ 0, ∀i, y.
In the dual, the α
i
(y) variables correspond to the w
¯
∆f
i
(y) ≥
0/1
(y)−ξ
i
constraints in the
primal Eq. (2.9). The solution to the dual α
∗
gives the solution to the primal as a weighted
combination of basis functions of examples:
w
∗
=
i,y
α
∗
i
(y)∆f
i
(y).
2.5. SVM DUAL AND KERNELS 23
The pairings of examples and incorrect labels, (i, y), that have nonzero α
∗
i
(y), are called
support vectors.
An important feature of the dual formulation is that the basis functions f appear only as
dot products. Expanding the quadratic term, we have:
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,y
α
i
(y)∆f
i
(y)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
=
i,y
j,¯ y
α
i
(y)α
j
(¯ y)∆f
i
(y)
¯
∆f
j
(¯ y).
Hence, as long as the dot product f (x, y)
¯
f (¯ x, ¯ y) can be computed efﬁciently, we can
solve Eq. (2.11) independently of the actual dimension of f . Note that at classiﬁcation
time, we also do not need to worry about the dimension of f since:
w
¯
f (x, ¯ y) =
i,y
α
i
(y)∆f
i
(y)
¯
f (x, ¯ y) =
i,y
α
i
(y)[f (x
(i)
, y
(i)
)
¯
f (x, ¯ y)−f (x
(i)
, y)
¯
f (x, ¯ y)].
For example, we might have basis functions that are polynomial of degree d in terms of
image pixels, f
j
(x, y) = 1I(x
row
1
,col
1
= on ∧ . . . ∧ x
row
d
,col
d
= on ∧ y = char) for each
possible (row
1
, col
1
) . . . (row
d
, col
d
) and char ∈ ¸. Computing this polynomial kernel
can be done independently of the dimension d, even though the number of basis functions
grows exponentially with d [Vapnik, 1995].
In fact, logistic regression can also be kernelized. However, the hinge loss formulation
usually produces sparse solutions in terms of the number of support vectors, while solutions
to the corresponding kernelized logloss problem are generally nonsparse (all examples
are support vectors) and require approximations for even relatively small datasets [Wahba
et al., 1993; Zhu & Hastie, 2001].
Chapter 3
Structured models
Consider once more the problem of character recognition. In fact, a more natural and useful
task is recognizing words and entire sentences. Fig. 3.1 shows an example handwritten
word “brace.” Distinguishing between the second letter and fourth letter (‘r’ and ‘c’) in
isolation is far from trivial, but in the context of the surrounding letters that together form
a word, this task is much less errorprone for humans and should be for computers as well.
In this chapter, we consider prediction problems in which the output is not a single
discrete value y, but a set of values y = (y
1
, . . . , y
L
), for example an entire sequence
of L characters. For concreteness, let the number of variables L be ﬁxed. The output
space ¸ ⊆ ¸
1
. . . ¸
L
we consider is a subset of product of output spaces of single
variables. In word recognition, each ¸
j
is the alphabet, while ¸ is the dictionary. This
joint output space is often a proper subset of the product of singleton output spaces, ¸ ⊂
¸
1
. . .¸
L
. In word recognition, we might restrict that the letter ‘q’ never follows by ‘z’ in
English. In addition to “hard” constraints, the output variables are often highly correlated,
e.g. consecutive letters in a word. We refer to joint spaces with constraints and correlations
as structured. We call problems with discrete output spaces structured classiﬁcation or
structured prediction. Structured models we consider in this chapter (and thesis) predict
the outputs jointly, respecting the constraints and exploiting the correlations in the output
space.
The range of prediction problems these broad deﬁnitions encompass is immense, aris
ing in ﬁelds as diverse as natural language analysis, machine vision, and computational
24
3.1. PROBABILISTIC MODELS: GENERATIVE AND CONDITIONAL 25
b r a c e
Figure 3.1: Handwritten word recognition: sample from Kassel [1995] data set.
biology, to name a few. The class of structured models H we consider is essentially of the
same form as in previous chapter, except that y has been replaced by y:
h
w
(x) = arg max
y : g(x,y)≤0
w
¯
f (x, y), (3.1)
where as before f (x, y) is a vector of functions f : A ¸ → IR
n
. The output space
¸ = ¦y : g(x, y) ≤ 0¦ is deﬁned using a vector of functions g(x, y) that deﬁne the
constraints, where g : A ¸ → IR
k
. This formulation is very general. Clearly, for
many f , g pairs, ﬁnding the optimal y is intractable. For the most part, we will restrict our
attention to models where this optimization problemcan be solved in polynomial time. This
includes, for example, probabilistic models like Markov networks (in certain cases) and
contextfree grammars, combinatorial optimization problems like mincut and matching,
convex optimization such as linear, quadratic and semideﬁnite programming. In other
cases, like intractable Markov networks (Ch. 8) and correlation clustering (Ch. 11), we use
an approximate polynomial time optimization procedure.
3.1 Probabilistic models: generative and conditional
The term model is often reserved for probabilistic models, which can be subdivided into
generative and conditional with respect to the prediction task. A generative model assigns
a normalized joint density p(x, y) to the input and output space A ¸ with
p(x, y) ≥ 0,
y∈¸
_
x∈.
p(x, y) = 1.
26 CHAPTER 3. STRUCTURED MODELS
A conditional model assigns a normalized density p(y [ x) only over the output space ¸
with
p(y [ x) ≥ 0,
y∈¸
p(y [ x) = 1 ∀x ∈ A.
Probabilistic interpretation of the model offers wellunderstood semantics and an im
mense toolbox of methods for inference and learning. It also provides an intuitive measure
of conﬁdence in the predictions of a model in terms of conditional probabilities. In addi
tion, generative models are typically structured to allow very efﬁcient maximum likelihood
learning. A very common class of generative models is the exponential family:
p(x, y) ∝ exp¦w
¯
f (x, y)¦.
For exponential families, the maximum likelihood parameters w with respect to the joint
distribution can be computed in closed form using the empirical basis function expectations
E
S
[f (x, y)] [DeGroot, 1970; Hastie et al., 2001].
Of course, this efﬁciency comes at a price. Any model is an approximation to the true
distribution underlying the data. A generative model must make simplifying assumptions
(more precisely, independence assumptions) about the entire p(x, y), while a conditional
model makes many fewer assumption by focusing on p(y [ y). Because of this, by opti
mizing the model to ﬁt the joint distribution p(x, y), we may be tuning the approximation
away from optimal conditional distribution p(y [ x), which we use to make the predictions.
Given sufﬁcient data, the conditional model will learn the best approximation to p(y [ x)
possible using w, while the generative model p(x, y) will not necessarily do so. Typically,
however, generative models actually need fewer samples to converge to a good estimate of
the joint distribution than conditional models need to accurately represent the conditional
distribution. In a regime with very few training samples (relative to the number of param
eters w), generative models may actually outperform conditional models [Ng & Jordan,
2001].
3.2. PREDICTION MODELS: NORMALIZED AND UNNORMALIZED 27
3.2 Prediction models: normalized and unnormalized
Probabilistic semantics are certainly not necessary for a good predictive model if we are
simply interested in the optimal prediction (the arg max in Eq. (3.1)). As we discussed
in the previous chapter, support vector machines, which do not represent a conditional
distribution, typically perform as well or better than logistic regression [Vapnik, 1995;
Cristianini & ShaweTaylor, 2000].
In general, we can often achieve higher accuracy models when we do not learn a nor
malized distribution over the outputs, but concentrate on the margin or decision boundary,
the difference between the optimal y and the rest. Even more importantly, in many cases
we discuss below, normalizing the model (summing over the entire ¸) is intractable, while
the optimal y can be found in polynomial time. This fact makes standard maximum like
lihood estimation infeasible. The learning methods we advocate in this thesis circumvent
this problem by requiring only the maximization problem to be tractable. We still heav
ily rely on the representation and inference tools familiar from probabilistic models for
the construction of and prediction in unnormalized models, but largely dispense with the
probabilistic interpretation when needed. Essentially, we use the term model very broadly,
to include any scheme that assigns scores to the output space ¸ and has a procedure for
ﬁnding the optimal scoring y.
In this chapter, we review basic concepts in probabilistic graphical models called Mar
kov networks or Markov randomﬁelds. We also brieﬂy touch upon examples of contextfree
grammars and combinatorial problems that will be explained in greater detail in Part III to
illustrate the range of prediction problems we address.
3.3 Markov networks
Markov networks provide a framework for a rich family of models for both discrete and
continuous prediction [Pearl, 1988; Cowell et al., 1999]. The models treat the inputs and
outputs as random variables X with domain A and Y with domain ¸ and compactly de
ﬁne a conditional density p(Y [ X) or distribution P(Y [ X) (we concentrate here on the
28 CHAPTER 3. STRUCTURED MODELS
conditional Markov networks or CRFs [Lafferty et al., 2001]). The advantage of a graphi
cal framework is that it can exploit sparseness in the correlations between outputs Y . The
graphical structure of the models encodes the qualitative aspects of the distribution: direct
dependencies as well as conditional independencies. The quantitative aspect of the model
is deﬁned by the potentials that are associated with nodes and cliques of the graph. Before
a formal deﬁnition, consider a ﬁrstorder Markov chain a model for the word recognition
task. In Fig. 3.2, the nodes are associated with output variables Y
i
and the edges correspond
to direct dependencies or correlations. We do not explicitly represent the inputs X in the
ﬁgure. For example, the model encodes that Y
j
is conditionally independent of the rest of
the variables given Y
j−1
, Y
j+1
. Intuitively, adjacent letters in a word are highly correlated,
but the ﬁrstorder model is making the assertion (which is certainly an approximation) that
once the value of a letter Y
j
is known, the correlation between a letter Y
b
before j and a
letter Y
a
after j is negligible. More precisely, we use a model where
P(Y
b
[ Y
j
, Y
a
, x) = P(Y
b
[ Y
j
, x), P(Y
a
[ Y
j
, Y
b
, x) = P(Y
a
[ Y
j
, x), b < j < a.
For the purposes of ﬁnding the most likely y, this conditional independence property means
that the optimization problem is decomposable: given that Y
j
= y
j
, it sufﬁces to separately
ﬁnd the optimal subsequence from 1 to j ending with y
j
, and the optimal subsequence
starting with y
j
from j to L.
3.3.1 Representation
The structure of a Markov network is deﬁned by an undirected graph ( = (1, c), where
the nodes are associated with variables 1 = ¦Y
1
, . . . , Y
L
¦. A clique is a set of nodes c ⊆ 1
that form a fully connected subgraph (every two nodes are connected by an edge). Note that
each subclique of a clique is also a clique, and we consider each node a singleton clique.
In the chain network in Fig. 3.2, the cliques are simply the nodes and the edges: ((() =
¦¦Y
1
¦, . . . , ¦Y
5
¦, ¦Y
1
, Y
2
¦, . . . , ¦Y
4
, Y
5
¦¦. We denote the set of variables in a clique c as
Y
c
, an assignment of variables in the clique as y
c
and the space of all assignments to
the clique as ¸
c
. We focus on discrete output spaces ¸ below, but many of the same
representation and inference concepts translate to continuous domains. No assumption is
3.3. MARKOV NETWORKS 29
Figure 3.2: Firstorder Markov chain: φ
i
(Y
i
) are node potentials, φ
i,i+1
(Y
i
, Y
i+1
) are edge
potentials (dependence on x is not shown).
made about A.
Deﬁnition 3.3.1 A Markov network is deﬁned by an undirected graph ( = (1, c) and a
set of potentials Φ = ¦φ
c
¦. The nodes are associated with variables 1 = ¦Y
1
, . . . , Y
L
¦.
Each clique c ∈ ((() is associated with a potential φ
c
(x, y
c
) with φ
c
: A ¸
c
→ IR
+
,
which speciﬁes a nonnegative value for each assignment y
c
to variables in Y
c
and any
input x. The Markov network ((, Φ) deﬁnes a conditional distribution:
P(y [ x) =
1
Z(x)
c∈((Ç)
φ
c
(x, y
c
),
where ((() is the set of all the cliques of the graph and Z(x) is the partition function
given by Z(x) =
y∈¸
c∈((Ç)
φ
c
(x, y
c
).
In our example Fig. 3.2, we have node and edge potentials. Intuitively, the node poten
tials quantify the correlation between the input x and the value of the node, while the edge
potentials quantify the correlation between the pair of adjacent output variables as well as
the input x. Potentials do not have a local probabilistic interpretation, but can be thought
of as deﬁning an unnormalized score for each assignment in the clique. Conditioned on
the image input, appropriate node potentials in our network should give high scores to the
correct letters (‘b’,‘r’,‘a’,‘c’,‘e’), though perhaps there would be some ambiguity with the
second and fourth letter. For simplicity, assume that the edge potentials would not depend
30 CHAPTER 3. STRUCTURED MODELS
on the images, but simply should give high scores to pairs of letters that tend to appear often
consecutively. Multiplied together, these scores should favor the correct output “brace”.
In fact, a Markov network is a generalized loglinear model, since the potentials φ
c
(x
c
, y
c
)
could be represented (in logspace) as a sum of basis functions over x, y
c
:
φ
c
(x
c
, y
c
) = exp
_
n
c
k=1
w
c,k
f
c,k
(x, y
c
)
_
= exp
_
w
¯
c
f
c
(x, y
c
)
¸
where n
c
is the number of basis functions for the clique c. Hence the log of the conditional
probability is given by:
log P(y [ x) =
c∈((Ç)
w
¯
c
f
c
(x, y
c
) −log Z
w
(x).
In case of node potentials for word recognition, we could use the same basis functions as
for individual character recognition: f
j,k
(x, y
j
) = 1I(x
j,row,col
= on ∧ y
j
= char) for each
possible (row, col) in x
j
, the window of the image that corresponds to letter j and each
char ∈ ¸
j
(we assume the input has been segmented into images x
j
that correspond to
letters). In general, we condition a clique only on a portion of the input x, which we denote
as x
c
. For the edge potentials, we can deﬁne basis functions for each combination of letters
(assume for simplicity no dependence on x) : f
j,j+1,k
(x, y
j
, y
j+1
) = 1I(y
j
= char
1
∧y
j+1
=
char
2
) for each char
1
∈ ¸
j
and char
2
∈ ¸
j+1
. In this problem (as well as many others),
we are likely to “tie” or “share” the parameters of the model w
c
across cliques. Usually, all
single node potentials would share the same weights and basis functions (albeit the relevant
portion of the input x
c
is different) and similarly for the pairwise cliques, no matter in what
position they appear in the sequence.
1
With slight abuse of notation, we stack all basis functions into one vector f . For the
sequence model, f has node functions and edge functions, so when c is a node, the edge
functions in f (x
c
, y
c
) are deﬁned to evaluate to zero. Similarly, when c is an edge, the node
1
Sometimes we might actually want some dependence on the position in the sequence, which can be
accomplished by adding more basis functions that condition on the position of the clique.
3.3. MARKOV NETWORKS 31
functions in f (x
c
, y
c
) are also deﬁned to evaluate to zero. Now we can write:
f (x, y) =
c∈((Ç)
f (x
c
, y
c
).
We stack the weights in the corresponding manner, so the most likely assignment according
to the model is given by:
arg max
y∈¸
log P
w
(y [ x) = arg max
y∈¸
w
¯
f (x, y),
in the same form as Eq. (3.1).
3.3.2 Inference
There are several important questions that can be answered by probabilistic models. The
task of ﬁnding the most likely assignment, known as maximum aposteriori (MAP) or most
likely explanation (MPE), is just one of such questions, but most relevant to our discussion.
The Viterbi dynamic programming algorithm solves this problem for chain networks in
O(L) time. Let the highest score of any subsequence from 1 to k > 1 ending with value y
k
be deﬁned as
φ
∗
k
(y
k
) = max
y
1..k−1
j
φ
j
(x, y
j
)φ
j
(x, y
j−1
, y
j
).
The algorithm computes the highest scores recursively:
φ
∗
1
(y
1
) = φ
1
(x, y
1
), ∀y
1
∈ ¸
1
;
φ
∗
k
(y
k
) = max
y
k−1
∈¸
k−1
φ
∗
k−1
(y
k−1
)φ
j
(x, y
k
)φ
j
(x, y
k−1
, y
k
), 1 < k ≤ L, ∀y
k
∈ ¸
k
.
The highest scoring sequence has score max
y
L
φ
∗
L
(y
L
). Using the arg max’s of the max’s in
the computation of φ
∗
, we can backtrace the highest scoring sequence itself. We assume
that score ties are broken in a predetermined way, say according to some lexicographic
order of the symbols.
32 CHAPTER 3. STRUCTURED MODELS
Figure 3.3: Diamond Markov network (added triangulation edge is dashed).
In general Markov networks, MAP inference is NPhard [Cowell et al., 1999]. How
ever, there are several important subclasses of networks that allow polynomial time infer
ence. The most important of these is the class of networks with low treewidth. We need the
concept of triangulation (or chordality) to formally deﬁne treewidth. Recall that a cycle
of length l in an undirected graph ( is a sequence of nodes (v
0
, v
1
, . . . , v
l
), distinct except
that v
0
= v
l
, which are connected by edges (v
i
, v
i+1
) ∈ (. A chord of this cycle is an edge
(v
i
, v
j
) ∈ ( between nonconsecutive nodes.
Deﬁnition 3.3.2 (Triangulated graph) An undirected graph ( is triangulated if every one
of its cycles of length ≥ 4 possesses a chord.
Singlyconnected graphs, like chains and trees, are triangulated since they contain no cy
cles. The simplest untriangulated network is the diamond in Fig. 3.3. To triangulate it,
we can add the edge (Y
1
, Y
3
) or (Y
2
, Y
4
). In general, there are many possible sets of edges
that can be added to triangulate a graph. The inference procedure creates a tree of cliques
using the graph augmented by triangulation. The critical property of a triangulation for the
inference procedure is the size of the largest clique.
Deﬁnition 3.3.3 (Treewidth of a graph) The treewidth of a triangulated graph ( is the
size of its largest clique minus 1. The treewidth of an untriangulated graph ( is the
minimum treewidth of all triangulations of (.
3.3. MARKOV NETWORKS 33
The treewidth of a chain or a tree is 1 and the treewidth of Fig. 3.3 is 2. Finding the mini
mum treewidth triangulation of a general graph is NPhard, but good heuristic algorithms
exist [Cowell et al., 1999].
The inference procedure is based on a data structure called junction tree that can be
constructed for a triangulated graph. The junction tree is an alternative representation of
the same distribution that allows simple dynamic programming inference similar to the
Viterbi algorithm for chains.
Deﬁnition 3.3.4 (Junction tree) A junction tree T = (1, c) for a triangulated graph ( is
a tree in which the nodes are a subset of the cliques of the graph, 1 ⊆ ((() and the edges
c satisfy the running intersection property: for any two cliques c and c
t
, the variables in
the intersection c ∩ c
t
are contained in the clique of every node of the tree on the (unique)
path between c and c
t
.
Fig. 3.4 shows a junction tree for the diamond network. Each of the original clique poten
tials must associated with exactly one node in the junction tree. For example, the potentials
for the ¦Y
1
, Y
3
, Y
4
¦ and ¦Y
1
, Y
3
, Y
4
¦ nodes are the product of the associated clique poten
tials:
φ
134
(Y
1
, Y
3
, Y
4
) = φ
1
(Y
1
)φ
4
(Y
4
)φ
14
(Y
1
, Y
4
)φ
34
(Y
3
, Y
4
),
φ
123
(Y
1
, Y
2
, Y
3
) = φ
2
(Y
2
)φ
3
(Y
3
)φ
12
(Y
1
, Y
2
)φ
23
(Y
2
, Y
3
).
Algorithms for constructing junction trees from triangulated graphs are described in detail
in Cowell et al. [1999].
The Viterbi algorithm for junction trees picks an arbitrary root r for the tree T and
proceeds recursively from the leaves to compute the highest scoring subtree at a node by
combining the subtrees with highest score from its children. We denote the leaves of the
tree as Lv(T ) and the children of node c (relative to the root r) as Ch
r
(c):
φ
∗
l
(y
l
) = φ
l
(x, y
l
), ∀l ∈ Lv(T ), ∀y
l
∈ ¸
l
;
φ
∗
c
(y
c
) = φ
c
(x, y
c
)
c
∈Ch
r
(c)
max
y
c
∼y
c
φ
∗
c
(y
c
), ∀c ∈ 1(T ) ¸ Lv(T ), ∀y
c
∈ ¸
c
,
34 CHAPTER 3. STRUCTURED MODELS
Figure 3.4: Diamond network junction tree. Each of the original potentials is associated
with a node in the tree.
where y
c
∼ y
c
denotes whether the partial assignment y
c
is consistent with the partial
assignment y
c
on the variables in the intersection of c and c
t
. The highest score is given by
max
y
r
φ
∗
r
(y
r
). Using the arg max’s of the max’s in the computation of φ
∗
, we can back
trace the highest scoring assignment itself. Note that this algorithm is exponential in the
treewidth, the size of the largest clique. Similar type of computations using the junction
tree can be used to compute the partition function Z
w
(x) (by simply replacing max by
)
as well as marginal probabilities P(y
c
[x) for the cliques of the graph [Cowell et al., 1999].
3.3.3 Linear programming MAP inference
In this section, we present an alternative inference method based on linear programming.
Although solving the MAP inference using a general LP solver is less efﬁcient than the
dynamic programming algorithms above, this formulation is crucial in viewing Markov
networks in a uniﬁed framework of the structured models we consider and to our develop
ment of common estimation methods in later chapters. Let us begin with a linear integer
program to compute the optimal assignment y. We represent an assignment as a set binary
variables µ
c
(y
c
), one for each clique c and each value of the clique y
c
, that denotes whether
the assignment has that value, such that:
log
c
φ
c
(x, y
c
) =
c,y
c
µ
c
(y
c
) log φ
c
(x, y
c
).
3.3. MARKOV NETWORKS 35
0 0 0 0
0 0 1 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 1 0
0 0 0 0
0 0 0 0
0
1
0
0
0
1
0
0
0 0 1 0 0 0 1 0
Figure 3.5: Example of marginal agreement: row sums of µ
12
(y
1
, y
2
) agree with µ
1
(y
1
),
column sums agree with µ
2
(y
2
).
We call these variables marginals, as they correspond to the marginals of a distribution that
has all of its mass centered on the MAP instantiation (assuming it is unique). There are
several elementary constraints that such marginals satisfy. First, they must sum to one for
each clique. Second, the marginals for cliques that share variables are consistent. For any
clique c ∈ ( and a subclique s ⊂ c, the assignment of the subclique, µ
s
(y
s
), must be
consistent with the assignment of the clique, µ
c
(y
c
). Together, these constraints deﬁne a
linear integer program:
max
c,y
c
µ
c
(y
c
) log φ
c
(x, y
c
) (3.2)
s.t.
y
c
µ
c
(y
c
) = 1, ∀c ∈ (; µ
c
(y
c
) ∈ ¦0, 1¦, ∀c ∈ (, ∀y
c
;
µ
s
(y
s
) =
y
c
∼y
s
µ
c
(y
t
c
), ∀s, c ∈ (, s ⊂ c, ∀y
s
.
For example, in case the network is a chain or a tree, we will have node and edge marginals
that sum to 1 and agree with each other as in Fig. 3.5.
Clearly, for any assignment y
t
, we can deﬁne µ
c
(y
c
) variables that satisfy the above
constraints by setting µ
c
(y
c
) = 1I(y
t
c
= y
c
). We can also show that converse is true: any
valid setting of µ
c
(y
c
) corresponds to a valid assignment y. In fact,
36 CHAPTER 3. STRUCTURED MODELS
Lemma 3.3.5 For a triangulated network with unique MAP assignment, the integrality
constraint in the integer program in Eq. (3.2) can be relaxed and the resulting LP is guar
anteed to have integer solutions.
A proof of this lemma appears in Wainwright et al. [2002]. Intuitively, the constraints force
the marginals µ
c
(y
c
) to correspond to some valid joint distribution over the assignments.
The optimal distribution with the respect to the objective puts all its mass on the MAP
assignment. If the MAP assignment is not unique, the value of the LP is the same as
the value of the integer program, and any linear combination of the MAP assignments
maximizes the LP.
In case the network is not triangulated, the set of marginals is not guaranteed to rep
resent a valid distribution. Consider, for example, the diamond network in Fig. 3.3 with
binary variables, with the following edge marginals that are consistent with the constraints:
µ
12
(0, 0) = µ
12
(1, 1) = 0.5, µ
12
(1, 0) = µ
12
(0, 1) = 0;
µ
23
(0, 0) = µ
23
(1, 1) = 0.5, µ
23
(1, 0) = µ
23
(0, 1) = 0;
µ
34
(0, 0) = µ
34
(1, 1) = 0.5, µ
34
(1, 0) = µ
34
(0, 1) = 0;
µ
14
(0, 0) = µ
34
(1, 1) = 0, µ
14
(1, 0) = µ
14
(0, 1) = 0.5.
The corresponding node marginals must all be set to 0.5. Note that the edge marginals for
(1, 2), (2, 3), (3, 4) disallow any assignment other than 0000 or 1111, but the edge marginal
for (1, 4) disallows any assignment that has Y
1
= Y
4
. Hence this set of marginals dis
allows all assignments. If we triangulate the graph and add the cliques ¦Y
1
, Y
2
, Y
3
¦ and
¦Y
1
, Y
3
, Y
4
¦ with their corresponding constraints, the above marginals will be disallowed.
In graphs where triangulation produces very large cliques, exact inference is intractable.
We can resort to the above LP without triangulation as an approximate inference procedure
(augmented with some procedure for rounding possibly fractional solutions). In Ch. 7, we
discuss another subclass of networks where MAP inference using LPs is tractable for any
network topology, but with a restricted type of potentials.
3.4. CONTEXT FREE GRAMMARS 37
Figure 3.6: Example parse tree from Penn Treebank [Marcus et al., 1993].
3.4 Context free grammars
Contextfree grammars are one of the primary formalisms for capturing the recursive struc
ture of syntactic constructions [Manning & Sch¨ utze, 1999]. For example, Fig. 3.6 shows
a parse tree for the sentence The screen was a sea of red. This tree is from the Penn Tree
bank [Marcus et al., 1993], a primary linguistic resource for expertannotated English text.
The nonterminal symbols (labels of internal nodes) correspond to syntactic categories such
as noun phrase (NP), verbal phrase (VP) or prepositional phrase (PP) and partofspeech
tags like nouns (NN), verbs (VBD), determiners (DT) and prepositions (IN). The terminal
symbols (leaves) are the words of the sentence.
For clarity of presentation, we restrict our grammars to be in Chomsky normal form
2
(CNF),
where all rules in the grammar are of the form: A →B C and A →D, where A, B and C
are nonterminal symbols, and D is a terminal symbol.
Deﬁnition 3.4.1 (CFG) A CFG ( consists of:
◦ A set of nonterminal symbols, ^
◦ A designated set of start symbols, ^
S
⊆ ^
2
Any CFG can be represented by another CFG in CNF that generates the same set of sentences.
38 CHAPTER 3. STRUCTURED MODELS
◦ A set of terminal symbols, T
◦ A set of productions, T = ¦T
B
, T
U
¦, divided into
Binary productions, T
B
= ¦A →B C : A, B, C ∈ ^¦ and
Unary productions, T
U
= ¦A →D : A ∈ ^, D ∈ T ¦.
Consider a very simple grammar:
◦ ^ = ¦S, NP, VP, PP, NN, VBD, DT, IN¦
◦ ^
S
= ¦S¦
◦ T = ¦The, the, cat, dog, tree, saw, from¦
◦ T
B
= ¦S → NP VP, NP → DT NN, NP → NP PP, VP → VBD NP,
VP →VP PP, PP →IN NP¦.
◦ T
U
= ¦DT → The, DT → the, NN → cat, NN → dog, NN → tree, VBD → saw,
IN → from¦
A grammar generates a sentence by starting with a symbol in ^
S
and applying the
productions in T to rewrite nonterminal symbols. For example, we can generate The cat
saw the dog by starting with S → NP VP, rewriting the NP as NP → DT NN with DT →
The and NN → cat, then rewriting the VP as VP → VBD NP with VBD → saw, again
using NP → DT NN, but now with DT → the and NN → dog. We can represent such
derivations using trees like in Fig. 3.6 or (more compactly) using bracketed expressions
like the one below:
[[The
DT
cat
NN
]
NP
[saw
VBD
[the
DT
dog
NN
]
NP
]
VP
]
S
.
The simple grammar above can generate sentences of arbitrary length, since it has sev
eral recursive productions. It can also generate the same sentence several ways. In general,
there are exponentially many parse trees that produce a sentence of length l. Consider the
sentence The cat saw the dog from the tree. The likely analysis of the sentence is that
the cat, sitting in the tree, saw the dog. An unlikely but possible alternative is that the cat
3.4. CONTEXT FREE GRAMMARS 39
actually saw the dog who lived near the tree or was tied to it in the past. Our grammar
allows both interpretations, with the difference being in the analysis of the toplevel VP:
[saw
VBD
[the
DT
dog
NN
]
NP
]
VP
[[from
IN
[the
DT
tree
NN
]
NP
]
PP
,
saw
VBD
[[the
DT
dog
NN
]
NP
[from
IN
[the
DT
tree
NN
]
NP
]
PP
]
NP
.
This kind of ambiguity, called prepositional attachment, is very common in many re
alistic grammars. A standard approach to resolving ambiguity is to use a PCFG to deﬁne
a joint probability distribution over the space of parse trees ¸ and sentences A. Standard
PCFG parsers use a Viterbistyle algorithm to compute arg max
y
P(x, y) as the most likely
parse tree for a sentence x. The distribution P(x, y) is deﬁned by assigning a probability
to each production and making sure that the sum of probabilities of all productions starting
with a each symbol is 1:
B,C:A→B C∈1
B
P(A →B C) = 1,
D:A→D∈1
U
P(A → D ) = 1, ∀A ∈ ^.
We also need to assign a probability to the different starting symbols P(A) ∈ ^
S
such that
A∈A
S
P(A) = 1. The probability of a tree is simply the product of probabilities of the
productions used in the tree (times the probability of the starting symbol). Hence the log
probability of a tree is a sum of the logprobabilities of its productions. By letting our basis
functions f (x, y) consist of the counts of the productions and w be their logprobabilities,
we can cast PCFG as a structured linear model (in log space). In Ch. 9, we will show how
to represent a parse tree as an assignment of variables Y with appropriate constraints to
express PCFGS (and more generally weighted CFGs) in the form of Eq. (3.1) as
h
w
(x) = arg max
y : g(x,y)≤0
w
¯
f (x, y),
and describe the associated algorithm to compute the highest scoring parse tree y given a
sentence x.
40 CHAPTER 3. STRUCTURED MODELS
3.5 Combinatorial problems
Many important computational tasks are formulated as combinatorial optimization prob
lems such as the maximum weight bipartite and perfect matching, spanning tree, graphcut,
edgecover, binpacking, and many others [Lawler, 1976; Papadimitriou & Steiglitz, 1982;
Cormen et al., 2001]. These problems arise in applications such as resource allocation,
job assignment, routing, scheduling, network design and many more. In some domains,
the weights of the objective function in the optimization problem are simple and natural
to deﬁne (for example, Euclidian distance or temporal latency), but in many others, con
structing the weights is an important and laborintensive design task. Treated abstractly, a
combinatorial space of structures, such as matchings or graphcuts or trees), together with
a scoring scheme that assigns weights to candidate outputs is a kind of a model.
As a particularly simple and relevant example, consider modeling the task of assigning
reviewers to papers as a maximum weight bipartite matching problem, where the weights
represent the “expertise” of each reviewer for each paper. More speciﬁcally, suppose we
would like to have R reviewers per paper, and that each reviewer be assigned at most P pa
pers. For each paper and reviewer, we have an a weight q
jk
indicating the qualiﬁcation level
of reviewer j for evaluating paper k. Our objective is to ﬁnd an assignment for reviewers
to papers that maximizes the total weight. We represent a matching with a set of binary
variables y
jk
that take the value 1 if reviewer j is assigned to paper k, and 0 otherwise. The
bipartite matching problem can be solved using a combinatorial algorithm or the following
linear program:
max
j,k
µ
jk
q
jk
(3.3)
s.t.
j
µ
jk
= R,
k
µ
jk
≤ P, 0 ≤ µ
jk
≤ 1.
This LP is guaranteed to produce integer solutions (as long as P and R are integers) for
any weights q(y) [Nemhauser & Wolsey, 1999].
The quality of the solution found depends critically on the choice of weights that de
ﬁne the objective. A simple scheme could measure the “expertise” as the percent of word
3.5. COMBINATORIAL PROBLEMS 41
overlap in the reviewer’s home page and the paper’s abstract. However, we would want to
weight certain words much more (words that are relevant to the subject and infrequent).
Constructing and tuning the weights for a problem is a difﬁcult and timeconsuming pro
cess, just as it is for Markov networks for handwriting recognition.
As usual, we will represent the objective q(y) as a weighted combination of a set of
basis functions w
¯
f (x, y). Let x
jk
denote the intersection of the set of words occurring in
webpage(j)∩abstract(k), the web page of a reviewer j and the abstract of the paper k. We
can deﬁne f
d
(x, y) =
jk
y
jk
1I(word
d
∈ x
jk
), the number of times word d was in both
the web page of a reviewer and the abstract of the paper that were matched in y. Then the
score q
jk
is simply q
jk
=
d
w
d
1I(word
d
∈ x
jk
), a weighted combination of overlapping
words. In the next chapter we will show how to learn the parameters w in much the same
way we learn the parameters w of a Markov network.
The space of bipartite matchings illustrates an important property of many structured
spaces: the maximization problem arg max
y∈¸
w
¯
f (x, y) is easier than the normalization
problem
y∈¸
exp¦w
¯
f (x, y)¦. The maximum weight bipartite matching can be found
in polynomial (cubic) time in the number of nodes in the graph using a combinatorial algo
rithm. However, even simply counting the number of matchings is #Pcomplete [Valiant,
1979; Garey & Johnson, 1979]. Note that counting is easier than normalization, which is
essentially weighted counting. This fact makes a probabilistic interpretation of the model as
a distribution over matchings intractable to compute. Similarly, exact maximum likelihood
estimation is intractable, since it requires computing the normalization.
Chapter 4
Structured maximum margin estimation
In the previous chapter, we described several important types of structured models of the
form:
h
w
(x) = arg max
y: g(x,y)≤0
w
¯
f (x, y), (4.1)
where we assume that the optimization problem max
y : g(x,y)≤0
w
¯
f (x, y) can be solved
or approximated by a compact convex optimization problem for some convex subset of
parameters w ∈ ¼. A compact problem formulation is polynomial in the description
length of the objective and the constraints.
Given a sample S = ¦(x
(i)
, y
(i)
)¦
m
i=1
, we develop methods for ﬁnding parameters w
such that:
arg max
y∈¸
(i)
w
¯
f (x
(i)
, y) ≈ y
(i)
, ∀i,
where ¸
(i)
= ¦y : g(x
(i)
, y) ≤ 0¦. In this chapter, we describe at an abstract level two
general approaches to structured estimation that we apply in the rest of the thesis. Both of
these approaches deﬁne a convex optimization problem for ﬁnding such parameters w.
There are several reasons to derive compact convex formulations. First and foremost,
we can ﬁnd globally optimal parameters (with ﬁxed precision) in polynomial time. Sec
ond, we can use standard optimization software to solve the problem. Although special
purpose algorithms that exploit the structure of a particular problem are often much faster
42
4.1. MAXMARGIN ESTIMATION 43
(see Ch. 6), the availability of offtheshelf software is very important for quick develop
ment and testing of such models. Third, we can analyze the generalization performance of
the framework without worrying about the actual algorithms used to carry out the optimiza
tion and the associated woes of intractable optimization problems: local minima, greedy
and heuristic methods, etc.
Our framework applies not only to the standard models typically estimated by prob
abilistic methods, such as Markov networks and contextfree grammars, but also to a
wide range of “unconventional” predictive models. Such models include graph cuts and
weighted matchings, where maximum likelihood estimation is intractable. We provide ex
act maximum margin solutions for several of these problems (Ch. 7 and Ch. 10).
In prediction problems where the maximization in Eq. (4.1) is intractable, we consider
convex programs that provide only an upper or lower bound on the true solution. We
discuss how to use these approximate solutions for approximate learning of parameters.
4.1 Maxmargin estimation
As in the univariate prediction, we measure the error of approximation using a loss func
tion . In structured problems, where we are jointly predicting multiple variables, the loss
is often not just the simple 01 loss or squared error. For structured classiﬁcation, a natural
loss function is a kind of Hamming distance between y
(i)
and h(x
(i)
): the number of vari
ables predicted incorrectly. We will explore these and more general loss functions in the
following chapters.
4.1.1 Minmax formulation
Throughout, we will adopt the hinge upper bound
i
(h(x
(i)
)) on the loss function for struc
tured classiﬁcation inspired by maxmargin criterion:
i
(h(x
(i)
)) = max
y∈¸
(i)
[w
¯
f
i
(y) +
i
(y)] −w
¯
f
i
(y
(i)
) ≥
i
(h(x
(i)
)),
44 CHAPTER 4. STRUCTURED MAXIMUM MARGIN ESTIMATION
where as before,
i
(h(x
(i)
)) = (x
(i)
, y
(i)
, h(x
(i)
)),
i
(h(x
(i)
)) = (x
(i)
, y
(i)
, h(x
(i)
)), and
f
i
(y) = f (x
(i)
, y). With this upper bound, the minmax formulation for structured classiﬁ
cation problem is analogous to multiclass SVM formulation in Eq. (2.9) and Eq. (2.10):
min
1
2
[[w[[
2
+ C
i
ξ
i
(4.2)
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥ max
y∈¸
(i)
[w
¯
f
i
(y) +
i
(y)], ∀i.
The above formulation is a convex quadratic program in w, since max
y∈¸
(i) [w
¯
f
i
(y) +
i
(y)] is convex in w (maximum of afﬁne functions is a convex function). For brevity, we
did not explicitly include the constraint that the parameters are in some legal convex set
(w ∈ ¼, most often IR
n
), but assume this throughout this chapter.
The problem with Eq. (4.2) is that the constraints have a very unwieldy form. An
other way to express this problem is using
i
[¸
(i)
[ linear constraints, which is generally
exponential in L
i
, the number of variables in y
i
.
min
1
2
[[w[[
2
+ C
i
ξ
i
(4.3)
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥ w
¯
f
i
(y) +
i
(y), ∀i, ∀y ∈ ¸
(i)
.
This form reveals the “maximum margin” nature of the formulation. We can interpret
1
[[w[[
w
¯
[f
i
(y
(i)
) − f
i
(y)] as the margin of y
(i)
over another y ∈ ¸
(i)
. Assuming ξ
i
are all
zero (say because C is very large), the constraints enforce
w
¯
f
i
(y
(i)
) −w
¯
f
i
(y) ≥
i
(y),
so minimizing [[w[[ maximizes the smallest such margin, scaled by the loss
i
(y). The
slack variables ξ
i
allow for violations of the constraints at a cost Cξ
i
. If the loss function is
not uniform over all the mistakes y ,= y
(i)
, then the constraints make costly mistakes (those
with high
i
(y)) less likely. In Ch. 5 we analyze the effect of nonuniform loss function
(Hamming distance type loss) on generalization, and show a strong connection between the
lossscaled margin and expected risk of the learned model.
The formulation in Eq. (4.3) is a standard QP with linear constraints, but its exponential
4.1. MAXMARGIN ESTIMATION 45
size is in general prohibitive. We now return to Eq. (4.2) and transform it to a a more man
ageable problem. The key to solving Eq. (4.2) efﬁciently is the lossaugmented inference
max
y∈¸
(i)
[w
¯
f
i
(y) +
i
(y)]. (4.4)
Even if max
y∈¸
(i) w
¯
f
i
(y) can be solved in polynomial time using convex optimization, the
form of the loss term
i
(y) is crucial for the lossaugmented inference to remain tractable.
The range of tractable losses will depend strongly on the problem itself (f and ¸). Even
within the range of tractable losses, some are more efﬁciently computable than others. A
large part of the development of structured estimation methods in the following chapters
is identifying appropriate loss functions for the application and designing convex formula
tions for the lossaugmented inference.
Assume that we ﬁnd such a formulation in terms of a set of variables µ
i
, with a concave
(in µ
i
) objective
¯
f
i
(w, µ
i
) and subject to convex constraints ¯ g
i
(µ
i
):
max
y∈¸
(i)
[w
¯
f
i
(y) +
i
(y)] = max
µ
i
: g
i
(µ
i
)≤0
¯
f
i
(w, µ
i
). (4.5)
We call such formulation compact if the number of variables µ
i
and constraints ¯ g
i
(µ
i
) is
polynomial in L
i
, the number of variables in y
(i)
.
Note that max
µ
i
: g
i
(µ
i
)≤0
¯
f
i
(w, µ
i
) must be convex in w, since Eq. (4.4) is. Likewise,
we can assume that it is feasible and bounded if Eq. (4.4) is. In the next section, we de
velop a maxmargin formulation that uses Lagrangian duality (see [Boyd & Vandenberghe,
2004] for an excellent review) to deﬁne a joint, compact convex problem for estimating the
parameters w.
To make the symbols concrete, consider the example of the reviewerassignment prob
lem we discussed in the previous chapter: we would like a bipartite matching with R re
viewers per paper and at most P papers per reviewer. Each training sample i consists of a
matching of N
(i)
p
papers and N
(i)
r
reviewers from some previous year. Let x
jk
denote the
intersection of the set of words occurring in the web page of a reviewer j and the abstract of
the paper k. Let y
jk
indicate whether reviewer j is matched to the paper k. We can deﬁne
a basis function f
d
(x
jk
, y
jk
) = y
jk
1I(word
d
∈ x
jk
), which indicates whether the word d is
46 CHAPTER 4. STRUCTURED MAXIMUM MARGIN ESTIMATION
in both the web page of a reviewer and the abstract of the paper that are matched in y. We
abbreviate the vector of all the basis functions for each edge jk as y
jk
f
(i)
jk
= f (x
(i)
jk
, y
jk
).
We assume that the loss function decomposes over the variables y
ij
. For example, the
Hamming loss simply counts the number of different edges in the matchings y and y
(i)
:
H
i
(y) =
jk
0/1
i,jk
(y
jk
) =
jk
1I(y
jk
,= y
(i)
jk
) = RN
(i)
p
−
jk
y
jk
y
(i)
jk
.
The last equality follows from the fact that any valid matching for example i has R review
ers for N
(i)
p
papers, hence RN
(i)
p
−
jk
y
jk
y
(i)
jk
represents exactly the number of edges that
are different between y and y
(i)
. Combining the two pieces, we have
w
¯
f (x
(i)
, y) =
jk
[w
¯
f (x
jk
, y
jk
) +
0/1
i,jk
(y
jk
)] = RN
(i)
p
+
jk
y
jk
[w
¯
f
jk
−y
(i)
jk
].
The lossaugmented inference problem can be then written as an LP in µ
i
similar
to Eq. (3.3) (without the constant term RN
(i)
p
):
max
jk
µ
i,jk
[w
¯
f
jk
−y
(i)
jk
]
s.t.
j
µ
i,jk
= R,
k
µ
i,jk
≤ P, 0 ≤ µ
i,jk
≤ 1.
In terms of Eq. (4.5),
¯
f
i
and ¯ g
i
are afﬁne in µ
i
:
¯
f
i
(w, µ
i
) = RN
(i)
p
+
i,j
µ
i,jk
[w
¯
f
jk
−y
(i)
jk
]
and ¯ g
i
(µ
i
) ≤ 0 ⇔
j
µ
i,jk
= R,
k
µ
i,jk
≤ P, 0 ≤ µ
i,jk
≤ 1.
In general, when we can express max
y∈¸
(i) w
¯
f (x
(i)
, y) as an LP and we use a loss
function this is linear in the number of mistakes, we have a linear program of this form for
the lossaugmented inference:
d
i
+ max (F
i
w+c
i
)
¯
µ
i
s.t. A
i
µ
i
≤ b
i
, µ
i
≥ 0, (4.6)
for appropriately deﬁned d
i
, F
i
, c
i
, A
i
, b
i
, which depend only on x
(i)
, y
(i)
, f (x, y) and
g(x, y). Note that the dependence on w is linear and only in the objective of the LP. If
this LP is compact (the number of variables and constraints is polynomial in the number of
4.1. MAXMARGIN ESTIMATION 47
label variables), then we can use it to solve the maxmargin estimation problem efﬁciently
by using convex duality.
The Lagrangian associated with Eq. (4.4) is given by
L
i,w
(µ
i
, λ
i
) =
¯
f
i
(w, µ
i
) −λ
¯
i
¯ g
i
(µ
i
), (4.7)
where λ
i
≥ 0 is a vector of Lagrange multipliers, one for each constraint function in
¯ g
i
(µ
i
). Since we assume that
¯
f
i
(w, µ
i
) is concave in µ
i
and bounded on the nonempty set
µ
i
: ¯ g
i
(µ
i
) ≤ 0, we have strong duality:
max
µ
i
: g
i
(µ
i
)≤0
¯
f
i
(w, µ
i
) = min
λ
i
≥0
max
µ
i
L
i,w
(µ
i
, λ
i
).
For many forms of
¯
f and ¯ g, we can write the Lagrangian dual min
λ
i
≥0
max
µ
i
L
i,w
(µ
i
, λ
i
)
explicitly as:
min h
i
(w, λ
i
) (4.8)
s.t. q
i
(w, λ
i
) ≤ 0,
where h
i
(w, λ
i
) and q
i
(w, λ
i
) are convex in both w and λ
i
. (We folded λ
i
≥ 0 into
q
i
(w, λ
i
) for brevity.) Since the original problem had polynomial size, the dual is polyno
mial size as well. For example, the dual of the LP in Eq. (4.6) is
d
i
+ min b
¯
i
λ
i
s.t. A
¯
i
λ
i
≥ F
i
w+c
i
, λ
i
≥ 0, (4.9)
where h
i
(w, λ
i
) = d
i
+b
¯
i
λ
i
and q
i
(w, λ
i
) ≤ 0 is ¦F
i
w+c
i
−A
¯
i
λ
i
≤ 0, −λ
i
≤ 0¦.
Plugging Eq. (4.8) into Eq. (4.2), we get
min
1
2
[[w[[
2
+ C
i
ξ
i
(4.10)
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥ min
q
i
(w,λ
i
)≤0
h
i
(w, λ
i
), ∀i.
Moreover, we can combine the minimization over λ with minimization over ¦w, ξ¦. The
48 CHAPTER 4. STRUCTURED MAXIMUM MARGIN ESTIMATION
reason for this is that if the right hand side is not at the minimum, the constraint is tighter
than necessary, leading to a suboptimal solution w. Optimizing jointly over λ as well will
produce a solution to ¦w, ξ¦ that is optimal.
min
1
2
[[w[[
2
+ C
i
ξ
i
(4.11)
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥ h
i
(w, λ
i
), ∀i;
q
i
(w, λ
i
) ≤ 0, ∀i.
Hence we have a joint and compact convex optimization program for estimating w.
The exact form of this program depends strongly on
¯
f and ¯ g. For our LPbased example,
we have a QP with linear constraints:
min
1
2
[[w[[
2
+ C
i
ξ
i
(4.12)
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥ d
i
+b
¯
i
λ
i
, ∀i;
A
¯
i
λ
i
≥ F
i
w+c
i
, ∀i;
λ
i
≥ 0, ∀i.
4.1.2 Certiﬁcate formulation
In the previous section, we assumed a compact convex formulation of the lossaugmented
max in Eq. (4.4). There are several important combinatorial problems which allow poly
nomial time solution yet do not have a compact convex optimization formulation. For
example, maximum weight perfect (nonbipartite) matching and spanning tree problems
can be expressed as linear programs with exponentially many constraints, but no polyno
mial formulation is known [Bertsimas & Tsitsiklis, 1997; Schrijver, 2003]. Both of these
problems, however, can be solved in polynomial time using combinatorial algorithms. In
some cases, though, we can ﬁnd a compact certiﬁcate of optimality that guarantees that
y
(i)
= arg max
y
[w
¯
f
i
(y) +
i
(y)] without expressing lossaugmented inference as a com
pact convex program. Intuitively, just verifying that a given assignment is optimal is some
times easier than actually ﬁnding it.
4.1. MAXMARGIN ESTIMATION 49
Consider the maximum weight spanning tree problem. A basic property of a span
ning tree is that cutting any edge (j, k) in the tree creates two disconnected sets of nodes
(1
j
[jk], 1
k
[jk]), where j ∈ 1
j
[jk] and k ∈ 1
k
[jk]. A spanning tree is optimal with respect
to a set of edge weights if and only if for every edge (j, k) in the tree connecting 1
j
[jk] and
1
k
[jk], the weight of (j, k) is larger than (or equal to) the weight of any other edge (j
t
, k
t
)
in the graph with j
t
∈ 1
j
[jk], k
t
∈ 1
k
[jk] [Cormen et al., 2001]. We discuss the conditions
for optimality of perfect matchings in Ch. 10. Suppose that we can ﬁnd a compact convex
formulation of these conditions via a polynomial (in L
i
) set of functions q
i
(w, ν
i
), jointly
convex in w and auxiliary variables ν
i
:
∃ν
i
s.t. q
i
(w, ν
i
) ≤ 0 ⇔ w
¯
f
i
(y
(i)
) ≥ w
¯
f
i
(y) +
i
(y), ∀y ∈ ¸
(i)
.
Then the following joint convex program in wand ν computes the maxmargin parameters:
min
1
2
[[w[[
2
(4.13)
s.t. q
i
(w, ν
i
) ≤ 0, ∀i.
Expressing the spanning tree optimality does not require additional variables ν
i
, but in
other problems, such as in perfect matching optimality in Ch. 10, such auxiliary variables
are needed. In the spanning tree problem, suppose y
jk
encodes whether edge (j, k) is in
the tree and the score of the edge is given by w
¯
f
i,jk
for some basis functions f (x
(i)
jk
, y
jk
).
We also assume that the loss function decomposes into a sum of losses over the edges, with
loss for each wrong edge given by
i,jk
. Then the optimality conditions are:
w
¯
f
i,jk
≥ w
¯
f
i,j
k
+
i,j
k
, ∀jk, j
t
k
t
s.t. y
(i)
jk
= 1, j
t
∈ 1
j
[jk], k
t
∈ 1
k
[jk].
For a full graph, we have
_
[1
(i)
[
3
) constraints for each example i, where [1
(i)
[ is the number
of nodes in the graph for example i.
Note that this formulation does not allow for violations of the margin constraints (it has
no slack variables ξ
i
). If the basis functions are not sufﬁciently rich to ensure that each y
(i)
is optimal, then Eq. (4.1.2) may be infeasible. Essentially, this formulation requires that
the upper bound on the empirical risk be zero, 1
S
[h
w
] = 0, and minimizes the complexity
50 CHAPTER 4. STRUCTURED MAXIMUM MARGIN ESTIMATION
of the hypothesis h
w
as measured by the norm of the weights.
If the problem is infeasible, the designer could add more basis functions f (x, y) that
take into account additional information about x. One could also add slack variables for
each example and each constraint that would allow violations of optimality conditions with
some penalty. However, these slack variables would not represent upper bounds on the loss
as they are in the minmax formulation, and therefore are less justiﬁed.
4.2 Approximations: upper and lower bounds
There are structured prediction tasks for which we might not be able to solve the estimation
problem exactly. Often, we cannot compute max
y∈¸
(i) [w
¯
f
i
(y) +
i
(y)] exactly or explic
itly, but can only upper or lower bound it. Fig. 4.1 shows schematically how approximating
of the max subproblem reduces or extends the feasible space of w and ξ and leads to ap
proximate solutions. The nature of these lower and upper bounds depends on the problem,
but we consider two general cases below.
4.2.1 Constraint generation
When neither compact maximization or optimality formulation is possible, but the max
imization problem can be solved or approximated by a combinatorial algorithm, we can
resort to constraint generation or cutting plane methods. Consider Eq. (4.3), where we
have an exponential number of linear constraints, one for each i and y ∈ ¸
(i)
. Only a sub
set of those constraints will be active at the optimal solution w. In fact, not more than the
number of parameters n plus the number of examples m can be active in general, since that
is the number of variables. If we can identify a small number of constraints that are critical
to the solution, we do not have to include all of them. Of course, identifying these con
straints is in general as difﬁcult as solving the problem, but a greedy approach of adding the
most violated constraints often achieves good approximate solutions after adding a small
(polynomial) number of constraints. If we continue adding constraints until there are no
more violated ones, the resulting solution is optimal.
We assume that we have an algorithm that produces y = arg max
y∈¸
(i) [w
¯
f
i
(y) +
4.2. APPROXIMATIONS: UPPER AND LOWER BOUNDS 51
Upperbound
Exact
Lowerbound
×
+
Figure 4.1: Exact and approximate constraints on the maxmargin quadratic program. The
solid red line represents the constraints imposed by the assignments y ∈ Y
(i)
, whereas the
dashed and dotted lines represent approximate constraints. The approximate constraints
may coincide with the exact constraints in some cases, and be more stringent or relaxed in
others. The parabolic contours represent the value of the objective function and ‘+’, ‘x’ and
‘o’ mark the different optima.
i
(y)]. The algorithm is described in Fig. 4.2. We maintain, for each example i, a small
but growing set of assignments
¯
¸
(i)
⊂ ¸
(i)
. At each iteration, we solve the problem with a
subset of constraints:
min
1
2
[[w[[
2
+ C
i
ξ
i
(4.14)
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥ w
¯
f
i
(y) +
i
(y), ∀i, ∀y ∈
¯
¸
(i)
.
The only difference between Eq. (4.3) and Eq. (4.14) is that ¸
(i)
has been replaced by
¯
¸
(i)
.
We then compute y = arg max
y∈¸
(i) [w
¯
f
i
(y) +
i
(y)] for each i and check whether the
constraint w
¯
f
i
(y
(i)
) + ξ
i
+ ≥ w
¯
f
i
(y) +
i
(y), is violated, where is a user deﬁned
precision parameter. If it is violated, we set
¯
¸
(i)
=
¯
¸
(i)
∪ y. The algorithm terminates
when no constraints are violated. In Fig. 4.1, the lowerbound on the constraints provided
by
¯
¸
(i)
∪y keeps tightening with each iteration, terminating when the desired precision is
reached. We note that if the algorithm that produces y = arg max
y∈¸
(i) [w
¯
f
i
(y)+
i
(y)] is
52 CHAPTER 4. STRUCTURED MAXIMUM MARGIN ESTIMATION
Input: precision parameter .
1. Initialize:
¯
¸
(i)
= ¦¦, ∀ i.
2. Set violation = 0 and solve for w and ξ by optimizing
min
1
2
[[w[[
2
+ C
i
ξ
i
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥ w
¯
f
i
(y) +
i
(y), ∀i, ∀y ∈
¯
¸
(i)
.
3. For each i,
Compute y = arg max
y∈¸
(i) [w
¯
f
i
(y) +
i
(y)],
if w
¯
f
i
(y
(i)
) + ξ
i
+ ≤ w
¯
f
i
(y) +
i
(y),
then set
¯
¸
(i)
=
¯
¸
(i)
∪ y and violation = 1
4. if violation = 1 goto 2.
Return w.
Figure 4.2: A constraint generation algorithm.
suboptimal, the approximation error of the solution we achieve might be much greater than
. The number of constraints that must be added before the algorithm terminates depends
on the precision and problem speciﬁc characteristics. See [Bertsimas & Tsitsiklis, 1997;
Boyd & Vandenberghe, 2004] for a more indepth discussion of cutting planes methods.
This approach may also be computationally faster in providing a very good approximation
in practice if the explicit convex programming formulation is polynomial in size, but very
large, while the maximization algorithm is comparatively fast.
4.2.2 Constraint strengthening
In many problems, the maximization problem we are interested in may be very expensive
or intractable. For example, we consider MAP inference in large treewidth Markov net
works in Ch. 8, multiway cut in Ch. 7, graphpartitioning in Ch. 11. Many such problems
can be written as integer programs. Relaxations of such integer programs into LPs, QPs
or SDPs often provide excellent approximation algorithms [Hochbaum, 1997; Nemhauser
& Wolsey, 1999]. The relaxation usually deﬁnes a larger feasible space
¯
¸
(i)
⊃ ¸
(i)
over
4.3. RELATED WORK 53
which the maximization is done, where y ∈
¯
¸
(i)
may correspond to a “fractional” assign
ment. For example, a solution to the MAP LP in Eq. (3.2) for an untriangulated network
may not correspond to any valid assignment. In such a case, the approximation is an over
estimate of the constraints:
max
y∈
¸
(i)
[w
¯
f
i
(y) +
i
(y)] ≥ max
y∈¸
(i)
[w
¯
f
i
(y) +
i
(y)].
Hence the constraint set is tightened with such invalid assignments. Fig. 4.1 shows how the
overestimate reduces the feasible space of w and ξ.
Note that for every setting of the weights w that produces fractional solutions for the
relaxation, the approximate constraints are tightened because of the additional invalid as
signments. In this case, the approximate MAP solution has higher value than any integer
solution, including the true assignment y
(i)
, thereby driving up the corresponding slack ξ
i
.
By contrast, for weights wfor which the MAP approximation is integervalued, the margin
has the standard interpretation as the difference between the score of y
(i)
and the MAP y
(according to w). As the objective includes a penalty for the slack variable, intuitively,
minimizing the objective tends to drive the weights w away from the regions where the so
lutions to the approximation are fractional. In essence, the estimation algorithm is ﬁnding
weights that are not necessarily optimal for an exact maximization algorithm, but (close to)
optimal for the particular approximate maximization algorithm used. In practice, we will
show experimentally that such approximations often work very well.
4.3 Related work
Our maxmargin formulation is related to a body of work called inverse combinatorial and
convex optimization [Burton & Toint, 1992; Zhang & Ma, 1996; Ahuja & Orlin, 2001;
Heuberger, 2004]. An inverse optimization problem is deﬁned by an instance of an opti
mization problem max
y∈¸
w
¯
f (y), a set of nominal weights w
0
, and a target solution y
t
.
The goal is to ﬁnd the weights w closest to the nominal w
0
in some norm, which make the
54 CHAPTER 4. STRUCTURED MAXIMUM MARGIN ESTIMATION
target solution optimal:
min [[w−w
0
[[
p
s.t. w
¯
f (y
t
) ≥ w
¯
f (y), ∀y ∈ ¸.
Most of the attention has been on L
1
and L
∞
norms, but L
2
norm is also used.
The study of inverse problems began with geophysical scientists (see [Tarantola, 1987]
for indepth discussion of a wide range of applications). Modeling a complex physical
system often involves a large number of parameters which scientists ﬁnd hard or impossible
to set correctly. Provided educated guesses for the parameters w
0
and the behavior of the
system as a target, the inverse optimization problem attempts to match the behavior while
not perturbing the “guesstimate” too much.
Although there is a strong connection between inverse optimization problems and our
formulations, the goals are very different than ours. In our framework, we are learning
a parameterized objective function that depends on the input x and will generalize well
in prediction on new instances. Moreover, we do not assume as given a nominal set of
weights. Note that if we set w
0
= 0, then w = 0 is trivially the optimal solution. The
solution w depends critically on the choice of nominal weights, which is not appropriate in
the learning setting.
The inverse reinforcement learning problem [Ng & Russell, 2000; Abbeel & Ng, 2004]
is much closer to our setting. The goal is to learn a reward function that will cause a rational
agent to act similar to the observed behavior of an expert. A full description of the problem
is beyond our scope, but we brieﬂy describe the Markov decision process (MDP) model
commonly used for sequential decision making problems where an agent interacts with its
environment. The environment is modeled as a system that can be in one of a set of discrete
states. At every time step, the agent chooses an action from a discrete set of actions and the
system transitions to a next state with a probability that depends on the current state and
the action taken. The agent collects a reward at each step, which generally depends on the
on the current and the next state and the action taken. A rational agent executes a policy
(essentially, a state to action mapping) that maximizes its expected reward. To map this
problem (approximately) to our setting, note that a policy roughly corresponds to the labels
4.4. CONCLUSION 55
y, the state sequence correspond to the input x and the reward for a state/action sequence is
assumed to be w
¯
f (x, y) for some basis functions w
¯
f (x, y). The goal is to learn wfroma
set of state/action sequences (x
(i)
, y
(i)
) of the expert such that the maximizing the expected
reward according to the system model makes the agent imitate the expert. This and related
problems are formulated as a convex program in Ng and Russell [2000] and Abbeel and
Ng [2004].
4.4 Conclusion
In this chapter, we presented two formulations of structured maxmargin estimation that
deﬁne a compact convex optimization problem. The ﬁrst formulation, minmax, relies on
the ability to express inference in the model as a compact convex optimization problem.
The second one, certiﬁcate, only requires expressing optimality of a given assignment ac
cording to the model. Our framework applies to a wide range of prediction problems that
we explore in the rest of the thesis, including Markov networks, context free grammars, and
many combinatorial structures such as matchings and graphcuts. The estimation problem
is tractable and exact whenever the prediction problem can be formulated as a compact
convex optimization problem or a polynomial time combinatorial algorithm with compact
convex optimality conditions. When the prediction problem is intractable or very expen
sive to solve exactly, we resort to approximations that only provide upper/lower bounds
on the predictions. The estimated parameters are then approximate, but produce accurate
approximate prediction models in practice.
Because our approach only relies using the maximum in the model for prediction, and
does not require a normalized distribution P(y [ x) over all outputs, maximum margin
estimation can be tractable when maximum likelihood is not. For example, to learn a prob
abilistic model P(y [ x) over bipartite matchings using maximum likelihood requires com
puting the normalizing partition function, which is #Pcomplete [Valiant, 1979; Garey &
Johnson, 1979]. By contrast, maximum margin estimation can be formulated as a compact
QP with linear constraints. Similar results hold for nonbipartite matchings and mincuts.
In models that are tractable for both maximum likelihood and maximum margin, (such
as lowtreewidth Markov networks, context free grammars, many other problems in which
56 CHAPTER 4. STRUCTURED MAXIMUM MARGIN ESTIMATION
inference is solvable by dynamic programming), our approach has an additional advantage.
Because of the hingeloss, the solutions to the estimation are relatively sparse in the dual
space (as in SVMs), which makes the use of kernels much more efﬁcient. Maximum like
lihood estimation with kernels results in models that are generally nonsparse and require
pruning or greedy support vector selection methods [Lafferty et al., 2004; Altun et al.,
2004].
The forthcoming formulations in the thesis follow the principles laid out in this chapter.
The range of applications of these principles is very broad and leads to estimation prob
lems with very interesting structure in each particular problem, from Markov networks and
contextfree grammars to graph cuts and perfect matchings.
Part II
Markov networks
57
Chapter 5
Markov networks
Markov networks are extensively used to model complex sequential, spatial, and relational
interactions in prediction problems arising in many ﬁelds. These problems involve labeling
a set of related objects that exhibit local consistency. In sequential labeling problems (such
as handwriting recognition), the labels (letters) of adjacent inputs (images) are highly corre
lated. Sequential prediction problems arise in natural language processing (partofspeech
tagging, speech recognition, information extraction [Manning & Sch¨ utze, 1999]), compu
tational biology (gene ﬁnding, protein structure prediction, sequence alignment [Durbin
et al., 1998]), and many other ﬁelds. In image processing, neighboring pixels exhibit spa
tial label coherence in denoising, segmentation and stereo correspondence [Besag, 1986;
Boykov et al., 1999a]. In hypertext or bibliographic classiﬁcation, labels of linked and
cocited documents tend to be similar [Chakrabarti et al., 1998; Taskar et al., 2002]. In
proteomic analysis, location and function of proteins that interact are often highly corre
lated [Vazquez et al., 2003]. Markov networks compactly represent complex joint distribu
tions of the label variables by modeling their local interactions. Such models are encoded
by a graph, whose nodes represent the different object labels, and whose edges represent
and quantify direct dependencies between them. For example, a Markov network for the
hypertext domain would include a node for each webpage, encoding its label, and an edge
between any pair of webpages whose labels are directly correlated (e.g., because one links
to the other).
We address the problem of maxmargin estimation the parameters of Markov networks
58
5.1. MAXIMUM LIKELIHOOD ESTIMATION 59
for such structured classiﬁcation problems. We show a compact convex formulation that
seamlessly integrates kernels with graphical models. We analyze the theoretical general
ization properties of maxmargin estimation and derive a novel marginbased bound for
structured classiﬁcation.
We are given a labeled training sample S = ¦(x
(i)
, y
(i)
)¦
m
i=1
, drawn from a ﬁxed dis
tribution D over A ¸. We assume the structure of the network is given: we have a
mapping from an input x to the corresponding Markov network graph ((x) = ¦1, c¦
where the nodes 1 map to the variables in y. We abbreviate ((x
(i)
) as (
(i)
below. In hand
writing recognition, this mapping depends on the segmentation algorithm that determines
how many letters the sample image contains and splits the image into individual images
for each letter. It also depends on the basis functions we use to model the dependencies of
the problem, for example, ﬁrstorder Markov chain or a higherorder models. Note that the
topology and size of the graph (
(i)
, might be different for each example i. For instance, the
training sequences might have different lengths.
We focus on conditional Markov networks (or CRFs [Lafferty et al., 2001]), which
represent P(y [ x) instead of generative models P(x, y). The loglinear representation we
have described in Sec. 3.3.1 is deﬁned via a vector of n basis functions f (x, y):
log P
w
(y [ x) = w
¯
f (x, y) −log Z
w
(x),
where Z
w
(x) =
y
exp¦w
¯
f (x, y)¦ and w ∈ IR
n
. Before we present the maximum
margin estimation, we review the standard maximum likelihood method.
5.1 Maximum likelihood estimation
The regularized maximum likelihood approach of learning the weights w of a Markov net
work is similar to logistic regression we described in Sec. 2.2. The objective is to minimize
the training logloss with an additional regularization term, usually the squarednorm of the
weights w [Lafferty et al., 2001]:
1
2
[[w[[
2
−C
i
log P
w
(y
(i)
[ x
(i)
) =
1
2
[[w[[
2
+ C
i
log Z
w
(x
(i)
) −w
¯
f
i
(y
(i)
),
60 CHAPTER 5. MARKOV NETWORKS
where f
i
(y) = f (x
(i)
, y).
This objective function is convex in the parameters w, so we have an unconstrained
convex optimization problem. The gradient with respect to w is given by:
w+ C
i
_
E
i,w
[f
i
(y)] −f
i
(y
(i)
)
¸
= w−C
i
E
i,w
[∆f
i
(y)],
where E
i,w
[f
i
(y)] =
y∈¸
f
i
(y)P
w
(y [ x
(i)
) is the expectation under the conditional dis
tribution P
w
(y [ x
(i)
) and ∆f
i
(y) = f (x
(i)
, y
(i)
) −f (x
(i)
, y), as before. To compute the ex
pectations, we can use inference in the Markov network to calculate marginals P
w
(y
c
[ x
(i)
)
for each clique c in the network Sec. 3.3.2. Since the basis functions decompose over the
cliques of the network, the expectation decomposes as well:
E
i,w
[f
i
(y)] =
c∈(
(i)
y
c
∈¸
(i)
c
f
i,c
(y
c
)P
w
(y
c
[ x
(i)
).
Second order methods for solving unconstrained convex optimization problems, such
as Newton’s method, require the second derivatives as well as the gradient. Let δf
i
(y) =
f
i
(y) − E
i,w
[f
i
(y)]. The Hessian of the objective depends on the covariances of the basis
functions:
I + C
i
E
i,w
_
δf
i
(y)δf
i
(y)
¯
¸
,
where I is a n n identity matrix. Computing the Hessian is more expensive than the
gradient, since we need to calculate joint marginals of every pair of cliques c and c
t
,
P
w
(y
c∪c
[ x
i
) as well as covariances of all basis functions, which is quadratic in the num
ber of cliques and the number of functions. A standard approach is to use an approximate
second order method that does not need to compute the Hessian, but uses only the gradient
information [Nocedal & Wright, 1999; Boyd & Vandenberghe, 2004]. Conjugate Gradients
or LBFGS methods have been shown to work very well on large estimation problems [Sha
& Pereira, 2003; Pinto et al., 2003], even with millions of parameters w.
5.2. MAXIMUM MARGIN ESTIMATION 61
5.2 Maximum margin estimation
For maximummargin estimation, we begin with the minmax formulation from Sec. 4.1:
min
1
2
[[w[[
2
+ C
i
ξ
i
(5.1)
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥ max
y
[w
¯
f
i
(y) +
i
(y)], ∀i.
We know from Sec. 3.3.3 how to express max
y
w
¯
f
i
(y) as an LP, but the important differ
ence is the loss function
i
. The simplest loss is the 0/1 loss
i
(y) ≡ 1I(y
(i)
,= y). In fact
this loss for sequence models was used by Collins [2001] and Altun et al. [2003]. However,
in structured problems, where we are predicting multiple labels, the loss is often not just
the simple 0/1 loss, but may depend on the number of labels and type of labels predicted
incorrectly or perhaps the number of cliques of labels predicted incorrectly. In general, we
assume that the loss, like the basis functions, decomposes over the cliques of labels.
Assumption 5.2.1 The loss function
i
(y) is decomposable:
i
(y) =
c∈((Ç
(i)
)
(x
(i)
c
, y
(i)
c
, y
c
) =
c∈((Ç
(i)
)
i,c
(y
c
).
We will focus on decomposable loss functions below. A natural choice that we use in our
experiments is the Hamming distance:
H
(x
(i)
, y
(i)
, y) =
v∈1
(i)
1I(y
(i)
v
,= y
v
).
With this assumption, we can express this inference problem for a triangulated graph
as a linear program for each example i as in Sec. 3.3.3:
max
c,y
c
µ
i,c
(y
c
)[w
¯
f
i,c
(y
c
) +
i,c
(y
c
)] (5.2)
s.t.
y
c
µ
i,c
(y
c
) = 1, ∀i, ∀c ∈ (
(i)
; µ
i,c
(y
c
) ≥ 0, ∀c ∈ (
(i)
, ∀y
c
;
µ
i,s
(y
s
) =
y
c
∼y
s
µ
i,c
(y
t
c
), ∀s, c ∈ (
(i)
, s ⊂ c, ∀y
s
,
62 CHAPTER 5. MARKOV NETWORKS
where (
(i)
= (((
(i)
) are the cliques of the Markov network for example i.
As we showed before, the constraints ensure that the µ
i
’s form a proper distribution. If
the most likely assignment is unique, then the distribution that maximizes the objective puts
all its weight on that assignment. (If the arg max is not unique, any convex combination of
the assignments is a valid solution). The dual of Eq. (5.2) is given by:
min
c
λ
i,c
(5.3)
s.t. λ
i,c
+
s⊃c
m
i,s,c
(y
c
) −
s⊂c, y
s
∼y
c
m
i,c,s
(y
t
s
) ≥ w
¯
f
i,c
(y
c
) +
i,c
(y
c
), ∀c ∈ (
(i)
, ∀y
c
.
In this dual, the λ
i,c
variables correspond to the normalization constraints, while m
i,c,s
(y
c
)
variables correspond to the agreement constraints in the primal in Eq. (5.2).
Plugging the dual into Eq. (5.1) for each example i and maximizing jointly over all the
variables (w, ξ, λ and m), we have:
min
1
2
[[w[[
2
+ C
i
ξ
i
(5.4)
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥
i,c
λ
i,c
, ∀i;
λ
i,c
+
s⊃c
m
i,s,c
(y
c
) −
s⊂c, y
s
∼y
c
m
i,c,s
(y
t
s
) ≥ w
¯
f
i,c
(y
c
) +
i,c
(y
c
), ∀c ∈ (
(i)
, ∀y
c
.
In order to gain some intuition about this formulation, we make a change of variables from
λ
i,c
to ξ
i,c
:
λ
i,c
= w
¯
f
i,c
(y
(i)
c
) + ξ
i,c
, ∀i, ∀c ∈ (
(i)
.
The reason for naming the newvariables using the letter ξ will be clear in the following. For
readability, we also introduce variables that capture the effect of all the agreement variables
m:
M
i,c
(y
c
) =
s⊂c, y
s
∼y
c
m
i,c,s
(y
t
s
) −
s⊃c
m
i,s,c
(y
c
), ∀i, ∀c ∈ (
(i)
, ∀y
c
.
5.2. MAXIMUM MARGIN ESTIMATION 63
With these new variables, we have:
min
1
2
[[w[[
2
+ C
i
ξ
i
(5.5)
s.t. ξ
i
≥
c
ξ
i,c
, ∀i;
w
¯
f
i,c
(y
(i)
c
) + ξ
i,c
≥ w
¯
f
i,c
(y
c
) +
i,c
(y
c
) + M
i,c
(y
c
), ∀i, ∀c ∈ (
(i)
, ∀y
c
;
M
i,c
(y
c
) =
s⊂c, y
s
∼y
c
m
i,c,s
(y
t
s
) −
s⊃c
m
i,s,c
(y
c
), ∀i, ∀c ∈ (
(i)
, ∀y
c
.
Note that ξ
i
=
c
ξ
i,c
at the optimum, since the slack variable ξ
i
only appears only in the
constraint ξ
i
≥
c
ξ
i,c
and the objective minimizes Cξ
i
. Hence we can simply eliminate
this set of variables:
min
1
2
[[w[[
2
+ C
i,c
ξ
i,c
(5.6)
s.t. w
¯
f
i,c
(y
(i)
c
) + ξ
i,c
≥ w
¯
f
i,c
(y
c
) +
i,c
(y
c
) + M
i,c
(y
c
), ∀i, ∀c ∈ (
(i)
, ∀y
c
;
M
i,c
(y
c
) =
s⊂c, y
s
∼y
c
m
i,c,s
(y
t
s
) −
s⊃c
m
i,s,c
(y
c
), ∀i, ∀c ∈ (
(i)
, ∀y
c
.
Finally, we can write this in a form that resembles our original formulation Eq. (5.1), but
deﬁned at a local level, for each clique:
min
1
2
[[w[[
2
+ C
i,c
ξ
i,c
(5.7)
s.t. w
¯
f
i,c
(y
(i)
c
) + ξ
i,c
≥ max
y
c
[w
¯
f
i,c
(y
c
) +
i,c
(y
c
) + M
i,c
(y
c
)], ∀i, ∀c ∈ (
(i)
;
M
i,c
(y
c
) =
s⊂c, y
s
∼y
c
m
i,c,s
(y
t
s
) −
s⊃c
m
i,s,c
(y
c
), ∀i, ∀c ∈ (
(i)
, ∀y
c
.
Note that without M
i,c
and m
i,c,s
variables, we essentially treat each clique as an indepen
dent classiﬁcation problem: for each clique we have a hinge upperbound on the local loss,
or a margin requirement. The m
i,c,s
(y
s
) variables correspond to a certain kind of messages
between cliques that distribute “credit” to cliques to fulﬁll this margin requirement from
other cliques which have sufﬁcient margin.
64 CHAPTER 5. MARKOV NETWORKS
Figure 5.1: Firstorder chain shown as a set of cliques (nodes and edges). Also shown are
the corresponding local slack variables ξ for each clique and messages m between cliques.
As an example, consider the ﬁrstorder Markov chain in Fig. 5.1. The set of cliques
consists of the ﬁve nodes and the four edges. Suppose for the sake of this example that
our training data consists of only one training sample. The ﬁgure shows the local slack
variables ξ and messages m between cliques for this sample. For brevity of notion in this
example, we drop the dependence on the sample index i in the indexing of the variables
(we also used y
(∗)
j
instead of y
(i)
j
below). For concreteness, below we use the Hamming
loss
H
, which decomposes into local terms
j
(y
j
) = 1I(y
j
,= y
(∗)
j
) for each node and is
zero for the edges.
The constraints associated with the node cliques in this sequence are:
w
¯
f
1
(y
(∗)
1
) + ξ
1
≥ w
¯
f
1
(y
1
) + 1I(y
1
,= y
(∗)
1
) −m
1,12
(y
1
), ∀y
1
;
w
¯
f
2
(y
(∗)
2
) + ξ
2
≥ w
¯
f
2
(y
2
) + 1I(y
2
,= y
(∗)
2
) −m
2,12
(y
2
) −m
2,23
(y
2
), ∀y
2
;
w
¯
f
3
(y
(∗)
3
) + ξ
3
≥ w
¯
f
3
(y
3
) + 1I(y
3
,= y
(∗)
3
) −m
3,23
(y
3
) −m
3,34
(y
3
), ∀y
3
;
w
¯
f
4
(y
(∗)
4
) + ξ
4
≥ w
¯
f
4
(y
4
) + 1I(y
4
,= y
(∗)
4
) −m
4,34
(y
4
) −m
4,45
(y
4
), ∀y
4
;
w
¯
f
5
(y
(∗)
5
) + ξ
5
≥ w
¯
f
5
(y
5
) + 1I(y
5
,= y
(∗)
5
) −m
5,45
(y
5
), ∀y
5
.
5.3. M
3
N DUAL AND KERNELS 65
The edge constraints are:
w
¯
f
12
(y
(∗)
1
, y
(∗)
2
) + ξ
12
≥ w
¯
f
12
(y
1
, y
2
) + m
1,12
(y
1
) + m
2,12
(y
2
), ∀y
1
, y
2
;
w
¯
f
23
(y
(∗)
2
, y
(∗)
3
) + ξ
23
≥ w
¯
f
23
(y
2
, y
3
) + m
2,23
(y
2
) + m
3,23
(y
3
), ∀y
2
, y
3
;
w
¯
f
34
(y
(∗)
3
, y
(∗)
4
) + ξ
34
≥ w
¯
f
34
(y
3
, y
4
) + m
3,34
(y
3
) + m
4,34
(y
4
), ∀y
3
, y
4
;
w
¯
f
45
(y
(∗)
4
, y
(∗)
5
) + ξ
45
≥ w
¯
f
45
(y
4
, y
5
) + m
4,45
(y
4
) + m
5,45
(y
5
), ∀y
4
, y
5
.
5.3 M
3
N dual and kernels
In the previous section, we showed a derivation of a compact formulation based on LP
inference. In this section, we develop an alternative dual derivation that provides a very
interesting interpretation of the problem and is a departure for specialpurpose algorithms
we develop. We begin with the formulation as in Eq. (4.3):
min
1
2
[[w[[
2
+ C
i
ξ
i
(5.8)
s.t. w
¯
∆f
i
(y) ≥
i
(y) −ξ
i
, ∀i, y,
where ∆f
i
(y) ≡ f (x
(i)
, y
(i)
) −f (x
(i)
, y). The dual is given by:
max
i,y
α
i
(y)
i
(y) −
1
2
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,y
α
i
(y)∆f
i
(y)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
(5.9)
s.t.
y
α
i
(y) = C, ∀i; α
i
(y) ≥ 0, ∀i, y.
In the dual, the exponential number of α
i
(y) variables correspond to the exponential num
ber of constraints in the primal. We make two small transformations to the dual that do not
change the problem: we normalize α’s by C (by letting α
i
(y) = Cα
t
i
(y)), so that they sum
66 CHAPTER 5. MARKOV NETWORKS
to 1 and divide the objective by C. The resulting dual is given by:
max
i,y
α
i
(y)
i
(y) −
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,y
α
i
(y)∆f
i
(y)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
(5.10)
s.t.
y
α
i
(y) = 1, ∀i; α
i
(y) ≥ 0, ∀i, y.
As in multiclass SVMs, the solution to the dual α gives the solution to the primal as a
weighted combination: w
∗
= C
i,y
α
∗
i
(y)∆f
i
(y).
Our main insight is that the variables α
i
(y) in the dual formulation Eq. (5.10) can be
interpreted as a kind of distribution over y, since they lie in the simplex
y
α
i
(y) = 1; α
i
(y) ≥ 0, ∀y.
This dual distribution does not represent the probability that the model assigns to an instan
tiation, but the importance of the constraint associated with the instantiation to the solution.
The dual objective is a function of expectations of
i
(y) and ∆f
i
(y) with respect to α
i
(y).
Since
i
(y) =
c
i,c
(y
c
) and ∆f
i
(y) =
c
∆f
i,c
(y
c
) decompose over the cliques of the
Markov network, we only need clique marginals of the distribution α
i
(y) to compute their
expectations. We deﬁne the marginal dual variables as follows:
µ
i,c
(y
c
) =
y
∼y
c
α
i
(y
t
), ∀i, ∀c ∈ (
(i)
, ∀y
c
, (5.11)
where y
t
∼ y
c
denotes whether the partial assignment y
c
is consistent with the full assign
ment y
t
. Note that the number of µ
i,c
(y
c
) variables is small (polynomial) compared to the
number of α
i
(y) variables (exponential) if the size of the largest clique is constant with
respect to the size of the network.
Now we can reformulate our entire QP (5.10) in terms of these marginal dual variables.
Consider, for example, the ﬁrst term in the objective function (ﬁxing a particular i):
y
α
i
(y)
i
(y) =
y
α
i
(y)
c
i,c
(y
c
) =
c,y
c
i,c
(y
c
)
y
∼y
c
α
i
(y
t
) =
c,y
c
µ
i,c
(y
c
)
i,c
(y
c
).
5.3. M
3
N DUAL AND KERNELS 67
The decomposition of the second term in the objective is analogous.
y
α
i
(y)∆f
i
(y) =
c,y
c
∆f
i,c
(y
c
)
y
∼y
c
α
i
(y
t
) =
c,y
c
µ
i,c
(y
c
)∆f
i,c
(y
c
).
Let us denote the the objective of Eq. (5.10) as Q(α). Note that it only depends on
α
i
(y) through its marginals µ
i,c
(y
c
), that is, Q(α) = Q
t
(/(α)), where / denotes the
marginalization operator deﬁned by Eq. (5.11) . The domain of this operator, T[/], is
the product of simplices for all the m examples. What is its range, 1[/], the set of legal
marginals? Characterizing this set (also known as marginal polytope) compactly will allow
us to work in the space of µ’s:
max
α∈T[/]
Q(α) ⇔ max
µ∈1[/]
Q
t
(µ).
Hence we must ensure that µ
i
corresponds to some distribution α
i
, which is exactly
what the constraints in the LP for MAP inference enforce (see discussion of Lemma 3.3.5).
Therefore, when all (
(i)
are triangulated, the following structured dual QP has the same
primal solution (w
∗
) as the original exponential dual QP in Eq. (5.10):
max
i,c,y
c
µ
i,c
(y
c
)
i,c
(y
c
) −
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,c,y
c
µ
i,c
(y
c
)∆f
i,c
(y
c
)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
(5.12)
s.t.
y
c
µ
i,c
(y
c
) = 1, ∀i, ∀c ∈ (
(i)
; µ
i,c
(y
c
) ≥ 0, ∀i, ∀c ∈ (
(i)
, ∀y
c
;
µ
i,s
(y
s
) =
y
c
∼y
s
µ
i,c
(y
t
c
), ∀i, ∀s, c ∈ (
(i)
, s ⊂ c, ∀y
s
.
The solution to the structured dual µ
∗
gives us the primal solution:
w
∗
= C
i,c,y
c
µ
∗
i,c
(y
c
)∆f
i,c
(y
c
).
In this structured dual, we only enforce that there exists an α
i
consistent with µ
i
, but do
not make a commitment about what it is. In general, the α distribution is not unique, but
there is a continuum of distributions consistent with a set of marginals. The objective of
68 CHAPTER 5. MARKOV NETWORKS
the QP Eq. (5.10) does not distinguish between these distributions, since it only depends on
their marginals. The maximumentropy distribution α
i
consistent with a set of marginals
µ
i
, however, is unique for a triangulated model and can be computed using the junction tree
T
(i)
for the network [Cowell et al., 1999].
Speciﬁcally, associated with each edge (c, c
t
) in the tree T
(i)
is a set of variables called
the separator s = c ∩ c
t
. Note that each separator s and complement of a separator c ¸ s is
also a clique of the original graph, since it is a subclique of a larger clique. We denote the
set of separators as o
(i)
. Now we can deﬁne the maximumentropy distribution α
i
(y) as
follows:
α
i
(y) =
c∈T
(i)
µ
i,c
(y
c
)
s∈S
(i)
µ
i,s
(y
s
)
. (5.13)
Again, by convention 0/0 ≡ 0.
Kernels
Note that the solution is a weighted combination of local basis functions and the objective
of Eq. (5.12) can be expressed in terms of dot products between local basis functions
∆f
i,c
(y
c
)
¯
∆f
j,¯ c
(y
¯ c
) = [f (x
(i)
c
, y
(i)
c
) −f (x
(i)
c
, y
c
)]
¯
[f (x
(j)
¯ c
, y
(j)
¯ c
) −f (x
(j)
¯ c
, y
¯ c
)].
Hence, we can locally kernelize our models and solve Eq. (5.12) efﬁciently. Kernels are
typically deﬁned on the input, e.g. k(x
(i)
c
, x
(j)
¯ c
). In our handwriting example, we use a
polynomial kernel on the pixel values for the node cliques. We usually extend the kernel
over the input space to the joint input and output space by simply deﬁning
f (x
c
, y
c
)
¯
f (x
¯ c
, y
¯ c
) ≡ 1I(y
c
= y
¯ c
)k(x
c
, x
¯ c
).
Of course, other deﬁnitions are possible and may be useful when the assignments in each
clique y
c
have interesting structure. In Sec. 6.2 we experiment with several kernels for
the handwriting example. As in SVMs, the solutions to the maxmargin QP are typically
sparse in the µ variables. Hence, each logpotential in the network “remembers” only a
small proportion of the relevant training data inputs.
5.4. UNTRIANGULATED MODELS 69
Figure 5.2: Diamond Markov network (added triangulation edge is dashed and threenode
marginals are in dashed rectangles).
5.4 Untriangulated models
If the underlying Markov net is not chordal, we must address the problem by triangulating
the graph, that is, adding ﬁllin edges to ensure triangulation. For example, if our graph is
a 4cycle Y
1
—Y
2
—Y
3
—Y
4
—Y
1
as in Fig. 5.2, we can triangulate the graph by adding an
arc Y
1
—Y
3
. This will introduce new cliques Y
1
, Y
2
, Y
3
and Y
1
, Y
3
, Y
4
and the corresponding
marginals, µ
123
(y
1
, y
2
, y
3
) and µ
134
(y
1
, y
3
, y
4
). We can then use this new graph to produce
the constraints on the marginals:
y
1
µ
123
(y
1
, y
2
, y
3
) = µ
23
(y
2
, y
3
), ∀y
2
, y
3
;
y
3
µ
123
(y
1
, y
2
, y
3
) = µ
12
(y
1
, y
2
), ∀y
1
, y
2
;
y
1
µ
134
(y
1
, y
3
, y
4
) = µ
34
(y
3
, y
4
), ∀y
3
, y
4
;
y
3
µ
134
(y
1
, y
3
, y
4
) = µ
13
(y
1
, y
3
), ∀y
1
, y
3
.
The new marginal variables appear only in the constraints; they do not add any new basis
functions nor change the objective function.
70 CHAPTER 5. MARKOV NETWORKS
In general, the number of constraints introduced is exponential in the number of vari
ables in the new cliques — the treewidth of the graph. Unfortunately, even sparsely con
nected networks, for example 2D grids often used in image analysis, have large treewidth.
However, we can still solve the QP in the structured primal Eq. (5.6) or the structured
dual Eq. (5.12) deﬁned by an untriangulated graph. Such a formulation, which enforces
only local consistency of marginals, optimizes our objective only over a relaxation of the
marginal polytope. However, the learned parameters produce very accurate approximate
models in practice, as experiments in Ch. 8 demonstrate.
Note that we could also strengthen the untriangulated relaxation without introducing
an exponential number of constraints. For example, we can add positive semideﬁnite con
straints on the marginals µ used by Wainwright and Jordan [2003], which tend to improve
the approximation of the marginal polytope. Although this and other more complex relax
ations are a very interesting area of future development, they are often much more expen
sive.
The approximate QP does not guarantee that the learned model using exact inference
minimizes the true objective: (upperbound on) empirical risk plus regularization. But do
we really need these optimal parameters if we cannot perform exact inference? A more
useful goal is to make sure that training error is minimized using the approximate infer
ence procedure via the untriangulated LP. We conjecture that the parameters learned by
the approximate QP in fact do that to some degree. For instance, consider the separable
case, where 100% accuracy is achievable on the training data by some parameter setting w
such that approximate inference (using the untriangulated LP) produces integral solutions.
Solving the problem as C → ∞ will ﬁnd this solution even though it may not be optimal
(in terms of the norm of the w) using exact inference. For C in intermediate range, the
formulation trades off fractionality of the untriangulated LP solutions with complexity of
the weights [[w[[
2
.
5.5 Generalization bound
In this section, we show a generalization bound for the task of structured classiﬁcation that
allows us to relate the error rate on the training set to the generalization error. To the best
5.5. GENERALIZATION BOUND 71
of our knowledge, this bound is the ﬁrst to deal with structured error, such as the Hamming
distance. Our analysis of Hamming loss allows to prove a signiﬁcantly stronger result than
previous bounds for the 0/1 loss, as we detail below.
Our goal in structured classiﬁcation is often to minimize the number of misclassiﬁed
labels, or the Hamming distance between y and h(x). An appropriate error function is the
average perlabel loss
L(w, x, y) =
1
L
H
(y, arg max
y
w
¯
f (x, y
t
)),
where Lis the number of label variables in y. As in other generalization bounds for margin
based classiﬁers, we relate the generalization error to the margin of the classiﬁer. Consider
an upper bound on the above loss:
L(w, x, y) ≤ L(w, x, y) = max
y
: w
f (y)≤w
f (y
)
1
L
H
(y, y
t
).
This upper bound is tight if y = arg max
y
w
¯
f (x, y
t
), Otherwise, it is adversarial: it
picks from all y
t
which are better (w
¯
f (y) ≤ w
¯
f (y
t
)), one that maximizes the Hamming
distance from y. We can now deﬁne a γmargin perlabel loss:
L(w, x, y) ≤ L(w, x, y) ≤ L
γ
(w, x, y) = max
y
: w
f (y)≤w
f (y
)+γ
H
(y,y
)
1
L
H
(y, y
t
).
This upper bound is even more adversarial: it is tight if y = arg max
y
[w
¯
f (x, y
t
) +
H
(y, y
t
)], otherwise, it picks from all y
t
which are better when helped by γ
H
(y, y
t
), one
that maximizes the Hamming distance from y. Note that the loss we minimize in the max
margin formulation is very closely related (although not identical to) this upper bound.
We can now prove that the generalization accuracy of any hypothesis w is bounded by
its empirical γmargin perlabel loss, plus a term that grows inversely with the margin.To
state the bound, we need to deﬁne several other factors it depends upon. Let N
c
be the
maximum number of cliques in ((x), V
c
be the maximum number of values in a clique
[Y
c
[, q be the maximum number of cliques that have a variable in common, and R
c
be
an upperbound on the 2norm of clique basis functions. Consider a ﬁrstorder sequence
72 CHAPTER 5. MARKOV NETWORKS
model as an example, with L as the maximumlength, and V the number of values a variable
takes. Then N
c
= 2L − 1 since we have L node cliques and L − 1 edge cliques; V
c
= V
2
because of the edge cliques; and q = 3 since nodes in the middle of the sequence participate
in 3 cliques: previouscurrent edge clique, node clique, and currentnext edge clique.
Theorem 5.5.1 For the family of hypotheses parameterized by w, and any δ > 0, there
exists a constant K such that for any γ > 0 perlabel margin, and m > 1 samples, the
expected perlabel loss is bounded by:
E
D
[L(w, x, y)] ≤ E
S
[L
γ
(w, x, y)] +
¸
K
m
_
R
2
c
[[w[[
2
q
2
γ
2
[ln m + ln N
c
+ ln V
c
] + ln
1
δ
_
,
with probability at least 1 −δ.
Proof: See Appendix A.1 for the proof details and the exact value of the constant K.
The ﬁrst term upper bounds the training error of w. Low loss E
S
[L
γ
(w, x, y)] at high
margin γ quantiﬁes the conﬁdence of the prediction model. The second term depends on
[[w[[/γ, which corresponds to the complexity of the classiﬁer (normalized by the margin
level). Thus, the result provides a bound to the generalization error that trades off the
effective complexity of the hypothesis space with the training error.
The proof uses a covering number argument analogous to previous results in SVMs [Zhang,
2002]. However we propose a novel method for covering the space of structured prediction
models by using a cover of the individual clique basis function differences ∆f
i,c
(y
c
). This
new type of cover is polynomial in the number of cliques, yielding signiﬁcant improve
ments in the bound. Speciﬁcally, our bound has a logarithmic dependence on the number
of cliques (ln N
c
) and depends only on the 2norm of the basis functions perclique (R
c
).
This is a signiﬁcant gain over the previous result of Collins [2001] for 0/1 loss, which has
linear dependence (inside the square root) on the number of nodes (L), and depends on
the joint 2norm of all of the basis functions for an example (which is ∼ N
c
R
c
). Such a
result was, until now, an open problem for marginbased sequence classiﬁcation [Collins,
2001]. Finally, for sequences, note that if
L
m
= O(1) (for example, in OCR, if the number
of instances is at least a constant times the length of a word), then our bound is independent
of the number of labels L.
5.6. RELATED WORK 73
5.6 Related work
The application of marginbased estimation methods to parsing and sequence modeling was
pioneered by Collins [2001] using the VotedPerceptron algorithm [Freund & Schapire,
1998]. He provides generalization guarantees (for 0/1 loss) that hold for separable case and
depend on the number of mistakes the perceptron makes before convergence. Remarkably,
the bound does not explicitly depend on the length of the sequence, although undoubtedly
the number of mistakes does.
Collins [2004] also suggested an SVMlike formulation (with exponentially many con
straints) and a constraint generation method for solving it. His generalization bound (for
0/1 loss) based on the SVMlike margin, however, has linear dependence (inside the square
root) on the number of nodes (L). It also depends on the joint 2norm of all of the basis
functions for an example (which is ∼ N
c
R
c
). By considering the more natural Hamming
loss, we achieve a much tighter analysis.
Altun et al. [2003] have applied the exponentialsize formulation with constraint gen
eration we described in Sec. 4.2.1 to problems natural language processing. In a followup
paper, Tsochantaridis et al. [2004] show that only a polynomial number of constraints are
needed to be generated to guarantee a ﬁxed level of precision of the solution. However,
the number of constraints in many important cases is several orders higher (in L) than in
the the approach we present. In addition, the corresponding problem needs to be resolved
(or at least approximately resolved) after each additional constraint is added, which is pro
hibitively expensive for large number of examples and label variables.
The work of Guestrin et al. [2003] presents LP decompositions based on graphical
model structure for the value function approximation problem in factored MDPs (Markov
decision processes with structure). Describing the exact setting is beyond our scope, but it
sufﬁces to say that our original decomposition of the maxmargin QP was inspired by the
proposed technique to transform an exponential set of constraints into a polynomial one
using a triangulated graph.
There has been a recent explosion of work in maximum conditional likelihood estima
tion of Markov networks. The work of Lafferty et al. [2001] has inspired many applications
in natural language, computational biology, computer vision and relational modeling [Sha
74 CHAPTER 5. MARKOV NETWORKS
& Pereira, 2003; Pinto et al., 2003; Kumar & Hebert, 2003; Sutton et al., 2004; Taskar
et al., 2002; Taskar et al., 2003b]. As in the case of logistic regression, maximum condi
tional likelihood estimation for Markov networks can also be kernelized [Altun et al., 2004;
Lafferty et al., 2004]. However, the solutions are nonsparse and the proposed algorithms
are forced to use greedy selection of support vectors or heuristic pruning methods.
5.7 Conclusion
We use graph decomposition to derive an exact, compact, convex maxmargin formulation
for Markov networks with sequence and other lowtreewidth structure. Our formulation
avoids the exponential blowup in the number of constraints in the maxmargin QP that
plagued previous approaches. The seamless integration of kernels with graphical models
allows us to create very rich models that leverage the immense amount of research in kernel
design and graphical model decompositions. We also use approximate graph decomposi
tion to derive a compact approximate formulation for Markov networks in which inference
is intractable.
We provide theoretical guarantees on the average perlabel generalization error of our
models in terms of the training set margin. Our generalization bound signiﬁcantly tightens
previous results of Collins [Collins, 2001] and suggests possibilities for analyzing perlabel
generalization properties of graphical models.
In the next chapter, we present an efﬁcient algorithm that exploits graphical model
inference and show experiments on a large handwriting recognition task that utilize the
powerful representational capability of kernels.
Chapter 6
M
3
N algorithms and experiments
Although the number of variables and constraints in the structured dual in Eq. (5.12) is
polynomial in the size of the data, unfortunately, for standard QP solvers, the problem is
often too large even for small training sets. Instead, we use a coordinate dual ascent method
analogous to the sequential minimal optimization (SMO) used for SVMs [Platt, 1999].
We apply our M
3
N framework and structured SMO algorithm to a handwriting recogni
tion task. We show that our models signiﬁcantly outperform other approaches by incorpo
rating highdimensional decision boundaries of polynomial kernels over character images
while capturing correlations between consecutive characters.
6.1 Solving the M
3
N QP
Let us begin by considering the primal and dual QPs for multiclass SVMs:
min
1
2
[[w[[
2
+ C
i
ξ
i
max
i,y
α
i
(y)
i
(y) −
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,y
α
i
(y)∆f
i
(y)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
s.t. w
∆f
i
(y) ≥ (y) −ξ
i
, ∀i, y. s.t.
y
α
i
(y) = 1, ∀i; α
i
(y) ≥ 0, ∀i, y.
The KKT conditions [Bertsekas, 1999; Boyd & Vandenberghe, 2004] provide sufﬁcient
and necessary criteria for optimality of a dual solution α. As we describe below, these
conditions have certain locality with respect to each example i, which allows us to perform
75
76 CHAPTER 6. M
3
N ALGORITHMS AND EXPERIMENTS
the search for optimal α by repeatedly considering one example at a time.
A feasible dual solution α and a primal solution deﬁned by:
w = C
i,y
α
i
(y)∆f
i
(y) (6.1)
ξ
i
= max
y
[
i
(y) −w∆f
i
(y)] = max
y
[
i
(y) +w
¯
f
i
(y)] −w
¯
f
i
(y
(i)
),
are optimal if they satisfy the following two types of constraints:
α
i
(y) = 0 ⇒ w
¯
∆f
i
(y) >
i
(y) −ξ
i
; (KKT1)
α
i
(y) > 0 ⇒ w
¯
∆f
i
(y) =
i
(y) −ξ
i
. (KKT2),
We can express these conditions as
α
i
(y) = 0 ⇒ w
¯
f
i
(y) +
i
(y) < max
y
[w
¯
f
i
(y
t
) +
i
(y
t
)]; (KKT1)
α
i
(y) > 0 ⇒ w
¯
f
i
(y) +
i
(y) = max
y
[w
¯
f
i
(y
t
) +
i
(y
t
)]. (KKT2)
To simplify the notation, we deﬁne
v
i
(y) = w
¯
f
i
(y) +
i
(y); v
i
(y) = max
y
,=y
[w
¯
f
i
(y
t
) +
i
(y
t
)].
With these deﬁnitions, we have
α
i
(y) = 0 ⇒v
i
(y) < v
i
(y); (KKT1) α
i
(y) > 0 ⇒v
i
(y) ≥ v
i
(y); (KKT2).
In practice, however, we will enforce KKT conditions up to a given tolerance 0 < ¸1.
α
i
(y) = 0 ⇒v
i
(y) ≤ v
i
(y) + ; α
i
(y) > 0 ⇒v
i
(y) ≥ v
i
(y) −. (6.2)
Essentially, α
i
(y) can be zero only if v
i
(y) is at most larger than the all others. Con
versely, α
i
(y) can be nonzero only if v
i
(y) is at most smaller than the all others.
Note that the normalization constraints on the dual variables α are local to each exam
ple i. This allows us to perform dual blockcoordinate ascent where a block corresponds to
6.1. SOLVING THE M
3
N QP 77
1. Initialize: α
i
(y) = 1I(y = y
(i)
), ∀ i, y.
2. Set violation = 0,
3. For each i,
4. If α
i
violates (KKT1) or (KKT2),
5. Set violation = 1,
6. Find feasible α
t
i
such that Q(α
t
i
, α
−i
) > Q(α
i
, α
−i
) and set α
i
= α
t
i
.
7. If violation = 1 goto 2.
Figure 6.1: Blockcoordinate dual ascent.
the vector of dual variables α
i
for a single example i. The general form of blockcoordinate
ascent algorithm as shown in Fig. 6.1 is essentially coordinate ascent on blocks α
i
, main
taining the feasibility of the dual. When optimizing with respect to a single block i, the
objective function can be split into two terms:
Q(α) = Q(α
−i
) +Q(α
i
, α
−i
),
where α
−i
denotes all dual α
k
variables for k other than i. Only the second part of the
objective Q(α
i
, α
−i
) matters for optimizing with respect to α
i
. The algorithm starts with
a feasible dual solution α and improves the objective blockwise until all KKT condi
tions are satisﬁed. Checking the constraints requires computing w and ξ from α according
to Eq. (6.1).
As long as the local ascent step over α
i
is guaranteed to improve the objective when
KKT conditions are violated, the algorithm will converge to the global maximum in a ﬁnite
number of steps (within the precision). This allows us to focus on efﬁcient updates to a
single block of α
i
at a time.
Let α
t
i
(y) = α
i
(y) + λ(y). Note that
y
λ(y) = 0 and α
i
(y) + λ(y) ≥ 0 so that α
t
i
is
feasible. We can write the objective Q(α
−i
) +Q(α
t
i
, α
−i
) in terms of λ and α:
j,y
α
j
(y)
j
(y) +
y
λ(y)
i
(y) −
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
y
λ(y)∆f
i
(y) +
j,y
α
j
(y)∆f
j
(y)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
.
78 CHAPTER 6. M
3
N ALGORITHMS AND EXPERIMENTS
By dropping all terms that do not involve λ, and making the substitution
w = C
j,y
α
j
(y)∆f
j
(y), we get:
y
λ(y)
i
(y) −w
¯
_
y
λ(y)∆f
i
(y)
_
−
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
y
λ(y)∆f
i
(y)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
.
Since
y
λ(y) = 0,
y
λ(y)∆f
i
(y) =
y
λ(y)f
i
(y
(i)
) −
y
λ(y)f
i
(y) = −
y
λ(y)f
i
(y).
Below we also make the substitution v
i
(y) = w
¯
f
i
(y) +
i
(y) to get the optimization
problem for λ:
max
y
λ(y)v
i
(y) −
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
y
λ(y)f
i
(y)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
s.t.
y
λ(y) = 0; α
i
(y) + λ(y) ≥ 0, ∀y.
6.1.1 SMO
We do not need to solve the optimization subproblem above at each pass through the data.
All that is required is an ascent step, not a full optimization. Sequential Minimal Opti
mization (SMO) approach takes an ascent step that modiﬁes the least number of variables.
In our case, we have the simplex constraint, so we must change at least two variables in
order to respect the normalization constraint (by moving weight from one dual variable to
another). We address a strategy for selecting the two variables in the next section, but for
now assume we have picked λ(y
t
) and λ(y
tt
). Then we have δ = λ(y
t
) = −λ(y
tt
) in order
to sum to 1. The optimization problem becomes a single variable quadratic program in δ:
max [v
i
(y
t
) −v
i
(y
tt
)]δ −
1
2
C[[f
i
(y
t
) −f
i
(y
tt
)[[
2
δ
2
(6.3)
s.t. α
i
(y
t
) + δ ≥ 0; α
i
(y
tt
) −δ ≥ 0.
6.1. SOLVING THE M
3
N QP 79
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
−2
−1.5
−1
−0.5
0
0.5
Figure 6.2: Representative examples of the SMO subproblem. Horizonal axis represents δ
with two vertical lines depicting the upper and lower bounds c and d. Vertical axis repre
sents the objective. Optimum either occurs at the maximum of the parabola if it is feasible
or the upper or lower bound otherwise.
With a = v
i
(y
t
) −v
i
(y
tt
), b = C[[f
i
(y
t
) −f
i
(y
tt
)[[
2
, c = −α
i
(y
t
), d = α
i
(y
tt
), we have:
max [aδ −
b
2
δ
2
] s.t. c ≤ δ ≤ d, (6.4)
where the optimum is achieved at the maximum of the parabola a/b if c ≤ a/b ≤ d or at
the boundary c or d (see Fig. 6.1.1). Hence the solution is given by simply clipping a/b:
δ
∗
= max(c, min(d, a/b)).
The key advantage of SMO is the simplicity of this update. Computing the coefﬁcients
involves dot products (or kernel evaluations) to compute w
¯
f
i
(y
t
) and w
¯
f
i
(y
tt
) as well as
(f
i
(y
t
) −f
i
(y
tt
))
¯
(f
i
(y
t
) −f
i
(y
tt
)).
6.1.2 Selecting SMO pairs
How do we actually select such a pair to guarantee that we make progress in optimizing
the objective? Note that at least one of the assignments y must violate (KKT1) or (KKT2),
80 CHAPTER 6. M
3
N ALGORITHMS AND EXPERIMENTS
1. Set violation = 0.
2. For each y,
3. KKT1: If α
i
(y) = 0 and v
i
(y) > v
i
(y) + ,
4. Set y
t
= y and violation = 1 and goto 7.
5. KKT2: If α
i
(y) > 0 and v
i
(y) < v
i
(y) −,
6. Set y
t
= y and violation = 2 and goto 7.
7. If violation > 0,
8. For each y ,= y
t
,
9. If violation = 1 and α
i
(y) > 0,
10. Set y
tt
= y and goto 13.
11. If violation = 2 and v
i
(y) > v
i
(y
t
),
12. Set y
tt
= y and goto 13.
13. Return y
t
and y
tt
.
Figure 6.3: SMO pair selection.
because otherwise α
i
is optimal with respect to the current α
−i
. The selection algorithm is
outlined in Fig. 6.3.
The ﬁrst variable in the pair, y
t
, corresponds to a violated condition, while the second
variable, y
tt
, is chosen to guarantee that solving Eq. (6.3) will result in improving the ob
jective. There are two cases, corresponding to violation of KKT1 and violation of KKT2.
Case KKT1. α
i
(y
t
) = 0 but v
i
(y
t
) > v
i
(y
t
) + . This is the case where i, y
t
is a not
support vector but should be. We would like to increase α
i
(y
t
), so we need α
i
(y
tt
) > 0
to borrow from. There will always be a such a y
tt
since
y
α
i
(y) = 1 and α
i
(y
t
) = 0.
Since v
i
(y
t
) > v
i
(y
t
) + , v
i
(y
t
) > v
i
(y
tt
) + , so the linear coefﬁcient in Eq. (6.4) is
a = v
i
(y
t
) −v
i
(y
tt
) > . Hence the unconstrained maximum is positive a/b > 0. Since the
upperbound d = α
i
(y
tt
) > 0, we have enough freedom to improve the objective.
Case KKT2. α
i
(y
t
) > 0 but v
i
(y
t
) < v
i
(y
t
) −. This is the case where i, y
t
is a support
vector but should not be. We would like to decrease α
i
(y
t
), so we need v
i
(y
tt
) > v
i
(y
t
)
so that a/b < 0. There will always be a such a y
tt
since v
i
(y
t
) < v
i
(y
t
) − . Since the
6.1. SOLVING THE M
3
N QP 81
lowerbound c = −α
i
(y
t
) < 0, again we have enough freedom to improve the objective.
Since at each iteration we are guaranteed to improve the objective if the KKT conditions
are violated and the objective is bounded, we can use the SMO in the blockcoordinate
ascent algorithm to converge in a ﬁnite number of steps. To the best of our knowledge,
there are no upper bounds on the speed of convergence of SMO, but experimental evidence
has shown it a very effective algorithm for SVMs [Platt, 1999]. Of course, we can improve
the speed of convergence by adding heuristics in the selection of the pair, as long as we
guarantee that improvement is possible when KKT conditions are violated.
6.1.3 Structured SMO
Clearly, we cannot perform the above SMO updates in the space of α directly for the
structured problems, since the number of α variables is exponential. The constraints on µ
variables are much more complicated, since each µ participates not only in nonnegativity
and normalization constraints, but also cliqueagreement constraints. We cannot limit our
ascent steps to changing only two µ variables at a time, because in order to make a change
in one clique and stay feasible, we need to modify variables in overlapping cliques. For
tunately, we can perform SMO updates on α variables implicitly in terms of the marginal
dual variables µ.
The diagram in Fig. 6.1.3 shows the abstract outline of the algorithm. The key steps in
the SMO algorithm are checking for violations of the KKT conditions, selecting the pair y
t
and y
tt
, computing the corresponding coefﬁcients a, b, c, d and updating the dual. We will
show how to do these operations by doing all the hard work in terms of the polynomially
many marginal µ
i
variables and auxiliary “maxmarginals” variables.
Structured KKT conditions
As before, we deﬁne v
i
(y) = w
¯
f
i
(y) +
i
(y). The KKT conditions are, for all y:
α
i
(y) = 0 ⇒ v
i
(y) ≤ v
i
(y); α
i
(y) > 0 ⇒ v
i
(y) ≥ v
i
(y). (6.5)
82 CHAPTER 6. M
3
N ALGORITHMS AND EXPERIMENTS
select
& lift
SMO
update
project
Figure 6.4: Structured SMO diagram. We use marginals µ to select an appropriate pair of
instantiations y
t
and y
tt
and reconstruct their α values. We then perform the simple SMO
update and project the result back onto the marginals.
Of course, we cannot check these explicitly. Instead, we deﬁne maxmarginals for each
clique in the junction tree c ∈ T
(i)
and its values y
c
, as:
´ v
i,c
(y
c
) = max
y∼y
c
[w
¯
f
i
(y) +
i
(y)], ´ α
i,c
(y
c
) = max
y∼y
c
α
i
(y).
We also deﬁne ´ v
i,c
(y
c
) = max
y
c
,=y
c
´ v
i,c
(y
t
c
) = max
y,∼y
c
[w
¯
f
i
(y) +
i
(y)]. Note that we
do not explicitly represent α
i
(y), but we can reconstruct the maximumentropy one from
the marginals µ
i
by using Eq. (5.13). Both ´ v
i,c
(y
c
) and ´ α
i,c
(y
c
) can be computed by using
the Viterbi algorithm (one pass propagation towards the root and one outwards from the
root [Cowell et al., 1999]). We can now express the KKT conditions in terms of the max
marginals for each clique c ∈ T
(i)
and its values y
c
:
´ α
i,c
(y
c
) = 0 ⇒ ´ v
i,c
(y
c
) ≤ ´ v
i,c
(y
c
); ´ α
i,c
(y
c
) > 0 ⇒ ´ v
i,c
(y
c
) ≥ ´ v
i,c
(y
c
). (6.6)
Theorem 6.1.1 The KKT conditions in Eq. (6.5) and Eq. (6.6) are equivalent.
Proof:
Eq. (6.5) ⇒Eq. (6.6). Assume Eq. (6.5). Suppose, we have a violation of KKT1: for
some c, y
c
, ´ α
i,c
(y
c
) = 0, but ´ v
i,c
(y
c
) > ´ v
i,c
(y
c
). Since ´ α
i,c
(y
c
) = max
y∼y
c
α
i
(y) = 0,
6.1. SOLVING THE M
3
N QP 83
then α
i
(y) = 0, ∀y ∼ y
c
. Hence, by Eq. (6.5), v
i
(y) ≤ v
i
(y), ∀y ∼ y
c
. But ´ v
i,c
(y
c
) >
´ v
i,c
(y
c
) implies the opposite: there exists y ∼ y
c
such that v
i
(y) > ´ v
i,c
(y
c
), which also
implies v
i
(y) > v
i
(y), a contradiction.
Now suppose we have a violation of KKT2: for some i, y
c
, ´ α
i,c
(y
c
) > 0, but ´ v
i,c
(y
c
) <
´ v
i,c
(y
c
). Then v
i
(y) < v
i
(y), ∀y ∼ y
c
. But ´ α
i,c
(y
c
) > 0 implies there exists y ∼ y
c
such
that α
i
(y) > 0. For that y, by Eq. (6.5), v
i
(y) ≥ v
i
(y), a contradiction.
Eq. (6.6) ⇒ Eq. (6.5). Assume Eq. (6.6). Suppose we have a violation of KKT1:
for some y, α
i
(y) = 0, but v
i
(y) > v
i
(y). This means that y is the optimum of v
i
(),
hence ´ v
i,c
(y
c
) = v
i
(y) > v
i
(y) > ´ v
i,c
(y
c
), ∀c ∈ T
(i)
, y
c
∼ y. But by Eq. (6.6), if
´ v
i,c
(y
c
) > ´ v
i,c
(y
c
), then we cannot have ´ α
i,c
(y
c
) = 0. Hence all the yconsistent α
i
max
marginals are positive ´ α
i,c
(y
c
) > 0, ∀c ∈ T
(i)
, and it follows that all the yconsistent
marginals µ
i
are positive as well µ
i,c
(y
c
) > 0, ∀c ∈ T
(i)
(since sum upperbounds max).
But α
i
(y) =
c∈T
(i)
µ
i,c
(y
c
)
c∈S
(i)
µ
i,s
(y
s
)
, so if all the yconsistent marginals are positive, then α
i
(y) > 0,
a contradiction.
Now suppose we have a violation of KKT2: for some y, α
i
(y) > 0, but v
i
(y) <
v
i
(y). Since α
i
(y) > 0, we know that all the yconsistent α
i
maxmarginals are positive
´ α
i,c
(y
c
) > 0, ∀c ∈ T
(i)
. By Eq. (6.6), ´ v
i,c
(y
c
) ≥ ´ v
i,c
(y
c
), ∀c ∈ T
(i)
. Note that trivially
max
y
v
i
(y
t
) = max(´ v
i,c
(y
t
c
), ´ v
i,c
(y
t
c
)) for any clique c and clique assignment y
t
c
. Since
´ v
i,c
(y
c
) ≥ ´ v
i,c
(y
c
), ∀c ∈ T
(i)
, then max
y
v
i
(y
t
) = ´ v
i,c
(y
c
), , ∀c ∈ T
(i)
. That is, ´ v
i,c
(y
c
)
is the optimal value. We will show that v
i
(y) = ´ v
i,c
(y
c
), a contradiction. To show that this,
we consider any two adjacent nodes in the tree T
(i)
, cliques a and b, with a separator s, and
show that ´ v
i,a∪b
(y
a∪b
) = ´ v
i,a
(y
a
) = ´ v
i,b
(y
b
). By chaining this equality from the root of the
tree to all the leaves, we get v
i
(y) = ´ v
i,c
(y
c
) for any c.
We need to introduce some more notation to deal with the two parts of the tree induced
by cutting the edge between a and b. Let ¦A, B¦ be a partition of the nodes T
(i)
(cliques
of (
(i)
) resulting from removing the edge between a and b such that a ∈ A and b ∈ B.
We denote the two subsets of an assignment y as y
A
and y
B
(with overlap at y
s
). The
value of an assignment v
i
(y) can be decomposed into two parts: v
i
(y) = v
i,A
(y
A
) +
v
i,B
(y
B
), where v
i,A
(y
A
) and v
i,B
(y
B
) only count the contributions of their constituent
cliques. Take any maximizer, y
(a)
∼ y
a
with v
i
(y
(a)
) = ´ v
i,a
(y
a
) ≥ ´ v
i,a
(y
a
) and any
maximizer y
(b)
∼ y
b
with v
i
(y
(b)
) = ´ v
i,b
(y
b
) ≥ ´ v
i,b
(y
b
), which by deﬁnition agree with y
84 CHAPTER 6. M
3
N ALGORITHMS AND EXPERIMENTS
on the intersection s. We decompose the two associated values into the corresponding parts:
v
i
(y
(a)
) = v
i
(y
(a)
A
)+v
i
(y
(a)
B
) and v
i
(y
(b)
) = v
i
(y
(b)
A
)+v
i
(y
(b)
B
). We create a newassignment
that combines the best of the two: y
(s)
= y
(b)
A
∪ y
(a)
B
. Note that v
i
(y
(s)
) = v
i
(y
(b)
A
) +
v
i
(y
(a)
B
) = ´ v
i,s
(y
s
), since we essentially ﬁxed the intersection s and maximized over the
rest of the variables in A and B separately. Now ´ v
i,a
(y
a
) = ´ v
i,b
(y
b
) ≥ ´ v
i,s
(y
s
) since they
are optimal as we said above. Hence we have v
i
(y
(a)
A
) + v
i
(y
(a)
B
) = v
i
(y
(b)
A
) + v
i
(y
(b)
B
) ≥
v
i
(y
(b)
A
) + v
i
(y
(a)
B
) which implies that v
i
(y
(a)
A
) ≥ v
i
(y
(b)
A
) and v
i
(y
(b)
B
) ≥ v
i
(y
(a)
B
). Now we
create another assignment that clamps the value of both a and b: y
(a∪b)
= y
(a)
A
∪ y
(b)
B
. The
value of this assignment is optimal v
i
(y
(a∪b)
) = v
i
(y
(a)
A
) + v
i
(y
(b)
B
) = v
i
(y
(a)
) = v
i
(y
(b)
).
Structured SMO pair selection and update
As in multiclass problems, we will select the ﬁrst variable in the pair, y
t
, corresponding to
a violated condition, while the second variable, y
tt
, to guarantee that solving Eq. (6.3) will
result in improving the objective. Having selected y
t
and y
tt
, the coefﬁcients for the one
variable QP in Eq. (6.4) are a = v
i
(y
t
)−v
i
(y
tt
), b = C[[f
i
(y
t
)−f
i
(y
tt
)[[
2
, c = −α
i
(y
t
), d =
α
i
(y
tt
). As before, we enforce approximate KKT conditions in the algorithm in Fig. 6.5.
We have two cases, corresponding to violation of KKT1 and violation of KKT2.
Case KKT1. ´ α
i,c
(y
t
c
) = 0 but ´ v
i,c
(y
t
c
) > ´ v
i,c
(y
t
c
)+. We have set y
t
= arg max
y∼y
c
v
i
(y),
so v
i
(y
t
) = ´ v
i,c
(y
t
c
) > ´ v
i,c
(y
t
c
) + > v
i
(y
t
) + and α
i
(y
t
) = 0. This is the case where
i, y
t
is a not support vector but should be. We would like to increase α
i
(y
t
), so we need
α
i
(y
tt
) > 0 to borrow from. There will always be a such a y
tt
(with y
tt
c
,= y
t
c
) since
y
α
i
(y) = 1 and α
i
(y
t
) = 0. We can ﬁnd one by choosing y
c
for which ´ α
i,c
(y
c
) > 0,
which guarantees that for y
tt
c
= arg max
y∼y
c
α
i
(y), α
i
(y
tt
) > 0. Since v
i
(y
t
) ≥ v
i
(y
t
) +,
v
i
(y
t
) ≥ v
i
(y
tt
) +, so the linear coefﬁcient in Eq. (6.4) is a = v
i
(y
t
) −v
i
(y
tt
) > . Hence
the unconstrained maximum is positive a/b > 0. Since the upperbound d = α
i
(y
tt
) > 0,
we have enough freedom to improve the objective.
Case KKT2. ´ α
i,c
(y
t
c
) > 0 but ´ v
i,c
(y
t
c
) < ´ v
i,c
(y
t
c
)−. We have set y
t
= arg max
y∼y
c
α
i
(y),
so α
i
(y
t
) = ´ α
i,c
(y
t
c
) > 0 and v
i
(y
t
) < ´ v
i,c
(y
t
c
) < ´ v
i,c
(y
t
c
) − < v
i
(y
t
) − . This is the
case where i, y
t
is a support vector but should not be. We would like to decrease α
i
(y
t
),
so we need v
i
(y
tt
) > v
i
(y
t
) so that a/b < 0. There will always be a such a y
tt
since
6.2. EXPERIMENTS 85
1. Set violation = 0.
2. For each c ∈ T
(i)
, y
c
3. KKT1: If ´ α
i,c
(y
c
) = 0, and ´ v
i,c
(y
c
) > ´ v
i,c
(y
c
) + ,
4. Set y
t
c
= y
c
, y
t
= arg max
y∼y
c
v
i
(y) and violation = 1 and goto 7.
5. KKT2: If ´ α
i,c
(y
c
) > 0, and ´ v
i,c
(y
c
) < ´ v
i,c
(y
c
) −,
6. Set y
t
c
= y
c
, y
t
= arg max
y∼y
c
α
i
(y) and violation = 2 and goto 7.
7. If violation > 0,
8. For each y
c
,= y
t
c
,
9. If violation = 1 and ´ α
i,c
(y
c
) > 0,
10. Set y
tt
c
= arg max
y∼y
c
α
i
(y) and goto 13.
11. If violation = 2 and ´ v
i,c
(y
c
) > ´ v
i,c
(y
t
c
),
12. Set y
tt
= arg max
y∼y
c
v
i
(y) and goto 13.
13. Return y
t
and y
tt
.
Figure 6.5: Structured SMO pair selection.
v
i
(y
t
) < v
i
(y
t
) − . We can ﬁnd one by choosing y
c
for which ´ v
i,c
(y
c
) > ´ v
i,c
(y
c
) − ,
which guarantees that for y
tt
c
= arg max
y∼y
c
v
i
(y), v
i
(y
tt
) > v
i
(y
t
) − , Since the lower
bound c = −α
i
(y
t
) < 0, again we have enough freedom to improve the objective.
Having computed new values α
t
i
(y
t
) = α
i
(y
t
) + δ and α
t
i
(y
tt
) = α
i
(y
t
) − δ, we need
to project this change onto the marginal dual variables µ
i
. The only marginal affected are
the ones consistent with y
t
and/or y
tt
, and the change is very simple:
µ
t
i,c
(y
c
) = µ
i,c
(y
c
) + δ1I(y
c
∼ y
t
) −δ1I(y
c
∼ y
tt
).
6.2 Experiments
We selected a subset of ∼ 6100 handwritten words, with average length of ∼ 8 characters,
from 150 human subjects, from the data set collected by Kassel [1995]. Each word was
divided into characters, each character was rasterized into an image of 16 by 8 binary
86 CHAPTER 6. M
3
N ALGORITHMS AND EXPERIMENTS
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
LogReg CRF mSVM M^3N
T
e
s
t
e
r
r
o
r
(
a
v
e
r
a
g
e
p
e
r

c
h
a
r
a
c
t
e
r
)
linear quadratic cubic
(a) (b)
Figure 6.6: (a) 3 example words from the OCR data set; (b) OCR: Average per
character test error for logistic regression, CRFs, multiclass SVMs, and M
3
Ns, using linear,
quadratic, and cubic kernels.
pixels. (See Fig. 6.6(a).) In our framework, the image for each word corresponds to x, a
label of an individual character to ¸
j
, and a labeling for a complete word to ¸. Each label
¸
j
takes values from one of 26 classes ¦a, . . . , z¦.
The data set is divided into 10 folds of ∼ 600 training and ∼ 5500 testing examples.
The accuracy results, summarized in Fig. 6.6(b), are averages over the 10 folds. We im
plemented a selection of stateoftheart classiﬁcation algorithms: independent label ap
proaches, which do not consider the correlation between neighboring characters — lo
gistic regression, multiclass SVMs as described in Eq. (2.9), and oneagainstall SVMs
(whose performance was slightly lower than multiclass SVMs); and sequence approaches
— CRFs, and our proposed M
3
networks. Logistic regression and CRFs are both trained by
maximizing the conditional likelihood of the labels given the features, using a zeromean
diagonal Gaussian prior over the parameters, with a standard deviation between 0.1 and
1. The other methods are trained by margin maximization. Our features for each label
¸
j
are the corresponding image of ith character. For the sequence approaches (CRFs and
M
3
), we used an indicator basis function to represent the correlation between ¸
j
and ¸
i+1
.
6.3. RELATED WORK 87
For marginbased methods (SVMs and M
3
), we were able to use kernels (both quadratic
and cubic were evaluated) to increase the dimensionality of the feature space. We used
the structured SMO algorithm with about 3040 iterations through the data. Using these
highdimensional feature spaces in CRFs is not feasible because of the enormous number
of parameters.
Fig. 6.6(b) shows two types of gains in accuracy: First, by using kernels, marginbased
methods achieve a very signiﬁcant gain over the respective likelihood maximizing methods.
Second, by using sequences, we obtain another signiﬁcant gain in accuracy. Interestingly,
the error rate of our method using linear features is 16% lower than that of CRFs, and
about the same as multiclass SVMs with cubic kernels. Once we use cubic kernels our
error rate is 45% lower than CRFs and about 33% lower than the best previous approach.
For comparison, the previously published results, although using a different setup (e.g., a
larger training set), are about comparable to those of multiclass SVMs.
6.3 Related work
The kerneladatron [Friess et al., 1998] and votedperceptron algorithms [Freund &Schapire,
1998] for largemargin classiﬁers have a similar online optimization scheme. Collins
[2001] have applied votedperceptron to structured problems in natural language. Although
headtohead comparisons have not been performed, it seems that, empirically, less passes
(about 3040) are needed for our algorithm than in the perceptron literature.
Recently, the Exponentiated Gradient [Kivinen & Warmuth, 1997] algorithm has been
adopted to solve our structured QP for maxmargin estimation [Bartlett et al., 2004]. Al
though the EG algorithm has attractive convergence properties, it has yet to be shown to
learn faster than Structured SMO, particularly in the early iterations through the dataset.
6.4 Conclusion
In this chapter, we address the large (though polynomial) size of our quadratic program
using an effective optimization procedure inspired by SMO. In our experiments with the
OCRtask, our sequence model signiﬁcantly outperforms other approaches by incorporating
88 CHAPTER 6. M
3
N ALGORITHMS AND EXPERIMENTS
highdimensional decision boundaries of polynomial kernels over character images while
capturing correlations between consecutive characters. Overall, we believe that M
3
net
works will signiﬁcantly further the applicability of high accuracy marginbased methods to
realworld structured data. In the next two chapters, we apply this framework to important
classes of Markov networks for spatial and relational data.
Chapter 7
Associative Markov networks
In the previous chapter, we considered applications of sequencestructured Markov net
works, which allow very efﬁcient inference and learning. The chief computational bottle
neck in applying Markov networks for other largescale prediction problems is inference,
which is NPhard in general networks suitable in a broad range of practical Markov network
structures, including gridtopology networks [Besag, 1986].
One can address the tractability issue by limiting the structure of the underlying net
work. In some cases, such as the quadtree model used for image segmentation [Bouman &
Shapiro, 1994], a tractable structure is determined in advance. In other cases (e.g., [Bach &
Jordan, 2001]), the network structure is learned, subject to the constraint that inference on
these networks is tractable. In many cases, however, the topology of the Markov network
does not allow tractable inference. For example, in hypertext, the network structure can
mirror the hyperlink graph, which is usually highly interconnected, leading to computa
tionally intractable networks.
In this chapter, we show that optimal learning is feasible for an important subclass of
Markov networks — networks with attractive potentials. This subclass, called associa
tive Markov networks (AMNs), contains networks of discrete variables with K labels and
arbitrarysize clique potentials with K parameters that favor the same labels for all vari
ables in the clique. Such positive interactions capture the “guilt by association” pattern of
reasoning present in many domains, in which connected (“associated”) variables tend to
have the same label. AMNs are a natural ﬁt object recognition and segmentation, webpage
89
90 CHAPTER 7. ASSOCIATIVE MARKOV NETWORKS
classiﬁcation, and many other applications.
In the maxmargin estimation framework, the inference subtask is one of ﬁnding the
best joint (MAP) assignment to all of the variables in a Markov network. By contrast, other
learning tasks (e.g., maximizing the conditional likelihood of the target labels given the
features) require that we compute the posterior probabilities of different label assignments,
rather than just the MAP.
The MAP problem can naturally be expressed as an integer programming problem. We
use a linear program relaxation of this integer program in the minmax formulation. We
show that, for associative Markov networks of over binary variables (K = 2), this linear
program provides exact answers. To our knowledge, our method is the ﬁrst to allowtraining
Markov networks of arbitrary connectivity and topology. For the nonbinary case (K > 2),
the approximate linear programis not guaranteed to be optimal but we can bound its relative
error. Our empirical results suggest that the solutions of the resulting approximate max
margin formulation work well in practice.
We present an AMNbased method for object segmentation of complex from 3D range
data. By constraining the class of Markov networks to AMNs, our models can be learned
efﬁciently and at runtime, scale up to tens of millions of nodes and edges. The proposed
learning formulation effectively and directly learns to exploit a large set of complex surface
and volumetric features, while balancing the spatial coherence modeled by the AMN.
7.1 Associative networks
Associative interactions arise naturally in the context of image processing, where nearby
pixels are likely to have the same label [Besag, 1986; Boykov et al., 1999b]. In this setting,
a common approach is to use a generalized Potts model [Potts, 1952], which penalizes
assignments that do not have the same label across the edge: φ
ij
(k, l) = λ
ij
, ∀k ,= l and
φ
ij
(k, k) = 1, where λ
ij
≤ 1.
For binaryvalued Potts models, Greig et al. [1989] show that the MAP problem can be
formulated as a mincut in an appropriately constructed graph. Thus, the MAP problem can
be solved exactly for this class of models in polynomial time. For L > 2, the MAP problem
7.2. LP INFERENCE 91
is NPhard, but a procedure based on a relaxed linear program guarantees a factor 2 approx
imation of the optimal solution [Boykov et al., 1999b; Kleinberg & Tardos, 1999]. Our
associative potentials extend the Potts model in several ways. Importantly, AMNs allow
different labels to have different attraction strength: φ
ij
(k, k) = λ
ij
(k), where λ
ij
(k) ≥ 1,
and φ
ij
(k, l) = 1, ∀k ,= l. This additional ﬂexibility is important in many domains, as
different labels can have very diverse afﬁnities. For example, foreground pixels tend to
have locally coherent values while background is much more varied.
In a second important extension, AMNs admit nonpairwise interactions between vari
ables, with potentials over cliques involving m variables φ(µ
i1
, . . . , µ
im
). In this case, the
clique potentials are constrained to have the same type of structure as the edge potentials:
There are K parameters φ
c
(k, . . . , k) = λ
c
(k) ≥ 1 and the rest of the entries are set to 1.
In particular, using this additional expressive power, AMNs allow us to encode the pattern
of (soft) transitivity present in many domains. For example, consider the problem of pre
dicting whether two proteins interact [Vazquez et al., 2003]; this probability may increase
if they both interact with another protein. This type of transitivity could be modeled by a
ternary clique that has high λ for the assignment with all interactions present.
More formally, we deﬁne associative functions and potentials as follows.
Deﬁnition 7.1.1 A function g : ¸ →IR is associative for a graph ( over Kary variables if
it can be written as:
g(y) =
v∈1
K
k=1
g
v
(k)1I(y
v
= k) +
c∈(\1
K
k=1
g
c
(k)1I(y
c
= k, . . . , k); g
c
(k) ≥ 0, ∀c ∈ (¸1,
where 1 are the nodes and ( are the cliques of the graph (. A set of potentials φ(y) is
associative if φ(y) = e
g(y)
and g(y) is associative.
7.2 LP Inference
We can write an integer linear program for the problem of ﬁnding the maximum of an
associative function g(y), where we have a “marginal” variable µ
v
(k) for each node v ∈ 1
and each label k, which indicates whether node v has value k, and µ
c
(k) for each clique c
92 CHAPTER 7. ASSOCIATIVE MARKOV NETWORKS
(containing more than one variable) and label k, which represents the event that all nodes
in the clique c have label k:
max
v∈1
K
k=1
µ
v
(k)g
v
(k) +
c∈(\1
K
k=1
µ
c
(k)g
c
(k) (7.1)
s.t. µ
c
(k) ∈ ¦0, 1¦, ∀c ∈ (, k;
K
k=1
µ
v
(k) = 1, ∀v ∈ 1;
µ
c
(k) ≤ µ
v
(k), ∀c ∈ ( ¸ 1, v ∈ c, k.
Note that we substitute the constraint µ
c
(k) =
_
v∈c
µ
v
(k) by linear inequality con
straints µ
c
(k) ≤ µ
v
(k). This works because the coefﬁcient g
c
(k) is nonnegative and we
are maximizing the objective function. Hence at the optimum, µ
c
(k) = min
v
µ
v
(k) , which
is equivalent to µ
c
(k) =
_
v∈c
µ
v
(k), when µ
v
(k) are binary.
It can be shown that in the binary case, the linear relaxation of Eq. (7.1), (where the
constraints µ
c
(k) ∈ ¦0, 1¦ are replaced by µ
c
(k) ≥ 0), is guaranteed to produce an integer
solution when a unique solution exists.
Theorem 7.2.1 If K = 2, for any associative function g, the linear relaxation of Eq. (7.1)
has an integral optimal solution.
See Appendix A.2.1 for the proof. This result states that the MAP problem in binary AMNs
is tractable, regardless of network topology or clique size. In the nonbinary case (L > 2),
these LPs can produce fractional solutions and we use a rounding procedure to get an
integral solution.
Theorem 7.2.2 If K > 2, for any associative function g, the linear relaxation of Eq. (7.1)
has a solution that is larger than the solution of the integer program by at most the number
of variables in the largest clique.
In the appendix, we also show that the approximation ratio of the rounding procedure is the
inverse of the size of the largest clique (e.g.,
1
2
for pairwise networks). Although artiﬁcial
examples with fractional solutions can be easily constructed by using symmetry, it seems
that in real data such symmetries are often broken. In fact, in all our experiments with
L > 2 on real data, we never encountered fractional solutions.
7.3. MINCUT INFERENCE 93
7.3 Mincut inference
We can also use efﬁcient mincut algorithms to perform exact inference on the learned
models for K = 2 and approximate inference for K > 2. For simplicity, we focus on the
pairwise AMN case. We ﬁrst consider the case of binary AMNs, and later show how to use
the local search algorithm developed by Boykov et al. [1999a] to perform (approximate)
inference in the general multiclass case. For pairwise, binary AMNs, the objective of the
integer program in Eq. (7.1) is:
max
v∈1
[µ
v
(1)g
v
(1) + µ
v
(2)g
v
(2)] +
uv∈c
[µ
uv
(1)g
uv
(1) + µ
uv
(2)g
uv
(2)]. (7.2)
7.3.1 Graph construction
We construct a graph in which the mincut will correspond to the optimal MAP labeling
for the above objective. First, we recast the objective as minimization by simply reversing
the signs on the value of each θ.
min −
v∈1
[µ
v
(1)g
v
(1) + µ
v
(2)g
v
(2)] −
uv∈c
[µ
uv
(1)g
uv
(1) + µ
uv
(2)g
uv
(2)]. (7.3)
The graph will consist of a vertex for each node in the AMN, along with the 1 and 2
terminals. In the ﬁnal (1
1
, 1
2
) cut, the 1
1
set will correspond to label 1, and the 1
2
set will
correspond to label 2. We will show how to deal with the node terms (those depending only
on a single variable) and the edge terms (those depending on a pair of variables), and then
how to combine the two.
Node terms
Consider a node term −µ
v
(1)g
v
(1) −µ
v
(2)g
v
(2). Such a term corresponds to the node po
tential contribution to our objective function for node v. For each node term corresponding
to node v we add a vertex v to the mincut graph. We then look at ∆
v
= g
v
(1) − g
v
(2),
and create an edge of weight [∆
v
[ from v to either 1 or 2, depending on the sign of ∆
v
.
The reason for that is that the ﬁnal mincut graph must consist of only positive weights. An
94 CHAPTER 7. ASSOCIATIVE MARKOV NETWORKS
1 2 v
1 2 v
1
2
u v
Figure 7.1: Mincut graph construction of node (left) and edge (right) terms.
example is presented in Fig. 7.3.1.
From Fig. 7.3.1, we see that if the AMN consisted of only node potentials, the graph
construction above would add an edge from each node to its more likely label. Thus if we
run mincut, we would simply get a cut with cost 0, since for each introduced vertex we
have only one edge of positive weight to either 1 or 2, and we would always choose not to
cut any edges.
Edge Terms
Now consider an edge term of the form −µ
uv
(1)g
uv
(1) − µ
uv
(2)g
uv
(2). To construct a
mincut graph for the edge term we will introduce two vertices u and v. We will connect
vertex u to 1 with an edge of weight g
uv
(1), connect v to 2 with an edge of weight g
uv
(2)
and connect u to v with an edge of weight g
uv
(1) + g
uv
(2). Fig. 7.3.1 shows an example.
Observe what happens when both nodes are on the 1
2
side of the cut: the value of the
mincut is g
uv
(1), which must be less than g
uv
(2) or the mincut would have placed them
both on the 1 side. When looking at edge terms in isolation, a cut that places each node
in different sets will not occur, but when we combine the graphs for node terms and edge
terms, such cuts will be possible.
We can take the individual graphs we created for node and edge terms and merge them
7.3. MINCUT INFERENCE 95
by adding edge weights together (and treating missing edges as edges with weight 0). It
can be shown that the resulting graph will represent the same objective (in the sense that
running mincut on it will optimize the same objective) as the sum of the objectives of each
graph. Since our MAPinference objective is simply a sumof node and edge terms, merging
the node and edge term graphs will result in a graph in which mincut will correspond to
the MAP labeling.
7.3.2 Multiclass case
The graph construction above ﬁnds the best MAP labeling for the binary case, but in prac
tice we would often like to handle multiple classes in AMNs. One of the most effective
algorithms for minimizing energy functions like ours is the αexpansion algorithm pro
posed by Boykov et al. [1999a]. The algorithm performs a series of “expansion” moves
each of which involves optimization over two labels, and it can be shown that it converges
to within a factor of 2 of the global minimum.
Expansion Algorithm
Consider a current labeling µ and a particular label k ∈ 1, . . . , K. Another labeling µ
t
is
called an “αexpansion” move (following Boykov et al. [1999a]) from µ if µ
t
v
,= k implies
µ
t
v
= µ
v
(where µ
v
is the label of the node v in the AMN.) In other words, a kexpansion
from a current labeling allows each label to either stay the same, or change to k.
The αexpansion algorithm cycles through all labels k in either a ﬁxed or random order,
and ﬁnds the new labeling whose objective has the lowest value. It terminates when there
is no αexpansion move for any label k that has a lower objective than the current labeling
(Fig. 7.2).
The key part of the algorithm is computing the best αexpansion labeling for a ﬁxed
k and a ﬁxed current labeling µ. The mincut construction from earlier allows us to do
exactly that since an αexpansion move essentially minimizes a MAPobjective over two
labels: it either allows a node to retain its current label, or switch to the label α. In this
new binary problem we will let label 1 represent a node keeping its current label and label
2 will denote a node taking on the new label k. In order to construct the right coefﬁcients
96 CHAPTER 7. ASSOCIATIVE MARKOV NETWORKS
1. Begin with arbitrary labeling µ
2. Set success := 0
3. For each label k ∈ ¦1, . . . K¦
3.1 Compute ˆ µ = arg min −g(µ
t
) among µ
t
within one αexpansion of µ.
3.2 If E(ˆ µ) < E(µ), set µ := ˆ µ and success := 1
4. If success = 1 goto 2.
5. Return µ
Figure 7.2: αexpansion algorithm
for the new binary objective we need to consider several factors. Below, let θ
tk
i
and θ
tk,k
ij
denote the node and edge coefﬁcients associated with the new binary objective:
◦ Node Potentials For each node i in the current labeling whose current label is not α,
we let θ
t0
i
= θ
y
i
i
, and θ
t1
i
= θ
α
i
, where y
i
denotes the current label of node i, and θ
y
i
denotes the coefﬁcient in the multiclass AMN MAP objective. Note that we ignore
nodes with label α altogether since an αexpansion move cannot change their label.
◦ Edge Potentials For each edge (i, j) ∈ E whose nodes have labels different from α,
we add a new edge potential, with weights θ
t1
ij
= θ
α,α
ij
. If the two nodes of the edge
currently have the same label, we set θ
t0
ij
= θ
y
i
,y
j
ij
, and if the two nodes currently have
different labels we let θ
t0
ij
= 0. For each edge (i, j) ∈ E in which exactly one of the
nodes has label α in the current labeling, we add θ
α,α
ij
, to the node potential θ
t1
i
of the
node whose label is different from α.
After we have constructed the new binary MAP objective as above, we can apply the
mincut construction from before to get the optimal labeling within one αexpansion from
the current one. Veksler [1999] shows that the αexpansion algorithm converges in O(N)
iterations where N is the number of nodes. As noted in Boykov et al. [1999a] and as we
have observed in our experiments, the algorithm terminates only after a few iterations with
most of the improvement occurring in the ﬁrst 23 expansion moves.
7.4. MAXMARGIN ESTIMATION 97
7.4 Maxmargin estimation
The potentials of the AMN are once again loglinear combinations of basis functions. We
will need the following assumption to ensure that w
¯
f (x, y) is associative:
Assumption 7.4.1 Basis functions f are componentwise associative for ((x) for any (x, y).
Recall that this implies that for cliques larger than one, all basis functions evaluate to 0
for assignments where the values of the nodes are not equal and are nonnegative for the
assignments where the values of the nodes are equal. To ensure that w
¯
f (x, y) is associa
tive, it is useful to separate the basis functions with support only on nodes from those with
support on larger cliques.
Deﬁnition 7.4.2 Let
˙
f be the subset of basis functions f with support only on singleton
cliques:
˙
f = ¦f ∈ f : ∀ x ∈ A, y ∈ ¸, c ∈ ((((x)), [c[ > 1, f
c
(x
c
, y
c
) = 0¦.
Let
¨
f = f ¸
˙
f be the rest of the basis functions. Let ¦ ˙ w, ¨ w¦ = w be the corresponding
subsets of parameters.
It is easy to verify that any nonnegative combination of associative functions is asso
ciative, and any combination of basis functions with support only on singleton cliques is
also associative, so we have:
Lemma 7.4.3 w
¯
f (x, y) is associative for ((x) for any (x, y) whenever Assumption 7.4.1
holds and ¨ w ≥ 0.
We must make similar associative assumption on the loss function in order to guarantee
that the LP inference can handle it.
Assumption 7.4.4 The loss function (x
(i)
, y
(i)
, y) is associative for (
(i)
for all i.
In practice, this restriction is fairly mild, and the Hamming loss, which we use in general
ization bounds and experiments, is associative.
98 CHAPTER 7. ASSOCIATIVE MARKOV NETWORKS
Using the above Assumptions 7.4.1 and 7.4.4 and some algebra (see Appendix A.2.3
for derivation), we have the following maxmargin QP for AMNs:
min
1
2
[[w[[
2
+ C
i,v∈1
(i)
ξ
i,v
(7.4)
s.t. w
¯
∆f
i,v
(k) −
c⊃v
m
i,c,v
(k) ≥
i,v
(k) −ξ
i,v
, ∀i, v ∈ 1
(i)
, k;
¨ w
¯
∆
¨
f
i,c
(k) +
v∈c
m
i,c,v
(k) ≥
i,c
(k), ∀i, c ∈ (
(i)
¸ 1
(i)
, k;
m
i,c,v
(k) ≥ −¨ w
¯
¨
f
i,c
(y
(i)
c
)/[c[, ∀i, c ∈ (
(i)
¸ 1
(i)
, v ∈ c, k;
¨ w ≥ 0;
where f
i,c
(k) = f
i,c
(k, . . . , k) and
i,c
(k) =
i,c
(k, . . . , k).
While this primal is more complex than the regular M
3
N factored primal in Eq. (5.4),
the basic structure of the ﬁrst two sets of constraints remains the same: we have local
margin requirements and “credit” passed around through messages m
i,c,v
(k). The extra
constraints are due to the associativity constraints on the resulting model.
The dual of Eq. (7.4) (see derivation in Sec. A.2.3) is given by:
max
i,c∈(
(i)
, k
µ
i,c
(k)
i,c
(k) −
C
2
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,v∈1
(i)
, k
µ
i,v
(k)∆
˙
f
i,v
(k)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
−
C
2
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¨ ν +
i,c∈(
(i)
, k
µ
i,c
(k)∆
¨
f
i,c
(k)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
s.t. µ
i,c
(k) ≥ 0, ∀i, ∀c ∈ (
(i)
, k;
K
k=1
µ
i,v
(k) = 1, ∀i, ∀v ∈ 1
(i)
;
µ
i,c
(k) ≤ µ
i,v
(k), ∀i, ∀c ∈ (
(i)
¸ 1
(i)
, v ∈ c, k;
¨ ν ≥ 0.
In the dual, there are marginals µ for each node and clique, for each value k, similar
to Eq. (5.12). However, the constraints are different, and not surprisingly, are essentially
the constraints from the inference LP relaxation in Eq. (7.1).
7.5. EXPERIMENTS 99
The dual and primal solutions are related by
˙ w =
i,v∈1
(i)
, k
µ
i,v
(k)∆
˙
f
i,v
(k); ¨ w = ¨ ν +
i,c∈(
(i)
, k
µ
i,c
(k)∆
¨
f
i,c
(k).
The ¨ ν variables simply ensure that ¨ ware positive (if any component
i,c∈(
(i)
, k
µ
i,c
(k)∆
¨
f
i,c
(k)
is negative, maximizing the objective will force the corresponding component of ¨ ν to cancel
it out). Note that the objective can be written in terms of dot products of node basis func
tions ∆
˙
f
i,v
(k)
¯
∆
˙
f
j,¯ v
(
¯
k), so they can be kernelized. Unfortunately, the edge basis functions
cannot be kernelized because of the nonnegativity constraint.
For K = 2, the LP inference is exact, so that Eq. (7.4) learns exact maxmargin weights
for Markov networks of arbitrary topology. For K > 2, the linear relaxation leads to a
strengthening of the constraints on w by potentially adding constraints corresponding to
fractional assignments as in the case of untriangualated networks. Thus, the optimal choice
w, ξ for the original QP may no longer be feasible, leading to a different choice of weights.
However, as our experiments show, these weights tend to do well in practice.
7.5 Experiments
We applied associative Markov networks to the task of terrain classiﬁcation. Terrain clas
siﬁcation is very useful for autonomous mobile robots in realworld environments for path
planning, target detection, and as a preprocessing step for other perceptual tasks. The
Stanford Segbot Project
1
has provided us with a laser range maps of the Stanford campus
collected by a moving robot equipped with SICK2 laser sensors Fig. 7.5. The data consists
of around 35 million points, represented as 3D coordinates in an absolute frame of refer
ence Fig. 7.5. Thus, the only available information is the location of points. Each reading
was a point in 3D space, represented by its (x, y, z) coordinates in an absolute frame of
reference. Thus, the only available information is the location of points, which was fairly
noisy because of localization errors.
Our task is to classify the laser range points into four classes: ground, building, tree,
1
Many thanks to Michael Montemerlo and Sebastian Thrun for sharing the data.
100 CHAPTER 7. ASSOCIATIVE MARKOV NETWORKS
Figure 7.3: Segbot: roving robot equipped with SICK2 laser sensors.
and shrubbery. Since classifying ground points is trivial given their absolute zcoordinate
(height), we classify them deterministically by thresholding the z coordinate at a value
close to 0. After we do that, we are left with approximately 20 million nonground points.
Each point is represented simply as a location in an absolute 3D coordinate system. The
features we use require preprocessing to infer properties of the local neighborhood of a
point, such as how planar the neighborhood is, or how much of the neighbors are close to
the ground. The features we use are invariant to rotation in the xy plane, as well as the
density of the range scan, since scans tend to be sparser in regions farther from the robot.
Our ﬁrst type of feature is based on the principal plane around it. For each point we
sample 100 points in a cube of radius 0.5 meters. We run PCA on these points to get the
plane of maximum variance (spanned by the ﬁrst two principal components). We then par
tition the cube into 3 3 3 bins around the point, oriented with respect to the principal
plane, and compute the percentage of points lying in the various subcubes. We use a num
ber of features derived from the cube such as the percentage of points in the central column,
the outside corners, the central plane, etc. These features capture the local distribution well
and are especially useful in ﬁnding planes. Our second type of feature is based on a column
7.5. EXPERIMENTS 101
Figure 7.4: 3D laser scan range map of the Stanford Quad.
around each point. We take a cylinder of radius 0.25 meters, which extends vertically to
include all the points in a “column”. We then compute what percentage of the points lie in
various segments of this vertical column (e.g., between 2m and 2.5m). Finally, we also use
an indicator feature of whether or not a point lies within 2m of the ground. This feature is
especially useful in classifying shrubbery.
For training we select roughly 30 thousand points that represent the classes well: a
segment of a wall, a tree, some bushes. We considered three different models: SVM,
VotedSVM and AMNs. All methods use the same set of features, augmented with a
quadratic kernel.
The ﬁrst model is a multiclass SVM with a quadratic kernel over the above features.
This model (Fig. 7.5, right panel and Fig. 7.7, top panel) achieves reasonable performance
102 CHAPTER 7. ASSOCIATIVE MARKOV NETWORKS
Figure 7.5: Terrain classiﬁcation results showing Stanford Memorial Church obtained
with SVM, VotedSVM and AMN models. (Color legend: buildings/red, trees/green,
shrubs/blue, ground/gray).
in many places, but fails to enforce local consistency of the classiﬁcation predictions. For
example arches on buildings and other less planar regions are consistently confused for
trees, even though they are surrounded entirely by buildings.
We improved upon the SVM by smoothing its predictions using voting. For each point
we took its local neighborhood (we varied the radius to get the best possible results) and
assigned the point the label of the majority of its 100 neighbors. The VotedSVM model
(Fig. 7.5, middle panel and Fig. 7.7, middle panel) performs slightly better than SVM: for
example, it smooths out trees and some parts of the buildings. Yet it still fails in areas like
arches of buildings where the SVM classiﬁer has a locally consistent wrong prediction.
The ﬁnal model is a pairwise AMN over laser scan points, with associative potentials
to ensure smoothness. Each point is connected to 6 of its neighbors: 3 of them are sampled
randomly from the local neighborhood in a sphere of radius 0.5m, and the other 3 are
sampled at random from the vertical cylinder column of radius 0.25m. It is important to
ensure vertical consistency since the SVMclassiﬁer is wrong in areas that are higher off the
7.5. EXPERIMENTS 103
0 0.5 1 1.5 2 2.5
x 10
7
0
50
100
150
200
250
Problem size (nodes and edges)
R
u
n
n
i
n
g
t
i
m
e
(
s
e
c
o
n
d
s
)
Mincut inference performance
Figure 7.6: The running time (in seconds) of the mincutbased inference algorithm for
different problem sizes. The problem size is the sum of the number of nodes and the
number of edges. Note the near linear performance of the algorithm and its efﬁciency even
for large models.
ground (due to the decrease in point density) or because objects tend to look different as we
vary their zcoordinate (for example, tree trunks and tree crowns look different). While we
experimented with a variety of edge features including various distances between points,
we found that even using only a constant feature performs well.
We trained the AMN model using CPLEX to solve the quadratic program; the train
ing took about an hour on a Pentium 3 desktop. The inference over each segment was
performed using mincut with αexpansion moves as described above. We used a pub
licly available implementation of the mincut algorithm, which uses bidirectional search
trees for augmenting paths (see Boykov and Kolmogorov [2004]). The implementation is
largely dominated by I/O time, with the actual mincut taking less than two minutes even
for the largest segment. The performance is summarized in Fig. 7.6, and as we can see, it
is roughly linear in the size of the problem (number of nodes and number of edges).
We can see that the predictions of the AMN (Fig. 7.5, left panel and Fig. 7.7, bot
tom panel) are much smoother: for example building arches and tree trunks are predicted
104 CHAPTER 7. ASSOCIATIVE MARKOV NETWORKS
correctly. We also handlabeled around 180 thousand points of the test set (Fig. 7.8) and
computed accuracies of the predictions shown in Fig. 7.9 (excluding ground, which was
classiﬁed by preprocessing). The differences are dramatic: SVM: 68%, VotedSVM: 73%
and AMN: 93%. See more results, including a ﬂythrough movie of the data, at
http://ai.stanford.edu/˜btaskar/3Dmap/.
7.6 Related work
Several authors have considered extensions to the Potts model. Kleinberg and Tardos
[1999] extend the multiclass Potts model to have more general edge potentials, under the
constraints that negative log of the edge potentials form a metric on the set of labels. They
also provide a solution based on a relaxed LP that has certain approximation guarantees.
More recently, Kolmogorov and Zabih [2002] showed how to optimize energy func
tions containing binary and ternary interactions using graph cuts, as long as the parameters
satisfy a certain regularity condition. Our deﬁnition of associative potentials below also
satisﬁes the Kolmogorov and Zabih regularity condition for K = 2. However, the structure
of our potentials is simpler to describe and extend for the multiclass case. In fact, we can
extend our maxmargin framework to estimate their more general potentials by expressing
inference as a linear program.
Our terrain classiﬁcation approach is most closely related to work in vision applying
conditional random ﬁelds (CRFs) to 2D images. Kumar and Hebert [2003] train CRFs
using a pseudolikelihood approximation to the distribution P(Y [ X) since estimating
the true conditional distribution is intractable. Unlike their work, our learning formulation
provides an exact and tractable optimization algorithm, as well as formal guarantees for
binary classiﬁcation problems. Moreover, unlike their work, our approach can also handle
multiclass problems in a straightforward manner.
7.7 Conclusion
In this chapter, we provide an algorithm for maxmargin training of associative Markov
networks, a subclass of Markov networks that allows only positive interactions between
7.7. CONCLUSION 105
related variables. Our approach relies on a linear programming relaxation of the MAP
problem, which is the key component in the quadratic program associated with the max
margin formulation. We thus provide a polynomial time algorithm which approximately
solves the maximum margin estimation problem for any associative Markov network. Im
portantly, our method is guaranteed to ﬁnd the optimal (marginmaximizing) solution for all
binaryvalued AMNs, regardless of the clique size or the connectivity. To our knowledge,
this algorithm is the ﬁrst to provide an effective learning procedure for Markov networks
of such general structure.
Our results in the binary case rely on the fact that the LP relaxation of the MAP problem
provides exact solutions. In the nonbinary case, we are not guaranteed exact solutions, but
we can prove constantfactor approximation bounds on the MAP solution returned by the
relaxed LP. It would be interesting to see whether these bounds provide us with guarantees
on the quality (e.g., the margin) of our learned model.
We present largescale experiments with terrain segmentation and classiﬁcation from
3D range data involving AMNs with tens of millions of nodes and edges. The class of
associative Markov networks appears to cover a large number of interesting applications.
We have explored only a computer vision application in this chapter, and consider another
one (hypertext classiﬁcation) in the next. It would be very interesting to consider other
applications, such as extracting protein complexes from proteinprotein interaction data, or
predicting links in relational data. The mincut based inference is able to handle very large
networks, and it is an interesting challenge to apply the algorithm to even larger models
and develop efﬁcient distributed implementations.
However, despite the prevalence of fully associative Markov networks, it is clear that
many applications call for repulsive potentials. While clearly we cannot introduce fully
general potentials into AMNs without running against the NPhardness of the general prob
lem, it would be interesting to see whether we can extend the class of networks we can learn
effectively.
106 CHAPTER 7. ASSOCIATIVE MARKOV NETWORKS
Figure 7.7: Results from the SVM, VotedSVM and AMN models.
7.7. CONCLUSION 107
Figure 7.8: Labeled part of the test set: ground truth (top) and SVM predictions (bottom).
108 CHAPTER 7. ASSOCIATIVE MARKOV NETWORKS
Figure 7.9: Predictions of the VotedSVM (top) and AMN (bottom) models.
Chapter 8
Relational Markov networks
In the previous chapters, we have seen how sequential and spatial correlation between
labels can be exploited for tremendous accuracy gains. In many other supervised learning
tasks, the entities to be labeled are related with each other in very complex ways, not just
sequentially or spatially. For example, in hypertext classiﬁcation, the labels of linked pages
are highly correlated. Astandard approach is to classify each entity independently, ignoring
the correlations between them. In this chapter, we present a framework that builds on
Markov networks and provides a ﬂexible language for modeling rich interaction patterns in
structured data. We provide experimental results on a webpage classiﬁcation task, showing
that accuracy can be signiﬁcantly improved by modeling relational dependencies.
Many realworld data sets are innately relational: hyperlinked webpages, crosscitations
in patents and scientiﬁc papers, social networks, medical records, and more. Such data con
sist of entities of different types, where each entity type is characterized by a different set
of attributes. Entities are related to each other via different types of links, and the link
structure is an important source of information.
Consider a collection of hypertext documents that we want to classify using some set
of labels. Most naively, we can use a bag of words model, classifying each webpage solely
using the words that appear on the page. However, hypertext has a very rich structure that
this approach loses entirely. One document has hyperlinks to others, typically indicating
that their topics are related. Each document also has internal structure, such as a partition
into sections; hyperlinks that emanate from the same section of the document are even
109
110 CHAPTER 8. RELATIONAL MARKOV NETWORKS
more likely to point to similar documents. When classifying a collection of documents,
these are important cues, that can potentially help us achieve better classiﬁcation accuracy.
Therefore, rather than classifying each document separately, we want to provide a form of
collective classiﬁcation, where we simultaneously decide on the class labels of all of the
entities together, and thereby can explicitly take advantage of the correlations between the
labels of related entities.
We propose the use of a joint probabilistic model for an entire collection of related enti
ties. We introduce the framework of relational Markov networks (RMNs), which compactly
deﬁnes a Markov network over a relational data set. The graphical structure of an RMN is
based on the relational structure of the domain, and can easily model complex patterns over
related entities. For example, we can represent a pattern where two linked documents are
likely to have the same topic. We can also capture patterns that involve groups of links: for
example, consecutive links in a document tend to refer to documents with the same label.
As we show, the use of an undirected graphical model avoids the difﬁculties of deﬁning
a coherent generative model for graph structures in directed models. It thereby allows us
tremendous ﬂexibility in representing complex patterns.
8.1 Relational classiﬁcation
Consider hypertext as a simple example of a relational domain. A relational domain is
deﬁned by a schema, which describes entities, their attributes and relations between them.
In our domain, there are two entity types: Doc and Link. If a webpage is represented as a
bag of words, Doc would have a set of boolean attributes Doc.HasWord
k
indicating whether
the word k occurs on the page. It would also have the label attribute Doc.Label, indicating
the topic of the page, which takes on a set of categorical values. The Link entity type has
two attributes: Link.From and Link.To, both of which refer to Doc entities.
In general, a schema speciﬁes of a set of entity types c = ¦E
1
, . . . , E
n
¦. Each type E is
associated with three sets of attributes: content attributes E.X(for example, Doc.HasWord
k
),
label attributes E.Y (for example, Doc.Label), and reference attributes E.R (for example,
Link.To). For simplicity, we restrict label and content attributes to take on categorical val
ues. Reference attributes include a special unique key attribute E.K that identiﬁes each
8.1. RELATIONAL CLASSIFICATION 111
entity. Other reference attributes E.R refer to entities of a single type E
t
= Range(E.R)
and take values in Domain(E
t
.K).
An instantiation 1 of a schema c speciﬁes the set of entities 1(E) of each entity type
E ∈ c and the values of all attributes for all of the entities. For example, an instantiation
of the hypertext schema is a collection of webpages, specifying their labels, words they
contain and links between them. We will use 1.X, 1.Y and 1.R to denote the content,
label and reference attributes in the instantiation 1; 1.x, 1.y and 1.r to denote the values
of those attributes. The component 1.r, which we call an instantiation skeleton or instan
tiation graph, speciﬁes the set of entities (nodes) and their reference attributes (edges). A
hypertext instantiation graph speciﬁes a set of webpages and links between them, but not
their words or labels. Taskar et al. [2001] suggest the use of probabilistic relational mod
els (PRMs) for the collective classiﬁcation task. PRMs [Koller & Pfeffer, 1998; Friedman
et al., 1999; Getoor et al., 2002] are a relational extension of Bayesian networks [Pearl,
1988]. A PRM speciﬁes a probability distribution over instantiations consistent with a
given instantiation graph by specifying a Bayesiannetworklike templatelevel probabilis
tic model for each entity type. Given a particular instantiation graph, the PRM induces
a large Bayesian network over that instantiation that speciﬁes a joint probability distribu
tion over all attributes of all of the entities. This network reﬂects the interactions between
related instances by allowing us to represent correlations between their attributes.
In our hypertext example, a PRM might use a naive Bayes model for words, with a di
rected edge between Doc.Label and each attribute Doc.HadWord
k
; each of these attributes
would have a conditional probability distribution P(Doc.HasWord
k
[ Doc.Label) associ
ated with it, indicating the probability that word k appears in the document given each of
the possible topic labels. More importantly, a PRM can represent the interdependencies
between topics of linked documents by introducing an edge from Doc.Label to Doc.Label
of two documents if there is a link between them. Given a particular instantiation graph
containing some set of documents and links, the PRM speciﬁes a Bayesian network over all
of the documents in the collection. We would have a probabilistic dependency from each
document’s label to the words on the document, and a dependency from each document’s
label to the labels of all of the documents to which it points. Taskar et al. show that this
approach works well for classifying scientiﬁc documents, using both the words in the title
112 CHAPTER 8. RELATIONAL MARKOV NETWORKS
and abstract and the citationlink structure.
However the application of this idea to other domains, such as webpages, is problematic
since there are many cycles in the link graph, leading to cycles in the induced “Bayesian
network”, which is therefore not a coherent probabilistic model. Getoor et al. [2001] sug
gest an approach where we do not include direct dependencies between the labels of linked
webpages, but rather treat links themselves as random variables. Each two pages have a
“potential link”, which may or may not exist in the data. The model deﬁnes the probability
of the link existence as a function of the labels of the two endpoints. In this link exis
tence model, labels have no incoming edges from other labels, and the cyclicity problem
disappears. This model, however, has other fundamental limitations. In particular, the re
sulting Bayesian network has a random variable for each potential link — N
2
variables for
collections containing N pages. This quadratic blowup occurs even when the actual link
graph is very sparse. When N is large (e.g., the set of all webpages), a quadratic growth is
intractable. Even more problematic are the inherent limitations on the expressive power im
posed by the constraint that the directed graph must represent a coherent generative model
over graph structures. The link existence model assumes that the presence of different
edges is a conditionally independent event. Representing more complex patterns involving
correlations between multiple edges is very difﬁcult. For example, if two pages point to the
same page, it is more likely that they point to each other as well. Such interactions between
many overlapping triples of links do not ﬁt well into the generative framework.
Furthermore, directed models such as Bayesian networks and PRMs are usually trained
to optimize the joint probability of the labels and other attributes, while the goal of clas
siﬁcation is a discriminative model of labels given the other attributes. The advantage
of training a model only to discriminate between labels is that it does not have to trade
off between classiﬁcation accuracy and modeling the joint distribution over nonlabel at
tributes. In many cases, discriminatively trained models are more robust to violations of
independence assumptions and achieve higher classiﬁcation accuracy than their generative
counterparts.
8.2. RELATIONAL MARKOV NETWORKS 113
Label 1
Label 2
Label 3
Figure 8.1: An unrolled Markov net over linked documents. The links follow a common
pattern: documents with the same label tend to link to each other more often.
8.2 Relational Markov networks
We now extend the framework of Markov networks to the relational setting. A relational
Markov network (RMN) speciﬁes a conditional distribution over all of the labels of all
of the entities in an instantiation given the relational structure and the content attributes.
(We provide the deﬁnitions directly for the conditional case, as the unconditional case is a
special case where the set of content attributes is empty.) Roughly speaking, it speciﬁes the
cliques and potentials between attributes of related entities at a template level, so a single
model provides a coherent distribution for any collection of instances from the schema.
For example, suppose that pages with the same label tend to link to each other, as
in Fig. 8.1. We can capture this correlation between labels by introducing, for each link, a
clique between the labels of the source and the target page. The potential on the clique will
have higher values for assignments that give a common label to the linked pages.
To specify what cliques should be constructed in an instantiation, we will deﬁne a no
tion of a relational clique template. Arelational clique template speciﬁes tuples of variables
in the instantiation by using a relational query language. For our link example, we can write
the template as a kind of SQL query:
SELECT doc1.Category, doc2.Category
114 CHAPTER 8. RELATIONAL MARKOV NETWORKS
FROM Doc doc1, Doc doc2, Link link
WHERE link.From = doc1.Key and link.To = doc2.Key
Note the three clauses that deﬁne a query: the FROM clause speciﬁes the cross prod
uct of entities to be ﬁltered by the WHERE clause and the SELECT clause picks out the
attributes of interest. Our deﬁnition of clique templates contains the corresponding three
parts.
A relational clique template C = (F, W, S) consists of three components:
◦ F = ¦F
i
¦ — a set of entity variables, where an entity variable F
i
is of type E(F
i
).
◦ W(F.R) — a boolean formula using conditions of the form F
i
.R
j
= F
k
.R
l
.
◦ F.S ⊆ F.X∪ F.Y — a selected subset of content and label attributes in F.
For the clique template corresponding to the SQL query above, Fconsists of doc1, doc2
and link of types Doc, Doc and Link, respectively. W(F.R) is link.From = doc1.Key ∧
link.To = doc2.Key and F.S is doc1.Category and doc2.Category.
A clique template speciﬁes a set of cliques in an instantiation 1:
C(1) ≡ ¦c = f .S : f ∈ 1(F) ∧ W(f .r)¦,
where f is a tuple of entities ¦f
i
¦ in which each f
i
is of type E(F
i
); 1(F) = 1(E(F
1
))
. . . 1(E(F
n
)) denotes the crossproduct of entities in the instantiation; the clause W(f .r)
ensures that the entities are related to each other in speciﬁed ways; and ﬁnally, f .S selects
the appropriate attributes of the entities. Note that the clique template does not specify the
nature of the interaction between the attributes; that is determined by the clique potentials,
which will be associated with the template.
This deﬁnition of a clique template is very ﬂexible, as the WHERE clause of a tem
plate can be an arbitrary predicate. It allows modeling complex relational patterns on the
instantiation graphs. To continue our webpage example, consider another common pattern
in hypertext: links in a webpage tend to point to pages of the same category. This pattern
can be expressed by the following template:
SELECT doc1.Category, doc2.Category
8.2. RELATIONAL MARKOV NETWORKS 115
FROM Doc doc1, Doc doc2, Link link1, Link link2
WHERE link1.From = link2.From and link1.To = doc1.Key
and link2.To = doc2.Key and not doc1.Key = doc2.Key
Depending on the expressive power of our template deﬁnition language, we may be able
to construct very complex templates that select entire subgraph structures of an instantia
tion. We can easily represent patterns involving three (or more) interconnected documents
without worrying about the acyclicity constraint imposed by directed models. Since the
clique templates do not explicitly depend on the identities of entities, the same template can
select subgraphs whose structure is fairly different. The RMN allows us to associate the
same clique potential parameters with all of the subgraphs satisfying the template, thereby
allowing generalization over a wide range of different structures.
A Relational Markov network (RMN) / = (C, Φ) speciﬁes a set of clique templates
C and corresponding potentials Φ = ¦φ
C
¦
C∈C
to deﬁne a conditional distribution:
P(1.y [ 1.x, 1.r)
=
1
Z(1.x, 1.r)
C∈C
c∈C(7)
φ
C
(1.x
c
, 1.y
c
)
where Z(1.x, 1.r) is the normalizing partition function:
Z(1.x, 1.r) =
7.y
C∈C
c∈C(7)
φ
C
(1.x
c
, 1.y
t
c
)
.
Using the loglinear representation of potentials, φ
C
(V
C
) = exp¦w
¯
C
f
C
(V
C
)¦, we can
write
log P(1.y [ 1.x, 1.r) = w
¯
f (1.x, 1.y, 1.r) −log Z(1.x, 1.r)
where
f
C
(1.x, 1.y, 1.r) =
c∈C(7)
f
C
(1.x
c
, 1.y
c
)
is the sum over all appearances of the template C(1) in the instantiation, and f is the vector
116 CHAPTER 8. RELATIONAL MARKOV NETWORKS
of all f
C
.
Given a particular instantiation 1 of the schema, the RMN / produces an unrolled
Markov network over the attributes of entities in 1. The cliques in the unrolled network
are determined by the clique templates C. We have one clique for each c ∈ C(1), and
all of these cliques are associated with the same clique potential φ
C
. In our webpage
example, an RMN with the link basis function described above would deﬁne a Markov net
in which, for every link between two pages, there is an edge between the labels of these
pages. Fig. 8.1 illustrates a simple instance of this unrolled Markov network.
8.3 Approximate inference and learning
Applying both maximum likelihood and maximum margin learning in the relational setting
is requires inference in very large and complicated networks, where exact inference is
typically intractable. We therefore resort to approximate methods.
Maximum likelihood estimation
For maximum likelihood learning, we need to compute basis function expectations, not
just the most likely assignment. There is a wide variety of approximation schemes for this
problem, including MCMC and variational methods. We chose to use belief propagation
for its simplicity and relative efﬁciency and accuracy. Belief Propagation (BP) is a local
message passing algorithm introduced by Pearl [1988]. It is guaranteed to converge to the
correct marginal probabilities for each node only for singly connected Markov networks.
However, recent analysis [Yedidia et al., 2000] provides some theoretical justiﬁcation. Em
pirical results [Murphy et al., 1999] show that it often converges in general networks, and
when it does, the marginals are a good approximation to the correct posteriors. As our
results in Sec. 8.4 show, this approach works well in our domain. We refer the reader to
Yedidia et al. for a detailed description of the BP algorithm.
We provide a brief outline of one variant of BP, referring to [Murphy et al., 1999]
for more details. For simplicity, we assume a pairwise network where all potentials are
associated only with nodes and edges given by:
8.3. APPROXIMATE INFERENCE AND LEARNING 117
P(Y
1
, . . . , Y
n
) =
1
Z
ij
ψ
ij
(Y
i
, Y
j
)
i
ψ
i
(Y
i
)
where ij ranges over the edges of the network and ψ
ij
(Y
i
, Y
j
) = φ(x
ij
, Y
i
, Y
j
), ψ
i
(Y
i
) =
φ(x
i
, Y
i
).
The belief propagation algorithm is very simple. At each iteration, each node Y
i
sends
the following messages to all its neighbors N(i):
m
ij
(Y
j
) ←α
y
i
ψ
ij
(y
i
, Y
j
)ψ
i
(y
i
)
k∈N(i)−j
m
ki
(Y
i
)
where α is a (different) normalizing constant. This process is repeated until the messages
converge. At any point in the algorithm, the marginal distribution of any node Y
i
is approx
imated by
b
i
(Y
i
) = αψ
i
(Y
i
)
k∈N(i)
m
ki
(Y
i
)
and the marginal distribution of a pair of nodes connected by an edge is approximated by
b
ij
(Y
i
, Y
j
) = αψ
ij
(Y
i
, Y
j
)ψ
i
(Y
i
)ψ
j
(Y
j
)
k∈N(i)−j
m
ki
(Y
i
)
l∈N(j)−i
m
lj
(Y
j
)
These approximate marginals are precisely what we need for the computation of the
basis function expectations and performing classiﬁcation. Computing the expected basis
function expectations involves summing their expected values for each clique using the
approximate marginals b
i
(Y
i
) and b
ij
(Y
i
, Y
j
). Similarly, we use max
y
i
b
i
(Y
i
) at prediction
time. Note that we can also max −product variant of loopy BP, with
m
ij
(Y
j
) ←αmax
y
i
ψ
ij
(y
i
, Y
j
)ψ
i
(y
i
)
k∈N(i)−j
m
ki
(Y
i
)
to compute approximate posterior “max”marginals and use those for prediction. In our
experiments, this results in less accurate classiﬁcation, so we use posterior marginal pre
diction.
118 CHAPTER 8. RELATIONAL MARKOV NETWORKS
Maximum margin estimation
For maximummargin estimation, we used approximate LP inference inside the maxmargin
QP, using commercial Ilog CPLEX software to solve it. For networks with general poten
tials, we used the untriangulated LP we described in Sec. 5.4. The untriangulated LP
produced fractional solutions for inference on the test data in several settings, which we
rounded independently for each label. For networks with attractive potentials (AMNs), we
used the LP in Sec. 7.2, which always produced integral solutions on test data.
8.4 Experiments
We tried out our framework on the WebKB dataset [Craven et al., 1998], which is an in
stance of our hypertext example. The data set contains webpages from four different Com
puter Science departments: Cornell, Texas, Washington and Wisconsin. Each page has a
label attribute, representing the type of webpage which is one of course, faculty, student,
project or other. The data set is problematic in that the category other is a grabbag of
pages of many different types. The number of pages classiﬁed as other is quite large,
so that a baseline algorithm that simply always selected other as the label would get an
average accuracy of 75%. We could restrict attention to just the pages with the four other
labels, but in a relational classiﬁcation setting, the deleted webpages might be useful in
terms of their interactions with other webpages. Hence, we compromised by eliminating
all other pages with fewer than three outlinks, making the number of other pages com
mensurate with the other categories. The resulting category distribution is: course (237),
faculty (148), other (332), researchproject (82) and student (542). The number of remain
ing pages for each school are: Cornell (280), Texas (292), Washington (315) and Wisconsin
(454). The number of links for each school are: Cornell (574), Texas (574), Washington
(728) and Wisconsin (1614).
For each page, we have access to the entire html of the page and the links to other
pages. Our goal is to collectively classify webpages into one of these ﬁve categories. In all
of our experiments, we learn a model from three schools and test the performance of the
learned model on the remaining school, thus evaluating the generalization performance of
8.4. EXPERIMENTS 119
the different models. We used C ∈ [0.1, 10] and took the best setting for all models.
Unfortunately, we cannot directly compare our accuracy results with previous work
because different papers use different subsets of the data and different training/test splits.
However, we compare to standard text classiﬁers such as Naive Bayes, Logistic Regression,
and Support Vector Machines, which have been demonstrated to be successful on this data
set [Joachims, 1999].
8.4.1 Flat models
The simplest approach we tried predicts the categories based on just the text content on
the webpage. The text of the webpage is represented using a set of binary attributes that
indicate the presence of different words on the page. We found that stemming and feature
selection did not provide much beneﬁt and simply pruned words that appeared in fewer
than three documents in each of the three schools in the training data. We also experi
mented with incorporating metadata: words appearing in the title of the page, in anchors
of links to the page and in the last header before a link to the page [Yang et al., 2002].
Note that metadata, although mostly originating from pages linking into the considered
page, are easily incorporated as features, i.e. the resulting classiﬁcation task is still ﬂat
featurebased classiﬁcation. Our ﬁrst experimental setup compares three wellknown text
classiﬁers — Naive Bayes, linear support vector machines (Svm), and logistic regression
(Logistic) — using words and metawords. The results, shown in Fig. 8.2, show that the
two discriminative approaches outperformNaive Bayes. Logistic and Svmgive very sim
ilar results. The average error over the 4 schools was reduced by around 4% by introducing
the metadata attributes.
Incorporating metadata gives a signiﬁcant improvement, but we can take additional
advantage of the correlation in labels of related pages by classifying them collectively. We
want to capture these correlations in our model and use them for transmitting informa
tion between linked pages to provide more accurate classiﬁcation. We experimented with
several relational models.
120 CHAPTER 8. RELATIONAL MARKOV NETWORKS
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
NaïveBayes Svm Logistic
T
e
s
t
E
r
r
o
r
Words Words+Meta
Figure 8.2: Comparison of Naive Bayes, Svm, and Logistic on WebKB, with and without
metadata features. (Only averages over the 4 schools are shown here.)
8.4.2 Link model
Our ﬁrst model captures direct correlations between labels of linked pages. These corre
lations are very common in our data: courses and research projects almost never link to
each other; faculty rarely link to each other; students have links to all categories but mostly
courses. The Link model, shown in Fig. 8.1, captures this correlation through links: in
addition to the local bag of words and metadata attributes, we introduce a relational clique
template over the labels of two pages that are linked.We train this model using maximum
conditional likelihood (labels given the words and the links) and maximum margin.
We also compare to a directed graphical model to contrast discriminative and genera
tive models of relational structure. The ExistsML model is a (partially) generative model
proposed by Getoor et al. [2001]. For each page, a logistic regression model predicts
the page label given the words and metafeatures. Then a simple generative model speci
ﬁes a probability distribution over the existence of links between pages conditioned on both
pages’ labels. Concretely, we learn the probability of existence of a link between two pages
given their labels. Note that this model does not require inference during learning. Max
imum likelihood estimation (with regularization) of the generative component is closed
8.4. EXPERIMENTS 121
0%
5%
10%
15%
20%
25%
30%
Cor Tex Was Wis Average
E
r
r
o
r
ExistsML
SVM
LinkML
LinkMM
Figure 8.3: Comparison of ﬂat versus collective classiﬁcation on WebKB: SVM, Exists
model with logistic regression and the Link model estimated using the maximum likelihood
(ML) and the maximum margin (MM) criteria.
form given appropriate cooccurrence counts of linked pages’ labels. However, the predic
tion phase is much more expensive, since the resulting graphical model includes edges not
only for the existing hyperlinks, but also those that do not exist. Intuitively, observing the
link structure directly correlates all page labels in a website, linked or not. By contrast,
the Link model avoids this problem by only modeling the conditional distribution given the
existing links.
Fig. 8.3 shows a gain in accuracy from SVMs to the Link model by using the corre
lations between labels of linked web pages. There is also very signiﬁcant additional gain
by using maximum margin training: the error rate of LinkMM is 40% lower than that of
LinkML, and 51% lower than multiclass SVMs. The Exists model doesn’t perform very
well in comparison. This can be attributed to the simplicity of the generative model and the
difﬁculty of the resulting inference problem.
8.4.3 Cocite model
The second relational model uses the insight that a webpage often has internal structure
that allows it to be broken up into sections. For example, a faculty webpage might have
122 CHAPTER 8. RELATIONAL MARKOV NETWORKS
0%
5%
10%
15%
20%
25%
Cor Tex Was Wis Average
E
r
r
o
r
SVM
CociteML
CociteMM
Figure 8.4: Comparison of Naive Bayes, Svm, and Logistic on WebKB, with and without
metadata features. (Only averages over the 4 schools are shown here.)
one section that discusses research, with a list of links to all of the projects of the faculty
member, a second section might contain links to the courses taught by the faculty member,
and a third to his advisees. We can view a section of a webpage as a ﬁnegrained version of
Kleinberg’s hub [Kleinberg, 1999] (a page that contains a lot of links to pages of particular
category). Intuitively, if two pages are cocited, or linked to from the same section, they are
likely to be on similar topics. Note that we expect the correlation between the labels in this
case to be positive, so we can use AMNtype potentials in the maxmargin estimation. The
Cocite model captures this type of correlation.
To take advantage of this trend, we need to enrich our schema by adding the attribute
Section to Link to refer to the section number it appears in. We deﬁned a section as a
sequence of three or more links that have the same path to the root in the html parse tree.
In the RMN, we have a relational clique template deﬁned by:
SELECT doc1.Category, doc2.Category
FROM Doc doc1, Doc doc2, Link link1, Link link2
WHERE link1.From = link2.From and link1.Section = link2.Section and
link1.To = doc1.Key and link2.To = doc2.Key and not doc1.Key = doc2.Key
We compared the performance of SVM, CociteML and CociteMM. The results,
8.5. RELATED WORK 123
shown in Fig. 8.4, also demonstrate signiﬁcant improvements of the relational models over
the SVM. The improvement is present when testing on each of the schools. Again, maxi
mum likelihood trained model CociteML achieves a worse test error than maximum mar
gin CociteMM model, which shows a 30% relative reduction in test error over SVM.
We note that, in our experiments, the learned CociteMM weights never produced frac
tional solutions when used for inference, which suggests that the optimization successfully
avoided problematic parameterizations of the network, even in the case of the nonoptimal
multiclass relaxation.
8.5 Related work
Our RMN representation is most closely related to the work on PRMs [Koller & Pfeffer,
1998]. Later work showed how to efﬁciently learn model parameters and structure (equiv
alent of clique selection in Markov networks) from data [Friedman et al., 1999]. Getoor
et al. [2002] propose several generative models of relational structure. Their approach
easily captures the dependence of link existence on attributes of entities. However there
are many patterns that we are difﬁcult to model in PRMs, in particular those that involve
several links at a time. We give some examples here.
One useful type of pattern type is a similarity template, where objects that share a cer
tain graphbased property are more likely to have the same label. Consider, for example,
a professor X and two other entities Y and Z. If X’s webpage mentions Y and Z in the
same context, it is likely that the XY relation and the YZ relation are of the same type; for
example, if Y is Professor X’s advisee, then probably so is Z. Our framework accommo
dates these patterns easily, by introducing pairwise cliques between the appropriate relation
variables.
Another useful type of subgraph template involves transitivity patterns, where the pres
ence of an AB link and of a BC link increases (or decreases) the likelihood of an AC link.
For example, students often assist in courses taught by their advisor. Note that this type
of interaction cannot be accounted for just using pairwise cliques. By introducing cliques
over triples of relations, we can capture such patterns as well. We can incorporate even
more complicated patterns, but of course we are limited by the ability of belief propagation
124 CHAPTER 8. RELATIONAL MARKOV NETWORKS
to scale up as we introduce larger cliques and tighter loops in the Markov network.
We describe and exploit these patterns in our work on RMNs using maximumlikelihood
estimation [Taskar et al., 2003b]. Attempts to model such pattern in PRMs run into the
constraint that the probabilistic dependency graph (Bayesian network) must be a directed
acyclic graph. For example, for the transitivity pattern, we might consider simply directing
the correlation edges between link existence variables arbitrarily. However, it is not clear
how to parameterize a link existence variable for a link that is involved in multiple triangles.
The structure of the relational graph has been used extensively to infer importance in
scientiﬁc publications [Egghe & Rousseau, 1990] and hypertext [Kleinberg, 1999]. Sev
eral recent papers have proposed algorithms that use the link graph to aid classiﬁcation.
Chakrabarti et al. [1998] use systempredicted labels of linked documents to iteratively
relabel each document in the test set, achieving a signiﬁcant improvement compared to a
baseline of using the text in each document alone. A similar approach was used by Neville
and Jensen [2000] in a different domain. Slattery and Mitchell [2000] tried to identify di
rectory (or hub) pages that commonly list pages of the same topic, and used these pages to
improve classiﬁcation of university webpages. However, none of these approaches provide
a coherent model for the correlations between linked webpages, applying combinations of
classiﬁers in a procedural way, with no formal justiﬁcation.
8.6 Conclusion
In this chapter, we propose a new approach for classiﬁcation in relational domains. Our ap
proach provides a coherent foundation for the process of collective classiﬁcation, where we
want to classify multiple entities, exploiting the interactions between their labels. We have
shown that we can exploit a very rich set of relational patterns in classiﬁcation, signiﬁcantly
improving the classiﬁcation accuracy over standard ﬂat classiﬁcation.
In some cases, we can incorporate relational features into standard ﬂat classiﬁcation.
For example, when classifying papers into topics, it is possible to simply view the presence
of particular citations as atomic features. However, this approach is limited in cases where
some or even all of the relational features that occur in the test data are not observed in
the training data. In our WebKB example, there is no overlap between the webpages in the
8.6. CONCLUSION 125
different schools, so we cannot learn anything from the training data about the signiﬁcance
of a hyperlink to/from a particular webpage in the test data. Incorporating basic features
(e.g., words) from the related entities can aid in classiﬁcation, but cannot exploit the strong
correlation between the labels of related entities that RMNs capture.
Hypertext is the most easily available source of structured data, however, RMNs are
generally applicable to any relational domain. The results in this chapter represent only
a subset of the domains we have worked on (see [Taskar et al., 2003b]). In particular,
social networks provide extensive information about interactions among people and orga
nizations. RMNs offer a principled method for learning to predict communities of and
hierarchical structure between people and organizations based on both the local attributes
and the patterns of static and dynamic interaction. Given the wealth of possible patterns, it
is particularly interesting to explore the problem of inducing them automatically.
126 CHAPTER 8. RELATIONAL MARKOV NETWORKS
Part III
Broader applications: parsing,
matching, clustering
127
Chapter 9
Context free grammars
We present a novel discriminative approach to parsing using structured maxmargin crite
rion based on the decomposition properties of context free grammars. We show that this
framework allows highaccuracy parsing in cubic time by exploiting novel kinds of lexical
information. Our models can condition on arbitrary features of input sentences, thus incor
porating an important kind of lexical information not usually used by conventional parsers.
We show experimental evidence of the model’s improved performance over a natural base
line model and a lexicalized probabilistic contextfree grammar.
9.1 Context free grammar model
CFGs are one of the primary formalisms for capturing the recursive structure of syntactic
constructions, although many others have also been proposed [Manning & Sch¨ utze, 1999].
For clarity of presentation, we restrict our grammars to be in Chomsky normal form as
in Sec. 3.4. The nonterminal symbols correspond to syntactic categories such as noun
phrase (NP) or verbal phrase (VP). The terminal symbols are usually words of the sen
tence. However, in the discriminative framework that we adopt, we are not concerned with
deﬁning a distribution over sequences of words (language model). Instead, we condition
on the words in a sentence to produce a model of the syntactic structure. Terminal sym
bols for our purposes are partofspeech tags like nouns (NN), verbs (VBD), determiners
(DT). For example, Fig. 9.1(a) shows a parse tree for the sentence The screen was a sea of
128
9.1. CONTEXT FREE GRAMMAR MODEL 129
red. The set of symbols we use is based on the Penn Treebank [Marcus et al., 1993]. The
nonterminal symbols with bars (for example, DT, NN, VBD) are added to conform to the
CNF restrictions. For convenience, we repeat our deﬁnition of a CFG from Sec. 3.4 here:
Deﬁnition 9.1.1 (CFG) A CFG ( consists of:
◦ A set of nonterminal symbols, ^
◦ A designated set of start symbols, ^
S
⊆ ^
◦ A set of terminal symbols, T
◦ A set of productions, T = ¦T
B
, T
U
¦, divided into
Binary productions, T
B
= ¦A →B C : A, B, C ∈ ^¦ and
Unary productions, T
U
= ¦A → D : A ∈ ^, D ∈ T ¦.
A CFG deﬁnes a set of valid parse trees in a natural manner:
Deﬁnition 9.1.2 (CFG tree) A CFG tree is a labeled directed tree, where the set of valid
labels of the internal nodes other than the root is ^ and the set of valid labels for the leaves
is T . The root’s label set is ^
S
. Additionally, each preleaf node has a single child and
this pair of nodes can be labeled as A and D, respectively, if and only if there is a unary
production A → D ∈ T
U
. All other internal nodes have two children, left and right, and
this triple of nodes can be labeled as A, B and C, respectively, if and only if there is a
binary production A →B C ∈ T
B
.
In general, there are exponentially many parse trees that produce a sentence of length n.
This tree representation seems quite different from the graphical models we have been
considering thus far. However, we can use an equivalent representation that essentially
encodes a tree as an assignment to a set of appropriate variables. For each span starting
with s and an ending with e, we introduce a variable Y
s,e
taking values in ^∪⊥to represent
the label of the subtree that exactly covers, or dominates, the words of the sentence from
s to e. The value ⊥ is assigned if no subtree dominates the span. Indices s and e refer to
positions between words, rather than to words themselves, hence 0 ≤ s < e ≤ n for a
sentence of length n. The “top” symbol Y
0,n
is constrained to be in ^
S
, since it represents
130 CHAPTER 9. CONTEXT FREE GRAMMARS
(a) (b)
Figure 9.1: Two representations of a binary parse tree: (a) nested tree structure, and (b)
grid of labeled spans. The row and column number are the beginning and end of the span,
respectively. Empty squares correspond to nonconstituent spans. The gray squares on the
diagonal represent partofspeech tags.
the starting symbol of the sentence. We also introduce variables Y
s,s
taking values in T to
represent the terminal symbol (partofspeech) between s and s +1. If Y
s,e
,= ⊥, it is often
called a constituent. Fig. 9.1(b) shows the representation of the tree in Fig. 9.1(a) as a grid
where each square corresponds to Y
s,e
. The row and column number in the grid correspond
to the beginning and end of the span, respectively. Empty squares correspond to ⊥ values.
The gray squares on the diagonal represent the terminal variables. For example, the ﬁgure
shows Y
0,0
= DT, Y
6,6
= NN, Y
3,5
= NP and Y
1,4
= ⊥.
While any parse tree corresponds to an assignment to this set of variables in a straight
forward manner, the converse is not true: there are assignments that do not correspond
to valid parse trees. In order to characterize the set of valid assignments ¸, consider the
9.1. CONTEXT FREE GRAMMAR MODEL 131
constraints that hold for a valid assignment y:
1I(y
s,e
= A) =
A→B C∈P
B
s<m<e
1I(y
s,m,e
= (A, B, C)), 0 ≤ s < e ≤ n, ∀A ∈ ^; (9.1)
1I(y
s,e
= A) =
B→A C∈P
B
0≤s
<s
1I(y
s
,s,e
= (B, A, C))
+
B→C A∈P
B
e<e
≤n
1I(y
s,e,e
= (B, C, A)), 0 ≤ s < e ≤ n, ∀A ∈ ^; (9.2)
1I(y
s,s+1
= A) =
A→D∈1
U
1I(y
s,s,s+1
= (A, D)), 0 ≤ s < n, ∀A ∈ ^; (9.3)
1I(y
s,s
= D) =
A→D∈1
U
1I(y
s,s,s+1
= (A, D)), 0 ≤ s < n, ∀D ∈ T . (9.4)
The notation y
s,m,e
= (A, B, C) abbreviates y
s,e
= A ∧ y
s,m
= B ∧ y
m,e
= C and
y
s,s,s+1
= (A, D) abbreviates y
s,s+1
= A ∧ y
s,s
= D. The ﬁrst set of constraints (9.1)
holds because if the span from s to e is dominated by a subtree starting with A (that is,
y
s,e
= A), then there must be a unique production starting with A and some split point m,
s < m < e, that produces that subtree. Conversely, if y
s,e
,= A, no productions start with
A and cover s to e. The second set of constraints (9.2) holds because that if the span from
s to e is dominated by a subtree starting with A (y
s,e
= A), then there must be a (unique)
production that generated it: either starting before s or after e. Similarly, the third and
fourth set of constraints (9.3 and 9.4) hold since the terminals are generated using valid
unary productions. We denote the set of assignments y satisfying (9.19.4) as ¸. In fact
the converse is true as well:
Theorem 9.1.3 If y ∈ ¸, then y represents a valid CFG tree.
Proof sketch: It is straightforward to construct a parse tree from y ∈ ¸ in a topdown
manner. Starting from the root symbol, y
0,n
, the ﬁrst set of constraints (9.1) ensures that a
unique production spans 0 to n, say splitting at m and specifying the values for y
0,m
and
y
m,n
. The second set of constraints (9.2) ensures that all other spans y
0,m
and y
m
,n
, for
m
t
,= m are labeled by ⊥. Recursing on the two subtrees, y
0,m
and y
m,n
, will produce the
rest of the tree down to the preterminals. The last two sets of constraints (9.3 and 9.4)
132 CHAPTER 9. CONTEXT FREE GRAMMARS
ensure that the terminals are generated by an appropriate unary productions from the pre
terminals.
9.2 Context free parsing
A standard approach to parsing is to use a CFG to deﬁne a probability distribution over
parse trees. This can be done simply by assigning a probability to each production and
making sure that the sum of probabilities of all productions starting with each symbol is 1:
B,C:A→B C∈1
B
P(A →B C) = 1,
D:A→D∈1
U
P(A →D) = 1, ∀A ∈ ^.
The probability of a tree is simply the product of probabilities of the productions used in
the tree. More generally, a weighted CFG assigns a score to each production (this score
may depend on the position of the production s, m, e) such that the total score of a tree is
the sum of the score of all the productions used:
S(y) =
0≤s<m<e≤n
S
s,m,e
(y
s,m,e
) +
0≤s<n
S
s,s+1
(y
s,s+1
),
where S
s,m,e
(y
s,m,e
) = 0 if (y
s,e
= ⊥ ∨ y
s,m
= ⊥ ∨ y
m,e
= ⊥). If the production scores
are production log probabilities, then the tree score is the tree log probability. However,
weighted CFGs do not have the local normalization constraints Eq. (9.5).
We can use a Viterbistyle dynamic programming algorithm called CKY to compute
the highest score parse tree in O([T[n
3
) time [Younger, 1967; Manning & Sch¨ utze, 1999].
The algorithm computes the highest score of any subtree starting with a symbol over each
span 0 ≤ s < e ≤ n recursively:
S
∗
s,s+1
(A) = max
A→D∈1
U
S
s,s+1
(A, D), 0 ≤ s < n, ∀A ∈ ^; (9.5)
S
∗
s,e
(A) = max
A→B C∈P
B
s<m<e
S
s,m,e
(A, B, C) + S
∗
s,m
(B) + S
∗
m,e
(C), 0 ≤ s < e ≤ n, ∀A ∈ ^.
The highest scoring tree has score max
A∈A
S
S
∗
0,n
(A). Using the arg max’s of the max’s in
9.3. DISCRIMINATIVE PARSING MODELS 133
the computation of S
∗
, we can backtrace the highest scoring tree itself. We assume that
score ties are broken in a predetermined way, say according to some lexicographic order of
the symbols.
9.3 Discriminative parsing models
We cast parsing as a structured classiﬁcation task, where we want to learn a function h :
A → ¸, where A is a set of sentences, and ¸ is a set of valid parse trees according to a
ﬁxed CFG grammar.
The functions we consider take the following linear discriminant form:
h
w
(x) = arg max
y
w
¯
f (x, y),
where w ∈ IR
d
and f is a basis function representation of a sentence and parse tree pair
f : A ¸ → IR
d
. We assume that the basis functions decompose with the CFG structure:
f (x, y) =
0≤s≤e≤n
f (x
s,e
, y
s,e
) +
0≤s≤m<e≤n
f (x
s,m,e
, y
s,m,e
),
where n is the length of the sentence x and x
s,e
and x
s,m,e
are the relevant subsets of
the sentence the basis functions depend on. To simplify notation, we introduce the set of
indices, (, which includes both spans and span triplets:
( = ¦(s, m) : 0 ≤ s ≤ e ≤ n¦ ∪ ¦(s, m, e) : 0 ≤ s ≤ m < e ≤ n¦.
Hence, f (x, y) =
c∈(
f (x
c
, y
c
).
Note that this class of discriminants includes PCFG models, where the basis func
tions consist of the counts of the productions used in the parse, and the parameters w are
the logprobabilities of those productions. For example, f could include functions which
identify the production used together with features of the words at positions s, m, e, and
neighboring positions in the sentence x (e.g. f(x
s,m,e
, y
s,m,e
) = 1I(y
s,m,e
= S, NP, VP) ∧
m
th
word(x) = was)). We could also include functions that identify the label of the span
134 CHAPTER 9. CONTEXT FREE GRAMMARS
from s to e together with features of the word (e.g. f(x
s,m
, y
s,m
) = 1I(y
s,m
= NP) ∧
s
th
word(x) = the)).
9.3.1 Maximum likelihood estimation
The traditional method of estimating the parameters of PCFGs assumes a generative model
that deﬁnes P(x, y) by assigning normalized probabilities to CFG productions. We then
maximize the joint loglikelihood
i
log P(x
(i)
, y
(i)
) (with some regularization). We com
pare to such a generative grammar of Collins [1999] in our experiments.
A alternative probabilistic approach is to estimate the parameters discriminatively by
maximizing conditional loglikelihood. For example, the maximumentropy approach [John
son, 2001] deﬁnes a conditional loglinear model:
P
w
(y [ x) =
1
Z
w
(x)
exp¦w
¯
f (x, y)¦,
where Z
w
(x) =
y
exp¦w
¯
f (x, y)¦, and maximizes the conditional loglikelihood of
the sample,
i
log P(y
(i)
[ y
(i)
), (with some regularization). The same assumption that
the basis functions decompose as sums of local functions over spans and productions is
typically made in such models. Hence, as in Markov networks, the gradient depends
on the expectations of the basis functions, which can be computed in O([T[n
3
) time by
dynamic programming algorithm called insideoutside, which is similar to the CKY al
gorithm. However, computing the expectations over trees is actually more expensive in
practice than ﬁnding the best tree for several reasons. CKY works entirely in the logspace,
while insideoutside needs to compute actual probabilities. Branchandprune techniques,
which save a lot of useless computation, are only applicable in CKY.
A typical method for ﬁnding the parameters is to use Conjugate Gradients or LBFGS
methods [Nocedal & Wright, 1999; Boyd & Vandenberghe, 2004], which repeatedly com
pute these expectations to calculate the gradient. Clark and Curran [2004] report experi
ments involving 479 iterations of training for one model, and 1550 iterations for another
using similar methods.
9.3. DISCRIMINATIVE PARSING MODELS 135
9.3.2 Maximum margin estimation
We assume that loss function also decomposes with the CFG structure:
(x, y, ˆ y) =
0≤s≤e≤n
(x
s,e
, y
s,e
, ˆ y
s,e
) +
0≤s≤m<e≤n
(x
s,m,e
, y
s,m,e
, ˆ y
s,m,e
) =
c∈(
(x
c
, y
c
, ˆ y
c
).
One approach would be to deﬁne (x
s,e
, y
s,e
, ˆ y
s,e
) = 1I(y
s,e
,= ˆ y
s,e
). This would lead to
(x, y, ˆ y) tracking the number of “constituent errors” in ˆ y. Another, more strict deﬁnition
would be to deﬁne (x
s,m,e
, y
s,m,e
, ˆ y
s,m,e
) = 1I(y
s,m,e
,= ˆ y
s,m,e
). This deﬁnition would lead
to (x, y, ˆ y) being the number of productions in ˆ y which are not seen in y. The constituent
loss function does not exactly correspond to the standard scoring metrics, such as F
1
or
crossing brackets, but shares the sensitivity to the number of differences between trees. We
have not thoroughly investigated the exact interplay between the various loss choices and
the various parsing metrics. We used the constituent loss in our experiments.
As in the maxmargin estimation for Markov networks, we can formulate an exponen
tial size QP:
min
1
2
[[w[[
2
+ C
i
ξ
i
(9.6)
s.t. w
¯
∆f
i
(y) ≥
i
(y) −ξ
i
∀i, y,
where ∆f
i
(y) = f (x
(i)
, y
(i)
) −f (x
(i)
, y), and
i
(y) = (x
(i)
, y
(i)
, y).
The dual of Eq. (9.6) (after normalizing by C) is given by:
max
i,y
α
i
(y)
i
(y) −
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,y
α
i
(y)∆f
i
(y)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
(9.7)
s.t.
y
α
i
(y) = 1, ∀i; α
i
(y) ≥ 0, ∀i, y.
Both of the above formulations are exponential (in the number of variables or con
straints) in the lengths (n
i
’s) of the sentences. But we can exploit the contextfree structure
of the basis functions and the loss to deﬁne a polynomialsize dual formulation in terms of
136 CHAPTER 9. CONTEXT FREE GRAMMARS
marginal variables µ
i
(y):
µ
i,s,e
(A) ≡
y:y
s,e
=A
α
i
(y), 0 ≤ s ≤ e ≤ n, ∀A ∈ ^;
µ
i,s,s
(D) ≡
y:y
s,s
=D
α
i
(y), 0 ≤ s ≤ e ≤ n, ∀D ∈ T ;
µ
i,s,m,e
(A, B, C) ≡
y:y
s,m,e
=(A,B,C)
α
i
(y), 0 ≤ s < m < e ≤ n, ∀A →B C ∈ T
B
,
µ
i,s,s,s+1
(A, D) ≡
y:y
s,s,s+1
=(A,D)
α
i
(y), 0 ≤ s < n, ∀A →D ∈ T
U
.
There are O([T
B
[n
3
i
+ [T
U
[n
i
) such variables for each sentence of length n
i
, instead of
exponentially many α
i
variables. We can now express the objective function in terms of
the marginals. Using these variables, the ﬁrst set of terms in the objective becomes:
i,y
α
i
(y)
i
(y) =
i,y
α
i
(y)
c∈(
(i)
i,c
(y
c
) =
i,c∈(
(i)
,y
c
µ
i,c
(y
c
)
i,c
(y
c
).
Similarly, the second set of terms (inside the 2norm) becomes:
i,y
α
i
(y)∆f
i
(y) =
i,y
α
i
(y)
c∈(
(i)
∆f
i,c
(y
c
) =
i,c∈(
(i)
,y
c
µ
i,c
(y
c
)∆f
i,c
(y
c
).
As in M
3
Ns, we must characterize the set of marginals µ that corresponds to valid
α. The constraints on µ are essentially based on the those that deﬁne ¸ in (9.19.4). In
addition, we require that the marginals over the root nodes, µ
i,0,n
i
(y
0,n
i
), sums to 1 over the
possible start symbols ^
S
.
9.3. DISCRIMINATIVE PARSING MODELS 137
Putting the pieces together, the factored dual is:
max
i,c∈(
(i)
µ
i,c
(y
c
)
i,c
(y
c
) + C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,c∈(
(i)
µ
i,c
(y
c
)∆f
i,c
(y
c
)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
(9.8)
s.t.
A∈A
S
µ
i,0,n
i
(A) = 1, ∀i; µ
i,c
(y
c
) ≥ 0; ∀i, ∀c ∈ (
(i)
;
µ
i,s,e
(A) =
A→B C∈P
B
s<m<e
µ
i,s,m,e
(A, B, C), ∀i, 0 ≤ s < e ≤ n
i
, ∀A ∈ ^;
µ
i,s,e
(A) =
B→A C∈P
B
0≤s
<s
µ
i,s
,s,e
(B, A, C)
+
B→C A∈P
B
e<e
≤n
i
µ
i,s,e,e
(B, C, A), ∀i, 0 ≤ s < e ≤ n
i
, ∀A ∈ ^;
µ
i,s,s+1
(A) =
A→D∈1
U
µ
i,s,s,s+1
(A, D), ∀i, 0 ≤ s < n
i
, ∀A ∈ ^;
µ
i,s,s
(D) =
A→D∈1
U
µ
i,s,s,s+1
(A, D), ∀i, 0 ≤ s < n
i
, ∀D ∈ T .
The constraints on µ is necessary, since they must correspond to marginals of a distri
bution over trees. They are also sufﬁcient:
Theorem 9.3.1 A set of marginals µ
i
(y) satisfying the constraints in Eq. (9.8) corresponds
to a valid distribution over the legal parse trees y ∈ ¸
(i)
. A consistent distribution α
i
(y)
is given by
α
i
(y) = µ
i,0,n
i
(y
0,n
i
)
0≤s≤m<e≤n
i
µ
i,s,m,e
(y
s,m,e
)
µ
i,s,e
(y
s,e
)
,
where 0/0 = 0 by convention.
Proof sketch: The proof follows from insideoutside probability relations [Manning &
Sch¨ utze, 1999]. The ﬁrst term is a valid distribution of starting symbols. Each
µ
i,s,m,e
(y
s,m,e
)
µ
i,s,e
(y
s,e
)
term for m > s corresponds to a conditional distribution over binary productions (y
s,e
→
y
s,m
y
m,e
) that are guaranteed to sum to 1 over split points m and possible productions.
Similarly, each
µ
i,s,s,s+1
(y
s,s,s+1
)
µ
i,s,s+1
(y
s,s+1
)
term for corresponds to a conditional distribution over
138 CHAPTER 9. CONTEXT FREE GRAMMARS
unary productions (y
s,s+1
→ y
s,s
) that are guaranteed to sum to 1 over possible produc
tions. Hence, we have deﬁned a kind of PCFG (where production probabilities depend on
the location of the symbol), which induces a valid distribution α
i
over trees. It straightfor
ward to verify that this distribution has marginals µ
i
.
9.4 Structured SMO for CFGs
We trained our maxmargin models using the Structured SMOalgorithmwith blockcoordinate
descent adopted from graphical models (see Sec. 6.1). The CKY algorithm computes sim
ilar maxmarginals in the course of computing the best tree as does Viterbi in Markov
networks.
´ v
i,c
(y
c
) = max
y∼y
c
[w
¯
f
i
(y) +
i
(y)], ´ α
i,c
(y
c
) = max
y∼y
c
α
i
(y).
We also deﬁne ´ v
i,c
(y
c
) = max
y
c
,=y
c
´ v
i,c
(y
t
c
) = max
y,∼y
c
[w
¯
f
i
(y) +
i
(y)]. Note that we
do not explicitly represent α
i
(y), but we can reconstruct the maximumentropy one from
the marginals µ
i
as in Theorem 9.3.1.
We again express the KKT conditions in terms of the maxmarginals for each span and
span triple c ∈ (
(i)
and its values y
c
:
´ α
i,c
(y
c
) = 0 ⇒ ´ v
i,c
(y
c
) ≤ ´ v
i,c
(y
c
); ´ α
i,c
(y
c
) > 0 ⇒ ´ v
i,c
(y
c
) ≥ ´ v
i,c
(y
c
). (9.9)
The algorithm cycles through the training sentences, runs CKY to compute the max
marginals and performs an SMO update on the violated constraints. We typically ﬁnd that
2040 iterations through the data are sufﬁcient for convergence in terms of the objective
function improvements.
9.5 Experiments
We used the Penn English Treebank for all of our experiments. We report results here for
each model and setting trained and tested on only the sentences of length ≤15 words. Aside
9.5. EXPERIMENTS 139
from the length restriction, we used the standard splits: sections 221 for training (9753
sentences), 22 for development (603 sentences), and 23 for ﬁnal testing (421 sentences).
As a baseline, we trained a CNF transformation of the unlexicalized model of Klein and
Manning [2003] on this data. The resulting grammar had 3975 nonterminal symbols and
contained two kinds of productions: binary nonterminal rewrites and tagword rewrites.
Unary rewrites were compiled into a single compound symbol, so for example a subject
gapped sentence would have label like S+VP. These symbols were expanded back into
their source unary chain before parses were evaluated. The scores for the binary rewrites
were estimated using unsmoothed relative frequency estimators. The tagging rewrites were
estimated with a smoothed model of P(w[t), also using the model from Klein and Manning
[2003]. In particular, Table 9.2 shows the performance of this model (GENERATIVE): 87.99
F
1
on the test set.
For the BASIC maxmargin model, we used exactly the same set of allowed rewrites
(and therefore the same set of candidate parses) as in the generative case, but estimated
their weights using the maxmargin formulation with a loss that counts the number of
wrong spans. Tagword production weights were ﬁxed to be the log of the generative
P(w[t) model. That is, the only change between GENERATIVE and BASIC is the use of the
discriminative maximummargin criterion in place of the generative maximum likelihood
one for learning production weights. This change alone results in a small improvement
(88.20 vs. 87.99 F
1
).
On top of the basic model, we ﬁrst added lexical features of each span; this gave a
LEXICAL model. For a span ¸s, e) of a sentence x, the base lexical features were:
◦ x
s
, the ﬁrst word in the span
◦ x
s−1
, the preceding adjacent word
◦ x
e−1
, the last word in the span
◦ x
e
, the following adjacent word
◦ ¸x
s−1
, x
s
)
◦ ¸x
e−1
, x
e
)
◦ x
s+1
for spans of length 3
140 CHAPTER 9. CONTEXT FREE GRAMMARS
Model P R F
1
GENERATIVE 87.70 88.06 87.88
BASIC 87.51 88.44 87.98
LEXICAL 88.15 88.62 88.39
LEXICAL+AUX 89.74 90.22 89.98
Table 9.1: Development set results of the various models when trained and tested on Penn
treebank sentences of length ≤ 15.
Model P R F
1
GENERATIVE 88.25 87.73 87.99
BASIC 88.08 88.31 88.20
LEXICAL 88.55 88.34 88.44
LEXICAL+AUX 89.14 89.10 89.12
COLLINS 99 89.18 88.20 88.69
Table 9.2: Test set results of the various models when trained and tested on Penn treebank
sentences of length ≤ 15.
These base features were conjoined with the span length for spans of length 3 and below,
since short spans have highly distinct behaviors (see the examples below). The features are
lexical in the sense than they allow speciﬁc words and word pairs to inﬂuence the parse
scores, but are distinct from traditional lexical features in several ways. First, there is no
notion of headword here, nor is there any modeling of wordtoword attachment. Rather,
these features pick up on lexical trends in constituent boundaries, for example the trend
that in the sentence The screen was a sea of red., the (length 2) span between the word was
and the word of is unlikely to be a constituent. These nonhead lexical features capture a
potentially very different source of constraint on tree structures than headargument pairs,
one having to do more with linear syntactic preferences than lexical selection. Regardless
of the relative merit of the two kinds of information, one clear advantage of the present
approach is that inference in the resulting model remains cubic (as opposed to O(n
5
)),
since the dynamic program need not track items with distinguished headwords. With the
addition of these features, the accuracy moved past the generative baseline, to 88.44.
9.5. EXPERIMENTS 141
As a concrete (and particularly clean) example of how these features can sway a de
cision, consider the sentence The Egyptian president said he would visit Libya today to
resume the talks. The generative model incorrectly considers Libya today to be a base NP.
However, this analysis is counter to the trend of today to be a oneword constituent. Two
features relevant to this trend are: (CONSTITUENT ∧ ﬁrstword = today ∧ length = 1) and
(CONSTITUENT ∧ lastword = today ∧ length = 1). These features represent the preference
of the word today for being the ﬁrst and last word in constituent spans of length 1.
1
In the
LEXICAL model, these features have quite large positive weights: 0.62 each. As a result,
this model makes this parse decision correctly.
Another kind of feature that can usefully be incorporated into the classiﬁcation process
is the output of other, auxiliary classiﬁers. For this kind of feature, one must take care
that its reliability on the training not be vastly greater than its reliability on the test set.
Otherwise, its weight will be artiﬁcially (and detrimentally) high. To ensure that such
features are as noisy on the training data as the test data, we split the training into two
folds. We then trained the auxiliary classiﬁers on each fold, and using their predictions as
features on the other fold. The auxiliary classiﬁers were then retrained on the entire training
set, and their predictions used as features on the development and test sets.
We used two such auxiliary classiﬁers, giving a prediction feature for each span (these
classiﬁers predicted only the presence or absence of a bracket over that span, not bracket
labels). The ﬁrst feature was the prediction of the generative baseline; this feature added
little information, but made the learning phase faster. The second feature was the output
of a ﬂat classiﬁer which was trained to predict whether single spans, in isolation, were
constituents or not, based on a bundle of features including the list above, but also the
following: the preceding, ﬁrst, last, and following tag in the span, pairs of tags such as
precedingﬁrst, lastfollowing, precedingfollowing, ﬁrstlast, and the entire tag sequence.
Tag features on the test sets were taken from a pretagging of the sentence by the tagger
described in [Toutanova et al., 2003].While the ﬂat classiﬁer alone was quite poor (P 78.77
/ R 63.94 / F
1
70.58), the resulting maxmargin model (LEXICAL+AUX) scored 89.12 F
1
.
To situate these numbers with respect to other models, the parser in [Collins, 1999],which
1
In this length 1 case, these are the same feature. Note also that the features are conjoined with only one
generic label class “constituent” rather than speciﬁc constituent types.
142 CHAPTER 9. CONTEXT FREE GRAMMARS
is generative, lexicalized, and intricately smoothed scores 88.69 over the same train/test
conﬁguration.
9.6 Related work
A number of recent papers have considered discriminative approaches for natural language
parsing [Johnson et al., 1999; Collins, 2000; Johnson, 2001; Geman & Johnson, 2002;
Miyao & Tsujii, 2002; Clark & Curran, 2004; Kaplan et al., 2004; Collins, 2004]. Broadly
speaking, these approaches fall into two categories, reranking and dynamic programming
approaches. In reranking methods [Johnson et al., 1999; Collins, 2000; Shen et al., 2003],
an initial parser is used to generate a number of candidate parses. A discriminative model
is then used to choose between these candidates. In dynamic programming methods, a
large number of candidate parse trees are represented compactly in a parse tree forest or
chart. Given sufﬁciently “local” features, the decoding and parameter estimation problems
can be solved using dynamic programming algorithms. For example, several approaches
[Johnson, 2001; Geman & Johnson, 2002; Miyao & Tsujii, 2002; Clark & Curran, 2004;
Kaplan et al., 2004] are based on conditional loglinear (maximum entropy) models, where
variants of the insideoutside algorithm can be used to efﬁciently calculate gradients of the
loglikelihood function, despite the exponential number of trees represented by the parse
forest.
The method we presented has several compelling advantages. Unlike reranking meth
ods, which consider only a prepruned selection of “good” parses, our method is an end
toend discriminative model over the full space of parses. This distinction can be very
signiﬁcant, as the set of nbest parses often does not contain the true parse. For example,
in the work of Collins [2000], 41% of the correct parses were not in the candidate pool of
∼30best parses. Unlike previous dynamic programming approaches, which were based on
maximum entropy estimation, our method incorporates an articulated loss function which
penalizes larger tree discrepancies more severely than smaller ones.
Moreover, the structured SMO we use requires only the calculation of Viterbi trees,
rather than expectations over all trees (for example using the insideoutside algorithm).
9.7. CONCLUSION 143
This allows a range of optimizations that prune the space of parses (without making ap
proximations) not possible for maximum likelihood approaches which must extract basis
function expectations from the entire set of parses. In our experiments, 2040 iterations
were generally required for convergence (except the BASIC model, which took about 100
iterations.)
9.7 Conclusion
We have presented a maximummargin approach to parsing, which allows a discriminative
SVMlike objective to be applied to the parsing problem. Our framework permits the use
of a rich variety of input features, while still decomposing in a way that exploits the shared
substructure of parse trees in the standard way.
It is worth considering the cost of this kind of method. At training time, discriminative
methods are inherently expensive, since they all involve iteratively checking current model
performance on the training set, which means parsing the training set (usually many times).
Generative approaches are vastly cheaper to train, since they must only collect counts from
the training set.
On the other hand, the maxmargin approach does have the potential to incorporate
many new kinds of features over the input, and the current feature set allows limited lexi
calization in cubic time, unlike other lexicalized models (including the Collins model which
it outperforms in the present limited experiments). This tradeoff between the complexity,
accuracy and efﬁciency of a parsing model is an important area of future research.
Chapter 10
Matchings
We address the problem of learning to match: given a set of input graphs and corresponding
matchings, ﬁnd a parameterized edge scoring function such that the correct matchings have
the highest score. Bipartite matchings are used in many ﬁelds, for example, to ﬁnd marker
correspondences in vision problems, to map words of a sentence in one language to another,
to identify functional genetic analogues in different organisms. We have shown a compact
maxmargin formulation for bipartite matchings in Ch. 4. In this chapter, we focus on a
more complex problem of nonbipartite matchings. We motivate this problem using an
application in computational biology, disulﬁde connectivity prediction, but nonbipartite
matchings can be used for many other tasks.
Identifying disulﬁde bridges formed by cysteine residues is critical in determining the
structure of proteins. Recently proposed models have formulated this prediction task as a
maximum weight perfect matching problem in a graph containing cysteines as nodes with
edge weights measuring the attraction strength of the potential bridges. We exploit combi
natorial properties of the perfect matching problem to deﬁne a compact, convex, quadratic
program. We use kernels to efﬁciently learn very rich (infact, inﬁnitedimensional) mod
els and present experiments on standard protein databases, showing that our framework
achieves stateoftheart performance on the task.
Throughout this chapter, we use the problem of disulﬁde connectivity prediction as an
example. We provide some background on this problem.
144
10.1. DISULFIDE CONNECTIVITY PREDICTION 145
10.1 Disulﬁde connectivity prediction
Proteins containing cysteine residues form intrachain covalent bonds known as disulﬁde
bridges. Such bonds are a very important feature of protein structure since they enhance
conformational stability by reducing the number of conﬁgurational states and decreasing
the entropic cost of folding a protein into its native state [Matsumura et al., 1989]. They do
so mostly by imposing strict structural constraints due to the resulting links between distant
regions of the protein sequence [Harrison & Sternberg, 1994].
Knowledge of the exact disulﬁde bonding pattern in a protein provides information
about protein structure and possibly its function and evolution. Furthermore, since the
disulﬁde connectivity pattern imposes structure constraints, it can be used to reduce the
search space in both protein folding prediction as well as protein 3D structure prediction.
Thus, the development of efﬁcient, scalable and accurate methods for the prediction of
disulﬁde bonds has numerous practical applications.
Recently, there has been increased interest in applying computational techniques to
the task of predicting the intrachain disulﬁde connectivity [Fariselli & Casadio, 2001;
Fariselli et al., 2002; Vullo & Frasconi, 2004; Klepeis & Floudas, 2003; Baldi et al., 2004].
Since a sequence may contain any number of cysteine residues, which may or may not
participate in disulﬁde bonds, the task of predicting the connectivity pattern is typically
decomposed into two subproblems: predicting the bonding state of each cysteine in the
sequence, and predicting the exact connectivity among bonded cysteines. Alternatively,
there are methods [Baldi et al., 2004] that predict the connectivity pattern without knowing
the bonding state of each cysteine
1
.
We predict the connectivity pattern by ﬁnding the maximum weighted matching in a
graph in which each vertex represents a cysteine residue, and each edge represents the
“attraction strength” between the cysteines it connects [Fariselli & Casadio, 2001]. We
parameterize the this attraction strength via a linear combination of features, which can
include the protein sequence around the two residues, evolutionary information in the form
of multiple alignment proﬁles, secondary structure or solvent accessibility information, etc.
1
We thank Pierre Baldi and Jianlin Cheng for introducing us to the problem of disulﬁde connectivity
prediction and providing us with preliminary draft of their paper and results of their model, as well as the
protein datasets.
146 CHAPTER 10. MATCHINGS
10.2 Learning to match
Formally, we seek a function h : A → ¸ that maps inputs x ∈ A to output matchings
y ∈ ¸, for example, A is the space of protein sequences and ¸ is the space of matchings
of their cysteines. The space of matchings ¸ is very large, in fact, superexponential in
the number of nodes in a graph. However, ¸ has interesting and complex combinatorial
structure which we exploit to learn h efﬁciently.
The training data consists of m examples S = ¦(x
(i)
, y
(i)
)¦
m
i=1
of input graphs and
output matchings. We assume that the input x deﬁnes the space of possible matchings
using some deterministic procedure. For example, given a protein sequence, we construct
a complete graph where each node corresponds to a cysteine. We represent each possible
edge between nodes j and k (j < k) in example i using a binary variable y
(i)
jk
. For simplicity,
we assume complete graphs, but very little needs to be changed to handle sparse graphs.
If example i has L
i
nodes, then there are L
i
(L
i
− 1)/2 edge variables, so y
(i)
is a
binary vector of dimension L
i
(L
i
− 1)/2. In a perfect matching, each node is connected
exactly one other node. In nonperfect matchings, each node is connected to at most one
other node. Let n
i
= L
i
/2, then for complete graphs with even number of vertices L
i
, the
number of possible perfect matchings is
(2n
i
)!
2
n
i n
i
!
(which is Ω((
n
i
2
)
n
i
), superexponential in
n
i
). For example, 1ANS protein in Fig. 10.1 has 6 cysteines (nodes), 15 potential bonds
(edges) and 15 possible perfect matchings.
Our hypothesis class is maximum weight matchings:
h
s
(x) = arg max
y∈¸
jk
s
jk
(x)y
jk
, (10.1)
For disulﬁde connectivity prediction, this model was used by Fariselli and Casadio [2001].
Their model assigns an attraction strength s
jk
(x) to each pair of cysteines, calculated by
assuming that all residues in the local neighborhoods of the two cysteines make contact,
and summing contact potentials for pairs of residues. We consider a simple but very general
class of attraction scoring functions deﬁned by a weighted combination of features or basis
functions:
s
jk
(x) =
d
w
d
f
d
(x
jk
) = w
¯
f (x
jk
), (10.2)
10.2. LEARNING TO MATCH 147
1
2 3
4
5 6
1
2 3
4
5 6
RSCCPCYWGGCPWGQNCYPEGCSGPKV
1234 56
Figure 10.1: PDB protein 1ANS: amino acid sequence, 3D structure, and graph of potential
disulﬁde bonds. Actual disulﬁde connectivity is shown in yellow in the 3D model and the
graph of potential bonds.
where x
jk
is the portion of the input x that directly relates to nodes j and k, f
d
(x
jk
) is
a realvalued basis function and w
d
∈ IR. For example, the basis functions can represent
arbitrary information about the two cysteine neighborhoods: the identity of the residues
at speciﬁc positions around the two cysteines, or the predicted secondary structure in the
neighborhood of each cysteine. We assume that the user provides the basis functions, and
that our goal is to learn the weights w, for the model:
h
w
(x) = arg max
y∈¸
jk
w
¯
f (x
jk
)y
jk
. (10.3)
Below, we will abbreviate w
¯
f (x, y) ≡
jk
w
¯
f (x
jk
)y
jk
, and w
¯
f
i
(y) ≡ w
¯
f (x
(i)
, y),
The naive formulation of the maxmargin estimation, which enumerates all perfect
matchings for each example i, is:
min
1
2
[[w[[
2
s.t. w
¯
f
i
(y
(i)
) ≥ w
¯
f
i
(y) +
i
(y), ∀i, ∀y ∈ ¸
(i)
. (10.4)
The number of constraints in this formulation is superexponential in the number of nodes
in each example. In the following sections we present two maxmargin formulations,
ﬁrst with an exponential set of constraints (Sec. 10.3), and then with a polynomial one
(Sec. 10.4).
148 CHAPTER 10. MATCHINGS
10.3 Minmax formulation
Using the minmax formulation from Ch. 4, we have a single max constraint for each i:
min
1
2
[[w[[
2
s.t. w
¯
f
i
(y
(i)
) ≥ max
y∈¸
(i)
[w
¯
f
i
(y) +
i
(y)], ∀i. (10.5)
The key to solving this problem efﬁciently is the lossaugmented inference
max
y∈¸
(i) [w
¯
f
i
(y) +
i
(y)]. Under the assumption of Hamming distance loss (or any loss
function that can be written as a sum of terms corresponding to edges), this maximization
is equivalent (up to a constant term) to a maximum weighted matching problem. Note that
since the y variables are binary, the Hamming distance between y
(i)
and y can be written
as (1 − y)
¯
y
(i)
+ (1 − y
(i)
)
¯
y = 1
¯
y
(i)
+ (1 − 2y
(i)
)
¯
y. Hence, the maximum weight
matching where edge jk has weight w
¯
f (x
(i)
jk
)+(1−2y
(i)
jk
) (plus the constant 1
¯
y
(i)
) gives
the value of max
y∈¸
(i) [w
¯
f
i
(y) +
i
(y)].
This problem can be solved in O(L
3
) time [Gabow, 1973; Lawler, 1976]. It can also be
solved as a linear program, where we introduce continuous variables µ
i,jk
instead of binary
variables y
(i)
jk
.
max
jk
µ
i,jk
[w
¯
f (x
(i)
jk
) + (1 −2y
(i)
jk
)] (10.6)
s.t. µ
i,jk
≥ 0, 1 ≤ j < k ≤ L
i
;
k
µ
i,jk
≤ 1, 1 ≤ j ≤ L
i
;
j,k∈V
µ
i,jk
≤
1
2
([V [ −1), V ⊆ ¦1, . . . , L
i
¦, [V [ ≥ 3 and odd.
The constraints
k
µ
i,jk
≤ 1 require that the number of bonds incident on a node is less
or equal to one. For perfect matchings, these constraints are changed to
k
µ
i,jk
= 1 to
ensure exactly one bond. The subset constraints (in the last line of Eq. (10.6)) ensure that
solutions to the LP are integral [Edmonds, 1965]. Note that we have an exponential number
of constraints (O(2
(L
i
−1)
)), but this number is asymptotically smaller than the number of
possible matchings . It is an open problem to derive a polynomial sized LP formulation for
perfect matchings [Schrijver, 2003].
10.3. MINMAX FORMULATION 149
We can write the lossaugmented inference problem in terms of the LP in Eq. (10.6):
max
y∈¸
(i)
[w
¯
f
i
(y) +
i
(y)] = d
i
+ max
A
i
µ
i
≤b
i
µ
i
≥0
µ
¯
i
[F
i
w+c
i
],
where: d
i
= 1
¯
y
(i)
; µ
i
is a vector of length L
i
(L
i
− 1)/2 indexed by bond jk; A
i
and b
i
are the appropriate constraint coefﬁcient matrix and right hand side vector, respectively. F
i
is a matrix of basis function coefﬁcients such that the component jk of the vector F
i
w is
w
¯
f (x
(i)
jk
) and c
i
= (1 − 2y
(i)
). Note that the dependence on w is linear and occurs only
in the objective of the LP.
The dual of the LP in Eq. (10.6) is
min λ
¯
i
b
i
s.t. A
¯
i
λ
i
≥ F
i
w+c
i
; λ
i
≥ 0. (10.7)
We plug it into Eq. (10.5) and combine the minimization over λ with minimization over w.
min
1
2
[[w[[
2
(10.8)
s.t. w
¯
f
i
(y
(i)
) ≥ d
i
+ λ
¯
i
b
i
, ∀i;
A
¯
i
λ
i
≥ F
i
w+c
i
, ∀i;
λ
i
≥ 0, ∀i.
In case that our basis functions are not rich enough to predict the training data perfectly,
we can introduce a slack variable ξ
i
for each example i to allow violations of the constraints
and minimize the sum of the violations:
min
1
2
[[w[[
2
+ C
i
ξ
i
(10.9)
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥ d
i
+ λ
¯
i
b
i
, ∀i;
A
¯
i
λ
i
≥ F
i
w+c
i
, ∀i;
λ
i
≥ 0, ∀i; ξ
i
≥ 0, ∀i.
The parameter C allows the user to trade off violations of the constraints with ﬁt to the
150 CHAPTER 10. MATCHINGS
0
5
10
15
20
25
30
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
NumberofBonds
L
o
g

N
u
m
b
e
r
o
f
C
o
n
s
t
r
a
i
n
t
s
PerfectMatchings
Minmax
Certificate
Figure 10.2: Log of the number of QP constraints (yaxis) vs. number of bonds (xaxis) in
the three formulations (perfect matching enumeration, minmax and certiﬁcate).
data.
Our formulation is a linearlyconstrained quadratic program, albeit with an exponen
tial number of constraints. In the next section, we develop an equivalent polynomial size
formulation.
10.4 Certiﬁcate formulation
Rather than solving the lossaugmented inference problem explicitly, we can focus on ﬁnd
ing a compact certiﬁcate of optimality that guarantees that y
(i)
= arg max
y
[w
¯
f
i
(y) +
i
(y)]. We consider perfect matchings and then provide a reduction for the nonperfect
case. Let M be a perfect matching for a complete undirected graph G = (V, E). In an
alternating cycle/path in G with respect to M, the edges alternate between those that be
long to M and those that do not. An alternating cycle is augmenting with respect to M if
the score of the edges in the matching M is smaller that the score of the edges not in the
matching M.
Theorem 10.4.1 [Edmonds, 1965] A perfect matching M is a maximum weight perfect
matching if and only if there are no augmenting alternating cycles.
10.4. CERTIFICATE FORMULATION 151
The number of alternating cycles is exponential in the number of vertices, so simply enu
merating all of them will not do. Instead, we can rule out such cycles by considering
shortest paths.
We begin by negating the score of those edges not in M. In the discussion below we
assume that each edge score s
jk
has been modiﬁed this way. We also refer to the score s
jk
as the length of the edge jk. An alternating cycle is augmenting if and only if its length is
negative. A condition ruling out negative length alternating cycles can be stated succinctly
using a kind of distance function. Pick an arbitrary root node r. Let d
e
j
, with j ∈ V ,
e ∈ ¦0, 1¦, denote the length of the shortest distance alternating path from r to j, where
e = 1 if the last edge of the path is in M, 0 otherwise. These shortest distances are well
deﬁned if and only if there are no negative alternating cycles. The following constraints
capture this distance function.
s
jk
≥ d
0
k
−d
1
j
, s
jk
≥ d
0
j
−d
1
k
, ∀ jk / ∈ M; (10.10)
s
jk
≥ d
1
k
−d
0
j
, s
jk
≥ d
1
j
−d
0
k
, ∀ jk ∈ M.
Theorem 10.4.2 There exists a distance function ¦d
e
j
¦ satisfying the constraints in
Eq. (10.10) if and only if no augmenting alternating cycles exist.
Proof.
(If) Suppose there are no augmenting alternating cycles. Since any alternating paths
from r to j can be shortened (or left the same length) by removing the cycles they contain,
the two shortest paths to j (one ending with Medge and one not) contain no cycles. Then
let d
0
j
and d
1
j
be the length of those paths, for all j (for j = r, set d
0
r
= d
1
r
= 0). Then for
any jk (or kj) in M, the shortest path to j ending with an edge not in M plus the edge jk
(or kj) is an alternating path to k ending with an edge in M. This path is longer or same
length as the shortest path to k ending with an edge in M: s
jk
+d
0
j
≥ d
1
k
(or s
kj
+d
0
j
≥ d
1
k
),
so the constraint is satisﬁed. Similarly for jk, kj / ∈ M.
(Only if) Suppose a distance function ¦d
e
j
¦ satisﬁes the constraints in Eq. (10.10). Con
sider an alternating cycle C. We renumber the nodes such that the cycle passes through
nodes 1, 2, . . . , l and the ﬁrst edge, (1, 2), is in M. The length of the path is s(C) =
s
1,l
+
l−1
j=1
s
j,j+1
. For each odd j, the edge (j, j + 1) is in M, so s
j,j+1
≥ d
1
j+1
−d
0
j
. For
152 CHAPTER 10. MATCHINGS
even j, the edge (j, j + 1) is not in M, so s
j,j+1
≥ d
0
j+1
−d
1
j
. Finally, the last edge, (1, l),
is not in M, so s
1,l
≥ d
0
1
−d
1
l
. Summing the edges, we have:
s(C) ≥ d
0
1
−d
1
l
+
l−1
j=1,odd
[d
1
j+1
−d
0
j
] +
l−1
j=2,even
[d
0
j+1
−d
1
j
] = 0.
Hence all alternating cycles have nonnegative length.
In our learning formulation we have the lossaugmented edge weights s
(i)
jk
= (2y
(i)
jk
−
1)(w
¯
f (x
jk
)+1−2y
(i)
jk
). Let d
i
be a vector of distance variables d
e
j
, H
i
and G
i
be matrices
of coefﬁcients and q
i
be a vector such that H
i
w + G
i
d
i
≥ q
i
represents the constraints
in Eq. (10.10) for example i. Then the following joint convex programin wand d computes
the maxmargin parameters:
min
1
2
[[w[[
2
(10.11)
s.t. H
i
w+G
i
d
i
≥ q
i
, ∀i.
Once again, in case that our basis functions are not rich enough to predict the training data
perfectly, we can introduce a slack variable vector ξ
i
to allow violations of the constraints.
The case of nonperfect matchings can be handled by a reduction to perfect matchings
as follows [Schrijver, 2003]. We create a new graph by making a copy of the nodes and
the edges and adding edges between each node and the corresponding node in the copy.
We extend the matching by replicating its edges in the copy and for each unmatched node,
introduce an edge to its copy. We deﬁne f (x
jk
) ≡ 0 for edges between the original and
the copy. Perfect matchings in this graph projected onto the original graph correspond to
nonperfect matchings in the original graph.
The comparison between the lognumber of constraints for our three equivalent QP
formulations (enumeration of all perfect matchings, minmax and certiﬁcate) is shown
in Fig. 10.2. The xaxis is the number of edges in the matching (number of nodes divided
by two).
10.5. KERNELS 153
10.5 Kernels
Instead of directly optimizing the primal problem in Eq. (10.8), we can work with its dual.
Each training example i has L
i
(L
i
− 1)/2 dual variables, and α
(i)
jk
is the dual variable
associated with the features corresponding to the edge jk. Let α
(i)
be the vector of dual
variables for example i. The dual quadratic optimization problem has the form:
max
i
c
¯
i
α
(i)
−
1
2
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i
jk∈E
(i)
__
Cy
(i)
jk
−α
(i)
jk
_
f (x
(i)
jk
)
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
(10.12)
s.t. A
i
α
(i)
≤ Cb
i
, ∀i.
α
(i)
≥ 0, ∀i.
The only occurrence of feature vectors is in the expansion of the squarednorm term in the
objective:
i,j
kl∈E
(i)
mn∈E
(j)
_
Cy
(i)
jk
−α
(i)
jk
_
f (x
(i)
kl
)
¯
f (x
(j)
mn
)
_
Cy
(j)
jk
−α
(j)
jk
_
(10.13)
Therefore, we can apply the kernel trick and let f (x
(i)
kl
)
¯
f (x
(j)
mn
) = K(x
(i)
kl
, x
(j)
mn
). Thus, we
can efﬁciently map the original features f (x
jk
) to a highdimensional space. The primal
and dual solutions are related by:
w =
i
jk
(Cy
(i)
jk
−α
(i)
jk
)f (x
(i)
jk
) (10.14)
Eq. (10.14) can be used to compute the attraction strength s
jk
(x) in a kernelized manner at
prediction time. The polynomialsized representation in Eq. (10.11) is similarly kerneliz
able.
154 CHAPTER 10. MATCHINGS
10.6 Experiments
We assess the performance of our method on two datasets containing sequences with ex
perimentally veriﬁed bonding patterns: DIPRO2 and SP39. The DIPRO2 dataset
2
was
compiled and made publicly available by Baldi et al. [2004]. It consists of all proteins
from PDB [Berman et al., 2000], as of May 2004, which contain intrachain disulﬁde
bonds. After redundance reduction there are a total of 1018 sequences. In addition, the
sequences are annotated with secondary structure and solvent accessibility information de
rived from the DSSP database [Kabsch & Sander, 1983]. The SP39 dataset is extracted
from the SwissProt database of proteins [Bairoch & Apweiler, 2000], release 39. It con
tains only sequences with experimentally veriﬁed disulﬁde bridges, and has a total of 726
proteins. The same dataset was used in earlier work [Baldi et al., 2004; Vullo & Frasconi,
2004; Fariselli & Casadio, 2001], and we have followed the same procedure for extracting
sequences from the database.
Even though our method is applicable to both sequences with a high number of bonds
or sequences in which the bonding state of cysteine residues is unknown, we report results
for the case where the bonding state is known, and the number of bonds is between 2 and
5 (since the case of 1 bond is trivial). The DIPRO2 contains 567 such sequences, and only
53 sequences with a higher number of bonds, so we are able to perform learning on over
90% of all proteins. There are 430 proteins with 2 and 3 bonds and 137 with 4 and 5 bonds.
SP39 contains 446 sequences containing between 2 and 5 bonds.
In order to avoid biases during testing, we adopt the same dataset splitting procedure
as the one used in previous work [Fariselli & Casadio, 2001; Vullo & Frasconi, 2004;
Baldi et al., 2004]. We split SP39 into 4 different subsets, with the constraint that pro
teins no proteins with sequence similarity of more than 30% belong to different subsets.
Sequence similarity was derived using an allagainstall rigorous SmithWaterman local
pairwise alignment [Smith & Waterman, 1981] (with the BLOSUM65 scoring matrix, gap
penalty 12 and gap extension 4). Pairs of chains whose alignment is less than 30 residues
were considered unrelated. The DIPRO2 dataset was split similarly into 5 folds, although
the procedure had less effect due to the redundance reduction applied by the authors of the
2
http://contact.ics.uci.edu/bridge.html
10.6. EXPERIMENTS 155
dataset.
Models
The experimental results we report use the dual formulation of Sec. 10.5 and an RBF kernel
K(x
jk
, x
lm
) = exp(
x
jk
−x
lm

2
γ
), with γ ∈ [0.1, 10]. We use the exponential sized represen
tation of Sec. 10.3 since for the case of proteins containing between two and ﬁve bonds, it
is more efﬁcient due to the low constants in the exponential problem size. We used com
mercial QP software (CPLEX) to train our models. Training time took around 70 minutes
for 450 examples, using a sequential optimization procedure which solves QP subproblems
associated with blocks of training examples. We are currently working on an implemen
tation of the certiﬁcate formulation Sec. 10.4 to handle longer sequences and nonperfect
matchings (when bonding state is unknown). Below, we describe several models we used.
The features we experimented with were all based on the local regions around candidate
cysteine pairs. For each pair of candidate cysteines ¦j, k¦, where j < k, we extract the
aminoacid sequence in windows of size n centered at j and k. As in Baldi et al. [2004],
we augment the features of each model with the number of residues between j and k. The
models below use windows of size n = 9.
The ﬁrst model, SEQUENCE, uses the features described above: for each window, the
actual sequence is expanded to a 20 n binary vector, in which the entries denote whether
or not a particular amino acid occurs at the particular position. For example, the 21st entry
in the vector represents whether or not the aminoacid Alanine occurs at position 2 of the
local window, counting from the left end of the window. The ﬁnal set of features for each
¦j, k¦ pair of cysteines is simply the two local windows concatenated together, augmented
with the linear distance between the cysteine residues.
The second model, PROFILE, is the same as SEQUENCE, except that instead of us
ing the actual protein sequence, we use multiple sequence alignment proﬁle information.
Multiple alignments were computed by running PSIBLAST using default settings to align
the sequence with all sequences in the NR database [Altschul et al., 1997]. Thus, the in
put at each position of a local window is the frequency of occurrence of each of the 20
aminoacids in the alignments.
156 CHAPTER 10. MATCHINGS
K PROFILE DAGRNN
2 0.75 / 0.75 0.74 / 0.74
3 0.60 / 0.48 0.61 / 0.51
4 0.46 / 0.24 0.44 / 0.27
5 0.43 / 0.16 0.41 / 0.11
K SEQUENCE PROFILE PROFILESS
2 0.70 / 0.70 0.73 / 0.73 0.79 / 0.79
3 0.62 / 0.52 0.67 / 0.59 0.74 / 0.69
4 0.44 / 0.21 0.59 / 0.44 0.70 / 0.56
5 0.29 / 0.06 0.43 / 0.17 0.62 / 0.27
(a) (b)
Table 10.1: Numbers indicate Precision / Accuracy. (a) Performance of PROFILE model
on SP39 vs. preliminary results of the DAGRNN model [Baldi et al., 2004] which repre
sent the best currently published results. In each row, the best performance is in bold. (b)
Performance of SEQUENCE, PROFILE, PROFILESS models on the DIPRO2 dataset.
The third model, PROFILESS, augments the PROFILE model with secondary structure
and solventaccessibility information. The DSSP program produces 8 types of secondary
structure, so we augment each local window of size n with an additional length 8 n
binary vector, as well as a length n binary vector representing the solvent accessibility at
each position.
Results and discussion
We evaluate our algorithm using two metrics: accuracy and precision. The accuracy mea
sure counts how many full connectivity patterns were predicted correctly, whereas preci
sion measures the number of correctly predicted bonds as a fraction of the total number of
possible bonds.
The ﬁrst set of experiments compares our model to preliminary results reported in Baldi
et al. [2004], which represent the current topperforming system. We perform 4fold cross
validation on SP39 in order to replicate their setup. As Table 10.1 shows, the PROFILE
model achieves comparable results, with similar or better levels of precision for all bond
numbers, and slightly lower accuracies for the case of 2 and 3 bonds.
In another experiment, we show the performance gained by using multiple alignment
information by comparing the results of the SEQUENCE model with the PROFILE. As we
can see from Table 10.1(b), the evolutionary information captured by the aminoacid align
ment frequencies plays an important role in increasing the performance of the algorithm.
10.7. RELATED WORK 157
2 and 3 bonds
20%
30%
40%
50%
60%
70%
80%
90%
100%
100 150 200 250 300 350 400 450
Training Set Size
Accuracy (2 Bonds)
Precision (3 Bonds)
Accuracy (3 Bonds)
4 and 5 bonds
5%
15%
25%
35%
45%
55%
65%
75%
85%
100 150 200 250 300 350 400 450
Training Set Size
Precision (4 Bonds)
Accuracy (4 Bonds)
Precision (5 Bonds)
Accuracy (5 Bonds)
(a) (b)
Figure 10.3: Performance of PROFILE model as training set size changes for proteins with
(a) 2 and 3 bonds (b) 4 and 5 bonds.
The same phenomenon is observed by Vullo and Frasconi [2004] in their comparison of
sequence and proﬁlebased models.
As a ﬁnal experiment, we examine the role that secondary structure and solventaccessibility
information plays in the model PROFILESS. Table 10.1(b) shows that the gains are sig
niﬁcant, especially for sequences with 3 and 4 bonds. This highlights the importance of
developing even richer features, perhaps through more complex kernels.
Fig. 10.3 shows the performance of the PROFILE model as training set size grows. We
can see that for sequences of all bond numbers, both accuracy and precision increase as the
amount of data grows. The trend is more pronounced for sequences with 4 and 5 bonds
because they are sparsely distributed in the dataset. Such behavior is very promising, since
it validates the applicability of our algorithm as the availability of highquality disulﬁde
bridge annotations increases with time.
10.7 Related work
The problem of inverse perfect matching has been studied by Liu and Zhang [2003] in
the inverse combinatorial optimization framework we describe in Sec. 4.3: Given a set of
nominal weights w
0
and a perfect matching M, which is not a maximum one with respect
to w
0
, ﬁnd a new weight vector w that makes M optimal and minimizes [[w
0
− w[[
p
for p = 1, ∞. They do not provide a compact optimization problem for this related but
158 CHAPTER 10. MATCHINGS
different task, relying instead on the ellipsoid method with constraint generation.
The problem of disulﬁde bond prediction ﬁrst received comprehensive computational
treatment in Fariselli and Casadio [2001]. They modeled the prediction problem as ﬁnding
a perfect matching in a weighted graph where vertices represent bonded cysteine residues,
and edge weights correspond to attraction strength. The problem of learning the edge
weights was addressed using a simulated annealing procedure. Their method is only ap
plicable to the case when bonding state is known. In Fariselli et al. [2002], the authors
switch to using a neural network for learning edge weights and achieve better performance,
especially for the case of 2 and 3 disulﬁde bonds.
The method in Vullo and Frasconi [2004] takes a different approach to the problem. It
scores candidate connectivity patterns according to their similarity with respect to the cor
rect pattern, and uses a recursive neural network architecture [Frasconi et al., 1998] to score
candidate patterns. At prediction time the pattern scores are used to perform an exhaustive
search on the space of all matchings. The method is computationally limited to sequences
of 2 to 5 bonds. It also uses multiple alignment proﬁle information and demonstrates its
beneﬁts over sequence information.
In Baldi et al. [2004], the authors achieve the current stateoftheart performance on
the task. Their method uses Directed Acyclic Graph Recursive Neural Networks [Baldi
& Pollastri, 2003] to predict bonding probabilities between cysteine pairs. The prediction
problem is solved using a weighted graph matching based on these probabilities. Their
method performs better than the one in Vullo and Frasconi [2004] and is also the only
one which can cope with sequences with more than 5 bonds. It also improves on previous
methods by not assuming knowledge of bonding state.
A different approach to predicting disulﬁde bridges is reported in Klepeis and Floudas
[2003], where bond prediction occurs as part of predicting βsheet topology in proteins.
Residuetoresidue contacts (which include disulphide bridges) are predicted by solving a
series of constrained integer programming problems. Interestingly, the approach can be
used to predict disulﬁde bonds with no knowledge of bonding state, but the results are not
comparable with those in other publications.
The task of predicting whether or not a cysteine is bonded has also been addressed using
a variety of machine learning techniques including neural networks, SVMs, and HMMs
10.8. CONCLUSION 159
[Fariselli et al., 1999; Fiser & Simon, 2000; Martelli et al., 2002; Frasconi et al., 2002;
Ceroni et al., 2003] Currently the top performing systems have accuracies around 85%.
10.8 Conclusion
In this chapter, we derive a compact convex quadratic program for the problem of learning
to match. Our approach learns a parameterized scoring function that reproduces the ob
served matchings in a training set. We present two formulations: one which is based on a
linear programming approach to matching, requiring an exponential number of constraints,
and one which develops a certiﬁcate of matching optimality for a compact polynomialsized
representation. We apply our framework to the task of disulﬁde connectivity prediction, for
mulated as a weighted matching problem. Our experimental results show that the method
can achieve performance comparable to current topperforming systems. Furthermore, the
use of kernels makes it easy to incorporate rich sets of features such as secondary structure
information, or extended local neighborhoods of the protein sequence. In the future, it will
be worthwhile to examine how other kernels, such as convolution kernels for protein se
quences, will affect performance. We also hope to explore the more challenging problem
of disulﬁde connectivity prediction when the bonding state of cysteines is unknown. While
we have developed the framework to handle that task, it remains to experimentally deter
mine how well the method performs, especially in comparison to existing methods [Baldi
et al., 2004], which have already addressed the more challenging setting.
Chapter 11
Correlation clustering
Data can often be grouped in many different reasonable clusterings. For example, one user
may organize her email messages by project and time, another by sender and topic. Images
can be segmented by hue or object boundaries. For a given application, there might be
only one of these clusterings that is desirable. Learning to cluster considers the problem of
ﬁnding desirable clusterings on new data, given example desirable clusterings on training
data.
We focus on correlation clustering, a novel clustering method that has recently en
joyed signiﬁcant attention from the theoretical computer science community [Bansal et al.,
2002; Demaine & Immorlica, 2003; Emanuel & Fiat, 2003]. It is formulated as a vertex
partitioning problem: Given a graph with realvalued edge scores (both positive and neg
ative), partition the vertices into clusters to maximize the score of intracluster edges, and
minimize the weight of intercluster edges. Positive edge weights represent correlations be
tween vertices, encouraging those vertices to belong to a common cluster; negative weights
encourage the vertices to belong to different clusters. Unlike most clustering formulations,
correlation clustering does not require the user to specify the number of clusters nor a dis
tance threshold for clustering; both of these parameters are effectively chosen to be the best
possible by the problem deﬁnition. These properties make correlation clustering a promis
ing approach to many clustering problems; in machine learning, it has been successfully
applied to coreference resolution for proper nouns [McCallum & Wellner, 2003].
Recently, several algorithms based on linear programming and positivesemideﬁnite
160
11.1. CLUSTERING FORMULATION 161
programming relaxations have been proposed to approximately solve this problem. In this
chapter, we employ these relaxations to derive a maxmargin formulation for learning the
edge scores for correlation clustering from clustered training data. We formulate the ap
proximate learning problem as a compact convex program with quadratic objective and
linear or positivesemideﬁnite constraints. Experiments on synthetic and realworld data
show the ability of the algorithm to learn an appropriate clustering metric for a variety of
desired clusterings.
11.1 Clustering formulation
An instance of correlation clustering is speciﬁed by an undirected graph ( = (1, c) with
N nodes and edge score s
jk
for each jk in c, (j < k). We assume that the graph is fully
connected (if it is not, we can make it fully connected by adding appropriate edges jk with
s
jk
= 0). We deﬁne binary variables y
jk
, one for each edge jk, that represent whether node
j and k belong to the same cluster. Let ¸ be the space of assignments y that deﬁne legal
partitions. For notational convenience, we introduce both y
jk
and y
kj
variables, which will
be constrained to have the same value. We also introduce y
jj
variables, and ﬁx them to
have value 1 and set s
jj
= 0.
Bansal et al. [2002] consider two related problems:
max
y∈¸
jk:s
jk
>0
s
jk
y
jk
−
jk:s
jk
<0
s
jk
(1 −y
jk
); (MAXAGREE)
min
y∈¸
jk:s
jk
>0
s
jk
(1 −y
jk
) −
jk:s
jk
<0
s
jk
y
jk
; (MINDISAGREE)
The motivation for the names of the two problems comes from separating the set of edges
into positive weight edges and negative weight edges. The best score is obviously achieved
by including all the positive and excluding all the negative edges, but this will not generally
produce a valid partition. In MAXAGREE, we maximize the “agreement” of the partition
with the positive/negative designations: the weight of the positive included edges minus
the weight of negative excluded edges. In MINDISAGREE, we minimize the disagreement:
the weight of positive excluded edges minus the weight of negative included edges. In
162 CHAPTER 11. CORRELATION CLUSTERING
particular, let
s
∗
= max
y∈¸
s
jk
y
jk
; s
−
=
jk:s
jk
<0
s
jk
; s
+
=
jk:s
jk
>0
s
jk
.
Then the value of MAXAGREE is s
∗
− s
−
and the value of MINDISAGREE s
+
− s
∗
. The
optimal partition for the two problems is of course the same (if it is unique). Bansal
et al. [2002] show that both of these problems are NPhard (but have different approxi
mation hardness). We will concentrate on the maximization version, MAXAGREE. Several
approximation algorithms have been developed based on Linear and Semideﬁnite Program
ming [Charikar et al., 2003; Demaine & Immorlica, 2003; Emanuel & Fiat, 2003], which
we consider in the next sections.
11.1.1 Linear programming relaxation
In order to insure that y deﬁnes a partition, it is sufﬁcient to enforce a kind of triangle
inequality for each triple of nodes j < k < l:
y
jk
+ y
kl
≤ y
jl
+ 1; y
jk
+ y
jl
≤ y
kl
+ 1; y
jl
+ y
kl
≤ y
jk
+ 1. (11.1)
The triangle inequality enforces transitivity: if j and k are in the same cluster (y
jk
= 1)
and k and l are in the same cluster (y
kl
= 1), then j and l will be forced to be in this
cluster (y
jl
= 1). The other two cases are similar. Any symmetric, transitive binary relation
induces a partition of the objects.
With these deﬁnitions, we can express the MAXAGREE problem as an integer linear
program (ignoring the constant −s
−
):
max
jk
s
jk
y
jk
(11.2)
s.t. y
jk
+ y
kl
≤ y
lj
+ 1, ∀j, k, l; y
jj
= 1, ∀j; y
jk
∈ ¦0, 1¦, ∀j, k.
Note that the constraints imply that y
ab
= y
ba
for any two nodes a and b. To see this,
consider the inequalities involving node a and b with j = a, k = b, l = b and j = b, k =
11.1. CLUSTERING FORMULATION 163
a, l = a:
y
ab
+ y
bb
≤ y
ba
+ 1; y
ba
+ y
aa
≤ y
ab
+ 1;
Since y
aa
= y
bb
= 1, we have y
ab
= y
ba
.
The LP relaxation is obtained by replacing the binary variables y
jk
∈ ¦0, 1¦ in Eq. (11.2)
with continuous variables 0 ≤ µ
jk
≤ 1.
µ
jk
+ µ
kl
≤ µ
lj
+ 1, ∀j, k, l; µ
jj
= 1, ∀j; µ
jk
≥ 0, ∀j, k. (11.3)
Note that µ
jk
≤ 1 is implied since the triangle inequality with j = l gives µ
jk
+ µ
kj
≤
µ
jj
+ 1, and since µ
jj
= 1 and µ
jk
= µ
kj
, we have µ
jk
≤ 1.
We are not guaranteed that this relaxation will produce integral solutions. The LP
solution, s
LP
, is an upper bound on the s
∗
. Charikar et al. [2003] show that the integrality
gap of this upper bound is at least 2/3:
s
∗
−s
−
s
LP
−s
−
<
2
3
.
11.1.2 Semideﬁnite programming relaxation
An alternative formulation [Charikar et al., 2003] is the SDP relaxation. Let mat(µ) denote
the variables µ
jk
arranged into a matrix.
max
jk
s
jk
µ
jk
(11.4)
s.t. mat(µ) _ 0; µ
jj
= 1, ∀j; µ
jk
≥ 0, ∀j, k.
In effect, we substituted the triangle inequalities by the semideﬁnite constraint. To motivate
this relaxation, consider any clustering solution. Choose a collection of orthogonal unit
vectors ¦v
1
, . . . , v
K
¦, one for each cluster in the solution. Every vertex j in the cluster
is assigned the unit vector v
j
corresponding to the cluster it is in. If vertices j and k
are in the same cluster, then v
¯
j
v
k
= 1, if not, v
¯
j
v
k
= 0. The score of the clustering
164 CHAPTER 11. CORRELATION CLUSTERING
solution can now be expressed in terms of the dot products v
¯
j
v
k
. In the SDP relaxation
in Eq. (11.4), we have mat(µ) _ 0, which can be decomposed into a sum of outer products
mat(µ) =
j
v
j
v
¯
j
.
The entries µ
jk
correspond to inner products v
¯
j
v
k
. The vectors generating the inner
products are unit vectors by the requirement µ
jj
= 1. However, they do not necessarily
form a set of orthogonal vectors.
The SDP solution, s
SDP
, is also an upper bound on the s
∗
and Charikar et al. [2003]
show that the integrality gap of this upper bound is at least 0.82843:
s
∗
−s
−
s
SDP
−s
−
≤ 0.82843.
11.2 Learning formulation
The score for an edge is commonly derived from the characteristics of the pair of nodes.
Speciﬁcally, we parameterize the score as a weighted combination of basis functions
s
jk
= w
¯
f (x
jk
),
w, f (x
jk
) ∈ IR
n
, where x
jk
is a set of features associated with nodes j and k. In document
clustering, the entries of f
x
jk
might be the words shared by the documents j and k, while if
one is clustering points in IR
n
, features might be distances along different dimensions. We
assume f (x
jj
) = 0 so that s
jj
= 0. Hence we write
w
¯
f (x, y) =
jk
y
jk
w
¯
f (x
jk
).
Furthermore, we assume that the loss function decomposes over the edges, into a sum
of edge losses
i,jk
(y
jk
):
i
(y) =
jk
i,jk
(y
jk
) =
jk
y
jk
i,jk
(1) + (1 −y
jk
)
i,jk
(0) =
i
(0) +
jk
y
jk
i,jk
,
where
i,jk
=
i,jk
(1) −
i,jk
(0). For example, the Hamming loss counts the number of
11.2. LEARNING FORMULATION 165
edges incorrectly cut or uncut by a partition.
H
i
(y) =
jk
1I(y
jk
,= y
(i)
jk
) =
jk
y
(i)
jk
+
jk
y
jk
(1 −2y
(i)
jk
) =
H
i
(0) +
jk
y
jk
i,jk
,
where
i,jk
= 1 −2y
(i)
jk
.
With this assumption, the loss augmented maximization is
i
(0) + max
y∈¸
jk
y
jk
[w
¯
f (x
(i)
jk
) +
i,jk
]. (11.5)
We can now use the LP relaxation in Eq. (11.3) and the SDP relaxation in Eq. (11.4) as
upper bounds on Eq. (11.5). We use these upperbounds in the minmax formulation to
achieve approximate maxmargin estimation.
The dual of the LP based upper bound for example i is
i
(0)+
min
jkl
λ
i,jkl
+
j
z
i,j
(11.6)
s.t.
l
[λ
i,jkl
+ λ
i,ljk
−λ
i,klj
] ≥ w
¯
f (x
(i)
jk
) +
i,jk
, ∀j ,= k;
z
i,j
+
l
[λ
i,jjl
+ λ
i,ljj
−λ
i,jlj
] ≥
i,jj
, ∀j;
λ
i,jkl
≥ 0, ∀j, k, l.
Above, we introduced a dual variable λ
i,jkl
for each triangle inequality and z
i,j
for the
identity on the diagonal. Note the righthandside of the second set of inequalities follows
from the assumption f (x
jj
) = 0.
166 CHAPTER 11. CORRELATION CLUSTERING
Plugging this dual into the minmax formulation, we have:
min
1
2
[[w[[
2
+ C
ξ
i
(11.7)
s.t. w
¯
f (x
(i)
, y
(i)
) + ξ
i
≥
i
(0) +
jkl
λ
i,jkl
+
j
z
i,j
, ∀i;
l
[λ
i,jkl
+ λ
i,ljk
−λ
i,klj
] ≥ w
¯
f (x
(i)
jk
) +
i,jk
, ∀i, ∀j ,= k;
z
i,j
+
l
[λ
i,jjl
+ λ
i,ljj
−λ
i,jlj
] ≥
i,jj
, ∀i, ∀j;
λ
i,jkl
≥ 0, ∀i, ∀j, k, l.
Similarly, the dual of the SDP based upper bound is
i
(0)+
min
j
z
i,j
(11.8)
s.t. −λ
i,jk
≥ w
¯
f (x
(i)
jk
) +
i,jk
, ∀j ,= k;
z
i,j
−λ
i,jj
≥
i,jj
∀j;
mat(λ
i
) _ 0.
Plugging the SDP dual into the minmax formulation, we have:
min
1
2
[[w[[
2
+ C
ξ
i
(11.9)
s.t. w
¯
f (x
(i)
, y
(i)
) + ξ
i
≥
i
(0) +
j
z
i,j
, ∀i;
−λ
i,jk
≥ w
¯
f (x
(i)
jk
) +
i,jk
, ∀i, ∀j ,= k;
z
i,j
−λ
i,jj
≥
i,jj
, ∀i, ∀j;
mat(λ
i
) _ 0, ∀i, ∀j, k, l.
11.3. DUAL FORMULATION AND KERNELS 167
11.3 Dual formulation and kernels
The dual of Eq. (11.7) and Eq. (11.9) provide some insight into the structure of the problem
and enable efﬁcient use of kernels. Here we give the dual of Eq. (11.7):
max
i,jk
µ
i,jk
i,jk
−
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,jk
(y
(i)
jk
−µ
i,jk
)f (x
(i)
jk
)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
s.t. µ
jk
+ µ
kl
≤ µ
lj
+ 1, ∀i, ∀j, k, l; µ
i,jj
= 1, ∀i, ∀j; µ
i,jk
≥ 0, ∀i, ∀j, k.
The dual of Eq. (11.9) is very similar, except that the linear transitivity constraints
µ
jk
+ µ
kl
≤ µ
lj
+ 1, ∀i, ∀j, k, l are replaced by the corresponding mat(µ
i
) _ 0:
max
i,jk
µ
i,jk
i,jk
−
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,jk
(y
(i)
jk
−µ
i,jk
)f (x
(i)
jk
)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
s.t. mat(µ
i
) _ 0, ∀i; µ
i,jj
= 1, ∀i, ∀j; µ
i,jk
≥ 0, ∀i, ∀j, k.
The relation between the primal and dual solution is
w = C
i,jk
(y
(i)
jk
−µ
jk
)f (x
(i)
jk
). (11.10)
One important consequence of this relationship is that the edge parameters are all sup
port vector expansions. The dual objective can be expressed in terms of dotproducts
f (x
jk
)
¯
f (x
lm
). Therefore, we can use kernels K(x
jk
, x
lm
) to deﬁne the space of basis
functions. This kernel looks at two pairs of nodes, (j, k) and (l, m), and measures the
similarity between the relation between the nodes of each pair. If we are clustering points
in Euclidian space, the kernel could be a function of the two segments corresponding to
the pairs of points, for example, a polynomial kernel over their lengths and angle between
them.
168 CHAPTER 11. CORRELATION CLUSTERING
11.4 Experiments
We present experiments on a synthetic problem exploring the effect of irrelevant basis
functions (features), and two real data sets, email clustering and image segmentation.
11.4.1 Irrelevant features
In this section, we explore on a synthetic example how our algorithm deals with irrele
vant features. In particular, we generate data (100 points) from a mixture of two one
dimensional Gaussians, where each mixture component corresponds to a cluster. This ﬁrst
dimension is thus the relevant feature. Then we add noise components in D additional
(irrelevant) dimensions. The noise is independently generated for each dimension, from a
mixture of Gaussians with same difference in means and variance as for the relevant dimen
sion. Figure 11.1(a) shows the projection of a data sample onto the ﬁrst two dimensions
and on two irrelevant dimensions.
Let x
j
denote each point and x
j
[d] denote the dth dimension of the point. We used
a basis function for each dimension f
d
(x
jk
) = e
−(x
j
[d]−x
k
[d])
2
, plus an additional constant
basis function. The training and test data consists of a 100 samples from the model. The
results in Fig. 11.1(b) illustrate the capability of our algorithm to learn to ignore irrelevant
dimensions. The accuracy is the fraction of edges correctly predicted to be between/within
cluster. Random clustering will give an accuracy of 50%. The comparison with kmeans is
simply a baseline to illustrate the effect of the noise on the data.
11.4.2 Email clustering
We also test our approach on the task of clustering email into folders. We gathered the
data from the SRI CALO project.
1
Our dataset consisted of email from seven users (ap
proximately 100 consecutive messages per user), which the users had manually ﬁled into
different folders. The number of folders for each users varied from two to six, with an
average of 34 folders. We are interested in the problem of learning to cluster emails.
1
http://www.ai.sri.com/project/CALO
11.4. EXPERIMENTS 169
−4 −3 −2 −1 0 1 2 3
−4
−3
−2
−1
0
1
2
3
4
−4 −3 −2 −1 0 1 2 3 4
−4
−3
−2
−1
0
1
2
3
4
0.45
0.55
0.65
0.75
0.85
0.95
1 2 4 8 16 32 64 128
#noisedimensions
t
e
s
t
a
c
c
u
r
a
c
y
lcc
kmeans
(a) (b)
Figure 11.1: (a) Projection onto ﬁrst two dimensions (top) and two noise dimensions (bot
tom); (c) Performance on 2 cluster problem as function of the number of irrelevant noise
dimensions. Learning to cluster is the solid line, kmeans the dashed line. Errorbars de
note one standard deviation, averages are over 20 runs. Accuracy is the fraction of edges
correctly predicted to be between/within cluster.
Speciﬁcally, what score function s
jk
causes correlation clustering to give clusters similar
to those that human users had chosen?
To test our learning algorithm, we use each user as a training set in turn, learning the
parameters from the partition of a single user’s mailbox into folders. We then use the
learned parameters to cluster the other users’ mail. The basis functions f (x
jk
) measured the
similarity between the text of the messages, the similarity between the “From:” ﬁeld, “To:”
ﬁeld, and “Cc:” ﬁeld. One feature was used for each common word in the pair of emails
(except words that appeared in more than half the messages, which were deemed “stop
words” and omitted). Also, additional features captured the proportion of shared tokens
for each email ﬁeld, including the from, to, Cc, subject and body ﬁelds. The algorithm
is therefore able to automatically learn the relative importance of certain email ﬁelds to
ﬁling two messages together, as well as importance of meaningful words versus common,
irrelevant ones.
We compare our method to the kmeans clustering algorithm using with the same word
170 CHAPTER 11. CORRELATION CLUSTERING
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 5 6 7
User
P
a
i
r
F
1
lcc
kmeans
Figure 11.2: Average Pair F1 measure for clustering user mailboxes.
features, and took the best clustering out of ﬁve tries. We made this comparison somewhat
easy for kmeans by giving it the correct number of clusters k. We also informed our algo
rithm of the number of clusters by uniformly adding a positive weight to the edge weights
to cause it to give the correct number of clusters. We performed a simple binary search on
this additional bias weight parameter to ﬁnd the number of clusters comparable to k. The
results in Fig. 11.2 show the average F1 measure (harmonic mean of precision and recall;
a standard metric in information retrieval [BaezaYates & RibeiroNeto, 1999]) computed
on the pairs of messages that belonged to the same cluster. Our algorithm signiﬁcantly
outperforms kmeans on several users and does worse only for one of the users.
11.4.3 Image segmentation
We also test our approach on the task of image segmentation. We selected images from
the Berkeley Image Segmentation Dataset [Martin et al., 2001] for which two users had
signiﬁcantly different segmentations. For example, Fig. 11.3(a) and (b) show two distinct
segmentations: one very coarse, mostly based of overall hue, and one much ﬁner, based
on the hue and intensity. Depending on the task at hand, we may prefer the ﬁrst over the
second or viceversa. It is precisely this kind of variability in the similarity judgements that
11.4. EXPERIMENTS 171
(a) (b)
Figure 11.3: Two segmentations by different users: training image with (a) coarse segmen
tation and (b) ﬁne segmentation.
we want our algorithm to capture.
In order to segment the image, we ﬁrst divided it contiguous regions of approximately
the same color by running connected components on the pixels. We connected two adjacent
pixels by an edge if their RGB value was the same at a coarse resolution (4 bits per each
of the R,G,B channel). We then selected about a hundred largest regions, which covered
8090% of the pixels. These regions are the objects that our algorithm learns to cluster.
(We then use the learned metric to greedily assign the remaining small regions to the large
adjoining regions.)
There is a rich space of possible features we can use in our models: for each pair
of regions, we can consider their shape, color distribution, distance, presence of edges
between them, etc. In our experiments, we used a fairly simple set of features that are very
easy to compute. For each region, we calculated the bounding box, area and average color
(averaging the pixels in the RGB space). We then computed three distances (one for each
HSV channel), as well as the distance in pixels between the bounding boxes and the area
of the smaller of the two regions. All features were normalized to have zero mean and
variance 1.
We trained two models, one using Fig. 11.3(a) and the other using Fig. 11.3(b) and
172 CHAPTER 11. CORRELATION CLUSTERING
(a) (b) (c)
Figure 11.4: Test image: (a) input; (b) segmentation based on coarse training data (c)
segmentation based on ﬁne training data.
tested on the image in Fig. 11.4(a). The results are shown in Fig. 11.4(b) and (c), respec
tively. Note that mountains, rocks and grass are segmented very coarsely based on hue in
(b) while the segmentation in (c) is more detailed and sensitive to saturation and value of
the colors.
11.5 Related work
The performance of most clustering algorithms depends critically on the distance metric
that they are given for measuring the similarity or dissimilarity between different data
points. Recently, a number of algorithms have been proposed for automatically learning
distance metrics as a preprocessing step for clustering [Xing et al., 2002; BarHillel et al.,
2003]. In contrast to algorithms that learn a metric independently of the algorithm that will
be used to cluster the data, we describe a formulation that tightly integrates metric learning
with the clustering algorithm, tuning one to the other in a joint optimization. Thus, instead
of using an externallydeﬁned criterion for choosing the metric, we will instead seek to
learn a good metric for the clustering algorithm. An example of work in a similar vein is
the algorithm for learning a distance metric for spectral clustering [Bach & Jordan, 2003].
The clustering algorithm essentially uses an eigenvector decomposition of an appropriate
matrix derived from the pairwise afﬁnity matrix, which is more efﬁcient than correlation
clustering, for which we use LP or SDP formulations. However the objective in the learning
11.6. CONCLUSION 173
formulation proposed in Bach and Jordan [2003] is not convex and difﬁcult to optimize.
11.6 Conclusion
We looked at correlation clustering, and how to learn the edge weights from example clus
terings. Our approach ties together the inference and learning algorithm, and attempts
to learn a good metric speciﬁcally for the clustering algorithm.. We showed results on a
synthetic dataset, showcasing robustness to noise dimensions. Experiments on the CALO
email and image segmentation experiments show the potential of the algorithm on real
world data. The main limitation of the correlation clustering is scalability: the number of
constraints ([1[
3
) in the LP relaxation and the size of the positivesemideﬁnite constraint
in the SDP relaxation. It would be very interesting to explore constraint generation or sim
ilar approaches to speed up learning and inference. On the theoretical side, it would be
interesting to work out a PAClike bound for generalization of the learned score metric.
174 CHAPTER 11. CORRELATION CLUSTERING
Part IV
Conclusions and future directions
175
Chapter 12
Conclusions and future directions
This thesis presents a novel statistical estimation framework for structured models based
on the large margin principle underlying support vector machines. The framework results
in several efﬁcient learning formulations for complex prediction tasks. Fundamentally,
we rely on the expressive power of convex optimization problems to compactly capture
inference or solution optimality in structured models. Directly embedding this structure
within the learning formulation produces compact convex problems for efﬁcient estimation
of very complex models. For some of these models, alternative estimation methods are
intractable. We develop theoretical foundations for our approach and show a wide range
of experimental applications, including handwriting recognition, 3D terrain classiﬁcation,
disulﬁde connectivity in protein structure prediction, hypertext categorization, natural lan
guage parsing, email organization and image segmentation.
12.1 Summary of contributions
We view a structured prediction model as a mapping from the space of inputs x ∈ A to
a discrete vector output y ∈ ¸. Essentially, a model deﬁnes a compact, parameterized
scoring function w
¯
f (x, y) and prediction using the model reduces to ﬁnding the highest
scoring output y given the input x. Our class of models has the following linear form:
h
w
(x) = arg max
y: g(x,y)≤0
w
¯
f (x, y),
176
12.1. SUMMARY OF CONTRIBUTIONS 177
where w ∈ IR
n
is the vector of parameters of the model, constraints g(x, y) ∈ IR
k
deﬁne
the space of feasible outputs y given the input x and basis functions f (x, y) ∈ IR
n
represent
salient features of the input/output pair. Although the space of outputs ¦y : g(x, y) ≤ 0¦ is
usually immense, we assume that the inference problem arg max
y : g(x,y)≤0
w
¯
f (x, y) can
be solved (or closely approximated) by an efﬁcient algorithm that exploits the structure of
the constraints g and basis functions f . This deﬁnition covers a broad range of models,
from probabilistic models such as Markov networks and context free grammars to more
unconventional models like weighted graphcuts and matchings.
12.1.1 Structured maximum margin estimation
Given a sample S = ¦(x
(i)
, y
(i)
)¦
m
i=1
, we develop methods for ﬁnding parameters w such
that:
arg max
y∈¸
(i)
w
¯
f (x
(i)
, y) ≈ y
(i)
, ∀i,
where ¸
(i)
= ¦y : g(x
(i)
, y) ≤ 0¦.
The naive formulation
1
uses
i
[¸
(i)
[ linear constraints, which is generally exponential
in the number of variables in each y
(i)
.
min
1
2
[[w[[
2
s.t. w
¯
f
i
(y
(i)
) ≥ w
¯
f
i
(y) +
i
(y), ∀i, ∀y ∈ ¸
(i)
.
We propose two general approaches that transform the above exponential size QP to an
exactly equivalent polynomial size QP in many important classes of models. These formu
lations allow us to ﬁnd globally optimal parameters (with ﬁxed precision) in polynomial
time using standard optimization software. In many models where maximum likelihood
estimation is intractable, we provide exact maximum margin solutions (Ch. 7 and Ch. 10).
1
For simplicity, we omit the slack variables in this summary.
178 CHAPTER 12. CONCLUSIONS AND FUTURE DIRECTIONS
Minmax formulation
We can turn the above problem into an equivalent minmax formulation with i nonlinear
maxconstraints:
min
1
2
[[w[[
2
s.t. w
¯
f
i
(y
(i)
) ≥ max
y∈¸
(i)
[w
¯
f
i
(y) +
i
(y)], ∀i.
The key to solving the estimation problem above efﬁciently is the lossaugmented infer
ence problem max
y∈¸
(i) [w
¯
f
i
(y) +
i
(y)]. Even if max
y∈¸
(i) w
¯
f
i
(y) can be solved in
polynomial time using convex optimization, the form of the loss term
i
(y) is crucial for
the lossaugmented inference to remain tractable. We typically use a natural loss func
tion which is essentially the Hamming distance between y
(i)
and h(x
(i)
): the number of
variables predicted incorrectly.
We show that if we can express the (lossaugmented) inference as a compact convex
optimization problem (e.g., LP, QP, SDP, etc.), we can embed the maximization inside the
minmax formulation to get a compact convex program equivalent to the naive exponential
formulation. We show that this approach leads to exact polynomialsize formulations for
estimation of lowtreewidth Markov networks, associative Markov networks over binary
variables, contextfree grammars, bipartite matchings, and many other models.
Certiﬁcate formulation
There are several important combinatorial problems which allow polynomial time solu
tion yet do not have a compact convex optimization formulation. For example, maximum
weight perfect (nonbipartite) matching and spanning tree problems can be expressed as
linear programs with exponentially many constraints, but no polynomial formulation is
known [Bertsimas & Tsitsiklis, 1997; Schrijver, 2003]. Both of these problems, however,
can be solved in polynomial time using combinatorial algorithms. In some cases, though,
we can ﬁnd a compact certiﬁcate of optimality that guarantees that
y
(i)
= arg max
y
[w
¯
f
i
(y) +
i
(y)].
12.1. SUMMARY OF CONTRIBUTIONS 179
For perfect (nonbipartite) matchings, this certiﬁcate is a condition that ensures there are
no augmenting alternating cycles (see Ch. 10). We can express this condition by deﬁning
an auxiliary distance function on the nodes an a set of linear constraints that are satisﬁed if
and only if there are no negative cycles. This simple set of linear constraints scales linearly
with the number of edges in the graph. Similarly, we can derive a compact certiﬁcate for
the spanning tree problem.
The certiﬁcate formulation relies on the fact that verifying optimality of a solution is
often easier than actually ﬁnding one. This observation allows us to apply our framework
to an even broader range of models with combinatorial structure than the minmax formu
lation.
Maximum margin vs. maximum likelihood
There are several theoretical advantages to our approach in addition to the empirical accu
racy improvements we have shown experimentally. Because our approach only relies on
using the maximum in the model for prediction, and does not require a normalized dis
tribution P(y [ x) over all outputs, maximum margin estimation can be tractable when
maximum likelihood is not. For example, to learn a probabilistic model P(y [ x) over
bipartite matchings using maximum likelihood requires computing the normalizing parti
tion function, which is #Pcomplete [Valiant, 1979; Garey & Johnson, 1979]. By contrast,
maximum margin estimation can be formulated as a compact QP with linear constraints.
Similar results hold for an important subclass of Markov networks and nonbipartite match
ings.
In models that are tractable for both maximum likelihood and maximum margin (such
as lowtreewidth Markov networks, context free grammars, many other problems in which
inference is solvable by dynamic programming), our approach has an additional advantage.
Because of the hingeloss, the solutions to the estimation are relatively sparse in the dual
space (as in SVMs), which makes the use of kernels much more efﬁcient. Maximum like
lihood models with kernels are generally nonsparse and require pruning or greedy support
vector selection methods [Wahba et al., 1993; Zhu & Hastie, 2001; Lafferty et al., 2004;
Altun et al., 2004].
180 CHAPTER 12. CONCLUSIONS AND FUTURE DIRECTIONS
There are, of course, several advantages to maximum likelihood estimation. In appli
cations where probabilistic conﬁdence information is a must, maximum likelihood is much
more appropriate. Also, in training settings with missing data and hidden variables, proba
bilistic interpretation permits the use of wellunderstood algorithms such as EM [Dempster
et al., 1977].
Approximations
In many problems, the maximization problem we are interested in may be NPhard, for
example, we consider MAP inference in large treewidth Markov networks in Ch. 8, multi
way cuts in Ch. 7, graphpartitioning in Ch. 11. Many such problems can be written as
integer programs. Relaxations of such integer programs into LPs, QPs or SDPs often pro
vide excellent approximation algorithms and ﬁt well within our framework, particularly the
minmax formulation. We show empirically that these approximations are very effective in
many applications.
12.1.2 Markov networks: maxmargin, associative, relational
The largest portion of the thesis is devoted to novel estimation algorithms, representational
extensions, generalization analysis and experimental validation for Markov networks.
◦ Lowtreewidth Markov networks
We use a compact LP for MAP inference in Markov networks with sequence and
other lowtreewidth structure to derive an exact, compact, convex learning formu
lation. The dual formulation allows efﬁcient integration of kernels with graphical
models that leverages rich highdimensional representations for accurate prediction
in realworld tasks.
◦ Scalable online algorithm
Although our convex formulation is a QP with linear number of variables and con
straints in the size of the data, for large datasets (millions of examples), it is very
difﬁcult to solve using standard software. We present an efﬁcient algorithm for solv
ing the estimation problem called Structured SMO. Our onlinestyle algorithm uses
12.1. SUMMARY OF CONTRIBUTIONS 181
inference in the model and analytic updates to solve extremely large estimation prob
lems.
◦ Generalization analysis
We analyze the theoretical generalization properties of maxmargin estimation in
Markov networks and derive a novel marginbased bound for structured prediction.
This is the ﬁrst bound to address structured error (e.g., proportion of mislabeled
pixels in an image).
◦ Learning associative Markov networks (AMNs)
We deﬁne an important subclass of Markov networks that captures positive correla
tions present in many domains. This class of networks extends the Potts model [Potts,
1952] often used in computer vision and allows exact MAP inference in the case of
binary variables. We show how to express the inference problem using an LP which
is exact for binary networks. As a result, for associative Markov networks over bi
nary variables, our framework allows exact estimation of networks of arbitrary con
nectivity and topology, for which likelihood methods are believed to be intractable.
For the nonbinary case, we provide an approximation that works well in practice.
We present an AMNbased method for object segmentation from 3D range data. By
constraining the class of Markov networks to AMNs, our models are learned efﬁ
ciently and, at runtime, can scale up to tens of millions of nodes and edges by using
graphcut based inference [Kolmogorov & Zabih, 2002].
◦ Representation and learning of relational Markov networks
We introduce relational Markov networks (RMNs), which compactly deﬁne tem
plates for Markov networks for domains with relational structure: objects, attributes,
relations. The graphical structure of an RMN is based on the relational structure of
the domain, and can easily model complex interaction patterns over related entities.
We apply this class of models to classiﬁcation of hypertext using hyperlink structure
to deﬁne relations between webpages. We use a compact approximate MAP LP in
these complex Markov networks, in which exact inference is intractable, to derive an
approximate maxmargin formulation.
182 CHAPTER 12. CONCLUSIONS AND FUTURE DIRECTIONS
12.1.3 Broader applications: parsing, matching, clustering
The other large portion of the thesis addresses a range of prediction tasks with very diverse
models: context free grammars for natural language parsing, perfect matchings for disulﬁde
connectivity in protein structure prediction, graph partitions for clustering documents and
segmenting images.
◦ Learning to parse
We exploit dynamic programming decomposition of context free grammars to derive
a compact maxmargin formulation. We build on a recently proposed “unlexicalized”
grammar that allows cubic time parsing and we show how to achieve highaccuracy
parsing (still in cubic time) by exploiting novel kinds of lexical information. We show
experimental evidence of the model’s improved performance over several baseline
models.
◦ Learning to match
We use a combinatorial optimality condition, namely the absence of augmenting al
ternating cycles, to derive an exact, efﬁcient certiﬁcate formulation for learning to
match. We apply our framework to prediction of disulﬁde connectivity in proteins
using perfect matchings. The algorithmwe propose uses kernels, which makes it pos
sible to efﬁciently embed input features in very highdimensional spaces and achieve
stateoftheart accuracy.
◦ Learning to cluster
By expressing the correlation clustering problem as a compact LP and SDP, we use
the minmax formulation to learn a parameterized scoring function for clustering. In
contrast to algorithms that learn a metric independently of the algorithm that will
be used to cluster the data, we describe a formulation that tightly integrates metric
learning with the clustering algorithm, tuning one to the other in a joint optimization.
We formulate the approximate learning problem as a compact convex program. Ex
periments on synthetic and realworld data show the ability of the algorithm to learn
an appropriate clustering metric for a variety of desired clusterings, including email
folder organization and image segmentation.
12.2. EXTENSIONS AND OPEN PROBLEMS 183
12.2 Extensions and open problems
There are several immediate applications, less immediate extensions and open problems for
our estimation framework. We organize these ideas into several sections below, including
further theoretical analysis and new optimization algorithms, novel prediction tasks, and
more general learning settings.
12.2.1 Theoretical analysis and optimization algorithms
◦ Approximation bounds
In several of the intractable models, like multiclass AMNs in Ch. 7 and correlation
clustering in Ch. 11, we used approximate convex programs within the minmax for
mulation. These approximate inference programs have strong relative error bounds.
An open question is to translate these error bounds on inference into error bounds on
the resulting maxmargin formulations.
◦ Generalization bounds with distributional assumptions
In Ch. 5, we presented a bound on the structured error in Markov networks, with
out any assumption about the distribution of P(y [ x), relying only on the samples
(x
(i)
, y
(i)
) being i.i.d. This distributionfree assumption leads to a worst case analy
sis, while some assumptions about the approximate decomposition P(y [ x) may be
warranted. For example, for sequential prediction problems, the Markov assumption
of some ﬁnite order is reasonable (i.e., given the input and previous k labels, the next
label is independent of the labels more than k in the past). In spatial prediction tasks,
a label variable is independent of the rest given a large enough ball of labels around
it. Similar assumptions may be made for some “degree of separation” in relational
domains. More generally, it would be interesting to exploit such conditional indepen
dence assumptions or asymptotic bounds on entropy of P(y [ x) to get generalization
guarantees even from a single structured example (x, y).
◦ Problemspeciﬁc optimization methods
Although our convex formulations are polynomial in the size of the data, scaling
184 CHAPTER 12. CONCLUSIONS AND FUTURE DIRECTIONS
up to larger datasets will require problemspeciﬁc optimization methods. For low
treewidth Markov networks and context free grammars, we have presented the Struc
tured SMO algorithm. Another algorithm useful for such models is Exponentiated
Gradient [Bartlett et al., 2004]. Both algorithms rely on dynamic programming de
compositions. However, models which do not permit such decompositions, such as
graphcuts, matchings, and many others, create a need for new algorithms that can ef
fectively use combinatorial optimization as a subroutine to eliminate the dependence
on generalpurpose convex solvers.
12.2.2 Novel prediction tasks
◦ Bipartite matchings
Maximum weight bipartite matchings are used in a variety of problems to predict
mappings between sets of items. In machine translation, matchings are used to
map words of the two languages in aligned sentences [Matusov et al., 2004]. In 2D
shape matching, points on two shapes are matched based on their local contour fea
tures [Belongie et al., 2002]. Our framework provides an exact, efﬁcient alternative
to the maximum likelihood estimation for learning the matching scoring function.
◦ Sequence alignment
In standard pairwise alignment of biological sequences, a string edit distance is used
to determine which portions of the sequences align to each other [Needleman &
Wunsch, 1970; Durbin et al., 1998]. Finding the best alignment involves a dynamic
program that generalizes the longest common subsequence algorithm. Our frame
work can be applied (just as in context free grammar estimation) to efﬁciently learn
a more complex edit function that depends on the contextual string features, perhaps
using novel string kernels [Haussler, 1999; Leslie et al., 2002; Lodhi et al., 2000].
◦ Continuous prediction problems
We have addressed estimation of models with discrete output spaces, generalizing
classiﬁcation models to multivariate, structured classiﬁcation. Similarly, we can
consider a whole range of problems where the prediction variables are continuous.
12.2. EXTENSIONS AND OPEN PROBLEMS 185
Such problems are a natural generalizations of regression, involving correlated, inter
constrained realvalued outputs. For example, several recent models of metabolic
ﬂux in yeast use linear programming formulations involving quantities of various
enzymes, with stoichiometric constraints [Varma & Palsson, 1994]. It would be
interesting to use observed equilibria data under different conditions to learn what
“objective” the cell is maximizing. In ﬁnancial modeling, convex programs are often
used to model portfolio management; for example, Markowitz portfolio optimization
is formulated as a quadratic program which minimizes risk and maximizes expected
return under budget constraints [Markowitz, 1991; Luenberger, 1997]. In this setting,
one could learn a user’s return projection and risk assessment function from observed
portfolio allocations by the user.
These problems are similar to the discrete structured prediction models we have con
sidered: inference in the model can formulated as a convex optimization problem.
However, there are obstacles to directly applying the minmax or certiﬁcate formu
lations. Details of this are beyond the scope of this thesis, but it sufﬁces to say that
lossaugmented inference using, Hamming distance equivalent, L
1
loss (or L
2
loss),
no longer produces a maximization of a concave objective with convex constraints
since L
1
, L
2
are convex, not concave (it turns out that it is actually possible to use
L
∞
loss). Developing effective loss functions and maxmargin formulations for the
continuous setting could provide a novel set of effective models for structured multi
variate realvalued prediction problems.
12.2.3 More general learning settings
◦ Structure learning
We have focused on the problem of learning parameters of the model (even though
our kernelized models can be considered nonparametric). In the case of Markov
networks, especially in spatial and relational domains, there is a wealth of possible
structures (cliques in the network) one can use to model a problem. It is particularly
interesting to explore the problem of inducing these cliques automatically from data.
The standard method of greedy stepwise selection followed by reestimation is very
186 CHAPTER 12. CONCLUSIONS AND FUTURE DIRECTIONS
expensive in general networks [Della Pietra et al., 1997; Bach & Jordan, 2001].
Recent work on selecting input features in Markov networks (or CRFs) uses several
approximations to learn efﬁciently with millions of candidate features [McCallum,
2003]. However, clique selection is still relatively unexplored. It is possible that
AMNs, by restricting the network to be tractable under any structure, may permit
more efﬁcient clique selection methods.
◦ Semisupervised learning
Throughout the thesis we have assumed completely labeled data. This assumption
often limits us to relatively small training sets where data has been carefully anno
tated, while much of the easily accessible data is not at all or suitably labeled. There
are several more general settings we would like to extend our framework.
The simplest setting is a mix of labeled and unlabeled examples, where a small su
pervised dataset is augmented by a large unsupervised one. There has been much
research in this setting for classiﬁcation [Blum & Mitchell, 1998; Nigam et al., 2000;
Chapelle et al., 2002; Szummer & Jaakkola, 2001; Zhu et al., 2003; Corduneanu &
Jaakkola, 2003]. Although most of this work has been done in a probabilistic set
ting, the principle of regularizing (discouraging) decision boundaries near densely
clustered inputs could be applicable to our structured setting.
A more complex and rich setting involves presence of hidden variables in each ex
ample. For example, in machine translation, word correspondences between pairs of
sentences are usually not manually annotated (at least not on a large scale). These
correspondence variables can be treated as hidden variables. Similarly, in handwrit
ing recognition, we may not have each letter segmented out but instead just get a
word or sentence as a label for the entire image. This setting has been studied mainly
in the probabilistic, generative models often using the EMalgorithm[Dempster et al.,
1977; Cowell et al., 1999]. Discriminative methods have been explored far less. Es
pecially in the case of combinatorial structures, extensions of our framework allow
opportunities for problemspeciﬁc convex approximations to be exploited.
12.3. FUTURE 187
12.3 Future
We have presented a supervised learning framework for a large class of prediction mod
els with rich and interesting structure. Our approach has several theoretical and practical
advantages over standard probabilistic models and estimation methods for structured pre
diction. We hope that continued research in this framework will help tackle evermore
sophisticated prediction problems in the future.
Appendix A
Proofs and derivations
A.1 Proof of Theorem 5.5.1
The proof of Theorem 5.5.1 uses the covering number bounds of Zhang [2002] (in the
DataDependent Structural Risk Minimization framework [ShaweTaylor et al., 1998].)
Zhang provides generalization guarantees for linear binary classiﬁers of the form h
w
(x) =
sgn(w
¯
x). His analysis is based on the upper bounds on the covering number for the class
of linear functions T
L
(w, z) = w
¯
z where the norms of the vectors w and z are bounded.
We reproduce the relevant deﬁnitions and theorems from Zhang [2002] here to highlight
the necessary extensions for structured classiﬁcation.
The covering number is a key quantity in measuring function complexity. Intuitively,
the covering number of an inﬁnite class of functions (e.g. parameterized by a set of weights
w) is the number of vectors necessary to approximate the values of any function in the class
on a sample. Marginbased analysis of generalization error uses the margin achieved by a
classiﬁer on the training set to approximate the original function class of the classiﬁer by
a ﬁnite covering with precision that depends on the margin. Here, we will only deﬁne the
∞norm covering number.
188
A.1. PROOF OF THEOREM 5.5.1 189
A.1.1 Binary classiﬁcation
In binary classiﬁcation, we are given a sample S = ¦x
(i)
, y
(i)
¦
m
i=1
, from distribution D over
A ¸, where A = IR
n
and ¸ is mapped to ±1, so we can fold x and y into z = yx.
Deﬁnition A.1.1 (Covering Number) Let 1 = ¦v
(1)
, . . . , v
(r)
¦, where v
(j)
∈ IR
m
, be a
covering of a function class T(w, S) with precision under the metric ρ, if for all w there
exists a v
(j)
such that for each data sample z
(i)
∈ S:
ρ(v
(j)
i
, T(w, z
(i)
)) ≤ .
The covering number of a sample S is the size of the smallest covering: ^
∞
(T, ρ, , S) =
inf [1[ s.t. 1 is a covering of T(w, S). We also deﬁne the covering number for any sample
of size m: ^
∞
(T, ρ, , m) = sup
S: [S[=m
^
∞
(T, ρ, , S).
When the norms of w and z are bounded, we have the following upper bound on the
covering number of linear functions under the linear metric ρ
L
(v, v
t
) = [v −v
t
[.
Theorem A.1.2 (Theorem 4 from Zhang [2002]) If w
2
≤ a and z
2
≤ b, then ∀
> 0,
log
2
^
∞
(T
L
, ρ
L
, , m) ≤ 36
a
2
b
2
2
log
2
(2 ¸4ab/ + 2 m + 1) .
In order to use the classiﬁer’s margin to bound its expected loss, the bounds below use
a stricter, marginbased loss on the training sample that measures the worst loss achieved
by the approximate covering based on this margin. Let f : IR → [0, 1] be a loss function.
In binary classiﬁcation, we let f(v) = 1I(v ≤ 0) be the step function, so that 01 loss of
sgn(w
¯
x) is f(T
L
(w, z)). The next theorem bounds the expected f loss in terms of the
γmargin loss, f
γ
(v) = sup
ρ(v,v
)<2γ
f(v
t
), on the training sample. For 01 loss and linear
metric ρ
L
, the corresponding γmargin loss is f
γ
(v) = 1I(v ≤ 2γ).
Theorem A.1.3 (Corollary 1 from Zhang [2002]) Let f : IR → [0, 1] be a loss function
and f
γ
(v) = sup
ρ(v,v
)<2γ
f(v
t
) be the γmargin loss for a metric ρ. Let γ
1
> γ
2
> . . . be
190 APPENDIX A. PROOFS AND DERIVATIONS
a decreasing sequence of parameters, and p
i
be a sequence of positive numbers such that
∞
i=1
p
i
= 1, then for all δ > 0, with probability of at least 1 −δ over data:
E
D
[f(T(w, z))] ≤ E
S
[f
γ
(T(w, z))] +
¸
32
m
_
ln 4^
∞
(T, ρ, γ
i
, S) + ln
1
p
i
δ
_
for all w and γ, where for each ﬁxed γ, we use i to denote the smallest index s.t. γ
i
≤ γ.
A.1.2 Structured classiﬁcation
We will extend this framework to bound the average perlabel loss
H
(y)/L for structured
classiﬁcation by deﬁning an appropriate loss f and a function class T (as well as a metric
ρ) such that f(T) computes average perlabel loss and f
γ
(T) provides a suitable γmargin
loss. We will bound the corresponding covering number by building on the bound in The
orem A.1.2.
We can no longer simply fold x and y, since y is a vector, so we let z = (x, y). In
order for our loss function to compute average perlabel loss, it is convenient to make our
function class vectorvalued (instead of scalarvalued as above). We deﬁne a new function
class T
M
(w, z), which is a vector of minimum values of w
¯
∆f
i
(y) for each error level
H
(y) from 1 to L as described below.
Deﬁnition A.1.4 (dtherrorlevel function) The dtherrorlevel function M
d
(w, z) for d ∈
¦1, . . . , L¦ is given by:
M
d
(w, z) = min
y:
H
(y)=d
w
¯
∆f
i
(y).
Deﬁnition A.1.5 (Multierrorlevel function class) The multierrorlevel function class T
M
(w, z)
is given by:
T
M
(w, z) = (M
1
(w, z), . . . , M
d
(w, z), . . . , M
L
(w, z)) .
We can now compute the average perlabel loss from T
M
(w, z) by deﬁning an appropriate
loss function f
M
.
A.1. PROOF OF THEOREM 5.5.1 191
Deﬁnition A.1.6 (Average perlabel loss) The average perlabel loss f
M
: IR
L
→[0, 1] is
given by:
f
M
(v) =
1
L
arg min
d:v
d
≤0
v
d
,
where in case ∀d, v
d
> 0, we deﬁne arg min
d:v
d
≤0
v
d
≡ 0.
With the above deﬁnitions, we have an upper bound on the average perlabel loss
f
M
(T
M
(w, z)) =
1
L
arg min
d:M
d
(w,z)≤0
M
d
(w, z) ≥
1
L
H
_
arg max
y
w
¯
f
i
(y)
_
.
Note that the case ∀d, M
d
(w, z) > 0 corresponds to the classiﬁer making no mistakes:
arg max
y
w
¯
f
i
(y) = y. This upper bound is tight if y = arg max
y
w
¯
f (x, y
t
), Other
wise, it is adversarial: it picks from all y
t
which are better (w
¯
f (y) ≤ w
¯
f (y
t
)), one that
maximizes the Hamming distance from y.
We now need to deﬁne an appropriate metric ρ that in turn deﬁnes γmargin loss for
structured classiﬁcation. Since the margin of the hypothesis grows with the number of
mistakes, our metric can become “looser” with the number of mistakes, as there is more
room for error.
Deﬁnition A.1.7 (Multierrorlevel metric) Let the multierrorlevel metric ρ
M
: IR
L
IR
L
→IR for a vector in IR
L
be given by:
ρ
M
(v, v
t
) = max
d
[v
d
−v
t
d
[
d
.
We now deﬁne the corresponding γmargin loss using the new metric:
Deﬁnition A.1.8 (γmargin average perlabel loss) The γmargin average perlabel loss
f
γ
M
: IR
L
→[0, 1] is given by:
f
γ
M
(v) = sup
ρ
M
(v,v
)≤2γ
f
M
(v
t
).
Combining the two deﬁnitions, we get:
f
γ
M
(T
M
(w, z)) = sup
v:[v
d
−M
d
(w,z)[≤2dγ
1
L
arg min
d:v
d
≤0
v
d
.
192 APPENDIX A. PROOFS AND DERIVATIONS
We also deﬁne the corresponding covering number for our vectorvalued function class:
Deﬁnition A.1.9 (Multierrorlevel covering number) Let 1 = ¦V
(1)
, . . . , V
(r)
¦, where
V
(j)
= (V
(j)
1
, . . . , V
(j)
i
, . . . , V
(j)
m
) and V
(j)
i
∈ IR
L
, be a covering of T
M
(w, S), with 
precision under the metric ρ
M
, if for all w there exists a V
(j)
such that for each data
sample z
(i)
∈ S:
ρ
M
(V
(j)
i
, T
M
(w, z
(i)
)) ≤ .
The covering number of a sample S is the size of the smallest covering: ^
∞
(T
M
, ρ
M
, , S) =
inf [1[ s.t. 1 is a covering of T
M
(w, S). We also deﬁne
^
∞
(T
M
, ρ
M
, , m) = sup
S: [S[=m
^
∞
(T
M
, ρ
M
, , S).
We provide a bound on the covering number of our new function class in terms of a
covering number for the linear function class. Recall that N
c
is the maximum number of
cliques in ((x), V
c
is the maximum number of values in a clique [Y
c
[, q is the maximum
number of cliques that have a variable in common, and R
c
is an upperbound on the 2norm
of clique basis functions. Consider a ﬁrstorder sequence model as an example, with L as
the maximum length, and V the number of values a variable takes. Then N
c
= 2L−1 since
we have L node cliques and L −1 edge cliques; V
c
= V
2
because of the edge cliques; and
q = 3 since nodes in the middle of the sequence participate in 3 cliques: previouscurrent
edge clique, node clique, and currentnext edge clique.
Lemma A.1.10 (Bound on multierrorlevel covering number)
^
∞
(T
M
, ρ
M
, q, m) ≤ ^
∞
(T
L
, ρ
L
, , mN
c
(V
c
−1)).
Proof: We will show that ^
∞
(T
M
, ρ
M
, q, S) ≤ ^
∞
(T
L
, ρ
L
, , S
t
) for any sample S
of size m, where we construct the sample S
t
of size mN
c
(V
c
− 1) in order to cover the
clique potentials as described below. Note that this is sufﬁcient since ^
∞
(T
L
, ρ
L
, , S
t
) ≤
^
∞
(T
L
, ρ
L
, , mN
c
(V
c
−1)), by deﬁnition, so
^
∞
(T
M
, ρ
M
, q, m) = sup
S:[S[=m
^
∞
(T
M
, ρ
M
, q, S) ≤ ^
∞
(T
L
, ρ
L
, , mN
c
(V
c
−1)).
A.1. PROOF OF THEOREM 5.5.1 193
The construction of S
t
below is inspired by the proof technique in Collins [2001],
but the key difference is that our construction is linear in the number of cliques N
c
and
exponential in the number of label variables per clique, while his is exponential in the total
number of label variables per example. This reduction in size comes about because our
covering approximates the values of clique potentials w
¯
∆f
i,c
(y
c
) for each clique c and
clique assignment y
c
as opposed to the values of entire assignments w
¯
∆f
i
(y).
For each sample z ∈ S, we create N
c
(V
c
− 1) samples ∆f
i,c
(y
c
), one for each clique
c and each assignment y
c
,= y
(i)
c
. We construct a set of vectors 1 = ¦v
(1)
, . . . , v
(r)
¦,
where v
(j)
∈ IR
mN
c
(V
c
−1)
. The component of v
(j)
corresponding to the sample z
(i)
and the
assignment y
c
to the labels of the clique c will be denoted by v
(j)
i,c
(y
c
). For convenience,
we deﬁne v
(j)
i,c
(y
(i)
c
) = 0 for correct label assignments, as ∆f
i,c
(y
(i)
c
) = 0. To make 1 an
∞norm covering of T
L
(w, S
t
) under ρ
L
, we require that for any w there exists a v
(j)
∈ 1
such that for each sample z
(i)
:
[v
(j)
i,c
(y
c
) −w
¯
∆f
i,c
(y
c
)[ ≤ ; ∀c ∈ (
(i)
, ∀y
c
. (A.1)
By Deﬁnition A.1.1, the number of vectors in 1 is given by r = ^
∞
(T
L
, ρ
L
, , mN
c
(V
c
−1)).
We can now use 1 to construct a covering 1 = ¦V
(1)
, . . . , V
(r)
¦, where
V
(j)
= (V
(j)
1
, . . . , V
(j)
i
, . . . , V
(j)
m
)
and V
(j)
i
∈ IR
L
, for our multierrorlevel function T
M
. Let v
(j)
i
(y) =
c
v
(j)
i,c
(y
c
), and
M
d
(v
(j)
i
, z
(i)
) = min
y:
H
i
(y)=d
v
(j)
i
(y), then
V
(j)
i
= (M
1
(v
(j)
, z
(i)
), . . . , M
d
(v
(j)
, z
(i)
), . . . , M
L
(v
(j)
, z
(i)
)) . (A.2)
Note that v
(j)
i,c
(y
c
) is zero for all cliques c for which the assignment is correct: y
c
= y
(i)
c
.
Thus for an assignment y with d mistakes, at most dq v
(j)
i,c
(y
c
) will be nonzero, as each
label can appear in at most q cliques. By combining this fact with Eq. (A.1), we obtain:
¸
¸
¸v
(j)
i
(y) −w
¯
∆f
i
(y)
¸
¸
¸ ≤ dq, ∀i, ∀y :
H
i
(y) = d. (A.3)
194 APPENDIX A. PROOFS AND DERIVATIONS
We conclude the proof by showing that 1 is a covering of T
M
under ρ
M
: For each w,
pick V
(j)
∈ 1 such that the corresponding v
(j)
∈ 1 satisﬁes the condition in Eq. (A.1). We
must now bound:
ρ
M
(V
(j)
i
, T
M
(w, z
(i)
)) = max
d
[ min
y:
H
i
(y)=d
v
(j)
i
(y) −min
y:
H
i
(y)=d
w
¯
∆f
i
(y)[
d
.
Fix any i. Let y
v
d
= arg min
y:
H
i
(y)=d
v
(j)
i
(y) and y
w
d
= arg min
y:
H
i
(y)=d
w
¯
∆f
i
(y). Con
sider the case where v
(j)
i
(y
v
d
) ≥ w
¯
∆f
i
(y
w
d
) (the reverse case is analogous), we must
prove that:
v
(j)
i
(y
v
d
) −w
¯
∆f
i
(y
w
d
) ≤ v
(j)
i
(y
w
d
) −w
¯
∆f
i
(y
w
d
) ≤ dq ; (A.4)
where the ﬁrst step follows from deﬁnition of y
v
d
, since v
(j)
i
(y
v
d
) ≤ v
(j)
i
(y
w
d
). The last step
is a direct consequence of Eq. (A.3). Hence ρ
M
(V
(j)
i
, T
M
(w, z
(i)
)) ≤ q.
Lemma A.1.11 (Numeric bound on multierrorlevel covering number)
log
2
^
∞
(T
M
, ρ
M
, , m) ≤ 36
R
2
c
w
2
2
q
2
2
log
2
_
1 + 2
_
4
R
c
w
2
q
+ 2
_
mN
c
(V
c
−1)
_
.
Proof: Substitute Theorem A.1.2 into Lemma A.1.10.
Theorem A.1.12 (Multilabel analog of Theorem A.1.3) Let f
M
and f
γ
M
(v) be as deﬁned
above. Let γ
1
> γ
2
> . . . be a decreasing sequence of parameters, and p
i
be a sequence
of positive numbers such that
∞
i=1
p
i
= 1, then for all δ > 0, with probability of at least
1 −δ over data:
E
z
f
M
(T
M
(w, z)) ≤ E
S
f
γ
M
(T
M
(w, z)) +
¸
32
m
_
ln 4^
∞
(T
M
, ρ
M
, γ
i
, S) + ln
1
p
i
δ
_
for all w and γ, where for each ﬁxed γ, we use i to denote the smallest index s.t. γ
i
≤ γ.
Proof: Similar to the proof of Zhang’s Theorem 2 and Corollary 1 [Zhang, 2002] where
in Step 3 (derandomization) we substitute the vectorvalued T
M
and the metric ρ
M
.
Theorem 5.5.1 follows from above theorem with γ
i
= R
c
w
2
/2
i
and p
i
= 1/2
i
using an
A.2. AMN PROOFS AND DERIVATIONS 195
argument identical to the proof of Theorem 6 in Zhang [2002].
A.2 AMN proofs and derivations
In this appendix, we present proofs of the LP inference properties and derivations of the
factored primal and dual maxmargin formulation from Ch. 7. Recall that the LP relaxation
for ﬁnding the optimal max
y
g(y) is:
max
v∈1
K
k=1
µ
v
(k)g
v
(k) +
c∈(\1
K
k=1
µ
c
(k)g
c
(k) (A.5)
s.t. µ
c
(k) ≥ 0, ∀c ∈ (, k;
K
k=1
µ
v
(k) = 1, ∀v ∈ 1;
µ
c
(k) ≤ µ
v
(k), ∀c ∈ ( ¸ 1, v ∈ c, k.
A.2.1 Binary AMNs
Proof (For Theorem 7.2.1) Consider any fractional, feasible µ. We show that we can con
struct a new feasible assignment µ
t
which increases the objective (or leaves it unchanged)
and furthermore has fewer fractional entries.
Since g
c
(k) ≥ 0, we can assume that µ
c
(k) = min
v∈c
µ
v
(k); otherwise we could in
crease the objective by increasing µ
c
(k). We construct an assignment µ
t
from µ by leaving
integral values unchanged and uniformly shifting fractional values by λ:
µ
t
v
(1) = µ
v
(1) −λ1I(0 < µ
v
(1) < 1), µ
t
v
(2) = µ
v
(2) + λ1I(0 < µ
v
(2) < 1),
µ
t
c
(1) = µ
c
(1) −λ1I(0 < µ
c
(1) < 1), µ
t
c
(2) = µ
c
(2) + λ1I(0 < µ
c
(2) < 1).
Now consider the smallest fractional µ
v
(k), λ(k) = min
v : µ
v
(k)>0
y
v
(k) for k = 1, 2.
Note that if λ = λ(1) or λ = −λ(2), µ
t
will have at least one more integral µ
t
v
(k) than µ.
Thus if we can show that the update results in a feasible and better scoring assignment, we
can apply it repeatedly to get an optimal integer solution. To show that µ
t
is feasible, we
need µ
t
v
(1) + µ
t
v
(2) = 1, µ
t
v
(k) ≥ 0 and µ
t
c
(k) = min
i∈c
µ
t
v
(k).
196 APPENDIX A. PROOFS AND DERIVATIONS
First, we show that µ
t
v
(1) + µ
t
v
(2) = 1.
µ
t
v
(1) + µ
t
v
(2) = µ
v
(1) −λ1I(0 < µ
v
(1) < 1) + µ
v
(2) + λ1I(0 < µ
v
(2) < 1)
= µ
v
(1) + µ
v
(2) = 1.
Above we used the fact that if µ
v
(1) is fractional, so is µ
v
(2), since µ
v
(1) + µ
v
(2) = 1.
To show that µ
t
v
(k) ≥ 0, we prove min
v
µ
t
v
(k) = 0.
min
v
µ
t
v
(k) = min
v
_
µ
v
(k) −( min
i:µ
v
(k)>0
µ
v
(k))1I(0 < µ
v
(k) < 1)
_
= min
_
min
i
µ
v
(k), min
i:µ
v
(k)>0
_
µ
v
(k) − min
i:µ
v
(k)>0
µ
v
(k)
__
= 0.
Lastly, we show µ
t
c
(k) = min
i∈c
µ
t
v
(k).
µ
t
c
(1) = µ
c
(1) −λ1I(0 < µ
c
(1) < 1)
= (min
i∈c
µ
v
(1)) −λ1I(0 < min
i∈c
µ
v
(1) < 1) = min
i∈c
µ
t
v
(1);
µ
t
c
(2) = µ
c
(2) + λ1I(0 < µ
c
(1) < 1)
= (min
i∈c
µ
v
(2)) + λ1I(0 < min
i∈c
µ
v
(2) < 1) = min
i∈c
µ
t
v
(2).
We have established that the new µ
t
are feasible, and it remains to show that we can
improve the objective. We can show that the change in the objective is always λD for some
constant D that depends only on µ and g. This implies that one of the two cases, λ = λ(1)
or λ = −λ(2), will necessarily increase the objective (or leave it unchanged). The change
in the objective is:
v∈1
k=1,2
[µ
t
v
(k) −µ
v
(k)]g
v
(k) +
c∈(\1
k=1,2
[µ
t
c
(k) −µ
c
(k)]g
c
(k)
= λ
_
_
v∈1
[D
v
(1) −D
v
(2)] +
c∈(\1
[D
c
(1) −D
c
(2)]
_
_
= λD
D
v
(k) = g
v
(k)1I(0 < µ
v
(k) < 1), D
c
(k) = g
c
(k)1I(0 < µ
c
(k) < 1).
Hence the new assignment µ
t
is feasible, does not decrease the objective function, and
A.2. AMN PROOFS AND DERIVATIONS 197
has strictly fewer fractional entries.
A.2.2 Multiclass AMNs
For K > 2, we use the randomized rounding procedure of Kleinberg and Tardos [1999]
to produce an integer solution for the linear relaxation, losing at most a factor of m =
max
c∈(
[c[ in the objective function. The basic idea of the rounding procedure is to treat
µ
v
(k) as probabilities and assign labels according to these probabilities in phases. In each
phase, we pick a label k, uniformly at random, and a threshold α ∈ [0, 1] uniformly at
random. For each node i which has not yet been assigned a label, we assign the label k
if µ
v
(k) ≥ α. The procedure terminates when all nodes have been assigned a label. Our
analysis closely follows that of Kleinberg and Tardos [1999].
Lemma A.2.1 The probability that a node i is assigned label k by the randomized proce
dure is µ
v
(k).
Proof The probability that an unassigned node is assigned label k during one phase is
1
K
µ
v
(k), which is proportional to µ
v
(k). By symmetry, the probability that a node is as
signed label k over all phases is exactly µ
v
(k).
Lemma A.2.2 The probability that all nodes in a clique c are assigned label k by the
procedure is at least
1
[c[
µ
c
(k).
Proof For a single phase, the probability that all nodes in a clique c are assigned label k if
none of the nodes were previously assigned is
1
K
min
i∈c
µ
v
(k) =
1
K
µ
c
(k). The probability
that at least one of the nodes will be assigned label k in a phase is
1
K
(max
i∈c
µ
v
(k)). The
probability that none of the nodes in the clique will be assigned any label in one phase is
1 −
1
K
K
k=1
max
i∈c
µ
v
(k).
Nodes in the clique c will be assigned label k by the procedure if they are assigned label
k in one phase. (They can also be assigned label k as a result of several phases, but we can
ignore this possibility for the purposes of the lower bound.) The probability that all the
198 APPENDIX A. PROOFS AND DERIVATIONS
nodes in c will be assigned label k by the procedure in a single phase is:
∞
j=1
1
K
µ
c
(k)
_
1 −
1
K
K
k=1
max
i∈c
µ
v
(k)
_
j−1
=
µ
c
(k)
K
k=1
max
i∈c
µ
v
(k)
≥
µ
c
(k)
K
k=1
i∈c
µ
v
(k)
=
µ
c
(k)
i∈c
K
k=1
µ
v
(k)
=
µ
c
(k)
[c[
.
Above, we ﬁrst used the fact that for d < 1,
∞
i=0
d
i
=
1
1−d
, and then upperbounded
the max of the set of positive µ
v
(k)’s by their sum.
Theorem A.2.3 The expected cost of the assignment found by the randomized procedure
given a solution µ to the linear program in Eq. (A.5) is at least
v∈1
K
k=1
g
v
(k)µ
v
(k) +
c∈(\1
1
[c[
K
k=1
g
c
(k)µ
k
c
.
Proof This is immediate from the previous two lemmas.
The only difference between the expected cost of the rounded solution and the (non
integer) optimal solution is the
1
[c[
factor in the second term. By picking m = max
c∈(
[c[, we
have that the rounded solution is at most m times worse than the optimal solution produced
by the LP of Eq. (A.5).
We can also derandomize this procedure to get a deterministic algorithm with the same
guarantees, using the method of conditional probabilities, similar in spirit to the approach
of Kleinberg and Tardos [1999].
Note that the approximation factor of m applies, in fact, only to the clique poten
tials. Thus, if we compare the logprobability of the optimal MAP solution and the log
probability of the assignment produced by this randomized rounding procedure, the terms
corresponding to the logpartitionfunction and the node potentials are identical. We obtain
an additive error (in logprobability space) only for the clique potentials. As node poten
tials are often larger in magnitude than clique potentials, the fact that we incur no loss
proportional to node potentials is likely to lead to smaller errors in practice. Along similar
lines, we note that the constant factor approximation is smaller for smaller cliques; again,
we observe, the potentials associated with large cliques are typically smaller in magnitude,
reducing further the actual error in practice.
A.2. AMN PROOFS AND DERIVATIONS 199
A.2.3 Derivation of the factored primal and dual maxmargin QP
Using Assumptions 7.4.1 and 7.4.4, we have the dual of the LP used to represent the
interior max subproblem max
y
w
¯
f
i
(y) +
i
(y) in Eq. (3.2):
min
v∈1
ξ
i,v
(A.6)
s.t. −w
¯
f
i,v
(k) −
c⊃v
m
i,c,v
(k) ≥
i,v
(k) −ξ
i,v
, ∀i, v ∈ 1
(i)
, k;
−¨ w
¯
¨
f
i,c
(k) +
v∈c
m
i,c,v
(k) ≥
i,c
(k), ∀i, c ∈ (
(i)
¸ 1
(i)
, k;
m
i,c,v
(k) ≥ 0, ∀i, c ∈ (
(i)
¸ 1
(i)
, v ∈ c, k;
where f
i,c
(k) = f
i,c
(k, . . . , k) and
i,c
(k) =
i,c
(k, . . . , k). In the dual, we have a variable
ξ
i,v
for each normalization constraint in Eq. (7.1) and variables m
i,c,v
(k) for each of the
inequality constraints.
Substituting this dual into Eq. (5.1), we obtain:
min
1
2
[[w[[
2
+ C
i
ξ
i
(A.7)
s.t. w
¯
f
i
(y
(i)
) + ξ
i
≥
v∈1
(i)
ξ
i,v
, ∀i;
−w
¯
f
i,v
(k) −
c⊃v
m
i,c,v
(k) ≥
i,v
(k) −ξ
i,v
, ∀i, v ∈ 1
(i)
, k;
−¨ w
¯
¨
f
i,c
(k) +
v∈c
m
i,c,v
(k) ≥
i,c
(k), ∀i, c ∈ (
(i)
¸ 1
(i)
, k;
m
i,c,v
(k) ≥ 0, ∀i, c ∈ (
(i)
¸ 1
(i)
, v ∈ c, k;
¨ w ≥ 0.
Now let ξ
i,v
= ξ
t
i,v
+ w
¯
f
i,v
(y
(i)
v
) +
c⊃v
¨ w
¯
¨
f
i,c
(y
(i)
c
)/[c[ and m
i,c,v
(k) = m
t
i,c,v
(k) +
¨ w
¯
¨
f
i,c
(y
(i)
c
)/[c[. Reexpressing the above QP in terms of these new variables, we get:
200 APPENDIX A. PROOFS AND DERIVATIONS
min
1
2
[[w[[
2
+ C
i
ξ
i
(A.8)
s.t. ξ
i
≥
v∈1
(i)
ξ
t
i,v
, ∀i;
w
¯
∆f
i,v
(k) −
c⊃v
m
t
i,c,v
(k) ≥
i,v
(k) −ξ
t
i,v
, ∀i, v ∈ 1
(i)
, k;
¨ w
¯
∆
¨
f
i,c
(k) +
v∈c
m
t
i,c,v
(k) ≥
i,c
(k), ∀i, c ∈ (
(i)
¸ 1
(i)
, k;
m
t
i,c,v
(k) ≥ −¨ w
¯
¨
f
i,c
(y
(i)
c
)/[c[, ∀i, c ∈ (
(i)
¸ 1
(i)
, v ∈ c, k;
¨ w ≥ 0.
Since ξ
i
=
i,v∈1
(i)
ξ
t
i,v
at the optimum, we can eliminate ξ
i
and the corresponding set
of constraints to get the formulation in Eq. (7.4), repeated here for reference:
min
1
2
[[w[[
2
+ C
i,v∈1
(i)
ξ
i,v
(A.9)
s.t. w
¯
∆f
i,v
(k) −
c⊃v
m
i,c,v
(k) ≥
i,v
(k) −ξ
i,v
, ∀i, v ∈ 1
(i)
, k;
¨ w
¯
∆
¨
f
i,c
(k) +
v∈c
m
i,c,v
(k) ≥
i,c
(k), ∀i, c ∈ (
(i)
¸ 1
(i)
, k;
m
i,c,v
(k) ≥ −¨ w
¯
¨
f
i,c
(y
(i)
c
)/[c[, ∀i, c ∈ (
(i)
¸ 1
(i)
, v ∈ c, k;
¨ w ≥ 0.
A.2. AMN PROOFS AND DERIVATIONS 201
Now the dual of Eq. (A.9) is given by:
max
i,c∈(
(i)
, k
µ
i,c
(k)
i,c
(k) −
1
2
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,v∈1
(i)
, k
µ
i,v
(k)∆
˙
f
i,v
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
(A.10)
−
1
2
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¨ τ +
i,c∈(
(i)
\1
(i)
,v∈c, k
λ
i,c,v
(k)
¨
f
i,c
(y
(i)
c
)/[c[ +
i,c∈(
(i)
, k
µ
i,c
(k)∆
¨
f
i,c
(k)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
s.t. µ
i,c
(k) ≥ 0, ∀i, ∀c ∈ (
(i)
, k;
K
k=1
µ
i,v
(k) = C, ∀i, ∀v ∈ 1
(i)
;
µ
i,c
(k) −µ
i,v
(k) = λ
i,c,v
(k), ∀i, ∀c ∈ (
(i)
¸ 1
(i)
, v ∈ c, k;
λ
i,c,v
(k) ≥ 0 ∀i, ∀c ∈ (
(i)
¸ 1
(i)
, v ∈ c, k,
¨ τ ≥ 0.
In this dual, µ correspond to the ﬁrst two sets of constraints, while λ and ¨ τ correspond
to third and fourth set of constraints. Using the substitution
¨ ν = ¨ τ +
i,c∈(
(i)
\1
(i)
,v∈c, k
λ
i,c,v
(k)
¨
f
i,c
(y
(i)
c
)/[c[
and the fact that λ
i,c,v
(k) ≥ 0 and
¨
f
i,c
(y
(i)
c
) ≥ 0, we can eliminate λ and ¨ τ, as well as divide
µ’s by C, and reexpress the above QP as:
max
i,c∈(
(i)
, k
µ
i,c
(k)
i,c
(k) −
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
i,v∈1
(i)
, k
µ
i,v
(k)∆
˙
f
i,v
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
−
1
2
C
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¨ ν +
i,c∈(
(i)
, k
µ
i,c
(k)∆
¨
f
i,c
(k)
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
2
s.t. µ
i,c
(k) ≥ 0, ∀i, ∀c ∈ (
(i)
, k;
K
k=1
µ
i,v
(k) = 1, ∀i, ∀v ∈ 1
(i)
;
µ
i,c
(k) ≤ µ
i,v
(k), ∀i, ∀c ∈ (
(i)
¸ 1
(i)
, v ∈ c, k; ¨ ν ≥ 0.
Bibliography
Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning.
Proc. ICML.
Ahuja, R., & Orlin, J. (2001). Inverse optimization, Part I: Linear programming and general
problem. Operations Research, 35, 771–783.
Altschul, S., Madden, T., Schaffer, A., Zhang, A., Miller, W., & Lipman (1997). Gapped
BLAST and PSIBLAST: a new generation of protein database search programs. Nucleid
Acids Res., 25, 3389–3402.
Altun, Y., Smola, A., & Hofmann, T. (2004). Exponential families for conditional random
ﬁelds. Proc. UAI.
Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hidden markov support vector ma
chines. Proc. ICML.
Bach, F., & Jordan, M. (2001). Thin junction trees. NIPS.
Bach, F., & Jordan, M. (2003). Learning spectral clustering. NIPS.
BaezaYates, R., & RibeiroNeto, B. (1999). Modern information retrieval. Addison
WesleyLongman.
Bairoch, A., & Apweiler, R. (2000). The SwissProt protein sequence database and its
supplement trembl. Nucleic Acids Res., 28, 45–48.
Baldi, P., Cheng, J., & Vullo, A. (2004). Largescale prediction of disulphide bond connec
tivity. Proc. NIPS.
202
BIBLIOGRAPHY 203
Baldi, P., & Pollastri, G. (2003). The principled design of largescale recursive neural
network architectures dagrnns and the protein structure prediction problem. Journal of
Machine Learning Research, 4, 575–602.
Bansal, N., Blum, A., & Chawla, S. (2002). Correlation clustering. FOCS.
BarHillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions
using equivalence relations. Proc. ICML.
Bartlett, P., Collins, M., Taskar, B., & McAllester, D. (2004). Exponentiated gradient
algorithms for largemargin structured classiﬁcation. NIPS.
Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2003). Convexity, classiﬁcation, and risk
bounds (Technical Report 638). Department of Statistics, U.C. Berkeley.
Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using
shape contexts. IEEE Trans. Pattern Anal. Mach. Intell., 24.
Berman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T., Weissig, H., Shindyalov, I., &
Bourne, P. (2000). The protein data bank.
Bertsekas, D. (1999). Nonlinear programming. Belmont, MA: Athena Scientiﬁc.
Bertsimas, D., & Tsitsiklis, J. (1997). Introduction to linear programming. Athena Scien
tiﬁc.
Besag, J. E. (1986). On the statistical analysis of dirty pictures. Journal of the Royal
Statistical Society B, 48.
Bishop, C. (1995). Neural networks for pattern recognition. Oxford, UK: Oxford Univer
sity Press.
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with cotraining.
Proc. COLT.
Bouman, C., & Shapiro, M. (1994). A multiscale random ﬁeld model for bayesian image
segmentation. IP, 3(2).
204 BIBLIOGRAPHY
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.
Boykov, Y., & Kolmogorov, V. (2004). An experimental comparison of mincut/maxﬂow
algorithms for energy minimization in computer vision. PAMI.
Boykov, Y., Veksler, O., & Zabih, R. (1999a). Fast approximate energy minimization via
graph cuts. ICCV.
Boykov, Y., Veksler, O., & Zabih, R. (1999b). Markov random ﬁelds with efﬁcient approx
imations. CVPR.
Burton, D., & Toint, P. L. (1992). On an instance of the inverse shortest paths problem.
Mathematical Programming, 53, 45–61.
Ceroni, A., Frasconi, P., Passerini, A., & Vullo, A. (2003). Predicting the disulﬁde bond
ing state of cysteines with combinations of kernel machines. Journal of VLSI Signal
Processing, 35, 287–295.
Chakrabarti, S., Dom, B., & Indyk, P. (1998). Enhanced hypertext categorization using
hyperlinks. SIGMOD.
Chapelle, O., Weston, J., & Schoelkopf, B. (2002). Cluster kernels for semisupervised
learning. Proc. NIPS.
Charikar, M., Guruswami, V., & Wirth, A. (2003). Clustering with qualitative information.
FOCS.
Charniak, E. (1993). Statistical language learning. MIT Press.
Clark, S., & Curran, J. R. (2004). Parsing the wsj using ccg and loglinear models. Pro
ceedings of the 42nd Annual Meeting of the Association for Computational Linguistics
(ACL ’04).
Collins, M. (1999). Headdriven statistical models for natural language parsing. Doctoral
dissertation, University of Pennsylvania.
BIBLIOGRAPHY 205
Collins, M. (2000). Discriminative reranking for natural language parsing. ICML 17 (pp.
175–182).
Collins, M. (2001). Parameter estimation for statistical parsing models: Theory and prac
tice of distributionfree methods. IWPT.
Collins, M. (2004). Parameter estimation for statistical parsing models: Theory and prac
tice of distributionfree methods. In H. Bunt, J. Carroll and G. Satta (Eds.), New devel
opments in parsing technology. Kluwer.
Corduneanu, A., & Jaakkola, T. (2003). On information regularization. Proc. UAI.
Cormen, T., Leiserson, C., Rivest, R., & Stein, C. (2001). Introduction to algorithms. MIT
Press. 2nd edition.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley.
Cowell, R., Dawid, A., Lauritzen, S., & Spiegelhalter, D. (1999). Probabilistic networks
and expert systems. New York: Springer.
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel
based vector machines. Journal of Machine Learning Research, 2(5), 265–292.
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery,
S. (1998). Learning to extract symbolic knowledge from the world wide web. Proc
AAAI98 (pp. 509–516).
Cristianini, N., & ShaweTaylor, J. (2000). An introduction to support vector machines and
other kernelbased learning methods. Cambridge University Press.
DeGroot, M. H. (1970). Optimal statistical decisions. New York: McGrawHill.
Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random ﬁelds.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(4), 380–393.
Demaine, E. D., & Immorlica, N. (2003). Correlation clustering with partial information.
APPROX.
206 BIBLIOGRAPHY
Dempster, P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, 39.
Devroye, L., Gy¨ orﬁ, L., & Lugosi, G. (1996). Probabilistic theory of pattern recognition.
New York: SpringerVerlag.
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classiﬁcation. New York: Wiley
Interscience. 2nd edition.
Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis.
Cambridge University Press.
Edmonds, J. (1965). Maximum matching and a polyhedron with 01 vertices. Journal of
Research at the National Bureau of Standards, 69B, 125–130.
Egghe, L., & Rousseau, R. (1990). Introduction to informetrics. Elsevier.
Emanuel, D., & Fiat, A. (2003). Correlation clustering  minimizing disagreements on
arbitrary weighted graphs. ESA.
Fariselli, P., & Casadio, R. (2001). Prediction of disulﬁde connectivity in proteins. Bioin
formatics, 17, 957–964.
Fariselli, P., Martelli, P., & Casadio, R. (2002). A neural networkbased method for pre
dicting the disulﬁde connectivity in proteins. Knowledge based intelligent information
engineering systems and allied technologies (KES 2002), 1, 464–468.
Fariselli, P., Ricobelli, P., & Casadio, R. (1999). Role of evolutionary information in pre
dicting the disulﬁdebonding state of cysteine in proteins. Proteins, 36, 340–346.
Fiser, A., & Simon, I. (2000). Predicting the oxidation state of cysteines by multiple se
quence alignment. Bionformatics, 3, 251–256.
Forsyth, D. A., & Ponce, J. (2002). Computer vision: A modern approach. Prentice Hall.
Frasconi, P., Gori, M., & Sperduti, A. (1998). A general framework for adaptive structures.
IEEE Transactions on Neural Networks.
BIBLIOGRAPHY 207
Frasconi, P., Passerini, A., & Vullo, A. (2002). A two stage svm architecture for predicting
the disulﬁde bonding state of cysteines. Proceedings of IEEE Neural Network for signal
processing conference.
Freund, Y., & Schapire, R. (1998). Large margin classiﬁcation using the perceptron algo
rithm. Computational Learing Theory.
Friedman, N., Getoor, L., Koller, D., & Pfeffer, A. (1999). Learning probabilistic relational
models. Proc. IJCAI99 (pp. 1300–1309). Stockholm, Sweden.
Friess, T., Cristianini, N., & Campbell, C. (1998). The kernel adatron algorithm: a fast and
simple learning procedure for support vector machine. Proc. ICML.
Gabow, H. (1973). Implementation of algorithms for maximum matching on nonbipartite
graphs. Doctoral dissertation, Stanford University.
Garey, M. R., & Johnson, D. S. (1979). Computers and intractability. Freeman.
Geman, S., & Johnson, M. (2002). Dynamic programming for parsing and estimation of
stochastic uniﬁcationbased grammars. Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics.
Getoor, L., Friedman, N., Koller, D., & Taskar, B. (2002). Learning probabilistic models
of link structure. Journal of Machine Learning Research, 8.
Getoor, L., Segal, E., Taskar, B., & Koller, D. (2001). Probabilistic models of text and link
structure for hypertext classiﬁcation. Proc. IJCAI01 Workshop on Text Learning: Beyond
Supervision. Seattle, Wash.
Greig, D. M., Porteous, B. T., & Seheult, A. H. (1989). Exact maximum a posteriori
estimation for binar images. J. R. Statist. Soc. B, 51.
Guestrin, C., Koller, D., Parr, R., & Venkataraman, S. (2003). Efﬁcient solution algorithms
for factored mdps. Journal of Artiﬁcial Intelligence Research, 19.
Gusﬁeld, D. (1997). Algorithms on strings, trees, and sequences : Computer science and
computational biology. Cambridge University Press.
208 BIBLIOGRAPHY
Harrison, P., & Sternberg, M. (1994). Analysis and classiﬁcation of disulphide connectivity
in proteins. the entropic effect of crosslinkage. Journal of Molecular Biology, 244.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New
York: SpringerVerlag.
Haussler, D. (1999). Convolution kernels on discrete structures (Technical Report). UC
Santa Cruz.
Heuberger, C. (2004). Inverse combinatorial optimization: Asurvey on problems, methods,
and results. Journal of Combinatorial Optimization, 8.
Hochbaum, D. S. (Ed.). (1997). Approximation algorithms for NPhard problems. PWS
Publishing Company.
Joachims, T. (1999). Transductive inference for text classiﬁcation using support vector
machines. Proc. ICML99 (pp. 200–209). Morgan Kaufmann Publishers, San Francisco,
US.
Johnson, M. (2001). Joint and conditional estimation of tagging and parsing models. ACL
39.
Johnson, M., Geman, S., Canon, S., Chi, Z., & Riezler, S. (1999). Estimators for stochastic
“uniﬁcationbased” grammars. Proceedings of ACL 1999.
Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: Pattern recog
nition of hydrogenbonded and geometrical features.
Kaplan, R., Riezler, S., King, T., Maxwell, J., Vasserman, A., & Crouch, R. (2004). Speed
and accuracy in shallow and deep stochastic parsing. Proceedings of HLTNAACL’04.
Kassel, R. (1995). A comparison of approaches to online handwritten character recogni
tion. Doctoral dissertation, MIT Spoken Language Systems Group.
Kivinen, J., & Warmuth, M. (1997). Exponentiated gradient versus gradient descent for
linear predictors. Information and Computation, 132(1), 1–63.
BIBLIOGRAPHY 209
Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. ACL 41 (pp. 423–
430).
Kleinberg, J., & Tardos, E. (1999). Approximation algorithms for classiﬁcation problems
with pairwise relationships: Metric labeling and Markov random ﬁelds. FOCS.
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of
the ACM, 46(5), 604–632.
Klepeis, J., & Floudas, C. (2003). Prediction of βsheet topology and disulﬁde bridges in
polypeptides. Journal of Computational Chemistry, 24, 191–208.
Koller, D., & Pfeffer, A. (1998). Probabilistic framebased systems. Proc. AAAI98 (pp.
580–587). Madison, Wisc.
Kolmogorov, V., & Zabih, R. (2002). What energy functions can be minimized using graph
cuts? PAMI.
Kumar, S., & Hebert, M. (2003). Discriminative ﬁelds for modeling spatial dependencies
in natural images. NIPS.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random ﬁelds: Probabilistic
models for segmenting and labeling sequence data. ICML.
Lafferty, J., Zhu, X., & Liu, Y. (2004). Kernel conditional random ﬁelds: Representation
and clique selection. ICML.
Lawler, E. (1976). Combinatorial optimization: Networks and matroids. New York: Holt,
Rinehart and Winston.
Leslie, C., Eskin, E., & Noble, W. (2002). The spectrum kernel: a string kernel for svm
protein classiﬁcation. Proc. Paciﬁc Symposium on Biocomputing.
Liu, Z., & Zhang, J. (2003). On inverse problems of optimum perfect matching. Journal
of Combinatorial Optimization, 7.
210 BIBLIOGRAPHY
Lodhi, H., ShaweTaylor, J., Cristianini, N., & Watkins, C. (2000). Text classiﬁcation using
string kernels. NIPS.
Luenberger, D. (1997). Investment science. Oxford University Press.
Manning, C., &Sch¨ utze, H. (1999). Foundations of statistical natural language processing.
Cambridge, Massachusetts: The MIT Press.
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated
corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Markowitz, H. (1991). Portfolio selection: Efﬁcient diversiﬁcation of investments. Basil
Blackwell.
Martelli, P., Fariselli, P., Malaguti, L., & Casadio, R. (2002). Prediction of the disulﬁde
bonding state of cysteines in proteins at 88% accuracy. Protein Science, 11, 27359.
Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented
natural images and its application to evaluating segmentation algorithms and measuring
ecological statistics. Proc. ICCV.
Matsumura, M., Signor, G., &Matthews, B. (1989). Substantial increase of protein stability
by multiple disulﬁde bonds. Nature, 342, 291:293.
Matusov, E., Zens, R., & Ney, H. (2004). Symmetric word alignments for statistical ma
chine translation. Proc. COLING.
McCallum, A. (2003). Efﬁciently inducing features of conditional random ﬁelds. Proc.
UAI.
McCallum, A., & Wellner, B. (2003). Toward conditional models of identity uncertainty
with application to proper noun coreference. IJCAI Workshop on Information Integration
on the Web.
Mitchell, T. (1997). Machine learning. McGrawHill.
BIBLIOGRAPHY 211
Miyao, Y., & Tsujii, J. (2002). Maximum entropy estimation for feature forests. Proceed
ings of Human Language Technology Conference (HLT 2002).
Murphy, K. P., Weiss, Y., & Jordan, M. I. (1999). Loopy belief propagation for approximate
inference: an empirical study. Proc. UAI99 (pp. 467–475).
Needleman, S., & Wunsch, C. (1970). A general method applicable to the search for
similarities in the amino acid sequences of two proteins. Journal of Molecular Biology,
48.
Nemhauser, G. L., & Wolsey, L. A. (1999). Integer and combinatorial optimization. New
York: J. Wiley.
Neville, J., & Jensen, D. (2000). Iterative classiﬁcation in relational data. Proc. AAAI2000
Workshop on Learning Statistical Models from Relational Data (pp. 13–20).
Ng, A., & Jordan, M. (2001). On discriminative vs. generative classiﬁers: A comparison
of logistic regression and naive Bayes. NIPS.
Ng, A. Y., & Russell, S. (2000). Algorithms for inverse reinforcement learning. Proc.
ICML.
Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (2000). Text classiﬁcation from
labeled and unlabeled documents using em. Machine Learning, 39.
Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: Springer.
Papadimitriou, C., & Steiglitz, K. (1982). Combinatorial optimization: Algorithms and
complexity. Englewood Cliffs, NJ: PrenticeHall.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Francisco: Morgan
Kaufmann.
Pinto, D., McCallum, A., Wei, X., &Croft, W. B. (2003). Table extraction using conditional
random ﬁelds. Proc. ACM SIGIR.
212 BIBLIOGRAPHY
Platt, J. (1999). Using sparseness and analytic QP to speed training of support vector
machines. NIPS.
Potts, R. B. (1952). Some generalized orderdisorder transformations. Proc. Cambridge
Phil. Soc., 48.
Quinlan, J. R. (2001). Induction of decision trees. Machine Learning, 1, 81–106.
Schrijver, A. (2003). Combinatorial optimization: Polyhedra and efﬁciency. Springer.
Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random ﬁelds. Proc. HLT
NAACL.
ShaweTaylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural
risk minimization over datadependent hierarchies. IEEE Trans. on Information Theory,
44(5), 1926–1940.
Shen, L., Sarkar, A., & Joshi, A. K. (2003). Using ltag based features in parse reranking.
Proc. EMNLP.
Slattery, S., & Mitchell, T. (2000). Discovering test set regularities in relational domains.
Proc. ICML00 (pp. 895–902).
Smith, T., & Waterman, M. (1981). Identiﬁcation of common molecular subsequences. J.
Mol. Biol., 147, 195–197.
Sutton, C., Rohanimanesh, K., & McCallum, A. (2004). Dynamic conditional random
ﬁelds: Factorized probabilistic models for labeling and segmenting sequence data. Proc.
ICML.
Szummer, M., & Jaakkola, T. (2001). Partially labeled classiﬁcation with Markov random
walks. Proc. NIPS.
Tarantola, A. (1987). Inverse problem theory: Methods for data ﬁtting and model parame
ter estimation. Amsterdam: Elsevier.
BIBLIOGRAPHY 213
Taskar, B., Abbeel, P., & Koller, D. (2002). Discriminative probabilistic models for rela
tional data. UAI.
Taskar, B., Chatalbashev, V., & Koller, D. (2004a). Learning associative Markov networks.
Proc. ICML.
Taskar, B., Guestrin, C., & Koller, D. (2003a). Max margin Markov networks. NIPS.
Taskar, B., Klein, D., Collins, M., Koller, D., & Manning, C. (2004b). Max margin parsing.
EMNLP.
Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic classiﬁcation and clustering in
relational data. Proc. IJCAI01 (pp. 870–876). Seattle, Wash.
Taskar, B., Wong, M., Abbeel, P., & Koller, D. (2003b). Link prediction in relational data.
Proc. NIPS.
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Featurerich partofspeech
tagging with a cyclic dependency network. NAACL 3 (pp. 252–259).
Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine
learning for interdependent and structured output spaces. Twentyﬁrst international con
ference on Machine learning.
Valiant, L. G. (1979). The complexity of computing the permanent. Theoretical Computer
Science, 8, 189–201.
Vapnik, V. (1995). The nature of statistical learning theory. New York, New York:
SpringerVerlag.
Varma, A., & Palsson, B. (1994). Metabolic ﬂux balancing: Basic concepts, scientiﬁc and
practical use. Bio/Technology, 12.
Vazquez, A., Flammini, A., Maritan, A., & Vespignani, A. (2003). Global protein function
prediction from proteinprotein interaction networksh. Nature Biotechnology, 6.
214 BIBLIOGRAPHY
Veksler, O. (1999). Efﬁcient graphbased energy minimization methods in computer vision.
Doctoral dissertation, Cornell University.
Vullo, A., & Frasconi, P. (2004). Disulﬁde connectivity prediction using recursive neural
networks and evolutionary information. Bioinformatics, 20, 653–659.
Wahba, G., Gu, C., Wang, Y., & Chappell, R. (1993). Soft classiﬁcation, a.k.a. risk estima
tion, via penalized log likelihood and smoothing spline analysis of variance. Computa
tional Learning Theory and Natural Learning Systems.
Wainwright, M., Jaakkola, T., & Willsky, A. (2002). Map estimation via agreement on
(hyper)trees: Messagepassing and linear programming approaches. Proc. Allerton Con
ference on Communication, Control and Computing.
Wainwright, M., & Jordan, M. I. (2003). Variational inference in graphical models: The
view from the marginal polytope. Proc. Allerton Conference on Communication, Control
and Computing.
Weston, J., & Watkins, C. (1998). Multiclass support vector machines (Technical Re
port CSDTR9804). Department of Computer Science, Royal Holloway, University of
London.
Xing, E., Ng, A., Jordan, M., & Russell, S. (2002). Distance metric learning, with applica
tion to clustering with sideinformation. NIPS.
Yang, Y., Slattery, S., & Ghani, R. (2002). A study of approaches to hypertext categoriza
tion. Journal of Intelligent Information Systems, 18(2).
Yedidia, J., Freeman, W., & Weiss, Y. (2000). Generalized belief propagation. NIPS.
Younger, D. H. (1967). Recognition and parsing of contextfree languages in time n
3
.
Information and Control, 10, 189–208.
Zhang, J., & Ma, Z. (1996). A network ﬂow method for solving inverse combinatorial
optimization problems. Optimization, 37, 59–72.
BIBLIOGRAPHY 215
Zhang, T. (2002). Covering number bounds of certain regularized linear function classes.
Journal of Machine Learning Research, 2, 527–550.
Zhu, J., & Hastie, T. (2001). Kernel logistic regression and the import vector machine.
Proc. NIPS.
Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semisupervised learning using gaussian
ﬁelds and harmonic functions. Proc. ICML.
c Copyright by Ben Taskar 2005 All Rights Reserved
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Daphne Koller
Computer Science Department Stanford University
(Principal Advisor)
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Andrew Y. Ng
Computer Science Department Stanford University
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Fernando Pereira
Computer Science Department University of Pennsylvania
Approved for the University Committee on Graduate Studies:
iii
iv
To my parents. . and my love. Mark and Tsilya. Anat.
.
like standardized testing. and track motion of articulated bodies.Abstract Most questions require more than just truefalse or multiplechoice answers. This thesis presents a discriminative estimation framework for structured models based on the large margin principle underlying support vector machines. combinatorial optimization problems such as weighted matchings and graphcuts. The acquired expertise must now be used to address tasks that demand answers as complex as the questions. ﬁnd global alignment of related DNA strings. we analyze genetic sequences to predict 3D structure of proteins. we segment complex objects in cluttered scenes. Yet supervised learning. In computer vision. reconstruct 3D shapes from stereo and video. In computational biology. grammatical. Intuitively. Our framework deﬁnes a suite of efﬁcient learning algorithms that rely on the expressive power of convex optimization to compactly capture inference or vii . likelihoodbased estimation methods by concentrating directly on the robustness of the decision boundary of a model. we often need to construct a global. chemical. coherent analysis of a sentence. or translation into another language. temporal. spatial constraints and correlations. and recognize functional portions of a genome. the largemargin criterion provides an alternative to probabilistic. parse tree. such as its corresponding partofspeech sequence. In natural language processing. Such structured models include graphical models such as Markov networks (Markov random ﬁelds). recursive language models such as context free grammars. We typically handle the exponential explosion of possible answers by building models that compactly capture the structural properties of the problem: sequential. Such complex answers may consist of multiple interrelated decisions that must be weighed against each other to arrive at a globally satisfactory and consistent solution to the question. has placed the heaviest emphasis on complex questions with simple answers.
We use combinatorial properties of weighted matchings to develop an exact. We describe experimental applications to a diverse range of tasks. which are undirected probabilistic graphical models widely used to efﬁciently represent and reason about joint multivariate distributions. we derive a maxmargin formulation for learning the scoring metric for clustering from clustered training data.solution optimality in structured models. These empirical evaluations show signiﬁcant improvements over stateoftheart methods and promise wide practical use for our framework. 3D terrain classiﬁcation. viii . hypertext categorization. we solve very large estimation problems. efﬁcient formulation for learning to match and apply it to prediction of disulﬁde connectivity in proteins. disulﬁde connectivity prediction in proteins. Seamless integration of kernels with graphical models allows efﬁcient. relations. In addition to graphical models. attributes. Using an efﬁcient onlinestyle algorithm that exploits inference in the model and analytic updates. While likelihoodbased methods are believed to be intractable for AMNs over binary variables. including handwriting recognition. associative Markov networks (AMNs). We deﬁne an important subclass of Markov networks. alternative estimation methods are intractable. We also introduce relational Markov networks (RMNs). email organization and image segmentation. accurate prediction in realworld tasks. convex formulation for largemargin estimation of Markov networks with sequence and other lowtreewidth structure. For some of these models. which tightly integrates metric learning with the clustering algorithm. tuning one to the other in a joint optimization. which compactly deﬁne templates for Markov networks for domains with relational structure: objects. our framework applies to a wide range of other models: We exploit context free grammar structure to derive a compact maxmargin formulation that allows highaccuracy parsing in cubic time by using novel kinds of lexical information. compact. We analyze the theoretical generalization properties of maxmargin estimation in Markov networks and derive a novel type of bound on structured error. our framework allows exact estimation of such networks of arbitrary connectivity and topology. We use graph decomposition to derive an exact. Finally. which captures positive correlations between variables and permits exact inference which scales up to tens of millions of nodes and edges. natural language parsing. The largest portion of the thesis is devoted to Markov networks.
Vassil Chatalbashev. Andrew Ng and Fernando Pereira. In particular. Daphne’s research group is an institution in itself. I am very lucky to have been a part of it and shared the company. Stephen’s book on convex optimization is the primary source of many of the insights and derivations in this thesis. Eran Segal. Carlos Guestrin. Stephen Boyd and Sebastian Thrun. Simon Tong. the ideas. Sebastian introduced me to the fascinating problems in 3D vision and robotics. advising. Haidong Wang and Ming Fai Wong. pushing ideas to their full development. Andrew’s research on clustering and classiﬁcation has informed many of the practical problems addressed in the thesis. I ix . Lise Getoor. JeanClaude Latombe was my ﬁrst research advisor when I was still an undergraduate at Stanford. I would like to thank my thesis committee. Daphne Koller. writing and presenting results in an engaging manner. the questions and the expertise of Pieter Abbeel. Uri Lerner. Uri Nodelman. Dirk Ormoneit. Luke Biewald. Alexis Battle. as well as my defense committee members. for their excellent suggestions and thoughtprovoking questions. Ron Parr. Finally.Acknowledgements I am profoundly grateful to my advisor. I am indebted to Daphne for priceless and copious advice about selecting interesting problems. Her tireless pursuit of excellence in research. Fernando’s work on Markov networks has inspired my focus on this subject. His thoughtful guidance and quick wit were a great welcome and inspiration to continue my studies. When I started my research in machine learning with Daphne. I have learned a great deal from their work and their inﬂuence on this thesis is immense. Christian Shelton. Drago Anguelov. and every other aspect of her academic work is truly inspirational. David Vickrey. making progress on difﬁcult ones. Xavier Boyen. teaching. Gal Chechik.
languages and generalization bounds. biking to EV. I am very glad I had the chance to work with him on the maxmargin parsing project. Nir Friedman. has been a constant source of encouragement and inspiration. always late. Lise Getoor and Eran Segal. everready to checkmate. and better yet. Chris Manning. my sounding board for many ideas in convex optimization and learning. David McAllester. Michael Collins. avid grillmaster x . Dan has an incredible gift of making ideas work. for numerous enlightening discussions that we have had anytime. Ogromnoye spasibo to my ofﬁcemate. my personal climbing instructor. good cheer and the great knack for always coming through. Collins’ work on learning largemargin tagging and parsing models motivated me to look at such methods for other structured models. They have taught me to remain steadfast when elegant theories meet the ragged edge of real problems. David Vickrey and Ming Fai Wong. I could not wish for more caring advice and encouragement than she has given me all these years. Daphne Koller. with his endearing combination of humor and warmth. anywhere: gym. who is always willing to share with me everything from research problems to music and backpacking treks to Russian books and Belgian beer.had the privilege and pleasure to work with Nir Friedman. explaining why other ideas do not work. Chris’ healthy scepticism and emphasis on the practical has kept me from going down many deadend paths. I would like to express my appreciation and gratitude to all my collaborators for their excellent ideas. Muchas gracias to Carlos Guestrin. hard work and dedication: Pieter Abbeel. I was lucky to meet Michael Collins. Lise has been a great friend and my other big sister throughout my graduate life. Luke Biewald. fellow photography fanatic. Michael Littman and David McAllester all in one place. Vasco Chatalbashev. while I was a summer intern at AT&T. Dan Klein. Vassil Chatalbashev. Michael Littman. Blagodarya vi mnogo for another great Bulgarian I have the good luck to know. David’s wideranging research on probability and logic. Audrey Gasch. sportsmate. Merci beaucoup to Pieter Abbeel. for his company and ideas. (Almost) everything I know about natural language processing I learned from Chris Manning and Dan Klein. enlightened much of my work. to mention a few. Drago Anguelov. Lise Getoor. Drago Anguelov. Peter Bartlett. Eran Segal. lightness and depth. Carlos Guestrin. Geremy Heitz. over cheap Thai food. tennis court.
Talia. but worked hard to keep my relative sanity throughout. Ilya.and exploiterofstructure extraordinaire. I will not list all of you here. To my tireless partner in work and play. thank you for making me happy next to you. and my sweet niece. xi . but my gratitude to you is immense. I am grateful for bringing me so much joy and love. kind heart and excellent work. Anna. hardship and love. for his sense of humor. have given me unbending support and constant encouragement. for opening his ear and heart to me when it matters most. my brotherinlaw. Many thanks to my friends who have had nothing to do with work in this thesis. tears and laughter. To my sister. My parents. I thank them for all the sacriﬁces they have made to ensure that their children would have freedom and opportunities they never had in Soviet Union. Mark and Tsilya. Mhgoisaai to Ming Fai Wong. Anat Caspi.
xii .
. . . . Structured models . . . . . . . . . . . . . . . .4 2. . . . 22 24 3 Structured models 3. . . . . . . . . . . 19 Logistic regression . . . . . . . . . . . .3 2. . . . . . . . . . 13 I Models and methods 2 Supervised learning 2. . . . . . . . 11 Previously published work . . .6 Supervised learning . .5 1. . . .4 1. . . . . . . . . . . . . . . . . . . . . . . . . Complex prediction problems . . . . . . . . . . . . . . . . . . . . .1 2. . . . . . 25 xiii . . . . . . .1 1. . .3 1. . . . . 21 SVM dual and kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Support vector machines . . . . . . . . . . .Contents Abstract Acknowledgements 1 Introduction 1. . . . .2 2. . .1 Probabilistic models: generative and conditional . . . . . . . . .2 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii ix 1 2 4 6 8 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . 20 Logistic dual and maximum entropy . . . . . . . . . . . . . .5 15 16 Classiﬁcation with generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributions . . . . . . .
. . . . . . . . . . . .6 5. . . . . . . . . . . . . 52 Approximations: upper and lower bounds . . . . . . . . . . . . . . .3. . . . .2 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . . . . . .3. . . . . . . . . . . . . . . . . . . .1 3. . . . . . . . . . . . . . . . 74 75 6 M3 N algorithms and experiments 6. . . . . . . . . . . . . . . . . . . 55 II Markov networks 57 58 5 Markov networks 5. . . . . . . .3. . .7 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . .4 Maxmargin estimation . . . . . . . . . . . . . 61 M3 N dual and kernels . . . . . 48 Constraint generation . . . . . . . . 27 Markov networks . .2 5. . . . . . . . . . . . . .2 4. . . . . .4 5. . . . . . . . . .1 Solving the M3 N QP . . .3 4.2 4. . . . . . . .1 4. . . . . . . 40 42 4 Structured maximum margin estimation 4. . . . . . . . . . . . . . . .5 Context free grammars . . . . . . . . . . . 50 Constraint strengthening . . . . . . . . 75 SMO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Combinatorial problems . .1 6. . . . . . . . . .3. . . . . . . . . . . . . . .2 3. . . . . 69 Generalization bound . . 34 3. . . . . . . . . . . . . . . 50 Related work . . . . . . . . . . . . . . . . . . . 73 Conclusion . . . . . . . . 31 Linear programming MAP inference . . . . . . . . . . . . . .3 5. . . . . . . . . . . . . . . . . .5 5. 43 Minmax formulation . . . . 53 Conclusion . . .1 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Maximum margin estimation . . . . 28 Inference . . . . . . . . . . . . .4 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Certiﬁcate formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2. . . . . . . .1. . . .3 Representation . . . . . . .1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 4. . . . . 78 xiv . . . . . . . . . . .1 5. . . . . . . . . . . . .3 Prediction models: normalized and unnormalized . . . . . . . . . 27 3. . . . . . 70 Related work . . . . . . . . . . . . .2 4.2. . . . . 65 Untriangulated models .
. . . . . . . . . 93 Multiclass case . . . . . . . . . . . . . .1 127 128 Context free grammar model . . . . . . . . . . . . . . . . . . . . . . . . . . .3.6 7. . . . . . . . . . . .3 Associative networks . . . . . .2 Graph construction . . . . . . . . .1. .1 8. . .5 7. . . . . . .2 7. . . . . . . . . . . . . . . . . . 128 xv . . . . . . . . . . 116 Experiments . . 99 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Experiments . . . . . . . . . . .4. . . 121 8. . . . . . . . . . . . . . . . . . . .1 8. . . . . . . . .2 8. . . . . . . .3 8. . . . .3 6. . . 120 Cocite model . . . . . . . . . . . . . . . . .6. 110 Relational Markov networks . . . 104 109 8 Relational Markov networks 8. . 87 Conclusion . . . . . . . . . . 87 89 7 Associative Markov networks 7. . . . . . . . . . . . . . . . . . . . 95 7. .1. . . . . . . . . . . . . .5 8. . . . . . . . . . . . . . 97 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Mincut inference . . . . . . . . . . . . . . 124 III Broader applications: parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8. . . . . . . . . . . . . . . . . . . . . . . . . 90 LP Inference . . . . . . . . . . . 104 Conclusion . . . . . . . . . . . . . 113 Approximate inference and learning . . . . . . . . . .1 7. . . . . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Link model . . . . .2 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . .4 7. . . . . . . . . . . . . . . . . . . .3 Flat models . . . . . . . . . . . 85 Related work . .3 6. . . . . . . . . . . . . . . . . . . . .2 6. . . . . . . . . . . . . . . . clustering 9 Context free grammars 9. . . .6 Related work . . . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Relational classiﬁcation . . . . . . . .4 Selecting SMO pairs . . . . . . . . . . . . . . . . . . . 93 7. . . . . . . . matching. . . . . . . . . . . . . . .1 7. . . . . . . . .7 Maxmargin estimation . . .2 6. . . . . . . . 79 Structured SMO . . . 123 Conclusion . . . .
. . . . . . . . . . . . .2 Learning formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 10. . . . . .2 Semideﬁnite programming relaxation . . . . . . . . . . .4. . . . . 146 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 9. . . . . . . . . . . .2 Maximum likelihood estimation . . . . . .8 Conclusion . . 167 11. 154 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 9. .6 Experiments .7 Related work .1 Irrelevant features . . . 161 11. . . .6 9. . . . . . . . . . .9. . . . . . . . . . . . . . . . . . . . . . . . . .3 Context free parsing . . . .5 Kernels .1. . . .4 Certiﬁcate formulation . . .2 Learning to match . . . . . . . . . . . . . . . . 133 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 9. . . . . . 168 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Email clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Clustering formulation . . 159 11 Correlation clustering 160 11. . . . . . . . . 162 11. . . .1. . . . . . . . . . 138 Experiments . . . . . . . . . . . . . . . . . . . . . . 170 11. . . . . . .3 Dual formulation and kernels . . . . . . . . . . .3. . . . . . . . . . .5 Related work . . . . . . 132 Discriminative parsing models . . . . . . . . .4. . . . . 150 10. . . . . . . . . . . . . . . . . . . . . .7 Structured SMO for CFGs . . . .3 Image segmentation . . . . . . . . . . . . .1 9. . . . . . . . 173 xvi . . . . . . . . . . 163 11. . 142 Conclusion . . 164 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . . 134 Maximum margin estimation . . . 143 144 10 Matchings 10. . . . . . . . . . . .1 Disulﬁde connectivity prediction . . . . .3. . . . . . . . . . . . . . . .3 Minmax formulation . . . . . . . . . . . . . . . 153 10. . . . . . . . . . .4 Experiments . . . . . . . . . . . . . . . .5 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 11. . . . . . . . . . . . . . . . . . 145 10. . . .6 Conclusion . . . . . 138 Related work .1 Linear programming relaxation . . . . . . . . . . . . . . . . . . . . . . 168 11. . . . . . . . . . . . . . . . . . . . . . . 172 11. . . . . . . . . . . . . . . . . . . .
.2 Multiclass AMNs . . . . .1. . . . . . . . . . . . . . .3 Broader applications: parsing. . . . . . . . . . . matching. . 190 A.3 Derivation of the factored primal and dual maxmargin QP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 12. . . . . . . 189 A. .2 Markov networks: maxmargin. . 182 12. . . . . . . . . . . . . . .1. . . . . . . 188 A. . .2. . . . . . . . .1. . . 197 A. . . . . . . . 177 12. . . . . . . . . 183 12.2 Structured classiﬁcation . . . 183 12. . . .IV Conclusions and future directions 12 Conclusions and future directions 175 176 12. . . . . . 187 A Proofs and derivations 188 A. . . . .2. . . . .1 Binary classiﬁcation . . . relational . . . . . . . . . . . . . . . . . . 180 12.3 More general learning settings . .2 Novel prediction tasks . . . . . . . . . . . . . . . . . .1. . . 176 12. . . . . . . . . . . . . . . . . . . . . . . . . . .2. . 199 Bibliography 202 xvii . . . . clustering . . . . . 195 A. . . . . . . . . . .5. . . . . . .1 Theoretical analysis and optimization algorithms .2 Extensions and open problems .1 Structured maximum margin estimation . . . . . . .3 Future . . . . . . .2. . . . .1. . . . .2.2. . . . . . . . . . . . .1 Binary AMNs . . . . . . . . . . . . . 195 A. . . . . . .1 Summary of contributions . . . . . 185 12. . . . . associative. . . .2 AMN proofs and derivations . . . . . . . . . . . . . .1 . . . . .1 Proof of Theorem 5. . . . . . . . . . . . . . . . . . .
List of Tables 9. . . . . 140 10. .1 9. . . . . . . . . . . . . . . . . . . . . . . 140 Parsing results on test set . . . . .2 Parsing results on development set . . . . . . . .1 Bond connectivity prediction results . . . . . . . . . . . . . . . . . . . . . . 156 xviii . . . . . . . . .
. 77 SMO subproblem . 20 Handwritten word recognition . . . . . . . . . . . . 86 xix . . . . . . . . . . . . . . 35 Example parse tree . . . . . . . . . . . . . . . . . . . . . . .2 3. . . . . .1 5. . . . . . . 82 Structured SMO pair selection . . . .List of Figures 1. . .2 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples of complex prediction problems . . . . . . . . . . . . . . . . . . 69 Blockcoordinate ascent . 34 Marginal agreement . . . . . . . . . 29 Diamond Markov network . . . . . . . . . . .1 6. . . . . . . . . . . . . . . . . 37 Exact and approximate constraints for maxmargin estimation . . . . . . . . . . . . . . . . . .1 4. . . . . . . 51 A constraint generation algorithm. . . . . . . . . . . . . . . . . . . . .3 3. . . . . . 25 Chain Markov network . . . . . . . . . . . . . . 64 Diamond Markov network . . . . 3 4 Handwritten character recognition . . . . . . 79 SMO pair selection . . . . . . . . . . . . . . . . . . .5 3. . . . . . . . . . . . .6 4. . .4 3. . . . . . . . . . . . . . . . . . .3 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1. . . . . . . . 52 Chain M3 N example . . . . . . . .4 6. . . . . . . . . . . . . . . . . . . . . . .1 2. . . . . . . . . . . . 80 Structured SMO diagram . . . . . . 19 Classiﬁcation loss and upper bounds . . . . . . . . . . 85 OCR results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 6.2 6.2 2. . . .5 6. . . .2 3. . . . . . . . . . . . .1 3. . . . . . . . . . . . . . . . . . . . . . . . . 32 Diamond network junction tree . . . . . . . . . . . .6 Supervised learning setting . .
. . . .7. 169 11. . . . . . . .8 7. . . . . . . . . . . . . . . . . . . . . 171 11. . . .3 8. . .4 Two segmentations by different models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 8. . . . . . 121 WebKB results: cocite model . . . . . .1 Learning to cluster with irrelevant features . 147 10. . . . . . . . . 94 αexpansion algorithm . 130 10. .4 9. . . . . . . . . . . . .3 Two segmentations by different users . . . . .1 7. . . . . . . . . . .9 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Mincut inference running times . . . . . . . . . . . . . .1 8. . . . . . . . . . . . . . . . . . . 101 Comparison of terrain classiﬁcation models (detail) . 96 Segbot: roving robot equipped with SICK2 laser sensors. . . . . . . . . . . . . . . . . . . . . . . .3 7. . . . . . . . . . . . . . . . . 120 WebKB results: link model . . . . . .5 7. . . . . .6 7. . . . . . 150 10. . . . . . 113 WebKB results: ﬂat models . . . . . 172 xx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 7. . . . . . .1 PDB protein 1ANS . . . . 122 Two representations of a binary parse tree . .2 Number of constraints vs. . . . . . . . . . . . . . . . . . . . . . . . 157 11. . 103 Comparison of terrain classiﬁcation models (zoomedout) . . . . . . . 170 11. . . . . . . . 100 3D laser scan range map of the Stanford Quad. . . .3 Learning curve for bond connectivity prediction . . 106 Labeled portion of the test terrain dataset (Ground truth and SVM predictions)107 Labeled portion of the test terrain dataset (VotedSVM and AMN predictions)108 Link model for document classiﬁcation . . . . number of bonds . . . .4 7. . . . . .1 Mincut graph construction . . .2 Learning to organize email .7 7. . . . . . . . . . . .
robot locomotion and many more. game playing. One of its primary goals 1 . Computer scientists. say. DNA sequence analysis. The reconstruction of an entire sentence from the photons hitting the retina off of each tiny patch of an image undoubtedly requires an elaborate interplay of recognition and representation of the penstrokes. neuroscientists. ﬁnancial analysis. the list of things we would like a computer to learn to do is much. As we work our way down that list. scientiﬁc discovery. Machine learning offers an alternative to encoding all the intricate details of such a model from scratch. frequent grammatical constructions of the phrases. speech and handwriting recognition.Chapter 1 Introduction The breadth of tasks addressed by machine learning is rapidly expanding. the individual letters. The complex synthesis of many levels of signal processing a person executes when confronted by a line of handwritten text is daunting. image analysis. Some tasks. handwriting recognition. we encounter the need for very sophisticated decision making from our programs. etc. are performed almost effortlessly by a person. whole words and constituent phrases. as opposed to. Of course. fraud detection. Major applications include medical diagnosis. common ways to combine words into phrases. are primarily concerned with achieving acceptable speed and accuracy of recognition rather than modeling this complicated process with any biological verity. Computational models for handwriting recognition aim to capture the salient properties of the problem: typical shapes of the letters. for example. much longer. likely letter combinations that make up words. but remain difﬁcult and errorprone for computers.
2
CHAPTER 1. INTRODUCTION
is to devise efﬁcient algorithms for training computers to automatically acquire effective and accurate models from experience. In this thesis, we present a discriminative learning framework and a novel family of efﬁcient models and algorithms for complex recognition tasks in several disciplines, including natural language processing, computer vision and computational biology. We develop theoretical foundations for our approach and show a wide range of experimental applications, including handwriting recognition, 3dimensional terrain classiﬁcation, disulﬁde connectivity in protein structure prediction, hypertext categorization, natural language parsing, email organization and image segmentation.
1.1 Supervised learning
The most basic supervised learning task is classiﬁcation. Suppose we wish to learn to recognize a handwritten character from a scanned image. This is a classiﬁcation task, because we must assigns a class (an English letter from ‘a’ through ‘z’) to an observation of an object (an image). Essentially, a classiﬁer is a function that maps an input (an image) to an output (a letter). In the supervised learning setting, we construct a classiﬁer by observing labeled training examples, in our case, sample images paired with appropriate letters. The main problem addressed by supervised learning is generalization. The learning program is allowed to observe only a small sample of labeled images to produce an accurate classiﬁer on unseen images of letters. More formally, let x denote an input. For example, a blackandwhite image x can be represented as a vector of pixel intensities. We use X to denote the space of all possible inputs. Let y denote the output, and Y be the discrete space of possible outcomes (e.g., 26 letters ‘a’‘z’). A classiﬁer (or hypothesis) h is a function from X to Y, h : X → Y. We denote the set of all classiﬁers that our learning program can produce as H (hypothesis class). Then given a set of labeled examples {x(i) , y (i) }, i = 1, . . . , m, a learning program seeks to produce a classiﬁer h ∈ H that will work well on unseen examples x, usually by ﬁnding h that accurately classiﬁes training data. The diagram in Fig. 1.1 summarizes the supervised learning setting.
1.1. SUPERVISED LEARNING
3
Hypotheses
Labeled data
Learning Prediction New data
Figure 1.1: Supervised learning setting
The problem of classiﬁcation has a long history and highly developed theory and practice (see for example, Mitchell [1997]; Vapnik [1995]; Duda et al. [2000]; Hastie et al. [2001]). The two most important dimensions of variation of classiﬁcation algorithms is the hypothesis class H and the criterion for selection of a hypothesis h from H given the training data. In this thesis, we build upon the generalized linear model family, which underlies standard classiﬁers such as logistic regression and support vector machines. Through the use of kernels to implicitly deﬁne highdimensional and even inﬁnitedimensional input representations, generalized linear models can approximate arbitrarily complex decision boundaries. The task of selecting a hypothesis h reduces to estimating model parameters. Broadly speaking, probabilistic estimation methods associate a joint distribution p(x, y) or conditional distribution p(y  x) with h and select a model based on the likelihood of the data [Hastie et al., 2001]. Joint distribution models are often called generative, while conditional models are called discriminative. Large margin methods, by contrast, select a model based on a more direct measure of conﬁdence of its predictions on the training data called the margin [Vapnik, 1995]. The difference between these two methods is one of the key
4
CHAPTER 1. INTRODUCTION
The screen was a sea of red
RSCCPCYWGGCPW GQNCYPEGCSGPKV
brace
(a)
(b)
(c)
(d)
Figure 1.2: Examples of complex prediction problems (inputstop, outputsbottom):
(a) handwriting recognition [image → word]; (b) natural language parsing [sentence → parse tree]; (c) disulﬁde bond prediction in proteins [aminoacid sequence → bond structure (shown in yellow)]; (d) terrain segmentation [3D image → segmented objects (trees, bushes, buildings, ground)]
themes in this thesis. Most of the research has focused on the analysis and classiﬁcation algorithms for the case of binary outcomes Y = 2, or a small number of classes. In this work, we focus on prediction tasks that involve not a single decision with a small set of outcomes, but a complex, interrelated collection of decisions.
1.2 Complex prediction problems
Consider once more the problem of character recognition. In fact, a more natural and useful task is recognizing words and entire sentences. Fig. 1.2(a) shows an example handwritten word “brace.” Distinguishing between the second letter and fourth letter (‘r’ and ‘c’) in isolation is actually far from trivial, but in the context of the surrounding letters that together form a word, this task is much less errorprone for humans and should be for computers as well. It is also more complicated, as different decisions must be weighed against each other to arrive at the globally satisfactory prediction. The space of all possible outcomes
1.2. COMPLEX PREDICTION PROBLEMS
5
Y is immense, usually exponential in the number of individual decisions, for example, the number of 5 letter sequences (265 ). However, most of these outcomes are unlikely given the observed input. By capturing the most salient structure of the problem, for example the strong local correlations between consecutive letters, we will construct compact models that efﬁciently deal with this complexity. Below we list several examples from different ﬁelds. • Natural language processing Vast amounts of electronically available text have spurred a tremendous amount of research into automatic analysis and processing of natural language. We mention some of the lowerlevel tasks that have received a lot of recent attention [Charniak, 1993; Manning & Sch¨ tze, 1999]. Partofspeech tagging involves assigning each u word in a sentence a partofspeech tag, such as noun, verb, pronoun, etc. As with handwriting recognition, capturing sequential structure of correlations between consecutive tags is key. In parsing, the goal is to recognize the recursive phrase structure of a sentence, such as verbal, noun and prepositional phrases and their nesting in relation to each other. Fig. 1.2(b) shows a parse tree corresponding to the sentence: “The screen was a sea of red” (more on this in Ch. 9). Many other problems, such as namedentity and relation extraction, text summarization, translation, involve complex global decision making. • Computational biology The last two decades have yielded a wealth of highthroughput experimental data, including complete sequencing of many genomes, precise measurements of protein 3D structure, genomewide assays of mRNA levels and proteinprotein interactions. Major research has been devoted to geneﬁnding, alignment of sequences, protein structure prediction, molecular pathway discovery [Gusﬁeld, 1997; Durbin et al., 1998]. Fig. 1.2(c) shows disulﬁde bond structure (shown in yellow) we would like to predict from the aminoacid sequence of the protein (more on this in Ch. 10). • Computer vision As digital cameras and optical scanners become commonplace accessories, medical imaging technology produces detailed physiological measurements, laser scanners
bushes.k. buildings. Markov networks compactly represent complex joint distributions of the label variables by modeling their local interactions. Abstractly. Inference in these models refers to computing the highest scoring output given the input and usually involves dynamic programming or combinatorial optimization. we are ﬂooded with images we would like our computer to analyze. INTRODUCTION capture 3D environments. The graphical structure of the models encodes the qualitative aspects of the distribution: direct dependencies as well as conditional independencies. 2002]. the treewidth of the graph. Markov random ﬁelds) are extensively used to model complex sequential. typically through a compact. The graphical structure of the network (more precisely. • Markov networks Markov networks (a. These problems involve labeling a set of related objects that exhibit local consistency.a. whose nodes represent the different object labels. Such models are encoded by a graph. Fig. • Context free grammars . 7). 1. 1. and much more [Forsyth & Ponce.6 CHAPTER 1. which we formally deﬁne in Ch. 3D reconstruction from stereo and video. a model assigns a score (or likelihood in probabilistic models) to each possible input/output pair (x. 3) is critical to efﬁcient inference and learning in the model.3 Structured models This wide range of problems have been tackled using various models and methods. (more on this in Ch. motion tracking. parameterized scoring function. y). and relational interactions in prediction problems arising in many ﬁelds. The quantitative aspect of the model is deﬁned by the potentials that are associated with nodes and cliques of the graph. spatial. etc. satellites and telescopes bring pictures of Earth and distant stars.2(d) shows a 3D laser range data image of the Stanford campus collected by a roving robot which we would like to segment into objects such as trees. and whose edges represent and quantify direct dependencies between them. ground. Example tasks include object detection and segmentation. We focus on the models that compactly capture correlation and constraint structure inherent to many tasks.
both joint and conditional. determiners (DT) and prepositions (IN).2(c) can be modeled by maximum weight perfect matchings. The other combinatorial structures we consider and apply in this thesis include graph cuts and partitions. • Combinatorial structures Many important computational tasks are formulated as combinatorial optimization problems such as the maximum weight bipartite and perfect matching.3. STRUCTURED MODELS 7 Contextfree grammars are one of the primary formalisms for capturing the recursive structure of syntactic constructions [Manning & Sch¨ tze. bipartite matchings. V P → V P P P . graphcut. For example. DT → The) that can be applied to derive a sentence of the language. and the prediction task reduces to ﬁnding the most likely tree given the sentence. verbs (VBD). verbal phrase (VP) or prepositional phrase (PP) and partofspeech tags like nouns (NN). A CFG consists of recursive productions (e. and many others [Lawler. 1. 1982. 1976. edgecover. we use the term model very broadly.. 2001]. The standard methods of estimation for Markov networks and context free grammars are based on maximum likelihood. However.g. . 1999]. By compactly deﬁning a probability distribution over individual productions. where the weights deﬁne potential bond strength based on the local aminoacid sequence properties. Cormen et al. Papadimitriou & Steiglitz. For example. The terminal symbols (leaves) are the words of the sentence.1. the disulﬁde connectivity prediction in Fig. maximum likelihood estimation of scoring function parameters for combinatorial structures is often intractable because of the problem of deﬁning a normalized distribution over an exponential set of combinatorial structures. Although the term ‘model’ is often reserved for probabilistic models. the nonterminal symbols (labels of internal nodes) correspond to syntactic categories such as noun phrase (NP). to include any scheme that assigns scores to the output space Y and has a procedure for ﬁnding the optimal scoring y.2. spanning tree. 1. probabilistic CFGs construct a distribution over parse trees and sentences. The context free restriction allows efﬁcient inference and learning in such models. u in Fig. The productions deﬁne the set of syntactically allowed phrase structures (derivations). and spanning trees.
combinatorial optimization) predicts the correct answers on the training data with maximum conﬁdence. taking advantage of prediction task structure. or can be efﬁciently approximated.4 Contributions This thesis addresses the problem of efﬁcient learning of highaccuracy models for complex prediction problems. We develop general conditions under which exact large margin estimation is tractable and present two formulations for structured maxmargin estimation that deﬁne compact convex optimization problems. representational extensions. ◦ Learning framework for structured models We propose a general framework for efﬁcient estimation of models for structured prediction. These two formulations form the foundation which the rest of the thesis develops. Lowtreewidth Markov networks We use graph decomposition to derive an exact. Intuitively. The ﬁrst formulation relies on the ability to express inference in the model as a compact convex optimization problem. generalization analysis and experimental validation for Markov networks. compact. we ﬁnd parameters such that inference in the model (dynamic programming. vision and biology.8 CHAPTER 1. convex learning formulation for Markov networks with sequence and other lowtreewidth structure. INTRODUCTION 1. a model class of choice in many structured prediction tasks in language. ◦ Markov networks The largest portion of the thesis is devoted to novel estimation algorithms. The second one only requires compactly expressing optimality of a given assignment according to the model and applies to a broader range of combinatorial problems. this framework builds upon the large margin estimation principle. The seamless integration of kernels with graphical models allows us to create . from Markov networks to context free grammars to combinatorial graph structures such as matchings and cuts. An alternative to likelihoodbased methods. We focus on those models where exact inference is tractable. We consider a very large class of structured models.
By constraining the class of Markov networks to AMNs. Generalization analysis We analyze the theoretical generalization properties of maxmargin estimation in Markov networks and derive a novel marginbased bound for structured prediction. we provide an approximation that works well in practice. our models are learned efﬁciently and. Scalable online algorithm We present an efﬁcient algorithm for solving the estimation problem called Structured SMO. We also use approximate graph decomposition to derive a compact approximate formulation for Markov networks in which inference is intractable.1.4. This bound is the ﬁrst to address structured error (e. Learning associative Markov networks (AMNs) We deﬁne an important subclass of Markov networks that captures positive correlations present in many domains. CONTRIBUTIONS 9 very rich models that leverage the immense amount of research in kernel design and graphical model decompositions for efﬁcient.g. which compactly deﬁne templates for Markov networks for domains with relational structure objects. for which likelihood methods are believed to be intractable. scale up to tens of millions of nodes and edges. at runtime. Representation and learning of relational Markov networks We introduce relational Markov networks (RMNs). Our onlinestyle algorithm uses inference in the model and analytic updates to solve extremely large estimation problems. We present an AMNbased method for object segmentation from 3D range data. accurate prediction in realworld tasks. We show that for AMNs over binary variables. For the nonbinary case. our framework allows exact estimation of networks of arbitrary connectivity and topology. proportion of mislabeled pixels in an image) and uses a proof that exploits the graphical model structure. .
which makes it possible to efﬁciently embed the features in very highdimensional spaces and achieve stateoftheart accuracy. in which exact inference is intractable. Learning to cluster We derive a maxmargin formulation for learning the afﬁnity metric for clustering from clustered training data. matching. We show experimental evidence of the model’s improved performance over several baseline models. We use approximate inference in these complex models. graph partitions for clustering documents and segmenting images. In contrast to algorithms that learn a metric independently of the algorithm that will be used to cluster the data. We apply our framework to prediction of disulﬁde connectivity in proteins using perfect nonbipartite matchings. The graphical structure of an RMN is based on the relational structure of the domain. Experiments on synthetic and realworld data show the ability of the algorithm to learn an appropriate . to derive an approximate learning formulation. Learning to match We use combinatorial properties of weighted matchings to develop an exact. perfect matchings for disulﬁde connectivity in protein structure prediction. ◦ Broader applications: parsing. The algorithm we propose uses kernels. Learning to parse We exploit context free grammar structure to derive a compact maxmargin formulation and show highaccuracy parsing in cubic time by exploiting novel kinds of lexical information. tuning one to the other in a joint optimization. efﬁcient algorithm for learning to match. relations.10 CHAPTER 1. and can easily model complex interaction patterns over related entities. we describe a formulation that tightly integrates metric learning with the clustering algorithm. INTRODUCTION attributes. We apply this class of models to classiﬁcation of hypertext using hyperlink structure to deﬁne relations between webpages. clustering The other large portion the thesis addresses a range of prediction tasks with very diverse models: context free grammars for natural language parsing.
We consider generalized linear models. We use graphical model decomposition to derive a convex. 1. in particular using upper and lower bounds. Chapter 5. Structured models: In this chapter. We address the exponential blowup of the naive problem formulation by deriving two general equivalent convex formulation.5 Thesis outline Below is a summary of the rest of the chapters in the thesis: Chapter 2. Structured maximum margin estimation: This chapter outlines the main principles of maximum margin estimation for structured models. We analyze the theoretical generalization . risk. They lead to polynomial size programs for estimation of models where the prediction problem is tractable. minmax and certiﬁcate. allow us to exploit decomposition and combinatorial structure of the prediction task. These formulations. Supervised learning: We review basic deﬁnitions and statistical framework for classiﬁcation. We describe the relationship between the dual estimation problems and kernels. We deﬁne hypothesis classes. including logistic regression and support vector machines. We describe representation and inference for Markov networks. We compare probabilistic models. conditional likelihood and margin. and review estimation methods based on maximizing likelihood. Chapter 3. THESIS OUTLINE 11 clustering metric for a variety of desired clusterings. We also brieﬂy describe context free grammars and combinatorial structures as models.1. Markov networks: We review maximum conditional likelihood estimation and present maximum margin estimation for Markov networks. we deﬁne the abstract class of structured prediction problems and models addressed by the thesis. for solving intractable or very large problems. including email folder organization and image segmentation.5. including dynamic and linear programming inference. compact formulation that seamlessly integrates kernels with graphical models. We also discuss approximations. loss functions. Chapter 4. generative and discriminative and unnormalized models.
Relational Markov networks: We introduce the framework of relational Markov networks (RMNs). which compactly deﬁnes templates for Markov networks in domains with rich structure modeled by objects. As we show. Our onlinestyle algorithm uses inference in the model and analytic updates to solve extremely large quadratic problems. where our models signiﬁcantly outperform other approaches by effectively capturing correlation between adjacent letters and incorporating highdimensional input representation via kernels. attributes and relations. Chapter 6. we provide an approximation that works well in practice. We show that for associative Markov networks of over binary variables. Chapter 9. that captures positive interactions present in many domains. We show . We provide experimental results on a webpage classiﬁcation task. associative Markov networks (AMNs). for which maximum likelihood methods are believed to be intractable. We present an AMNbased method for object segmentation from 3D range data that scales to very large prediction tasks involving tens of millions of points. called Structured SMO. Context free grammars: We present maxmargin estimation for natural language parsing on the decomposition properties of context free grammars. showing that accuracy can be signiﬁcantly improved by modeling relational dependencies.12 CHAPTER 1. Chapter 7. Associative Markov networks: We deﬁne an important subclass of Markov networks. Chapter 8. discriminative graphical model avoids the difﬁculties of deﬁning a coherent generative model for graph structures in directed models and allows us tremendous ﬂexibility in representing complex patterns. the use of an undirected. For the nonbinary case. M3 N algorithms and experiments: We present an efﬁcient algorithm for solving the estimation problem in graphical models. The graphical structure of an RMN is based on the relational structure of the domain. and can easily model complex patterns over related entities. INTRODUCTION properties of maxmargin estimation and derive a novel marginbased bound for structured classiﬁcation. We present experiments with handwriting recognition. maxmargin estimation allows exact training of networks of arbitrary connectivity and topology.
The algorithm we propose uses kernels. 2003a].1. PREVIOUSLY PUBLISHED WORK 13 that this framework allows highaccuracy parsing in cubic time by exploiting novel kinds of lexical information. Chapter 12.. Chapter 10. Work on associative Markov networks (Ch. We discuss extensions and future research directions not addressed in the thesis. which makes it possible to efﬁciently embed the features in very highdimensional spaces and achieve stateoftheart accuracy. 4. Conclusions and future directions: We review the main contributions of the thesis and summarize their signiﬁcance.6 Previously published work Some of the work described in this thesis has been published in conference proceedings.6. The minmax and certiﬁcate formulations for structured maxmargin estimation have not been published in their general form outlined in Ch. Correlation clustering: In this chapter. efﬁcient algorithm for learning the parameters of the model. including email folder organization and image segmentation. The polynomial formulation of maximum margin Markov networks presented in Ch. we derive a maxmargin formulation for learning afﬁnity scores for correlation clustering from clustered training data. using a dual decomposition technique [Taskar et al. 1. Experiments on synthetic and realworld data show the ability of the algorithm to learn an appropriate clustering metric for a variety of desired clusterings. applicability and limitations. 7) was published with experiments on hypertext and newswire classiﬁcation [Taskar et al. Chapter 11. although they underly several papers mentioned below. 2004a]. We show experimental evidence of the model’s improved performance over several baseline models. We use combinatorial properties of weighted matchings to develop an exact. We formulate the approximate learning problem as a compact convex program with quadratic objective and linear or positivesemideﬁnite constraints. 5 was published for a less general case. A paper .. Perfect matchings: We apply our framework to learning to predict disulﬁde connectivity in proteins using perfect matchings.
Natural language parsing in Ch. 9 was published in Taskar et al. done jointly with Pieter Abbeel and Andrew Ng. 8). . which presents a experiments on terrain classiﬁcation and other tasks. INTRODUCTION on 3D object segmentation using AMNs. Daphne Koller and Andrew Ng). has not been published. Finally. 10 (joint work with Vassil Chatalbashev and Daphne Koller) is currently under review. using maximum (conditional) likelihood estimation.14 CHAPTER 1. Dinkar Gupta. Disulﬁde connectivity prediction using perfect matchings in Ch. 11. [2003b] deﬁned and applied the Relational Markov networks (Ch. [2002] and Taskar et al. Vassil Chatalbashev. is currently under review (joint work with Drago Anguelov. Geremy Heitz. Taskar et al. [2004b]. work on correlation clustering in Ch.
Part I Models and methods 15 .
We will concentrate on the last two classes. In handwritten character recognition.i. The input to an algorithm is training data. we seek a function h : X → Y that maps inputs x ∈ X to outputs y ∈ Y.d. Duda et al. . . where k is the number of classes. . Numerous classes of functions have been well studied. chief among them is the hypothesis class H of functions h the algorithm outputs. for indepth discussion of these and many other models). while the output space Y we consider in this chapter discrete. for several reasons we discuss below. nearestneighbors. Learning algorithms can be distinguished among several dimensions. . yk }. [2000]. y (i) )}m drawn from a ﬁxed but unknown distribution D i=1 over X × Y. generalized loglinear models and kernel methods (see Quinlan [2001]. Hastie et al. [2001]. y) ∼ D. A supervised learning problem with discrete outputs. The input space X is an arbitrary set (often X = IRn ). a set of m i. is called classiﬁcation. including decision trees.1). 2. We arrive at such a criterion by quantifying what it means for h(x) to approximate 16 . neural networks. (independent and identically distributed) samples S = {(x(i) . and extensibility to more complex structured prediction tasks will consider in the next chapter. X is the set of images of letters and Y is the alphabet (see Fig. efﬁciency. Y = {y1 .Chapter 2 Supervised learning In supervised learning. The second crucial dimension of a learning algorithm is the criterion for selection of h from H. Bishop [1995]. for example. The goal of a learning algorithm is to output a hypothesis h such that h(x) will approximate y on new samples from the distribution (x. including accuracy.
For a vast majority of supervised learning algorithms. Since we do not generally know the distribution D.2) where we abbreviate (x(i) .. that is. (2. h(x )) = m (i) (i) (i) (i) i (h(x )).. The risk functional RD [(h)] measures the expected error of the approximation: RD [h] = E(x. it will have zero empirical risk. For 0/1 loss. we will be able to ﬁnd h that has zero or very small empirical risk. y. y).. . For example. In general. h(x))]. This polynomial is very likely to overﬁt the training data. y (i) . RS [h] is often called the training If our set of hypotheses. y ) = 0 if y = y . we assume that (x. computed on the training sample S: 1 RS [h] = m m i=1 1 (x . h∈H is generally not a good idea. we estimate the risk of h using its empirical risk RS . 1 I(·) I(true) = 1 and 1 I(false) = 0. that is. y (i) ).17 y.y)∼D [ (x. where the loss function (2.g. The key to selecting a good hypothesis is to tradeoff complexity of class H (e. simply selecting a hypothesis with lowest risk h∗ = arg min RS [h]. y . RS [h] is simply the proportion of training examples that h misclassiﬁes. the degree of the polynomial) with the error on the training data as measured by empirical risk RS . is large enough. H. y.1) : X × Y × Y → IR+ measures the penalty for predicting h(x) on the sample (x. we can always ﬁnd a polynomial h that passes through all the sample points (x(i) . Y = IR and H includes all polynomials of degree m − 1. m i (h(x i=1 (i) )). m) assuming that all the x(i) are unique. ˆ ˆ A common loss function for classiﬁcation is 0/1 loss 0/1 (x. However. if X = IR. h(x)) ≡ 1 = h(x)). h(x(i) )) = error or training loss. I(y where 1 denotes the indicator function. y. but high actual risk. i = (1. this fundamental balance is .
4) where f (w. for example decision trees and multilayer neural networks. SUPERVISED LEARNING achieved by minimizing the weighted combination of the two criteria: h∗ = arg min D[h] + CRS [h] . For many classes. y) is a function f : W × X × Y → IR. the learning algorithms we consider are completely characterized by the hypothesis class H. the search for the optimal h∗ in (2. but simply adopt the standard measures as needed and refer the reader to Vapnik [1995]. Depending on the complexity of the class H. See discussion in the next section about approaches to dealing with 0/1 loss. we will concentrate on models where the optimal h∗ can be found efﬁciently using convex optimization in polynomial time. As we discuss below. x. Below however. minimizing the objective with the usual 0/1 training error is generally a very difﬁcult problem with multiple maxima for most realistic H. the loss function . we consider hypothesis classes of the following parametric form: hw (x) = arg max f (w. [2001] for details. Devroye et al. the search procedure used by the learning algorithm is crucial.3) where D[h] measures the inherent dimension or complexity of h. where w ∈ W is a set of parameters. y). x. y∈Y (2. We will not go into derivation of various complexity measures D[h] here. We assume that ties in the arg max are broken using some arbitrary but ﬁxed rule. 2001]. this class of hypotheses is very rich and includes many standard models. [1996]. usually with W ⊆ IRn . Hence. In general. 1 . The term D[h] is often called regularization. The formulation in (2. and the regularization D[h]. Quinlan.4) of the hypothesis class in terms of an optimization procedure will become crucial to extending supervised learning techniques to cases where the output space Y is more complex. and C ≥ 0 is a tradeoff parameter. and we must resort to approximate. Hastie et al. it is intractable [Bishop. greedy optimization methods. h∈H (2. For classiﬁcation. For these intractable classes.18 CHAPTER 2. 1995.3) may be a daunting task1 .
h(x)) ≥ (x. y. The most common loss for classiﬁcation is 0/1 loss. col). a hypothesis hw ∈ H is deﬁned by a set of n coefﬁcients wj ∈ IR such that: n hw (x) = arg max y∈Y i=1 wj fj (x. We might have a basis function fj (x. we consider the generalized linear family of hypotheses H. this very simple model captures enough information to perform reasonably well. . 2. y) = arg max w f (x. . y∈Y (2. or small weight wj implies that the hypothesis hw does not depend on the value of fj (x. Intuitively. y). (In addition to computational advantages of this approach. The regularization D[hw ] for the linear family is typically the norm of the parameters wp for p = 1. CLASSIFICATION WITH GENERALIZED LINEAR MODELS 19 a b c d e Figure 2. z}.5) Consider the character recognition example in Fig. 2003]). 2. Two of the primary classiﬁcation methods we consider. . Given n realvalued basis functions fj : X × Y → IR. Our input x is a vector of pixel values of the image and y is the alphabet {a. col) and char ∈ Y. (x.1. .1. y. differ primarily in their choice of the upper bound on the training 0/1 loss.col denotes the value of pixel (row. logistic regression and support vector machines. a zero. Since different letters tend to have different pixels turned on.1: Handwritten character recognition: sample letters from Kassel [1995] data set.1 Classiﬁcation with generalized linear models For classiﬁcation. y) = 1 row. y) and hence is simpler than a hw with a large weight wj .2. . I(x where xrow. 2. The standard solution is minimizing an upper bound on the 0/1 loss.col = on ∧ y = char) for each possible (row. h(x)). there are statistical beneﬁts of minimizing a convex upper bound [Bartlett et al. Minimizing the 0/1 risk is generally a very difﬁcult problem with multiple maxima for any large class H..
2. The optimal weights are selected by maximiz ing the conditional likelihood of the data (minimizing the logloss) with some regularization. . y (i) ). we use 2norm below: min 1 w2 + C 2 log Zw (x(i) ) − w f (x(i) . Horizontal axis shows w f (x. y) − maxy =y w f (x.5 log−loss hinge−loss 0/1−loss 3 2.5 2 Figure 2. The logloss is shown up to an additive constant for illustration purposes.2 Logistic regression In logistic regression. y) is an upper bound (up to a constant) on the 0/1 loss 0/1 (see Fig.7) where C is a userspeciﬁed constant the determines the tradeoff between regularization and likelihood of the data. where y is the correct label for x. The logloss log Zw (x) − w f (x. Zw (x) (2.5 1 1.5 2 1.5 0 0.2: 0/1loss upper bounded by logloss and hingeloss.5 1 0. y)}. i (2.5 0 −2 −1.2).6) y∈Y exp{w f (x. Common choices for regularization are 1 or 2norm regularization on the weights. 2. y)}. y ).5 −1 −0. we assign a probabilistic interpretation to the hypothesis hw as deﬁning a conditional distribution: Pw (y  x) = where Zw (x) = 1 exp{w f (x. SUPERVISED LEARNING 3. while the vertical axis show the value of the associated loss. This approach is called the (regularized) maximum likelihood estimation.20 CHAPTER 2.
LOGISTIC DUAL AND MAXIMUM ENTROPY 21 2.8) Ei. y)] − fi (x(i) . In the multiclass SVM formulation [Weston & Watkins. 1998. y).w [∆fi (y)]. We can interpret logistic regression as trying to match the empirical basis function expectations while maintaining a high entropy conditional distribution Pw (y  x). 2. 1995] select the weights based on the “margin” of conﬁdence of hw . y (i) ) = w − C i Ei.2. y (i) ) − f (x(i) .3. where Ei. − i. the gradient is zero when the basis function expectations are equal to the basis functions evaluated on the labels y (i) .w [fi (x(i) . y) = min w ∆fi (y). The gradient with respect to w is given by: w+C i Ei. the margin on example i quantiﬁes by how much the true label “wins” over the wrong ones: γi = 1 1 min w f (x(i) . 2001]. 1991] that the dual of the maximum likelihood problem (without regularization) is the maximum entropy problem: max s.y Pw (y  x(i) ) log Pw (y  x(i) ) (2.t.4 Support vector machines Support vector machines [Vapnik. Ignoring the regularization term.w [f (y)] = y f (y)Pw (y  x(i) ) is the expectation under the conditional distribu tion Pw (y  x(i) ) and ∆fi (y) = f (x(i) . y (i) ) − w f (x(i) .3 Logistic dual and maximum entropy The objective function is convex in the parameters w. w y=y(i) w y=y(i) . It can be shown [Cover & Thomas. so we have an unconstrained (differentiable) convex optimization problem.w [∆fi (y)] = 0. ∀i. Crammer & Singer.
y αi (y) 0/1 i (y) 1 − 2 2 αi (y)∆fi (y) i.11) s. (2. 1 w2 + C 2 w ∆fi (y) ≥ ξi i 0/1 (2.9). (2. Note that the slack variable ξi is constrained to be positive in the above program since w ∆fi (y (i) ) = 0 and 0/1 (y (i) ) = 0. y). ∀i. (2. αi (y) ≥ 0. (2. 2.2). (2. (y) − ξi . SUPERVISED LEARNING where ∆fi (y) = f (x(i) .t. The solution to the dual α gives the solution to the primal as a weighted combination of basis functions of examples: w∗ = i.9) is crucial to efﬁcient solution of SVM and the ability to use a high or even inﬁnite dimensional set of basis functions via kernels. 0/1 In the dual.5 SVM dual and kernels The form of the dual of Eq.10) The hingeloss maxy [ (see Fig.y ∗ αi (y)∆fi (y). max i. the αi (y) variables correspond to the w ∆fi (y) ≥ ∗ (y)−ξi constraints in the primal Eq. Maximizing the smallest such margin (and allowing for negative margins) is equivalent to solving the following quadratic program: min s. ∀i. y αi (y) = C. 0/1 i (y) − w ∆fi (y)] is also an upper bound on the 0/1 loss 0/1 2. We can also express ξi as maxy 0/1 i (y) − w ∆fi (y).y (2. . and the optimization problem Eq. ∀y ∈ Y.22 CHAPTER 2.t. y.9) in a form similar to Eq.7): min 1 w2 + C 2 max[ i y 0/1 i (y) − w ∆fi (y)]. y (i) ) − f (x(i) .9) ∀i.
cold ) and char ∈ Y. 1995]. y) f (x. 1993. we might have basis functions that are polynomial of degree d in terms of image pixels. (rowd . y )−f (x(i).y = i. An important feature of the dual formulation is that the basis functions f appear only as dot products. In fact. . Note that at classiﬁcation time. y )]. (i. y ) = ¯ i. while solutions to the corresponding kernelized logloss problem are generally nonsparse (all examples are support vectors) and require approximations for even relatively small datasets [Wahba et al. y (i) ) f (x. y) = 1 row1 .y j. Computing this polynomial kernel can be done independently of the dimension d.y αi (y)∆fi (y) f (x. 2001]. . Expanding the quadratic term. y y Hence. we can x ¯ solve Eq.¯ y αi (y)αj (¯)∆fi (y) ∆fj (¯). that have nonzero αi (y). .cold = on ∧ y = char) for each I(x possible (row1 . are called support vectors.2. .11) independently of the actual dimension of f .. y) f (¯ . ¯ ¯ For example. y ) can be computed efﬁciently. even though the number of basis functions grows exponentially with d [Vapnik.col1 = on ∧ . we also do not need to worry about the dimension of f since: w f (x. we have: 2 αi (y)∆fi (y) i. the hinge loss formulation usually produces sparse solutions in terms of the number of support vectors. logistic regression can also be kernelized. col1 ) . However. y). y ) = ¯ i. fj (x. SVM DUAL AND KERNELS 23 ∗ The pairings of examples and incorrect labels. ∧ xrowd .5.y αi (y)[f (x(i). . Zhu & Hastie. as long as the dot product f (x. (2.
respecting the constraints and exploiting the correlations in the output space. This joint output space is often a proper subset of the product of singleton output spaces. each Yj is the alphabet. Structured models we consider in this chapter (and thesis) predict the outputs jointly. We refer to joint spaces with constraints and correlations as structured.g. In word recognition. this task is much less errorprone for humans and should be for computers as well. For concreteness. In fact.Chapter 3 Structured models Consider once more the problem of character recognition. In addition to “hard” constraints. let the number of variables L be ﬁxed.1 shows an example handwritten word “brace.” Distinguishing between the second letter and fourth letter (‘r’ and ‘c’) in isolation is far from trivial. machine vision. Y ⊂ Y1 ×. 3. We call problems with discrete output spaces structured classiﬁcation or structured prediction. . . In this chapter. and computational 24 . . Fig. .×YL . but in the context of the surrounding letters that together form a word. for example an entire sequence of L characters. The output space Y ⊆ Y1 × . we consider prediction problems in which the output is not a single discrete value y. yL ). while Y is the dictionary. we might restrict that the letter ‘q’ never follows by ‘z’ in English. arising in ﬁelds as diverse as natural language analysis. the output variables are often highly correlated. e. consecutive letters in a word. The range of prediction problems these broad deﬁnitions encompass is immense. a more natural and useful task is recognizing words and entire sentences. . . In word recognition. × YL we consider is a subset of product of output spaces of single variables. . . but a set of values y = (y1 .
g pairs.1: Handwritten word recognition: sample from Kassel [1995] data set. like intractable Markov networks (Ch. The output space Y = {y : g(x. y∈Y x∈X p(x. y) ≤ 0} is deﬁned using a vector of functions g(x.y)≤0 (3. y : g(x. In other cases. y).1. combinatorial optimization problems like mincut and matching. y) ≥ 0. probabilistic models like Markov networks (in certain cases) and contextfree grammars. y) is a vector of functions f : X × Y → IRn . This includes. For the most part. except that y has been replaced by y: hw (x) = arg max w f (x. ﬁnding the optimal y is intractable. we will restrict our attention to models where this optimization problem can be solved in polynomial time. to name a few. for example.3. PROBABILISTIC MODELS: GENERATIVE AND CONDITIONAL 25 b r a c e Figure 3. y) = 1. y) to the input and output space X × Y with p(x. quadratic and semideﬁnite programming. where g : X × Y → IRk . convex optimization such as linear.1) where as before f (x. 11). we use an approximate polynomial time optimization procedure. 8) and correlation clustering (Ch.1 Probabilistic models: generative and conditional The term model is often reserved for probabilistic models. The class of structured models H we consider is essentially of the same form as in previous chapter. biology. Clearly. A generative model assigns a normalized joint density p(x. 3. which can be subdivided into generative and conditional with respect to the prediction task. This formulation is very general. . y) that deﬁne the constraints. for many f .
. A generative model must make simplifying assumptions (more precisely. . which we use to make the predictions. Given sufﬁcient data. A very common class of generative models is the exponential family: p(x. It also provides an intuitive measure of conﬁdence in the predictions of a model in terms of conditional probabilities. In a regime with very few training samples (relative to the number of parameters w). generative models are typically structured to allow very efﬁcient maximum likelihood learning. we may be tuning the approximation away from optimal conditional distribution p(y  x). while a conditional model makes many fewer assumption by focusing on p(y  y). y) ∝ exp{w f (x. independence assumptions) about the entire p(x. For exponential families. Typically. Because of this. In addition. by optimizing the model to ﬁt the joint distribution p(x. 1970. 2001]. generative models may actually outperform conditional models [Ng & Jordan. y)] [DeGroot. STRUCTURED MODELS A conditional model assigns a normalized density p(y  x) only over the output space Y with p(y  x) ≥ 0. Hastie et al. while the generative model p(x. the maximum likelihood parameters w with respect to the joint distribution can be computed in closed form using the empirical basis function expectations ES [f (x. y)}. generative models actually need fewer samples to converge to a good estimate of the joint distribution than conditional models need to accurately represent the conditional distribution. this efﬁciency comes at a price. Of course. y∈Y p(y  x) = 1 ∀x ∈ X . Any model is an approximation to the true distribution underlying the data. y) will not necessarily do so. Probabilistic interpretation of the model offers wellunderstood semantics and an immense toolbox of methods for inference and learning. 2001]. y). the conditional model will learn the best approximation to p(y  x) possible using w.26 CHAPTER 3. however. y).
PREDICTION MODELS: NORMALIZED AND UNNORMALIZED 27 3. As we discussed in the previous chapter. the difference between the optimal y and the rest. This fact makes standard maximum likelihood estimation infeasible. Essentially. 2000]. 1999]. 1988. (3. in many cases we discuss below. Even more importantly. we can often achieve higher accuracy models when we do not learn a normalized distribution over the outputs. The models treat the inputs and outputs as random variables X with domain X and Y with domain Y and compactly deﬁne a conditional density p(Y  X) or distribution P (Y  X) (we concentrate here on the .1)). 3. Cristianini & ShaweTaylor. We still heavily rely on the representation and inference tools familiar from probabilistic models for the construction of and prediction in unnormalized models. while the optimal y can be found in polynomial time.3. but largely dispense with the probabilistic interpretation when needed. which do not represent a conditional distribution. Cowell et al. 1995. normalizing the model (summing over the entire Y) is intractable. The learning methods we advocate in this thesis circumvent this problem by requiring only the maximization problem to be tractable.3 Markov networks Markov networks provide a framework for a rich family of models for both discrete and continuous prediction [Pearl. to include any scheme that assigns scores to the output space Y and has a procedure for ﬁnding the optimal scoring y. typically perform as well or better than logistic regression [Vapnik.2. In this chapter. we review basic concepts in probabilistic graphical models called Markov networks or Markov random ﬁelds. support vector machines.. but concentrate on the margin or decision boundary. we use the term model very broadly.2 Prediction models: normalized and unnormalized Probabilistic semantics are certainly not necessary for a good predictive model if we are simply interested in the optimal prediction (the arg max in Eq. We also brieﬂy touch upon examples of contextfree grammars and combinatorial problems that will be explained in greater detail in Part III to illustrate the range of prediction problems we address. In general.
. . . {Y4 . Intuitively. an assignment of variables in the clique as yc and the space of all assignments to the clique as Yc . Y2 }. 3. b < j < a. Yb .. where the nodes are associated with variables V = {Y1 . {Y5 }. . the cliques are simply the nodes and the edges: C(G) = {{Y1 }. The advantage of a graphical framework is that it can exploit sparseness in the correlations between outputs Y . . We focus on discrete output spaces Y below. Yj+1 . but the ﬁrstorder model is making the assertion (which is certainly an approximation) that once the value of a letter Yj is known. We denote the set of variables in a clique c as Yc . No assumption is . .1 Representation The structure of a Markov network is deﬁned by an undirected graph G = (V. adjacent letters in a word are highly correlated. E). . and we consider each node a singleton clique. In the chain network in Fig. x) = P (Yb  Yj . consider a ﬁrstorder Markov chain a model for the word recognition task. For example. x) = P (Ya  Yj . STRUCTURED MODELS conditional Markov networks or CRFs [Lafferty et al. Ya . For the purposes of ﬁnding the most likely y. . it sufﬁces to separately ﬁnd the optimal subsequence from 1 to j ending with yj . the nodes are associated with output variables Yi and the edges correspond to direct dependencies or correlations. and the optimal subsequence starting with yj from j to L.3. . x). Y5 }}.2. . P (Ya  Yj .2. Before a formal deﬁnition. {Y1 . . the correlation between a letter Yb before j and a letter Ya after j is negligible. 3. More precisely. but many of the same representation and inference concepts translate to continuous domains. The quantitative aspect of the model is deﬁned by the potentials that are associated with nodes and cliques of the graph. The graphical structure of the models encodes the qualitative aspects of the distribution: direct dependencies as well as conditional independencies. x). We do not explicitly represent the inputs X in the ﬁgure. Note that each subclique of a clique is also a clique. . this conditional independence property means that the optimization problem is decomposable: given that Yj = yj . the model encodes that Yj is conditionally independent of the rest of the variables given Yj−1 . 3.28 CHAPTER 3. 2001]). In Fig. YL }. A clique is a set of nodes c ⊆ V that form a fully connected subgraph (every two nodes are connected by an edge). we use a model where P (Yb  Yj .
we have node and edge potentials.‘a’. φi.3. yc ). Potentials do not have a local probabilistic interpretation. YL }. The Markov network (G. MARKOV NETWORKS 29 Figure 3. yc ) with φc : X × Yc → IR+ . For simplicity. Intuitively. E) and a set of potentials Φ = {φc }.‘c’. c∈C(G) where C(G) is the set of all the cliques of the graph and Z(x) is the partition function given by Z(x) = y∈Y c∈C(G) φc (x.i+1 (Yi . . . which speciﬁes a nonnegative value for each assignment yc to variables in Yc and any input x. while the edge potentials quantify the correlation between the pair of adjacent output variables as well as the input x.1 A Markov network is deﬁned by an undirected graph G = (V. the node potentials quantify the correlation between the input x and the value of the node. In our example Fig. The nodes are associated with variables V = {Y1 . .3. . made about X . Φ) deﬁnes a conditional distribution: P (y  x) = 1 Z(x) φc (x.3. yc ).‘r’.2: Firstorder Markov chain: φi (Yi ) are node potentials. appropriate node potentials in our network should give high scores to the correct letters (‘b’. though perhaps there would be some ambiguity with the second and fourth letter. Deﬁnition 3. but can be thought of as deﬁning an unnormalized score for each assignment in the clique. Conditioned on the image input. assume that the edge potentials would not depend .2.‘e’). 3. Yi+1 ) are edge potentials (dependence on x is not shown). Each clique c ∈ C(G) is associated with a potential φc (x.
we are likely to “tie” or “share” the parameters of the model wc across cliques. f has node functions and edge functions.k (x. we can deﬁne basis functions for each combination of letters (assume for simplicity no dependence on x) : fj. yc : nc φc (xc .k (x. no matter in what position they appear in the sequence. a Markov network is a generalized loglinear model. In general.k fc. which can be accomplished by adding more basis functions that condition on the position of the clique. yj .j+1. the edge functions in f (xc . In this problem (as well as many others). we could use the same basis functions as for individual character recognition: fj. yc ) where nc is the number of basis functions for the clique c. yj ) = 1 j. For the edge potentials. when c is an edge. these scores should favor the correct output “brace”. Hence the log of the conditional probability is given by: log P (y  x) = c∈C(G) wc fc (x. yc ) = exp wc fc (x. we stack all basis functions into one vector f . yc ) = exp k=1 wc. so when c is a node. all single node potentials would share the same weights and basis functions (albeit the relevant portion of the input xc is different) and similarly for the pairwise cliques.k (x.row. col) in xj .30 CHAPTER 3.1 With slight abuse of notation. since the potentials φc (xc . we condition a clique only on a portion of the input x. STRUCTURED MODELS on the images. which we denote as xc . yc ) are deﬁned to evaluate to zero. the node Sometimes we might actually want some dependence on the position in the sequence. yj+1 ) = 1 j = char1 ∧yj+1 = I(y char2 ) for each char1 ∈ Yj and char2 ∈ Yj+1 . Usually. In case of node potentials for word recognition. For the sequence model.col = on ∧ yj = char) for each I(x possible (row. yc ) − log Zw (x). yc ) could be represented (in logspace) as a sum of basis functions over x. the window of the image that corresponds to letter j and each char ∈ Yj (we assume the input has been segmented into images xj that correspond to letters). Similarly. In fact. but simply should give high scores to pairs of letters that tend to appear often consecutively. Multiplied together. 1 .
1 φ∗ (yk ) = k yk−1 ∈Yk−1 ∀y1 ∈ Y1 . The task of ﬁnding the most likely assignment.1). j The algorithm computes the highest scores recursively: φ∗ (y1 ) = φ1 (x.3.. We stack the weights in the corresponding manner. y) = c∈C(G) f (xc . yk ). We assume that score ties are broken in a predetermined way. MARKOV NETWORKS 31 functions in f (xc . . say according to some lexicographic order of the symbols. Using the arg max’s of the max’s in L the computation of φ∗ . y∈Y y∈Y in the same form as Eq. Let the highest score of any subsequence from 1 to k > 1 ending with value yk be deﬁned as φ∗ (yk ) = max k y1. y1 ). The Viterbi dynamic programming algorithm solves this problem for chain networks in O(L) time. 1 < k ≤ L.3. yj ). ∀yk ∈ Yk . 3. yj−1 . yc ). Now we can write: f (x.k−1 φj (x.3. but most relevant to our discussion.2 Inference There are several important questions that can be answered by probabilistic models. yc ) are also deﬁned to evaluate to zero. is just one of such questions. y). yk−1 . max φ∗ (yk−1 )φj (x. yj )φj (x. k−1 The highest scoring sequence has score maxyL φ∗ (yL ). known as maximum aposteriori (MAP) or most likely explanation (MPE). (3. yk )φj (x. we can backtrace the highest scoring sequence itself. so the most likely assignment according to the model is given by: arg max log Pw (y  x) = arg max w f (x.
The critical property of a triangulation for the inference procedure is the size of the largest clique. The most important of these is the class of networks with low treewidth. . A chord of this cycle is an edge (vi . distinct except that v0 = vl . Deﬁnition 3. MAP inference is NPhard [Cowell et al. like chains and trees. Y4 ). In general Markov networks. we can add the edge (Y1 . Singlyconnected graphs..3 (Treewidth of a graph) The treewidth of a triangulated graph G is the size of its largest clique minus 1.3: Diamond Markov network (added triangulation edge is dashed). Recall that a cycle of length l in an undirected graph G is a sequence of nodes (v0 . v1 .3.32 CHAPTER 3. there are several important subclasses of networks that allow polynomial time inference. Deﬁnition 3. . The treewidth of an untriangulated graph G is the minimum treewidth of all triangulations of G. which are connected by edges (vi . The inference procedure creates a tree of cliques using the graph augmented by triangulation. STRUCTURED MODELS Figure 3. vi+1 ) ∈ G. there are many possible sets of edges that can be added to triangulate a graph.3.3. 3. . . However. . are triangulated since they contain no cycles. In general. Y3 ) or (Y2 .2 (Triangulated graph) An undirected graph G is triangulated if every one of its cycles of length ≥ 4 possesses a chord. The simplest untriangulated network is the diamond in Fig. 1999]. vl ). vj ) ∈ G between nonconsecutive nodes. We need the concept of triangulation (or chordality) to formally deﬁne treewidth. To triangulate it.
The Viterbi algorithm for junction trees picks an arbitrary root r for the tree T and proceeds recursively from the leaves to compute the highest scoring subtree at a node by combining the subtrees with highest score from its children. V ⊆ C(G) and the edges E satisfy the running intersection property: for any two cliques c and c .. 1999]. The junction tree is an alternative representation of the same distribution that allows simple dynamic programming inference similar to the Viterbi algorithm for chains. l φ∗ (yc ) = φc (x.3. For example. Finding the minimum treewidth triangulation of a general graph is NPhard.3. ∀yl ∈ Yl . Y4 )φ34 (Y3 . [1999]. the variables in the intersection c ∩ c are contained in the clique of every node of the tree on the (unique) path between c and c . Y3 . We denote the leaves of the tree as Lv(T ) and the children of node c (relative to the root r) as Chr (c): φ∗ (yl ) = φl (x. Y3 . φ123 (Y1 .4 (Junction tree) A junction tree T = (V. Y3 ) = φ2 (Y2 )φ3 (Y3 )φ12 (Y1 . the potentials for the {Y1 . Fig. Each of the original clique potentials must associated with exactly one node in the junction tree. 3. but good heuristic algorithms exist [Cowell et al. Y3 . yc ) c c ∈Chr (c) ∀l ∈ Lv(T ).3. Y4 } and {Y1 . MARKOV NETWORKS 33 The treewidth of a chain or a tree is 1 and the treewidth of Fig.3 is 2. Y2 )φ23 (Y2 . The inference procedure is based on a data structure called junction tree that can be constructed for a triangulated graph. Y4 ). Deﬁnition 3. Y4 } nodes are the product of the associated clique potentials: φ134 (Y1 . yl ).4 shows a junction tree for the diamond network. c ∀c ∈ V(T ) \ Lv(T ). yc ∼yc max φ∗ (yc ). Y2 . 3. Y3 ). E) for a triangulated graph G is a tree in which the nodes are a subset of the cliques of the graph. Algorithms for constructing junction trees from triangulated graphs are described in detail in Cowell et al. ∀yc ∈ Yc . . Y4 ) = φ1 (Y1 )φ4 (Y4 )φ14 (Y1 .
where yc ∼ yc denotes whether the partial assignment yc is consistent with the partial assignment yc on the variables in the intersection of c and c .4: Diamond network junction tree. yc ).yc µc (yc ) log φc (x. Using the arg max’s of the max’s in the computation of φ∗ . one for each clique c and each value of the clique yc . 3. the size of the largest clique. . We represent an assignment as a set binary variables µc (yc ).3. yc ) = c. Similar type of computations using the junction tree can be used to compute the partition function Zw (x) (by simply replacing max by ) as well as marginal probabilities P (yc x) for the cliques of the graph [Cowell et al. Although solving the MAP inference using a general LP solver is less efﬁcient than the dynamic programming algorithms above.3 Linear programming MAP inference In this section. that denotes whether the assignment has that value.. such that: log c φc (x. this formulation is crucial in viewing Markov networks in a uniﬁed framework of the structured models we consider and to our development of common estimation methods in later chapters. we can backr trace the highest scoring assignment itself. Each of the original potentials is associated with a node in the tree. Let us begin with a linear integer program to compute the optimal assignment y. we present an alternative inference method based on linear programming. The highest score is given by maxyr φ∗ (yr ). 1999]. Note that this algorithm is exponential in the treewidth. STRUCTURED MODELS Figure 3.34 CHAPTER 3.
in case the network is a chain or a tree. yc ) µc (yc ) = 1. µc (yc ). we will have node and edge marginals that sum to 1 and agree with each other as in Fig. the assignment of the subclique. the marginals for cliques that share variables are consistent. s ⊂ c. µs (ys ). 3. we can deﬁne µc (yc ) variables that satisfy the above constraints by setting µc (yc ) = 1 c = yc ). For any clique c ∈ C and a subclique s ⊂ c. c ∈ C. There are several elementary constraints that such marginals satisfy. s. Clearly. MARKOV NETWORKS 35 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Figure 3. We call these variables marginals.3. ∀c ∈ C. ∀ys . µs (ys ) = yc ∼ys µc (yc ). We can also show that converse is true: any I(y valid setting of µc (yc ) corresponds to a valid assignment y. 1}. ∀c ∈ C. column sums agree with µ2 (y2 ).2) µc (yc ) ∈ {0. . Together. For example. they must sum to one for each clique. ∀s. these constraints deﬁne a linear integer program: max c. ∀yc . for any assignment y . In fact. must be consistent with the assignment of the clique. as they correspond to the marginals of a distribution that has all of its mass centered on the MAP instantiation (assuming it is unique).5: Example of marginal agreement: row sums of µ12 (y1 .3.t. y2 ) agree with µ1 (y1 ).yc µc (yc ) log φc (x. yc (3. Second.5. First.
1) = 0. 0) = µ12 (1. µ12 (1. 7. Intuitively. 3). µ23 (1. In case the network is not triangulated. 1) = 0.5. 0) = µ23 (0. (3. Y3 . 0) = µ14 (0. 0) = µ34 (1.5.36 CHAPTER 3. In graphs where triangulation produces very large cliques.3 with binary variables. Note that the edge marginals for (1. [2002]. we discuss another subclass of networks where MAP inference using LPs is tractable for any network topology. but the edge marginal for (1. µ14 (1. In Ch. 1) = 0. . exact inference is intractable.5.5.5. the value of the LP is the same as the value of the integer program. (2. Y3 } and {Y1 . (3. Hence this set of marginals disallows all assignments. 1) = 0. The corresponding node marginals must all be set to 0. 4) disallows any assignment that has Y1 = Y4 . 3. Y4 } with their corresponding constraints. the integrality constraint in the integer program in Eq. If we triangulate the graph and add the cliques {Y1 . 1) = 0. 4) disallow any assignment other than 0000 or 1111. Consider. Y2 . the diamond network in Fig. 0) = µ23 (1. for example. and any linear combination of the MAP assignments maximizes the LP. The optimal distribution with the respect to the objective puts all its mass on the MAP assignment. 0) = µ34 (1. but with a restricted type of potentials. 2). 1) = 0.5 For a triangulated network with unique MAP assignment. the set of marginals is not guaranteed to represent a valid distribution. µ34 (0. 0) = µ34 (0. 0) = µ12 (0. µ23 (0. A proof of this lemma appears in Wainwright et al. the above marginals will be disallowed.3.2) can be relaxed and the resulting LP is guaranteed to have integer solutions. If the MAP assignment is not unique. 1) = 0. µ34 (1. µ14 (0. STRUCTURED MODELS Lemma 3. We can resort to the above LP without triangulation as an approximate inference procedure (augmented with some procedure for rounding possibly fractional solutions). with the following edge marginals that are consistent with the constraints: µ12 (0. 1) = 0. the constraints force the marginals µc (yc ) to correspond to some valid joint distribution over the assignments.
where A. B and C are nonterminal symbols. and D is a terminal symbol. . verbal phrase (VP) or prepositional phrase (PP) and partofspeech tags like nouns (NN). This tree is from the Penn Treebank [Marcus et al. The terminal symbols (leaves) are the words of the sentence.4 Context free grammars Contextfree grammars are one of the primary formalisms for capturing the recursive structure of syntactic constructions [Manning & Sch¨ tze. 1993]. where all rules in the grammar are of the form: A → B C and A → D. Deﬁnition 3.4. 3..6 shows u a parse tree for the sentence The screen was a sea of red.6: Example parse tree from Penn Treebank [Marcus et al. N ◦ A designated set of start symbols.3. For clarity of presentation.. Fig. determiners (DT) and prepositions (IN).4. NS ⊆ N 2 Any CFG can be represented by another CFG in CNF that generates the same set of sentences. For example. a primary linguistic resource for expertannotated English text. verbs (VBD). 1999]. we restrict our grammars to be in Chomsky normal form2 (CNF). 3. CONTEXT FREE GRAMMARS 37 Figure 3. 1993]. The nonterminal symbols (labels of internal nodes) correspond to syntactic categories such as noun phrase (NP).1 (CFG) A CFG G consists of: ◦ A set of nonterminal symbols.
The likely analysis of the sentence is that the cat. IN} ◦ NS = {S} ◦ T = {The. STRUCTURED MODELS ◦ A set of terminal symbols. D ∈ T }. For example. VBD → saw. ◦ PU = {DT → The. since it has several recursive productions. PP. PP → IN NP}. from} ◦ PB = {S → NP VP. PU = {A → D : A ∈ N . IN → from} A grammar generates a sentence by starting with a symbol in NS and applying the productions in P to rewrite nonterminal symbols. VP → VP PP. NP → NP PP.6 or (more compactly) using bracketed expressions like the one below: [[TheDT catNN ]NP [sawVBD [theDT dogNN ]NP ]VP ]S . divided into Binary productions. The simple grammar above can generate sentences of arbitrary length. NN. NP → DT NN. T ◦ A set of productions. there are exponentially many parse trees that produce a sentence of length l. VP. 3. B. In general. P = {PB . Consider a very simple grammar: ◦ N = {S. NN → dog. It can also generate the same sentence several ways. tree. cat. sitting in the tree. VBD. NP. the. but now with DT → the and NN → dog. VP → VBD NP. we can generate The cat saw the dog by starting with S → NP VP. We can represent such derivations using trees like in Fig. rewriting the NP as NP → DT NN with DT → The and NN → cat. C ∈ N } and Unary productions. then rewriting the VP as VP → VBD NP with VBD → saw. NN → cat.38 CHAPTER 3. NN → tree. Consider the sentence The cat saw the dog from the tree. DT → the. PB = {A → B C : A. saw the dog. PU }. again using NP → DT NN. dog. DT. An unlikely but possible alternative is that the cat . saw.
called prepositional attachment. y) as the most likely parse tree for a sentence x. The distribution P (x.1) as hw (x) = arg max w f (x. The probability of a tree is simply the product of probabilities of the productions used in the tree (times the probability of the starting symbol). we can cast PCFG as a structured linear model (in log space). y : g(x.3. ∀A ∈ N . CONTEXT FREE GRAMMARS 39 actually saw the dog who lived near the tree or was tied to it in the past. In Ch. y).y)≤0 and describe the associated algorithm to compute the highest scoring parse tree y given a sentence x. is very common in many realistic grammars. B. This kind of ambiguity. By letting our basis functions f (x. We also need to assign a probability to the different starting symbols P (A) ∈ NS such that A∈NS P (A) = 1. we will show how to represent a parse tree as an assignment of variables Y with appropriate constraints to express PCFGS (and more generally weighted CFGs) in the form of Eq. sawVBD [[theDT dogNN ]NP [fromIN [theDT treeNN ]NP ]PP ]NP . (3. 9. A standard approach to resolving ambiguity is to use a PCFG to deﬁne a joint probability distribution over the space of parse trees Y and sentences X . y) is deﬁned by assigning a probability to each production and making sure that the sum of probabilities of all productions starting with a each symbol is 1: P (A → B C) = 1. Standard PCFG parsers use a Viterbistyle algorithm to compute arg maxy P (x.4. Hence the logprobability of a tree is a sum of the logprobabilities of its productions. with the difference being in the analysis of the toplevel VP: [sawVBD [theDT dogNN ]NP ]VP [[fromIN [theDT treeNN ]NP ]PP .C:A→B C∈PB D:A→D∈PU P (A → D ) = 1. . y) consist of the counts of the productions and w be their logprobabilities. Our grammar allows both interpretations.
j k (3. such as matchings or graphcuts or trees).k µjk qjk µjk = R. 1982. More speciﬁcally. In some domains. the weights of the objective function in the optimization problem are simple and natural to deﬁne (for example. The bipartite matching problem can be solved using a combinatorial algorithm or the following linear program: max j. edgecover. network design and many more. together with a scoring scheme that assigns weights to candidate outputs is a kind of a model. but in many others. where the weights represent the “expertise” of each reviewer for each paper. Our objective is to ﬁnd an assignment for reviewers to papers that maximizes the total weight. As a particularly simple and relevant example. Treated abstractly. For each paper and reviewer. The quality of the solution found depends critically on the choice of weights that deﬁne the objective. job assignment.40 CHAPTER 3. a combinatorial space of structures.. Euclidian distance or temporal latency). 2001]. These problems arise in applications such as resource allocation. spanning tree. we have an a weight qjk indicating the qualiﬁcation level of reviewer j for evaluating paper k. consider modeling the task of assigning reviewers to papers as a maximum weight bipartite matching problem. constructing the weights is an important and laborintensive design task.5 Combinatorial problems Many important computational tasks are formulated as combinatorial optimization problems such as the maximum weight bipartite and perfect matching. and many others [Lawler. and that each reviewer be assigned at most P papers. s. suppose we would like to have R reviewers per paper. graphcut. A simple scheme could measure the “expertise” as the percent of word . 0 ≤ µjk ≤ 1. 1999]. This LP is guaranteed to produce integer solutions (as long as P and R are integers) for any weights q(y) [Nemhauser & Wolsey. and 0 otherwise. binpacking. We represent a matching with a set of binary variables yjk that take the value 1 if reviewer j is assigned to paper k. routing. scheduling. 1976. STRUCTURED MODELS 3.t. Papadimitriou & Steiglitz. Cormen et al.3) µjk ≤ P.
a weighted combination of overlapping the web page of a reviewer and the abstract of the paper that were matched in y. The maximum weight bipartite matching can be found in polynomial (cubic) time in the number of nodes in the graph using a combinatorial algorithm. even simply counting the number of matchings is #Pcomplete [Valiant. Then the d words. exact maximum likelihood estimation is intractable. Let xjk denote the intersection of the set of words occurring in webpage(j)∩abstract(k). Garey & Johnson. 1979].5. since it requires computing the normalization. the web page of a reviewer j and the abstract of the paper k. 1979. The space of bipartite matchings illustrates an important property of many structured spaces: the maximization problem arg maxy∈Y w f (x. y) is easier than the normalization problem y∈Y exp{w f (x. COMBINATORIAL PROBLEMS 41 overlap in the reviewer’s home page and the paper’s abstract. which is essentially weighted counting. we would want to weight certain words much more (words that are relevant to the subject and infrequent). As usual. y) = score qjk is simply qjk = jk yjk 1 I(wordd ∈ xjk ). Similarly.3. However. This fact makes a probabilistic interpretation of the model as a distribution over matchings intractable to compute. We can deﬁne fd (x. Note that counting is easier than normalization. the number of times word d was in both wd 1 I(wordd ∈ xjk ). we will represent the objective q(y) as a weighted combination of a set of basis functions w f (x. Constructing and tuning the weights for a problem is a difﬁcult and timeconsuming process. y). . just as it is for Markov networks for handwriting recognition. However. In the next chapter we will show how to learn the parameters w in much the same way we learn the parameters w of a Markov network. y)}.
y)≤0 (4. Both of these approaches deﬁne a convex optimization problem for ﬁnding such parameters w. we describe at an abstract level two general approaches to structured estimation that we apply in the rest of the thesis. y) ≤ 0}. we develop methods for ﬁnding parameters w i=1 such that: arg max w f (x(i) . Second.Chapter 4 Structured maximum margin estimation In the previous chapter. There are several reasons to derive compact convex formulations. Given a sample S = {(x(i) . y). y(i) )}m .1) where we assume that the optimization problem maxy : g(x. y) ≈ y(i) . we can ﬁnd globally optimal parameters (with ﬁxed precision) in polynomial time. First and foremost. y∈Y (i) where Y (i) = {y : g(x(i) . we can use standard optimization software to solve the problem. In this chapter. A compact problem formulation is polynomial in the description length of the objective and the constraints.y)≤0 w f (x. y : g(x. we described several important types of structured models of the form: hw (x) = arg max w f (x. y) can be solved or approximated by a compact convex optimization problem for some convex subset of parameters w ∈ W. ∀i. Although specialpurpose algorithms that exploit the structure of a particular problem are often much faster 42 .
but also to a wide range of “unconventional” predictive models. we will adopt the hinge upper bound i (h(x(i) )) on the loss function for structured classiﬁcation inspired by maxmargin criterion: i (h(x (i) )) = max [w fi (y) + i (y)] − w fi (y(i) ) ≥ i (h(x(i) )). etc. 7 and Ch.1 Maxmargin estimation As in the univariate prediction. a natural loss function is a kind of Hamming distance between y(i) and h(x(i) ): the number of variables predicted incorrectly. we can analyze the generalization performance of the framework without worrying about the actual algorithms used to carry out the optimization and the associated woes of intractable optimization problems: local minima. Such models include graph cuts and weighted matchings. We will explore these and more general loss functions in the following chapters. (4. y∈Y (i) .1 Minmax formulation Throughout. 6). We discuss how to use these approximate solutions for approximate learning of parameters.1. 4. we measure the error of approximation using a loss function . the loss is often not just the simple 01 loss or squared error. where maximum likelihood estimation is intractable. We provide exact maximum margin solutions for several of these problems (Ch. For structured classiﬁcation. Our framework applies not only to the standard models typically estimated by probabilistic methods. the availability of offtheshelf software is very important for quick development and testing of such models.1. In prediction problems where the maximization in Eq. such as Markov networks and contextfree grammars. 4. In structured problems. 10).1) is intractable. greedy and heuristic methods. where we are jointly predicting multiple variables.4. Third. we consider convex programs that provide only an upper or lower bound on the true solution. MAXMARGIN ESTIMATION 43 (see Ch.
2) w fi (y ) + ξi ≥ max [w fi (y) + i (y)].t. (2. y). For brevity. the constraints enforce w fi (y(i) ) − w fi (y) ≥ i (y). the number of variables in yi . min s. then the constraints make costly mistakes (those with high i (y)) less likely.44 CHAPTER 4. y(i) . (4. If the loss function is not uniform over all the mistakes y = y(i) . we did not explicitly include the constraint that the parameters are in some legal convex set (w ∈ W. y(i) .3) w fi (y ) + ξi ≥ w fi (y) + i (y). STRUCTURED MAXIMUM MARGIN ESTIMATION where as before. 1 w2 + C 2 (i) ξi i y∈Y (i) (4. and show a strong connection between the lossscaled margin and expected risk of the learned model. since maxy∈Y (i) [w fi (y) + i (y)] is convex in w (maximum of afﬁne functions is a convex function). (4.10): min s. the minmax formulation for structured classiﬁcation problem is analogous to multiclass SVM formulation in Eq. most often IRn ).3) is a standard QP with linear constraints. and fi (y) = f (x(i) . With this upper bound. In Ch. Another way to express this problem is using i Y (i)  linear constraints.t. The slack variables ξi allow for violations of the constraints at a cost Cξi . i (h(x(i) )) = (x(i) . h(x(i) )). 1 w2 + C 2 (i) ξi i (4. The above formulation is a convex quadratic program in w. so minimizing w maximizes the smallest such margin. The problem with Eq.9) and Eq. ∀i. which is generally exponential in Li . but its exponential . (2. 5 we analyze the effect of nonuniform loss function (Hamming distance type loss) on generalization. The formulation in Eq.2) is that the constraints have a very unwieldy form. i (h(x(i) )) = (x(i) . Assuming ξi are all zero (say because C is very large). This form reveals the “maximum margin” nature of the formulation. ∀i. We can interpret 1 w w [fi (y(i) ) − fi (y)] as the margin of y(i) over another y ∈ Y (i) . h(x(i) )). ∀y ∈ Y (i) . scaled by the loss i (y). but assume this throughout this chapter.
4.2) and transform it to a a more manageable problem.2) efﬁciently is the lossaugmented inference max [w fi (y) + i (y)]. To make the symbols concrete.4) y∈Y (i) Even if maxy∈Y (i) w fi (y) can be solved in polynomial time using convex optimization. We can deﬁne a basis function fd (xjk .1. (4. Let xjk denote the intersection of the set of words occurring in the web page of a reviewer j and the abstract of the paper k. The key to solving Eq. compact convex problem for estimating the parameters w. A large part of the development of structured estimation methods in the following chapters is identifying appropriate loss functions for the application and designing convex formulations for the lossaugmented inference. Note that maxµi :gi (µi )≤0 fi (w. Let yjk indicate whether reviewer j is matched to the paper k. In the next section. Even within the range of tractable losses. Assume that we ﬁnd such a formulation in terms of a set of variables µi . (4. 2004] for an excellent review) to deﬁne a joint. the number of variables in y(i) . Each training sample i consists of a matching of Np papers and Nr reviewers from some previous year. µi ) and subject to convex constraints gi (µi ): max [w fi (y) + i (y)] = max fi (w. We now return to Eq. since Eq. consider the example of the reviewerassignment problem we discussed in the previous chapter: we would like a bipartite matching with R reviewers per paper and at most P papers per reviewer. which indicates whether the word d is (i) (i) . the form of the loss term i (y) is crucial for the lossaugmented inference to remain tractable. µi ). (4. some are more efﬁciently computable than others. The range of tractable losses will depend strongly on the problem itself (f and Y).5) y∈Y (i) µi :gi (µi )≤0 We call such formulation compact if the number of variables µi and constraints gi (µi ) is polynomial in Li . yjk ) = yjk 1 I(wordd ∈ xjk ).4) is. µi ) must be convex in w. MAXMARGIN ESTIMATION 45 size is in general prohibitive.4) is. with a concave (in µi ) objective fi (w. (4. we develop a maxmargin formulation that uses Lagrangian duality (see [Boyd & Vandenberghe. (4. we can assume that it is feasible and bounded if Eq. (4. Likewise.
46 CHAPTER 4.jk (yjk )] (i) = RNp + jk yjk [w fjk − yjk ]. (3. If this LP is compact (the number of variables and constraints is polynomial in the number of .t. when we can express maxy∈Y (i) w f (x(i) .t.jk = R. we have a linear program of this form for the lossaugmented inference: di + max (Fi w + ci ) µi s.jk ≤ 1. Combining the two pieces. j k (i) (i) s. Ai µi ≤ bi .jk [w fjk − yjk ] µi. µi ) = RNp + and gi (µi ) ≤ 0 ⇔ j i. 0 ≤ µi. y). which depend only on x(i) . We abbreviate the vector of all the basis functions for each edge jk as yjk fjk = f (xjk . yjk ). (i) The lossaugmented inference problem can be then written as an LP in µi similar to Eq. y) and g(x. y) = jk [w f (xjk . µi.jk ≤ P.3) (without the constant term RNp ): max jk (i) µi. For example.jk = R. y(i) . µi ≥ 0.jk ≤ P. Note that the dependence on w is linear and only in the objective of the LP. the Hamming loss simply counts the number of different edges in the matchings y and y(i) : H i (y) (i) (i) = jk 0/1 i. hence RNp − (i) (i) jk yjk yjk represents exactly the number of edges that (i) are different between y and y(i) .6) for appropriately deﬁned di .jk ≤ 1. (4. we have w f (x(i) .5). bi . k µi. (i) In terms of Eq. (i) The last equality follows from the fact that any valid matching for example i has R reviewers for Np papers. Ai .j µi. f (x. yjk ) + 0/1 i. y) as an LP and we use a loss function this is linear in the number of mistakes. ci . (4. fi and gi are afﬁne in µi : fi (w. Fi . 0 ≤ µi. STRUCTURED MAXIMUM MARGIN ESTIMATION in both the web page of a reviewer and the abstract of the paper that are matched in y.jk (yjk ) = jk (i) 1 jk = yjk ) = RNp − I(y jk (i) yjk yjk .jk [w fjk −yjk ] µi. In general. We assume that the loss function decomposes over the variables yij .
λi ) ≤ 0. λi ) = fi (w. Plugging Eq. hi (w. we have strong duality: max fi (w.t. The Lagrangian associated with Eq. µi ) − λi gi (µi ).) Since the original problem had polynomial size. Since we assume that fi (w.8) into Eq. λi ) = di + bi λi and qi (w.w (µi .t.t. λi ). we get min s.1. ∀i. µi ) = min max Li. (4.8) where hi (w.4) is given by Li. (4. Ai λi ≥ Fi w + ci .w (µi . w fi (y(i) ) + ξi ≥ Moreover. we can combine the minimization over λ with minimization over {w.4.7) where λi ≥ 0 is a vector of Lagrange multipliers. λi ) ≤ 0 is {Fi w + ci − Ai λi ≤ 0. then we can use it to solve the maxmargin estimation problem efﬁciently by using convex duality. λi ) explicitly as: min s. (We folded λi ≥ 0 into qi (w.6) is di + min bi λi s. MAXMARGIN ESTIMATION 47 label variables). µi ) is concave in µi and bounded on the nonempty set µi : gi (µi ) ≤ 0. λi ≥ 0. (4. the dual of the LP in Eq. λi ). For example. λi ) for brevity. λi ≥0 µi µi :gi (µi )≤0 For many forms of f and g.2).9) where hi (w. λi ) qi (w.10) min hi (w. λi ) and qi (w. 1 w2 + C 2 ξi i qi (w. λi ) are convex in both w and λi . (4. ξ}. we can write the Lagrangian dual minλi ≥0 maxµi Li. the dual is polynomial size as well. one for each constraint function in gi (µi ). The . (4. (4.w (µi . (4. −λi ≤ 0}.λi )≤0 (4.
4. Both of these problems. qi (w. For example. There are several important combinatorial problems which allow polynomial time solution yet do not have a compact convex optimization formulation.4). 1 w2 + C 2 ξi i (4. w fi (y(i) ) + ξi ≥ di + bi λi .1. the constraint is tighter than necessary. λi ). 2003]. For our LPbased example. min s. 1 w2 + C 2 ξi i (4. Optimizing jointly over λ as well will produce a solution to {w. ∀i. leading to a suboptimal solution w.48 CHAPTER 4. Ai λi ≥ Fi w + ci .11) w fi (y(i) ) + ξi ≥ hi (w. we have a QP with linear constraints: min s. Hence we have a joint and compact convex optimization program for estimating w. STRUCTURED MAXIMUM MARGIN ESTIMATION reason for this is that if the right hand side is not at the minimum. however. maximum weight perfect (nonbipartite) matching and spanning tree problems can be expressed as linear programs with exponentially many constraints. ∀i. though. Schrijver. ∀i. . just verifying that a given assignment is optimal is sometimes easier than actually ﬁnding it. we can ﬁnd a compact certiﬁcate of optimality that guarantees that y(i) = arg maxy [w fi (y) + i (y)] without expressing lossaugmented inference as a compact convex program. 1997.12) ∀i. λi ≥ 0. The exact form of this program depends strongly on f and g. ξ} that is optimal. we assumed a compact convex formulation of the lossaugmented max in Eq. (4. λi ) ≤ 0. but no polynomial formulation is known [Bertsimas & Tsitsiklis. In some cases. Intuitively.2 Certiﬁcate formulation In the previous section. ∀i. can be solved in polynomial time using combinatorial algorithms.t.t.
∀y ∈ Y (i) . 1 w2 2 qi (w. We discuss the conditions for optimality of perfect matchings in Ch. Then the following joint convex program in w and ν computes the maxmargin parameters: min s. such auxiliary variables are needed.jk . Note that this formulation does not allow for violations of the margin constraints (it has no slack variables ξi ). such as in perfect matching optimality in Ch. k) in the tree connecting Vj [jk] and Vk [jk]. where V (i)  is the number of nodes in the graph for example i.13) ∀i. Suppose that we can ﬁnd a compact convex formulation of these conditions via a polynomial (in Li ) set of functions qi (w. A basic property of a spanning tree is that cutting any edge (j. k) in the tree creates two disconnected sets of nodes (Vj [jk]. 10. We also assume that the loss function decomposes into a sum of losses over the edges. jointly convex in w and auxiliary variables νi : ∃νi s.2) may be infeasible. suppose yjk encodes whether edge (j. If the basis functions are not sufﬁciently rich to ensure that each y(i) is optimal. 2001]. (i) Then the optimality conditions are: (i) . where j ∈ Vj [jk] and k ∈ Vk [jk].j k i. with loss for each wrong edge given by w fi. Vk [jk]). yjk ).jk ≥ w fi. νi ). MAXMARGIN ESTIMATION 49 Consider the maximum weight spanning tree problem. (4. this formulation requires that the upper bound on the empirical risk be zero. Expressing the spanning tree optimality does not require additional variables νi .t. qi (w. k) is larger than (or equal to) the weight of any other edge (j . ∀jk. νi ) ≤ 0 ⇔ w fi (y(i) ) ≥ w fi (y) + i (y). j k s.j k + i.1.. k) is in the tree and the score of the edge is given by w fi. then Eq. Essentially. 10. yjk = 1. νi ) ≤ 0. (4. and minimizes the complexity . A spanning tree is optimal with respect to a set of edge weights if and only if for every edge (j.t. j ∈ Vj [jk]. In the spanning tree problem. k ∈ Vk [jk] [Cormen et al.jk for some basis functions f (xjk . k ) in the graph with j ∈ Vj [jk].4. k ∈ Vk [jk]. For a full graph. the weight of (j.1. but in other problems. RS [hw ] = 0. we have V (i) 3 ) constraints for each example i.t.
However.50 CHAPTER 4.1 shows schematically how approximating of the max subproblem reduces or extends the feasible space of w and ξ and leads to approximate solutions.2. identifying these constraints is in general as difﬁcult as solving the problem. (4. Fig. One could also add slack variables for each example and each constraint that would allow violations of optimality conditions with some penalty.3). one for each i and y ∈ Y (i) . and therefore are less justiﬁed. Of course. the designer could add more basis functions f (x. but a greedy approach of adding the most violated constraints often achieves good approximate solutions after adding a small (polynomial) number of constraints. We assume that we have an algorithm that produces y = arg maxy∈Y (i) [w fi (y) + . we do not have to include all of them. If the problem is infeasible. If we can identify a small number of constraints that are critical to the solution.2 Approximations: upper and lower bounds There are structured prediction tasks for which we might not be able to solve the estimation problem exactly. In fact. we can resort to constraint generation or cutting plane methods. since that is the number of variables. not more than the number of parameters n plus the number of examples m can be active in general. Only a subset of those constraints will be active at the optimal solution w. we cannot compute maxy∈Y (i) [w fi (y) + i (y)] exactly or explicitly. y) that take into account additional information about x. 4. but we consider two general cases below. the resulting solution is optimal. Often. Consider Eq. but the maximization problem can be solved or approximated by a combinatorial algorithm. STRUCTURED MAXIMUM MARGIN ESTIMATION of the hypothesis hw as measured by the norm of the weights. 4. but can only upper or lower bound it. The nature of these lower and upper bounds depends on the problem. If we continue adding constraints until there are no more violated ones. these slack variables would not represent upper bounds on the loss as they are in the minmax formulation. 4.1 Constraint generation When neither compact maximization or optimality formulation is possible. where we have an exponential number of linear constraints.
We maintain. and be more stringent or relaxed in others. (4. The algorithm terminates when no constraints are violated. the lowerbound on the constraints provided by Y (i) ∪ y keeps tightening with each iteration. is violated. w fi (y(i) ) + ξi ≥ w fi (y) + i (y).1: Exact and approximate constraints on the maxmargin quadratic program.3) and Eq. The only difference between Eq.2. we solve the problem with a subset of constraints: min s. 4. where (i) is a user deﬁned precision parameter.1. We note that if the algorithm that produces y = arg maxy∈Y (i) [w fi (y)+ i (y)] is .t. i (y)]. we set Y = Y (i) ∪ y. In Fig. (4.4. ∀y ∈ Y (i) . If it is violated. The algorithm is described in Fig. ‘x’ and ‘o’ mark the different optima. The solid red line represents the constraints imposed by the assignments y ∈ Y(i) . 1 w2 + C 2 ξi i (4. whereas the dashed and dotted lines represent approximate constraints. 4. a small but growing set of assignments Y (i) ⊂ Y (i) .14) is that Y (i) has been replaced by Y (i) . terminating when the desired precision is reached.14) ∀i. APPROXIMATIONS: UPPER AND LOWER BOUNDS 51 Upperbound Exact Lowerbound + × Figure 4.2. The parabolic contours represent the value of the objective function and ‘+’. The approximate constraints may coincide with the exact constraints in some cases. At each iteration. for each example i. We then compute y = arg maxy∈Y (i) [w fi (y) + i (y)] for each i and check whether the constraint w fi (y(i) ) + ξi + ≥ w fi (y) + i (y).
2004] for a more indepth discussion of cutting planes methods. 1997. ∀ i. The relaxation usually deﬁnes a larger feasible space Y (i) ⊃ Y (i) over . Return w. then set Y (i) = Y (i) ∪ y and violation = 1 4. Relaxations of such integer programs into LPs. ∀y ∈ Y (i) . QPs or SDPs often provide excellent approximation algorithms [Hochbaum. For example. ∀i. 2. For each i. The number of constraints that must be added before the algorithm terminates depends on the precision and problem speciﬁc characteristics. 1999]. multiway cut in Ch. graphpartitioning in Ch. Compute y = arg maxy∈Y (i) [w fi (y) + i (y)]. 7. 8.2. Initialize: Y (i) = {}. Set violation = 0 and solve for w and ξ by optimizing min s. 1. Many such problems can be written as integer programs. STRUCTURED MAXIMUM MARGIN ESTIMATION Input: precision parameter . This approach may also be computationally faster in providing a very good approximation in practice if the explicit convex programming formulation is polynomial in size. 1997. we consider MAP inference in large treewidth Markov networks in Ch.2 Constraint strengthening In many problems. Figure 4. 3. Nemhauser & Wolsey. 1 w2 + C 2 (i) ξi i w fi (y ) + ξi ≥ w fi (y) + i (y).2: A constraint generation algorithm. 11. the approximation error of the solution we achieve might be much greater than . while the maximization algorithm is comparatively fast. if w fi (y(i) ) + ξi + ≤ w fi (y) + i (y). Boyd & Vandenberghe. but very large. suboptimal.t. See [Bertsimas & Tsitsiklis. 4. the maximization problem we are interested in may be very expensive or intractable. if violation = 1 goto 2.52 CHAPTER 4.
1996. In practice. where y ∈ Y (i) may correspond to a “fractional” assignment. 4. An inverse optimization problem is deﬁned by an instance of an optimization problem maxy∈Y w f (y).3 Related work Our maxmargin formulation is related to a body of work called inverse combinatorial and convex optimization [Burton & Toint. Zhang & Ma.3. In this case. 2001. RELATED WORK 53 which the maximization is done. As the objective includes a penalty for the slack variable. intuitively.4.1 shows how the overestimate reduces the feasible space of w and ξ. 4. minimizing the objective tends to drive the weights w away from the regions where the solutions to the approximation are fractional. we will show experimentally that such approximations often work very well. a set of nominal weights w0 . the margin has the standard interpretation as the difference between the score of y(i) and the MAP y (according to w). (3. the approximation is an overestimate of the constraints: max [w fi (y) + i (y)] ≥ max [w fi (y) + i (y)]. By contrast. the approximate constraints are tightened because of the additional invalid assignments. the approximate MAP solution has higher value than any integer solution. a solution to the MAP LP in Eq. thereby driving up the corresponding slack ξi . Heuberger. For example. which make the . and a target solution yt . Note that for every setting of the weights w that produces fractional solutions for the relaxation. Fig. but (close to) optimal for the particular approximate maximization algorithm used. including the true assignment y(i) . 2004]. Ahuja & Orlin. the estimation algorithm is ﬁnding weights that are not necessarily optimal for an exact maximization algorithm. for weights w for which the MAP approximation is integervalued. 1992. y∈Y (i) y∈Y (i) Hence the constraint set is tightened with such invalid assignments. In such a case.2) for an untriangulated network may not correspond to any valid assignment. In essence. The goal is to ﬁnd the weights w closest to the nominal w0 in some norm.
To map this problem (approximately) to our setting. 2004] is much closer to our setting. 1987] for indepth discussion of a wide range of applications). Abbeel & Ng. then w = 0 is trivially the optimal solution. Note that if we set w0 = 0.t. we are learning a parameterized objective function that depends on the input x and will generalize well in prediction on new instances. Modeling a complex physical system often involves a large number of parameters which scientists ﬁnd hard or impossible to set correctly. The study of inverse problems began with geophysical scientists (see [Tarantola. The goal is to learn a reward function that will cause a rational agent to act similar to the observed behavior of an expert. Most of the attention has been on L1 and L∞ norms. Provided educated guesses for the parameters w0 and the behavior of the system as a target. we do not assume as given a nominal set of weights. The solution w depends critically on the choice of nominal weights. The environment is modeled as a system that can be in one of a set of discrete states. but we brieﬂy describe the Markov decision process (MDP) model commonly used for sequential decision making problems where an agent interacts with its environment. Moreover. a state to action mapping) that maximizes its expected reward. In our framework. The agent collects a reward at each step. The inverse reinforcement learning problem [Ng & Russell.54 CHAPTER 4. Although there is a strong connection between inverse optimization problems and our formulations. but L2 norm is also used. w − w0 p w f (yt ) ≥ w f (y). A full description of the problem is beyond our scope. ∀y ∈ Y. 2000. the goals are very different than ours. which generally depends on the on the current and the next state and the action taken. which is not appropriate in the learning setting. the agent chooses an action from a discrete set of actions and the system transitions to a next state with a probability that depends on the current state and the action taken. the inverse optimization problem attempts to match the behavior while not perturbing the “guesstimate” too much. STRUCTURED MAXIMUM MARGIN ESTIMATION target solution optimal: min s. At every time step. A rational agent executes a policy (essentially. note that a policy roughly corresponds to the labels .
For example. the state sequence correspond to the input x and the reward for a state/action sequence is assumed to be w f (x. minmax. we resort to approximations that only provide upper/lower bounds on the predictions. The estimated parameters are then approximate. y) for some basis functions w f (x. The ﬁrst formulation. certiﬁcate. context free grammars. to learn a probabilistic model P (y  x) over bipartite matchings using maximum likelihood requires computing the normalizing partition function. Because our approach only relies using the maximum in the model for prediction.4 Conclusion In this chapter. context free grammars. maximum margin estimation can be tractable when maximum likelihood is not. we presented two formulations of structured maxmargin estimation that deﬁne a compact convex optimization problem. many other problems in which . which is #Pcomplete [Valiant. This and related problems are formulated as a convex program in Ng and Russell [2000] and Abbeel and Ng [2004]. When the prediction problem is intractable or very expensive to solve exactly. By contrast. only requires expressing optimality of a given assignment according to the model. The goal is to learn w from a set of state/action sequences (x(i) . The estimation problem is tractable and exact whenever the prediction problem can be formulated as a compact convex optimization problem or a polynomial time combinatorial algorithm with compact convex optimality conditions. and does not require a normalized distribution P (y  x) over all outputs. maximum margin estimation can be formulated as a compact QP with linear constraints.4. but produce accurate approximate prediction models in practice.4. Similar results hold for nonbipartite matchings and mincuts. 4. Garey & Johnson. The second one. In models that are tractable for both maximum likelihood and maximum margin. y). relies on the ability to express inference in the model as a compact convex optimization problem. 1979]. (such as lowtreewidth Markov networks. Our framework applies to a wide range of prediction problems that we explore in the rest of the thesis. including Markov networks. 1979. and many combinatorial structures such as matchings and graphcuts. CONCLUSION 55 y. y(i) ) of the expert such that the maximizing the expected reward according to the system model makes the agent imitate the expert.
Maximum likelihood estimation with kernels results in models that are generally nonsparse and require pruning or greedy support vector selection methods [Lafferty et al. The range of applications of these principles is very broad and leads to estimation problems with very interesting structure in each particular problem. the solutions to the estimation are relatively sparse in the dual space (as in SVMs). our approach has an additional advantage. . Altun et al.56 CHAPTER 4. The forthcoming formulations in the thesis follow the principles laid out in this chapter.. which makes the use of kernels much more efﬁcient. Because of the hingeloss. 2004. STRUCTURED MAXIMUM MARGIN ESTIMATION inference is solvable by dynamic programming). 2004]. from Markov networks and contextfree grammars to graph cuts and perfect matchings..
Part II Markov networks 57 .
. segmentation and stereo correspondence [Besag. 2003].. whose nodes represent the different object labels.. labels of linked and cocited documents tend to be similar [Chakrabarti et al. information extraction [Manning & Sch¨ tze. 1998.g.. a Markov network for the hypertext domain would include a node for each webpage. 2002]. Boykov et al. Sequential prediction problems arise in natural language processing (partofspeech tagging. location and function of proteins that interact are often highly correlated [Vazquez et al. We address the problem of maxmargin estimation the parameters of Markov networks 58 . and relational interactions in prediction problems arising in many ﬁelds. Such models are encoded by a graph. because one links to the other). neighboring pixels exhibit spatial label coherence in denoising. and whose edges represent and quantify direct dependencies between them. Taskar et al. In sequential labeling problems (such as handwriting recognition). compuu tational biology (gene ﬁnding. the labels (letters) of adjacent inputs (images) are highly correlated. 1999a]. For example. 1986. 1998]). spatial.. sequence alignment [Durbin et al. In hypertext or bibliographic classiﬁcation. 1999]). speech recognition.. In proteomic analysis.Chapter 5 Markov networks Markov networks are extensively used to model complex sequential. protein structure prediction. Markov networks compactly represent complex joint distributions of the label variables by modeling their local interactions. and an edge between any pair of webpages whose labels are directly correlated (e. In image processing. These problems involve labeling a set of related objects that exhibit local consistency. encoding its label. and many other ﬁelds.
1 Maximum likelihood estimation The regularized maximum likelihood approach of learning the weights w of a Markov network is similar to logistic regression we described in Sec. We assume the structure of the network is given: we have a mapping from an input x to the corresponding Markov network graph G(x) = {V. We analyze the theoretical generalization properties of maxmargin estimation and derive a novel marginbased bound for structured classiﬁcation.2. 3.. might be different for each example i. y). E} where the nodes V map to the variables in y.3. We show a compact convex formulation that seamlessly integrates kernels with graphical models.1 is deﬁned via a vector of n basis functions f (x. 2001]). The loglinear representation we have described in Sec. ﬁrstorder Markov chain or a higherorder models. drawn from a ﬁxed disi=1 tribution D over X × Y. which represent P (y  x) instead of generative models P (x.. It also depends on the basis functions we use to model the dependencies of the problem. Before we present the maximum margin estimation. The objective is to minimize the training logloss with an additional regularization term. we review the standard maximum likelihood method. We focus on conditional Markov networks (or CRFs [Lafferty et al. y): log Pw (y  x) = w f (x. the training sequences might have different lengths. for example. y) − log Zw (x). MAXIMUM LIKELIHOOD ESTIMATION 59 for such structured classiﬁcation problems. We are given a labeled training sample S = {(x(i) . y)} and w ∈ IRn . Note that the topology and size of the graph G (i) . this mapping depends on the segmentation algorithm that determines how many letters the sample image contains and splits the image into individual images for each letter.5. usually the squarednorm of the weights w [Lafferty et al. We abbreviate G(x(i) ) as G (i) below. 2001]: 1 w2 − C 2 1 log Pw (y(i)  x(i) ) = w2 + C 2 log Zw (x(i) ) − w fi (y(i) ). y(i) )}m . For instance. 2. In handwriting recognition. where Zw (x) = y exp{w f (x. 5.1. i i .
3..60 CHAPTER 5. so we have an unconstrained convex optimization problem. The gradient with respect to w is given by: w+C i Ei.w [∆fi (y)]. Computing the Hessian is more expensive than the gradient. Since the basis functions decompose over the cliques of the network. 2004]. A standard approach is to use an approximate second order method that does not need to compute the Hessian. Let δfi (y) = fi (y) − Ei. y). 2003. Boyd & Vandenberghe. we can use inference in the Markov network to calculate marginals Pw (yc  x(i) ) for each clique c in the network Sec. where Ei. Second order methods for solving unconstrained convex optimization problems. which is quadratic in the number of cliques and the number of functions. MARKOV NETWORKS where fi (y) = f (x(i) . Conjugate Gradients or LBFGS methods have been shown to work very well on large estimation problems [Sha & Pereira. such as Newton’s method.w [fi (y)] =  x(i) ) is the expectation under the conditional dis tribution Pw (y  x(i) ) and ∆fi (y) = f (x(i) . This objective function is convex in the parameters w. y).2.w [fi (y)] − fi (y(i) ) = w − C i y∈Y fi (y)Pw (y Ei.w [fi (y)] = (i) c∈C (i) yc ∈Yc fi. Pinto et al. . but uses only the gradient information [Nocedal & Wright.w [fi (y)]. y(i) ) − f (x(i) . as before. 1999.w δfi (y)δfi (y) . since we need to calculate joint marginals of every pair of cliques c and c . Pw (yc∪c  xi ) as well as covariances of all basis functions. where I is a n × n identity matrix. the expectation decomposes as well: Ei. To compute the expectations. require the second derivatives as well as the gradient. The Hessian of the objective depends on the covariances of the basis functions: I +C i Ei. 3. 2003]. even with millions of parameters w.c (yc )Pw (yc  x(i) ).
However. [2003]. 3.t. ∀s. but may depend on the number of labels and type of labels predicted incorrectly or perhaps the number of cliques of labels predicted incorrectly.t.c (yc ) = 1. we can express this inference problem for a triangulated graph as a linear program for each example i as in Sec.s (ys ) = yc ∼ys µi. We know from Sec. I(y (i) With this assumption. y(i) . ∀i.c (yc )[w fi. ∀c ∈ C (i) .c (yc ). we assume that the loss. yc ) = c c∈C(G (i) ) i.2 Maximum margin estimation For maximummargin estimation. We will focus on decomposable loss functions below.5. .c (yc ) ≥ 0. s.c (yc )] (5. c ∈ C (i) . ∀i. MAXIMUM MARGIN ESTIMATION 61 5. Assumption 5.2. s ⊂ c.3 how to express maxy w fi (y) as an LP. but the important difference is the loss function i .3: max c. ∀yc .1 The loss function i (y) is decomposable: i (y) = c∈C(G (i) ) (i) (x(i) . In fact I(y this loss for sequence models was used by Collins [2001] and Altun et al. y) = v∈V (i) 1 v = yv ). In general. A natural choice that we use in our experiments is the Hamming distance: H (x(i) .yc µi. where we are predicting multiple labels. yc . 3. yc µi.3. the loss is often not just the simple 0/1 loss. like the basis functions. ∀ys . 4. The simplest loss is the 0/1 loss i (y) ≡ 1 (i) = y).2. in structured problems.c (yc ) + i.3. ∀c ∈ C (i) . decomposes over the cliques of labels. we begin with the minmax formulation from Sec. w fi (y(i) ) + ξi ≥ max [w fi (y) + i (y)].1) s.1: min 1 w2 + C 2 ξi i y (5.2) µi. µi.c (yc ).
then the distribution that maximizes the objective puts all its weight on that assignment.s.c (yc ) + i. ∀i.3) mi.c (yc ).4) λi. 1 w2 + C 2 ξi i (5.1) for each example i and maximizing jointly over all the variables (w. (5.c.s. we also introduce variables that capture the effect of all the agreement variables m: Mi. As we showed before.t. ∀i.c (yc ) − mi.s (ys ) − s⊃c mi. For readability. If the most likely assignment is unique.c (yc ) − s⊃c (5. ∀yc . s.s (yc ) variables correspond to the agreement constraints in the primal in Eq.c variables correspond to the normalization constraints. we make a change of variables from λi. ∀yc . The dual of Eq. (5.c. ∀yc . the constraints ensure that the µi ’s form a proper distribution.2) is given by: min c λi. any convex combination of the assignments is a valid solution).c.s (ys ) ≥ w fi. (If the arg max is not unique.c w fi (y(i) ) + ξi ≥ λi. The reason for naming the new variables using the letter ξ will be clear in the following. ys ∼yc mi. we have: min s. ys ∼yc In this dual.s (ys ) ≥ w fi.c + s⊃c mi. the λi. MARKOV NETWORKS where C (i) = C(G (i) ) are the cliques of the Markov network for example i. i. ξ. ∀c ∈ C (i) .c (yc ) = s⊂c.c . ∀c ∈ C (i) .c (yc ) + i.c mi.c = w fi.c (yc ) + ξi.c (yc ).t. λi.c. λ and m).c + ∀c ∈ C (i) . s⊂c.62 CHAPTER 5. ys ∼yc In order to gain some intuition about this formulation.c : (i) λi. s⊂c.c (yc ).s.c . Plugging the dual into Eq. while mi.c to ξi. ∀i.2). . (5. ∀c ∈ C (i) .
c this set of variables: min s. Mi. ∀i.c (yc ).c.t. for each clique: min s. Finally.c (yc ) (5.c.c (yc ) + Mi. ∀yc . ys ∼yc mi.c ≥ w fi. we can write this in a form that resembles our original formulation Eq. since the slack variable ξi only appears only in the and the objective minimizes Cξi .c at the optimum.c and mi.s.s variables. ys ∼yc mi. (i) w fi. ∀c ∈ C (i) . Note that without Mi.c (yc ) + ξi.c (yc )]. ∀i. ∀i. (i) w fi.5.t. ∀c ∈ C (i) . .c (yc ) = s⊂c.c yc i.5) ξi.s. or a margin requirement.c (yc ) + Mi.c . ∀i. ∀yc .c ≥ w fi.s (ys ) − s⊃c mi. ∀i.s (ys ) variables correspond to a certain kind of messages between cliques that distribute “credit” to cliques to fulﬁll this margin requirement from other cliques which have sufﬁcient margin.c i. we essentially treat each clique as an independent classiﬁcation problem: for each clique we have a hinge upperbound on the local loss. Note that ξi = constraint ξi ≥ c ξi. ys ∼yc mi.c (yc ) = s⊂c. ∀yc . 1 w2 + C 2 ξi.c ≥ max [w fi.c i. ∀c ∈ C (i) . Hence we can simply eliminate c ξi.c.c i.c (yc ) + ξi.c (yc ). ∀c ∈ C (i) .s (ys ) − s⊃c mi. ∀yc . ∀c ∈ C (i) .c (yc ) + i.c. ∀i. MAXIMUM MARGIN ESTIMATION 63 With these new variables.c (yc ).c (yc ). The mi. (5.c (yc ).c (yc ) = s⊂c.t.c (yc ) + ξi.1).s (ys ) − s⊃c mi.s.c. 1 w2 + C 2 ξi ≥ c (i) w fi.c (yc ) (5. 1 w2 + C 2 ξi. ∀c ∈ C (i) .6) + Mi. ∀yc .c (yc ) ξi i (5. + Mi. but deﬁned at a local level. we have: min s.7) + Mi. ∀i.2.
Suppose for the sake of this example that our training data consists of only one training sample.23 (y3 ) − m3. ∀y4 . we drop the dependence on the sample index i in the indexing of the variables (we also used yj loss H (∗) instead of yj below). MARKOV NETWORKS Figure 5. I(y (∗) (∗) (∗) (∗) (∗) (∗) (∗) (∗) (∗) (∗) . consider the ﬁrstorder Markov chain in Fig.1. I(y w f5 (y5 ) + ξ5 ≥ w f5 (y5 ) + 1 5 = y5 ) − m5. ∀y3 . ∀y5 .34 (y3 ). I(y w f4 (y4 ) + ξ4 ≥ w f4 (y4 ) + 1 4 = y4 ) − m4. For concreteness. The ﬁgure shows the local slack variables ξ and messages m between cliques for this sample. 5.45 (y5 ). For brevity of notion in this example. below we use the Hamming j (yj ) (i) . ∀y1 . As an example.12 (y2 ) − m2.45 (y4 ).34 (y4 ) − m4. ∀y2 . which decomposes into local terms = 1 j = yj ) for each node and is I(y (∗) zero for the edges. I(y w f3 (y3 ) + ξ3 ≥ w f3 (y3 ) + 1 3 = y3 ) − m3. Also shown are the corresponding local slack variables ξ for each clique and messages m between cliques.23 (y2 ).1: Firstorder chain shown as a set of cliques (nodes and edges). I(y w f2 (y2 ) + ξ2 ≥ w f2 (y2 ) + 1 2 = y2 ) − m2.12 (y1 ). The set of cliques consists of the ﬁve nodes and the four edges. The constraints associated with the node cliques in this sequence are: w f1 (y1 ) + ξ1 ≥ w f1 (y1 ) + 1 1 = y1 ) − m1.64 CHAPTER 5.
In this section. ∀y4 . w f45 (y4 . y2 ) + m1. we develop an alternative dual derivation that provides a very interesting interpretation of the problem and is a departure for specialpurpose algorithms we develop.34 (y3 ) + m4. We make two small transformations to the dual that do not change the problem: we normalize α’s by C (by letting αi (y) = Cαi (y)). w f34 (y3 . ∀i. y5 . y4 ) + m3. where ∆fi (y) ≡ f (x(i) . we showed a derivation of a compact formulation based on LP inference. 1 w2 + C 2 ξi i (5. ∀y2 . (∗) (∗) (∗) (∗) (∗) (∗) (∗) (∗) 5. y αi (y) ≥ 0.5. M3 N DUAL AND KERNELS 65 The edge constraints are: w f12 (y1 .34 (y4 ). In the dual.9) s. 2 αi (y)∆fi (y) i. y3 .t. ∀y1 . We begin with the formulation as in Eq. y3 ) + ξ23 ≥ w f23 (y2 . y.23 (y2 ) + m3. y4 . ∀y3 . (4.8) w ∆fi (y) ≥ i (y) − ξi . The dual is given by: max i. y5 ) + m4.t. y5 ) + ξ45 ≥ w f45 (y4 .y (5. y. y(i) ) − f (x(i) .23 (y3 ). so that they sum . y2 .3): min s. y3 ) + m2. y2 ) + ξ12 ≥ w f12 (y1 .3 M3N dual and kernels In the previous section.12 (y1 ) + m2.y 1 αi (y) i (y) − 2 αi (y) = C.45 (y4 ) + m5. ∀i.45 (y5 ).12 (y2 ).3. y4 ) + ξ34 ≥ w f34 (y3 . w f23 (y2 . y). the exponential number of αi (y) variables correspond to the exponential number of constraints in the primal. ∀i.
t.10) can be interpreted as a kind of distribution over y. but the importance of the constraint associated with the instantiation to the solution. ∀y.y (5. (5.c (yc ) = y ∼yc αi (y ). ∀c ∈ C (i) . . Consider. Our main insight is that the variables αi (y) in the dual formulation Eq. y. ∀i. Now we can reformulate our entire QP (5. The resulting dual is given by: max i. (5.10) in terms of these marginal dual variables.c (yc ) decompose over the cliques of the Markov network. ∀i. Since i (y) = c i. We deﬁne the marginal dual variables as follows: µi. the solution to the dual α gives the solution to the primal as a weighted combination: w∗ = C i.yc µi. for example.y 1 αi (y) i (y) − C 2 αi (y) = 1. The dual objective is a function of expectations of i (y) and ∆fi (y) with respect to αi (y). the ﬁrst term in the objective function (ﬁxing a particular i): αi (y) i (y) = y y αi (y) c i.yc i.c (yc ) = c. This dual distribution does not represent the probability that the model assigns to an instantiation. 2 αi (y)∆fi (y) i.y ∗ αi (y)∆fi (y). ∀yc .10) s. MARKOV NETWORKS to 1 and divide the objective by C.c (yc ) y ∼yc αi (y ) = c. Note that the number of µi.66 CHAPTER 5.c (yc ) and ∆fi (y) = c ∆fi. y αi (y) ≥ 0. ∀i. since they lie in the simplex αi (y) = 1. we only need clique marginals of the distribution αi (y) to compute their expectations.c (yc ).c (yc ) i.c (yc ) variables is small (polynomial) compared to the number of αi (y) variables (exponential) if the size of the largest clique is constant with respect to the size of the network.11) where y ∼ yc denotes whether the partial assignment yc is consistent with the full assignment y . As in multiclass SVMs. y αi (y) ≥ 0.
Therefore. when all G (i) are triangulated.5.s (ys ) = yc ∼ys µi.yc µ∗ (yc )∆fi. ∀c ∈ C (i) . αi (y)∆fi (y) = y c.c. ∀s.yc 1 µi. the following structured dual QP has the same primal solution (w∗ ) as the original exponential dual QP in Eq. The objective of .yc µi. s ⊂ c. ∀ys . the α distribution is not unique.c (yc ) − C 2 2 µi. (5. What is its range.c (yc ) i. Note that it only depends on αi (y) through its marginals µi. but there is a continuum of distributions consistent with a set of marginals.yc ∆fi. D[M].c (yc ).5).c (yc ). µi.3.c (yc ). c ∈ C (i) .c (yc ). the set of legal marginals? Characterizing this set (also known as marginal polytope) compactly will allow us to work in the space of µ’s: max Q(α) ⇔ max Q (µ). Let us denote the the objective of Eq. yc µi.10): max i. i.c (yc ) i. is the product of simplices for all the m examples. µ∈R[M] α∈D[M] Hence we must ensure that µi corresponds to some distribution αi . (5. The solution to the structured dual µ∗ gives us the primal solution: w∗ = C i.11) . ∀c ∈ C (i) . (5.t. where M denotes the marginalization operator deﬁned by Eq.yc (5.c (yc ) ≥ 0. Q(α) = Q (M(α)). ∀yc .c In this structured dual.12) s.c. but do not make a commitment about what it is.c (yc ) y ∼yc αi (y ) = c. R[M].3.c (yc ) = 1. The domain of this operator.c. we only enforce that there exists an αi consistent with µi . ∀i. that is.c (yc )∆fi. ∀i. which is exactly what the constraints in the LP for MAP inference enforce (see discussion of Lemma 3. ∀i. µi.10) as Q(α). M3 N DUAL AND KERNELS 67 The decomposition of the second term in the objective is analogous. In general.c (yc )∆fi.
yc ) f (xc .s (ys ) (5. As in SVMs. s∈S (i) µi. we can locally kernelize our models and solve Eq.¯(yc ) = [f (x(i) . Hence.10) does not distinguish between these distributions.c (yc ) ∆fj.g. k(xc . yc )]. yc ) − f (xc . c∈T (i) µi. 1999]. is unique for a triangulated model and can be computed using the junction tree T (i) for the network [Cowell et al.68 CHAPTER 5. yc )] [f (xc .c (yc ) . since it only depends on their marginals. 6. (5. by convention 0/0 ≡ 0. each logpotential in the network “remembers” only a small proportion of the relevant training data inputs. Note that each separator s and complement of a separator c \ s is also a clique of the original graph. the solutions to the maxmargin QP are typically sparse in the µ variables. The maximumentropy distribution αi consistent with a set of marginals µi . We usually extend the kernel over the input space to the joint input and output space by simply deﬁning f (xc . yc ) ≡ 1 c = yc )k(xc . (5.12) efﬁciently. (5. associated with each edge (c. We denote the set of separators as S (i) . c ¯ ¯ ¯ ¯ ¯ c c (j) (j) (j) Hence. In our handwriting example. however. xc ). I(y ¯ ¯ ¯ ¯ Of course. we use a ¯ polynomial kernel on the pixel values for the node cliques. (i) (j) . yc ) − f (x(i) . e. other deﬁnitions are possible and may be useful when the assignments in each clique yc have interesting structure. c ) in the tree T (i) is a set of variables called the separator s = c ∩ c . Speciﬁcally. Now we can deﬁne the maximumentropy distribution αi (y) as follows: αi (y) = Again. Kernels are typically deﬁned on the input. xc ).12) can be expressed in terms of dot products between local basis functions (i) ∆fi.2 we experiment with several kernels for the handwriting example. MARKOV NETWORKS the QP Eq. since it is a subclique of a larger clique.13) Kernels Note that the solution is a weighted combination of local basis functions and the objective of Eq. In Sec..
y3 . y3 The new marginal variables appear only in the constraints. y2 . y3 µ134 (y1 . y4 . Y3 . ∀y1 .2: Diamond Markov network (added triangulation edge is dashed and threenode marginals are in dashed rectangles). y3 . Y2 . y3 ) = µ12 (y1 . y2 ). y3 ) and µ134 (y1 . they do not add any new basis functions nor change the objective function. . if our graph is a 4cycle Y1 —Y2 —Y3 —Y4 —Y1 as in Fig. We can then use this new graph to produce the constraints on the marginals: µ123 (y1 . y1 µ134 (y1 . y3 ) = µ23 (y2 . UNTRIANGULATED MODELS 69 Figure 5. Y3 and Y1 . This will introduce new cliques Y1 . y2 . y4 ). y3 . y1 ∀y2 . we must address the problem by triangulating the graph. y3 ).4. y4 ) = µ34 (y3 .2. 5.5. that is. 5. For example. y3 . µ123 (y1 . Y4 and the corresponding marginals. we can triangulate the graph by adding an arc Y1 —Y3 . y4 ). adding ﬁllin edges to ensure triangulation.4 Untriangulated models If the underlying Markov net is not chordal. y2 . y2 . y3 ). y3 . ∀y3 . µ123 (y1 . y4 ) = µ13 (y1 . ∀y1 .
for example 2D grids often used in image analysis. where 100% accuracy is achievable on the training data by some parameter setting w such that approximate inference (using the untriangulated LP) produces integral solutions. (5.70 CHAPTER 5. For instance. (5. the learned parameters produce very accurate approximate models in practice. they are often much more expensive. the formulation trades off fractionality of the untriangulated LP solutions with complexity of the weights w2 . we can still solve the QP in the structured primal Eq. But do we really need these optimal parameters if we cannot perform exact inference? A more useful goal is to make sure that training error is minimized using the approximate inference procedure via the untriangulated LP. For C in intermediate range. Solving the problem as C → ∞ will ﬁnd this solution even though it may not be optimal (in terms of the norm of the w) using exact inference.12) deﬁned by an untriangulated graph. have large treewidth. which enforces only local consistency of marginals. Note that we could also strengthen the untriangulated relaxation without introducing an exponential number of constraints. Such a formulation. optimizes our objective only over a relaxation of the marginal polytope. To the best . However. the number of constraints introduced is exponential in the number of variables in the new cliques — the treewidth of the graph. as experiments in Ch. For example. which tend to improve the approximation of the marginal polytope. Although this and other more complex relaxations are a very interesting area of future development. MARKOV NETWORKS In general. we show a generalization bound for the task of structured classiﬁcation that allows us to relate the error rate on the training set to the generalization error. We conjecture that the parameters learned by the approximate QP in fact do that to some degree. Unfortunately. even sparsely connected networks.6) or the structured dual Eq. The approximate QP does not guarantee that the learned model using exact inference minimizes the true objective: (upperbound on) empirical risk plus regularization. we can add positive semideﬁnite constraints on the marginals µ used by Wainwright and Jordan [2003]. However. 5. consider the separable case.5 Generalization bound In this section. 8 demonstrate.
x. we relate the generalization error to the margin of the classiﬁer. Vc be the maximum number of values in a clique Yc . x. this bound is the ﬁrst to deal with structured error. y )]. we need to deﬁne several other factors it depends upon. y ). arg max w f (x. As in other generalization bounds for marginbased classiﬁers. y) ≤ L(w. y) = max 1 H (y. it is adversarial: it picks from all y which are better (w f (y) ≤ w f (y )). such as the Hamming distance. y ). y) ≤ L(w. GENERALIZATION BOUND 71 of our knowledge. y )). Otherwise. Note that the loss we minimize in the maxmargin formulation is very closely related (although not identical to) this upper bound. This upper bound is tight if y = arg maxy w f (x. q be the maximum number of cliques that have a variable in common.5. x. plus a term that grows inversely with the margin. y) = max 1 f (y ) L H y : w f (y)≤w (y. and Rc be an upperbound on the 2norm of clique basis functions.y ) L H y : w f (y)≤w f (y )+γ (y. it picks from all y which are better when helped by γ H (y. y ). x. Consider an upper bound on the above loss: L(w. An appropriate error function is the average perlabel loss L(w. or the Hamming distance between y and h(x). otherwise. Our goal in structured classiﬁcation is often to minimize the number of misclassiﬁed labels. Let Nc be the maximum number of cliques in G(x). y ) + H (y. Our analysis of Hamming loss allows to prove a signiﬁcantly stronger result than previous bounds for the 0/1 loss. y) ≤ Lγ (w. Consider a ﬁrstorder sequence .To state the bound. one that maximizes the Hamming distance from y.5. one that maximizes the Hamming distance from y. x. y where L is the number of label variables in y. This upper bound is even more adversarial: it is tight if y = arg maxy [w f (x. x. y) = 1 L H (y. as we detail below. We can now deﬁne a γmargin perlabel loss: L(w. y ). We can now prove that the generalization accuracy of any hypothesis w is bounded by its empirical γmargin perlabel loss.
This is a signiﬁcant gain over the previous result of Collins [2001] for 0/1 loss. MARKOV NETWORKS model as an example. Thus. with probability at least 1 − δ. Vc = V 2 because of the edge cliques. there exists a constant K such that for any γ > 0 perlabel margin. y)] ≤ ES [Lγ (w. for sequences. y)] at high margin γ quantiﬁes the conﬁdence of the prediction model. 2002].1 For the family of hypotheses parameterized by w. with L as the maximum length. The ﬁrst term upper bounds the training error of w. This new type of cover is polynomial in the number of cliques. yielding signiﬁcant improvements in the bound. Then Nc = 2L − 1 since we have L node cliques and L − 1 edge cliques. x.c (yc ). The second term depends on w/γ. Proof: See Appendix A.5. the result provides a bound to the generalization error that trades off the effective complexity of the hypothesis space with the training error. and depends on the joint 2norm of all of the basis functions for an example (which is ∼ Nc Rc ). then our bound is independent . and q = 3 since nodes in the middle of the sequence participate in 3 cliques: previouscurrent edge clique. and V the number of values a variable takes. an open problem for marginbased sequence classiﬁcation [Collins. and m > 1 samples. The proof uses a covering number argument analogous to previous results in SVMs [Zhang. x. Speciﬁcally. L m = O(1) (for example. and any δ > 0. which has linear dependence (inside the square root) on the number of nodes (L). Theorem 5. which corresponds to the complexity of the classiﬁer (normalized by the margin level). However we propose a novel method for covering the space of structured prediction models by using a cover of the individual clique basis function differences ∆fi. Such a result was. x.1 for the proof details and the exact value of the constant K. Finally. and currentnext edge clique. note that if of the number of labels L. 2001]. y)] + 2 K Rc w2 q 2 1 [ln m + ln Nc + ln Vc ] + ln m γ2 δ . Low loss ES [Lγ (w. the expected perlabel loss is bounded by: ED [L(w. if the number of instances is at least a constant times the length of a word). until now. in OCR. our bound has a logarithmic dependence on the number of cliques (ln Nc ) and depends only on the 2norm of the basis functions perclique (Rc ).72 CHAPTER 5. node clique.
[2003] presents LP decompositions based on graphical model structure for the value function approximation problem in factored MDPs (Markov decision processes with structure). By considering the more natural Hamming loss. however. The work of Guestrin et al. The work of Lafferty et al. He provides generalization guarantees (for 0/1 loss) that hold for separable case and depend on the number of mistakes the perceptron makes before convergence. although undoubtedly the number of mistakes does. we achieve a much tighter analysis.1 to problems natural language processing. In addition. the corresponding problem needs to be resolved (or at least approximately resolved) after each additional constraint is added. RELATED WORK 73 5. 1998]. Tsochantaridis et al. [2003] have applied the exponentialsize formulation with constraint generation we described in Sec. His generalization bound (for 0/1 loss) based on the SVMlike margin. There has been a recent explosion of work in maximum conditional likelihood estimation of Markov networks. computational biology. Altun et al.6 Related work The application of marginbased estimation methods to parsing and sequence modeling was pioneered by Collins [2001] using the VotedPerceptron algorithm [Freund & Schapire. the number of constraints in many important cases is several orders higher (in L) than in the the approach we present. It also depends on the joint 2norm of all of the basis functions for an example (which is ∼ Nc Rc ).2. However. Remarkably. computer vision and relational modeling [Sha . In a followup paper. Collins [2004] also suggested an SVMlike formulation (with exponentially many constraints) and a constraint generation method for solving it.5.6. has linear dependence (inside the square root) on the number of nodes (L). the bound does not explicitly depend on the length of the sequence. [2004] show that only a polynomial number of constraints are needed to be generated to guarantee a ﬁxed level of precision of the solution. which is prohibitively expensive for large number of examples and label variables. 4. but it sufﬁces to say that our original decomposition of the maxmargin QP was inspired by the proposed technique to transform an exponential set of constraints into a polynomial one using a triangulated graph. Describing the exact setting is beyond our scope. [2001] has inspired many applications in natural language.
2004. Sutton et al. We also use approximate graph decomposition to derive a compact approximate formulation for Markov networks in which inference is intractable. 2003b]. 2003. The seamless integration of kernels with graphical models allows us to create very rich models that leverage the immense amount of research in kernel design and graphical model decompositions. In the next chapter.. 2002. Our generalization bound signiﬁcantly tightens previous results of Collins [Collins.74 CHAPTER 5.7 Conclusion We use graph decomposition to derive an exact. the solutions are nonsparse and the proposed algorithms are forced to use greedy selection of support vectors or heuristic pruning methods. Kumar & Hebert. However.. Pinto et al. 5. 2004. maximum conditional likelihood estimation for Markov networks can also be kernelized [Altun et al. 2004]. MARKOV NETWORKS & Pereira. convex maxmargin formulation for Markov networks with sequence and other lowtreewidth structure. 2003. We provide theoretical guarantees on the average perlabel generalization error of our models in terms of the training set margin. Taskar et al. Lafferty et al.. 2003.. we present an efﬁcient algorithm that exploits graphical model inference and show experiments on a large handwriting recognition task that utilize the powerful representational capability of kernels.. compact. .. Taskar et al. As in the case of logistic regression. Our formulation avoids the exponential blowup in the number of constraints in the maxmargin QP that plagued previous approaches. 2001] and suggests possibilities for analyzing perlabel generalization properties of graphical models.
y αi (y) ≥ 0. (5. these conditions have certain locality with respect to each example i. We show that our models signiﬁcantly outperform other approaches by incorporating highdimensional decision boundaries of polynomial kernels over character images while capturing correlations between consecutive characters.12) is polynomial in the size of the data. Boyd & Vandenberghe. 1999. αi (y)∆fi (y) i.t. ∀i. w ∆fi (y) ≥ (y) − ξi . 1999]. the problem is often too large even for small training sets.y 1 αi (y) i (y) − C 2 αi (y) = 1. for standard QP solvers. ∀i.1 Solving the M3N QP Let us begin by considering the primal and dual QPs for multiclass SVMs: 2 1 min w2 + C 2 ξi i max i.Chapter 6 M3N algorithms and experiments Although the number of variables and constraints in the structured dual in Eq. which allows us to perform 75 . We apply our M3 N framework and structured SMO algorithm to a handwriting recognition task. y.t. y. 2004] provide sufﬁcient and necessary criteria for optimality of a dual solution α. Instead. we use a coordinate dual ascent method analogous to the sequential minimal optimization (SMO) used for SVMs [Platt. The KKT conditions [Bertsekas. 6.y s. s. unfortunately. ∀i. As we describe below.
A feasible dual solution α and a primal solution deﬁned by: w = C i. αi (y) > 0 ⇒ w ∆fi (y) = i (y) − ξi .1) ξi = max [ i (y) − w∆fi (y)] = max [ i (y) + w fi (y)] − w fi (y (i) ).2) vi (y) = max [w fi (y ) + i (y )]. 1. y To simplify the notation. M3 N ALGORITHMS AND EXPERIMENTS the search for optimal α by repeatedly considering one example at a time. (KKT1) αi (y) > 0 ⇒ vi (y) ≥ vi (y). With these deﬁnitions. We can express these conditions as αi (y) = 0 ⇒ w fi (y) + i (y) < max [w fi (y ) + i (y )]. however. αi (y) can be nonzero only if vi (y) is at most smaller than the all others. This allows us to perform dual blockcoordinate ascent where a block corresponds to .y y αi (y)∆fi (y) y (6. Note that the normalization constraints on the dual variables α are local to each example i. y (KKT1) (KKT2). Essentially. we deﬁne vi (y) = w fi (y) + i (y). αi (y) > 0 ⇒ vi (y) ≥ vi (y) − . αi (y) can be zero only if vi (y) is at most larger than the all others. we have αi (y) = 0 ⇒ vi (y) < vi (y). Con versely. y =y In practice. (KKT2). we will enforce KKT conditions up to a given tolerance 0 < αi (y) = 0 ⇒ vi (y) ≤ vi (y) + .76 CHAPTER 6. (KKT1) (KKT2) αi (y) > 0 ⇒ w fi (y) + i (y) = max [w fi (y ) + i (y )]. (6. are optimal if they satisfy the following two types of constraints: αi (y) = 0 ⇒ w ∆fi (y) > i (y) − ξi .
The general form of blockcoordinate ascent algorithm as shown in Fig. where α−i denotes all dual αk variables for k other than i. the objective function can be split into two terms: Q(α) = Q(α−i ) + Q(αi . Set violation = 1. ∀ i. Figure 6. Only the second part of the objective Q(αi . 4. α−i ) in terms of λ and α: αj (y) j (y) + j. Find feasible αi such that Q(αi . 3. α−i ) matters for optimizing with respect to αi . This allows us to focus on efﬁcient updates to a single block of αi at a time. . For each i. We can write the objective Q(α−i ) + Q(αi . Set violation = 0. y. maintaining the feasibility of the dual. the vector of dual variables αi for a single example i. As long as the local ascent step over αi is guaranteed to improve the objective when KKT conditions are violated.1: Blockcoordinate dual ascent. 6. α−i ) and set αi = αi . Checking the constraints requires computing w and ξ from α according to Eq. 6. (6. the algorithm will converge to the global maximum in a ﬁnite number of steps (within the precision).6. SOLVING THE M3 N QP 77 1.y y 1 λ(y) i (y) − C 2 2 λ(y)∆fi (y) + y j. 7. When optimizing with respect to a single block i. I(y 2.y αj (y)∆fj (y) . Note that y λ(y) = 0 and αi (y) + λ(y) ≥ 0 so that αi is feasible.1 is essentially coordinate ascent on blocks αi .1). Initialize: αi (y) = 1 = y (i) ). 5. α−i ). If αi violates (KKT1) or (KKT2).1. The algorithm starts with a feasible dual solution α and improves the objective blockwise until all KKT conditions are satisﬁed. Let αi (y) = αi (y) + λ(y). If violation = 1 goto 2. α−i ) > Q(αi .
y αi (y) + λ(y) ≥ 0. so we must change at least two variables in order to respect the normalization constraint (by moving weight from one dual variable to another). 1 [vi (y ) − vi (y )]δ − Cfi (y ) − fi (y )2 δ 2 2 αi (y ) + δ ≥ 0. but for now assume we have picked λ(y ) and λ(y ). and making the substitution w=C j. Sequential Minimal Optimization (SMO) approach takes an ascent step that modiﬁes the least number of variables. Then we have δ = λ(y ) = −λ(y ) in order to sum to 1. All that is required is an ascent step.y αj (y)∆fj (y). Since y λ(y) = 0.1. We address a strategy for selecting the two variables in the next section.t. 6.t. we get: 1 − C 2 2 λ(y) i (y) − w y y λ(y)∆fi (y) λ(y)∆fi (y) y . not a full optimization. we have the simplex constraint. λ(y)∆fi (y) = y y λ(y)fi (y (i) ) − y λ(y)fi (y) = − y i (y) λ(y)fi (y). In our case. (6.3) .1 SMO We do not need to solve the optimization subproblem above at each pass through the data. ∀y.78 CHAPTER 6. Below we also make the substitution vi (y) = w fi (y) + problem for λ: max y to get the optimization 1 λ(y)vi (y) − C 2 λ(y) = 0. M3 N ALGORITHMS AND EXPERIMENTS By dropping all terms that do not involve λ. αi (y ) − δ ≥ 0. The optimization problem becomes a single variable quadratic program in δ: max s. 2 λ(y)fi (y) y s.
6.1. SOLVING THE M3 N QP
79
0.5
0
−0.5
−1
−1.5
−2 −0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Figure 6.2: Representative examples of the SMO subproblem. Horizonal axis represents δ with two vertical lines depicting the upper and lower bounds c and d. Vertical axis represents the objective. Optimum either occurs at the maximum of the parabola if it is feasible or the upper or lower bound otherwise.
With a = vi (y ) − vi (y ), b = Cfi (y ) − fi (y )2 , c = −αi (y ), d = αi (y ), we have: b max [aδ − δ 2 ] s.t. c ≤ δ ≤ d, 2 (6.4)
where the optimum is achieved at the maximum of the parabola a/b if c ≤ a/b ≤ d or at the boundary c or d (see Fig. 6.1.1). Hence the solution is given by simply clipping a/b: δ ∗ = max(c, min(d, a/b)). The key advantage of SMO is the simplicity of this update. Computing the coefﬁcients involves dot products (or kernel evaluations) to compute w fi (y ) and w fi (y ) as well as (fi (y ) − fi (y )) (fi (y ) − fi (y )).
6.1.2 Selecting SMO pairs
How do we actually select such a pair to guarantee that we make progress in optimizing the objective? Note that at least one of the assignments y must violate (KKT1) or (KKT2),
80
CHAPTER 6. M3 N ALGORITHMS AND EXPERIMENTS
1. Set violation = 0. 2. For each y, 3. 4. 5. 6. 8. 9. 10. 11. 12. KKT1: If αi (y) = 0 and vi (y) > vi (y) + , Set y = y and violation = 1 and goto 7. KKT2: If αi (y) > 0 and vi (y) < vi (y) − , Set y = y and violation = 2 and goto 7. For each y = y , If violation = 1 and αi (y) > 0, Set y = y and goto 13. If violation = 2 and vi (y) > vi (y ), Set y = y and goto 13.
7. If violation > 0,
13. Return y and y . Figure 6.3: SMO pair selection. because otherwise αi is optimal with respect to the current α−i . The selection algorithm is outlined in Fig. 6.3. The ﬁrst variable in the pair, y , corresponds to a violated condition, while the second variable, y , is chosen to guarantee that solving Eq. (6.3) will result in improving the objective. There are two cases, corresponding to violation of KKT1 and violation of KKT2. Case KKT1. αi (y ) = 0 but vi (y ) > vi (y ) + . This is the case where i, y is a not support vector but should be. We would like to increase αi (y ), so we need αi (y ) > 0 to borrow from. There will always be a such a y since
y
αi (y) = 1 and αi (y ) = 0.
Since vi (y ) > vi (y ) + , vi (y ) > vi (y ) + , so the linear coefﬁcient in Eq. (6.4) is a = vi (y ) − vi (y ) > . Hence the unconstrained maximum is positive a/b > 0. Since the upperbound d = αi (y ) > 0, we have enough freedom to improve the objective. Case KKT2. αi (y ) > 0 but vi (y ) < vi (y ) − . This is the case where i, y is a support vector but should not be. We would like to decrease αi (y ), so we need vi (y ) > vi (y ) so that a/b < 0. There will always be a such a y since vi (y ) < vi (y ) − . Since the
6.1. SOLVING THE M3 N QP
81
lowerbound c = −αi (y ) < 0, again we have enough freedom to improve the objective. Since at each iteration we are guaranteed to improve the objective if the KKT conditions are violated and the objective is bounded, we can use the SMO in the blockcoordinate ascent algorithm to converge in a ﬁnite number of steps. To the best of our knowledge, there are no upper bounds on the speed of convergence of SMO, but experimental evidence has shown it a very effective algorithm for SVMs [Platt, 1999]. Of course, we can improve the speed of convergence by adding heuristics in the selection of the pair, as long as we guarantee that improvement is possible when KKT conditions are violated.
6.1.3 Structured SMO
Clearly, we cannot perform the above SMO updates in the space of α directly for the structured problems, since the number of α variables is exponential. The constraints on µ variables are much more complicated, since each µ participates not only in nonnegativity and normalization constraints, but also cliqueagreement constraints. We cannot limit our ascent steps to changing only two µ variables at a time, because in order to make a change in one clique and stay feasible, we need to modify variables in overlapping cliques. Fortunately, we can perform SMO updates on α variables implicitly in terms of the marginal dual variables µ. The diagram in Fig. 6.1.3 shows the abstract outline of the algorithm. The key steps in the SMO algorithm are checking for violations of the KKT conditions, selecting the pair y and y , computing the corresponding coefﬁcients a, b, c, d and updating the dual. We will show how to do these operations by doing all the hard work in terms of the polynomially many marginal µi variables and auxiliary “maxmarginals” variables.
Structured KKT conditions As before, we deﬁne vi (y) = w fi (y) + i (y). The KKT conditions are, for all y: αi (y) = 0 ⇒ vi (y) ≤ vi (y); αi (y) > 0 ⇒ vi (y) ≥ vi (y). (6.5)
82
CHAPTER 6. M3 N ALGORITHMS AND EXPERIMENTS
select & lift
SMO update
project
Figure 6.4: Structured SMO diagram. We use marginals µ to select an appropriate pair of instantiations y and y and reconstruct their α values. We then perform the simple SMO update and project the result back onto the marginals.
Of course, we cannot check these explicitly. Instead, we deﬁne maxmarginals for each clique in the junction tree c ∈ T (i) and its values yc , as: vi,c (yc ) = max [w fi (y) + i (y)],
y∼yc
αi,c (yc ) = max αi (y).
y∼yc
We also deﬁne vi,c (yc ) = maxyc =yc vi,c (yc ) = maxy∼yc [w fi (y) + i (y)]. Note that we do not explicitly represent αi (y), but we can reconstruct the maximumentropy one from the marginals µi by using Eq. (5.13). Both vi,c (yc ) and αi,c (yc ) can be computed by using the Viterbi algorithm (one pass propagation towards the root and one outwards from the root [Cowell et al., 1999]). We can now express the KKT conditions in terms of the maxmarginals for each clique c ∈ T (i) and its values yc : αi,c (yc ) = 0 ⇒ vi,c (yc ) ≤ vi,c (yc ); αi,c (yc ) > 0 ⇒ vi,c (yc ) ≥ vi,c (yc ). (6.6)
Theorem 6.1.1 The KKT conditions in Eq. (6.5) and Eq. (6.6) are equivalent. Proof: Eq. (6.5) ⇒ Eq. (6.6). Assume Eq. (6.5). Suppose, we have a violation of KKT1: for some c, yc , αi,c (yc ) = 0, but vi,c (yc ) > vi,c (yc ). Since αi,c (yc ) = maxy∼yc αi (y) = 0,
which by deﬁnition agree with y . By Eq.c (yc ) = vi (y) > vi (y) > vi. Since αi (y) > 0. SOLVING THE M3 N QP 83 then αi (y) = 0. y(a) ∼ ya with vi (y(a) ) = vi. αi (y) = 0. (6. ∀c ∈ T (i) . Now suppose we have a violation of KKT2: for some i. where vi. (6.c (yc ) ≥ vi. we consider any two adjacent nodes in the tree T (i) .5). ∀c ∈ T (i) .6). (6.c (yc ). Then vi (y) < vi (y). We need to introduce some more notation to deal with the two parts of the tree induced by cutting the edge between a and b. we get vi (y) = vi.A (yA ) + vi. ∀c ∈ T (i) .c (yc ). so if all the yconsistent marginals are positive. To show that this.b (yb ) ≥ vi. by Eq.a (ya ) ≥ vi.B (yB ) only count the contributions of their constituent cliques.c (yc ) implies the opposite: there exists y ∼ yc such that vi (y) > vi. (6. Hence. For that y.c (yc ) µi. with a separator s.c (yc ). Assume Eq.c (yc ) > vi. vi.c (yc ). ∀c ∈ T (i) .c (yc ) > 0. Let {A. a contradiction.c (yc ) = 0. and it follows that all the yconsistent marginals µi are positive as well µi. hence vi. We denote the two subsets of an assignment y as yA and yB (with overlap at ys ).a (ya ) = vi.a∪b (ya∪b ) = vi. a contradiction.c (yc ) > 0. (6. Note that trivially maxy vi (y ) = max(vi. The value of an assignment vi (y) can be decomposed into two parts: vi (y) = vi. αi (y) > 0. we know that all the yconsistent αi maxmarginals are positive αi. But αi. ∀c ∈ T (i) (since sum upperbounds max).1. then αi (y) > 0. ∀y ∼ yc . ∀y ∼ yc .c (yc ) < vi.c (yc ). and show that vi. We will show that vi (y) = vi.c (yc ) > 0 implies there exists y ∼ yc such that αi (y) > 0. But vi.c (yc ). Now suppose we have a violation of KKT2: for some y. then we cannot have αi. Suppose we have a violation of KKT1: for some y.c (yc ) is the optimal value. a contradiction. vi.a (ya ) and any maximizer y(b) ∼ yb with vi (y(b) ) = vi. Since vi.5). vi (y) ≥ vi (y).s (ys ) . vi (y) ≤ vi (y). But αi (y) = c∈T (i) c∈S (i) µi. By chaining this equality from the root of the tree to all the leaves.c (yc )) for any clique c and clique assignment yc . (6. B} be a partition of the nodes T (i) (cliques of C (i) ) resulting from removing the edge between a and b such that a ∈ A and b ∈ B. (6. ∀c ∈ T (i) .c (yc ) > 0. then maxy vi (y ) = vi.b (yb ). Take any maximizer. . but vi.6). cliques a and b. ∀c ∈ T (i) .c (yc ). Hence all the yconsistent αi maxmarginals are positive αi.6). but vi (y) > vi (y).B (yB ).b (yb ). yc .6) ⇒ Eq. That is. ∀y ∼ yc .c (yc ). if vi. αi. But by Eq. Eq.c (yc ) > 0. yc ∼ y. a contradiction.c (yc ) > vi. by Eq. This means that y is the optimum of vi (·).c (yc ).A (yA ) and vi. vi.c (yc ) ≥ vi. which also implies vi (y) > vi (y).c (yc ) for any c.6. but vi (y) < vi (y).5).
Having selected y and y . This is the case where i. y . (a) (b) (a) (b) (b) (a) (a) (b) (b) (a) (a) (a) (b) (b) (a) (b) (a) (b) (a) (a) (b) (b) Structured SMO pair selection and update As in multiclass problems. so vi (y ) = vi. the coefﬁcients for the onevariable QP in Eq. Case KKT2. b = Cfi (y )−fi (y )2 . vi (y ) ≥ vi (y ) + . so we need vi (y ) > vi (y ) so that a/b < 0.4) are a = vi (y )−vi (y ). As before.c (yc ) > 0.a (ya ) = vi. We have two cases. Since the upperbound d = αi (y ) > 0. y . y is a support vector but should not be. which guarantees that for yc = arg maxy∼yc αi (y). We have set y = arg maxy∼yc vi (y). This is the case where i. We would like to decrease αi (y ). Note that vi (y(s) ) = vi (yA ) + vi (yB ) = vi. Now vi.5. Hence we have vi (yA ) + vi (yB ) = vi (yA ) + vi (yB ) ≥ vi (yA ) + vi (yB ) which implies that vi (yA ) ≥ vi (yA ) and vi (yB ) ≥ vi (yB ).c (yc ) − < vi (y ) − .c (yc ) < vi. The value of this assignment is optimal vi (y(a∪b) ) = vi (yA ) + vi (yB ) = vi (y(a) ) = vi (y(b) ). αi. We have set y = arg maxy∼yc αi (y). so the linear coefﬁcient in Eq. (6. αi.c (yc ) > 0 and vi (y ) < vi. αi (y ) > 0. Hence the unconstrained maximum is positive a/b > 0. We can ﬁnd one by choosing yc for which αi. M3 N ALGORITHMS AND EXPERIMENTS on the intersection s. 6. There will always be a such a y since . c = −αi (y ). since we essentially ﬁxed the intersection s and maximized over the rest of the variables in A and B separately. we have enough freedom to improve the objective. Since vi (y ) ≥ vi (y ) + . (6. y is a not support vector but should be. There will always be a such a y (with yc = yc ) since y αi (y) = 1 and αi (y ) = 0.4) is a = vi (y ) − vi (y ) > . d = αi (y ).s (ys ) since they are optimal as we said above.c (yc )− . Case KKT1.b (yb ) ≥ vi.c (yc )+ . while the second variable.3) will result in improving the objective. We would like to increase αi (y ). We decompose the two associated values into the corresponding parts: vi (y(a) ) = vi (yA )+vi (yB ) and vi (y(b) ) = vi (yA )+vi (yB ). corresponding to violation of KKT1 and violation of KKT2.s (ys ). We create a new assignment that combines the best of the two: y(s) = yA ∪ yB .c (yc ) + > vi (y ) + and αi (y ) = 0. we will select the ﬁrst variable in the pair.c (yc ) > 0 but vi.84 CHAPTER 6.c (yc ) > vi. to guarantee that solving Eq. we enforce approximate KKT conditions in the algorithm in Fig. Now we create another assignment that clamps the value of both a and b: y(a∪b) = yA ∪ yB .c (yc ) < vi. so we need αi (y ) > 0 to borrow from. corresponding to a violated condition. so αi (y ) = αi. (6.c (yc ) > vi.c (yc ) = 0 but vi.
we need to project this change onto the marginal dual variables µi . 12. Set yc = yc . each character was rasterized into an image of 16 by 8 binary .c (yc ) − . For each c ∈ T (i) . 13.c (yc ) + . EXPERIMENTS 85 1. and vi. If violation > 0. Since the lowerbound c = −αi (y ) < 0. which guarantees that for yc = arg maxy∼yc vi (y). 5. again we have enough freedom to improve the objective. and the change is very simple: µi. Set yc = arg maxy∼yc αi (y) and goto 13.c (yc ) − . KKT2: If αi.6. For each yc = yc . I(y I(y 6. Return y and y . 6. KKT1: If αi. 9. from 150 human subjects. y = arg maxy∼yc vi (y) and violation = 1 and goto 7. If violation = 2 and vi.5: Structured SMO pair selection.c (yc ) < vi. 8.c (yc ) > 0. 11. Having computed new values αi (y ) = αi (y ) + δ and αi (y ) = αi (y ) − δ.2. Set violation = 0.c (yc ) = µi. Set yc = yc .c (yc ) + δ1 c ∼ y ) − δ1 c ∼ y ).c (yc ) > vi.c (yc ) = 0. If violation = 1 and αi.2 Experiments We selected a subset of ∼ 6100 handwritten words. 4. vi (y ) > vi (y ) − . 10.c (yc ) > vi. The only marginal affected are the ones consistent with y and/or y . 2.c (yc ) > 0. Set y = arg maxy∼yc vi (y) and goto 13. Figure 6. vi (y ) < vi (y ) − . yc 3. We can ﬁnd one by choosing yc for which vi.c (yc ) > vi. from the data set collected by Kassel [1995].c (yc ). 7. with average length of ∼ 8 characters. y = arg maxy∼yc αi (y) and violation = 2 and goto 7. Each word was divided into characters. and vi.
are averages over the 10 folds.9). The accuracy results. and cubic kernels. 6. 6.) In our framework. For the sequence approaches (CRFs and M3 ). quadratic.2 0.1 and 1. .6: (a) 3 example words from the OCR data set.86 CHAPTER 6. (b) OCR: Average percharacter test error for logistic regression. pixels. the image for each word corresponds to x. Each label Yj takes values from one of 26 classes {a. Our features for each label Yj are the corresponding image of ith character.05 0 LogReg CRF quadratic cubic m SVM M^3N (a) (b) Figure 6. and a labeling for a complete word to Y.6(b). z}. We implemented a selection of stateoftheart classiﬁcation algorithms: independent label approaches. The data set is divided into 10 folds of ∼ 600 training and ∼ 5500 testing examples. .15 0. (See Fig. multiclass SVMs. a label of an individual character to Yj . CRFs.6(a). multiclass SVMs as described in Eq. . using linear. and sequence approaches — CRFs. . with a standard deviation between 0. . M3 N ALGORITHMS AND EXPERIMENTS 0. and oneagainstall SVMs (whose performance was slightly lower than multiclass SVMs). we used an indicator basis function to represent the correlation between Yj and Yi+1 .25 0.1 0. summarized in Fig. (2. and our proposed M3 networks. The other methods are trained by margin maximization. using a zeromean diagonal Gaussian prior over the parameters. which do not consider the correlation between neighboring characters — logistic regression. and M3 Ns.3 0. Logistic regression and CRFs are both trained by maximizing the conditional likelihood of the labels given the features.35 linear Test error (average percharacter) 0.
it has yet to be shown to learn faster than Structured SMO. the previously published results. we were able to use kernels (both quadratic and cubic were evaluated) to increase the dimensionality of the feature space. are about comparable to those of multiclass SVMs. by using sequences. We used the structured SMO algorithm with about 3040 iterations through the data. and about the same as multiclass SVMs with cubic kernels. it seems that. 6. For comparison. we address the large (though polynomial) size of our quadratic program using an effective optimization procedure inspired by SMO. Once we use cubic kernels our error rate is 45% lower than CRFs and about 33% lower than the best previous approach. 1997] algorithm has been adopted to solve our structured QP for maxmargin estimation [Bartlett et al.6. Using these highdimensional feature spaces in CRFs is not feasible because of the enormous number of parameters. 6. Second. the error rate of our method using linear features is 16% lower than that of CRFs.6(b) shows two types of gains in accuracy: First. we obtain another signiﬁcant gain in accuracy. a larger training set). by using kernels.. In our experiments with the OCR task. marginbased methods achieve a very signiﬁcant gain over the respective likelihood maximizing methods. Although headtohead comparisons have not been performed. our sequence model signiﬁcantly outperforms other approaches by incorporating . Collins [2001] have applied votedperceptron to structured problems in natural language. 2004].. Recently. Although the EG algorithm has attractive convergence properties.g. RELATED WORK 87 For marginbased methods (SVMs and M3 ). Interestingly. the Exponentiated Gradient [Kivinen & Warmuth. 6. particularly in the early iterations through the dataset.3. empirically. Fig. 1998] and votedperceptron algorithms [Freund & Schapire. although using a different setup (e..4 Conclusion In this chapter. 1998] for largemargin classiﬁers have a similar online optimization scheme.3 Related work The kerneladatron [Friess et al. less passes (about 3040) are needed for our algorithm than in the perceptron literature.
. we apply this framework to important classes of Markov networks for spatial and relational data. Overall. M3 N ALGORITHMS AND EXPERIMENTS highdimensional decision boundaries of polynomial kernels over character images while capturing correlations between consecutive characters. In the next two chapters.88 CHAPTER 6. we believe that M3 networks will signiﬁcantly further the applicability of high accuracy marginbased methods to realworld structured data.
This subclass. we considered applications of sequencestructured Markov networks. 1986]. webpage 89 . a tractable structure is determined in advance. the network structure is learned. called associative Markov networks (AMNs). In some cases. we show that optimal learning is feasible for an important subclass of Markov networks — networks with attractive potentials. the network structure can mirror the hyperlink graph.. Such positive interactions capture the “guilt by association” pattern of reasoning present in many domains. the topology of the Markov network does not allow tractable inference. The chief computational bottleneck in applying Markov networks for other largescale prediction problems is inference. 1994].Chapter 7 Associative Markov networks In the previous chapter. subject to the constraint that inference on these networks is tractable. which is NPhard in general networks suitable in a broad range of practical Markov network structures. however. including gridtopology networks [Besag. In this chapter. which is usually highly interconnected. [Bach & Jordan. AMNs are a natural ﬁt object recognition and segmentation. in hypertext. in which connected (“associated”) variables tend to have the same label. such as the quadtree model used for image segmentation [Bouman & Shapiro. which allow very efﬁcient inference and learning. leading to computationally intractable networks. For example. In many cases. In other cases (e. contains networks of discrete variables with K labels and arbitrarysize clique potentials with K parameters that favor the same labels for all variables in the clique. One can address the tractability issue by limiting the structure of the underlying network.g. 2001]).
Our empirical results suggest that the solutions of the resulting approximate maxmargin formulation work well in practice. We use a linear program relaxation of this integer program in the minmax formulation. where nearby pixels are likely to have the same label [Besag. rather than just the MAP. maximizing the conditional likelihood of the target labels given the features) require that we compute the posterior probabilities of different label assignments.1 Associative networks Associative interactions arise naturally in the context of image processing. For L > 2. ASSOCIATIVE MARKOV NETWORKS classiﬁcation. the inference subtask is one of ﬁnding the best joint (MAP) assignment to all of the variables in a Markov network. Boykov et al. l) = λij .90 CHAPTER 7. The proposed learning formulation effectively and directly learns to exploit a large set of complex surface and volumetric features. To our knowledge. Thus. In this setting.. the MAP problem can be solved exactly for this class of models in polynomial time. while balancing the spatial coherence modeled by the AMN. By contrast. where λij ≤ 1. our method is the ﬁrst to allow training Markov networks of arbitrary connectivity and topology.g. Greig et al. and many other applications. 1986. For binaryvalued Potts models. ∀k = l and φij (k. other learning tasks (e. our models can be learned efﬁciently and at runtime. a common approach is to use a generalized Potts model [Potts. [1989] show that the MAP problem can be formulated as a mincut in an appropriately constructed graph. For the nonbinary case (K > 2). scale up to tens of millions of nodes and edges. 1999b]. which penalizes assignments that do not have the same label across the edge: φij (k. k) = 1. 7. for associative Markov networks of over binary variables (K = 2). 1952]. In the maxmargin estimation framework.. The MAP problem can naturally be expressed as an integer programming problem. By constraining the class of Markov networks to AMNs. the approximate linear program is not guaranteed to be optimal but we can bound its relative error. We show that. We present an AMNbased method for object segmentation of complex from 3D range data. the MAP problem . this linear program provides exact answers.
µim ). 1999b. I(y gc (k) ≥ 0. and µc (k) for each clique c . For example. AMNs allow us to encode the pattern of (soft) transitivity present in many domains. AMNs admit nonpairwise interactions between variables.7. In a second important extension. AMNs allow different labels to have different attraction strength: φij (k. . consider the problem of predicting whether two proteins interact [Vazquez et al. . A set of potentials φ(y) is associative if φ(y) = eg(y) and g(y) is associative. as different labels can have very diverse afﬁnities. where V are the nodes and C are the cliques of the graph G. using this additional expressive power. we deﬁne associative functions and potentials as follows. . . and φij (k.2.1.. . . LP INFERENCE 91 is NPhard. this probability may increase if they both interact with another protein.2 LP Inference We can write an integer linear program for the problem of ﬁnding the maximum of an associative function g(y). k) = λc (k) ≥ 1 and the rest of the entries are set to 1. More formally. ∀k = l. . where we have a “marginal” variable µv (k) for each node v ∈ V and each label k. 1999]. . k) = λij (k). ∀c ∈ C\V. 2003]. For example.. where λij (k) ≥ 1. . which indicates whether node v has value k. Kleinberg & Tardos. but a procedure based on a relaxed linear program guarantees a factor 2 approximation of the optimal solution [Boykov et al. In particular. . . foreground pixels tend to have locally coherent values while background is much more varied. with potentials over cliques involving m variables φ(µi1 . Deﬁnition 7. Our associative potentials extend the Potts model in several ways. k). . In this case. the clique potentials are constrained to have the same type of structure as the edge potentials: There are K parameters φc (k. 7.1 A function g : Y → IR is associative for a graph G over Kary variables if it can be written as: K K g(y) = v∈V k=1 gv (k)1 v = k) + I(y c∈C\V k=1 gc (k)1 c = k. This additional ﬂexibility is important in many domains. This type of transitivity could be modeled by a ternary clique that has high λ for the assignment with all interactions present. l) = 1. Importantly.
is guaranteed to produce an integer solution when a unique solution exists. Although artiﬁcial examples with fractional solutions can be easily constructed by using symmetry. v∈c Note that we substitute the constraint µc (k) = µv (k) by linear inequality con straints µc (k) ≤ µv (k). the linear relaxation of Eq. the linear relaxation of Eq. in all our experiments with L > 2 on real data. for any associative function g. In the appendix. This result states that the MAP problem in binary AMNs is tractable.1).. these LPs can produce fractional solutions and we use a rounding procedure to get an integral solution. ∀c ∈ C. ∀c ∈ C \ V. See Appendix A. ∀v ∈ V. Hence at the optimum. ASSOCIATIVE MARKOV NETWORKS (containing more than one variable) and label k.g. In fact. (7. the linear relaxation of Eq. regardless of network topology or clique size. 1 2 for pairwise networks). k.1 If K = 2.2. 1}.1) has a solution that is larger than the solution of the integer program by at most the number of variables in the largest clique. This works because the coefﬁcient gc (k) is nonnegative and we are maximizing the objective function.92 CHAPTER 7. we never encountered fractional solutions.2 If K > 2. k=1 µv (k) = 1. µc (k) = minv µv (k) . which is equivalent to µc (k) = v∈c µv (k). µc (k) ≤ µv (k).2. k. In the nonbinary case (L > 2). It can be shown that in the binary case.2. it seems that in real data such symmetries are often broken. Theorem 7. (7. 1} are replaced by µc (k) ≥ 0).t. Theorem 7. we also show that the approximation ratio of the rounding procedure is the inverse of the size of the largest clique (e. v ∈ c. . (7.1) has an integral optimal solution.1 for the proof.1) s. µc (k) ∈ {0. which represents the event that all nodes in the clique c have label k: K K max v∈V k=1 µv (k)gv (k) + c∈C\V k=1 µc (k)gc (k) K (7. for any associative function g. (where the constraints µc (k) ∈ {0. when µv (k) are binary.
First. For each node term corresponding to node v we add a vertex v to the mincut graph. and create an edge of weight ∆v  from v to either 1 or 2. along with the 1 and 2 terminals.3. The reason for that is that the ﬁnal mincut graph must consist of only positive weights. Such a term corresponds to the node potential contribution to our objective function for node v. and then how to combine the two. V2 ) cut.3 Mincut inference We can also use efﬁcient mincut algorithms to perform exact inference on the learned models for K = 2 and approximate inference for K > 2. An . For simplicity. MINCUT INFERENCE 93 7. we focus on the pairwise AMN case. (7. We will show how to deal with the node terms (those depending only on a single variable) and the edge terms (those depending on a pair of variables). we recast the objective as minimization by simply reversing the signs on the value of each θ. depending on the sign of ∆v .1) is: max v∈V [µv (1)gv (1) + µv (2)gv (2)] + uv∈E [µuv (1)guv (1) + µuv (2)guv (2)]. min − v∈V [µv (1)gv (1) + µv (2)gv (2)] − uv∈E [µuv (1)guv (1) + µuv (2)guv (2)]. the objective of the integer program in Eq. For pairwise. (7. binary AMNs. In the ﬁnal (V1 . [1999a] to perform (approximate) inference in the general multiclass case.2) 7. Node terms Consider a node term −µv (1)gv (1) − µv (2)gv (2).3. We then look at ∆v = gv (1) − gv (2). (7.7. and later show how to use the local search algorithm developed by Boykov et al. and the V2 set will correspond to label 2. We ﬁrst consider the case of binary AMNs.3) The graph will consist of a vertex for each node in the AMN.1 Graph construction We construct a graph in which the mincut will correspond to the optimal MAP labeling for the above objective. the V1 set will correspond to label 1.
which must be less than guv (2) or the mincut would have placed them both on the 1 side. example is presented in Fig.3. When looking at edge terms in isolation. From Fig. 7. Thus if we run mincut. ASSOCIATIVE MARKOV NETWORKS 1 v 2 1 u v 1 v 2 2 Figure 7. since for each introduced vertex we have only one edge of positive weight to either 1 or 2. and we would always choose not to cut any edges. Fig. Observe what happens when both nodes are on the V2 side of the cut: the value of the mincut is guv (1). we see that if the AMN consisted of only node potentials. connect v to 2 with an edge of weight guv (2) and connect u to v with an edge of weight guv (1) + guv (2). 7. Edge Terms Now consider an edge term of the form −µuv (1)guv (1) − µuv (2)guv (2).1. We can take the individual graphs we created for node and edge terms and merge them .94 CHAPTER 7.1: Mincut graph construction of node (left) and edge (right) terms.1 shows an example.3.3. we would simply get a cut with cost 0. We will connect vertex u to 1 with an edge of weight guv (1). To construct a mincut graph for the edge term we will introduce two vertices u and v. but when we combine the graphs for node terms and edge terms. the graph construction above would add an edge from each node to its more likely label.1. a cut that places each node in different sets will not occur. 7. such cuts will be possible.
7. and it can be shown that it converges to within a factor of 2 of the global minimum. . In this new binary problem we will let label 1 represent a node keeping its current label and label 2 will denote a node taking on the new label k.3. [1999a]) from µ if µv = k implies µv = µv (where µv is the label of the node v in the AMN. K. The αexpansion algorithm cycles through all labels k in either a ﬁxed or random order. but in practice we would often like to handle multiple classes in AMNs. The algorithm performs a series of “expansion” moves each of which involves optimization over two labels.7. Expansion Algorithm Consider a current labeling µ and a particular label k ∈ 1.) In other words.2). The key part of the algorithm is computing the best αexpansion labeling for a ﬁxed k and a ﬁxed current labeling µ. Since our MAPinference objective is simply a sum of node and edge terms. The mincut construction from earlier allows us to do exactly that since an αexpansion move essentially minimizes a MAPobjective over two labels: it either allows a node to retain its current label. a kexpansion from a current labeling allows each label to either stay the same. merging the node and edge term graphs will result in a graph in which mincut will correspond to the MAP labeling. Another labeling µ is called an “αexpansion” move (following Boykov et al. One of the most effective algorithms for minimizing energy functions like ours is the αexpansion algorithm proposed by Boykov et al.3. . . It terminates when there is no αexpansion move for any label k that has a lower objective than the current labeling (Fig. 7. or switch to the label α. In order to construct the right coefﬁcients . or change to k. and ﬁnds the new labeling whose objective has the lowest value.2 Multiclass case The graph construction above ﬁnds the best MAP labeling for the binary case. . It can be shown that the resulting graph will represent the same objective (in the sense that running mincut on it will optimize the same objective) as the sum of the objectives of each graph. MINCUT INFERENCE 95 by adding edge weights together (and treating missing edges as edges with weight 0). [1999a].
y node whose label is different from α. the algorithm terminates only after a few iterations with most of the improvement occurring in the ﬁrst 23 expansion moves. . and if the two nodes currently have 0 different labels we let θij = 0.2: αexpansion algorithm k. For each edge (i. Begin with arbitrary labeling µ 2. y α we let θi0 = θi i .1 Compute µ = arg min −g(µ ) among µ within one αexpansion of µ. set µ := µ and success := 1 µ ˆ 4. Return µ Figure 7. .2 If E(ˆ) < E(µ). where yi denotes the current label of node i. After we have constructed the new binary MAP objective as above.k for the new binary objective we need to consider several factors. and θyi denotes the coefﬁcient in the multiclass AMN MAP objective. α. [1999a] and as we have observed in our experiments. If the two nodes of the edge 0 currently have the same label.96 CHAPTER 7. we set θij = θiji j . For each label k ∈ {1. Veksler [1999] shows that the αexpansion algorithm converges in O(N ) iterations where N is the number of nodes.α 1 we add a new edge potential. 5. As noted in Boykov et al. . ◦ Edge Potentials For each edge (i. j) ∈ E in which exactly one of the α. Below. K} 3. let θik and θij denote the node and edge coefﬁcients associated with the new binary objective: ◦ Node Potentials For each node i in the current labeling whose current label is not α. If success = 1 goto 2. j) ∈ E whose nodes have labels different from α. we can apply the mincut construction from before to get the optimal labeling within one αexpansion from the current one. we add θij . ASSOCIATIVE MARKOV NETWORKS 1. Note that we ignore nodes with label α altogether since an αexpansion move cannot change their label. and θi1 = θi . to the node potential θi1 of the y . . Set success := 0 3.α nodes has label α in the current labeling. ˆ 3. with weights θij = θij .
and the Hamming loss. Let {w. y). . y ∈ Y. all basis functions evaluate to 0 for assignments where the values of the nodes are not equal and are nonnegative for the assignments where the values of the nodes are equal.3 w f (x. y(i) . We must make similar associative assumption on the loss function in order to guarantee that the LP inference can handle it.4.1 Basis functions f are componentwise associative for G(x) for any (x.4. y) is associative for G (i) for all i. ˙ ˙ ¨ Let ¨ = f \ f be the rest of the basis functions.7.4 The loss function (x(i) .4. It is easy to verify that any nonnegative combination of associative functions is associative.4. y) whenever Assumption 7. and any combination of basis functions with support only on singleton cliques is also associative. yc ) = 0}. this restriction is fairly mild.1 ¨ holds and w ≥ 0. Assumption 7. is associative. which we use in generalization bounds and experiments. c ∈ C(G(x)). In practice. ˙ Deﬁnition 7. y) is associative.2 Let f be the subset of basis functions f with support only on singleton cliques: ˙ f = {f ∈ f : ∀ x ∈ X . MAXMARGIN ESTIMATION 97 7. y) is associative: Assumption 7.4 Maxmargin estimation The potentials of the AMN are once again loglinear combinations of basis functions. Recall that this implies that for cliques larger than one.4. w} = w be the corresponding f subsets of parameters. c > 1. y) is associative for G(x) for any (x. To ensure that w f (x.4. it is useful to separate the basis functions with support only on nodes from those with support on larger cliques. fc (xc . so we have: Lemma 7. We will need the following assumption to ensure that w f (x.
k. ∀i.98 CHAPTER 7. similar to Eq. ∀i. k).c∈C (i) . ¨ w ∆¨i. . µi.c. there are marginals µ for each node and clique.1 and 7.c (k)∆¨i. \ V . ¨ w ≥ 0. w ∆fi.c (yc )/c.v (k) − 2 K ν+ ¨ µi.c (k) + f ∀i. k. k.v∈V (i) .v (k) ≥ mi. 1 w2 + C 2 ξi. The extra constraints are due to the associativity constraints on the resulting model.v (k) i. k (i) C ˙ µi. However.v (k)∆fi.c (k) i.v . v ∈ V (i) . ∀i. c ∈ C (i) \ V (i) .3) is given by: 2 2 max i.t.c (k).c. . k s.2. While this primal is more complex than the regular M3 N factored primal in Eq. The dual of Eq.c (k) ≤ µi.12).4 and some algebra (see Appendix A.v (k) ≥ v∈c − ξi.v∈V (i) (7. (5. In the dual. ASSOCIATIVE MARKOV NETWORKS Using the above Assumptions 7. . ν ≥ 0. and not surprisingly. v ∈ c.c (k) f i.4. ∀c ∈ C . . ¨ ∀i.c. ∀c ∈ C (i) k=1 (i) µi.c (k) = fi.t. µi.c.v (k) ≥ −w ¨i.v (k) − c⊃v mi.1). k. the constraints are different.v (k) = 1.4. .c (k.3 for derivation). k.4) i.c (k) ≥ 0. ∀v ∈ V (i) . k C µi.c∈C (i) .c (k. (i) ¨ f mi. .v (k).c (k) = i.2. for each value k.v (k). k) and i. (7.4). . (5. where fi. we have the following maxmargin QP for AMNs: min s. .4) (see derivation in Sec. are essentially the constraints from the inference LP relaxation in Eq. (7. . v ∈ c. A. c ∈ C (i) \ V (i) .c (k) − 2 i. the basic structure of the ﬁrst two sets of constraints remains the same: we have local margin requirements and “credit” passed around through messages mi. ∀i.v i.
Terrain classiﬁcation is very useful for autonomous mobile robots in realworld environments for path planning. represented by its (x. . and as a preprocessing step for other perceptual tasks.5 Experiments We applied associative Markov networks to the task of terrain classiﬁcation. ξ for the original QP may no longer be feasible.c (k). 7. leading to a different choice of weights.5. the linear relaxation leads to a strengthening of the constraints on w by potentially adding constraints corresponding to fractional assignments as in the case of untriangualated networks. (7. so that Eq. For K > 2. 1 Many thanks to Michael Montemerlo and Sebastian Thrun for sharing the data.v (k)∆fi. For K = 2. EXPERIMENTS 99 The dual and primal solutions are related by ˙ w= i. k µi. these weights tend to do well in practice. However. Note that the objective can be written in terms of dot products of node basis func˙ ˙ v ¯ tions ∆fi.¯(k). 7. The Stanford Segbot Project1 has provided us with a laser range maps of the Stanford campus collected by a moving robot equipped with SICK2 laser sensors Fig. building. the LP inference is exact.c (k)∆¨i. the only available information is the location of points. Thus. represented as 3D coordinates in an absolute frame of reference Fig.v (k). Thus.c (k) f ¨ The ν variables simply ensure that w are positive (if any component ¨ i. ¨ w=ν+ ¨ i. Our task is to classify the laser range points into four classes: ground. f µi. the edge basis functions cannot be kernelized because of the nonnegativity constraint. the optimal choice w.v (k) ∆fj. z) coordinates in an absolute frame of reference.v∈V (i) .5. Thus.c∈C (i) . as our experiments show.7.c (k)∆¨i. Each reading was a point in 3D space. which was fairly noisy because of localization errors. k ˙ µi.c∈C (i) . The data consists of around 35 million points. tree. the only available information is the location of points. 7. maximizing the objective will force the corresponding component of ν to cancel ¨ it out). Unfortunately. so they can be kernelized. target detection.5. y.4) learns exact maxmargin weights for Markov networks of arbitrary topology. k is negative.
oriented with respect to the principal plane. since scans tend to be sparser in regions farther from the robot. we classify them deterministically by thresholding the z coordinate at a value close to 0. Our second type of feature is based on a column . and shrubbery. the outside corners. Each point is represented simply as a location in an absolute 3D coordinate system. These features capture the local distribution well and are especially useful in ﬁnding planes. After we do that. such as how planar the neighborhood is. Since classifying ground points is trivial given their absolute zcoordinate (height). We then partition the cube into 3 × 3 × 3 bins around the point. We use a number of features derived from the cube such as the percentage of points in the central column. Our ﬁrst type of feature is based on the principal plane around it. The features we use are invariant to rotation in the xy plane. The features we use require preprocessing to infer properties of the local neighborhood of a point. as well as the density of the range scan. the central plane. For each point we sample 100 points in a cube of radius 0. we are left with approximately 20 million nonground points.5 meters.3: Segbot: roving robot equipped with SICK2 laser sensors. and compute the percentage of points lying in the various subcubes. etc.100 CHAPTER 7. We run PCA on these points to get the plane of maximum variance (spanned by the ﬁrst two principal components). or how much of the neighbors are close to the ground. ASSOCIATIVE MARKOV NETWORKS Figure 7.
around each point. The ﬁrst model is a multiclass SVM with a quadratic kernel over the above features. For training we select roughly 30 thousand points that represent the classes well: a segment of a wall.5.g.7.7. All methods use the same set of features. some bushes.. augmented with a quadratic kernel. This model (Fig. Finally.25 meters. right panel and Fig.5m).4: 3D laser scan range map of the Stanford Quad. a tree. 7. between 2m and 2. We considered three different models: SVM. which extends vertically to include all the points in a “column”. This feature is especially useful in classifying shrubbery. VotedSVM and AMNs.5. We then compute what percentage of the points lie in various segments of this vertical column (e. EXPERIMENTS 101 Figure 7. top panel) achieves reasonable performance . We take a cylinder of radius 0. 7. we also use an indicator feature of whether or not a point lies within 2m of the ground.
5: Terrain classiﬁcation results showing Stanford Memorial Church obtained with SVM.5m. shrubs/blue. VotedSVM and AMN models.5. 7. middle panel and Fig. The ﬁnal model is a pairwise AMN over laser scan points. but fails to enforce local consistency of the classiﬁcation predictions. We improved upon the SVM by smoothing its predictions using voting. For each point we took its local neighborhood (we varied the radius to get the best possible results) and assigned the point the label of the majority of its 100 neighbors.7. It is important to ensure vertical consistency since the SVM classiﬁer is wrong in areas that are higher off the . even though they are surrounded entirely by buildings. (Color legend: buildings/red. 7. and the other 3 are sampled at random from the vertical cylinder column of radius 0. The VotedSVM model (Fig. in many places. it smooths out trees and some parts of the buildings. middle panel) performs slightly better than SVM: for example. Each point is connected to 6 of its neighbors: 3 of them are sampled randomly from the local neighborhood in a sphere of radius 0. with associative potentials to ensure smoothness.102 CHAPTER 7.25m. For example arches on buildings and other less planar regions are consistently confused for trees. ground/gray). trees/green. Yet it still fails in areas like arches of buildings where the SVM classiﬁer has a locally consistent wrong prediction. ASSOCIATIVE MARKOV NETWORKS Figure 7.
6. which uses bidirectional search trees for augmenting paths (see Boykov and Kolmogorov [2004]). The problem size is the sum of the number of nodes and the number of edges. While we experimented with a variety of edge features including various distances between points. 7. ground (due to the decrease in point density) or because objects tend to look different as we vary their zcoordinate (for example. we found that even using only a constant feature performs well. The inference over each segment was performed using mincut with αexpansion moves as described above. The implementation is largely dominated by I/O time.5. the training took about an hour on a Pentium 3 desktop. We used a publicly available implementation of the mincut algorithm.7. Note the near linear performance of the algorithm and its efﬁciency even for large models. We trained the AMN model using CPLEX to solve the quadratic program. EXPERIMENTS 103 Mincut inference performance 250 200 Running time (seconds) 150 100 50 0 0 0. and as we can see. 7.5 x 10 7 Figure 7. bottom panel) are much smoother: for example building arches and tree trunks are predicted .7. 7.5.6: The running time (in seconds) of the mincutbased inference algorithm for different problem sizes. tree trunks and tree crowns look different). with the actual mincut taking less than two minutes even for the largest segment.5 1 1. left panel and Fig. We can see that the predictions of the AMN (Fig. The performance is summarized in Fig.5 Problem size (nodes and edges) 2 2. it is roughly linear in the size of the problem (number of nodes and number of edges).
The differences are dramatic: SVM: 68%.edu/˜btaskar/3Dmap/.6 Related work Several authors have considered extensions to the Potts model. a subclass of Markov networks that allows only positive interactions between . We also handlabeled around 180 thousand points of the test set (Fig. Kumar and Hebert [2003] train CRFs using a pseudolikelihood approximation to the distribution P (Y  X) since estimating the true conditional distribution is intractable. Our deﬁnition of associative potentials below also satisﬁes the Kolmogorov and Zabih regularity condition for K = 2. as long as the parameters satisfy a certain regularity condition. the structure of our potentials is simpler to describe and extend for the multiclass case.7 Conclusion In this chapter. Kleinberg and Tardos [1999] extend the multiclass Potts model to have more general edge potentials.8) and computed accuracies of the predictions shown in Fig. our learning formulation provides an exact and tractable optimization algorithm. our approach can also handle multiclass problems in a straightforward manner.stanford. Moreover. Kolmogorov and Zabih [2002] showed how to optimize energy functions containing binary and ternary interactions using graph cuts. 7. Our terrain classiﬁcation approach is most closely related to work in vision applying conditional random ﬁelds (CRFs) to 2D images. However. including a ﬂythrough movie of the data. we provide an algorithm for maxmargin training of associative Markov networks.104 CHAPTER 7. as well as formal guarantees for binary classiﬁcation problems. They also provide a solution based on a relaxed LP that has certain approximation guarantees. 7. under the constraints that negative log of the edge potentials form a metric on the set of labels. Unlike their work. at http://ai. which was classiﬁed by preprocessing). unlike their work. More recently. See more results.9 (excluding ground. 7. we can extend our maxmargin framework to estimate their more general potentials by expressing inference as a linear program. 7. ASSOCIATIVE MARKOV NETWORKS correctly. VotedSVM: 73% and AMN: 93%. In fact.
we are not guaranteed exact solutions. To our knowledge. or predicting links in relational data. We have explored only a computer vision application in this chapter.7. it is clear that many applications call for repulsive potentials. despite the prevalence of fully associative Markov networks. but we can prove constantfactor approximation bounds on the MAP solution returned by the relaxed LP. We thus provide a polynomial time algorithm which approximately solves the maximum margin estimation problem for any associative Markov network.7. While clearly we cannot introduce fully general potentials into AMNs without running against the NPhardness of the general problem. and consider another one (hypertext classiﬁcation) in the next. which is the key component in the quadratic program associated with the maxmargin formulation. The class of associative Markov networks appears to cover a large number of interesting applications. it would be interesting to see whether we can extend the class of networks we can learn effectively.. Importantly. regardless of the clique size or the connectivity. our method is guaranteed to ﬁnd the optimal (marginmaximizing) solution for all binaryvalued AMNs. In the nonbinary case. CONCLUSION 105 related variables. this algorithm is the ﬁrst to provide an effective learning procedure for Markov networks of such general structure. Our results in the binary case rely on the fact that the LP relaxation of the MAP problem provides exact solutions. and it is an interesting challenge to apply the algorithm to even larger models and develop efﬁcient distributed implementations. It would be interesting to see whether these bounds provide us with guarantees on the quality (e. It would be very interesting to consider other applications. However. The mincut based inference is able to handle very large networks. We present largescale experiments with terrain segmentation and classiﬁcation from 3D range data involving AMNs with tens of millions of nodes and edges. the margin) of our learned model. . such as extracting protein complexes from proteinprotein interaction data.g. Our approach relies on a linear programming relaxation of the MAP problem.
ASSOCIATIVE MARKOV NETWORKS Figure 7.7: Results from the SVM. .106 CHAPTER 7. VotedSVM and AMN models.
7. .7.8: Labeled part of the test set: ground truth (top) and SVM predictions (bottom). CONCLUSION 107 Figure 7.
9: Predictions of the VotedSVM (top) and AMN (bottom) models. ASSOCIATIVE MARKOV NETWORKS Figure 7.108 CHAPTER 7. .
we present a framework that builds on Markov networks and provides a ﬂexible language for modeling rich interaction patterns in structured data. However. we have seen how sequential and spatial correlation between labels can be exploited for tremendous accuracy gains. the entities to be labeled are related with each other in very complex ways. One document has hyperlinks to others. the labels of linked pages are highly correlated. Entities are related to each other via different types of links. classifying each webpage solely using the words that appear on the page. and more. In many other supervised learning tasks. we can use a bag of words model. A standard approach is to classify each entity independently. in hypertext classiﬁcation. Most naively. crosscitations in patents and scientiﬁc papers. typically indicating that their topics are related. We provide experimental results on a webpage classiﬁcation task. not just sequentially or spatially.Chapter 8 Relational Markov networks In the previous chapters. ignoring the correlations between them. such as a partition into sections. showing that accuracy can be signiﬁcantly improved by modeling relational dependencies. Each document also has internal structure. and the link structure is an important source of information. Consider a collection of hypertext documents that we want to classify using some set of labels. social networks. where each entity type is characterized by a different set of attributes. For example. In this chapter. Such data consist of entities of different types. medical records. Many realworld data sets are innately relational: hyperlinked webpages. hyperlinks that emanate from the same section of the document are even 109 . hypertext has a very rich structure that this approach loses entirely.
and reference attributes E. . we restrict label and content attributes to take on categorical values. . We introduce the framework of relational Markov networks (RMNs). which compactly deﬁnes a Markov network over a relational data set.To. As we show. Doc. It thereby allows us tremendous ﬂexibility in representing complex patterns. indicating the topic of the page. RELATIONAL MARKOV NETWORKS more likely to point to similar documents.R (for example. which takes on a set of categorical values.Label. there are two entity types: Doc and Link. we can represent a pattern where two linked documents are likely to have the same topic. Doc would have a set of boolean attributes Doc. 8.Y (for example. which describes entities. Reference attributes include a special unique key attribute E. rather than classifying each document separately.To). In our domain. Therefore.K that identiﬁes each . Doc. En }. For example.1 Relational classiﬁcation Consider hypertext as a simple example of a relational domain. these are important cues. We can also capture patterns that involve groups of links: for example.110 CHAPTER 8. A relational domain is deﬁned by a schema.Label).X (for example. that can potentially help us achieve better classiﬁcation accuracy. where we simultaneously decide on the class labels of all of the entities together. Link.HasWordk ). and thereby can explicitly take advantage of the correlations between the labels of related entities.HasWordk indicating whether the word k occurs on the page. label attributes E. When classifying a collection of documents. both of which refer to Doc entities. we want to provide a form of collective classiﬁcation. consecutive links in a document tend to refer to documents with the same label. the use of an undirected graphical model avoids the difﬁculties of deﬁning a coherent generative model for graph structures in directed models. The graphical structure of an RMN is based on the relational structure of the domain. For simplicity. and can easily model complex patterns over related entities. . It would also have the label attribute Doc. a schema speciﬁes of a set of entity types E = {E1 . We propose the use of a joint probabilistic model for an entire collection of related entities. The Link entity type has two attributes: Link. . If a webpage is represented as a bag of words. their attributes and relations between them.From and Link. Each type E is associated with three sets of attributes: content attributes E. In general.
1. Other reference attributes E.R to denote the content. Taskar et al.Label and each attribute Doc. A hypertext instantiation graph speciﬁes a set of webpages and links between them. We would have a probabilistic dependency from each document’s label to the words on the document. indicating the probability that word k appears in the document given each of the possible topic labels. the PRM induces a large Bayesian network over that instantiation that speciﬁes a joint probability distribution over all attributes of all of the entities.Label of two documents if there is a link between them. which we call an instantiation skeleton or instantiation graph. Getoor et al.R refer to entities of a single type E = Range(E. [2001] suggest the use of probabilistic relational models (PRMs) for the collective classiﬁcation task. 1999. I.r. 1988]. the PRM speciﬁes a Bayesian network over all of the documents in the collection.. An instantiation I of a schema E speciﬁes the set of entities I(E) of each entity type E ∈ E and the values of all attributes for all of the entities.Label to Doc. a PRM can represent the interdependencies between topics of linked documents by introducing an edge from Doc. speciﬁes the set of entities (nodes) and their reference attributes (edges). I. each of these attributes would have a conditional probability distribution P (Doc. Given a particular instantiation graph containing some set of documents and links. Friedman et al. show that this approach works well for classifying scientiﬁc documents. a PRM might use a naive Bayes model for words. specifying their labels. an instantiation of the hypertext schema is a collection of webpages. words they contain and links between them. but not their words or labels. More importantly.HasWordk  Doc. The component I.8.HadWordk . In our hypertext example.y and I. For example.R) and take values in Domain(E .Y and I. label and reference attributes in the instantiation I.. RELATIONAL CLASSIFICATION 111 entity. PRMs [Koller & Pfeffer. with a directed edge between Doc.X. Given a particular instantiation graph. using both the words in the title . We will use I.K).Label) associated with it. 1998.r to denote the values of those attributes.x. This network reﬂects the interactions between related instances by allowing us to represent correlations between their attributes. I. A PRM speciﬁes a probability distribution over instantiations consistent with a given instantiation graph by specifying a Bayesiannetworklike templatelevel probabilistic model for each entity type. and a dependency from each document’s label to the labels of all of the documents to which it points. Taskar et al. 2002] are a relational extension of Bayesian networks [Pearl.
which may or may not exist in the data. the set of all webpages). if two pages point to the same page. it is more likely that they point to each other as well.. The advantage of training a model only to discriminate between labels is that it does not have to trade off between classiﬁcation accuracy and modeling the joint distribution over nonlabel attributes. The link existence model assumes that the presence of different edges is a conditionally independent event. In this link existence model. however. such as webpages. the resulting Bayesian network has a random variable for each potential link — N 2 variables for collections containing N pages. This quadratic blowup occurs even when the actual link graph is very sparse. . Each two pages have a “potential link”. Such interactions between many overlapping triples of links do not ﬁt well into the generative framework. while the goal of classiﬁcation is a discriminative model of labels given the other attributes. This model. The model deﬁnes the probability of the link existence as a function of the labels of the two endpoints. However the application of this idea to other domains. discriminatively trained models are more robust to violations of independence assumptions and achieve higher classiﬁcation accuracy than their generative counterparts. Representing more complex patterns involving correlations between multiple edges is very difﬁcult.112 CHAPTER 8. RELATIONAL MARKOV NETWORKS and abstract and the citationlink structure. labels have no incoming edges from other labels. [2001] suggest an approach where we do not include direct dependencies between the labels of linked webpages. Furthermore. directed models such as Bayesian networks and PRMs are usually trained to optimize the joint probability of the labels and other attributes. leading to cycles in the induced “Bayesian network”. but rather treat links themselves as random variables. has other fundamental limitations. is problematic since there are many cycles in the link graph. For example. When N is large (e. In many cases. which is therefore not a coherent probabilistic model. Even more problematic are the inherent limitations on the expressive power imposed by the constraint that the directed graph must represent a coherent generative model over graph structures. Getoor et al. and the cyclicity problem disappears. a quadratic growth is intractable. In particular.g.
so a single model provides a coherent distribution for any collection of instances from the schema. we can write the template as a kind of SQL query: SELECT doc1. it speciﬁes the cliques and potentials between attributes of related entities at a template level.Category.2. for each link. For example. 8.1. To specify what cliques should be constructed in an instantiation.1: An unrolled Markov net over linked documents.Category . The links follow a common pattern: documents with the same label tend to link to each other more often. For our link example. The potential on the clique will have higher values for assignments that give a common label to the linked pages.) Roughly speaking. We can capture this correlation between labels by introducing. as in Fig. RELATIONAL MARKOV NETWORKS Label 1 Label 2 Label 3 113 Figure 8. 8. suppose that pages with the same label tend to link to each other. (We provide the deﬁnitions directly for the conditional case. A relational Markov network (RMN) speciﬁes a conditional distribution over all of the labels of all of the entities in an instantiation given the relational structure and the content attributes. we will deﬁne a notion of a relational clique template. A relational clique template speciﬁes tuples of variables in the instantiation by using a relational query language.8. as the unconditional case is a special case where the set of content attributes is empty.2 Relational Markov networks We now extend the framework of Markov networks to the relational setting. doc2. a clique between the labels of the source and the target page.
S : f ∈ I(F) ∧ W(f .S ⊆ F. the clause W(f .R) — a boolean formula using conditions of the form Fi . This pattern can be expressed by the following template: SELECT doc1. respectively. consider another common pattern in hypertext: links in a webpage tend to point to pages of the same category.F rom = doc1.Category .Category.r) ensures that the entities are related to each other in speciﬁed ways. Doc doc2. ◦ F. that is determined by the clique potentials. S) consists of three components: ◦ F = {Fi } — a set of entity variables. Link link WHERE link. Doc and Link.Rj = Fk .r)}. I(F) = I(E(F1 )) × .Key Note the three clauses that deﬁne a query: the FROM clause speciﬁes the cross product of entities to be ﬁltered by the WHERE clause and the SELECT clause picks out the attributes of interest. .Category. and ﬁnally. doc2. For the clique template corresponding to the SQL query above. A clique template speciﬁes a set of cliques in an instantiation I: C(I) ≡ {c = f .Key ∧ link.S is doc1. W(F.Key and F.X ∪ F.R) is link. A relational clique template C = (F.T o = doc2. Our deﬁnition of clique templates contains the corresponding three parts. doc2 and link of types Doc. RELATIONAL MARKOV NETWORKS FROM Doc doc1. which will be associated with the template. where f is a tuple of entities {fi } in which each fi is of type E(Fi ). Note that the clique template does not specify the nature of the interaction between the attributes.From = doc1.To = doc2. This deﬁnition of a clique template is very ﬂexible.S selects the appropriate attributes of the entities. where an entity variable Fi is of type E(Fi ). To continue our webpage example. f . It allows modeling complex relational patterns on the instantiation graphs. F consists of doc1.Rl .Y — a selected subset of content and label attributes in F.Category and doc2. . × I(E(Fn )) denotes the crossproduct of entities in the instantiation. W. as the WHERE clause of a template can be an arbitrary predicate.Key and link.114 CHAPTER 8. ◦ W(F.
Link link2 WHERE link1. Doc doc2.xc . Link link1.From and link1. I. I. I. I. thereby allowing generalization over a wide range of different structures. Φ) speciﬁes a set of clique templates C and corresponding potentials Φ = {φC }C∈C to deﬁne a conditional distribution: P (I.r) = w f (I. I.r) 1 = Z(I.y  I. I.r) where fC (I.xc .8.Key and link2. Using the loglinear representation of potentials.r) − log Z(I. Since the clique templates do not explicitly depend on the identities of entities. I.y.To = doc1.r) = c∈C(I) fC (I.Key Depending on the expressive power of our template deﬁnition language. We can easily represent patterns involving three (or more) interconnected documents without worrying about the acyclicity constraint imposed by directed models. A Relational Markov network (RMN) M = (C. we can write log P (I. the same template can select subgraphs whose structure is fairly different.r) C∈C φC (I.yc ) c∈C(I) where Z(I.x.x. The RMN allows us to associate the same clique potential parameters with all of the subgraphs satisfying the template. I.r) = I.Key and not doc1.x. I. and f is the vector . I. we may be able to construct very complex templates that select entire subgraph structures of an instantiation.r) is the normalizing partition function: Z(I.x.From = link2.yc ) is the sum over all appearances of the template C(I) in the instantiation. RELATIONAL MARKOV NETWORKS 115 FROM Doc doc1.xc .y C∈C c∈C(I) φC (I.To = doc2.y.y  I. φC (VC ) = exp{wC fC (VC )}.yc ) .x. I.x.x.x. I.Key = doc2. I.2.
Given a particular instantiation I of the schema.116 CHAPTER 8. the RMN M produces an unrolled Markov network over the attributes of entities in I. there is an edge between the labels of these pages.. Empirical results [Murphy et al. There is a wide variety of approximation schemes for this problem.4 show. Belief Propagation (BP) is a local message passing algorithm introduced by Pearl [1988]. We have one clique for each c ∈ C(I). and when it does. for every link between two pages. this approach works well in our domain. However. The cliques in the unrolled network are determined by the clique templates C. Maximum likelihood estimation For maximum likelihood learning. we assume a pairwise network where all potentials are associated only with nodes and edges given by: . including MCMC and variational methods. We chose to use belief propagation for its simplicity and relative efﬁciency and accuracy. 8. We therefore resort to approximate methods. where exact inference is typically intractable. recent analysis [Yedidia et al.. for a detailed description of the BP algorithm. As our results in Sec. 1999] for more details.1 illustrates a simple instance of this unrolled Markov network. the marginals are a good approximation to the correct posteriors. 2000] provides some theoretical justiﬁcation. and all of these cliques are associated with the same clique potential φC . referring to [Murphy et al. We provide a brief outline of one variant of BP. For simplicity. an RMN with the link basis function described above would deﬁne a Markov net in which. 8. we need to compute basis function expectations. 8. RELATIONAL MARKOV NETWORKS of all fC . We refer the reader to Yedidia et al.3 Approximate inference and learning Applying both maximum likelihood and maximum margin learning in the relational setting is requires inference in very large and complicated networks. not just the most likely assignment. 1999] show that it often converges in general networks. In our webpage example.. It is guaranteed to converge to the correct marginal probabilities for each node only for singly connected Markov networks. Fig.
Note that we can also max − product variant of loopy BP. Similarly. In our experiments. Computing the expected basis function expectations involves summing their expected values for each clique using the approximate marginals bi (Yi ) and bij (Yi . the marginal distribution of any node Yi is approximated by bi (Yi ) = αψi (Yi ) k∈N (i) mki (Yi ) and the marginal distribution of a pair of nodes connected by an edge is approximated by bij (Yi . Yj ) ij i ψi (Yi ) where ij ranges over the edges of the network and ψij (Yi . This process is repeated until the messages converge. we use maxyi bi (Yi ) at prediction time.3. Yj ). this results in less accurate classiﬁcation. At any point in the algorithm. . Yj ). . APPROXIMATE INFERENCE AND LEARNING 117 P (Y1 . Yi ). . Yj )ψi (yi ) yi k∈N (i)−j mki (Yi ) to compute approximate posterior “max”marginals and use those for prediction. The belief propagation algorithm is very simple. . Yn ) = 1 Z ψij (Yi . At each iteration. each node Yi sends the following messages to all its neighbors N (i): mij (Yj ) ← α yi ψij (yi . Yj ) = αψij (Yi . Yj ) = φ(xij . ψi (Yi ) = φ(xi . with mij (Yj ) ← α max ψij (yi .8. Yj )ψi (yi ) k∈N (i)−j mki (Yi ) where α is a (different) normalizing constant. so we use posterior marginal prediction. . Yj )ψi (Yi )ψj (Yj ) k∈N (i)−j mki (Yi ) l∈N (j)−i mlj (Yj ) These approximate marginals are precisely what we need for the computation of the basis function expectations and performing classiﬁcation. Yi .
Washington (315) and Wisconsin (454). 8. 5. researchproject (82) and student (542). faculty. thus evaluating the generalization performance of . Our goal is to collectively classify webpages into one of these ﬁve categories. which always produced integral solutions on test data. For networks with general potentials. Texas (574).118 CHAPTER 8. Hence. we have access to the entire html of the page and the links to other pages. we used the untriangulated LP we described in Sec. Washington (728) and Wisconsin (1614). Each page has a label attribute. project or other. which is an instance of our hypertext example. which we rounded independently for each label. we used the LP in Sec. making the number of other pages commensurate with the other categories. other (332). In all of our experiments. Texas. RELATIONAL MARKOV NETWORKS Maximum margin estimation For maximum margin estimation. The number of remaining pages for each school are: Cornell (280). The number of links for each school are: Cornell (574). We could restrict attention to just the pages with the four other labels. representing the type of webpage which is one of course.. For networks with attractive potentials (AMNs). 7. The number of pages classiﬁed as other is quite large. so that a baseline algorithm that simply always selected other as the label would get an average accuracy of 75%. The data set contains webpages from four different Computer Science departments: Cornell. we compromised by eliminating all other pages with fewer than three outlinks.4. The resulting category distribution is: course (237). but in a relational classiﬁcation setting.4 Experiments We tried out our framework on the WebKB dataset [Craven et al. we learn a model from three schools and test the performance of the learned model on the remaining school. we used approximate LP inference inside the maxmargin QP. student. The data set is problematic in that the category other is a grabbag of pages of many different types. Texas (292). For each page. Washington and Wisconsin. The untriangulated LP produced fractional solutions for inference on the test data in several settings. the deleted webpages might be useful in terms of their interactions with other webpages. faculty (148). using commercial Ilog CPLEX software to solve it.2. 1998].
4. The text of the webpage is represented using a set of binary attributes that indicate the presence of different words on the page. 1999].e. Our ﬁrst experimental setup compares three wellknown text classiﬁers — Naive Bayes. 10] and took the best setting for all models. and Support Vector Machines. Incorporating metadata gives a signiﬁcant improvement. and logistic regression (Logistic) — using words and metawords. 8. which have been demonstrated to be successful on this data set [Joachims. The average error over the 4 schools was reduced by around 4% by introducing the metadata attributes.1 Flat models The simplest approach we tried predicts the categories based on just the text content on the webpage. We experimented with several relational models. we compare to standard text classiﬁers such as Naive Bayes.2. although mostly originating from pages linking into the considered page. EXPERIMENTS 119 the different models. We found that stemming and feature selection did not provide much beneﬁt and simply pruned words that appeared in fewer than three documents in each of the three schools in the training data. However. 8. 2002]. are easily incorporated as features. We also experimented with incorporating metadata: words appearing in the title of the page.1. We used C ∈ [0. .. we cannot directly compare our accuracy results with previous work because different papers use different subsets of the data and different training/test splits. show that the two discriminative approaches outperform Naive Bayes.8. shown in Fig. We want to capture these correlations in our model and use them for transmitting information between linked pages to provide more accurate classiﬁcation. in anchors of links to the page and in the last header before a link to the page [Yang et al. linear support vector machines (Svm). the resulting classiﬁcation task is still ﬂat featurebased classiﬁcation. Unfortunately. but we can take additional advantage of the correlation in labels of related pages by classifying them collectively. i. Logistic and Svm give very similar results. Note that metadata.4. The results. Logistic Regression.
120 CHAPTER 8. For each page.35 0. The ExistsML model is a (partially) generative model proposed by Getoor et al. Concretely. shown in Fig.We train this model using maximum conditional likelihood (labels given the words and the links) and maximum margin. with and without metadata features. and Logistic on WebKB. captures this correlation through links: in addition to the local bag of words and metadata attributes.2 Link model Our ﬁrst model captures direct correlations between labels of linked pages.05 0 Words Words+Meta Naïve Bayes Svm Logistic Figure 8. [2001]. we introduce a relational clique template over the labels of two pages that are linked. RELATIONAL MARKOV NETWORKS 0. we learn the probability of existence of a link between two pages given their labels. The Link model.4.) 8.25 Test Error 0. faculty rarely link to each other. Then a simple generative model speciﬁes a probability distribution over the existence of links between pages conditioned on both pages’ labels. Maximum likelihood estimation (with regularization) of the generative component is closed . Svm. Note that this model does not require inference during learning.2 0. students have links to all categories but mostly courses.2: Comparison of Naive Bayes.1 0. These correlations are very common in our data: courses and research projects almost never link to each other.15 0.3 0. (Only averages over the 4 schools are shown here. 8. a logistic regression model predicts the page label given the words and metafeatures. We also compare to a directed graphical model to contrast discriminative and generative models of relational structure.1.
the Link model avoids this problem by only modeling the conditional distribution given the existing links.4. The Exists model doesn’t perform very well in comparison. For example. observing the link structure directly correlates all page labels in a website.4.3: Comparison of ﬂat versus collective classiﬁcation on WebKB: SVM. Exists model with logistic regression and the Link model estimated using the maximum likelihood (ML) and the maximum margin (MM) criteria. EXPERIMENTS 121 30% 25% 20% ExistsML SVM LinkML LinkMM Error 15% 10% 5% 0% Cor Tex Was Wis Average Figure 8. since the resulting graphical model includes edges not only for the existing hyperlinks. Fig.3 shows a gain in accuracy from SVMs to the Link model by using the correlations between labels of linked web pages. linked or not. form given appropriate cooccurrence counts of linked pages’ labels. However. and 51% lower than multiclass SVMs. Intuitively.3 Cocite model The second relational model uses the insight that a webpage often has internal structure that allows it to be broken up into sections. There is also very signiﬁcant additional gain by using maximum margin training: the error rate of LinkMM is 40% lower than that of LinkML. This can be attributed to the simplicity of the generative model and the difﬁculty of the resulting inference problem. but also those that do not exist. 8. 8. By contrast. the prediction phase is much more expensive.8. a faculty webpage might have .
a second section might contain links to the courses taught by the faculty member.From = link2.Key and not doc1. RELATIONAL MARKOV NETWORKS 25% 20% SVM CociteML CociteMM Error 15% 10% 5% 0% Cor Tex Was Wis Average Figure 8. Doc doc2.4: Comparison of Naive Bayes.From and link1.Section and link1. The Cocite model captures this type of correlation. Note that we expect the correlation between the labels in this case to be positive. and a third to his advisees. . Link link2 WHERE link1. The results.To = doc2.To = doc1. so we can use AMNtype potentials in the maxmargin estimation. or linked to from the same section.) one section that discusses research. if two pages are cocited. and Logistic on WebKB. with and without metadata features. we have a relational clique template deﬁned by: SELECT doc1.Category FROM Doc doc1. In the RMN.Category.Section = link2.Key = doc2. Intuitively.Key and link2. 1999] (a page that contains a lot of links to pages of particular category). we need to enrich our schema by adding the attribute Section to Link to refer to the section number it appears in. (Only averages over the 4 schools are shown here. Link link1. CociteML and CociteMM. doc2. To take advantage of this trend. they are likely to be on similar topics. Svm. We can view a section of a webpage as a ﬁnegrained version of Kleinberg’s hub [Kleinberg. We deﬁned a section as a sequence of three or more links that have the same path to the root in the html parse tree. with a list of links to all of the projects of the faculty member.122 CHAPTER 8.Key We compared the performance of SVM.
which shows a 30% relative reduction in test error over SVM. 8. We can incorporate even more complicated patterns. Getoor et al. We give some examples here. but of course we are limited by the ability of belief propagation . students often assist in courses taught by their advisor. One useful type of pattern type is a similarity template. then probably so is Z. 1998]. If X’s webpage mentions Y and Z in the same context. Their approach easily captures the dependence of link existence on attributes of entities. For example.5. for example.. if Y is Professor X’s advisee. By introducing cliques over triples of relations. we can capture such patterns as well. [2002] propose several generative models of relational structure. which suggests that the optimization successfully avoided problematic parameterizations of the network.5 Related work Our RMN representation is most closely related to the work on PRMs [Koller & Pfeffer. We note that. 8. in our experiments. even in the case of the nonoptimal multiclass relaxation. the learned CociteMM weights never produced fractional solutions when used for inference. a professor X and two other entities Y and Z. Another useful type of subgraph template involves transitivity patterns.8. 1999]. Later work showed how to efﬁciently learn model parameters and structure (equivalent of clique selection in Markov networks) from data [Friedman et al.4. also demonstrate signiﬁcant improvements of the relational models over the SVM. The improvement is present when testing on each of the schools. However there are many patterns that we are difﬁcult to model in PRMs. Consider. it is likely that the XY relation and the YZ relation are of the same type. where the presence of an AB link and of a BC link increases (or decreases) the likelihood of an AC link. Note that this type of interaction cannot be accounted for just using pairwise cliques. RELATED WORK 123 shown in Fig. maximum likelihood trained model CociteML achieves a worse test error than maximum margin CociteMM model. for example. Again. Our framework accommodates these patterns easily. where objects that share a certain graphbased property are more likely to have the same label. by introducing pairwise cliques between the appropriate relation variables. in particular those that involve several links at a time.
there is no overlap between the webpages in the . Slattery and Mitchell [2000] tried to identify directory (or hub) pages that commonly list pages of the same topic. and used these pages to improve classiﬁcation of university webpages. Several recent papers have proposed algorithms that use the link graph to aid classiﬁcation. [1998] use systempredicted labels of linked documents to iteratively relabel each document in the test set. we might consider simply directing the correlation edges between link existence variables arbitrarily. A similar approach was used by Neville and Jensen [2000] in a different domain. The structure of the relational graph has been used extensively to infer importance in scientiﬁc publications [Egghe & Rousseau. where we want to classify multiple entities. we can incorporate relational features into standard ﬂat classiﬁcation. In some cases. However.. We describe and exploit these patterns in our work on RMNs using maximum likelihood estimation [Taskar et al. Chakrabarti et al. for the transitivity pattern. 8. For example. none of these approaches provide a coherent model for the correlations between linked webpages. However. with no formal justiﬁcation. it is not clear how to parameterize a link existence variable for a link that is involved in multiple triangles. In our WebKB example. RELATIONAL MARKOV NETWORKS to scale up as we introduce larger cliques and tighter loops in the Markov network.124 CHAPTER 8. this approach is limited in cases where some or even all of the relational features that occur in the test data are not observed in the training data. when classifying papers into topics. For example. signiﬁcantly improving the classiﬁcation accuracy over standard ﬂat classiﬁcation. applying combinations of classiﬁers in a procedural way. Our approach provides a coherent foundation for the process of collective classiﬁcation. 2003b]. exploiting the interactions between their labels. Attempts to model such pattern in PRMs run into the constraint that the probabilistic dependency graph (Bayesian network) must be a directed acyclic graph. it is possible to simply view the presence of particular citations as atomic features. 1999]. we propose a new approach for classiﬁcation in relational domains.6 Conclusion In this chapter. 1990] and hypertext [Kleinberg. However. achieving a signiﬁcant improvement compared to a baseline of using the text in each document alone. We have shown that we can exploit a very rich set of relational patterns in classiﬁcation.
. it is particularly interesting to explore the problem of inducing them automatically. Hypertext is the most easily available source of structured data. 2003b]). Incorporating basic features (e. The results in this chapter represent only a subset of the domains we have worked on (see [Taskar et al.. In particular. but cannot exploit the strong correlation between the labels of related entities that RMNs capture. CONCLUSION 125 different schools. RMNs are generally applicable to any relational domain.6.g. RMNs offer a principled method for learning to predict communities of and hierarchical structure between people and organizations based on both the local attributes and the patterns of static and dynamic interaction. so we cannot learn anything from the training data about the signiﬁcance of a hyperlink to/from a particular webpage in the test data. . words) from the related entities can aid in classiﬁcation. however. social networks provide extensive information about interactions among people and organizations.8. Given the wealth of possible patterns.
RELATIONAL MARKOV NETWORKS .126 CHAPTER 8.
Part III Broader applications: parsing. clustering 127 . matching.
3. The nonterminal symbols correspond to syntactic categories such as noun phrase (NP) or verbal phrase (VP). in the discriminative framework that we adopt. Instead. The terminal symbols are usually words of the sentence.1 Context free grammar model CFGs are one of the primary formalisms for capturing the recursive structure of syntactic constructions. Fig. we are not concerned with deﬁning a distribution over sequences of words (language model). determiners (DT). Our models can condition on arbitrary features of input sentences.1(a) shows a parse tree for the sentence The screen was a sea of 128 . verbs (VBD). 9. we restrict our grammars to be in Chomsky normal form as in Sec. 1999]. although many others have also been proposed [Manning & Sch¨ tze. thus incorporating an important kind of lexical information not usually used by conventional parsers. For example. 9.4. We show that this framework allows highaccuracy parsing in cubic time by exploiting novel kinds of lexical information. We show experimental evidence of the model’s improved performance over a natural baseline model and a lexicalized probabilistic contextfree grammar.Chapter 9 Context free grammars We present a novel discriminative approach to parsing using structured maxmargin criterion based on the decomposition properties of context free grammars. However. u For clarity of presentation. Terminal symbols for our purposes are partofspeech tags like nouns (NN). we condition on the words in a sentence to produce a model of the syntactic structure.
The value ⊥ is assigned if no subtree dominates the span. NS ⊆ N ◦ A set of terminal symbols.1. we can use an equivalent representation that essentially encodes a tree as an assignment to a set of appropriate variables. respectively. each preleaf node has a single child and this pair of nodes can be labeled as A and D.1.2 (CFG tree) A CFG tree is a labeled directed tree. The set of symbols we use is based on the Penn Treebank [Marcus et al. Additionally. PB = {A → B C : A. B and C. or dominates. D ∈ T }. if and only if there is a binary production A → B C ∈ PB . where the set of valid labels of the internal nodes other than the root is N and the set of valid labels for the leaves is T . there are exponentially many parse trees that produce a sentence of length n. T ◦ A set of productions. respectively.1. if and only if there is a unary production A → D ∈ PU . left and right. divided into Binary productions. the words of the sentence from s to e. This tree representation seems quite different from the graphical models we have been considering thus far. 3. The root’s label set is NS . However. For each span starting with s and an ending with e.n is constrained to be in NS . B. N ◦ A designated set of start symbols. PU = {A → D : A ∈ N . P = {PB . 1993]. CONTEXT FREE GRAMMAR MODEL 129 red.9. For convenience. VBD) are added to conform to the CNF restrictions. All other internal nodes have two children. PU }. since it represents .4 here: Deﬁnition 9.1 (CFG) A CFG G consists of: ◦ A set of nonterminal symbols. rather than to words themselves. The nonterminal symbols with bars (for example. we introduce a variable Ys. The “top” symbol Y0. hence 0 ≤ s < e ≤ n for a sentence of length n. NN. In general.e taking values in N ∪⊥ to represent the label of the subtree that exactly covers. we repeat our deﬁnition of a CFG from Sec. and this triple of nodes can be labeled as A. Indices s and e refer to positions between words. A CFG deﬁnes a set of valid parse trees in a natural manner: Deﬁnition 9. DT.. C ∈ N } and Unary productions.
s taking values in T to represent the terminal symbol (partofspeech) between s and s + 1.1(b) shows the representation of the tree in Fig.130 CHAPTER 9. The row and column number in the grid correspond to the beginning and end of the span. 9. CONTEXT FREE GRAMMARS (a) (b) Figure 9. Y6. respectively. respectively. it is often called a constituent. The row and column number are the beginning and end of the span.1: Two representations of a binary parse tree: (a) nested tree structure. consider the . The gray squares on the diagonal represent the terminal variables.0 = DT. Fig.e = ⊥. While any parse tree corresponds to an assignment to this set of variables in a straightforward manner.e .6 = NN.4 = ⊥. the ﬁgure shows Y0.1(a) as a grid where each square corresponds to Ys. and (b) grid of labeled spans. If Ys.5 = NP and Y1. 9. the converse is not true: there are assignments that do not correspond to valid parse trees. For example. Empty squares correspond to ⊥ values. the starting symbol of the sentence. Empty squares correspond to nonconstituent spans. We also introduce variables Ys. Y3. In order to characterize the set of valid assignments Y. The gray squares on the diagonal represent partofspeech tags.
The second set of constraints (9. B.e = (B. I(y 1 s .s. D) abbreviates ys.4) A→D∈PU A→D∈PU The notation ys.s = D) = I(y 1 s.s. then y represents a valid CFG tree.e = A ∧ ys.1) ensures that a unique production spans 0 to n.e = A) = I(y + B→A C∈PB 1 s. that produces that subtree. C)).4) as Y.s+1 = (A. I(y 1 s.n . y0.n .3 and 9. We denote the set of assignments y satisfying (9. A)).2) holds because that if the span from s to e is dominated by a subtree starting with A (ys.m and ym .e.m = B ∧ ym. D)). B. the third and fourth set of constraints (9. if ys.19.s+1 = A ∧ ys.e = (B.s+1 = (A.4) . In fact the converse is true as well: Theorem 9. The last two sets of constraints (9. for m = m are labeled by ⊥. The ﬁrst set of constraints (9. 0 ≤ s < n. I(y (9.m and ym. (9. The second set of constraints (9. ∀A ∈ N . ∀A ∈ N . Similarly.s.2) ensures that all other spans y0. Recursing on the two subtrees.1.e = A). (9. 0 ≤ s < n. ∀A ∈ N . the ﬁrst set of constraints (9. CONTEXT FREE GRAMMAR MODEL 131 constraints that hold for a valid assignment y: 1 s.3) (9. y0. say splitting at m and specifying the values for y0.3 and 9.4) hold since the terminals are generated using valid unary productions.m. I(y e<e ≤n 0 ≤ s < e ≤ n.m and ym. A.9. C.e = C and ys. ys.3 If y ∈ Y. Conversely. Starting from the root symbol. Proof sketch: It is straightforward to construct a parse tree from y ∈ Y in a topdown manner.n .1. then there must be a unique production starting with A and some split point m. will produce the rest of the tree down to the preterminals. s < m < e. C)) I(y 0≤s <s 0 ≤ s < e ≤ n.m. ∀D ∈ T .e = A. no productions start with A and cover s to e.e = A) = I(y s<m<e 1 s. then there must be a (unique) production that generated it: either starting before s or after e. D)).1) A→B C∈PB 1 s.s+1 = (A.s+1 = A) = I(y 1 s.s = D.n . C) abbreviates ys.s.e = A).2) B→C A∈PB 1 s.e = (A.e = (A.1) holds because if the span from s to e is dominated by a subtree starting with A (that is.
m (B) + Sm.m.2 Context free parsing A standard approach to parsing is to use a CFG to deﬁne a probability distribution over parse trees.132 CHAPTER 9.m.e (ys. This can be done simply by assigning a probability to each production and making sure that the sum of probabilities of all productions starting with each symbol is 1: P (A → B C) = 1. Manning & Sch¨ tze. D).m. Using the arg max’s of the max’s in .e = ⊥). We can use a Viterbistyle dynamic programming algorithm called CKY to compute the highest score parse tree in O(Pn3 ) time [Younger. a weighted CFG assigns a score to each production (this score may depend on the position of the production s. (9.e (A. m.e ) = 0 if (ys. B. B. 1999]. ∀A ∈ N . where Ss.e ) + 0≤s<n Ss.m.e (C). The probability of a tree is simply the product of probabilities of the productions used in the tree.m.s+1 ). C) + Ss. u The algorithm computes the highest score of any subtree starting with a symbol over each span 0 ≤ s < e ≤ n recursively: ∗ Ss. then the tree score is the tree log probability.5) 0 ≤ s < e ≤ n.e (ys. More generally. weighted CFGs do not have the local normalization constraints Eq. If the production scores are production log probabilities.s+1 (A) = ∗ Ss.s+1 (ys. CONTEXT FREE GRAMMARS ensure that the terminals are generated by an appropriate unary productions from the preterminals. A→B C∈PB max s<m<e ∗ The highest scoring tree has score maxA∈NS S0. ∗ ∗ Ss. 9. 1967. However. e) such that the total score of a tree is the sum of the score of all the productions used: S(y) = 0≤s<m<e≤n Ss.C:A→B C∈PB D:A→D∈PU P (A → D) = 1.n (A).e (A) = A→D∈PU max Ss. ∀A ∈ N .s+1 (A.e = ⊥ ∨ ys. 0 ≤ s < n. (9. ∀A ∈ N .5).m = ⊥ ∨ ym.
f (xs.m.e ) = 1 s. NP. e) : 0 ≤ s ≤ m < e ≤ n}.e and xs.9. and the parameters w are the logprobabilities of those productions. and Y is a set of valid parse trees according to a ﬁxed CFG grammar. f could include functions which identify the production used together with features of the words at positions s. say according to some lexicographic order of the symbols.g. We assume that score ties are broken in a predetermined way. we introduce the set of indices.3. Note that this class of discriminants includes PCFG models. 9. we can backtrace the highest scoring tree itself. m) : 0 ≤ s ≤ e ≤ n} ∪ {(s.m.e = S.e .e ). ys.e . y).3 Discriminative parsing models We cast parsing as a structured classiﬁcation task. m.e ) + 0≤s≤m<e≤n f (xs. y where w ∈ IRd and f is a basis function representation of a sentence and parse tree pair f : X × Y → IRd . where X is a set of sentences. VP) ∧ I(y mth word(x) = was)). The functions we consider take the following linear discriminant form: hw (x) = arg max w f (x. We could also include functions that identify the label of the span . yc ). ys.m.m. where n is the length of the sentence x and xs. ys. y) = c∈C f (xc . We assume that the basis functions decompose with the CFG structure: f (x. Hence. which includes both spans and span triplets: C = {(s. y) = 0≤s≤e≤n f (xs. DISCRIMINATIVE PARSING MODELS 133 the computation of S ∗ . and neighboring positions in the sentence x (e.m.e are the relevant subsets of the sentence the basis functions depend on.m.e . where the basis functions consist of the counts of the productions used in the parse. f (x. C. To simplify notation. m. where we want to learn a function h : X → Y. e. For example.
134
CHAPTER 9. CONTEXT FREE GRAMMARS
from s to e together with features of the word (e.g. f (xs,m , ys,m ) = 1 s,m = NP) ∧ I(y sth word(x) = the)).
9.3.1 Maximum likelihood estimation
The traditional method of estimating the parameters of PCFGs assumes a generative model that deﬁnes P (x, y) by assigning normalized probabilities to CFG productions. We then maximize the joint loglikelihood
i
log P (x(i) , y(i) ) (with some regularization). We com
pare to such a generative grammar of Collins [1999] in our experiments. A alternative probabilistic approach is to estimate the parameters discriminatively by maximizing conditional loglikelihood. For example, the maximum entropy approach [Johnson, 2001] deﬁnes a conditional loglinear model: Pw (y  x) = where Zw (x) = the sample,
i
1 exp{w f (x, y)}, Zw (x)
y
exp{w f (x, y)}, and maximizes the conditional loglikelihood of
log P (y(i)  y(i) ), (with some regularization). The same assumption that
the basis functions decompose as sums of local functions over spans and productions is typically made in such models. Hence, as in Markov networks, the gradient depends on the expectations of the basis functions, which can be computed in O(Pn3 ) time by dynamic programming algorithm called insideoutside, which is similar to the CKY algorithm. However, computing the expectations over trees is actually more expensive in practice than ﬁnding the best tree for several reasons. CKY works entirely in the logspace, while insideoutside needs to compute actual probabilities. Branchandprune techniques, which save a lot of useless computation, are only applicable in CKY. A typical method for ﬁnding the parameters is to use Conjugate Gradients or LBFGS methods [Nocedal & Wright, 1999; Boyd & Vandenberghe, 2004], which repeatedly compute these expectations to calculate the gradient. Clark and Curran [2004] report experiments involving 479 iterations of training for one model, and 1550 iterations for another using similar methods.
9.3. DISCRIMINATIVE PARSING MODELS
135
9.3.2
Maximum margin estimation
We assume that loss function also decomposes with the CFG structure: ˆ (x, y, y) = (xs,e , ys,e , ys,e ) + ˆ ˆ (xs,m,e , ys,m,e , ys,m,e ) =
c∈C
ˆ (xc , yc , yc ).
0≤s≤e≤n
0≤s≤m<e≤n
One approach would be to deﬁne (xs,e , ys,e , ys,e ) = 1 s,e = ys,e ). This would lead to ˆ I(y ˆ ˆ ˆ (x, y, y) tracking the number of “constituent errors” in y. Another, more strict deﬁnition ˆ ˆ would be to deﬁne (xs,m,e , ys,m,e , ys,m,e ) = 1 s,m,e = ys,m,e ). This deﬁnition would lead I(y ˆ ˆ to (x, y, y) being the number of productions in y which are not seen in y. The constituent loss function does not exactly correspond to the standard scoring metrics, such as F1 or crossing brackets, but shares the sensitivity to the number of differences between trees. We have not thoroughly investigated the exact interplay between the various loss choices and the various parsing metrics. We used the constituent loss in our experiments. As in the maxmargin estimation for Markov networks, we can formulate an exponential size QP: min s.t. 1 w2 + C 2 ξi
i
(9.6)
w ∆fi (y) ≥ i (y) − ξi ∀i, y,
where ∆fi (y) = f (x(i) , y(i) ) − f (x(i) , y), and i (y) = (x(i) , y(i) , y). The dual of Eq. (9.6) (after normalizing by C) is given by: max
i,y
1 αi (y) i (y) − C 2 αi (y) = 1,
2
αi (y)∆fi (y)
i,y
(9.7)
s.t.
y
∀i; αi (y) ≥ 0, ∀i, y.
Both of the above formulations are exponential (in the number of variables or constraints) in the lengths (ni ’s) of the sentences. But we can exploit the contextfree structure of the basis functions and the loss to deﬁne a polynomialsize dual formulation in terms of
136
CHAPTER 9. CONTEXT FREE GRAMMARS
marginal variables µi (y): µi,s,e (A) ≡
y:ys,e =A
αi (y), 0 ≤ s ≤ e ≤ n, ∀A ∈ N ; αi (y),
y:ys,s =D
µi,s,s (D) ≡ µi,s,m,e (A, B, C) ≡
0 ≤ s ≤ e ≤ n, ∀D ∈ T ; αi (y), 0 ≤ s < m < e ≤ n, ∀A → B C ∈ PB , 0 ≤ s < n, ∀A → D ∈ PU .
y:ys,m,e =(A,B,C)
µi,s,s,s+1 (A, D) ≡
y:ys,s,s+1 =(A,D)
αi (y),
There are O(PB n3 + PU ni ) such variables for each sentence of length ni , instead of i exponentially many αi variables. We can now express the objective function in terms of the marginals. Using these variables, the ﬁrst set of terms in the objective becomes: αi (y) i (y) =
i,y i,y
αi (y)
c∈C (i)
i,c (yc )
=
i,c∈C (i) ,yc
µi,c (yc )
i,c (yc ).
Similarly, the second set of terms (inside the 2norm) becomes: αi (y)∆fi (y) =
i,y i,y
αi (y)
c∈C (i)
∆fi,c (yc ) =
i,c∈C (i) ,yc
µi,c (yc )∆fi,c (yc ).
As in M3 Ns, we must characterize the set of marginals µ that corresponds to valid α. The constraints on µ are essentially based on the those that deﬁne Y in (9.19.4). In addition, we require that the marginals over the root nodes, µi,0,ni (y0,ni ), sums to 1 over the possible start symbols NS .
9.3. DISCRIMINATIVE PARSING MODELS
137
Putting the pieces together, the factored dual is:
2
max
i,c∈C (i)
µi,c (yc )
i,c (yc )
+C
i,c∈C (i)
µi,c (yc )∆fi,c (yc ) µi,c (yc ) ≥ 0; ∀i, ∀c ∈ C (i) ; ∀i, 0 ≤ s < e ≤ ni , ∀A ∈ N ;
(9.8)
s.t.
A∈NS
µi,0,ni (A) = 1, µi,s,e (A) =
s<m<e
∀i;
µi,s,m,e (A, B, C), µi,s ,s,e (B, A, C)
0≤s <s
A→B C∈PB
µi,s,e (A) = + µi,s,s+1 (A) = µi,s,s (D) =
B→A C∈PB
µi,s,e,e (B, C, A), ∀i, 0 ≤ s < e ≤ ni , ∀A ∈ N ; µi,s,s,s+1 (A, D), µi,s,s,s+1 (A, D), ∀i, 0 ≤ s < ni , ∀A ∈ N ; ∀i, 0 ≤ s < ni , ∀D ∈ T .
B→C A∈PB e<e ≤ni
A→D∈PU
A→D∈PU
The constraints on µ is necessary, since they must correspond to marginals of a distribution over trees. They are also sufﬁcient: Theorem 9.3.1 A set of marginals µi (y) satisfying the constraints in Eq. (9.8) corresponds to a valid distribution over the legal parse trees y ∈ Y (i) . A consistent distribution αi (y) is given by αi (y) = µi,0,ni (y0,ni )
0≤s≤m<e≤ni
µi,s,m,e (ys,m,e ) , µi,s,e (ys,e )
where 0/0 = 0 by convention. Proof sketch: The proof follows from insideoutside probability relations [Manning & Sch¨ tze, 1999]. The ﬁrst term is a valid distribution of starting symbols. Each u
µi,s,m,e (ys,m,e ) µi,s,e (ys,e )
term for m > s corresponds to a conditional distribution over binary productions (ys,e → ys,m ym,e ) that are guaranteed to sum to 1 over split points m and possible productions. Similarly, each
µi,s,s,s+1 (ys,s,s+1 ) µi,s,s+1 (ys,s+1 )
term for corresponds to a conditional distribution over
138
CHAPTER 9. CONTEXT FREE GRAMMARS
unary productions (ys,s+1 → ys,s ) that are guaranteed to sum to 1 over possible productions. Hence, we have deﬁned a kind of PCFG (where production probabilities depend on the location of the symbol), which induces a valid distribution αi over trees. It straightforward to verify that this distribution has marginals µi .
9.4 Structured SMO for CFGs
We trained our maxmargin models using the Structured SMO algorithm with blockcoordinate descent adopted from graphical models (see Sec. 6.1). The CKY algorithm computes similar maxmarginals in the course of computing the best tree as does Viterbi in Markov networks. vi,c (yc ) = max [w fi (y) + i (y)],
y∼yc
αi,c (yc ) = max αi (y).
y∼yc
We also deﬁne vi,c (yc ) = maxyc =yc vi,c (yc ) = maxy∼yc [w fi (y) + i (y)]. Note that we do not explicitly represent αi (y), but we can reconstruct the maximumentropy one from the marginals µi as in Theorem 9.3.1. We again express the KKT conditions in terms of the maxmarginals for each span and span triple c ∈ C (i) and its values yc : αi,c (yc ) = 0 ⇒ vi,c (yc ) ≤ vi,c (yc ); αi,c (yc ) > 0 ⇒ vi,c (yc ) ≥ vi,c (yc ). (9.9)
The algorithm cycles through the training sentences, runs CKY to compute the maxmarginals and performs an SMO update on the violated constraints. We typically ﬁnd that 2040 iterations through the data are sufﬁcient for convergence in terms of the objective function improvements.
9.5 Experiments
We used the Penn English Treebank for all of our experiments. We report results here for each model and setting trained and tested on only the sentences of length ≤ 15 words. Aside
9.5. EXPERIMENTS
139
from the length restriction, we used the standard splits: sections 221 for training (9753 sentences), 22 for development (603 sentences), and 23 for ﬁnal testing (421 sentences). As a baseline, we trained a CNF transformation of the unlexicalized model of Klein and Manning [2003] on this data. The resulting grammar had 3975 nonterminal symbols and contained two kinds of productions: binary nonterminal rewrites and tagword rewrites. Unary rewrites were compiled into a single compound symbol, so for example a subjectgapped sentence would have label like S + VP. These symbols were expanded back into their source unary chain before parses were evaluated. The scores for the binary rewrites were estimated using unsmoothed relative frequency estimators. The tagging rewrites were estimated with a smoothed model of P (wt), also using the model from Klein and Manning [2003]. In particular, Table 9.2 shows the performance of this model (GENERATIVE): 87.99 F1 on the test set. For the
BASIC
maxmargin model, we used exactly the same set of allowed rewrites
(and therefore the same set of candidate parses) as in the generative case, but estimated their weights using the maxmargin formulation with a loss that counts the number of wrong spans. Tagword production weights were ﬁxed to be the log of the generative P (wt) model. That is, the only change between GENERATIVE and BASIC is the use of the discriminative maximummargin criterion in place of the generative maximum likelihood one for learning production weights. This change alone results in a small improvement (88.20 vs. 87.99 F1 ). On top of the basic model, we ﬁrst added lexical features of each span; this gave a
LEXICAL
model. For a span s, e of a sentence x, the base lexical features were:
◦ xs , the ﬁrst word in the span ◦ xs−1 , the preceding adjacent word ◦ xe−1 , the last word in the span ◦ xe , the following adjacent word ◦ xs−1 , xs ◦ xe−1 , xe ◦ xs+1 for spans of length 3
140
CHAPTER 9. CONTEXT FREE GRAMMARS
Model GENERATIVE BASIC LEXICAL LEXICAL+AUX
P 87.70 87.51 88.15 89.74
R 88.06 88.44 88.62 90.22
F1 87.88 87.98 88.39 89.98
Table 9.1: Development set results of the various models when trained and tested on Penn treebank sentences of length ≤ 15.
Model GENERATIVE BASIC LEXICAL LEXICAL+AUX COLLINS 99
P 88.25 88.08 88.55 89.14 89.18
R F1 87.73 87.99 88.31 88.20 88.34 88.44 89.10 89.12 88.20 88.69
Table 9.2: Test set results of the various models when trained and tested on Penn treebank sentences of length ≤ 15.
These base features were conjoined with the span length for spans of length 3 and below, since short spans have highly distinct behaviors (see the examples below). The features are lexical in the sense than they allow speciﬁc words and word pairs to inﬂuence the parse scores, but are distinct from traditional lexical features in several ways. First, there is no notion of headword here, nor is there any modeling of wordtoword attachment. Rather, these features pick up on lexical trends in constituent boundaries, for example the trend that in the sentence The screen was a sea of red., the (length 2) span between the word was and the word of is unlikely to be a constituent. These nonhead lexical features capture a potentially very different source of constraint on tree structures than headargument pairs, one having to do more with linear syntactic preferences than lexical selection. Regardless of the relative merit of the two kinds of information, one clear advantage of the present approach is that inference in the resulting model remains cubic (as opposed to O(n5 )), since the dynamic program need not track items with distinguished headwords. With the addition of these features, the accuracy moved past the generative baseline, to 88.44.
1 . and the entire tag sequence. its weight will be artiﬁcially (and detrimentally) high.5. We then trained the auxiliary classiﬁers on each fold. were constituents or not. pairs of tags such as precedingﬁrst.12 F1 .77 / R 63.which In this length 1 case. 1999]. As a result. The ﬁrst feature was the prediction of the generative baseline. consider the sentence The Egyptian president said he would visit Libya today to resume the talks. auxiliary classiﬁers.62 each. and following tag in the span. last. EXPERIMENTS 141 As a concrete (and particularly clean) example of how these features can sway a decision. and their predictions used as features on the development and test sets. 2003]. To situate these numbers with respect to other models. lastfollowing. The generative model incorrectly considers Libya today to be a base NP. these features have quite large positive weights: 0. this model makes this parse decision correctly. Another kind of feature that can usefully be incorporated into the classiﬁcation process is the output of other. but made the learning phase faster. the parser in [Collins. Note also that the features are conjoined with only one generic label class “constituent” rather than speciﬁc constituent types. ﬁrstlast.94 / F1 70. ﬁrst. Otherwise. not bracket labels). Two features relevant to this trend are: (CONSTITUENT ∧ ﬁrstword = today ∧ length = 1) and (CONSTITUENT ∧ lastword = today ∧ length = 1). The second feature was the output of a ﬂat classiﬁer which was trained to predict whether single spans. and using their predictions as features on the other fold. giving a prediction feature for each span (these classiﬁers predicted only the presence or absence of a bracket over that span. For this kind of feature. the resulting maxmargin model (LEXICAL + AUX) scored 89.While the ﬂat classiﬁer alone was quite poor (P 78. these are the same feature. The auxiliary classiﬁers were then retrained on the entire training set. in isolation. Tag features on the test sets were taken from a pretagging of the sentence by the tagger described in [Toutanova et al.9. we split the training into two folds. one must take care that its reliability on the training not be vastly greater than its reliability on the test set. To ensure that such features are as noisy on the training data as the test data. We used two such auxiliary classiﬁers. this analysis is counter to the trend of today to be a oneword constituent.. but also the following: the preceding. However.1 In the LEXICAL model.58). based on a bundle of features including the list above. this feature added little information. These features represent the preference of the word today for being the ﬁrst and last word in constituent spans of length 1. precedingfollowing.
Shen et al. 2004].69 over the same train/test conﬁguration. Geman & Johnson. 41% of the correct parses were not in the candidate pool of ∼30best parses. 2004. 2002. where variants of the insideoutside algorithm can be used to efﬁciently calculate gradients of the loglikelihood function.. 2004] are based on conditional loglinear (maximum entropy) models. these approaches fall into two categories. Given sufﬁciently “local” features. Geman & Johnson. Clark & Curran. which were based on maximum entropy estimation. The method we presented has several compelling advantages. the decoding and parameter estimation problems can be solved using dynamic programming algorithms. 2001. This distinction can be very signiﬁcant. Miyao & Tsujii. Johnson. the structured SMO we use requires only the calculation of Viterbi trees. our method incorporates an articulated loss function which penalizes larger tree discrepancies more severely than smaller ones. 2002. 2004. A discriminative model is then used to choose between these candidates. In dynamic programming methods. 1999. and intricately smoothed scores 88. For example. our method is an endtoend discriminative model over the full space of parses. CONTEXT FREE GRAMMARS is generative.. 1999. Clark & Curran.. For example.6 Related work A number of recent papers have considered discriminative approaches for natural language parsing [Johnson et al. Broadly speaking. 2004. Unlike previous dynamic programming approaches. Kaplan et al. several approaches [Johnson. Collins. 2003]. Collins. 2001. a large number of candidate parse trees are represented compactly in a parse tree forest or chart. rather than expectations over all trees (for example using the insideoutside algorithm). Kaplan et al. 2000.. reranking and dynamic programming approaches. .. 2002. in the work of Collins [2000]. Collins. 2000. Miyao & Tsujii. In reranking methods [Johnson et al. Moreover. Unlike reranking methods. 9. despite the exponential number of trees represented by the parse forest. 2002. which consider only a prepruned selection of “good” parses. as the set of nbest parses often does not contain the true parse.142 CHAPTER 9. an initial parser is used to generate a number of candidate parses. lexicalized.
CONCLUSION 143 This allows a range of optimizations that prune the space of parses (without making approximations) not possible for maximum likelihood approaches which must extract basis function expectations from the entire set of parses. It is worth considering the cost of this kind of method. while still decomposing in a way that exploits the shared substructure of parse trees in the standard way. unlike other lexicalized models (including the Collins model which it outperforms in the present limited experiments). . and the current feature set allows limited lexicalization in cubic time. which took about 100 9. since they must only collect counts from the training set. Our framework permits the use of a rich variety of input features. 2040 iterations were generally required for convergence (except the iterations. which means parsing the training set (usually many times). discriminative methods are inherently expensive. In our experiments. which allows a discriminative SVMlike objective to be applied to the parsing problem.) BASIC model. the maxmargin approach does have the potential to incorporate many new kinds of features over the input. On the other hand. accuracy and efﬁciency of a parsing model is an important area of future research.7. At training time. since they all involve iteratively checking current model performance on the training set.7 Conclusion We have presented a maximummargin approach to parsing. This tradeoff between the complexity.9. Generative approaches are vastly cheaper to train.
Recently proposed models have formulated this prediction task as a maximum weight perfect matching problem in a graph containing cysteines as nodes with edge weights measuring the attraction strength of the potential bridges. We provide some background on this problem. We exploit combinatorial properties of the perfect matching problem to deﬁne a compact. In this chapter. 4. we focus on a more complex problem of nonbipartite matchings. 144 . Bipartite matchings are used in many ﬁelds. We motivate this problem using an application in computational biology. we use the problem of disulﬁde connectivity prediction as an example. We have shown a compact maxmargin formulation for bipartite matchings in Ch. to identify functional genetic analogues in different organisms. ﬁnd a parameterized edge scoring function such that the correct matchings have the highest score. but nonbipartite matchings can be used for many other tasks. disulﬁde connectivity prediction. to ﬁnd marker correspondences in vision problems.Chapter 10 Matchings We address the problem of learning to match: given a set of input graphs and corresponding matchings. Throughout this chapter. quadratic program. inﬁnitedimensional) models and present experiments on standard protein databases. showing that our framework achieves stateoftheart performance on the task. for example. Identifying disulﬁde bridges formed by cysteine residues is critical in determining the structure of proteins. We use kernels to efﬁciently learn very rich (infact. to map words of a sentence in one language to another. convex.
2004. and predicting the exact connectivity among bonded cysteines.1 Disulﬁde connectivity prediction Proteins containing cysteine residues form intrachain covalent bonds known as disulﬁde bridges. Recently. Klepeis & Floudas. 2004].. We parameterize the this attraction strength via a linear combination of features.. Knowledge of the exact disulﬁde bonding pattern in a protein provides information about protein structure and possibly its function and evolution. Vullo & Frasconi. which can include the protein sequence around the two residues. 2003. 2001]. 1 .1. secondary structure or solvent accessibility information.. there has been increased interest in applying computational techniques to the task of predicting the intrachain disulﬁde connectivity [Fariselli & Casadio. scalable and accurate methods for the prediction of disulﬁde bonds has numerous practical applications. DISULFIDE CONNECTIVITY PREDICTION 145 10. Alternatively. 2002. the task of predicting the connectivity pattern is typically decomposed into two subproblems: predicting the bonding state of each cysteine in the sequence. Furthermore. 2004] that predict the connectivity pattern without knowing the bonding state of each cysteine1 . there are methods [Baldi et al. Baldi et al. since the disulﬁde connectivity pattern imposes structure constraints. Such bonds are a very important feature of protein structure since they enhance conformational stability by reducing the number of conﬁgurational states and decreasing the entropic cost of folding a protein into its native state [Matsumura et al.. etc. 1989]. evolutionary information in the form of multiple alignment proﬁles. We predict the connectivity pattern by ﬁnding the maximum weighted matching in a graph in which each vertex represents a cysteine residue. the development of efﬁcient. it can be used to reduce the search space in both protein folding prediction as well as protein 3D structure prediction. We thank Pierre Baldi and Jianlin Cheng for introducing us to the problem of disulﬁde connectivity prediction and providing us with preliminary draft of their paper and results of their model. Since a sequence may contain any number of cysteine residues. Thus. 2001. Fariselli et al. as well as the protein datasets. 1994]. They do so mostly by imposing strict structural constraints due to the resulting links between distant regions of the protein sequence [Harrison & Sternberg. and each edge represents the “attraction strength” between the cysteines it connects [Fariselli & Casadio.10. which may or may not participate in disulﬁde bonds.
we assume complete graphs. for example. but very little needs to be changed to handle sparse graphs. 10. (10. For simplicity. calculated by assuming that all residues in the local neighborhoods of the two cysteines make contact. For example. and summing contact potentials for pairs of residues. If example i has Li nodes. the number of possible perfect matchings is (2ni )! 2ni ni ! (i) (which is Ω(( ni )ni ). then for complete graphs with even number of vertices Li .1) For disulﬁde connectivity prediction. (10. We consider a simple but very general class of attraction scoring functions deﬁned by a weighted combination of features or basis functions: sjk (x) = d wd fd (xjk ) = w f (xjk ). each node is connected to at most one other node. given a protein sequence. this model was used by Fariselli and Casadio [2001]. superexponential in 2 ni ). y(i) )}m of input graphs and i=1 output matchings. superexponential in the number of nodes in a graph. Y has interesting and complex combinatorial structure which we exploit to learn h efﬁciently.2) . we seek a function h : X → Y that maps inputs x ∈ X to output matchings y ∈ Y. each node is connected exactly one other node. However. 1ANS protein in Fig. Their model assigns an attraction strength sjk (x) to each pair of cysteines. We represent each possible edge between nodes j and k (j < k) in example i using a binary variable yjk . then there are Li (Li − 1)/2 edge variables. In a perfect matching.146 CHAPTER 10.1 has 6 cysteines (nodes). so y(i) is a binary vector of dimension Li (Li − 1)/2. 15 potential bonds (edges) and 15 possible perfect matchings. In nonperfect matchings. Our hypothesis class is maximum weight matchings: hs (x) = arg max y∈Y jk sjk (x)yjk .2 Learning to match Formally. X is the space of protein sequences and Y is the space of matchings of their cysteines. MATCHINGS 10. we construct a complete graph where each node corresponds to a cysteine. Let ni = Li /2. The space of matchings Y is very large. The training data consists of m examples S = {(x(i) . We assume that the input x deﬁnes the space of possible matchings using some deterministic procedure. in fact. For example.
we will abbreviate w f (x. In the following sections we present two maxmargin formulations. w fi (y(i) ) ≥ w fi (y) + i (y). fd (xjk ) is a realvalued basis function and wd ∈ IR. the basis functions can represent arbitrary information about the two cysteine neighborhoods: the identity of the residues at speciﬁc positions around the two cysteines. and then with a polynomial one (Sec. Actual disulﬁde connectivity is shown in yellow in the 3D model and the graph of potential bonds. 2 (10.4).1: PDB protein 1ANS: amino acid sequence.4) The number of constraints in this formulation is superexponential in the number of nodes in each example. We assume that the user provides the basis functions. y) ≡ matchings for each example i.t. where xjk is the portion of the input x that directly relates to nodes j and k. for the model: hw (x) = arg max y∈Y jk w f (xjk )yjk . or the predicted secondary structure in the neighborhood of each cysteine. and that our goal is to learn the weights w.10.2. and w fi (y) ≡ w f (x(i) . ﬁrst with an exponential set of constraints (Sec. and graph of potential disulﬁde bonds. ∀i. For example. 3D structure. which enumerates all perfect 1 w2 s. LEARNING TO MATCH 147 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6 2 3 1 4 5 6 Figure 10. .3) Below. 10. ∀y ∈ Y (i) . (10. y). 10. The naive formulation of the maxmargin estimation.3). is: min jk w f (xjk )yjk .
jk ≥ 0. these constraints are changed to µi. . It is an open problem to derive a polynomial sized LP formulation for perfect matchings [Schrijver.t.148 CHAPTER 10. The subset constraints (in the last line of Eq. but this number is asymptotically smaller than the number of possible matchings .3 Minmax formulation Using the minmax formulation from Ch.jk ≤ 1. For perfect matchings. max jk (i) (i) (i) µi. MATCHINGS 10. 4. (10. The constraints k µi. Note that we have an exponential number of constraints (O(2(Li −1) )). 2 y∈Y (i) (10. 1976]. w fi (y(i) ) ≥ max [w fi (y) + i (y)].5) The key to solving this problem efﬁciently is the lossaugmented inference maxy∈Y (i) [w fi (y) + i (y)].jk instead of binary variables yjk . .6) µi. Li }. V  ≥ 3 and odd. 2003]. this maximization is equivalent (up to a constant term) to a maximum weighted matching problem.jk [w f (xjk ) + (1 − 2yjk )] µi.6)) ensure that solutions to the LP are integral [Edmonds. k (i) (i) (10. This problem can be solved in O(L3 ) time [Gabow. 1 ≤ j ≤ Li . It can also be solved as a linear program. 1 ≤ j < k ≤ Li . Hence. Note that since the y variables are binary.jk ≤ 1 require that the number of bonds incident on a node is less k or equal to one. j. . the maximum weight matching where edge jk has weight w f (xjk )+(1 −2yjk ) (plus the constant 1 y(i) ) gives the value of maxy∈Y (i) [w fi (y) + i (y)]. Lawler. Under the assumption of Hamming distance loss (or any loss function that can be written as a sum of terms corresponding to edges). we have a single max constraint for each i: min 1 w2 s.t. s. 1965]. . 1973.jk ≤ (V  − 1). ∀i. 2 V ⊆ {1. .k∈V 1 µi. where we introduce continuous variables µi.jk = 1 to ensure exactly one bond. the Hamming distance between y(i) and y can be written as (1 − y) y(i) + (1 − y(i) ) y = 1 y(i) + (1 − 2y(i) ) y.
∀i.8) In case that our basis functions are not rich enough to predict the training data perfectly. ∀i. ξi ≥ 0.6) is min λi bi s.t. Ai λi ≥ Fi w + ci .7) (i) We plug it into Eq. (10. respectively.10. (10. ∀i. 1 w2 2 w fi (y(i) ) ≥ di + λi bi . Note that the dependence on w is linear and occurs only in the objective of the LP. ∀i. ∀i. µi is a vector of length Li (Li − 1)/2 indexed by bond jk. λi ≥ 0. λi ≥ 0.6): max [w fi (y) + i (y)] = di + max µi [Fi w + ci ]. ∀i. The dual of the LP in Eq.9) w fi (y ) + ξi ≥ di + λi bi . ∀i. λi ≥ 0. MINMAX FORMULATION 149 We can write the lossaugmented inference problem in terms of the LP in Eq.t. min s. (10. (10. 1 w2 + C 2 (i) ξi i (10. (10. we can introduce a slack variable ξi for each example i to allow violations of the constraints and minimize the sum of the violations: min s. The parameter C allows the user to trade off violations of the constraints with ﬁt to the .3. Ai and bi are the appropriate constraint coefﬁcient matrix and right hand side vector.t.5) and combine the minimization over λ with minimization over w. Ai λi ≥ Fi w + ci . Fi is a matrix of basis function coefﬁcients such that the component jk of the vector Fi w is w f (xjk ) and ci = (1 − 2y(i) ). Ai λi ≥ Fi w + ci . Ai µi ≤bi µi ≥0 y∈Y (i) where: di = 1 y(i) .
150 CHAPTER 10. we develop an equivalent polynomial size formulation. 10. Theorem 10. we can focus on ﬁnding a compact certiﬁcate of optimality that guarantees that y(i) = arg maxy [w fi (y) + i (y)]. An alternating cycle is augmenting with respect to M if the score of the edges in the matching M is smaller that the score of the edges not in the matching M . the edges alternate between those that belong to M and those that do not. In an alternating cycle/path in G with respect to M . . Let M be a perfect matching for a complete undirected graph G = (V. 1965] A perfect matching M is a maximum weight perfect matching if and only if there are no augmenting alternating cycles. E). MATCHINGS 30 Perfect Matchings 25 Minmax Certificate 20 LogNumber of Constraints 15 10 5 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Bonds Figure 10.2: Log of the number of QP constraints (yaxis) vs. data. Our formulation is a linearlyconstrained quadratic program.4 Certiﬁcate formulation Rather than solving the lossaugmented inference problem explicitly. albeit with an exponential number of constraints. In the next section.1 [Edmonds.4. We consider perfect matchings and then provide a reduction for the nonperfect case. minmax and certiﬁcate). number of bonds (xaxis) in the three formulations (perfect matching enumeration.
. set d0 = d1 = 0). The length of the path is s(C) = s1. We also refer to the score sjk as the length of the edge jk. Proof.4. j k sjk ≥ d1 − d0 .j+1 . k j sjk ≥ d1 − d0 . so sj. Let de . with j ∈ V . (If) Suppose there are no augmenting alternating cycles. Pick an arbitrary root node r. (1. For each odd j. so simply enumerating all of them will not do. The following constraints capture this distance function. A condition ruling out negative length alternating cycles can be stated succinctly using a kind of distance function. j e ∈ {0. denote the length of the shortest distance alternating path from r to j. Conj sider an alternating cycle C.4. (10. CERTIFICATE FORMULATION 151 The number of alternating cycles is exponential in the number of vertices. we can rule out such cycles by considering shortest paths.10). Then let d0 and d1 be the length of those paths.j+1 ≥ d1 − d0 .10) Theorem 10. Then for j j r r any jk (or kj) in M . 2.10) if and only if no augmenting alternating cycles exist. is in M . 2). / ∀ jk ∈ M. . Instead. kj ∈ M . l and the ﬁrst edge. Similarly for jk. j k ∀ jk ∈ M . sjk ≥ d0 − d1 .2 There exists a distance function {de } satisfying the constraints in j Eq. j j k k so the constraint is satisﬁed. the edge (j. For j j+1 . for all j (for j = r. This path is longer or same length as the shortest path to k ending with an edge in M : sjk + d0 ≥ d1 (or skj + d0 ≥ d1 ). j + 1) is in M .l + l−1 j=1 sj. where e = 1 if the last edge of the path is in M . (10. We renumber the nodes such that the cycle passes through nodes 1. 1}. (10. We begin by negating the score of those edges not in M . the two shortest paths to j (one ending with M edge and one not) contain no cycles. An alternating cycle is augmenting if and only if its length is negative. . . Since any alternating paths from r to j can be shortened (or left the same length) by removing the cycles they contain. k j sjk ≥ d0 − d1 . 0 otherwise.10. / (Only if) Suppose a distance function {de } satisﬁes the constraints in Eq. These shortest distances are welldeﬁned if and only if there are no negative alternating cycles. the shortest path to j ending with an edge not in M plus the edge jk (or kj) is an alternating path to k ending with an edge in M . In the discussion below we assume that each edge score sjk has been modiﬁed this way.
We extend the matching by replicating its edges in the copy and for each unmatched node. j+1 j j=2.152 CHAPTER 10. Summing the edges.j+1 ≥ d0 − d1 . the last edge. In our learning formulation we have the lossaugmented edge weights sjk = (2yjk − 1)(w f (xjk )+1−2yjk ). MATCHINGS even j. Hi and Gi be matrices j of coefﬁcients and qi be a vector such that Hi w + Gi di ≥ qi represents the constraints in Eq. 2003]. minmax and certiﬁcate) is shown in Fig. Finally. 10. The xaxis is the number of edges in the matching (number of nodes divided by two). The case of nonperfect matchings can be handled by a reduction to perfect matchings as follows [Schrijver. j+1 j is not in M . so sj. (10. introduce an edge to its copy. 1 w2 2 H i w + G i di ≥ q i . The comparison between the lognumber of constraints for our three equivalent QP formulations (enumeration of all perfect matchings. Let di be a vector of distance variables de . (10.odd [d1 j+1 − d0 ] j + [d0 − d1 ] = 0. Perfect matchings in this graph projected onto the original graph correspond to nonperfect matchings in the original graph.10) for example i. the edge (j. in case that our basis functions are not rich enough to predict the training data perfectly. (i) (i) (i) Once again. we have: 1 l l−1 l−1 s(C) ≥ d0 1 − d1 l + j=1. j + 1) is not in M .even Hence all alternating cycles have nonnegative length. Then the following joint convex program in w and d computes the maxmargin parameters: min s. We deﬁne f (xjk ) ≡ 0 for edges between the original and the copy. we can introduce a slack variable vector ξi to allow violations of the constraints. (1. l).t.11) ∀i.l ≥ d0 − d1 . so s1. We create a new graph by making a copy of the nodes and the edges and adding edges between each node and the corresponding node in the copy.2. .
(10. Ai α(i) ≤ Cbi .5. The primal and dual solutions are related by: w= i jk (Cyjk − αjk )f (xjk ) (i) (i) (i) (10.12) s. we can efﬁciently map the original features f (xjk ) to a highdimensional space. xmn ). α(i) ≥ 0.14) can be used to compute the attraction strength sjk (x) in a kernelized manner at prediction time. KERNELS 153 10. (10.t. The dual quadratic optimization problem has the form: 2 (i) max i ci α(i) − 1 2 Cyjk − αjk f (xjk ) i jk∈E (i) (i) (i) (i) (10.13) Therefore. and αjk is the dual variable associated with the features corresponding to the edge jk. .8). The only occurrence of feature vectors is in the expansion of the squarednorm term in the objective: Cyjk − αjk f (xkl ) f (x(j) ) Cyjk − αjk mn i. ∀i. Let α(i) be the vector of dual variables for example i. Thus. (10. we can apply the kernel trick and let f (xkl ) f (xmn ) = K(xkl . we can work with its dual. The polynomialsized representation in Eq.5 Kernels Instead of directly optimizing the primal problem in Eq. ∀i.j kl∈E (i) mn∈E (j) (i) (j) (i) (j) (i) (i) (i) (j) (j) (10.10. Each training example i has Li (Li − 1)/2 dual variables.11) is similarly kernelizable.14) Eq.
ics. although the procedure had less effect due to the redundance reduction applied by the authors of the 2 http://contact. Vullo & Frasconi. Sequence similarity was derived using an allagainstall rigorous SmithWaterman local pairwise alignment [Smith & Waterman. SP39 contains 446 sequences containing between 2 and 5 bonds. It contains only sequences with experimentally veriﬁed disulﬁde bridges. The SP39 dataset is extracted from the SwissProt database of proteins [Bairoch & Apweiler. 2004]. There are 430 proteins with 2 and 3 bonds and 137 with 4 and 5 bonds. Pairs of chains whose alignment is less than 30 residues were considered unrelated. 2001].154 CHAPTER 10. 2000].edu/bridge. Even though our method is applicable to both sequences with a high number of bonds or sequences in which the bonding state of cysteine residues is unknown. 1983].6 Experiments We assess the performance of our method on two datasets containing sequences with experimentally veriﬁed bonding patterns: DIPRO2 and SP39. MATCHINGS 10. The DIPRO2 contains 567 such sequences. we adopt the same dataset splitting procedure as the one used in previous work [Fariselli & Casadio.. The DIPRO2 dataset2 was compiled and made publicly available by Baldi et al.. Fariselli & Casadio. as of May 2004. We split SP39 into 4 different subsets. the sequences are annotated with secondary structure and solvent accessibility information derived from the DSSP database [Kabsch & Sander. [2004]. which contain intrachain disulﬁde bonds. 2004. we report results for the case where the bonding state is known. with the constraint that proteins no proteins with sequence similarity of more than 30% belong to different subsets. and we have followed the same procedure for extracting sequences from the database. gap penalty 12 and gap extension 4). and the number of bonds is between 2 and 5 (since the case of 1 bond is trivial). and only 53 sequences with a higher number of bonds.html . 2004. The DIPRO2 dataset was split similarly into 5 folds. so we are able to perform learning on over 90% of all proteins. In order to avoid biases during testing. 1981] (with the BLOSUM65 scoring matrix. and has a total of 726 proteins.uci. In addition. 2000]. 2001. The same dataset was used in earlier work [Baldi et al. 2004. Vullo & Frasconi.. It consists of all proteins from PDB [Berman et al. Baldi et al. After redundance reduction there are a total of 1018 sequences. release 39.
The ﬁnal set of features for each {j. we extract the aminoacid sequence in windows of size n centered at j and k. The models below use windows of size n = 9. We use the exponential sized represen tation of Sec. We used commercial QP software (CPLEX) to train our models. we describe several models we used. in which the entries denote whether or not a particular amino acid occurs at the particular position. k}. 10. where j < k. For example. 10]. it is more efﬁcient due to the low constants in the exponential problem size. using a sequential optimization procedure which solves QP subproblems associated with blocks of training examples. 10. 10. As in Baldi et al. Multiple alignments were computed by running PSIBLAST using default settings to align the sequence with all sequences in the NR database [Altschul et al. is the same as SEQUENCE. Models The experimental results we report use the dual formulation of Sec. xlm ) = exp( xjk −xlm γ 2 ). Thus. [2004]. we augment the features of each model with the number of residues between j and k. For each pair of candidate cysteines {j.3 since for the case of proteins containing between two and ﬁve bonds. except that instead of using the actual protein sequence. The features we experimented with were all based on the local regions around candidate cysteine pairs. The ﬁrst model. PROFILE. The second model. the input at each position of a local window is the frequency of occurrence of each of the 20 aminoacids in the alignments. augmented with the linear distance between the cysteine residues. 1997]. EXPERIMENTS 155 dataset.6. counting from the left end of the window. k} pair of cysteines is simply the two local windows concatenated together..4 to handle longer sequences and nonperfect matchings (when bonding state is unknown). Training time took around 70 minutes for 450 examples. SEQUENCE. we use multiple sequence alignment proﬁle information.5 and an RBF kernel K(xjk . We are currently working on an implementation of the certiﬁcate formulation Sec. Below. the 21st entry in the vector represents whether or not the aminoacid Alanine occurs at position 2 of the local window.10.1. with γ ∈ [0. the actual sequence is expanded to a 20 × n binary vector. uses the features described above: for each window. .
67 / 0.46 / 0. In another experiment. PROFILESS.52 0. and slightly lower accuracies for the case of 2 and 3 bonds.59 / 0. the evolutionary information captured by the aminoacid alignment frequencies plays an important role in increasing the performance of the algorithm.74 / 0.43 / 0. As Table 10. The third model.59 0. preliminary results of the DAGRNN model [Baldi et al.24 0.69 0. Results and discussion We evaluate our algorithm using two metrics: accuracy and precision.56 0.73 0.74 0. [2004]. (b) Performance of SEQUENCE. PROFILESS models on the DIPRO2 dataset. with similar or better levels of precision for all bond numbers. whereas precision measures the number of correctly predicted bonds as a fraction of the total number of possible bonds.1 shows. the best performance is in bold.79 / 0.73 / 0. so we augment each local window of size n with an additional length 8 × n binary vector. .156 CHAPTER 10. the PROFILE model achieves comparable results.44 / 0.70 / 0.75 0. we show the performance gained by using multiple alignment information by comparing the results of the SEQUENCE model with the PROFILE. MATCHINGS K 2 3 4 5 PROFILE 0.74 / 0. The ﬁrst set of experiments compares our model to preliminary results reported in Baldi et al.51 0. PROFILE.60 / 0.16 (a) DAGRNN 0.21 0.44 0. The DSSP program produces 8 types of secondary structure. (a) Performance of PROFILE model on SP39 vs.62 / 0.17 (b) PROFILESS 0.1: Numbers indicate Precision / Accuracy. 2004] which represent the best currently published results.79 0.44 / 0.27 0.29 / 0.27 Table 10.62 / 0.75 / 0.43 / 0.70 0. The accuracy measure counts how many full connectivity patterns were predicted correctly.41 / 0.1(b). which represent the current topperforming system. augments the PROFILE model with secondary structure and solventaccessibility information.06 PROFILE 0. as well as a length n binary vector representing the solvent accessibility at each position. As we can see from Table 10.61 / 0.48 0.11 K 2 3 4 5 SEQUENCE 0. We perform 4fold crossvalidation on SP39 in order to replicate their setup.70 / 0. In each row..
10. perhaps through more complex kernels. ﬁnd a new weight vector w that makes M optimal and minimizes w0 − wp for p = 1.3 shows the performance of the PROFILE model as training set size grows. They do not provide a compact optimization problem for this related but . which is not a maximum one with respect to w0 . ∞. 10.7. especially for sequences with 3 and 4 bonds.3: Given a set of nominal weights w0 and a perfect matching M . since it validates the applicability of our algorithm as the availability of highquality disulﬁde bridge annotations increases with time. 4. 10. The trend is more pronounced for sequences with 4 and 5 bonds because they are sparsely distributed in the dataset.1(b) shows that the gains are signiﬁcant. As a ﬁnal experiment.7 Related work The problem of inverse perfect matching has been studied by Liu and Zhang [2003] in the inverse combinatorial optimization framework we describe in Sec. we examine the role that secondary structure and solventaccessibility information plays in the model PROFILESS. both accuracy and precision increase as the amount of data grows. This highlights the importance of developing even richer features. Table 10. RELATED WORK 157 2 and 3 bonds 100% 90% 80% 70% 60% 50% 40% 30% 20% 100 150 200 250 300 Training Set Size 350 Accuracy (2 Bonds) Precision (3 Bonds) Accuracy (3 Bonds) 4 and 5 bonds 85% 75% 65% 55% 45% 35% 25% 15% 5% Precision (4 Bonds) Accuracy (4 Bonds) Precision (5 Bonds) Accuracy (5 Bonds) 400 450 100 150 200 250 300 Training Set Size 350 400 450 (a) (b) Figure 10. Such behavior is very promising. Fig. The same phenomenon is observed by Vullo and Frasconi [2004] in their comparison of sequence and proﬁlebased models. We can see that for sequences of all bond numbers.3: Performance of PROFILE model as training set size changes for proteins with (a) 2 and 3 bonds (b) 4 and 5 bonds.
The method is computationally limited to sequences of 2 to 5 bonds. the approach can be used to predict disulﬁde bonds with no knowledge of bonding state.158 CHAPTER 10. [2004]. The method in Vullo and Frasconi [2004] takes a different approach to the problem. and HMMs . Interestingly. Their method performs better than the one in Vullo and Frasconi [2004] and is also the only one which can cope with sequences with more than 5 bonds. In Baldi et al. [2002]. and edge weights correspond to attraction strength. where bond prediction occurs as part of predicting βsheet topology in proteins. The task of predicting whether or not a cysteine is bonded has also been addressed using a variety of machine learning techniques including neural networks. The problem of disulﬁde bond prediction ﬁrst received comprehensive computational treatment in Fariselli and Casadio [2001]. 1998] to score candidate patterns. Their method uses Directed Acyclic Graph Recursive Neural Networks [Baldi & Pollastri. MATCHINGS different task. A different approach to predicting disulﬁde bridges is reported in Klepeis and Floudas [2003]. the authors achieve the current stateoftheart performance on the task. especially for the case of 2 and 3 disulﬁde bonds. At prediction time the pattern scores are used to perform an exhaustive search on the space of all matchings. 2003] to predict bonding probabilities between cysteine pairs. SVMs. relying instead on the ellipsoid method with constraint generation. The problem of learning the edge weights was addressed using a simulated annealing procedure. Their method is only applicable to the case when bonding state is known. the authors switch to using a neural network for learning edge weights and achieve better performance. It scores candidate connectivity patterns according to their similarity with respect to the correct pattern. but the results are not comparable with those in other publications. and uses a recursive neural network architecture [Frasconi et al. In Fariselli et al. The prediction problem is solved using a weighted graph matching based on these probabilities.. It also uses multiple alignment proﬁle information and demonstrates its beneﬁts over sequence information. They modeled the prediction problem as ﬁnding a perfect matching in a weighted graph where vertices represent bonded cysteine residues. Residuetoresidue contacts (which include disulphide bridges) are predicted by solving a series of constrained integer programming problems. It also improves on previous methods by not assuming knowledge of bonding state.
We also hope to explore the more challenging problem of disulﬁde connectivity prediction when the bonding state of cysteines is unknown.10. 1999. 2002. will affect performance. Our approach learns a parameterized scoring function that reproduces the observed matchings in a training set.. or extended local neighborhoods of the protein sequence. Martelli et al. We apply our framework to the task of disulﬁde connectivity prediction. Fiser & Simon. 2000. Our experimental results show that the method can achieve performance comparable to current topperforming systems... CONCLUSION 159 [Fariselli et al. Frasconi et al. especially in comparison to existing methods [Baldi et al. it remains to experimentally determine how well the method performs. it will be worthwhile to examine how other kernels. we derive a compact convex quadratic program for the problem of learning to match.. the use of kernels makes it easy to incorporate rich sets of features such as secondary structure information.8 Conclusion In this chapter. 10. . 2002. 2003] Currently the top performing systems have accuracies around 85%. In the future. Ceroni et al. such as convolution kernels for protein sequences. 2004]. and one which develops a certiﬁcate of matching optimality for a compact polynomialsized representation. While we have developed the framework to handle that task. Furthermore. requiring an exponential number of constraints. formulated as a weighted matching problem. We present two formulations: one which is based on a linear programming approach to matching..8. which have already addressed the more challenging setting.
both of these parameters are effectively chosen to be the best possible by the problem deﬁnition. For a given application. negative weights encourage the vertices to belong to different clusters. Unlike most clustering formulations. 2003]. Learning to cluster considers the problem of ﬁnding desirable clusterings on new data. 2002. These properties make correlation clustering a promising approach to many clustering problems. a novel clustering method that has recently enjoyed signiﬁcant attention from the theoretical computer science community [Bansal et al. For example. one user may organize her email messages by project and time. there might be only one of these clusterings that is desirable. correlation clustering does not require the user to specify the number of clusters nor a distance threshold for clustering. it has been successfully applied to coreference resolution for proper nouns [McCallum & Wellner. Demaine & Immorlica. Positive edge weights represent correlations between vertices. and minimize the weight of intercluster edges. 2003]. 2003. in machine learning. Emanuel & Fiat. encouraging those vertices to belong to a common cluster. We focus on correlation clustering. given example desirable clusterings on training data. another by sender and topic. It is formulated as a vertex partitioning problem: Given a graph with realvalued edge scores (both positive and negative).. Recently. partition the vertices into clusters to maximize the score of intracluster edges. Images can be segmented by hue or object boundaries.Chapter 11 Correlation clustering Data can often be grouped in many different reasonable clusterings. several algorithms based on linear programming and positivesemideﬁnite 160 .
E) with N nodes and edge score sjk for each jk in E. jk:sjk <0 (M AX AGREE) (M IN D IS AGREE) min y∈Y jk:sjk >0 sjk (1 − yjk ) − The motivation for the names of the two problems comes from separating the set of edges into positive weight edges and negative weight edges. which will be constrained to have the same value. (j < k). For notational convenience. but this will not generally produce a valid partition.11. sjk yjk . We assume that the graph is fully connected (if it is not. In . In M AX AGREE. we maximize the “agreement” of the partition with the positive/negative designations: the weight of the positive included edges minus the weight of negative excluded edges. We also introduce yjj variables. In this chapter. [2002] consider two related problems: max y∈Y jk:sjk >0 sjk yjk − jk:sjk <0 sjk (1 − yjk ). 11.1 Clustering formulation An instance of correlation clustering is speciﬁed by an undirected graph G = (V. one for each edge jk. Experiments on synthetic and realworld data show the ability of the algorithm to learn an appropriate clustering metric for a variety of desired clusterings. We formulate the approximate learning problem as a compact convex program with quadratic objective and linear or positivesemideﬁnite constraints. we employ these relaxations to derive a maxmargin formulation for learning the edge scores for correlation clustering from clustered training data. we minimize the disagreement: the weight of positive excluded edges minus the weight of negative included edges. We deﬁne binary variables yjk . The best score is obviously achieved by including all the positive and excluding all the negative edges. that represent whether node j and k belong to the same cluster. CLUSTERING FORMULATION 161 programming relaxations have been proposed to approximately solve this problem. Bansal et al. we introduce both yjk and ykj variables. Let Y be the space of assignments y that deﬁne legal partitions. we can make it fully connected by adding appropriate edges jk with sjk = 0).1. and ﬁx them to have value 1 and set sjj = 0. In M IN D IS AGREE.
162 CHAPTER 11. CORRELATION CLUSTERING particular. 1}. (11. k = b. (11. 2003]. We will concentrate on the maximization version. which we consider in the next sections. s− = jk:sjk <0 sjk . s+ = jk:sjk >0 sjk . 2003. [2002] show that both of these problems are NPhard (but have different approximation hardness). ∀j. Note that the constraints imply that yab = yba for any two nodes a and b. Emanuel & Fiat. With these deﬁnitions. ∀j.. To see this.1. transitive binary relation induces a partition of the objects. k = . l.2) yjk ∈ {0. 2003. The optimal partition for the two problems is of course the same (if it is unique). consider the inequalities involving node a and b with j = a.1) The triangle inequality enforces transitivity: if j and k are in the same cluster (yjk = 1) and k and l are in the same cluster (ykl = 1). Any symmetric. 11. yjl + ykl ≤ yjk + 1. The other two cases are similar. yjk + yjl ≤ ykl + 1. k. k. it is sufﬁcient to enforce a kind of triangle inequality for each triple of nodes j < k < l: yjk + ykl ≤ yjl + 1. then j and l will be forced to be in this cluster (yjl = 1). Then the value of M AX AGREE is s∗ − s− and the value of M IN D IS AGREE s+ − s∗ . let s∗ = max y∈Y sjk yjk . Several approximation algorithms have been developed based on Linear and Semideﬁnite Programming [Charikar et al.1 Linear programming relaxation In order to insure that y deﬁnes a partition.t. we can express the M AX AGREE problem as an integer linear program (ignoring the constant −s− ): max jk sjk yjk yjk + ykl ≤ ylj + 1. M AX AGREE. Demaine & Immorlica. Bansal et al. ∀j. s. l = b and j = b. yjj = 1.
The score of the clustering . ∀j. one for each cluster in the solution.4) s. ∀j. In effect. . µjk ≥ 0. . µjk + µkl ≤ µlj + 1. yba + yaa ≤ yab + 1. If vertices j and k are in the same cluster. (11. sLP . we have µjk ≤ 1. µjj = 1. vj vk = 0. µjk ≥ 0. . if not. ∀j. µjj = 1. The LP solution. k. We are not guaranteed that this relaxation will produce integral solutions. k.2 Semideﬁnite programming relaxation An alternative formulation [Charikar et al. Choose a collection of orthogonal unit vectors {v1 . [2003] show that the integrality gap of this upper bound is at least 2/3: s∗ − s− 2 < . is an upper bound on the s∗ . then vj vk = 1. and since µjj = 1 and µjk = µkj .2) with continuous variables 0 ≤ µjk ≤ 1.11. The LP relaxation is obtained by replacing the binary variables yjk ∈ {0. consider any clustering solution. we substituted the triangle inequalities by the semideﬁnite constraint. l. we have yab = yba . l = a: yab + ybb ≤ yba + 1. (11. 2003] is the SDP relaxation. ∀j. (11. Charikar et al.3) Note that µjk ≤ 1 is implied since the triangle inequality with j = l gives µjk + µkj ≤ µjj + 1. To motivate this relaxation. 1} in Eq.. vK }. Since yaa = ybb = 1. max jk sjk µjk mat(µ) 0.1. Let mat(µ) denote the variables µjk arranged into a matrix. Every vertex j in the cluster is assigned the unit vector vj corresponding to the cluster it is in. LP − s− s 3 11.1.t. CLUSTERING FORMULATION 163 a. . k. ∀j.
features might be distances along different dimensions.jk (0). For example.jk (1) + (1 − yjk ) i.82843.2 Learning formulation The score for an edge is commonly derived from the characteristics of the pair of nodes. The vectors generating the inner products are unit vectors by the requirement µjj = 1. we parameterize the score as a weighted combination of basis functions sjk = w f (xjk ). we have mat(µ) mat(µ) = j 0.164 CHAPTER 11.jk (0) = i (0) + jk yjk i. The entries µjk correspond to inner products vj vk . However. which can be decomposed into a sum of outer products vj vj . sSDP .82843: s∗ − s− ≤ 0. (11. Hence we write w f (x. where i.jk = i. [2003] show that the integrality gap of this upper bound is at least 0. we assume that the loss function decomposes over the edges. We assume f (xjj ) = 0 so that sjj = 0.jk (1) − i. In the SDP relaxation in Eq. sSDP − s− 11.jk (yjk ): = jk i. w.jk . Furthermore. the entries of fxjk might be the words shared by the documents j and k. Speciﬁcally.4).jk (yjk ) = jk yjk i. In document clustering. CORRELATION CLUSTERING solution can now be expressed in terms of the dot products vj vk . while if one is clustering points in IRn . The SDP solution. into a sum of edge losses i (y) i. they do not necessarily form a set of orthogonal vectors. the Hamming loss counts the number of . y) = jk yjk w f (xjk ). f (xjk ) ∈ IRn . where xjk is a set of features associated with nodes j and k. is also an upper bound on the s∗ and Charikar et al.
With this assumption. (11.5). (11. The dual of the LP based upper bound for example i is i (0)+ min jkl λi. (11.jk . ∀j.jk ].j (i) i. Above.jj .klj ] ≥ w f (xjk ) + zi. the loss augmented maximization is i (0) + max y∈Y jk yjk [w f (xjk ) + (i) i. l [λi.j + l [λi. we introduced a dual variable λi. We use these upperbounds in the minmax formulation to achieve approximate maxmargin estimation.ljk − λi.jkl + j zi.jk .ljj − λi. .jkl for each triangle inequality and zi. l.2.jkl + λi.4) as upper bounds on Eq. (11.6) ∀j = k.5) We can now use the LP relaxation in Eq.jjl + λi. s.j for the identity on the diagonal.jk = 1 − 2yjk . k.11. H i (y) = jk 1 jk = yjk ) = I(y jk (i) (i) yjk + jk (i) yjk (1 − 2yjk ) = (i) H i (0) + jk yjk i. Note the righthandside of the second set of inequalities follows from the assumption f (xjj ) = 0. where i.jkl ≥ 0. LEARNING FORMULATION 165 edges incorrectly cut or uncut by a partition. (11.jlj ] ≥ i. ∀j.t. λi.3) and the SDP relaxation in Eq.
9) zi. k. ∀i.7) λi.t.166 CHAPTER 11.jkl ≥ 0.j − λi.jk . (11.jkl + j (i) f (xjk ) i. i. we have: min s.t.jjl + λi.jk ≥ w f (xjk ) + zi.jj . ∀j = k. ∀j. . s. CORRELATION CLUSTERING Plugging this dual into the minmax formulation.j + l [λi. i. ∀i. (i) i.j . [λi. y(i) ) + ξi ≥ i (0) + j (11.j .jj (i) i.8) ∀j = k. zi. k.jlj ] ≥ ∀i.jj ≥ mat(λi ) 0. ∀j = k. ∀j.ljj − λi. ∀i.jkl + λi. ∀j. l. Similarly.j −λi.ljk − λi.jk ≥ w f (xjk ) + zi. we have: min s. ∀i. ∀i.jk .jk . l. 1 w2 + C ξi 2 w f (x(i) . ∀i.klj ] ≥ w l + ∀i. ∀j. −λi.t.jj . λi. Plugging the SDP dual into the minmax formulation.jj ≥ mat(λi ) i. 1 w2 + C ξi 2 w f (x(i) . y(i) ) + ξi ≥ i (0) + jkl (11. 0.j − λi. the dual of the SDP based upper bound is i (0)+ min j zi. zi. ∀j.
(11. k. mat(µi ) µi.jk )f (xjk ) s. 2 (i) (yjk i.9) provide some insight into the structure of the problem and enable efﬁcient use of kernels. xlm ) to deﬁne the space of basis functions. (11. ∀i.jk 1 i. ∀j.t. ∀i. l. The dual objective can be expressed in terms of dotproducts f (xjk ) f (xlm ). (11. µjk + µkl ≤ µlj + 1.10) One important consequence of this relationship is that the edge parameters are all support vector expansions. ∀i. Here we give the dual of Eq. Therefore. ∀j.jk 1 i.3 Dual formulation and kernels The dual of Eq. The dual of Eq. except that the linear transitivity constraints µjk + µkl ≤ µlj + 1. k. (i) (i) (11. k.jk ∀i. µi.11. The relation between the primal and dual solution is w = C i. m). ∀j. µi.jk µi.9) is very similar. (11.7): max i. the kernel could be a function of the two segments corresponding to the pairs of points. If we are clustering points in Euclidian space. ∀j.t. This kernel looks at two pairs of nodes. ∀j.jj = 1.jk 0: − (i) µi.jk − C 2 2 (i) (yjk i. we can use kernels K(xjk . l are replaced by the corresponding mat(µi ) µi. and measures the similarity between the relation between the nodes of each pair.jk ≥ 0.jj = 1. ∀i. ∀i.jk )f (xjk ) s. k. max i. DUAL FORMULATION AND KERNELS 167 11. ∀i.jk (yjk − µjk )f (xjk ). ∀j. k) and (l. a polynomial kernel over their lengths and angle between them.jk ≥ 0.jk − (i) µi. (j.3. .7) and Eq.jk − C 2 0. for example. µi.
ai.168 CHAPTER 11. We used a basis function for each dimension fd (xjk ) = e−(xj [d]−xk [d]) . with an average of 34 folders. In particular.1(b) illustrate the capability of our algorithm to learn to ignore irrelevant dimensions. We gathered the data from the SRI CALO project.4 Experiments We present experiments on a synthetic problem exploring the effect of irrelevant basis functions (features). We are interested in the problem of learning to cluster emails. 2 11. Figure 11. email clustering and image segmentation.1 Irrelevant features In this section. The accuracy is the fraction of edges correctly predicted to be between/within cluster. 1 http://www. 11. 11.4. from a mixture of Gaussians with same difference in means and variance as for the relevant dimension. we explore on a synthetic example how our algorithm deals with irrelevant features.com/project/CALO . and two real data sets.sri. The results in Fig. where each mixture component corresponds to a cluster. The noise is independently generated for each dimension. The training and test data consists of a 100 samples from the model.1 Our dataset consisted of email from seven users (approximately 100 consecutive messages per user).2 Email clustering We also test our approach on the task of clustering email into folders. which the users had manually ﬁled into different folders. This ﬁrst dimension is thus the relevant feature. plus an additional constant basis function. Let xj denote each point and xj [d] denote the dth dimension of the point.4. The number of folders for each users varied from two to six. Then we add noise components in D additional (irrelevant) dimensions. we generate data (100 points) from a mixture of two onedimensional Gaussians. The comparison with kmeans is simply a baseline to illustrate the effect of the noise on the data. Random clustering will give an accuracy of 50%. CORRELATION CLUSTERING 11.1(a) shows the projection of a data sample onto the ﬁrst two dimensions and on two irrelevant dimensions.
as well as importance of meaningful words versus common. Speciﬁcally.55 0. including the from. “To:” ﬁeld. We compare our method to the kmeans clustering algorithm using with the same word .65 0. We then use the learned parameters to cluster the other users’ mail. additional features captured the proportion of shared tokens for each email ﬁeld.4. what score function sjk causes correlation clustering to give clusters similar to those that human users had chosen? To test our learning algorithm. The basis functions f (xjk ) measured the similarity between the text of the messages. One feature was used for each common word in the pair of emails (except words that appeared in more than half the messages.1: (a) Projection onto ﬁrst two dimensions (top) and two noise dimensions (bottom). EXPERIMENTS 169 4 3 0. Errorbars denote one standard deviation.95 test accuracy 0.85 0.45 1 2 4 8 16 32 64 128 lcc kmeans 2 1 0 −1 −2 −3 −4 −4 4 −3 −2 −1 0 1 2 3 3 2 1 0 −1 −2 # noise dimensions −3 −2 −1 0 1 2 3 4 −3 −4 −4 (a) (b) Figure 11. Cc. subject and body ﬁelds. Learning to cluster is the solid line. averages are over 20 runs. (c) Performance on 2 cluster problem as function of the number of irrelevant noise dimensions. irrelevant ones. and “Cc:” ﬁeld. to.11. the similarity between the “From:” ﬁeld. we use each user as a training set in turn. Accuracy is the fraction of edges correctly predicted to be between/within cluster. kmeans the dashed line. learning the parameters from the partition of a single user’s mailbox into folders. Also.75 0. which were deemed “stop words” and omitted). The algorithm is therefore able to automatically learn the relative importance of certain email ﬁelds to ﬁling two messages together.
We selected images from the Berkeley Image Segmentation Dataset [Martin et al. It is precisely this kind of variability in the similarity judgements that . We made this comparison somewhat easy for kmeans by giving it the correct number of clusters k. and took the best clustering out of ﬁve tries. based on the hue and intensity.6 0.3(a) and (b) show two distinct segmentations: one very coarse.4 0. 11. 2001] for which two users had signiﬁcantly different segmentations. mostly based of overall hue. and one much ﬁner.2: Average Pair F1 measure for clustering user mailboxes.2 0.4. For example.3 Image segmentation We also test our approach on the task of image segmentation. we may prefer the ﬁrst over the second or viceversa. Depending on the task at hand. Fig. The results in Fig.2 show the average F 1 measure (harmonic mean of precision and recall.8 0.1 0 1 2 3 4 User 5 6 7 Figure 11.5 Pair F1 lcc kmeans 0. a standard metric in information retrieval [BaezaYates & RibeiroNeto. 1999]) computed on the pairs of messages that belonged to the same cluster. Our algorithm signiﬁcantly outperforms kmeans on several users and does worse only for one of the users.7 0. 11. We performed a simple binary search on this additional bias weight parameter to ﬁnd the number of clusters comparable to k. features. CORRELATION CLUSTERING 0. We also informed our algorithm of the number of clusters by uniformly adding a positive weight to the edge weights to cause it to give the correct number of clusters..170 CHAPTER 11. 11.3 0.
as well as the distance in pixels between the bounding boxes and the area of the smaller of the two regions. etc. We then selected about a hundred largest regions. we ﬁrst divided it contiguous regions of approximately the same color by running connected components on the pixels. We then computed three distances (one for each HSV channel).G. we used a fairly simple set of features that are very easy to compute. one using Fig. These regions are the objects that our algorithm learns to cluster. we want our algorithm to capture. we calculated the bounding box. We trained two models.3: Two segmentations by different users: training image with (a) coarse segmentation and (b) ﬁne segmentation. we can consider their shape. In our experiments. which covered 8090% of the pixels.B channel). color distribution.) There is a rich space of possible features we can use in our models: for each pair of regions.4.11. All features were normalized to have zero mean and variance 1. EXPERIMENTS 171 (a) (b) Figure 11. 11.3(b) and . distance. presence of edges between them. 11.3(a) and the other using Fig. We connected two adjacent pixels by an edge if their RGB value was the same at a coarse resolution (4 bits per each of the R. area and average color (averaging the pixels in the RGB space). (We then use the learned metric to greedily assign the remaining small regions to the large adjoining regions. In order to segment the image. For each region.
Recently. respectively. The clustering algorithm essentially uses an eigenvector decomposition of an appropriate matrix derived from the pairwise afﬁnity matrix. (b) segmentation based on coarse training data (c) segmentation based on ﬁne training data.4(b) and (c). 2003].5 Related work The performance of most clustering algorithms depends critically on the distance metric that they are given for measuring the similarity or dissimilarity between different datapoints. An example of work in a similar vein is the algorithm for learning a distance metric for spectral clustering [Bach & Jordan.. Note that mountains.172 CHAPTER 11. tested on the image in Fig. which is more efﬁcient than correlation clustering. for which we use LP or SDP formulations. 11. a number of algorithms have been proposed for automatically learning distance metrics as a preprocessing step for clustering [Xing et al. rocks and grass are segmented very coarsely based on hue in (b) while the segmentation in (c) is more detailed and sensitive to saturation and value of the colors. In contrast to algorithms that learn a metric independently of the algorithm that will be used to cluster the data. tuning one to the other in a joint optimization. Thus. The results are shown in Fig.4(a). BarHillel et al. 11. we describe a formulation that tightly integrates metric learning with the clustering algorithm. we will instead seek to learn a good metric for the clustering algorithm.. 11. instead of using an externallydeﬁned criterion for choosing the metric. CORRELATION CLUSTERING (a) (b) (c) Figure 11. 2003]. However the objective in the learning .4: Test image: (a) input. 2002.
6 Conclusion We looked at correlation clustering. It would be very interesting to explore constraint generation or similar approaches to speed up learning and inference. Our approach ties together the inference and learning algorithm.. . 11. showcasing robustness to noise dimensions. The main limitation of the correlation clustering is scalability: the number of constraints (V3 ) in the LP relaxation and the size of the positivesemideﬁnite constraint in the SDP relaxation. and how to learn the edge weights from example clusterings. CONCLUSION 173 formulation proposed in Bach and Jordan [2003] is not convex and difﬁcult to optimize. and attempts to learn a good metric speciﬁcally for the clustering algorithm.6. On the theoretical side. Experiments on the CALO email and image segmentation experiments show the potential of the algorithm on realworld data.11. We showed results on a synthetic dataset. it would be interesting to work out a PAClike bound for generalization of the learned score metric.
174 CHAPTER 11. CORRELATION CLUSTERING .
Part IV Conclusions and future directions 175 .
disulﬁde connectivity in protein structure prediction. alternative estimation methods are intractable.Chapter 12 Conclusions and future directions This thesis presents a novel statistical estimation framework for structured models based on the large margin principle underlying support vector machines. including handwriting recognition. we rely on the expressive power of convex optimization problems to compactly capture inference or solution optimality in structured models. 12. Our class of models has the following linear form: hw (x) = arg max w f (x. We develop theoretical foundations for our approach and show a wide range of experimental applications. For some of these models. parameterized scoring function w f (x. y) and prediction using the model reduces to ﬁnding the highest scoring output y given the input x.1 Summary of contributions We view a structured prediction model as a mapping from the space of inputs x ∈ X to a discrete vector output y ∈ Y. hypertext categorization. Essentially. Fundamentally. email organization and image segmentation.y)≤0 176 . y). y : g(x. a model deﬁnes a compact. natural language parsing. The framework results in several efﬁcient learning formulations for complex prediction tasks. Directly embedding this structure within the learning formulation produces compact convex problems for efﬁcient estimation of very complex models. 3D terrain classiﬁcation.
which is generally exponential in the number of variables in each y(i) . y∈Y (i) where Y (i) = {y : g(x(i) . we develop methods for ﬁnding parameters w such i=1 that: arg max w f (x(i) .1. ∀i. 10).1. we assume that the inference problem arg maxy : g(x. 1 w2 2 w fi (y(i) ) ≥ w fi (y) + i (y). y) ≈ y(i) . 12. y) ∈ IRk deﬁne the space of feasible outputs y given the input x and basis functions f (x. y) ≤ 0} is usually immense.12. from probabilistic models such as Markov networks and context free grammars to more unconventional models like weighted graphcuts and matchings. The naive formulation1 uses i Y (i)  linear constraints. we provide exact maximum margin solutions (Ch. 7 and Ch. SUMMARY OF CONTRIBUTIONS 177 where w ∈ IRn is the vector of parameters of the model. y) ≤ 0}. y) ∈ IRn represent salient features of the input/output pair. This deﬁnition covers a broad range of models. min s. In many models where maximum likelihood estimation is intractable. ∀y ∈ Y (i) . we omit the slack variables in this summary.t. ∀i. . y) can be solved (or closely approximated) by an efﬁcient algorithm that exploits the structure of the constraints g and basis functions f . We propose two general approaches that transform the above exponential size QP to an exactly equivalent polynomial size QP in many important classes of models.1 Structured maximum margin estimation Given a sample S = {(x(i) . These formulations allow us to ﬁnd globally optimal parameters (with ﬁxed precision) in polynomial time using standard optimization software. constraints g(x. y(i) )}m .y)≤0 w f (x. 1 For simplicity. Although the space of outputs {y : g(x.
We show that this approach leads to exact polynomialsize formulations for estimation of lowtreewidth Markov networks. contextfree grammars. but no polynomial formulation is known [Bertsimas & Tsitsiklis. the form of the loss term i (y) is crucial for the lossaugmented inference to remain tractable. we can embed the maximization inside the minmax formulation to get a compact convex program equivalent to the naive exponential formulation. SDP.178 CHAPTER 12. however.g. In some cases. 1 w2 2 w fi (y(i) ) ≥ max [w fi (y) + i (y)]. y∈Y (i) The key to solving the estimation problem above efﬁciently is the lossaugmented inference problem maxy∈Y (i) [w fi (y) + i (y)]. ∀i. can be solved in polynomial time using combinatorial algorithms. CONCLUSIONS AND FUTURE DIRECTIONS Minmax formulation We can turn the above problem into an equivalent minmax formulation with i nonlinear maxconstraints: min s. and many other models. etc. For example. 2003]. bipartite matchings. Even if maxy∈Y (i) w fi (y) can be solved in polynomial time using convex optimization. LP. Schrijver. We show that if we can express the (lossaugmented) inference as a compact convex optimization problem (e. Both of these problems. We typically use a natural loss function which is essentially the Hamming distance between y(i) and h(x(i) ): the number of variables predicted incorrectly. associative Markov networks over binary variables. though. y . we can ﬁnd a compact certiﬁcate of optimality that guarantees that y(i) = arg max[w fi (y) + i (y)]. maximum weight perfect (nonbipartite) matching and spanning tree problems can be expressed as linear programs with exponentially many constraints.t. QP. Certiﬁcate formulation There are several important combinatorial problems which allow polynomial time solution yet do not have a compact convex optimization formulation.. 1997.).
By contrast. In models that are tractable for both maximum likelihood and maximum margin (such as lowtreewidth Markov networks. 1993. 1979]. Altun et al. Garey & Johnson. and does not require a normalized distribution P (y  x) over all outputs. 1979. 2001. Because of the hingeloss. 10). we can derive a compact certiﬁcate for the spanning tree problem.. Maximum margin vs. our approach has an additional advantage. context free grammars. to learn a probabilistic model P (y  x) over bipartite matchings using maximum likelihood requires computing the normalizing partition function. SUMMARY OF CONTRIBUTIONS 179 For perfect (nonbipartite) matchings. This simple set of linear constraints scales linearly with the number of edges in the graph. 2004]. maximum margin estimation can be tractable when maximum likelihood is not. Lafferty et al. Zhu & Hastie. Because our approach only relies on using the maximum in the model for prediction. We can express this condition by deﬁning an auxiliary distance function on the nodes an a set of linear constraints that are satisﬁed if and only if there are no negative cycles. The certiﬁcate formulation relies on the fact that verifying optimality of a solution is often easier than actually ﬁnding one. This observation allows us to apply our framework to an even broader range of models with combinatorial structure than the minmax formulation.. Maximum likelihood models with kernels are generally nonsparse and require pruning or greedy support vector selection methods [Wahba et al. the solutions to the estimation are relatively sparse in the dual space (as in SVMs). maximum margin estimation can be formulated as a compact QP with linear constraints.12. Similarly. maximum likelihood There are several theoretical advantages to our approach in addition to the empirical accuracy improvements we have shown experimentally. this certiﬁcate is a condition that ensures there are no augmenting alternating cycles (see Ch. which makes the use of kernels much more efﬁcient. Similar results hold for an important subclass of Markov networks and nonbipartite matchings. many other problems in which inference is solvable by dynamic programming). which is #Pcomplete [Valiant. 2004. For example. ..1.
graphpartitioning in Ch. Also. CONCLUSIONS AND FUTURE DIRECTIONS There are. QPs or SDPs often provide excellent approximation algorithms and ﬁt well within our framework. 11. Approximations In many problems. in training settings with missing data and hidden variables. it is very difﬁcult to solve using standard software. Our onlinestyle algorithm uses . 1977]. convex learning formulation.180 CHAPTER 12. particularly the minmax formulation. We present an efﬁcient algorithm for solving the estimation problem called Structured SMO.1.. ◦ Scalable online algorithm Although our convex formulation is a QP with linear number of variables and constraints in the size of the data. several advantages to maximum likelihood estimation. 7. the maximization problem we are interested in may be NPhard. ◦ Lowtreewidth Markov networks We use a compact LP for MAP inference in Markov networks with sequence and other lowtreewidth structure to derive an exact. multiway cuts in Ch. The dual formulation allows efﬁcient integration of kernels with graphical models that leverages rich highdimensional representations for accurate prediction in realworld tasks. maximum likelihood is much more appropriate. 12. representational extensions. relational The largest portion of the thesis is devoted to novel estimation algorithms. we consider MAP inference in large treewidth Markov networks in Ch. for large datasets (millions of examples). We show empirically that these approximations are very effective in many applications. 8.2 Markov networks: maxmargin. In applications where probabilistic conﬁdence information is a must. of course. compact. generalization analysis and experimental validation for Markov networks. Relaxations of such integer programs into LPs. Many such problems can be written as integer programs. associative. probabilistic interpretation permits the use of wellunderstood algorithms such as EM [Dempster et al. for example.
This is the ﬁrst bound to address structured error (e. This class of networks extends the Potts model [Potts. our models are learned efﬁciently and.g. We present an AMNbased method for object segmentation from 3D range data. at runtime. relations.12. The graphical structure of an RMN is based on the relational structure of the domain. for associative Markov networks over binary variables. 2002]. SUMMARY OF CONTRIBUTIONS 181 inference in the model and analytic updates to solve extremely large estimation problems. we provide an approximation that works well in practice.1. We use a compact approximate MAP LP in these complex Markov networks. which compactly deﬁne templates for Markov networks for domains with relational structure: objects. proportion of mislabeled pixels in an image). 1952] often used in computer vision and allows exact MAP inference in the case of binary variables. As a result. and can easily model complex interaction patterns over related entities. ◦ Representation and learning of relational Markov networks We introduce relational Markov networks (RMNs). We show how to express the inference problem using an LP which is exact for binary networks. to derive an approximate maxmargin formulation. can scale up to tens of millions of nodes and edges by using graphcut based inference [Kolmogorov & Zabih. By constraining the class of Markov networks to AMNs. ◦ Generalization analysis We analyze the theoretical generalization properties of maxmargin estimation in Markov networks and derive a novel marginbased bound for structured prediction. We apply this class of models to classiﬁcation of hypertext using hyperlink structure to deﬁne relations between webpages. our framework allows exact estimation of networks of arbitrary connectivity and topology. attributes.. for which likelihood methods are believed to be intractable. . For the nonbinary case. ◦ Learning associative Markov networks (AMNs) We deﬁne an important subclass of Markov networks that captures positive correlations present in many domains. in which exact inference is intractable.
Experiments on synthetic and realworld data show the ability of the algorithm to learn an appropriate clustering metric for a variety of desired clusterings. ◦ Learning to cluster By expressing the correlation clustering problem as a compact LP and SDP. CONCLUSIONS AND FUTURE DIRECTIONS 12. We show experimental evidence of the model’s improved performance over several baseline models. we use the minmax formulation to learn a parameterized scoring function for clustering. We formulate the approximate learning problem as a compact convex program. We build on a recently proposed “unlexicalized” grammar that allows cubic time parsing and we show how to achieve highaccuracy parsing (still in cubic time) by exploiting novel kinds of lexical information. including email folder organization and image segmentation. The algorithm we propose uses kernels. efﬁcient certiﬁcate formulation for learning to match. In contrast to algorithms that learn a metric independently of the algorithm that will be used to cluster the data.1. we describe a formulation that tightly integrates metric learning with the clustering algorithm. ◦ Learning to parse We exploit dynamic programming decomposition of context free grammars to derive a compact maxmargin formulation. ◦ Learning to match We use a combinatorial optimality condition. graph partitions for clustering documents and segmenting images. which makes it possible to efﬁciently embed input features in very highdimensional spaces and achieve stateoftheart accuracy.182 CHAPTER 12. to derive an exact.3 Broader applications: parsing. matching. . tuning one to the other in a joint optimization. perfect matchings for disulﬁde connectivity in protein structure prediction. clustering The other large portion of the thesis addresses a range of prediction tasks with very diverse models: context free grammars for natural language parsing. namely the absence of augmenting alternating cycles. We apply our framework to prediction of disulﬁde connectivity in proteins using perfect matchings.
while some assumptions about the approximate decomposition P (y  x) may be warranted.12. y(i) ) being i. and more general learning settings. novel prediction tasks. we used approximate convex programs within the minmax formulation. relying only on the samples (x(i) . These approximate inference programs have strong relative error bounds. Similar assumptions may be made for some “degree of separation” in relational domains. like multiclass AMNs in Ch. 5. scaling . we presented a bound on the structured error in Markov networks. a label variable is independent of the rest given a large enough ball of labels around it.2. including further theoretical analysis and new optimization algorithms. ◦ Generalization bounds with distributional assumptions In Ch. More generally. y). An open question is to translate these error bounds on inference into error bounds on the resulting maxmargin formulations. less immediate extensions and open problems for our estimation framework. 7 and correlation clustering in Ch. for sequential prediction problems.2 Extensions and open problems There are several immediate applications. the next label is independent of the labels more than k in the past). In spatial prediction tasks.. the Markov assumption of some ﬁnite order is reasonable (i. 11.2. For example. We organize these ideas into several sections below. without any assumption about the distribution of P (y  x).e.i.d. This distributionfree assumption leads to a worst case analysis. 12. ◦ Problemspeciﬁc optimization methods Although our convex formulations are polynomial in the size of the data. it would be interesting to exploit such conditional independence assumptions or asymptotic bounds on entropy of P (y  x) to get generalization guarantees even from a single structured example (x. EXTENSIONS AND OPEN PROBLEMS 183 12. given the input and previous k labels.1 Theoretical analysis and optimization algorithms ◦ Approximation bounds In several of the intractable models.
184 CHAPTER 12. Leslie et al. Both algorithms rely on dynamic programming decompositions. generalizing classiﬁcation models to multivariate... ◦ Sequence alignment In standard pairwise alignment of biological sequences. 2004]. a string edit distance is used to determine which portions of the sequences align to each other [Needleman & Wunsch. Durbin et al. and many others. points on two shapes are matched based on their local contour features [Belongie et al. Another algorithm useful for such models is Exponentiated Gradient [Bartlett et al. 1970. Our framework can be applied (just as in context free grammar estimation) to efﬁciently learn a more complex edit function that depends on the contextual string features. we have presented the Structured SMO algorithm. 1998]... such as graphcuts. matchings. efﬁcient alternative to the maximum likelihood estimation for learning the matching scoring function. CONCLUSIONS AND FUTURE DIRECTIONS up to larger datasets will require problemspeciﬁc optimization methods. models which do not permit such decompositions. Finding the best alignment involves a dynamic program that generalizes the longest common subsequence algorithm. structured classiﬁcation. 2002. In machine translation. perhaps using novel string kernels [Haussler. Lodhi et al.. In 2D shape matching.. 12. create a need for new algorithms that can effectively use combinatorial optimization as a subroutine to eliminate the dependence on generalpurpose convex solvers. Similarly. ◦ Continuous prediction problems We have addressed estimation of models with discrete output spaces. For lowtreewidth Markov networks and context free grammars. Our framework provides an exact. matchings are used to map words of the two languages in aligned sentences [Matusov et al.2 Novel prediction tasks ◦ Bipartite matchings Maximum weight bipartite matchings are used in a variety of problems to predict mappings between sets of items. .2. 1999. 2000]. we can consider a whole range of problems where the prediction variables are continuous. 2002]. However. 2004].
Luenberger. Details of this are beyond the scope of this thesis. with stoichiometric constraints [Varma & Palsson. 1994]. In ﬁnancial modeling.12. not concave (it turns out that it is actually possible to use L∞ loss). For example. for example. convex programs are often used to model portfolio management. 12.3 More general learning settings ◦ Structure learning We have focused on the problem of learning parameters of the model (even though our kernelized models can be considered nonparametric). 1997]. Developing effective loss functions and maxmargin formulations for the continuous setting could provide a novel set of effective models for structured multivariate realvalued prediction problems. In the case of Markov networks. These problems are similar to the discrete structured prediction models we have considered: inference in the model can formulated as a convex optimization problem. but it sufﬁces to say that lossaugmented inference using. one could learn a user’s return projection and risk assessment function from observed portfolio allocations by the user. interconstrained realvalued outputs. However. It would be interesting to use observed equilibria data under different conditions to learn what “objective” the cell is maximizing. In this setting. EXTENSIONS AND OPEN PROBLEMS 185 Such problems are a natural generalizations of regression. no longer produces a maximization of a concave objective with convex constraints since L1 . there are obstacles to directly applying the minmax or certiﬁcate formulations. there is a wealth of possible structures (cliques in the network) one can use to model a problem. especially in spatial and relational domains.2. L2 are convex. L1 loss (or L2 loss). The standard method of greedy stepwise selection followed by reestimation is very . involving correlated.2. Hamming distance equivalent. It is particularly interesting to explore the problem of inducing these cliques automatically from data. 1991. several recent models of metabolic ﬂux in yeast use linear programming formulations involving quantities of various enzymes. Markowitz portfolio optimization is formulated as a quadratic program which minimizes risk and maximizes expected return under budget constraints [Markowitz.
Zhu et al.. These correspondence variables can be treated as hidden variables. word correspondences between pairs of sentences are usually not manually annotated (at least not on a large scale). while much of the easily accessible data is not at all or suitably labeled. Nigam et al. Bach & Jordan. extensions of our framework allow opportunities for problemspeciﬁc convex approximations to be exploited. ◦ Semisupervised learning Throughout the thesis we have assumed completely labeled data. It is possible that AMNs. Cowell et al. 2002. generative models often using the EM algorithm [Dempster et al.186 CHAPTER 12. 2003.. For example. There are several more general settings we would like to extend our framework. Corduneanu & Jaakkola. 2001. Recent work on selecting input features in Markov networks (or CRFs) uses several approximations to learn efﬁciently with millions of candidate features [McCallum.. Similarly. where a small supervised dataset is augmented by a large unsupervised one. Discriminative methods have been explored far less. we may not have each letter segmented out but instead just get a word or sentence as a label for the entire image. This setting has been studied mainly in the probabilistic. The simplest setting is a mix of labeled and unlabeled examples. Szummer & Jaakkola. . CONCLUSIONS AND FUTURE DIRECTIONS expensive in general networks [Della Pietra et al. A more complex and rich setting involves presence of hidden variables in each example. clique selection is still relatively unexplored. in handwriting recognition. Especially in the case of combinatorial structures. This assumption often limits us to relatively small training sets where data has been carefully annotated. However. may permit more efﬁcient clique selection methods. Chapelle et al.. 1998. 1997. 2003]. the principle of regularizing (discouraging) decision boundaries near densely clustered inputs could be applicable to our structured setting. 2003].. by restricting the network to be tractable under any structure. 1977. 2001]. 2000. There has been much research in this setting for classiﬁcation [Blum & Mitchell. in machine translation. 1999]. Although most of this work has been done in a probabilistic setting..
We hope that continued research in this framework will help tackle evermore sophisticated prediction problems in the future. Our approach has several theoretical and practical advantages over standard probabilistic models and estimation methods for structured prediction. FUTURE 187 12.3.12. .3 Future We have presented a supervised learning framework for a large class of prediction models with rich and interesting structure.
Appendix A Proofs and derivations A. His analysis is based on the upper bounds on the covering number for the class of linear functions FL (w. The covering number is a key quantity in measuring function complexity. parameterized by a set of weights w) is the number of vectors necessary to approximate the values of any function in the class on a sample. Intuitively. we will only deﬁne the ∞norm covering number.5. the covering number of an inﬁnite class of functions (e.1 The proof of Theorem 5. 1998]. Here.g. We reproduce the relevant deﬁnitions and theorems from Zhang [2002] here to highlight the necessary extensions for structured classiﬁcation.1 Proof of Theorem 5.5. Marginbased analysis of generalization error uses the margin achieved by a classiﬁer on the training set to approximate the original function class of the classiﬁer by a ﬁnite covering with precision that depends on the margin.1 uses the covering number bounds of Zhang [2002] (in the DataDependent Structural Risk Minimization framework [ShaweTaylor et al. 188 .) Zhang provides generalization guarantees for linear binary classiﬁers of the form hw (x) = sgn(w x). z) = w z where the norms of the vectors w and z are bounded..
so we can fold x and y into z = yx.1. PROOF OF THEOREM 5. ρ.3 (Corollary 1 from Zhang [2002]) Let f : IR → [0. y (i) }m .1 (Covering Number) Let V = {v(1) . Deﬁnition A. .5. The covering number of a sample S is the size of the smallest covering: N∞ (F. if for all w there exists a v(j) such that for each data sample z(i) ∈ S: ρ(vi . we have the following upper bound on the covering number of linear functions under the linear metric ρL (v. When the norms of w and z are bounded. Let γ1 > γ2 > .1 Binary classiﬁcation In binary classiﬁcation.1. 1] be a loss function and f γ (v) = supρ(v.1.1. be . . S) = inf V  s. F(w. For 01 loss and linear metric ρL . ρ. .t. the bounds below use a stricter. . Let f : IR → [0. the corresponding γmargin loss is f γ (v) = 1 ≤ 2γ).1. v(r) }. f γ (v) = supρ(v. In binary classiﬁcation.A.v )<2γ f (v ) be the γmargin loss for a metric ρ. . . so that 01 loss of I(v sgn(w x) is f (FL (w. 1] be a loss function. S). log2 N∞ (FL .v )<2γ f (v ). marginbased loss on the training sample that measures the worst loss achieved by the approximate covering based on this margin. m) = supS: S=m N∞ (F. then ∀ log2 (2 4ab/ + 2 m + 1) . v ) = v − v . In order to use the classiﬁer’s margin to bound its expected loss. where X = IRn and Y is mapped to ±1. We also deﬁne the covering number for any sample of size m: N∞ (F. m) ≤ 36 a2 b2 2 2 ≤ a and z 2 ≤ b. . . Theorem A. S) with precision under the metric ρ.1 189 A.2 (Theorem 4 from Zhang [2002]) If w > 0. . z)). we let f (v) = 1 ≤ 0) be the step function. we are given a sample S = {x(i) . z(i) )) ≤ . where v(j) ∈ IRm . I(v Theorem A. ρL . on the training sample. . V (j) is a covering of F(w. The next theorem bounds the expected f loss in terms of the γmargin loss. be a covering of a function class F(w. ρ. S). from distribution D over i=1 X × Y.
. so we let z = (x. . . we use i to denote the smallest index s. . A.2. . PROOFS AND DERIVATIONS a decreasing sequence of parameters. .190 APPENDIX A.1. . . z). z) = (M1 (w. L} is given by: Md (w.5 (Multierrorlevel function class) The multierrorlevel function class FM (w. z) = min w ∆fi (y). .1. z))] ≤ ES [f γ (F(w. then for all δ > 0. and pi be a sequence of positive numbers such that ∞ i=1 pi = 1. z) by deﬁning an appropriate loss function fM . . z)) . where for each ﬁxed γ.2 Structured classiﬁcation We will extend this framework to bound the average perlabel loss H (y)/L for structured classiﬁcation by deﬁning an appropriate loss f and a function class F (as well as a metric ρ) such that f (F) computes average perlabel loss and f γ (F) provides a suitable γmargin loss. it is convenient to make our function class vectorvalued (instead of scalarvalued as above). We will bound the corresponding covering number by building on the bound in Theorem A. .4 (dtherrorlevel function) The dtherrorlevel function Md (w. z). z))] + for all w and γ. In order for our loss function to compute average perlabel loss. Deﬁnition A.1.t. with probability of at least 1 − δ over data: 32 1 ln 4N∞ (F. ML (w. since y is a vector. γi ≤ γ. We can now compute the average perlabel loss from FM (w. We deﬁne a new function class FM (w. Md (w. z) for d ∈ {1. y).1. ρ. which is a vector of minimum values of w ∆fi (y) for each error level H (y) from 1 to L as described below. z). z) is given by: FM (w. γi . . We can no longer simply fold x and y. S) + ln m pi δ ED [f (F(w. y: H (y)=d Deﬁnition A. .
z) > 0 corresponds to the classiﬁer making no mistakes: arg maxy w fi (y) = y. we get: γ fM (FM (w.1. we deﬁne arg mind:vd ≤0 vd ≡ 0. This upper bound is tight if y = arg maxy w f (x. We now need to deﬁne an appropriate metric ρ that in turn deﬁnes γmargin loss for structured classiﬁcation. d:vd ≤0 L . y ). Md (w. as there is more room for error.7 (Multierrorlevel metric) Let the multierrorlevel metric ρM : IRL × IRL → IR for a vector in IR L be given by: ρM (v. 1] is given by: γ fM (v) = sup ρM (v. 1] is given by: fM (v) = 1 arg min vd . With the above deﬁnitions.6 (Average perlabel loss) The average perlabel loss fM : IR L → [0.A.1 191 Deﬁnition A. vd > 0. one that maximizes the Hamming distance from y. we have an upper bound on the average perlabel loss fM (FM (w. our metric can become “looser” with the number of mistakes.1. Deﬁnition A.z)≤0 L L H arg max w fi (y) . z)) = sup v:vd −Md (w. d We now deﬁne the corresponding γmargin loss using the new metric: Deﬁnition A.v )≤2γ fM (v ). d:vd ≤0 L where in case ∀d.1. v ) = max d vd − vd  . Otherwise. it is adversarial: it picks from all y which are better (w f (y) ≤ w f (y )). PROOF OF THEOREM 5. z) ≥ d:Md (w.5.z)≤2dγ 1 arg min vd . Combining the two deﬁnitions. z)) = 1 1 arg min Md (w.1.8 (γmargin average perlabel loss) The γmargin average perlabel loss γ fM : IR L → [0. Since the margin of the hypothesis grows with the number of mistakes. y Note that the case ∀d.
m) ≤ N∞ (FL . We also deﬁne N∞ (FM . . . . S ) ≤ N∞ (FL . by deﬁnition. Lemma A. Vm ) and Vi sample z(i) ∈ S: ρM (Vi . q. .1. . ρM . m) = sup N∞ (FM . S) = inf V s. The covering number of a sample S is the size of the smallest covering: N∞ (FM . Consider a ﬁrstorder sequence model as an example. V(r) }. FM (w. Then Nc = 2L−1 since we have L node cliques and L − 1 edge cliques. . q.t. where we construct the sample S of size mNc (Vc − 1) in order to cover the clique potentials as described below. q. . so N∞ (FM . ρM . S: S=m (j) (j) (j) (j) (j) ∈ IRL . ρM . S ) for any sample S of size m. ρL . Vc is the maximum number of values in a clique Yc . S). ρM . mNc (Vc − 1)). Proof: We will show that N∞ (FM . S) ≤ N∞ (FL . . and Rc is an upperbound on the 2norm of clique basis functions. . .9 (Multierrorlevel covering number) Let V = {V(1) . V is a covering of FM (w. q is the maximum number of cliques that have a variable in common. S). Recall that Nc is the maximum number of cliques in G(x). ρM . mNc (Vc − 1)). . . S) ≤ N∞ (FL . .1. with L as the maximum length. and V the number of values a variable takes. ρL . . . and currentnext edge clique. S). mNc (Vc − 1)). . . q.192 APPENDIX A. Vi . Note that this is sufﬁcient since N∞ (FL . ρL . ρM . m) = sup N∞ (FM . and q = 3 since nodes in the middle of the sequence participate in 3 cliques: previouscurrent edge clique. . ρM . node clique. Vc = V 2 because of the edge cliques. z(i) )) ≤ . be a covering of FM (w. where V(j) = (V1 . PROOFS AND DERIVATIONS We also deﬁne the corresponding covering number for our vectorvalued function class: Deﬁnition A. ρL . . S:S=m . . ρL . with  precision under the metric ρM . if for all w there exists a V(j) such that for each data We provide a bound on the covering number of our new function class in terms of a covering number for the linear function class.10 (Bound on multierrorlevel covering number) N∞ (FM .
1.c (yc ) for each clique c and clique assignment yc as opposed to the values of entire assignments w ∆fi (y). ML (v(j) . v(r) }. .5.1) By Deﬁnition A. Vi . . one for each clique c and each assignment yc = yc . as each label can appear in at most q cliques. . We can now use V to construct a covering V = {V(1) . but the key difference is that our construction is linear in the number of cliques Nc and exponential in the number of label variables per clique. . PROOF OF THEOREM 5. For each sample z ∈ S. as ∆fi. . then i Vi = (M1 (v(j) . where (j) V(j) = (V1 .c (yc ) ≤ . the number of vectors in V is given by r = N∞ (FL . at most dq vi. . . We construct a set of vectors (j) (i) V = {v(1) . V(r) }. z(i) ). ∀yc . (A. . (j) (j) ∀i. while his is exponential in the total number of label variables per example.c (yc ).1 193 The construction of S below is inspired by the proof technique in Collins [2001].A. (A. . For convenience. . we deﬁne vi. z(i) ). z(i) ) = miny: H (y)=d vi (y). This reduction in size comes about because our covering approximates the values of clique potentials w ∆fi. mNc (Vc −1)).c (yc ) = 0. .3) . .c (yc ). . .c (yc ) will be nonzero.c (yc ) = 0 for correct label assignments. Let vi (y) = (j) (j) c vi. .2) (i) Note that vi.1. S ) under ρL . where v(j) ∈ IRmNc (Vc −1) . By combining this fact with Eq. . ∀c ∈ C (i) . ∀y : H i (y) = d. and (j) Md (vi . Thus for an assignment y with d mistakes.c (yc ) is zero for all cliques c for which the assignment is correct: yc = yc . ρL .1). z(i) )) . To make ∞norm covering of FL (w. (j) (j) (j) (A. (j) (i) (j) (i) (i) V an (j) ∈V (A. . .c (yc ) − w ∆fi. Md (v(j) . . we require that for any w there exists a v such that for each sample z : vi. . we obtain: vi (y) − w ∆fi (y) ≤ dq . .1. . . for our multierrorlevel function FM . The component of v(j) corresponding to the sample z(i) and the assignment yc to the labels of the clique c will be denoted by vi. Vm ) (j) (j) and Vi (j) ∈ IRL . . we create Nc (Vc − 1) samples ∆fi.c (yc ). .
since vi (yd ) ≤ vi (yd ).1).3). 2002] where in Step 3 (derandomization) we substitute the vectorvalued FM and the metric ρM . . PROOFS AND DERIVATIONS We conclude the proof by showing that V is a covering of FM under ρM : For each w. S) + ln m pi δ for all w and γ. z(i) )) = max d  miny: H (y)=d vi (y) − miny: H (y)=d w ∆fi (y) i i d (j) (j) . (A.4) v v w where the ﬁrst step follows from deﬁnition of yd . We must now bound: (j) ρM (Vi . FM (w.1. (A. z(i) )) ≤ q .1.1. Hence ρM (Vi . pick V(j) ∈ V such that the corresponding v(j) ∈ V satisﬁes the condition in Eq.10.12 (Multilabel analog of Theorem A.194 APPENDIX A. m) ≤ 36 2 Rc w 2 2 2 (j) q2 log2 1 + 2 4 Rc w 2 q + 2 mNc (Vc − 1) . γ Theorem A. then for all δ > 0. Lemma A. z)) + ∞ i=1 pi = 1. z)) ≤ ES fM (FM (w.5. (j) (j) (j) (j) (A. we use i to denote the smallest index s.1. Let γ1 > γ2 > . and pi be a sequence of positive numbers such that 1 − δ over data: γ Ez fM (FM (w.2 into Lemma A. Proof: Similar to the proof of Zhang’s Theorem 2 and Corollary 1 [Zhang. Proof: Substitute Theorem A. be a decreasing sequence of parameters.11 (Numeric bound on multierrorlevel covering number) log2 N∞ (FM .3) Let fM and fM (v) be as deﬁned above.1. where for each ﬁxed γ. Theorem 5. ρM . γi . ρM . Let yd = arg miny: H (y)=d vi (y) and yd = arg miny: H (y)=d w ∆fi (y). Coni i w v sider the case where vi (yd ) ≥ w ∆fi (yd ) (the reverse case is analogous). The last step is a direct consequence of Eq.t. . γi ≤ γ.1 follows from above theorem with γi = Rc w 2 /2i and pi = 1/2i using an . we must (j) prove that: v w w w vi (yd ) − w ∆fi (yd ) ≤ vi (yd ) − w ∆fi (yd ) ≤ dq . . with probability of at least 1 32 ln 4N∞ (FM . FM (w. w v Fix any i.
v ∈ c. To show that µ is feasible. Thus if we can show that the update results in a feasible and better scoring assignment. µc (k) ≥ 0.2. Recall that the LP relaxation for ﬁnding the optimal maxy g(y) is: K K max v∈V k=1 µv (k)gv (k) + c∈C\V k=1 K µc (k)gc (k) µv (k) = 1. 2. µ will have at least one more integral µv (k) than µ. ∀v ∈ V. A. we can apply it repeatedly to get an optimal integer solution. A. we need µv (1) + µv (2) = 1. ∀c ∈ C. I(0 Now consider the smallest fractional µv (k).A. Since gc (k) ≥ 0. AMN PROOFS AND DERIVATIONS 195 argument identical to the proof of Theorem 6 in Zhang [2002].5) s. I(0 µc (2) = µc (2) + λ1 < µc (2) < 1).1 Binary AMNs Proof (For Theorem 7. we can assume that µc (k) = minv∈c µv (k).2. otherwise we could increase the objective by increasing µc (k). We construct an assignment µ from µ by leaving integral values unchanged and uniformly shifting fractional values by λ: µv (1) = µv (1) − λ1 < µv (1) < 1). µc (k) ≤ µv (k). I(0 µc (1) = µc (1) − λ1 < µc (1) < 1). 7.1) Consider any fractional. k. λ(k) = minv : µv (k)>0 yv (k) for k = 1. . we present proofs of the LP inference properties and derivations of the factored primal and dual maxmargin formulation from Ch. We show that we can construct a new feasible assignment µ which increases the objective (or leaves it unchanged) and furthermore has fewer fractional entries. feasible µ. I(0 µv (2) = µv (2) + λ1 < µv (2) < 1). Note that if λ = λ(1) or λ = −λ(2). µv (k) ≥ 0 and µc (k) = mini∈c µv (k).2 AMN proofs and derivations In this appendix. k.t. k=1 (A.2. ∀c ∈ C \ V.
and it remains to show that we can improve the objective. λ = λ(1) or λ = −λ(2). µv (1) + µv (2) = µv (1) − λ1 < µv (1) < 1) + µv (2) + λ1 < µv (2) < 1) I(0 I(0 = µv (1) + µv (2) = 1. does not decrease the objective function. we show µc (k) = mini∈c µv (k).2 [µc (k) − µc (k)]gc (k) c∈C\V k=1. µc (1) = µc (1) − λ1 < µc (1) < 1) I(0 = (min µv (1)) − λ1 < min µv (1) < 1) = min µv (1). so is µv (2). we prove minv µv (k) = 0. To show that µv (k) ≥ 0. The change in the objective is: [µv (k) − µv (k)]gv (k) + v∈V k=1. I(0 i∈c i∈c i∈c We have established that the new µ are feasible. We can show that the change in the objective is always λD for some constant D that depends only on µ and g. Above we used the fact that if µv (1) is fractional. min i i:µv (k)>0 µv (k) − min µv (k) i:µv (k)>0 = 0. This implies that one of the two cases. Lastly. PROOFS AND DERIVATIONS First. min µv (k) = min µv (k) − ( min µv (k))1 < µv (k) < 1) I(0 v v i:µv (k)>0 = min min µv (k). and . I(0 Dv (k) = gv (k)1 < µv (k) < 1). I(0 Hence the new assignment µ is feasible. I(0 i∈c i∈c i∈c µc (2) = µc (2) + λ1 < µc (1) < 1) I(0 = (min µv (2)) + λ1 < min µv (2) < 1) = min µv (2). will necessarily increase the objective (or leave it unchanged).2 = λ v∈V [Dv (1) − Dv (2)] + c∈C\V [Dc (1) − Dc (2)] = λD Dc (k) = gc (k)1 < µc (k) < 1). since µv (1) + µv (2) = 1.196 APPENDIX A. we show that µv (1) + µv (2) = 1.
For each node i which has not yet been assigned a label. losing at most a factor of m = maxc∈C c in the objective function.A.2 Multiclass AMNs For K > 2.2. Lemma A. we use the randomized rounding procedure of Kleinberg and Tardos [1999] to produce an integer solution for the linear relaxation. the probability that all nodes in a clique c are assigned label k if none of the nodes were previously assigned is 1 K mini∈c µv (k) = that at least one of the nodes will be assigned label k in a phase 1− 1 K K k=1 1 µ (k). Our analysis closely follows that of Kleinberg and Tardos [1999]. we assign the label k if µv (k) ≥ α. c c Proof For a single phase. The probability K c 1 is K (maxi∈c µv (k)). The basic idea of the rounding procedure is to treat µv (k) as probabilities and assign labels according to these probabilities in phases. K v which is proportional to µv (k). and a threshold α ∈ [0. The procedure terminates when all nodes have been assigned a label. 1] uniformly at random.1 The probability that a node i is assigned label k by the randomized procedure is µv (k).2. Nodes in the clique c will be assigned label k by the procedure if they are assigned label k in one phase.2 The probability that all nodes in a clique c are assigned label k by the procedure is at least 1 µ (k). By symmetry. AMN PROOFS AND DERIVATIONS 197 has strictly fewer fractional entries. (They can also be assigned label k as a result of several phases. Lemma A.2. Proof The probability that an unassigned node is assigned label k during one phase is 1 µ (k). but we can ignore this possibility for the purposes of the lower bound. we pick a label k. In each phase. uniformly at random.2. The probability that none of the nodes in the clique will be assigned any label in one phase is maxi∈c µv (k). A.) The probability that all the . the probability that a node is as signed label k over all phases is exactly µv (k).
reducing further the actual error in practice.3 The expected cost of the assignment found by the randomized procedure given a solution µ to the linear program in Eq. di = and then upperbounded Theorem A. We obtain an additive error (in logprobability space) only for the clique potentials. Thus. similar in spirit to the approach of Kleinberg and Tardos [1999].198 APPENDIX A. PROOFS AND DERIVATIONS nodes in c will be assigned label k by the procedure in a single phase is: ∞ j=1 1 1 µc (k) 1 − K K µc (k) K k=1 i∈c K j−1 max µv (k) k=1 i∈c = = µc (k) K k=1 maxi∈c µv (k) ≥ µv (k) = i∈c µc (k) K k=1 µv (k) ∞ i=0 µc (k) . We can also derandomize this procedure to get a deterministic algorithm with the same guarantees. the fact that we incur no loss proportional to node potentials is likely to lead to smaller errors in practice. again. only to the clique potentials. we ﬁrst used the fact that for d < 1. we have that the rounded solution is at most m times worse than the optimal solution produced by the LP of Eq. the max of the set of positive µv (k)’s by their sum.5). (A. c 1 . using the method of conditional probabilities. c Proof This is immediate from the previous two lemmas. As node potentials are often larger in magnitude than clique potentials. we note that the constant factor approximation is smaller for smaller cliques. in fact. . (A. Along similar lines. Note that the approximation factor of m applies. we observe. the potentials associated with large cliques are typically smaller in magnitude. the terms corresponding to the logpartitionfunction and the node potentials are identical. By picking m = maxc∈C c.5) is at least 1 c∈C\V c K k=1 v∈V K k=1 gv (k)µv (k) + gc (k)µk .2. 1−d Above. if we compare the logprobability of the optimal MAP solution and the logprobability of the assignment produced by this randomized rounding procedure. The only difference between the expected cost of the rounded solution and the (non1 integer) optimal solution is the c factor in the second term.
v (k) for each of the Substituting this dual into Eq. k. In the dual. k). Reexpressing the above QP in terms of these new variables.t.c (k) = i.v (k) ≥ mi.t. k.v . . .v (k) ≥ 0.2): min v∈V ξi. k. we obtain: min s.v (k) ≥ v∈c − ξi.1) and variables mi. v∈V (i) w fi (y ) + ξi ≥ −w fi. mi.4. ∀i.4. − ξi.v (k) + (i) ¨ f w ¨i.c. i.c (yc )/c.c. we get: .c. ¨ f −w ¨i. (3.v (k) i. v ∈ V (i) . we have a variable ξi. . .6) mi. k. (5. s.2.c (k.c. v ∈ c.1).3 Derivation of the factored primal and dual maxmargin QP Using Assumptions 7. AMN PROOFS AND DERIVATIONS 199 A.v (k) ≥ 0.v for each normalization constraint in Eq. ∀i.c (k). . ∀i. . c ∈ C (i) \ V (i) . v ∈ c.c (k) + mi.4.v −w fi.c (k. i. .1 and 7.c (k) + mi.7) ξi. we have the dual of the LP used to represent the interior max subproblem maxy w fi (y) + i (y) in Eq.c (yc )/c and mi.v (k) ≥ mi.A.c. ¨ w ≥ 0. ¨ f −w ¨i. (i) (i) ¨ f Now let ξi.2. c ∈ C (i) \ V (i) . v ∈ V (i) .c (k).c. ∀i.c. c ∈ C (i) \ V (i) .v (k) − c⊃v (A.v . k. ∀i.v (k) i.v . c ∈ C (i) \ V (i) . .v (k) − c⊃v ∀i.c. k) and inequality constraints. 1 w2 + C 2 (i) ξi i (A. (7.v = ξi.v (yv ) + c⊃v w ¨i. k. ∀i.v (k) ≥ v∈c i.c.c (k) = fi.v + w fi. where fi.v (k) = mi.
v (k) ≥ −w ¨i.v (k) ≥ −w ¨i. (7. k.v . w ∆fi. ∀i. mi. k.v (k) − c⊃v mi. 1 w2 + C 2 ξi.c.c (k). ∀i.c.v i. we can eliminate ξi and the corresponding set of constraints to get the formulation in Eq.v (k) ≥ mi. Since ξi = i. ¨ w ∆¨i. ¨ w ≥ 0.c (k) + f v∈c − ξi. w ∆fi. mi.v∈V (i) (A.c (yc )/c.t.v (k) i.c (k).9) i.8) ξi.c. . c ∈ C (i) \ V (i) .c. ¨ w ≥ 0.c.v at the optimum.v .v (k) ≥ v∈c − ξi. c ∈ C (i) \ V (i) . k. v ∈ V (i) .v∈V (i) ξi. 1 w2 + C 2 ξi ≥ v∈V (i) ξi i (A. c ∈ C (i) \ V (i) . v ∈ c.v (k) ≥ ∀i. (i) ¨ f mi. ∀i.t. k. PROOFS AND DERIVATIONS min s.c (k) + f ∀i. k. repeated here for reference: min s. c ∈ C (i) \ V (i) . v ∈ c.200 APPENDIX A.c.v (k) − ¨ w ∆¨i. ∀i. v ∈ V (i) .4). (i) ¨ f mi.v .c (yc )/c.v (k) i. ∀i. k.v (k) ≥ c⊃v i.
c (k) − µi. k s. ∀i. k.c∈C (i) . k 1 µi. we can eliminate λ and τ . ∀i. k. k (A. ∀c ∈ C (i) .c (k) − 2 1 τ+ ¨ − 2 ˙ µi. µi. µi.v (k) = 1.v∈c.v (k) ≥ 0 τ ≥ 0.v∈V (i) .v (k)¨i.c. \ V . Using the substitution ν=τ+ ¨ ¨ i.c∈C (i) .c∈C (i) \V (i) .c (k) f i. λi.c (yc )/c f ¨ and the fact that λi.v i. AMN PROOFS AND DERIVATIONS 201 Now the dual of Eq. (A.c (k)∆¨i. k.v (k)∆fi. ¨ ∀i. k.c (k) f i.2. ∀i.c (yc )/c + f K µi.v i.10) 2 (i) λi.v (k) = C. ∀c ∈ C (i) k=1 (i) µi.v (k) ≥ 0 and ¨i.t.A. v ∈ c.v (k). k (i) i. ∀i.c (k) − C 2 ˙ µi. µ correspond to the ﬁrst two sets of constraints. ¨ . k=1 µi. k K 1 ¨ − C ν+ 2 µi.c∈C (i) . k (i) λi. ∀c ∈ C .v (k). ∀v ∈ V (i) . ∀i.v∈c.c (k)∆¨i. ∀c ∈ C (i) \ V (i) .c∈C (i) . while λ and τ correspond ¨ to third and fourth set of constraints. k s. In this dual.v (k)∆fi.c. ν ≥ 0. k 1 µi.c (k) ≥ 0.c.c. v ∈ c. and reexpress the above QP as: 2 2 max i.c (k) i.c (k) ≤ µi.c (k) i. k.v (k)¨i.v∈V (i) . ∀i. µi.9) is given by: 2 max i. v ∈ c. as well as divide f (i) µ’s by C.c.t. ∀c ∈ C (i) \ V (i) . ∀v ∈ V (i) . µi.c (k) ≥ 0.v (k) = λi.c∈C (i) \V (i) .c (yc ) ≥ 0.
& Hofmann.. 28. Altschul. Miller. P. A. T. The SwissProt protein sequence database and its supplement trembl. Zhang. Baldi. Part I: Linear programming and general problem. M. Ahuja. (2003).. F. P.Bibliography Abbeel. Tsochantaridis. A. & Vullo. & Jordan. I. T.. Inverse optimization. Operations Research. ICML. A.. & Hofmann. R. Nucleic Acids Res. Bairoch. Y. (2004). Y. (2003). 3389–3402... Largescale prediction of disulphide bond connectivity. Hidden markov support vector machines.. Altun. NIPS. Modern information retrieval. (1999). UAI. A. S. & Ng. A. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Proc. Exponential families for conditional random ﬁelds.. Altun. & Orlin. (2004). 35. J. NIPS. 45–48.. 771–783. 25... (2001). T. ICML. Nucleid Acids Res.. Proc. & Lipman (1997). A.. BaezaYates. B. Bach. & Jordan. Proc.. F. Cheng.. J.. & RibeiroNeto. AddisonWesleyLongman. 202 . R. Schaffer. (2001). Apprenticeship learning via inverse reinforcement learning. Learning spectral clustering. & Apweiler. M. (2004). W. R. Smola. Proc. Thin junction trees. Bach. Madden. (2000).... NIPS. Y.
Convexity. (1998). Proc.. Pattern Anal. Belongie. L. (1997). Bishop. P.. (1994).. T. Bansal. (1995). 24. Exponentiated gradient algorithms for largemargin structured classiﬁcation. Introduction to linear programming. . & Pollastri. M. T.C. J. ICML.. COLT. Combining labeled and unlabeled data with cotraining. S. U. (2003). Shindyalov. P. M. Bertsimas. D. & Mitchell. G. (1986)..BIBLIOGRAPHY 203 Baldi. Journal of the Royal Statistical Society B. P.. C.. J. Feng. B. Westbrook. (2004). Shape matching and object recognition using shape contexts. Bartlett. N. J. (2003). Blum. and risk bounds (Technical Report 638). Weissig. & Chawla. classiﬁcation. & Puzicha. Athena Scientiﬁc. A multiscale random ﬁeld model for bayesian image segmentation. MA: Athena Scientiﬁc. 4. The protein data bank. I.. T. Correlation clustering. S. Oxford... Bhat. H. N.. M. A. & Weinshall. (2002).. E. 3(2).. The principled design of largescale recursive neural network architectures dagrnns and the protein structure prediction problem.. Journal of Machine Learning Research. Blum. G. Bertsekas. D. Mach. IP. Learning distance functions using equivalence relations. Neural networks for pattern recognition. Z. Intell. & McAllester. IEEE Trans. 575–602.. Nonlinear programming. BarHillel. Collins. P.. On the statistical analysis of dirty pictures. (2003). UK: Oxford University Press.. H. Jordan. D. J. & McAuliffe.. C. Shental. Belmont. Taskar. (2000). (1999). D. Bouman. Malik. Berman... D. J. & Shapiro. Berkeley. A. 48. Besag... & Bourne. Gilliland. FOCS. Bartlett. Department of Statistics. I.. A. (2002).. J. Hertz. NIPS. & Tsitsiklis. Proc.
NIPS. S. Clark. Veksler. . (1999). A. Cluster kernels for semisupervised learning. SIGMOD. Parsing the wsj using ccg and loglinear models. Passerini.. Boykov. & Indyk. V. P.. M.. A. (1999a). O. P. S. V. & Vandenberghe. & Vullo. Doctoral dissertation. Charikar.. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL ’04). Weston. (2004). & Curran. PAMI. E.. Proc. Boykov. Journal of VLSI Signal Processing. (2002).. B. Fast approximate energy minimization via graph cuts. Enhanced hypertext categorization using hyperlinks. R. Clustering with qualitative information. (2004). P. S. & Toint. & Zabih. & Zabih. Convex optimization.. Boykov. Collins.. Burton. R. R. L. ICCV. FOCS. University of Pennsylvania. (1993). & Schoelkopf... Chakrabarti. (2003). 53. Headdriven statistical models for natural language parsing. MIT Press. Y. 35. & Kolmogorov. A.. Mathematical Programming. Y. Predicting the disulﬁde bonding state of cysteines with combinations of kernel machines. J.. Dom. O.204 BIBLIOGRAPHY Boyd.. O. (1999b). D.. & Wirth. 45–61. (1992). A. Veksler. Frasconi. Charniak. On an instance of the inverse shortest paths problem. (2003).. Statistical language learning. (2004). B. Cambridge University Press. Chapelle. J. 287–295. Guruswami.. (1998). M. CVPR. Markov random ﬁelds with efﬁcient approximations. L. Ceroni. Y. An experimental comparison of mincut/maxﬂow algorithms for energy minimization in computer vision..
Inducing features of random ﬁelds. Cowell. & Singer. S. New York: Wiley. Craven.. Elements of information theory. V. (1999). T.. Corduneanu. Parameter estimation for statistical parsing models: Theory and practice of distributionfree methods. Lauritzen.. & Jaakkola. Kluwer. Carroll and G. A. & Stein. (2003). Rivest. N. (1998). Parameter estimation for statistical parsing models: Theory and practice of distributionfree methods. Collins... IWPT... M.. Dawid. UAI. Learning to extract symbolic knowledge from the world wide web. M. R. D. Proc. on Pattern Analysis and Machine Intelligence. 175–182). J. N. A. Cover. Journal of Machine Learning Research. New developments in parsing technology. On information regularization. In H.. (2004). R. Della Pietra... C. & Slattery.. 509–516). MIT Press. Correlation clustering with partial information. H. Nigam. D. McCallum. 19(4). E.. (2001). K. New York: McGrawHill. A. A. C. An introduction to support vector machines and other kernelbased learning methods.. . Y. Introduction to algorithms.. 380–393. DeGroot. D. M. Leiserson. Della Pietra.).. APPROX. M. S. Proc AAAI98 (pp. Cormen. Collins. DiPasquo. Cambridge University Press. (2000). & ShaweTaylor. & Thomas. M. Mitchell.. 2(5). New York: Springer. Freitag. 265–292. K. (2001). (2003). J. (1970). (2000). T. & Immorlica. M. & Lafferty.. Satta (Eds.. Crammer. IEEE Trans. 2nd edition. S. & Spiegelhalter. Demaine. T. J. Cristianini. ICML 17 (pp. (2001). (1991). On the algorithmic implementation of multiclass kernelbased vector machines. Probabilistic networks and expert systems. J. T.BIBLIOGRAPHY 205 Collins. Bunt. Discriminative reranking for natural language parsing. D. (1997). Optimal statistical decisions.
A. Prentice Hall.. R.minimizing disagreements on arbitrary weighted graphs. J. B. (2003). & Simon. G. Emanuel. D. & Fiat. Krogh.. o New York: SpringerVerlag. Knowledge based intelligent information engineering systems and allied technologies (KES 2002). Journal of Research at the National Bureau of Standards.. D. 125–130. Journal of the Royal Statistical Society. Egghe. Hart. (2000). ESA.. (1999).. Eddy. I. P. & Casadio. P. L. Probabilistic theory of pattern recognition. (1977). Introduction to informetrics. (1996). L.. IEEE Transactions on Neural Networks. (2002). R. 69B. E. Fariselli. P.. (2001). Ricobelli.. Maximum likelihood from incomplete data via the EM algorithm. (1998). Fariselli. Correlation clustering . & Stork. M.. 340–346. 3. 36. A general framework for adaptive structures. M. A. Devroye. N. P. Role of evolutionary information in predicting the disulﬁdebonding state of cysteine in proteins.. (1990).. & Casadio. G. 17. New York: Wiley Interscience. J. & Rubin. D. 1. Prediction of disulﬁde connectivity in proteins. Cambridge University Press. P. Bioinformatics. A. Fariselli. (2000). Biological sequence analysis.. 464–468. P. L. R.206 BIBLIOGRAPHY Dempster. Edmonds. Bionformatics. & Casadio. Laird.. 39. Predicting the oxidation state of cysteines by multiple sequence alignment.. Gy¨ rﬁ. Pattern classiﬁcation. D. S. R. A neural networkbased method for predicting the disulﬁde connectivity in proteins. Duda.. 957–964. Fiser. Martelli. A. (1965). Computer vision: A modern approach.. & Rousseau.. Maximum matching and a polyhedron with 01 vertices. & Ponce. Proteins. (2002). & Sperduti. Elsevier. Durbin.. O. & Lugosi. R. Forsyth. G. & Mitchison. 2nd edition. . Frasconi. A. 251–256. R. (1998). P... P. Gori.
Computers and intractability.. J. A. M. Wash. N. Y. (2002). Koller. Proc. B. Probabilistic models of text and link structure for hypertext classiﬁcation. Exact maximum a posteriori estimation for binar images. A. Geman. Efﬁcient solution algorithms for factored mdps. & Venkataraman. & Schapire. Getoor.. Stanford University. S.. H. Dynamic programming for parsing and estimation of stochastic uniﬁcationbased grammars. (2002). & Koller. Doctoral dissertation.. Koller.. (1973).. Friess. Taskar. A. & Johnson. T. (1999). C. Journal of Machine Learning Research. & Taskar. & Johnson.. Learning probabilistic relational models. Proceedings of IEEE Neural Network for signal processing conference... Guestrin. Soc. Proc. B.. L. (1998). (1989)... (1979). (2001). S. Passerini. Friedman.. N. Large margin classiﬁcation using the perceptron algorithm. L. Getoor... L. Parr. Cristianini. B. B.. & Campbell.. R. (2002). 51. (1997). Stockholm. P. D. N. D. Implementation of algorithms for maximum matching on nonbipartite graphs. R. D. A two stage svm architecture for predicting the disulﬁde bonding state of cysteines.. Garey. ICML.. Segal. D. & Seheult. Learning probabilistic models of link structure. Statist. E. H. . (2003). 1300–1309). Gusﬁeld. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Friedman. & Pfeffer. Algorithms on strings. M. D. Seattle. Freund. 19. D. Computational Learing Theory.. The kernel adatron algorithm: a fast and simple learning procedure for support vector machine. A. Koller. IJCAI01 Workshop on Text Learning: Beyond Supervision. M. Gabow. Greig. Cambridge University Press. R. Proc. Getoor. (1998). Journal of Artiﬁcial Intelligence Research.BIBLIOGRAPHY 207 Frasconi. S. Freeman. 8. & Vullo. trees. Sweden. IJCAI99 (pp. Porteous.. and sequences : Computer science and computational biology. C. R. T. D.
. C. (2001). 200–209). (1983). S. J. Joint and conditional estimation of tagging and parsing models. 1–63. M. Transductive inference for text classiﬁcation using support vector machines. the entropic effect of crosslinkage. & Sander. Analysis and classiﬁcation of disulphide connectivity in proteins. Exponentiated gradient versus gradient descent for linear predictors. & Sternberg. 8. Information and Computation. King. (2001). methods. R. MIT Spoken Language Systems Group. Joachims. US.. R.208 BIBLIOGRAPHY Harrison. .. A comparison of approaches to online handwritten character recognition. S. T. Johnson. Kivinen. & Crouch. (1997). Kassel. Riezler. Johnson... Heuberger.. M. New York: SpringerVerlag. Dictionary of protein secondary structure: Pattern recognition of hydrogenbonded and geometrical features. T. (1999). ICML99 (pp. Hochbaum. Journal of Combinatorial Optimization. Maxwell. Morgan Kaufmann Publishers. ACL 39. (1997).. J. PWS Publishing Company. and results. (1994). R. S. Proc. Proceedings of HLTNAACL’04. Kabsch. 132(1). Convolution kernels on discrete structures (Technical Report).).. J. (Ed. Kaplan. Tibshirani. D. S. Vasserman. Geman.. T. Hastie. C.. The elements of statistical learning. Canon.. W. M. (2004). Proceedings of ACL 1999. (1995). 244. M. P.. & Riezler.. (2004). San Francisco. (1999). Haussler. Doctoral dissertation. & Warmuth. R. D.. Estimators for stochastic “uniﬁcationbased” grammars. Chi. Z. & Friedman. A. Speed and accuracy in shallow and deep stochastic parsing. Approximation algorithms for NPhard problems. Journal of Molecular Biology. (1999). S. Inverse combinatorial optimization: A survey on problems. UC Santa Cruz.
Paciﬁc Symposium on Biocomputing. X. & Pfeffer. Rinehart and Winston.. ACL 41 (pp. J... (2002). M. (2002). Proc. 191–208. Lafferty. & Floudas. C. (2003). (1999). Journal of Combinatorial Optimization. Zhu. & Pereira. AAAI98 (pp. J. 7. & Liu. What energy functions can be minimized using graph cuts? PAMI. S. D. 580–587).. 46(5).. (1998). Journal of the ACM. Lafferty. ICML. Kleinberg. R. Z. Combinatorial optimization: Networks and matroids. E. Discriminative ﬁelds for modeling spatial dependencies in natural images.BIBLIOGRAPHY 209 Klein. Lawler. F. New York: Holt. Proc. 24. (2004). C. Kleinberg. 423– 430). A. (2003). Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. & Tardos. (1999). E. E. Journal of Computational Chemistry. Leslie.. Kumar. Klepeis. Koller. .. & Manning. Authoritative sources in a hyperlinked environment. M. & Zabih. ICML. J.. Eskin. FOCS. Kolmogorov. & Noble. NIPS. J. D. Prediction of βsheet topology and disulﬁde bridges in polypeptides. Accurate unlexicalized parsing. 604–632. Madison... The spectrum kernel: a string kernel for svm protein classiﬁcation. W. Probabilistic framebased systems. D. (2001). Wisc. (2003). Y. McCallum. Kernel conditional random ﬁelds: Representation and clique selection. On inverse problems of optimum perfect matching. (1976). Liu... A. C. Approximation algorithms for classiﬁcation problems with pairwise relationships: Metric labeling and Markov random ﬁelds. J. V.. (2003). & Zhang. J. & Hebert.
C. R. Oxford University Press. P.. Efﬁciently inducing features of conditional random ﬁelds. 11. P. & Watkins. (2002). G. McCallum. Tal. (1999). & Casadio. (2003). Prediction of the disulﬁdebonding state of cysteines in proteins at 88% accuracy.. 313–330. Manning. Computational Linguistics.. (1997).. & Ney. & Marcinkiewicz. McGrawHill. M. (2004). ICCV. & Sch¨ tze. Zens. B. Basil Blackwell. Matsumura. Proc. Marcus. Massachusetts: The MIT Press. 342. & Matthews. Malaguti. Foundations of statistical natural language processing. . M. Mitchell. Cristianini. & Wellner.. D.. J. Building a large annotated corpus of English: The Penn Treebank. C. B. Machine learning.. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. A. T. E. Signor. 19(2). Santorini. Symmetric word alignments for statistical machine translation. Toward conditional models of identity uncertainty with application to proper noun coreference.. Investment science. P. A.. (2003). Fariselli. D.210 BIBLIOGRAPHY Lodhi. (1993). H.. Proc. C. UAI. H. & Malik. Matusov. A. J. Portfolio selection: Efﬁcient diversiﬁcation of investments. Nature. (2000). Martin. COLING. u Cambridge. D.. Proc. (1997). M. B. N. ShaweTaylor. 291:293.. McCallum. H. (1989).. 27359. L. (1991). Fowlkes. Martelli. NIPS.. R. (2001). Substantial increase of protein stability by multiple disulﬁde bonds. Protein Science. Luenberger.. Markowitz. H.. Text classiﬁcation using string kernels.. IJCAI Workshop on Information Integration on the Web.
Integer and combinatorial optimization. X. Wei.. J. Maximum entropy estimation for feature forests. Proc. NJ: PrenticeHall. (1970). (2000). Text classiﬁcation from labeled and unlabeled documents using em. G. A.. Englewood Cliffs. M. On discriminative vs. & Croft. & Wolsey. Nigam. Murphy. Probabilistic reasoning in intelligent systems. (2001). (1999). B. Needleman. J. Iterative classiﬁcation in relational data. Weiss. NIPS. (2002).. M. Journal of Molecular Biology. (1988). Pinto. S. & Tsujii. Numerical optimization. Nocedal.. Combinatorial optimization: Algorithms and complexity. W. J. (2000). New York: Springer... ICML. (2000). Wiley.. & Jordan. & Mitchell. 39. San Francisco: Morgan Kaufmann. Ng. P. K. Proceedings of Human Language Technology Conference (HLT 2002). generative classiﬁers: A comparison of logistic regression and naive Bayes.. C. ACM SIGIR. I. S. Papadimitriou. K. New York: J. & Steiglitz. L. Y. A. (1999). & Wunsch. Thrun. Y. 13–20). 48. AAAI2000 Workshop on Learning Statistical Models from Relational Data (pp. UAI99 (pp. Pearl. J. Algorithms for inverse reinforcement learning. (1999). Ng. & Wright. S. D. & Russell. L. Loopy belief propagation for approximate inference: an empirical study. McCallum. Machine Learning.. K. .. J. (2003). McCallum. C. A. T. Proc. Nemhauser.. Proc. Neville.BIBLIOGRAPHY 211 Miyao. Proc. D.... A. & Jensen. (1982). & Jordan. A.. A general method applicable to the search for similarities in the amino acid sequences of two proteins. 467–475). Y.. S. Table extraction using conditional random ﬁelds.
Machine Learning. 1. Bartlett. (2003). C. Szummer. 48. Soc. Proc. (2003). . on Information Theory. Tarantola.. Proc. F.. & Pereira. R. A. L. J. Some generalized orderdisorder transformations. NIPS... T. ShaweTaylor. F. Sha. Using sparseness and analytic QP to speed training of support vector machines. J. C. Potts. Smith. Rohanimanesh. (2003). Proc. (1952). Using ltag based features in parse reranking. Identiﬁcation of common molecular subsequences.. HLTNAACL. 195–197. 44(5). (2001).. M. A.. Shallow parsing with conditional random ﬁelds. (2000). T. R.. (2001). (1999). (2004). Quinlan.. Schrijver.212 BIBLIOGRAPHY Platt. T. Williamson.. Inverse problem theory: Methods for data ﬁtting and model parameter estimation. & Anthony. & McCallum. 81–106. 895–902). Cambridge Phil. & Jaakkola. Partially labeled classiﬁcation with Markov random walks. M. R. Dynamic conditional random ﬁelds: Factorized probabilistic models for labeling and segmenting sequence data. NIPS. J. Sutton. Slattery. (1981). Combinatorial optimization: Polyhedra and efﬁciency. 1926–1940. P. K. Amsterdam: Elsevier.. Biol. M. A. & Waterman. EMNLP. (1998).. A.. & Mitchell. Shen. ICML00 (pp. Springer. B. 147. Discovering test set regularities in relational domains. K. Proc. ICML. Sarkar. S. A. IEEE Trans. & Joshi. Proc. Proc. L. Induction of decision trees. (1987). Mol. Structural risk minimization over datadependent hierarchies. J.
B. Abbeel. (2002).. Collins. Chatalbashev. E. Taskar.. C. (2001).. New York: SpringerVerlag. L. A. Guestrin. ICML. Vapnik. D. Klein. Theoretical Computer Science.. . A. Hofmann. Support vector machine learning for interdependent and structured output spaces. T. EMNLP. 252–259). & Vespignani.. & Palsson. Metabolic ﬂux balancing: Basic concepts... New York. D. (2003b). D. Max margin parsing. T. C... NAACL 3 (pp. B. B. Twentyﬁrst international conference on Machine learning... K... Toutanova. & Koller. Bio/Technology. & Koller.. Proc. C. Discriminative probabilistic models for relational data. P. Taskar. V. Seattle. M. G. 870–876). Max margin Markov networks. B. 189–201. Probabilistic classiﬁcation and clustering in relational data.. M. (2003). Varma. UAI. (2004a).BIBLIOGRAPHY 213 Taskar. Maritan. & Singer. B. Wong. scientiﬁc and practical use. Taskar. Valiant. (2004b).. Koller. Nature Biotechnology. B. D. V. & Altun. (2003a). Joachims. & Koller.. (1994). D. IJCAI01 (pp. (1995). (1979).. D. Taskar. 8. (2004). Wash. P. I. Link prediction in relational data. Learning associative Markov networks. & Koller. A. Taskar. & Manning. Flammini... Proc. Klein. 6. The complexity of computing the permanent. NIPS. Abbeel... Y.. Global protein function prediction from proteinprotein interaction networksh. (2003). & Koller.. 12. D. NIPS. D. A. D. Featurerich partofspeech tagging with a cyclic dependency network. Vazquez. Tsochantaridis.. Proc. A. Manning. The nature of statistical learning theory. B. Y. Segal.
214 BIBLIOGRAPHY Veksler.. Proc. & Russell. Wainwright. Y. & Jordan. (2004).. Control and Computing. Optimization.. Journal of Intelligent Information Systems. P. NIPS. NIPS. R. A. J. Z. Zhang. & Frasconi. Doctoral dissertation. Information and Control. & Watkins. H. T. 653–659. Younger. G.. 189–208.. Wang. A. Allerton Conference on Communication. Cornell University. Freeman. M. & Ghani. Yedidia. R. Proc.. a. Recognition and parsing of contextfree languages in time n3 . (2002).. W. . via penalized log likelihood and smoothing spline analysis of variance.. Jaakkola. Jordan. C.k.. Soft classiﬁcation. J. Efﬁcient graphbased energy minimization methods in computer vision. 10. Department of Computer Science. Gu. Y. S. (1967). 59–72. Ng. (1998). Variational inference in graphical models: The view from the marginal polytope. I. 37. with application to clustering with sideinformation. & Weiss. risk estimation. Generalized belief propagation. (1993). Multiclass support vector machines (Technical Report CSDTR9804). (1996). M. Slattery. S. 18(2).. Map estimation via agreement on (hyper)trees: Messagepassing and linear programming approaches. Computational Learning Theory and Natural Learning Systems. (2002). Wahba. Bioinformatics... M. E. Disulﬁde connectivity prediction using recursive neural networks and evolutionary information. & Ma. Allerton Conference on Communication.. & Willsky. University of London.. A network ﬂow method for solving inverse combinatorial optimization problems. A. O. D. J. Vullo. Y. Royal Holloway. & Chappell. Yang... (2000). Xing. 20. (2003). Control and Computing.a. Weston. Wainwright. M. (2002). C. Distance metric learning. A study of approaches to hypertext categorization. (1999).
T. (2002).BIBLIOGRAPHY 215 Zhang. 527–550. Proc. Semisupervised learning using gaussian ﬁelds and harmonic functions. T. Zhu. Ghahramani. (2001). (2003). J. ICML. Kernel logistic regression and the import vector machine. Proc. . Journal of Machine Learning Research. Zhu. 2. Z. & Lafferty.. & Hastie.. J.. X. NIPS. Covering number bounds of certain regularized linear function classes.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.