You are on page 1of 155

Uncovering, Understanding, and

Predicting Links
Jonathan Chang
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance
by the Department of
Electrical Engineering
Adviser: David M. Blei
November 2011
c ( Copyright by Jonathan Chang, 2011.
All Rights Reserved
Abstract
Network data, such as citation networks of documents, hyperlinked networks of web
pages, and social networks of friends, are pervasive in applied statistics and machine
learning. The statistical analysis of network data can provide both useful predictive
models and descriptive statistics. Predictive models can point social network mem-
bers towards new friends, scientic papers towards relevant citations, and web pages
towards other related pages. Descriptive statistics can uncover the hidden community
structure underlying a network data set.
In this work we develop new models of network data that account for both links
and attributes. We also develop the inferential and predictive tools around these
models to make them widely applicable to large, real-world data sets. One such model,
the Relational Topic Model can predict links using only a new nodes attributes. Thus,
we can suggest citations of newly written papers, predict the likely hyperlinks of a
web page in development, or suggest friendships in a social network based only on a
new users prole of interests. Moreover, given a new node and its links, the model
provides a predictive distribution of node attributes. This mechanism can be used to
predict keywords from citations or a users interests from his or her social connections.
While explicit network data network data in which the connections between
people, places, genes, corporations, etc. are explicitly encoded are already ubiq-
uitous, most of these can only annotate connections in a limited fashion. Although
relationships between entities are rich, it is impractical to manually devise complete
characterizations of these relationships for every pair of entities on large, real-world
corpora. To resolve this we present a probabilistic topic model to analyze text cor-
pora and infer descriptions of its entities and of relationships between those entities.
We show qualitatively and quantitatively that our model can construct and annotate
graphs of relationships and make useful predictions.
iii
Acknowledgements
A graduate career is an endeavor which requries support from all those around you.
Friends, family, you know who you are and what I owe you (at least $2016). To all
the people in the EE and CS departments, especially the Liberty and SL@P labs, its
been a ball.
I want to call some special attention (in temporal order) to the faculty who have
helped me on my peripatetic journey through grad school. First o, thanks to David
August who took a chance on a clown with green hair. I made some good friends and
research contributions during my sojourn at the liberty lab. Next Id like to thank
Moses Charikar, Christiane D. Fellbaum, and Dan Osherson for giving me my second
chance by including me on the WordNet project when I had all but given up on
graduate school. Special thanks also go out to the members of my FPO committee:
Rob Schapire, Paul Cu, Sanjeev Kulkarni, and Matt Salganik. Thanks for helping
me make sure my thesis is well-written and relevant.
Finally, the bulk of my thanks must be given to David Blei a consummate advi-
sor, teacher, and all-around stand-up guy. Thanks for teaching me about variational
inference, schooling me on strange and wonderful music, and never giving up on me
and making sure I nished.
iv
To Rory Gilmore, for being a hell of a lot smarter than me.
v
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1 Introduction 1
2 Modeling, Inference and Prediction 9
2.1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Exponential family distributions . . . . . . . . . . . . . . . . . 15
2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Exponential Family Models of Links 25
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Pairwise Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Approximate inference of marginals . . . . . . . . . . . . . . . 29
3.2.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Estimating marginal probabilities . . . . . . . . . . . . . . . . 35
3.3.2 Making predictions . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
vi
4 Relational Topic Models 41
4.1 Relational Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.1 Modeling assumptions . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Latent Dirichlet allocation . . . . . . . . . . . . . . . . . . . . 45
4.1.3 Relational topic model . . . . . . . . . . . . . . . . . . . . . . 48
4.1.4 Link probability function . . . . . . . . . . . . . . . . . . . . . 49
4.2 Inference, Estimation and Prediction . . . . . . . . . . . . . . . . . . 53
4.2.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Evaluating the predictive distribution . . . . . . . . . . . . . . 61
4.3.2 Automatic link suggestion . . . . . . . . . . . . . . . . . . . . 65
4.3.3 Modeling spatial data . . . . . . . . . . . . . . . . . . . . . . 67
4.3.4 Modeling social networks . . . . . . . . . . . . . . . . . . . . . 71
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Discovering Link Information 78
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Computation with NUBBI . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.1 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.1 Learning networks . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.2 Evaluating the predictive distribution . . . . . . . . . . . . . . 96
5.4.3 Application to New York Times . . . . . . . . . . . . . . . . . 99
vii
5.5 Discussion and related work . . . . . . . . . . . . . . . . . . . . . . . 106
6 Conclusion 108
A Derivation of RTM Coordinate Ascent Updates 110
B Derivation of RTM Parameter Estimates 113
C Derivation of NUBBI coordinate-ascent updates 117
D Derivation of Gibbs sampling equations 121
D.1 Latent Dirichlet allocation (LDA) . . . . . . . . . . . . . . . . . . . . 122
D.2 Mixed-membership stochastic blockmodel (MMSB) . . . . . . . . . . 125
D.3 Relational topic model (RTM) . . . . . . . . . . . . . . . . . . . . . . 127
D.4 Supervised latent Dirichlet allocation (sLDA) . . . . . . . . . . . . . 128
D.5 Networks uncovered by Bayesian inference (NUBBI) model . . . . . . 130
viii
Chapter 1
Introduction
In this work our aim is to apply the tools of probabilistic modeling to network data,
that is, a collection of nodes with identiable properties, each one (possibly) connected
to other nodes. In the parlance of graph theory, we are concerned with graphs,
collections of vertices and edges, whose vertices may contain additional information.
In modeling these networks our aim is to gain insight into the structure underpinning
these networks and be able to make predictions about them.
Much of the pioneering work on the study of networks was done under the auspices
of sociological studies, i.e., the networks under consideration were social networks.
Zacharys data on the members of a university karate club (Zachary 1977) and
Sampsons study of social interactions among monks at a monastery (Sampson 1969)
are some early iconic works. The number and variety of data sets have grown
considerably since, from networks of dolphins (Lusseau et al. 2003) to co-authorship
networks (Newman 2006a); however the underlying structure of the data remains the
same a collection of nodes (people / animals / organisms / etc.) connected to one
another through some relationship (friendship / hatred / co-authorship / etc.).
In recent years, with the increasing digital representation of entities and the
relationships between them, the amount of data available to researchers has increased
1
Figure 1.1: A depiction of a subset of an online social network. Nodes represent
individuals and edges represent friendships between them.
and the impact of network understanding and prediction has magnied enormously.
Online social networks such as Facebook
1
, Linked-in
2
, and Twitter
3
have made creating
and leveraging these networks their primary product. Consequently these online social
networks operate on a scale unimaginable to the early researchers of social networks;
the aforementioned early works have social networks on the order of tens of nodes
whereas Facebook alone has over 500 million
4
.
1
http://www.facebook.com
2
http://www.linkedin.com
3
http://www.twitter.com
4
https://www.facebook.com/press/info.php?statistics, retrieved June 2011
2
Figure 1.1 shows what a subset of an online social network might look like. The
nodes in the graph represent people and the edges represent self-reported friendship
between members. Even in this simple example, a rich structure emerges with some
individuals belonging to tightly connected clusters while others exist on the periphery.
Characterizing this structure has been one major thrust of network research (Newman
et al. 2006b).
Figure 1.2 shows a screen capture from the online social network Facebook. In
this view, the screenshot shows some of the other nodes connected to the node that
the prole represents. The screenshot shows the variety of nodes and large number
of edges associated with a single user. For example, in this small portion of the
prole alone there are connections to nodes representing friends and family, nodes
representing workplaces, nodes representing schools, and nodes representing interests.
Again, there is a rich structure to be explored.
In the previous discussion we have discussed social networks as simple graphs;
however, these graphs are often richer than those expressed in traditional graphs. In
particular, both the nodes and edges may have some content associated with them.
Even in Figure 1.2 it is clear that a single node / edge type cannot capture the
structure associated with friends vs. family or musical interest vs. sport interest.
The nodes may also have other attributes such as age or gender that can make for a
more expressive probabilistic model. Additionally, users on online social networks may
produce textual data associated with status updates or biographical prose. Figure 1.3
shows an example of a status update. The user generates some snippet of text which
is then posted online; other users may respond with comments. A collection of status
updates (and comments, etc.) comprise a corpus. Outside of social networks, one may
also consider citation networks (Figure 4.1), gene regulatory networks and many other
instances as networks whose nodes and their attendant attributes comprise a corpus.
Thus instead of referring to nodes / vertices we may refer to documents, and instead of
3
Friends
Interests
Family
Work
Education
Figure 1.2: A screenshot from a typical Facebook prole with sections annotated. In
this subset of the prole alone there are network connections to workplaces, schools,
interests, family and friends. Understanding the nature of these connections is of
immense practical and theoretical interest.
4
Figure 1.3: A screenshot of a typical Facebook status update, a small user-generated
snippet of text. Other users can react to status updates by posting comments in
response.
referring to node attributes we refer to words. Throughout this work we will be using
the language of textual analysis and the language of graph theory interchangeably.
The study of natural language has a long and rich history (see Jurafsky and Martin
(2008) for a description of many modern techniques for analyzing language). One
modern technique to analyze language that we shall leverage throughout this work is
topic modeling (Blei et al. 2003b). Topic modeling, to be described in more detail
in Section 4.1, is a latent mixed-membership model. It presupposes the existence
of latent themes or topics which characterize how words tend to occur with one
another. Documents are then merely specic realizations of ensembles of these themes.
Figure 1.4 depicts how the approach assumes topics on the left (denoted by ) and
an ensemble of these themes for each document (right). It is this ensemble that
determines the words in the document that we observe.
We have thus far described two incomplete perspectives of data; one which is
centered around documents and another centered around graphs. What we propose in
this thesis is a set of techniques for modeling these data with a complete perspective
that takes both of these aspects of the data into account. We also develop methods
for determining the unknowns in these models. We show that once so determined,
these models provide useful insights into the structure underpinning the data and are
5
The congressman threw the opening
pitch at the Yankees game yesterday
evening, despite being under
investigation by a house committee.
Both Democrats and Republicans on
the committee condemned...
lawyer
justice
judge
investigate
prosecutor
game
coach
player
play
match
republican
democrat
senate
campaign
mayor

1
w
d,1:N

d
z
d,1
z
d,2
Figure 1.4: A depiction of the assumptions underlying topic models. Topic models
presuppose latent themes (left) and documents (right). Documents are a composition
of latent themes; this composition determines the words in the document that we
observe.
6
able to make predictions about unseen nodes, edges, and attributes.
In Chapter 2 we lay the ground work for our technique by rst describing a general
framework in which to dene and speak about probabilistic models. We follow up
by describing a set of tools for using data to uncover the unknown aspects of these
models. Then we describe how this can be used to analyze and make predictions about
data. In Chapter 3 we dive into a specic model which has wide applicability not
only to networks but also to a variety of other data. The challenge with this model
has always been the computational complexity of uncovering likely values for the
latent parameters. We introduce a technique which is able to vastly improve on the
state-of-the-art in terms of computation speed, while sacricing very little accuracy,
thus making these models much more applicable to the large networks in which we
are interested.
In Chapter 4, we introduce the Relational Topic Model, a model specically
designed to analyze collections of documents with connections between them (or
alternatively graphs with both edges and attributes). It leverages the aforementioned
topic modeling infrastructure but extends it so that the model can oer a unied view
of both links and content. We show that the model can make statements about new
nodes, for example predicting the content of a document based solely on its citations
or predicting additional citations based on its content. Further, it can be used to nd
hidden community structure, and we analyze these features of the model on several
data sets.
The work in Chapter 4 presupposes a network in which most links have already
been observed. However, it is often the case that we have only textual content and we
would like to build out this network. Chapter 5 explores the construction of networks
based purely on text. By looking at the content associated with each node, as well as
content appearing around pairs of nodes we are able to infer descriptions of individual
entities and of the relationship between those entities. With the inference machinery
7
we develop we can apply the model to large corpora such as Wikipedia and show that
the model can construct and annotate graphs and make useful predictions.
8
Chapter 2
Modeling, Inference and Prediction
Throughout this work our approach will be to
1. dene a probabilistic model with certain unknown parameters for data of a
particular character;
2. perform inference, that is, nd values of the unknown parameters of the model
that best explain observations;
3. make predictions using a model whose parameters have been determined.
In this chapter I describe the framework in which we execute these steps. A more
detailed treatment can be found in Wainwright and Jordan (2008).
2.1 Probabilistic Models
Our approach uses the language of directed graphical models to describe probabilistic
models. Directed graphical models have been described as a synthesis of graph theory
and probability. In this framework, distributions are represented as directed, acyclic
graphs. Nodes in this graph represent variables and arrows indicate, informally, a
9
Z
(a) Un-
observed
variable
named Z.
X
(b) Observed
(indicated by
shading) vari-
able named
X.
U V
(c) Variable V possibly dependent
on variable U (indicated by arrow).
Y
N
(d) Variable Y repli-
cated N times (indi-
cated by box).
Figure 2.1: The language of graphical models.
possible dependence between variables.
1
The constituents of directed graphical models are
1. unshaded nodes indicating unobserved variables whose names are enclosed in
the circle;
2. shaded nodes indicate observed variables;
3. arrows between nodes indicating a possible dependence between variables;
4. boxes which depict replication.
These are shown in Figure 2.1.
Associated with each node is a conditional probability distribution over the variable
represented by that node. That probability distribution is conditioned on the variable
represented by that nodes parents. That is, letting x
i
represent the variable associated
with the ith node,
p
i
(x
i
[x
jparents(i)
) (2.1)
describes the distribution of x
i
. The full joint distribution of the entire graphical
model can thus be written as
1
The dependence between variables can be formally described by D-separation which is outside
the scope of this text.
10
p(x) =

i
p
i
(x
i
[x
jparents(i)
). (2.2)
Note that it is straightforward to evaluate the probability of a state in this formalism;
one need only take the product of the evaluation of each p
i
. This formalism also makes
it convenient to simulate draws from this distribution by drawing each constituent
variable in topological order. Because each of the variables x
i
is conditioned on is a
parent, and all parent variables are guaranteed to have xed values by dint of the
topological sort, x
i
can be simulated by doing a single draw from p
i
.
This also means it is straightforward to describe each probability distribution as a
generative process, that is, a sequence of probabilistic steps by which the data were
hypothetically generated. The intermediate steps of the generative process create
unobserved variables while the nal step generates the observed data, i.e., the leaves
of the graph. This construction will be of particular interest in the sequel.
2.2 Inference
With a probability distribution thus dened our goal is to nd values of unobserved
variables which explain observed variables. More formally, we are interested in nding
the posterior distribution of hidden variables (z) conditioned on observed variables
(x).
p(z[x) (2.3)
For all but a few special cases, it is computationally prohibitive to compute this
exactly. To see why, let us recall the denition of marginalization,
11
p(z[x) =
p(x, z)
p(x)
=
p(x, z)

p(x, z

)
.
As mentioned in the previous section, evaluating the joint distribution p(x, z) is
straightforward. However, to compute the posterior probability we must evaluate the
joint probability across all possible values of z

. Since the number of possible values


of z

increases exponentially with the number of variables comprising z

, this quickly
becomes prohibitive.
Thus we turn to approximate methods. There are many approaches to approx-
imating the posterior such as Markov Chain Monte Carlo (MCMC) (Neal 1993).
However, we will use variational approximations in this work because they do not
rely on stochasticity, they are amenable to various optimization approaches, and have
been empirically shown to achieve good approximations.
Variational methods approximate the true posterior, p(z[x) with an approximate
posterior, q(z). The approximation chosen is that distribution which is in some sense
closest to the true distribution. The denition of closeness used is Kullback-Leibler
(KL) divergence,
12
KL(q(z)[[p(z[x)) =

z
q(z) log
q(z)
p(z[x)
(2.4)
=

z
q(z) log
p(z[x)
q(z)
log

z
q(z)
p(z[x)
q(z)
= log

z
p(z[x)
= log 1
= 0, (2.5)
where the inequality follows from Jensens inequality. This choice of distance can be
intuitively justied several ways. One is to rewrite the KL-divergence as
KL(q(z)[[p(z[x)) = E
q
[log p(z[x)] H(q) , (2.6)
where H() denotes entropy. Thus KL-divergence promotes distributions q which
look similar to p while adding an entropy regularization. Another justication of
KL-divergence arises from its relationship to the likelihood of observed data,
KL(q(z)[[p(z[x)) = E
q
[log p(z[x)] H(q)
= E
q

log
p(z, x)
p(x)

H(q)
= E
q
[log p(z, x)] +E
q
[log p(x)] H(q)
= log p(x) E
q
[log p(z, x)] H(q) . (2.7)
This representation rst implies that the problem can be expressed as nding the
distance between the variational distribution and the joint distribution rather than
13
the posterior distribution. The second is that this distance can be used to form an
evidence lower bound (ELBO); as the distance decreases the likelihood of our data
increases.
Our objective function now is to nd q

such that
q

(z) = argmin
qQ
KL(q(z)[[p(z[x)). (2.8)
Note that this is trivially minimized when q

(z) = p(z[x), the true posterior.


Therefore, the optimization problem as formulated is equivalent to posterior inference.
But since this is intractable, a tractable approximation is made by restricting the
search space O. A common choice is the family of factorized distributions,
q(z) =

i
q
i
(z
i
). (2.9)
This choice of O is often termed a nave variational approximation. This expression
is convenient since
H(q) = E
q

log

i
q
i
(z
i
)

= E
q

i
log q
i
(z
i
)

i
E
q
[log q
i
(z
i
)]
=

i
H(q
i
) . (2.10)
Further, recall from the discussion above that in a generative process all of the
observations (x) appear as leaves of the graph. Therefore the expected log joint
probability can be expressed as
14
E
q
[log p(z, x)] = E
q

log

p
i
(z
i
[z
jparents(i)
)

p
i
(x
i
[z
jparents(i

)
)

E
q

log p
i
(z
i
[z
jparents(i)
)

E
q

log p
i
(x
i
[z
jparents(i

)
)

.
Note that because of marginalization the expectation of term p
i
depends only on
q
j
(z
j
)[j parents(i) if i is a leaf node, and q
j
(z
j
)[j parents(i) i otherwise.
Optimizing this with respect to a common choice for p
i
warrants further elucidation
below.
2.2.1 Exponential family distributions
Exponential family distributions are a class of distributions which take a particular
form. This form encompasses many common distributions and is convenient to optimize
with respect to the objective described in the previous section. Exponential family
distributions take the following form:
p(x[) =
exp(
T
(x))
Z()
. (2.11)
The normalization constant Z() is chosen so that the distribution sums to one.
The vector is termed the natural parameters while (x) are the sucient statistics.
Figure 2.2 helps illustrate how common distributions such as the Gaussian and
the Beta can be expressed in this representation.
The structure of the exponential family representation allows for these distributions
to be easily manipulated in the variational optimization above. In particular,
15
6 4 2 0 2 4 6
x
0
0.1
0.2
0.3
0.4
0.5
0.6
p
(
x
)
< 0.5, 0 >
6 4 2 0 2 4 6
x
< 0.5, 1 >
6 4 2 0 2 4 6
x
< 1, 0 >
(a) The Gaussian distribution has sucient statistics (x) = 'x
2
, x`. The natural parameters are
related to the common parameterization by = '
1
2
2
,

2
`. The normalization constant Z is given
by

1
exp(

2
2
41
).
0 0.2 0.4 0.6 0.8 1
x
0
0.5
1
1.5
2
2.5
3
3.5
p
(
x
)
< 0, 0 >
0 0.2 0.4 0.6 0.8 1
x
< 0.5, 0.5 >
0 0.2 0.4 0.6 0.8 1
x
< 2, 1 >
(b) The Beta distribution has sucient statistics (x) = 'log(x), log(1 x)`. The natural parameters
are related to the common parameterization by = ' 1, 1`. The normalization constant Z is
given by
(1+1)(2+1)
(1+2+2)
.
Figure 2.2: Two exponential family functions. The title of each panel shows the value
of the natural parameters for the depicted distribution.
16
2
z
x
N
Figure 2.3: A directed graphical model representation of a Gaussian mixture model.
E
q
[log p(x, z)] = E
q

log
exp(
T
(x, z))
Z()

= E
q

T
(x, z)

E
q
[log Z()]
= E
q

E
q
[(x, z)] E
q
[log Z()] , (2.12)
where the last line follows by independence under a fully-factorized variational distri-
bution. (Note that q is a distribution over both sets of latent variables in the model,
z and .)
2.3 Example
To illustrate the procedure described in the previous sections, we perform it on a
simple Gaussian mixture model. Figure 2.3 shows a directed graphical model for this
example. We describe the generative process as
1. For i 0, 1,
(a) Draw
i
Uniform(, ).
2
2. For n [N],
2
We set aside here the issue of drawing from an improper probability distribution.
17
(a) Draw mixture indicator z
n
Bernoulli(0.5);
(b) Draw observation x
n
N(
zn
, 1).
Our goal now is to approximate the posterior distribution of the hidden variables,
p(z, [x), conditioned on observations x. To do so we use the factorized distribution,
q(, z) = r(
0
[m
0
)r(
1
[m
1
)

n
q
n
(z
n
[
n
), (2.13)
where q
n
(z
n
[
n
) is a binomial distribution with parameter
n
, and r(
i
[m
i
) is a
Gaussian distribution with mean m
i
and unit variance. With the variational family
thus parameterized, the optimization problem becomes
argmin
,m
E
q
[log p(, z)] H(q) (2.14)
To do so we rst appeal to Equation 2.12 for the expected log probability of an
exponential family with our choice of parameter,
E
q
[log p(x
n
[
i
)] =
1
2
x
2
n
+E
q
[
i
] x
n

1
2
E
q

2
i

1
2
log 2
=
1
2
x
2
n
+ m
i
x
n

1
2
(1 + m
2
i
)
1
2
log 2.
Since we have chosen uniform distributions for z and , we can express the
expected log probability of the joint as
18
E
q
[log p(, z)] = E
q

log

n
p(x
n
[
0
)
1zn
p(x
n
[
1
)
zn

i
E
q
[(1 z
n
) log p(x
n
[
0
)] +E
q
[z
n
log p(x
n
[
1
)]
=

n
(1
n
)E
q
[log p(x
n
[
0
)] +
n
E
q
[log p(x
n
[
1
)]
=

n
(1
n
)(m
0
x
n

1
2
m
2
0
) +
n
(m
1
x
n

1
2
m
2
1
) + C,
where C contains terms which do not depend on either
n
or m
i
. We also compute
the entropy terms,
H(q
n
(z
n
[
n
)) = (1
n
) log(1
n
) +
n
log
n
H(r
i
(
i
[m
i
)) =
1
2
log(2e).
To optimize these expressions we take the derivative with respect to each variable,
/

n
=
1
2
(m
1
m
0
)(2 m
1
m
0
)x
n
+ log

n
1
n
/
m
0
=

n
(1
n
)(x
n
m
0
)
/
m
1
=

n
(x
n
m
1
).
19
q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q q qq q q q q q
q q qq q q q q q q q q q q q q q q qq q q q q q q q qq qq q q q q q q q q q qq q qq q q qq q q q q q
6 4 2 0 2 4 6
x
0
0.2
0.4
0.6
0.8
1
z
Figure 2.4: 100 points drawn from the mixture model depicted in Figure 2.3 with

0
= 3 and
1
= 3. The x axis denotes observed values while the horizontal axis
and coloring denote the latent mixture indicator values.
Setting these equal to zero yields the following optimality conditions,

n
= (
1
2
(m
1
m
0
)(2 m
1
m
0
)x
n
)
m
0
=

n
(1
n
)x
n

n
(1
n
)
m
1
=

n

n
x
n

n

n
where (x) denotes the sigmoid function
1
1+exp(x)
. This is a system of transcenden-
tal equations which cannot be solved analytically. However, we may apply coordinate
ascent; we initialize each variable to some guess and repeatedly cycle through variables
optimizing them one at a time while holding the others xed.
20
1 2 3 4 5
iteration
4
3
2
1
0
1
2
3
m
i
Figure 2.5: Estimated values of m
0
and m
1
as a function of iteration using coordinate
ascent. The variational method is able to quickly recover the true values of these
parameters (shown as dashed lines).
21
N 2
z
x
z
N+1
x
N+1
Figure 2.6: The mixture model of Figure 2.3 augmented with an additional unobserved
datum to be predicted.
Figure 2.4 shows the result of simulating 100 draws from the distribution to be
estimated. The distribution has
0
= 3 and
1
= 3. The x axis denotes observed
values while the horizontal axis and coloring denote the latent mixture indicator values.
Figure 2.5 shows the result of applying the variational method with coordinte ascent
estimation. The series show the estimated values of m
i
as a function of iteration. The
approach is able to quickly nd the parameters of the true generating distributions
(dashed lines).
2.4 Prediction
With an approximate posterior in hand, our goal is often to make predictions about
data we have not yet seen. That is, given some observed data x
1:N
we wish to evaluate
the probability of an additional datum x
N+1
,
p(x
N+1
[x
1:N
). (2.15)
This desideratum is illustrated in Figure 2.6 for the case of the Gaussian mixture
of the previous section. On the right hand side another unobserved instance of a
draw from the mixture model has been added as the datum to be predicted. One way
of approaching the problem is by noting that the marginalization of the predictive
22
distribution,
p(x
N+1
[x
1:N
) =

z
N+1

z
1:N
p(x
N+1
, z
N+1
[z
1:N
)p(z
1:N
[x
1:N
)
=

z
N+1
E
p
[p(x
N+1
, z
N+1
[z
1:N
)]

z
N+1
E
q
[p(x
N+1
, z
N+1
[z
1:N
)] , (2.16)
where the expectation on the second line is taken with respect to the true posterior
of the observed data, p(z
1:N
[x
1:N
) and the expectation on the third line is taken with
respect to the variational approximation to the posterior, q(z
1:N
).
In the case of the Gaussian mixture, this expression is
p(x
N+1
[x
1:N
)
1
2
E
q
[p(x
N+1
[
1
)] +
1
2
E
q
[p(x
N+1
[
0
)]
=
1
2
p(x
N+1
[m
1
) +
1
2
p(x
N+1
[m
0
). (2.17)
The ecacy of this approach is demonstrated in Figure 2.7 wherein we empirically
estimate the expected value of p(x
N+1
[x
1:N
) by drawing an additional M values and
taking their average. The dashed line shows the expectation estimated using the
variational approximation.
We have now described a framework for dening probabilistic models, inferring
the values of their unknowns using data, and taking the model and inferred values
to provide predictions about unseen data. In the following chapters we leverage this
framework to model, understand, and make predictions about networked data.
23
0 20 40 60 80 100
M
4
3
2
1
0
1
x
N
+
1
^
Figure 2.7: Estimated expected value of p(x
N+1
[x
1:N
) taken by averaging M random
draws from this function. The dashed line shows the value of this expectation estimated
by the variational approximation.
24
Chapter 3
Exponential Family Models of
Links
The rst model of networks we explore are Binary Markov random elds. These
models are widely used to model correlations between binary random variables. While
generally useful for a wide variety of applications, in this chapter we focus on applying
these models to collections of documents which contain words and/or links. In a Binary
Markov random eld, each document is treated as a collection of binary variables;
these binary variables may correspond to the presence of words in a document or the
presence of a citation to another document. Modeling the correlations between these
variables allows us to predict new words or new connections for documents.
However, their application to large-scale data sets has been hindered by their
intractability; both parameter estimation and inference is prohibitively expensive
on many large real-world data sets. In this chapter we present a new method to
perform both of these tasks. Leveraging a novel variational approximation to compute
approximate gradients, our technique is accurate yet computationally simple. We
evaluate our technique on both synthetic and real-world data and demonstrate that
we are able to learn models comparable to the state-of-the-art in a fraction of the
25
time.
3.1 Background
Large-scale models of co-occurrence are increasingly in demand. They can be used
to model the words in documents, connections between members of social networks,
or the structure of the human brain; these models can then lead to new insights into
brain function, suggest new friendships, or discover latent patterns of language usage.
The Ising model (Ising 1925) is a model of co-occurrence for binary vectors which
has been successfully applied to a variety of domains such as signal processing (Besag
1986), natural language processing (Takamura et al. 2005), genetics (Majewski et al.
2001), biological sensing (Shi and Duke 1998; Besag 1975), and computer vision (Blake
et al. 2004). Practitioners of the Ising model are limited, however, in the size of the
data sets to which the model can be applied. Many modern data sets and applications
require models with millions of parameters. Unfortunately, estimating the models
probabilities and optimizing its parameters are both #P-complete problems (Welsh
1990).
In response to its intractability, there has been a rich body of work on approximate
inference and optimization for the Ising model. The most common approaches have
been sampling-based (Geman and Geman 1984) of which contrastive divergence is
the most recent incarnation (Carreira-Perpinan and Hinton 2005; Welling and Hinton
2002). Other approaches include max-margin (Taskar et al. 2004a) and exponentiated
gradient (Globerson et al. 2007), expectation propagation (Minka and Qi 2003),
various relaxations (Fisher 1966; Globerson and Jaakkola 2007; Wainwright and
Jordan 2006; Kolar and Xing 2008; Sontag and Jaakkola 2007), as well as loopy belief
propagation (Pearl 1988; Murphy et al. 1999; Yedidia et al. 2003; Szeliski et al. 2008)
and its extensions (Wainwright et al. 2003; Welling and Teh 2001; Kolmogorov 2006).
26
In this chapter we present a new approach which is substantially faster and has
accuracy comparable to state-of-the-art methods. Our approach employs iterative
scaling (Dudk et al. 2007) and a new technique for approximating the gradients of the
log partition function of the Ising model. This approximation technique is inspired
by variational mean eld methods (Jordan et al. 1999; Wainwright and Jordan 2003).
While these methods have been applied to a variety of models (Jaakkola and Jordan
1999; Saul and Jordan 1999; Bishop et al. 2002) including the Ising model, we will
show that our technique produces more accurate estimates of marginals and that this
in turn produces models with higher predictive accuracy. Further, our approximation
has a simple mathematical form which can be computed much more quickly. This
allows us to apply the Ising model to large models with millions of parameters.
Because of the large parameter space, our model also employs
1
+
2
2
feature
selection penalties to achieve sparse parameter estimates. This penalty is used in
linear models under the name elastic nets (Zou and Hastie 2005). Feature selection
penalties have an extensive history (Laerty and Wasserman 2008; Malouf 2002). The

1
penalty, in particular, has been a popular approach to obtaining sparse parameter
vectors (Friedman et al. 2007; Meinshausen and Buhlmann 2006; Wainwright et al.
2006). However, theory of regularized maximum likelihood estimation also indicates
that it is often benecial to use
2
2
regularization (Dudk et al. 2007). Regularizations
of this form have been extensively applied (Chen and Rosenfeld 2000; Goodman 2004;
Riezler and Vasserman 2004; Haner et al. 2006; Andrew and Gao 2007; Kazama and
Tsujii 2003; Gao et al. 2006).
This chapter is organized as follows. In Section 3.2, we describe the Ising model
and our procedure for approximating the marginals of the model and tting its
parameters by approximate maximum a posteriori point estimation. In Section 3.3,
we compare the accuracy/speed trade-o of our model with several others on synthetic
and large real-world corpora. We show that our method provides parameter estimates
27
comparable with those of state-of-the-art techniques, but in much less time. This
enables the application of the Ising model to new data sets and application areas
which were previously out of reach. We summarize these ndings in Section 3.4.
3.2 Pairwise Ising model
We study the exponential family known as the pairwise Ising model or binary Markov
random eld which has long been used in physics to model ensembles of particles with
pairwise interactions. Our motivation is to characterize the co-occurrence of items
within unordered bags such as the co-occurrence of citations or keywords in research
papers. Such bags are represented by a binary vector x 0, 1
n
with components
x
i
indicating presence of each item. The pairwise Ising model is parameterized by
R
n
and R
n(n1)
controlling frequencies of individual items and frequencies of
their co-occurrence as
p
,
(x) =
1
Z
,
exp

i=1

i
x
i
+
1
2
n

i=1

j=i

ij
x
i
x
j

.
We assume throughout that
ij
=
ji
. Here, Z
,
denotes the normalization constant
ensuring that probabilities sum to one. For general settings of and , the exact
calculation of the normalization constant Z
,
requires summation over 2
n
possible
values of x, which becomes intractable for even moderate sizes of n. Since the normal-
ization constant Z
,
is required to calculate expectations and evaluate likelihoods,
basic tasks such as inference of marginals and parameter estimation cannot be carried
out exactly and require approximation. We propose a novel technique to approximate
marginals of the Ising model and a new procedure to learn its parameters. Since
learning of parameters relies on inference of marginals as a subroutine, we rst present
the marginal approximation.
28
3.2.1 Approximate inference of marginals
Our approach begins with the nave mean eld approximation (Wainwright and Jordan
2005b; Jordan et al. 1999). While nave mean eld approximations may provide good
estimates of singleton marginals p
,
(x
i
), they often provide poor estimates of pairwise
marginals p
,
(x
i
, x
j
). Our technique corrects these estimates using an augmented
variational family. By combining the richness of the augmented variational family with
the computational simplicity of the nave mean eld, our technique yields accurate
estimates that can be computed eciently.
In the sequel we rst present the nave mean eld and then our improved approxi-
mation.
Nave mean eld
Nave mean eld approximates the Ising model p
,
by a distribution q
MF
with a
factored representation across components x
i
q
MF
(x) =

i
q
MF
i
(x
i
) .
Among all distributions of the form above, nave mean eld algorithms seek the
distribution q
MF
which minimizes the KL divergence from the true distribution p
,
,
q
MF
= argmin
q
MF
D(q
MF
|p
,
) . (3.1)
Here D(q|p) = E
q
[ln(q/p)] denotes the KL divergence, which measures information-
theoretic discrepancy between densities q and p. Since Equation (3.1) is not convex,
it is usually solved by alternating minimization in each coordinatea procedure
which only yields a local minimum. In each individual coordinate, the objective of
Equation (3.1) can be minimized exactly by setting the derivatives to zero, yielding
29
the update
q
MF
i
(x
i
) exp

i
x
i
+

j=i

ij
x
i
q
MF
j
(x
j
= 1)

. (3.2)
For the derivation see for example Wainwright and Jordan (2005b).
A chief advantage of nave mean eld is its simplicity and the speed of conver-
gence. However, compared with other approximation techniques such as loopy belief
propagation, the nave mean eld solution q
MF
may yield poor approximations to the
pairwise marginals p
,
(x
i
, x
j
) (in Section 3.3 we demonstrate this empirically). Since
pairwise marginals are needed for parameter estimation, this is a major drawback.
Our approach
Our approach, Fast Learning of Ising Models (FLIM), takes advantage of the rapid
convergence of nave mean eld while correcting its estimates of pairwise marginals.
When estimating the marginal p
,
(x
i
, x
j
) for a xed pair i, j, we propose replacing
the product density q
MF
in Equation (3.1) by a richer family
q
(ij)
(x) = q
(ij)
ij
(x
i
, x
j
)

k=i,j
q
(ij)
k
(x
k
) .
This is similar to the approach known as structured mean eld (Saul and Jordan
1996). However, we take advantage of the approximate singleton marginals q
MF
k
(x
k
)
provided by nave mean eld which, unlike pairwise marginals, provide suciently
good approximations of the true singleton marginals p
,
(x
k
). We minimize the KL
divergence from p
,
under the constraint that q
(ij)
k
(x
k
) equal q
MF
k
(x
k
):
q
(ij)
=argmin
q
(ij)
D(q
(ij)
|p
,
)
s.t. q
(ij)
k
(x
k
) = q
MF
k
(x
k
) for all k = i, j . (3.3)
30
Note that the only undetermined portion of q
(ij)
is q
(ij)
ij
. This can be solved explicitly
by setting derivatives equal to zero, yielding
q
(ij)
ij
(x
i
, x
j
) exp

i
x
i
+
j
x
j
+
ij
x
i
x
j
+

k=i,j
(
ik
x
i
+
jk
x
j
) q
MF
k
(x
k
= 1)

. (3.4)
Given the nave mean eld solution q
MF
, it is possible to calculate all corrected pairwise
marginals q
(ij)
ij
in time O(n
2
) by using auxiliary values
rowsum
i
=

k=i

ik
q
MF
k
(x
k
= 1) .
Thus, each q
(ij)
ij
is calculated in constant amortized time.
Note that if we have access to estimates of marginals p
,
(x
k
) other than those
given by nave mean eld, we can use them instead of q
MF
k
in Equations (3.3) and
(3.4).
3.2.2 Parameter estimation
The main task we study is the problem of estimating parameters and from data.
As we will see, this necessitates calculation of pairwise marginals which we derived in
the previous section.
The data consists of a set of observations x
1
, x
2
, . . . , x
D
generated by an Ising
model p(x[ , ) = p
,
(x). We posit a prior p(, ), and estimate and as
maximizers of the posterior
p(, [ x
d
) p(, )
D

d=1
p(x
d
[ , ) . (3.5)
31
We consider the factored prior
p(, ) =

i
p(
i
)

i,j
p(
ij
)

,
with
p(
i
) exp(
i
)
p(
ij
) exp(
1
[
ij
[
2

2
ij
)
(3.6)
where
1
and
2
are hyperparameters. The prior over
i
corresponds to Laplace
smoothing of empirical counts (however, note that it is improper). The prior over
ij
corresponds to regularization with an
1
-norm term and an
2
2
-norm, used in linear
models under the name elastic nets (Zou and Hastie 2005). This prior encourages
parameter vectors which exhibit both sparsity and grouping.
Combining Equation (3.5) and Equation (3.6), we obtain the following expression
for the log posterior:
ln p(, [ x
d
)
=

1
||
1

2
||
2
2
+
D

d=1

i=1

i
x
d
i

+
1
2

i=1

j=i

ij
x
d
i
x
d
j

ln Z
,

+ const. (3.7)
We optimize Equation (3.7) by a version of the algorithm PLUMMET (Dudk et al.
2007). This algorithm in each iteration updates and to new values

and

that
optimize a lower bound on Equation (3.7). More precisely,

ij
=
ij
+
ij
where

ij
= argmax

1
([
ij
+ [)
2
(
ij
+ )
2

, (3.8)
32
where denotes the empirical co-occurrence count
=

d
x
d
i
x
d
j
while is the estimate of this count, = DE
,
[x
i
x
j
]. We approximate the expectation
E
,
[x
i
x
j
] using the technique of the previous section.
The objective of Equation (3.8) is concave in and therefore we can nd its
maximizer by setting its derivative to zero
e

+
1
sign(
ij
+ ) 2
2
(
ij
+ ) = 0 . (3.9)
This can be solved explicitly using Lambert W function, denoted W(z), which for
a given z e
1
represents the unique value W(z) 1 such that W(z)e
W(z)
= z.
Using this denition it is straightforward to prove the following lemma which can then
be used to solve Equation (3.9).
Lemma 3.2.1. For b > 0, the identity x = abe
x
holds if and only if x = aW(be
a
).
Rearranging Equation (3.9) to match the lemma, we now just need to carry out
the case analysis according to the sign of
ij
+ and consider possibilities

+
=

1
2
2
W

ij
2
2
exp


1
2
2

ij

=
+
1
2
2
W

ij
2
2
exp

+
1
2
2

ij

0
=
ij
.
We choose
+
if
ij
+
+
> 0,

if
ij
+

< 0 and
0
otherwise.
33
3.3 Evaluation
In this section we rst apply our technique for performing marginal inference to a
synthetic test case. We compare our technique with several competing techniques
on both accuracy and speed. We then evaluate our entire parameter estimation
procedure on two large-scale, real-world data sets and show that models trained using
our procedure perform comparably with the state-of-the-art at making predictions
about unseen data.
Throughout this section we will compare the following ve approaches:
Baseline No training is done for the parameters which govern pairwise correlations
, i.e., is set to 0.
NMF This method uses a nave mean eld to approximate pairwise expectations. As
described in Section 3.2, this method approximates the true model with one in
which all variables are decoupled. Because the implied Markov random eld has
no edges, it cannot capture pairwise behavior.
BP Loopy belief propagation (Yedidia et al. 2003) is a message passing algorithm
that optimizes an approximation to the log partition function based on Bethe
energies. Because it must compute O(n
2
) messages each iteration, it can be
comparatively slow.
FLIM-NMF FLIM-NMF (Fast Learning of Ising Models) is our proposal for esti-
mating pairwise and singleton marginals described in Section 3.2. The estimates
are the solutions to a variational approximation where singleton marginals are
constrained to be equal to the marginals adduced by nave mean eld.
FLIM-Z FLIM-Z is similar to FLIM-NMF except that the singleton marginals are
constrained to be equal to the marginals when the pairwise correlations = 0,
i.e., (). This is an eective approximation to FLIM-NMF when is close to
zero. FLIM-Z is faster than FLIM-NMF since it does not require rst solving
34
the nave mean eld variational problem.
3.3.1 Estimating marginal probabilities
To evaluate how well each of the approaches approximates the singleton marginals
p(x
i
) and pairwise marginals p(x
i
, x
j
) we generated a model with 24 nodes. Because
the number of nodes in this model is small, it is possible to compute the singleton
marginals and the pairwise marginals exactly through enumeration. By comparing
these true marginals with those estimated by each of the approximation techniques,
we can evaluate their accuracy/speed trade-o.
The following procedure was used to generate the parameters of the model. The
parameters which control the frequency of components, , is a vector of length 24
generated from a Beta distribution (
i
) Beta(1, 100). The parameters which
control correlations of components, , is a vector of length 276. 10% of the elements of
are randomly chosen to be non-zero; those elements are generated from a zero-mean
Gaussian
ij
A(0, 1). The parameters generated by this process resemble those
found in the real-world corpora described in the next section.
The metric we use to compare the estimated marginals to the true marginals is
the mean relative error,

singleton
=
1
n

i
[q(x
i
= 1) p(x
i
= 1)[
p(x
i
= 1)

pairwise
=
1
n
2
n

j=i
[q(x
i
x
j
= 1) p(x
i
x
j
= 1)[
p(x
i
x
j
= 1)
,
where q describes the approximate marginals computed by the approach under test.
To measure the approximation error as a function of computation time, we compute
the mean relative error after every full round of message passing for BP, every full
iteration of coordinate ascent updates in Equation (3.2) for FLIM-NMF and NMF,
and once at the end for FLIM-Z, since FLIM-Z is not iterative. We also compute
35
the time elapsed since the start of the program every time the mean relative error is
computed.
The approximation error versus time for BP, FLIM-Z, FLIM-NMF, and NMF
is shown in Figure 3.1. Loopy belief propagation is the most accurate of all the
techniques at estimating both the singleton marginals and the pairwise marginals.
Further, it converges to its nal estimate after very few iterations. Unfortunately, it is
also the slowest. In contrast, nave mean eld and our proposals, FLIM-NMF and
FLIM-Z are much faster. They too converge in very few iterations. However, their
errors are higher than those of BP.
On singleton marginals, all of the approximations are quite accurate mean
relative errors are always less than 1% on singleton marginals. NMF and FLIM-
NMF have the same relative errors since the singleton marginals for FLIM-NMF are
constrained to be equal to the solutions of NMF. FLIM-Z has a larger error than
either of these since its marginals assume that there are no pairwise correlations, an
assumption that is violated.
On pairwise marginals, BP once again achieves the lowest error, with FLIM-NMF
and FLIM-Z following closely behind. However, here NMF deviates from the other
three, having a much larger error (note that the y-axis is logarithmic). Because the
nave mean eld removes all dependencies between variables, it poorly characterizes
the rich correlation structure implied by . As the next section shows, this large error
leads to poorer MAP estimates of . FLIM-NMF, FLIM-Z, and BP however have
errors circa 1%; consequently they all have better MAP estimates of than NMF. But
our proposals, FLIM-NMF and FLIM-Z are able to run in a fraction of the execution
time of BP.
36
3.3.2 Making predictions
With the parameters of the model optimized using the procedure described in Sec-
tion 3.2, the model can then be used to make predictions on unseen data. The
predictive problem we evaluate here is that of predicting one of the binary random
variables x
i
given all other variables x
i
. This question can be answered by computing
the conditional log likelihood
log p(x
i
[ x
i
, , ) exp(
i
x
i
+

j=i

ij
x
i
x
j
).
We apply this predictive procedure to two data sets:
Cora Cora (McCallum et al. 2000) is a set of 2708 abstracts from the Cora research
paper search engine, with links between documents that cite each other. For
the evaluation in this section, we ignore the textual content of the corpus and
concern ourselves with the links alone. The set of observed tokens associated
with each document is the set of cited and citing documents, yielding 2708
unique tokens. The model has a total of 3,667,986 parameters.
Metalter Metalter
1
is an internet community weblog where users share links.
Users can then annotate links with tags which describe them. We consider each
link to be a document and each links attendant tags to be its observed token
set. We culled a subset of these links to create a corpus of 18609 documents
with 3096 unique tokens. The model has a total of 4794156 parameters.
For Cora, this predictive problem amounts to estimating the probability of a document
in Cora citing a particular paper given our knowledge of the documents other citations.
For Metalter, we are estimating the probability that a link has a certain tag given
its other tags.
1
http://www.metalter.com
37
We used ve-fold cross-validation to compute the predictive perplexity of unseen
data. All experiments were run with Dirichlet prior parameter = 2 (equivalent
to Laplace smoothing); Gaussian and Laplacian priors were set to
1
=
2
= De
8
,
where D is the size of the corpus (cross-validation can be used to nd good values of

1
and
2
). The results of these experiments are shown in Figure 3.2.
On both data sets, learning the covariance structure improves the predictive
perplexity over the baseline. Thus the correlation structure captured by the Ising
model provides increased predictive power when applied to these data sets.
The predictive perplexity of the model when trained using our proposals, FLIM-Z
and FLIM-NMF, is nearly identical to that of loopy belief propagation (BP) on both
data sets. Nave mean eld (NMF), on the other hand, does substantially worse, but
still better than Baseline. While FLIM-Z and FLIM-NMF are close to BP with respect
to predictive power, the previous section showed that their speed was closer to that
of NMF. Thus, our procedure provides a way to train models as accurately as loopy
belief propagation, but in a fraction of the time.
3.4 Discussion
We introduced a procedure to estimate the parameters of large-scale Ising models. This
procedure makes use of a novel constrained variational approximation for estimating
the pairwise marginals of the Ising model. This approximation has a simple mathemat-
ical form and can be computed more eciently than other techniques. We also showed
empirically that this approximation is accurate for real-world data sets. Our approxi-
mation yields a procedure which can tractably be applied to models with millions of
parameters that can make predictions comparable with the state-of-the-art.
38
0.02 0.05 0.10 0.20 0.50 1.00 2.00
0
.
0
0
7
0
0
.
0
0
7
5
0
.
0
0
8
0
0
.
0
0
9
0
Execution time (ms)
M
e
a
n

r
e
l
a
t
i
v
e

e
r
r
o
r

i
n

s
i
n
g
l
e
t
o
n

m
a
r
g
i
n
a
l
s
BP
FLIMZ
FLIMNMF
NMF
(a) Relative error of singleton marginals
0.02 0.05 0.10 0.20 0.50 1.00 2.00
0
.
0
2
0
.
0
5
0
.
1
0
Execution time (ms)
M
e
a
n

r
e
l
a
t
i
v
e

e
r
r
o
r

i
n

p
a
i
r
w
i
s
e

m
a
r
g
i
n
a
l
s
BP
FLIMZ
FLIMNMF
NMF
(b) Relative error of pairwise marginals
Figure 3.1: Mean relative error of singleton marginals(left) and pairwise
marginals(right) on a synthetic model. Execution times are on a logarithmic scale.
The errors in (b) are also on a logarithmic scale. Loopy belief propagation (BP)
is accurate but slow. Nave mean eld (NMF) is grossly inaccurate at estimating
pairwise marginals. FLIM-NMF oers a compromise: accuracy not much worse than
BP at speed not much worse than NMF.
39
BP FLIMZ FLIMNMF NMF Baseline
P
r
e
d
i
c
t
i
v
e

p
e
r
p
l
e
x
i
t
y
0
e
+
0
0
2
e
+
1
2
4
e
+
1
2
(a) Cora
BP FLIMZ FLIMNMF NMF Baseline
P
r
e
d
i
c
t
i
v
e

p
e
r
p
l
e
x
i
t
y
0
e
+
0
0
2
e
+
0
9
4
e
+
0
9
(b) Metalter
Figure 3.2: A comparison of the predictive perplexity of the Ising model using
procedures for parameter optimization. Lower is better. All approaches perform better
than the baseline. Our proposals (FLIM-Z and FLIM-NMF) achieves better predictive
perplexity than nave mean eld (NMF), as does loopy belief propagation (BP). But
our proposals are able to run in a fraction of the time of BP (Figure 3.1).
40
Chapter 4
Relational Topic Models
In the previous chapter, we described a model of documents and links and inferential
tools for the model. While these models are able to successfully make predictions
about documents, they often miss salient patterns of the corpus better captured by
latent variable models of link structure.
Recent research in this eld has focused on latent variable models of link structure
because of their ability to decompose a network according to hidden patterns of
connections between its nodes (Kemp et al. 2004; Hofman and Wiggins 2007; Airoldi
et al. 2008). These models represent a signicant departure from statistical models of
networks, which explain network data in terms of observed sucient statistics (Wasser-
man and Pattison 1996; Newman 2002; Fienberg et al. 1985; Getoor et al. 2001; Taskar
et al. 2004b).
While powerful, current latent variable models account only for the structure
of the network, ignoring additional attributes of the nodes that might be available.
For example, a citation network of articles also contains text and abstracts of the
documents, a linked set of web-pages also contains the text for those pages, and an
on-line social network also contains prole descriptions and other information about
its members. This type of information about the nodes, along with the links between
Portions of this chapter appear in Chang and Blei (2010, 2009).
41
them, should be used for uncovering, understanding and exploiting the latent structure
in the data.
To this end, we develop a new model of network data that accounts for both links
and attributes. While a traditional network model requires some observed links to
provide a predictive distribution of links for a node, our model can predict links using
only a new nodes attributes. Thus, we can suggest citations of newly written papers,
predict the likely hyperlinks of a web page in development, or suggest friendships in a
social network based only on a new users prole of interests. Moreover, given a new
node and its links, our model provides a predictive distribution of node attributes.
This mechanism can be used to predict keywords from citations or a users interests
from his or her social connections. Such prediction problems are out of reach for
traditional network models.
Here we focus on document networks. The attributes of each document are its
text, i.e., discrete observations taken from a xed vocabulary, and the links between
documents are connections such as friendships, hyperlinks, citations, or adjacency.
To model the text, we build on previous research in mixed-membership document
models, where each document exhibits a latent mixture of multinomial distributions
or topics (Blei et al. 2003b; Erosheva et al. 2004; Steyvers and Griths 2007). The
links are then modeled dependent on this latent representation. We call our model,
which explicitly ties the content of the documents with the connections between them,
the relational topic model (RTM).
The RTM aords a signicant improvement over previously developed models
of document networks. Because the RTM jointly models node attributes and link
structure, it can be used to make predictions about one given the other. Previous work
tends to explore one or the other of these two prediction problems. Some previous work
uses link structure to make attribute predictions (Chakrabarti et al. 1998; Kleinberg
1999), including several topic models (Dietz et al. 2007; McCallum et al. 2005; Wang
42
et al. 2005). However, none of these methods can make predictions about links given
words.
Other models use node attributes to predict links (Ho et al. 2002). However,
these models condition on the attributes but do not model them. While this may be
eective for small numbers of attributes of low dimension, these models cannot make
meaningful predictions about or using high-dimensional attributes such as text data.
As our empirical study in Section 4.3 illustates, the mixed-membership component
provides dimensionality reduction that is essential for eective prediction.
In addition to being able to make predictions about links given words and words
given links, the RTM is able to do so for new documentsdocuments outside of
training data. Approaches which generate document links through topic models treat
links as discrete terms from a separate vocabulary that essentially indexes the
observed documents (Nallapati and Cohen 2008; Cohn and Hofmann 2001; Sinkkonen
et al. 2008; Gruber et al. 2008; Erosheva et al. 2004; Xu et al. 2006, 2008). Through
this index, such approaches encode the observed training data into the model and
thus cannot generalize to observations outside of them. Link and word predictions for
new documents, of the kind we evaluate in Section 4.3.1, are ill-dened.
Recent work from Nallapati et al. (2008) has jointly modeled links and document
content so as to avoid these problems. We elucidate the subtle but important dier-
ences between their model and the RTM in Section 4.1.4. We then demonstrate in
Section 4.3.1 that the RTM makes modeling assumptions that lead to signicantly
better predictive performance.
The remainder of this chapter is organized as follows. First, we describe the
statistical assumptions behind the relational topic model. Then, we derive ecient
algorithms based on variational methods for approximate posterior inference, parameter
estimation, and prediction. Finally, we study the performance of the RTM on scientic
citation networks, hyperlinked web pages, geographically tagged news articles, and
43
52
478
430
2487
75
288
1123
2122
2299
1354
1854
1855
89
635
92
2438
136
479
109
640
119
686
120
1959
1539
147
172
177
965
911
2192
1489
885
178
378
286
208
1569
2343
1270
218
1290
223
227
236
1617
254
1176
256
634
264
1963
2195
1377
303
426
2091
313
1642
534
801
335
344
585
1244
2291
2617
1627
2290
1275
375
1027
396
1678
2447
2583
1061 692
1207
960
1238
2012
1644
2042
381
418
1792
1284
651
524
1165
2197
1568
2593
1698
547 683
2137 1637
2557
2033
632
1020
436
442
449
474
649
2636
2300
539
541
603
1047
722
660
806
1121
1138
831
837
1335
902
964
966
981
1673
1140
1481
1432
1253
1590
1060
992
994
1001
1010
1651
1578
1039
1040
1344
1345
1348
1355
1420
1089
1483
1188
1674
1680
2272
1285
1592
1234
1304
1317
1426
1695
1465
1743
1944
2259
2213
We address the problem of
finding a subset of features
that allows a supervised
induction algorithm to...
Irrelevant features and
the subset selection
problem
In many domains, an
appropriate inductive bias
is the MIN-FEATURES
bias, which prefers ...
Learning with many
irrelevant features
In this introduction, we
define the term bias as it is
used in machine learning
systems. We motivate ...
Evaluation and selection
of biases in machine
learning
The inductive learning
problem consists of
learning a concept given
examples ...
Utilizing prior concepts
for learning
The problem of learning
decision rules for
sequential tasks is
addressed, focusing on ...
Improving tactical plans
with genetic algorithms
Evolutionary learning
methods have been found
to be useful in several
areas in ...
An evolutionary
approach to learning in
robots
...
...
...
...
...
...
...
...
...
...
Figure 4.1: Example data appropriate for the relational topic model. Each document
is represented as a bag of words and linked to other documents via citation. The RTM
denes a joint distribution over the words in each document and the citation links
between them.
social networks. The RTM provides better word prediction and link prediction than
natural alternatives and the current state of the art.
4.1 Relational Topic Models
The relational topic model (RTM) is a hierarchical probabilistic model of networks,
where each node is endowed with attribute information. We will focus on text data,
where the attributes are the words of the documents (see Figure 4.1). The RTM
embeds this data in a latent space that explains both the words of the documents and
how they are connected.
4.1.1 Modeling assumptions
The RTM builds on previous work in mixed-membership document models. Mixed-
membership models are latent variable models of heterogeneous data, where each data
point can exhibit multiple latent components. Mixed-membership models have been
44
successfully applied in many domains, including survey data (Erosheva et al. 2007),
image data (Fei-Fei and Perona 2005; Barnard et al. 2003), network data (Airoldi
et al. 2008), and document modeling (Steyvers and Griths 2007; Blei et al. 2003b).
Mixed-membership models were independently developed in the eld of population
genetics (Pritchard et al. 2000).
To model node attributes, the RTM reuses the statistical assumptions behind
latent Dirichlet allocation (LDA) (Blei et al. 2003b), a mixed-membership model of
documents.
1
Specically, LDA is a hierarchical probabilistic model that uses a set
of topics, distributions over a xed vocabulary, to describe a corpus of documents.
In its generative process, each document is endowed with a Dirichlet-distributed
vector of topic proportions, and each word of the document is assumed drawn by rst
drawing a topic assignment from those proportions and then drawing the word from
the corresponding topic distribution. While a traditional mixture model of documents
assumes that every word of a document arises from a single mixture component, LDA
allows each document to exhibit multiple components via the latent topic proportions
vector. Below we describe this model in more detail before introducing our contribution,
the RTM.
4.1.2 Latent Dirichlet allocation
Latent Dirichlet allocation takes as input a collection of documents which are rep-
resented as bags-of-words, that is, an unordered collections of terms from a xed
vocabulary. A collection of documents is imbued with a xed number of topics, multi-
nomial distributions over those terms. Intuitively, a topic captures themes by putting
high weights on words which are connected to that theme, and small weights otherwise.
This representation is captured in Figure 1.4 (reproduced here for convenience). On
the left are three topics,
1
,
2
,
3
; we have depicted each by selecting words with
1
A general mixed-membership model can accommodate any kind of grouped data paired with an
appropriate observation model (Erosheva et al. 2004).
45
high probability mass in that topic. For example, the blue topic,
2
puts high mass
on terms related to jurisprudence while the red topic,
3
puts high mass on terms
related to sports.
The congressman threw the opening
pitch at the Yankees game yesterday
evening, despite being under
investigation by a house committee.
Both Democrats and Republicans on
the committee condemned...
lawyer
justice
judge
investigate
prosecutor
game
coach
player
play
match
republican
democrat
senate
campaign
mayor

1
w
d,1:N

d
z
d,1
z
d,2
Figure 4.2: A depiction of the assumptions underlying topic models. Topic models
presuppose latent themes (left) and documents (right). Documents are a composition
of latent themes; this composition determines the words in the document that we
observe.
Additionally, LDA associates with each document a multinomial distribution over
topics. Intuitively, this captures what the document is about in broad thematic
terms. This is captured by
d
in Figure 1.4 also depicted graphically as a bar graph
over topics (colors). In the example text, the document is mostly about politics with
a smattering of sports and law. Finally, LDA associates a single topic assignment
with each word in the document. The topic proportions
d
govern the frequency with
which each topic appears in an assignment; the topic vectors
k
govern which words
are likely to appear for a given assignment. This is graphically depicte in Figure 1.4
46
D
N K
z
w

Figure 4.3: A graphical model representation of latent Dirichlet allocation. The words
are observed (shaded) while the the topic assignments (z), topic proportions (), and
topics () are latent. Plates indicate replication.
by coloring words according to their topic assignment.
This intuitive description of LDA can be formalized by the following generative
process:
1. For each document d:
(a) Draw topic proportions
d
[ Dir().
(b) For each word w
d,n
:
i. Draw assignment z
d,n
[
d
Mult(
d
).
ii. Draw word w
d,n
[z
d,n
,
1:K
Mult(
z
d,n
).
The notation x[z F(z) means that x is drawn conditional on z from the
distribution F(z). We use Dir and Mult as shorthand for the Dirichlet and Multinomial
distributions.
This generative process is depicted in Figure 4.3. The words w are the only
observed variables. The parameters for the model are K, the number of topics in the
model, , a K-dimensional Dirichlet parameter controlling the topic proportions ,
and
1:K
K multinomial parameters representing the topic distributions over terms.
It is worth emphasizing that the words are the only observed data in this model.
47
The topics, the rate at which topics appear in each document, and the topic associated
with each word are all inferred solely based on the way words co-occur in the data.
4.1.3 Relational topic model
In the RTM, each document is rst generated from topics as in LDA. The links between
documents are then modeled as binary variables, one for each pair of documents.
These binary variables are distributed according to a distribution that depends on the
topics used to generate each of the constituent documents. Because of this dependence,
the content of the documents are statistically connected to the link structure between
them. Thus each documents mixed-membership depends both on the content of the
document as well as the pattern of its links. In turn, documents whose memberships
are similar will be more likely to be connected under the model.
The parameters of the RTM are
1:K
, K topic distributions over terms, a K-
dimensional Dirichlet parameter , and a function that provides binary probabilities.
(This function is explained in detail below.) We denote a set of observed documents
by w
1:D,1:N
, where w
i,1:N
are the words of the ith document. (Words are assumed to
be discrete observations from a xed vocabulary.) We denote the links between the
documents as binary variables y
1:D,1:D
, where y
i,j
is 1 if there is a link between the ith
and jth document. The RTM assumes that a set of observed documents w
1:D,1:N
and
binary links between them y
1:D,1:D
are generated by the following process.
1. For each document d:
(a) Draw topic proportions
d
[ Dir().
(b) For each word w
d,n
:
i. Draw assignment z
d,n
[
d
Mult(
d
).
ii. Draw word w
d,n
[z
d,n
,
1:K
Mult(
z
d,n
).
2. For each pair of documents d, d

:
48

N
d

d
w
d,n
z
d,n
K

k
y
d,d'

N
d'

d'
w
d',n
z
d',n
Figure 4.4: A two-document segment of the RTM. The variable y
d,d
indicates whether
the two documents are linked. The complete model contains this variable for each pair
of documents. This binary variable is generated contingent on the topic assignments
for the participating documents, z
d
and z
d
, and global regression parameters . The
plates indicate replication. This model captures both the words and the link structure
of the data shown in Figure 4.1.
(a) Draw binary link indicator
y
d,d
[z
d
, z
d
([z
d
, z
d
, ),
where z
d
= 'z
d,1
, z
d,2
, . . . , z
d,n
`.
Figure 4.4 illustrates the graphical model for this process for a single pair of documents.
The full model, which is dicult to illustrate in a small graphical model, contains
the observed words from all D documents, and D
2
link variables for each possible
connection between them.
4.1.4 Link probability function
The function is the link probability function that denes a distribution over the
link between two documents. This function is dependent on the two vectors of topic
assignments that generated their words, z
d
and z
d
.
This modeling decision is important. A natural alternative is to model links as a
49
function of the topic proportions vectors
d
and
d
. One such model is that of Nallapati
et al. (2008), which extends the mixed-membership stochastic blockmodel (Airoldi
et al. 2008) to generate node attributes. Similar in spirit is the non-generative model
of Mei et al. (2008) which regularizes topic models with graph information. The
issue with these formulations is that the links and words of a single document are
possibly explained by disparate sets of topics, thereby hindering their ability to make
predictions about words from links and vice versa.
For example, such a model with ten topics may use the rst ve topics to describe
the language of the corpus and the latter ve to describe its connectivity. Each
document would participate in topics from the rst set which account for its language
and the second set which account for its links. However, given a new document without
link information it is impossible in such a model to make predictions about links since
the document does not participate in the latter ve topics. Similarly, a new document
without word information does not participate in the rst ve topics and hence no
predictions can be made.
In enforcing that the link probability function depends on the latent topic as-
signments z
d
and z
d
, we enforce that the specic topics used to generate the links
are those used to generate the words. A similar mechanism is employed in Blei and
McAulie (2007) for non pair-wise response variables. In estimating parameters, this
means that the same topic indices describe both patterns of recurring words and
patterns in the links. The results in Section 4.3.1 show that this provides a superior
prediction mechanism.
We explore four specic possibilities for the link probability function. First, we
consider

(y = 1) = (
T
(z
d
z
d
) + ), (4.1)
where z
d
=
1
N
d

n
z
d,n
, the notation denotes the Hadamard (element-wise) product,
and the function is the sigmoid. This link function models each per-pair binary
50
variable as a logistic regression with hidden covariates. It is parameterized by coe-
cients and intercept . The covariates are constructed by the Hadamard product of
z
d
and z
d
, which captures similarity between the hidden topic representations of the
two documents.
Second, we consider

e
(y = 1) = exp(
T
(z
d
z
d
) + ). (4.2)
Here,
e
uses the same covariates as

, but has an exponential mean function instead.


Rather than tapering o when z
d
z
d
are close, the probabilities returned by this
function continue to increases exponentially. With some algebraic manipulation, the
function
e
can be viewed as an approximate variant of the modeling methodology
presented in Blei and Jordan (2003).
Third, we consider

(y = 1) = (
T
(z
d
z
d
) + ), (4.3)
where represents the cumulative distribution function of the Normal distribution.
Like

, this link function models the link response as a regression parameterized by


coecients and intercept . The covariates are also constructed by the Hadamard
product of z
d
and z
d
, but instead of the logit model hypothesized by

models
the link probability with a probit model.
Finally, we consider

N
(y = 1) = exp

T
(z
d
z
d
) (z
d
z
d
)

. (4.4)
Note that
N
is the only one of the link probability functions which is not a function
of z
d
z
d
. Instead, it depends on a weighted squared Euclidean dierence between the
51
0.0 0.2 0.4 0.6 0.8 1.0
0
.
1
0
.
3
0
.
5
0
.
7
z
d
z
d
L
i
n
k

p
r
o
b
a
b
i
l
i
t
y

N
Figure 4.5: A comparison of dierent link probability functions. The plot shows
the probability of two documents being linked as a function of their similarity (as
measured by the inner product of the two documents latent topic assignments). All
link probability functions were parameterized so as to have the same endpoints.
two latent topic assignment distributions. Specically, it is the multivariate Gaussian
density function, with mean 0 and diagonal covariance characterized by , applied to
z
d
z
d
. Because the range of z
d
z
d
is nite, the probability of a link,
N
(y = 1),
is also nite. We constrain the parameters and to ensure that it is between zero
and one.
All four of the functions we consider are plotted in Figure 4.5. The link likelihoods
suggested by the link probability functions are plotted against the inner product of z
d
and z
d
. The parameters of the link probability functions were chosen to ensure that
all curves have the same endpoints. Both

and

have similar sigmoidal shapes.


In contrast, the
e
is exponential in shape and its slope remains large at the right
limit. The one-sided Gaussian form of
N
is also apparent.
52
4.2 Inference, Estimation and Prediction
With the model dened, we turn to approximate posterior inference, parameter estima-
tion, and prediction. We develop a variational inference procedure for approximating
the posterior. We use this procedure in a variational expectation-maximization (EM)
algorithm for parameter estimation. Finally, we show how a model whose parameters
have been estimated can be used as a predictive model of words and links.
4.2.1 Inference
The goal of posterior inference is to compute the posterior distribution of the latent
variables conditioned on the observations. As with many hierarchical Bayesian models
of interest, exact posterior inference is intractable and we appeal to approximate
inference methods. Most previous work on latent variable network modeling has
employed Markov Chain Monte Carlo (MCMC) sampling methods to approximate the
posterior of interest (Ho et al. 2002; Kemp et al. 2004). Here, we employ variational
inference (Jordan et al. 1999; Wainwright and Jordan 2005a) a deterministic alternative
to MCMC sampling that has been shown to give comparative accuracy to MCMC with
improved computational eciency (Braun and McAulie 2007; Blei and Jordan 2006).
Wainwright and Jordan (2008) investigate the properties of variational approximations
in detail. Recently, variational methods have been employed in other latent variable
network models (Airoldi et al. 2008; Hofman and Wiggins 2007).
In variational methods, we posit a family of distributions over the latent variables,
indexed by free variational parameters. Those parameters are then t to be close to
the true posterior, where closeness is measured by relative entropy. For the RTM, we
use the fully-factorized family, where the topic proportions and all topic assignments
53
are considered independent,
q(, Z[, ) =

(
d
[
d
)

n
q
z
(z
d,n
[
d,n
)

. (4.5)
The parameters are variational Dirichlet parameters, one for each document, and
are variational multinomial parameters, one for each word in each document. Note
that E
q
[z
d,n
] =
d,n
.
Minimizing the relative entropy is equivalent to maximizing the Jensens lower
bound on the marginal probability of the observations, i.e., the evidence lower bound
(ELBO),
L =

(d
1
,d
2
)
E
q
[log p(y
d
1
,d
2
[z
d
1
, z
d
2
, , )] +

n
E
q
[log p(z
d,n
[
d
)] +

n
E
q
[log p(w
d,n
[
1:K
, z
d,n
)] +

d
E
q
[log p(
d
[)] + H(q) , (4.6)
where (d
1
, d
2
) denotes all document pairs and H(q) denotes the entropy of the dis-
tribution q. The rst term of the ELBO dierentiates the RTM from LDA (Blei
et al. 2003b). The connections between documents aect the objective in approximate
posterior inference (and, below, in parameter estimation).
We develop the inference procedure below under the assumption that only observed
links will be modeled (i.e., y
d
1
,d
2
is either 1 or unobserved).
2
We do this for both
methodological and computational reasons.
First, while one can x y
d
1
,d
2
= 1 whenever a link is observed between d
1
and
d
2
and set y
d
1
,d
2
= 0 otherwise, this approach is inappropriate in corpora where the
absence of a link cannot be construed as evidence for y
d
1
,d
2
= 0. In these cases, treating
these links as unobserved variables is more faithful to the underlying semantics of the
data. For example, in large social networks such as Facebook the absence of a link
2
Sums over document pairs (d
1
, d
2
) are understood to range over pairs for which a link has been
observed.
54
between two people does not necessarily mean that they are not friends; they may
be real friends who are unaware of each others existence in the network. Treating
this link as unobserved better respects our lack of knowledge about the status of their
relationship.
Second, treating non-links as hidden decreases the computational cost of inference;
since the link variables are leaves in the graphical model they can be removed whenever
they are unobserved. Thus the complexity of computation scales with the number
of observed links rather than the number of document pairs. When the number of
true observations is sparse relative to the number of document pairs, as is typical,
this provides a signicant computational advantage. For example, on the Cora data
set described in Section 4.3, there are 3,665,278 unique document pairs but only
5,278 observed links. Treating non-links as hidden in this case leads to an inference
procedure which is nearly 700 times faster.
Our aim now is to compute each term of the objective function given in Equation 4.6.
The rst term,

(d
1
,d
2
)
L
d
1
,d
2

(d
1
,d
2
)
E
q
[log p(y
d
1
,d
2
[z
d
1
, z
d
2
, , )] , (4.7)
depends on our choice of link probability function. For many link probability func-
tions, this term cannot be expanded analytically. However, if the link probability
function depends only on z
d
1
z
d
2
we can expand the expectation using the following
approximation arising from a rst-order Taylor expansion of the term (Braun and
McAulie 2007)
3
,
L
(d
1
,d
2
)
= E
q
[log (z
d
1
z
d
2
)] log (E
q
[z
d
1
z
d
2
]) = log (
d
1
,d
2
),
3
While we do not give a detailed proof here, the error of a rst-order approximation is closely
related to the probability mass in the tails of the distribution on z
d1
and z
d2
. Because the number
words in a document is typically large, the variance of z
d1
and z
d2
tends to be small, making the
rst-order approximation a good one.
55
where
d
1
,d
2
=
d
1

d
2
and
d
= E
q
[z
d
] =
1
N
d

n

d,n
. In this work, we explore
three functions which can be written in this form,
E
q
[log

(z
d
1
z
d
2
)] log (
T

d
1
,d
2
+ )
E
q
[log

(z
d
1
z
d
2
)] log (
T

d
1
,d
2
+ )
E
q
[log
e
(z
d
1
z
d
2
)] =
T

d
1
,d
2
+ . (4.8)
Note that for
e
the expression is exact. The likelihood when
N
is chosen as the link
probability function can also be computed exactly,
E
q
[log
N
(z
d
1
, z
d
2
)] =

i
(
d
1
,i

d
2
,i
)
2
+ Var(z
d
1
,i
) + Var(z
d
2
,i
)),
where Var(z
d,i
) =
1
N
2
d

n

d,n,i
(1
d,n,i
). (See Appendix A.)
Leveraging these expanded expectations, we then use coordinate ascent to op-
timize the ELBO with respect to the variational parameters , . This yields an
approximation to the true posterior. The update for the variational multinomial
d,j
is

d,j
exp

=d

d,n
L
d,d
+E
q
[log
d
[
d
] + log
,w
d,j

. (4.9)
The contribution to the update from link information,

d,n
L
d,d
, depends on the
choice of link probability function. For the link probability functions expanded in
Equation 4.8, this term can be written as

d,n
L
d,d
= (

d
1
,d
2
L
d,d
)

d

N
d
. (4.10)
Intuitively, Equation 4.10 will cause a documents latent topic assignments to be
nudged in the direction of neighboring documents latent topic assignments. The
56
magnitude of this pull depends only on
d,d
, i.e., some measure of how close they are
already. The corresponding gradients for the functions in Equation 4.8 are

d,d

d,d
(1 (
T

d,d
+ ))

d,d

d,d

(
T

d,d
+ )
(
T

d,d
+ )

d,d

L
e
d,d
= .
The gradient when
N
is the link probability function is

d,n
L
N
d,d
=
2
N
d
(
d

d,n

1
N
d
), (4.11)
where
d,n
=
d

1
N
d

d,n
. Similar in spirit to Equation 4.10, Equation 4.11 will cause
a documents latent topic assignments to be drawn towards those of its neighbors.
This draw is tempered by
d,n
, a measure of how similar the current document is to
its neighbors.
The contribution to the update in Equation 4.9 from the word evidence log
,w
d,j
can be computed by taking the element-wise logarithm of the w
d,j
th column of the
topic matrix . The contribution to the update from the documents latent topic
proportions is given by
E
q
[log
d
[
d
] = (
d
) (

d,i
),
where is the digamma function
4
. (A digamma of a vector is the vector of digammas.)
The update for is identical to that in variational inference for LDA (Blei et al.
4
The digamma function is dened as the logarithmic derivative of the gamma function.
57
2003b),

d
+

d,n
.
These updates are fully derived in Appendix A.
4.2.2 Parameter estimation
We t the model by nding maximum likelihood estimates for each of the parameters:
multinomial topic vectors
1:K
and link function parameters , . Once again, this
is intractable so we turn to an approximation. We employ variational expectation-
maximization, where we iterate between optimizing the ELBO of Equation 4.6 with
respect to the variational distribution and with respect to the model parameters. This
is equivalent to the usual expectation-maximization algorithm (Dempster et al. 1977),
except that the computation of the posterior is replaced by variational inference.
Optimizing with respect to the variational distribution is described in Section 4.2.1.
Optimizing with respect to the model parameters is equivalent to maximum likelihood
estimation with expected sucient statistics, where the expectation is taken with
respect to the variational distribution.
The update for the topics matrix is

k,w

n
1(w
d,n
= w)
d,n,k
. (4.12)
This is the same as the variational EM update for LDA (Blei et al. 2003b). In practice,
we smooth our estimates of
k,w
using pseudocount smoothing (Jurafsky and Martin
2008) which helps to prevent overtting by positing a Dirichlet prior on
k
.
In order to t the parameters , of the logistic function of Equation 4.1, we employ
gradient-based optimization. Using the approximation described in Equation 4.8, we
compute the gradient of the objective given in Equation 4.6 with respect to these
58
parameters,

(d
1
,d
2
)

y
d
1
,d
2

d
1
,d
2
+

d
1
,d
2
,

(d
1
,d
2
)

y
d
1
,d
2

d
1
,d
2
+

.
Note that these gradients cannot be used to directly optimize the parameters
of the link probability function without negative observations (i.e., y
d
1
,d
2
= 0). We
address this by applying a regularization penalty. This regularization penalty along
with parameter update procedures for the other link probability functions are given in
Appendix B.
4.2.3 Prediction
With a tted model, our ultimate goal is to make predictions about new data. We
describe two kinds of prediction: link prediction from words and word prediction from
links.
In link prediction, we are given a new document (i.e. a document which is not
in the training set) and its words. We are asked to predict its links to the other
documents. This requires computing
p(y
d,d
[w
d
, w
d
) =

z
d
,z
d

p(y
d,d
[z
d
, z
d
)p(z
d
, z
d
[w
d
, w
d
),
an expectation with respect to a posterior that we cannot compute. Using the inference
algorithm from Section 4.2.1, we nd variational parameters which optimize the ELBO
for the given evidence, i.e., the words and links for the training documents and the
words in the test document. Replacing the posterior with this approximation q(, Z),
59
the predictive probability is approximated with
p(y
d,d
[w
d
, w
d
) E
q
[p(y
d,d
[z
d
, z
d
)] . (4.13)
In a variant of link prediction, we are given a new set of documents (documents not
in the training set) along with their words and asked to select the links most likely to
exist. The predictive probability for this task is proportional to Equation 4.13.
The second predictive task is word prediction, where we predict the words of a
new document based only on its links. As with link prediction, p(w
d,i
[y
d
) cannot be
computed. Using the same technique, a variational distribution can approximate this
posterior. This yields the predictive probability
p(w
d,i
[y
d
) E
q
[p(w
d,i
[z
d,i
)] .
Note that models which treat the endpoints of links as discrete observations of
data indices cannot participate in the two tasks presented here. They cannot make
meaningful predictions for documents that do not appear in the training set (Nallapati
and Cohen 2008; Cohn and Hofmann 2001; Sinkkonen et al. 2008; Erosheva et al.
2004). By modeling both documents and links generatively, our model is able to
give predictive distributions for words given links, links given words, or any mixture
thereof.
4.3 Empirical Results
We examined the RTM on four data sets
5
. Words were stemmed; stop words, i.e.,
words like and of or but, and infrequently occurring words were removed.
5
An R package implementing these models and more are available online at http://cran.r-
project.org/web/packages/lda/. Detailed derivations for some of the models included in the package
are given in Appendix D.
60
Table 4.1: Summary statistics for the four data sets after processing.
Data Set # of Documents # of Words Number of Links Lexicon size
Cora 2708 49216 5278 1433
WebKB 877 79365 1388 1703
PNAS 2218 119162 1577 2239
LocalNews 51 93765 107 1242
Directed links were converted to undirected links
6
and documents with no links were
removed. The Cora data (McCallum et al. 2000) contains abstracts from the Cora
computer science research paper search engine, with links between documents that
cite each other. The WebKB data (Craven et al. 1998) contains web pages from the
computer science departments of dierent universities, with links determined from
the hyperlinks on each page. The PNAS data contains recent abstracts from the
Proceedings of the National Academy of Sciences. The links between documents are
intra-PNAS citations. The LocalNews data set is a corpus of local news culled from
various media markets throughout the United States. We create one bag-of-words
document associated with each state (including the District of Columbia); each states
document consists of headlines and summaries from local news in that states media
markets. Links between states were determined by geographical adjacency. Summary
statistics for these data sets are given in Table 4.1.
4.3.1 Evaluating the predictive distribution
As with any probabilistic model, the RTM denes a probability distribution over unseen
data. After inferring the latent variables from data (as described in Section 4.2.1), we
ask how well the model predicts the links and words of unseen nodes. Models that
give higher probability to the unseen documents better capture the joint structure of
words and links.
We study the RTM with three link probability functions discussed above: the
6
The RTM can be extended to accommodate directed connections. Here we modeled undirected
links.
61
5 10 15 20 25
6
0
0
7
0
0
8
0
0
9
0
0
Predictive Link Rank
C
o
r
a
5 10 15 20 25
2
7
5
2
8
5
2
9
5
Predictive Word Rank
5 10 15 20 25
1
8
0
2
2
0
2
6
0
W
e
b
K
B
5 10 15 20 25
3
0
0
3
0
5
3
1
0
3
1
5
5 10 15 20 25
4
4
0
4
8
0
5
2
0
Number of topics
P
N
A
S
RTM,,

RTM,,
e
RTM,,

LDA + Regression
Pairwise LinkLDA
5 10 15 20 25
4
3
0
4
4
0
4
5
0
4
6
0
Number of topics
Figure 4.6: Average held-out predictive link rank (left) and word rank (right) as
a function of the number of topics. Lower is better. For all three corpora, RTMs
outperform baseline unigram, LDA, and Pairwise Link-LDA Nallapati et al. (2008).
logistic link probability function,

, of Equation 4.1; the exponential link proba-


bility function,
e
of Equation 4.2; and the probit link probability function,

of
Equation 4.3. We compare these models against two alternative approaches.
62
The rst (Pairwise Link-LDA) is the model proposed by Nallapati et al. (2008),
which is an extension of the mixed membership stochastic block model (Airoldi et al.
2008) to model network structure and node attributes. This model posits that each link
is generated as a function of two individual topics, drawn from the topic proportions
vectors associated with the endpoints of the link. Because latent topics for words and
links are drawn independently in this model, it cannot ensure that the discovered topics
are representative of both words and links simultaneously. Additionally, this model
introduces additional variational parameters for every link which adds computational
complexity.
The second (LDA + Regression) rst ts an LDA model to the documents and
then ts a logistic regression model to the observed links, with input given by the
Hadamard product of the latent class distributions of each pair of documents. Rather
than performing dimensionality reduction and regression simultaneously, this method
performs unsupervised dimensionality reduction rst, and then regresses to understand
the relationship between the latent space and underlying link structure. All models
were t such that the total mass of the Dirichlet hyperparameter was 1.0. (While
we omit a full sensitivity study here, we observed that the performance of the models
was similar for within a factor of 2 above and below the value we chose.)
We measured the performance of these models on link prediction and word pre-
diction (see Section 4.2.3). We divided the Cora, WebKB and PNAS data sets each
into ve folds. For each fold and for each model, we ask two predictive queries: given
the words of a new document, how probable are its links; and given the links of a
new document, how probable are its words? Again, the predictive queries are for
completely new test documents that are not observed in training. During training the
test documents are removed along with their attendant links. We show the results
for both tasks in terms of predictive rank as a function of the number of topics in
Figure 4.6. (See Section 4.4 for a discussion on potential approaches for selecting the
63
number of topics and the Dirichlet hyperparameter .) Here we follow the convention
that lower predictive rank is better.
In predicting links, the three variants of the RTM perform better than all of the
alternative models for all of the data sets (see Figure 4.6, left column). Cora is
paradigmatic, showing a nearly 40% improvement in predictive rank over baseline and
25% improvement over LDA + Regression. The performance for the RTM on this
task is similar for all three link probability functions. We emphasize that the links are
predicted to documents seen in the training set from documents which were held out.
By incorporating link and node information in a joint fashion, the model is able to
generalize to new documents for which no link information was previously known.
Note that the performance of the RTM on link prediction generally increases as the
number of topics is increased (there is a slight decrease on WebKB). In contrast, the
performance of the Pairwise Link-LDA worsens as the number of topics is increased.
This is most evident on Cora, where Pairwise Link-LDA is competitive with RTM
at ve topics, but the predictive link rank monotonically increases after that despite
its increased dimensionality (and commensurate increase in computational diculty).
We hypothesize that Pairwise Link-LDA exhibits this behavior because it uses some
topics to explain the words observed in the training set, and other topics to explain
the links observed in the training set. This problem is exacerbated as the number of
topics is increased, making it less eective at predicting links from word observations.
In predicting words, the three variants of the RTM again outperform all of the
alternative models (see Figure 4.6, right column). This is because the RTM uses
link information to inuence the predictive distribution of words. In contrast, the
predictions of LDA + Regression and Pairwise Link-LDA barely use link information;
thus they give predictions independent of the number of topics similar to those made
by a simple unigram model.
64
4.3.2 Automatic link suggestion
Table 4.2: Top eight link predictions made by RTM (
e
) and LDA + Regression for
two documents (italicized) from Cora. The models were t with 10 topics. Boldfaced
titles indicate actual documents cited by or citing each document. Over the whole
corpus, RTM improves precision over LDA + Regression by 80% when evaluated on
the rst 20 documents retrieved.
Markov chain Monte Carlo convergence diagnostics: A comparative review
Minorization conditions and convergence rates for Markov chain Monte Carlo
R
T
M
(

e
)
Rates of convergence of the Hastings and Metropolis algorithms
Possible biases induced by MCMC convergence diagnostics
Bounding convergence time of the Gibbs sampler in Bayesian image restoration
Self regenerative Markov chain Monte Carlo
Auxiliary variable methods for Markov chain Monte Carlo with applications
Rate of Convergence of the Gibbs Sampler by Gaussian Approximation
Diagnosing convergence of Markov chain Monte Carlo algorithms
Exact Bound for the Convergence of Metropolis Chains L
D
A
+
R
e
g
r
e
s
s
i
o
n
Self regenerative Markov chain Monte Carlo
Minorization conditions and convergence rates for Markov chain Monte Carlo
Gibbs-markov models
Auxiliary variable methods for Markov chain Monte Carlo with applications
Markov Chain Monte Carlo Model Determination for Hierarchical and Graphical Models
Mediating instrumental variables
A qualitative framework for probabilistic inference
Adaptation for Self Regenerative MCMC
Competitive environments evolve better solutions for complex tasks
Coevolving High Level Representations
R
T
M
(

e
)
A Survey of Evolutionary Strategies
Genetic Algorithms in Search, Optimization and Machine Learning
Strongly typed genetic programming in evolving cooperation strategies
Solving combinatorial problems using evolutionary algorithms
A promising genetic algorithm approach to job-shop scheduling. . .
Evolutionary Module Acquisition
An Empirical Investigation of Multi-Parent Recombination Operators. . .
A New Algorithm for DNA Sequence Assembly
L
D
A
+
R
e
g
r
e
s
s
i
o
n
Identication of protein coding regions in genomic DNA
Solving combinatorial problems using evolutionary algorithms
A promising genetic algorithm approach to job-shop scheduling. . .
A genetic algorithm for passive management
The Performance of a Genetic Algorithm on a Chaotic Objective Function
Adaptive global optimization with local search
Mutation rates as adaptations
A natural real-world application of link prediction is to suggest links to a user
based on the text of a document. One might suggest citations for an abstract or
friends for a user in a social network.
65
As a complement to the quantitative evaluation of link prediction given in the
previous section, Table 4.2 illustrates suggested citations using RTM (
e
) and LDA +
Regression as predictive models. These suggestions were computed from a model t on
one of the folds of the Cora data using 10 topics. (Results are qualitatively similar for
models t using dierent numbers of topics; see Section 4.4 for strategies for choosing
the number of topics.) The top results illustrate suggested links for Markov chain
Monte Carlo convergence diagnostics: A comparative review, which occurs in this
folds training set. The bottom results illustrate suggested links for Competitive
environments evolve better solutions for complex tasks, which is in the test set.
RTM outperforms LDA + Regression in being able to identify more true connections.
For the rst document, RTM nds 3 of the connected documents versus 1 for LDA +
Regression. For the second document, RTM nds 3 while LDA + Regression does not
nd any. This qualitative behavior is borne out quantitatively over the entire corpus.
Considering the precision of the rst 20 documents retrieved by the models, RTM
improves precision over LDA + Regression by 80%. (Twenty is a reasonable number
of documents for a user to examine.)
While both models found several connections which were not observed in the data,
those found by the RTM are qualitatively dierent. In the rst document, both sets
of suggested links are about Markov chain Monte Carlo. However, the RTM nds
more documents relating specically to convergence and stationary behavior of Monte
Carlo methods. LDA + Regression nds connections to documents in the milieu
of MCMC, but many are only indirectly related to the input document. The RTM
is able to capture that the notion of convergence is an important predictor for
citations, and has adjusted the topic distribution and predictors correspondingly. For
the second document, the documents found by the RTM are also of a dierent nature
than those found by LDA + Regression. All of the documents suggested by RTM
relate to genetic algorithms. LDA + Regression, however, suggests some documents
66
which are about genomics. By relying only on words, LDA + Regression conates
two genetic topics which are similar in vocabulary but dierent in citation structure.
In contrast, the RTM partitions the latent space dierently, recognizing that papers
about DNA sequencing are unlikely to cite papers about genetic algorithms, and vice
versa. Better modeling the properties of the network jointly with the content of the
documents, the model is able to better tease apart the community structure.
4.3.3 Modeling spatial data
While explicitly linked structures like citation networks oer one sort of connectivity,
data with spatial or temporal information oer another sort of connectivity. In this
section, we show how RTMs can be used to model spatially connected data by applying
it to the LocalNews data set, a corpus of news headlines and summaries from each
state, with document linkage determined by spatial adjacency.
Figure 4.7 shows the per state topic distributions inferred by RTM (left) and LDA
(right). Both models were t with ve topics using the same initialization. (We restrict
the discussion here to ve topics for expositional convenience. See Section 4.4 for a
discussion on potential approaches for selecting the number of topics.) While topics
are strictly speaking exchangeable and therefore not comparable between models,
using the same initialization typically yields topics which are amenable to comparison.
Each row of Figure 4.7 shows a single component of each states topic proportion for
RTM and LDA. That is, if
s
is the latent topic proportions vector for state s, then
s1
governs the intensity of that states color in the rst row,
s2
the second, and so on.
While both RTM and LDA model the words in each states local news corpus,
LDA ignores geographical information. Hence, it nds topics which are distributed
over a wide swath of states which are often not contiguous. For example, LDAs topic
1 is strongly expressed by Maine and Illinois, along with Texas and other states in
the South and West. In contrast, RTM only assigns non-trivial mass to topic 1 in a
67
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Figure 4.7: A comparison between RTM (left) and LDA (right) of topic distributions
on local news data. Each color/row depicts a single topic. Each states color intensity
indicates the magnitude of that topics component. The corresponding words associated
with each topic are given in Table 4.3. Whereas LDA nds geographically diuse
topics, RTM, by modeling spatial connectivity, nds coherent regions.
68
Southern states. Similarly, LDA nds that topic 5 is expressed by several states in
the Northeast and the West. The RTM, however, concentrates topic 4s mass on the
Northeastern states.
Table 4.3: The top eight words in each RTM (left) and LDA (right) topic shown in
Figure 4.7 ranked by score (dened below). RTM nds words which are predictive of
both a states geography and its local news.
comments dead
scores landfill
plane metro
courthouse evidence
Topic 1
crash yesterday
registration county
police children
quarter campaign
Topic 2
measure marriage
suspect
officer
guards protesters
appeals finger
Topic 3
bridge area
veterans winter
city snow
deer concert
Topic 4
manslaughter route
girls state
knife grounds
committee developer
Topic 5
election plane
landfill dead
police union
interests veterans
Topic 1
crash police
yesterday judge
fire leave
charges investors
Topic 2
comments marriage
register scores
schools comment
registration rights
Topic 3
snow city
veterans votes
winter bridge
recount lion
Topic 4
garage girls
video dealers
underage housing
mall union
Topic 5
The RTM does so by nding dierent topic assignments for each state, and
69
commensurately, dierent distributions over words for each topic. Table 4.3 shows the
top words in each RTM topic and each LDA topic. Words are ranked by the following
score,
score
k,w

k,w
(log
k,w

1
K

log
k

,w
).
The score nds words which are likely to appear in a topic, but also corrects for
frequent words. The score therefore puts greater weight on words which more easily
characterize a topic. Table 4.3 shows that RTM nds words more geographically
indicative. While LDA provides one way of analyzing this collection of documents,
the RTM enables a dierent approach which is geographically cognizant. For example,
LDAs topic 3 is an assortment of themes associated with California (e.g., marriage)
as well as others (scores, registration, schools). The RTM on the other hand,
discovers words thematically related to a single news item (measure, protesters,
appeals) local to California. The RTM typically nds groups of words associated
with specic news stories, since they are easily localized, while LDA nds words which
cut broadly across news stories in many states. Thus on topic 5, the RTM discovers
key words associated with news stories local to the Northeast such as manslaughter
and developer. On topic 5, the RTM also discovers a peculiarity of the Northeastern
dialect: that roads are given the appellation route more frequently than elsewhere in
the country.
By combining textual information along with geographical information, the RTM
provides a novel exploratory tool for identifying clusters of words that are driven by
both word co-occurrence and geographic proximity. Note that the RTM nds regions
in the United States which correspond to typical clusterings of states: the South, the
Northeast, the Midwest, etc. Further, the soft clusterings found by RTM conrm
many of our cultural intuitionswhile New York is denitively a Northeastern state,
Virginia occupies a liminal space between the Midatlantic and the South.
70
4.3.4 Modeling social networks
We now show how the RTM can be used to qualitatively understand the structure of
social networks. In this section we apply the RTM to four data sets people from
the Bible, people from New York Times articles, and two data sets crawled from the
online social networking site Twitter
7
.
Bible The data set contains 523 entities which appear in the Bible. For each entity
we extract all of the verses in which those entities appear; we take this collection
of verses to be the document for that entity. For links we take all entities which
co-occur in the same verse yielding 475 links. Figure 4.8 shows a visualization of
the results. Each node represents an individual; nodes are colored according to the
topic most associated with that individual. The node near the center (which although
colored brown is on the border of several other clusters) is associated with Jesus.
Another notable gure in that cluster is David who is connected to many others of
that line. A node with high connectivity in a dierent cluster is Israel. Because Israel
may refer to both the place and as an alternate name for Jacob, it is possible that
some of these edges are spurious and the result of improper disambiguation. However
the results are suggestive, with the RTM clustering Israel along with gures such as
Joseph and Benjamin. As an avenue of future work, the RTM might be used to help
disambiguate these entities.
New York Times The data set contains 944 entities tagged in New York Times
articles. We use the collection of articles (out of a set of approximately one million
articles) in which those entities appear as that entitys document. We consider two
entities connected if they are co-tagged in an article. Figure 4.9 shows the result of
tting the RTM to these data. The RTM nds distinct clusters corresponding to
distinct areas in which people are notable; these clusters also often have strong internal
7
http://www.twitter.com
71
tobiah
jahzeel achish
hodiah
jacob jonathan
jonathan
israel
israel
michal
nethaniah
omar
epher gershon
mahlah
geber
unni
on
ibhar
sihon
ish
damascus
milcah anah
anah
boaz
manasseh
elihu
joses
sodom
nebuchadnezzar
zibeon
ehud
adoni
carmi
eliphaz
abigail
abigail
timothy reuben
hanoch
hanoch
ezra james james
gilead
gilead
gilead
debir
nehemiah
solomon
gemariah
makkedah
caleb
edom
dorcas
evil
maaseiah
teresh maaseiah
gideon
adam
adam jehoiada
mishael
benaiah
almodad
amraphel nineveh
priscilla
caiaphas saul
jehoram
shishak
meshach
gad
gad
gad
seraiah
seraiah seraiah
beeliada
uzzah
amasa
pekah
debir
gezer
er eli
legion
haggith
elnathan
beriah
hushai
joram
joram kedar
canaan
canaan
shalmaneser
hadad
ziba joiada
manoah
jeremiah
megiddo
joseph
joseph
joseph
sapphira
ezekiel
ahithophel
jerusalem
david
david
eldad
sarah
herod
dedan jezebel
hannah
shechem
enos hophni
ben ben
adonijah
josiah
trophimus
zimri
bishlam
havilah nahor nahor
joash
mamre azariah azariah
azariah
azariah
korah libnah
bigthan
rehoboam
reuel
reuel
pallu
tiras annas
jehozadak
serug ithamar
levi shemaiah
malchus
zebulun
hazor hanan
magog
tubal
hezekiah
rehum
isaac
joel
ephraim
sheshai
meshech
lachish
uzziah
guni
zelophehad
kohath
rachel
huldah
kenaz
ethan peleg
bela
simeon
simeon
jamin
barzillai
shuah
abraham
enoch
gehazi
jehoahaz
hormah
samuel
shaul
sheleph
nahshon
shobal
hilkiah
hilkiah balaam
joseph
asher
asher
jonadab
shammah
shammah
noah
noah rapha
jairus
mark
andrew
sennacherib
zebul javan
ham ram
eglon
manasseh
jeroboam
jeroboam terah sceva
amaziah
arphaxad
sered
ohad
ahab
heman
elkanah
achbor
rechab
rechab
ebed
ashbel
zohar
og
athaliah
eliab
eliab eliab
eliab
goliath
shillem
ahaz
amminadab
jahleel
lot
simon
simon simon
simon
simon salome
haran
haran
elisha
vashti artaxerxes
abimelech
amram
joab
abijah
miriam
aphek
jeiel
immanuel
ahaziah
aaron
tiglath milcah
jaazaniah
abiathar
naaman
mephibosheth
shimei
taanach
bethuel tirzah tirzah
naboth
nathanael
tarshish
gaal
pelatiah
jotham
eliasaph
abner
zadok
ephraim
ephraim
shaphan
shaphan
adriel
hazarmaveth
chenaanah
chedorlaomer
elon
stephen
birsha
jehoash
kish
dan shemaiah
eve
baanah
shechem salma bartholomew
shemaiah
zur
dan
dan
sidon
toi
zipporah
jabin
zimran
onan
dathan
baasha
japheth cush
amariah
demetrius
luke
nathan malchiel
abishag
abishai
piram
mary
mary
mary
johanan
nicodemus
phinehas
barabbas
jehoiachin
lamech
lamech
ananias
ananias shobach laban
ahimaaz
tamar tamar
elishama
libni arioch
asahel
hiram
jether
gallio
ahimelech
abiram
geshem
shealtiel abel benjamin benjamin
benjamin
gomer
jonathan
aram
tidal
japhia
shimshai
gedaliah
sherebiah
adoni
eliada
judah
joktan
shelah
joel
jason
thomas
shemiramoth
isaiah
hamor ornan
seir
seir
israel
lazarus
zeruiah
sheba sheba
cain
nebaioth ishmael
ishmael
elijah
gera zalmunna
tabitha
moses
zerah
asenath
uriah
jemuel
nepheg
elah
jericho
john
john
zephaniah
ahijah
ahijah
joah
jeshua
ephah
naphtali
naphtali jarmuth mattithiah melchizedek
jephthah
jethro
jezer
jachin
reu
hebron
remaliah
matthias
jesus issachar
shaphat
hadadezer
absalom
eliakim eber martha hoglah medad
nadab
nadab
hazael rezin
thaddaeus
zobah
zedekiah
mizzah
zedekiah heth silas
araunah
asaph titus
elishua
hepher
madai
asa
jehu
jehu
adah
seba
mephibosheth
aharah
matthew samson
jehoiakim
seth
jehoshaphat
paul
hoshea
methuselah shem judah
eldaah
peter
leah
mordecai
eliphelet
daniel merab
delaiah
esau
pharaoh
pharaoh
pharaoh
pharaoh
raamah
philip
philip
philip
midian
eliashib
zebah
jeshua
jesse
bani
balak
malachi
joshua
joshua
hezron
hezron
Figure 4.8: The result of tting the RTM to a collection of entities from the Bible.
Nodes represent people and edges indicate that the people co-occur in the same verse.
Nodes are colored according to the topic most assoicated with that individual. Edges
between nodes with the same primary topic are colored black while edges between
nodes with dierent primary topics are colored grey.
72
ties. For example, in the top center the green cluster contains sports personalities.
Michael Jordan and Derek Jeter are a few prominent, highly-connected gures in
this cluster. The yellow cluster which also has strong internal connections represents
international leaders such as George W. Bush (lower left), Ronald Reagan (lower right),
and George H. W. Bush (upper right). Note that many of these are conservatives.
Beside this cluster is another orange cluster of politicians. This cluster leans more
liberal with gures such as Bill Clinton and Michael Dukakis. Notably, several
republicans are also in this cluster such as Michael Bloomberg. The remaining
clusters found by RTM capture other groups of related individuals, such as artists
and businesspeople.
Twitter Twitter is an online social network where users can regularly post statements
(known as tweets). Users can also choose to follow other users, that is, receive
their tweets. We take each users documents to be the accumulation of their tweets
and we use follower connections as edges between users. Here we present two data
sets.
The rst is a series of tweets collected over the period of approximately one week.
The users included in this data set were found by starting a breadth-rst crawl from a
distinguished node, leading to 180 users being included. Figure 4.10 shows a force-
directed layout of this data set after RTM has been t to it. The nodes represent
users; the colors of the nodes indicate the topic most associated with that user. Some
regions of the graph with similar topics have also been highlighted and annotated
with the most frequently occurring words in that topic. For example, one sector of the
graph has people talking about music topics. However these reside on the periphery.
Another sector uses words associated with blogs and social media; this area has a
hub-spoke structure. Finally, another region of the graph is distinguished by frequent
occurences of the phrase happy easter (the crawl period included Easter). This
73
lazio, rick a
richter, mike
wilder, l douglas
foreman, george
mehta, zubin
bhutto, benazir
wright, jim olmert, ehud
pinochet, augusto
hatch, orrin g
berlin, irving
souter, david h
tower, john g
husband, rick d
risen, james
charles, eleanor
klebold, dylan
rather, dan
mcmellon, edward
saint laurent, yves
helms, jesse
spitzer, eliot l
panetta, leon e
columbus, christopher
jesus christ
mcveigh, timothy james
o'neill, paul
martinez, pedro
barry, marion s jr
els, ernie
chernomyrdin, viktor s
hynes, charles j
ewing, patrick
miers, harriet e
stein, andrew j
aung san suu kyi, daw
silver, sheldon
specter, arlen
coughlin, tom
kennedy, john fitzgerald
rodman, dennis
chaney, don
james, caryn
lane, nathan
hashimoto, ryutaro
lautenberg, frank r
giambi, jason
de klerk, f w
mourning, alonzo
mapplethorpe, robert
glavine, tom
winfrey, oprah
van horn, keith jeter, derek
reagan, ronald
limbaugh, rush
netanyahu, benjamin
law, bernard f
sessions, william s
fujimori, alberto
ward, charlie
barron, james
van gundy, jeff
musharraf, pervez
blair, jayson
valentine, bobby
rowland, john g
chass, murray
jacobs, marc
puccini, giacomo
gates, henry louis jr
staples, brent
baker, howard h jr clark, wesley k
koch, edward i
canseco, jose
major, john
plame, valerie
o'neill, paul h
holyfield, evander
mills, richard p
abdullah
bush, laura
odeh, mohammed saddiq
kozlowski, l dennis
karadzic, radovan
salinas de gortari, carlos
federer, roger
jospin, lionel
bennett, william j nader, ralph
jackson, mark steinbrenner, george m 3d
steinberg, lisa
velella, guy j
moynihan, daniel patrick
kahane, meir
khamenei, ali
darman, richard g
barber, tiki friedman, thomas l
bradley, bill
minaya, omar
goldin, harrison j
hernandez, orlando brown, dave
reich, robert b
smith, william k scoppetta, nicholas
james, sharpe
gore, al
moi, daniel arap
forbes, steve
chertoff, michael
wolfowitz, paul d
clark, laurel salton
ashe, arthur vecsey, george
pirro, jeanine f
sullivan, andrew
mantle, mickey
hanks, tom robinson, jackie bird, larry
reno, janet
edberg, stefan
belkin, lisa
nussbaum, hedda
blix, hans
houston, allan
carter, bill
scowcroft, brent
golden, howard
quayle, dan
morgenthau, robert m
taylor, lawrence
ovitz, michael murdoch, rupert
delay, tom
wagner, richard
wellstone, paul
brody, jane e
mcenroe, john
kim jong il
rivera, mariano
welch, john f jr
schundler, bret d green, mark
jiang zemin kim dae jung
karzai, hamid
rohatyn, felix g
carlucci, frank c
kaczynski, theodore j
biaggi, mario
bruni, frank
baker, james a 3d
chavez, julio cesar
kantor, mickey
clemens, roger
steinberg, joel b
buffett, warren e
paulson, henry m jr
greenhouse, steven
klein, calvin
conner, dennis
letterman, david
lamont, ned rice, condoleezza
sununu, john h
combs, sean miller, arthur
bruder, thomas
mussina, mike
freeh, louis j
difrancesco, donald t
johnson, lyndon baines
pelosi, nancy
moxley, martha
weicker, lowell p jr
scott, byron
johnson, magic
geffen, david
mandela, winnie
falwell, jerry farrakhan, louis
chretien, jean
eastwood, clint
dean, howard
florio, james j
miller, judith
ward, benjamin
gretzky, wayne
zedillo ponce de leon, ernesto
arafat, yasir
joyce, james
woodward, bob
green, richard r
summers, lawrence h
muschamp, herbert
izetbegovic, alija
roth, philip
martin, steve
abrams, robert
chirac, jacques sullivan, louis w
shevardnadze, eduard a
starr, kenneth w
rostenkowski, dan
thompson, william c jr
wasserstein, wendy
levin, carl
kristof, nicholas d
rangel, charles b
noriega, manuel antonio
ali, muhammad
mobutu sese seko
gates, bill
james, lebron pitino, rick
fehr, donald
warhol, andy robbins, jerome
dalai lama
feinstein, dianne holtzman, elizabeth kennedy, edward m
pirro, jeanine
vance, cyrus r
becker, boris
moore, michael
gonzalez, elian
barenboim, daniel
henderson, rickey
john, elton
edwards, john
winerip, michael
padilla, jose
lewis, anthony
abramoff, jack
eisner, michael d
beckett, samuel
franks, bob
bonilla, bobby martinez, tino
sistani, ali al
manning, eli
shalala, donna e
foley, thomas s
biden, joseph r jr
einstein, albert
hart, gary
williams, serena
hariri, rafik
feld, eliot
dole, elizabeth h
cosby, bill
frist, bill
alito, samuel a jr
dingell, john d
klein, joel i
purdum, todd s
anderson, dave
maddox, alton h jr
king, wayne
mulroney, brian
mbeki, thabo
thurmond, strom
moses, robert
stern, henry j
sharon, ariel
mcgreevey, james e
robb, charles s
malvo, john lee
norodom sihanouk
taubman, a alfred redstone, sumner m
bernstein, leonard
fields, c virginia
botstein, leon
rove, karl
perry, william j
marcos, imelda
sheffield, gary
hussein i
ruth, george herman
cuomo, mario m
schmitt, eric
morris, mark
miller, melvin
thomas, isiah
keating, charles h jr
chalabi, ahmad ceausescu, nicolae
brokaw, tom
suozzi, thomas r
roh tae woo
o'neill, eugene
pettitte, andy
pollan, michael
rabin, yitzhak
leno, jay
tagliabue, paul
rosenthal, a m
ortega saavedra, daniel
north, oliver l
turner, ted
blumenthal, ralph walters, barbara
harkin, tom
hussein, saddam
madden, john
glenn, john
golisano, b thomas
bryant, kobe
bremer, l paul iii
marbury, stephon
kelly, raymond w
pickens, t boone jr
qaddafi, muammar el
dole, bob
hingis, martina
king, rodney glen
wilson, august
bonds, barry
mubarak, hosni
bradsher, keith
kean, thomas h
coleman, derrick
brodsky, richard l
sondheim, stephen
tommasini, anthony
johnson, earvin
gates, robert m
vincent, fay phillips, steve
brown, jerry
kabila, laurent
sprewell, latrell
washington, desiree
lewis, lennox
kushner, tony ma, yo
whitman, christie
wiese, thomas
leiter, al
kasparov, garry capriati, jennifer
lee, spike
molinari, guy v
primakov, yevgeny m
shakespeare, william
dukakis, michael s
verdi, giuseppe
piazza, mike
yeltsin, boris n khatami, mohammad
bush, barbara
mohamed, khalfan khamis
dowd, maureen
lloyd webber, andrew
norman, greg
meese, edwin 3d
pataki, george e
gershwin, george
hitler, adolf
johnson, larry
volpe, justin a
baker, russell
domingo, placido
dinkins, david n
baryshnikov, mikhail
giamatti, a bartlett
leetch, brian
dole, robert j
weinberger, caspar w
kevorkian, jack
paterno, joe
simon, neil
simpson, o j
gerstner, louis v jr
masur, kurt
ashcroft, john
soros, george
dole, elizabeth
jones, marion
schlesinger, arthur m jr
diana, princess of wales
schroder, gerhard
walsh, lawrence e
lennon, john gibson, mel
baker, al
gore, albert jr
brown, david m
spitzer, elliot l
miyazawa, kiichi
collins, glenn
mccartney, paul
torre, joe
ridge, tom
malone, john c
scalia, antonin
ackerman, felicia
lindros, eric
leonard, sugar ray
mutombo, dikembe
torricelli, robert g
aquino, corazon c kim young sam
murphy, richard
fujimori, alberto k
martin, kenyon
hemingway, ernest
gotti, john
sciolino, elaine
belichick, bill
reagan, ronald wilson
weingarten, randi
milosevic, slobodan
hill, anita f
kerry, john
thomas, clarence
bettman, gary stevens, scott
peres, shimon
picasso, pablo
bumiller, elisabeth
keller, bill
spielberg, steven
packwood, robert w
foster, vincent w jr
o'neal, shaquille
lipsyte, robert
weinstein, harvey
prodi, romano
papp, joseph
badillo, herman
canby, vincent
collins, kerry
pierce, samuel r jr
maliki, nuri kamal al
graham, martha van gogh, vincent
charles, prince of wales
li peng da silva, luiz inacio lula
bloomberg, michael r
savimbi, jonas
mickelson, phil
webber, chris
messinger, ruth w
dimaggio, joe
krugman, paul
riley, pat martin, billy
mcfarlane, robert c
ozawa, seiji
camby, marcus
nunn, sam
lewinsky, monica s
powell, colin l
strahan, michael
waldheim, kurt
levy, steve
oates, joyce carol
kemp, jack f
snow, john w
hun sen
bruno, joseph l
courier, jim
ellington, duke
poindexter, john m
pope
whitman, christine todd
wilson, pete
levitt, arthur jr
stern, david
maslin, janet
volcker, paul a
martin, curtis
brown, edmund g jr
pareles, jon
graham, bob
holland, bernard
lopez, jennifer
daly, john
johnson, randy
albee, edward
brawley, tawana
tsongas, paul e
richardson, bill blair, tony
waxman, henry a
jones, paula corbin
testaverde, vinny
reagan, nancy
levine, james fabricant, florence
williams, ted
hyde, henry j
nichols, terry lynn
goss, porter j
brooks, david
sadr, moktada al
schilling, curt
carter, jimmy
koizumi, junichiro
perot, ross
parcells, bill
montana, joe
gorbachev, mikhail s kostunica, vojislav deng xiaoping
rowling, j k
sabatini, gabriela
nicklaus, jack
nixon, richard milhous
armey, dick mccain, john s
roosevelt, theodore
ramon, ilan shamir, yitzhak
bush, george w
helmsley, harry b
al
cortines, ramon c
cheney, dick
holik, bobby messier, mark
whitehead, mary beth
crew, rudy
hosokawa, morihiro
altman, robert
mamet, david
casey, william j
christo
pear, robert
hastert, j dennis
shultz, george p
woods, tiger
thompson, tommy g bentsen, lloyd
cashman, brian
mccain, john
harris, katherine
safire, william
jeffords, james m
jobs, steven p
oakley, charles
ferraro, geraldine a
robertson, pat
johns, jasper
schiavo, terri thornburgh, richard l
wilson, michael
singh, vijay
starks, john
ravitch, richard
rose, pete
wilson, joseph c iv
havel, vaclav
o'connor, john
carey, mariah
diallo, amadou
fox, vicente
glass, philip
malcolm x
marsalis, wynton
sand, leonard b
thatcher, margaret h
martins, peter
o'neill, william a
kennedy, john f jr
anderson, kenny
king, don
gandhi, rajiv
stalin, joseph
armani, giorgio
buckley, william f jr
spano, andrew j
williams, bernie
clinton, bill
simon, paul
karpov, anatoly
kidd, jason
bowman, patricia
stoppard, tom
rodriguez, alex
hartocollis, anemona
mcconnell, mitch
armstrong, lance
rosenbaum, yankel
ahern, bertie
jennings, peter
pollock, jackson
regan, edward v
versace, gianni
lauder, ronald s
karan, donna
warner, john w le pen, jean tudjman, franjo
fernandez, joseph a
schwarz, charles
shostakovich, dmitri
louima, abner
kalikow, peter s
chavez, hugo
cuomo, andrew m
ebbers, bernard j
rocker, john
norton, gale
herbert, bob
robertson, marion g
safir, howard
lieberman, joseph i
mailer, norman
ferguson, colin
kwan, michelle
pavarotti, luciano
gates, william h
mccall, h carl
lott, trent forrester, douglas r zemin, jiang
franco, john
mladic, ratko
gooden, dwight
wiesel, elie
steinbrenner, george
scorsese, martin
castro, fidel
bork, robert h
lauren, ralph williams, tennessee
gordon, jeff
eisenhower, dwight david
collor de mello, fernando
abdel rahman, omar
navratilova, martina
helmsley, leona
bush, jeb
brady, nicholas f
ford, gerald rudolph jr
shevardnadze, eduard
kerry, john f
strawberry, darryl
allen, woody
jagr, jaromir
rafsanjani, hashemi
hu jintao
salameh, mohammed a
adams, john
madonna farrow, mia
bolton, john r
wilson, valerie plame
griffith, michael
menendez, robert
mugabe, robert
wilpon, fred
weiner, tim
boxer, barbara
reid, harry
lagerfeld, karl
greenspan, alan
khodorkovsky, mikhail b perelman, ronald o
fassel, jim
de niro, robert
putin, vladimir
harding, tonya
bush, george w.
sather, glen
bowe, riddick
mcnally, terrence
deaver, michael k rehnquist, william h
quindlen, anna
jackson, thomas penfield
botha, p w
perez de cuellar, javier
adams, gerry
handel, george frederick
pennington, chad
barak, ehud
zedillo, ernesto
tierney, john
kerrigan, nancy
brown, larry matsui, hideki
jefferson, thomas
chun doo hwan
zuckerman, mortimer b
bakker, jim
balanchine, george
alexander, lamar
freud, sigmund
chang, michael
hevesi, alan g
johnson, philip
crew, rudolph f
agassi, andre
presley, elvis
icahn, carl c
brown, ronald h
wilson, robert
childs, chris
lewis, neil a
tharp, twyla
roosevelt, franklin delano
weill, sanford i
kerkorian, kirk
muhammad, john allen
najibullah
williams, jayson
dodd, christopher j
taylor, charles
cone, david
pol pot
o'connor, sandra day
goldman, ronald lyle
erlanger, steven
king, stephen onassis, jacqueline kennedy
goodnough, abby
harris, eric
solomon, deborah
sandomir, richard
kaye, judith s
lay, kenneth l
randolph, willie
carter, vince
strauss, richard
chamorro, violeta barrios de
corzine, jon s
holbrooke, richard c
sinatra, frank
buchanan, patrick j
grasso, richard a
ahmadinejad, mahmoud
washington, george
bush, george fox quesada, vicente
gramm, phil
spitzer, eliot
fitzwater, marlin
sorenstam, annika
edwards, herman
jackson, michael
van natta, don jr
truman, harry s abbas, mahmoud
cooper, michael
diller, barry
boss, kenneth
seles, monica
mandela, nelson r
rumsfeld, donald h
weinstein, jack b
gingrich, newt
aspin, les
menem, carlos saul
kissinger, henry a
cage, john
babbitt, bruce schumer, charles e
assad, bashar al boutros
simms, phil
lipton, eric
newman, paul
mcgwire, mark
lee, wen ho
leahy, patrick j grassley, charles e
libeskind, daniel
gephardt, richard a
brown, lee p
vacco, dennis c
rell, m jodi
tyson, mike
davis, gray
o'donnell, rosie
roberts, john g jr
clinton, hillary rodham
goetz, bernhard hugo
stephanopoulos, george
aziz, tariq
lewinsky, monica
iacocca, lee a
feiner, paul
smith, william kennedy
cunningham, merce
khomeini, ruhollah
skilling, jeffrey k
showalter, buck
samaranch, juan antonio abraham, spencer
springsteen, bruce
davenport, lindsay
monroe, marilyn
rohde, david
gonzales, alberto r
romney, mitt
lemieux, mario
trump, donald j
pearl, daniel
jackson, bo
kerik, bernard b
mason, c vernon
lewis, carl
gulotta, thomas s
maazel, lorin
lugar, richard g
lincoln, abraham
de la hoya, oscar
lewis, michael
jackson, jesse l
berlusconi, silvio suharto
wachtler, sol
aoun, michel
lendl, ivan
rushdie, salman
mandela, nelson
holden, stephen
chen shui albright, madeleine k
gotbaum, betsy
hanover, donna
tenet, george j
cardoso, fernando henrique
piniella, lou
sampras, pete
kimmelman, michael
walesa, lech
mozart, wolfgang amadeus
barkley, charles
levy, harold o
byrd, robert c
jones, roy jr
shays, christopher
cohen, william s
mueller, robert s iii
mitterrand, francois
ginsburg, ruth bader
elizabeth ii, queen of great britain
williams, venus
brodeur, martin
el
aristide, jean feingold, russell d
hawkins, yusuf k
beethoven, ludwig van
carroll, sean
kerrey, bob
giuliani, rudolph w
cruise, tom
stewart, martha
john paul ii
berke, richard l
graf, steffi
mitchell, george j
chawla, kalpana
maxwell, robert
winfield, dave
o'rourke, andrew p
nelson, lemrick jr
rodgers, richard
kohl, helmut
koppel, ted
rubin, robert e stern, howard
clinton, chelsea
wright, frank lloyd
d'amato, alfonse m
rich, frank
annan, kofi
jackson, phil
putin, vladimir v
irabu, hideki
weiner, anthony d
miller, gifford
sharpton, al
johnson, keyshawn
hubbell, webster l
egan, edward m
libby, i lewis jr
obama, barack
johnson, ben roddick, andy
bin laden, osama
brozan, nadine
boesky, ivan f
connors, jimmy
schwarzenegger, arnold
white, mary jo ryan, george
mattingly, don
weld, william f
sosa, sammy
silverstein, larry a
wells, david
mccool, william c traub, james
bratton, william j
sweeney, john j
daschle, tom
milken, michael r
ryan, nolan
blumenthal, richard
moussaoui, zacarias
jordan, michael
vallone, peter f
dylan, bob
christopher, warren m
codey, richard j
pagones, steven a
dershowitz, alan m
hewitt, lleyton
selig, bud
assad, hafez al
bach, johann sebastian
altman, lawrence k
gehry, frank
zimmer, richard a daley, richard m
broad, william j
brady, lois smith
simpson, nicole brown
zarqawi, abu musab al
reeves, dan
trump, donald
iverson, allen
knoblauch, chuck
king, martin luther jr
kennedy, anthony m
kasparov, gary
koresh, david
benedict xvi
marcos, ferdinand e
ferrer, fernando
Figure 4.9: The result of tting the RTM to a collection of entities from the New
York Times. Nodes represent people and edges indicate that the people co-occur in
the same article. Nodes are colored according to the topic most assoicated with that
individual. Edges between nodes with the same primary topic are colored black while
edges between nodes with dierent primary topics are colored grey.
74

happy
easter
over
band
blog
social
Figure 4.10: The result of tting the RTM to a small collection of Twitter users. Nodes
represent users and edges indicate follower/followee relationships. Nodes are colored
according to the topic most assoicated with each user. Some regions dominated by a
single topic have been highlighted and annotated with frequently appearing words for
that topic.
region is more of a clique, with many users sending individual greetings to one another.
The second Twitter data set we analyze comes from a larger-scale crawl over a
longer period of time. There were 1425 users in this data set. Figure 4.11 shows
a visualization of the RTM applied to this data set. Once again, nodes have been
colored according to primary topic and several of the topical areas have been labeled
with frequently occurring words. This subset of the graph is dominated by a large
connected component in the center focused on online aairs (blog, post, online).
At the periphery are several smaller communities. For example, there is a food-centric
75
blog post money
online business
butter recipe
research food
sausage
news iphone game
2009 video
obama tcot swine
michigan
sotomayor
night show tonight
chicago game
Figure 4.11: The result of tting the RTM to a larger collection of Twitter users.
Nodes represent users and edges indicate follower/followee relationships. Nodes are
colored according to the topic most assoicated with each user. Some regions are
annotated with frequently appearing words for that topic.
community in the lower left, and a politics community just above it
8
. Because this
is a larger data set, the RTM is able to discover broader, more thematically related
communities than with the smaller data set.
4.4 Discussion
There are many avenues for future work on relational topic models. Applying the
RTM to diverse types of documents such as protein-interaction networks, whose
node attributes are governed by rich internal structure, is one direction. Even the
8
The frequently appearing term tcot is an acronym for Top Conservatives On Twitter.
76
text documents which we have focused on in this chapter have internal structure such
as syntax (Boyd-Graber and Blei 2008) which we are discarding in the bag-of-words
model. Augmenting and specializing the RTM to these cases may yield better models
for many application domains.
As with any parametric mixed-membership model, the number of latent components
in the RTM must be chosen using either prior knowledge or model-selection techniques
such as cross-validation. Incorporating non-parametric Bayesian priors such as the
Dirichlet process into the model would allow it to exibly adapt the number of topics
to the data (Ferguson 1973; Antoniak 1974; Kemp et al. 2004; Teh et al. 2007). This,
in turn, may give researchers new insights into the latent membership structure of
networks.
In sum, the RTM is a hierarchical model of networks and per-node attribute data.
The RTM is used to analyze linked corpora such as citation networks, linked web
pages, social networks with user proles, and geographically tagged news. We have
demonstrated qualitatively and quantitatively that the RTM provides an eective
and useful mechanism for analyzing and using such data. It signicantly improves on
previous models, integrating both node-specic information and link structure to give
better predictions.
77
Chapter 5
Discovering Link Information
In the previous chapters we have focused on modeling existing network data,
encoding collections of relationships between entities such as people, places, genes, or
corporations. However, the network data thus far have been unannotated, that is, edges
express connectivity but not the nature of the connection. And while many resources
for networks of interesting entities are emerging, most of these can only annotate
connections in a limited fashion. Although relationships between entities are rich, it is
impractical to manually devise complete characterizations of these relationships for
every pair of entities on large, real-world corpora.
Below we present a novel probabilistic topic model to analyze text corpora and
infer descriptions of its entities and of relationships between those entities. We
develop variational methods for performing approximate inference on our model and
demonstrate that our model can be practically deployed on large corpora such as
Wikipedia. We show qualitatively and quantitatively that our model can construct
and annotate graphs of relationships and make useful predictions.
Portions of this chapter appear in Chang et al. (2009).
78
5.1 Background
Network datadata which express relationships between ensembles of entitiesare
becoming increasingly pervasive. People are connected to each other through a variety
of kinship, social, and professional relationships; proteins bind to and interact with
other proteins; corporations conduct business with other corporations. Understanding
the nature of these relationships can provide useful mechanisms for suggesting new
relationships between entities, characterizing new relationships, and quantifying global
properties of naturally occurring network structures (Anagnostopoulos et al. 2008;
Cai et al. 2005; Taskar et al. 2003; Wasserman and Pattison 1996; Zhou et al. 2008).
Many corpora of network data have emerged in recent years. Examples of such
data include social networks, such as LinkedIn or Facebook, and citation networks,
such as CiteSeer, Rexa, or JSTOR. Other networks can be constructed manually or
automatically using texts with people such as the Bible, scientic abstracts with genes,
or decisions in legal journals. Characterizing the networks of connections between
these entities is of historical, scientic, and practical interest. However, describing
every relationship for large, real-world corpora is infeasible. Thus most data sets label
edges as merely on or o, or with a small set of xed, predened connection types.
These labellings cannot capture the complexities underlying the relationships and
limit the applicability of these data sets.
An example of this is shown in Figure 5.1. The gure depicts a social network
where nodes represent entities and edges represent some relationship between the
entities. Some social networks such as Facebook
1
have self-reported information about
each edge; for example, two users may be connected by the fact that they attended
the same school (top panel). However, this self-reported information is limited and
sparsely populated. By analyzing unstructured resources, we hope to increase the
number of annotated edges, the number of nodes covered, and the kinds of annotations
1
http://www.facebook.com
79
(bottom panel).
In this chapter we develop a method for augmenting such data sets by analyzing
document collections to uncover the relationships encoded in their texts. Text corpora
are replete with information about relationships, but this information is out of reach
for traditional network analysis techniques. We develop Networks Uncovered By
Bayesian Inference (Nubbi), a probabilistic topic model of text (Blei et al. 2003a;
Hofmann 1999; Steyvers and Griths 2007) with hidden variables that represent the
patterns of word use which describes the relationships in the text. Given a collection
of documents, Nubbi reveals the hidden network of relationships that is encoded in
the texts by associating rich descriptions with each entity and its connections. For
example, Figure 5.2 illustrates a subset of the network uncovered from the texts
of Wikipedia. Connections between people are depicted by edges, each of which is
associated with words that describe the relationship.
First, we describe the intuitions and statistical assumptions behind Nubbi. Second,
we derive ecient algorithms for using Nubbi to analyze large document collections.
Finally, we apply Nubbi to the Bible, Wikipedia, and scientic abstracts. We demon-
strate that Nubbi can discover sensible descriptions of the network and can make
predictions competitive with those made by state of the art models.
5.2 Model
The goal of Nubbi is to analyze a corpus to describe the relationships between pairs of
entities. Nubbi takes as input very lightly annotated data, requiring only that entities
within the input text be identied. Nubbi also takes as input the network of entities
to be annotated. For some corpora this network is already explicitly encoded as a
graph. For other text corpora this graph must be constructed. One simple way of
constructing this graph is to use a fully-connected network of entities and then prune
80
Jonathan Chang
Jordan Boyd-Graber
You and Jordan both
went to Princeton.
(a) A social network with some extant data about
how two entities are related.
Ronald Reagan
Jane Wyman
You and Jane used to
be married.
(b) The desiderata, a social network where relation-
ships have been automatically by analyzing free
text.
Figure 5.1: An example motivating this work. The gures depict a social network;
nodes represent individuals and edges represent relationships between the individuals.
Many social networks have some detailed information about the relationships. It is
this data we seek to automatically build and augment.
81
Joseph
Stalin
Winston
Churchill
Lyndon B.
Johnson
Mao
Zedong
Jimmy
Carter
Margaret
Thatcher
Ronald
Reagan
Richard
Nixon
Nikita
Khrushchev
John F. Kennedy
Hubert
Humphrey
George H. W.
Bush
Ross
Perot
Leon
Trotsky
Lev
Kamenev
Zhou
Enlai
Mikhail
Gorbachev
labour
govern
leader
british
world
soviet
communist
central
union
full
soviet
russian
govern
union
nuclear
republican
state
federalist
vote
vice
Figure 5.2: A small subgraph of the social network Nubbi learned taking only the raw
text of Wikipedia with tagged entities as input. The full model uses 25 relationship and
entity topics. An edge exists between two entities if their co-occurrence count is high.
For some of the edges, we show the top words from the most probable relationship
topic associated with that pair of entities. These are the words that best explain the
contexts where these two entities appear together. A complete browser for this data
is available at http://topics.cs.princeton.edu/nubbi.
the edges in this graph using statistics such as entity co-occurrence counts.
From the entities in this network, the text is divided into two dierent classes of
bags of words. First, each entity is associated with an entity context, a bag of words
co-located
2
with the entity. Second, each pair of entities is associated with a pair
context, a bag of words co-located with the pair. Figure 5.3 shows an example of the
input to the algorithm turned into entity contexts and pair contexts.
Nubbi learns two descriptions of how entities appear in the corpus: entity topics
and relationship topics. Following Blei et al. (2003a), a topic is dened to be a
distribution over words. To aid intuitions, we will for the moment assume that these
topics are given and have descriptive names. We will describe how the topics and
contexts interplay to reveal the network of relationships hidden in the texts. We
emphasize, however, that the goal of Nubbi is to analyze the texts to learn both the
topics and relationships between entities.
An entity topic is a distribution over words, and each entity is associated with a
distribution over entity topics. For example, suppose there are three entity topics:
politics, movies, and sports. Ronald Reagan would have a distribution that favors
2
We use the term co-located to refer to words and entities which appear near one-another in a
text. The denition of near depends on the corpus; some practical choices are given in Section 5.4.
82
1 When Jesus had spoken these words, he
went forth with his disciples over the brook
Cedron, where was a garden, into the
which he entered, and his disciples.
2 And Judas also, which betrayed him,
knew the place: for Jesus ofttimes resorted
thither with his disciples.
3 Judas then, having received a band of
men and ofcers from the chief priests and
Pharisees, cometh thither with lanterns and
torches and weapons.
4 Jesus therefore, knowing all things that
should come upon him, went forth, and
said unto them, Whom seek ye?
5 They answered him, Jesus of Nazareth.
Jesus saith unto them, I am he. And Judas
also, which betrayed him, stood with them.
6 As soon then as he had said unto them, I
am he, they went backward, and fell to the
ground.
7 Then asked he them again, Whom seek
ye? And they said, Jesus of Nazareth.
received band ofcers chief
priests Pharisees lanterns
torches weapons
spoken words disciples brook Cedron
garden enter disciples knowing things
seek asked seek Nazareth
Jesus
Judas
1 When Jesus had spoken these words, he
went forth with his disciples over the brook
Cedron, where was a garden, into the
which he entered, and his disciples.
2 And Judas also, which betrayed him,
knew the place: for Jesus ofttimes resorted
thither with his disciples.
3 Judas then, having received a band of
men and ofcers from the chief priests and
Pharisees, cometh thither with lanterns and
torches and weapons.
4 Jesus therefore, knowing all things that
should come upon him, went forth, and
said unto them, Whom seek ye?
5 They answered him, Jesus of Nazareth.
Jesus saith unto them, I am he. And Judas
also, which betrayed him, stood with them.
6 As soon then as he had said unto them, I
am he, they went backward, and fell to the
ground.
7 Then asked he them again, Whom seek
ye? And they said, Jesus of Nazareth.
betrayed knew place disciples answered
Nazareth saith betrayed
Jesus
and
Judas
Figure 5.3: A high-level overview of Nubbis view of text data. A corpus with identied
entities is turned into a collection of bags-of-words (in rectangles), each associated
with individual entities (left) or pairs of entities (right). The procedure in the left
panel is repeated for every entity in the text while the procedure in the right panel is
repeated for every pair of entities.
politics and movies, athlete actors like Johnny Weissmuller and Geena Davis would
have distributions that favor movies and sports, and specialized athletes, like Pele,
would have distributions that favor sports more than other entity topics. Nubbi uses
entity topics to model entity contexts. Because the sports entity topic would contain
words like cup, win, and goal, associating Pele exclusively with the sports
entity topic would be consistent with the words observed in his context.
Relationship topics are distributions over words associated with pairs of entities,
rather than individual entities, and each pair of entities is associated with a distribution
over relationship topics. Just as the entity topics cluster similar people together (e.g.,
Ronald Reagan, George Bush, and Bill Clinton all express the politics topic), the
relationship topics can cluster similar pairs of people. Thus, Romeo and Juliet,
Abelard and Heloise, Ruslan and Ludmilla, and Izanami and Izanagi might all share a
lovers relationship topic.
Relationship topics are used to explain pair contexts. Each word in a pair context
is assumed to express something about either one of the participating entities or
something particular to their relationship. For example, consider Jane Wyman and
83
Ronald Reagan. (Jane Wyman, an actress, was actor/president Ronald Reagans rst
wife.) Individually, Wyman is associated with the movies entity topic and Reagan
is associated with the movies and politics entity topics. In addition, this pair of
entities is associated with relationship topics for divorce and costars.
Nubbi hypothesizes that each word describes either one of the entities or their
relationship. Consider the pair context for Reagan and Wyman:
In 1938, Wyman co-starred with Ronald Reagan. Reagan and actress Jane
Wyman were engaged at the Chicago Theater and married in Glendale, Califor-
nia. Following arguments about Reagans political ambitions, Wyman led for
divorce in 1948. Since Reagan is the only U.S. president to have been divorced,
Wyman is the only ex-wife of an American President.
We have marked the words that are not associated with the relationship topic. Func-
tional words are gray; words that come from a politics topic (associated with Ronald
Reagan) are underlined; and words that come from a movies topic (associated with
Jane Wyman) are italicized.
The remaining words, 1938, co-starred, engaged, Glendale, led, di-
vorce, 1948, divorced, and ex-wife, describe the relationship between Reagan
and Wyman. Indeed, it is by deducing which case each word falls into that Nubbi is
able to capture the relationships between entities. Examining the relationship topics
associated with each pair of entities provides a description of that relationship.
The above discussion gives an intuitive picture of how Nubbi explains the observed
entity and pair contexts using entity and relationship topics. In data analysis, however,
we do not observe the entity topics, pair topics, or the assignments of words to topics.
Our goal is to discover them.
To do this, we formalize these notions in a generative probabilistic model of the
texts that uses hidden random variables to encode the hidden structure described
above. In posterior inference, we reverse the process to discover the latent structure
84
that best explains the documents. (Posterior inference is described in the next section.)
More formally, Nubbi assumes the following statistical model.
1. For each entity topic j and relationship topic k,
(a) Draw topic multinomials

j
Dir(

+ 1),

k
Dir(

+ 1)
2. For each entity e,
(a) Draw entity topic proportions
e
Dir(

);
(b) For each word associated with this entitys context,
i. Draw topic assignment z
e,n
Mult(
e
);
ii. Draw word w
e,n
Mult(

ze,n
).
3. For each pair of entities e, e

,
(a) Draw relationship topic proportions
e,e
Dir(

);
(b) Draw selector proportions
e,e
Dir(

);
(c) For each word associated with this entity pairs context,
i. Draw selector c
e,e

,n
Mult(
e,e
);
ii. If c
e,e

,n
= 1,
A. Draw topic assignment z
e,e

,n
Mult(
e
);
B. Draw word w
e,e

,n
Mult(

z
e,e

,n
).
iii. If c
e,e

,n
= 2,
A. Draw topic assignment z
e,e

,n
Mult(
e
);
B. Draw word w
e,e

,n
Mult(

z
e,e

,n
).
iv. If c
e,e

,n
= 3,
A. Draw topic assignment z
e,e

,n
Mult(
e,e
);
B. Draw word w
e,e

,n
Mult(

z
e,e

,n
).
This is depicted in a graphical model in Figure 5.4.
85
M
N
e
N
e,e'
K

e,e'

w
z

e,e'

z
w

N
e'
z
w

e'
...
N
e''
z
w

e''
N entity contexts M pair contexts K

relationship topics K

entity topics
Figure 5.4: A depiction of the Nubbi model using the graphical model formalism.
Nodes are random variables; edges denote dependence; plates (i.e., rectangles) denote
replication; shaded nodes are observed and unshaded nodes are hidden. The left half
of the gure are entity contexts, while the right half of the gure are pair contexts. In
its entirety, the model generates both the entity contexts and the pair contexts shown
in Figure 5.3.
The hyperparameters of the Nubbi model are Dirichlet parameters

, and

, which govern the entity topic distributions, the relationship distributions, and
the entity/pair mixing proportions. The Dirichlet parameters

and

are priors
for each topics multinomial distribution over terms. There are K

per-topic term
distributions for entity topics,

1:K

, and K

per-topic term distributions

1:K

for
relationship topics.
The words of each entity context are essentially drawn from an LDA model using
the entity topics. The words of each pair context are drawn in a more sophisticated
way. The topic assignments for the words in the pair context for entity e and entity e

are hypothesized to come from the entity topic proportions


e
, entity topic proportions

e
, or relationship topic proportions
e,e
. The switching variable c
e,e

,n
selects which
of these three assignments is used for each word. This selector c
e,e

,n
is drawn from

e,e
, which describes the tendency of words associated with this pair of entities to be
ascribed to either of the entities or the pair.
It is
e,e
that describes what the relationship between entities e and e

is. By
allowing some of each pairs context words to come from a relationship topic distribu-
tion, the model is able to characterize each pairs interaction in terms of the latent
86
relationship topics.
5.3 Computation with NUBBI
With the model formally dened in terms of hidden and observed random variables,
we now turn to deriving the algorithms needed to analyze data. Data analysis involves
inferring the hidden structure from observed data and making predictions on future
data. In this section, we develop a variational inference procedure for approximating
the posterior. We then use this procedure to develop a variational expectation-
maximization (EM) algorithm for parameter estimation and for approximating the
various predictive distributions of interest.
5.3.1 Inference
In posterior inference, we approximate the posterior distribution of the latent variables
conditioned on the observations. As for LDA, exact posterior inference for Nubbi is
intractable (Blei et al. 2003a). We appeal to variational methods.
Variational methods posit a family of distributions over the latent variables indexed
by free variational parameters. Those parameters are then t to be close to the true
posterior, where closeness is measured by relative entropy. See Jordan et al. (1999)
for a review. We use the factorized family
q(, Z, C, , [

, ) =

q(
e
[

e
)

n
q(z
e,n
[

e,n
)

e,e

q(
e,e
[

e,e

)q(
e,e
[

e,e
)

e,e

n
q(z
e,e

,n
, c
e,e

,n
[

e,e

,n
,
e,e

,n
)

,
where

is a set of Dirichlet parameters, one for each entity;

and

are sets
87
of Dirichlet parameters, one for each pair of entities;

is a set of multinomial
parameters, one for each word in each entity; is a set of multinomial parameters, one
for each pair of entities; and

is a set of matrices, one for each word in each entity


pair. Each

e,e

,n
contains three rows one which denes a multinomial over topics
given that the word comes from
e
, one which denes a multinomial given that the
word comes from
e
, and one which denes a multinomial given that the word comes
from
e,e
. Note that the variational family we use is not the fully-factorized family;
this family fully captures the joint distribution of z
e,e

,n
and c
e,e

,n
. We parameterize
this pair by

e,e

,n
and
e,e

,n
which dene a multinomial distribution over all 3K
possible values of this pair of variables.
Minimizing the relative entropy is equivalent to maximizing the Jensens lower
bound on the marginal probability of the observations, i.e., the evidence lower bound
(ELBO),
/ =

e,e

/
e,e
+

e
/
e
+ H(q) , (5.1)
where sums over e, e

iterate over all pairs of entities and


/
e,e
=

n
E
q

log p(w
e,e

,n
[

1:K
,

1:K
, z
e,e

,n
, c
e,e

,n
)

n
E
q
[log p(z
e,e

,n
[c
e,e

,n
,
e
,
e
,
e,e
)] +

n
E
q
[log p(c
e,e

,n
[
e,e
)] +
E
q
[log p(
e,e
[

)] +E
q
[log p(
e,e
[

)]
88
and
/
e
=

n
E
q

log p(w
e,n
[

1:K
, z
e,n
)

+
E
q
[log p(
e
[

)] +

n
E
q
[log p(z
e,n
[
e
)] .
The /
e,e
term of the ELBO dierentiates this model from previous models (Blei et al.
2003a). The connections between entities aect the objective in posterior inference
(and, below, in parameter estimation).
Our aim now is to compute each term of the objective function given in Equation 5.1.
After expanding this expression in terms of the variational parameters, we can derive a
set of coordinate ascent updates to optimize the ELBO with respect to the variational
parameters,

, . Refer to Appendix C for a full derivation of the


following updates.
The updates for

e,n
assign topic proportions to each word associated with an
individual entity,

e,n
exp

log

wn
+

, (5.2)
where log

wn
represents the logarithm of column w
n
of

and () is the digamma


function. (A digamma of a vector is the vector of digammas.) The topic assignments
for each word associated with a pair of entities are similar,

e,e

,n,1
= exp

log

wn
+

1
T

e,e

,n,1

(5.3)

e,e

,n,2
= exp

log

wn
+

1
T

e,e

,n,2

(5.4)

e,e

,n,3
= exp

log

wn
+

e,e

1
T

e,e

e,e

,n,3

, (5.5)
where
e,e

,n
is a vector of normalizing constants. These normalizing constants are
89
then used to estimate the probability that each word associated with a pair of entities
is assigned to either an individual or relationship,

e,e

,n
exp

e,e

,n
+

e,e

. (5.6)
The topic and entity assignments are then used to estimate the variational Dirichlet
parameters which parameterize the latent topic and entity proportions,

e,e
=

e,e

,n
(5.7)

e,e

e,e

,n,3

e,e

,n,3
. (5.8)
Finally, the topic and entity assignments for each pair of entities along with the topic
assignments for each individual entity are used to update the variational Dirichlet
parameters which govern the latent topic assignments for each individual entity. These
updates allow us to combine evidence associated with individual entities and evidence
associated with entity pairs.

e
=

e,e

,n,1

e,e

,n,1
+
e

,e,2

,e,n,2

+ (5.9)

e,n
. (5.10)
5.3.2 Parameter estimation
We t the model by nding maximum likelihood estimates for each of the parameters:

e,e
,

1:K
and

1:K
. Once again, this is intractable so we turn to an approximation.
We employ variational expectation-maximization, where we iterate between optimizing
the ELBO of Equation 5.1 with respect to the variational distribution and with respect
to the model parameters.
Optimizing with respect to the variational distribution is described in Section 5.3.1.
90
Optimizing with respect to the model parameters is equivalent to maximum likelihood
estimation with expected sucient statistics, where the expectation is taken with
respect to the variational distribution. The sucient statistics for the topic vectors

and

consist of all topic-word pairs in the corpus, along with their entity or
relationship assignments. Collecting these statistics leads to the following updates,

n
1(w
e,n
= w)

e,n
+ (5.11)

e,e

n
1(w
e,e

,n
= w)
e,e

,n,1

e,e

,n,1
+ (5.12)

e,e

n
1(w
e

,e,n
= w)
e

,e,n,2

,e,n,2
(5.13)

e,e

n
1(w
e,e

,n
= w)
e,e

,n,3

e,e

,n,3
. (5.14)
The sucient statistics for
e,e
are the number of words ascribed to the rst entity,
the second entity, and the relationship topic. This results in the update

e,e
exp ((

n

e,e

,n
)) .
5.3.3 Prediction
With a tted model, we can make judgments about how well the model describes the
joint distribution of words associated with previously unseen data. In this section we
describe two prediction tasks that we use to compare Nubbi to other models: word
prediction and entity prediction.
In word prediction, the model predicts an unseen word associated with an entity
pair given the other words associated with that pair, p(w
e,e

,i
[w
e,e

,i
). This quantity
cannot be computed tractably. We instead turn to a variational approximation of this
91
posterior,
p(w
e,e

,i
[w
e,e

,i
) E
q
[p(w
e,e

,i
[z
e,e

,i
)] .
Here we have replaced the expectation over the true posterior probability p(z
e,e

,i
[w
e,e

,i
)
with the variational distribution q(z
e,e

,i
) whose parameters are trained by maximizing
the evidence bound given w
e,e

,i
.
In entity prediction, the model must predict which entity pair a set of words is
most likely to appear in. By Bayes rule, the posterior probability of an entity pair
given a set of words is proportional to the probability of the set of words belonging to
that entity pair,
p((e, e

)[w) p(w[w
e,e
),
where the proportionality constant is chosen such that the sum of this probability
over all entity pairs is equal to one.
After a qualitative examination of the topics learned from corpora, we use these
two prediction methods to compare Nubbi against other models that oer probabilistic
frameworks for associating entities with text in Section 5.4.2.
5.4 Experiments
In this section, we describe a qualitative and quantitative study of Nubbi on three
data sets: the bible (characters in the bible), biological (genes, diseases, and proteins
in scientic abstracts), and wikipedia. For these three corpora, the entities of interest
are already annotated. Experts have marked all mentions of people in the Bible (Nave
2003) and biological entities in corpora of scientic abstracts (Ohta et al. 2002; Tanabe
et al. 2005), and Wikipedias link structure oers disambiguated mentions. Note that
92
Topic 1 Topic 2
Entities Jesus, Mary Abraham, Chedorlaomer
Terah, Abraham Ahaz, Rezin
father king
begat city
Top Terms james smote
daughter lord
mother thousand
Table 5.1: Examples of relationship topics learned by a ve topic Nubbi model trained
on the Bible. The upper part of the table shows some of the entity pairs highly
associated with that topic. The lower part of the table shows the top terms in that
topics multinomial.
it is also possible to use named entity recognizers to preprocess data for which entities
are not previously identied.
The rst step in our analysis is to determine the entity and pair contexts. For
bible, verses oer an atomic context; any term in a verse with an entity (pair) is
associated with that entity (pair). For biological, we use tokens within a xed distance
from mentions of an entity (pair) to build the data used by our model. For wikipedia,
we used the same approach as biological for associating words with entity pairs. We
associated with individual entities, however, all the terms in his/her Wikipedia entry.
For all corpora we removed tokens based on a stop list and stemmed all tokens using
the Porter stemmer. Infrequent tokens, entities, and pairs were pruned from the
corpora.
3
5.4.1 Learning networks
We rst demonstrate that the Nubbi model produces interpretable entity topics that
describe entity contexts and relationship topics that describe pair contexts. We also
show that by combining Nubbis model of language with a network automatically
estimated through co-occurrence counts, we can construct rich social networks with
labeled relationships.
3
After preprocessing, the bible dataset contains a lexicon of size 2411, 523 entities, and 475 entity
pairs. The biological dataset contains a lexicon of size 2425, 1566 entities, and 577 entity pairs. The
wikipedia dataset contains a lexicon of size 9144, 1918 entities, and 429 entity pairs.
93
Table 5.1 shows some of the relationship topics learned from the Bible data. (This
model has ve entity topics and ve relationship topics; see the following section for
more details on how the choice of number of topics aects performance.) Each column
shows the words with the highest weight in that topics multinomial parameter vector,
and above each column are examples of entity pairs associated with that topic. In this
example, relationship Topic 1 corresponds to blood relations, and relationship Topic 2
refers to antagonists. We emphasize that this structure is uncovered by analyzing the
original texts. No prior knowledge of the relationships between characters is used in
the analysis.
In a more diverse corpus, Nubbi learns broader topics. In a twenty-ve topic model
trained on the Wikipedia data, the entity topics broadly apply to entities across many
time periods and cultures. Artists, monarchs, world politicians, people from American
history, and scientists each have a representative topic (see Table 5.2).
The relationship topics further restrict entities that are specic to an individual
country or period (Table 5.3). In some cases, relationship topics narrow the focus of
broader entity topics. For instance, relationship Topics 1, 5, 6, 9, and 10 in Table 5.3
help explain the specic historical context of pairs better than the very broad world
leader entity Topic 7.
In some cases, these distinctions are very specic. For example, relationship Topic 6
contains pairs of post-Hanoverian monarchs of Great Britain and Northern Ireland,
while relationship Topic 5 contains relationships with pre-Hanoverian monarchs of
England even though both share words like queen and throne. Note also that
these topics favor words like father and daughter, which describe the relationships
present in these pairs.
The model sometimes groups together pairs of people from radically dierent
contexts. For example, relationship Topic 8 groups composers with religious scholars
(both share terms like mass and patron), revealing a drawback of using a unigram-
94
Topic 1 Topic 2 Topic 3 Topic 4
Entities
George Westinghouse Charles Peirce Lindsay Davenport Lee Harvey Oswald
George Stephenson Francis Crick Martina Hingis Timothy McVeigh
Guglielmo Marconi Edmund Husserl Michael Schumacher Yuri Gagarin
James Watt Ibn al-Haytham Andre Agassi Bobby Seale
Robert Fulton Linus Pauling Alain Prost Patty Hearse
Top Terms
electricity work align state
engine universe bgcolor american
patent theory race year
company science win time
invent time grand president
Topic 5 Topic 6 Topic 7 Topic 8
Entities
Pierre-Joseph Proudhon Betty Davis Franklin D. Roosevelt Jack Kirby
Benjamin Tucker Humphrey Bogart Jimmy Carter Terry Pratchett
Murray Rothbard Kate Winslet Brian Mulroney Carl Barks
Karl Marx Martin Scorsese Neville Chamberlain Gregory Benford
Amartya Sen Audrey Hepburn Margaret Thatcher Steve Ditko
Top Terms
social lm state story
work award party book
politics star election work
society role president ction
economics play government publish
Topic 9 Topic 10
Entities
Babe Ruth Xenophon
Barry Bonds Caligula
Satchel Page Horus
Pedro Martinez Nebuchadrezzar II
Roger Clemens Nero
Top Terms
game greek
baseball rome
season history
league senate
run death
Table 5.2: Ten topics from a model trained on Wikipedia carve out fairly broad
categories like monarchs, athletes, entertainers, and gures from myth and religion.
An exception is the more focused Topic 9, which is mostly about baseball. Note that
not all of the information is linguistic; Topic 3 shows we were unsuccessful in ltering
out all Wikipedias markup, and the algorithm learned to associate score tables with
a sports category.
95
Topic 1 Topic 2 Topic 3 Topic 4
Pairs
Reagan-Gorbachev Muhammad-Moses Grant-Lee Paul VI-John Paul II
Kennedy-Khrushchev Rabin-Arafat Muhammad-Abu Bakr Pius XII-Paul II
Alexandra-Alexander III E. Bronte-C. Bronte Sherman-Grant John XXIII-John Paul II
Najibullah-Kamal Solomon-Moses Jackson-Lee Pius IX-John Paul II
Nicholas I-Alexander III Arafat-Sharon Sherman-Lee Leo XIII-John Paul II
Terms
soviet israel union vatican
russian god corp cathol
government palestinian gen papal
union chile campaign council
nuclear book richmond time
Topic 5 Topic 6 Topic 7 Topic 8
Pairs
Philip V-Louis XIV Henry VIII-C. of Aragon Jeerson-Burr Mozart-Salieri
Louis XVI-Francis I Mary I (Eng)-Elizabeth I Jeerson-Madison Malory-Arthur
Maria Theresa-Charlemagne Henry VIII-Anne Boleyn Perot-Bush Mozart-Beethoven
Philip V-Louis XVI Mary I (Scot)-Elizabeth I Jeerson-Jay Bede-Augustine
Philip V-Maria Theresa Henry VIII-Elizabeth I J.Q. Adams-Clay Leo X-Julius II
Terms
french queen republican music
dauphin english state play
spanish daughter federalist lm
death death vote piano
throne throne vice work
Topic 9 Topic 10
Pairs
George VI-Edward VII Trotsky-Stalin
George VI-Edward VIII Kamenev-Stalin
Victoria-Edward VII Khrushchev-Stalin
George V-Edward VII Kamenev-Trotsky
Victoria-George VI Zhou Enlai-Mao Zedong
Terms
royal soviet
queen communist
british central
throne union
father full
Table 5.3: In contrast to Table 5.2, the relationship topics shown here are more specic
to time and place. For example, English monarch pairs (Topic 6) are distinct from
British monarch pairs (Topic 9). While there is some noise (the Bronte sisters being
lumped in with mideast leaders or Abu Bakr and Muhammad with civil war generals),
these relationship topics group similar pairs of entities well. A social network labeled
with these relationships is shown in Figure 5.2.
based method. As another example, relationship Topic 3 links civil war generals and
early Muslim leaders.
5.4.2 Evaluating the predictive distribution
The qualitative results of the previous section illustrate that Nubbi is an eective
model for exploring and understanding latent structure in data. In this section, we
provide a quantitative evaluation of the predictive mechanisms that Nubbi provides.
As with any probabilistic model, Nubbi denes a probability distribution over
unseen data. After tting the latent variables of our model to data (as described
96
10 15 20

5
.
9

5
.
8

5
.
7

5
.
6

5
.
5

5
.
4
biological
W
o
r
d

P
r
e
d
i
c
t
i
o
n

L
o
g

L
i
k
e
l
i
h
o
o
d
10 15 20

6
.
5

6
.
3

6
.
1

5
.
9
bible
10 15 20

7
.
5

7
.
0

6
.
5
wikipedia
10 15 20

6
.
0

5
.
5

5
.
0

4
.
5

4
.
0
Number of topics
E
n
t
i
t
y

P
r
e
d
i
c
t
i
o
n

L
o
g

L
i
k
e
l
i
h
o
o
d
10 15 20

6
.
5

6
.
0

5
.
5
Number of topics
10 15 20

7
.
0

6
.
0

5
.
0

4
.
0
Number of topics
Nubbi AuthorTopic LDA Unigram Mutual
Figure 5.5: Predictive log likelihood as a function of the number of Nubbi topics on two
tasks: entity prediction (given the context, predict what entities are being discussed)
and relation prediction (given the entities, predict what words occur). Higher is better.
in Section 5.3.1), we take unseen pair contexts and ask how well the model predicts
those held-out words. Models that give higher probability to the held-out words better
capture how the two entities participating in that context interact. In a complimentary
problem, we can ask the tted model to predict entities given the words in the pair
context. (The details of these metrics are dened more precisely in Section 5.3.3.)
We compare Nubbi to three alternative approaches: a unigram model, LDA (Blei
et al. 2003a), and the Author-Topic model (Rosen-Zvi et al. 2004). All of these
approaches are models of language which treat individual entities and pairs of entities
alike as bags of words. In the Author-Topic model (Rosen-Zvi et al. 2004), entities are
associated with individual contexts and pair contexts, but there are no distinguished
pair topics; all words are explained by the topics associated with individuals. In
addition, we also compare the model against two baselines: a unigram model (equivalent
to using no relationship topics and one entity topic) and a mutual information model
97
(equivalent to using one relationship topic and one entity topic).
We use the bootstrap method to create held-out data sets and compute predictive
probability (Efron 1983). Figure 5.5 shows the average predictive log likelihood for the
three approaches. The results for Nubbi are plotted as a function of the total number
of topics K = K

+ K

. The results for LDA and author-topic were also computed


with K topics. All models were trained with the same hyperparameters.
Nubbi outperforms both LDA and unigram on all corpora for all numbers of topics
K. For word prediction Nubbi performs comparably to Author-Topic on bible, worse
on biological, and better on wikipedia. We posit that because the wikipedia corpus
contains more tokens per entity and pair of entities, the Nubbi model is able to leverage
more data to make better word predictions. Conversely, for biological, individual
entities explain pair contexts better than relationship topics, giving the advantage
to Author-Topic. For wikipedia, this yields a 19% improvement in average word log
likelihood over the unigram model at K = 24.
In contrast, the LDA model is unable to make improved predictions over the
unigram model. There are two reasons for this. First, LDA cannot use information
about the participating entities to make predictions about the pair, because it treats
entity contexts and pair contexts as independent bags of words. Second, LDA does not
allocate topics to describe relationships alone, whereas Nubbi does learn topics which
express relationships. This allows Nubbi to make more accurate predictions about the
words used to describe relationships. When relationship words do nd their way into
LDA topics, LDAs performance improves, such as on the bible dataset. Here, LDA
obtains a 6% improvement over unigram while Nubbi achieves a 10% improvement.
With the exception of Author-Topic on biological, Nubbi outperforms the other
approaches on the entity prediction task. For example, on wikipedia, the Nubbi model
shows a 32% improvement over the unigram baseline, LDA shows a 7% improvement,
and Author-Topic actually performs worse than the unigram baseline. While LDA,
98
Author-Topic, and Nubbi improve monotonically with the number of topics on the
word task, they can peak and decrease for the entity prediction task. Recall that an
improved word likelihood need not imply an improved entity likelihood; if a model
assigns a higher word likelihood to other entity pairs in addition to the correct entity
pair, the predictive entity likelihood may still decrease. Thus, while each held-out
context is associated with a particular pair of entities, it does not follow that that
same context could not also be aptly associated with some other entity pair.
5.4.3 Application to New York Times
We can gain qualitative insight into the preformance of the Nubbi model (and demon-
strate its scalability) by investigating its performance on a larger data set from the
New York Times. We treat each of the approximately 1 million articles in this corpus
as a document. We lter the corpus down to 2500 vocabulary terms and 944 entities.
We t a Nubbi model using ve entity topics and ve relationship topics.
4
Figure 5.6 shows a visualization of the results as a radial plot. Each entity appears
at an angle along the edge of the circle and lines are drawn between related entities.
The thickness of the lines represent the strength of the relationship inferred by the
model while the color of the lines represent the relationship topic which appears most
frequently in the description of the relationship between the two entities.
Because the data set is large, a high-level overview such as Figure 5.6 is dicult to
fully take in. Consequently we zoom in to a small portion of the graph in Figure 5.7
which also annotates some of the relationship topics with the word with highest
probability mass in that topic. This view reveals some of the structure of relationships
that Nubbi is able to uncover on this data set. One topic we have labeled trial
appears infrequently in this sector of the graph; the only entity connected by this
relationship is Nicole Brown-Simpson. Although not depicted in this zoomed-in graph,
4
For the qualitative evaluation here we x the number of topics. Refer to the previous section for
more details on how performance varies with the number of topics.
99
lazio, rick a
richter, mike
wilder, l douglas
foreman, george
mehta, zubin
bhutto, benazir wright, jim olmert, ehud pinochet, augusto hatch, orrin g berlin, irving souter, david h tower, john g husband, rick d risen, james charles, eleanor klebold, dylan rather, dan mcmellon, edward saint laurent, yves helms, jesse spitzer, eliot l panetta, leon e columbus, christopher jesus christ mcveigh, timothy james
o'neill, paul martinez, pedro barry, marion s jr els, ernie chernomyrdin, viktor s
hynes, charles j ewing, patrick miers, harriet e stein, andrew j aung san suu kyi, daw
silver, sheldon specter, arlen coughlin, tom kennedy, john fitzgerald
rodman, dennis
chaney, don james, caryn lane, nathan hashimoto, ryutaro
lautenberg, frank r
giambi, jason
de klerk, f w mourning, alonzo
mapplethorpe, robert
glavine, tom
winfrey, oprah
van horn, keith
jeter, derek
reagan, ronald
limbaugh, rush
netanyahu, benjamin
law, bernard f
sessions, william s
fujimori, alberto
ward, charlie
barron, james
van gundy, jeff
musharraf, pervez
blair, jayson
valentine, bobby
rowland, john g
chass, murray
jacobs, marc
puccini, giacomo
gates, henry louis jr
staples, brent
baker, howard h jr
clark, wesley k
koch, edward i
canseco, jose
major, john
plame, valerie
o'neill, paul h
holyfield, evander
mills, richard p
abdullah
bush, laura
odeh, mohammed saddiq
kozlowski, l dennis
karadzic, radovan
salinas de gortari, carlos
federer, roger
jospin, lionel
bennett, william j
nader, ralph
jackson, mark
steinbrenner, george m 3d
steinberg, lisa
velella, guy j
moynihan, daniel patrick
kahane, meir
khamenei, ali
darman, richard g
barber, tiki
friedman, thomas l
bradley, bill
minaya, omar
goldin, harrison j
hernandez, orlando
brown, dave
reich, robert b
smith, william k
scoppetta, nicholas
james, sharpe
gore, al
moi, daniel arap
forbes, steve
chertoff, michael
wolfowitz, paul d
clark, laurel salton
ashe, arthur
vecsey, george
pirro, jeanine f
sullivan, andrew
mantle, mickey
hanks, tom
robinson, jackie
bird, larry
reno, janet
edberg, stefan
belkin, lisa
nussbaum, hedda
blix, hans
houston, allan
carter, bill
scowcroft, brent
golden, howard
quayle, dan
morgenthau, robert m
taylor, lawrence
ovitz, michael
murdoch, rupert
delay, tom
wagner, richard
wellstone, paul
brody, jane e
m
cenroe, john
kim
jong il
rivera, m
ariano
welch, john f jr
schundler, bret d
green, m
ark
jiang zem
in
kim
dae jung
karzai, ham
id
rohatyn, felix g
carlucci, frank c
kaczynski, theodore j
biaggi, m
ario
bruni, frank
baker, jam
es a 3d
chavez, julio cesar
kantor, m
ickey
clem
ens, roger
steinberg, joel b
buffett, warren e
paulson, henry m
jr
greenhouse, steven
klein, calvin
conner, dennis
letterm
an, david
lam
ont, ned
rice, condoleezza
sununu, john h
com
bs, sean
m
iller, arthur
bruder, thom
as
m
ussina, m
ike
freeh, louis j
difrancesco, donald t
johnson, lyndon baines
pelosi, nancy
m
oxley, m
artha
weicker, lowell p jr
scott, byron
johnson, m
agic
geffen, david
m
andela, w
innie
falw
ell, jerry
farrakhan, louis
chretien, jean
eastw
ood, clint
dean, how
ard
florio, jam
es j
m
iller, judith
w
ard, benjam
in
gretzky, w
ayne
zedillo ponce de leon, ernesto
arafat, yasir
joyce, jam
es
w
oodw
ard, bob
green, richard r
sum
m
ers, law
rence h
m
uscham
p, herbert
izetbegovic, alija
roth, philip
m
artin, steve
abram
s, robert
chirac, jacques
sullivan, louis w
shevardnadze, eduard a
starr, kenneth w
rostenkow
ski, dan
thom
pson, w
illiam
c jr
w
asserstein, w
endy
levin, carl
kristof, nicholas d
rangel, charles b
noriega, m
anuel antonio
ali, m
uham
m
ad
m
obutu sese seko
gates, bill
jam
es, lebron
pitino, rick
fehr, donald
w
arhol, andy
robbins, jerom
e
dalai lam
a
feinstein, dianne
holtzm
an, elizabeth
kennedy, edw
ard m
pirro, jeanine
vance, cyrus r
becker, boris
m
oore, m
ichael
gonzalez, elian
barenboim
, daniel
henderson, rickey
john, elton
edw
ards, john
w
inerip, m
ichael
p a d i l l a , j o s e
l e w
i s , a n t h o n y
a b r a m
o f f , j a c k
e i s n e r , m
i c h a e l d
b e c k e t t , s a m
u e l
f r a n k s , b o b
b o n i l l a , b o b b y
m
a r t i n e z , t i n o
s i s t a n i , a l i a l
m
a n n i n g , e l i
s h a l a l a , d o n n a e
f o l e y , t h o m
a s s
b i d e n , j o s e p h r j r
e i n s t e i n , a l b e r t
h a r t , g a r y
w
i l l i a m
s , s e r e n a
h a r i r i , r a f i k
f e l d , e l i o t
d o l e , e l i z a b e t h h
c o s b y , b i l l
f r i s t , b i l l
a l i t o , s a m
u e l a j r
d i n g e l l , j o h n d
k l e i n , j o e l i
p u r d u m
, t o d d s
a n d e r s o n , d a v e
m
a d d o x , a l t o n h j r
k i n g , w
a y n e
m
u l r o n e y , b r i a n
m
b e k i , t h a b o
t h u r m
o n d , s t r o m
m
o s e s , r o b e r t
s t e r n , h e n r y j
s h a r o n , a r i e l
m
c g r e e v e y , j a m
e s e
r o b b , c h a r l e s s
m
a l v o , j o h n l e e
n o r o d o m
s i h a n o u k
t a u b m
a n , a a l f r e d
r e d s t o n e , s u m
n e r m
b e r n s t e i n , l e o n a r d
f i e l d s , c v i r g i n i a
b o t s t e i n , l e o n
r o v e , k a r l
p e r r y , w
i l l i a m
j
m
a r c o s , i m
e l d a
s h e f f i e l d , g a r y
h u s s e i n i
r u t h , g e o r g e h e r m
a n
c u o m
o , m
a r i o m
s c h m
i t t , e r i c
m
o r r i s , m
a r k
m
i l l e r , m
e l v i n
t h o m
a s , i s i a h
k e a t i n g , c h a r l e s h j r
c h a l a b i , a h m
a d
c e a u s e s c u , n i c o l a e
b r o k a w , t o m
s u o z z i , t h o m
a s r
r o h t a e w o o
o ' n e i l l , e u g e n e
p e t t i t t e , a n d y
p o l l a n , m
i c h a e l
r a b i n , y i t z h a k
l e n o , j a y
t a g l i a b u e , p a u l
r o s e n t h a l , a m
o r t e g a s a a v e d r a , d a n i e l
n o r t h , o l i v e r l
t u r n e r , t e d
b l u m
e n t h a l , r a l p h
w a l t e r s , b a r b a r a
h a r k i n , t o m
h u s s e i n , s a d d a m
m
a d d e n , j o h n
g l e n n , j o h n
g o l i s a n o , b t h o m
a s
b r y a n t , k o b e
b r e m
e r , l p a u l i i i
m
a r b u r y , s t e p h o n
k e l l y , r a y m
o n d w
p i c k e n s , t b o o n e j r
q a d d a f i , m
u a m
m
a r e l
d o l e , b o b
h i n g i s , m
a r t i n a
k i n g , r o d n e y g l e n
w i l s o n , a u g u s t
b o n d s , b a r r y
m
u b a r a k , h o s n i
b r a d s h e r , k e i t h
k e a n , t h o m
a s h
c o l e m
a n , d e r r i c k
b r o d s k y , r i c h a r d l
s o n d h e i m
, s t e p h e n
t o m
m
a s i n i , a n t h o n y
j o h n s o n , e a r v i n
g a t e s , r o b e r t m
v i n c e n t , f a y
p h i l l i p s , s t e v e
b r o w n , j e r r y
k a b i l a , l a u r e n t
s p r e w e l l , l a t r e l l
w a s h i n g t o n , d e s i r e e
l e w i s , l e n n o x
k u s h n e r , t o n y
m a , y o y o
w h i t m a n , c h r i s t i e
w i e s e , t h o m a s
l e i t e r , a l
k a s p a r o v , g a r r y
c a p r i a t i , j e n n i f e r
l e e , s p i k e
m o l i n a r i , g u y v
p r i m a k o v , y e v g e n y m
s h a k e s p e a r e , w i l l i a m
d u k a k i s , m i c h a e l s
v e r d i , g i u s e p p e
p i a z z a , m i k e
y e l t s i n , b o r i s n
k h a t a m i , m o h a m m a d
b u s h , b a r b a r a
m o h a m e d , k h a l f a n k h a m i s
d o w d , m a u r e e n
l l o y d w e b b e r , a n d r e w
n o r m a n , g r e g
m e e s e , e d w i n 3 d
p a t a k i , g e o r g e e
g e r s h w i n , g e o r g e
h i t l e r , a d o l f
j o h n s o n , l a r r y
v o l p e , j u s t i n a
b a k e r , r u s s e l l
d o m i n g o , p l a c i d o
d i n k i n s , d a v i d n
b a r y s h n i k o v , m i k h a i l
g i a m a t t i , a b a r t l e t t
l e e t c h , b r i a n
d o l e , r o b e r t j
w e i n b e r g e r , c a s p a r w
k e v o r k i a n , j a c k
p a t e r n o , j o e
s i m o n , n e i l
s i m p s o n , o j
g e r s t n e r , l o u i s v j r
m a s u r , k u r t
a s h c r o f t , j o h n
s o r o s , g e o r g e
d o l e , e l i z a b e t h
j o n e s , m a r i o n
s c h l e s i n g e r , a r t h u r m j r
d i a n a , p r i n c e s s o f w a l e s
s c h r o d e r , g e r h a r d
w a l s h , l a w r e n c e e
l e n n o n , j o h n
g i b s o n , m e l
b a k e r , a l
g o r e , a l b e r t j r
b r o w n , d a v i d m
s p i t z e r , e l l i o t l
m i y a z a w a , k i i c h i
c o l l i n s , g l e n n
m c c a r t n e y , p a u l
t o r r e , j o e
r i d g e , t o m
m a l o n e , j o h n c
s c a l i a , a n t o n i n
a c k e r m a n , f e l i c i a
l i n d r o s , e r i c
l e o n a r d , s u g a r r a y
m u t o m b o , d i k e m b e
t o r r i c e l l i , r o b e r t g
a q u i n o , c o r a z o n c
k i m y o u n g s a m
m u r p h y , r i c h a r d
f u j i m o r i , a l b e r t o k
m a r t i n , k e n y o n
h e m i n g w a y , e r n e s t
g o t t i , j o h n
s c i o l i n o , e l a i n e
b e l i c h i c k , b i l l
r e a g a n , r o n a l d w i l s o n
w e i n g a r t e n , r a n d i
m i l o s e v i c , s l o b o d a n
h i l l , a n i t a f
k e r r y , j o h n
t h o m a s , c l a r e n c e
b e t t m a n , g a r y
s t e v e n s , s c o t t
p e r e s , s h i m o n
p i c a s s o , p a b l o
b u m i l l e r , e l i s a b e t h
k e l l e r , b i l l
s p i e l b e r g , s t e v e n
p a c k w o o d , r o b e r t w
f o s t e r , v i n c e n t w j r
o ' n e a l , s h a q u i l l e
l i p s y t e , r o b e r t
w e i n s t e i n , h a r v e y
p r o d i , r o m a n o
p a p p , j o s e p h
b a d i l l o , h e r m a n
c a n b y , v i n c e n t
c o l l i n s , k e r r y
p i e r c e , s a m u e l r j r
m a l i k i , n u r i k a m a l a l
g r a h a m , m a r t h a
v a n g o g h , v i n c e n t
c h a r l e s , p r i n c e o f w a l e s
l i p e n g
d a s i l v a , l u i z i n a c i o l u l a
b l o o m b e r g , m i c h a e l r
s a v i m b i , j o n a s
m i c k e l s o n , p h i l
w e b b e r , c h r i s
m e s s i n g e r , r u t h w
d i m a g g i o , j o e
k r u g m a n , p a u l
r i l e y , p a t
m a r t i n , b i l l y
m c f a r l a n e , r o b e r t c
o z a w a , s e i j i
c a m b y , m a r c u s
n u n n , s a m
l e w i n s k y , m o n i c a s
p o w e l l , c o l i n l
s t r a h a n , m i c h a e l
w a l d h e i m , k u r t
l e v y , s t e v e
o a t e s , j o y c e c a r o l
k e m p , j a c k f
s n o w , j o h n w
h u n s e n
b r u n o , j o s e p h l
c o u r i e r , j i m
e l l i n g t o n , d u k e
p o i n d e x t e r , j o h n m
p o p e
w h i t m a n , c h r i s t i n e t o d d
w i l s o n , p e t e
l e v i t t , a r t h u r j r s t e r n , d a v i d m a s l i n , j a n e t v o l c k e r , p a u l a m a r t i n , c u r t i s b r o w n , e d m u n d g j r p a r e l e s , j o n g r a h a m , b o b h o l l a n d , b e r n a r d l o p e z , j e n n i f e r d a l y , j o h n j o h n s o n , r a n d y a l b e e , e d w a r d b r a w l e y , t a w a n a t s o n g a s , p a u l e r i c h a r d s o n , b i l l b l a i r , t o n y w a x m a n , h e n r y a j o n e s , p a u l a c o r b i n t e s t a v e r d e , v i n n y r e a g a n , n a n c y l e v i n e , j a m e s f a b r i c a n t , f l o r e n c e w i l l i a m s , t e d h y d e , h e n r y j n i c h o l s , t e r r y l y n n g o s s , p o r t e r j b r o o k s , d a v i d s a d r , m o k t a d a a l
s c h i l l i n g , c u r t c a r t e r , j i m m y k o i z u m i , j u n i c h i r o
p e r o t , r o s s p a r c e l l s , b i l l m o n t a n a , j o e g o r b a c h e v , m i k h a i l s
k o s t u n i c a , v o j i s l a v
d e n g x i a o p i n g
r o w l i n g , j k s a b a t i n i , g a b r i e l a
n i c k l a u s , j a c k
n i x o n , r i c h a r d m i l h o u s
a r m e y , d i c k m c c a i n , j o h n s
r o o s e v e l t , t h e o d o r e
r a m o n , i l a n s h a m i r , y i t z h a k
b u s h , g e o r g e w
h e l m s l e y , h a r r y b
a l ' o w h a l i , m o h a m e d r a s h e d d a o u d
c o r t i n e s , r a m o n c
c h e n e y , d i c k
h o l i k , b o b b y
m e s s i e r , m a r k
w h i t e h e a d , m a r y b e t h
c r e w , r u d y
h o s o k a w a , m o r i h i r o
a l t m a n , r o b e r t
m a m e t , d a v i d
c a s e y , w i l l i a m j
c h r i s t o p e a r , r o b e r t
h a s t e r t , j d e n n i s
s h u l t z , g e o r g e p
w o o d s , t i g e r
t h o m p s o n , t o m m y g
b e n t s e n , l l o y d
c a s h m a n , b r i a n
m c c a i n , j o h n
h a r r i s , k a t h e r i n e
s a f i r e , w i l l i a m
j e f f o r d s , j a m e s m
j o b s , s t e v e n p
o a k l e y , c h a r l e s
f e r r a r o , g e r a l d i n e a
r o b e r t s o n , p a t
j o h n s , j a s p e r
s c h i a v o , t e r r i
t h o r n b u r g h , r i c h a r d l
w i l s o n , m i c h a e l
s i n g h , v i j a y
s t a r k s , j o h n
r a v i t c h , r i c h a r d
r o s e , p e t e
w i l s o n , j o s e p h c i v
h a v e l , v a c l a v
o ' c o n n o r , j o h n
c a r e y , m a r i a h
d i a l l o , a m a d o u
f o x , v i c e n t e
g l a s s , p h i l i p
m a l c o l m x
m a r s a l i s , w y n t o n
s a n d , l e o n a r d b
t h a t c h e r , m a r g a r e t h
m a r t i n s , p e t e r
o ' n e i l l , w i l l i a m a
k e n n e d y , j o h n f j r
a n d e r s o n , k e n n y
k i n g , d o n
g a n d h i , r a j i v
s t a l i n , j o s e p h
a r m a n i , g i o r g i o
b u c k l e y , w i l l i a m f j r
s p a n o , a n d r e w j
w i l l i a m s , b e r n i e
c l i n t o n , b i l l
s i m o n , p a u l
k a r p o v , a n a t o l y
k i d d , j a s o n
b o w m a n , p a t r i c i a
s t o p p a r d , t o m
r o d r i g u e z , a l e x
h a r t o c o l l i s , a n e m o n a
m c c o n n e l l , m i t c h
a r m s t r o n g , l a n c e
r o s e n b a u m , y a n k e l
a h e r n , b e r t i e
j e n n i n g s , p e t e r
p o l l o c k , j a c k s o n
r e g a n , e d w a r d v
v e r s a c e , g i a n n i
l a u d e r , r o n a l d s
k a r a n , d o n n a
w a r n e r , j o h n w
l e p e n , j e a n m a r i e
t u d j m a n , f r a n j o
f e r n a n d e z , j o s e p h a
s c h w a r z , c h a r l e s
s h o s t a k o v i c h , d m i t r i
l o u i m a , a b n e r
k a l i k o w , p e t e r s
c h a v e z , h u g o
c u o m o , a n d r e w m
e b b e r s , b e r n a r d j
r o c k e r , j o h n
n o r t o n , g a l e
h e r b e r t , b o b
r o b e r t s o n , m
a r i o n g
s a f i r , h o w a r d
l i e b e r m
a n , j o s e p h i
m
a i l e r , n o r m
a n
f e r g u s o n , c o l i n
k w a n , m
i c h e l l e
p a v a r o t t i , l u c i a n o
g a t e s , w i l l i a m
h
m
c c a l l , h c a r l
l o t t , t r e n t
f o r r e s t e r , d o u g l a s r
z e m
i n , j i a n g
f r a n c o , j o h n
m
l a d i c , r a t k o
g o o d e n , d w i g h t
w i e s e l , e l i e
s t e i n b r e n n e r , g e o r g e
s c o r s e s e , m
a r t i n
c a s t r o , f i d e l
b o r k , r o b e r t h
l a u r e n , r a l p h
w i l l i a m
s , t e n n e s s e e
g o r d o n , j e f f
e i s e n h o w e r , d w i g h t d a v i d
c o l l o r d e m
e l l o , f e r n a n d o
a b d e l r a h m
a n , o m
a r
n a v r a t i l o v a , m
a r t i n a
h e l m
s l e y , l e o n a
b u s h , j e b
b r a d y , n i c h o l a s f
f o r d , g e r a l d r u d o l p h j r
s h e v a r d n a d z e , e d u a r d
k e r r y , j o h n f
s t r a w b e r r y , d a r r y l
a l l e n , w o o d y
j a g r , j a r o m
i r
r a f s a n j a n i , h a s h e m
i
h u j i n t a o
s a l a m
e h , m
o h a m
m
e d a
a d a m
s , j o h n
m
a d o n n a
f a r r o w , m
i a
b o l t o n , j o h n r
w
i l s o n , v a l e r i e p l a m
e
g r i f f i t h , m
i c h a e l
m
e n e n d e z , r o b e r t
m
u g a b e , r o b e r t
w
i l p o n , f r e d
w
e i n e r , t i m
b o x e r , b a r b a r a
r e i d , h a r r y
l a g e r f e l d , k a r l
g r e e n s p a n , a l a n
k h o d o r k o v s k y , m
i k h a i l b
p e r e l m
a n , r o n a l d o
f a s s e l , j i m
d e n i r o , r o b e r t
p u t i n , v l a d i m
i r
h a r d i n g , t o n y a
b u s h , g e o r g e w
.
s a t h e r , g l e n
b o w
e , r i d d i c k
m
c n a l l y , t e r r e n c e
d e a v e r , m
i c h a e l k
r e h n q u i s t , w
i l l i a m
h
q u i n d l e n , a n n a
j a c k s o n , t h o m
a s p e n f i e l d
b o t h a , p w
p e r e z d e c u e l l a r , j a v i e r
a d a m
s , g e r r y
h a n d e l , g e o r g e f r e d e r i c k
p e n n i n g t o n , c h a d
b a r a k , e h u d
z e d i l l o , e r n e s t o
t i e r n e y , j o h n
k e r r i g a n , n a n c y
b r o w
n , l a r r y
m
a t s u i , h i d e k i
j e f f e r s o n , t h o m
a s
c h u n d o o h w
a n
z u c k e r m
a n , m
o r t i m
e r b
b a k k e r , j i m
b a l a n c h i n e , g e o r g e
a l e x a n d e r , l a m
a r
f r e u d , s i g m
u n d
c h a n g , m
i c h a e l
h e v e s i , a l a n g
j o h n s o n , p h i l i p
c r e w , r u d o l p h f
a g a s s i , a n d r e
p r e s l e y , e l v i s
i c a h n , c a r l c
b r o w
n , r o n a l d h
w
i l s o n , r o b e r t
childs, chris
lew
is, neil a
tharp, tw
yla
roosevelt, franklin delano
w
eill, sanford i
kerkorian, kirk
m
uham
m
ad, john allen
najibullah
w
illiam
s, jayson
dodd, christopher j
taylor, charles
cone, david
pol pot
o'connor, sandra day
goldm
an, ronald lyle
erlanger, steven
king, stephen
onassis, jacqueline kennedy
goodnough, abby
harris, eric
solom
on, deborah
sandom
ir, richard
kaye, judith s
lay, kenneth l
randolph, w
illie
carter, vince
strauss, richard
cham
orro, violeta barrios de
corzine, jon s
holbrooke, richard c
sinatra, frank
buchanan, patrick j
grasso, richard a
ahm
adinejad, m
ahm
oud
w
ashington, george
bush, george
fox quesada, vicente
gram
m
, phil
spitzer, eliot
fitzw
ater, m
arlin
sorenstam
, annika
edw
ards, herm
an
jackson, m
ichael
van natta, don jr
trum
an, harry s
abbas, m
ahm
oud
cooper, m
ichael
diller, barry
boss, kenneth
seles, m
onica
m
andela, nelson r
rum
sfeld, donald h
w
einstein, jack b
gingrich, new
t
aspin, les
m
enem
, carlos saul
kissinger, henry a
cage, john
babbitt, bruce
schum
er, charles e
assad, bashar al
boutrosghali, boutros
sim
m
s, phil
lipton, eric
newm
an, paul
m
cgwire, m
ark
lee, wen ho
leahy, patrick j
grassley, charles e
libeskind, daniel
gephardt, richard a
brown, lee p
vacco, dennis c
rell, m
jodi
tyson, m
ike
davis, gray
o'donnell, rosie
roberts, john g jr
clinton, hillary rodham
goetz, bernhard hugo
stephanopoulos, george
aziz, tariq
lewinsky, m
onica
iacocca, lee a
feiner, paul
sm
ith, william
kennedy
cunningham
, m
erce
khom
eini, ruhollah
skilling, jeffrey k
showalter, buck
sam
aranch, juan antonio
abraham
, spencer
springsteen, bruce
davenport, lindsay
m
onroe, m
arilyn
rohde, david
gonzales, alberto r
romney, mitt
lemieux, mario
trump, donald j
pearl, daniel
jackson, bo
kerik, bernard b
mason, c vernon
lewis, carl
gulotta, thomas s
maazel, lorin
lugar, richard g
lincoln, abraham
de la hoya, oscar
lewis, michael
jackson, jesse l
berlusconi, silvio
suharto
wachtler, sol
aoun, michel
lendl, ivan
rushdie, salman
mandela, nelson
holden, stephen
chen shuibian
albright, madeleine k
gotbaum, betsy
hanover, donna
tenet, george j
cardoso, fernando henrique
piniella, lou
sampras, pete
kimmelman, michael
walesa, lech
mozart, wolfgang amadeus
barkley, charles
levy, harold o
byrd, robert c
jones, roy jr
shays, christopher
cohen, william s
mueller, robert s iii
mitterrand, francois
ginsburg, ruth bader
elizabeth ii, queen of great britain
williams, venus
brodeur, martin
elhage, wadih
aristide, jeanbertrand
feingold, russell d
hawkins, yusuf k
beethoven, ludwig van
carroll, sean
kerrey, bob
giuliani, rudolph w
cruise, tom
stewart, martha
john paul ii
berke, richard l
graf, steffi
mitchell, george j
chawla, kalpana
maxwell, robert
winfield, dave
o'rourke, andrew p
nelson, lemrick jr
rodgers, richard
kohl, helmut
koppel, ted
rubin, robert e
stern, howard
clinton, chelsea
wright, frank lloyd
d'amato, alfonse m
rich, frank
annan, kofi
jackson, phil
putin, vladimir v
irabu, hideki
weiner, anthony d
miller, gifford
sharpton, al
johnson, keyshawn
hubbell, webster l
egan, edward m
libby, i lewis jr
obama, barack
johnson, ben
roddick, andy
bin laden, osama
brozan, nadine
boesky, ivan f
connors, jimmy
schwarzenegger, arnold
white, mary jo
ryan, george
mattingly, don
weld, william f
sosa, sammy
silverstein, larry a
wells, david
mccool, william c
traub, james
bratton, william j
sweeney, john j
daschle, tom
milken, michael r
ryan, nolan
blumenthal, richard
moussaoui, zacarias
jordan, michael
vallone, peter f
dylan, bob
christopher, warren m
codey, richard j
pagones, steven a
dershowitz, alan m
hewitt, lleyton
selig, bud
assad, hafez al
bach, johann sebastian
altman, lawrence k
gehry, frank
zimmer, richard a
daley, richard m
broad, william j
brady, lois smith
simpson, nicole brown
zarqawi, abu musab al
reeves, dan
trump, donald
iverson, allen
knoblauch, chuck
king, martin luther jr
kennedy, anthony m
kasparov, gary
koresh, david
benedict xvi
marcos, ferdinand e
ferrer, fernando
Figure 5.6: A visualization of the results of applying the Nubbi model to the New York
Times. Entities appear along the edge of the circle and lines connect related entities.
The thickness of the lines represent the strength of the relationship while the colors
represent the relationship topic which appears most frequently in the description of
the relationship.
100
lazio, rick a
richter, mike
wilder, l douglas
foreman, george
mehta, zubin
bhutto, benazir
wright, jim
olmert, ehud
pinochet, augusto
hatch, orrin g
berlin, irving
souter, david h
tower, john g
husband, rick d
risen, jam
es
charles, eleanor
klebold, dylan
rather, dan
m
cm
ellon, edw
ard
saint laurent, yves
h
e
lm
s, je
sse
sp
itze
r, e
lio
t l p
a
n
e
tta
, le
o
n
e
c
o
lu
m
b
u
s
, c
h
ris
to
p
h
e
r
je
s
u
s
c
h
ris
t m
c
v
e
ig
h
, tim
o
th
y
ja
m
e
s
o
'n
e
ill, p
a
u
l m
a
rtin
e
z
, p
e
d
ro
b
a
rry
, m
a
rio
n
s
jr
e
ls
, e
rn
ie
c
h
e
r
n
o
m
y
rd
in
, v
ik
to
r s
h
y
n
e
s
, c
h
a
rle
s
j
e
w
in
g
, p
a
tr
ic
k
m
ie
r
s
, h
a
r
r
ie
t e
s
te
in
, a
n
d
r
e
w
j
a
u
n
g
s
a
n
s
u
u
k
y
i, d
a
w
s
ilv
e
r, s
h
e
ld
o
n
s
p
e
c
te
r, a
r
le
n
c
o
u
g
h
lin
, to
m
k
e
n
n
e
d
y
,

jo
h
n

f
it
z
g
e
r
a
ld
r
o
d
m
a
n
,

d
e
n
n
is
c
h
a
n
e
y
,

d
o
n
ja
m
e
s
,

c
a
r
y
n
la
n
e
,

n
a
t
h
a
n
h
a
s
h
im
o
t
o
,

r
y
u
t
a
r
o
la
u
t
e
n
b
e
r
g
,

f
r
a
n
k

r
g
ia
m
b
i,

ja
s
o
n
d
e

k
le
r
k
,

f

w
m
o
u
r
n
in
g
,

a
lo
n
z
o
m
a
p
p
le
t
h
o
r
p
e
,

r
o
b
e
r
t
g
l
a
v
i
n
e
,

t
o
m
w
i
n
f
r
e
y
,

o
p
r
a
h
v
a
n

h
o
r
n
,

k
e
i
t
h
j
e
t
e
r
,

d
e
r
e
k
r
e
a
g
a
n
,

r
o
n
a
l
d
l
i
m
b
a
u
g
h
,

r
u
s
h
n
e
t
a
n
y
a
h
u
,

b
e
n
j
a
m
i
n
l
a
w
,

b
e
r
n
a
r
d

f
s
e
s
s
i
o
n
s
,

w
i
l
l
i
a
m

s
f
u
j
i
m
o
r
i
,

a
l
b
e
r
t
o
w
a
r
d
,

c
h
a
r
l
i
e
b
a
r
r
o
n
,

j
a
m
e
s
v
a
n

g
u
n
d
y
,

j
e
f
f
m
u
s
h
a
r
r
a
f
,

p
e
r
v
e
z
b
l
a
i
r
,

j
a
y
s
o
n
v
a
l
e
n
t
i
n
e
,

b
o
b
b
y
r
o
w
l
a
n
d
,

j
o
h
n

g
c
h
a
s
s
,

m
u
r
r
a
y
j
a
c
o
b
s
,

m
a
r
c
p
u
c
c
i
n
i
,

g
i
a
c
o
m
o
g
a
t
e
s
,

h
e
n
r
y

l
o
u
i
s

j
r
s
t
a
p
l
e
s
,

b
r
e
n
t
b
a
k
e
r
,

h
o
w
a
r
d

h

j
r
c
l
a
r
k
,

w
e
s
l
e
y

k
k
o
c
h
,

e
d
w
a
r
d

i
c
a
n
s
e
c
o
,

j
o
s
e
m
a
j
o
r
,

j
o
h
n
p
l
a
m
e
,

v
a
l
e
r
i
e
o
'
n
e
i
l
l
,

p
a
u
l

h
h
o
l
y
f
i
e
l
d
,

e
v
a
n
d
e
r
m
i
l
l
s
,

r
i
c
h
a
r
d

p
a
b
d
u
l
l
a
h
b
u
s
h
,

l
a
u
r
a
o
d
e
h
,

m
o
h
a
m
m
e
d

s
a
d
d
i
q
k
o
z
l
o
w
s
k
i
,

l

d
e
n
n
i
s
k
a
r
a
d
z
i
c
,

r
a
d
o
v
a
n
s
a
l
i
n
a
s

d
e

g
o
r
t
a
r
i
,

c
a
r
l
o
s
f
e
d
e
r
e
r
,

r
o
g
e
r
j
o
s
p
i
n
,

l
i
o
n
e
l
b
e
n
n
e
t
t
,

w
i
l
l
i
a
m

j
n
a
d
e
r
,

r
a
l
p
h
j
a
c
k
s
o
n
,

m
a
r
k
s
t
e
i
n
b
r
e
n
n
e
r
,

g
e
o
r
g
e

m

3
d
s
t
e
i
n
b
e
r
g
,

l
i
s
a
v
e
l
e
l
l
a
,

g
u
y

j
m
o
y
n
i
h
a
n
,

d
a
n
i
e
l

p
a
t
r
i
c
k
k
a
h
a
n
e
,

m
e
i
r
k
h
a
m
e
n
e
i
,

a
l
i
d
a
r
m
a
n
,

r
i
c
h
a
r
d

g
b
a
r
b
e
r
,

t
i
k
i
f
r
i
e
d
m
a
n
,

t
h
o
m
a
s

l
b
r
a
d
l
e
y
,

b
i
l
l
m
i
n
a
y
a
,

o
m
a
r
g
o
l
d
i
n
,

h
a
r
r
i
s
o
n

j
h
e
r
n
a
n
d
e
z
,

o
r
l
a
n
d
o
b
r
o
w
n
,

d
a
v
e
r
e
i
c
h
,

r
o
b
e
r
t

b
s
m
i
t
h
,

w
i
l
l
i
a
m

k
s
c
o
p
p
e
t
t
a
,

n
i
c
h
o
l
a
s
j
a
m
e
s
,

s
h
a
r
p
e
g
o
r
e
,

a
l
m
o
i
,

d
a
n
i
e
l

a
r
a
p
f
o
r
b
e
s
,

s
t
e
v
e
c
h
e
r
t
o
f
f
,

m
i
c
h
a
e
l
w
o
l
f
o
w
i
t
z
,

p
a
u
l

d
c
l
a
r
k
,

l
a
u
r
e
l

s
a
l
t
o
n
a
s
h
e
,

a
r
t
h
u
r
v
e
c
s
e
y
,

g
e
o
r
g
e
p
i
r
r
o
,

j
e
a
n
i
n
e

f
s
u
l
l
i
v
a
n
,

a
n
d
r
e
w
m
a
n
t
l
e
,

m
i
c
k
e
y
h
a
n
k
s
,

t
o
m
r
o
b
i
n
s
o
n
,

j
a
c
k
i
e
b
i
r
d
,

l
a
r
r
y
r
e
n
o
,

j
a
n
e
t
e
d
b
e
r
g
,

s
t
e
f
a
n
b
e
l
k
i
n
,

l
i
s
a
n
u
s
s
b
a
u
m
,

h
e
d
d
a
b
l
i
x
,

h
a
n
s
h
o
u
s
t
o
n
,

a
l
l
a
n
c
a
r
t
e
r
,

b
i
l
l
s
c
o
w
c
r
o
f
t
,

b
r
e
n
t
g
o
l
d
e
n
,

h
o
w
a
r
d
q
u
a
y
l
e
,

d
a
n
m
o
r
g
e
n
t
h
a
u
,

r
o
b
e
r
t

m
t
a
y
l
o
r
,

l
a
w
r
e
n
c
e
o
v
i
t
z
,

m
i
c
h
a
e
l
m
u
r
d
o
c
h
,

r
u
p
e
r
t
d
e
l
a
y
,

t
o
m
w
a
g
n
e
r
,

r
i
c
h
a
r
d
w
e
l
l
s
t
o
n
e
,

p
a
u
l
b
r
o
d
y
,

j
a
n
e

e
m
c
e
n
r
o
e
,

j
o
h
n
k
i
m

j
o
n
g

i
l
r
i
v
e
r
a
,

m
a
r
i
a
n
o
w
e
l
c
h
,

j
o
h
n

f

j
r
s
c
h
u
n
d
l
e
r
,

b
r
e
t

d
g
r
e
e
n
,

m
a
r
k
j
i
a
n
g

z
e
m
i
n
k
i
m

d
a
e

j
u
n
g
k
a
r
z
a
i
,

h
a
m
i
d
r
o
h
a
t
y
n
,

f
e
l
i
x

g
c
a
r
l
u
c
c
i
,

f
r
a
n
k

c
k
a
c
z
y
n
s
k
i
,

t
h
e
o
d
o
r
e

j
b
i
a
g
g
i
,

m
a
r
i
o
b
r
u
n
i
,

f
r
a
n
k
b
a
k
e
r
,

j
a
m
e
s

a

3
d
c
h
a
v
e
z
,

j
u
l
i
o

c
e
s
a
r
k
a
n
t
o
r
,

m
i
c
k
e
y
c
l
e
m
e
n
s
,

r
o
g
e
r
s
t
e
i
n
b
e
r
g
,

j
o
e
l

b
b
u
f
f
e
t
t
,

w
a
r
r
e
n

e
p
a
u
l
s
o
n
,

h
e
n
r
y

m

j
r
g
r
e
e
n
h
o
u
s
e
,

s
t
e
v
e
n
k
l
e
i
n
,

c
a
l
v
i
n
c
o
n
n
e
r
,

d
e
n
n
i
s
l
e
t
t
e
r
m
a
n
,

d
a
v
i
d
l
a
m
o
n
t
,

n
e
d
r
i
c
e
,

c
o
n
d
o
l
e
e
z
z
a
s
u
n
u
n
u
,

j
o
h
n

h
c
o
m
b
s
,

s
e
a
n
m
i
l
l
e
r
,

a
r
t
h
u
r
b
r
u
d
e
r
,

t
h
o
m
a
s
m
u
s
s
i
n
a
,

m
i
k
e
f
r
e
e
h
,

l
o
u
i
s

j
d
i
f
r
a
n
c
e
s
c
o
,

d
o
n
a
l
d

t
j
o
h
n
s
o
n
,

l
y
n
d
o
n

b
a
i
n
e
s
p
e
l
o
s
i
,

n
a
n
c
y
m
o
x
l
e
y
,

m
a
r
t
h
a
w
e
i
c
k
e
r
,

l
o
w
e
l
l

p

j
r
s
c
o
t
t
,

b
y
r
o
n
j
o
h
n
s
o
n
,

m
a
g
i
c
g
e
f
f
e
n
,

d
a
v
i
d
m
a
n
d
e
l
a
,

w
i
n
n
i
e
f
a
l
w
e
l
l
,

j
e
r
r
y
f
a
r
r
a
k
h
a
n
,

l
o
u
i
s
c
h
r
e
t
i
e
n
,

j
e
a
n
e
a
s
t
w
o
o
d
,

c
l
i
n
t
d
e
a
n
,

h
o
w
a
r
d
f
l
o
r
i
o
,

j
a
m
e
s

j
m
i
l
l
e
r
,

j
u
d
i
t
h
w
a
r
d
,

b
e
n
j
a
m
i
n
g
r
e
t
z
k
y
,

w
a
y
n
e
z
e
d
i
l
l
o

p
o
n
c
e

d
e

l
e
o
n
,

e
r
n
e
s
t
o
a
r
a
f
a
t
,

y
a
s
i
r
j
o
y
c
e
,

j
a
m
e
s
w
o
o
d
w
a
r
d
,

b
o
b
g
r
e
e
n
,

r
i
c
h
a
r
d

r
s
u
m
m
e
r
s
,

l
a
w
r
e
n
c
e

h
m
u
s
c
h
a
m
p
,

h
e
r
b
e
r
t
i
z
e
t
b
e
g
o
v
i
c
,

a
l
i
j
a
r
o
t
h
,

p
h
i
l
i
p
m
a
r
t
i
n
,

s
t
e
v
e
a
b
r
a
m
s
,

r
o
b
e
r
t
c
h
i
r
a
c
,

j
a
c
q
u
e
s
s
u
l
l
i
v
a
n
,

l
o
u
i
s

w
s
h
e
v
a
r
d
n
a
d
z
e
,

e
d
u
a
r
d

a
s
t
a
r
r
,

k
e
n
n
e
t
h

w
r
o
s
t
e
n
k
o
w
s
k
i
,

d
a
n
t
h
o
m
p
s
o
n
,

w
i
l
l
i
a
m

c

j
r
w
a
s
s
e
r
s
t
e
i
n
,

w
e
n
d
y
l
e
v
i
n
,

c
a
r
l
k
r
i
s
t
o
f
,

n
i
c
h
o
l
a
s

d
r
a
n
g
e
l
,

c
h
a
r
l
e
s

b
n
o
r
i
e
g
a
,

m
a
n
u
e
l

a
n
t
o
n
i
o
a
l
i
,

m
u
h
a
m
m
a
d
m
o
b
u
t
u

s
e
s
e

s
e
k
o
g
a
t
e
s
,

b
i
l
l
j
a
m
e
s
,

l
e
b
r
o
n
p
i
t
i
n
o
,

r
i
c
k
f
e
h
r
,

d
o
n
a
l
d
w
a
r
h
o
l
,

a
n
d
y
r
o
b
b
i
n
s
,

j
e
r
o
m
e
d
a
l
a
i

l
a
m
a
f
e
i
n
s
t
e
i
n
,

d
i
a
n
n
e
h
o
l
t
z
m
a
n
,

e
l
i
z
a
b
e
t
h
k
e
n
n
e
d
y
,

e
d
w
a
r
d

m
p
i
r
r
o
,

j
e
a
n
i
n
e
v
a
n
c
e
,

c
y
r
u
s

r
b
e
c
k
e
r
,

b
o
r
i
s
m
o
o
r
e
,

m
i
c
h
a
e
l
g
o
n
z
a
l
e
z
,

e
l
i
a
n
b
a
r
e
n
b
o
i
m
,

d
a
n
i
e
l
h
e
n
d
e
r
s
o
n
,

r
i
c
k
e
y
j
o
h
n
,

e
l
t
o
n
e
d
w
a
r
d
s
,

j
o
h
n
w
i
n
e
r
i
p
,

m
i
c
h
a
e
l
p
a
d
i
l
l
a
,

j
o
s
e
l
e
w
i
s
,

a
n
t
h
o
n
y
a
b
r
a
m
o
f
f
,

j
a
c
k
e
i
s
n
e
r
,

m
i
c
h
a
e
l

d
b
e
c
k
e
t
t
,

s
a
m
u
e
l
f
r
a
n
k
s
,

b
o
b
b
o
n
i
l
l
a
,

b
o
b
b
y
m
a
r
t
i
n
e
z
,

t
i
n
o
s
i
s
t
a
n
i
,

a
l
i

a
l
!
m
a
n
n
i
n
g
,

e
l
i
s
h
a
l
a
l
a
,

d
o
n
n
a

e
f
o
l
e
y
,

t
h
o
m
a
s

s
b
i
d
e
n
,

j
o
s
e
p
h

r

j
r
e
i
n
s
t
e
i
n
,

a
l
b
e
r
t
h
a
r
t
,

g
a
r
y
w
i
l
l
i
a
m
s
,

s
e
r
e
n
a
h
a
r
i
r
i
,

r
a
f
i
k
f
e
l
d
,

e
l
i
o
t
d
o
l
e
,

e
l
i
z
a
b
e
t
h

h
c
o
s
b
y
,

b
i
l
l
f
r
i
s
t
,

b
i
l
l
a
l
i
t
o
,

s
a
m
u
e
l

a

j
r
d
i
n
g
e
l
l
,

j
o
h
n

d
k
l
e
i
n
,

j
o
e
l

i
p
u
r
d
u
m
,

t
o
d
d

s
a
n
d
e
r
s
o
n
,

d
a
v
e
m
a
d
d
o
x
,

a
l
t
o
n

h

j
r
k
i
n
g
,

w
a
y
n
e
m
u
l
r
o
n
e
y
,

b
r
i
a
n
m
b
e
k
i
,

t
h
a
b
o
t
h
u
r
m
o
n
d
,

s
t
r
o
m
m
o
s
e
s
,

r
o
b
e
r
t
s
t
e
r
n
,

h
e
n
r
y

j
s
h
a
r
o
n
,

a
r
i
e
l
m
c
g
r
e
e
v
e
y
,

j
a
m
e
s

e
r
o
b
b
,

c
h
a
r
l
e
s

s
m
a
l
v
o
,

j
o
h
n

l
e
e
n
o
r
o
d
o
m

s
i
h
a
n
o
u
k
t
a
u
b
m
a
n
,

a

a
l
f
r
e
d
r
e
d
s
t
o
n
e
,

s
u
m
n
e
r

m
b
e
r
n
s
t
e
i
n
,

l
e
o
n
a
r
d
f
i
e
l
d
s
,

c

v
i
r
g
i
n
i
a
b
o
t
s
t
e
i
n
,

l
e
o
n
r
o
v
e
,

k
a
r
l
p
e
r
r
y
,

w
i
l
l
i
a
m

j
m
a
r
c
o
s
,

i
m
e
l
d
a
s
h
e
f
f
i
e
l
d
,

g
a
r
y
h
u
s
s
e
i
n

i
r
u
t
h
,

g
e
o
r
g
e

h
e
r
m
a
n
c
u
o
m
o
,

m
a
r
i
o

m
s
c
h
m
i
t
t
,

e
r
i
c
m
o
r
r
i
s
,

m
a
r
k
m
i
l
l
e
r
,

m
e
l
v
i
n
t
h
o
m
a
s
,

i
s
i
a
h
k
e
a
t
i
n
g
,

c
h
a
r
l
e
s

h

j
r
c
h
a
l
a
b
i
,

a
h
m
a
d
c
e
a
u
s
e
s
c
u
,

n
i
c
o
l
a
e
b
r
o
k
a
w
,

t
o
m
s
u
o
z
z
i
,

t
h
o
m
a
s

r
r
o
h

t
a
e

w
o
o
o
'
n
e
i
l
l
,

e
u
g
e
n
e
p
e
t
t
i
t
t
e
,

a
n
d
y
p
o
l
l
a
n
,

m
i
c
h
a
e
l
r
a
b
i
n
,

y
i
t
z
h
a
k
l
e
n
o
,

j
a
y
t
a
g
l
i
a
b
u
e
,

p
a
u
l
r
o
s
e
n
t
h
a
l
,

a

m
o
r
t
e
g
a

s
a
a
v
e
d
r
a
,

d
a
n
i
e
l
n
o
r
t
h
,

o
l
i
v
e
r

l
t
u
r
n
e
r
,

t
e
d
b
l
u
m
e
n
t
h
a
l
,

r
a
l
p
h
w
a
l
t
e
r
s
,

b
a
r
b
a
r
a
h
a
r
k
i
n
,

t
o
m
h
u
s
s
e
i
n
,

s
a
d
d
a
m
m
a
d
d
e
n
,

j
o
h
n
g
l
e
n
n
,

j
o
h
n
g
o
l
i
s
a
n
o
,

b

t
h
o
m
a
s
b
r
y
a
n
t
,

k
o
b
e
b
r
e
m
e
r
,

l

p
a
u
l

i
i
i
m
a
r
b
u
r
y
,

s
t
e
p
h
o
n
k
e
l
l
y
,

r
a
y
m
o
n
d

w
p
i
c
k
e
n
s
,

t

b
o
o
n
e

j
r
q
a
d
d
a
f
i
,

m
u
a
m
m
a
r

e
l
!
d
o
l
e
,

b
o
b
h
i
n
g
i
s
,

m
a
r
t
i
n
a
k
i
n
g
,

r
o
d
n
e
y

g
l
e
n
w
i
l
s
o
n
,

a
u
g
u
s
t
b
o
n
d
s
,

b
a
r
r
y
m
u
b
a
r
a
k
,

h
o
s
n
i
b
r
a
d
s
h
e
r
,

k
e
i
t
h
k
e
a
n
,

t
h
o
m
a
s

h
c
o
l
e
m
a
n
,

d
e
r
r
i
c
k
b
r
o
d
s
k
y
,

r
i
c
h
a
r
d

l
s
o
n
d
h
e
i
m
,

s
t
e
p
h
e
n
t
o
m
m
a
s
i
n
i
,

a
n
t
h
o
n
y
j
o
h
n
s
o
n
,

e
a
r
v
i
n
g
a
t
e
s
,

r
o
b
e
r
t

m
v
i
n
c
e
n
t
,

f
a
y
p
h
i
l
l
i
p
s
,

s
t
e
v
e
b
r
o
w
n
,

j
e
r
r
y
k
a
b
i
l
a
,

l
a
u
r
e
n
t
s
p
r
e
w
e
l
l
,

l
a
t
r
e
l
l
w
a
s
h
i
n
g
t
o
n
,

d
e
s
i
r
e
e
l
e
w
i
s
,

l
e
n
n
o
x
k
u
s
h
n
e
r
,

t
o
n
y
m
a
,

y
o
!
y
o
w
h
i
t
m
a
n
,

c
h
r
i
s
t
i
e
w
i
e
s
e
,

t
h
o
m
a
s
l
e
i
t
e
r
,

a
l
k
a
s
p
a
r
o
v
,

g
a
r
r
y
c
a
p
r
i
a
t
i
,

j
e
n
n
i
f
e
r
l
e
e
,

s
p
i
k
e
m
o
l
i
n
a
r
i
,

g
u
y

v
p
r
i
m
a
k
o
v
,

y
e
v
g
e
n
y

m
s
h
a
k
e
s
p
e
a
r
e
,

w
i
l
l
i
a
m
d
u
k
a
k
i
s
,

m
i
c
h
a
e
l

s
v
e
r
d
i
,

g
i
u
s
e
p
p
e
p
i
a
z
z
a
,

m
i
k
e
y
e
l
t
s
i
n
,

b
o
r
i
s

n
k
h
a
t
a
m
i
,

m
o
h
a
m
m
a
d
b
u
s
h
,

b
a
r
b
a
r
a
m
o
h
a
m
e
d
,

k
h
a
l
f
a
n

k
h
a
m
i
s
d
o
w
d
,

m
a
u
r
e
e
n
l
l
o
y
d

w
e
b
b
e
r
,

a
n
d
r
e
w
n
o
r
m
a
n
,

g
r
e
g
m
e
e
s
e
,

e
d
w
i
n

3
d
p
a
t
a
k
i
,

g
e
o
r
g
e

e
g
e
r
s
h
w
i
n
,

g
e
o
r
g
e
h
i
t
l
e
r
,

a
d
o
l
f
j
o
h
n
s
o
n
,

l
a
r
r
y
v
o
l
p
e
,

j
u
s
t
i
n

a
b
a
k
e
r
,

r
u
s
s
e
l
l
d
o
m
i
n
g
o
,

p
l
a
c
i
d
o
d
i
n
k
i
n
s
,

d
a
v
i
d

n
b
a
r
y
s
h
n
i
k
o
v
,

m
i
k
h
a
i
l
g
i
a
m
a
t
t
i
,

a

b
a
r
t
l
e
t
t
l
e
e
t
c
h
,

b
r
i
a
n
d
o
l
e
,

r
o
b
e
r
t

j
w
e
i
n
b
e
r
g
e
r
,

c
a
s
p
a
r

w
k
e
v
o
r
k
i
a
n
,

j
a
c
k
p
a
t
e
r
n
o
,

j
o
e
s
i
m
o
n
,

n
e
i
l
s
i
m
p
s
o
n
,

o

j
g
e
r
s
t
n
e
r
,

l
o
u
i
s

v

j
r
m
a
s
u
r
,

k
u
r
t
a
s
h
c
r
o
f
t
,

j
o
h
n
s
o
r
o
s
,

g
e
o
r
g
e
d
o
l
e
,

e
l
i
z
a
b
e
t
h
j
o
n
e
s
,

m
a
r
i
o
n
s
c
h
l
e
s
i
n
g
e
r
,

a
r
t
h
u
r

m

j
r
d
i
a
n
a
,

p
r
i
n
c
e
s
s

o
f

w
a
l
e
s
s
c
h
r
o
d
e
r
,

g
e
r
h
a
r
d
w
a
l
s
h
,

l
a
w
r
e
n
c
e

e
l
e
n
n
o
n
,

j
o
h
n
g
i
b
s
o
n
,

m
e
l
b
a
k
e
r
,

a
l
g
o
r
e
,

a
l
b
e
r
t

j
r
b
r
o
w
n
,

d
a
v
i
d

m
s
p
i
t
z
e
r
,

e
l
l
i
o
t

l
m
i
y
a
z
a
w
a
,

k
i
i
c
h
i
c
o
l
l
i
n
s
,

g
l
e
n
n
m
c
c
a
r
t
n
e
y
,

p
a
u
l
t
o
r
r
e
,

j
o
e
r
i
d
g
e
,

t
o
m
m
a
l
o
n
e
,

j
o
h
n

c
s
c
a
l
i
a
,

a
n
t
o
n
i
n
a
c
k
e
r
m
a
n
,

f
e
l
i
c
i
a
l
i
n
d
r
o
s
,

e
r
i
c
l
e
o
n
a
r
d
,

s
u
g
a
r

r
a
y
m
u
t
o
m
b
o
,

d
i
k
e
m
b
e
t
o
r
r
i
c
e
l
l
i
,

r
o
b
e
r
t

g
a
q
u
i
n
o
,

c
o
r
a
z
o
n

c
k
i
m

y
o
u
n
g

s
a
m
m
u
r
p
h
y
,

r
i
c
h
a
r
d
f
u
j
i
m
o
r
i
,

a
l
b
e
r
t
o

k
m
a
r
t
i
n
,

k
e
n
y
o
n
h
e
m
i
n
g
w
a
y
,

e
r
n
e
s
t
g
o
t
t
i
,

j
o
h
n
s
c
i
o
l
i
n
o
,

e
l
a
i
n
e
b
e
l
i
c
h
i
c
k
,

b
i
l
l
r
e
a
g
a
n
,

r
o
n
a
l
d

w
i
l
s
o
n
w
e
i
n
g
a
r
t
e
n
,

r
a
n
d
i
m
i
l
o
s
e
v
i
c
,

s
l
o
b
o
d
a
n
h
i
l
l
,

a
n
i
t
a

f
k
e
r
r
y
,

j
o
h
n
t
h
o
m
a
s
,

c
l
a
r
e
n
c
e
b
e
t
t
m
a
n
,

g
a
r
y
s
t
e
v
e
n
s
,

s
c
o
t
t
p
e
r
e
s
,

s
h
i m
o
n
p
i c
a
s
s
o
,

p
a
b
l o
b
u
m
i l l e
r
,

e
l i s
a
b
e
t
h
k
e
l l e
r
,

b
i l l
s
p
i e
l b
e
r
g
,

s
t
e
v
e
n
p
a
c
k
w
o
o
d
,

r
o
b
e
r
t

w
f o
s
t
e
r
,

v
i n
c
e
n
t

w

j r
o
' n
e
a
l ,

s
h
a
q
u
i l l e
l i p
s
y
t
e
,

r
o
b
e
r
t
w
e
i n
s
t e
i n
, h
a
r
v
e
y
p
r
o
d
i , r
o
m
a
n
o
p
a
p
p
, j o
s
e
p
h
b
a
d
i l l o
, h
e
r
m
a
n
c
a
n
b
y
, v
i n
c
e
n
t
c
o
l l i n
s
, k
e
r
r
y
p
i e
r c
e
, s
a
m
u
e
l r j r
m
a
l i k
i , n
u
r i k
a
m
a
l a
l !
g
r a
h
a
m
, m
a
r
t h
a
v
a
n
g
o
g
h
, v
i n
c
e
n
t
c
h
a
r l e
s
, p
r i n
c
e
o
f w
a
l e
s
l i p
e
n
g
d
a
s
i l v
a
, l u
i z
i n
a
c
i o
l u
l a
b
l o
o
m
b
e
r g
, m
i c
h
a
e
l r
s
a
v
i m
b
i , j o
n
a
s
m
i c
k
e
l s
o
n
, p
h
i l
w
e
b
b
e
r , c
h
r i s
m
e
s s i n
g
e
r , r u
t h
w
d
i m
a
g
g
i o , j o
e
k r u g m
a n , p a u l
r i l e y , p a t
m
a r t i n , b i l l y
m
c f a r l a n e , r o b e r t c
o z a w a , s e i j i
c a m
b y , m
a r c u s
n u n n , s a m
l e w i n s k y , m o n i c a s
p o w e l l , c o l i n l
s t r a h a n , m i c h a e l
w a l d h e i m , k u r t
l e v y , s t e v e
o a t e s , j o y c e c a r o l
k e m p , j a c k f
s n o w , j o h n w
h u n s e n
b r u n o , j o s e p h l
c o u r i e r , j i m
e l l i n g t o n , d u k e
p o i n d e x t e r , j o h n m
p o p e
w h i t m a n , c h r i s t i n e t o d d
w i l s o n , p e t e
l e v i t t , a r t h u r j r
s t e r n , d a v i d
m a s l i n , j a n e t
v o l c k e r , p a u l a
m a r t i n , c u r t i s
b r o w n , e d m u n d g j r
p a r e l e s , j o n
g r a h a m , b o b
h o l l a n d , b e r n a r d
l o p e z , j e n n i f e r
d a l y , j o h n
j o h n s o n , r a n d y
a l b e e , e d w
a r d
b r a w
l e y , t a w
a n a t s o n g a s , p a u l e r i c h a r d s o n , b i l l b
l a
i r , t o
n
y
w
a
x m
a
n
, h
e
n
r y a
j o
n
e
s
, p
a
u
l a
c
o
r b
i n
t e
s
t a
v
e
r d
e
, v
i n
n
y
r e
a
g
a
n
, n
a
n
c
y
l e
v
i n
e
, j a
m
e
s
f a
b
r i c
a
n
t , f l o
r e
n
c
e
w
i l l i a
m
s
, t e
d
h
y
d
e
, h
e
n
r y
j
n
i c
h
o
l s
, t e
r r y
l y
n
n
g
o
s
s
, p
o
r
t e
r j
b
r o
o
k
s
, d
a
v
i d
s
a
d
r , m
o
k
t a
d
a
a
l !
s
c
h
i l l i n
g
, c
u
r
t
c
a
r
t e
r , j i m
m
y
k
o
i z
u
m
i , j u
n
i c
h
i r
o
p
e
r
o
t , r
o
s
s
p
a
r
c
e
l l s
, b
i l l
m
o
n
t a
n
a
, j o
e
g
o
r
b
a
c
h
e
v
,

m
i k
h
a
i l
s
k
o
s
t
u
n
i c
a
,

v
o
j i s
l a
v
d
e
n
g

x
i a
o
p
i n
g
r
o
w
l i n
g
,

j
k
s
a
b
a
t
i n
i ,

g
a
b
r
i e
l a
n
i c
k
l a
u
s
,

j a
c
k
n
i x
o
n
,

r
i c
h
a
r
d

m
i l h
o
u
s
a
r
m
e
y
,

d
i c
k
m
c
c
a
i n
,

j o
h
n

s
r
o
o
s
e
v
e
l t
,

t
h
e
o
d
o
r
e
r
a
m
o
n
,

i l a
n
s
h
a
m
i
r
,

y
i
t
z
h
a
k
b
u
s
h
,

g
e
o
r
g
e

w
h
e
l
m
s
l
e
y
,

h
a
r
r
y

b
a
l
!
' o
w
h
a
l
i
,

m
o
h
a
m
e
d

r
a
s
h
e
d

d
a
o
u
d
c
o
r
t
i
n
e
s
,

r
a
m
o
n

c
c
h
e
n
e
y
,

d
i
c
k
h
o
l
i
k
,

b
o
b
b
y
m
e
s
s
i
e
r
,

m
a
r
k
w
h
i
t
e
h
e
a
d
,

m
a
r
y

b
e
t
h
c
r
e
w
,

r
u
d
y
h
o
s
o
k
a
w
a
,

m
o
r
i
h
i
r
o
a
l
t
m
a
n
,

r
o
b
e
r
t
m
a
m
e
t
,

d
a
v
i
d
c
a
s
e
y
,

w
i
l
l
i
a
m

j
c
h
r
i
s
t
o
p
e
a
r
,

r
o
b
e
r
t
h
a
s
t
e
r
t
,

j

d
e
n
n
i
s
s
h
u
l
t
z
,

g
e
o
r
g
e

p
w
o
o
d
s
,

t
i
g
e
r
t
h
o
m
p
s
o
n
,

t
o
m
m
y

g
b
e
n
t
s
e
n
,

l
l
o
y
d
c
a
s
h
m
a
n
,

b
r
i
a
n
m
c
c
a
i
n
,

j
o
h
n
h
a
r
r
i
s
,

k
a
t
h
e
r
i
n
e
s
a
f
i
r
e
,

w
i
l
l
i
a
m
j
e
f
f
o
r
d
s
,

j
a
m
e
s

m
j
o
b
s
,

s
t
e
v
e
n

p
o
a
k
l
e
y
,

c
h
a
r
l
e
s
f
e
r
r
a
r
o
,

g
e
r
a
l
d
i
n
e

a
r
o
b
e
r
t
s
o
n
,

p
a
t
j
o
h
n
s
,

j
a
s
p
e
r
s
c
h
i
a
v
o
,

t
e
r
r
i
t
h
o
r
n
b
u
r
g
h
,

r
i
c
h
a
r
d

l
w
i
l
s
o
n
,

m
i
c
h
a
e
l
s
i
n
g
h
,

v
i
j
a
y
s
t
a
r
k
s
,

j
o
h
n
r
a
v
i
t
c
h
,

r
i
c
h
a
r
d
r
o
s
e
,

p
e
t
e
w
i
l
s
o
n
,

j
o
s
e
p
h

c

i
v
h
a
v
e
l
,

v
a
c
l
a
v
o
'
c
o
n
n
o
r
,

j
o
h
n
c
a
r
e
y
,

m
a
r
i
a
h
d
i
a
l
l
o
,

a
m
a
d
o
u
f
o
x
,

v
i
c
e
n
t
e
g
l
a
s
s
,

p
h
i
l
i
p
m
a
l
c
o
l
m

x
m
a
r
s
a
l
i
s
,

w
y
n
t
o
n
s
a
n
d
,

l
e
o
n
a
r
d

b
t
h
a
t
c
h
e
r
,

m
a
r
g
a
r
e
t

h
m
a
r
t
i
n
s
,

p
e
t
e
r
o
'
n
e
i
l
l
,

w
i
l
l
i
a
m

a
k
e
n
n
e
d
y
,

j
o
h
n

f

j
r
a
n
d
e
r
s
o
n
,

k
e
n
n
y
k
i
n
g
,

d
o
n
g
a
n
d
h
i
,

r
a
j
i
v
s
t
a
l
i
n
,

j
o
s
e
p
h
a
r
m
a
n
i
,

g
i
o
r
g
i
o
b
u
c
k
l
e
y
,

w
i
l
l
i
a
m

f

j
r
s
p
a
n
o
,

a
n
d
r
e
w

j
w
i
l
l
i
a
m
s
,

b
e
r
n
i
e
c
l
i
n
t
o
n
,

b
i
l
l
s
i
m
o
n
,

p
a
u
l
k
a
r
p
o
v
,

a
n
a
t
o
l
y
k
i
d
d
,

j
a
s
o
n
b
o
w
m
a
n
,

p
a
t
r
i
c
i
a
s
t
o
p
p
a
r
d
,

t
o
m
r
o
d
r
i
g
u
e
z
,

a
l
e
x
h
a
r
t
o
c
o
l
l
i
s
,

a
n
e
m
o
n
a
m
c
c
o
n
n
e
l
l
,

m
i
t
c
h
a
r
m
s
t
r
o
n
g
,

l
a
n
c
e
r
o
s
e
n
b
a
u
m
,

y
a
n
k
e
l
a
h
e
r
n
,

b
e
r
t
i
e
j
e
n
n
i
n
g
s
,

p
e
t
e
r
p
o
l
l
o
c
k
,

j
a
c
k
s
o
n
r
e
g
a
n
,

e
d
w
a
r
d

v
v
e
r
s
a
c
e
,

g
i
a
n
n
i
l
a
u
d
e
r
,

r
o
n
a
l
d

s
k
a
r
a
n
,

d
o
n
n
a
w
a
r
n
e
r
,

j
o
h
n

w
l
e

p
e
n
,

j
e
a
n
!
m
a
r
i
e
t
u
d
j
m
a
n
,

f
r
a
n
j
o
f
e
r
n
a
n
d
e
z
,

j
o
s
e
p
h

a
s
c
h
w
a
r
z
,

c
h
a
r
l
e
s
s
h
o
s
t
a
k
o
v
i
c
h
,

d
m
i
t
r
i
l
o
u
i
m
a
,

a
b
n
e
r
k
a
l
i
k
o
w
,

p
e
t
e
r

s
c
h
a
v
e
z
,

h
u
g
o
c
u
o
m
o
,

a
n
d
r
e
w

m
e
b
b
e
r
s
,

b
e
r
n
a
r
d

j
r
o
c
k
e
r
,

j
o
h
n
n
o
r
t
o
n
,

g
a
l
e
h
e
r
b
e
r
t
,

b
o
b
r
o
b
e
r
t
s
o
n
,

m
a
r
i
o
n

g
s
a
f
i
r
,

h
o
w
a
r
d
l
i
e
b
e
r
m
a
n
,

j
o
s
e
p
h

i
m
a
i
l
e
r
,

n
o
r
m
a
n
f
e
r
g
u
s
o
n
,

c
o
l
i
n
k
w
a
n
,

m
i
c
h
e
l
l
e
p
a
v
a
r
o
t
t
i
,

l
u
c
i
a
n
o
g
a
t
e
s
,

w
i
l
l
i
a
m

h
m
c
c
a
l
l
,

h

c
a
r
l
l
o
t
t
,

t
r
e
n
t
f
o
r
r
e
s
t
e
r
,

d
o
u
g
l
a
s

r
z
e
m
i
n
,

j
i
a
n
g
f
r
a
n
c
o
,

j
o
h
n
m
l
a
d
i
c
,

r
a
t
k
o
g
o
o
d
e
n
,

d
w
i
g
h
t
w
i
e
s
e
l
,

e
l
i
e
s
t
e
i
n
b
r
e
n
n
e
r
,

g
e
o
r
g
e
s
c
o
r
s
e
s
e
,

m
a
r
t
i
n
c
a
s
t
r
o
,

f
i
d
e
l
b
o
r
k
,

r
o
b
e
r
t

h
l
a
u
r
e
n
,

r
a
l
p
h
w
i
l
l
i
a
m
s
,

t
e
n
n
e
s
s
e
e
g
o
r
d
o
n
,

j
e
f
f
e
i
s
e
n
h
o
w
e
r
,

d
w
i
g
h
t

d
a
v
i
d
c
o
l
l
o
r

d
e

m
e
l
l
o
,

f
e
r
n
a
n
d
o
a
b
d
e
l

r
a
h
m
a
n
,

o
m
a
r
n
a
v
r
a
t
i
l
o
v
a
,

m
a
r
t
i
n
a
h
e
l
m
s
l
e
y
,

l
e
o
n
a
b
u
s
h
,

j
e
b
b
r
a
d
y
,

n
i
c
h
o
l
a
s

f
f
o
r
d
,

g
e
r
a
l
d

r
u
d
o
l
p
h

j
r
s
h
e
v
a
r
d
n
a
d
z
e
,

e
d
u
a
r
d
k
e
r
r
y
,

j
o
h
n

f
s
t
r
a
w
b
e
r
r
y
,

d
a
r
r
y
l
a
l
l
e
n
,

w
o
o
d
y
j
a
g
r
,

j
a
r
o
m
i
r
r
a
f
s
a
n
j
a
n
i
,

h
a
s
h
e
m
i
h
u

j
i
n
t
a
o
s
a
l
a
m
e
h
,

m
o
h
a
m
m
e
d

a
a
d
a
m
s
,

j
o
h
n
m
a
d
o
n
n
a
f
a
r
r
o
w
,

m
i
a
b
o
l
t
o
n
,

j
o
h
n

r
w
i
l
s
o
n
,

v
a
l
e
r
i
e

p
l
a
m
e
g
r
i
f
f
i
t
h
,

m
i
c
h
a
e
l
m
e
n
e
n
d
e
z
,

r
o
b
e
r
t
m
u
g
a
b
e
,

r
o
b
e
r
t
w
i
l
p
o
n
,

f
r
e
d
w
e
i
n
e
r
,

t
i
m
b
o
x
e
r
,

b
a
r
b
a
r
a
r
e
i
d
,

h
a
r
r
y
l
a
g
e
r
f
e
l
d
,

k
a
r
l
g
r
e
e
n
s
p
a
n
,

a
l
a
n
k
h
o
d
o
r
k
o
v
s
k
y
,

m
i
k
h
a
i
l

b
p
e
r
e
l
m
a
n
,

r
o
n
a
l
d

o
f
a
s
s
e
l
,

j
i
m
d
e

n
i
r
o
,

r
o
b
e
r
t
p
u
t
i
n
,

v
l
a
d
i
m
i
r
h
a
r
d
i
n
g
,

t
o
n
y
a
b
u
s
h
,

g
e
o
r
g
e

w
.
s
a
t
h
e
r
,

g
l
e
n
b
o
w
e
,

r
i
d
d
i
c
k
m
c
n
a
l
l
y
,

t
e
r
r
e
n
c
e
d
e
a
v
e
r
,

m
i
c
h
a
e
l

k
r
e
h
n
q
u
i
s
t
,

w
i
l
l
i
a
m

h
q
u
i
n
d
l
e
n
,

a
n
n
a
j
a
c
k
s
o
n
,

t
h
o
m
a
s

p
e
n
f
i
e
l
d
b
o
t
h
a
,

p

w
p
e
r
e
z

d
e

c
u
e
l
l
a
r
,

j
a
v
i
e
r
a
d
a
m
s
,

g
e
r
r
y
h
a
n
d
e
l
,

g
e
o
r
g
e

f
r
e
d
e
r
i
c
k
p
e
n
n
i
n
g
t
o
n
,

c
h
a
d
b
a
r
a
k
,

e
h
u
d
z
e
d
i
l
l
o
,

e
r
n
e
s
t
o
t
i
e
r
n
e
y
,

j
o
h
n
k
e
r
r
i
g
a
n
,

n
a
n
c
y
b
r
o
w
n
,

l
a
r
r
y
m
a
t
s
u
i
,

h
i
d
e
k
i
j
e
f
f
e
r
s
o
n
,

t
h
o
m
a
s
c
h
u
n

d
o
o

h
w
a
n
z
u
c
k
e
r
m
a
n
,

m
o
r
t
i
m
e
r

b
b
a
k
k
e
r
,

j
i
m
b
a
l
a
n
c
h
i
n
e
,

g
e
o
r
g
e
a
l
e
x
a
n
d
e
r
,

l
a
m
a
r
f
r
e
u
d
,

s
i
g
m
u
n
d
c
h
a
n
g
,

m
i
c
h
a
e
l
h
e
v
e
s
i
,

a
l
a
n

g
j
o
h
n
s
o
n
,

p
h
i
l
i
p
c
r
e
w
,

r
u
d
o
l
p
h

f
a
g
a
s
s
i
,

a
n
d
r
e
p
r
e
s
l
e
y
,

e
l
v
i
s
i
c
a
h
n
,

c
a
r
l

c
b
r
o
w
n
,

r
o
n
a
l
d

h
w
i
l
s
o
n
,

r
o
b
e
r
t
c
h
i
l
d
s
,

c
h
r
i
s
l
e
w
i
s
,

n
e
i
l

a
t
h
a
r
p
,

t
w
y
l
a
r
o
o
s
e
v
e
l
t
,

f
r
a
n
k
l
i
n

d
e
l
a
n
o
w
e
i
l
l
,

s
a
n
f
o
r
d

i
k
e
r
k
o
r
i
a
n
,

k
i
r
k
m
u
h
a
m
m
a
d
,

j
o
h
n

a
l
l
e
n
n
a
j
i
b
u
l
l
a
h
w
i
l
l
i
a
m
s
,

j
a
y
s
o
n
d
o
d
d
,

c
h
r
i
s
t
o
p
h
e
r

j
t
a
y
l
o
r
,

c
h
a
r
l
e
s
c
o
n
e
,

d
a
v
i
d
p
o
l

p
o
t
o
'
c
o
n
n
o
r
,

s
a
n
d
r
a

d
a
y
g
o
l
d
m
a
n
,

r
o
n
a
l
d

l
y
l
e
e
r
l
a
n
g
e
r
,

s
t
e
v
e
n
k
i
n
g
,

s
t
e
p
h
e
n
o
n
a
s
s
i
s
,

j
a
c
q
u
e
l
i
n
e

k
e
n
n
e
d
y
g
o
o
d
n
o
u
g
h
,

a
b
b
y
h
a
r
r
i
s
,

e
r
i
c
s
o
l
o
m
o
n
,

d
e
b
o
r
a
h
s
a
n
d
o
m
i
r
,

r
i
c
h
a
r
d
k
a
y
e
,

j
u
d
i
t
h

s
l
a
y
,

k
e
n
n
e
t
h

l
r
a
n
d
o
l
p
h
,

w
i
l
l
i
e
c
a
r
t
e
r
,

v
i
n
c
e
s
t
r
a
u
s
s
,

r
i
c
h
a
r
d
c
h
a
m
o
r
r
o
,

v
i
o
l
e
t
a

b
a
r
r
i
o
s

d
e
c
o
r
z
i
n
e
,

j
o
n

s
h
o
l
b
r
o
o
k
e
,

r
i
c
h
a
r
d

c
s
i
n
a
t
r
a
,

f
r
a
n
k
b
u
c
h
a
n
a
n
,

p
a
t
r
i
c
k

j
g
r
a
s
s
o
,

r
i
c
h
a
r
d

a
a
h
m
a
d
i
n
e
j
a
d
,

m
a
h
m
o
u
d
w
a
s
h
i
n
g
t
o
n
,

g
e
o
r
g
e
b
u
s
h
,

g
e
o
r
g
e
f
o
x

q
u
e
s
a
d
a
,

v
i
c
e
n
t
e
g
r
a
m
m
,

p
h
i
l
s
p
i
t
z
e
r
,

e
l
i
o
t
f
i
t
z
w
a
t
e
r
,

m
a
r
l
i
n
s
o
r
e
n
s
t
a
m
,

a
n
n
i
k
a
e
d
w
a
r
d
s
,

h
e
r
m
a
n
j
a
c
k
s
o
n
,

m
i
c
h
a
e
l
v
a
n

n
a
t
t
a
,

d
o
n

j
r
t
r
u
m
a
n
,

h
a
r
r
y

s
a
b
b
a
s
,

m
a
h
m
o
u
d
c
o
o
p
e
r
,

m
i
c
h
a
e
l
d
i
l
l
e
r
,

b
a
r
r
y
b
o
s
s
,

k
e
n
n
e
t
h
s
e
l
e
s
,

m
o
n
i
c
a
m
a
n
d
e
l
a
,

n
e
l
s
o
n

r
r
u
m
s
f
e
l
d
,

d
o
n
a
l
d

h
w
e
i
n
s
t
e
i
n
,

j
a
c
k

b
g
i
n
g
r
i
c
h
,

n
e
w
t
a
s
p
i
n
,

l
e
s
m
e
n
e
m
,

c
a
r
l
o
s

s
a
u
l
k
i
s
s
i
n
g
e
r
,

h
e
n
r
y

a
c
a
g
e
,

j
o
h
n
b
a
b
b
i
t
t
,

b
r
u
c
e
s
c
h
u
m
e
r
,

c
h
a
r
l
e
s

e
a
s
s
a
d
,

b
a
s
h
a
r

a
l
!
b
o
u
t
r
o
s
!
g
h
a
l
i
,

b
o
u
t
r
o
s
s
i
m
m
s
,

p
h
i
l
l
i
p
t
o
n
,

e
r
i
c
n
e
w
m
a
n
,

p
a
u
l
m
c
g
w
i
r
e
,

m
a
r
k
l
e
e
,

w
e
n

h
o
l
e
a
h
y
,

p
a
t
r
i
c
k

j
g
r
a
s
s
l
e
y
,

c
h
a
r
l
e
s

e
l
i
b
e
s
k
i
n
d
,

d
a
n
i
e
l
g
e
p
h
a
r
d
t
,

r
i
c
h
a
r
d

a
b
r
o
w
n
,

l
e
e

p
v
a
c
c
o
,

d
e
n
n
i
s

c
r
e
l
l
,

m

j
o
d
i
t
y
s
o
n
,

m
i
k
e
d
a
v
i
s
,

g
r
a
y
o
'
d
o
n
n
e
l
l
,

r
o
s
i
e
r
o
b
e
r
t
s
,

j
o
h
n

g

j
r
c
l
i
n
t
o
n
,

h
i
l
l
a
r
y

r
o
d
h
a
m
g
o
e
t
z
,

b
e
r
n
h
a
r
d

h
u
g
o
s
t
e
p
h
a
n
o
p
o
u
l
o
s
,

g
e
o
r
g
e
a
z
i
z
,

t
a
r
i
q
l
e
w
i
n
s
k
y
,

m
o
n
i
c
a
i
a
c
o
c
c
a
,

l
e
e

a
f
e
i
n
e
r
,

p
a
u
l
s
m
i
t
h
,

w
i
l
l
i
a
m

k
e
n
n
e
d
y
c
u
n
n
i
n
g
h
a
m
,

m
e
r
c
e
k
h
o
m
e
i
n
i
,

r
u
h
o
l
l
a
h
s
k
i
l
l
i
n
g
,

j
e
f
f
r
e
y

k
s
h
o
w
a
l
t
e
r
,

b
u
c
k
s
a
m
a
r
a
n
c
h
,

j
u
a
n

a
n
t
o
n
i
o
a
b
r
a
h
a
m
,

s
p
e
n
c
e
r
s
p
r
i
n
g
s
t
e
e
n
,

b
r
u
c
e
d
a
v
e
n
p
o
r
t
,

l
i
n
d
s
a
y
m
o
n
r
o
e
,

m
a
r
i
l
y
n
r
o
h
d
e
,

d
a
v
i
d
g
o
n
z
a
l
e
s
,

a
l
b
e
r
t
o

r
r
o
m
n
e
y
,

m
i
t
t
l
e
m
i
e
u
x
,

m
a
r
i
o
t
r
u
m
p
,

d
o
n
a
l
d

j
p
e
a
r
l
,

d
a
n
i
e
l
j
a
c
k
s
o
n
,

b
o
k
e
r
i
k
,

b
e
r
n
a
r
d

b
m
a
s
o
n
,

c

v
e
r
n
o
n
l
e
w
i
s
,

c
a
r
l
g
u
l
o
t
t
a
,

t
h
o
m
a
s

s
m
a
a
z
e
l
,

l
o
r
i
n
l
u
g
a
r
,

r
i
c
h
a
r
d

g
l
i
n
c
o
l
n
,

a
b
r
a
h
a
m
d
e

l
a

h
o
y
a
,

o
s
c
a
r
l
e
w
i
s
,

m
i
c
h
a
e
l
j
a
c
k
s
o
n
,

j
e
s
s
e

l
b
e
r
l
u
s
c
o
n
i
,

s
i
l
v
i
o
s
u
h
a
r
t
o
w
a
c
h
t
l
e
r
,

s
o
l
a
o
u
n
,

m
i
c
h
e
l
l
e
n
d
l
,

i
v
a
n
r
u
s
h
d
i
e
,

s
a
l
m
a
n
m
a
n
d
e
l
a
,

n
e
l
s
o
n
h
o
l
d
e
n
,

s
t
e
p
h
e
n
c
h
e
n

s
h
u
i
!
b
i
a
n
a
l
b
r
i
g
h
t
,

m
a
d
e
l
e
i
n
e

k
g
o
t
b
a
u
m
,

b
e
t
s
y
h
a
n
o
v
e
r
,

d
o
n
n
a
t
e
n
e
t
,

g
e
o
r
g
e

j
c
a
r
d
o
s
o
,

f
e
r
n
a
n
d
o

h
e
n
r
i
q
u
e
p
i
n
i
e
l
l
a
,

l
o
u
s
a
m
p
r
a
s
,

p
e
t
e
k
i
m
m
e
l
m
a
n
,

m
i
c
h
a
e
l
w
a
l
e
s
a
,

l
e
c
h
m
o
z
a
r
t
,

w
o
l
f
g
a
n
g

a
m
a
d
e
u
s
b
a
r
k
l
e
y
,

c
h
a
r
l
e
s
l
e
v
y
,

h
a
r
o
l
d

o
b
y
r
d
,

r
o
b
e
r
t

c
j
o
n
e
s
,

r
o
y

j
r
s
h
a
y
s
,

c
h
r
i
s
t
o
p
h
e
r
c
o
h
e
n
,

w
i
l
l
i
a
m

s
m
u
e
l
l
e
r
,

r
o
b
e
r
t

s

i
i
i
m
i
t
t
e
r
r
a
n
d
,

f
r
a
n
c
o
i
s
g
i
n
s
b
u
r
g
,

r
u
t
h

b
a
d
e
r
e
l
i
z
a
b
e
t
h

i
i
,

q
u
e
e
n

o
f

g
r
e
a
t

b
r
i
t
a
i
n
w
i
l
l
i
a
m
s
,

v
e
n
u
s
b
r
o
d
e
u
r
,

m
a
r
t
i
n
e
l
!
h
a
g
e
,

w
a
d
i
h
a
r
i
s
t
i
d
e
,

j
e
a
n
!
b
e
r
t
r
a
n
d
f
e
i
n
g
o
l
d
,

r
u
s
s
e
l
l

d
h
a
w
k
i
n
s
,

y
u
s
u
f

k
b
e
e
t
h
o
v
e
n
,

l
u
d
w
i
g

v
a
n
c
a
r
r
o
l
l
,

s
e
a
n
k
e
r
r
e
y
,

b
o
b
g
i
u
l
i
a
n
i
,

r
u
d
o
l
p
h

w
c
r
u
i
s
e
,

t
o
m
s
t
e
w
a
r
t
,

m
a
r
t
h
a
j
o
h
n

p
a
u
l

i
i
b
e
r
k
e
,

r
i
c
h
a
r
d

l
g
r
a
f
,

s
t
e
f
f
i
m
i
t
c
h
e
l
l
,

g
e
o
r
g
e

j
c
h
a
w
l
a
,

k
a
l
p
a
n
a
m
a
x
w
e
l
l
,

r
o
b
e
r
t
w
i
n
f
i
e
l
d
,

d
a
v
e
o
'r
o
u
r
k
e
,

a
n
d
r
e
w

p
n
e
l
s
o
n
,

l
e
m
r
i
c
k

j
r
r
o
d
g
e
r
s
,

r
i
c
h
a
r
d
k
o
h
l
,

h
e
l
m
u
t
k
o
p
p
e
l
,

t
e
d
r
u
b
i
n
,

r
o
b
e
r
t

e
s
t
e
r
n
,

h
o
w
a
r
d
c
l
i
n
t
o
n
,

c
h
e
l
s
e
a
w
r
i
g
h
t
,

f
r
a
n
k

l
l
o
y
d
d
'
a
m
a
t
o
,

a
l
f
o
n
s
e

m
r
i
c
h
,

f
r
a
n
k
a
n
n
a
n
,

k
o
f
i
j
a
c
k
s
o
n
,

p
h
i
l
p
u
t
i
n
,

v
l
a
d
i
m
i
r

v
i
r
a
b
u
,

h
i
d
e
k
i
w
e
i
n
e
r
,

a
n
t
h
o
n
y

d
m
i
l
l
e
r
,

g
i
f
f
o
r
d
s
h
a
r
p
t
o
n
,

a
l
j
o
h
n
s
o
n
,

k
e
y
s
h
a
w
n
h
u
b
b
e
l
l
,

w
e
b
s
t
e
r

l
e
g
a
n
,

e
d
w
a
r
d

m
l
i
b
b
y
,

i

l
e
w
i
s

j
r
o
b
a
m
a
,

b
a
r
a
c
k
j
o
h
n
s
o
n
,

b
e
n
r
o
d
d
i
c
k
,

a
n
d
y
b
i
n

l
a
d
e
n
,

o
s
a
m
a
b
r
o
z
a
n
,

n
a
d
i
n
e
b
o
e
s
k
y
,

iv
a
n

f
c
o
n
n
o
r
s
,

jim
m
y
s
c
h
w
a
r
z
e
n
e
g
g
e
r
,

a
r
n
o
ld
w
h
it
e
,

m
a
r
y

jo
r
y
a
n
,

g
e
o
r
g
e
m
a
t
t
in
g
ly
,

d
o
n
w
e
ld
,

w
illia
m

f
s
o
s
a
,

s
a
m
m
y
s
ilv
e
r
s
t
e
in
,

la
r
r
y

a
w
e
lls
,

d
a
v
id
m
c
c
o
o
l,

w
illia
m

c
tr
a
u
b
, ja
m
e
s
b
r
a
tto
n
, w
illia
m
j
s
w
e
e
n
e
y
, jo
h
n
j
d
a
s
c
h
le
, to
m
m
ilk
e
n
, m
ic
h
a
e
l r
r
y
a
n
, n
o
la
n
b
lu
m
e
n
th
a
l, r
ic
h
a
rd
m
o
u
s
s
a
o
u
i, z
a
c
a
ria
s
jo
rd
a
n
, m
ic
h
a
e
l
v
a
llo
n
e
, p
e
te
r f
d
y
la
n
, b
o
b
c
h
ris
to
p
h
e
r, w
a
rre
n
m
c
o
d
e
y, ric
h
a
rd
j
p
a
g
o
n
e
s
, s
te
v
e
n
a
d
e
rs
h
o
w
itz
, a
la
n
m
h
e
w
itt, lle
y
to
n
s
e
lig
, b
u
d
a
ssa
d
, h
a
fe
z a
l!
b
a
ch
, jo
h
a
n
n
se
b
a
stia
n
altm
an, law
rence k
gehry, frank
zim
m
er, richard a
daley, richard m
broad, w
illiam
j
brady, lois sm
ith
sim
pson, nicole brown
zarqawi, abu musab al!
reeves, dan
trump, donald
iverson, allen
knoblauch, chuck
king, martin luther jr
kennedy, anthony m
kasparov, gary
koresh, david
benedict xvi
marcos, ferdinand e
ferrer, fernando
ght
match
trial
Figure 5.7: A zoomed view into a small portion of Figure 5.7. The colors (i.e.,
relationship topics) have been annotated with the most frequently occuring term in
that topic. Nubbi is able to discover a way of partitioning relationships into topics
and assigning these relationship topics to individual paris of entities.
101
Figure 5.8: A screen shot of an Amazon Mechanical Turk task asking users to label
the relationships between entities with textual descriptions. In this way we can get a
large-scale ground truth for the relationships in the New York Times data set.
the other end of this relationship is O. J. Simpson.
Another two topics seem closely related; we have labeled them here as match
and ght. The latter is focused on (sporting) contests, such as those involving
George Foreman and Gary Kasparov. The former however, seems to capture a more
general notion of contention, with Donald Trump strongly related to several people
according to this topic (and Rick Lazio to a lesser extent). The boxer, George Foreman,
interestingly occupies both topics almost equally.
This sort of qualitative analysis is suggestive that Nubbi is able to capture aspects
of relationships. However, this kind of analysis is dicult to scale up to large data
sets such as this one. To aid in this, we perform a large-scale study using Amazon
Mechanical Turk (AMT)
5
. AMT is an online marketplace for tasks (known as HITs).
5
https://www.mturk.com/mturk/welcome
102
A large pool of users selects and completes HITs for a small fee. In this way it is
possible to obtain a large number of human labelings of data sets.
We oered a series of tasks asking users to label relationships that appear in the
New York Times data set. We collected 600 labelings from 13 users. A screenshot of
our task is shown in Figure 5.8. In it we present each user with ten pairs of entities.
For each pair of entities we ask them to write a textual description of the relationship
between those entities (users may optionally check boxes indicating that they do not
know how they are related or that they are not related). To reduce noise each pair of
entities was presented to multiple users. After removing stop words and tokenizing we
are left with a bag of crowd-sourced labels for each of 200 relationships.
We now measure how well our models can predict these bags of labels. We rst train
both the Nubbi model and the Author-Topic model using the parameters mentioned
earlier in this section. As mentioned above, each of these trained models can then
predict words describing the relationship between two entities. For each word in our
test set, that is, the labels we obtained from users on AMT, we compute the rank of
that word in the list of predicted words. We emphasize that for this predictive task
the relationship ground truth was completely hidden from both models. The result of
this experiment is shown in Figure 5.9.
Each word in the gure represents an instance of a word in the test set. The
position of the word is determined by the predicted rank according to the Author-Topic
Model (x-axis) and the Nubbi model (y-axis). Lower is better along both axes. The
words below the diagonal are instances where Nubbis prediction was better than the
Author-Topic models, and vice versa for those above the diagonal. As with other
visualizations of this data set, because of the large scale it is dicult to tease out
individual dierences. Therefore we create another version of this visualization by
removing those terms close to the diagonal, that is, the labels for which Nubbi and the
Author-Topic Model make similar predictions. This allows us to better understand
103
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
Lower is better
Log Rank (AuthorTopic Model)
L
o
g

R
a
n
k

(
N
u
b
b
i
)
time
president
minister
countries
bush
prime
israeli
race
1992
presidential
opponents
president president
american
political
figures
candidates
presidential
ran
republican
republicans
politicians
president
american
political
figures
york
republicans
politicians
president
court
bush
attorney
supreme
husband
married
party
democratic
politicians
involved
prize
prize
figure
olympic
york
politicians
york
state
political
figures
politicians
world
leaders
countries
president
world
war
leaders
iraq
bush
troops
opponents
war
leaders
iraq
professional
president
general
attorney
president
vice
democratic
senators
father son
brothers
brother
brothers
president
election
candidates
presidential
2004
opponents
president
democratic
running
ran
politicians
husband
married
wife
victim
president
united
states
vice
father
mother
daughter
married
players
basketball
time
world
leaders
countries
wife
husband
married
married
california
leaders
minister
prime
russian
russia
father son
time
president
leaders
groups
attack
september
2001
terrorist
east
process
involved
middle
peace
headed
president
world leaders
british
bush
candidates
presidential
1992
candidates
presidential
ran
president
state
secretary
ran
wife
husband
married
previously
president
party
republican
republicans
candidates
politicians
republic
wife
married
federal
building
involved
planned
fed
bombing
father
daughter
wife
husband
married
president
vice
running
candidates
1996
manager
yankees
owner
york
senate
race
spot
opponents
father
president
candidates
presidential
2004
president
republican
2000
candidates
politicians
opponents
players
tennis
players
professional
tennis
time
world
leaders
nations
countries
east
issue
leaders
middle
israeli
palestinian
politicians
show
night
late
talk
wife
husband
married
sexual
sex
players
opponents
tennis
time
country
leaders
countries
president
chairman
fed
president
running
president
administration
vice
running
father
son person
professional
person
executives
convicted
york
current
senators
husband
married
person
relationship
accused
sexual
sex
field
track stars
olympic
wife
husband
married
president
vice
democratic
running
2000
candidates
presidential
politicians
players
baseball
users
israeli
east
middle
eastern
east
leaders
nations
process
middle
peace
time
leaders
democratic
countries
president
republican
governor
bush
york york
state
governor
politicians
police
case
victim
york
city house
white
worked
jersey
married
president
special
part
bill
investigation
clinton
subject
sex
headed
players
tennis
president
house
clinton
speaker
american
political
figures
case
accused
sexual
president
court
failed
supreme
reagan
time
countries
world
leaders
powers
world
leaders
person
president
general
attorney
president
state
general
attorney
washington
area
baseball
players
baseball
atlanta
president
republican
politicians
york
city
party
mayor
democratic
candidates
players
baseball
politicians
american
political
administration
figures
politicians
president
administration
defense
secretary
leaders
countries
republican
running
politicians
person
senators
president
state
leaders
palestinian
time
political
leaders
countries
court
candidates
supreme
american
political
figures
party
democratic
election
presidential
primary
president
ran
politicians
leaders
countries
middle
eastern
politicians
york
mayor
politicians
york
politicians
police
commissioner
jersey
baseball
owner
chief
staff
person
senators
part
east
process
middle
peace
husband
won
governor
ran
convicted
york
city
mayor
candidates
2001
state
politicians
president
state
secretary
york
city
politicians
executives disney
players
subject
iran
politicians
president
vice
1988
candidates
president
political
south
africa
world
leaders
victim
leaders
countries
convicted
prime
israeli
Figure 5.9: Predicted rank of ground truth labels using the Author-Topic Model
(x-axis) versus predicted rank using Nubbi (y-axis). Lower is better along both axes.
Words below the diagonal are instances where Nubbis prediction was better than the
Author-Topic models.
104
1 2 3 4 5 6 7
1
2
3
4
5
6
7
Lower is better
Log Rank (AuthorTopic Model)
L
o
g

R
a
n
k

(
N
u
b
b
i
)
baseball
africa
baseball
south
presidential
israeli
owner
chairman
terrorist
palestinian
israeli
minister minister
olympic
palestinian
prime prime prime
fed
israeli
israeli
democratic
peace
peace
track
iran
peace
party
candidates
democratic
iraq
primary
russian
russia
candidates
presidential
presidential presidential
stars
candidates
supreme
presidential
senate democratic
presidential
politicians politicians
2001
case
running
administration court
time
world world
president state state
governor
figure
nations
president
tennis
wife
opponents
disney
baseball
nations
york
brothers
president
brothers
married
tennis tennis
wife
father father
husband
wife
wife
sex
husband
wife
husband husband
wife wife
son son
sexual
husband
husband
sex
husband
husband
relationship
brother
husband
sexual sexual
Figure 5.10: The visualization in Figure 5.9 with the terms closest to the diagonal
removed. This emphasizes the dierences betewen the Author-Topic Model and Nubbi
revealing that the predictions Nubbi makes are qualitatively dierent from those made
by the Author-Topic Model.
the dierences between these two models.
This second visualization is given in Figure 5.10. This visualization reveals a
qualitative dierence between the predictions Nubbi is able to make well (below
the dashed line) versus the predictions the Author-Topic Models is able to make
well (above the dashed line). In particular, the words below the dashed line are
generally relationship words such as brother, father, husband, married,
and opponents. In contrast, the words above the dashed line provide context, such
as africa, baseball, russia, or olympic.
The descriptions of relationships provided by users often contain both contextual
and relationship words, for example, olympic opponents. The Nubbi model better
predicts the relationship-specic words such as opponent opting instead to explain
105
words like olympic by the entity itself. This, in fact, reveals some structure about
gold-standard relationship descriptions. In contrast, the Author-Topic Model does not
make this distinction between relationship and context words. One avenue of future
work would be to take this insight about how relationships are characterized by people
to build models specically designed to generate these sorts of descriptions.
5.5 Discussion and related work
We presented Nubbi, a novel machine learning approach for analyzing free text to
extract descriptions of relationships between entities. We applied Nubbi to several
corporathe Bible, Wikipedia, scientic abstracts, and New York Times articles. We
showed that Nubbi provides a state-of-the-art predictive model of entities and relation-
ships and, moreover, is a useful exploratory tool for discovering and understanding
network data hidden in plain text.
Analyzing networks of entities has a substantial history (Wasserman and Pattison
1996); recent work has focused in particular on clustering and community struc-
ture (Anagnostopoulos et al. 2008; Cai et al. 2005; Gibson et al. 1998; McGovern et al.
2003; Newman 2006b), deriving models for social networks (Leskovec et al. 2008a,b;
Meeds et al. 2007; Taskar et al. 2003), and applying these analyses to predictive appli-
cations Zhou et al. (2008). Latent variable approaches to modeling social networks
with associated text have also been explored (McCallum et al. 2005; Mei et al. 2008;
Nallapati et al. 2008; Wang et al. 2005). While the space of potential applications for
these models is rich, it is tempered by the need for observed network data as input.
Nubbi allows these techniques to augment their network data by leveraging the large
body of relationship information encoded in collections of free text.
Previous work in this vein has used either pattern-based approaches or co-occurrence
methods. The pattern-based approaches (Agichtein and Gravano 2003; Diehl et al.
106
2007; Mei et al. 2007; Sahay et al. 2008) and syntax based approaches (Banko
et al. 2007; Katrenko and Adriaans 2007) require patterns or parsers which are
meticulously hand-crafted, often fragile, and typically need several examples of desired
relationships limiting the type of relationships that can be discovered. In contrast,
Nubbi makes minimal assumptions about the input text, and is thus practical for
languages and non-linguistic data where parsing is not available or applicable. Co-
occurrence methods (Culotta et al. 2005; Davidov et al. 2007) also make minimal
assumptions. However, because Nubbi draws on topic modeling (Blei et al. 2003a),
it is able to uncover hidden and semantically meaningful groupings of relationships.
Through the distinction between relationship topics and entity topics, it can better
model the language used to describe relationships.
Finally, while other models have also leveraged the machinery of LDA to understand
ensembles of entities and the words associated with them (Bhattacharya et al. 2008;
Newman et al. 2006a; Rosen-Zvi et al. 2004) these models only learn hidden topics for
individual entities. Nubbi models individual entities and pairs of entities distinctly. By
controlling for features of individual entities and explicitly relationships, Nubbi yields
more powerful predictive models and can discover richer descriptions of relationships.
107
Chapter 6
Conclusion
In this thesis we have studied network data. These data may take the form of an online
social network, a social network of characters in a book, public gures in news articles,
networks of webpages, networks of genes, etc. These data are already pervasive and
will only increase in ubiquity as more users use online services which connect them
with other users, or as biologists nd ever more complicated interconnections between
proteins and genes of interest, or more literature and news becomes digitized and
scrutinized. Thus being able to learn from these data to gain insights and make
predictions is becoming ever more important. Predictive models can suggest new
friends for members of a social network or new citations for a paper, while descriptive
statistics can discover communities of friends or authors.
In this work we have introduced and explored several models of network data. The
rst and simplest models correlations between links most directly. Here, the central
challenge is the speed at which observed data can be synthesized into a learned model.
We develop techniques that drastically speed up this process making these models
more applicable to the large, real-world data that are becoming ubiquitous.
We then developed a model of network data that accounts for both links and
attributes. Given a corpus of documents with connections between them, the Relational
108
Topic Model can map those documents into a latent space leveraging the mechanisms
of topic modeling. With a trained model, we showed how one can predict links for a
node given only its attributes or attributes given only its links. Thus we can suggest
new citations for a document given only its content, or new interests for a user given
only their friends. We apply this model to several data sets including local news,
twitter, and scientic abstracts and demonstrate the models ability to make state of
the art predictions and nd interesting perspectives on the data.
Finally, we turned our attention to cases where our understanding of the links is
incomplete or missing altogether. In particular, we focused on the problem of inferring
whether or not a link exists between two nodes, and if so, giving a latent-space
characterization of that relationship. It is important to know, for example, how two
people know each other in a social network or how two genes interact in a biological
network; linkage is not simply binary. While some resources for annotating edges exist,
they are limited and not scalable to the large and varied networks we have today. We
developed the Nubbi model to infer edges and their characterizations using only free
text. We showed qualitatively and quantitatively that our model can construct and
annotate graphs of relationships and make useful predictions.
In sum, this thesis has contributed a set of probabilistic models, along with
attendant inferential and predictive tools that make it possible to better uncover,
understand, and predict links.
109
Appendix A
Derivation of RTM Coordinate
Ascent Updates
Inference under the variational method amounts to nding values of the variational
parameters , which optimize the evidence lower bound, L, given in Equation 4.6.
To do so, we rst expand the expectations in these terms,
L =

(d
1
,d
2
)
L
d
1
,d
2
+

d,n
T
log
,w
d,n
+

d,n
T
((
d
) 1(1
T

d
))+

d
( 1)
T
((
d
) 1(1
T

d
))+

d,n
T
log
d,n

d
(
d
1)
T
((
d
) 1(1
T

d
))+

d
1
T
log (
d
) log (1
T

d
), (A.1)
110
where L
d
1
,d
2
is dened as in Equation 4.7. Since L
d
1
,d
2
is independent of , we can
collect all of the terms associated with
d
into
L

d
=

d,n

T
((
d
) 1(1
T

d
))+
1
T
log (
d
) log (1
T

d
).
Taking the derivatives and setting equal to zero leads to the following optimality
condition,

d,n

T
(

(
d
) 1

(1
T

d
)) = 0,
which is satised by the update

d
+

d,n
. (A.2)
In order to derive the update for
d,n
we also collect its associated terms,
L

d,n
=
d,n
T
(log
d,n
+ log
,w
d,n
+ (
d
) 1(1
T

d
)) +

=d
L
d,d
.
Adding a Lagrange multiplier to ensure that
d,n
normalizes and setting the derivative
equal to zero leads to the following condition,

d,n
exp

log
,w
d,n
+ (
d
) 1(1
T

d
) +

d,n
L
d,d

. (A.3)
The exact form of

d,n
L
d,d
will depend on the link probability function chosen. If
the expected log link probability depends only on
d
1
,d
2
=
d
1

d
2
, the gradients
are given by Equation 4.10. When
N
is chosen as the link probability function, we
111
expand the expectation,
E
q
[log
N
(z
d
, z
d
)] =
T
E
q
[(z
d
z
d
) (z
d
z
d
)]
=

i
(E
q

z
2
d,i

+E
q

z
2
d

,i

2
d,i

,i
). (A.4)
Because each word is independent under the variational distribution, E
q

z
2
d,i

=
Var(z
d,i
) +
2
d,i
, where Var(z
d,i
) =
1
N
2
d

n

d,n,i
(1
d,n,i
). The gradient of this
expression is given by Equation 4.11.
112
Appendix B
Derivation of RTM Parameter
Estimates
In order to estimate the parameters of our model, we nd values of the topic multinomial
parameters and link probability parameters , which maximize the variational
objective, L, given in Equation 4.6.
To optimize , it suces to take the derivative of the expanded objective given in
Equation A.1 along with a Lagrange multiplier to enforce normalization:

k,w
L =

d,n,k
1(w = w
d,n
)
1

k,w
d,n
+
k
.
Setting this quantity equal to zero and solving yields the update given in Equation 4.12.
By taking the gradient of Equation A.1 with respect to and , we can also derive
updates for the link probability parameters. When the expectation of the logarithm of
the link probability function depends only on
T

d,d
+, as with all the link functions
given in Equation 4.8, then these derivatives take a convenient form. For notational
expedience, denote
+
= ', ` and
+
d,d

= '
d,d
, 1`. Then the derivatives can be
113
written as

+L

d,d
(1 (
+
T

+
d,d

))
+
d,d

+L

d,d

(
+
T

+
d,d

)
(
+
T

+
d,d

+
d,d

+L
e
d,d
=
+
d,d

. (B.1)
Note that all of these gradients are positive because we are faced with a one-class
estimation problem. Unchecked, the parameter estimates will diverge. While a variety
of techniques exist to address this problem, one set of strategies is to add regularization.
A common regularization for regression problems is the
2
regularizer. This
penalizes the objective L with the term ||
2
, where is a free parameter. This
penalization has a Bayesian interpretation as a Gaussian prior on .
In lieu of or in conjunction with
2
regularization, one can also employ regularization
which in eect injects some number of observations, , for which the link variable,
y = 0. We associate with these observations a document similarity of

=

1
T



1
T

,
the expected Hadamard product of any two documents given the Dirichlet prior of the
model. Because both

and

are symmetric, these gradients of these regularization


terms can be written as

+R

= (
+
T

)
+

+R

(
+
T

)
(
+
T

.
While this approach could also be applied to
e
, here we use a dierent approximation.
We do this for two reasons. First, we cannot optimize the parameters of
e
in an
unconstrained fashion since this may lead to link functions which are not probabilities.
Second, the approximation we propose will lead to explicit updates.
Because E
q
[log
e
(z
d
z
d
)] is linear in
d,d
by Equation 4.8, this suggests a linear
114
approximation of E
q
[log(1
e
(z
d
z
d
))]. Namely, we let
E
q
[log(1
e
(z
d
z
d
))]

d,d
+

.
This leads to a penalty term of the form
R
e
= (

).
We t the parameters of the approximation,

, by making the approximation exact


whenever
d,d
= 0 or max
d,d
= 1. This yields the following K + 1 equations for
the K + 1 parameters of the approximation:

= log(1 exp())

i
= log(1 exp(
i
+ ))

.
Combining the gradient of the likelihood of the observations given in Equation B.1
with the gradient of the penalty R
e
and solving leads to the following updates:
log

M 1
T

log

(1 1
T

) + M 1
T

log

log

1,
where M =

(d
1
,d
2
)
1 and

=

(d
1
,d
2
)

d
1
,d
2
. Note that because of the constraints
on our approximation, these updates are guaranteed to yield parameters for which
0
e
1.
Finally, in order to t parameters for
N
, we begin by assuming the variance terms
of Equation A.4 are small. Equation A.4 can then be written as
E
q
[log
N
(z
d
, z
d
)] =
T
(
d

d
) (
d

d
),
115
which is the log likelihood of a Gaussian distribution where
d

d
is random
with mean 0 and diagonal variance
1
2
. This suggests tting using the empirically
observed variance:

M
2

d,d

(
d

d
) (
d

d
)
.
acts as a scaling factor for the Gaussian distribution; here we want only to ensure that
the total probability mass respects the frequency of observed links to regularization
observations. Equating the normalization constant of the distribution with the
desired probability mass yields the update
log
1
2

K/2
+ log( + M) log M
1
2
1
T
log ,
guarding against values of which would make
N
inadmissable as a probability.
116
Appendix C
Derivation of NUBBI
coordinate-ascent updates
For convenience, we break up the terms of the objective function into two classes
those that concern each pair of entities, L
e,e
, and those that concern individual
entities, L
e
. Equation 5.1 can then be rewritten as
L =

e,e

L
e,e
+

e
L
e
.
We rst expand L
e
as
L
e
=

e,n
T
log

wn
+

e,n
T

1
T

+
(

1)

1
T

e,n
T
log

e,n

log

log

1
T

e
1)

1
T

.
117
Next we expand L
e,e
. In order to do so, we rst dene

e,e

,n

e,e

,n
= '
e,e

,n,1

e,e

,n,1
,

e,e

,n,2

e,e

,n,2
,

e,e

,n,3

e,e

,n,3
`.
Note that
e,e

,n

e,e

,n
denes a multinomial parameter vector of length 3 K repre-
senting the multinomial probabilities for each z
e,e

,n
, c
e,e

,n
assignment. In particular
q(z
e,e

,n
= z

, c
e,e

,n
= c

) =
e,e

,n,c

e,e

,n,c

,z

. Thus,
L
e,e
=

n
(
e,e

,n

e,e

,n
)
T
log'

wn
,

wn
,

wn
` +

e,e

,n,1

e,e

,n,1
T

1
T

e,e

,n,2

e,e

,n,2
T

1
T

e,e

,n,3

e,e

,n,3
T

e,e

1
T

e,e

e,e

,n
T

e,e

1
T

e,e

n
(
e,e

,n

e,e

,n
)
T
log(
e,e

,n

e,e

,n
)

log

e,e

log

1
T

e,e

+
(

1)

e,e

1
T

e,e

e,e

1)

e,e

1
T

e,e

log

e,e

log

1
T

e,e

+
(

1)

e,e

1
T

e,e

e,e
1)

e,e

1
T

e,e

,
Since

e,n
only appears in L
e
, we can optimize this parameter by taking the
118
gradient,

e,n
L
e
= log

wn
+

1
T

log

e,n
1.
Setting this equal to zero yields the update equation for

e,n
in Equation 5.2. To
optimize

e,e

,n,1
, it suces to take the gradient of L
e,e
,

e,e

,n,1
L
e,e
=
e,e

,n,1

log

wn
+

1
T

log

e,e

,n,1
1

.
Setting this equal to zero yields the update in Equation 5.5. The updates for

e,e

,n,2
and

e,e

,n,3
are derived in exactly the same fashion.
Similarly, to derive the update for
e,e

,n,1
, we take the partial derivative of L
e,e
,
L
e,e

e,e

,n,1
=

e,e

,1

1
T

e,e

log
e,e

,n,1
1 +

e,e

,n,1

log

wn
+

1
T

log

e,e

,n,1

.
Replacing log

e,e

,n,1
with the update equation given above, this expression reduces to
L
e,e

e,e

,n,1
=

e,e

,1

1
T

e,e

log
e,e

,n,1
1 +
e,e

,n,1
.
Consequently the update for
e,e

,n
is Equation 5.6. In order to update

e,e
we collect
the terms which contain this parameter,

e,e

,n

e,e

e,e

1
T

e,e

log

e,e

log

1
T

e,e

.
119
The optimum for these terms is obtained when the condition in Equation 5.7 is
satised. See Blei et al. (2003a) for details on this solution. Collecting terms associated
with

e,e

similarly leads to the update given in Equation 5.8.


We also collect terms to yield updates for

e
. The terms associated with this
variational parameter (and this variational parameter alone) span both L
e,e
and L
e
and it is via these parameter updates in Equation 5.10 that evidence associated with
individual entities and evidence associated with entity pairs is combined.
To nd MAP estimates for

and

, note that both variables are multinomial


and hence in the exponential family with topic-word assignment counts as sucient
statistics. Because the conjugate prior on these parameters is Dirichlet, the posterior
is also a multinomial distribution with sucient statistics dened by the observations
plus the prior hyperparameter. These posterior sucient statistics are precisely the
right-hand sides of Equation 5.14. The MAP value of the parameters is achieved
when the expected sucient statistics equal the observed sucient statistics giving
the updates in Equation 5.14.
120
Appendix D
Derivation of Gibbs sampling
equations
In this section we derive collapsed Gibbs sampling equations for the models presented
in this thesis. Collapsed Gibbs sampling is an alternative to the variational approach
instead of approximating the posterior distribution by optimizing a variational
lower bound, collapsed Gibbs sampling directly collects samples from the posterior
distribution. In order to sample from the posterior, it suces to compute the posterior
(up to a constant) for a single assignment conditioned on all other assignments,
p(z
d,n
[z
(d,n)
, , , w), (D.1)
where z
(d,n)
denotes the set of topic assignments to all words in all documents
excluding z
d,n
. For a review of Gibbs sampling and why this is the case, see ().
In contrast to variational inference, the equations we derive here are collapsed, that
is, they integrate out variables such as the per-topic distribution over words,
k
, and
the per-document distribution over topics,
d
. What remain are the topic assignments
for each word, z
d,n
.
121
D.1 Latent Dirichlet allocation (LDA)
First we compute the prior distribution over topic assignments.

p(z
d
[
d
)dp(
d
[) =

d,z
i
1
B()

k
d,k
d
d
=
1
B()

n
d,k
+
k
d,k
d
d
=
B(n
d,
+ )
B()
, (D.2)
where B() =

k
(
k
)
(

k

k
)
is a normalizing constant, n
d,k
=

i
1(z
d,i
= k) counts the
number of words in document d assigned to topic k and n
d,
= 'n
d,1
, n
d,2
, . . . , n
d,K
` is
the vector of counts.
We then compute the likelihood of the word observations given a set of topic
assignments,


d
p(w
d
[z
d
, )dp([) =

w
d,i
,z
d.i

k
1
B()

w
w,k

d
=

k
1
B()

w+n
w,k
w,k
d
k
=

k
B( +n
,k
)
B()
(D.3)
where n
w,k
=

i
1(z
d,i
= k w
d,i
= w) counts the number of assignments of word
w to topic k across all documents and n
,k
= 'n
1,k
, n
2,k
, . . . , n
W,k
` is the vector of these
counts.
Combining Equation D.2 and Equation D.3, the posterior probability of a set of
122
topic assignments can be written as
p(z[, , w) p(w[z, )p(z[)
=

p(w[z, )dp([)

p(z[)dp([)
=

d
B(n
d,
+ )
B()

k
B( +n
,k
)
B()
. (D.4)
Conditioning on all other assignments, the posterior probability of a single assignment
is then
p(z
d,n
= k[, , w, z
(d,n)
)

B(n
d

,
+ )
B()

B( +n
,k
)
B()
B(n
d,
+ )

B( +n
,k
)
=

(n
d,k
+
k
)
(

n
d,k
+
k
)

w
(
w
+ n
w,k
)
(

w
n
w,k
+
w
)

1
(

n
d,k
+
k
)

(n
d,k
+
k
)
(
w
d,n
+ n
w
d,n
,k
)
(

w
n
w,k
+
w
)

=
1
(N
d
+


k
)

(n
d,k
+
k
)
(
w
d,n
+ n
w
d,n
,k
)
(

w
n
w,k
+
w
)

(n
d,k
+
k
)
(
w
d,n
+ n
w
d,n
,k
)
(

w
n
w,k
+
w
)

, (D.5)
where N
d
=

n
d,k
denotes the number of words in document d. The second line
follows because terms which are independent of the topic assignment z
d,n
are constants,
the third line follows by denition of B, and the fourth line follows because the
posterior cannot depend on counts over words other than w
d,n
.
123
Finally, we make use of the identity
(x + b)
(x)
=

x if b = 1
1 if b = 0
(D.6)
= x
b
, b 0, 1 (D.7)
which implies that
(n
d,k
+
k
) = (n
d,k
1(k = k

) +
k
+1(k = k

))
= (n
d,k
1(k = k

) +
k
) (n
d,k
1(k = k

) +
k
)
1(k=k

)
= (n
d,n
d,k

+
k
) (n
d,n
d,k

+
k
)
1(k=k

)
, (D.8)
where n
d,n
d,k

i=n
1(z
d,i
= k

) denotes the number of words assigned to topic k

in
document d excluding the current assignment, z
d,n
. Because n
d,n
d,k

does not depend on


the current assignment, z
d,n
, it is a constant in the posterior computation; Equation D.8
then becomes
(n
d,k
+
k
) (n
d,n
d,k

+
k
)
1(k=k

)
. (D.9)
Applying the same identity to the other instances of the gamma function in Equa-
tion D.5 gives
(
w
d,n
+ n
w
d,n
,k
) (n
d,n
w
d,n
,k

+
w
d,n
)
1(k=k

)
(D.10)
(

w
+ n
w,k
) (

w
+ n
d,n
w,k

)
1(k=k

)
, (D.11)
where the exclusionary sum is similarly dened as n
d,n
w,k

i
1(z
d

,i
= k

w
d

,i
=
124
w (d, n) = (d

, i)). Combining these identities with Equation D.5 yields


p(z
d,n
= k[, , w, z
(d,n)
)

(n
d,n
d,k

+
k
)
n
d,n
w
d,n
,k

+
w
d,n

w

w
+ n
d,n
w,k

1(k=k

)
= (n
d,n
d,k
+
k
)
n
d,n
w
d,n
,k
+
w
d,n

w

w
+ n
d,n
w,k
= (n
d,n
d,k
+
k
)
n
d,n
w
d,n
,k
+
w
d,n
N
d,n
k
+ W
, (D.12)
where for convenience we denote the total number of words assigned to topic k
excluding the current assignment z
d,n
as N
d,n
k
.
D.2 Mixed-membership stochastic blockmodel (MMSB)
Because the observations in the mixed-membership stochastic blockmodel (MMSB)
depend on pairs of topic assignments, the collapsed Gibbs sampling equations also
depend on the pairwise posterior,
p(z
d,d

,1
= k
1
, z
d,d

,2
= k
2
[, , y, z
(d,d

)
), (D.13)
where z
(d,d

)
denotes the set of topic assignments without the two assignments
associated with the link between d and d

, z
d,d

,1
and z
d,d

,2
.
125
To compute this, as before we rst compute the likelihood of the observations,


d,d

p(y
d,d
[z
d,d

,1
z
d,d

,2
, )dp([) =


d,d

y
d,d

z
d,d

,1
,z
d,d

,2
(1
z
d,d

,1
,z
d,d

,2
)
1y
d,d

k,k

1
B()

1
k,k

(1
k,k
)

0
d
=

k,k

1
B()

n
k,k

,1
+
1
k,k

(1
k,k
)
n
k,k

,0
+
0
d
k,k

k,k

B( +n
k,k
)
B()
, (D.14)
where n
k,k

,i
=

d,d

1(z
d,d

,1
= k z
d,d

,2
= k

y
d,d
= i) counts the number of links
of value i for which the rst node is assigned a topic of k for that link and the second
node is assigned a topic of k

for that link. n


k,k
= 'n
k,k

,1
, n
k,k

,0
` denotes the vector
of these counts. Because the prior of the MMSB is the same as that of LDA, we can
express the posterior (the analogue of Equation D.5) as
p(z
d,d

,1
= k
1
, z
d,d

,2
= k
2
[, , y, z
(d,d

)
)

k,k

(n
d,k
+
k
)(n
d

,k
+
k
)
B( +n
k,k
)
B()

k,k

(n
d,k
+
k
)(n
d

,k
+
k
)
(
y
d,d

+ n
k,k

,y
d,d

)
(

i

i
+ n
k,k

,i
)
(n
d,d

d,k
1
+
k
1
)(n
d,d

,k
2
+
k
2
)
n
d,d

k
1
,k
2
,y
d,d

+
k
1
,k
2
,y
d,d

H + N
d,d

k
1
,k
2
,
where for convenience we denote the total number of links with (k, k

) as the partici-
pating topics excluding the current link (z
d,d

,1
, z
d,d

,2
) as N
d,d

k,k

. The rst line follows


by expanding the prior terms as in the derivation of Equation D.5. The second line
follows by expanding B and eliminating terms which are constant and the last line
follows using the identities used to derive Equation D.12.
126
D.3 Relational topic model (RTM)
The sampling equations for the relational topic model (RTM) are similar in spirit to
the LDA sampling equations. For brevity, we restrict the derivation to the exponential
response
1
,
p(y
d,d
= 1[z
d
, z
d
, b) exp(b
T
( z
d
z
d
)). (D.15)
As with the MMSB, the prior distribution on z is identical to that of LDA, so we
omit its re-derivation. Thus the joint posterior can be written as
p(z[, , w, y, b)

d
B(n
d,
+ )
B()

k
B( +n
,k
)
B()

d,d

exp(b
T
( z
d
z
d
)), (D.16)
where the latter product is understood to range over d, d

such that y
d,d
= 1. The
posterior, following the derivation of Equation D.12, is
p(z
d,n
= k[, , w, y, b, z
(d,n)
) (n
d,n
d,k
+
k
)
n
d,n
w
d,n
,k
+
w
d,n
N
d,n
k
+ W

,d

exp(b
T
( z
d
z
d
))
(n
d,n
d,k
+
k
)
n
d,n
w
d,n
,k
+
w
d,n
N
d,n
k
+ W

exp(b
T
( z
d
z
d
))
= (n
d,n
d,k
+
k
)
n
d,n
w
d,n
,k
+
w
d,n
N
d,n
k
+ W
exp

(b z
d
)
T

z
d

(D.17)
Notice that z
d,k
=
1
N
d

1(z
d,n
= k

) =
1
N
d

=n
1(z
d,n
= k

) +
1
N
d
1(z
d,n
= k

) =
z
n
d,k
+
1
N
d
1(z
d,n
= k

), where z
n
d,k
is the mean topic assignment to topic k

in document
d excluding that of the nth word, z
d,n
. Because z
n
d,k
does not depend on the topic
1
Here we depart from the notation used in previous chapters. We use b for the regression
coecients instead of . We also omit the regression intercept and absorb it into the normalization
constant.
127
assignment z
d,n
, the last term of Equation D.17 can be eciently computed as
exp

(b z
d
)
T

z
d

= exp

b
k
z
d,k

z
d

,k

= exp

b
k

z
n
d,k
+
1
N
d
1(k = k

z
d

,k

= exp

b
k
z
n
d,k

z
d

,k

+
b
k
N
d

z
d

,k

exp

b
k
N
d

z
d

,k

= exp

b
k
N
d

n
d

,k
N
d

, (D.18)
where the second line follows using our identity on z
d,k
and the last proportionality
follows from the fact that the terms in the left sum do not depend on the current
topic assignment, z
d,n
. Finally, the last equality stems from the denitions of z
d

,k
and
n
d

,k
. This expression is ecient because it is constant for all words in a document
and thus need only be computed once per document. Combining Equation D.18 and
Equation 4.6 yields
p(z
d,n
= k[, , w, y, b, z
(d,n)
) = (n
d,n
d,k
+
k
)
n
d,n
w
d,n
,k
+
w
d,n
N
d,n
k
+ W
exp

b
k
N
d

n
d

,k
N
d

.
(D.19)
D.4 Supervised latent Dirichlet allocation (sLDA)
We derive the sampling equations for supervised latent Dirichlet allocation. Here, we
consider Gaussian errors, but the derivation can be easily extended to other models
128
as well,
p(y
d
[z
d
, b, a) exp((y
d
b
T
z
d
a)
2
)
exp((y
d
a)
2
+ 2b
T
z
d
(y
d
a) (b
T
z
d
)
2
)
exp(2b
T
z
d
(y
d
a) (b
T
z
d
)
2
), (D.20)
where the proportionality is with respect to z
d
.
The prior distribution on z is identical to that of LDA. Thus the joint posterior
can be written as
p(z[, , w, y, b, a)

d
B(n
d,
+ )
B()

k
B( +n
,k
)
B()

d
exp((y
d
b
T
z
d
a)
2
).
(D.21)
The sampling equation, following the derivation of Equation D.12, is
p(z
d,n
= k[, , w, y, b, a, z
(d,n)
) (n
d,n
d,k
+
k
)
n
d,n
w
d,n
,k
+
w
d,n
N
d,n
k
+ W
exp((y
d
b
T
z
d
a)
2
)
(D.22)
The right-most term can be expanded as
exp(2b
T
z
d
(y
d
a) (b
T
z
d
)
2
) = exp(2

b
k
z
n
d,k
(y
d
a) + 2
y
d
a
N
d
b
k
+ (b
T
z
d
)
2
)
exp(2
y
d
a
N
d
b
k
(b
T
z
d
)
2
)
exp(2
y
d
a
N
d
b
k
(

b
k
z
n
d,k
+
b
k
N
d
)
2
)
exp(2
y
d
a
N
d
b
k
2
b
k
N
d
b
T
z
n
d
(
b
k
N
d
)
2
)
= exp(2
b
k
N
d
(y
d
a b
T
z
n
d
) (
b
k
N
d
)
2
), (D.23)
yielding
129
p(z
d,n
= k[, , w, y, b, a, z
(d,n)
) (n
d,n
d,k
+
k
)
n
d,n
w
d,n
,k
+
w
d,n
N
d,n
k
+ W

exp(2
b
k
N
d
(y
d
a b
T
z
n
d
) (
b
k
N
d
)
2
) (D.24)
D.5 Networks uncovered by Bayesian inference (NUBBI)
model
The networks uncovered by Bayesian inference (NUBBI) model is a switching model,
wherein each word can be explained by one of three distributions the distribution of
the rst entity
e
, the distribution of the second entity
e
, or the distribution over their
relationships
e,e
. Each of these generates topic assignments with the same structure
as LDA, so their contributions to the posterior are the same, conditioned on the
assignments from words to distributions, which also follows a Dirichlet-Multinomial
distribution.
Hence, the joint posterior over topic assignments and source assignments is
p(z, c[, , w)

e
B(n
e
e,
+

)
B(

k
B(

+n
e
,k
)
B(

B(n

,
+

)
B(

k
B(

+n

,k
)
B(

B(n
c
,
+

)
B(

)
. (D.25)
Here we have used the shorthand = (e, e

) to denote iteration over pairs of entities.


We have also introduced new count variables for documents associated with individual
130
entities, documents associated with pairs of entities, and source assignments:
n
e
w,k
=

i
1(z
e,i
= k w
e,i
= w)
+

i
1(z
,i
= k w
,i
= w c
,i
= 1)
+

i
1(z
,i
= k w
,i
= w c
,i
= 2) (D.26)
n

w,k
=

i
1(c
,i
= 3 w
,i
= w) (D.27)
n
e
e,k
=

i
1(z
e,i
= k)
+

i
1(z
,i
= k c
,i
= 1
1
= e)
+

i
1(z
,i
= k c
,i
= 2
2
= e) (D.28)
n

,k
=

i
1(c
,i
= 3 z
,i
= k) (D.29)
n
c
,k
=

i
1(c
,i
= k) (D.30)
with marginals being dened as before. There are two sampling equations to be
considered. First, when sampling the topic assignment for a word in an entitys
document,
p(z
e,n
= k[, , w, z
(e,n)
, c) (n
e,(e,n)
e,k
+
,k
)
n
e,(e,n)
we,n,k
+
,we,n
N
e,(e,n)
k
+ W

. (D.31)
Second, when sampling the topic and source assignment for a word in an entity
131
pairs document,
p(z
,n
= k, c
,n
= 1[, , w, z
(,n)
, c
(,n)
)

n
e,(,n)
e,k
+
,k
n
e,(,n)
e,k
+ K

n
e,(,n)
w,n,k
+
,w,n
N
e,(,n)
k
+ W

(n
c,(,n)
,1
+

) (D.32)
p(z
,n
= k, c
,n
= 2[, , w, z
(,n)
, c
(,n)
)

n
e,(,n)
e

,k
+
,k
n
e,(,n)
e

,k
+ K

n
e,(,n)
w,n,k
+
,w,n
N
e,(,n)
k
+ W

(n
c,(,n)
,2
+

) (D.33)
p(z
,n
= k, c
,n
= 3[, , w, z
(,n)
, c
(,n)
)

n
,(,n)
,k
+
,k
n
,(,n)
,k
+ K

n
,(,n)
w,n,k
+
,w,n
N
,(,n)
k
+ W

(n
c,(,n)
,3
+

). (D.34)
132
Bibliography
E. Agichtein and L. Gravano. Querying text databases for ecient information
extraction. Data Engineering, International Conference on, 0:113, 2003. ISSN
1063-6382. doi: http://doi.ieeecomputersociety.org/10.1109/ICDE.2003.1260786.
E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership stochastic block-
models. Journal of Machine Learning Research, pages 1981 2014, September 2008.
URL http://arxiv.org/pdf/0705.4485.
A. Anagnostopoulos, R. Kumar, and M. Mahdian. Inuence and correlation in social
networks. KDD 2008, 2008.
G. Andrew and J. Gao. Scalable training of l1-regularized log-linear models. Proceed-
ings of the 24th international Conference on Machine Learning, Jan 2007. URL
http://portal.acm.org/citation.cfm?id=1273496.1273501.
C. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonpara-
metric problems. The Annals of Statistics, 2(6):11521174, 1974.
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni.
Open information extraction from the web. In IJCAI 2007, 2007. URL
http://www.ijcai.org/papers07/Papers/IJCAI07-429.pdf.
K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan. Matching
words and pictures. Journal of Machine Learning Research, 3:11071135, 2003.
133
J. Besag. Statistical analysis of non-lattice data. The Statistician, 24(3):179
195, 1975. ISSN 00390526. doi: http://dx.doi.org/10.2307/2987782. URL
http://dx.doi.org/10.2307/2987782.
J. Besag. On the statistical analysis of dirty pictures. Jour-
nal of the Royal Statistical Society, 48(3):259302, 1986. URL
http://www.informaworld.com/index/739172868.pdf.
I. Bhattacharya, S. Godbole, and S. Joshi. Structured entity identication and
document categorization: Two tasks with one joint model. KDD 2008, 2008.
C. M. Bishop, D. Spiegelhalter, and J. Winn. Vibes: A variational in-
ference engine for bayesian networks. In NIPS 2002, 2002. URL
http://scholar.google.fi/url?sa=U&#38;q=http://books.nips.cc/papers/files/nips15/AA37.pdf.
A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr. Interactive image segmentation
using an adaptive gmmrf model. pages Vol I: 428441, 2004.
D. Blei and M. Jordan. Modeling annotated data. Proceedings of the 26th annual
international ACM SIGIR Conference on Research and Development in Information
Retrieval, 2003. URL http://portal.acm.org/citation.cfm?id=860460.
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine
Learning Research, 3:9931022, 2003a.
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet alloca-
tion. Journal of Machine Learning Research, 2003b. URL
http://www.mitpressjournals.org/doi/abs/10.1162/jmlr.2003.3.4-5.993.
D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures.
Bayesian Analysis, 1(1):121144, Oct 2006.
134
D. M. Blei and J. D. McAulie. Supervised topic models. Neural Information
Processsing Systems, Aug 2007.
J. Boyd-Graber and D. M. Blei. Syntactic topic models. In Neural Information
Processing Systems, Dec 2008.
M. Braun and J. McAulie. Variational inference for large-scale models
of discrete choice. Arxiv preprint arXiv:0712.2526, Jan 2007. URL
http://arxiv.org/pdf/0712.2526.
D. Cai, Z. Shao, X. He, X. Yan, and J. Han. Mining hidden commu-
nity in heterogeneous social networks. LinkKDD 2005, Aug 2005. URL
http://portal.acm.org/citation.cfm?id=1134271.1134280.
M. Carreira-Perpinan and G. Hinton. On contrastive divergence
learning. Articial Intelligence and Statistics, Jan 2005. URL
http://www.csri.utoronto.ca/ hinton/absps/cdmiguel.pdf.
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext clas-
sication using hyperlinks. Proc. ACM SIGMOD, 1998. URL
http://citeseer.ist.psu.edu/article/chakrabarti98enhanced.html.
J. Chang and D. M. Blei. Relational topic models for documenet networks. 2009.
J. Chang and D. M. Blei. Hierarchical relational models for document networks.
Annals of Applied Statistics, 4(1), 2010.
J. Chang, J. Boyd-Graber, and D. M. Blei. Connections between the lines: Augmenting
social networks with text. 2009.
S. F. Chen and R. Rosenfeld. A survey of smoothing techniques for me models. IEEE
Transactions on Speech and Audio Processing, 8(1), Jun 2000.
135
D. Cohn and T. Hofmann. The missing link-a probabilistic model of document content
and hypertext connectivity. Advances in Neural Information Processing Systems 13,
2001.
M. Craven, D. DiPasquo, D. Freitag, and A. McCallum. Learning to ex-
tract symbolic knowledge from the world wide web. Proc. AAAI, 1998. URL
http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/1998/CMU-CS-98-122.pdf.
A. Culotta, R. Bekkerman, and A. McCallum. Extracting social networks and
contact information from email and the web. AAAI 2005, 2005. URL
http://www.cs.umass.edu/ ronb/papers/dex.pdf.
D. Davidov, A. Rappoport, and M. Koppel. Fully unsupervised discovery of concept-
specic relationships by web mining. In ACL, 2007.
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:138, 1977.
C. Diehl, G. M. Namata, and L. Getoor. Relationship identication for social network
discovery. In AAAI 2007, July 2007.
L. Dietz, S. Bickel, and T. Scheer. Unsupervised prediction of citation inuences.
Proc. ICML, 2007. URL http://portal.acm.org/citation.cfm?id=1273526.
M. Dudk, S. Phillips, and R. Schapire. Maximum entropy density estimation
with generalized regularization and an application to species distribution mod-
eling. The Journal of Machine Learning Research, 8:12171260, Jan 2007. URL
http://portal.acm.org/citation.cfm?id=1314540.
B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-
validation. Journal of the American Statistical Association, 78(382), 1983.
136
E. Erosheva, S. Fienberg, and J. Laerty. Mixed-membership models of scientic
publications. Proceedings of the National Academy of Sciences, 2004.
E. Erosheva, S. Fienberg, and C. Joutard. Describing disability through individual-
level mixture models for multivariate binary data. Annals of Applied Statistics,
2007.
L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene
categories. Computer Vision and Pattern Recognition, 2005.
T. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of
Statistics, 1:209230, 1973.
S. E. Fienberg, M. M. Meyer, and S. Wasserman. Statistical analysis of multiple
sociometric relations. Journal of the American Statistical Association, 80:5167,
1985.
M. E. Fisher. On the dimer solution of planar ising models. Journal of
Mathematical Physics, 7(10):17761781, 1966. doi: 10.1063/1.1704825. URL
http://link.aip.org/link/?JMP/7/1776/1.
J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with
the graphical lasso. Biostatistics, 2007.
J. Gao, H. Suzuki, and B. Yu. Approximation lasso methods for language modeling.
Proceedings of the 21st International Conference on Computational Linguistics, Jan
2006. URL http://acl.ldc.upenn.edu/P/P06/P06-1029.pdf.
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, (6):721741, 1984.
137
L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning prob-
abilistic models of relational structure. Proc. ICML, 2001. URL
http://ai.stanford.edu/users/nir/Papers/GFTK1.pdf.
D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web commu-
nities from link topology. HYPERTEXT 1998, May 1998. URL
http://portal.acm.org/citation.cfm?id=276627.276652.
A. Globerson and T. S. Jaakkola. Approximate inference using planar graph decom-
position. In B. Scholkopf, J. Platt, and T. Homan, editors, Advances in Neural
Information Processing Systems 19, pages 473480. MIT Press, Cambridge, MA,
2007.
A. Globerson, T. Koo, X. Carreras, and M. Collins. Exponentiated gra-
dient algorithms for log-linear structured prediction. Proceedings of the
24th international Conference on Machine Learning, Jan 2007. URL
http://portal.acm.org/citation.cfm?id=1273535.
J. Goodman. Exponential priors for maximum entropy models. Mar 2004.
A. Gruber, M. Rosen-Zvi, and Y. Weiss. Latent topic models for hypertext. Uncertainty
in Articial Intelligence, May 2008.
P. Haner, S. Phillips, and R. Schapire. Ecient multiclass implementations of
l1-regularized maximum entropy. May 2006.
P. Ho, A. Raftery, and M. Handcock. Latent space approaches to social network
analysis. Journal of the American Statistical Association, 2002.
J. Hofman and C. Wiggins. A Bayesian approach to network modularity. eprint arXiv:
0709.3512, 2007. URL http://arxiv.org/pdf/0709.3512.
138
T. Hofmann. Probabilistic latent semantic indexing. SIGIR, 1999. URL
http://portal.acm.org/citation.cfm?id=312649.
E. Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift f ur Physik, 31:253258,
1925.
T. S. Jaakkola and M. I. Jordan. Variational methods and the qmr-dt database. MIT
Computational Cognitive Science Technical Report 9701, page 23, Jan 1999.
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to
variational methods for graphical models. Oct 1999.
D. Jurafsky and J. Martin. Speech and language processing. Prentice Hall, 2008.
S. Katrenko and P. Adriaans. Learning relations from biomedical corpora us-
ing dependency trees. Lecture Notes in Computer Science, 2007. URL
http://www.springerlink.com/index/n145566q7t1u4365.pdf.
J. Kazama and J. Tsujii. Evaluation and extension of maximum entropy models with
inequality constraints. Jun 2003.
C. Kemp, T. Griths, and J. Tenenbaum. Discovering latent
classes in relational data. MIT AI Memo 2004-019, 2004. URL
http://www-psych.stanford.edu/ gruffydd/papers/blockTR.pdf.
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the
ACM (JACM), 1999. URL http://portal.acm.org/citation.cfm?id=324140.
M. Kolar and E. P. Xing. Improved estimation of high-dimensional ising models, 2008.
URL http://www.citebase.org/abstract?id=oai:arXiv.org:0811.1239.
V. Kolmogorov. Convergent tree-reweighted message passing for energy mini-
mization. IEEE Trans. Pattern Anal. Mach. Intell., 28(10):15681583, October
139
2006. ISSN 0162-8828. doi: http://dx.doi.org/10.1109/TPAMI.2006.200. URL
http://dx.doi.org/10.1109/TPAMI.2006.200.
J. Laerty and L. Wasserman. Rodeo: Sparse, greedy nonparametric regression. The
Annals of Statistics, Jan 2008.
J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic evolution of
social networks. KDD 2008, 2008a.
J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Statistical properties of
community structure in large social and information networks. WWW 2008, 2008b.
URL http://portal.acm.org/citation.cfm?id=1367591.
D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson.
The bottlenose dolphin community of doubtful sound features a large proportion
of long-lasting associations. can geographic isolation explain this unique trait?
Behavioral Ecology and Sociobiology, 54:396405, 2003.
J. Majewski, H. Li, and J. Ott. The ising model in physics and statistical genetics.
American Journal of Human Genetics, 69:853862, 2001.
R. Malouf. A comparison of algorithms for maximum entropy parameter estimation.
Jun 2002.
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction
of internet portals with machine learning. Information Retrieval, 2000. URL
http://www.springerlink.com/index/R1723134248214T0.pdf.
A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in
social networks. Proceedings of the Nineteenth International Joint Conference on
Articial Intelligence, 2005. URL http://www.ijcai.org/papers/1623.pdf.
140
A. McGovern, L. Friedland, M. Hay, B. Gallagher, A. Fast, J. Neville, and
D. Jensen. Exploiting relational structure to understand publication patterns
in high-energy physics. ACM SIGKDD Explorations Newsletter, 5(2), Dec 2003.
URL http://portal.acm.org/citation.cfm?id=980972.980999.
E. Meeds, Z. Ghahramani, R. Neal, and S. Roweis. Modeling dyadic data with binary
latent factors. NIPS 2007, 2007.
Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. Semantic annotation of frequent
patterns. KDD 2007, 1(3), 2007.
Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization.
WWW 08: Proceeding of the 17th international conference on World Wide Web,
Apr 2008. URL http://portal.acm.org/citation.cfm?id=1367497.1367512.
N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection
with the lasso. Annals of Statistics, Jan 2006.
T. Minka and Y. Qi. Tree-structured approximations by expectation. In Propagation,
Proc. Neural Information Processing Systems Conf. (NIPS, page 2003, 2003.
K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate
inference: An empirical study. Proceedings of Uncertainty in AI, Jan 1999. URL
http://www.vision.ethz.ch/ks/slides/murphy99loopy.pdf.
R. Nallapati and W. Cohen. Link-pLSA-LDA: A new unsupervised model for topics
and inuence of blogs. ICWSM, 2008.
R. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for
text and citations. Proceedings of the 14th ACM SIGKDD international conference
on Knowledge discovery and data mining, 2008.
O. J. Nave. Naves Topical Bible. Thomas Nelson, 2003. ISBN 0785250581.
141
R. M. Neal. Probabilistic inference using markov chain monte carlo methods. CRG-
TR-93-1, May 1993.
D. Newman, C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In
KDD 2006, pages 680686, New York, NY, USA, 2006a. ACM. ISBN 1-59593-339-5.
doi: http://doi.acm.org/10.1145/1150402.1150487.
M. Newman. The structure and function of net-
works. Computer Physics Communications, 2002. URL
http://linkinghub.elsevier.com/retrieve/pii/S0010465502002011.
M. E. J. Newman. Finding community structure in networks using the eigenvectors of
matrices. Phys. Rev. E, 74(036104), 2006a.
M. E. J. Newman. Modularity and community structure in networks. Proceedings of
the National Academy of Sciences, 103(23), 2006b. doi: 10.1073/pnas.0601602103.
URL http://arxiv.org/abs/physics/0602124v1.
M. E. J. Newman, A.-L. Barabsi, and D. J. Watts. The Structure and Dynamics of
Networks. Princeton University Press, 2006b.
T. Ohta, Y. Tateisi, and J.-D. Kim. Genia corpus: an annotated research abstract
corpus in molecular biology domain. In HLT 2008, San Diego, USA, 2002. URL
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/paper/hlt2002GENIA.pdf.
J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference.
1988.
J. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using
multilocus genotype data. Genetics, 155:945959, June 2000.
S. Riezler and A. Vasserman. Incremental feature selection and l1 reg-
ularization for relaxed maximum-entropy modeling. Proceedings of the
142
2004 Conference on Empirical Methods in NLP, Jan 2004. URL
http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Riezler.pdf.
M. Rosen-Zvi, T. Griths, T. Griths, M. Steyvers, and P. Smyth. The author-
topic model for authors and documents. In AUAI 2004, pages 487494, Arlington,
Virginia, United States, 2004. AUAI Press. ISBN 0-9749039-0-6.
S. Sahay, S. Mukherjea, E. Agichtein, E. Garcia, S. Navathe, and A. Ram. Discovering
semantic biomedical relations utilizing the web. KDD 2008, 2(1), Mar 2008. URL
http://portal.acm.org/citation.cfm?id=1342320.1342323.
S. Sampson. Crisis in a cloister. PhD thesis, Cornell University, 1969.
L. Saul and M. Jordan. Exploiting Tractable Substructures in Intractable Networks.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, pages 486
492, 1996.
L. Saul and M. Jordan. A mean eld learning algorithm for unsuper-
vised neural networks. Learning in Graphical Models, Jan 1999. URL
http://citeseer.comp.nus.edu.sg/cache/papers/cs/513/http:zSzzSzwww.ai.mit.eduzSzprojectszSzcbclzSzcourse9.641-F97zSzpmlp.ps.gz/a-mean-field-learning.ps.gz.
Y. Shi and T. Duke. Cooperative model of bacterial sensing. Phys. Rev. E, 58(5):
63996406, Nov 1998. doi: 10.1103/PhysRevE.58.6399.
J. Sinkkonen, J. Aukia, and S. Kaski. Component models for large networks. arXiv,
stat.ML, Mar 2008. URL http://arxiv.org/abs/0803.1628v1.
D. Sontag and T. Jaakkola. New Outer Bounds on the Marginal Polytope. Advances
in Neural Information Processing Systems, 21, 2007.
M. Steyvers and T. Griths. Probabilistic topic models. Handbook of Latent Semantic
Analysis, 2007.
143
R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agar-
wala, M. Tappen, and C. Rother. A comparative study of energy min-
imization methods for markov random elds with smoothness-based priors.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(6):
10681080, 2008. doi: http://dx.doi.org/10.1109/TPAMI.2007.70844. URL
http://dx.doi.org/10.1109/TPAMI.2007.70844.
H. Takamura, T. Inui, and M. Okumura. Extracting semantic orientations
of words using spin model. In ACL 05: Proceedings of the 43rd Annual
Meeting on Association for Computational Linguistics, pages 133140, Mor-
ristown, NJ, USA, 2005. Association for Computational Linguistics. doi:
http://dx.doi.org/10.3115/1219840.1219857.
L. Tanabe, N. Xie, L. H. Thom, W. Matten, and W. J. Wilbur. Genetag: a tagged
corpus for gene/protein named entity recognition. BMC Bioinformatics, 6 Suppl
1, 2005. ISSN 1471-2105. doi: http://dx.doi.org/10.1186/1471-2105-6-S1-S3. URL
http://dx.doi.org/10.1186/1471-2105-6-S1-S3.
B. Taskar, M.-F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data.
NIPS 2003, 2003.
B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. Ad-
vances in Neural Information Processing Systems, Jan 2004a. URL
http://web.engr.oregonstate.edu/ tgd/classes/539/slides/max-margin-markov-networks.pdf.
B. Taskar, M. Wong, P. Abbeel, and D. Koller. Link prediction in relational data.
NIPS, 2004b.
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of
the American Statistical Association, 101(476):15661581, 2007.
144
M. Wainwright and M. Jordan. A variational principle for graphical models. In New
Directions in Statistical Signal Processing, chapter 11. MIT Press, 2005a.
M. Wainwright and M. Jordan. Log-determinant relaxation for approximate inference
in discrete markov random elds. Signal Processing, IEEE Transactions on, 54(6):
20992109, June 2006. ISSN 1053-587X. doi: 10.1109/TSP.2006.874409.
M. Wainwright, T. Jaakkola, and A. Willsky. Tree-reweighted belief propagation
algorithms and approximate ml estimation by pseudomoment matching. Articial
Intelligence and Statistics, Jan 2003.
M. J. Wainwright and M. I. Jordan. Variational inference in graphical models: The
view from the marginal polytope. Allerton Conference on Control, Communication
and Computing, Apr 2003.
M. J. Wainwright and M. I. Jordan. A variational principle for graphical models. New
Directions in Statistical Signal Processing, 2005b.
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and
variational inference. Foundations and Trends in Machine Learnings, 1(1 2):1305,
Dec 2008.
M. J. Wainwright, P. Ravikumar, and J. D. Laerty. High-dimensional graphical model
selection using l1-regularized logistic regression. Neural Information Processing
Systems, Jan 2006.
X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations
and text. Proceedings of the 3rd international workshop on Link discovery, 2005.
URL http://portal.acm.org/citation.cfm?id=1134276.
S. Wasserman and P. Pattison. Logit models and logistic regressions for social
145
networks: I. an introduction to markov graphs and p*. Psychometrika, 1996. URL
http://www.springerlink.com/index/T2W46715636R2H11.pdf.
M. Welling and G. Hinton. A new learning algorithm for mean eld boltzmann
machines. Articial Neural Networks-Icann 2002, Jan 2002.
M. Welling and Y. W. Teh. Belief optimization for binary networks: a stable alternative
to loopy belief propagation. In In Proceedings of the Conference on Uncertainty in
Articial Intelligence, pages 554561, 2001.
D. J. A. Welsh. The computational complexity of some classical problems from
statistical physics. In In Disorder in Physical Systems, pages 307321. Clarendon
Press, 1990.
Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Innite hidden relational models. In UAI,
2006.
Z. Xu, V. Tresp, S. Yu, and K. Yu. Nonparametric relational learning for social
network analysis. In 2nd ACM Workshop on Social Network Mining and Analysis
(SNA-KDD 2008), 2008.
J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation and
its generalizations. Exploring articial intelligence in the new millennium, pages
239269, 2003.
W. Zachary. An information ow model for conict and ssion in small groups. Journal
of Anthropological Research, 33:452473, 1977.
D. Zhou, S. Zhu, K. Yu, X. Song, B. Tseng, H. Zha, and C. Giles. Learning
multiple graphs for document recommendations. WWW 2008, Apr 2008. URL
http://portal.acm.org/citation.cfm?id=1367497.1367517.
146
H. Zou and T. Hastie. Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society Series B, Jan 2005. URL
http://www.blackwell-synergy.com/doi/abs/10.1111/j.1467-9868.2005.00503.x.
147

You might also like