Stewart Slides

Automated Topic Models in Political Science
Practical Skills for Document Clustering in R
Brandon M. Stewart1
Harvard University
Talk at the Tools for Text Workshop, June 14, 2010
1
I am grateful to Justin Grimmer and David Blei for permission to use some graphics
and material from their previous presentations.
Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 1 / 39
Introduction
Today We’ll Cover
1 Introduction
2 An Overview of Clustering Models
3 A Procedure for Performing Cluster Analysis in Social Sciences
4 Resources for Additional Information

Introduction
Basic Definitions

Introduction
Basic Definitions
1 Clustering: grouping like units into partitions

Introduction
Basic Definitions

2 Unsupervised Learning: learning without using labelled data

Introduction
Basic Definitions

3 Topic Models: the application of clustering techniques to documents
for determining their subject matter

Introduction
Basic Definitions

3 Topic Models: the application of clustering techniques to documents
for determining their subject matter
4 We are both creating the categories and categorizing the documents
at the same time.

Introduction
Why Cluster?

Introduction
Why Cluster?
1 All kinds of application in information retrieval and computer science

Introduction
Why Cluster?

2 We can use it for exploratory analysis!

Introduction
Why Cluster?

2 We can use it for exploratory analysis!
3 We can also use it to make data from large text corpora with minimal
assumptions!

Introduction
Some Motivating Examples
Public Land
0 0.02 0.04 0.06 0.08 0.1
Avg Prop of Press Release
Graphic from Grimmer (2010)

Introduction
Farms Per Capita

Predicts Attention to Agriculture
0.15
●
Prop. Press Releases Agriculture

●
0.10 ●
● ●
●
●
●
0.05
●
● ●
●
● ● ●
●● ●
●●
●● ● ●●
●●
●● ● ●
● ● ●
●
● ● ● ●
●●● ●● ●
●
●● ● ●
●● ●
● ●
● ●
●
●
●● ● ● ●
●● ●●● ●
● ● ●
●● ●
● ●
●
●● ● ● ●
0.00
●
●
● ● ●
● ●
●
0.00 0.01 0.02 0.03 0.04
No. Farms Per Person

Introduction
Cloture 2
60
Immigration Reform Act (2006)

50
40
Cloture 1
count
30
20
DREAM
10
0
1Jan2005 2Jul2005 1Jan2006 2Jul2006 1Jan2007 2Jul2007 1Jan2008
date

Introduction
McCain 2005
iraq
●
honor
●
court
●
land
●
broadband
●
consum
●
energi
●
hurrican
●
immigr
●
transpar
●
0 0.04 0.09 0.14 0.18
Proportion of Press Releases

Introduction
A Preview of a Procedure for Performing Cluster Analysis

in Social Sciences

Introduction

in Social Sciences
1 Choose a Corpus of Documents

Introduction

in Social Sciences

2 Quantitatively Represent Your Text

Introduction

in Social Sciences

3 Choose a Model

Introduction

in Social Sciences

3 Choose a Model
4 Label Clusters

Introduction

in Social Sciences

3 Choose a Model
4 Label Clusters
5 Validate, Validate, Validate

Introduction

in Social Sciences
1 Choose a Corpus of Documents Think!

2 Quantitatively Represent Your Text Think!
3 Choose a Model Think!
4 Label Clusters Think!
5 Validate, Validate, Validate Think!

An Overview of Clustering Models
There are a lot of different clustering models:
k-means

k-means , self-organizing maps

k-means , self-organizing maps , spectral clustering

k-means , self-organizing maps , spectral clustering ,affinity propagation

k-means , self-organizing maps , spectral clustering ,affinity propagation,

grid-based


grid-based , fuzzy k-modes, subspace, model-based, dynamic topic model,
latent dirichlet allocation, expressed agenda model, k-mediods, correlated
topic model, hierarchical dirichlet processes, mixture of von-mises fisher,
mixed membership stochastic blockmodels, self-organizing tree, rock, qt
clustering, proximus, neural networks, bayesian mixture of multinomial
using mean-field approximation, maximum entropy, mixture of
normals. . . plus every combination of different distance metrics and
representations of text.


How do we choose the right one?



We have to think about our data!



We have to think about our data!

Choosing a Model

Choosing a Model
1 We want one algorithm that has optimal performance for all our sets
of documents and on our subject of interest

Choosing a Model
2 Unsurprisingly, this is impossible.

Choosing a Model
3 Two important theorems:

Choosing a Model
1 Ugly Duckling Theorem

Choosing a Model
2 No Free Lunch Theorem

Choosing a Model
2 No Free Lunch Theorem
4 Thus to choose a method we need to think about substance
(Grimmer and King, 2009)

Different Types of Models

1 There are many potentially important distinctions


2 Algorithmic vs. Statistical


3 Hard vs. Soft


3 Hard vs. Soft
4 Flat vs. Hierarchical (Divisive vs. Agglomerative)

A Simple Example

A Simple Example
1 We’ll start with the simple k-means algorithm.

A Simple Example

2 Most common algorithm for clustering in CS.

A Simple Example

3 It is fast and often very useful.

A Simple Example

4 It uses two repeating steps:

A Simple Example

Position the center to minimize variance within the cluster.

A Simple Example

Position the center to minimize variance within the cluster.
Reassign the documents to the cluster of the closest center.

K -Means
Pause for a Visualization Break

http://siebn.de/other/yakmeans/

Considerations when using K -Means

1 Poor handling of outliers


2 Highly sensitive to distance metrics


3 Not guaranteed to find a global maximum


3 Not guaranteed to find a global maximum
4 Key for Social Scientists: There is no Data-Generating Process!

K -Means as a Special Case of EM

1 We need a story about how these documents were generated.

1 We need a story about how these documents were generated.

2 Let’s say each word occurrence is generated by a normal distribution
whose mean is set by the topic.

K -Means as a Special Case of EM.

1 Find the parameters with the highest likelihood.


2 The Expectation step “assigns responsibilities” for each data point.


3 The Maximization step alters the parameters based on the current
assignments.


assignments.
4 The result is a “soft” or probabilistic assignment (e.g. .95 in Cluster 1
and .05 in Cluster 2).


assignments.
4 The result is a “soft” or probabilistic assignment (e.g. .95 in Cluster 1
and .05 in Cluster 2).
5 K-Means is isomorphic to a Gaussian mixture model with a σ 2 → 1
(Hastie, Tibshirani and Friedman, 2009).

Statistical Models of Text

1 Mixture models “supposes that the data is an i.i.d sample from some
population described by a probability density function. This density
function is characterized by a parameterized model taken to be a
mixture of component density functions; each component density
describes one of the clusters. This model is then fit to the data by
maximum likelihood or corresponding Bayesian approaches.” (Hastie,
Tibshirani and Friedman, 2009).

2 Now we can tell a story about our data. Poisson vs. Multinomial.

2 Now we can tell a story about our data. Poisson vs. Multinomial.
3 Nothing stops us from incorporating more substance into more
complex models.

Hierarchical Models of Text

1 We can include problem-specific information


2 Latent Dirichlet Allocation


1 Documents → Words
2 Documents are a mixture of topics
3 Blei, Ng and Jordan2003


3 Dynamic Topic Model


1 Days → Speeches
2 Quinn et. al, 2010; Blei and Lafferty 2006


1 Days → Speeches
4 Expressed Agenda Model


1 Days → Speeches
4 Expressed Agenda Model
1 Senators → Press Releases
2 Grimmer 2010

Latent Dirichlet Allocation

1 Documents arise from a data generating process with hidden

variables: the topics.

1 Documents arise from a data generating process with hidden

variables: the topics.
2 Each document is a random mixture of topics and each word is drawn
from a topic.

Figure: Plate Notation of Latent Dirichlet Allocation
Graphic from David Blei’s Website: http://www.cs.princeton.edu/ blei/modeling-science.pdf

Figure: Dynamic Topic Model

Figure: Topic Evolution over Time

Figure: Word Use in Topics Over Time

Expressed Agenda Model

1 Assumes:

1 Assumes:
1 Each document is assigned to one topic

1 Assumes:
2 Each author allocates some hidden proportion of time to each topic

1 Assumes:
2 Grimmer’s project seeks to quantitatively represent the content of
senators’ press releases.

1 Assumes:
2 Grimmer’s project seeks to quantitatively represent the content of
senators’ press releases.
3 It is called the Expressed Agenda Model because it captures the way
they communicate that agenda to constituents.

Figure: Expressed Agenda Model
Graphic from Grimmer 2010

Overview of Clustering Models

1 There are a lot of clustering models and no a priori way to know

which is best without substance.


2 Many algorithms are implemented in R and k-means is even in the
base package.


base package.
3 Bayesian Hierarchical Models are trickier but they allow us to
incorporate what we know about our documents.


base package.
3 Bayesian Hierarchical Models are trickier but they allow us to
incorporate what we know about our documents.
4 Tomorrow we’ll cover: k-means, Latent Dirichlet Allocation and
Expressed Agenda Model.

A Procedure for Performing Cluster Analysis in Social Sciences
A Procedure for Performing Cluster Analysis in Social

Sciences

3 Choose a Model
4 Label Clusters
5 Validate, Validate, Validate

Labelling Clusters

Labelling Clusters
1 Automatic: Identify stems that are the most informative of whether a

document belongs in a cluster.

Labelling Clusters

1 Difference in Means

Labelling Clusters

2 Highest Mutual Information

Labelling Clusters

3 Regression-Based Techniques (Monroe et. al 2008)

Labelling Clusters

3 Regression-Based Techniques (Monroe et. al 2008)
2 Manual: Read a sample of documents in the cluster including the
“centroid” document.

Validation
We want to validate our topics across five major categories of validation

(from Quinn et al. 2010):

Validation

1 Semantic Validity: All categories are coherent and meaningful

Validation

2 Convergent Construct Validity: Topics concur with existing
measures in critical details

Validation

3 Discriminant Construct Validity: The topics differ from existing
measures in productive ways.

Validation

4 Predictive Measure: Data generated from the model corresponds to
external events in expected ways.

Validation

4 Predictive Measure: Data generated from the model corresponds to
external events in expected ways.
5 Hypothesis Validity: Data generated from the model can be used to
test substantive hypotheses.

Choosing the Number of Clusters

1 Two Points of View: Fit Statistics vs. Validations


2 Gap Statistic (Tibshirani 2001)

Gap Statistic
Figure: (Left panel): observed (green) and expected (blue) values of logWK .
Both curves have been translated to equal zero at one cluster. (Right panel): Gap
curve, equal to the difference between the observed and expected values of
logWK . The Gap estimate K ∗ is the smallest K producing a gap within one
standard deviation of the gap at K + 1; here K ∗ = 2.
Graphic from Hastie, Tibshirani and Friedman, 2009




3 Cluster Quality measure.

Cluster Quality

Cluster Quality
1 A new evaluation metric from Grimmer and King, 2009.

Cluster Quality

2 Human limitations:

Cluster Quality

1 We can only consider a small number of documents and clusters at one
time.

Cluster Quality

1 We can only consider a small number of documents and clusters at one
time.
2 We are pretty good at evaluating pairwise document pairs.

Cluster Quality for the Expressed Agenda Model(Grimmer

2010b)


2010b)
1 Sample 100 document pairs.


2010b)
1 Sample 100 document pairs.
2 Humans evaluate document pairs: (1) unrelated, (2) loosely related,
(3) closely related
Rockefeller
●
Press Releases
Lautenberg
●
Press Releases
Dirichlet●Process
−0.2 −0.1 0 0.1 0.2
(Expressed Agenda) − (Other Methods)
Graphic from Grimmer, 2010




4 Height in Hierarchical Clustering

Dendrogram (Quinn et. al, 2010)
Figure: The height of the dendrogram provides the similarity of the two clusters
which are merged over the horizontal line.
Graphic from Quinn et. al., 2010




5 It is more an art than a science.

Resources for Additional Information

1 Topics not covered: hierarchical clustering, sub-string kernels,

non-cluster based unsupervised learning.


2 Implementation issues not discussed: Distance metrics


3 A literature review of key work in the social sciences


3 A literature review of key work in the social sciences
4 Key reference texts will be posted on website

Stewart Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stewart Slides

Uploaded by

Copyright:

Available Formats

Automated Topic Models in Political Science

Practical Skills for Document Clustering in R

Talk at the Tools for Text Workshop, June 14, 2010

Today We’ll Cover

2 An Overview of Clustering Models

3 A Procedure for Performing Cluster Analysis in Social Sciences

4 Resources for Additional Information

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 2 / 39

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 3 / 39

1 Clustering: grouping like units into partitions

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 3 / 39

1 Clustering: grouping like units into partitions

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 3 / 39

1 Clustering: grouping like units into partitions

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 3 / 39

1 Clustering: grouping like units into partitions

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 3 / 39

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 4 / 39

1 All kinds of application in information retrieval and computer science

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 4 / 39

1 All kinds of application in information retrieval and computer science

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 4 / 39

1 All kinds of application in information retrieval and computer science

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 4 / 39

Some Motivating Examples

0 0.02 0.04 0.06 0.08 0.1

Avg Prop of Press Release

Graphic from Grimmer (2010)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 5 / 39

Some Motivating Examples

Farms Per Capita

Prop. Press Releases Agriculture

0.00 0.01 0.02 0.03 0.04

No. Farms Per Person

Graphic from Grimmer (2010)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 6 / 39

Some Motivating Examples

Immigration Reform Act (2006)

1Jan2005 2Jul2005 1Jan2006 2Jul2006 1Jan2007 2Jul2007 1Jan2008

Graphic from Grimmer (2010)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 7 / 39

Some Motivating Examples

0 0.04 0.09 0.14 0.18

Proportion of Press Releases

Graphic from Grimmer (2010)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 8 / 39

A Preview of a Procedure for Performing Cluster Analysis

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39

A Preview of a Procedure for Performing Cluster Analysis

1 Choose a Corpus of Documents

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39

A Preview of a Procedure for Performing Cluster Analysis

1 Choose a Corpus of Documents

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39

A Preview of a Procedure for Performing Cluster Analysis

1 Choose a Corpus of Documents

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39

A Preview of a Procedure for Performing Cluster Analysis

1 Choose a Corpus of Documents

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39

A Preview of a Procedure for Performing Cluster Analysis

1 Choose a Corpus of Documents

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39