You are on page 1of 133

Automated Topic Models in Political Science

Practical Skills for Document Clustering in R

Brandon M. Stewart1

Harvard University

Talk at the Tools for Text Workshop, June 14, 2010

1
I am grateful to Justin Grimmer and David Blei for permission to use some graphics
and material from their previous presentations.
Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 1 / 39
Introduction

Today We’ll Cover

1 Introduction

2 An Overview of Clustering Models

3 A Procedure for Performing Cluster Analysis in Social Sciences

4 Resources for Additional Information

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 2 / 39


Introduction

Basic Definitions

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 3 / 39


Introduction

Basic Definitions

1 Clustering: grouping like units into partitions

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 3 / 39


Introduction

Basic Definitions

1 Clustering: grouping like units into partitions


2 Unsupervised Learning: learning without using labelled data

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 3 / 39


Introduction

Basic Definitions

1 Clustering: grouping like units into partitions


2 Unsupervised Learning: learning without using labelled data
3 Topic Models: the application of clustering techniques to documents
for determining their subject matter

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 3 / 39


Introduction

Basic Definitions

1 Clustering: grouping like units into partitions


2 Unsupervised Learning: learning without using labelled data
3 Topic Models: the application of clustering techniques to documents
for determining their subject matter
4 We are both creating the categories and categorizing the documents
at the same time.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 3 / 39


Introduction

Why Cluster?

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 4 / 39


Introduction

Why Cluster?

1 All kinds of application in information retrieval and computer science

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 4 / 39


Introduction

Why Cluster?

1 All kinds of application in information retrieval and computer science


2 We can use it for exploratory analysis!

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 4 / 39


Introduction

Why Cluster?

1 All kinds of application in information retrieval and computer science


2 We can use it for exploratory analysis!
3 We can also use it to make data from large text corpora with minimal
assumptions!

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 4 / 39


Introduction

Some Motivating Examples

Public Land

0 0.02 0.04 0.06 0.08 0.1

Avg Prop of Press Release

Graphic from Grimmer (2010)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 5 / 39


Introduction

Some Motivating Examples

Farms Per Capita


Predicts Attention to Agriculture

0.15

Prop. Press Releases Agriculture


0.10 ●
● ●




0.05


● ●

● ● ●
●● ●
●●
●● ● ●●
●●
●● ● ●
● ● ●

● ● ● ●
●●● ●● ●

●● ● ●
●● ●
● ●
● ●


●● ● ● ●
●● ●●● ●
● ● ●
●● ●
● ●

●● ● ● ●
0.00



● ● ●
● ●

0.00 0.01 0.02 0.03 0.04

No. Farms Per Person

Graphic from Grimmer (2010)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 6 / 39


Introduction

Some Motivating Examples

Cloture 2
60

Immigration Reform Act (2006)


50
40

Cloture 1
count

30
20

DREAM
10
0

1Jan2005 2Jul2005 1Jan2006 2Jul2006 1Jan2007 2Jul2007 1Jan2008

date

Graphic from Grimmer (2010)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 7 / 39


Introduction

Some Motivating Examples

McCain 2005

iraq

honor

court

land

broadband

consum

energi

hurrican

immigr

transpar

0 0.04 0.09 0.14 0.18

Proportion of Press Releases

Graphic from Grimmer (2010)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 8 / 39


Introduction

A Preview of a Procedure for Performing Cluster Analysis


in Social Sciences

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39


Introduction

A Preview of a Procedure for Performing Cluster Analysis


in Social Sciences

1 Choose a Corpus of Documents

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39


Introduction

A Preview of a Procedure for Performing Cluster Analysis


in Social Sciences

1 Choose a Corpus of Documents


2 Quantitatively Represent Your Text

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39


Introduction

A Preview of a Procedure for Performing Cluster Analysis


in Social Sciences

1 Choose a Corpus of Documents


2 Quantitatively Represent Your Text
3 Choose a Model

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39


Introduction

A Preview of a Procedure for Performing Cluster Analysis


in Social Sciences

1 Choose a Corpus of Documents


2 Quantitatively Represent Your Text
3 Choose a Model
4 Label Clusters

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39


Introduction

A Preview of a Procedure for Performing Cluster Analysis


in Social Sciences

1 Choose a Corpus of Documents


2 Quantitatively Represent Your Text
3 Choose a Model
4 Label Clusters
5 Validate, Validate, Validate

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39


Introduction

A Preview of a Procedure for Performing Cluster Analysis


in Social Sciences

1 Choose a Corpus of Documents Think!


2 Quantitatively Represent Your Text Think!
3 Choose a Model Think!
4 Label Clusters Think!
5 Validate, Validate, Validate Think!

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 9 / 39


An Overview of Clustering Models

An Overview of Clustering Models

There are a lot of different clustering models:

k-means

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 10 / 39


An Overview of Clustering Models

An Overview of Clustering Models

There are a lot of different clustering models:

k-means , self-organizing maps

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 10 / 39


An Overview of Clustering Models

An Overview of Clustering Models

There are a lot of different clustering models:

k-means , self-organizing maps , spectral clustering

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 10 / 39


An Overview of Clustering Models

An Overview of Clustering Models

There are a lot of different clustering models:

k-means , self-organizing maps , spectral clustering ,affinity propagation

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 10 / 39


An Overview of Clustering Models

An Overview of Clustering Models

There are a lot of different clustering models:

k-means , self-organizing maps , spectral clustering ,affinity propagation,


grid-based

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 10 / 39


An Overview of Clustering Models

An Overview of Clustering Models

There are a lot of different clustering models:

k-means , self-organizing maps , spectral clustering ,affinity propagation,


grid-based , fuzzy k-modes, subspace, model-based, dynamic topic model,
latent dirichlet allocation, expressed agenda model, k-mediods, correlated
topic model, hierarchical dirichlet processes, mixture of von-mises fisher,
mixed membership stochastic blockmodels, self-organizing tree, rock, qt
clustering, proximus, neural networks, bayesian mixture of multinomial
using mean-field approximation, maximum entropy, mixture of
normals. . . plus every combination of different distance metrics and
representations of text.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 10 / 39


An Overview of Clustering Models

An Overview of Clustering Models

There are a lot of different clustering models:

k-means , self-organizing maps , spectral clustering ,affinity propagation,


grid-based , fuzzy k-modes, subspace, model-based, dynamic topic model,
latent dirichlet allocation, expressed agenda model, k-mediods, correlated
topic model, hierarchical dirichlet processes, mixture of von-mises fisher,
mixed membership stochastic blockmodels, self-organizing tree, rock, qt
clustering, proximus, neural networks, bayesian mixture of multinomial
using mean-field approximation, maximum entropy, mixture of
normals. . . plus every combination of different distance metrics and
representations of text.

How do we choose the right one?

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 10 / 39


An Overview of Clustering Models

An Overview of Clustering Models

There are a lot of different clustering models:

k-means , self-organizing maps , spectral clustering ,affinity propagation,


grid-based , fuzzy k-modes, subspace, model-based, dynamic topic model,
latent dirichlet allocation, expressed agenda model, k-mediods, correlated
topic model, hierarchical dirichlet processes, mixture of von-mises fisher,
mixed membership stochastic blockmodels, self-organizing tree, rock, qt
clustering, proximus, neural networks, bayesian mixture of multinomial
using mean-field approximation, maximum entropy, mixture of
normals. . . plus every combination of different distance metrics and
representations of text.

How do we choose the right one?


We have to think about our data!

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 10 / 39


An Overview of Clustering Models

An Overview of Clustering Models

There are a lot of different clustering models:

k-means , self-organizing maps , spectral clustering ,affinity propagation,


grid-based , fuzzy k-modes, subspace, model-based, dynamic topic model,
latent dirichlet allocation, expressed agenda model, k-mediods, correlated
topic model, hierarchical dirichlet processes, mixture of von-mises fisher,
mixed membership stochastic blockmodels, self-organizing tree, rock, qt
clustering, proximus, neural networks, bayesian mixture of multinomial
using mean-field approximation, maximum entropy, mixture of
normals. . . plus every combination of different distance metrics and
representations of text.

How do we choose the right one?


We have to think about our data!

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 10 / 39


An Overview of Clustering Models

Choosing a Model

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 11 / 39


An Overview of Clustering Models

Choosing a Model

1 We want one algorithm that has optimal performance for all our sets
of documents and on our subject of interest

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 11 / 39


An Overview of Clustering Models

Choosing a Model

1 We want one algorithm that has optimal performance for all our sets
of documents and on our subject of interest
2 Unsurprisingly, this is impossible.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 11 / 39


An Overview of Clustering Models

Choosing a Model

1 We want one algorithm that has optimal performance for all our sets
of documents and on our subject of interest
2 Unsurprisingly, this is impossible.
3 Two important theorems:

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 11 / 39


An Overview of Clustering Models

Choosing a Model

1 We want one algorithm that has optimal performance for all our sets
of documents and on our subject of interest
2 Unsurprisingly, this is impossible.
3 Two important theorems:
1 Ugly Duckling Theorem

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 11 / 39


An Overview of Clustering Models

Choosing a Model

1 We want one algorithm that has optimal performance for all our sets
of documents and on our subject of interest
2 Unsurprisingly, this is impossible.
3 Two important theorems:
1 Ugly Duckling Theorem
2 No Free Lunch Theorem

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 11 / 39


An Overview of Clustering Models

Choosing a Model

1 We want one algorithm that has optimal performance for all our sets
of documents and on our subject of interest
2 Unsurprisingly, this is impossible.
3 Two important theorems:
1 Ugly Duckling Theorem
2 No Free Lunch Theorem
4 Thus to choose a method we need to think about substance
(Grimmer and King, 2009)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 11 / 39


An Overview of Clustering Models

Different Types of Models

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 12 / 39


An Overview of Clustering Models

Different Types of Models

1 There are many potentially important distinctions

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 12 / 39


An Overview of Clustering Models

Different Types of Models

1 There are many potentially important distinctions


2 Algorithmic vs. Statistical

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 12 / 39


An Overview of Clustering Models

Different Types of Models

1 There are many potentially important distinctions


2 Algorithmic vs. Statistical
3 Hard vs. Soft

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 12 / 39


An Overview of Clustering Models

Different Types of Models

1 There are many potentially important distinctions


2 Algorithmic vs. Statistical
3 Hard vs. Soft
4 Flat vs. Hierarchical (Divisive vs. Agglomerative)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 12 / 39


An Overview of Clustering Models

A Simple Example

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 13 / 39


An Overview of Clustering Models

A Simple Example

1 We’ll start with the simple k-means algorithm.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 13 / 39


An Overview of Clustering Models

A Simple Example

1 We’ll start with the simple k-means algorithm.


2 Most common algorithm for clustering in CS.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 13 / 39


An Overview of Clustering Models

A Simple Example

1 We’ll start with the simple k-means algorithm.


2 Most common algorithm for clustering in CS.
3 It is fast and often very useful.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 13 / 39


An Overview of Clustering Models

A Simple Example

1 We’ll start with the simple k-means algorithm.


2 Most common algorithm for clustering in CS.
3 It is fast and often very useful.
4 It uses two repeating steps:

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 13 / 39


An Overview of Clustering Models

A Simple Example

1 We’ll start with the simple k-means algorithm.


2 Most common algorithm for clustering in CS.
3 It is fast and often very useful.
4 It uses two repeating steps:
Position the center to minimize variance within the cluster.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 13 / 39


An Overview of Clustering Models

A Simple Example

1 We’ll start with the simple k-means algorithm.


2 Most common algorithm for clustering in CS.
3 It is fast and often very useful.
4 It uses two repeating steps:
Position the center to minimize variance within the cluster.
Reassign the documents to the cluster of the closest center.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 13 / 39


An Overview of Clustering Models

K -Means

Pause for a Visualization Break


http://siebn.de/other/yakmeans/

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 14 / 39


An Overview of Clustering Models

Considerations when using K -Means

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 15 / 39


An Overview of Clustering Models

Considerations when using K -Means

1 Poor handling of outliers

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 15 / 39


An Overview of Clustering Models

Considerations when using K -Means

1 Poor handling of outliers


2 Highly sensitive to distance metrics

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 15 / 39


An Overview of Clustering Models

Considerations when using K -Means

1 Poor handling of outliers


2 Highly sensitive to distance metrics
3 Not guaranteed to find a global maximum

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 15 / 39


An Overview of Clustering Models

Considerations when using K -Means

1 Poor handling of outliers


2 Highly sensitive to distance metrics
3 Not guaranteed to find a global maximum
4 Key for Social Scientists: There is no Data-Generating Process!

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 15 / 39


An Overview of Clustering Models

K -Means as a Special Case of EM

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 16 / 39


An Overview of Clustering Models

K -Means as a Special Case of EM

1 We need a story about how these documents were generated.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 16 / 39


An Overview of Clustering Models

K -Means as a Special Case of EM

1 We need a story about how these documents were generated.


2 Let’s say each word occurrence is generated by a normal distribution
whose mean is set by the topic.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 16 / 39


An Overview of Clustering Models

K -Means as a Special Case of EM.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 17 / 39


An Overview of Clustering Models

K -Means as a Special Case of EM.

1 Find the parameters with the highest likelihood.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 17 / 39


An Overview of Clustering Models

K -Means as a Special Case of EM.

1 Find the parameters with the highest likelihood.


2 The Expectation step “assigns responsibilities” for each data point.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 17 / 39


An Overview of Clustering Models

K -Means as a Special Case of EM.

1 Find the parameters with the highest likelihood.


2 The Expectation step “assigns responsibilities” for each data point.
3 The Maximization step alters the parameters based on the current
assignments.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 17 / 39


An Overview of Clustering Models

K -Means as a Special Case of EM.

1 Find the parameters with the highest likelihood.


2 The Expectation step “assigns responsibilities” for each data point.
3 The Maximization step alters the parameters based on the current
assignments.
4 The result is a “soft” or probabilistic assignment (e.g. .95 in Cluster 1
and .05 in Cluster 2).

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 17 / 39


An Overview of Clustering Models

K -Means as a Special Case of EM.

1 Find the parameters with the highest likelihood.


2 The Expectation step “assigns responsibilities” for each data point.
3 The Maximization step alters the parameters based on the current
assignments.
4 The result is a “soft” or probabilistic assignment (e.g. .95 in Cluster 1
and .05 in Cluster 2).
5 K-Means is isomorphic to a Gaussian mixture model with a σ 2 → 1
(Hastie, Tibshirani and Friedman, 2009).

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 17 / 39


An Overview of Clustering Models

Statistical Models of Text

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 18 / 39


An Overview of Clustering Models

Statistical Models of Text

1 Mixture models “supposes that the data is an i.i.d sample from some
population described by a probability density function. This density
function is characterized by a parameterized model taken to be a
mixture of component density functions; each component density
describes one of the clusters. This model is then fit to the data by
maximum likelihood or corresponding Bayesian approaches.” (Hastie,
Tibshirani and Friedman, 2009).

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 18 / 39


An Overview of Clustering Models

Statistical Models of Text

1 Mixture models “supposes that the data is an i.i.d sample from some
population described by a probability density function. This density
function is characterized by a parameterized model taken to be a
mixture of component density functions; each component density
describes one of the clusters. This model is then fit to the data by
maximum likelihood or corresponding Bayesian approaches.” (Hastie,
Tibshirani and Friedman, 2009).
2 Now we can tell a story about our data. Poisson vs. Multinomial.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 18 / 39


An Overview of Clustering Models

Statistical Models of Text

1 Mixture models “supposes that the data is an i.i.d sample from some
population described by a probability density function. This density
function is characterized by a parameterized model taken to be a
mixture of component density functions; each component density
describes one of the clusters. This model is then fit to the data by
maximum likelihood or corresponding Bayesian approaches.” (Hastie,
Tibshirani and Friedman, 2009).
2 Now we can tell a story about our data. Poisson vs. Multinomial.
3 Nothing stops us from incorporating more substance into more
complex models.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 18 / 39


An Overview of Clustering Models

Hierarchical Models of Text

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 19 / 39


An Overview of Clustering Models

Hierarchical Models of Text

1 We can include problem-specific information

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 19 / 39


An Overview of Clustering Models

Hierarchical Models of Text

1 We can include problem-specific information


2 Latent Dirichlet Allocation

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 19 / 39


An Overview of Clustering Models

Hierarchical Models of Text

1 We can include problem-specific information


2 Latent Dirichlet Allocation
1 Documents → Words
2 Documents are a mixture of topics
3 Blei, Ng and Jordan2003

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 19 / 39


An Overview of Clustering Models

Hierarchical Models of Text

1 We can include problem-specific information


2 Latent Dirichlet Allocation
1 Documents → Words
2 Documents are a mixture of topics
3 Blei, Ng and Jordan2003
3 Dynamic Topic Model

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 19 / 39


An Overview of Clustering Models

Hierarchical Models of Text

1 We can include problem-specific information


2 Latent Dirichlet Allocation
1 Documents → Words
2 Documents are a mixture of topics
3 Blei, Ng and Jordan2003
3 Dynamic Topic Model
1 Days → Speeches
2 Quinn et. al, 2010; Blei and Lafferty 2006

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 19 / 39


An Overview of Clustering Models

Hierarchical Models of Text

1 We can include problem-specific information


2 Latent Dirichlet Allocation
1 Documents → Words
2 Documents are a mixture of topics
3 Blei, Ng and Jordan2003
3 Dynamic Topic Model
1 Days → Speeches
2 Quinn et. al, 2010; Blei and Lafferty 2006
4 Expressed Agenda Model

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 19 / 39


An Overview of Clustering Models

Hierarchical Models of Text

1 We can include problem-specific information


2 Latent Dirichlet Allocation
1 Documents → Words
2 Documents are a mixture of topics
3 Blei, Ng and Jordan2003
3 Dynamic Topic Model
1 Days → Speeches
2 Quinn et. al, 2010; Blei and Lafferty 2006
4 Expressed Agenda Model
1 Senators → Press Releases
2 Grimmer 2010

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 19 / 39


An Overview of Clustering Models

Latent Dirichlet Allocation

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 20 / 39


An Overview of Clustering Models

Latent Dirichlet Allocation

1 Documents arise from a data generating process with hidden


variables: the topics.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 20 / 39


An Overview of Clustering Models

Latent Dirichlet Allocation

1 Documents arise from a data generating process with hidden


variables: the topics.
2 Each document is a random mixture of topics and each word is drawn
from a topic.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 20 / 39


An Overview of Clustering Models

Latent Dirichlet Allocation

Figure: Plate Notation of Latent Dirichlet Allocation

Graphic from David Blei’s Website: http://www.cs.princeton.edu/ blei/modeling-science.pdf

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 21 / 39


An Overview of Clustering Models

Latent Dirichlet Allocation

Figure: Dynamic Topic Model

Graphic from David Blei’s Website: http://www.cs.princeton.edu/ blei/modeling-science.pdf


Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 22 / 39
An Overview of Clustering Models

Latent Dirichlet Allocation

Figure: Topic Evolution over Time

Graphic from David Blei’s Website: http://www.cs.princeton.edu/ blei/modeling-science.pdf


Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 23 / 39
An Overview of Clustering Models

Latent Dirichlet Allocation

Figure: Word Use in Topics Over Time

Graphic from David Blei’s Website: http://www.cs.princeton.edu/ blei/modeling-science.pdf


Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 24 / 39
An Overview of Clustering Models

Expressed Agenda Model

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 25 / 39


An Overview of Clustering Models

Expressed Agenda Model

1 Assumes:

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 25 / 39


An Overview of Clustering Models

Expressed Agenda Model

1 Assumes:
1 Each document is assigned to one topic

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 25 / 39


An Overview of Clustering Models

Expressed Agenda Model

1 Assumes:
1 Each document is assigned to one topic
2 Each author allocates some hidden proportion of time to each topic

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 25 / 39


An Overview of Clustering Models

Expressed Agenda Model

1 Assumes:
1 Each document is assigned to one topic
2 Each author allocates some hidden proportion of time to each topic
2 Grimmer’s project seeks to quantitatively represent the content of
senators’ press releases.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 25 / 39


An Overview of Clustering Models

Expressed Agenda Model

1 Assumes:
1 Each document is assigned to one topic
2 Each author allocates some hidden proportion of time to each topic
2 Grimmer’s project seeks to quantitatively represent the content of
senators’ press releases.
3 It is called the Expressed Agenda Model because it captures the way
they communicate that agenda to constituents.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 25 / 39


An Overview of Clustering Models

Expressed Agenda Model

Figure: Expressed Agenda Model

Graphic from Grimmer 2010

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 26 / 39


An Overview of Clustering Models

Overview of Clustering Models

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 27 / 39


An Overview of Clustering Models

Overview of Clustering Models

1 There are a lot of clustering models and no a priori way to know


which is best without substance.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 27 / 39


An Overview of Clustering Models

Overview of Clustering Models

1 There are a lot of clustering models and no a priori way to know


which is best without substance.
2 Many algorithms are implemented in R and k-means is even in the
base package.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 27 / 39


An Overview of Clustering Models

Overview of Clustering Models

1 There are a lot of clustering models and no a priori way to know


which is best without substance.
2 Many algorithms are implemented in R and k-means is even in the
base package.
3 Bayesian Hierarchical Models are trickier but they allow us to
incorporate what we know about our documents.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 27 / 39


An Overview of Clustering Models

Overview of Clustering Models

1 There are a lot of clustering models and no a priori way to know


which is best without substance.
2 Many algorithms are implemented in R and k-means is even in the
base package.
3 Bayesian Hierarchical Models are trickier but they allow us to
incorporate what we know about our documents.
4 Tomorrow we’ll cover: k-means, Latent Dirichlet Allocation and
Expressed Agenda Model.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 27 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

A Procedure for Performing Cluster Analysis in Social


Sciences

1 Choose a Corpus of Documents


2 Quantitatively Represent Your Text
3 Choose a Model
4 Label Clusters
5 Validate, Validate, Validate

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 28 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Labelling Clusters

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 29 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Labelling Clusters

1 Automatic: Identify stems that are the most informative of whether a


document belongs in a cluster.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 29 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Labelling Clusters

1 Automatic: Identify stems that are the most informative of whether a


document belongs in a cluster.
1 Difference in Means

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 29 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Labelling Clusters

1 Automatic: Identify stems that are the most informative of whether a


document belongs in a cluster.
1 Difference in Means
2 Highest Mutual Information

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 29 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Labelling Clusters

1 Automatic: Identify stems that are the most informative of whether a


document belongs in a cluster.
1 Difference in Means
2 Highest Mutual Information
3 Regression-Based Techniques (Monroe et. al 2008)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 29 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Labelling Clusters

1 Automatic: Identify stems that are the most informative of whether a


document belongs in a cluster.
1 Difference in Means
2 Highest Mutual Information
3 Regression-Based Techniques (Monroe et. al 2008)
2 Manual: Read a sample of documents in the cluster including the
“centroid” document.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 29 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Validation

We want to validate our topics across five major categories of validation


(from Quinn et al. 2010):

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 30 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Validation

We want to validate our topics across five major categories of validation


(from Quinn et al. 2010):
1 Semantic Validity: All categories are coherent and meaningful

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 30 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Validation

We want to validate our topics across five major categories of validation


(from Quinn et al. 2010):
1 Semantic Validity: All categories are coherent and meaningful
2 Convergent Construct Validity: Topics concur with existing
measures in critical details

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 30 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Validation

We want to validate our topics across five major categories of validation


(from Quinn et al. 2010):
1 Semantic Validity: All categories are coherent and meaningful
2 Convergent Construct Validity: Topics concur with existing
measures in critical details
3 Discriminant Construct Validity: The topics differ from existing
measures in productive ways.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 30 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Validation

We want to validate our topics across five major categories of validation


(from Quinn et al. 2010):
1 Semantic Validity: All categories are coherent and meaningful
2 Convergent Construct Validity: Topics concur with existing
measures in critical details
3 Discriminant Construct Validity: The topics differ from existing
measures in productive ways.
4 Predictive Measure: Data generated from the model corresponds to
external events in expected ways.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 30 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Validation

We want to validate our topics across five major categories of validation


(from Quinn et al. 2010):
1 Semantic Validity: All categories are coherent and meaningful
2 Convergent Construct Validity: Topics concur with existing
measures in critical details
3 Discriminant Construct Validity: The topics differ from existing
measures in productive ways.
4 Predictive Measure: Data generated from the model corresponds to
external events in expected ways.
5 Hypothesis Validity: Data generated from the model can be used to
test substantive hypotheses.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 30 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Choosing the Number of Clusters

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 31 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Choosing the Number of Clusters

1 Two Points of View: Fit Statistics vs. Validations

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 31 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Choosing the Number of Clusters

1 Two Points of View: Fit Statistics vs. Validations


2 Gap Statistic (Tibshirani 2001)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 31 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Gap Statistic

Figure: (Left panel): observed (green) and expected (blue) values of logWK .
Both curves have been translated to equal zero at one cluster. (Right panel): Gap
curve, equal to the difference between the observed and expected values of
logWK . The Gap estimate K ∗ is the smallest K producing a gap within one
standard deviation of the gap at K + 1; here K ∗ = 2.

Graphic from Hastie, Tibshirani and Friedman, 2009


Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 32 / 39
A Procedure for Performing Cluster Analysis in Social Sciences

Choosing the Number of Clusters

1 Two Points of View: Fit Statistics vs. Validations


2 Gap Statistic (Tibshirani 2001)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 33 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Choosing the Number of Clusters

1 Two Points of View: Fit Statistics vs. Validations


2 Gap Statistic (Tibshirani 2001)
3 Cluster Quality measure.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 33 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Cluster Quality

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 34 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Cluster Quality

1 A new evaluation metric from Grimmer and King, 2009.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 34 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Cluster Quality

1 A new evaluation metric from Grimmer and King, 2009.


2 Human limitations:

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 34 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Cluster Quality

1 A new evaluation metric from Grimmer and King, 2009.


2 Human limitations:
1 We can only consider a small number of documents and clusters at one
time.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 34 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Cluster Quality

1 A new evaluation metric from Grimmer and King, 2009.


2 Human limitations:
1 We can only consider a small number of documents and clusters at one
time.
2 We are pretty good at evaluating pairwise document pairs.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 34 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Cluster Quality for the Expressed Agenda Model(Grimmer


2010b)

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 35 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Cluster Quality for the Expressed Agenda Model(Grimmer


2010b)
1 Sample 100 document pairs.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 35 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Cluster Quality for the Expressed Agenda Model(Grimmer


2010b)
1 Sample 100 document pairs.
2 Humans evaluate document pairs: (1) unrelated, (2) loosely related,
(3) closely related

Rockefeller

Press Releases

Lautenberg

Press Releases

Dirichlet●Process

−0.2 −0.1 0 0.1 0.2

(Expressed Agenda) − (Other Methods)

Graphic from Grimmer, 2010


Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 35 / 39
A Procedure for Performing Cluster Analysis in Social Sciences

Choosing the Number of Clusters

1 Two Points of View: Fit Statistics vs. Validations


2 Gap Statistic (Tibshirani 2001)
3 Cluster Quality measure.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 36 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Choosing the Number of Clusters

1 Two Points of View: Fit Statistics vs. Validations


2 Gap Statistic (Tibshirani 2001)
3 Cluster Quality measure.
4 Height in Hierarchical Clustering

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 36 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Dendrogram (Quinn et. al, 2010)

Figure: The height of the dendrogram provides the similarity of the two clusters
which are merged over the horizontal line.

Graphic from Quinn et. al., 2010


Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 37 / 39
A Procedure for Performing Cluster Analysis in Social Sciences

Choosing the Number of Clusters

1 Two Points of View: Fit Statistics vs. Validations


2 Gap Statistic (Tibshirani 2001)
3 Cluster Quality measure.
4 Height in Hierarchical Clustering

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 38 / 39


A Procedure for Performing Cluster Analysis in Social Sciences

Choosing the Number of Clusters

1 Two Points of View: Fit Statistics vs. Validations


2 Gap Statistic (Tibshirani 2001)
3 Cluster Quality measure.
4 Height in Hierarchical Clustering
5 It is more an art than a science.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 38 / 39


Resources for Additional Information

Resources for Additional Information

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 39 / 39


Resources for Additional Information

Resources for Additional Information

1 Topics not covered: hierarchical clustering, sub-string kernels,


non-cluster based unsupervised learning.

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 39 / 39


Resources for Additional Information

Resources for Additional Information

1 Topics not covered: hierarchical clustering, sub-string kernels,


non-cluster based unsupervised learning.
2 Implementation issues not discussed: Distance metrics

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 39 / 39


Resources for Additional Information

Resources for Additional Information

1 Topics not covered: hierarchical clustering, sub-string kernels,


non-cluster based unsupervised learning.
2 Implementation issues not discussed: Distance metrics
3 A literature review of key work in the social sciences

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 39 / 39


Resources for Additional Information

Resources for Additional Information

1 Topics not covered: hierarchical clustering, sub-string kernels,


non-cluster based unsupervised learning.
2 Implementation issues not discussed: Distance metrics
3 A literature review of key work in the social sciences
4 Key reference texts will be posted on website

Brandon M. Stewart (Harvard University) Topic Models June 14, 2010 39 / 39

You might also like