You are on page 1of 8

PLSC 506: Measurement, Estimation and Inference

(with Applications to Text Data)

Spring 2017

Prof. John A. Henderson Office: ISPS Room # D230


Location: PR77 (ISPS) B217 Hours: Wed. 3:30 – 5:30
Wed. 1:30 – 3:30 Email: john.henderson@yale.edu

Course Description
This course covers a wide array of methodologies that aim to improve the quality of measurement,
estimation, and inference, particularly in light of challenges that emerge in the analysis of text
data. Though topics will be generally applicable to political science research, the course largely will
draw examples from text analytic problems and will focus somewhat closely on methods to study
text statistically. Topics will include measurement, reliability and error, text and web scraping,
supervised and unsupervised learning, Bayesian inference, cluster and topic modeling, ideal point
scaling, and some advanced topics in statistical inference. The aim of the course is to provide
students with a host of practical tools that can be used to evaluate and replicate other research, as
well as to help students address methodological issues arising in their own work. Prerequisites for
the course include PLSC 500a, 503b, and 504a or equivalent.

Course Requirements
Final grades will be based on a series of homework assignments (30% of final grade), presentations
(20% of final grade), a term paper (40% of final grade), and course participation (10% of final
grade). Collaboration on the final paper is encouraged, but students may not coauthor with more
than two other students.

Software and Course Books


The course will focus extensively on programming in R, which is available for download here:
http://www.r-project.org/. The course will also utilize software to approximate posterior dis-
tributions using Gibbs samplers (JAGS/BUGS). Mac users should use JAGS, which is available
here: http://mcmc-jags.sourceforge.net. Windows users should use WinBugs, which is avail-
able here http://www.mrc-bsu.cam.ac.uk/bugs/. We will also make somewhat extensive use of
the following packages in R: tm, RCurl, rjags, R2WinBUGS.

In addition to the weekly course readings, the following books are recommended for consultation
or purchase:

1
Programming in R:
• Krause, Andreas and Melvin Olson (2005), The Basics of S-PLUS. New York: Springer.
• Venables, W.N and Brian D. Ripley (2003), Modern Applied Statistics with S. New York:
Springer-Verlag.
• Kruschke, John K. (2011), Doing Bayesian Data Analysis: A Tutorial with R and BUGS,
New York: Elsevier.
Bayesian Inference and Machine Learning:

• Bishop, Christopher (2006), Pattern Recognition and Machine Learning, New York: Springer.
• Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009), The Elements of Statistical
Learning: Data Mining, Inference, and Prediction 2nd edition, New York: Springer.
• Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin (2004), Bayesian Data
Analysis, Boca Raton, Florida: Chapman and Hall/CRC. [Third edition is also good]
• Jackman, Simon (2009), Bayesian Analysis for the Social Sciences, London: Wiley.

Modern Text Analysis and Data Mining:

• Aggarwal, Charu C. and ChengXiang Zhai (2012), Mining Text Data, New York: Springer.
• Manning, Christopher D., Prabhakar Raghavan and Hinrich Schütze (2008), Introduction to
Information Retrieval, Cambridge: Cambridge University Press, available online at http:
//nlp.stanford.edu/IR-book/information-retrieval-book.html

Academic Dishonesty
It is your responsibility to be familiar with and to follow the university policy on academic dishon-
esty. (See http://catalog.yale.edu/handbook-instructors-undergraduates-yale-college/
teaching/academic-dishonesty/ and http://gsas.yale.edu/academic-professional-development/
professional-ethics-regulations.) Any student caught plagiarizing or engaging in other aca-
demic dishonesty will receive an F in the course and will reported to the Dean’s Office for further
sanction.

A brief note from Dean Thomas Pollard:


“Academic integrity is a core institutional value at Yale. It means, among other things,
truth in presentation, diligence and precision in citing works and ideas we have used, and
acknowledging our collaborations with others. In view of our commitment to maintain-
ing the highest standards of academic integrity, the Graduate School Code of Conduct
specifically prohibits the following forms of behavior: cheating on examinations, problem
sets and all other forms of assessment; falsification and/or fabrication of data; plagia-
rism, that is, the failure in a dissertation, essay or other written exercise to acknowledge
ideas, research, or language taken from others; and multiple submission of the same work
without obtaining explicit written permission from both instructors before the material
is submitted. Students found guilty of violations of academic integrity are subject to
one or more of the following penalties: written reprimand, probation, suspension (noted
on a students transcript) or dismissal (noted on a students transcript).”

2
Course Schedule

Week 1

January 18: Course Overview and Introduction to Measurement

- Cronbach and Meehl, “Construct Validity in Psychological Tests” at http://mcps.umn.


edu/assets/pdf/1_7_Cronbach.pdf.
- Weber, Ch. 3 “Techniques of Content Analysis” in Basic Content Analysis at https:
//drive.google.com/file/d/0B2vSiN5b-8RIRk94NjVLNHFIM28/edit?usp=sharing
- Grimmer and Stewart, “Text as Data: The Promise and Pitfalls of Automatic Content
Analysis Methods for Political Texts” at http://www.stanford.edu/~jgrimmer/tad2.
pdf

Week 2

January 25: (Getting and Managing) Text as Data

- Monroe and Schrodt, “Introduction to the Special Issue: The Statistical Analysis of Po-
litical Text” at http://pan.oxfordjournals.org/content/16/4/351.full.pdf#page=
1&view=FitH
- Jackman, “Data from the Web Into R” at http://www.nyu.edu/projects/spirling/
documents/tpm.pdf

Suggested

- Additional Sources for Webscraping/API via R:


• http://tonybreyal.wordpress.com/2011/11/18/htmltotext-extracting-text-from-html-vi
• http://www.nyu.edu/projects/politicsdatalab/localdata/workshops/twitter.
pdf
• How to Scrape Facebook Data at http://minimaxir.com/2015/07/facebook-scraper/
- R Packages for Scraping and Text Processing: tm, RCurl
• https://cran.r-project.org/web/packages/twitteR/twitteR.pdf
• http://cran.r-project.org/web/packages/tm/tm.pdf
• http://cran.r-project.org/web/packages/RCurl/RCurl.pdf

Week 3

February 1: Natural Language (Pre-)Processing and Regular Expressions

- Regular Expressions in R: https://stat.ethz.ch/R-manual/R-devel/library/base/


html/regex.html
- Eden, “Introduction to String Matching and Modification in R Using Regular Expres-
sions” at http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/
regExprTalk.pdf

3
- Feinerer, “Introduction to the tm Package Text Mining in R” at http://cran.r-project.
org/web/packages/tm/vignettes/tm.pdf
- Denny and Spirling, “Assessing the Consequences of Text Preprocessing Decisions” at
http://www.nyu.edu/projects/spirling/documents/preprocessing.pdf

Suggested

- R Packages for String Manipulation: stringr, gsub


• https://cran.r-project.org/web/packages/stringr/stringr.pdf
• http://www.endmemo.com/program/R/gsub.php
- Stemming and Preprocessing Approaches:
• Porter, “Snowball: A Language for Stemming Algorithms” at http://snowball.
tartarus.org/texts/introduction.html
• Kumar and Chandrasekhar, “Text Data Pre-processing and Dimensionality Reduc-
tion Techniques for Document Clustering” at https://drive.google.com/open?
id=0B2vSiN5b-8RISThSQ05Va2Y1dms

Week 4

February 8: Text Geometry: Distances, Vector Spaces, tf-idf Weights

- Spirling, “Bargaining Power in Practice: US Treaty-Making with American Indians,


1784-1911” at http://onlinelibrary.wiley.com/doi/10.1111/ajps.2012.56.issue-1/
issuetoc
- Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries”, at http:
//www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/papers/ramos.pdf
- Manning, Raghavan and Schütze, Ch. 6 (especially 6.2 and 6.3) in Introduction to
Information Retrieval

Suggested

- Turney and Pantel, “From Frequency to Meaning: Vector Space Models of Semantics”
at http://www.jair.org/media/2934/live-2934-4846-jair.pdf
- Classic Multidimensional Scaling: http://www.stat.pitt.edu/sungkyu/course/2221Fall13/
lec8_mds_combined.pdf

Week 5

February 15: Supervised Learning I: Dictionaries, Naive Bayes, SVM, Prediction


Validation

- Dodds and Danforth, “Measuring the Happiness of Large-Scale Written Expression:


Songs, Blogs, and President” at
http://www.uvm.edu/~cdanfort/research/dodds-danforth-johs-2009.pdf

4
- Monroe, Colaresi, and Quinn, “Fightin’ Words: Lexical Feature Selection and Evaluation
for Identifying the Content of Political Conflict” at http://pan.oxfordjournals.org/
content/16/4/372.full.pdf#page=1&view=FitH
- Manning, Raghavan and Schütze, Ch. 13 (especially 13.2 – 13.5) in Introduction to
Information Retrieval

Suggested

- Pang, Lee, and Vaithyanathan,“Thumbs Up? Sentiment Classification using Machine


Learning Techniques” at http://www.cs.cornell.edu/home/llee/papers/sentiment.
pdf
- Mosteller and Wallace, “Inference in an Authorship Problem ” at http://www.stat.
cmu.edu/~vlachos/courses/724/final/mosteller.pdf
- Albaugh, Sevenans, Soroka and Loewen, “The Automated Coding of Policy Agendas: A
Dictionary-Based Approach” at http://www.lexicoder.com/docs/CAP2013v2.pdf
- Hillard, Purpura and Wilkerson, “Computer Assisted Classification for Mixed Methods
Social Science Research” at http://faculty.washington.edu/jwilker/tft/Hillard.
pdf [SVM and good discussion about classification]
- Yano, Resnik, and Smith, “Shedding (a Thousand Points of) Light on Biased Language”
at http://www.cs.cmu.edu/~nasmith/papers/yano+resnik+smith.wamt10.pdf
- Yu, Kaufmann, and Diermeier, “Classifying Party Affiliation from Political Speech” at
http://www.tandfonline.com/doi/pdf/10.1080/19331680802149608

Week 6

February 22: Unsupervised Learning I: Text Ideal Point Models

- Slapin and Proksch, “A Scaling Model for Estimating Time-Series Party Positions from
Texts” at http://www.wordfish.org/uploads/1/2/9/8/12985397/slapin_proksch_
ajps_2008.pdf
- Beauchamp, “Using Text to Scale Legislatures with Uninformative Voting” at http:
//nickbeauchamp.com/work/Beauchamp_scaling_current.pdf
- Will Lowe, “Understanding Wordscores” at http://faculty.washington.edu/jwilker/
tft/Lowe.pdf
- Bafumi, Gelman, Park and Kaplan, “Practical Issues in Implementing and Understand-
ing Bayesian Ideal Point Estimation” at http://www.stat.columbia.edu/~gelman/
research/published/171.pdf

Suggested

- Clinton, Jackman, and Rivers, “The Statistical Analysis of Roll Call Data” at https:
//my.vanderbilt.edu/joshclinton/files/2011/10/CJR_APSR2004.pdf
- Martin and Quinn, “Dynamic Ideal Point Esimation via Markov Chain Monte Carlo for
the U.S. Supreme Court, 1953-1999” at http://mqscores.wustl.edu/media/pa02.pdf
- Jackman, “Bayesian Analysis for Political Research” at http://www.annualreviews.
org/doi/abs/10.1146/annurev.polisci.7.012003.104706

5
Week 7

March 1: Bayesian Statistics I: Basic Posterior Inference

- Gelman, Carlin, Stern, and Rubin, Chs. 1 – 3, 5 in Bayesian Data Analysis [pay closer
attention to Chs. 3 & 5]
- Jackman, “Introduction” starting on page xxvii, Chs. 1 and 2, in Bayesian Analysis
for the Social Sciences

Suggested

- Jackman, Chs. 3 – 5 in Bayesian Analysis for the Social Sciences [more comprehensive
treatment of posterior sampling approximations]
- Gelman, Carlin, Stern, and Rubin, Chs. 4, 6 & 8 in Bayesian Data Analysis
- Assorted Materials/Tutorials on the JAGS/BUGS Model Environment
• http://www.uvm.edu/~bbeckage/Teaching/DataAnalysis/Manuals/manual.jags.
pdf
• http://people.math.aau.dk/~kkb/Undervisning/Bayes14/sorenh/docs/JAGS-intro-slides.
pdf
• http://files.meetup.com/1406240/Probabalistic%20Programming%20-%20JMW.
pdf
• http://blog.nus.edu.sg/alexcook/files/2010/06/w6_handout.pdf
- Implementing JAGS/BUGS in R
• http://www.johnmyleswhite.com/notebook/2010/08/20/using-jags-in-r-with-the-rjags-p
• http://cran.r-project.org/web/packages/rjags/rjags.pdf
• http://cran.r-project.org/web/packages/R2WinBUGS/R2WinBUGS.pdf
• http://voteview.com/bayes_beach/R2WinBUGS.pdf

Week 8

March 8: Unsupervised Learning II: Clustering and Word-Topic Models

- Tan, Steinbach, and Kumar, “Cluster Analysis: Basic Concepts and Algorithms” at
http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf
- Blei, Ng, and Jordan, “Latent Dirichlet Allocation” at http://www.cs.princeton.edu/
~blei/papers/BleiNgJordan2003.pdf
- Quinn, Monroe, Colaresi, Crespin, and Radev “How to Analyze Political Attention
with Minimal Assumptions and Costs” at http://onlinelibrary.wiley.com/doi/10.
1111/j.1540-5907.2009.00427.x/abstract
- Grimmer, “A Bayesian Hierarchical Topic Model for Political Texts: Measuring Ex-
pressed Agendas in Senate Press Releases” at http://www.stanford.edu/~jgrimmer/
ExpAgendaFinal.pdf
- Blaydes and Grimmer “Political Cultures: Exploring the Long-Run Determinants of Val-
ues Transmission” at https://ncgg.princeton.edu/IPES/2013/papers/S930_rm3.pdf

6
Suggested

- Hastie, Tibshirani, and Friedman, Ch. 14.3: “Cluster Analysis” in The Elements of
Statistical Learning
- Blei and McAuliffe, “Supervised Topic Models” at https://www.cs.princeton.edu/
~blei/papers/BleiMcAuliffe2007.pdf
- Implementing LDA in R
• http://cran.r-project.org/web/packages/lda/lda.pdf
• http://obphio.us/pdfs/lda_tutorial.pdf [nice tutorial on the LDA algorithm]
• https://github.com/cjrd/SimpleLDA-R [see here for R implementation of a slow
version of LDA]

Week 9 & 10

March 15 & 22: NO CLASS

Week 11

March 29: Bayesian Statistics II: Hierarchy, Non-Conjugacy, Advanced Estima-


tion

- Gelman, Carlin, Stern, and Rubin, Remainder in Bayesian Data Analysis


- Jackman, Remainder in Bayesian Analysis for the Social Sciences

Suggested

- Fox, “Stochastic EM for Estimating the Parameters of a Multilevel IRT Model” at


http://users.edte.utwente.nl/fox/Publications/Papers/artsem.pdf
- Bishop, Selections in Pattern Recognition and Machine Learning
- Ormerod and Wand, “Explaining Variational Approximations” at http://www.maths.
usyd.edu.au/u/jormerod/JTOpapers/Ormerod10.pdf

Week 12

April 5: Unsupervised Learning III: Structural Ideal Points and Topic Models

- Kim, Londregan and Ratkovic, “Estimating Spatial Preferences from Votes and Text”
at http://web.mit.edu/insong/www/pdf/sfa_pa.pdf
- Gerrish and Blei, “The Ideal Point Topic Model” at https://people.cs.umass.edu/
~wallach/workshops/nips2010css/papers/gerrish.pdf
- Roberts et al., “Structural Topic Models for Open-Ended Surveys” at http://scholar.
harvard.edu/dtingley/files/topicmodelsopenendedexperiments.pdf.

Suggested

7
- Gerrish and Blei, The Issue-Adjusted Ideal Point Model at https://arxiv.org/pdf/
1209.6004.pdf
- Gerrish, Sean, “Applications of Latent Variable Models in Modeling Influence and De-
cision Making” at http://www.seangerrish.com/data/thesis.pdf
- Fox, “Multilevel IRT Modeling in Practice with the Package mlirt” at https://www.
jstatsoft.org/article/view/v020i05/v20i05.pdf
- STM in R: https://cran.r-project.org/web/packages/stm/stm.pdf

Week 13

April 12: Supservised Learning II: Lasso and Ridge Regression, Ensemble Learner,
Cross-Validation

- Tibshirani, “Regression Shrinkage and Selection Via the Lasso” at http://statweb.


stanford.edu/~tibs/lasso/lasso.pdf
- Tibshirani (Ryan), Notes on Ridge Regression at http://www.stat.cmu.edu/~ryantibs/
datamining/lectures/16-modr1.pdf
- Hastie, Tibshirani, and Friedman, Ch. 3 (especially 3.4 – 3.8) and Ch. 4 (especially 4.3
and 4.4) in The Elements of Statistical Learning: Data Mining, Inference, and Prediction
[linear classification methods]
- van der Laan, Polley and Hubbard, “Super Learner” at http://biostats.bepress.
com/cgi/viewcontent.cgi?article=1226&context=ucbbiostat

Suggested

- Manning, Raghavan and Schütze, Ch. 15 in Introduction to Information Retrieval [SVMs


for classification and prediction under supervised approach]
- Hastie, Tibshirani, and Friedman, Ch. 9 & 10 (especially 9.1, 9.2, 10.1 – 10.4), in The
Elements of Statistical Learning: Data Mining, Inference, and Prediction [addititive
models, trees and boosting for supervised learning/classification]
- Gawalt et al., “Discovering Word Associations in News Media via Feature Selection and
Sparse Classification” at http://www.eecs.berkeley.edu/~elghaoui/Pubs/MIR2010.
pdf

Week 14

April 19: Final Presentations I

Week 15

April 26: Final Presentations II

You might also like