43 views

Uploaded by Jonathan Stray

From the course Frontiers of Computational Journalism, Columbia University, Fall 2017 http://www.compjournalism.com/?p=206

save

- Intro
- Computational Journalism 2017 Week 5: Quantification and Statistics
- Hybrid Approach for Generating Non Overlapped Substring Using Genetic Algorithm
- Computational Journalism 2017 Week 7: Algorithmic Accountability and Discrimination
- nurbs_edit (1) (1)
- Faceted Search
- letter to instructor
- Www.ptcdb.edu.Ps Ar Sites Default Files ورقة مهند عياش
- Text Mining
- 10.1.1.60.7873
- Computational Journalism 2017 Week 4: Computational Journalism Platforms
- Computational Journalism 2017 Week 3: Filters as Editors
- Computational Journalism 2017 Week 6: Drawing Conclusions From Data
- 7 DB Flowchart
- 3.Web
- deadlock management
- Asopotin Xd Kimjon Un
- Object saliency
- kljgfgjk
- Fifth International Conference on Information Technology Convergence and Services (ITCSE 2016)
- CV Mansoor
- gs_cmdl.ps
- Sample Letter of Recommendation by Employer for Computer Science or Engineering Students.pdf
- 2003 Mayo PS Pág 5 Uso MATLAB Sistemas de Potencia
- Sop Vipulsingh
- SYSTEMATIC REVIEW ON PROJECT ACTUALITY
- 1.1. Software Development Plan - sp Ramos - Oswaldo.doc
- Modeling Simulation and Operation Research
- Type Systems Summary
- AlexanderNareyek_AIInComputerGames.pdf
- Computational Journalism 2016 Week 9: Knowledge Representation
- Computational Journalism Week 9: Knowledge Representation
- Computational Journalism 2016 Week 8: Visualization
- Computational Journalism Week 11: Privacy and Security
- Computational Journalism 2017 Week 2: Filtering Algorithms
- Computational Journalism 2016 Week 11: Privacy and Security
- Computational Journalism 2017 Week 4: Computational Journalism Platforms
- Computational Journalism 2016 Week 10: Social Network Analysis
- Computational Journalism 2017 Week 3: Filters as Editors
- Computational Journalism 2017 Week 6: Drawing Conclusions From Data
- What Do Journalists Do With Documents? Field Notes for NLP Researchers
- Computational Journalism Week 8: Visualization and Networks
- Practical Digital Security for Journalists
- Visualization. Computational Journalism week 7
- Computational Journalism 2016 Week 2: Text Analysis
- Knowledge Representation. Computational Journalism week 8
- Computational Journalism 2016 Week 4: Filters as Editors
- Computational Journalism 2016 Week 7: Algorithmic Accountability
- Computational Journalism 2016 Week 3: Algorithmic Filtering
- Social Network Analysis. Computational Journalism week 10
- Computational Journalism 2016 Week 6: Drawing Conclusions from Data
- Privacy and Security. Computational Journalism week 12
- From Algorithms to Stories.
- Computational Journalism 2016 Week 1: Introduction
- Algorithmic Accountability. Computational Journalism week 9
- Drawing Conclusions From Data. Computational Journalism week 11
- Computational Journalism 2016 Week 5: Quantification and Statistics

Computational Journalism

Columbia Journalism School

Week 1: Introduction, Clustering

September 8, 2017

Computational Journalism: Definitions

**“Broadly defined, it can involve changing how stories are
**

discovered, presented, aggregated, monetized, and

archived. Computation can advance journalism by

drawing on innovations in topic detection, video analysis,

personalization, aggregation, visualization, and

sensemaking.”

**- Cohen, Hamilton, Turner, Computational Journalism, 2011
**

Computational Journalism: Definitions

**“Stories will emerge from stacks of financial disclosure
**

forms, court records, legislative hearings, officials' calendars

or meeting notes, and regulators' email messages that no

one today has time or money to mine. With a suite of

reporting tools, a journalist will be able to scan, transcribe,

analyze, and visualize the patterns in these documents.”

**- Cohen, Hamilton, Turner, Computational Journalism, 2011
**

Computational Journalism: Definitions

**“The application of computer science to the problems of
**

public information, knowledge, and belief, by practitioners

who see their mission as outside of both commerce and

government.”

**- Jonathan Stray, A Computational Journalism Reading List, 2011
**

Cohen et al. model

Data Reporting

User

Computer

Science

CS for presentation / interaction

CS

CS

Data Reporting

User

Filter stories for user

CS

CS

Reporting

Data

CS CS CS

Data

Reporting Filtering

CS CS User

Reporting

Data

Examples of filters

**• Facebook news feed
**

• What an editor puts on the front page

• Google News

• Reddit’s comment system

• Twitter

• New York Times recommendation system

etc.

http://snap.stanford.edu/nifty

Kony 2012 early network, by Gilad Lotan

CS in Journalism

CS

CS

Reporting

Data

CS

CS CS CS

**Reporting Filtering Effects
**

Data

CS CS

User

Reporting

Data

Journalism as a cycle

CS

Effects Data

CS

Reporting

User CS

CS

Filtering

Journalism with algorithms

vs.

Journalism about algorithms

Websites Vary Prices, Deals Based on Users' Information

Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012

Message Machine

Jeff Larson, Al Shaw, ProPublica, 2012

Computer Science and Journalism

**Reporting on society, using computation
**

Reporting on computation in society

Information Retrieval Visualization

Clustering

Natural Language Text Analysis

Processing Sociology

Filter Design

Social Network Analysis

Graph Theory

Artificial Knowledge Representation

Intelligence Drawing Conclusions

Cognitive Science

Statistics Epistemology

Course Structure

Unit 1: Filters

Information retrieval, TF-IDF, topic modeling, search engines, social filtering, filtering

system design.

**Unit 2: Interpreting Data
**

Quantification, data quality, statistical basics, causation, prediction, narratives.

Unit 3: Methods

Visualization, knowledge representation, social network analysis, privacy and

security, tracking flow and effects

Administration

Assignments

Some assignments require programming, but

your writing counts for more than your code!

Final project

Code, story, or research

Course blog

http://compjournalism.com

Grading

40% assignements

40% final project

20% class participation

This class

• Introduction

• Classification and clustering

• Text analysis in journalism

• The Document Vector Space model

Classification and Clustering

Classification and Clustering

Classification is arguably one of the most central and

generic of all our conceptual exercises. It is the

foundation not only for conceptualization, language,

and speech, but also for mathematics, statistics, and

data analysis in general.

**Kenneth D. Bailey, Typologies and Taxonomies: An Introduction to
**

Classification Techniques

Interpreting High Dimensional Data

**UK House of Lords voting record, 2000-2012.
**

N = 1043 lords by M = 1630 votes

2 = aye, 4 = nay, -9 = didn't vote

Distance metric

d(x, y) ≥ 0

- distance is never negative

d(x, x) = 0

- “reflexivity”: zero distance to self

d(x, y) = d(y, x)

- “symmetry”: x to y same as y to x

d(x, z) ≤ d(x, y) + d(y, z)

- “triangle inequality”: going direct is shorter

Distance matrix

Data matrix for M objects of N dimensions

! x1 $ ! x1,1 x1,2 x1,N $

# & # &

# x2 & # x2,1 x2,2 &

X= =

# & #

&

# & # &

" xM % #" x1,M xM ,N &%

Distance matrix

! d d1,2 dM ,M $

# 1,1 &

# d2,1 d2,2 &

Dij = D ji = d(xi , x j ) = # &

# &

# d1,M dM ,M &%

"

Different clustering algorithms

• Partitioning

o keep adjusting clusters until convergence

o e.g. K-means

o Also LDA and many Bayesian models, from a certain perspective

• Agglomerative hierarchical

o start with leaves, repeatedly merge clusters

o e.g. MIN and MAX approaches

• Divisive hierarchical

o start with root, repeatedly split clusters

o e.g. binary split

K-means demo

**http://www.paused21.net/off/kmeans/bin/
**

UK House of Lords voting clusters

UK House of Lords voting clusters

Algorithm instructed to separate MPs into five clusters. Output:

1 2 2 1 3 2 2 2 1 4

1 1 1 1 1 1 5 2 1 1

2 2 1 2 3 2 2 4 2 1

2 3 2 1 3 1 1 2 1 2

1 5 2 1 4 2 2 1 2 1

1 4 1 1 4 1 2 2 1 5

1 1 1 2 3 3 2 2 2 5

2 3 1 2 1 4 1 1 4 4

1 1 2 1 1 2 2 2 2 1

2 1 2 1 2 2 1 3 2 1

1 2 2 1 2 3 4 2 2 2

.

.

Voting clusters with parties

LDem XB Lab LDem XB Lab XB Lab Con XB

1 2 2 1 3 2 2 2 1 4

Con Con LDem Con Con Con LDem Lab Con LDem

1 1 1 1 1 1 5 2 1 1

Lab Lab Con Lab XB XB Lab XB Lab Con

2 2 1 2 3 2 2 4 2 1

Lab XB Lab Con XB XB LDem Lab XB Lab

2 3 2 1 3 1 1 2 1 2

Con Con Lab Con XB Lab Lab Con XB XB

1 5 2 1 4 2 2 1 2 1

Con XB Con Con XB Con Lab XB LDem Con

1 4 1 1 4 1 2 2 1 5

Con Con Con Lab Bp XB Lab Lab Lab LDem

1 1 1 2 3 3 2 2 2 5

Lab XB Con Lab Con XB Con Con XB XB

2 3 1 2 1 4 1 1 4 4

Con Con Lab Con Con XB Lab Lab Lab Con

1 1 2 1 1 2 2 2 2 1

Lab LDem Lab Con Lab Lab Con XB Lab Con

2 1 2 1 2 2 1 3 2 1

Con Lab XB Con XB XB XB Lab Lab Lab

1 2 2 1 2 3 4 2 2 2

Clustering Algorithm

**Input: data points (feature vectors).
**

Output: a set of clusters, each of which is a set of points.

Visualization

**Input: data points (feature vectors).
**

Output: a picture of the points.

Dimensionality reduction

Problem: vector space is high-dimensional. Up to

thousands of dimensions. The screen is two-dimensional.

We have to go from

x RN

to much lower dimensional points

y RK<<N

Probably K=2 or K=3.

Projection

**Projection from 3 to 2 dimensions
**

Which direction should we look from?

Principal components analysis: find a linear projection

that preserves greatest variance

**Take first K eigenvectors of covariance matrix corresponding to
**

largest eigenvalues. This gives a K-dimensional sub-space for

projection.

Linear projections

**Projects in a straight line to
**

closest point on "screen.”

y = Px

where P is a K by N matrix.

**Projection from 2 to 1 dimensions
**

Nonlinear projections

Still going from high-

dimensional x to low-

dimensional y, but now

y = f(x)

**for some function f(), not
**

linear. So, may not

preserve relative

distances, angles, etc.

Fish-eye projection from 3 to 2 dimensions

Multidimensional scaling

Idea: try to preserve distances between points "as much as

possible."

If we have the distances between all points in a distance matrix,

D = |xi – xj| for all i,j

**We can recover the original {xi} coordinates exactly (up to rigid
**

transformations.) Like working out a country map if you know how

far away each city is from every other.

Multidimensional scaling

Torgerson's "classical MDS" algorithm (1952)

Reducing dimension with MDS

Notice: dimension N is not encoded in the distance matrix D (it’s

M by M where M is number of points)

**MDS formula (theoretically) allows us to recover point
**

coordinates {x} in any number of dimensions k.

MDS Stress minimization

The formula actually minimizes “stress”

2

(

stress(x) = ∑ xi − x j − dij )

i, j

Think of “springs” between every pair of points. Spring between xi,

xj has rest length dij

**Stress is zero if all high-dimensional distances matched exactly in
**

low dimension.

Multi-dimensional Scaling

Like "flattening" a

stretchy structure into

2D, so that distances

between points are

preserved (as much as

possible")

House of Lords MDS plot

Robustness of results

Regarding these analyses of legislative voting, we could still

ask:

• Are we modeling the right thing? (What about other

legislative work, e.g. in committee?)

• Are our underlying assumptions correct? (do representatives

really have “ideal points” in a preference space?)

• What are we trying to argue? What will be the effect of

pointing out this result?

Text Analysis in Journalism

USA Today/Twitter Political Issues Index

Politico analysis of GOP primary, 2012

CNN State of the Union Twitter analysis, 2010

The Post obtained draft versions of 12 audits by the inspector general’s office,

covering projects from the Caribbean to Pakistan to the Republic of Georgia

between 2011 and 2013. The drafts are confidential and rarely become public.

The Post compared the drafts with the final reports published by the

inspector general’s office and interviewed former and current employees. E-

mails and other internal records also were reviewed.

**The Post tracked changes in the language that auditors used to describe
**

USAID and its mission offices. The analysis found that more than 400

negative references were removed from the audits between the draft and

final versions.

**Sentiment analysis used by Washington Post, 2014
**

LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years

Los Angeles Times

The Times analyzed Los Angeles Police Department violent crime

data from 2005 to 2012. Our analysis found that the Los Angeles

Police Department misclassified an estimated 14,000 serious assaults

as minor offenses, artificially lowering the city’s crime levels. To

conduct the analysis, The Times used an algorithm that combined

two machine learning classifiers. Each classifier read in a brief

description of the crime, which it used to determine if it was a minor

or serious assault.

**An example of a minor assault reads: "VICTS AND SUSPS BECAME
**

INV IN VERBA ARGUMENT SUSP THEN BEGAN HITTING VICTS

IN THE FACE.”

We used a machine-learning method

known as latent Dirichlet allocation to

identify the topics in all 14,400 petitions

and to then categorize the briefs. This

enabled us to identify which lawyers did

which kind of work for which sorts of

petitioners. For example, in cases where

workers sue their employers, the lawyers

most successful getting cases before the

court were far more likely to represent the

employers rather than the employees.

**The Echo Chamber, Reuters
**

Document Vector Space Model

Documents, not words

We can use clustering and classification techniques if we

can convert documents into vectors.

**As before, we want to find numerical “features” that
**

describe the document.

**How do we capture the meaning of a document in
**

numbers?

What is this document "about"?

Most commonly occurring words a pretty good indicator.

30 the

23 to

19 and

19 a

18 animal

17 cruelty

15 of

15 crimes

14 in

14 for

11 that

8 crime

7 we

Features = words works fine

Encode each document as the list of words it contains.

Dimensions = vocabulary of document set.

**Value on each dimension = # of times word appears in
**

document

Example

D1 = “I like databases”

D2 = “I hate hate databases”

**Each row = document vector
**

All rows = term-document matrix

Individual entry = tf(t,d) = “term frequency”

Aka “Bag of words” model

Throws out word order.

**e.g. “soldiers shot civilians” and “civilians shot soldiers”
**

encoded identically.

Tokenization

The documents come to us as long strings, not individual

words. Tokenization is the process of converting the string

into individual words, or "tokens."

**For this course, we will assume a very simple strategy:
**

o convert all letters to lowercase

o remove all punctuation characters

o separate words based on spaces

**Note that this won't work at all for Chinese. It will fail in
**

,many ways even for English. How?

Distance metric for text

Useful for:

• clustering documents

• finding docs similar to example

• matching a search query

**Basic idea: look for overlapping terms
**

Cosine similarity

Given document vectors a,b define

similarity(a, b) ≡ a • b

If each word occurs exactly once in each document,

equivalent to counting overlapping words.

**Note: not a distance function, as similarity increases when
**

documents are… similar. (What part of the definition of a

distance function is violated here?)

Problem: long documents always win

**Let a = “This car runs fast.”
**

Let b = “My car is old. I want a new car, a shiny car”

Let query = “fast car”

this car runs fast my is old I want a new shiny

a 1 1 1 1 0 0 0 0 0 0 0 0

b 0 3 0 0 1 1 1 1 1 1 1 1

q 0 1 0 1 0 0 0 0 0 0 0 0

Problem: long documents always win

**similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2
**

similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3

**Longer document more “similar”, by virtue of repeating
**

words.

Normalize document vectors

a•b

similarity(a, b) ≡

a b

= cos(Θ)

returns result in [0,1]

Normalized query example

this car runs fast my is old I want a new shiny

a 1 1 1 1 0 0 0 0 0 0 0 0

b 0 3 0 0 1 1 1 1 1 1 1 1

q 0 1 0 1 0 0 0 0 0 0 0 0

2 1

similarity(a, q) = = ≈ 0.707

4 2 2

3

similarity(b, q) = ≈ 0.514

17 2

Cosine similarity

a•b

cosθ = similarity(a, b) ≡

a b

Cosine distance (finally)

a•b

dist(a, b) ≡ 1−

a b

Problem: common words

We want to look at words that “discriminate” among

documents.

**Stopwords: if all documents contain “the,” are all
**

documents similar?

**Common words: if most documents contain “car” then car
**

doesn’t tell us much about (contextual) similarity.

Context matters

General News Car Reviews

= contains “car”

= does not contain “car”

Document Frequency

**Idea: de-weight common words
**

Common = appears in many documents

df (t, D) = d ∈ D : t ∈ d D

**“document frequency” = fraction of docs
**

containing term

Inverse Document Frequency

**Invert (so more common = smaller weight) and take
**

log

idf (t, D) = log ( D d ∈ D : t ∈ d )

TF-IDF

Multiply term frequency by inverse document frequency

tfidf (t, d, D) = tf (t, d)⋅ idf (d, D)

= n(t, d)⋅ log ( D n(t, D))

n(t,d) = number of times term t in doc d

n(t,D) = number docs in D containing t

TF-IDF depends on entire corpus

The TF-IDF vector for a document changes if we

add another document to the corpus.

tfidf (t, d, D) = tf (t, d)⋅ idf (d, D)

if we add a document, D changes!

**TF-IDF is sensitive to context. The context is all other
**

documents

What is this document "about"?

Each document is now a vector of TF-IDF scores for every word in the

document. We can look at which words have the top scores.

crimes 0.0675591652263963

cruelty 0.0585772393867342

crime 0.0257614113616027

reporting 0.0208838148975406

animals 0.0179258756717422

michael 0.0156575858658684

category 0.0154564813388897

commit 0.0137447439653709

criminal 0.0134312894429112

societal 0.0124164973052386

trends 0.0119505837811614

conviction 0.0115699047136248

patterns 0.011248045148093

Salton’s description of tf-idf

**- from Salton et al, A Vector Space Model for Automatic Indexing, 1975
**

TF TF-IDF

nj-sentator-menendez corpus, Overview sample files

**color = human tags generated from TF-IDF clusters
**

Cluster Hypothesis

“documents in the same cluster behave similarly with respect to

relevance to information needs”

- Manning, Raghavan, Schütze, Introduction to Information Retrieval

**Not really a precise statement – but the crucial link between
**

human semantics and mathematical properties.

**Articulated as early as 1971, has been shown to hold at web
**

scale, widely assumed.

Bag of words + TF-IDF hard to beat

Practical win: good precision-recall metrics in tests with human-

tagged document sets.

**Still the dominant text indexing scheme used today. (Lucene,
**

FAST, Google…) Many variants and extensions.

**Some, but not much, theory to explain why this works. (E.g. why
**

that particular IDF formula? why doesn’t indexing bigrams

improve performance?)

**Collectively: the vector space document model
**

No unique “right” clustering

Different distance metrics and clustering algorithms give

different results.

**Should we sort incident reports by location, time, actor,
**

event type, author, cost, casualties…?

**There is only context-specific categorization.
**

And the computer doesn’t understand your context.

Different libraries,

different categories

- IntroUploaded bylateworm
- Computational Journalism 2017 Week 5: Quantification and StatisticsUploaded byJonathan Stray
- Hybrid Approach for Generating Non Overlapped Substring Using Genetic AlgorithmUploaded byInternational Journal of Research in Engineering and Technology
- Computational Journalism 2017 Week 7: Algorithmic Accountability and DiscriminationUploaded byJonathan Stray
- nurbs_edit (1) (1)Uploaded bydpomah
- Faceted SearchUploaded bysgsfak
- letter to instructorUploaded byapi-253419821
- Www.ptcdb.edu.Ps Ar Sites Default Files ورقة مهند عياشUploaded byrrameshsmit
- Text MiningUploaded byVanessa Echeverria
- 10.1.1.60.7873Uploaded byLoe Kep
- Computational Journalism 2017 Week 4: Computational Journalism PlatformsUploaded byJonathan Stray
- Computational Journalism 2017 Week 3: Filters as EditorsUploaded byJonathan Stray
- Computational Journalism 2017 Week 6: Drawing Conclusions From DataUploaded byJonathan Stray
- 7 DB FlowchartUploaded byBhasilio Ahuja Montaño
- 3.WebUploaded byJyoti Agarwal
- deadlock managementUploaded bytinababes7
- Asopotin Xd Kimjon UnUploaded byelver gomez
- Object saliencyUploaded byAbhijit Sharang
- kljgfgjkUploaded byJuan Pérez
- Fifth International Conference on Information Technology Convergence and Services (ITCSE 2016)Uploaded byCS & IT
- CV MansoorUploaded byQasim Khan
- gs_cmdl.psUploaded bytechj
- Sample Letter of Recommendation by Employer for Computer Science or Engineering Students.pdfUploaded bygutswe
- 2003 Mayo PS Pág 5 Uso MATLAB Sistemas de PotenciaUploaded byfreddyrivera
- Sop VipulsinghUploaded bysanjay_dutta_5
- SYSTEMATIC REVIEW ON PROJECT ACTUALITYUploaded byAnonymous Gl4IRRjzN
- 1.1. Software Development Plan - sp Ramos - Oswaldo.docUploaded byJonnel Vasquez Silva
- Modeling Simulation and Operation ResearchUploaded byHomik Soni
- Type Systems SummaryUploaded byCharles Kireki
- AlexanderNareyek_AIInComputerGames.pdfUploaded byhyoung65

- Computational Journalism 2016 Week 9: Knowledge RepresentationUploaded byJonathan Stray
- Computational Journalism Week 9: Knowledge RepresentationUploaded byJonathan Stray
- Computational Journalism 2016 Week 8: VisualizationUploaded byJonathan Stray
- Computational Journalism Week 11: Privacy and SecurityUploaded byJonathan Stray
- Computational Journalism 2017 Week 2: Filtering AlgorithmsUploaded byJonathan Stray
- Computational Journalism 2016 Week 11: Privacy and SecurityUploaded byJonathan Stray
- Computational Journalism 2017 Week 4: Computational Journalism PlatformsUploaded byJonathan Stray
- Computational Journalism 2016 Week 10: Social Network AnalysisUploaded byJonathan Stray
- Computational Journalism 2017 Week 3: Filters as EditorsUploaded byJonathan Stray
- Computational Journalism 2017 Week 6: Drawing Conclusions From DataUploaded byJonathan Stray
- What Do Journalists Do With Documents? Field Notes for NLP ResearchersUploaded byJonathan Stray
- Computational Journalism Week 8: Visualization and NetworksUploaded byJonathan Stray
- Practical Digital Security for JournalistsUploaded byJonathan Stray
- Visualization. Computational Journalism week 7Uploaded byJonathan Stray
- Computational Journalism 2016 Week 2: Text AnalysisUploaded byJonathan Stray
- Knowledge Representation. Computational Journalism week 8Uploaded byJonathan Stray
- Computational Journalism 2016 Week 4: Filters as EditorsUploaded byJonathan Stray
- Computational Journalism 2016 Week 7: Algorithmic AccountabilityUploaded byJonathan Stray
- Computational Journalism 2016 Week 3: Algorithmic FilteringUploaded byJonathan Stray
- Social Network Analysis. Computational Journalism week 10Uploaded byJonathan Stray
- Computational Journalism 2016 Week 6: Drawing Conclusions from DataUploaded byJonathan Stray
- Privacy and Security. Computational Journalism week 12Uploaded byJonathan Stray
- From Algorithms to Stories.Uploaded byJonathan Stray
- Computational Journalism 2016 Week 1: IntroductionUploaded byJonathan Stray
- Algorithmic Accountability. Computational Journalism week 9Uploaded byJonathan Stray
- Drawing Conclusions From Data. Computational Journalism week 11Uploaded byJonathan Stray
- Computational Journalism 2016 Week 5: Quantification and StatisticsUploaded byJonathan Stray