00 upvotes00 downvotes

62 views90 pagesFrom the course Frontiers of Computational Journalism, Columbia University, Fall 2017 http://www.compjournalism.com/?p=206

Sep 21, 2017

Computational Journalism 2017 Week 1: Introduction

© © All Rights Reserved

PDF, TXT or read online from Scribd

From the course Frontiers of Computational Journalism, Columbia University, Fall 2017 http://www.compjournalism.com/?p=206

© All Rights Reserved

62 views

00 upvotes00 downvotes

Computational Journalism 2017 Week 1: Introduction

From the course Frontiers of Computational Journalism, Columbia University, Fall 2017 http://www.compjournalism.com/?p=206

© All Rights Reserved

You are on page 1of 90

Computational Journalism

Columbia Journalism School

Week 1: Introduction, Clustering

September 8, 2017

Computational Journalism: Definitions

discovered, presented, aggregated, monetized, and

archived. Computation can advance journalism by

drawing on innovations in topic detection, video analysis,

personalization, aggregation, visualization, and

sensemaking.

Computational Journalism: Definitions

forms, court records, legislative hearings, officials' calendars

or meeting notes, and regulators' email messages that no

one today has time or money to mine. With a suite of

reporting tools, a journalist will be able to scan, transcribe,

analyze, and visualize the patterns in these documents.

Computational Journalism: Definitions

public information, knowledge, and belief, by practitioners

who see their mission as outside of both commerce and

government.

Cohen et al. model

Data Reporting

User

Computer

Science

CS for presentation / interaction

CS

CS

Data Reporting

User

Filter stories for user

CS

CS

Reporting

Data

CS CS CS

Data

Reporting Filtering

CS CS User

Reporting

Data

Examples of filters

What an editor puts on the front page

Google News

Reddits comment system

New York Times recommendation system

etc.

http://snap.stanford.edu/nifty

Kony 2012 early network, by Gilad Lotan

CS in Journalism

CS

CS

Reporting

Data

CS

CS CS CS

Data

CS CS

User

Reporting

Data

Journalism as a cycle

CS

Effects Data

CS

Reporting

User CS

CS

Filtering

Journalism with algorithms

vs.

Journalism about algorithms

Websites Vary Prices, Deals Based on Users' Information

Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012

Message Machine

Jeff Larson, Al Shaw, ProPublica, 2012

Computer Science and Journalism

Reporting on computation in society

Information Retrieval Visualization

Clustering

Natural Language Text Analysis

Processing Sociology

Filter Design

Social Network Analysis

Graph Theory

Artificial Knowledge Representation

Intelligence Drawing Conclusions

Cognitive Science

Statistics Epistemology

Course Structure

Unit 1: Filters

Information retrieval, TF-IDF, topic modeling, search engines, social filtering, filtering

system design.

Quantification, data quality, statistical basics, causation, prediction, narratives.

Unit 3: Methods

Visualization, knowledge representation, social network analysis, privacy and

security, tracking flow and effects

Administration

Assignments

Some assignments require programming, but

your writing counts for more than your code!

Final project

Code, story, or research

Course blog

http://compjournalism.com

Grading

40% assignements

40% final project

20% class participation

This class

Introduction

Classification and clustering

Text analysis in journalism

The Document Vector Space model

Classification and Clustering

Classification and Clustering

Classification is arguably one of the most central and

generic of all our conceptual exercises. It is the

foundation not only for conceptualization, language,

and speech, but also for mathematics, statistics, and

data analysis in general.

Classification Techniques

Interpreting High Dimensional Data

N = 1043 lords by M = 1630 votes

2 = aye, 4 = nay, -9 = didn't vote

Distance metric

d(x, y) 0

- distance is never negative

d(x, x) = 0

- reflexivity: zero distance to self

d(x, y) = d(y, x)

- symmetry: x to y same as y to x

- triangle inequality: going direct is shorter

Distance matrix

Data matrix for M objects of N dimensions

! x1 $ ! x1,1 x1,2 x1,N $

# & # &

# x2 & # x2,1 x2,2 &

X= =

# & #

&

# & # &

" xM % #" x1,M xM ,N &%

Distance matrix

! d d1,2 dM ,M $

# 1,1 &

# d2,1 d2,2 &

Dij = D ji = d(xi , x j ) = # &

# &

# d1,M dM ,M &%

"

Different clustering algorithms

Partitioning

o keep adjusting clusters until convergence

o e.g. K-means

o Also LDA and many Bayesian models, from a certain perspective

Agglomerative hierarchical

o start with leaves, repeatedly merge clusters

o e.g. MIN and MAX approaches

Divisive hierarchical

o start with root, repeatedly split clusters

o e.g. binary split

K-means demo

http://www.paused21.net/off/kmeans/bin/

UK House of Lords voting clusters

UK House of Lords voting clusters

Algorithm instructed to separate MPs into five clusters. Output:

1 2 2 1 3 2 2 2 1 4

1 1 1 1 1 1 5 2 1 1

2 2 1 2 3 2 2 4 2 1

2 3 2 1 3 1 1 2 1 2

1 5 2 1 4 2 2 1 2 1

1 4 1 1 4 1 2 2 1 5

1 1 1 2 3 3 2 2 2 5

2 3 1 2 1 4 1 1 4 4

1 1 2 1 1 2 2 2 2 1

2 1 2 1 2 2 1 3 2 1

1 2 2 1 2 3 4 2 2 2

.

.

Voting clusters with parties

LDem XB Lab LDem XB Lab XB Lab Con XB

1 2 2 1 3 2 2 2 1 4

Con Con LDem Con Con Con LDem Lab Con LDem

1 1 1 1 1 1 5 2 1 1

Lab Lab Con Lab XB XB Lab XB Lab Con

2 2 1 2 3 2 2 4 2 1

Lab XB Lab Con XB XB LDem Lab XB Lab

2 3 2 1 3 1 1 2 1 2

Con Con Lab Con XB Lab Lab Con XB XB

1 5 2 1 4 2 2 1 2 1

Con XB Con Con XB Con Lab XB LDem Con

1 4 1 1 4 1 2 2 1 5

Con Con Con Lab Bp XB Lab Lab Lab LDem

1 1 1 2 3 3 2 2 2 5

Lab XB Con Lab Con XB Con Con XB XB

2 3 1 2 1 4 1 1 4 4

Con Con Lab Con Con XB Lab Lab Lab Con

1 1 2 1 1 2 2 2 2 1

Lab LDem Lab Con Lab Lab Con XB Lab Con

2 1 2 1 2 2 1 3 2 1

Con Lab XB Con XB XB XB Lab Lab Lab

1 2 2 1 2 3 4 2 2 2

Clustering Algorithm

Output: a set of clusters, each of which is a set of points.

Visualization

Output: a picture of the points.

Dimensionality reduction

Problem: vector space is high-dimensional. Up to

thousands of dimensions. The screen is two-dimensional.

We have to go from

x RN

to much lower dimensional points

y RK<<N

Projection

Which direction should we look from?

Principal components analysis: find a linear projection

that preserves greatest variance

largest eigenvalues. This gives a K-dimensional sub-space for

projection.

Linear projections

closest point on "screen.

y = Px

where P is a K by N matrix.

Nonlinear projections

Still going from high-

dimensional x to low-

dimensional y, but now

y = f(x)

linear. So, may not

preserve relative

distances, angles, etc.

Fish-eye projection from 3 to 2 dimensions

Multidimensional scaling

Idea: try to preserve distances between points "as much as

possible."

transformations.) Like working out a country map if you know how

far away each city is from every other.

Multidimensional scaling

Torgerson's "classical MDS" algorithm (1952)

Reducing dimension with MDS

Notice: dimension N is not encoded in the distance matrix D (its

M by M where M is number of points)

coordinates {x} in any number of dimensions k.

MDS Stress minimization

The formula actually minimizes stress

2

(

stress(x) = xi x j dij )

i, j

Think of springs between every pair of points. Spring between xi,

xj has rest length dij

low dimension.

Multi-dimensional Scaling

Like "flattening" a

stretchy structure into

2D, so that distances

between points are

preserved (as much as

possible")

House of Lords MDS plot

Robustness of results

Regarding these analyses of legislative voting, we could still

ask:

Are we modeling the right thing? (What about other

legislative work, e.g. in committee?)

Are our underlying assumptions correct? (do representatives

really have ideal points in a preference space?)

What are we trying to argue? What will be the effect of

pointing out this result?

Text Analysis in Journalism

USA Today/Twitter Political Issues Index

Politico analysis of GOP primary, 2012

CNN State of the Union Twitter analysis, 2010

The Post obtained draft versions of 12 audits by the inspector generals office,

covering projects from the Caribbean to Pakistan to the Republic of Georgia

between 2011 and 2013. The drafts are confidential and rarely become public.

The Post compared the drafts with the final reports published by the

inspector generals office and interviewed former and current employees. E-

mails and other internal records also were reviewed.

The Post tracked changes in the language that auditors used to describe

USAID and its mission offices. The analysis found that more than 400

negative references were removed from the audits between the draft and

final versions.

LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years

Los Angeles Times

The Times analyzed Los Angeles Police Department violent crime

data from 2005 to 2012. Our analysis found that the Los Angeles

Police Department misclassified an estimated 14,000 serious assaults

as minor offenses, artificially lowering the citys crime levels. To

conduct the analysis, The Times used an algorithm that combined

two machine learning classifiers. Each classifier read in a brief

description of the crime, which it used to determine if it was a minor

or serious assault.

INV IN VERBA ARGUMENT SUSP THEN BEGAN HITTING VICTS

IN THE FACE.

We used a machine-learning method

known as latent Dirichlet allocation to

identify the topics in all 14,400 petitions

and to then categorize the briefs. This

enabled us to identify which lawyers did

which kind of work for which sorts of

petitioners. For example, in cases where

workers sue their employers, the lawyers

most successful getting cases before the

court were far more likely to represent the

employers rather than the employees.

Document Vector Space Model

Documents, not words

We can use clustering and classification techniques if we

can convert documents into vectors.

describe the document.

numbers?

What is this document "about"?

Most commonly occurring words a pretty good indicator.

30 the

23 to

19 and

19 a

18 animal

17 cruelty

15 of

15 crimes

14 in

14 for

11 that

8 crime

7 we

Features = words works fine

document

Example

D1 = I like databases

D2 = I hate hate databases

All rows = term-document matrix

Individual entry = tf(t,d) = term frequency

Aka Bag of words model

encoded identically.

Tokenization

The documents come to us as long strings, not individual

words. Tokenization is the process of converting the string

into individual words, or "tokens."

o convert all letters to lowercase

o remove all punctuation characters

o separate words based on spaces

Note that this won't work at all for Chinese. It will fail in

,many ways even for English. How?

Distance metric for text

Useful for:

clustering documents

finding docs similar to example

matching a search query

Cosine similarity

Given document vectors a,b define

similarity(a, b) a b

If each word occurs exactly once in each document,

equivalent to counting overlapping words.

documents are similar. (What part of the definition of a

distance function is violated here?)

Problem: long documents always win

Let b = My car is old. I want a new car, a shiny car

Let query = fast car

a 1 1 1 1 0 0 0 0 0 0 0 0

b 0 3 0 0 1 1 1 1 1 1 1 1

q 0 1 0 1 0 0 0 0 0 0 0 0

Problem: long documents always win

similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3

words.

Normalize document vectors

ab

similarity(a, b)

a b

= cos()

Normalized query example

this car runs fast my is old I want a new shiny

a 1 1 1 1 0 0 0 0 0 0 0 0

b 0 3 0 0 1 1 1 1 1 1 1 1

q 0 1 0 1 0 0 0 0 0 0 0 0

2 1

similarity(a, q) = = 0.707

4 2 2

3

similarity(b, q) = 0.514

17 2

Cosine similarity

ab

cos = similarity(a, b)

a b

Cosine distance (finally)

ab

dist(a, b) 1

a b

Problem: common words

We want to look at words that discriminate among

documents.

documents similar?

doesnt tell us much about (contextual) similarity.

Context matters

General News Car Reviews

= contains car

= does not contain car

Document Frequency

Common = appears in many documents

df (t, D) = d D : t d D

containing term

Inverse Document Frequency

log

TF-IDF

Multiply term frequency by inverse document frequency

= n(t, d) log ( D n(t, D))

n(t,d) = number of times term t in doc d

n(t,D) = number docs in D containing t

TF-IDF depends on entire corpus

The TF-IDF vector for a document changes if we

add another document to the corpus.

documents

What is this document "about"?

Each document is now a vector of TF-IDF scores for every word in the

document. We can look at which words have the top scores.

crimes 0.0675591652263963

cruelty 0.0585772393867342

crime 0.0257614113616027

reporting 0.0208838148975406

animals 0.0179258756717422

michael 0.0156575858658684

category 0.0154564813388897

commit 0.0137447439653709

criminal 0.0134312894429112

societal 0.0124164973052386

trends 0.0119505837811614

conviction 0.0115699047136248

patterns 0.011248045148093

Saltons description of tf-idf

- from Salton et al, A Vector Space Model for Automatic Indexing, 1975

TF TF-IDF

Cluster Hypothesis

documents in the same cluster behave similarly with respect to

relevance to information needs

human semantics and mathematical properties.

scale, widely assumed.

Bag of words + TF-IDF hard to beat

Practical win: good precision-recall metrics in tests with human-

tagged document sets.

FAST, Google) Many variants and extensions.

Some, but not much, theory to explain why this works. (E.g. why

that particular IDF formula? why doesnt indexing bigrams

improve performance?)

No unique right clustering

Different distance metrics and clustering algorithms give

different results.

event type, author, cost, casualties?

And the computer doesnt understand your context.

Different libraries,

different categories

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.