You are on page 1of 58

Frontiers  of  

Computational  Journalism
Columbia Journalism School
Week 1: Introduction
September 11, 2015

Lecture  1:  Basics
Computer Science and Journalism
Course Structure
Interpreting High Dimensional Data

Computational  Journalism:  
Definitions
“Broadly defined, it can involve changing how stories
are discovered, presented, aggregated, monetized,
and archived. Computation can advance journalism
by drawing on innovations in topic detection, video
analysis, personalization, aggregation, visualization,
and sensemaking.”
- Cohen, Hamilton, Turner, Computational Journalism, 2011

Computational  Journalism:  
Definitions
“Stories will emerge from stacks of financial disclosure
forms, court records, legislative hearings, officials' calendars
or meeting notes, and regulators' email messages that no
one today has time or money to mine. With a suite of
reporting tools, a journalist will be able to scan, transcribe,
analyze, and visualize the patterns in these documents.”
- Cohen, Hamilton, Turner, Computational Journalism, 2011

Cohen  et  al.  model
Data

Reporting

User

Computer
Science

CS  for  presentation  /  
interaction
CS

Data

CS

Reporting

User

Filter  stories  for  user
CS

Data

Reporting

CS

Data

CS

Reporting

CS

Filtering

Reporting

CS

Data

CS

CS

User

Examples  of  filters
• 
• 
• 
• 
• 
• 
• 
• 

Facebook news feed
What an editor puts on the front page
Google News
Reddit’s comment system
Twitter
Techmeme
New York Times recommendation system

http://snap.stanford.edu/nifty

Kony 2012 early network, by Gilad Lotan

CS  in  Journalism
CS

Data

Reporting

CS

Data

Reporting

CS

CS

CS

Reporting

CS

Data

CS

Effects

Filtering

CS

User

Journalism with algorithms
vs.
Journalism about algorithms

Websites Vary Prices, Deals Based on Users' Information
Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012

Message Machine
Jeff Larson, Al Shaw, ProPublica, 2012

Where does data come from?

Computer  Science  in  
Journalism
Reporting
Presentation
Filtering
Tracking
Algorithmic accountability

Quantification

Data

Journalism  as  a  cycle
CS

Effects

Data

CS

Reporting

User

CS
CS

Filtering

Computational  Journalism:  
Definitions
“the application of computer science to the problems
of public information, knowledge, and belief, by
practitioners who see their mission as outside of both
commerce and government.”
- Jonathan Stray, A Computational Journalism Reading List,
2011

Course  Structure
• 
• 
• 
• 
• 
• 
• 
• 
• 

Information retrieval: TF-IDF, search engines
Text analysis: clustering and topic modeling
Information filtering systems
Social network analysis
Knowledge representation
Drawing conclusions from data
Writing about data
Information Security
Tracking flow and effects

Information  Retrieval

Visualization
Clustering

Natural  Language  
Processing

Text  Analysis
Filter  Design
Social  Network  Analysis

Artificial  
Intelligence

Sociology

Knowledge  Representation

Graph  Theory

Drawing  Conclusions
Cognitive  Science

Statistics

Epistemology

Administration
Assignment after each class
Four assignments require programming, but
your writing counts for more than your code!

Course blog
http://compjournalism.com

Final project
for 6-pt students only

Grading
Dual degree students
Pass/Fail.
Final project: paper, story, or software.

Non-journalism students
80% assignements
20% class participation

Definition of data?

 

My Definition of data
 
a  collection  of  related  pieces  of  
recorded  information

structured  data

unstructured  data

Quantification
!
#
#
#
#
#
#
#
"

x1 $
&
x2 &
&
x3 &
 &
&
xN &
%

Other  things  that  are  tricky  to  
quantify,  but  quantified  anyway
• 
• 
• 
• 
• 
• 
• 
• 

Intelligence
Academic performance
Gender
Race, ethnicity, nationality
Number of sexual harassment incidents
Income
Political Ideology
...

Different  types  of  “quantitative”
•  Numeric
o 
o 
o 
o 

continuous
countable
bounded?
units of measurement?

•  Categorical
o 
o 
o 
o 

finite, e.g. {on, off}
infinite e.g. {red, yellow, blue, ... chartreuse…}
ordered?
equivalence classes or other structure?

Different  types  of  scales
Temperature
Continuous scale, fixed zero point,
physical units, comparative, uniform
Likert  Scale  
Discrete  scale,  no  fixed  origin  ,  abstract  units,  
comparative,  non-­‐‑uniform

Likert  scales  are  non-­‐‑uniform

No  averages  on  a  non-­‐‑uniform  scale
It’s not linear, so is 2X1 twice as good?
(X1+c) – (X2+c) ≠ X1 – X2
Lots of things don’t make much sense, such as
sum(X1 ... XN) / N = ?
Average is not well defined! (Nor std dev, etc.)
But rank order statistics are robust.
And all of this might not be a problem in practice.

Other  issues  with“quantitative”
•  Where did the data come from?
o  physical measurement
o  computer logging
o  human recording

•  What are the sources of error?
o 
o 
o 
o 
o 

measurement error
missing data
ambiguity in human classification
process errors
intentional bias / deception

Vector  representation  of  objects
Fundamental representation for many data mining, clustering,
machine learning, visualization, NLP, etc. algorithms.

!
#
#
#
#
#
#
#
"

x1 $
&
x2 &
&
x3 &
 &
&
xN &
%

Each  xi  is  a  numerical  or  categorical  feature
N  =  number  of  features  or  “dimension”

Examples  of  features
• 
• 
• 
• 
• 
• 
• 
• 
• 

number of claws
latitude
color ∈{red, yellow, blue}
number of break-ins
1 for “bought X”, 0 for “did not buy X”
time, duration, etc.
number of times word Y appears in document
votes cast

“Feature  selection”
Technical meaning in machine learning etc.:
which variables matter?
We’re journalists, so we’re interested in an earlier
process:
how to describe the world in numbers?

Choosing  Features
!
#
#
#
#
#
#
#
"

Journalism
How  do  we  
represent  the  
world  
numerically?

x1 $
&
x2 &
&
x3 &
 &
&
xN &
%

! x
f (1)
#
# x f (2 )
#

#
# x f (k )
"

$
&
&
&
&
&
%

where  k  ≤N
Machine  learning
Which  variables  
carry  the  most  
information?

Examples  of  vector  representations
Obvious
o  movies watched / items purchased
o  Legislative voting history for a politician
o  crime locations

Less obvious, but standard
o  document vector space model
o  psychological survey results

Tricky research problem: disparate field types
o  Corporate filing document
o  Wikileaks SIGACT

What  can  we  do  with  vectors?
Predict one variable based on others
o  this is called “regression”
o  or maybe "classification"
o  supervised machine learning

Group similar items together
o  This is clustering
o  or maybe "classification" with unknown categories
o  unsupervised machine learning