You are on page 1of 73

Frontiers of

Computational Journalism
Columbia Journalism School
Week 8: Visualization and Network Analysis

November 3, 2017
This class
Visualization as perception
Visualization design
Social network theory
Network analysis in journalism
Visualization as Perception
Topic links in Gdel, Escher, Bach
Visualization allows people to ofoad cognition to the
perceptual system, using carefully designed images as a
form of external memory.

The human visual system is a very high-bandwidth

channel to the brain, with a signicant amount of
processing occurring in parallel and at the pre-conscious

- Tamara Munzner
Pop-Out Effects
Visual Comparisons


size color number, shape, relative motion, and much more

Basic idea of visualization:

Turn something you want to find

into something you can see
without thinking about it
clusters outliers

extents correlations
Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012
Visualization Design
Inward and Outward Grand Challenges for Visualization, Tamara Munzner
A multi-level typology of abstract visualization tasks, Brehmer & Munzner
Sequential Narrative

Whats Really Warming The World?, Bloomberg

Visualization isnt objective, but that doesnt mean you cant
mislead. (Is this graph misleading?)
Social Network Theory
A set of people

and a set of connections between pairs of them

Types of connections

Social network analysis: only one type of connection

between individuals (e.g. "friend")

Link analysis: multiple types of connections

went to university with
sold a car to
owns 51% of

Link analysis is much more relevant to journalism,

because it allows representation of much more detail
and context.
People Act in Groups
Family and friendships: I am most closely connected to a small set of
people, who are usually closely connected to each other.

Business: I am much more likely to do business with people I already know.

Influence: I listen to people I know more than I listen to strangers.

Norms: what is right depends on what the people around me think.

People tend to marry, do business with, spend time with, etc. people from
similar backgrounds... and people who have social ties tend to be similar.
Two major analysis methods
after you have the network data, which may be a very
manual process.

Look at a visualization
Apply algorithm

In both cases, the results are not interpretable without

A sociogram of a fraternity from Morenos Who Shall Survive? (1934). Arrows show
one way attraction and lines with a cross bar show mutual attraction.
Force-Directed Layout

Each edge is a "spring" with a fixed preferred length.

Plus global repulsive force that pushes all nodes apart.
From The Effect of Graph Layout on Inference from Social Network
Data, Blythe et al.
We asked respondents three questions about the same
five focal nodes in each sociogram:

1) how many subgroups were in the sociogram

2) how prominent was each player in the sociogram
3) how important a bridging role did each player
occupy in the sociogram

The Effect of Graph Layout on Inference from Social Network

Data, Blythe et al.
Often identified with "influence" or "power." Often important in

We can visualize the graph and use our eyes, or we can compute
centrality values algorithmically.
Degree centrality: number of edges

Models: cases where the number of connections is important.

Example: which celebrity can reach the most people at
Closeness centrality: average distance to all other nodes

Models: cases where time taken to reach a node is important.

Example: who finds out about gossip first?
Betweenness centrality:
number of shortest paths that pass through node

Models: cases where control over transmission is important.

Example: who has the most power to make introductions?
Eigenvector centrality:
how likely you are to end up at a node on a random walk
(same idea as PageRank)

Models: cases where importance of neighbors is important.

Example: the private adviser to the president
Journalism centrality:
how important is this person to this story?
Finding Communities
No one definition of "community." Could mean a town, or a
club, or an industry network.

But for our purposes, a community is "a group of people with

pre-existing patterns of association."

In social network analysis, that translates into clusters in the

Co-consumption Network of political book sales,
Communications network Exploring Enron, Jeffery Heer
Web link structure Map of Iranian Blogosphere, Berkman Center
Individual time/location trails CitySense, Sense Networks
Mathematical definitions of "cluster"

You've already seen several. If you can compute distance

between any two items, you can cluster.

But in social networks, not everyone is connected to everyone


Are there more intra-group edges than we

would expect randomly?
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise

There are m = k total edges in the graph.

2 i

If they go between random vertices then number of

edges between i,j is ki k j / 2m
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise

Modularity Q = ( Aij ki k j / 2m)gij


If Q>0 then there are "excess" edges inside the

groups (and fewer edges between them.)
Modularity algorithm

Look for a division of nodes into two groups that

maximizes Q
Can find this through eigenvector technique
Possible that no division has Q>0, in which case the graph
is a single community
If a division with Q>0 found, split
Recursively split sub-graphs
Network Analysis in Journalism
Case Study: Seattle Art World

Network obtained from

dozens of in-person
interviews. Interactive
visualization in story.

In Seattle Art World, Women Run the Show, Seattle Times

Case Study: Hot Wheels

Network obtained from

juvenile arrest records
concerning stolen cars.
Unpublished visualization
and centrality measures
used to direct reporting to
most interesting people.

Hot Wheels, Tampa Bay Times

Coded 34 Stories for Sources and Uses

Story visualization: published story contains a visualization

Reporting visualization: used to guide reporters, unpublished.

Scraping: network extracted from source documents

Algorithm: centrality, community, etc. used

Graph DB: network loaded into graph database








Total Story Vis Scraping Reporting Vis Algorithm Graph DB

Why not algorithms?
Heterogeneous networks. Multiple entity/relationship types.
Link analysis like criminal investigations.

Incomplete data. Building out the network is often an

interactive process of data gathering.

Contextual interpretation: What does it mean for someone to

be central? Depends on the nature of the network and
Correlation of different types of info
Suppose you have a record of phone numbers called, a database of
political campaign donations, and a list of government appointees. Put
them together, and you have this story:

WASHINGTONTime and again, Texas Gov. Rick Perry picked up his office phone in
the months before he would announce his bid for the presidency. He dialed wealthy
friends who were his big fundraisers and state officials who owed him for their jobs.

Perry also met with a Texas executive who would later co-found an independent
political committee that has promised to raise millions to support Perry but is prohibited
from coordinating its activities with the governor.

- Jack Gillum, Perry called top donors from work phones, AP, 6 Dec 2011
The state of the art: Panama Papers
Graph Databases in Theory

Load everything into the database, then analyze using a

graph query language and interactive visualization.

Magic bullet for large, complex, cross border

Panama Papers networks derived from
structured data only
Entity recognition is not solved!
Entities found
out of 150

Incredibly dirty source data. Current methods have low recall (~70%)

Graph Databases in Practice
Incomplete data. Building a network often requires scraping from documents. Bulk data
often unavailable or impractical, and some records need to be purchased one at a
time. Instead, reporting involves interactive data enrichment.

Record linkage: With N databases, there could be N copies of each entity.

Graph queries are not that helpful. Cipher was available to PP investigators but no one
outside the core team learned it. Moreover, its not clear how often reporting problems
can be expressed as a graph query. Even find path between did not produce any
(documented) leads on PP.

Networks need to be narratives. The most useful networks are hand-built, for a particular
line of reporting.
Maps, not data visualizations
Query results vs. hand-built graphs

Graph query results Search for node to add

Proposed System

You might also like