You are on page 1of 73

Frontiers of

Computational Journalism
Columbia Journalism School
Week 8: Visualization and Network Analysis

November 3, 2017
This class
• Visualization as perception
• Visualization design
• Social network theory
• Network analysis in journalism
Visualization as Perception
Topic links in Gödel, Escher, Bach
“Visualization allows people to offload cognition to the
perceptual system, using carefully designed images as a
form of external memory.

The human visual system is a very high-bandwidth
channel to the brain, with a significant amount of
processing occurring in parallel and at the pre-conscious
level.”

- Tamara Munzner
Pop-Out Effects
Visual Comparisons

length
orientation

size color

...plus number, shape, relative motion, and much more
Basic idea of visualization:

Turn something you want to find
into something you can see
without thinking about it
clusters outliers

extents correlations
Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012
Visualization Design
Inward and Outward Grand Challenges for Visualization, Tamara Munzner
A multi-level typology of abstract visualization tasks, Brehmer & Munzner
Sequential Narrative

What’s Really Warming The World?, Bloomberg
Visualization isn’t “objective,” but that doesn’t mean you can’t
mislead. (Is this graph misleading?)
Social Network Theory
Network
A set of people

and a set of connections between pairs of them
Types of connections

Social network analysis: only one type of connection
between individuals (e.g. "friend")

Link analysis: multiple types of connections
friend
brother
employer
went to university with
sold a car to
owns 51% of

Link analysis is much more relevant to journalism,
because it allows representation of much more detail
and context.
People Act in Groups
Family and friendships: I am most closely connected to a small set of
people, who are usually closely connected to each other.

Business: I am much more likely to do business with people I already know.

Influence: I listen to people I know more than I listen to strangers.

Norms: what is right depends on what the people around me think.

People tend to marry, do business with, spend time with, etc. people from
similar backgrounds... and people who have social ties tend to be similar.
Two major analysis methods
…after you have the network data, which may be a very
manual process.

• Look at a visualization
• Apply algorithm

In both cases, the results are not interpretable without
context.
A “sociogram” of a fraternity from Moreno’s Who Shall Survive? (1934). Arrows show
one way “attraction” and lines with a cross bar show “mutual attraction.”
Force-Directed Layout

Each edge is a "spring" with a fixed preferred length.
Plus global repulsive force that pushes all nodes apart.
From The Effect of Graph Layout on Inference from Social Network
Data, Blythe et al.
We asked respondents three questions about the same
five focal nodes in each sociogram:

1) how many subgroups were in the sociogram
2) how “prominent” was each player in the sociogram
3) how important a “bridging” role did each player
occupy in the sociogram

The Effect of Graph Layout on Inference from Social Network
Data, Blythe et al.
Centrality
Often identified with "influence" or "power." Often important in
journalism.

We can visualize the graph and use our eyes, or we can compute
centrality values algorithmically.
Degree centrality: number of edges

Models: cases where the number of connections is important.
Example: which celebrity can reach the most people at
once?
Closeness centrality: average distance to all other nodes

Models: cases where time taken to reach a node is important.
Example: who finds out about gossip first?
Betweenness centrality:
number of shortest paths that pass through node

Models: cases where control over transmission is important.
Example: who has the most power to make introductions?
Eigenvector centrality:
how likely you are to end up at a node on a random walk
(same idea as PageRank)

Models: cases where importance of neighbors is important.
Example: the private adviser to the president
Journalism centrality:
how important is this person to this story?
Finding Communities
No one definition of "community." Could mean a town, or a
club, or an industry network.

But for our purposes, a community is "a group of people with
pre-existing patterns of association."

In social network analysis, that translates into clusters in the
graph.
Friends/followers
Co-consumption – Network of political book sales, Orgnet.com
Communications network – Exploring Enron, Jeffery Heer
Web link structure – Map of Iranian Blogosphere, Berkman Center
Individual time/location trails – CitySense, Sense Networks
Mathematical definitions of "cluster"

You've already seen several. If you can compute distance
between any two items, you can cluster.

But in social networks, not everyone is connected to everyone
else...
Modularity

Are there more intra-group edges than we
would expect randomly?
Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise

There are m = ∑ k total edges in the graph.
1
2 i

If they go between random vertices then number of
edges between i,j is ki k j / 2m
Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise

Modularity Q = ∑( Aij − ki k j / 2m)gij
ij

If Q>0 then there are "excess" edges inside the
groups (and fewer edges between them.)
Modularity algorithm

• Look for a division of nodes into two groups that
maximizes Q
• Can find this through eigenvector technique
• Possible that no division has Q>0, in which case the graph
is a single community
• If a division with Q>0 found, split
• Recursively split sub-graphs
Network Analysis in Journalism
Case Study: Seattle Art World

Network obtained from
dozens of in-person
interviews. Interactive
visualization in story.

In Seattle Art World, Women Run the Show, Seattle Times
Case Study: Hot Wheels

Network obtained from
juvenile arrest records
concerning stolen cars.
Unpublished visualization
and centrality measures
used to direct reporting to
most interesting people.

Hot Wheels, Tampa Bay Times
Coded 34 Stories for Sources and Uses

Story visualization: published story contains a visualization

Reporting visualization: used to guide reporters, unpublished.

Scraping: network extracted from source documents

Algorithm: centrality, community, etc. used

Graph DB: network loaded into graph database
Results
40

35

30

25

20

15

10

5

0

Total Story Vis Scraping Reporting Vis Algorithm Graph DB
Why not algorithms?
Heterogeneous networks. Multiple entity/relationship types.
“Link analysis” like criminal investigations.

Incomplete data. Building out the network is often an
interactive process of data gathering.

Contextual interpretation: What does it mean for someone to
be “central”? Depends on the nature of the network and
story.
Correlation of different types of info
Suppose you have a record of phone numbers called, a database of
political campaign donations, and a list of government appointees. Put
them together, and you have this story:

WASHINGTON—Time and again, Texas Gov. Rick Perry picked up his office phone in
the months before he would announce his bid for the presidency. He dialed wealthy
friends who were his big fundraisers and state officials who owed him for their jobs.

Perry also met with a Texas executive who would later co-found an independent
political committee that has promised to raise millions to support Perry but is prohibited
from coordinating its activities with the governor.

- Jack Gillum, Perry called top donors from work phones, AP, 6 Dec 2011
The state of the art: Panama Papers
Graph Databases in Theory

Load everything into the database, then analyze using a
graph query language and interactive visualization.

“Magic bullet” for large, complex, cross border
investigations.
Panama Papers networks derived from
structured data only
Entity recognition is not solved!
Entities found
out of 150

Incredibly dirty source data. Current methods have low recall (~70%)
Unlinked
records

“Soft”
record
linkage
Graph Databases in Practice
Incomplete data. Building a network often requires scraping from documents. Bulk data
often unavailable or impractical, and some records need to be purchased one at a
time. Instead, reporting involves interactive data enrichment.

Record linkage: With N databases, there could be N copies of each entity.

Graph queries are not that helpful. Cipher was available to PP investigators but no one
outside the core team learned it. Moreover, it’s not clear how often reporting problems
can be expressed as a graph query. Even “find path between” did not produce any
(documented) leads on PP.

Networks need to be narratives. The most useful networks are hand-built, for a particular
line of reporting.
Maps, not data visualizations
Query results vs. hand-built graphs

Graph query results Search for node to add
Proposed System