Frontiers of

Computational Journalism
Columbia Journalism School
Week 10: Social Network Analysis
November 18, 2015

This class

Social network theory
Centrality Measures
Identifying Communities
SNA in practice

A set of people

and a set of connections between pairs of them

Types of connections
Social network analysis: only one type of connection
between individuals (e.g. "friend")
Link analysis: multiple types of connections
went to university with
sold a car to
owns 51% of

Link analysis is much more relevant to journalism,
because it allows representation of much more detail
and context.

People Act in Groups
Family and friendships: I am most closely connected to a small set of
people, who are usually closely connected to each other.
Business: I am much more likely to do business with people I already know.
Influence: I listen to people I know more than I listen to strangers.
Norms: what is right depends on what the people around me think.
People tend to marry, do business with, spend time with, etc. people from
similar backgrounds... and people who have social ties tend to be similar.

Homophily is the principle that contact between
similar people occurs at a higher rate than among
dissimilar people. The pervasive fact of homophily
means that cultural, behavioral, genetic, or material
information that flows through networks will tend to
be localized. Homophily imples that distance in terms
of social characteristics translates into network
distance, the number of relationships through which a
piece of information must travel to connect two
- McPherson, Smith-Lovin, Cook
Birds of a feather: homophily in social networks

Structure Relates to Behavior

In a 1951 experiment, researchers had five people work
together, only allowed to communicate according to one of the
patterns above. They were each given a card with several
symbols on it. The task was to determine which symbol was in
common between all of the cards. It was repeated many times.
How did the groups organize themselves? Which patterns were
From H. Leavitt, Some effects of certain communication patterns on group
performance, Journal of Abnormal Psychology 46(1)

Correlation of different types of info
Suppose you have a record of phone numbers called, a database of
political campaign donations, and a list of government appointees. Put
them together, and you have this story:
WASHINGTON—Time and again, Texas Gov. Rick Perry picked up his office phone in
the months before he would announce his bid for the presidency. He dialed wealthy
friends who were his big fundraisers and state officials who owed him for their jobs.
Perry also met with a Texas executive who would later co-found an independent
political committee that has promised to raise millions to support Perry but is prohibited
from coordinating its activities with the governor.
- Jack Gillum, Perry called top donors from work phones, AP, 6 Dec 2011

Social Network Analysis in Journalism

Identify people or communities
Track money and criminal networks
Understand spread of information and behavior
Illustrate complex stories

Useful in all areas where CS intersects journalism! (Reporting,
communication, filtering, effect tracking)

Two major analysis methods
…after you have the network data, which may be a very
manual process.

Look at a visualization
Apply algorithm

In both cases, the results are not interpretable without

Force-Directed Layout

Each edge is a "spring" with a fixed preferred length.
Plus global repulsive force that pushes all nodes apart.

From The Effect of Graph Layout on Inference from Social Network
Data, Blythe et al.

From The Effect of Graph Layout on Inference from Social Network
Data, Blythe et al.

We asked respondents three questions about the same
five focal nodes in each sociogram:
1) how many subgroups were in the sociogram
2) how “prominent” was each player in the sociogram
3) how important a “bridging” role did each player
occupy in the sociogram

From The Effect of Graph Layout on Inference from Social Network
Data, Blythe et al.

Centrality Measures

Often identified with "influence" or "power." Often important in
We can visualize the graph and use our eyes, or we can
compute centrality values algorithmically.

Degree centrality: number of

Models: cases where the number of connections is important.
Example: which celebrity can reach the most people at once?

Closeness centrality: average distance to all other nodes

Models: cases where time taken to reach a node is important.
Example: who finds out about gossip first?

Betweenness centrality:
number of shortest paths that pass through node

Models: cases where control over transmission is important.
Example: who has the most power to make introductions?

Eigenvector centrality:
how likely you are to end up at a node on a random walk
(same idea as PageRank)

Models: cases where importance of neighbors is important.
Example: the private adviser to the president

Journalism centrality:
how important is this person to this story?

Who is "important"?
What type of person do you want to identify in the
Often assumed we're after "influential." But sociology
says "power" is a complicated thing and difficult to
define and measure.
Network analysis has mostly ignored this problem. I
know of no successful use of centrality metrics in
journalism – maybe you'll be the first.

Identifying Communities

Finding Communities
No one definition of "community." Could mean a town, or a
club, or an industry network.
But for our purposes, a community is "a group of people with
pre-existing patterns of association."
In social network analysis, that translates into clusters in the


Co-consumption – Network of political book sales,

Communications network – Exploring Enron, Jeffery Heer

Web link structure – Map of Iranian Blogosphere, Berkman Center

Individual time/location trails – CitySense, Sense Networks

Warning: no network is ever "complete."
Otherwise there would be 7 billion people in it

Mathematical definitions of "cluster"
You've already seen several. If you can compute distance
between any two items, you can cluster.
But in social networks, not everyone is connected to everyone


Are there more intra-group edges than we
would expect randomly?

n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise
There are m = ∑ k total edges in the graph.
If they go between random vertices then number of
edges between i,j is ki k j / 2m


n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same group, 0 otherwise
Modularity Q = ∑( Aij − ki k j / 2m)gij

If Q>0 then there are "excess" edges inside the
groups (and fewer edges between them.)

Modularity algorithm
• Look for a division of nodes into two groups that
maximizes Q
• Can find this through eigenvector technique
• Possible that no division has Q>0, in which case the graph
is a single community
• If a division with Q>0 found, split
• Recursively split sub-graphs

The Hairball problem

Real social networks are big, with complex, overlapping communities in the central
component. Modularity and other community detection algorithms give poor results.

K-core Decomposition
Find the nodes at the "center" of a network.
for k=1 to maximum node degree
remove all nodes with degree < k
until all remaining nodes have degree >=
set "core number" of remaining nodes to

K-core Decomposition

Carmi et al., A model of Internet topology using k-shell decomposition

Protest Dynamics on Twitter

González-Bailon et al, The Dynamics of Protest Recruitment through
an Online Network

k-core number vs. maximum cascade size. Color = sent at least one
tweet which reached this fraction of users (orange = reached all

Key insight: triangles not edges

Simmel's theory of sociology (early 20th C.) says
relationship between two people cannot be
understood without context.

Idea: count shared triangles
1. Given each node A, given each of A's friends B,
count the number of triangles involving A and each B
(= number of shared friends of A and B).
2. Rank A's friends (each B) by number of shared
friends (number of C's for A,B) to create "top friends"
list for A.
2. Keep the edge between nodes A,D only if there is
some threshold percentage overlap in their top friends

Simmelian Backbones

SNA in Practice

SNA in journalism

ICIJ Panama Papers
ICIJ Offshore Tax Haven leak
ICIJ human tissue investigation
Organized Crime and Corruption Reporting Project
WSJ Galleon's Web insider trading story
SCMP's Who Runs Hong Kong

The other challenge was the data itself. How to separate the extraordinary from the
routine and find the public interest inside a maze of more than 37,000 offshore
company holders? A first step was to build as many lists as possible of public figures:
Politburo members, military commanders, mayors of large cities, billionaires listed in
Forbes and Hurun’s rankings of the mega-wealthy and so-called princelings (relatives
of the current leadership or former Communist Party elders).
Through painstaking database work, a reporter in Spain cross-referenced the lists of
notable Chinese against the names of offshore clients listed within ICIJ’s Offshore
Leaks data. The added difficulty was that in most cases, names in the offshore files
were registered in Romanized form, not Chinese characters. This made making exact
matches extremely hard, because Romanized spellings from Chinese characters tend
to vary widely: Wang might be spelled Wong, Zhang could be Cheung, and Ye might
be spelled Yeh. Addresses and ID numbers helped confirmed many identities but
many others names were dropped because the reporting team could not be 100
percent sure that the person was a correct match.
A picture slowly began to emerge: China’s elites were aggressively using offshore
havens to hold assets, list companies in the world’s stock exchanges, buy and sell
real estate and conduct their business away from Beijing’s red tape and capital
How We Did Offshore Leaks China, ICIJ

Analyzing the Data behind Skin and Bone, ICIJ

Who Runs HK? The Fight over Stanley Ho's Fortune
South China Morning Post, 2010

SNA that could be used in Journalism

The Network of Global Corporate Control paper
Network of campaign finance contributions (SuperPACs)
International financial system / banking counterparies
"Revolving door" / regulatory capture
Political elite in any country
Find audience for story, akin to targeted marketing

Vitali, Glattfelder, Battiston, The Network of Global Corporate Control

SNA in journalism
• Visualization widely used
• Link analysis successful in investigative reporting
• Most of the work required to do these types of stories is
traditional research, not algorithmically-guided.
• I am not aware of successful application of centrality
metrics or community detection algorithms.
• This may change as the graphs journalism examines get
• Would it be possible to use community detection to find
the "right" audience for a story?

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.