You are on page 1of 443

Why model?

Network Science:
Erdős-Rényi Model for
Network Formation
■ Simpler representation of possibly very complex structures
■ Can gain insight into how networks form and how they grow
■ May allow mathematical derivation of certain properties
■ Can serve to “explain” certain properties observed in real
Ozalp Babaoglu networks
Dipartimento di Informatica — Scienza e Ingegneria ■ Can predict new properties or outcomes for networks that do not
Università di Bologna even exist
■ Can serve as benchmarks for evaluating real networks
www.cs.unibo.it/babaoglu/

© Babaoglu 2020 2

Modeling approaches Erdős-Rényi model

■ Random models — choices independent of current network ■ Network is undirected


structure ■ Start with all isolated nodes (no edges) and add edges between
■ Erdős-Rényi (ER) pairs of nodes one at a time randomly
■ Watts-Strogatz (clustered) ■ Perhaps the simplest (dumbest) possible model
■ Strategic models — choices depend on current network structure ■ Very unlikely that real networks actually form like this (certainly not
■ Barabási-Albert (preferential attachment) social networks)
■ Limited knowledge models — choices based on local information ■ Yet, can predict a surprising number of interesting properties
only ■ Two possible choices for adding edges randomly:
■ Newscast ■ Randomize edge presence or absence
■ Cyclone ■ Randomize node pairs

© Babaoglu 2020 3 © Babaoglu 2020 4


Erdős-Rényi model Erdős-Rényi model
Randomize edge presence/absence Randomize edge presence/absence
■ Example: n=5, p=0.6

■ Two parameters
■ Number of nodes: n
■ Probability that an edge is present: p
■ For each of the n(n−1)/2 possible edges in the network, flip a
(biased) coin that comes up “heads” with probability p
■ If coin flip is “heads”, then add the edge to the network ■ Number of possible edges: n(n−1)/2=5×4/2=10
■ If coin flip is “tails”, then don’t add the edge to the network ■ Ten flips of a coin that comes up heads 60%, tails 40%
■ Also known as the “G(n, p) model” (graph on n nodes with
probability p)
■ Add the edges corresponding to the “heads” outcomes
© Babaoglu 2020 5 © Babaoglu 2020 6

Erdős-Rényi model Erdős-Rényi model


Randomize edge presence/absence Degree distribution
5 2

Frequency
3 2
0

1
3
0 3
0 1 2 3 4
Degree

■ Average node degree: p(n−1)


■ What about node degree distribution? ■ Expected average node degree: p(n−1)=0.6×4=2.4
■ Actual average node degree: (3+3+2+2+0)/5=2.0
■ Distribution

© Babaoglu 2020 7 © Babaoglu 2020 8


Erdős-Rényi model Erdős-Rényi model
Degree distribution Degree distribution
■ The outcome “k “heads” and (n−1−k) “tails”” occurs with
probability
■ Need to quantify the probability that a node has degree k for all ■ But there are “(n−1) choose k” ways in which this outcome can
0 ≤ k ≤ (n−1) occur (the order of the flip results does not matter)
■ A node has degree zero if all coin flips are “tails” ■ Thus, the probability that a given node has degree k is given by
■ A node has degree (n−1) if all coin flips are “heads” the Binomial distribution
■ For a node to have degree k, the (n−1) coin flips must have
resulted in k “heads” and (n−1−k) “tails”
■ Since the probability of a “heads” is p, the probability of a “tails” is
(1−p)

© Babaoglu 2020 9 © Babaoglu 2020 10

Erdős-Rényi model Erdős-Rényi model


Binomial distribution Binomial distribution—approximations

Binomial
for p small

Poisson
for n large

n=8, p=0.5 n=8, p=0.1 Normal (Gaussian)

■ Mean of the binomial distribution is µ=p(n−1) (which is also the


average node degree we saw earlier)
© Babaoglu 2020 11 © Babaoglu 2020 12
Erdős-Rényi model Erdős-Rényi model
Binomial distribution Binomial distribution
Random Network
■ Random network with n=50, p=0.08 ■ Degree distribution of random network with n=50, p=0.08
p=.08, 50 nodes
Actual data
Poisson approximation

© Babaoglu 2020 13 © Babaoglu 2020 14

Erdős-Rényi model Erdős-Rényi model


Binomial distribution Randomize node pairs
Exponential decay
µ
■ Alternative method for adding edges randomly
■ Two parameters
■ Number of nodes: n
■ Number of edges: m
■ Pick a pair of nodes at random among the n nodes and add an
edge between them if not already present
■ Repeat until exactly m edges have been added
Poisson distribution Normal distribution with different ■ Also known as the “G(n, m) model” (graph on n nodes with m
with different means means and standard deviations edges)
■ For large n, the two versions of ER are equivalent

© Babaoglu 2020 15 © Babaoglu 2020 16


Erdős-Rényi model Erdős-Rényi model vs real networks
Randomize node pairs Degree distribution
■ Example: n=5, m=4

■ The ER model is a poor predictor of degree distribution compared


to real networks
■ The model results in Poisson degree distributions that have
exponential decay
■ The two versions of the model are related through the equation for the
■ Whereas most real networks exhibit power-law degree
number of edges: m=pn(n−1)/2
■ In the first case we pick p, and m is established by the model distributions that decay much slower than exponential
■ In the second case we pick m, and p is established by the model
■ The above example corresponds to the second case where
p=2m/n(n−1)=2×4/(5×4)=0.4
© Babaoglu 2020 17 © Babaoglu 2020 18

Erdős-Rényi diameter Erdős-Rényi diameter


■ Suppose the model results in a tree-structured network of nodes with
■ Recall that the diameter of a network is the longest shortest path identical degrees, all equal to the average z=p(n−1)
between pairs of nodes ■ Starting from a given node, how many nodes can we reach in 𝓁 steps?
■ Equivalently, the average distance between two randomly selected
nodes At step 1, reach z nodes
■ In a connected network with n nodes, the diameter is in the range 1
then, reach z(z−1) new nodes
(completely connected) to n−1 (linear chain)
■ For a given n as we vary the model parameter p from 0 to 1, at some then, reach z(z−1)2 new nodes
critical value of p, the diameter becomes finite (network becomes …
connected) and continues to decrease, becoming 1 when p=1 the number of new nodes reached
■ What is the relation between the diameter and p in the region where grows exponentially with steps
the network is connected?

© Babaoglu 2020 19 © Babaoglu 2020 20


Erdős-Rényi diameter Erdős-Rényi diameter
■ After 𝓁 steps, we have reached a total of
■ The diameter will be roughly twice log(n)/log(z)
z + z(z − 1) + z(z − 1)2 + …+ z(z − 1)𝓁− 1
■ Confirms the empirical data we observed in real networks
■ nodes, which is ■ Can be shown to hold for the general ER model without the
strong assumptions
z((z − 1)𝓁 − 1) / (z − 2) ■ In reality, not all nodes have the same degree
■ which is roughly (z − 1)𝓁 ■ In reality, not tree-structured (there could be backwards edges)
■ Proof based on a weaker set of conditions
■ How many steps to reach (n − 1) nodes? ■ n large
■ z ≥ (1 − ε)log(n) for some ε>0 (connected)
(z − 1)𝓁 = (n − 1)
■ z/n → 0 (but not too connected)
■ Roughly, 𝓁 has to be on the order of log(n)/log(z)
© Babaoglu 2020 21 © Babaoglu 2020 22

Erdős-Rényi model vs real networks


Erdős-Rényi clustering coefficient
Diameter
■ Recall clustering coefficient of a node: probability that two
randomly selected friends of it are friends themselves

■ The ER model is a good predictor of diameter and average path


Is the edge present?
length compared to real networks
■ The model results in networks with small diameters, capturing
very well the “small-world” property observed in many real
networks ■ In the ER model, an edge between any two nodes is present with
probability p (independent of their context)
■ So, the clustering coefficient of the ER random network is equal
to p
© Babaoglu 2020 23 © Babaoglu 2020 24
Erdős-Rényi clustering coefficient Erdős-Rényi clustering coefficient
■ Example: n=5, p=0.6 1 ■ Recall edge density of a network: actual number of edges in
proportion to the maximum possible number of edges
1 ■ In the ER model, on average, pn(n−1)/2 edges are added, thus
0 m=pn(n−1)/2
■ Edge density of ER network:
2/3 2/3

■ CC=(0+1+1+2/3+2/3)/5=0.6667
■ Since the edge density is exactly equal to the background
■ Compare with p which is 0.6
probability of triangles being closed, the networks produced by
the ER model cannot be considered highly clustered

© Babaoglu 2020 25 © Babaoglu 2020 26

Erdős-Rényi model vs real networks


Erdős-Rényi giant component
Clustering coefficient

■ Suppose we add edges randomly with probability p


■ The ER model is a poor predictor of clustering compared to real ■ If p=0, no edges added, so edge density of the network is 0
networks ■ As p tends towards 1, the edge density tends towards 1
■ The model results in clustering coefficients that are too small and ■ In fact, for the ER model, edge density follows the edge
too close to the edge density probability exactly
■ Whereas most real networks are often highly clustered with ■ What structural properties are likely at a given density 𝜌?
clustering coefficients that are much greater (sometimes several ■ When do certain structures emerge as a function of 𝜌?
orders of magnitude) than their edge densities ■ Many interesting properties occur at small densities
■ And they occur very suddenly (tipping points)

© Babaoglu 2020 27 © Babaoglu 2020 28


Erdős-Rényi giant component Erdős-Rényi giant component
Tipping point Tipping point
■ Note that at edge density 𝜌, the expected node degree is ■ Why is it very unlikely that two large components form?
■ Run the NetLogo ErdosRenyiTwoComponents simulation
𝜌(n−1) 𝜌n for large n ■ Suppose two large components containing roughly half the nodes
■ Run the NetLogo Library/Networks/GiantComponent simulation each do form in the ER model
■ In the ER model, giant components start forming at very low
values of edge density
■ For large n, we can show that
■ If 𝜌 < 1/n, the probability of a giant component tends to 0
missing
■ If 𝜌 > 1/n, the probability of a giant component tends to 1 and all other
edges
components have size at most log(n)
■ At the tipping point 𝜌=1/n, the average node degree is 𝜌n=1
■ Network is very sparse but ER uses edges very efficiently
© Babaoglu 2020 29 © Babaoglu 2020
∼n/2 nodes ∼n/2 nodes 30

Erdős-Rényi giant component Erdős-Rényi giant component


Tipping point Tipping point
■ How many potential edges are missing? ■ In those rare cases where two giant components have co-existed
■ The number of cross component edges is n/2×n/2=n2/4 for a long time, their merger is sudden and often dramatic
■ Compare to the total number of possible edges: n(n−1)/2 ■ Imagine the arrival of the first Europeans in the Americas some
■ In other words, more than half of all possible edges are missing 500 years ago
■ Selecting a new edge to add that is not one of the missing “cross ■ Until then, the global socio-economic-technological network likely
edges” becomes increasingly more unlikely consisted of two giant components — one for the Americas,
■ Imagine enrolling 10,000 friends to Facebook asking them to another for Europe-Asia
keep their friendships strictly among themselves ■ In the two components, not only technology, but also human
■ Impossible to maintain since all it takes is just one of the 10,000 diseases developed independently
to make one external friendship ■ When they came in contact, the results were disastrous

© Babaoglu 2020 31 © Babaoglu 2020 32


Erdős-Rényi diameter Erdős-Rényi
Tipping point Other tipping points
■ In fact, we can prove a much more general result
■ In the ER model, emergence of small diameter is also sudden
■ In the ER model, any monotone property of networks has a
and has a tipping point
tipping point
■ For large n, we can show that
■ In networks, a property is monotone if it continues to hold as we
■ If 𝜌 < n−5/6, the probability of the network having diameter 6 or less add more edges to the network
tends to 0 ■ Examples:
■ If 𝜌 > n−5/6, the probability of the network having diameter 6 or less ■ The network has a giant component
tends to 1 ■ The diameter of the network is at most k
■ For the US, n=300M and the tipping point is 𝜌n 25.8 ■ The network contains a cycle of length at most k
■ For the world, n=7B and the tipping point is 𝜌n 43.7 ■ The network contains at most k isolated nodes
■ The network contains at least k triangles

© Babaoglu 2020 33 © Babaoglu 2020 34

Erdős-Rényi
Summary

■ The ER model is able explain


■ Small diameter, path lengths
■ Giant components
■ The ER model is not able explain
■ Degree distributions
■ Clustering

© Babaoglu 2020 35
Lecture Notes: Social Networks: Models, Algorithms, and Applications
Lecture 1: Jan 26, 2012
Scribes: Geoffrey Fairchild and Jason Fries

1 Random Graph Models for Networks


1.1 Graph Modeling
A random graph is a graph that is obtained by randomly sampling from a collection of graphs.
This collection may be characterized by certain graph parameters having fixed values.

Definition 1 G(n, m) is the graph obtained by sampling uniformly from all graphs with n vertices
and m edges.

For example, given n = 4 and m = 2 with the vertex set V = {1, 2, 3, 4} we could obtain any one
of these graphs:

1 2 1 2 1 2

3 4 3 4 . . . . . . . 3 4
(a) G1 (b) G2 (d) G15

Figure 1: Possible random graph instances for n = 4, m = 2 resulting in a state space Ω of size 15

The probability of selecting a graph GN requires determining the size of the set of all possible
graph outcomes, computed as choosing from all possible pairs of nodes n, all possible m edge
combinations.

Definition 2 The total number of possible random graphs given n vertices and m edges is
1
|Ω| =
(n2 )
m

1.2 Erdős-Renyi Model


The above approach constitutes the sampling view of generating a random graph. Alternatively
we can take a constructive view where we start with vertex set V = {1, 2, 3....n}, and selecting
uniformly at random one edge from those edges not yet chosen, repeating this m times.
Definition 3 G(n, p) is the random graph obtained by starting with vertex set V = {1, 2, 3...n},
letting 0 ≤ p ≤ 1, and connecting each pair vertices {i, j} by a edge with probability p
This model is typically referred to as the Erdos-Renyi (ER) Random Graph Model, outlined by
Erdős and Renyi in two papers from 1959 and 1960 [2, 3]. While the model bears their names,
their work initially examined the properties of the G(n, m) model, only later expanding to analyze

1
the G(n, p) model. Both variants were independently proposed by Solomonoff and Rapaport in
1951[5] and Gilbert in 1959[4].

In analyses, the G(n, m) model is not as easy to deal with mathematically as the similar (though
not exact) graph G(n, p), so in practice G(n, p) is more commonly used today. The equivalence of
G(n, m) and G(n, p) can be noted by setting n2 p = M , and observing that as n → ∞ G(n, p)
should behave similarly to G(n, m), as by virtue of the law of large numbers, G(n, p) will contain
approximating the same number of edges as G(n, m).

1.2.1 Probabilistic Characteristics of G(n, p)


n

Definition 4 The expected number of edges in G(n, p) = 2 p

For example, if we wanted the generate linear number of edges, sparse graphs need p should be on
the order of n1 .

Definition 5 For the distribution of number of edges in G(n, p), let x be the random variable
dependent on the number of edges in
 n
n
P rob[X = x] = 2 px (1 − p)( 2 )−x
x

This takes the form of a binomial distribution, and the implication of this definition is that edges
are concentrated around the mean with high probability.

Definition 6 The expected degree of G(n, p) = (n − 1) p

Definition 7 For the degree distribution of G(n, p), fix a vertex v and let y be the number of edges
incident on v  
n−1 y
P rob[Y = y] = p (1 − p)n−n−y
y

This is why the Erdős-Renyi graphs are said to have a binomial degree distribution.

Definition 8 We can use a poisson approximation to compute an expected degree distribution of


G(n, p) as follows:

If we fix (n − 1)p to a constant c – the expected degree – then

cy e−c
 
n−1
py (1 − p)n−1−y −−−−→
y y!
| {z }
poisson distribution with paramter c

Definition 9 The expectation of the local clustering clustering coefficient in G(n, p) is p

2
Recall the definition of the local clustering coefficient as:
pairs of neighbors of v connected by edges
cc(v) =
total pairs of v

The expected value of cc(v) is calculated as


n−1
X
E[cc(v)] = E[ cc(v) | deg(v) = d ] P rob[ deg(v) = d ]
d=0

Observe that the conditional expectation of cc(v) given deg(v)=d reduces to p

p d2

E[ cc(v) | deg(v) = d ] = d = p
2

Leaving the equation


n−1
X
p P rob[ deg(v) = d ]
d=0

The sum probability of all possible outcomes is, of course, equal to 1, leaving our final equation as

p·1=p

Given a formal definition of the clustering coefficient cc for a random graph G(n, p) we can revisit
the 1998 paper in Nature by Watts and Strogatz[6] and now compute the cc of a corresponding
random graph.

N average degree cc1 cc1 of corresponding random graph


actors network 225226 61 0.79 0.00027
power grid 4941 2.67 0.080 0.005
C. elegans 282 14 0.28 0.05

Table 1: Comparing observed networks against “corresponding” random graphs.

For example,the corresponding random graph for the actors’ network would be

n = 225226
c = (n − 1)p = 61
p = 61/225225 = 0.00027

1.2.2 “Small World” Property of G(n, p)


We will show that the diameter of G(n, p) is

ln n
lnc n , where c = p(n − 1)
ln c

3
For example, consider an acquaintance network of every human being on earth, currently estimated
at 7 billion people. If every individual has, on average, 1000 acquaintances, our graph diameter is
calculated as
ln 7x109
= 3.33...
ln 1000
Definition 10 The diameter of graph G(V,E), where distance = the shortest path between u, v is

max distance(u, v)
u,v∈V

Remember however that we are acting on random graphs, meaning that diameter is itself a random
variable. The diameter referred to here more correctly thought of as the expected diameter of graph
G, formally stated as
ln n
P rob[ distance(u, v) > ] −→ 0 as n → ∞
ln c
Note that, counter perhaps to our intuition, this expected diameter value does not lie in the middle
of roughly an equal number of graphs with low diameter and and graphs with high diameters. In
reality, as n → ∞, there are a diminishing number of graphs with a diameter larger than ln n
ln c . A
formal proof[1] of the expected diameter of a random graph is outside the scope of this text, but
we can construct a heuristic argument that gives some intuition into the problem.

Fix a vertex v, and c = (n − 1)p. Divide our graph into two sets of nodes, reached and re-
maining. At each level we continuing adding edges to unreached nodes such that the number of
vertices reachable in s hops us cs , where cs = n and s = ln n
ln c .
RE A

2
c
CHE

3 6
D

1
N
4 5
Size C

Size C 2

Size C 3

Figure 2: A “heuristic” argument proof for expected graph diameter. As our graph grows we add
unreached nodes to add to our graph.

4
As a heuristic proof, there are of course problems with this . Eventually our reached set will
be larger than our remaining, for example. Next class will discuss some of these points and ways
to address them.

References
[1] B. Bollobás. Random graphs, volume 73. Cambridge Univ Pr, 2001.
[2] P. Erdös and A. Rényi. On random graphs, i. Publicationes Mathematicae (Debrecen), 6:290–297, 1959.
[3] P. Erdős and A. Rényi. On the evolution of random graphs. Akad. Kiadó, 1960.
[4] E.N. Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4):1141–1144, 1959.
[5] R. Solomonoff and A. Rapoport. Connectivity of random nets. Bulletin of Mathematical Biology,
13(2):107–117, 1951.
[6] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ’small-world’ networks. Nature,
393:440–442, 1998.

5
Module5_SemanticWeb
Modelling and aggregating social
network data

Reference: Peter Mike, Social


Networks and the Semantic Web
Introduction
• The network data collected in various studies is
stored and published either in data formats not
primarily intended for network analysis (such as
Excel sheets, SPSS tables) or in proprietary graph
description languages of network analysis
packages that ignore the semantics of data (e.g.
the types of instances, the meaning of the
attributes etc. )
• Hence, it is difficult to verify results
independently, to carry out secondary analysis
(to reuse data) and to compare results across
different studies.
Introduction
• Two fundamental reasons for developing
semantic-based representations of social
networks
– Maintaining the semantics of social network data is
crucial for aggregating social network information,
especially in heterogeneous environments where the
individual sources of data are under diverse control.
– Semantical representations can facilitate the
exchange and reuse of case study data in the
academic field of Social Network Analysis.
State-of-the-art in network data representation

• Common kind of social network data can be


modelled by a graph where the nodes
represent individuals and the edges represent
binary social relationships.
– Higher-arity relationships may be represented
using hyper-edges, i.e. edges connecting multiple
nodes.
• Social network studies build on attributes of
nodes and edges, which can be formalized as
functions operating on nodes or edges.
State-of-the-art in network data representation

• A number of different formats exist for


serializing such graphs and attribute data in
machine-processable electronic documents.
• The most commonly encountered formats are
those used by the popular network analysis
packages Pajek and UCINET.
• These are text-based formats which can be
easily edited using simple text editors.
A simple graph described in Pajek .NET, UCINET DL formats

Pajek , UCINET DL (Data Language) format


Pajek,UCINET Formats
• The two formats are incompatible.
– UCINET has the ability to read and write the .net format of
Pajek, but not vice versa.)
• Social Sciences researchers often represent their data
initially using Microsoft Excel spreadsheets, which can
be exported in the simple CSV (Comma Separated
Values) format.
• As this format is not specific to graph structures,
additional constraints need to be put on the content
before such a file can be processed by graph packages.
• Visualization software packages also have their own
proprietary formats.
GraphML Format
• The GraphML format represents an advancement over
the previously mentioned formats in terms of
interoperability and extensibility.
• GraphML originates from the information visualization
community where a shared format greatly increases
the usability of new visualization methods.
• GraphML is based on XML with a schema defined in
XML Schema.
• Advantage: GraphML files can be edited, stored,
queried, transformed etc. using generic XML tools.
• Focuses on the graph structure, which is the primary
input to network analysis and visualization.
A simple graph described in GraphML format

<?xml version="1.0" encoding="UTF-8"?>


<graphml xmlns="http://graphml.graphdrawing.org/xmlns">
<graph id="G" edgedefault="undirected">
<node id="a"/>
<node id="b"/>
<node id="c"/>
<node id="d"/>
<edge source="a" target="b"/>
<edge source="a" target="c"/>
<edge source="a" target="d"/>
<edge source="b" target="c"/>
</graph>
</graphml>
Data Formats
• None of the formats discussed previously support the
aggregation and reuse of electronic data, which is the
primary concern.
Why data aggregation?
• Consider the typical scenario (as in next slide) where we
would like to implement a network study by reusing a
number of data sources describing the same set of
individuals and their relationships (for example, email
archives and publication databases holding information
about researchers).
• One of the common reasons to use multiple data sources
is to perform triangulation i.e. to use a variety of data
sources and/or methods of analysis to verify the same
conclusion.
Example
• Example of a case of identity reasoning. Based on a semantic representation, a
reasoner would be able to conclude, for example, that Peter Mika knows the
author of the book “A Semantic Web Primer”
Data Sources
• Often the data sources to be used contain complementary
information.
– For example, a study by Besselaar et al.
• A multiplex network is studied using data sources that contain evidence about
different kinds of relationships in a community
• Allows to look at, how these networks differ and how relationships of one type
might effect the building of relationships of another type.
• In both cases we need to be able to recognize matching instances in
the different data sources and merge the records before starting
the analysis.
• The graph representations strips social network data of exactly
those characteristics that one needs to consider when aggregating
data or sharing it for reuse.
– These graph formats reduce social individuals and their relationships
to nodes and edges, which is the only information required for the
purposes of analysing networks.
Aggregation and Reuse
• Hence, to support aggregation and reuse, a
representation is required that allows to capture
and compare the identity of instances and
relationships.
– Instances are primarily persons, but it might be for
example that we need to aggregate multiple
databases containing patents or publications, which is
then used to build a network of individuals or
institutions.
• Maintaining the identity of individuals and
relationships is also crucial for preserving our
data sets in a way that best enables their reuse.
Key Problems in aggregating social
network data
• The proposed solution (considering the difficulties of other formats)
is a rich, semantic-based representation of the primary objects in
social networks data.
• A semantic-based representation will allow us to manipulate the
power of ontology languages and tools in aggregating data sets
through domain specific knowledge about identity, i.e. what if, it
requires for two instances to be considered the same.
• A semantic-based format has the additional advantage that, we can
easily enrich our data set with specific domain knowledge (such as
the relatively simple fact that if two people send emails to each
other, they know each other, or that a certain kind of relationship
implies (or refutes) the existence of another kind of relationship.
• Hence, the two key problems in aggregating social network data are
– the identification and disambiguation of social individuals
– the aggregation of information about social relationships
Metadata
• Metadata is data about data
– Data being used to identify, describe, or locate
information resources
• Metadata helps manage and use large
collections of information
– Library card catalogs are example of metadata
Ontological representation of social
individuals
• The Friend-of-a-Friend (FOAF) ontology is an OWL based
format for representing personal information and an
individual’s social network.
• The Friend of a Friend (FOAF) project is about creating a
Web of machine readable homepages describing people,
the links between them and the things they create.
• FOAF is a machine-readable ontology describing persons,
their activities and their relations to other people and
objects.
• Anyone can use FOAF to describe themselves.
• FOAF allows groups of people to describe social networks
without the need for a centralised database.
FOAF
• FOAF is a descriptive vocabulary expressed using the
Resource Description Framework (RDF) and the Web
Ontology Language (OWL).
• Computer programs may use these FOAF profiles to find,
for example, all people living in Europe, or to list all people
both you and a friend of yours know.
• This is accomplished by defining relationships between
people.
• Each profile has a unique identifier (such as the person's e-
mail addresses, international telephone number, Facebook
account name, or a URI of the homepage or weblog of the
person), which is used when defining these relationships.
FOAF
• FOAF aims to create a linked information system
about people, groups, companies and other kinds
of thing.
• If people publish information in FOAF document
format, machines will be able to make use of that
information.
• If those files contain “see also” references to
other such documents in the Web, we will have a
machine-friendly version of today’s hypertext
web
• FOAF documents are usually represented in RDF.
FOAF
• FOAF greatly surpasses graph description
languages in expressivity by using the
powerful OWL vocabulary to characterize
individuals.
• Using FOAF one can rely on the extendibility
of the RDF/OWL representation framework for
enhancing the basic ontology with domain-
specific knowledge about identity.
FOAF ontology - Example
Friend-of-a-Friend (FOAF) ontology

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns>


@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#label>
@prefix foaf: <http://xmlns.com/foaf/0.1/>
@prefix example: <http://www.example.org/>
example:Rembrandt rdf:type foaf:Person
example:Saskia rdf:type foaf:Person
example:Rembrandt foaf:name "Rembrandt"
example:Rembrandt foaf:mbox <mailto:rembrandt@example.org>
example:Rembrandt foaf:knows example:Saskia
example:Saskia foaf:name "Saskia"

A set of triples describing two persons represented in the Turtle (Terse RDF Triple
Language) language
A graph visualization of the RDF
document
• @prefix rdf:
<http://www.w3.org/1999/02/22-rdf-
syntax-ns> .
• @prefix rdfs:
<http://www.w3.org/2000/01/rdf-
schema#label> .
• @prefix foaf:
<http://xmlns.com/foaf/0.1/> .
• @prefix example:
<http://www.example.org/> .
• example:Rembrandt rdf:type
foaf:Person .
• example:Saskia rdf:type foaf:Person .
• example:Rembrandt foaf:name
"Rembrandt" .
• example:Rembrandt foaf:mbox
<mailto:rembrandt@example.org> .
• example:Rembrandt foaf:knows
example:Saskia .
• example:Saskia foaf:name "Saskia" .
Classes and properties of the FOAF ontology
• FOAF has a vocabulary for describing personal attribute information typically found on
homepages such as name and email address of the individual, projects, interests, links to
work and school homepage etc.
FOAF Basics
• foaf:Agent
– An agent (eg., person, group, software or physical
artifact)
– Subclass: foaf:Person, foaf:Organization, foaf:Group
• foaf:Document
– Sublcass: foaf:Image
• foaf:Person
– A person
• foaf:Project
– A project
FOAF basic properties
• foaf:family_name
• foaf:firstName
• foaf:homepage
• foaf:knows
– A person known by this person
• foaf:mbox
• foaf:mbox_sha1sum
• foaf:title
– Personal title (Mr, Mrs, Ms, Dr, etc.)
FOAF
• The idea of FOAF was to provide a machine processable
format for representing personal information described in
homepages of individuals.
• FOAF profiles, can also contain a description of the individual’s
friends using the same vocabulary that is used to describe the
individual himself.
• FOAF profiles can be linked together to form networks of web-
based profiles.
• Studies noted that the majority of FOAF profiles on the Web
are auto-generated by community sites such as LiveJournal,
Opera Communities
• As FOAF profiles are scattered across the Web it is difficult to
estimate their number.
• FOAF started as an experimentation with Semantic Web technology.
FOAF
• FOAF became the center point of interest in 2003 with the spread of Social
Networking Services such Friendster, Orkut, LinkedIn etc.
• Despite their early popularity, a number of drawbacks were discovered.
– Firstly, the information is under the control of the database owner
who has an interest in keeping the information bound to the site and
is willing to protect the data through technical and legal means.
• The profiles stored in these systems typically cannot be exported in machine
processable formats (or cannot be exported legally) and therefore the data cannot
be transferred from one system to the next.
• As a result, the data needs to be maintained separately at different services.
– Secondly, centralized systems do not allow users to control the
information they provide on their own terms.
• Although Friendster follow-ups offer several levels of sharing (e.g. public
information vs. only for friends), users often still find out the hard way that their
information was used in ways that were not intended.
Create your own FOAF
• http://www.ldodds.com
/foaf/foaf-a-matic
– Fill in the detail of
yourself •
– It will create FOAF in RDF
• Publish your FOAF
description
– Save your FOAF RDF file
into your website
somewhere and name it
usually as “foaf.rdf”
FOAF Example in RDF/XML
FOAF Conclusions
• Vocabulary for machine-processable personal
homepages
• Currently some preliminary tools available
• Not yet as successful as social networks such
as friendster, which use proprietary central
data
• Advantage of foaf: decentralized, could serve
as exchange format between those existing
networks and exists on its own
References
• http://www.foaf-project.org/
• http://xmlns.com/foaf/0.1/
Module5_SemanticWeb
Introduction
Reference: Peter Mike, Social
Networks and the Semantic Web
Introduction
• Most of the web’s content is designed for humans to
read and not for computer programs to process
meaningfully.
• Computer programs can
- parse the web pages
- perform routine processing
• In general, they have no reliable method to understand
and process the semantics.
• The Semantic Web brings structure to the meaningful
content of the web pages, creating an environment
where software agents roaming from page to page
carry out sophisticated tasks for users.
Introduction (cont’d)
• The Semantic Web is a major research initiative
of the World Wide Web Consortium (W3C) to
create a metadata-rich Web of resources that
can describe themselves not only by how they
should be displayed (HTML) or syntactically (XML),
but also by the meaning of the metadata.
• “The Semantic Web is an extension of the current
web in which information is given well-defined
meaning, better enabling computers and people
to work in cooperation.”
– Tim Berners-Lee, James Hendler, Ora Lassila,
Introduction (cont’d)
• Difficulties to find, present, access, or
maintain available electronic information on
the web.
• Need for a data representation to enable
software products (agents) to provide
intelligent access to heterogeneous and
distributed information.
The Semantic Web: why?
• Difficulty in searching on the Web
– due to the way in which information is stored on the
Web
• Problem 1: Web documents do not distinguish
between information content and presentation
(“solved” by XML)
• Problem 2: Different web documents may
represent in different ways semantically related
pieces of information
• This leads to hard problems for “intelligent”
information search on the Web
Separating content and presentation
• Problem 1: web documents do not distinguish
between information content and
presentation
– problem due to the HTML language
– problem “solved” by technologies like
• stylesheets (HTML, XML)
• XML
– Stylesheets allow for separating formatting attributes from
the information presented
XML
• XML: eXtensible Mark-up Language
• XML documents are written through a user defined set
of tags
• XML lets everyone to create their own tags.
• These tags can be used by the scripts in sophisticated
ways to perform various tasks, but the script writer has
to know what the page writer uses each tag for.
– XML allows to add arbitrary structure to the documents
but says nothing about what the structures mean.
• It has no built mechanism to convey the meaning of
the user’s new tags to other users.
XML: example
• HTML:
<H1>Seminar on Data Analytics </H1>
<UL>
<LI>Teacher: Max Plank
<LI>Room: 7
<LI>Prerequisites: none
</UL>
• XML:
<course>
<title> Seminar on Data Analytics </title>
<teacher> Max Plank </teacher>
<room> 7 </room>
<prereq> none</prereq>
</course>
Limitations of XML
• XML does not solve all the problems:
– different XML documents may express
information with the same meaning using
different tags
The need for a “Semantic” Web
• Problem 2: Different web documents may
represent in different ways semantically related
pieces of information
– different XML documents do not share the semantics
of information
• Idea: annotate (mark-up) pieces of information to
express the “meaning” of such a piece of
information
- the meaning of such tags is shared
⇒shared semantics
The Semantic Web initiative
• The Semantic Web provides a common
framework that allows data to be shared and
reused across application, enterprise and
community boundaries.
• Published using languages specifically
designed for data:
– Resource Description Framework (RDF)
– Web Ontology Language (OWL)
– Extensible Markup Language (XML)
Example
• An example of a tag that would be used in a
non-semantic web page:
<item> blog </item>
• Encoding similar information in a semantic web
page might look like this:

<item rdf:about="http://example.org/blog/"> blog </item>


The Semantic Web Tower
The Semantic Web Layers
• XML layer
• RDF + RDFS layer
• Ontology layer
• Proof-rule layer
• Trust layer
The XML layer
• XML (eXtensible Markup Language)
- user-definable and domain-specific
markup
• URI (Uniform Resource Identifier)
– universal naming for Web resources
– same URI => same resource
• URIs are the “ground terms” of the SW
Resource Description Framework (RDF)
• A scheme for defining information on the web.
• It provides the technology for expressing the meaning of
terms and concepts in a form that computers can readily
process.
• Framework for describing metadata (data describing the web
resources).
• RDF encodes this information on the XML page in sets of
triples.
• The triple is an information on the web about related things.
• Each triple is a combination of Subject, Verb and Object,
similar to an elementary sentence.
• Subjects, Verbs and Objects are each identified by a URI,
which enable anyone to define a new concept/new verb just
by defining a URI for it somewhere on the web.
Resource Description Framework
(RDF)
• RDF has following important concepts:

– Resource : The resources being described by RDF are anything that


can be named via a URI.

– Property : A property is also a resource that has a name, for


instance Author or Title.

– Statement : A statement consists of the combination of a Resource,


a Property, and an associated value.
The RDF + RDFS layer
• RDF model = set of RDF triples
• triple = expression (statement)
(subject, predicate, object)
– subject = resource
– predicate = property (of the resource)
– object = value (of the property)
The RDF + RDFS layer
• Triples can be written using XML tags as shown
<contact rdf:about=“edumbill”>
<name>Edd Dumbill</name>
<role>Managing Director</role>
<organization>XML.com</organization>
</contact>

Subject Verb Object

doc.xml#edumbill http://w3.org/1999/02/22-rdf-syntax-ns#type http://example.org/contact

doc.xml#edumbill http://example.org/name “Edd Dumbill”

doc.xml#edumbill http://example.org/role “Managing Director”

doc.xml#edumbill http://example.organization “XML.com”


RDF Schema
• RDF Schema is an extension of Resource
Description Framework.
• RDF Schema provides a higher level of abstraction
than RDF.
– specific classes of resources
– specific properties
– relationships between these properties and other
resources can be described.
• RDFS allows specific resources to be described as
instances of more general classes
The RDF + RDFS layer
• RDFS = RDF Schema
• example:
Limitations of RDF/RDFS
• No standard for expressing primitive data
types such as integer, etc.
• All data types in RDF/RDFS are treated as
strings.
• No standard for expressing relations of
properties (unique, transitive etc.)
• No standard to express equivalence,
disjointedness etc. among properties
The Ontology layer
• Ontologies are collections of statements written in a language such as RDF that
define relations between concepts and specifies logical rules for reasoning about
them.

• Computers/agents/services will understand the meaning of semantic data on a


web page by following links to specified ontologies.

• Ontologies can express a large number of relationships among entities (objects) by


assigning properties to classes and allowing subclasses to inherit such properties.

• An Ontology may express the rule,

If City Code State Code


and Address City Code then Address
State Code
The Ontology layer
• ontology = shared conceptualization
(more expressive than RDF + RDFS)
• Expressed in a true knowledge representation
language
• OWL (Web Ontology Language) = standard
language for ontologies
OWL
• Web Ontology Language (OWL) is another effort developed by the OWL
working group of the W3Consorsium.
• OWL is divided following sub languages.
• OWL Lite
• OWL (Description Logics) DL
• OWL Full

• OWL Lite is a subset of OWL DL, which in turn is a subset of OWL Full.
The proof/rule layer
Beyond OWL:
• Rule: informal notion
• Rules are used to perform inference over
ontologies
• Rules -> a tool for capturing further
knowledge
(not expressible in OWL ontologies )
The Trust layer
• SW top layer:
– where does the information come from?
– how is this information obtained?
– can I trust this information?
Ontologies: example
Evolution of Semantic Web
The Semantic Web Tower
Module5_Text Mining
Data Mining
• Data mining is the computing process of
discovering patterns in large data sets involving
methods at the intersection of machine learning,
statistics, and database systems.

• The goal of the data mining process is to extract


information from a data set and transform it into
an understandable structure for further use.
Data Mining
Data mining involves six common classes of tasks:
• Anomaly detection
• Association rule learning
• Clustering
• Classification
• Regression
• Summarization
Data Mining
• Anomaly detection (outlier/change/deviation detection) –
The identification of unusual data records, that might be
interesting or data errors that require further
investigation.
• Association rule learning (dependency modelling) –
Searches for relationships between variables.
– For example, a supermarket might gather data on customer purchasing
habits. Using association rule learning, the supermarket can determine
which products are frequently bought together and use this information
for marketing purposes (market basket analysis)
• Clustering – is the task of discovering groups and
structures in the data that are in some way or another
"similar", without using known structures in the data.
Data Mining
• Classification – is the task of generalizing known
structure to apply to new data.
– For example, an e-mail program might attempt to classify an e-
mail as "legitimate" or as "spam".
• Regression – attempts to find a function which models
the data with the least error (for estimating the
relationships among data or datasets).
• Summarization – providing a more compact
representation of the data set, including visualization and
report generation.
Definition
• A process of identifying novel information
from a collection of texts.

• Text mining also is known as Text Data


Mining (TDM) and Knowledge Discovery in
Textual Database
Text Mining
• Text mining is the process of deriving high-quality
information from a collection of texts.
• High-quality information is typically derived through the
devising of patterns and trends through means such as
statistical pattern learning.
• “High quality” in text mining usually refers to some
combination of relevance, novelty and interestingness.
• Text mining also is known as Text Data Mining (TDM)
and Knowledge Discovery in Textual Database
Text Mining
• Text mining usually involves the process of
– structuring the input text
– deriving patterns within the structured data
– evaluation and interpretation of the output

• Typical text mining tasks include


– text classification, text clustering, concept/entity extraction,
production of granular taxonomies, sentiment analysis,
document summarization, and entity relation modeling
Text Classification - Nearest
Neighbor Classifier
• k-nearest neighbor or kNN uses the k nearest instances,
called neighbors, to perform classification.
• The instance being classified is assigned the label
(class attribute value) that the majority of its k
neighbors are assigned.
• When k = 1, the closest neighbor’s label is used as the
predicted label for the instance being classified.
• To determine the neighbors of an instance, we need to
measure its distance to all other instances based on
some distance metric.
– Euclidean distance is employed commonly.
k - Nearest Neighbor Classifier

• The input consists of the k closest training examples


• The output is a class membership.
– An object is classified by a majority vote of its neighbors, with the object being assigned to
the class most common among its k nearest neighbors
• Example:
• The test sample (green circle) should be
classified either to the first class of blue
squares or to the second class of red
triangles.
• If k = 3 (solid line circle) it is assigned to
the second class because there are 2
triangles and only 1 square inside the
inner circle.
• If k = 5 (dashed line circle) it is assigned
to the first class (3 squares vs. 2 triangles
inside the outer circle).
k - Nearest Neighbor Classifier
K-NN Example - 1
• Given the data set below
• X1 X2 Y(Class Label)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
• How would 3-NN classify the test data (3,7) ?
• 3-NN will classify the sample (3,7) as Good
K-NN Example - 2
• Suppose we have height, weight and T-
shirt size of some customers and we need
to predict the T-shirt size of a new
customer given only height and weight
information. Data including height, weight
and T-shirt size (class attribute)
information is shown below.
K-NN Example - 2
• What T shirt size would 5-NN model using Euclidean distance return
for a new customer of height 161cm and weight 61kg?
Height (in
Weight (in kgs) T Shirt Size
cms)
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
Bag-of-Words (BoW) model for document
classification
• The bag-of-words model(vector space model) is a way of
representing text data when modeling text with machine
learning algorithms.
– Simple to understand and implement problems in areas of language
modeling and document classification.
– A text (such as a sentence or a document) is represented as the
bag of its words, disregarding grammar and even word order but
keeping multiplicity.
– Commonly used in methods of document classification where the
(frequency of) occurrence of each word is used as a feature for training
a classifier.
• Uses
– A vocabulary of known words
– A measure of the presence of known words
Example of Bag-of-Words Model

• Data collected: Create Document Vectors
• Vocabulary: “it” “was” “the” “best”
• It was the best of times,
it was the worst of times, “of” “times” “worst” “age” “wisdom”
it was the age of wisdom, “foolishness”
it was the age of foolishness • Scores:
• Design the Vocabulary • "it was the best of times"
– “it” – [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
– “was” • “it was the worst of times"
– “the” – [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
– “best”
• "it was the age of wisdom"
– “of”
– [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
– “times”
– “worst” • "it was the age of foolishness"
– “age” – [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
– “wisdom”
– “foolishness”
K-NN Example - 3
• Email spam filtering models often use a bag-of-words representation for
emails. In a bag-of-words representation, the descriptive features that
describe a document each represent how many times a particular word
occurs in the document. One descriptive feature is included for each word in
a predefined dictionary. The dictionary is typically defined as the complete
set of words that occur in the training dataset. Consider the 5 emails (data
set) given below.
• “money, money, money”
• “free money for free gambling fun”
• “gambling for fun”
• “machine learning for fun, fun, fun”
• “free machine learning”
K-NN Example – 3 (cont’d)
• The table below lists the bag-of-words representation for the
following five emails and a target feature, SPAM, whether they are
spam emails or genuine emails:
• What target level would a nearest neighbor model using Euclidean
distance return for the following email: “machine learning for free”?
• What target level would a k-NN model with k = 3 and using
Euclidean distance return for the same query: “machine learning
for free”?
K-NN Example – 3 (cont’d)
K-NN Example – 3 (cont’d)
k-NN in R Programming language

knn(train, test, cl, k)


where
• train is a matrix or a data frame of training (classification) cases
• test is a matrix or a data frame of test case(s) (one or more rows)
• cl is a vector of classification labels (with the number of elements
matching the number of classes in the training data set)
• k is an integer value of closest cases (the k in the k-Nearest Neighbor
Algorithm)
kNN - R program
# Class A training # Class labels vector (attached to
each class instance)
A1=c(0,0)
cl=factor(c(rep("A",3),rep("B",3)))
A2=c(1,1)
A3=c(2,2)
# The object to be classified
test=c(4, 4)
# Class B training cases
B1=c(6,6)
# Load the “class” package that
B2=c(5.5,7)
holds the knn() function
B3=c(6.5,5)
library(class)

# Build the classification matrix


# call knn() and get its summary
train=rbind(A1,A2,A3, B1,B2,B3)
summary(knn(train, test, cl, k = 1))
K-Means Example
• Consider the following data set consisting of the
scores of two variables on each of seven
individuals:
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

• This data set is to be grouped into two clusters.


Example 2
k-Means Example: Given: { (1.0,1.0),(1.5,2.0), (3.0,4.0), (5.0,7.0), (3.5,
5.0 ), (4.5,5.0) , (3.5,4.5) }, k=2
• Randomly assign means: m1=(1,1) ,m2=(5,7)
Iteration 1
• K1={(1.0,1.0),(1.5,2.0), (3.0,4.0)}, K2={(5.0,7.0), (3.5, 5.0 ), (4.5,5.0) ,
(3.5,4.5) }
• Calculating mean of K1 and K2 results in m1=(1.83,2.3) ,m2=(4.1,5.4)
Iteration 2
• K1={(1.0,1.0),(1.5,2.0)}, K2={(3.0,4.0),(5.0,7.0), (3.5, 5.0 ), (4.5,5.0) ,
(3.5,4.5) }
• Calculating mean of K1 and K2 results in m1=(1.25,1.5) ,m2=(3.9, 5.1)
Iteration 3
• K1={(1.0,1.0),(1.5,2.0)}, K2={(3.0,4.0),(5.0,7.0), (3.5, 5.0 ), (4.5,5.0) ,
(3.5,4.5) }
• Calculating mean of K1 and K2 results in m1=(1.25,1.5), m2=(3.9, 5.1)
• Stop iterating as the clusters with these means stay the same
K Means using R
Example 1:
x <- c(2,4,10,12,3,20,30,11,25)
cl <- kmeans(x, 2)
Example 2:
library(graphics)
xy <- matrix(c(1.0 ,1.0, 1.5, 2.0 , 3.0, 4.0, 5.0, 7.0, 3.5, 5.0, 4.5, 5.0, 3.5,
4.5 ), byrow = TRUE, ncol = 2)
colnames(x) <- c("x", "y")
cl <- kmeans(xy, 2)
plot(xy, col = “blue”)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
K Means using R
#Example 3:
library(graphics)
# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, 2)
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
K Means using R
#Example 4:
library(graphics)
# A 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, 5)
plot(x, col = cl$cluster)
points(cl$centers, col = 1:5, pch = 8, cex=2)
Regression
• Example: Predict the stock market value (class attribute) of a
company given information about the company (features). T
– Regression must be used to predict it.
• The input to the regression method is a dataset where attributes are
represented using x1; x2; : : : ; xm (also known as regressors) and
class attribute is represented using Y (also known as the dependent
variable), where the class attribute is a real number.
• Aim: To find the relation between Y and the
vector X = (x1; x2; : : : ; xm).
Linear Regression
• Investigate the potential relationship between a variable
of interest (the response variable) and a set of one of
more variables (known as the independent variables).
• In linear regression, the class attribute Y is assumed to
have a linear relation with the regressors (feature set) X
by considering a linear error .
Y = XW + ε
where W represents the vector of regression coefficients.
• W is estimated using the training dataset and its labels Y
such that fitting error is minimized.
Linear Regression - Example
• The sales of a company (in million dollars) for each year
are shown in the table below.
x (year) 2005 2006 2007 2008 2009

y (sales) 12 19 29 37 45

a) Find the least square regression line y = ax + b.

b) Use the least squares regression line as a model


to estimate the sales of the company in 2012.
We know: a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2)

b = (1/n)(Σy - a Σx)
Linear Regression - Example
• Let us change the variable x into t such
that t = x - 2005 and therefore t represents
the number of years after 2005. The table
of values becomes.
t (years after 2005) 0 1 2 3 4

y (sales) 12 19 29 37 45
Linear Regression - Example
• We now use the table to calculate a and b
included in the least regression line
formula.
t y ty t2

0 12 0 0

1 19 19 1

2 29 58 4

3 37 111 9

4 45 180 16

Σx = 10 Σy = 142 Σxy = 368 Σx2 = 30


Linear Regression - Example
• Calculate a and b using the least square
regression formulas for a and b.
a = (nΣt y - ΣtΣy) / (nΣt2 - (Σt)2) = (5*368 -
10*142) / (5*30 - 102) = 8.4
b = (1/n)(Σy - a Σx) = (1/5)(142 - 8.4*10) = 11.6

b) In 2012, t = 2012 - 2005 = 7

The estimated sales in 2012 are: y = 8.4 * 7 +


11.6 = 70.4 million dollars.
Linear Regression - R Example
#Create the predictor and response variable.
x <- c(0,1,2,3,4)
y <- c(12,19,29,37,45)
relation <- lm(y~x)
# Predict for x=7
a <- data.frame(x = 7)
result <- predict(relation,a)
print(result)
Linear Regression - R Example
# The predictor vector
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The response vector.


y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

# Find weight of a person with height 170.


a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)
Linear Regression – R Example
#Create the predictor and response variable.
setwd(“E:\\”)
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.


png(file = "linearregression.png")

# Plot the chart.


plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")
# Save the file.
dev.off()
Association Rule Mining
Web Technologies and
Applications, University of Alberta,
Dr. Osmar R Zaiane
Association Rule Learning
• A method for discovering interesting relations
between variables in large databases.
• Association rule mining searches for
relationships between items in a dataset.
– Find association, correlation, casual structures among
sets of items or objects in transaction databases,
relational databases or other information repositories.
• A transaction is a set of items
• A set of items is an itemset.
• A itemset containing k-items is called k-itemset
Association Rule Mining
Definition: [by Agrawal, Imieliński, Swami ]
• Let I = { i1 , i2 , … , in } be a set of n binary attributes
called items.
• Let D = { t1 , t2 , … , tm } be a set of transactions called
the database.
• Each transaction in D has a unique transaction ID and
contains a subset of the items in I .
• A rule is defined as an implication of the form:
X ⇒ Y , where X , Y ⊆ I
where X is called antecedent or left-hand-side (LHS) and Y
consequent or right-hand-side (RHS).
Example
Example database with 5
transactions and 5 items
• The set of items is I = { milk ,
bread , butter , bee , diapers }

• In each entry, the value 1


means the presence of the
item in the corresponding • An example rule for the super
transaction, and the value 0 market could be
{ butter , bread } ⇒ { milk }
represents the absence of an
– meaning that if butter and bread
item in that transaction. are bought, customers also buy
milk.
Association Rule Mining
How to mine association rules:
• Input:
– A database of transactions
– Each transaction is a list of items
• Find all rules that associate one set of
items with another set of items.
– Example: 95% of people who buy bread also
buy butter
Association Rule Mining
• Rule Measures: Support and Confidence
• In order to select interesting rules from the
set of all possible rules, constraints on
various measures of significance and
interest are used.
• The best-known constraints are
minimum thresholds on support and
confidence.
Support
• Let X be an itemset, X ⇒ Y an association rule
and T a set of transactions of a given database.
• Support is an indication of how frequently the
itemset appears in the dataset.
• The support of X with respect to T is defined as
the proportion of transactions t in the
dataset which contains the itemset X
Confidence
• Confidence is an indication of how often the rule
has been found to be true.
• The confidence value of a rule, X ⇒ Y , with
respect to a set of transactions T, is the
proportion of the transactions that contains
X which also contains Y
• Confidence is defined as:
conf ( X ⇒ Y ) = supp ( X ∪ Y ) / supp ( X )
• The rule {butter,bread} ⇒ • Itemset X = {beer,diapers}
{milk} has a confidence has a support of 1 / 5 = 0.2
since it occurs in 20% of all
of 0.2 / 0.2 = 1.0 in the transactions (1 out of 5
database, which means transactions).
that for 100% of the
transactions containing
butter and bread the rule
is correct (100% of the
times a customer buys
butter and bread, milk is
bought as well).

Example
Lift
• Lift is defined as the ratio of the observed
support to that expected

• The rule {milk,bread} ⇒ {butter} has a lift


of 0.2 / (0.4 × 0.4) = 1.25
Apriori Algorithm
• An algorithm for frequent item set mining and association rule
learning over transactional databases.
• Proposed by Agrawal and Srikant in 1994
• Find Frequent item sets: the set of all items that have minimum
support
• A subset of frequent itemset must also be a frequent itemset
– If {AB} is a frequent itemset then {A} and {B} should be frequent
itemsets
• Iteratively find frequent itemsets with cardinality from 1 to k (k-
itemsets)
• Frequent itemsets are used to generate the rules.
Apriori: A Candidate Generation & Test Approach –
Frequent Itemset Mining method
• Apriori pruning principle: If there is any itemset which is infrequent, its
superset should not be generated/tested! (Agrawal & Srikant
@VLDB’94, Mannila, et al. @ KDD’ 94)
• Method:

– Initially, scan DB once to get frequent 1-itemset


– Generate length (k+1) candidate itemsets from
length k frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can
be generated

48
The Apriori Algorithm (Pseudo-Code)

All nonempty subsets of a frequent itemset must be


frequent(Apriori property).
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates
in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

49
Apriori Algorithm
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation

– L3={abc, abd, acd, ace, bcd}


– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}

50
Example
• Consider the following database:
• alpha beta epsilon
• alpha beta theta
• alpha beta epsilon
• alpha beta theta

• The association rules that can be determined from


this database are the following:
– 100% of sets with alpha also contain beta
– 50% of sets with alpha, beta also have epsilon
– 50% of sets with alpha, beta also have theta

Example
#install.packages("arules")
library(arules)
trans<-list(c("Alpha","Beta","Epsilon"), c
("Alpha","Beta", "Theta"),
c("Alpha","Beta","Epsilon"), c
("Alpha","Beta","Theta"))
names(trans)<-paste("Tr", c(1:4), sep="")
trans
rules<-apriori(trans,parameter=list(supp=.5, conf=.5,
target="rules"))
inspect(head(sort(rules,by="lift"),n=20))
Apriori Algorithm

• Algorithm:
Example
Association Rule Mining
Generating Association Rules from Frequent Itemsets
• Procedure:
• For each frequent itemset “l”, generate all nonempty subsets of l.
• For every nonempty subset s of l, output the rule “s → (l-s)” if
support_count(l) / support_count(s) >= min_conf where
min_conf is minimum confidence threshold.
Generating Association Rules from Frequent Itemsets

Generating Association Rules from Frequent Itemsets


• Procedure:
• For each frequent itemset “l”, generate all
nonempty subsets of l.
• For every nonempty subset s of l, output the
rule “s → (l-s)” if
support_count(l) / support_count(s) >=
min_conf where min_conf is minimum
confidence threshold.
Back To Example:
We have, L = { {1}, {2}, {3}, {5}, {1,3}, {2,3}, {2,5}, {3,5},
{2,3,5} }.
Lets take l = {2,3,5}.
Its nonempty subsets are {2,3},{2,5}, {3,5}, {2},{3},{5}
Possible Rules are: Let minimum confidence threshold be 70%.
{2,3} → {5} // 2/2 => 100% : Strong
{2,5} → {3} // 2/3 => 67%
{3,5} → {2} // 2/2 => 100% :Strong
{2} → {3,5} // 2/3=> 67%
{3} → {2,5} // 2/3 => 67%
{5} → {2,3} // 2/3 => 67%

56
Generating Association Rules from Frequent Itemsets

Lets take l = {2,3}


Its nonempty subsets are {2},{3}
Possible Rules are: Let minimum confidence threshold be 70%.
{2} → {3} // 2/3 => 67%
{3} → {2} // 2/3 => 67%
Lets take l = {2,5}
Its nonempty subsets are {2},{5}
Possible Rules are: Let minimum confidence threshold be 70%.
{2} → {5} // 3/3 => 100% : Strong
{5} → {2} // 3/3 => 100% : Strong
Lets take l = {3,5}
Its nonempty subsets are {3},{5}
Possible Rules are: Let minimum confidence threshold be 70%.
{3} → {5} // 2/3 => 67%
{5} → {3} // 2/3 => 67%

57
Generating Association Rules from Frequent Itemsets

Strong Association rules:

{2,3} → {5} // 2/2 => 100% : Strong


{3,5} → {2} // 2/2 => 100% :Strong
{2} → {5} // 3/3 => 100% : Strong
{5} → {2} // 3/3 => 100% : Strong

58
R Example
library(arules)
trans<-list(c("A","B","C"), c ("B","C"), c("A","B","D"), c("A","B","C","D"), c("A"),
c("B"))
names(trans)<-paste("Tr", c(1:6), sep="")
trans
rules<-apriori(trans,parameter=list(supp=.02, conf=.5, target="rules"))
inspect(head(rules,n=20))
R Example
library(arules)
trans<-list(c(1,3,4), c(2,3,5), c(1,2,3,5), c(2,5))
names(trans)<-paste("Tr", c(1:4), sep="")
trans
rules<-apriori(trans,parameter=list(supp=.02,
conf=.5,target=“frequentitemsets"))
inspect(head(rules),n=20))
Apriori Algorithm in Social Networks

• Ways to increase the number of members


in a social network site
• Ways to advertise a social network site
Association Rule Mining
Another Example
Module5_Naïve Bayes
Classifier
Reference: Data Mining: Concepts and Techniques, (3rd Edn.), Jiawei
Han, Micheline Kamber, Morgan Kaufmann, 2015
Bayes Classifier
• A statistical classifier: performs probabilistic prediction, i.e., predicts class membership
probabilities
P( A | B) = P ( B | A ) P (A)
• Foundation: Based on Bayes’ Theorem. P(B)
• Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
• Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
• Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with few other classifiers
• Incremental: Each training example can incrementally increase/decrease the probability
that a hypothesis is correct — prior knowledge can be combined with observed data
Bayes’ Theorem: Basics

• Bayes’ Theorem: P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)


P(X)
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the
hypothesis holds given the observed data sample X
• P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
• prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background
knowledge)
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X’s age is 31..40 with medium income
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem

P(H | X) = P(X | H )P(H ) = P(X | H ) P(H ) / P(X)


P(X)

• Informally, this can be viewed as


posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes

4
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)

• Since P(X) is constant for all classes, only


P(C | X) = P(X | C )P(C )
i i i
needs to be maximized
5
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence
n
relation between attributes):
P( X | C i) =  P( x | C i) = P( x | C i)  P( x | C i)  ... P( x | C i)
k 1 2 n
k =1
• This greatly reduces the computation cost: Only counts the class
distribution
• Once the probability P(X|Ci) is known, assign X to the class with
maximum P(X|Ci)*P(Ci)
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x− ) 2

and P(xk|Ci) is P( X | C ) = g ( xk , C ,  C ) g ( x,  ,  ) =
1 −
e 2
2
i i i
2 
6
Naïve Bayes Classifier - Example
• Class: • Dataset
• C1:buys_computer = ‘yes’ age income student credit_rating buys_computer
• C2:buys_computer = ‘no’ <=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
• Instance to be classified: >40
>40
low
low
yes fair
yes excellent
yes
no
31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
<=30 low yes fair yes
Income = medium, >40 medium yes fair yes
<=30 medium yes excellent yes
Student = yes 31…40 medium no excellent yes
31…40 high yes fair yes
Credit_rating = fair) >40 medium no excellent no
Naïve Bayes Classifier - Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 age income student credit_rating buys_computer
<=30 high no fair no
P(buys_computer = “no”) = 5/14= 0.357 <=30 high no excellent no
• Compute P(X|Ci) for each class 31…40 high no fair yes
>40 medium no fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 low yes excellent no
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 31…40 low
<=30 medium
yes excellent
no fair
yes
no
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 low yes fair yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 >40 medium yes fair yes
<=30 medium yes excellent yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 31…40 medium no excellent yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 31…40 high yes fair yes
>40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
8 Therefore, X belongs to class (“buys_computer = yes”)
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i ) =  P( x k | C i )
k =1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their
“uncorrected” counterparts
9
Naïve Bayes Classifier: Comments
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence, therefore loss of
accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
10
Example
• Classify "What is the price of the book“ , using the dataset given below
Example
• Classify the text "A very close game" using the dataset given below.
Regression - Example
• Apply kNN and score for the text “This is a unique abstract” using the dataset given below.
K-Means Clustering Example
• K-Means Clustering-

• K-Means clustering is an unsupervised iterative clustering technique.


• It partitions the given data set into k predefined distinct clusters.
• A cluster is defined as a collection of data points exhibiting certain
similarities.

Partitions the data set such that-


• Each data point belongs to a cluster with the nearest mean.
• Data points belonging to one cluster have high degree of similarity.
• Data points belonging to different clusters have high degree of
dissimilarity.
K-Means Clustering Algorithm-

K-Means Clustering Algorithm involves the following steps-

Step-01:

•Choose the number of clusters K.

Step-02:

•Randomly select any K data points as cluster centers.


•Select cluster centers in such a way that they are as farther as possible from each other.
Step-03:

•Calculate the distance between each data point and each cluster center.
•The distance may be calculated either by using given distance function or by using euclidean
distance formula.

Step-04:

•Assign each data point to some cluster.


•A data point is assigned to that cluster whose center is nearest to that data point.

Step-05:

•Re-compute the center of newly formed clusters.


Advantages-

K-Means Clustering Algorithm offers the following advantages-

Point-01:

It is relatively efficient with time complexity O(nkt) where-


•n = number of instances
•k = number of clusters
•t = number of iterations

isadvantages-

K-Means Clustering Algorithm has the following disadvantages-


•It requires to specify the number of clusters (k) in advance.
•It can not handle noisy data and outliers.
•It is not suitable to identify clusters with non-convex shapes.
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|

Use K-Means Algorithm to find the three cluster centers after the second iteration.
Iteration-01:

•We calculate the distance of each point from each of the center of the three
clusters.
•The distance is calculated by using the given distance function.

The following illustration shows the calculation of distance between point


A1(2, 10) and each of the center of the three clusters-
Calculating Distance Between A1(2, 10) and
C1(2, 10)-

Ρ(A1, C1) Calculating Distance Between A1(2, 10) and


= |x2 – x1| + |y2 – y1| C3(1, 2)-
= |2 – 2| + |10 – 10|
=0 Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
Calculating Distance Between A1(2, 10) and = |1 – 2| + |2 – 10|
C2(5, 8)- =1+8
=9
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1| In the similar manner, we calculate the distance
= |5 – 2| + |8 – 10| of other points from each of the center of the
=3+2 three clusters.
=5
From here, New clusters are- Now,
Cluster-01: •We re-compute the new cluster clusters.
First cluster contains points- •The new cluster center is computed by taking mean of all
•A1(2, 10) the points contained in that cluster.
Cluster-02: For Cluster-01:
Second cluster contains points-
•A3(8, 4) •We have only one point A1(2, 10) in Cluster-01.
•A4(5, 8) •So, cluster center remains the same.
•A5(7, 5)
For Cluster-02:
•A6(6, 4) Center of Cluster-02
•A8(4, 9) = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
Cluster-03: = (6, 6)
Third cluster contains points-
•A2(2, 5) For Cluster-03:
Center of Cluster-03
•A7(1, 2) = ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)

This is completion of Iteration-01.


Calculating Distance Between A1(2, 10) and C1(2, 10)-
Iteration-02:
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
•We calculate the distance of = |2 – 2| + |10 – 10|
each point from each of the =0
Calculating Distance Between A1(2, 10) and C2(6, 6)-
center of the three clusters.
•The distance is calculated by Ρ(A1, C2)
using the given distance = |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
function. =4+4
=8
The following illustration Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-
shows the calculation of Ρ(A1, C3)
distance between point A1(2, = |x2 – x1| + |y2 – y1|
10) and each of the center of = |1.5 – 2| + |3.5 – 10|
= 0.5 + 6.5
the three clusters- =7

In the similar manner, we calculate the distance of other points from each of the
center of the three clusters.
From here, New clusters are- For Cluster-01:
Cluster-01: Center of Cluster-01
First cluster contains points- = ((2 + 4)/2, (10 + 9)/2)
•A1(2, 10) = (3, 9.5)
•A8(4, 9)
For Cluster-02:
Cluster-02: Center of Cluster-02
Second cluster contains points- = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
•A3(8, 4) = (6.5, 5.25)
•A4(5, 8)
•A5(7, 5) For Cluster-03:
•A6(6, 4) Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
Cluster-03: = (1.5, 3.5)
Third cluster contains points-
•A2(2, 5) This is completion of Iteration-02.
•A7(1, 2)
Now, After second iteration, the center of the three clusters
•We re-compute the new cluster clusters. are-
•The new cluster center is computed by taking mean •C1(3, 9.5)
of all the points contained in that cluster. •C2(6.5, 5.25)
•C3(1.5, 3.5)
Module5_TextMining

K Mode Clustering
K Modes Clustering
K Modes Clustering
K Modes Clustering
K-Modes Clustering – Example 1
• Consider the given dataset:
Tuple X1 X2 X3 X4 X5 X6
No.
1 AA BB AB AA AB AB
2 AB BB AB AA AB BB
3 AA AB AA AB AA AB
4 BB AA BB AB AA BB
5 AB AA AB BB BB BB
6 AA AB BB AA AB BB
7 BB BB AA AB AA AB
8 AB AB AA AB BB AB

• Let K be 2
• Let tuples 1 and 5 be the initial centroids (chosen randomly)
of the two clusters respectively. Tup X1 X2 X3 X4 X5 X6
• Centroids: le
No.
1 AA BB AB AA AB AB
5 AB AA AB BB BB BB
Example 1 (cont’d)
• Let us compute the distance from each tuple to the two
cluster centroids as given below by d(X,Y):
Tupl X1 X2 X3 X4 X5 X6 Distance Distance
e to to
No. Cluster1 Cluster2
1 AA BB AB AA AB AB 0 5
2 AB BB AB AA AB BB 2 3 Centroids
3 AA AB AA AB AA AB 4 6
Tu X1 X2 X3 X4 X5 X6
4 BB AA BB AB AA BB 5 4 ple
5 AB AA AB BB BB BB 5 0 No.
6 AA AB BB AA AB BB 3 5 1 AA BB AB AA AB AB
7 BB BB AA AB AA AB 4 6 5 AB AA AB BB BB BB
8 AB AB AA AB BB AB 5 4

• The distance from the tuple to the closest cluster is given in


red color. Hence
• Tuples 1,2,3,6 and 7 will fall in cluster 1.
• Tuples 4,5 and 8 will fall in cluster 2.
Example 1 (cont’d)
• Let the centroids (modes) of two clusters be updated with
reference to the tuples currently assigned to the clusters.
Cluster X X X X X X
• New Centroids: No. 1 2 3 4 5 6

1 AA BB AA AA AB AB
2 AB AA BB AB BB BB
• Let us compute the updated distance from each tuple to the
two cluster centroids as given below:
Tupl X1 X2 X3 X4 X5 X6 Distance Distance
e to to
No. Cluster1 Cluster2
1 AA BB AB AA AB AB 1 4
2 AB BB AB AA AB BB 3 4
3 AA AB AA AB AA AB 3 5
4 BB AA BB AB AA BB 6 2
5 AB AA AB BB BB BB 6 2
6 AA AB BB AA AB BB 2 4
7 BB BB AA AB AA AB 3 5
8 AB AB AA AB BB AB 4 3
Example 1 (cont’d)
• Following the updated distance
– Tuples 1,2,3,6 and 7 will fall in cluster 1.
– Tuples 4,5 and 8 will fall in cluster 2.
• Let the centroids of two clusters be updated with reference to
the tuples currently assigned to the clusters.
• New Centroids: Cluste X1 X2 X3 X4 X5 X6
r No.
1 AA BB AA AA AB AB
2 AB AA BB AB BB BB

• The current state of the two centroids is same as the previous


centroids. Hence the membership of the two clusters will not
change anymore.
• Conclusion:
– Cluster 1 will comprise Tuples 1,2,3,6 and 7
– Cluster 2 will comprise Tuples 4,5 and 8
K-Modes clustering – Example 2- Data
points
Step -1: Randomly select k unique objects as the
initial cluster Centers (modes)

d(X,Y) where
X = BB AB AB AB AB AA AB BB AB BB
Y = AB BB BB AB BB BB AB AB BB AB (2nd data
point
d(X,Y)= 1 + 1 + 1 + 0 +1 + 1 + 0 + 1 + 1 + 1 = 8
Step 2: Calculate the distances between each object and the
cluster mode; assign the object to the cluster whose center has
the shortest distance to the object; repeat this step until all
objects are assigned to clusters.
Example 2- Identify Data points belonging to
different clusters and update the new centroids
References
[1] Z. Huang, “A Fast Clustering Algorithm to
Cluster Very Large Categorical Data Sets in Data
Mining”, Proceedings of Data Mining and
Knowledge Discovery, pp. 1-6, 1997.
[2]https://openi.nlm.nih.gov/detailedresult.php
?img=PMC1525209_1471-2105-7-204-
17&req=4
Module6_R_Examples
Example
install.packages(“network”)
library(network)

src <- c("A", "B", "C", "D", "E", "B", "A", "F")
dst <- c("B", "E", "A", "B", "B", "A", "F", "A")

edges <- cbind(src, dst)


Net <- as.network(edges, matrix.type = "edgelist")

summary(Net)
plot(Net)
Example
library(igraph)
g <- make_ring(10)
degree(g)
plot(g)
degree_distribution(g)
Example

g <- sample_gnp(1000, 10/1000)


degree_distribution(g)
Example
g <- barabasi.game(100,m=2)
eb <- cluster_edge_betweenness(g)
g <- make_full_graph(10) %du%
make_full_graph(10)
g <- add_edges(g, c(1,11))
eb <- cluster_edge_betweenness(g)
eb
Example
g <- make_full_graph(5) %du%
make_full_graph(5) %du% make_full_graph(5) g
<- add_edges(g, c(1,6, 1,11, 6, 11))
cluster_walktrap(g)
Example
library( igraph)
plot(make_graph(c(1, 2, 2, 3, 3, 4, 5, 6), directed =
FALSE) )
plot(make_graph(c("A", "B", "B", "C", "C", "D"),
directed = FALSE) )
plot(make_graph("Tetrahedron"))
plot(make_graph("Cubical"))
plot(make_graph("Octahedron"))
plot(make_graph("Dodecahedron"))
plot(make_graph("Icosahedron"))
Example
karate <- make_graph("Zachary")
wc <- cluster_walktrap(karate)
modularity(wc)
membership(wc)
plot(wc, karate)
Example
karate <- make_graph("Zachary")
wc <- cluster_fast_greedy(karate)
modularity(wc)
membership(wc)
plot(wc, karate)
Example
library(igraph)
g1 <- barabasi.game(10)
plot(g1)
Example
library(igraph)
g2 <- barabasi.game(5)
plot(g2)
Example – Plot several graphs
library(igraph)
g1 <- barabasi.game(10)
g2 <- barabasi.game(5)
plot(g1)
plot(g2,add=TRUE, vertex.color="green" ,
edge.color="blue")
References
• http://sites.stat.psu.edu/~drh20/Rnetworks/
• http://igraph.org/r/doc/
• http://kateto.net/networks-r-igraph
• https://github.com/kolaczyk/sand/blob/master/s
and/inst/code/chapter4.R
• http://www.mayin.org/ajayshah/KB/R/tutorial.R
• https://www.datacamp.com/community/tutorial
s/r-data-import-tutorial
Modules – 6 - R – Examples
install.packages("network")
library(network)

src <- c("A", "B", "C", "D", "E", "B", "A", "F")
dst <- c("B", "E", "A", "B", "B", "A", "F", "A")

edges <- cbind(src, dst)


Net <- as.network(edges, matrix.type = "edgelist")

summary(Net)
plot(Net)
#.........................................

library( igraph)
plot(make_graph(c(1, 2, 2, 3, 3, 4, 5, 6), directed = FALSE) )

plot(make_graph(c("A", "B", "B", "C", "C", "D"), directed = FALSE) )

plot(make_graph("Tetrahedron"))

plot(make_graph("Cubical"))

plot(make_graph("Octahedron"))

plot(make_graph("Dodecahedron"))

plot(make_graph("Icosahedron"))
#....................................
library(igraph)

g <- make_ring(10)
plot(g)

degree(g)
closeness(g)
betweenness(g)

eigen_centrality(g)
mean_distance(g)
transitivity(g)

library(igraph)

g =make_graph(c('c','d', 'c','f', 'd','f', 'd', 'e','e','f'), directed = FALSE)


plot( g )
eigen_centrality(g, directed = FALSE)

#....................................

g <- sample_gnp(10, 0.1)


plot(g)

g <- sample_gnm(5, 8)
plot(g)

g <- sample_pa(5)
plot(g)

g <- barabasi.game(10)
plot(g)

# sample_smallworld(dim, size, nei, p, loops = FALSE, multiple = FALSE)


# dim --> Integer constant, the dimension of the starting lattice.
# size --> Integer constant, the size of the lattice along each dimension.
# nei --> Integer constant, the neighborhood within which the vertices of the lattice will be
connected.
# p --> Real constant between zero and one, the rewiring probability.

g <- sample_smallworld(1, 10, 2, 0.05)


plot(g)

#....................................
karate <- make_graph("Zachary")
wc <- cluster_walktrap(karate)
modularity(wc)
membership(wc)
plot(wc, karate)

#....................................
karate <- make_graph("Zachary")
wc <- cluster_fast_greedy(karate)
modularity(wc)
membership(wc)
plot(wc, karate)

#....................................
library(igraph)
g1 <- barabasi.game(10)
plot(g1)

#....................................

library(igraph)
g2 <- barabasi.game(5)
plot(g2)

#....................................
library(igraph)
g1 <- barabasi.game(10)
g2 <- barabasi.game(5)
plot(g1)
plot(g2,add=TRUE, vertex.color="green" , edge.color="blue")

#....................................
# kNN Example
# Class A training
A1=c(0,0)
A2=c(1,1)
A3=c(2,2)

# Class B training cases


B1=c(6,6)
B2=c(5.5,7)
B3=c(6.5,5)

# Build the classification matrix


train=rbind(A1,A2,A3, B1,B2,B3)

# Class labels vector (attached to each class instance)


cl=factor(c(rep("A",3),rep("B",3)))
cl
# The object to be classified
test=c(4, 4)

# Load the "class" package that holds the knn() function


library(class)

# call knn() and get its summary


summary(knn(train, test, cl, k = 2))

#....................................
# kMeans Example 1:
x <- c(2,4,10,12,3,20,30,11,25)
cl <- kmeans(x, 2)
cl
# kMeans Example 2:
library(graphics)
xy <- matrix(c(1.0 ,1.0, 1.5, 2.0 , 3.0, 4.0, 5.0, 7.0, 3.5, 5.0, 4.5, 5.0, 3.5, 4.5 ), byrow =
TRUE, ncol = 2)
colnames(xy) <- c("x", "y")
cl <- kmeans(xy, 2)
cl
plot( xy)
points(cl$centers, col = 2:3, pch = 9, cex = 3)

#........................................
#Create the predictor and response variable.
x <- c(0,1,2,3,4)
y <- c(12,19,29,37,45)
relation <- lm(y~x)
# Predict for x=7
a <- data.frame(x = 7)
result <- predict(relation,a)
print(result)

#........................................

#Create the predictor and response variable.


#setwd("E:/")
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.


png(file = "E:/linearregression.png")
# Plot the chart.
plot(x,y,col = "blue",main = "Height & Weight Regression",
abline(lm(y~x)),cex = 1.3,pch = 16,ylab = "Weight in Kg",xlab = "Height in cm")
# Save the file.
dev.off()
#........................................

#........................................

#........................................

#........................................

library(arules)
trans<-list(c("A","B","D"), c ("B","C","E"), c("A","B","C","E"), c("B","E"))
names(trans)<-paste("Tr", c(1:4), sep="")
trans
rules<-apriori(trans,parameter=list(supp=0.5, conf=.7, target="rules", minlen=1))
inspect(head(rules,n=20))
Network visualization with R
Sunbelt 2019 Workshop, Montreal, Canada
Katherine Ognyanova, Rutgers University
Web: www.kateto.net, Twitter: ognyanova

Contents
1 Introduction: network visualization 2

2 Colors in R plots 5

3 Data format, size, and preparation 9


3.1 DATASET 1: edgelist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Creating an igraph object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 DATASET 2: matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Two-mode (bipartite) networks in igraph . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Plotting networks with igraph 13


4.1 Plotting parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Network layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Highlighting aspects of the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Highlighting specific nodes or links . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Interactive plotting with tkplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Plotting two-mode networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7 Plotting multiplex networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Beyond igraph: Statnet, ggraph, and simple charts 39


5.1 A network package example (for Statnet users) . . . . . . . . . . . . . . . . . . . . . 39
5.2 A ggraph package example (for ggplot2 users) . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Other ways to represent a network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Interactive network visualizations 46


6.1 Simple plot animations in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Interactive JS visualization with visNetwork . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Interactive JS visualization with threejs . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Interactive JS visualization with networkD3 . . . . . . . . . . . . . . . . . . . . . . . 54

7 Dynamic network visualizations with ndtv-d3 55


7.1 Interactive plots of static networks in ndtv . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 Network evolution animations in ndtv . . . . . . . . . . . . . . . . . . . . . . . . . . 56

8 Overlaying networks on geographic maps 62

1
1 Introduction: network visualization

The main concern in designing a network visualization is the purpose it has to serve. What are the
structural properties that we want to highlight? What are the key concerns we want to address?

Network visualization goals

Key actors and links Relationship strength Structural properties

Communities Diffusion patterns Network evolution

T1 T2

Networks as maps Networks as persuasion Networks as art

A B

Network maps are far from the only visualization available for graphs - other network representation
formats, and even simple charts of key characteristics, may be more appropriate in some cases.

Some network visualization types

Network Maps Statistical charts

Arc diagrams Heat maps

Hive plots Biofabric

2
In network maps, as in other visualization formats, we have several key elements that control the
outcome. The major ones are color, size, shape, and position.

Network visualization controls

Color Position

Size Shape

Honorable mention: arrows (direction) and labels (identification)

Modern graph layouts are optimized for speed and aesthetics. In particular, they seek to minimize
overlaps and edge crossing, and ensure similar edge length across the graph.

Layout aesthetics

Minimize edge crossing Uniform edge length


No Yes No Yes

Prevent overlap Symmetry


No Yes No Yes

3
Note: You can download all workshop materials here, or visit kateto.net/sunbelt2019.
This tutorial uses several key packages that you will need to install in order to follow along. Other
packages will be mentioned along the way, but those are not critical and can be skipped.
The main packages we are going to use are igraph (maintained by Gabor Csardi and Tamas
Nepusz), sna & network (maintained by Carter Butts and the Statnet team), ggraph(maintained
by Thomas Lin Pederson), visNetwork (maintained by Benoit Thieurmel), threejs (maintained by
Bryan W. Lewis), NetworkD3 (maintained by Christopher Gandrud), and ndtv (maintained by Skye
Bender-deMoll).

install.packages("igraph")
install.packages("network")
install.packages("sna")
install.packages("ggraph")
install.packages("visNetwork")
install.packages("threejs")
install.packages("networkD3")
install.packages("ndtv")

4
2 Colors in R plots

Colors are pretty, but more importantly, they help people differentiate between types of objects or
levels of an attribute. In most R functions, you can use named colors, hex, or RGB values.
In the simple base R plot chart below, x and y are the point coordinates, pch is the point symbol
shape, cex is the point size, and col is the color. To see the parameters for plotting in base R,
check out ?par.
plot(x=1:10, y=rep(5,10), pch=19, cex=3, col="dark red")
points(x=1:10, y=rep(6, 10), pch=19, cex=3, col="557799")
points(x=1:10, y=rep(4, 10), pch=19, cex=3, col=rgb(.25, .5, .3))

You may notice that RGB here ranges from 0 to 1. While this is the R default, you can also set it
to the 0-255 range using something like rgb(10, 100, 100, maxColorValue=255).
We can set the opacity/transparency of an element using the parameter alpha (range 0-1):
plot(x=1:5, y=rep(5,5), pch=19, cex=12, col=rgb(.25, .5, .3, alpha=.5), xlim=c(0,6))

If we have a hex color representation, we can set the transparency alpha using adjustcolor from
package grDevices. For fun, let’s also set the plot background to gray using the par() function
for graphical parameters. We won’t do that below, but we could set the margins of the plot with
par(mar=c(bottom, left, top, right)), or tell R not to clear the previous plot before adding a
new one with par(new=TRUE).
par(bg="gray40")
col.tr <- grDevices::adjustcolor("557799", alpha=0.7)
plot(x=1:5, y=rep(5,5), pch=19, cex=12, col=col.tr, xlim=c(0,6))

5
If you plan on using the built-in color names, here’s how to list all of them:
colors() # List all named colors
grep("blue", colors(), value=T) # Colors that have "blue" in the name

In many cases, we need a number of contrasting colors, or multiple shades of a color. R comes with
some predefined palette function that can generate those for us. For example:
pal1 <- heat.colors(5, alpha=1) # 5 colors from the heat palette, opaque
pal2 <- rainbow(5, alpha=.5) # 5 colors from the heat palette, transparent
plot(x=1:10, y=1:10, pch=19, cex=5, col=pal1)

plot(x=1:10, y=1:10, pch=19, cex=5, col=pal2)

We can also generate our own gradients using colorRampPalette. Note that colorRampPalette
returns a function that we can use to generate as many colors from that palette as we need.

6
palf <- colorRampPalette(c("gray80", "dark red"))
plot(x=10:1, y=1:10, pch=19, cex=5, col=palf(10))

To add transparency to colorRampPalette, you need to use a parameter alpha=TRUE:


palf <- colorRampPalette(c(rgb(1,1,1, .2),rgb(.8,0,0, .7)), alpha=TRUE)
plot(x=10:1, y=1:10, pch=19, cex=5, col=palf(10))

Finding good color combinations is a tough task - and the built-in R palettes are rather limited.
Thankfully there are other available packages for this:
# If you don't have R ColorBrewer already, you will need to install it:
install.packages('RColorBrewer')
library('RColorBrewer')
display.brewer.all()

This package has one main function, called brewer.pal. To use it, you just need to select the
desired palette and a number of colors. Let’s take a look at some of the RColorBrewer palettes:
display.brewer.pal(8, "Set3")

display.brewer.pal(8, "Spectral")

7
display.brewer.pal(8, "Blues")

Using RColorBrewer palettes in plots:


pal3 <- brewer.pal(10, "Set3")
plot(x=10:1, y=10:1, pch=19, cex=6, col=pal3)

plot(x=10:1, y=10:1, pch=19, cex=6, col=rev(pal3)) # backwards

8
3 Data format, size, and preparation

In this tutorial, we will work primarily with two small example data sets. Both contain data about
media organizations. One involves a network of hyperlinks and mentions among news sources. The
second is a network of links between media venues and consumers.
While the example data used here is small, many of the ideas behind the visualizations we will
generate apply to medium and large-scale networks. This is also the reason why we will rarely use
certain visual properties such as the shape of the node symbols: those are impossible to distinguish
in larger graph maps. In fact, when drawing very big networks we may even want to hide the
network edges, and focus on identifying and visualizing communities of nodes.
At this point, the size of the networks you can visualize in R is limited mainly by the RAM of your
machine. One thing to emphasize though is that in many cases, visualizing larger networks as giant
hairballs is less helpful than providing charts that show key characteristics of the graph.

3.1 DATASET 1: edgelist

The first data set we are going to work with consists of two files, “Dataset1-Media-Example-
NODES.csv” and “Dataset1-Media-Example-EDGES.csv” (download here).
nodes <- read.csv("Dataset1-Media-Example-NODES.csv", header=T, as.is=T)
links <- read.csv("Dataset1-Media-Example-EDGES.csv", header=T, as.is=T)

Examine the data:


head(nodes)
head(links)

3.2 Creating an igraph object

Next we will convert the raw data to an igraph network object. To do that, we will use the
graph_from_data_frame() function, which takes two data frames: d and vertices.
• d describes the edges of the network. Its first two columns are the IDs of the source and the
target node for each edge. The following columns are edge attributes (weight, type, label, or
anything else).
• vertices starts with a column of node IDs. Any following columns are interpreted as node
attributes.
library('igraph')
net <- graph_from_data_frame(d=links, vertices=nodes, directed=T)
net

## IGRAPH 3cbdde0 DNW- 17 49 --


## + attr: name (v/c), media (v/c), media.type (v/n), type.label
## | (v/c), audience.size (v/n), type (e/c), weight (e/n)
## + edges from 3cbdde0 (vertex names):

9
## [1] s01->s02 s01->s03 s01->s04 s01->s15 s02->s01 s02->s03 s02->s09
## [8] s02->s10 s03->s01 s03->s04 s03->s05 s03->s08 s03->s10 s03->s11
## [15] s03->s12 s04->s03 s04->s06 s04->s11 s04->s12 s04->s17 s05->s01
## [22] s05->s02 s05->s09 s05->s15 s06->s06 s06->s16 s06->s17 s07->s03
## [29] s07->s08 s07->s10 s07->s14 s08->s03 s08->s07 s08->s09 s09->s10
## [36] s10->s03 s12->s06 s12->s13 s12->s14 s13->s12 s13->s17 s14->s11
## [43] s14->s13 s15->s01 s15->s04 s15->s06 s16->s06 s16->s17 s17->s04

The description of an igraph object starts with four letters:


1. D or U, for a directed or undirected graph
2. N for a named graph (where nodes have a name attribute)
3. W for a weighted graph (where edges have a weight attribute)
4. B for a bipartite (two-mode) graph (where nodes have a type attribute)
The two numbers that follow (17 49) refer to the number of nodes and edges in the graph. The
description also lists node & edge attributes, for example:
• (g/c) - graph-level character attribute
• (v/c) - vertex-level character attribute
• (e/n) - edge-level numeric attribute
We also have easy access to nodes, edges, and their attributes with:
E(net) # The edges of the "net" object
V(net) # The vertices of the "net" object
E(net)$type # Edge attribute "type"
V(net)$media # Vertex attribute "media"

# Find nodes and edges by attribute:


# (that returns oblects of type vertex sequence/edge sequence)
V(net)[media=="BBC"]
E(net)[type=="mention"]

# You can also examine the network matrix directly:


net[1,]
net[5,7]

It is also easy to extract an edge list or matrix back from the igraph network:
# Get an edge list or a matrix:
as_edgelist(net, names=T)
as_adjacency_matrix(net, attr="weight")

# Or data frames describing nodes and edges:


as_data_frame(net, what="edges")
as_data_frame(net, what="vertices")

10
Now that we have our igraph network object, let’s make a first attempt to plot it.
plot(net) # not a pretty picture!

s08 s07 s14


s09 s13
s10
s12

s03 s11
s02 s04
s01
s05
s15 s17
s06
s16

That doesn’t look very good. Let’s start fixing things by removing the loops in the graph.
net <- simplify(net, remove.multiple = F, remove.loops = T)

We could also use simplify to combine multiple edges by summing their weights with a command
like simplify(net, edge.attr.comb=list(Weight="sum","ignore")). Note, however, that this
would also combine multiple edge types (in our data: “hyperlinks” and “mentions”).
Let’s and reduce the arrow size and remove the labels (we do that by setting them to NA):
plot(net, edge.arrow.size=.4,vertex.label=NA)

11
3.3 DATASET 2: matrix

Our second dataset is a network of links between news outlets and consumers. It includes two
files, “Dataset2-Media-Example-NODES.csv” and “Dataset2-Media-Example-EDGES.csv” (down-
load here).
nodes2 <- read.csv("Dataset2-Media-User-Example-NODES.csv", header=T, as.is=T)
links2 <- read.csv("Dataset2-Media-User-Example-EDGES.csv", header=T, row.names=1)

Examine the data:


head(nodes2)
head(links2)

3.4 Two-mode (bipartite) networks in igraph

We can see that links2 is an adjacency matrix for a two-mode network. Two-mode or bipartite
graphs have two different types of actors and links that go across, but not within each type. Our
second media example is a network of that kind, examining links between news sources and their
consumers.
links2 <- as.matrix(links2)
dim(links2)
dim(nodes2)

Next we will convert our second network into an igraph object.


As we have seen above, the edges of our second network are in a matrix format. We can read those
into a graph object using graph_from_incidence_matrix(). In igraph, bipartite networks have a
node attribute called type that is FALSE (or 0) for vertices in one mode and TRUE (or 1) for those
in the other mode.

head(nodes2)
head(links2)

net2 <- graph_from_incidence_matrix(links2)


table(V(net2)$type)

To transform a one-mode network matrix into an igraph object, use graph_from_adjacency_matrix().

12
4 Plotting networks with igraph

4.1 Plotting parameters

Plotting with igraph: the network plots have a wide set of parameters you can set. Those include
node options (starting with vertex.) and edge options (starting with edge.). A list of selected
options is included below, but you can also check out ?igraph.plotting for more information.

The igraph plotting parameters include (among others):

NODES
vertex.color Node color
vertex.frame.color Node border color
vertex.shape One of “none”, “circle”, “square”, “csquare”, “rectangle”
“crectangle”, “vrectangle”, “pie”, “raster”, or “sphere”
vertex.size Size of the node (default is 15)
vertex.size2 The second size of the node (e.g. for a rectangle)
vertex.label Character vector used to label the nodes
vertex.label.family Font family of the label (e.g.“Times”, “Helvetica”)
vertex.label.font Font: 1 plain, 2 bold, 3, italic, 4 bold italic, 5 symbol
vertex.label.cex Font size (multiplication factor, device-dependent)
vertex.label.dist Distance between the label and the vertex
vertex.label.degree The position of the label in relation to the vertex, where
0 is right, “pi” is left, “pi/2” is below, and “-pi/2” is above
EDGES
edge.color Edge color
edge.width Edge width, defaults to 1
edge.arrow.size Arrow size, defaults to 1
edge.arrow.width Arrow width, defaults to 1
edge.lty Line type, could be 0 or “blank”, 1 or “solid”, 2 or “dashed”,
3 or “dotted”, 4 or “dotdash”, 5 or “longdash”, 6 or “twodash”
edge.label Character vector used to label edges
edge.label.family Font family of the label (e.g.“Times”, “Helvetica”)
edge.label.font Font: 1 plain, 2 bold, 3, italic, 4 bold italic, 5 symbol
edge.label.cex Font size for edge labels
edge.curved Edge curvature, range 0-1 (FALSE sets it to 0, TRUE to 0.5)
arrow.mode Vector specifying whether edges should have arrows,
possible values: 0 no arrow, 1 back, 2 forward, 3 both
OTHER
margin Empty space margins around the plot, vector with length 4
frame if TRUE, the plot will be framed
main If set, adds a title to the plot
sub If set, adds a subtitle to the plot
asp Numeric, the aspect ratio of a plot (y/x).
palette A color palette to use for vertex color
rescale Whether to rescale coordinates to [-1,1]. Default is TRUE.

13
We can set the node & edge options in two ways - the first one is to specify them in the plot()
function, as we are doing below.
# Plot with curved edges (edge.curved=.1) and reduce arrow size:
# Note that using curved edges will allow you to see multiple links
# between two nodes (e.g. links going in either direction, or multiplex links)
plot(net, edge.arrow.size=.4, edge.curved=.1)

s13 s14
s12 s07
s08
s10 s09
s11
s04 s03
s17
s01 s02
s06
s16 s05
s15

# Set edge color to light gray, the node & border color to orange
# Replace the vertex label with the node names stored in "media"
plot(net, edge.arrow.size=.2, edge.color="orange",
vertex.color="orange", vertex.frame.color="#ffffff",
vertex.label=V(net)$media, vertex.label.color="black")

FOX News
MSNBC
CNN ABC

Washington
Reuters.com Wall Street Journal Post
LA Times
Google News NY Times
Yahoo News USA Today
BBC NYTimes.com

AOL.com
New York Post
WashingtonPost.com

The second way to set attributes is to add them to the igraph object. Let’s say we want to color
our network nodes based on type of media, and size them based on degree centrality (more links ->
larger node) We will also change the width of the edges based on their weight.

14
# Generate colors based on media type:
colrs <- c("gray50", "tomato", "gold")
V(net)$color <- colrs[V(net)$media.type]

# Compute node degrees (#links) and use that to set node size:
deg <- degree(net, mode="all")
V(net)$size <- deg*3
# We could also use the audience size value:
V(net)$size <- V(net)$audience.size*0.6

# The labels are currently node IDs.


# Setting them to NA will render no labels:
V(net)$label <- NA

# Set edge width based on weight:


E(net)$width <- E(net)$weight/6

#change arrow size and edge color:


E(net)$arrow.size <- .2
E(net)$edge.color <- "gray80"

# We can even set the network layout:


graph_attr(net, "layout") <- layout_with_lgl
plot(net)

We can also override the attributes explicitly in the plot:


plot(net, edge.color="orange", vertex.color="gray50")

15
It helps to add a legend explaining the meaning of the colors we used:
plot(net)
legend(x=-1.5, y=-1.1, c("Newspaper","Television", "Online News"), pch=21,
col="#777777", pt.bg=colrs, pt.cex=2, cex=.8, bty="n", ncol=1)

Newspaper
Television
Online News

Sometimes, especially with semantic networks, we may be interested in plotting only the labels of
the nodes:
plot(net, vertex.shape="none", vertex.label=V(net)$media,
vertex.label.font=2, vertex.label.color="gray40",
vertex.label.cex=.7, edge.color="gray85")

16
Reuters.com
Google News
CNN
Yahoo News MSNBC

ABC
FOX News

BBC
Wall Street Journal
USA Today
Washington Post
AOL.com NY Times

New York Post LA Times


WashingtonPost.com NYTimes.com

Let’s color the edges of the graph based on their source node color. We can get the starting node for
each edge with the ends() igraph function. It returns the start and end vertex for edges listed in
the es parameter. The names parameter control whether the function returns edge names or IDs.
edge.start <- ends(net, es=E(net), names=F)[,1]
edge.col <- V(net)$color[edge.start]

plot(net, edge.color=edge.col, edge.curved=.1)

17
4.2 Network layouts

Network layouts are simply algorithms that return coordinates for each node in a network.
For the purposes of exploring layouts, we will generate a slightly larger 100-node graph. We use
the sample_pa() function which generates a simple graph starting from one node and adding more
nodes and links based on a preset level of preferential attachment (Barabasi-Albert model).
net.bg <- sample_pa(100)
V(net.bg)$size <- 8
V(net.bg)$frame.color <- "white"
V(net.bg)$color <- "orange"
V(net.bg)$label <- ""
E(net.bg)$arrow.mode <- 0
plot(net.bg)

You can set the layout in the plot function:


plot(net.bg, layout=layout_randomly)

Or you can calculate the vertex coordinates in advance:

18
l <- layout_in_circle(net.bg)
plot(net.bg, layout=l)

l is simply a matrix of x, y coordinates (N x 2) for the N nodes in the graph. For 3D layouts, it has
x, y, and z coordinates (N x 3). You can easily generate your own:
l <- cbind(1:vcount(net.bg), c(1, vcount(net.bg):2))
plot(net.bg, layout=l)

This layout is just an example and not very helpful - thankfully igraph has a number of built-in
layouts, including:
# Randomly placed vertices
l <- layout_randomly(net.bg)
plot(net.bg, layout=l)

19
# Circle layout
l <- layout_in_circle(net.bg)
plot(net.bg, layout=l)

# 3D sphere layout
l <- layout_on_sphere(net.bg)
plot(net.bg, layout=l)

Fruchterman-Reingold is one of the most used force-directed layout algorithms out there.

20
Force-directed layouts try to get a nice-looking graph where edges are similar in length and cross
each other as little as possible. They simulate the graph as a physical system. Nodes are electrically
charged particles that repulse each other when they get too close. The edges act as springs that
attract connected nodes closer together. As a result, nodes are evenly distributed through the chart
area, and the layout is intuitive in that nodes which share more connections are closer to each other.
The disadvantage of these algorithms is that they are rather slow and therefore less often used in
graphs larger than ~1000 vertices.

l <- layout_with_fr(net.bg)
plot(net.bg, layout=l)

With force-directed layouts, you can use the niter parameter to control the number of iterations to
perform. The default is set at 500 iterations. You can lower that number for large graphs to get
results faster and check if they look reasonable.
l <- layout_with_fr(net.bg, niter=50)
plot(net.bg, layout=l)

The layout can also interpret edge weights. You can set the “weights” parameter which increases
the attraction forces among nodes connected by heavier edges.
ws <- c(1, rep(100, ecount(net.bg)-1))
lw <- layout_with_fr(net.bg, weights=ws)
plot(net.bg, layout=lw)

21
You will also notice that the Fruchterman-Reingold layout is not deterministic - different runs will
result in slightly different configurations. Saving the layout in l allows us to get the exact same
result multiple times, which can be helpful if you want to plot the time evolution of a graph, or
different relationships – and want nodes to stay in the same place in multiple plots.

par(mfrow=c(2,2), mar=c(0,0,0,0)) # plot four figures - 2 rows, 2 columns


plot(net.bg, layout=layout_with_fr)
plot(net.bg, layout=layout_with_fr)
plot(net.bg, layout=l)
plot(net.bg, layout=l)

22
dev.off()

By default, the coordinates of the plots are rescaled to the [-1,1] interval for both x and y. You can
change that with the parameter rescale=FALSE and rescale your plot manually by multiplying the
coordinates by a scalar. You can use norm_coords to normalize the plot with the boundaries you
want. This way you can create more compact or spread out layout versions.
l <- layout_with_fr(net.bg)
l <- norm_coords(l, ymin=-1, ymax=1, xmin=-1, xmax=1)

par(mfrow=c(2,2), mar=c(0,0,0,0))
plot(net.bg, rescale=F, layout=l*0.4)
plot(net.bg, rescale=F, layout=l*0.6)
plot(net.bg, rescale=F, layout=l*0.8)
plot(net.bg, rescale=F, layout=l*1.0)

dev.off()

Some layouts have 3D versions that you can use with parameter dim=3. As you might expect, a 3D
layout returns a matrix with 3 columns containing the X, Y, and Z coordinates of each node.
l <- layout_with_fr(net.bg, dim=3)
plot(net.bg, layout=l)

23
Another popular force-directed algorithm that produces nice results for connected graphs is Kamada
Kawai. Like Fruchterman Reingold, it attempts to minimize the energy in a spring system.
l <- layout_with_kk(net.bg)
plot(net.bg, layout=l)

Graphopt is a nice force-directed layout implemented in igraph that uses layering to help with
visualizations of large networks.
l <- layout_with_graphopt(net.bg)
plot(net.bg, layout=l)

The available graphopt parameters can be used to change the mass and electric charge of nodes, as
well as the optimal spring length and the spring constant for edges. The parameter names are charge
(defaults to 0.001), mass (defaults to 30), spring.length (defaults to 0), and spring.constant
(defaults to 1). Tweaking those can lead to considerably different graph layouts.
l1 <- layout_with_graphopt(net.bg, charge=0.02)
l2 <- layout_with_graphopt(net.bg, charge=0.00000001)

par(mfrow=c(1,2), mar=c(1,1,1,1))

24
plot(net.bg, layout=l1)
plot(net.bg, layout=l2)

dev.off()

The LGL algorithm is meant for large, connected graphs. Here you can also specify a root: a node
that will be placed in the middle of the layout.
plot(net.bg, layout=layout_with_lgl)

The MDS (multidimensional scaling) algorithm tries to place nodes based on some measure of
similarity or distance between them. More similar nodes are plotted closer to each other. By default,
the measure used is based on the shortest paths between nodes in the network. We can change that
by using our own distance matrix (however defined) with the parameter dist. MDS layouts are
nice because positions and distances have a clear interpretation. The problem with them is visual
clarity: nodes often overlap, or are placed on top of each other.

plot(net.bg, layout=layout_with_mds)

25
Let’s take a look at all available layouts in igraph:
layouts <- grep("^layout_", ls("package:igraph"), value=TRUE)[-1]
# Remove layouts that do not apply to our graph.
layouts <- layouts[!grepl("bipartite|merge|norm|sugiyama|tree", layouts)]

par(mfrow=c(3,3), mar=c(1,1,1,1))
for (layout in layouts) {
print(layout)
l <- do.call(layout, list(net))
plot(net, edge.arrow.mode=0, layout=l, main=layout) }
layout_as_star layout_components layout_in_circle

layout_nicely layout_on_grid layout_on_sphere

layout_randomly layout_with_dh layout_with_drl

26
layout_with_fr layout_with_gem layout_with_graphopt

layout_with_kk layout_with_lgl layout_with_mds

4.3 Highlighting aspects of the network

Notice that our network plot is still not too helpful. We can identify the type and size of nodes,
but cannot see much about the structure since the links we’re examining are so dense. One way to
approach this is to see if we can sparsify the network, keeping only the most important ties and
discarding the rest.
hist(links$weight)
mean(links$weight)
sd(links$weight)

There are more sophisticated ways to extract the key edges, but for the purposes of this exercise
we’ll only keep ones that have weight higher than the mean for the network. In igraph, we can
delete edges using delete_edges(net, edges):

27
cut.off <- mean(links$weight)
net.sp <- delete_edges(net, E(net)[weight<cut.off])
plot(net.sp, layout=layout_with_kk)

Another way to think about this is to plot the two tie types (hyperlink & mention) separately. We
will do that in section 5 of this tutorial: Plotting multiplex networks.
We can also try to make the network map more useful by showing the communities within it:
par(mfrow=c(1,2))

# Community detection (by optimizing modularity over partitions):


clp <- cluster_optimal(net)
class(clp)

# Community detection returns an object of class "communities"


# which igraph knows how to plot:
plot(clp, net)

# We can also plot the communities without relying on their built-in plot:
V(net)$community <- clp$membership
colrs <- adjustcolor( c("gray50", "tomato", "gold", "yellowgreen"), alpha=.6)
plot(net, vertex.color=colrs[V(net)$community])

28
dev.off()

4.4 Highlighting specific nodes or links

Sometimes we want to focus the visualization on a particular node or a group of nodes. In our
example media network, we can examine the spread of information from focal actors. For instance,
let’s represent distance from the NYT.
The distances function returns a matrix of shortest paths from nodes listed in the v parameter to
ones included in the to parameter.
dist.from.NYT <- distances(net, v=V(net)[media=="NY Times"],
to=V(net), weights=NA)

# Set colors to plot the distances:


oranges <- colorRampPalette(c("dark red", "gold"))
col <- oranges(max(dist.from.NYT)+1)
col <- col[dist.from.NYT+1]

plot(net, vertex.color=col, vertex.label=dist.from.NYT, edge.arrow.size=.6,


vertex.label.color="white")

3
1 2

1 0 2
1
1
1 2 2
2 2 3
2 2 3

We can also highlight a path in the network:


news.path <- shortest_paths(net,
from = V(net)[media=="MSNBC"],
to = V(net)[media=="New York Post"],
output = "both") # both path nodes and edges

# Generate edge color variable to plot the path:


ecol <- rep("gray80", ecount(net))
ecol[unlist(news.path$epath)] <- "orange"
# Generate edge width variable to plot the path:
ew <- rep(2, ecount(net))

29
ew[unlist(news.path$epath)] <- 4
# Generate node color variable to plot the path:
vcol <- rep("gray40", vcount(net))
vcol[unlist(news.path$vpath)] <- "gold"

plot(net, vertex.color=vcol, edge.color=ecol,


edge.width=ew, edge.arrow.mode=0)

We can highlight the edges going into or out of a vertex, for instance the WSJ. For a single node,
use incident(), for multiple nodes use incident_edges()
inc.edges <- incident(net, V(net)[media=="Wall Street Journal"], mode="all")

# Set colors to plot the selected edges.


ecol <- rep("gray80", ecount(net))
ecol[inc.edges] <- "orange"
vcol <- rep("grey40", vcount(net))
vcol[V(net)$media=="Wall Street Journal"] <- "gold"
plot(net, vertex.color=vcol, edge.color=ecol)

We can also point to the immediate neighbors of a vertex, say WSJ. The neighbors function
finds all nodes one step out from the focal actor.To find the neighbors for multiple nodes, use
adjacent_vertices() instead of neighbors(). To find node neighborhoods going more than one

30
step out, use function ego() with parameter order set to the number of steps out to go from the
focal node(s).
neigh.nodes <- neighbors(net, V(net)[media=="Wall Street Journal"], mode="out")

# Set colors to plot the neighbors:


vcol[neigh.nodes] <- "#ff9d00"
plot(net, vertex.color=vcol)

A way to draw attention to a group of nodes (we saw this before with communities) is to “mark”
them:
par(mfrow=c(1,2))
plot(net, mark.groups=c(1,4,5,8), mark.col="#C5E5E7", mark.border=NA)

# Mark multiple groups:


plot(net, mark.groups=list(c(1,4,5,8), c(15:17)),
mark.col=c("#C5E5E7","#ECD89A"), mark.border=NA)

dev.off()

31
4.5 Interactive plotting with tkplot

R and igraph allow for interactive plotting of networks. This might be a useful option for you if you
want to tweak slightly the layout of a small graph. After adjusting the layout manually, you can get
the coordinates of the nodes and use them for other plots.
tkid <- tkplot(net) #tkid is the id of the tkplot that will open
l <- tkplot.getcoords(tkid) # grab the coordinates from tkplot
plot(net, layout=l)

4.6 Plotting two-mode networks

As you might remember, our second media example is a two-mode network examining links between
news sources and their consumers.
head(nodes2)
head(links2)
plot(net2, vertex.label=NA)

32
As with one-mode networks, we can modify the network object to include the visual properties that
will be used by default when plotting the network. Notice that this time we will also change the
shape of the nodes - media outlets will be squares, and their users will be circles.
# Media outlets are blue squares, audience nodes are orange circles:
V(net2)$color <- c("steel blue", "orange")[V(net2)$type+1]
V(net2)$shape <- c("square", "circle")[V(net2)$type+1]

# Media outlets will have name labels, audience members will not:
V(net2)$label <- ""
V(net2)$label[V(net2)$type==F] <- nodes2$media[V(net2)$type==F]
V(net2)$label.cex=.6
V(net2)$label.font=2

plot(net2, vertex.label.color="white", vertex.size=(2-V(net2)$type)*8)

NYT

BBC

LATimes USAT

CNN WSJ

MSNBC
FOX
ABC

WaPo

In igraph, there is also a special layout for bipartite networks (though it doesn’t always work great,
and you might be better off generating your own two-mode layout).

plot(net2, vertex.label=NA, vertex.size=7, layout=layout_as_bipartite)

33
Using text as nodes may be helpful at times:
plot(net2, vertex.shape="none", vertex.label=nodes2$media,
vertex.label.color=V(net2)$color, vertex.label.font=2,
vertex.label.cex=.6, edge.color="gray70", edge.width=2)

Paul Mary
NYT

John

BBC

Sandra Ronda
Nancy LATimes
USAT
Sheila
Dan
CNN
Anna WSJ Jim
Ed Brian MSNBC Jill
Kate Jo
FOX
Lisa Jason
ABC

Dave

WaPo
Tom Ted

In this example, we will also experiment with the use of images as nodes. In order to do this, you
will need the png package (if missing, install with install.packages('png')
# install.packages('png')
library('png')

img.1 <- readPNG("./images/news.png")


img.2 <- readPNG("./images/user.png")

V(net2)$raster <- list(img.1, img.2)[V(net2)$type+1]

34
plot(net2, vertex.shape="raster", vertex.label=NA,
vertex.size=16, vertex.size2=16, edge.width=2)

By the way, we can also add any image we want to a plot. For example, many network graphs can
be largely improved by a photo of a puppy in a teacup.
plot(net2, vertex.shape="raster", vertex.label=NA,
vertex.size=16, vertex.size2=16, edge.width=2)

img.3 <- readPNG("./images/puppy.png")


rasterImage(img.3, xleft=-1.6, xright=-0.6, ybottom=-1.1, ytop=0.1)

35
# The numbers after the image are its coordinates
# The limits of your plotting area are given in par()$usr

We can also generate and plot bipartite projections for the two-mode network: co-memberships are
easy to calculate by multiplying the network matrix by its transposed matrix, or using igraph’s
bipartite.projection() function.
par(mfrow=c(1,2))

net2.bp <- bipartite.projection(net2)

plot(net2.bp$proj1, vertex.label.color="black", vertex.label.dist=1,


vertex.label=nodes2$media[!is.na(nodes2$media.type)])

plot(net2.bp$proj2, vertex.label.color="black", vertex.label.dist=1,


vertex.label=nodes2$media[ is.na(nodes2$media.type)])

WaPo Mary Paul

ABC John
FOX
MSNBC

WSJ Nancy Ted


CNN Sandra Tom
Dan
Ed Dave
Anna Kate

LATimes USAT Ronda


Sheila Lisa

BBC Brian Jason


Jim Jo

NYT Jill

dev.off()

4.7 Plotting multiplex networks

In some cases, the networks we want to plot are multigraphs: they can have multiple edges connecting
the same two nodes. A related concept, multiplex networks, contain multiple types of ties. For
instance, we can represent friendship, romantic, and work relationships between individuals in a

36
single multiplex network.
In our example network, we also have two tie types: hyperlinks and mentions. One thing we can do
with them is plot each type of tie separately:
E(net)$width <- 1.5
plot(net, edge.color=c("dark red", "slategrey")[(E(net)$type=="hyperlink")+1],
vertex.color="gray40", layout=layout_in_circle, edge.curved=.3)

net.m <- net - E(net)[E(net)$type=="hyperlink"] # another way to delete edges:


net.h <- net - E(net)[E(net)$type=="mention"] # using the minus operator

# Plot the two links separately:


par(mfrow=c(1,2))
plot(net.h, vertex.color="orange", layout=layout_with_fr, main="Tie: Hyperlink")
plot(net.m, vertex.color="lightsteelblue2", layout=layout_with_fr, main="Tie: Mention")

Tie: Hyperlink Tie: Mention

# Make sure the nodes stay in place in both plots:


l <- layout_with_fr(net)
plot(net.h, vertex.color="orange", layout=l, main="Tie: Hyperlink")
plot(net.m, vertex.color="lightsteelblue2", layout=l, main="Tie: Mention")

37
Tie: Hyperlink Tie: Mention

dev.off()

In our example network, it so happens that we do not have node dyads connected by multiple types
of connections. That is to say, we never have both a ‘hyperlink’ and a ‘mention’ tie between the
same two news outlets. However, this could easily happen in a multiplex network.
One challenge in visualizing multigraphs is that multiple edges between the same two nodes may
get plotted on top of each other in a way that makes impossible to see them clearly. For example,
let us generate a very simple multiplex network with two nodes and three ties between them:
multigtr <- graph( edges=c(1,2, 1,2, 1,2), n=2 )
l <- layout_with_kk(multigtr)

# Let's just plot the graph:


plot(multigtr, vertex.color="lightsteelblue", vertex.frame.color="white",
vertex.size=40, vertex.shape="circle", vertex.label=NA,
edge.color=c("gold", "tomato", "yellowgreen"), edge.width=10,
edge.arrow.size=3, edge.curved=0.1, layout=l)

Because all edges in the graph have the same curvature, they are drawn over each other so that
we only see one of them. What we can do is assign each edge a different curvature. One useful
function in igraph called curve_multiple can help us here. For a graph G, curve.multiple(G)
will generate a curvature for each edge that maximizes visibility.

38
plot(multigtr, vertex.color="lightsteelblue", vertex.frame.color="white",
vertex.size=40, vertex.shape="circle", vertex.label=NA,
edge.color=c("gold", "tomato", "yellowgreen"), edge.width=10,
edge.arrow.size=3, edge.curved=curve_multiple(multigtr), layout=l)

It is a good practice to detach packages when we stop needing them. Try to remember that especially
with igraph and the statnet family packages, as bad things tend to happen if you have them
loaded together.
detach('package:igraph')

5 Beyond igraph: Statnet, ggraph, and simple charts

The igraph package is only one of many available network visualization options in R. This section
provides a few quick examples illustrating other available approaches to static network visualization.

5.1 A network package example (for Statnet users)

Plotting with the network package is very similar to that with igraph - although the notation is
slightly different (a whole new set of parameter names!). This package also uses less default controls
obtained by modifying the network object, and more explicit parameters in the plotting function.
Here is a quick example using the (by now familiar) media network. We will begin by converting
the data into the network format used by the Statnet family of packages (including network, sna,
ergm, stergm, and others).
As in igraph, we can generate a ‘network’ object from an edge list, an adjacency matrix, or an
incidence matrix. You can get the specifics with ?edgeset.constructors. Here we will use the
edge list and the node attribute data frames to create the network object. One specific thing to
pay attention to here is the ignore.eval parameter. It is set to TRUE by default, and that setting
causes the network object to disregard edge weights.

39
library('network')

net3 <- network(links, vertex.attr=nodes, matrix.type="edgelist",


loops=F, multiple=F, ignore.eval = F)

Here again we can easily access the edges, vertices, and the network matrix:
net3[,]
net3 %n% "net.name" <- "Media Network" # network attribute
net3 %v% "media" # Node attribute
net3 %e% "type" # Node attribute

Let’s plot our media network once again:


net3 %v% "col" <- c("gray70", "tomato", "gold")[net3 %v% "media.type"]
plot(net3, vertex.cex=(net3 %v% "audience.size")/7, vertex.col="col")

Note that - as in igraph - the plot returns the node position coordinates. You can use them in other
plots using the coord parameter.
l <- plot(net3, vertex.cex=(net3 %v% "audience.size")/7, vertex.col="col")
plot(net3, vertex.cex=(net3 %v% "audience.size")/7, vertex.col="col", coord=l)

40
detach('package:network')

The network package also offers the option to edit a plot interactively, by setting the parameter
interactive=T:
plot(net3, vertex.cex=(net3 %v% "audience.size")/7, vertex.col="col", interactive=T)

For a full list of parameters that you can use in the network package, check out ?plot.network.

5.2 A ggraph package example (for ggplot2 users)

The ggplot2 package and its extensions are known for offering the most meaningfully structured
and advanced way to visualize data in R. In ggplot2, you can select from a variety of visual building
blocks and add them to your graphics one by one, a layer at a time.
The ggraph package takes this principle and extends it to network data. In this section, we’ll only
cover the basics without providing a detailed overview of the grammar of graphics approach. For a
deeper look, it would be best to get familiar with ggplot2 first, then learn the specifics of ggraph.
One good news is that we can use our igraph objects directly with the ggraph package. The
following code gets the data and adds separate layers for nodes and links.
library(ggraph)
library(igraph)

ggraph(net) +
geom_edge_link() + # add edges to the plot
geom_node_point() # add nodes to the plot

41
10.0

7.5

5.0

y
2.5

0.0

−2.5
6 9 12 15 18
x

You will also recognize here some network layouts familiar from igraph plotting: ‘star’, ‘circle’,
‘grid’, ‘sphere’, ‘kk’, ‘fr’, ‘mds’, ‘lgl’, etc.
ggraph(net, layout="lgl") +
geom_edge_link() +
ggtitle("Look ma, no nodes!") # add title to the plot

Look ma, no nodes!


−7.5

−10.0

−12.5
y

−15.0

−17.5

−20.0
−7.5 −5.0 −2.5 0.0
x

Here we can use geom_edge_link() for straight edges, geom_edge_arc() for curved ones, and
geom_edge_fan() when we want to make sure any overlapping multiplex edges will be fanned out.
As in other packages, we can set visual properties for the network plot by using key function
parameters. For instance, nodes have color, fill, shape, size, and stroke. Edges have color,
width, and linetype. Here too the alpha parameter controls transparency.

ggraph(net, layout="lgl") +
geom_edge_fan(color="gray50", width=0.8, alpha=0.5) +
geom_node_point(color=V(net)$color, size=8) +
theme_void()

42
As in ggplot2, we can add different themes to the plot. For a cleaner look, you can use a minimal
or empty theme with theme_minimal() or theme_void().
ggraph(net, layout = 'linear') +
geom_edge_arc(color = "orange", width=0.7) +
geom_node_point(size=5, color="gray50") +
theme_void()

The ggraph package also uses the traditional ggplot2 way of mapping aesthetics: that is to say, of
specifying which elements of the data should correspond to different visual properties of the graphic.
This is done using the aes() function that matches visual parameters with attribute names from
the data. In the code below, the edge attribute type and node attribute audience.size are taken
from our data as they are included in the igraph object.

43
ggraph(net, layout="lgl") +
geom_edge_link(aes(color = type)) + # colors by edge type
geom_node_point(aes(size = audience.size)) + # size by audience size
theme_void()

audience.size
20
30
40
50
60

type
hyperlink
mention

One great thing about ggplot2 and ggraph you can see above is that they automatically generate
a legend which makes plots easier to interpret.

We can add a layer with node labels using geom_node_text() or geom_node_label() which
correspond to similar functions in ggplot2.
ggraph(net, layout = 'lgl') +
geom_edge_arc(color="gray", curvature=0.3) +
geom_node_point(color="orange", aes(size = audience.size)) +
geom_node_text(aes(label = media), size=2, color="gray50", repel=T) +
theme_void()
WashingtonPost.com

AOL.com

Google News audience.size


New York Post
20
Yahoo News

USA Today 30
Reuters.com NYTimes.com

BBC NY Times 40
Wall Street Journal 50
LA Times
MSNBC 60
CNN
Washington Post
ABC
FOX News

44
detach("package:ggraph")

While those are not discussed here, note that ggraph offers a number of other interesting ways to
represent networks, including dendrograms, treemaps, hive plots, and circle plots.

5.3 Other ways to represent a network

At this point it might be useful to provide a quick reminder that there are many ways to represent
a network not limited to a hairball plot.
For example, here is a quick heatmap of the network matrix:
netm <- get.adjacency(net, attr="weight", sparse=F)
colnames(netm) <- V(net)$media
rownames(netm) <- V(net)$media

palf <- colorRampPalette(c("gold", "dark orange"))


heatmap(netm[,17:1], Rowv = NA, Colv = NA, col = palf(100),
scale="none", margins=c(10,10) )

AOL.com
WashingtonPost.com
NYTimes.com
Reuters.com
Google News
Yahoo News
BBC
ABC
FOX News
MSNBC
CNN
New York Post
LA Times
USA Today
Wall Street Journal
Washington Post
NY Times
AOL.com
WashingtonPost.com
NYTimes.com
Reuters.com
Google News
Yahoo News
BBC
ABC
FOX News
MSNBC
CNN
New York Post
LA Times
USA Today
Wall Street Journal
Washington Post
NY Times

Depending on what properties of the network or its nodes and edges are most important to you,
simple graphs can often be more informative than network maps.

45
# Plot the egree distribution for our network:
deg.dist <- degree_distribution(net, cumulative=T, mode="all")
plot( x=0:max(degree(net)), y=1-deg.dist, pch=19, cex=1.2, col="orange",
xlab="Degree", ylab="Cumulative Frequency")

Cumulative Frequency

0.8
0.4
0.0

0 2 4 6 8 10 12

Degree

6 Interactive network visualizations

6.1 Simple plot animations in R

If you have already installed “ndtv”, you should also have a package used by it called “animation”.
If not, now is a good time to install it with install.packages('animation'). Note that this
package provides a simple technique to create various (not necessarily network-related) animations
in R. It works by generating multiple plots and combining them in an animated GIF.
The catch here is that in order for this to work, you need not only the R package, but also an
additional software called ImageMagick (http://imagemagick.org). You probably don’t want to
install that during the workshop, but you can try it at home.
The good news is that once you figure this out, you can turn any series of R plots (network or not!)
into an animated GIF.
library('animation')
library('igraph')

ani.options("convert") # Check that the package knows where to find ImageMagick


# If it doesn't know where to find it, give it the correct path for your system.
ani.options(convert="C:/Program Files/ImageMagick-6.8.8-Q16/convert.exe")

46
We will now generate 4 network plots (the same way we did before), only this time we’ll do it
within the saveGIF command. The animation interval is set with interval, and the movie.name
parameter controls name of the gif.
l <- layout_with_lgl(net)

saveGIF( { col <- rep("grey40", vcount(net))


plot(net, vertex.color=col, layout=l)

step.1 <- V(net)[media=="Wall Street Journal"]


col[step.1] <- "#ff5100"
plot(net, vertex.color=col, layout=l)

step.2 <- unlist(neighborhood(net, 1, step.1, mode="out"))


col[setdiff(step.2, step.1)] <- "#ff9d00"
plot(net, vertex.color=col, layout=l)

step.3 <- unlist(neighborhood(net, 2, step.1, mode="out"))


col[setdiff(step.3, step.2)] <- "#FFDD1F"
plot(net, vertex.color=col, layout=l) },
interval = .8, movie.name="network_animation.gif" )

detach('package:igraph')
detach('package:animation')

6.2 Interactive JS visualization with visNetwork

These days it is fairly easy to export R plots to HTML/JavaScript output. There are a number of
packages like rcharts and htmlwidgets that can help you create interactive web charts right from
R. One thing to keep in mind though is that the network visualizations created that way are most
helpful as a starting point for further work. If you know a little bit of javascript, you can use them

47
as a first step and tweak the results to get closer to what you want.
Here we will take a quick look at visNetwork which generates interactive network visualizations using
the vis.js javascript library. You can install the package with install.packages('visNetwork').
We can visualize our media network right away: visNetwork() will accept our node and link data
frames. As usual, the node data frame needs to have an id column, and the link data needs to have
from and to columns denoting the start and end of each tie.
library('visNetwork')
visNetwork(nodes, links)

If we want to set specific height and width for the interactive plot, we can do that with the height
and width parameters. As is often the case in R, the title of the plot is set with the main parameter.
The subtitle and footer can be set with submain and footer respectively.
visNetwork(nodes, links, height="600px", width="100%", background="#eeefff",
main="Network", submain="And what a great network it is!",
footer= "Hyperlinks and mentions among media sources")

Like the igraph package, visNetwork allows us to set graphic properties as node or edge attributes.
We can simply add them as columns in our data before we call the visNetwork() function. Check

48
out the available options with:
?visNodes
?visEdges

In the following code, we are changing some of the visual parameters for nodes. We start with
the node shape (the available options for it include ellipse, circle, database, box, text, image,
circularImage, diamond, dot, star, triangle, triangleDown, square, and icon). We are also
going to change the color of several node elements. In this package, background controls the node
color, border changes the frame color; highlight sets the color on mouse click, and hover sets the
color on mouseover.

# We'll start by adding new node and edge attributes to our dataframes.
vis.nodes <- nodes
vis.links <- links

vis.nodes$shape <- "dot"


vis.nodes$shadow <- TRUE # Nodes will drop shadow
vis.nodes$title <- vis.nodes$media # Text on click
vis.nodes$label <- vis.nodes$type.label # Node label
vis.nodes$size <- vis.nodes$audience.size # Node size
vis.nodes$borderWidth <- 2 # Node border width

vis.nodes$color.background <- c("slategrey", "tomato", "gold")[nodes$media.type]


vis.nodes$color.border <- "black"
vis.nodes$color.highlight.background <- "orange"
vis.nodes$color.highlight.border <- "darkred"

visNetwork(vis.nodes, vis.links)

Next we will change some of the visual properties of the edges.

49
vis.links$width <- 1+links$weight/8 # line width
vis.links$color <- "gray" # line color
vis.links$arrows <- "middle" # arrows: 'from', 'to', or 'middle'
vis.links$smooth <- FALSE # should the edges be curved?
vis.links$shadow <- FALSE # edge shadow

visnet <- visNetwork(vis.nodes, vis.links)


visnet

We can also set the visualization options directly with visNodes() and visEdges().
visnet2 <- visNetwork(nodes, links)
visnet2 <- visNodes(visnet2, shape = "square", shadow = TRUE,
color=list(background="gray", highlight="orange", border="black"))
visnet2 <- visEdges(visnet2, color=list(color="black", highlight = "orange"),
smooth = FALSE, width=2, dashes= TRUE, arrows = 'middle' )
visnet2

visNetwork offers a number of other options in the visOptions() function. For instance, we can
highlight all neighbors of the selected node (highlightNearest), or add a drop-down menu to
select subset of nodes (selectedBy). The subsets are based on a column from our data - here we
use the type label.

50
visOptions(visnet, highlightNearest = TRUE, selectedBy = "type.label")

visNetwork can also work with predefined groups of nodes. The visual characteristics for nodes
belonging in each group can be set with visGroups(). We can add an automatically generated
group legend with visLegend().
nodes$group <- nodes$type.label
visnet3 <- visNetwork(nodes, links)
visnet3 <- visGroups(visnet3, groupname = "Newspaper", shape = "square",
color = list(background = "gray", border="black"))
visnet3 <- visGroups(visnet3, groupname = "TV", shape = "dot",
color = list(background = "tomato", border="black"))
visnet3 <- visGroups(visnet3, groupname = "Online", shape = "diamond",
color = list(background = "orange", border="black"))
visLegend(visnet3, main="Legend", position="right", ncol=1)

For more information, you can also check out:


?visOptions # available options
?visLayout # available layouts
?visGroups # using node groups
?visLegend # adding a legend

# Detach the package since we're done with it.

51
detach('package:visNetwork')

6.3 Interactive JS visualization with threejs

Another good package exporting networks from R to javascript is threejs, which generates
interactive network visualizations using the three.js javascript library and the htmlwidgets R
package. One nice thing about threejs is that it can directly read igraph objects.
You can install the package with install.packages('threejs'). If you get errors or warnings
using this library with the latest version of R, try also installing the development version of the
htmlwidgets package which may have bug fixes that will help:
devtools::install_github('ramnathv/htmlwidgets')

The main network plotting function here,graphjs, will take an igraph object. We could use our
initial net object with a slight modification: we will delete its graph layout and let threejs generate
one on its own. We cheated a bit earlier by assigning a function to the layout attribute in the igraph
object rather than giving it a table of node coordinates. This is fine by igraph, but threejs will
not let us do it.
library(threejs)
library(htmlwidgets)
library(igraph)

net.js <- net


graph_attr(net.js, "layout") <- NULL

Note that RStudio for Windows may not render the threejs graphics properly. We will save the
output in an HTML file and open it in a browser. Some of the parameters that we can add include
main for the plot title; curvature for the edge curvature; bg for background color; showLabels
to set labels to visible (TRUE) or not (FALSE); attraction and repulsion to set how much nodes
attract and repulse each other in the layout; opacity for node transparency (range 0 to 1); stroke
to indicate whether nodes should be framed in a black circle (TRUE) or not (FALSE), etc.
For the full list of parameters, check out ?graphjs.
gjs <- graphjs(net.js, main="Network!", bg="gray10", showLabels=F, stroke=F,
curvature=0.1, attraction=0.9, repulsion=0.8, opacity=0.9)
print(gjs)
saveWidget(gjs, file="Media-Network-gjs.html")
browseURL("Media-Network-gjs.html")

52
Once we open the resulting visualization in a browser, we can use the mouse scrollwheel to zoom in
and out, the left mouse button to rotate the network, and the right mouse button to pan.
We can also create simple animations with threejs by using lists of layouts, vertex colors, and edge
colors that will switch at each step.
gjs.an <- graphjs(net.js, bg="gray10", showLabels=F, stroke=F,
layout=list(layout_randomly(net.js, dim=3),
layout_with_fr(net.js, dim=3),
layout_with_drl(net.js, dim=3),
layout_on_sphere(net.js)),
vertex.color=list(V(net.js)$color, "gray", "orange",
V(net.js)$color),
main=list("Random Layout", "Fruchterman-Reingold",
"DrL layout", "Sphere" ) )
print(gjs.an)
saveWidget(gjs.an, file="Media-Network-gjs-an.html")
browseURL("Media-Network-gjs-an.html")

As an additional example, we can take a look at the Les Miserables network included with the
package:
data(LeMis)
lemis.net <- graphjs(LeMis, main="Les Miserables", showLabels=T)
print(lemis.net)

53
saveWidget(lemis.net, file="LeMis-Network-gjs.html")
browseURL("LeMis-Network-gjs.html")

6.4 Interactive JS visualization with networkD3

We will also take a quick look at networkD3 which - as its name suggests - generates interactive
network visualizations using the D3 javascript library. If you d not have the networkD3 library,
install it with install.packages("networkD3").
The data that this library needs from is is in the standard edge list form, with a few little twists.
In order for things to work, the node IDs have to be numeric, and they also have to start from 0.
An easy was to get there is to transform our character IDs to a factor variable, transform that to
numeric, and make sure it starts from zero by subtracting 1.
library(networkD3)

links.d3 <- data.frame(from=as.numeric(factor(links$from))-1,


to=as.numeric(factor(links$to))-1 )

The nodes need to be in the same order as the “source” column in links:
nodes.d3 <- cbind(idn=factor(nodes$media, levels=nodes$media), nodes)

Now we can generate the interactive chart. The Group parameter in it is used to color the nodes.
Nodesize is not (as one might think) the size of the node, but the number of the column in the node

54
data that should be used for sizing. The charge parameter controls node repulsion (if negative) or
attraction (if positive).
forceNetwork(Links = links.d3, Nodes = nodes.d3, Source="from", Target="to",
NodeID = "idn", Group = "type.label",linkWidth = 1,
linkColour = "#afafaf", fontSize=12, zoom=T, legend=T,
Nodesize=6, opacity = 1, charge=-600,
width = 600, height = 600)

7 Dynamic network visualizations with ndtv-d3

7.1 Interactive plots of static networks in ndtv

Here we will create D3 visualizations using the ndtv package. You should not need additional
software to produce web animations with ndtv. If you want to save the animations as video files
(see ?saveVideo), you have to install a video converter called FFmpeg (http://ffmpg.org). To find
out how to get the right installation for your OS, check out ?install.ffmpeg. To use all available
layouts, you would also need to have Java installed on your machine.
install.packages('ndtv', dependencies=T)

As ndtv is part of the Statnet family, it will accept objects from the network package such as the
one we created earlier (net3).
library('ndtv')
net3

Most of the parameters below are self-explanatory at this point (bg is the background color of the
plot). Two new parameters we haven’t used before are vertex.tooltip and edge.tooltip. Those
contain the information that we can see when moving the mouse cursor over network elements. Note
that the tooltip parameters accepts html tags – for example we will use the line break tag <br>.

55
The parameter launchBrowser instructs R to open the resulting visualization file (filename) in
the browser.
render.d3movie(net3, usearrows = F, displaylabels = F, bg="#111111",
vertex.border="#ffffff", vertex.col = net3 %v% "col",
vertex.cex = (net3 %v% "audience.size")/8,
edge.lwd = (net3 %e% "weight")/3, edge.col = '#55555599',
vertex.tooltip = paste("<b>Name:</b>", (net3 %v% 'media') , "<br>",
"<b>Type:</b>", (net3 %v% 'type.label')),
edge.tooltip = paste("<b>Edge type:</b>", (net3 %e% 'type'), "<br>",
"<b>Edge weight:</b>", (net3 %e% "weight" ) ),
launchBrowser=F, filename="Media-Network.html" )

If you are going to embed the plot in a markdown document, use output.mode='inline' above.

7.2 Network evolution animations in ndtv

Animations are a good way to show the evolution of small to medium size networks over time. At
present, ndtv is the best R package for that – especially since it now has D3 capabilities and allows
easy export for the Web.
In order to work with the network animations in ndtv, we need to understand Statnet’s dynamic
network format, implemented in the networkDynamic package. The format can be used to represent
longitudinal structures, both discrete (if you have multiple snapshots of your network at different
time points) and continuous (if you have timestamps indicating when edges and/or nodes appear
and disappear from the network). The examples below will only scratch the surface of temporal
networks in Statnet - for a deeper dive, check out Skye Bender-deMoll’s Temporal network tools
tutorial and the networkDynamic package vignette.
Let’s look at one example dataset included in the package, containing simulation data based on a
network of business connections among Renaissance Florentine families:

56
data(short.stergm.sim)
short.stergm.sim
head(as.data.frame(short.stergm.sim))

## onset terminus tail head onset.censored


## 1 0 1 3 5 FALSE
## 2 10 20 3 5 FALSE
## 3 0 25 3 6 FALSE
## 4 0 1 3 9 FALSE
## 5 2 25 3 9 FALSE
## 6 0 4 3 11 FALSE
## terminus.censored duration edge.id
## 1 FALSE 1 1
## 2 FALSE 10 1
## 3 FALSE 25 2
## 4 FALSE 1 3
## 5 FALSE 23 3
## 6 FALSE 4 4

What we see here is a temporal edge list. An edge goes from a node with ID in the tail column to
a node with ID in the head column. Edges exist from time point onset to time point terminus.
As you can see in our example, there may be multiple periods (activity spells) where an edge is
present. Each of those periods is recorded on a separate row in the data frame above.
The idea of onset and terminus censoring refers to start and end points enforced by the beginning
and end of network observation rather than by actual tie formation/dissolution.
We can simply plot the network disregarding its time component (combining all nodes and edges
that were ever present):
plot(short.stergm.sim)

We can also use network.extract() to get a network that only contains elements active at a given
point, or during a given time interval. For instance, we can plot the network at time 1 (at=1):
plot( network.extract(short.stergm.sim, at=1) )

57
Plot nodes and edges that were active for the entire period (rule=all) from time 1 to time 5:
plot( network.extract(short.stergm.sim, onset=1, terminus=5, rule="all") )

Plot nodes and edges that were active at any point (rule=any) between time 1 and time 10:
plot( network.extract(short.stergm.sim, onset=1, terminus=10, rule="any") )

Let’s make a quick d3 animation from the example network:


render.d3movie(short.stergm.sim,displaylabels=TRUE)

58
Next, we will create and animate our own dynamic network. Dynamic network objects can be
generated in a number of ways: from a set of networks/matrices representing different time points;
from data frames/matrices with node lists and edge lists indicating when each is active, or when
they switch state. You can check out ?networkDynamic for more information.
We are going to add a time component to our media network example. The code below takes a
0-to-50 time interval and sets the nodes in the network as active throughout (time 0 to 50). The
edges of the network appear one by one, and each one is active from their first activation until time
point 50. We generate this longitudinal network using networkDynamic with our node times as
node.spells and edge times as edge.spells.
vs <- data.frame(onset=0, terminus=50, vertex.id=1:17)
es <- data.frame(onset=1:49, terminus=50,
head=as.matrix(net3, matrix.type="edgelist")[,1],
tail=as.matrix(net3, matrix.type="edgelist")[,2])

net3.dyn <- networkDynamic(base.net=net3, edge.spells=es, vertex.spells=vs)

If we try to just plot the networkDynamic network, what we get is a combined network for the
entire time period under observation – or as it happens, our original media example.
plot(net3.dyn, vertex.cex=(net3 %v% "audience.size")/7, vertex.col="col")

One way to show the network evolution is through static images from different time points. While

59
we can generate those one by one as we did above, ndtv offers an easier way. The command to do
that is filmstrip(). As in the par() function controlling base R plot parameters, here mfrow sets
the number of rows and columns in the multi-plot grid.
filmstrip(net3.dyn, displaylabels=F, mfrow=c(1, 5),
slice.par=list(start=0, end=49, interval=10,
aggregate.dur=10, rule='any'))

t=0−10 t=10−20 t=20−30 t=30−40 t=40−50

Next, let’s generate a network animation. We can pre-compute the coordinates for it (otherwise
they get calculated when we generate the animation). Here animation.mode is the layout algorithm
- one of “kamadakawai”, “MDSJ”, “Graphviz” and “useAttribute” (user-generated coordinates).
In filmstrip() above and in the animation computation below, slice.par is a list of parameters
controlling how the network visualization moves through time. The parameter interval is the
time step between layouts, aggregate.dur is the period shown in each layout, rule is the rule for
displaying elements (e.g. any: active at any point during that period, all: active during the entire
period, etc).
compute.animation(net3.dyn, animation.mode = "kamadakawai",
slice.par=list(start=0, end=50, interval=1,
aggregate.dur=1, rule='any'))

render.d3movie(net3.dyn, usearrows = F,
displaylabels = F, label=net3 %v% "media",
bg="#ffffff", vertex.border="#333333",
vertex.cex = degree(net3)/2,
vertex.col = net3.dyn %v% "col",
edge.lwd = (net3.dyn %e% "weight")/3,
edge.col = '#55555599',
vertex.tooltip = paste("<b>Name:</b>", (net3.dyn %v% "media") , "<br>",
"<b>Type:</b>", (net3.dyn %v% "type.label")),
edge.tooltip = paste("<b>Edge type:</b>", (net3.dyn %e% "type"), "<br>",
"<b>Edge weight:</b>", (net3.dyn %e% "weight" ) ),
launchBrowser=T, filename="Media-Network-Dynamic.html",
render.par=list(tween.frames = 30, show.time = F),
plot.par=list(mar=c(0,0,0,0)), output.mode='inline' )

60
To embed this animation, we add the parameter output.mode='inline'.
In addition to dynamic nodes and edges, ndtv also takes dynamic attributes. We could have added
those to the es and vs data frames above. However, the plotting function can also evaluate special
parameters and generate dynamic arguments on the fly. For example, function(slice) { do some
calculations with slice } will perform operations on the current time slice of the network, allowing
us to change parameters dynamically.
See the node size below:
render.d3movie(net3.dyn, usearrows = F,
displaylabels = F, label=net3 %v% "media",
bg="#000000", vertex.border="#dddddd",
vertex.cex = function(slice){ degree(slice)/2.5 },
vertex.col = net3.dyn %v% "col",
edge.lwd = (net3.dyn %e% "weight")/3,
edge.col = '#55555599',
vertex.tooltip = paste("<b>Name:</b>", (net3.dyn %v% "media") , "<br>",
"<b>Type:</b>", (net3.dyn %v% "type.label")),
edge.tooltip = paste("<b>Edge type:</b>", (net3.dyn %e% "type"), "<br>",
"<b>Edge weight:</b>", (net3.dyn %e% "weight" ) ),
launchBrowser=T, filename="Media-Network-even-more-Dynamic.html",
render.par=list(tween.frames = 15, show.time = F), output.mode='inline',
slice.par=list(start=0, end=50, interval=4, aggregate.dur=4, rule='any'))

61
8 Overlaying networks on geographic maps

The example presented in this section uses only base R and mapping packages. If you have experience
with ggplot2, that package does provide a more versatile way of approaching this task. The code
using ggplot() would be similar to what you will see below, but you would use ‘borders()’ to plot
the map and ‘geom_path()’ for the edges.
In order to plot on a map, we will need a few more packages. As you will see below, maps will
let us generate a geographic map to use as background, and geosphere will help us generate arcs
representing our network edges. If you do not already have them, install the two packages, then
load them.
install.packages('maps')
install.packages('geosphere')

library('maps')
library('geosphere')

Let us plot some example maps with the maps library. The parameters of maps() include col for
the map fill, border for the border color, and bg for the background color.

62
par(mfrow = c(2,2), mar=c(0,0,0,0))

map("usa", col="tomato", border="gray10", fill=TRUE, bg="gray30")


map("state", col="orange", border="gray10", fill=TRUE, bg="gray30")
map("county", col="palegreen", border="gray10", fill=TRUE, bg="gray30")
map("world", col="skyblue", border="gray10", fill=TRUE, bg="gray30")

dev.off()

The data we will use here contains US airports and flights among them. The airport file includes
geographic coordinates - latitude and longitude. If you do not have those in your data, you can the
geocode() function from package ggmap to grab the latitude and longitude for an address.
airports <- read.csv("Dataset3-Airlines-NODES.csv", header=TRUE)
flights <- read.csv("Dataset3-Airlines-EDGES.csv", header=TRUE, as.is=TRUE)

head(flights)

## Source Target Freq


## 1 0 109 10
## 2 1 36 10
## 3 1 61 10
## 4 2 152 10
## 5 3 104 10
## 6 4 132 10

63
head(airports)

## ID Label Code City latitude longitude


## 1 0 Adams Field Airport LIT Little Rock, AR 34.72944 -92.22444
## 2 1 Akron/canton Regional CAK Akron/Canton, OH 40.91611 -81.44222
## 3 2 Albany International ALB Albany 42.73333 -73.80000
## 4 3 Albemarle CHO Charlottesville 38.13333 -78.45000
## 5 4 Albuquerque International ABQ Albuquerque 35.04028 -106.60917
## 6 5 Alexandria International AEX Alexandria, LA 31.32750 -92.54861
## ToFly Visits
## 1 0 105
## 2 0 123
## 3 0 129
## 4 1 114
## 5 0 105
## 6 0 93
# Select only large airports: ones with more than 10 connections in the data.
tab <- table(flights$Source)
big.id <- names(tab)[tab>10]
airports <- airports[airports$ID %in% big.id,]
flights <- flights[flights$Source %in% big.id &
flights$Target %in% big.id, ]

In order to generate our plot, we will first add a map of the United states. Then we will add a point
on the map for each airport:
# Plot a map of the united states:
map("state", col="grey20", fill=TRUE, bg="black", lwd=0.1)

# Add a point on the map for each airport:


points(x=airports$longitude, y=airports$latitude, pch=19,
cex=airports$Visits/80, col="orange")

64
Next we will generate a color gradient to use for the edges in the network. Heavier edges will be
lighter in color.
col.1 <- adjustcolor("orange red", alpha=0.4)
col.2 <- adjustcolor("orange", alpha=0.4)
edge.pal <- colorRampPalette(c(col.1, col.2), alpha = TRUE)
edge.col <- edge.pal(100)

For each flight in our data, we will use gcIntermediate() to generate the coordinates of the shortest
arc that connects its start and end point (think distance on the surface of a sphere). After that, we
will plot each arc over the map using lines().
for(i in 1:nrow(flights)) {
node1 <- airports[airports$ID == flights[i,]$Source,]
node2 <- airports[airports$ID == flights[i,]$Target,]

arc <- gcIntermediate( c(node1[1,]$longitude, node1[1,]$latitude),


c(node2[1,]$longitude, node2[1,]$latitude),
n=1000, addStartEnd=TRUE )
edge.ind <- round(100*flights[i,]$Freq / max(flights$Freq))

lines(arc, col=edge.col[edge.ind], lwd=edge.ind/30)


}

65
Note that if you are plotting the network on a full world map, there might be cases when the
shortest arc goes “behind” the map – e.g. exits it on the left side and enters back on the right (since
the left-most and right-most points on the map are actually next to each other). In order to avoid
that, we can use greatCircle() to generate the full great circle (circle going through those two
points and around the globe, with a center at the center of the earth). Then we can extract from it
the arc connecting our start and end points which does not cross “behind” the map, regardless of
whether it is the shorter or the longer of the two.

This is the end of our tutorial. If you have comments, questions, or want to report typos, please
e-mail netvis@ognyanova.net. Check for updated versions of the tutorial at kateto.net.

66
Network Analysis and Visualization with R and igraph
Katherine Ognyanova, www.kateto.net
NetSciX 2016 School of Code Workshop, Wroclaw, Poland

Contents

1. A quick reminder of R basics 3


1.1 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Value comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Special constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Matrces & Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.9 Flow Control and loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.10 R plots and colors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.11 R troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2. Networks in igraph 14
2.1 Create networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Edge, vertex, and network attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Specific graphs and graph models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3. Reading network data from files 27


3.1 DATASET 1: edgelist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 DATASET 2: matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4. Turning networks into igraph objects 28


4.1 Dataset 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Dataset 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1
5. Plotting networks with igraph 32
5.1 Plotting parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Network layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Improving network plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Interactive plotting with tkplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Other ways to represent a network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.6 Plotting two-mode networks with igraph . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6. Network and node descriptives 50


6.1 Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Reciprocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Transitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4 Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.5 Node degrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.6 Degree distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.7 Centrality & centralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.8 Hubs and authorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7. Distances and paths 56

8. Subgroups and communities 59


8.1 Cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.2 Community detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.3 K-core decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9. Assortativity and Homophily 64

2
Note: You can download all workshop materials here, or visit kateto.net/netscix2016.

This tutorial covers basics of network analysis and visualization with the R package igraph (main-
tained by Gabor Csardi and Tamas Nepusz). The igraph library provides versatile options for
descriptive network analysis and visualization in R, Python, and C/C++. This workshop will focus
on the R implementation. You will need an R installation, and RStudio. You should also install the
latest version of igraph for R:

install.packages("igraph")

1. A quick reminder of R basics

Before we start working with networks, we will go through a quick introduction/reminder of some
simple tasks and principles in R.

1.1 Assignment

You can assign a value to an object using assign(), <-, or =.

x <- 3 # Assignment
x # Evaluate the expression and print result

y <- 4 # Assignment
y + 5 # Evaluation, y remains 4

z <- x + 17*y # Assignment


z # Evaluation

3
rm(z) # Remove z: deletes the object.
z # Error!

1.2 Value comparisons

We can use the standard operators <, >, <=, >=, ==(equality) and != (inequality). Comparisons
return Boolean values: TRUE or FALSE (often abbreviated to just T and F).

2==2 # Equality
2!=2 # Inequality
x <= y # less than or equal: "<", ">", and ">=" also work

1.3 Special constants

Special constants include:

• NA for missing or undefined data


• NULL for empty object (e.g. null/empty lists)
• Inf and -Inf for positive and negative infinity
• NaN for results that cannot be reasonably defined

# NA - missing or undefined data


5 + NA # When used in an expression, the result is generally NA
is.na(5+NA) # Check if missing

# NULL - an empty object, e.g. a null/empty list


10 + NULL # use returns an empty object (length zero)
is.null(NULL) # check if NULL

Inf and -Inf represent positive and negative infinity. They can be returned by mathematical
operations like division of a number by zero:

5/0
is.finite(5/0) # Check if a number is finite (it is not).

NaN (Not a Number) - the result of an operation that cannot be reasonably defined, such as dividing
zero by zero.

0/0
is.nan(0/0)

1.4 Vectors

Vectors can be constructed by combining their elements with the important R function c().

4
v1 <- c(1, 5, 11, 33) # Numeric vector, length 4
v2 <- c("hello","world") # Character vector, length 2 (a vector of strings)
v3 <- c(TRUE, TRUE, FALSE) # Logical vector, same as c(T, T, F)

Combining different types of elements in one vector will coerce the elements to the least restrictive
type:

v4 <- c(v1,v2,v3,"boo") # All elements turn into strings

Other ways to create vectors include:

v <- 1:7 # same as c(1,2,3,4,5,6,7)


v <- rep(0, 77) # repeat zero 77 times: v is a vector of 77 zeroes
v <- rep(1:3, times=2) # Repeat 1,2,3 twice
v <- rep(1:10, each=2) # Repeat each element twice
v <- seq(10,20,2) # sequence: numbers between 10 and 20, in jumps of 2

v1 <- 1:5 # 1,2,3,4,5


v2 <- rep(1,5) # 1,1,1,1,1

Check the length of a vector:

length(v1)
length(v2)

Element-wise operations:

v1 + v2 # Element-wise addition
v1 + 1 # Add 1 to each element
v1 * 2 # Multiply each element by 2
v1 + c(1,7) # This doesn't work: (1,7) is a vector of different length

Mathematical operations:

sum(v1) # The sum of all elements


mean(v1) # The average of all elements
sd(v1) # The standard deviation
cor(v1,v1*5) # Correlation between v1 and v1*5

Logical operations:

v1 > 2 # Each element is compared to 2, returns logical vector


v1==v2 # Are corresponding elements equivalent, returns logical vector.
v1!=v2 # Are corresponding elements *not* equivalent? Same as !(v1==v2)
(v1>2) | (v2>0) # | is the boolean OR, returns a vector.
(v1>2) & (v2>0) # & is the boolean AND, returns a vector.
(v1>2) || (v2>0) # || is the boolean OR, returns a single value
(v1>2) && (v2>0) # && is the boolean AND, ditto

5
Vector elements:

v1[3] # third element of v1


v1[2:4] # elements 2, 3, 4 of v1
v1[c(1,3)] # elements 1 and 3 - note that your indexes are a vector
v1[c(T,T,F,F,F)] # elements 1 and 2 - only the ones that are TRUE
v1[v1>3] # v1>3 is a logical vector TRUE for elements >3

Note that the indexing in R starts from 1, a fact known to confuse and upset people used to
languages that index from 0.
To add more elements to a vector, simply assign them values.

v1[6:10] <- 6:10

We can also directly assign the vector a length:

length(v1) <- 15 # the last 5 elements are added as missing data: NA

1.5 Factors

Factors are used to store categorical data.

eye.col.v <- c("brown", "green", "brown", "blue", "blue", "blue") #vector


eye.col.f <- factor(c("brown", "green", "brown", "blue", "blue", "blue")) #factor
eye.col.v

## [1] "brown" "green" "brown" "blue" "blue" "blue"

eye.col.f

## [1] brown green brown blue blue blue


## Levels: blue brown green

R will identify the different levels of the factor - e.g. all distinct values. The data is stored internally
as integers - each number corresponding to a factor level.

levels(eye.col.f) # The levels (distinct values) of the factor (categorical var)

## [1] "blue" "brown" "green"

as.numeric(eye.col.f) # As numeric values: 1 is blue, 2 is brown, 3 is green

## [1] 2 3 2 1 1 1

6
as.numeric(eye.col.v) # The character vector can not be coerced to numeric

## Warning: NAs introduced by coercion

## [1] NA NA NA NA NA NA

as.character(eye.col.f)

## [1] "brown" "green" "brown" "blue" "blue" "blue"

as.character(eye.col.v)

## [1] "brown" "green" "brown" "blue" "blue" "blue"

1.6 Matrces & Arrays

A matrix is a vector with dimensions:

m <- rep(1, 20) # A vector of 20 elements, all 1


dim(m) <- c(5,4) # Dimensions set to 5 & 4, so m is now a 5x4 matrix

Creating a matrix using matrix():

m <- matrix(data=1, nrow=5, ncol=4) # same matrix as above, 5x4, full of 1s


m <- matrix(1,5,4) # same matrix as above
dim(m) # What are the dimensions of m?

## [1] 5 4

Creating a matrix by combining vectors:

m <- cbind(1:5, 5:1, 5:9) # Bind 3 vectors as columns, 5x3 matrix


m <- rbind(1:5, 5:1, 5:9) # Bind 3 vectors as rows, 3x5 matrix

Selecting matrix elements:

m <- matrix(1:10,10,10)

m[2,3] # Matrix m, row 2, column 3 - a single cell


m[2,] # The whole second row of m as a vector
m[,2] # The whole second column of m as a vector
m[1:2,4:6] # submatrix: rows 1 and 2, columns 4, 5 and 6
m[-1,] # all rows *except* the first one

Other operations with matrices:

7
# Are elements in row 1 equivalent to corresponding elements from column 1:
m[1,]==m[,1]
# A logical matrix: TRUE for m elements >3, FALSE otherwise:
m>3
# Selects only TRUE elements - that is ones greater than 3:
m[m>3]

t(m) # Transpose m
m <- t(m) # Assign m the transposed m
m %*% t(m) # %*% does matrix multiplication
m * m # * does element-wise multiplication

Arrays are used when we have more than 2 dimensions. We can create them using the array()
function:

a <- array(data=1:18,dim=c(3,3,2)) # 3d with dimensions 3x3x2


a <- array(1:18,c(3,3,2)) # the same array

1.7 Lists

Lists are collections of objects. A single list can contain all kinds of elements - character strings,
numeric vectors, matrices, other lists, and so on. The elements of lists are often named for easier
access.

l1 <- list(boo=v1,foo=v2,moo=v3,zoo="Animals!") # A list with four components


l2 <- list(v1,v2,v3,"Animals!")

Create an empty list:

l3 <- list()
l4 <- NULL

Accessing list elements:

l1["boo"] # Access boo with single brackets: this returns a list.


l1[["boo"]] # Access boo with double brackets: this returns the numeric vector
l1[[1]] # Returns the first component of the list, equivalent to above.
l1$boo # Named elements can be accessed with the $ operator, as with [[]]

Adding more elements to a list:

l3[[1]] <- 11 # add an element to the empty list l3


l4[[3]] <- c(22, 23) # add a vector as element 3 in the empty list l4.

Since we added element 3 to the list l4above, elements 1 and 2 will be generated and empty (NULL).

8
l1[[5]] <- "More elements!" # The list l1 had 4 elements, we're adding a 5th here.
l1[[8]] <- 1:11

We added an 8th element, but not 6th and 7th to the listl1 above. Elements number 6 and 7 will
be created empty (NULL).

l1$Something <- "A thing" # Adds a ninth element - "A thing", named "Something"

1.8 Data Frames

The data frame is a special kind of list used for storing dataset tables. Think of rows as cases,
columns as variables. Each column is a vector or factor.
Creating a dataframe:

dfr1 <- data.frame( ID=1:4,


FirstName=c("John","Jim","Jane","Jill"),
Female=c(F,F,T,T),
Age=c(22,33,44,55) )

dfr1$FirstName # Access the second column of dfr1.

## [1] John Jim Jane Jill


## Levels: Jane Jill Jim John

Notice that R thinks that dfr1$FirstName is a categorical variable and so it’s treating it like a
factor, not a character vector. Let’s get rid of the factor by telling R to treat ‘FirstName’ as a
vector:

dfr1$FirstName <- as.vector(dfr1$FirstName)

Alternatively, you can tell R you don’t like factors from the start using stringsAsFactors=FALSE

dfr2 <- data.frame(FirstName=c("John","Jim","Jane","Jill"), stringsAsFactors=F)


dfr2$FirstName # Success: not a factor.

## [1] "John" "Jim" "Jane" "Jill"

Access elements of the data frame:

dfr1[1,] # First row, all columns


dfr1[,1] # First column, all rows
dfr1$Age # Age column, all rows
dfr1[1:2,3:4] # Rows 1 and 2, columns 3 and 4 - the gender and age of John & Jim
dfr1[c(1,3),] # Rows 1 and 3, all columns

9
Find the names of everyone over the age of 30 in the data:

dfr1[dfr1$Age>30,2]

## [1] "Jim" "Jane" "Jill"

Find the average age of all females in the data:

mean ( dfr1[dfr1$Female==TRUE,4] )

## [1] 49.5

1.9 Flow Control and loops

The controls and loops in R are fairly straightforward (see below). They determine if a block of
code will be executed, and how many times. Blocks of code in R are enclosed in curly brackets {}.

# if (condition) expr1 else expr2


x <- 5; y <- 10
if (x==0) y <- 0 else y <- y/x #
y

## [1] 2

# for (variable in sequence) expr


ASum <- 0; AProd <- 1
for (i in 1:x)
{
ASum <- ASum + i
AProd <- AProd * i
}
ASum # equivalent to sum(1:x)

## [1] 15

AProd # equivalemt to prod(1:x)

## [1] 120

# while (condintion) expr


while (x > 0) {print(x); x <- x-1;}

# repeat expr, use break to exit the loop


repeat { print(x); x <- x+1; if (x>10) break}

10
1.10 R plots and colors

In most R functions, you can use named colors, hex, or RGB values. In the simple base R plot chart
below, x and y are the point coordinates, pch is the point symbol shape, cex is the point size, and
col is the color. To see the parameters for plotting in base R, check out ?par

plot(x=1:10, y=rep(5,10), pch=19, cex=3, col="dark red")


points(x=1:10, y=rep(6, 10), pch=19, cex=3, col="557799")
points(x=1:10, y=rep(4, 10), pch=19, cex=3, col=rgb(.25, .5, .3))

You may notice that RGB here ranges from 0 to 1. While this is the R default, you can also set it
for to the 0-255 range using something like rgb(10, 100, 100, maxColorValue=255).
We can set the opacity/transparency of an element using the parameter alpha (range 0-1):

plot(x=1:5, y=rep(5,5), pch=19, cex=12, col=rgb(.25, .5, .3, alpha=.5), xlim=c(0,6))

If we have a hex color representation, we can set the transparency alpha using adjustcolor from
package grDevices. For fun, let’s also set the plot background to gray using the par() function for
graphical parameters.

par(bg="gray40")
col.tr <- grDevices::adjustcolor("557799", alpha=0.7)
plot(x=1:5, y=rep(5,5), pch=19, cex=12, col=col.tr, xlim=c(0,6))

11
If you plan on using the built-in color names, here’s how to list all of them:

colors() # List all named colors


grep("blue", colors(), value=T) # Colors that have "blue" in the name

In many cases, we need a number of contrasting colors, or multiple shades of a color. R comes with
some predefined palette function that can generate those for us. For example:

pal1 <- heat.colors(5, alpha=1) # 5 colors from the heat palette, opaque
pal2 <- rainbow(5, alpha=.5) # 5 colors from the heat palette, transparent
plot(x=1:10, y=1:10, pch=19, cex=5, col=pal1)

plot(x=1:10, y=1:10, pch=19, cex=5, col=pal2)

We can also generate our own gradients using colorRampPalette. Note that colorRampPalette
returns a function that we can use to generate as many colors from that palette as we need.

12
palf <- colorRampPalette(c("gray80", "dark red"))
plot(x=10:1, y=1:10, pch=19, cex=5, col=palf(10))

To add transparency to colorRampPalette, you need to use a parameter alpha=TRUE:

palf <- colorRampPalette(c(rgb(1,1,1, .2),rgb(.8,0,0, .7)), alpha=TRUE)


plot(x=10:1, y=1:10, pch=19, cex=5, col=palf(10))

1.11 R troubleshooting

While I generate many (and often very creative) errors in R, there are three simple things that will
most often go wrong for me. Those include:

1) Capitalization. R is case sensitive - a graph vertex named “Jack” is not the same as one
named “jack”. The function rowSums won’t work if spelled as rowsums or RowSums.

2) Object class. While many functions are willing to take anything you throw at them, some will
still surprisingly require character vector or a factor instead of a numeric vector, or a matrix
instead of a data frame. Functions will also occasionally return results in an unexpected
formats.

3) Package namespaces. Occasionally problems will arise when different packages contain
functions with the same name. R may warn you about this by saying something like “The
following object(s) are masked from ‘package:igraph’ as you load a package. One way to deal
with this is to call functions from a package explicitly using ::. For instance, if function
blah() is present in packages A and B, you can call A::blah and B::blah. In other cases
the problem is more complicated, and you may have to load packages in certain order, or not
use them together at all. For example (and pertinent to this workshop), igraph and Statnet

13
packages cause some problems when loaded at the same time. It is best to detach one before
loading the other.

library(igraph) # load a package


detach(package:igraph) # detach a package

For more advanced troubleshooting, check out try(), tryCatch(), and debug().

2. Networks in igraph

rm(list = ls()) # Remove all the objects we created so far.


library(igraph) # Load the igraph package

2.1 Create networks

The code below generates an undirected graph with three edges. The numbers are interpreted as
vertex IDs, so the edges are 1–>2, 2–>3, 3–>1.

g1 <- graph( edges=c(1,2, 2,3, 3, 1), n=3, directed=F )


plot(g1) # A simple plot of the network - we'll talk more about plots later

class(g1)

## [1] "igraph"

g1

14
## IGRAPH U--- 3 3 --
## + edges:
## [1] 1--2 2--3 1--3

# Now with 10 vertices, and directed by default:


g2 <- graph( edges=c(1,2, 2,3, 3, 1), n=10 )
plot(g2)

10 9

5
6
8

3
1 7
2
4

g2

## IGRAPH D--- 10 3 --
## + edges:
## [1] 1->2 2->3 3->1

g3 <- graph( c("John", "Jim", "Jim", "Jill", "Jill", "John")) # named vertices
# When the edge list has vertex names, the number of nodes is not needed
plot(g3)

John
Jim

Jill

g3

## IGRAPH DN-- 3 3 --
## + attr: name (v/c)
## + edges (vertex names):
## [1] John->Jim Jim ->Jill Jill->John

15
g4 <- graph( c("John", "Jim", "Jim", "Jack", "Jim", "Jack", "John", "John"),
isolates=c("Jesse", "Janis", "Jennifer", "Justin") )
# In named graphs we can specify isolates by providing a list of their names.

plot(g4, edge.arrow.size=.5, vertex.color="gold", vertex.size=15,


vertex.frame.color="gray", vertex.label.color="black",
vertex.label.cex=0.8, vertex.label.dist=2, edge.curved=0.2)

Janis

Jesse

Jack

Jim Justin

John

Small graphs can also be generated with a description of this kind: - for undirected tie, +- or -+
for directed ties pointing left & right, ++ for a symmetric tie, and “:” for sets of vertices.

plot(graph_from_literal(a---b, b---c)) # the number of dashes doesn't matter

plot(graph_from_literal(a--+b, b+--c))

16
c

plot(graph_from_literal(a+-+b, b+-+c))

plot(graph_from_literal(a:b:c---c:d:e))

d
c
e

gl <- graph_from_literal(a-b-c-d-e-f, a-g-h-b, h-e:f:i, j)


plot(gl)

17
j
d

c e
f

b h
i
a g

2.2 Edge, vertex, and network attributes

Access vertices and edges:

E(g4) # The edges of the object

## + 4/4 edges (vertex names):


## [1] John->Jim Jim ->Jack Jim ->Jack John->John

V(g4) # The vertices of the object

## + 7/7 vertices, named:


## [1] John Jim Jack Jesse Janis Jennifer Justin

You can also examine the network matrix directly:

g4[]

## 7 x 7 sparse Matrix of class "dgCMatrix"


## John Jim Jack Jesse Janis Jennifer Justin
## John 1 1 . . . . .
## Jim . . 2 . . . .
## Jack . . . . . . .
## Jesse . . . . . . .
## Janis . . . . . . .
## Jennifer . . . . . . .
## Justin . . . . . . .

g4[1,]

## John Jim Jack Jesse Janis Jennifer Justin


## 1 1 0 0 0 0 0

18
Add attributes to the network, vertices, or edges:

V(g4)$name # automatically generated when we created the network.

## [1] "John" "Jim" "Jack" "Jesse" "Janis" "Jennifer"


## [7] "Justin"

V(g4)$gender <- c("male", "male", "male", "male", "female", "female", "male")


E(g4)$type <- "email" # Edge attribute, assign "email" to all edges
E(g4)$weight <- 10 # Edge weight, setting all existing edges to 10

Examine attributes:

edge_attr(g4)

## $type
## [1] "email" "email" "email" "email"
##
## $weight
## [1] 10 10 10 10

vertex_attr(g4)

## $name
## [1] "John" "Jim" "Jack" "Jesse" "Janis" "Jennifer"
## [7] "Justin"
##
## $gender
## [1] "male" "male" "male" "male" "female" "female" "male"

graph_attr(g4)

## named list()

Another way to set attributes (you can similarly use set_edge_attr(), set_vertex_attr(), etc.):

g4 <- set_graph_attr(g4, "name", "Email Network")


g4 <- set_graph_attr(g4, "something", "A thing")

graph_attr_names(g4)

## [1] "name" "something"

19
graph_attr(g4, "name")

## [1] "Email Network"

graph_attr(g4)

## $name
## [1] "Email Network"
##
## $something
## [1] "A thing"

g4 <- delete_graph_attr(g4, "something")


graph_attr(g4)

## $name
## [1] "Email Network"

plot(g4, edge.arrow.size=.5, vertex.label.color="black", vertex.label.dist=1.5,


vertex.color=c( "pink", "skyblue")[1+(V(g4)$gender=="male")] )

Jennifer Jesse

Justin

Jack

Jim
Janis
John

The graph g4 has two edges going from Jim to Jack, and a loop from John to himself. We can
simplify our graph to remove loops & multiple edges between the same nodes. Use edge.attr.comb
to indicate how edge attributes are to be combined - possible options include sum, mean, prod
(product), min, max, first/last (selects the first/last edge’s attribute). Option “ignore” says the
attribute should be disregarded and dropped.

g4s <- simplify( g4, remove.multiple = T, remove.loops = F,


edge.attr.comb=c(weight="sum", type="ignore") )
plot(g4s, vertex.label.dist=1.5)

20
Janis
Jennifer

Justin

Jack

Jim Jesse

John

g4s

## IGRAPH DNW- 7 3 -- Email Network


## + attr: name (g/c), name (v/c), gender (v/c), weight (e/n)
## + edges (vertex names):
## [1] John->John John->Jim Jim ->Jack

The description of an igraph object starts with up to four letters:

1. D or U, for a directed or undirected graph


2. N for a named graph (where nodes have a name attribute)
3. W for a weighted graph (where edges have a weight attribute)
4. B for a bipartite (two-mode) graph (where nodes have a type attribute)

The two numbers that follow (7 5) refer to the number of nodes and edges in the graph. The
description also lists node & edge attributes, for example:

• (g/c) - graph-level character attribute


• (v/c) - vertex-level character attribute
• (e/n) - edge-level numeric attribute

2.3 Specific graphs and graph models

Empty graph

eg <- make_empty_graph(40)
plot(eg, vertex.size=10, vertex.label=NA)

21
Full graph

fg <- make_full_graph(40)
plot(fg, vertex.size=10, vertex.label=NA)

Simple star graph

st <- make_star(40)
plot(st, vertex.size=10, vertex.label=NA)

Tree graph

22
tr <- make_tree(40, children = 3, mode = "undirected")
plot(tr, vertex.size=10, vertex.label=NA)

Ring graph

rn <- make_ring(40)
plot(rn, vertex.size=10, vertex.label=NA)

Erdos-Renyi random graph model


(‘n’ is number of nodes, ‘m’ is the number of edges).

er <- sample_gnm(n=100, m=40)


plot(er, vertex.size=6, vertex.label=NA)

23
Watts-Strogatz small-world model
Creates a lattice (with dim dimensions and size nodes across dimension) and rewires edges randomly
with probability p. The neighborhood in which edges are connected is nei. You can allow loops
and multiple edges.

sw <- sample_smallworld(dim=2, size=10, nei=1, p=0.1)


plot(sw, vertex.size=6, vertex.label=NA, layout=layout_in_circle)

Barabasi-Albert preferential attachment model for scale-free graphs


(n is number of nodes, power is the power of attachment (1 is linear); m is the number of edges
added on each time step)

ba <- sample_pa(n=100, power=1, m=1, directed=F)


plot(ba, vertex.size=6, vertex.label=NA)

igraph can also give you some notable historical graphs. For instance:

zach <- graph("Zachary") # the Zachary carate club


plot(zach, vertex.size=10, vertex.label=NA)

24
Rewiring a graph
each_edge() is a rewiring method that changes the edge endpoints uniformly randomly with a
probability prob.

rn.rewired <- rewire(rn, each_edge(prob=0.1))


plot(rn.rewired, vertex.size=10, vertex.label=NA)

Rewire to connect vertices to other vertices at a certain distance.

rn.neigh = connect.neighborhood(rn, 5)
plot(rn.neigh, vertex.size=8, vertex.label=NA)

25
Combine graphs (disjoint union, assuming separate vertex sets): %du%

plot(rn, vertex.size=10, vertex.label=NA)

plot(tr, vertex.size=10, vertex.label=NA)

plot(rn %du% tr, vertex.size=10, vertex.label=NA)

26
3. Reading network data from files

In the following sections of the tutorial, we will work primarily with two small example data sets.
Both contain data about media organizations. One involves a network of hyperlinks and mentions
among news sources. The second is a network of links between media venues and consumers. While
the example data used here is small, many of the ideas behind the analyses and visualizations we
will generate apply to medium and large-scale networks.

3.1 DATASET 1: edgelist

The first data set we are going to work with consists of two files, “Media-Example-NODES.csv” and
“Media-Example-EDGES.csv” (download here).

nodes <- read.csv("Dataset1-Media-Example-NODES.csv", header=T, as.is=T)


links <- read.csv("Dataset1-Media-Example-EDGES.csv", header=T, as.is=T)

Examine the data:

head(nodes)
head(links)
nrow(nodes); length(unique(nodes$id))
nrow(links); nrow(unique(links[,c("from", "to")]))

Notice that there are more links than unique from-to combinations. That means we have cases
in the data where there are multiple links between the same two nodes. We will collapse all links
of the same type between the same two nodes by summing their weights, using aggregate() by
“from”, “to”, & “type”. We don’t use simplify() here so as not to collapse different link types.

links <- aggregate(links[,3], links[,-3], sum)


links <- links[order(links$from, links$to),]
colnames(links)[4] <- "weight"
rownames(links) <- NULL

3.2 DATASET 2: matrix

Two-mode or bipartite graphs have two different types of actors and links that go across, but not
within each type. Our second media example is a network of that kind, examining links between
news sources and their consumers.

nodes2 <- read.csv("Dataset2-Media-User-Example-NODES.csv", header=T, as.is=T)


links2 <- read.csv("Dataset2-Media-User-Example-EDGES.csv", header=T, row.names=1)

Examine the data:

27
head(nodes2)
head(links2)

We can see that links2 is an adjacency matrix for a two-mode network:

links2 <- as.matrix(links2)


dim(links2)
dim(nodes2)

———————————–

4. Turning networks into igraph objects

We start by converting the raw data to an igraph network object. Here we use igraph’s
graph.data.frame function, which takes two data frames: d and vertices.

• d describes the edges of the network. Its first two columns are the IDs of the source and the
target node for each edge. The following columns are edge attributes (weight, type, label, or
anything else).
• vertices starts with a column of node IDs. Any following columns are interpreted as node
attributes.

4.1 Dataset 1

library(igraph)

net <- graph_from_data_frame(d=links, vertices=nodes, directed=T)


class(net)

## [1] "igraph"

net

## IGRAPH DNW- 17 49 --
## + attr: name (v/c), media (v/c), media.type (v/n), type.label
## | (v/c), audience.size (v/n), type (e/c), weight (e/n)
## + edges (vertex names):
## [1] s01->s02 s01->s03 s01->s04 s01->s15 s02->s01 s02->s03 s02->s09
## [8] s02->s10 s03->s01 s03->s04 s03->s05 s03->s08 s03->s10 s03->s11
## [15] s03->s12 s04->s03 s04->s06 s04->s11 s04->s12 s04->s17 s05->s01
## [22] s05->s02 s05->s09 s05->s15 s06->s06 s06->s16 s06->s17 s07->s03
## [29] s07->s08 s07->s10 s07->s14 s08->s03 s08->s07 s08->s09 s09->s10
## [36] s10->s03 s12->s06 s12->s13 s12->s14 s13->s12 s13->s17 s14->s11
## [43] s14->s13 s15->s01 s15->s04 s15->s06 s16->s06 s16->s17 s17->s04

28
We also have easy access to nodes, edges, and their attributes with:

E(net) # The edges of the "net" object


V(net) # The vertices of the "net" object
E(net)$type # Edge attribute "type"
V(net)$media # Vertex attribute "media"

Now that we have our igraph network object, let’s make a first attempt to plot it.

plot(net, edge.arrow.size=.4,vertex.label=NA)

That doesn’t look very good. Let’s start fixing things by removing the loops in the graph.

net <- simplify(net, remove.multiple = F, remove.loops = T)

You might notice that we could have used simplify to combine multiple edges by summing their
weights with a command like simplify(net, edge.attr.comb=list(weight="sum","ignore")).
The problem is that this would also combine multiple edge types (in our data: “hyperlinks” and
“mentions”).
If you need them, you can extract an edge list or a matrix from igraph networks.

as_edgelist(net, names=T)
as_adjacency_matrix(net, attr="weight")

Or data frames describing nodes and edges:

as_data_frame(net, what="edges")
as_data_frame(net, what="vertices")

29
4.2 Dataset 2

As we have seen above, this time the edges of the network are in a matrix format. We can read
those into a graph object using graph_from_incidence_matrix(). In igraph, bipartite networks
have a node attribute called type that is FALSE (or 0) for vertices in one mode and TRUE (or 1)
for those in the other mode.

head(nodes2)

## id media media.type media.name audience.size


## 1 s01 NYT 1 Newspaper 20
## 2 s02 WaPo 1 Newspaper 25
## 3 s03 WSJ 1 Newspaper 30
## 4 s04 USAT 1 Newspaper 32
## 5 s05 LATimes 1 Newspaper 20
## 6 s06 CNN 2 TV 56

head(links2)

## U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 U16 U17
## s01 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## s02 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## s03 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
## s04 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
## s05 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
## s06 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1
## U18 U19 U20
## s01 0 0 0
## s02 0 0 1
## s03 0 0 0
## s04 0 0 0
## s05 0 0 0
## s06 0 0 0

net2 <- graph_from_incidence_matrix(links2)


table(V(net2)$type)

##
## FALSE TRUE
## 10 20

To transform a one-mode network matrix into an igraph object, use instead graph_from_adjacency_matrix().
We can also easily generate bipartite projections for the two-mode network: (co-memberships are
easy to calculate by multiplying the network matrix by its transposed matrix, or using igraph’s
bipartite.projection() function).

30
net2.bp <- bipartite.projection(net2)

We can calculate the projections manually as well:

as_incidence_matrix(net2) %*% t(as_incidence_matrix(net2))


t(as_incidence_matrix(net2)) %*% as_incidence_matrix(net2)

plot(net2.bp$proj1, vertex.label.color="black", vertex.label.dist=1,


vertex.size=7, vertex.label=nodes2$media[!is.na(nodes2$media.type)])

NYT

BBC
LATimes
CNN USAT
MSNBC
FOX WSJ

ABC

WaPo

plot(net2.bp$proj2, vertex.label.color="black", vertex.label.dist=1,


vertex.size=7, vertex.label=nodes2$media[ is.na(nodes2$media.type)])

Jill
Jim
Jo
BrianSheila
Jason Ronda
Mary
Lisa Sandra JohnPaul

Nancy
Dave Kate Dan
Ted Ed
Tom Anna

31
5. Plotting networks with igraph

Plotting with igraph: the network plots have a wide set of parameters you can set. Those include
node options (starting with vertex.) and edge options (starting with edge.). A list of selected
options is included below, but you can also check out ?igraph.plotting for more information.

The igraph plotting parameters include (among others):

5.1 Plotting parameters

NODES
vertex.color Node color
vertex.frame.color Node border color
vertex.shape One of “none”, “circle”, “square”, “csquare”, “rectangle”
“crectangle”, “vrectangle”, “pie”, “raster”, or “sphere”
vertex.size Size of the node (default is 15)
vertex.size2 The second size of the node (e.g. for a rectangle)
vertex.label Character vector used to label the nodes
vertex.label.family Font family of the label (e.g.“Times”, “Helvetica”)
vertex.label.font Font: 1 plain, 2 bold, 3, italic, 4 bold italic, 5 symbol
vertex.label.cex Font size (multiplication factor, device-dependent)
vertex.label.dist Distance between the label and the vertex
vertex.label.degree The position of the label in relation to the vertex,
where 0 right, “pi” is left, “pi/2” is below, and “-pi/2” is above
EDGES
edge.color Edge color
edge.width Edge width, defaults to 1
edge.arrow.size Arrow size, defaults to 1
edge.arrow.width Arrow width, defaults to 1
edge.lty Line type, could be 0 or “blank”, 1 or “solid”, 2 or “dashed”,
3 or “dotted”, 4 or “dotdash”, 5 or “longdash”, 6 or “twodash”
edge.label Character vector used to label edges
edge.label.family Font family of the label (e.g.“Times”, “Helvetica”)
edge.label.font Font: 1 plain, 2 bold, 3, italic, 4 bold italic, 5 symbol
edge.label.cex Font size for edge labels
edge.curved Edge curvature, range 0-1 (FALSE sets it to 0, TRUE to 0.5)
arrow.mode Vector specifying whether edges should have arrows,
possible values: 0 no arrow, 1 back, 2 forward, 3 both
OTHER
margin Empty space margins around the plot, vector with length 4
frame if TRUE, the plot will be framed
main If set, adds a title to the plot
sub If set, adds a subtitle to the plot

32
We can set the node & edge options in two ways - the first one is to specify them in the plot()
function, as we are doing below.

# Plot with curved edges (edge.curved=.1) and reduce arrow size:


plot(net, edge.arrow.size=.4, edge.curved=.1)

s13
s14 s17 s16
s12
s11 s06
s04
s07
s03
s08 s15
s01
s10
s02 s05
s09

# Set edge color to gray, and the node color to orange.


# Replace the vertex label with the node names stored in "media"
plot(net, edge.arrow.size=.2, edge.curved=0,
vertex.color="orange", vertex.frame.color="#555555",
vertex.label=V(net)$media, vertex.label.color="black",
vertex.label.cex=.7)

MSNBC CNN

ABC Reuters.com
FOX News
BBC
Wall Street Journal
Washington Post
Google News
Yahoo News
LA TimesNY Times
USA Today

NYTimes.com AOL.com

New York Post

WashingtonPost.com

The second way to set attributes is to add them to the igraph object. Let’s say we want to color
our network nodes based on type of media, and size them based on audience size (larger audience
-> larger node). We will also change the width of the edges based on their weight.

33
# Generate colors based on media type:
colrs <- c("gray50", "tomato", "gold")
V(net)$color <- colrs[V(net)$media.type]

# Set node size based on audience size:


V(net)$size <- V(net)$audience.size*0.7

# The labels are currently node IDs.


# Setting them to NA will render no labels:
V(net)$label.color <- "black"
V(net)$label <- NA

# Set edge width based on weight:


E(net)$width <- E(net)$weight/6

#change arrow size and edge color:


E(net)$arrow.size <- .2
E(net)$edge.color <- "gray80"

E(net)$width <- 1+E(net)$weight/12

We can also override the attributes explicitly in the plot:

plot(net, edge.color="orange", vertex.color="gray50")

34
It helps to add a legend explaining the meaning of the colors we used:

plot(net)
legend(x=-1.5, y=-1.1, c("Newspaper","Television", "Online News"), pch=21,
col="#777777", pt.bg=colrs, pt.cex=2, cex=.8, bty="n", ncol=1)

Newspaper
Television
Online News

Sometimes, especially with semantic networks, we may be interested in plotting only the labels of
the nodes:

plot(net, vertex.shape="none", vertex.label=V(net)$media,


vertex.label.font=2, vertex.label.color="gray40",
vertex.label.cex=.7, edge.color="gray85")

35
WashingtonPost.com

New York Post


AOL.com

NYTimes.com Google News


USA TodayYahoo News

NY Times
LA Times BBC Reuters.com
Wall Street Journal
Washington Post

ABC CNN
FOX News MSNBC

Let’s color the edges of the graph based on their source node color. We can get the starting node
for each edge with the ends() igraph function.

edge.start <- ends(net, es=E(net), names=F)[,1]


edge.col <- V(net)$color[edge.start]

plot(net, edge.color=edge.col, edge.curved=.1)

5.2 Network layouts

Network layouts are simply algorithms that return coordinates for each node in a network.
For the purposes of exploring layouts, we will generate a slightly larger 80-node graph. We use the
sample_pa() function which generates a simple graph starting from one node and adding more
nodes and links based on a preset level of preferential attachment (Barabasi-Albert model).

36
net.bg <- sample_pa(80)
V(net.bg)$size <- 8
V(net.bg)$frame.color <- "white"
V(net.bg)$color <- "orange"
V(net.bg)$label <- ""
E(net.bg)$arrow.mode <- 0
plot(net.bg)

You can set the layout in the plot function:

plot(net.bg, layout=layout_randomly)

Or you can calculate the vertex coordinates in advance:

l <- layout_in_circle(net.bg)
plot(net.bg, layout=l)

37
l is simply a matrix of x, y coordinates (N x 2) for the N nodes in the graph. You can easily
generate your own:

l <- cbind(1:vcount(net.bg), c(1, vcount(net.bg):2))


plot(net.bg, layout=l)

This layout is just an example and not very helpful - thankfully igraph has a number of built-in
layouts, including:

# Randomly placed vertices


l <- layout_randomly(net.bg)
plot(net.bg, layout=l)

38
# Circle layout
l <- layout_in_circle(net.bg)
plot(net.bg, layout=l)

# 3D sphere layout
l <- layout_on_sphere(net.bg)
plot(net.bg, layout=l)

39
Fruchterman-Reingold is one of the most used force-directed layout algorithms out there.
Force-directed layouts try to get a nice-looking graph where edges are similar in length and cross
each other as little as possible. They simulate the graph as a physical system. Nodes are electrically
charged particles that repulse each other when they get too close. The edges act as springs that
attract connected nodes closer together. As a result, nodes are evenly distributed through the chart
area, and the layout is intuitive in that nodes which share more connections are closer to each
other. The disadvantage of these algorithms is that they are rather slow and therefore less often
used in graphs larger than ~1000 vertices. You can set the “weight” parameter which increases the
attraction forces among nodes connected by heavier edges.

l <- layout_with_fr(net.bg)
plot(net.bg, layout=l)

You will notice that the layout is not deterministic - different runs will result in slightly different
configurations. Saving the layout in l allows us to get the exact same result multiple times, which
can be helpful if you want to plot the time evolution of a graph, or different relationships – and
want nodes to stay in the same place in multiple plots.

par(mfrow=c(2,2), mar=c(0,0,0,0)) # plot four figures - 2 rows, 2 columns


plot(net.bg, layout=layout_with_fr)
plot(net.bg, layout=layout_with_fr)
plot(net.bg, layout=l)
plot(net.bg, layout=l)

40
dev.off()

By default, the coordinates of the plots are rescaled to the [-1,1] interval for both x and y. You can
change that with the parameter rescale=FALSE and rescale your plot manually by multiplying the
coordinates by a scalar. You can use norm_coords to normalize the plot with the boundaries you
want.

l <- layout_with_fr(net.bg)
l <- norm_coords(l, ymin=-1, ymax=1, xmin=-1, xmax=1)

par(mfrow=c(2,2), mar=c(0,0,0,0))
plot(net.bg, rescale=F, layout=l*0.4)
plot(net.bg, rescale=F, layout=l*0.6)
plot(net.bg, rescale=F, layout=l*0.8)
plot(net.bg, rescale=F, layout=l*1.0)

41
dev.off()

Another popular force-directed algorithm that produces nice results for connected graphs is Kamada
Kawai. Like Fruchterman Reingold, it attempts to minimize the energy in a spring system.

l <- layout_with_kk(net.bg)
plot(net.bg, layout=l)

The LGL algorithm is meant for large, connected graphs. Here you can also specify a root: a node
that will be placed in the middle of the layout.

42
plot(net.bg, layout=layout_with_lgl)

Let’s take a look at all available layouts in igraph:

layouts <- grep("^layout_", ls("package:igraph"), value=TRUE)[-1]


# Remove layouts that do not apply to our graph.
layouts <- layouts[!grepl("bipartite|merge|norm|sugiyama|tree", layouts)]

par(mfrow=c(3,3), mar=c(1,1,1,1))
for (layout in layouts) {
print(layout)
l <- do.call(layout, list(net))
plot(net, edge.arrow.mode=0, layout=l, main=layout) }

layout_as_star layout_components layout_in_circle

layout_nicely layout_on_grid layout_on_sphere

43
layout_randomly layout_with_dh layout_with_drl

layout_with_fr layout_with_gem layout_with_graphopt

layout_with_kk layout_with_lgl layout_with_mds

5.3 Improving network plots

Notice that our network plot is still not too helpful. We can identify the type and size of nodes,
but cannot see much about the structure since the links we’re examining are so dense. One way to
approach this is to see if we can sparsify the network, keeping only the most important ties and
discarding the rest.

hist(links$weight)
mean(links$weight)
sd(links$weight)

There are more sophisticated ways to extract the key edges, but for the purposes of this exercise
we’ll only keep ones that have weight higher than the mean for the network. In igraph, we can
delete edges using delete_edges(net, edges):

44
cut.off <- mean(links$weight)
net.sp <- delete_edges(net, E(net)[weight<cut.off])
plot(net.sp)

Another way to think about this is to plot the two tie types (hyperlink & mention) separately.

E(net)$width <- 1.5


plot(net, edge.color=c("dark red", "slategrey")[(E(net)$type=="hyperlink")+1],
vertex.color="gray40", layout=layout.circle)

net.m <- net - E(net)[E(net)$type=="hyperlink"] # another way to delete edges


net.h <- net - E(net)[E(net)$type=="mention"]

# Plot the two links separately:


par(mfrow=c(1,2))
plot(net.h, vertex.color="orange", main="Tie: Hyperlink")
plot(net.m, vertex.color="lightsteelblue2", main="Tie: Mention")

45
Tie: Hyperlink Tie: Mention

# Make sure the nodes stay in place in both plots:


l <- layout_with_fr(net)
plot(net.h, vertex.color="orange", layout=l, main="Tie: Hyperlink")
plot(net.m, vertex.color="lightsteelblue2", layout=l, main="Tie: Mention")

Tie: Hyperlink Tie: Mention

dev.off()

5.4 Interactive plotting with tkplot

R and igraph allow for interactive plotting of networks. This might be a useful option for you if you
want to tweak slightly the layout of a small graph. After adjusting the layout manually, you can get
the coordinates of the nodes and use them for other plots.

tkid <- tkplot(net) #tkid is the id of the tkplot that will open
l <- tkplot.getcoords(tkid) # grab the coordinates from tkplot
tk_close(tkid, window.close = T)
plot(net, layout=l)

46
5.5 Other ways to represent a network

At this point it might be useful to provide a quick reminder that there are many ways to represent
a network not limited to a hairball plot.
For example, here is a quick heatmap of the network matrix:

netm <- get.adjacency(net, attr="weight", sparse=F)


colnames(netm) <- V(net)$media
rownames(netm) <- V(net)$media

palf <- colorRampPalette(c("gold", "dark orange"))


heatmap(netm[,17:1], Rowv = NA, Colv = NA, col = palf(100),
scale="none", margins=c(10,10) )

47
AOL.com
WashingtonPost.com
NYTimes.com
Reuters.com
Google News
Yahoo News
BBC
ABC
FOX News
MSNBC
CNN
New York Post
LA Times
USA Today
Wall Street Journal
Washington Post
NY Times
AOL.com
WashingtonPost.com
NYTimes.com
Reuters.com
Google News
Yahoo News
BBC
ABC
FOX News
MSNBC
CNN
New York Post
LA Times
USA Today
Wall Street Journal
Washington Post
NY Times

5.6 Plotting two-mode networks with igraph

As with one-mode networks, we can modify the network object to include the visual properties that
will be used by default when plotting the network. Notice that this time we will also change the
shape of the nodes - media outlets will be squares, and their users will be circles.

V(net2)$color <- c("steel blue", "orange")[V(net2)$type+1]


V(net2)$shape <- c("square", "circle")[V(net2)$type+1]
V(net2)$label <- ""
V(net2)$label[V(net2)$type==F] <- nodes2$media[V(net2)$type==F]
V(net2)$label.cex=.4
V(net2)$label.font=2

plot(net2, vertex.label.color="white", vertex.size=(2-V(net2)$type)*8)

48
WaPo

FOX MSNBC
ABC

CNN

WSJ

LATimes
USAT

BBC

NYT

Igraph also has a special layout for bipartite networks (though it doesn’t always work great, and
you might be better off generating your own two-mode layout).

plot(net2, vertex.label=NA, vertex.size=7, layout=layout_as_bipartite)

Using text as nodes may be helpful at times:

plot(net2, vertex.shape="none", vertex.label=nodes2$media,


vertex.label.color=V(net2)$color, vertex.label.font=2.5,
vertex.label.cex=.6, edge.color="gray70", edge.width=2)

49
Paul
Jill
NYTMary
MSNBCJim Ronda
John
CNN Sheila
LATimes BBC
Jo
Brian
Sandra

JasonFOX
USAT
Nancy
Lisa
Dan

ABC
Kate WSJ
Anna
Dave Ed

WaPo
Tom
Ted

6. Network and node descriptives

6.1 Density

The proportion of present edges from all possible edges in the network.

edge_density(net, loops=F)

## [1] 0.1764706

ecount(net)/(vcount(net)*(vcount(net)-1)) #for a directed network

## [1] 0.1764706

6.2 Reciprocity

The proportion of reciprocated ties (for a directed network).

reciprocity(net)
dyad_census(net) # Mutual, asymmetric, and null node pairs
2*dyad_census(net)$mut/ecount(net) # Calculating reciprocity

50
6.3 Transitivity

• global - ratio of triangles (direction disregarded) to connected triples.


• local - ratio of triangles to connected triples each vertex is part of.

transitivity(net, type="global") # net is treated as an undirected network


transitivity(as.undirected(net, mode="collapse")) # same as above
transitivity(net, type="local")
triad_census(net) # for directed networks

Triad types (per Davis & Leinhardt):

• 003 A, B, C, empty triad.


• 012 A->B, C
• 102 A<->B, C
• 021D A<-B->C
• 021U A->B<-C
• 021C A->B->C
• 111D A<->B<-C
• 111U A<->B->C
• 030T A->B<-C, A->C
• 030C A<-B<-C, A->C.
• 201 A<->B<->C.
• 120D A<-B->C, A<->C.
• 120U A->B<-C, A<->C.
• 120C A->B->C, A<->C.
• 210 A->B<->C, A<->C.
• 300 A<->B<->C, A<->C, completely connected.

6.4 Diameter

A network diameter is the longest geodesic distance (length of the shortest path between two nodes)
in the network. In igraph, diameter() returns the distance, while get_diameter() returns the
nodes along the first found path of that distance.
Note that edge weights are used by default, unless set to NA.

diameter(net, directed=F, weights=NA)

## [1] 4

diameter(net, directed=F)

## [1] 28

51
diam <- get_diameter(net, directed=T)
diam

## + 7/17 vertices, named:


## [1] s12 s06 s17 s04 s03 s08 s07

Note that get_diameter() returns a vertex sequence. Note though that when asked to behaved as
a vector, a vertex sequence will produce the numeric indexes of the nodes in it. The same applies
for edge sequences.

class(diam)

## [1] "igraph.vs"

as.vector(diam)

## [1] 12 6 17 4 3 8 7

Color nodes along the diameter:

vcol <- rep("gray40", vcount(net))


vcol[diam] <- "gold"
ecol <- rep("gray80", ecount(net))
ecol[E(net, path=diam)] <- "orange"
# E(net, path=diam) finds edges along a path, here 'diam'
plot(net, vertex.color=vcol, edge.color=ecol, edge.arrow.mode=0)

52
6.5 Node degrees

The function degree() has a mode of in for in-degree, out for out-degree, and all or total for
total degree.

deg <- degree(net, mode="all")


plot(net, vertex.size=deg*3)

hist(deg, breaks=1:vcount(net)-1, main="Histogram of node degree")

Histogram of node degree


0 1 2 3 4 5 6
Frequency

0 5 10 15

deg

53
6.6 Degree distribution

deg.dist <- degree_distribution(net, cumulative=T, mode="all")


plot( x=0:max(deg), y=1-deg.dist, pch=19, cex=1.2, col="orange",
xlab="Degree", ylab="Cumulative Frequency")
Cumulative Frequency

0.0 0.2 0.4 0.6 0.8

0 2 4 6 8 10 12

Degree

6.7 Centrality & centralization

Centrality functions (vertex level) and centralization functions (graph level). The centralization
functions return res - vertex centrality, centralization, and theoretical_max - maximum
centralization score for a graph of that size. The centrality function can run on a subset of nodes
(set with the vids parameter). This is helpful for large graphs where calculating all centralities may
be a resource-intensive and time-consuming task.
Degree (number of ties)

degree(net, mode="in")
centr_degree(net, mode="in", normalized=T)

Closeness (centrality based on distance to others in the graph)


Inverse of the node’s average geodesic distance to others in the network.

closeness(net, mode="all", weights=NA)


centr_clo(net, mode="all", normalized=T)

54
Eigenvector (centrality proportional to the sum of connection centralities)
Values of the first eigenvector of the graph matrix.

eigen_centrality(net, directed=T, weights=NA)


centr_eigen(net, directed=T, normalized=T)

Betweenness (centrality based on a broker position connecting others)


Number of geodesics that pass through the node or the edge.

betweenness(net, directed=T, weights=NA)


edge_betweenness(net, directed=T, weights=NA)
centr_betw(net, directed=T, normalized=T)

6.8 Hubs and authorities

The hubs and authorities algorithm developed by Jon Kleinberg was initially used to examine
web pages. Hubs were expected to contain catalogs with a large number of outgoing links; while
authorities would get many incoming links from hubs, presumably because of their high-quality
relevant information.

hs <- hub_score(net, weights=NA)$vector


as <- authority_score(net, weights=NA)$vector

par(mfrow=c(1,2))
plot(net, vertex.size=hs*50, main="Hubs")
plot(net, vertex.size=as*30, main="Authorities")

Hubs Authorities

dev.off()

55
7. Distances and paths

Average path length: the mean of the shortest distance between each pair of nodes in the network
(in both directions for directed graphs).

mean_distance(net, directed=F)

## [1] 2.058824

mean_distance(net, directed=T)

## [1] 2.742188

We can also find the length of all shortest paths in the graph:

distances(net) # with edge weights


distances(net, weights=NA) # ignore weights

We can extract the distances to a node or set of nodes we are interested in. Here we will get the
distance of every media from the New York Times.

dist.from.NYT <- distances(net, v=V(net)[media=="NY Times"], to=V(net), weights=NA)

# Set colors to plot the distances:


oranges <- colorRampPalette(c("dark red", "gold"))
col <- oranges(max(dist.from.NYT)+1)
col <- col[dist.from.NYT+1]

plot(net, vertex.color=col, vertex.label=dist.from.NYT, edge.arrow.size=.6,


vertex.label.color="white")

2 2
2 2
3
1 1 2
1 0 2 3
1
1
2
2

56
We can also find the shortest path between specific nodes. Say here between MSNBC and the New
York Post:

news.path <- shortest_paths(net,


from = V(net)[media=="MSNBC"],
to = V(net)[media=="New York Post"],
output = "both") # both path nodes and edges

# Generate edge color variable to plot the path:


ecol <- rep("gray80", ecount(net))
ecol[unlist(news.path$epath)] <- "orange"
# Generate edge width variable to plot the path:
ew <- rep(2, ecount(net))
ew[unlist(news.path$epath)] <- 4
# Generate node color variable to plot the path:
vcol <- rep("gray40", vcount(net))
vcol[unlist(news.path$vpath)] <- "gold"

plot(net, vertex.color=vcol, edge.color=ecol,


edge.width=ew, edge.arrow.mode=0)

Identify the edges going into or out of a vertex, for instance the WSJ. For a single node, use
incident(), for multiple nodes use incident_edges()

inc.edges <- incident(net, V(net)[media=="Wall Street Journal"], mode="all")

# Set colors to plot the selected edges.


ecol <- rep("gray80", ecount(net))
ecol[inc.edges] <- "orange"
vcol <- rep("grey40", vcount(net))
vcol[V(net)$media=="Wall Street Journal"] <- "gold"
plot(net, vertex.color=vcol, edge.color=ecol)

57
We can also easily identify the immediate neighbors of a vertex, say WSJ. The neighbors function
finds all nodes one step out from the focal actor.To find the neighbors for multiple nodes, use
adjacent_vertices() instead of neighbors(). To find node neighborhoods going more than one
step out, use function ego() with parameter order set to the number of steps out to go from the
focal node(s).

neigh.nodes <- neighbors(net, V(net)[media=="Wall Street Journal"], mode="out")

# Set colors to plot the neighbors:


vcol[neigh.nodes] <- "#ff9d00"
plot(net, vertex.color=vcol)

Special operators for the indexing of edge sequences: %–%, %->%, %<-%
E(network)[X %–% Y] selects edges between vertex sets X and Y, ignoring direction
E(network)[X %->% Y] selects edges from vertex sets X to vertex set Y
E(network)[X %->% Y] selects edges from vertex sets Y to vertex set X
For example, select edges from newspapers to online sources:

58
E(net)[ V(net)[type.label=="Newspaper"] %->% V(net)[type.label=="Online"] ]

## + 7/48 edges (vertex names):


## [1] s01->s15 s03->s12 s04->s12 s04->s17 s05->s15 s06->s16 s06->s17

Co-citation (for a couple of nodes, how many shared nominations they have):

cocitation(net)

## s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17
## s01 0 1 1 2 1 1 0 1 2 2 1 1 0 0 1 0 0
## s02 1 0 1 1 0 0 0 0 1 0 0 0 0 0 2 0 0
## s03 1 1 0 1 0 1 1 1 2 2 1 1 0 1 1 0 1
## s04 2 1 1 0 1 1 0 1 0 1 1 1 0 0 1 0 0
## s05 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0
## s06 1 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 2
## s07 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## s08 1 0 1 1 1 0 0 0 0 2 1 1 0 1 0 0 0
## s09 2 1 2 0 0 0 1 0 0 1 0 0 0 0 1 0 0
## s10 2 0 2 1 1 0 0 2 1 0 1 1 0 1 0 0 0
## s11 1 0 1 1 1 1 0 1 0 1 0 2 1 0 0 0 1
## s12 1 0 1 1 1 1 0 1 0 1 2 0 0 0 0 0 2
## s13 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0
## s14 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0
## s15 1 2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0
## s16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## s17 0 0 1 0 0 2 0 0 0 0 1 2 0 0 0 1 0

8. Subgroups and communities

Before we start, we will make our network undirected. There are several ways to do that conversion:

• We can create an undirected link between any pair of connected nodes (mode="collapse" )
• Create undirected link for each directed one in the network, potentially ending up with a
multiplex graph (mode="each" )
• Create undirected link for each symmetric link in the graph (mode="mutual" ).

In cases when we may have ties A -> B and B -> A ties collapsed into a single undirected link, we
need to specify what to do with their edge attributes using the parameter ‘edge.attr.comb’ as we
did earlier with simplify(). Here we have said that the ‘weight’ of the links should be summed,
and all other edge attributes ignored and dropped.

59
net.sym <- as.undirected(net, mode= "collapse",
edge.attr.comb=list(weight="sum", "ignore"))

8.1 Cliques

Find cliques (complete subgraphs of an undirected graph)

cliques(net.sym) # list of cliques


sapply(cliques(net.sym), length) # clique sizes
largest_cliques(net.sym) # cliques with max number of nodes

vcol <- rep("grey80", vcount(net.sym))


vcol[unlist(largest_cliques(net.sym))] <- "gold"
plot(as.undirected(net.sym), vertex.label=V(net.sym)$name, vertex.color=vcol)

s16

s17
s06
s13
s12 s04 s15

s14 s11 s01


s03 s05
s02
s07
s10
s08 s09

8.2 Community detection

A number of algorithms aim to detect groups that consist of densely connected nodes with fewer
connections across groups.
Community detection based on edge betweenness (Newman-Girvan)
High-betweenness edges are removed sequentially (recalculating at each step) and the best parti-
tioning of the network is selected.

ceb <- cluster_edge_betweenness(net)


dendPlot(ceb, mode="hclust")

60
s04
s13
s16
s17
s06
s12
s14
s11
s08
s07
s03
s02
s09
s10
s05
s15
s01
plot(ceb, net)

Let’s examine the community detection igraph object:

class(ceb)

## [1] "communities"

length(ceb) # number of communities

## [1] 5

membership(ceb) # community membership for each node

## s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17
## 1 2 3 4 1 4 3 3 5 5 4 4 4 4 1 4 4

61
modularity(ceb) # how modular the graph partitioning is

## [1] 0.292476

crossing(ceb, net) # boolean vector: TRUE for edges across communities

High modularity for a partitioning reflects dense connections within communities and sparse
connections across communities.
Community detection based on based on propagating labels
Assigns node labels, randomizes, than replaces each vertex’s label with the label that appears most
frequently among neighbors. Those steps are repeated until each vertex has the most common label
of its neighbors.

clp <- cluster_label_prop(net)


plot(clp, net)

Community detection based on greedy optimization of modularity

cfg <- cluster_fast_greedy(as.undirected(net))


plot(cfg, as.undirected(net))

62
We can also plot the communities without relying on their built-in plot:

V(net)$community <- cfg$membership


colrs <- adjustcolor( c("gray50", "tomato", "gold", "yellowgreen"), alpha=.6)
plot(net, vertex.color=colrs[V(net)$community])

8.3 K-core decomposition

The k-core is the maximal subgraph in which every node has degree of at least k. This also means
that the (k+1)-core will be a subgraph of the k-core.
The result here gives the coreness of each vertex in the network. A node has coreness D if it belongs
to a D-core but not to (D+1)-core.

kc <- coreness(net, mode="all")


plot(net, vertex.size=kc*6, vertex.label=kc, vertex.color=colrs[kc])

63
4
4 4
4
4
4 4
3 3 4

3 4 4
3
3 3

9. Assortativity and Homophily

Homophily: the tendency of nodes to connect to others who are similar on some variable.

• assortativity_nominal() is for categorical variables (labels)


• assortativity() is for ordinal and above variables
• assortativity_degree() checks assortativity in node degrees

assortativity_nominal(net, V(net)$media.type, directed=F)

## [1] 0.1715568

# Matching of attributes across connected nodes more than expected by chance

assortativity(net, V(net)$audience.size, directed=F)

## [1] -0.1102857

# Correlation of attributes across connected nodes

assortativity_degree(net, directed=F)

## [1] -0.009551146

# As above, with the focal attribute being the node degree D-1

64
Module7_Security

Reference: Handbook of Social Network


Technologies and Applications, Springer New
York Dordrecht Heidelberg London, Editor:
Borko Furht, Florida Atlantic University
Managing Trust in online social network
• Users of the online social networks may share their
experiences and opinions within the networks about
an item which may be a product or service.
• User faces the problem of evaluating trust in a service
or service provider before making a choice.
• Recommendations may be received through a chain of
friends network.
• Opinion or recommendation has a great influence to
choose/use the item by the other user of the
community.
Managing Trust in online social network

• Problem : Evaluate various types of trust


opinions and recommendations.
• Collaborative filtering system is the most popular
method in recommender system.
• The task in collaborative filtering is to predict the
utility of items to a particular user based on a
database of user rates from a sample or
population of other users.
• People rate differently according to their
subjective taste.
Managing Trust in online social network
• Recommender system - recommend items that one
participant likes, to other persons in the same cluster.
• Collaborative filtering system performs poor when
there is insufficient previous common rating available
between users; commonly known as cold start
problem.
– Solution : Trust based approach to recommendation
– Assumes a trust network among users and makes
recommendations based on the ratings of the users that
are directly or indirectly trusted by the target user.
– Here, trust is used for neighborhood formation.
– Trust could be used as supplementary or replacement
method of widely used collaborative filtering system.
Managing Trust in online social network

• Trust and reputation systems can be used in


order to assist users in predicting and
selecting the best quality services.
• Online trust and reputation systems are
important decision support tools for selecting
online services and for assessing the risk of
accessing them.
Reputation Systems
Bayesian reputation systems:
• Binomial Bayesian reputation systems take
ratings expressed in a discrete binary form as
either positive (e.g. good) or negative (e.g.
bad).
• Multinomial Bayesian reputation systems
allow the possibility of providing ratings with
discrete graded levels such as e.g. mediocre –
bad – average – good – excellent .
Trust in online networks
• Trust is defined as the subjective probability
by which an individual, A, expects that
another individual, B, perform a given action
on which its welfare depends.
• Decision trust is defined as the extent to
which one party is willing to depend on
something or somebody in a given situation
with a feeling of relative security, even though
negative consequences are possible.
Trust Models Based on Subjective
Logic
• Subjective logic is a type of probabilistic logic that
explicitly takes uncertainty and belief ownership into
account.
• Arguments in subjective logic are subjective opinions
about states in a state space.
• A binomial opinion applies to a single proposition, and
can be represented as a Beta distribution.
• A multinomial opinion applies to a collection of
propositions, and can be represented as a Dirichlet
distribution (a multivariate generalization of Beta
distribution) .
• Trust models based on subjective logic are directly compatible with Bayesian reputation systems
because a bi-jective mapping exists between their respective trust and reputation representations
Trust Models Based on Subjective Logic
• Subjective logic defines a trust metric called opinion denoted
by
– which expresses the relying party A’s belief over a state space X.
– represents belief masses over the states of X , in the range [0,1]
– u represent uncertainty mass, in the range [0,1]

– , represents the base rates over X, used for computing


the probability expectation value of a state x as

• Binomial opinions are expressed as


– d denotes disbelief in x.
Trust Models Based on Subjective Logic
• Let the statement x, for • Example:
example be “David is honest
and reliable”, then the opinion
can be interpreted as
reliability trust in David.
• Example: Let us assume that
Alice needs to get her car
serviced, and that she asks
Bob to recommend a good car
mechanic.
• When Bob recommends David,
Alice would like to get a second
opinion, so she asks Claire for
her opinion about David.
Trust Models Based on Subjective Logic

• When trust and referrals are expressed as


subjective opinions,
– Each transitive trust path,
Alice → Bob → David and
Alice → Claire → David

can be computed with the transitivity operator, where


the idea is that the referrals from Bob and Claire are
discounted as a function Alice’s trust in Bob and Claire
respectively.
Trust Models Based on Subjective Logic

• Two paths can be combined using the cumulative or averaging fusion


operator.
• A trust relationship between A and B is denoted as [A,B].
• The transitivity of two arcs is denoted as “:”
• The fusion of two parallel paths is denoted as
• Then the trust network of “Alice, Bob, Claire, David” example can be
expressed as:

• The corresponding transitivity operator for opinions is denoted as


• The corresponding fusion operator is
• The mathematical expression for combining the opinions about the trust
relationships is
Trust Network Analysis
• Trust networks consist of transitive trust relationships
between people, organisations and software agents
connected through a medium for communication and
interaction.
• Trust relationships are formalised as reputation scores or as
subjective trust measures
– Trust between parties within a domain can be derived by analysing the trust
paths linking the parties together.
– Trust Network Analysis using Subjective Logic (TNA-SL) is the proposed
method [by Jøsang et al]
• TNA-SL takes directed trust edges between pairs as input, and
can be used to derive a level of trust between arbitrary
parties that are interconnected through the network.
Trust Network Analysis
• Even in case of no explicit trust paths between two parties
exist;
– Subjective logic allows a level of trust to be derived through the default
opinions.
• TNA-SL therefore has a general applicability and is suitable for
many types of trust networks.
• Limitation of TNA-SL : Complex trust networks must be
simplified to series-parallel networks in order for TNA-SL to
produce consistent results.
– The simplification consisted of gradually removing the least certain trust paths
until the whole network can be represented in a series-parallel form.
– As this process removes information it is intuitively sub-optimal.
Trust Network Analysis
• Assuming that the path ([A,B]:[B,C]:[C,D]) is the weakest
path in the graph , network simplification of the dependent
graph would be to remove the edge [B,C] from the graph.
Operators for Deriving Trust
• Subjective logic is a belief calculus specifically developed for modeling trust
relationships.
• In subjective logic, beliefs are represented on binary state spaces, where
each of the two possible states can consist of sub-states.
• Belief functions on binary state spaces are called subjective opinions and
are formally expressed in the form of an ordered tuple

– where b, d, and u represent belief, disbelief and uncertainty respectively where


b; d; u belongs [0, 1] and
– b+d+u=1
• The base rate parameter a (in the range [0, 1] ) represents the base rate
probability in the absence of evidence, and is used for computing an
opinion’s probability expectation value

– a determines how uncertainty shall contribute to


Operators for Deriving Trust
• A subjective opinion is interpreted as an
agent A’s belief in the truth of statement x.
• Ownership of an opinion is represented as a
superscript so that for example A’s opinion
about x is denoted as
Operators for Deriving Trust
Operators for Deriving Trust
Trust Path Dependency
• Transitive trust networks can involve many principals
• A single trust relationship is expressed as a directed
edge between two nodes that represent the trust
source and the trust target of that edge.
• For example the edge [A, B] means that A trusts B.
• The symbol “:” is used to denote the transitive
connection of two consecutive trust edges to form a
transitive trust path.
• The trust relationships between four principals A, B, C
and D connected serially can be expressed as
([A,D]) = ([A,B] :[B,C] : [C,D])
Trust Path Dependency
• A’s combination of the two parallel trust paths from
her to D is expressed as
([A,D]) = ([A,B] :[B,D] ) ([A,C]: [C,D])

• Expression for the graph ,


Trust Transitivity Analysis
• Assume two agents A and B where A trusts B, and B believes that
proposition x is true.
• Then by transitivity, agent A will also believe that proposition x is
true.
• This assumes that B recommends x to A.
• Trust and belief are formally expressed as opinions.
• The transitive linking of these two opinions consists of discounting
B’s opinion about x by A’s opinion about B, in order to derive A’s
opinion about x.
• The solid arrows represent initial direct trust, and the dotted arrow represents derived
indirect trust.
Trust Transitivity Analysis
• Trust transitivity, is a human mental
phenomenon, so there is no such thing as
objective transitivity, and trust transitivity
therefore lends itself to different
interpretations.
• Two main difficulties:
1. Effect of A disbelieving that B will give a good
advice.
2. Effect of base rate trust in a transitive path.
Uncertainty Favoring Trust Transitivity

• Interpretation 1 for difficulty 1: A’s disbelief in


the recommending agent B means that A
thinks that B ignores the truth value of x.
– As a result, A also ignores the truth value of x.
• Definition 22.1(next slide) supports the above
interpretation.
Uncertainty Favoring Trust Transitivity
Example 1
• Example of applying the discounting operator for independent opinions,
Example 2
• Given
– Compute

– Check if the operator is commutative


Example 2 (cont’d)
• Given

– Compute

– Check if the operator is commutative


= ( 0.21, 0.03, 0.76 )
= ( 0.21, 0.42, 0.37 )

Hence is not commutative


Opposite Belief Favoring
• Interpretation 2 for difficulty 1: A’s disbelief in the
recommending agent B means that A thinks that
B consistently recommends the opposite of his
real opinion about the truth value of x.
– Hence, A not only disbelieves in x to the degree that B
recommends belief, but she also believes in x to the
degree that B recommends disbelief in x, because the
combination of two disbeliefs results in belief in this
case.
• Definition 22.2(next slide) supports the above
interpretation.
Opposite Belief Favoring

• This operator models the principle that “your enemy’s enemy is your friend”.
• This operator should only be applied when the situation makes it plausible.
• It is doubtful whether the enemy of your enemy’s enemy necessarily is your enemy too (if there are more
than 2 arcs)
Base Rate Sensitive Transitivity
Effect of base rate trust in a transitive path –
difficulty 2:
• The transitivity operators (transitivity trust
operator and transitivity fusion operator)
had no influence on the ‘a’ (base rate)
parameter.
Base Rate Sensitive Transitivity
Example: Imagine a stranger coming to a town which is know for its
citizens being honest.
– The stranger is looking for a car mechanic, and asks the first person he
meets to direct him to a good car mechanic.
– The stranger receives the reply that there are two car mechanics in town,
David and Eric, where David is cheap but does not always do quality work,
and Eric might be a bit more expensive, but he always does a perfect job.
• According to subjective logic, the stranger has no other info about
the person he asks than the base rate that the citizens in the town
are honest.
• The stranger is thus ignorant, but the expectation value of a good
advice is still very high.
• Without taking parameter ‘a’, base rate, into account, the result would
be that the stranger is completely ignorant about which of the
mechanics is the best.
• Definition 22.3 (next slide) supplements “Base Rate Sensitive
Transitivity”
Base Rate Sensitive Transitivity
Mass Hysteria
• Mass hysteria can be caused
by people not being aware
of dependency between
opinions.
• Example: Person A
recommends an opinion
about a particular
statement x to a group of
other persons. Without
being aware of the fact that
the opinion came from the
same origin, these persons
can recommend their
opinions to each other.
Mass Hysteria
• The arrows represent trust so that
for example B → A can be
interpreted as saying that B trusts A
to recommend an opinion about
statement x.
• The actual recommendation goes, of
course, in the opposite direction to
the arrows.
• Here, A recommends an opinion
about x to 6 other agents, and that
G receives six recommendations in
all. If G assumes the recommended
opinions to be independent and
takes the consensus between them,
his opinion can become abnormally
strong and in fact even stronger
than A’s opinion.
Mass Hysteria

• Analyzing the whole graph of dependent paths, as if


they were independent, will then produce ( ommiting
“:” and the cumulative fusion symbol written as “,”)
Dirichlet Reputation System
• Reputation systems collect ratings about users or service
providers from members in a community.
• Reputation centre computes and publish reputation
scores about those users and services.
– (Dotted arrow indicates ratings and the solid arrows indicate reputation scores about the
users).

• Multinomial Bayesian systems are based on computing


reputation scores by statistical updating of Dirichlet
Probability Density Functions (PDF), which are as
Dirichlet reputation systems.
Module7_SecurityAndPrivacy

Reference: Handbook of Social Network


Technologies and Applications, Springer New
York Dordrecht Heidelberg London, Editor:
Borko Furht, Florida Atlantic University
Introduction
• Online Social Network (OSN)
• Social Network Services (SNS)
• Personally Identifiable Information (PII)
Introduction
• OSN customers and their relationships to PII and SNS
Functional Overview of Online Social Networks

• Access control functions: OSN users are usually


allowed to define their own privacy settings
through some control functions. An OSN user
may have control over the
– Visibility of the online presence within the OSN
– Visibility of contacts from the user’s contact lists
– Visibility and access to his or her own profile
information
– Access to his or her own uploaded content and posted
communications
Social Network Services (SNS)
• Social Network Services (SNS) are structured
along the following three layers with different
responsibilities
– A Social Network (SN) level, building the digital
representation of members and their relationships
– A Application Service (AS) level, constituting the
application infrastructure managed by the SNS
provider
– A Communication and Transport (CT) level
representing the communication and transport
services as provided by the network
Social Network Services (SNS)
• Architectural layers of SNS
Security Objectives
• Privacy
• Integrity
• Availability
Security Attacks
• Plain impersonation
• Profile cloning
• Profile hijacking
• Profile porting
• Id theft
• Profiling
• Secondary data collection
Security Attacks
• Fake requests
• Crawling and harvesting
• Image retrieval and analysis
• Communication tracking
• Fake profiles and sybil attacks
• Group metamorphosis
• Ballot stuffing and defamation
• Censorship
• Collusion attacks
Security Requirements for Social
Networks in Web 2.0
Reference:
Furht, B. (Ed.). (2010). Handbook of social
network technologies and applications. Springer
Science & Business Media.
Web 2.0 + Semantic Web =Web 3.0?
• Web 2.0 is often contrasted to the Semantic Web, which is a more
conscious and carefully orchestrated effort on the side of the W3C to
trigger a new stage of developments using semantic technologies.
• Web 2.0 mostly effects how users interact with the Web, while the
Semantic Web opens new technological opportunities for web developers
in combining data and services from different sources.
• Some of the opportunities arise by the combination of ideas from these
two developments.
• Basic lesson of Web 2.0 is users are willing to provide content as well as
metadata.
• Articles and facts organized in tables and categories in Wikipedia, photos
organized in sets and according to tags in Flickr or structured information
embedded into homepages and blog postings using microformats.
Web 2.0 + Semantic Web =Web 3.0?
• Mini-vocabularies for encoding metadata of all kinds in HTML
pages, for example information about the author or a blog
item.
• Addresses a primary concern of the Semantic Web
community, namely whether users would be willing to provide
metadata to bootstrap the Semantic Web.
• Semantic Web was expected to be filled by users annotating
Web resources, describing their home pages and multimedia
content.
• Early implementations of embedding RDF into HTML have
been abandoned as it was not conceived realistic to expect
everyday users to master the intricacies of encoding metadata
in RDF/XML.
Web 2.0 + Semantic Web =Web 3.0?
• Although it is still dubious whether everyday users could master
Semantic Web languages such as RDF and OWL, from the lessons of
Web 2.0.
• Many users are willing to provide structured information, provided
that they can do so in a task oriented way and through a user-
friendly interface that hides the complexity of the underlying
representation.
• Micro formats proved to be more popular due to the easier
authoring using existing HTML attributes.
• Web pages created automatically from a database can encode
metadata in micro formats without the user necessarily being
aware of it.
• Micro formats retain all the advantages of RDF in terms of machine
accessibility.
Web 2.0 + Semantic Web =Web 3.0?
• Blog search engines are able to provide search on the
properties of the author or the news item.
• The idea of providing ways to encode RDF into HTML pages
has resurfaced.
• There are also works under way to extend the Media Wiki
software behind Wikipedia to allow users to encode facts in
the text of articles while writing the text.
• Machine processable markup of facts would enable to easily
extract, query and aggregate the knowledge of Wikipedia.
Web 2.0 + Semantic Web =Web 3.0?
• New Wiki systems combine free-text authoring with the
collaborative editing of structured information.
• Due to the extensive collaborations online many applications
have access to significantly more metadata about the users.
• Information about the choices, preferences, tastes and social
networks of users means that the new breed of applications
are able to build on a much richer user profiles.
• Semantic technology can help in matching users with similar
interests as well as matching users with available content.
• Golbeck shows, there are components of trust that are
beyond of what can be inferred based on profile similarity
alone.
Web 2.0 + Semantic Web =Web 3.0?
• Therefore social-semantic systems that can provide recommendations
based on both the social network of users and their personal profiles are
likely to outperform traditional recommender systems as well as purely
network-based trust mechanisms.
• Semantic Web can offer to the Web 2.0 community is a standard
infrastructure for building creative combinations of data and services.
• Standard formats for exchanging data and schema information, support
for data integration, along with standard query languages and protocols
for querying remote data sources provide a platform for the easy
development of mashups.
• San Francisco based start-up MetaWeb is developing FreeBase, a kind of
“data commons” that allows users to share, interlink, and jointly edit
ontologies and structured data through a Web-based interface.
• A public API allows developers to build applications using the combined
knowledge of FreeBase.
Context, Threats, and Incidents
• Social networks are complex systems that were not designed with security
as a basic objective.
• Recently, there has been pressure from their users to make them more
secure but as all security experts know, security is not an add-on, it must
be considered from the beginning.
• This implies that new platforms and applications are needed.
• The original platforms and applications are easy to attack as shown by
several incidents, some of which are described below.
• Usability of the security mechanisms is another fundamental design flaw.
• It doesn’t matter if the system has good security controls if they are too
complex for the average users.
• Most privacy settings are confusing, cumbersome, and change frequently,
so users lose track of their privacy restrictions.
• For example, Facebook’s Privacy Policy has almost 6,000 words; its policies
include 50 settings with over 170 options.
Context, Threats, and Incidents
• Many companies are encouraging their employees to use
social networks as a way to reach potential clients or provide
better service.
• Usage may expose them to risks of illegal access to their
corporate data or to their reputation.
• Often these companies don’t have any policies about the use
of social networks by their employees.
• Users themselves are a source of many security problems;
many of them don’t know or don’t care about privacy.
• In surveys many respondents were willing to let anyone see
their full names, addresses, gender, and had few aspects that
they wanted to hide.
Context, Threats, and Incidents
• A user in a survey said “I have nothing to hide”, and we have
heard this before from others.
• People don’t realize the information can be misinterpreted,
propagated erroneously, and misused, e.g. for identity theft.
• User even exposes his DNA record.
• Somebody could see in it a potential for some disease and he
might not get a job, lose his girlfriend, or be denied insurance.
• List of his purchases may identify him as liking books that
some people might consider perverted or evil, with possible
bad consequences.
• This position has started to reverse as many people are
realizing that showing all this information is creating problems
for them to get jobs.
Context, Threats, and Incidents
• Main source of privacy threats comes from the platform
providers.
• Social networks are commercial enterprises trying to make
money.
• They don’t charge their users and provide nice functions to
entice them to join.
• The providers sell this information or the access to the users
to external parties to generate profits.
• They encourage users to provide as much information as
possible and to share it with as many people as possible.
• Sometimes they use deceptive policies that confuse users and
make them provide more information, not to protect their
information, and even to share it with external entities.
Context, Threats, and Incidents
• Social engineering attacks, frequent in the Internet, become
much faster and more effective using social networks, the
attacker can reach much more people at one time.
• Spam can propagate farther and faster.
• Not only privacy and security are at risk in poorly controlled
social networks, reputation is another aspect that can suffer.
• People can post things about others, which can affect their
lives.
• A company using the network can get a bad reputation if
some users express the disapproval of their products or
services.
• Standard Internet threats: Malware: viruses, worms, Trojan
horses, spyware, identity theft, phishing, account hijacking.
Context, Threats, and Incidents
• Availability attacks can be very annoying to people used to
keep track of their friends in real time.
• If you let others know always your location you are inviting
burglars to your house.
• Web sites hosting social networks may keep your information
indefinitely.
• Effect of the platform and associated software is very
important.
• An example is ELGG, an open source platform, Elgg runs on a
combination of the Apache web server, MySQL database
system and the PHP interpreted scripting language.
• These are open source products but they are not secure.
• Use of mashups can bring data leakage.
Context, Threats, and Incidents
• Mobile access to social networks and their applications will
bring new problems.
• Many social networks are using cloud based platforms that
may not have proper defences.
• Several incidents have shown the fragility of the current
networks:
– A recent breach in Facebook allowed users to see private
information from other users.
– An attacker got into the master directory of Twitter’s
addresses and tampered with its DNS to redirect users to
the site of the “Iranian Cyber Army”.
Two Patterns
• Security requirements for social networks can be defined by building
functional patterns that include appropriate security patterns to restrict
the users’ actions needed to protect their privacy.
• Two methods
– Participation Collaboration: Participation allows members of the public
to contribute ideas and expertise so that their government can make
policies with the benefit of information that is widely dispersed in
society.
– Collaborative Tagging: Collaboration improves the effectiveness of
government by encouraging partnerships and cooperation within the
federal government, across levels of government, and between the
government and private institutions.
• Intent
– Describe the functionality of the collaboration between users
participating in social networks, together with access and rights
restrictions.
Participation-Collaboration
• Example
– A small company wants to create a manual covering the
use of one of its products.
– Traditional approach is to gather a small set of experts to
write it, hopefully reducing the potential for costly errors.
– Manuals face a market of readers with different skill levels,
and the company’s writers may not always get everything
right.
– Customers often know what they need better than the
company does, but the flow of information has
traditionally gone from the publisher to the customer,
rather than the other way around.
– It is hard then to produce good-quality manuals
Participation-Collaboration
• Context
– This pattern can be useful when a group of people has a
common interest in sharing and communicating
information about a specific subject.
– They all have access to the facilities of an environment
such as the one provided by Web 2.0.
• Problem
– There are tasks where we need the collaboration of a large
variety of people, who can provide unique points of view
or expertise.
– How do we share information among people in different
places and with different areas of expertise so they can
work together?
Participation-Collaboration
• Problem
– Solution for this problem is affected by the following
forces:
• There are issues that can be solved better when many people
collaborate, we should provide a convenient way for them to
interact.
• Consistent participation may provide a platform for some users to
be recognized as experts. We want to know who are the users who
have a high level of expertise in some area.
• A person needs to be designated to accept or reject the changes in
the content made by the users; otherwise the collaboration may
be overwhelmed by some users or become corrupted with
spurious content.
• We should control who can propose new content to avoid
spammers and similar input providers.
Participation-Collaboration
• Solution
– An open process may provide better results than having
only a few people provide their knowledge and we want to
let as many people as possible to participate under
controlled conditions.
– Each collaboration should be received before acceptance.
– Only registered users should be able to add content.
– User is authenticated by an Authentication system.
– User has specific Rights with respect to the Content.
– Reviewer approves the content provided by the user.
Participation-Collaboration

Fig. 26.1 Class Diagram for Participation and Collaboration pattern


Participation-Collaboration

Fig. 26.2 Use Case Diagram for Participation and Collaboration pattern
Participation-Collaboration
• Dynamics
– The class diagram supports the use cases shown in
Fig. 26.2.
– The sequence diagram of Fig. 26.3 shows the use
case modify content: First the user logs in the
platform, then she can make changes but those
changes are not published until the reviewer
approves them.
Participation-Collaboration

Fig. 26.3 Sequence Diagram for the use case Modify Content
Participation-Collaboration
• Implementation
– Using the facilities of Web 2.0, implement a collaborative platform
through which users or application actors can contribute knowledge,
code, facts, and other material relevant to a discussion.
– The material can be made available, possibly in more than one format
or language, and participants can then modify or append to that core
content.
– The collaborative platform allows the user to modify an article, upload
images, videos and audio.
– To do that, the user must have an account to be identified to the
reviewer; it is important to use an appropriate authentication system.
Participation-Collaboration
• Known Uses
– Platforms of collaboration such as Wikipedia, a free web encyclopedia.
– Facebook Wiki, a technical reference for developers interested in the
Facebook Platform.
– Facebook Platform is a standards-based Web service with methods for
accessing and contributing data.
• Related Patterns
– The Authenticator pattern, is used to authenticate the users of the
system.
– The Role-Based Access Control (RBAC) pattern is used to define the
rights of the users with respect to the contents.
Participation-Collaboration
• Consequences:
• This pattern has the following advantages:
– Allows users in any place to modify content; they can share text,
videos, images and can discuss any topic trying to collaborate and give
different ideas.
– Experts demonstrate their knowledge or talent about certain topic and
they can be recognized in their field of expertise.
– We can keep out spammers and other undesirable users.
• Possible disadvantages include:
– Sometimes the reviewers don’t know about a specific topic and they
can eliminate important or useful content.
– They can also be biased.
– This means the reviewer should be carefully selected.
Collaborative Tagging
• Intent
– Collaborative Tagging pattern makes content more meaningful and
useful by using keywords to tag bookmarks, photographs, and other
content.
• Example
– Consider a person tagging a photograph of broccoli.
– One person might label it “cruciform,” “vegetable,” or “nutritious,”
while another might tag it “gross,” “bitter,” or worse. We need some
way to attach information to this item as a guide to possibly interest
users about this vegetable.
• Context
– People in the internet need to search different kinds of content such
as pictures, text, content, audio files, bookmarks, news, items,
websites, products, blog posts, comments, and other items available
online. They may want an item for a variety of reasons.
Collaborative Tagging
• Problem
– Often, we need to use a search system to find resources on the
Internet.
– The resources must match our needs, and to find relevant information,
we need to enter search terms.
– The search system compares those terms with a metadata index that
shows which resources might be relevant to our search.
– The primary problem with such a system is that the metadata index is
often built by a small group of people who determine the relevancy of
those resources for specific terms.
– The smaller the group that does this, the greater the chance that the
group will apply inaccurate tags to some resources or omit certain
relationships between the search terms and the resources’ semantic
relevancy.
– How do we let users guide the search for people with related
interests?
Collaborative Tagging
• The solution is affected by the following forces:
– The number of ways to classify an item is undefined and
the choices can be as different as the users and all of these
are valid in some sense.
– A specific item can belong to an unlimited number of
categories.
– We want to have a variety of ways to find items.
Collaborative Tagging
• Solution
– Let the users add tags to items to indicate categorizations of interest
to the members of the group.
– Figure(Next Slide) shows the class diagram for this pattern.
– User belongs to a Domain and applies Tags from this domain to
Resources.
– User is any human, application, process, or other entity that is capable
of interacting with a resource.
– Domain is the total set of objects and actions that the language
provides.
– Resource denotes any digital asset that can have an identifier.
Examples of resources include online content, audio files, digital
photos, bookmarks, news items, websites, products, blog posts,
comments, and other items available online.
Collaborative Tagging

Fig. 26.4 Class Diagram for the Collaborative Tagging Pattern


Collaborative Tagging

Fig. 26.5 Use Case Diagram for the Collaborative Tagging Pattern
Collaborative Tagging

Fig. 26.6 Sequence diagram for


the Assign a Tag use case
Collaborative Tagging
• Figure 26.5 shows the use cases related to this pattern.
• Figure 26.6 shows a sequence diagram for the use case Tag a resource:
First the user selects a resource, he then makes a semantic classification
of the resource by adding a tag.
• The user has freedom to assign any word to a resource.
• After the user tags a resource, the domain changes its content and people
can start to use the tag for searches.
• Known Uses
– Flickr has implemented the capability to let individuals provide tags for
digital photographs.
– Technorati lets individuals use tags for blog articles. It also it has
published a microformat for tags in blog entries and indexes blogs
using those tags (and category data).
– Slashdot’s beta tagging system lets any member place specific tags on
any post.
Collaborative Tagging
• Consequences
• This pattern has the following advantages:
– Users can classify their collections of items in any way that they find
useful.
– When users try to find an item, the results have more variety and
meanings.
– Tags provide a way to measure aspects such as directionality and
centrality.
• Possible liabilities are:
– When users can freely choose tags, the resulting metadata can include
the same tags used with different meanings and multiple tags for the
same concept, which may lead to inappropriate connections between
items and inefficient searches for information about a subject.
– There is no explicit information about the meaning of each tag.
– The personalized variety of terms can present challenges when
Module7_Security Requirements for
Social Networks in Web 2.0
Reference: Handbook of Social Network
Technologies and Applications, Springer New
York Dordrecht Heidelberg London, Editor:
Borko Furht, Florida Atlantic University
Web 2.0
• Web 2.0 refers to World Wide Web websites
that emphasize user-generated content,
usability (ease of use, even by non-experts),
and interoperability for end users.
• The term was coined by Darcy DiNucci in 1999
• A Web 2.0 website may allow users to interact
and collaborate with each other as creators of
user-generated content in a virtual
community.
Web 1.0 Web 2.0
Automatic text, image, video, and interactive media advertisements,
Banner ads on websites
that are targeted to website content and audience

Ofoto, an online digital photography website, on which users could Flickr, an image hosting and video hosting website and web
store, share, view and print digital photos services suite

BitTorrent and eMule, communications protocols of peer-to-peer


content delivery networks (CDN) file sharing (P2P) which is used to distribute data and electronic
files over the Internet

mp3.com, a website providing information about digital music and Napster, a pioneering peer-to-peer (P2P) file sharing Internet
artists, songs, services, community, and technologies and a legal, service that emphasized sharing digital audio files, typically songs,
free music-sharing service encoded in MP3 format

Wikipedia, can be written and edited by any person, even amateurs


Britannica Online, written by professionals and experts
and non-experts
personal websites blogging
evite upcoming.org and EVDB
domain name speculation search engine optimization (SEO)
page views cost per click
"screen scraping" web services
publishing of online documents, once approved by gatekeepers and mass user participation, without approval of content by gatekeepers
editorial staff or editorial staff
content management systems wikis that allow almost any users to contribute
directories (taxonomy) "tagging" of websites, images and videos (folksonomy)
"stickiness" syndication
Useful Web 2.0 Tools
• Weblogs
• Wikis
• Real Simple Syndication (RSS)
• Aggregators
• Social Bookmarking and Networking
• Online Photo Galleries
• Audio/video-casting
Weblogs (Blogs)
• Easily created/updated sites.

• Publish instantly from any internet


connection.

• Interactive - comment, question, link,


converse.
Wikis
• Collaborative page.

• Anyone can add or edit content.

• Wikipedia - friend or foe?

• Wikispaces
Real Simple Syndication
• Subscribe to “feeds” from your favorite sites.

• New information is sent to you.

• Reduces research time.


Aggregators
• Collects and organizes the content you
subscribe to with an RSS feed.
Social Bookmarking

• Save web addresses of useful content.

• Share and search bookmarks.

• Generate specific resource lists.


Social Networking

• Facebook, Twitter

• LinkedIn.com

• Collaborate, bookmark, share, etc.


Online Photo Galleries

• Publish photos online.

• Comment, and share ideas.

• Create photo stories and presentations.


Audio/video-casting

• Produce audio and video recordings.

• Publish them easily on the web.

• Creates a world-wide audience.


Two Patterns
• Security requirements for social networks are
fulfilled by building functional patterns that
include appropriate security patterns to
restrict the users’ actions needed to protect
their privacy.
Participation-Collaboration
• Describe the functionality of the collaboration
between users participating in social
networks, together with access and rights
restrictions.
• This pattern can be useful when a group of
people has a common interest in sharing and
communicating information about a specific
subject.
Collaborative Tagging
• The Collaborative Tagging pattern makes content
more meaningful and useful by using keywords
to tag bookmarks, photographs, and other
content.
• People in the internet need to search different
kinds of content such as pictures, text, content,
audio files, bookmarks, news, items, websites,
products, blog posts, comments, and other items
available online.
• They may want an item for a variety of reasons.
TUCAN: Twitter User Centric ANalyzer

Luigi Grimaudo Han Song Mario Baldi Marco Mellia Maurizio Munafò
Politecnico di Torino, Italy Narus Inc. Narus Inc. Politecnico di Torino, Italy Politecnico di Torino, Italy
luigi.grimaudo@polito.it hsong@narus.com mbaldi@narus.com mellia@polito.it munafo@polito.it

Abstract—Twitter has attracted millions of users that generate enabling the extraction of valuable information from the user’s
a humongous flow of information at constant pace. The research timeline. From a methodology stand-point, we build upon text
community has thus started proposing tools to extract meaningful mining techniques, adapting them to cope with the specific
information from tweets. In this paper, we take a different angle Twitter characteristics.
from the mainstream of previous works: we explicitly target the
analysis of the timeline of tweets from “single users”. We define As input, we group the target user’s tweets based on a
a framework - named TUCAN - to compare information offered window of time (e.g., a day, or a week) so to form bird songs,
by the target users over time, and to pinpoint recurrent topics or one for each time window. At the next step, filtering is applied
topics of interest. First, tweets belonging to the same time window to each bird song using either simple stop-word removal, stem-
are aggregated into “bird songs”. Several filtering procedures ming, lemmatization, or more complicated transformations
can be selected to remove stop-words and reduce noise. Then, based on lexical databases. Next, terms in bird songs are scored
each pair of bird songs is compared using a similarity score to
automatically highlight the most common terms, thus highlighting
using classic Term Frequency-Inverse Document Frequency
recurrent or persistent topics. TUCAN can be naturally applied (TF-IDF) [6] to pinpoint those terms that are particularly
to compare bird song pairs generated from timelines of different important for the target user. Each pair of birds songs are
users. finally compared by computing a similarity score, so to unveil
those bird songs that contain overlapping, and thus persistent,
By showing actual results for both public profiles and topics. The output is then represented using a coloured matrix,
anonymous users, we show how TUCAN is useful to highlight in which cell colour represents the similarity score. As a result,
meaningful information from a target user’s Twitter timeline.
TUCAN offers a simple and natural visual representation of
extracted information that easily unveils the most interesting
I. I NTRODUCTION AND MOTIVATION bird songs and the persistent topics the target user is interested
into during a given time period. Moreover, comparisons among
Twitter is nowadays part of everyone’s life, with hundreds bird songs gives intuitions on the transition of user interests
of millions of people using it on regular basis. Originally as well as the significance of topics to the user.
born as a microblogging service, Twitter is now being used
to chat, to discuss, to run polls, to collect feedback, etc. It is The framework is naturally extended to find and extract
not surprising then that the interest of the research community similarities among tweets of two or more target users. TUCAN
has been attracted to study the “social aspects” of Twitter. User computes and graphically shows the similarity among bird
and usage characterization [1], [2], topic analysis [3]–[5], and songs generated from the timelines of the pairs of target users,
community-level social interest identification [1] have recently revealing similarities and common interests that are present
emerged as hot research topics. Most of previous works focus possibly during different time periods.
on the analysis of “a community of twitters”, whose tweets
are analysed using text and data mining techniques to identify II. F RAMEWORK
the topics, moods, or interests. A. Bird song generation and cleaning process
In this paper we take a different angle: first, we focus on Let T W (u) be the set of tweets of a single user u that
the analysis of a Twitter target user. We consider set of tweets are retrieved from Twitter, time stamped with their generation
that appear on his Twitter public page, i.e., the target user’s time, stored and organized in a repository in binary format,
timeline, and define a methodology to explore exposed content to be easily accessed and further analyzed when necessary.
and extract possible valuable information. Which are the tweets Bird songs are created by aggregating tweets from T W (u)
that carry the most valuable information? Which are the topics generated within a time period T , to then be analyzed. We
he/she is interested into? How do this topics change over time? define the i-th bird song for the user u, BS(u, i), as the subset
Our second goal is to compare the Twitter activity of two (or of tweets in T W (u) that appear in the i-th time period of
more) target users. Do they share some common traits? Is there duration T , i.e., the set of tweets that are generated in the
any shared interest? How important is for one user a topic of [(i − 1)T, (i)T )], i > 0 window of time.
interest for the other user? What is the most common interest
of these two users, regardless of the time they are interested A “plain cleaning” pre-processing is applied to bird songs
in it? to discard stopwords, HTML tag entities, and links. Plain
cleaning can be possibly substituted by more advanced text
We propose a graphical framework which we term as cleaning mechanisms; the following are also considered in
TUCAN - Twitter User Centric ANalyzer. TUCAN highlights this work: (i) removal of Twitter ‘mentions’, (ii) stemming,
correlations among tweets using intuitive visualization, al- (iii) lemmatization, and (iv) ontology-based lexicon generaliza-
lowing exploration of the information exposed in them, thus tion. TUCAN allows the analyst to select the most appropriate
Rank single Tweet T = 1 day T = 7 days T = 14 days
cleaning method to take advantage of different effects of them
1 photo lead #immigrationreform #immigrationreform
in different contexts. 2 day international immigration gun
3 bo @cfpb gun immigration
4 snow cordray violence violence
B. Cross-correlation computation 5 mary comprehensive comprehensive
6 snow @whlive @whlive
Each pre-processed bird song is tailored in a Bag-Of-Words 7 nominates broken broken
(BoW) model, a common representation used in information 8 sec @vp reform
retrieval and natural language processing. Each word is then 9 richard representative representative
10 white reform @vp
scored according to a weighting scheme. In this work, the
Term Frequency-Inverse Document Frequency (TF-IDF) score TABLE I. T OP - WORDS RANKED BY TF-IDF, BARACK O BAMA .
is adopted as past literature has shown it to produce good
results [5]. Hence, words that are frequent in a bird song but Analysis on timeline of a single user. Figure 1 shows corre-
rare in the collection are assigned with higher weights. lation matrices representing similarities between pairs of bird
songs of a single user. Figure 1(a) shows a matrix on the bird
Bird songs are then transformed into a vector space model songs of Barack Obama. It highlights three blocks of highly
V S(u, i), in which each word is given a fixed position. In this correlated period of Tweets. The larger block [A] at the upper
space, each word in the bird song BS(u, i) is characterized left corner represents Obama tweets during US presidential
by its TF-IDF score. Words that do not appear in BS(u, i) are election in 2012. With a maximum Cosine similarity score
characterized by a null score. of 0.33, it is clear that he has been tweeting a lot on a
To evaluate the similarity V S(u, i) ⊗ V S(v, j) among a few correlated topics (voting, Romney, convention, health, etc.
pair of bird song vectors, the Cosine similarity measure is being among the most recurrent top terms). Block [B] refers
deployed. to periods when Obama was interested in fiscal cliff. Finally,
block [C] relates to the shooting in the Newtown elementary
school, during which Obama’s major topic terms were gun,
C. Dashboard visualizer
violence, and weapon.
In order to pinpoint similarities among bird songs, in-
The correlation matrix in Figure 1(b) shows an interesting
dependently of the time the user posted them, TUCAN
behavior of a generic “user X” (as opposed to a public
computes the similarity score for all possible pairs of bird
figure or news media). Analyzing user X’s bird songs the
songs. In total, N 2 similarity scores are computed and stored
plot highlights two blocks, [A] and [B]. The similarity of
in a matrix form, where each cell represents V S(u, i) ⊗
bird songs are dominated by the use of mentions to particular
V S(u, j), i, j ∈ [1, N ]. To help identifying correlation, the
follower/followee of his. Investigating key terms in the time
matrix is presented to the analyst in a graphical format using
period of block [A], user X was exchanging messages with one
a web interface. Each cell is represented by a square whose
of his follower. After one week of pause, in block [B], user X
color reflects the similarity score between the i-th and j-th
then mentions about another follower of his (and never refers to
bird songs. A demo of the dashboard is available online at
the follower in [A]). We suppose that user X’s sudden change
“http://dbdmg.polito.it/asonam2013/index.html”.
in his mentions indicates a change in his social relationship,
e.g., change of his dating partner.
III. E XPERIMENTS
Analysis across different users. Besides the per-user analysis,
A. Dataset description
TUCAN can infer semantic relationships across a multiple
To perform user centric analysis through TUCAN, we of users when applied to a group of target users. We select
monitor 712 randomly selected Twitter users for two or more ten public figures and media blogs and report the cross-
months starting from the Summer 2012. Additionally, we mon- similarity matrix in Figure 2. The latest six bird songs with
itor 28 well-known public figures, selected among politicians, T = 14 days are considered, referring to a common period of
news media, tech blogs, etc. time. Each bird song is checked against each other. Results
are represented as a colored matrix, using different color
From a total of 810,655 tweets, it emerges that about 300 scales (and normalization) for blocks outside the main diagonal
users (40%) twitted more than twice in each week. Out of and in the main diagonal (where same-user’s bird songs are
them, 20 users posted more than 400 tweets per week (i.e., compared). Focusing on the former, two pairs of users emerge
more than 57 tweets/day). This suggests that the window size as mostly correlated: {Barack Obama, White House} and
parameter T has to be tailored to each user twitting habit when {idownloadblog, iMore}.
forming bird songs. For example, Table I shows up to ten top-
words extracted from Barack Obama’s bird songs, for different Zooming in and increasing the resolution by selecting T =
value of T . 7 days, Figure 2(b) compares {Barack Obama, White House}
in detail over 25 weeks of tweeting. First, notice that during
B. User centric analysis Barack Obama’s campaign (ref. Figure 1(a)) the correlation
with White House is marginal. After elections, four periods
To demonstrate the effectiveness of TUCAN on user anal- of high correlations are pinpointed, highlighting the periods
ysis, we present results of case studies. Unless mentioned oth- Barack Obama and White House publicize similar topics. The
erwise, we use the following settings by default: (i) windows block [A] indicates the period of educational cost cut. [B]
size of 7 days, (ii) pre-processing with plain cleaning, and (iii) indicates the massacre at Newtown. [C] refers to fiscal cliff,
similarity scoring using Cosine similarity measure. and [D] on reformation of US immigration laws. The discovery
Barack Obama User D
week 1 week 10 week 20 week 1 week 10 week 20

week 1
week 1

week 10
week 10
Barack Obama

User D
week 20
week 20

(a) Barack Obama - Max similarity = 33% (b) User X - Max similarity = 31%

Fig. 1. Similarity among bird songs for different type of users. T = 7 days, plain cleaning, Cosine similarity.
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$Barack$Obama$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$iMore$ Barack Obama White House
idownloadblog$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$White$House$
week 1 week 10 week 20 week 1 week 10 week 20
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$Barack$Obama$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$iMore$
$$$$$idownloadblog$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$White$House$$

week 1
Barack Obama
week 10
week 20 week 1
White House
week 10
week 20

(a) Famous users vs famous users. T = 14 days. (b) Barack Obama vs White House. T = 7 days.

Fig. 2. Similarity among users over different bird songs. Plain cleaning and Cosine similarity.

of both well-correlated and non-correlated periods allows us to R EFERENCES


quantify periods of time the President spoke for himself (and [1] A. Java, X. Song, T. Finin, and B. Tseng, “Why We Twitter: Under-
his political party) and the government of the US. standing Microblogging Usage and Communities,” Workshop on Web
Mining and Social Network Analysis, pp. 56–65, 2007.
[2] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a Social
Network or a News Media?” WWW, pp. 591–600, 2010.
IV. C ONCLUSION
[3] F. Alvanaki, S. Michel, K. Ramamritham, and G. Weikum, “See What’s
In this paper we presented TUCAN, a framework to graph- enBlogue - Real-time Emergent Topic Identification in Social Media,”
in EDBT. Berlin, Germany: ACM, 2012.
ically represent semantic correlations of individual Twitter
[4] M. Mathioudakis and N. Koudas, “TwitterMonitor: Trend detection
users’ timelines. Building on text mining techniques, TUCAN over the twitter stream,” in SIGMOD ’10. New York, NY, USA:
analyses “bird songs”, i.e., group of tweets belonging to the ACM, 2010, pp. 1155–1158.
same time period, and compares their similarity. The analyst [5] L. Hong and B. D. Davison, “Empirical Study of Topic Modeling
is offered a GUI to investigate the impact of different pre- in Twitter,” in Workshop on Social Media Analytics, New York, NY,
processing and similarity definitions. Experiments conducted USA: ACM, 2010, pp. 80–88.
on actual Twitter users show the ability to pinpoint recurrent [6] G. Salton and M. J. Mcgill, Introduction to Modern Information
topics, or correlations among users. Retrieval. New York, NY, USA: McGraw-Hill, Inc., 1986.

You might also like