Professional Documents
Culture Documents
Network Science:
Erdős-Rényi Model for
Network Formation
■ Simpler representation of possibly very complex structures
■ Can gain insight into how networks form and how they grow
■ May allow mathematical derivation of certain properties
■ Can serve to “explain” certain properties observed in real
Ozalp Babaoglu networks
Dipartimento di Informatica — Scienza e Ingegneria ■ Can predict new properties or outcomes for networks that do not
Università di Bologna even exist
■ Can serve as benchmarks for evaluating real networks
www.cs.unibo.it/babaoglu/
© Babaoglu 2020 2
■ Two parameters
■ Number of nodes: n
■ Probability that an edge is present: p
■ For each of the n(n−1)/2 possible edges in the network, flip a
(biased) coin that comes up “heads” with probability p
■ If coin flip is “heads”, then add the edge to the network ■ Number of possible edges: n(n−1)/2=5×4/2=10
■ If coin flip is “tails”, then don’t add the edge to the network ■ Ten flips of a coin that comes up heads 60%, tails 40%
■ Also known as the “G(n, p) model” (graph on n nodes with
probability p)
■ Add the edges corresponding to the “heads” outcomes
© Babaoglu 2020 5 © Babaoglu 2020 6
Frequency
3 2
0
1
3
0 3
0 1 2 3 4
Degree
Binomial
for p small
Poisson
for n large
■ CC=(0+1+1+2/3+2/3)/5=0.6667
■ Since the edge density is exactly equal to the background
■ Compare with p which is 0.6
probability of triangles being closed, the networks produced by
the ER model cannot be considered highly clustered
Erdős-Rényi
Summary
© Babaoglu 2020 35
Lecture Notes: Social Networks: Models, Algorithms, and Applications
Lecture 1: Jan 26, 2012
Scribes: Geoffrey Fairchild and Jason Fries
Definition 1 G(n, m) is the graph obtained by sampling uniformly from all graphs with n vertices
and m edges.
For example, given n = 4 and m = 2 with the vertex set V = {1, 2, 3, 4} we could obtain any one
of these graphs:
1 2 1 2 1 2
3 4 3 4 . . . . . . . 3 4
(a) G1 (b) G2 (d) G15
Figure 1: Possible random graph instances for n = 4, m = 2 resulting in a state space Ω of size 15
The probability of selecting a graph GN requires determining the size of the set of all possible
graph outcomes, computed as choosing from all possible pairs of nodes n, all possible m edge
combinations.
Definition 2 The total number of possible random graphs given n vertices and m edges is
1
|Ω| =
(n2 )
m
1
the G(n, p) model. Both variants were independently proposed by Solomonoff and Rapaport in
1951[5] and Gilbert in 1959[4].
In analyses, the G(n, m) model is not as easy to deal with mathematically as the similar (though
not exact) graph G(n, p), so in practice G(n, p) is more commonly used today. The equivalence of
G(n, m) and G(n, p) can be noted by setting n2 p = M , and observing that as n → ∞ G(n, p)
should behave similarly to G(n, m), as by virtue of the law of large numbers, G(n, p) will contain
approximating the same number of edges as G(n, m).
For example, if we wanted the generate linear number of edges, sparse graphs need p should be on
the order of n1 .
Definition 5 For the distribution of number of edges in G(n, p), let x be the random variable
dependent on the number of edges in
n
n
P rob[X = x] = 2 px (1 − p)( 2 )−x
x
This takes the form of a binomial distribution, and the implication of this definition is that edges
are concentrated around the mean with high probability.
Definition 7 For the degree distribution of G(n, p), fix a vertex v and let y be the number of edges
incident on v
n−1 y
P rob[Y = y] = p (1 − p)n−n−y
y
This is why the Erdős-Renyi graphs are said to have a binomial degree distribution.
cy e−c
n−1
py (1 − p)n−1−y −−−−→
y y!
| {z }
poisson distribution with paramter c
2
Recall the definition of the local clustering coefficient as:
pairs of neighbors of v connected by edges
cc(v) =
total pairs of v
p d2
E[ cc(v) | deg(v) = d ] = d = p
2
The sum probability of all possible outcomes is, of course, equal to 1, leaving our final equation as
p·1=p
Given a formal definition of the clustering coefficient cc for a random graph G(n, p) we can revisit
the 1998 paper in Nature by Watts and Strogatz[6] and now compute the cc of a corresponding
random graph.
For example,the corresponding random graph for the actors’ network would be
n = 225226
c = (n − 1)p = 61
p = 61/225225 = 0.00027
ln n
lnc n , where c = p(n − 1)
ln c
3
For example, consider an acquaintance network of every human being on earth, currently estimated
at 7 billion people. If every individual has, on average, 1000 acquaintances, our graph diameter is
calculated as
ln 7x109
= 3.33...
ln 1000
Definition 10 The diameter of graph G(V,E), where distance = the shortest path between u, v is
max distance(u, v)
u,v∈V
Remember however that we are acting on random graphs, meaning that diameter is itself a random
variable. The diameter referred to here more correctly thought of as the expected diameter of graph
G, formally stated as
ln n
P rob[ distance(u, v) > ] −→ 0 as n → ∞
ln c
Note that, counter perhaps to our intuition, this expected diameter value does not lie in the middle
of roughly an equal number of graphs with low diameter and and graphs with high diameters. In
reality, as n → ∞, there are a diminishing number of graphs with a diameter larger than ln n
ln c . A
formal proof[1] of the expected diameter of a random graph is outside the scope of this text, but
we can construct a heuristic argument that gives some intuition into the problem.
Fix a vertex v, and c = (n − 1)p. Divide our graph into two sets of nodes, reached and re-
maining. At each level we continuing adding edges to unreached nodes such that the number of
vertices reachable in s hops us cs , where cs = n and s = ln n
ln c .
RE A
2
c
CHE
3 6
D
1
N
4 5
Size C
Size C 2
Size C 3
Figure 2: A “heuristic” argument proof for expected graph diameter. As our graph grows we add
unreached nodes to add to our graph.
4
As a heuristic proof, there are of course problems with this . Eventually our reached set will
be larger than our remaining, for example. Next class will discuss some of these points and ways
to address them.
References
[1] B. Bollobás. Random graphs, volume 73. Cambridge Univ Pr, 2001.
[2] P. Erdös and A. Rényi. On random graphs, i. Publicationes Mathematicae (Debrecen), 6:290–297, 1959.
[3] P. Erdős and A. Rényi. On the evolution of random graphs. Akad. Kiadó, 1960.
[4] E.N. Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4):1141–1144, 1959.
[5] R. Solomonoff and A. Rapoport. Connectivity of random nets. Bulletin of Mathematical Biology,
13(2):107–117, 1951.
[6] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ’small-world’ networks. Nature,
393:440–442, 1998.
5
Module5_SemanticWeb
Modelling and aggregating social
network data
A set of triples describing two persons represented in the Turtle (Terse RDF Triple
Language) language
A graph visualization of the RDF
document
• @prefix rdf:
<http://www.w3.org/1999/02/22-rdf-
syntax-ns> .
• @prefix rdfs:
<http://www.w3.org/2000/01/rdf-
schema#label> .
• @prefix foaf:
<http://xmlns.com/foaf/0.1/> .
• @prefix example:
<http://www.example.org/> .
• example:Rembrandt rdf:type
foaf:Person .
• example:Saskia rdf:type foaf:Person .
• example:Rembrandt foaf:name
"Rembrandt" .
• example:Rembrandt foaf:mbox
<mailto:rembrandt@example.org> .
• example:Rembrandt foaf:knows
example:Saskia .
• example:Saskia foaf:name "Saskia" .
Classes and properties of the FOAF ontology
• FOAF has a vocabulary for describing personal attribute information typically found on
homepages such as name and email address of the individual, projects, interests, links to
work and school homepage etc.
FOAF Basics
• foaf:Agent
– An agent (eg., person, group, software or physical
artifact)
– Subclass: foaf:Person, foaf:Organization, foaf:Group
• foaf:Document
– Sublcass: foaf:Image
• foaf:Person
– A person
• foaf:Project
– A project
FOAF basic properties
• foaf:family_name
• foaf:firstName
• foaf:homepage
• foaf:knows
– A person known by this person
• foaf:mbox
• foaf:mbox_sha1sum
• foaf:title
– Personal title (Mr, Mrs, Ms, Dr, etc.)
FOAF
• The idea of FOAF was to provide a machine processable
format for representing personal information described in
homepages of individuals.
• FOAF profiles, can also contain a description of the individual’s
friends using the same vocabulary that is used to describe the
individual himself.
• FOAF profiles can be linked together to form networks of web-
based profiles.
• Studies noted that the majority of FOAF profiles on the Web
are auto-generated by community sites such as LiveJournal,
Opera Communities
• As FOAF profiles are scattered across the Web it is difficult to
estimate their number.
• FOAF started as an experimentation with Semantic Web technology.
FOAF
• FOAF became the center point of interest in 2003 with the spread of Social
Networking Services such Friendster, Orkut, LinkedIn etc.
• Despite their early popularity, a number of drawbacks were discovered.
– Firstly, the information is under the control of the database owner
who has an interest in keeping the information bound to the site and
is willing to protect the data through technical and legal means.
• The profiles stored in these systems typically cannot be exported in machine
processable formats (or cannot be exported legally) and therefore the data cannot
be transferred from one system to the next.
• As a result, the data needs to be maintained separately at different services.
– Secondly, centralized systems do not allow users to control the
information they provide on their own terms.
• Although Friendster follow-ups offer several levels of sharing (e.g. public
information vs. only for friends), users often still find out the hard way that their
information was used in ways that were not intended.
Create your own FOAF
• http://www.ldodds.com
/foaf/foaf-a-matic
– Fill in the detail of
yourself •
– It will create FOAF in RDF
• Publish your FOAF
description
– Save your FOAF RDF file
into your website
somewhere and name it
usually as “foaf.rdf”
FOAF Example in RDF/XML
FOAF Conclusions
• Vocabulary for machine-processable personal
homepages
• Currently some preliminary tools available
• Not yet as successful as social networks such
as friendster, which use proprietary central
data
• Advantage of foaf: decentralized, could serve
as exchange format between those existing
networks and exists on its own
References
• http://www.foaf-project.org/
• http://xmlns.com/foaf/0.1/
Module5_SemanticWeb
Introduction
Reference: Peter Mike, Social
Networks and the Semantic Web
Introduction
• Most of the web’s content is designed for humans to
read and not for computer programs to process
meaningfully.
• Computer programs can
- parse the web pages
- perform routine processing
• In general, they have no reliable method to understand
and process the semantics.
• The Semantic Web brings structure to the meaningful
content of the web pages, creating an environment
where software agents roaming from page to page
carry out sophisticated tasks for users.
Introduction (cont’d)
• The Semantic Web is a major research initiative
of the World Wide Web Consortium (W3C) to
create a metadata-rich Web of resources that
can describe themselves not only by how they
should be displayed (HTML) or syntactically (XML),
but also by the meaning of the metadata.
• “The Semantic Web is an extension of the current
web in which information is given well-defined
meaning, better enabling computers and people
to work in cooperation.”
– Tim Berners-Lee, James Hendler, Ora Lassila,
Introduction (cont’d)
• Difficulties to find, present, access, or
maintain available electronic information on
the web.
• Need for a data representation to enable
software products (agents) to provide
intelligent access to heterogeneous and
distributed information.
The Semantic Web: why?
• Difficulty in searching on the Web
– due to the way in which information is stored on the
Web
• Problem 1: Web documents do not distinguish
between information content and presentation
(“solved” by XML)
• Problem 2: Different web documents may
represent in different ways semantically related
pieces of information
• This leads to hard problems for “intelligent”
information search on the Web
Separating content and presentation
• Problem 1: web documents do not distinguish
between information content and
presentation
– problem due to the HTML language
– problem “solved” by technologies like
• stylesheets (HTML, XML)
• XML
– Stylesheets allow for separating formatting attributes from
the information presented
XML
• XML: eXtensible Mark-up Language
• XML documents are written through a user defined set
of tags
• XML lets everyone to create their own tags.
• These tags can be used by the scripts in sophisticated
ways to perform various tasks, but the script writer has
to know what the page writer uses each tag for.
– XML allows to add arbitrary structure to the documents
but says nothing about what the structures mean.
• It has no built mechanism to convey the meaning of
the user’s new tags to other users.
XML: example
• HTML:
<H1>Seminar on Data Analytics </H1>
<UL>
<LI>Teacher: Max Plank
<LI>Room: 7
<LI>Prerequisites: none
</UL>
• XML:
<course>
<title> Seminar on Data Analytics </title>
<teacher> Max Plank </teacher>
<room> 7 </room>
<prereq> none</prereq>
</course>
Limitations of XML
• XML does not solve all the problems:
– different XML documents may express
information with the same meaning using
different tags
The need for a “Semantic” Web
• Problem 2: Different web documents may
represent in different ways semantically related
pieces of information
– different XML documents do not share the semantics
of information
• Idea: annotate (mark-up) pieces of information to
express the “meaning” of such a piece of
information
- the meaning of such tags is shared
⇒shared semantics
The Semantic Web initiative
• The Semantic Web provides a common
framework that allows data to be shared and
reused across application, enterprise and
community boundaries.
• Published using languages specifically
designed for data:
– Resource Description Framework (RDF)
– Web Ontology Language (OWL)
– Extensible Markup Language (XML)
Example
• An example of a tag that would be used in a
non-semantic web page:
<item> blog </item>
• Encoding similar information in a semantic web
page might look like this:
• OWL Lite is a subset of OWL DL, which in turn is a subset of OWL Full.
The proof/rule layer
Beyond OWL:
• Rule: informal notion
• Rules are used to perform inference over
ontologies
• Rules -> a tool for capturing further
knowledge
(not expressible in OWL ontologies )
The Trust layer
• SW top layer:
– where does the information come from?
– how is this information obtained?
– can I trust this information?
Ontologies: example
Evolution of Semantic Web
The Semantic Web Tower
Module5_Text Mining
Data Mining
• Data mining is the computing process of
discovering patterns in large data sets involving
methods at the intersection of machine learning,
statistics, and database systems.
y (sales) 12 19 29 37 45
b = (1/n)(Σy - a Σx)
Linear Regression - Example
• Let us change the variable x into t such
that t = x - 2005 and therefore t represents
the number of years after 2005. The table
of values becomes.
t (years after 2005) 0 1 2 3 4
y (sales) 12 19 29 37 45
Linear Regression - Example
• We now use the table to calculate a and b
included in the least regression line
formula.
t y ty t2
0 12 0 0
1 19 19 1
2 29 58 4
3 37 111 9
4 45 180 16
Example
Lift
• Lift is defined as the ratio of the observed
support to that expected
48
The Apriori Algorithm (Pseudo-Code)
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates
in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
49
Apriori Algorithm
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
50
Example
• Consider the following database:
• alpha beta epsilon
• alpha beta theta
• alpha beta epsilon
• alpha beta theta
• Algorithm:
Example
Association Rule Mining
Generating Association Rules from Frequent Itemsets
• Procedure:
• For each frequent itemset “l”, generate all nonempty subsets of l.
• For every nonempty subset s of l, output the rule “s → (l-s)” if
support_count(l) / support_count(s) >= min_conf where
min_conf is minimum confidence threshold.
Generating Association Rules from Frequent Itemsets
56
Generating Association Rules from Frequent Itemsets
57
Generating Association Rules from Frequent Itemsets
58
R Example
library(arules)
trans<-list(c("A","B","C"), c ("B","C"), c("A","B","D"), c("A","B","C","D"), c("A"),
c("B"))
names(trans)<-paste("Tr", c(1:6), sep="")
trans
rules<-apriori(trans,parameter=list(supp=.02, conf=.5, target="rules"))
inspect(head(rules,n=20))
R Example
library(arules)
trans<-list(c(1,3,4), c(2,3,5), c(1,2,3,5), c(2,5))
names(trans)<-paste("Tr", c(1:4), sep="")
trans
rules<-apriori(trans,parameter=list(supp=.02,
conf=.5,target=“frequentitemsets"))
inspect(head(rules),n=20))
Apriori Algorithm in Social Networks
4
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
and P(xk|Ci) is P( X | C ) = g ( xk , C , C ) g ( x, , ) =
1 −
e 2
2
i i i
2
6
Naïve Bayes Classifier - Example
• Class: • Dataset
• C1:buys_computer = ‘yes’ age income student credit_rating buys_computer
• C2:buys_computer = ‘no’ <=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
• Instance to be classified: >40
>40
low
low
yes fair
yes excellent
yes
no
31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
<=30 low yes fair yes
Income = medium, >40 medium yes fair yes
<=30 medium yes excellent yes
Student = yes 31…40 medium no excellent yes
31…40 high yes fair yes
Credit_rating = fair) >40 medium no excellent no
Naïve Bayes Classifier - Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 age income student credit_rating buys_computer
<=30 high no fair no
P(buys_computer = “no”) = 5/14= 0.357 <=30 high no excellent no
• Compute P(X|Ci) for each class 31…40 high no fair yes
>40 medium no fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 >40 low yes excellent no
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 31…40 low
<=30 medium
yes excellent
no fair
yes
no
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 low yes fair yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 >40 medium yes fair yes
<=30 medium yes excellent yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 31…40 medium no excellent yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 31…40 high yes fair yes
>40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
8 Therefore, X belongs to class (“buys_computer = yes”)
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i ) = P( x k | C i )
k =1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their
“uncorrected” counterparts
9
Naïve Bayes Classifier: Comments
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence, therefore loss of
accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
10
Example
• Classify "What is the price of the book“ , using the dataset given below
Example
• Classify the text "A very close game" using the dataset given below.
Regression - Example
• Apply kNN and score for the text “This is a unique abstract” using the dataset given below.
K-Means Clustering Example
• K-Means Clustering-
Step-01:
Step-02:
•Calculate the distance between each data point and each cluster center.
•The distance may be calculated either by using given distance function or by using euclidean
distance formula.
Step-04:
Step-05:
Point-01:
isadvantages-
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Iteration-01:
•We calculate the distance of each point from each of the center of the three
clusters.
•The distance is calculated by using the given distance function.
In the similar manner, we calculate the distance of other points from each of the
center of the three clusters.
From here, New clusters are- For Cluster-01:
Cluster-01: Center of Cluster-01
First cluster contains points- = ((2 + 4)/2, (10 + 9)/2)
•A1(2, 10) = (3, 9.5)
•A8(4, 9)
For Cluster-02:
Cluster-02: Center of Cluster-02
Second cluster contains points- = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
•A3(8, 4) = (6.5, 5.25)
•A4(5, 8)
•A5(7, 5) For Cluster-03:
•A6(6, 4) Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
Cluster-03: = (1.5, 3.5)
Third cluster contains points-
•A2(2, 5) This is completion of Iteration-02.
•A7(1, 2)
Now, After second iteration, the center of the three clusters
•We re-compute the new cluster clusters. are-
•The new cluster center is computed by taking mean •C1(3, 9.5)
of all the points contained in that cluster. •C2(6.5, 5.25)
•C3(1.5, 3.5)
Module5_TextMining
K Mode Clustering
K Modes Clustering
K Modes Clustering
K Modes Clustering
K-Modes Clustering – Example 1
• Consider the given dataset:
Tuple X1 X2 X3 X4 X5 X6
No.
1 AA BB AB AA AB AB
2 AB BB AB AA AB BB
3 AA AB AA AB AA AB
4 BB AA BB AB AA BB
5 AB AA AB BB BB BB
6 AA AB BB AA AB BB
7 BB BB AA AB AA AB
8 AB AB AA AB BB AB
• Let K be 2
• Let tuples 1 and 5 be the initial centroids (chosen randomly)
of the two clusters respectively. Tup X1 X2 X3 X4 X5 X6
• Centroids: le
No.
1 AA BB AB AA AB AB
5 AB AA AB BB BB BB
Example 1 (cont’d)
• Let us compute the distance from each tuple to the two
cluster centroids as given below by d(X,Y):
Tupl X1 X2 X3 X4 X5 X6 Distance Distance
e to to
No. Cluster1 Cluster2
1 AA BB AB AA AB AB 0 5
2 AB BB AB AA AB BB 2 3 Centroids
3 AA AB AA AB AA AB 4 6
Tu X1 X2 X3 X4 X5 X6
4 BB AA BB AB AA BB 5 4 ple
5 AB AA AB BB BB BB 5 0 No.
6 AA AB BB AA AB BB 3 5 1 AA BB AB AA AB AB
7 BB BB AA AB AA AB 4 6 5 AB AA AB BB BB BB
8 AB AB AA AB BB AB 5 4
1 AA BB AA AA AB AB
2 AB AA BB AB BB BB
• Let us compute the updated distance from each tuple to the
two cluster centroids as given below:
Tupl X1 X2 X3 X4 X5 X6 Distance Distance
e to to
No. Cluster1 Cluster2
1 AA BB AB AA AB AB 1 4
2 AB BB AB AA AB BB 3 4
3 AA AB AA AB AA AB 3 5
4 BB AA BB AB AA BB 6 2
5 AB AA AB BB BB BB 6 2
6 AA AB BB AA AB BB 2 4
7 BB BB AA AB AA AB 3 5
8 AB AB AA AB BB AB 4 3
Example 1 (cont’d)
• Following the updated distance
– Tuples 1,2,3,6 and 7 will fall in cluster 1.
– Tuples 4,5 and 8 will fall in cluster 2.
• Let the centroids of two clusters be updated with reference to
the tuples currently assigned to the clusters.
• New Centroids: Cluste X1 X2 X3 X4 X5 X6
r No.
1 AA BB AA AA AB AB
2 AB AA BB AB BB BB
d(X,Y) where
X = BB AB AB AB AB AA AB BB AB BB
Y = AB BB BB AB BB BB AB AB BB AB (2nd data
point
d(X,Y)= 1 + 1 + 1 + 0 +1 + 1 + 0 + 1 + 1 + 1 = 8
Step 2: Calculate the distances between each object and the
cluster mode; assign the object to the cluster whose center has
the shortest distance to the object; repeat this step until all
objects are assigned to clusters.
Example 2- Identify Data points belonging to
different clusters and update the new centroids
References
[1] Z. Huang, “A Fast Clustering Algorithm to
Cluster Very Large Categorical Data Sets in Data
Mining”, Proceedings of Data Mining and
Knowledge Discovery, pp. 1-6, 1997.
[2]https://openi.nlm.nih.gov/detailedresult.php
?img=PMC1525209_1471-2105-7-204-
17&req=4
Module6_R_Examples
Example
install.packages(“network”)
library(network)
src <- c("A", "B", "C", "D", "E", "B", "A", "F")
dst <- c("B", "E", "A", "B", "B", "A", "F", "A")
summary(Net)
plot(Net)
Example
library(igraph)
g <- make_ring(10)
degree(g)
plot(g)
degree_distribution(g)
Example
src <- c("A", "B", "C", "D", "E", "B", "A", "F")
dst <- c("B", "E", "A", "B", "B", "A", "F", "A")
summary(Net)
plot(Net)
#.........................................
library( igraph)
plot(make_graph(c(1, 2, 2, 3, 3, 4, 5, 6), directed = FALSE) )
plot(make_graph("Tetrahedron"))
plot(make_graph("Cubical"))
plot(make_graph("Octahedron"))
plot(make_graph("Dodecahedron"))
plot(make_graph("Icosahedron"))
#....................................
library(igraph)
g <- make_ring(10)
plot(g)
degree(g)
closeness(g)
betweenness(g)
eigen_centrality(g)
mean_distance(g)
transitivity(g)
library(igraph)
#....................................
g <- sample_gnm(5, 8)
plot(g)
g <- sample_pa(5)
plot(g)
g <- barabasi.game(10)
plot(g)
#....................................
karate <- make_graph("Zachary")
wc <- cluster_walktrap(karate)
modularity(wc)
membership(wc)
plot(wc, karate)
#....................................
karate <- make_graph("Zachary")
wc <- cluster_fast_greedy(karate)
modularity(wc)
membership(wc)
plot(wc, karate)
#....................................
library(igraph)
g1 <- barabasi.game(10)
plot(g1)
#....................................
library(igraph)
g2 <- barabasi.game(5)
plot(g2)
#....................................
library(igraph)
g1 <- barabasi.game(10)
g2 <- barabasi.game(5)
plot(g1)
plot(g2,add=TRUE, vertex.color="green" , edge.color="blue")
#....................................
# kNN Example
# Class A training
A1=c(0,0)
A2=c(1,1)
A3=c(2,2)
#....................................
# kMeans Example 1:
x <- c(2,4,10,12,3,20,30,11,25)
cl <- kmeans(x, 2)
cl
# kMeans Example 2:
library(graphics)
xy <- matrix(c(1.0 ,1.0, 1.5, 2.0 , 3.0, 4.0, 5.0, 7.0, 3.5, 5.0, 4.5, 5.0, 3.5, 4.5 ), byrow =
TRUE, ncol = 2)
colnames(xy) <- c("x", "y")
cl <- kmeans(xy, 2)
cl
plot( xy)
points(cl$centers, col = 2:3, pch = 9, cex = 3)
#........................................
#Create the predictor and response variable.
x <- c(0,1,2,3,4)
y <- c(12,19,29,37,45)
relation <- lm(y~x)
# Predict for x=7
a <- data.frame(x = 7)
result <- predict(relation,a)
print(result)
#........................................
#........................................
#........................................
#........................................
library(arules)
trans<-list(c("A","B","D"), c ("B","C","E"), c("A","B","C","E"), c("B","E"))
names(trans)<-paste("Tr", c(1:4), sep="")
trans
rules<-apriori(trans,parameter=list(supp=0.5, conf=.7, target="rules", minlen=1))
inspect(head(rules,n=20))
Network visualization with R
Sunbelt 2019 Workshop, Montreal, Canada
Katherine Ognyanova, Rutgers University
Web: www.kateto.net, Twitter: ognyanova
Contents
1 Introduction: network visualization 2
2 Colors in R plots 5
1
1 Introduction: network visualization
The main concern in designing a network visualization is the purpose it has to serve. What are the
structural properties that we want to highlight? What are the key concerns we want to address?
T1 T2
A B
Network maps are far from the only visualization available for graphs - other network representation
formats, and even simple charts of key characteristics, may be more appropriate in some cases.
2
In network maps, as in other visualization formats, we have several key elements that control the
outcome. The major ones are color, size, shape, and position.
Color Position
Size Shape
Modern graph layouts are optimized for speed and aesthetics. In particular, they seek to minimize
overlaps and edge crossing, and ensure similar edge length across the graph.
Layout aesthetics
3
Note: You can download all workshop materials here, or visit kateto.net/sunbelt2019.
This tutorial uses several key packages that you will need to install in order to follow along. Other
packages will be mentioned along the way, but those are not critical and can be skipped.
The main packages we are going to use are igraph (maintained by Gabor Csardi and Tamas
Nepusz), sna & network (maintained by Carter Butts and the Statnet team), ggraph(maintained
by Thomas Lin Pederson), visNetwork (maintained by Benoit Thieurmel), threejs (maintained by
Bryan W. Lewis), NetworkD3 (maintained by Christopher Gandrud), and ndtv (maintained by Skye
Bender-deMoll).
install.packages("igraph")
install.packages("network")
install.packages("sna")
install.packages("ggraph")
install.packages("visNetwork")
install.packages("threejs")
install.packages("networkD3")
install.packages("ndtv")
4
2 Colors in R plots
Colors are pretty, but more importantly, they help people differentiate between types of objects or
levels of an attribute. In most R functions, you can use named colors, hex, or RGB values.
In the simple base R plot chart below, x and y are the point coordinates, pch is the point symbol
shape, cex is the point size, and col is the color. To see the parameters for plotting in base R,
check out ?par.
plot(x=1:10, y=rep(5,10), pch=19, cex=3, col="dark red")
points(x=1:10, y=rep(6, 10), pch=19, cex=3, col="557799")
points(x=1:10, y=rep(4, 10), pch=19, cex=3, col=rgb(.25, .5, .3))
You may notice that RGB here ranges from 0 to 1. While this is the R default, you can also set it
to the 0-255 range using something like rgb(10, 100, 100, maxColorValue=255).
We can set the opacity/transparency of an element using the parameter alpha (range 0-1):
plot(x=1:5, y=rep(5,5), pch=19, cex=12, col=rgb(.25, .5, .3, alpha=.5), xlim=c(0,6))
If we have a hex color representation, we can set the transparency alpha using adjustcolor from
package grDevices. For fun, let’s also set the plot background to gray using the par() function
for graphical parameters. We won’t do that below, but we could set the margins of the plot with
par(mar=c(bottom, left, top, right)), or tell R not to clear the previous plot before adding a
new one with par(new=TRUE).
par(bg="gray40")
col.tr <- grDevices::adjustcolor("557799", alpha=0.7)
plot(x=1:5, y=rep(5,5), pch=19, cex=12, col=col.tr, xlim=c(0,6))
5
If you plan on using the built-in color names, here’s how to list all of them:
colors() # List all named colors
grep("blue", colors(), value=T) # Colors that have "blue" in the name
In many cases, we need a number of contrasting colors, or multiple shades of a color. R comes with
some predefined palette function that can generate those for us. For example:
pal1 <- heat.colors(5, alpha=1) # 5 colors from the heat palette, opaque
pal2 <- rainbow(5, alpha=.5) # 5 colors from the heat palette, transparent
plot(x=1:10, y=1:10, pch=19, cex=5, col=pal1)
We can also generate our own gradients using colorRampPalette. Note that colorRampPalette
returns a function that we can use to generate as many colors from that palette as we need.
6
palf <- colorRampPalette(c("gray80", "dark red"))
plot(x=10:1, y=1:10, pch=19, cex=5, col=palf(10))
Finding good color combinations is a tough task - and the built-in R palettes are rather limited.
Thankfully there are other available packages for this:
# If you don't have R ColorBrewer already, you will need to install it:
install.packages('RColorBrewer')
library('RColorBrewer')
display.brewer.all()
This package has one main function, called brewer.pal. To use it, you just need to select the
desired palette and a number of colors. Let’s take a look at some of the RColorBrewer palettes:
display.brewer.pal(8, "Set3")
display.brewer.pal(8, "Spectral")
7
display.brewer.pal(8, "Blues")
8
3 Data format, size, and preparation
In this tutorial, we will work primarily with two small example data sets. Both contain data about
media organizations. One involves a network of hyperlinks and mentions among news sources. The
second is a network of links between media venues and consumers.
While the example data used here is small, many of the ideas behind the visualizations we will
generate apply to medium and large-scale networks. This is also the reason why we will rarely use
certain visual properties such as the shape of the node symbols: those are impossible to distinguish
in larger graph maps. In fact, when drawing very big networks we may even want to hide the
network edges, and focus on identifying and visualizing communities of nodes.
At this point, the size of the networks you can visualize in R is limited mainly by the RAM of your
machine. One thing to emphasize though is that in many cases, visualizing larger networks as giant
hairballs is less helpful than providing charts that show key characteristics of the graph.
The first data set we are going to work with consists of two files, “Dataset1-Media-Example-
NODES.csv” and “Dataset1-Media-Example-EDGES.csv” (download here).
nodes <- read.csv("Dataset1-Media-Example-NODES.csv", header=T, as.is=T)
links <- read.csv("Dataset1-Media-Example-EDGES.csv", header=T, as.is=T)
Next we will convert the raw data to an igraph network object. To do that, we will use the
graph_from_data_frame() function, which takes two data frames: d and vertices.
• d describes the edges of the network. Its first two columns are the IDs of the source and the
target node for each edge. The following columns are edge attributes (weight, type, label, or
anything else).
• vertices starts with a column of node IDs. Any following columns are interpreted as node
attributes.
library('igraph')
net <- graph_from_data_frame(d=links, vertices=nodes, directed=T)
net
9
## [1] s01->s02 s01->s03 s01->s04 s01->s15 s02->s01 s02->s03 s02->s09
## [8] s02->s10 s03->s01 s03->s04 s03->s05 s03->s08 s03->s10 s03->s11
## [15] s03->s12 s04->s03 s04->s06 s04->s11 s04->s12 s04->s17 s05->s01
## [22] s05->s02 s05->s09 s05->s15 s06->s06 s06->s16 s06->s17 s07->s03
## [29] s07->s08 s07->s10 s07->s14 s08->s03 s08->s07 s08->s09 s09->s10
## [36] s10->s03 s12->s06 s12->s13 s12->s14 s13->s12 s13->s17 s14->s11
## [43] s14->s13 s15->s01 s15->s04 s15->s06 s16->s06 s16->s17 s17->s04
It is also easy to extract an edge list or matrix back from the igraph network:
# Get an edge list or a matrix:
as_edgelist(net, names=T)
as_adjacency_matrix(net, attr="weight")
10
Now that we have our igraph network object, let’s make a first attempt to plot it.
plot(net) # not a pretty picture!
s03 s11
s02 s04
s01
s05
s15 s17
s06
s16
That doesn’t look very good. Let’s start fixing things by removing the loops in the graph.
net <- simplify(net, remove.multiple = F, remove.loops = T)
We could also use simplify to combine multiple edges by summing their weights with a command
like simplify(net, edge.attr.comb=list(Weight="sum","ignore")). Note, however, that this
would also combine multiple edge types (in our data: “hyperlinks” and “mentions”).
Let’s and reduce the arrow size and remove the labels (we do that by setting them to NA):
plot(net, edge.arrow.size=.4,vertex.label=NA)
11
3.3 DATASET 2: matrix
Our second dataset is a network of links between news outlets and consumers. It includes two
files, “Dataset2-Media-Example-NODES.csv” and “Dataset2-Media-Example-EDGES.csv” (down-
load here).
nodes2 <- read.csv("Dataset2-Media-User-Example-NODES.csv", header=T, as.is=T)
links2 <- read.csv("Dataset2-Media-User-Example-EDGES.csv", header=T, row.names=1)
We can see that links2 is an adjacency matrix for a two-mode network. Two-mode or bipartite
graphs have two different types of actors and links that go across, but not within each type. Our
second media example is a network of that kind, examining links between news sources and their
consumers.
links2 <- as.matrix(links2)
dim(links2)
dim(nodes2)
head(nodes2)
head(links2)
12
4 Plotting networks with igraph
Plotting with igraph: the network plots have a wide set of parameters you can set. Those include
node options (starting with vertex.) and edge options (starting with edge.). A list of selected
options is included below, but you can also check out ?igraph.plotting for more information.
NODES
vertex.color Node color
vertex.frame.color Node border color
vertex.shape One of “none”, “circle”, “square”, “csquare”, “rectangle”
“crectangle”, “vrectangle”, “pie”, “raster”, or “sphere”
vertex.size Size of the node (default is 15)
vertex.size2 The second size of the node (e.g. for a rectangle)
vertex.label Character vector used to label the nodes
vertex.label.family Font family of the label (e.g.“Times”, “Helvetica”)
vertex.label.font Font: 1 plain, 2 bold, 3, italic, 4 bold italic, 5 symbol
vertex.label.cex Font size (multiplication factor, device-dependent)
vertex.label.dist Distance between the label and the vertex
vertex.label.degree The position of the label in relation to the vertex, where
0 is right, “pi” is left, “pi/2” is below, and “-pi/2” is above
EDGES
edge.color Edge color
edge.width Edge width, defaults to 1
edge.arrow.size Arrow size, defaults to 1
edge.arrow.width Arrow width, defaults to 1
edge.lty Line type, could be 0 or “blank”, 1 or “solid”, 2 or “dashed”,
3 or “dotted”, 4 or “dotdash”, 5 or “longdash”, 6 or “twodash”
edge.label Character vector used to label edges
edge.label.family Font family of the label (e.g.“Times”, “Helvetica”)
edge.label.font Font: 1 plain, 2 bold, 3, italic, 4 bold italic, 5 symbol
edge.label.cex Font size for edge labels
edge.curved Edge curvature, range 0-1 (FALSE sets it to 0, TRUE to 0.5)
arrow.mode Vector specifying whether edges should have arrows,
possible values: 0 no arrow, 1 back, 2 forward, 3 both
OTHER
margin Empty space margins around the plot, vector with length 4
frame if TRUE, the plot will be framed
main If set, adds a title to the plot
sub If set, adds a subtitle to the plot
asp Numeric, the aspect ratio of a plot (y/x).
palette A color palette to use for vertex color
rescale Whether to rescale coordinates to [-1,1]. Default is TRUE.
13
We can set the node & edge options in two ways - the first one is to specify them in the plot()
function, as we are doing below.
# Plot with curved edges (edge.curved=.1) and reduce arrow size:
# Note that using curved edges will allow you to see multiple links
# between two nodes (e.g. links going in either direction, or multiplex links)
plot(net, edge.arrow.size=.4, edge.curved=.1)
s13 s14
s12 s07
s08
s10 s09
s11
s04 s03
s17
s01 s02
s06
s16 s05
s15
# Set edge color to light gray, the node & border color to orange
# Replace the vertex label with the node names stored in "media"
plot(net, edge.arrow.size=.2, edge.color="orange",
vertex.color="orange", vertex.frame.color="#ffffff",
vertex.label=V(net)$media, vertex.label.color="black")
FOX News
MSNBC
CNN ABC
Washington
Reuters.com Wall Street Journal Post
LA Times
Google News NY Times
Yahoo News USA Today
BBC NYTimes.com
AOL.com
New York Post
WashingtonPost.com
The second way to set attributes is to add them to the igraph object. Let’s say we want to color
our network nodes based on type of media, and size them based on degree centrality (more links ->
larger node) We will also change the width of the edges based on their weight.
14
# Generate colors based on media type:
colrs <- c("gray50", "tomato", "gold")
V(net)$color <- colrs[V(net)$media.type]
# Compute node degrees (#links) and use that to set node size:
deg <- degree(net, mode="all")
V(net)$size <- deg*3
# We could also use the audience size value:
V(net)$size <- V(net)$audience.size*0.6
15
It helps to add a legend explaining the meaning of the colors we used:
plot(net)
legend(x=-1.5, y=-1.1, c("Newspaper","Television", "Online News"), pch=21,
col="#777777", pt.bg=colrs, pt.cex=2, cex=.8, bty="n", ncol=1)
Newspaper
Television
Online News
Sometimes, especially with semantic networks, we may be interested in plotting only the labels of
the nodes:
plot(net, vertex.shape="none", vertex.label=V(net)$media,
vertex.label.font=2, vertex.label.color="gray40",
vertex.label.cex=.7, edge.color="gray85")
16
Reuters.com
Google News
CNN
Yahoo News MSNBC
ABC
FOX News
BBC
Wall Street Journal
USA Today
Washington Post
AOL.com NY Times
Let’s color the edges of the graph based on their source node color. We can get the starting node for
each edge with the ends() igraph function. It returns the start and end vertex for edges listed in
the es parameter. The names parameter control whether the function returns edge names or IDs.
edge.start <- ends(net, es=E(net), names=F)[,1]
edge.col <- V(net)$color[edge.start]
17
4.2 Network layouts
Network layouts are simply algorithms that return coordinates for each node in a network.
For the purposes of exploring layouts, we will generate a slightly larger 100-node graph. We use
the sample_pa() function which generates a simple graph starting from one node and adding more
nodes and links based on a preset level of preferential attachment (Barabasi-Albert model).
net.bg <- sample_pa(100)
V(net.bg)$size <- 8
V(net.bg)$frame.color <- "white"
V(net.bg)$color <- "orange"
V(net.bg)$label <- ""
E(net.bg)$arrow.mode <- 0
plot(net.bg)
18
l <- layout_in_circle(net.bg)
plot(net.bg, layout=l)
l is simply a matrix of x, y coordinates (N x 2) for the N nodes in the graph. For 3D layouts, it has
x, y, and z coordinates (N x 3). You can easily generate your own:
l <- cbind(1:vcount(net.bg), c(1, vcount(net.bg):2))
plot(net.bg, layout=l)
This layout is just an example and not very helpful - thankfully igraph has a number of built-in
layouts, including:
# Randomly placed vertices
l <- layout_randomly(net.bg)
plot(net.bg, layout=l)
19
# Circle layout
l <- layout_in_circle(net.bg)
plot(net.bg, layout=l)
# 3D sphere layout
l <- layout_on_sphere(net.bg)
plot(net.bg, layout=l)
Fruchterman-Reingold is one of the most used force-directed layout algorithms out there.
20
Force-directed layouts try to get a nice-looking graph where edges are similar in length and cross
each other as little as possible. They simulate the graph as a physical system. Nodes are electrically
charged particles that repulse each other when they get too close. The edges act as springs that
attract connected nodes closer together. As a result, nodes are evenly distributed through the chart
area, and the layout is intuitive in that nodes which share more connections are closer to each other.
The disadvantage of these algorithms is that they are rather slow and therefore less often used in
graphs larger than ~1000 vertices.
l <- layout_with_fr(net.bg)
plot(net.bg, layout=l)
With force-directed layouts, you can use the niter parameter to control the number of iterations to
perform. The default is set at 500 iterations. You can lower that number for large graphs to get
results faster and check if they look reasonable.
l <- layout_with_fr(net.bg, niter=50)
plot(net.bg, layout=l)
The layout can also interpret edge weights. You can set the “weights” parameter which increases
the attraction forces among nodes connected by heavier edges.
ws <- c(1, rep(100, ecount(net.bg)-1))
lw <- layout_with_fr(net.bg, weights=ws)
plot(net.bg, layout=lw)
21
You will also notice that the Fruchterman-Reingold layout is not deterministic - different runs will
result in slightly different configurations. Saving the layout in l allows us to get the exact same
result multiple times, which can be helpful if you want to plot the time evolution of a graph, or
different relationships – and want nodes to stay in the same place in multiple plots.
22
dev.off()
By default, the coordinates of the plots are rescaled to the [-1,1] interval for both x and y. You can
change that with the parameter rescale=FALSE and rescale your plot manually by multiplying the
coordinates by a scalar. You can use norm_coords to normalize the plot with the boundaries you
want. This way you can create more compact or spread out layout versions.
l <- layout_with_fr(net.bg)
l <- norm_coords(l, ymin=-1, ymax=1, xmin=-1, xmax=1)
par(mfrow=c(2,2), mar=c(0,0,0,0))
plot(net.bg, rescale=F, layout=l*0.4)
plot(net.bg, rescale=F, layout=l*0.6)
plot(net.bg, rescale=F, layout=l*0.8)
plot(net.bg, rescale=F, layout=l*1.0)
dev.off()
Some layouts have 3D versions that you can use with parameter dim=3. As you might expect, a 3D
layout returns a matrix with 3 columns containing the X, Y, and Z coordinates of each node.
l <- layout_with_fr(net.bg, dim=3)
plot(net.bg, layout=l)
23
Another popular force-directed algorithm that produces nice results for connected graphs is Kamada
Kawai. Like Fruchterman Reingold, it attempts to minimize the energy in a spring system.
l <- layout_with_kk(net.bg)
plot(net.bg, layout=l)
Graphopt is a nice force-directed layout implemented in igraph that uses layering to help with
visualizations of large networks.
l <- layout_with_graphopt(net.bg)
plot(net.bg, layout=l)
The available graphopt parameters can be used to change the mass and electric charge of nodes, as
well as the optimal spring length and the spring constant for edges. The parameter names are charge
(defaults to 0.001), mass (defaults to 30), spring.length (defaults to 0), and spring.constant
(defaults to 1). Tweaking those can lead to considerably different graph layouts.
l1 <- layout_with_graphopt(net.bg, charge=0.02)
l2 <- layout_with_graphopt(net.bg, charge=0.00000001)
par(mfrow=c(1,2), mar=c(1,1,1,1))
24
plot(net.bg, layout=l1)
plot(net.bg, layout=l2)
dev.off()
The LGL algorithm is meant for large, connected graphs. Here you can also specify a root: a node
that will be placed in the middle of the layout.
plot(net.bg, layout=layout_with_lgl)
The MDS (multidimensional scaling) algorithm tries to place nodes based on some measure of
similarity or distance between them. More similar nodes are plotted closer to each other. By default,
the measure used is based on the shortest paths between nodes in the network. We can change that
by using our own distance matrix (however defined) with the parameter dist. MDS layouts are
nice because positions and distances have a clear interpretation. The problem with them is visual
clarity: nodes often overlap, or are placed on top of each other.
plot(net.bg, layout=layout_with_mds)
25
Let’s take a look at all available layouts in igraph:
layouts <- grep("^layout_", ls("package:igraph"), value=TRUE)[-1]
# Remove layouts that do not apply to our graph.
layouts <- layouts[!grepl("bipartite|merge|norm|sugiyama|tree", layouts)]
par(mfrow=c(3,3), mar=c(1,1,1,1))
for (layout in layouts) {
print(layout)
l <- do.call(layout, list(net))
plot(net, edge.arrow.mode=0, layout=l, main=layout) }
layout_as_star layout_components layout_in_circle
26
layout_with_fr layout_with_gem layout_with_graphopt
Notice that our network plot is still not too helpful. We can identify the type and size of nodes,
but cannot see much about the structure since the links we’re examining are so dense. One way to
approach this is to see if we can sparsify the network, keeping only the most important ties and
discarding the rest.
hist(links$weight)
mean(links$weight)
sd(links$weight)
There are more sophisticated ways to extract the key edges, but for the purposes of this exercise
we’ll only keep ones that have weight higher than the mean for the network. In igraph, we can
delete edges using delete_edges(net, edges):
27
cut.off <- mean(links$weight)
net.sp <- delete_edges(net, E(net)[weight<cut.off])
plot(net.sp, layout=layout_with_kk)
Another way to think about this is to plot the two tie types (hyperlink & mention) separately. We
will do that in section 5 of this tutorial: Plotting multiplex networks.
We can also try to make the network map more useful by showing the communities within it:
par(mfrow=c(1,2))
# We can also plot the communities without relying on their built-in plot:
V(net)$community <- clp$membership
colrs <- adjustcolor( c("gray50", "tomato", "gold", "yellowgreen"), alpha=.6)
plot(net, vertex.color=colrs[V(net)$community])
28
dev.off()
Sometimes we want to focus the visualization on a particular node or a group of nodes. In our
example media network, we can examine the spread of information from focal actors. For instance,
let’s represent distance from the NYT.
The distances function returns a matrix of shortest paths from nodes listed in the v parameter to
ones included in the to parameter.
dist.from.NYT <- distances(net, v=V(net)[media=="NY Times"],
to=V(net), weights=NA)
3
1 2
1 0 2
1
1
1 2 2
2 2 3
2 2 3
29
ew[unlist(news.path$epath)] <- 4
# Generate node color variable to plot the path:
vcol <- rep("gray40", vcount(net))
vcol[unlist(news.path$vpath)] <- "gold"
We can highlight the edges going into or out of a vertex, for instance the WSJ. For a single node,
use incident(), for multiple nodes use incident_edges()
inc.edges <- incident(net, V(net)[media=="Wall Street Journal"], mode="all")
We can also point to the immediate neighbors of a vertex, say WSJ. The neighbors function
finds all nodes one step out from the focal actor.To find the neighbors for multiple nodes, use
adjacent_vertices() instead of neighbors(). To find node neighborhoods going more than one
30
step out, use function ego() with parameter order set to the number of steps out to go from the
focal node(s).
neigh.nodes <- neighbors(net, V(net)[media=="Wall Street Journal"], mode="out")
A way to draw attention to a group of nodes (we saw this before with communities) is to “mark”
them:
par(mfrow=c(1,2))
plot(net, mark.groups=c(1,4,5,8), mark.col="#C5E5E7", mark.border=NA)
dev.off()
31
4.5 Interactive plotting with tkplot
R and igraph allow for interactive plotting of networks. This might be a useful option for you if you
want to tweak slightly the layout of a small graph. After adjusting the layout manually, you can get
the coordinates of the nodes and use them for other plots.
tkid <- tkplot(net) #tkid is the id of the tkplot that will open
l <- tkplot.getcoords(tkid) # grab the coordinates from tkplot
plot(net, layout=l)
As you might remember, our second media example is a two-mode network examining links between
news sources and their consumers.
head(nodes2)
head(links2)
plot(net2, vertex.label=NA)
32
As with one-mode networks, we can modify the network object to include the visual properties that
will be used by default when plotting the network. Notice that this time we will also change the
shape of the nodes - media outlets will be squares, and their users will be circles.
# Media outlets are blue squares, audience nodes are orange circles:
V(net2)$color <- c("steel blue", "orange")[V(net2)$type+1]
V(net2)$shape <- c("square", "circle")[V(net2)$type+1]
# Media outlets will have name labels, audience members will not:
V(net2)$label <- ""
V(net2)$label[V(net2)$type==F] <- nodes2$media[V(net2)$type==F]
V(net2)$label.cex=.6
V(net2)$label.font=2
NYT
BBC
LATimes USAT
CNN WSJ
MSNBC
FOX
ABC
WaPo
In igraph, there is also a special layout for bipartite networks (though it doesn’t always work great,
and you might be better off generating your own two-mode layout).
33
Using text as nodes may be helpful at times:
plot(net2, vertex.shape="none", vertex.label=nodes2$media,
vertex.label.color=V(net2)$color, vertex.label.font=2,
vertex.label.cex=.6, edge.color="gray70", edge.width=2)
Paul Mary
NYT
John
BBC
Sandra Ronda
Nancy LATimes
USAT
Sheila
Dan
CNN
Anna WSJ Jim
Ed Brian MSNBC Jill
Kate Jo
FOX
Lisa Jason
ABC
Dave
WaPo
Tom Ted
In this example, we will also experiment with the use of images as nodes. In order to do this, you
will need the png package (if missing, install with install.packages('png')
# install.packages('png')
library('png')
34
plot(net2, vertex.shape="raster", vertex.label=NA,
vertex.size=16, vertex.size2=16, edge.width=2)
By the way, we can also add any image we want to a plot. For example, many network graphs can
be largely improved by a photo of a puppy in a teacup.
plot(net2, vertex.shape="raster", vertex.label=NA,
vertex.size=16, vertex.size2=16, edge.width=2)
35
# The numbers after the image are its coordinates
# The limits of your plotting area are given in par()$usr
We can also generate and plot bipartite projections for the two-mode network: co-memberships are
easy to calculate by multiplying the network matrix by its transposed matrix, or using igraph’s
bipartite.projection() function.
par(mfrow=c(1,2))
ABC John
FOX
MSNBC
NYT Jill
dev.off()
In some cases, the networks we want to plot are multigraphs: they can have multiple edges connecting
the same two nodes. A related concept, multiplex networks, contain multiple types of ties. For
instance, we can represent friendship, romantic, and work relationships between individuals in a
36
single multiplex network.
In our example network, we also have two tie types: hyperlinks and mentions. One thing we can do
with them is plot each type of tie separately:
E(net)$width <- 1.5
plot(net, edge.color=c("dark red", "slategrey")[(E(net)$type=="hyperlink")+1],
vertex.color="gray40", layout=layout_in_circle, edge.curved=.3)
37
Tie: Hyperlink Tie: Mention
dev.off()
In our example network, it so happens that we do not have node dyads connected by multiple types
of connections. That is to say, we never have both a ‘hyperlink’ and a ‘mention’ tie between the
same two news outlets. However, this could easily happen in a multiplex network.
One challenge in visualizing multigraphs is that multiple edges between the same two nodes may
get plotted on top of each other in a way that makes impossible to see them clearly. For example,
let us generate a very simple multiplex network with two nodes and three ties between them:
multigtr <- graph( edges=c(1,2, 1,2, 1,2), n=2 )
l <- layout_with_kk(multigtr)
Because all edges in the graph have the same curvature, they are drawn over each other so that
we only see one of them. What we can do is assign each edge a different curvature. One useful
function in igraph called curve_multiple can help us here. For a graph G, curve.multiple(G)
will generate a curvature for each edge that maximizes visibility.
38
plot(multigtr, vertex.color="lightsteelblue", vertex.frame.color="white",
vertex.size=40, vertex.shape="circle", vertex.label=NA,
edge.color=c("gold", "tomato", "yellowgreen"), edge.width=10,
edge.arrow.size=3, edge.curved=curve_multiple(multigtr), layout=l)
It is a good practice to detach packages when we stop needing them. Try to remember that especially
with igraph and the statnet family packages, as bad things tend to happen if you have them
loaded together.
detach('package:igraph')
The igraph package is only one of many available network visualization options in R. This section
provides a few quick examples illustrating other available approaches to static network visualization.
Plotting with the network package is very similar to that with igraph - although the notation is
slightly different (a whole new set of parameter names!). This package also uses less default controls
obtained by modifying the network object, and more explicit parameters in the plotting function.
Here is a quick example using the (by now familiar) media network. We will begin by converting
the data into the network format used by the Statnet family of packages (including network, sna,
ergm, stergm, and others).
As in igraph, we can generate a ‘network’ object from an edge list, an adjacency matrix, or an
incidence matrix. You can get the specifics with ?edgeset.constructors. Here we will use the
edge list and the node attribute data frames to create the network object. One specific thing to
pay attention to here is the ignore.eval parameter. It is set to TRUE by default, and that setting
causes the network object to disregard edge weights.
39
library('network')
Here again we can easily access the edges, vertices, and the network matrix:
net3[,]
net3 %n% "net.name" <- "Media Network" # network attribute
net3 %v% "media" # Node attribute
net3 %e% "type" # Node attribute
Note that - as in igraph - the plot returns the node position coordinates. You can use them in other
plots using the coord parameter.
l <- plot(net3, vertex.cex=(net3 %v% "audience.size")/7, vertex.col="col")
plot(net3, vertex.cex=(net3 %v% "audience.size")/7, vertex.col="col", coord=l)
40
detach('package:network')
The network package also offers the option to edit a plot interactively, by setting the parameter
interactive=T:
plot(net3, vertex.cex=(net3 %v% "audience.size")/7, vertex.col="col", interactive=T)
For a full list of parameters that you can use in the network package, check out ?plot.network.
The ggplot2 package and its extensions are known for offering the most meaningfully structured
and advanced way to visualize data in R. In ggplot2, you can select from a variety of visual building
blocks and add them to your graphics one by one, a layer at a time.
The ggraph package takes this principle and extends it to network data. In this section, we’ll only
cover the basics without providing a detailed overview of the grammar of graphics approach. For a
deeper look, it would be best to get familiar with ggplot2 first, then learn the specifics of ggraph.
One good news is that we can use our igraph objects directly with the ggraph package. The
following code gets the data and adds separate layers for nodes and links.
library(ggraph)
library(igraph)
ggraph(net) +
geom_edge_link() + # add edges to the plot
geom_node_point() # add nodes to the plot
41
10.0
7.5
5.0
y
2.5
0.0
−2.5
6 9 12 15 18
x
You will also recognize here some network layouts familiar from igraph plotting: ‘star’, ‘circle’,
‘grid’, ‘sphere’, ‘kk’, ‘fr’, ‘mds’, ‘lgl’, etc.
ggraph(net, layout="lgl") +
geom_edge_link() +
ggtitle("Look ma, no nodes!") # add title to the plot
−10.0
−12.5
y
−15.0
−17.5
−20.0
−7.5 −5.0 −2.5 0.0
x
Here we can use geom_edge_link() for straight edges, geom_edge_arc() for curved ones, and
geom_edge_fan() when we want to make sure any overlapping multiplex edges will be fanned out.
As in other packages, we can set visual properties for the network plot by using key function
parameters. For instance, nodes have color, fill, shape, size, and stroke. Edges have color,
width, and linetype. Here too the alpha parameter controls transparency.
ggraph(net, layout="lgl") +
geom_edge_fan(color="gray50", width=0.8, alpha=0.5) +
geom_node_point(color=V(net)$color, size=8) +
theme_void()
42
As in ggplot2, we can add different themes to the plot. For a cleaner look, you can use a minimal
or empty theme with theme_minimal() or theme_void().
ggraph(net, layout = 'linear') +
geom_edge_arc(color = "orange", width=0.7) +
geom_node_point(size=5, color="gray50") +
theme_void()
The ggraph package also uses the traditional ggplot2 way of mapping aesthetics: that is to say, of
specifying which elements of the data should correspond to different visual properties of the graphic.
This is done using the aes() function that matches visual parameters with attribute names from
the data. In the code below, the edge attribute type and node attribute audience.size are taken
from our data as they are included in the igraph object.
43
ggraph(net, layout="lgl") +
geom_edge_link(aes(color = type)) + # colors by edge type
geom_node_point(aes(size = audience.size)) + # size by audience size
theme_void()
audience.size
20
30
40
50
60
type
hyperlink
mention
One great thing about ggplot2 and ggraph you can see above is that they automatically generate
a legend which makes plots easier to interpret.
We can add a layer with node labels using geom_node_text() or geom_node_label() which
correspond to similar functions in ggplot2.
ggraph(net, layout = 'lgl') +
geom_edge_arc(color="gray", curvature=0.3) +
geom_node_point(color="orange", aes(size = audience.size)) +
geom_node_text(aes(label = media), size=2, color="gray50", repel=T) +
theme_void()
WashingtonPost.com
AOL.com
USA Today 30
Reuters.com NYTimes.com
BBC NY Times 40
Wall Street Journal 50
LA Times
MSNBC 60
CNN
Washington Post
ABC
FOX News
44
detach("package:ggraph")
While those are not discussed here, note that ggraph offers a number of other interesting ways to
represent networks, including dendrograms, treemaps, hive plots, and circle plots.
At this point it might be useful to provide a quick reminder that there are many ways to represent
a network not limited to a hairball plot.
For example, here is a quick heatmap of the network matrix:
netm <- get.adjacency(net, attr="weight", sparse=F)
colnames(netm) <- V(net)$media
rownames(netm) <- V(net)$media
AOL.com
WashingtonPost.com
NYTimes.com
Reuters.com
Google News
Yahoo News
BBC
ABC
FOX News
MSNBC
CNN
New York Post
LA Times
USA Today
Wall Street Journal
Washington Post
NY Times
AOL.com
WashingtonPost.com
NYTimes.com
Reuters.com
Google News
Yahoo News
BBC
ABC
FOX News
MSNBC
CNN
New York Post
LA Times
USA Today
Wall Street Journal
Washington Post
NY Times
Depending on what properties of the network or its nodes and edges are most important to you,
simple graphs can often be more informative than network maps.
45
# Plot the egree distribution for our network:
deg.dist <- degree_distribution(net, cumulative=T, mode="all")
plot( x=0:max(degree(net)), y=1-deg.dist, pch=19, cex=1.2, col="orange",
xlab="Degree", ylab="Cumulative Frequency")
Cumulative Frequency
0.8
0.4
0.0
0 2 4 6 8 10 12
Degree
If you have already installed “ndtv”, you should also have a package used by it called “animation”.
If not, now is a good time to install it with install.packages('animation'). Note that this
package provides a simple technique to create various (not necessarily network-related) animations
in R. It works by generating multiple plots and combining them in an animated GIF.
The catch here is that in order for this to work, you need not only the R package, but also an
additional software called ImageMagick (http://imagemagick.org). You probably don’t want to
install that during the workshop, but you can try it at home.
The good news is that once you figure this out, you can turn any series of R plots (network or not!)
into an animated GIF.
library('animation')
library('igraph')
46
We will now generate 4 network plots (the same way we did before), only this time we’ll do it
within the saveGIF command. The animation interval is set with interval, and the movie.name
parameter controls name of the gif.
l <- layout_with_lgl(net)
detach('package:igraph')
detach('package:animation')
These days it is fairly easy to export R plots to HTML/JavaScript output. There are a number of
packages like rcharts and htmlwidgets that can help you create interactive web charts right from
R. One thing to keep in mind though is that the network visualizations created that way are most
helpful as a starting point for further work. If you know a little bit of javascript, you can use them
47
as a first step and tweak the results to get closer to what you want.
Here we will take a quick look at visNetwork which generates interactive network visualizations using
the vis.js javascript library. You can install the package with install.packages('visNetwork').
We can visualize our media network right away: visNetwork() will accept our node and link data
frames. As usual, the node data frame needs to have an id column, and the link data needs to have
from and to columns denoting the start and end of each tie.
library('visNetwork')
visNetwork(nodes, links)
If we want to set specific height and width for the interactive plot, we can do that with the height
and width parameters. As is often the case in R, the title of the plot is set with the main parameter.
The subtitle and footer can be set with submain and footer respectively.
visNetwork(nodes, links, height="600px", width="100%", background="#eeefff",
main="Network", submain="And what a great network it is!",
footer= "Hyperlinks and mentions among media sources")
Like the igraph package, visNetwork allows us to set graphic properties as node or edge attributes.
We can simply add them as columns in our data before we call the visNetwork() function. Check
48
out the available options with:
?visNodes
?visEdges
In the following code, we are changing some of the visual parameters for nodes. We start with
the node shape (the available options for it include ellipse, circle, database, box, text, image,
circularImage, diamond, dot, star, triangle, triangleDown, square, and icon). We are also
going to change the color of several node elements. In this package, background controls the node
color, border changes the frame color; highlight sets the color on mouse click, and hover sets the
color on mouseover.
# We'll start by adding new node and edge attributes to our dataframes.
vis.nodes <- nodes
vis.links <- links
visNetwork(vis.nodes, vis.links)
49
vis.links$width <- 1+links$weight/8 # line width
vis.links$color <- "gray" # line color
vis.links$arrows <- "middle" # arrows: 'from', 'to', or 'middle'
vis.links$smooth <- FALSE # should the edges be curved?
vis.links$shadow <- FALSE # edge shadow
We can also set the visualization options directly with visNodes() and visEdges().
visnet2 <- visNetwork(nodes, links)
visnet2 <- visNodes(visnet2, shape = "square", shadow = TRUE,
color=list(background="gray", highlight="orange", border="black"))
visnet2 <- visEdges(visnet2, color=list(color="black", highlight = "orange"),
smooth = FALSE, width=2, dashes= TRUE, arrows = 'middle' )
visnet2
visNetwork offers a number of other options in the visOptions() function. For instance, we can
highlight all neighbors of the selected node (highlightNearest), or add a drop-down menu to
select subset of nodes (selectedBy). The subsets are based on a column from our data - here we
use the type label.
50
visOptions(visnet, highlightNearest = TRUE, selectedBy = "type.label")
visNetwork can also work with predefined groups of nodes. The visual characteristics for nodes
belonging in each group can be set with visGroups(). We can add an automatically generated
group legend with visLegend().
nodes$group <- nodes$type.label
visnet3 <- visNetwork(nodes, links)
visnet3 <- visGroups(visnet3, groupname = "Newspaper", shape = "square",
color = list(background = "gray", border="black"))
visnet3 <- visGroups(visnet3, groupname = "TV", shape = "dot",
color = list(background = "tomato", border="black"))
visnet3 <- visGroups(visnet3, groupname = "Online", shape = "diamond",
color = list(background = "orange", border="black"))
visLegend(visnet3, main="Legend", position="right", ncol=1)
51
detach('package:visNetwork')
Another good package exporting networks from R to javascript is threejs, which generates
interactive network visualizations using the three.js javascript library and the htmlwidgets R
package. One nice thing about threejs is that it can directly read igraph objects.
You can install the package with install.packages('threejs'). If you get errors or warnings
using this library with the latest version of R, try also installing the development version of the
htmlwidgets package which may have bug fixes that will help:
devtools::install_github('ramnathv/htmlwidgets')
The main network plotting function here,graphjs, will take an igraph object. We could use our
initial net object with a slight modification: we will delete its graph layout and let threejs generate
one on its own. We cheated a bit earlier by assigning a function to the layout attribute in the igraph
object rather than giving it a table of node coordinates. This is fine by igraph, but threejs will
not let us do it.
library(threejs)
library(htmlwidgets)
library(igraph)
Note that RStudio for Windows may not render the threejs graphics properly. We will save the
output in an HTML file and open it in a browser. Some of the parameters that we can add include
main for the plot title; curvature for the edge curvature; bg for background color; showLabels
to set labels to visible (TRUE) or not (FALSE); attraction and repulsion to set how much nodes
attract and repulse each other in the layout; opacity for node transparency (range 0 to 1); stroke
to indicate whether nodes should be framed in a black circle (TRUE) or not (FALSE), etc.
For the full list of parameters, check out ?graphjs.
gjs <- graphjs(net.js, main="Network!", bg="gray10", showLabels=F, stroke=F,
curvature=0.1, attraction=0.9, repulsion=0.8, opacity=0.9)
print(gjs)
saveWidget(gjs, file="Media-Network-gjs.html")
browseURL("Media-Network-gjs.html")
52
Once we open the resulting visualization in a browser, we can use the mouse scrollwheel to zoom in
and out, the left mouse button to rotate the network, and the right mouse button to pan.
We can also create simple animations with threejs by using lists of layouts, vertex colors, and edge
colors that will switch at each step.
gjs.an <- graphjs(net.js, bg="gray10", showLabels=F, stroke=F,
layout=list(layout_randomly(net.js, dim=3),
layout_with_fr(net.js, dim=3),
layout_with_drl(net.js, dim=3),
layout_on_sphere(net.js)),
vertex.color=list(V(net.js)$color, "gray", "orange",
V(net.js)$color),
main=list("Random Layout", "Fruchterman-Reingold",
"DrL layout", "Sphere" ) )
print(gjs.an)
saveWidget(gjs.an, file="Media-Network-gjs-an.html")
browseURL("Media-Network-gjs-an.html")
As an additional example, we can take a look at the Les Miserables network included with the
package:
data(LeMis)
lemis.net <- graphjs(LeMis, main="Les Miserables", showLabels=T)
print(lemis.net)
53
saveWidget(lemis.net, file="LeMis-Network-gjs.html")
browseURL("LeMis-Network-gjs.html")
We will also take a quick look at networkD3 which - as its name suggests - generates interactive
network visualizations using the D3 javascript library. If you d not have the networkD3 library,
install it with install.packages("networkD3").
The data that this library needs from is is in the standard edge list form, with a few little twists.
In order for things to work, the node IDs have to be numeric, and they also have to start from 0.
An easy was to get there is to transform our character IDs to a factor variable, transform that to
numeric, and make sure it starts from zero by subtracting 1.
library(networkD3)
The nodes need to be in the same order as the “source” column in links:
nodes.d3 <- cbind(idn=factor(nodes$media, levels=nodes$media), nodes)
Now we can generate the interactive chart. The Group parameter in it is used to color the nodes.
Nodesize is not (as one might think) the size of the node, but the number of the column in the node
54
data that should be used for sizing. The charge parameter controls node repulsion (if negative) or
attraction (if positive).
forceNetwork(Links = links.d3, Nodes = nodes.d3, Source="from", Target="to",
NodeID = "idn", Group = "type.label",linkWidth = 1,
linkColour = "#afafaf", fontSize=12, zoom=T, legend=T,
Nodesize=6, opacity = 1, charge=-600,
width = 600, height = 600)
Here we will create D3 visualizations using the ndtv package. You should not need additional
software to produce web animations with ndtv. If you want to save the animations as video files
(see ?saveVideo), you have to install a video converter called FFmpeg (http://ffmpg.org). To find
out how to get the right installation for your OS, check out ?install.ffmpeg. To use all available
layouts, you would also need to have Java installed on your machine.
install.packages('ndtv', dependencies=T)
As ndtv is part of the Statnet family, it will accept objects from the network package such as the
one we created earlier (net3).
library('ndtv')
net3
Most of the parameters below are self-explanatory at this point (bg is the background color of the
plot). Two new parameters we haven’t used before are vertex.tooltip and edge.tooltip. Those
contain the information that we can see when moving the mouse cursor over network elements. Note
that the tooltip parameters accepts html tags – for example we will use the line break tag <br>.
55
The parameter launchBrowser instructs R to open the resulting visualization file (filename) in
the browser.
render.d3movie(net3, usearrows = F, displaylabels = F, bg="#111111",
vertex.border="#ffffff", vertex.col = net3 %v% "col",
vertex.cex = (net3 %v% "audience.size")/8,
edge.lwd = (net3 %e% "weight")/3, edge.col = '#55555599',
vertex.tooltip = paste("<b>Name:</b>", (net3 %v% 'media') , "<br>",
"<b>Type:</b>", (net3 %v% 'type.label')),
edge.tooltip = paste("<b>Edge type:</b>", (net3 %e% 'type'), "<br>",
"<b>Edge weight:</b>", (net3 %e% "weight" ) ),
launchBrowser=F, filename="Media-Network.html" )
If you are going to embed the plot in a markdown document, use output.mode='inline' above.
Animations are a good way to show the evolution of small to medium size networks over time. At
present, ndtv is the best R package for that – especially since it now has D3 capabilities and allows
easy export for the Web.
In order to work with the network animations in ndtv, we need to understand Statnet’s dynamic
network format, implemented in the networkDynamic package. The format can be used to represent
longitudinal structures, both discrete (if you have multiple snapshots of your network at different
time points) and continuous (if you have timestamps indicating when edges and/or nodes appear
and disappear from the network). The examples below will only scratch the surface of temporal
networks in Statnet - for a deeper dive, check out Skye Bender-deMoll’s Temporal network tools
tutorial and the networkDynamic package vignette.
Let’s look at one example dataset included in the package, containing simulation data based on a
network of business connections among Renaissance Florentine families:
56
data(short.stergm.sim)
short.stergm.sim
head(as.data.frame(short.stergm.sim))
What we see here is a temporal edge list. An edge goes from a node with ID in the tail column to
a node with ID in the head column. Edges exist from time point onset to time point terminus.
As you can see in our example, there may be multiple periods (activity spells) where an edge is
present. Each of those periods is recorded on a separate row in the data frame above.
The idea of onset and terminus censoring refers to start and end points enforced by the beginning
and end of network observation rather than by actual tie formation/dissolution.
We can simply plot the network disregarding its time component (combining all nodes and edges
that were ever present):
plot(short.stergm.sim)
We can also use network.extract() to get a network that only contains elements active at a given
point, or during a given time interval. For instance, we can plot the network at time 1 (at=1):
plot( network.extract(short.stergm.sim, at=1) )
57
Plot nodes and edges that were active for the entire period (rule=all) from time 1 to time 5:
plot( network.extract(short.stergm.sim, onset=1, terminus=5, rule="all") )
Plot nodes and edges that were active at any point (rule=any) between time 1 and time 10:
plot( network.extract(short.stergm.sim, onset=1, terminus=10, rule="any") )
58
Next, we will create and animate our own dynamic network. Dynamic network objects can be
generated in a number of ways: from a set of networks/matrices representing different time points;
from data frames/matrices with node lists and edge lists indicating when each is active, or when
they switch state. You can check out ?networkDynamic for more information.
We are going to add a time component to our media network example. The code below takes a
0-to-50 time interval and sets the nodes in the network as active throughout (time 0 to 50). The
edges of the network appear one by one, and each one is active from their first activation until time
point 50. We generate this longitudinal network using networkDynamic with our node times as
node.spells and edge times as edge.spells.
vs <- data.frame(onset=0, terminus=50, vertex.id=1:17)
es <- data.frame(onset=1:49, terminus=50,
head=as.matrix(net3, matrix.type="edgelist")[,1],
tail=as.matrix(net3, matrix.type="edgelist")[,2])
If we try to just plot the networkDynamic network, what we get is a combined network for the
entire time period under observation – or as it happens, our original media example.
plot(net3.dyn, vertex.cex=(net3 %v% "audience.size")/7, vertex.col="col")
One way to show the network evolution is through static images from different time points. While
59
we can generate those one by one as we did above, ndtv offers an easier way. The command to do
that is filmstrip(). As in the par() function controlling base R plot parameters, here mfrow sets
the number of rows and columns in the multi-plot grid.
filmstrip(net3.dyn, displaylabels=F, mfrow=c(1, 5),
slice.par=list(start=0, end=49, interval=10,
aggregate.dur=10, rule='any'))
Next, let’s generate a network animation. We can pre-compute the coordinates for it (otherwise
they get calculated when we generate the animation). Here animation.mode is the layout algorithm
- one of “kamadakawai”, “MDSJ”, “Graphviz” and “useAttribute” (user-generated coordinates).
In filmstrip() above and in the animation computation below, slice.par is a list of parameters
controlling how the network visualization moves through time. The parameter interval is the
time step between layouts, aggregate.dur is the period shown in each layout, rule is the rule for
displaying elements (e.g. any: active at any point during that period, all: active during the entire
period, etc).
compute.animation(net3.dyn, animation.mode = "kamadakawai",
slice.par=list(start=0, end=50, interval=1,
aggregate.dur=1, rule='any'))
render.d3movie(net3.dyn, usearrows = F,
displaylabels = F, label=net3 %v% "media",
bg="#ffffff", vertex.border="#333333",
vertex.cex = degree(net3)/2,
vertex.col = net3.dyn %v% "col",
edge.lwd = (net3.dyn %e% "weight")/3,
edge.col = '#55555599',
vertex.tooltip = paste("<b>Name:</b>", (net3.dyn %v% "media") , "<br>",
"<b>Type:</b>", (net3.dyn %v% "type.label")),
edge.tooltip = paste("<b>Edge type:</b>", (net3.dyn %e% "type"), "<br>",
"<b>Edge weight:</b>", (net3.dyn %e% "weight" ) ),
launchBrowser=T, filename="Media-Network-Dynamic.html",
render.par=list(tween.frames = 30, show.time = F),
plot.par=list(mar=c(0,0,0,0)), output.mode='inline' )
60
To embed this animation, we add the parameter output.mode='inline'.
In addition to dynamic nodes and edges, ndtv also takes dynamic attributes. We could have added
those to the es and vs data frames above. However, the plotting function can also evaluate special
parameters and generate dynamic arguments on the fly. For example, function(slice) { do some
calculations with slice } will perform operations on the current time slice of the network, allowing
us to change parameters dynamically.
See the node size below:
render.d3movie(net3.dyn, usearrows = F,
displaylabels = F, label=net3 %v% "media",
bg="#000000", vertex.border="#dddddd",
vertex.cex = function(slice){ degree(slice)/2.5 },
vertex.col = net3.dyn %v% "col",
edge.lwd = (net3.dyn %e% "weight")/3,
edge.col = '#55555599',
vertex.tooltip = paste("<b>Name:</b>", (net3.dyn %v% "media") , "<br>",
"<b>Type:</b>", (net3.dyn %v% "type.label")),
edge.tooltip = paste("<b>Edge type:</b>", (net3.dyn %e% "type"), "<br>",
"<b>Edge weight:</b>", (net3.dyn %e% "weight" ) ),
launchBrowser=T, filename="Media-Network-even-more-Dynamic.html",
render.par=list(tween.frames = 15, show.time = F), output.mode='inline',
slice.par=list(start=0, end=50, interval=4, aggregate.dur=4, rule='any'))
61
8 Overlaying networks on geographic maps
The example presented in this section uses only base R and mapping packages. If you have experience
with ggplot2, that package does provide a more versatile way of approaching this task. The code
using ggplot() would be similar to what you will see below, but you would use ‘borders()’ to plot
the map and ‘geom_path()’ for the edges.
In order to plot on a map, we will need a few more packages. As you will see below, maps will
let us generate a geographic map to use as background, and geosphere will help us generate arcs
representing our network edges. If you do not already have them, install the two packages, then
load them.
install.packages('maps')
install.packages('geosphere')
library('maps')
library('geosphere')
Let us plot some example maps with the maps library. The parameters of maps() include col for
the map fill, border for the border color, and bg for the background color.
62
par(mfrow = c(2,2), mar=c(0,0,0,0))
dev.off()
The data we will use here contains US airports and flights among them. The airport file includes
geographic coordinates - latitude and longitude. If you do not have those in your data, you can the
geocode() function from package ggmap to grab the latitude and longitude for an address.
airports <- read.csv("Dataset3-Airlines-NODES.csv", header=TRUE)
flights <- read.csv("Dataset3-Airlines-EDGES.csv", header=TRUE, as.is=TRUE)
head(flights)
63
head(airports)
In order to generate our plot, we will first add a map of the United states. Then we will add a point
on the map for each airport:
# Plot a map of the united states:
map("state", col="grey20", fill=TRUE, bg="black", lwd=0.1)
64
Next we will generate a color gradient to use for the edges in the network. Heavier edges will be
lighter in color.
col.1 <- adjustcolor("orange red", alpha=0.4)
col.2 <- adjustcolor("orange", alpha=0.4)
edge.pal <- colorRampPalette(c(col.1, col.2), alpha = TRUE)
edge.col <- edge.pal(100)
For each flight in our data, we will use gcIntermediate() to generate the coordinates of the shortest
arc that connects its start and end point (think distance on the surface of a sphere). After that, we
will plot each arc over the map using lines().
for(i in 1:nrow(flights)) {
node1 <- airports[airports$ID == flights[i,]$Source,]
node2 <- airports[airports$ID == flights[i,]$Target,]
65
Note that if you are plotting the network on a full world map, there might be cases when the
shortest arc goes “behind” the map – e.g. exits it on the left side and enters back on the right (since
the left-most and right-most points on the map are actually next to each other). In order to avoid
that, we can use greatCircle() to generate the full great circle (circle going through those two
points and around the globe, with a center at the center of the earth). Then we can extract from it
the arc connecting our start and end points which does not cross “behind” the map, regardless of
whether it is the shorter or the longer of the two.
This is the end of our tutorial. If you have comments, questions, or want to report typos, please
e-mail netvis@ognyanova.net. Check for updated versions of the tutorial at kateto.net.
66
Network Analysis and Visualization with R and igraph
Katherine Ognyanova, www.kateto.net
NetSciX 2016 School of Code Workshop, Wroclaw, Poland
Contents
2. Networks in igraph 14
2.1 Create networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Edge, vertex, and network attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Specific graphs and graph models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1
5. Plotting networks with igraph 32
5.1 Plotting parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Network layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Improving network plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Interactive plotting with tkplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Other ways to represent a network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.6 Plotting two-mode networks with igraph . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2
Note: You can download all workshop materials here, or visit kateto.net/netscix2016.
This tutorial covers basics of network analysis and visualization with the R package igraph (main-
tained by Gabor Csardi and Tamas Nepusz). The igraph library provides versatile options for
descriptive network analysis and visualization in R, Python, and C/C++. This workshop will focus
on the R implementation. You will need an R installation, and RStudio. You should also install the
latest version of igraph for R:
install.packages("igraph")
Before we start working with networks, we will go through a quick introduction/reminder of some
simple tasks and principles in R.
1.1 Assignment
x <- 3 # Assignment
x # Evaluate the expression and print result
y <- 4 # Assignment
y + 5 # Evaluation, y remains 4
3
rm(z) # Remove z: deletes the object.
z # Error!
We can use the standard operators <, >, <=, >=, ==(equality) and != (inequality). Comparisons
return Boolean values: TRUE or FALSE (often abbreviated to just T and F).
2==2 # Equality
2!=2 # Inequality
x <= y # less than or equal: "<", ">", and ">=" also work
Inf and -Inf represent positive and negative infinity. They can be returned by mathematical
operations like division of a number by zero:
5/0
is.finite(5/0) # Check if a number is finite (it is not).
NaN (Not a Number) - the result of an operation that cannot be reasonably defined, such as dividing
zero by zero.
0/0
is.nan(0/0)
1.4 Vectors
Vectors can be constructed by combining their elements with the important R function c().
4
v1 <- c(1, 5, 11, 33) # Numeric vector, length 4
v2 <- c("hello","world") # Character vector, length 2 (a vector of strings)
v3 <- c(TRUE, TRUE, FALSE) # Logical vector, same as c(T, T, F)
Combining different types of elements in one vector will coerce the elements to the least restrictive
type:
length(v1)
length(v2)
Element-wise operations:
v1 + v2 # Element-wise addition
v1 + 1 # Add 1 to each element
v1 * 2 # Multiply each element by 2
v1 + c(1,7) # This doesn't work: (1,7) is a vector of different length
Mathematical operations:
Logical operations:
5
Vector elements:
Note that the indexing in R starts from 1, a fact known to confuse and upset people used to
languages that index from 0.
To add more elements to a vector, simply assign them values.
1.5 Factors
eye.col.f
R will identify the different levels of the factor - e.g. all distinct values. The data is stored internally
as integers - each number corresponding to a factor level.
## [1] 2 3 2 1 1 1
6
as.numeric(eye.col.v) # The character vector can not be coerced to numeric
## [1] NA NA NA NA NA NA
as.character(eye.col.f)
as.character(eye.col.v)
## [1] 5 4
m <- matrix(1:10,10,10)
7
# Are elements in row 1 equivalent to corresponding elements from column 1:
m[1,]==m[,1]
# A logical matrix: TRUE for m elements >3, FALSE otherwise:
m>3
# Selects only TRUE elements - that is ones greater than 3:
m[m>3]
t(m) # Transpose m
m <- t(m) # Assign m the transposed m
m %*% t(m) # %*% does matrix multiplication
m * m # * does element-wise multiplication
Arrays are used when we have more than 2 dimensions. We can create them using the array()
function:
1.7 Lists
Lists are collections of objects. A single list can contain all kinds of elements - character strings,
numeric vectors, matrices, other lists, and so on. The elements of lists are often named for easier
access.
l3 <- list()
l4 <- NULL
Since we added element 3 to the list l4above, elements 1 and 2 will be generated and empty (NULL).
8
l1[[5]] <- "More elements!" # The list l1 had 4 elements, we're adding a 5th here.
l1[[8]] <- 1:11
We added an 8th element, but not 6th and 7th to the listl1 above. Elements number 6 and 7 will
be created empty (NULL).
l1$Something <- "A thing" # Adds a ninth element - "A thing", named "Something"
The data frame is a special kind of list used for storing dataset tables. Think of rows as cases,
columns as variables. Each column is a vector or factor.
Creating a dataframe:
Notice that R thinks that dfr1$FirstName is a categorical variable and so it’s treating it like a
factor, not a character vector. Let’s get rid of the factor by telling R to treat ‘FirstName’ as a
vector:
Alternatively, you can tell R you don’t like factors from the start using stringsAsFactors=FALSE
9
Find the names of everyone over the age of 30 in the data:
dfr1[dfr1$Age>30,2]
mean ( dfr1[dfr1$Female==TRUE,4] )
## [1] 49.5
The controls and loops in R are fairly straightforward (see below). They determine if a block of
code will be executed, and how many times. Blocks of code in R are enclosed in curly brackets {}.
## [1] 2
## [1] 15
## [1] 120
10
1.10 R plots and colors
In most R functions, you can use named colors, hex, or RGB values. In the simple base R plot chart
below, x and y are the point coordinates, pch is the point symbol shape, cex is the point size, and
col is the color. To see the parameters for plotting in base R, check out ?par
You may notice that RGB here ranges from 0 to 1. While this is the R default, you can also set it
for to the 0-255 range using something like rgb(10, 100, 100, maxColorValue=255).
We can set the opacity/transparency of an element using the parameter alpha (range 0-1):
If we have a hex color representation, we can set the transparency alpha using adjustcolor from
package grDevices. For fun, let’s also set the plot background to gray using the par() function for
graphical parameters.
par(bg="gray40")
col.tr <- grDevices::adjustcolor("557799", alpha=0.7)
plot(x=1:5, y=rep(5,5), pch=19, cex=12, col=col.tr, xlim=c(0,6))
11
If you plan on using the built-in color names, here’s how to list all of them:
In many cases, we need a number of contrasting colors, or multiple shades of a color. R comes with
some predefined palette function that can generate those for us. For example:
pal1 <- heat.colors(5, alpha=1) # 5 colors from the heat palette, opaque
pal2 <- rainbow(5, alpha=.5) # 5 colors from the heat palette, transparent
plot(x=1:10, y=1:10, pch=19, cex=5, col=pal1)
We can also generate our own gradients using colorRampPalette. Note that colorRampPalette
returns a function that we can use to generate as many colors from that palette as we need.
12
palf <- colorRampPalette(c("gray80", "dark red"))
plot(x=10:1, y=1:10, pch=19, cex=5, col=palf(10))
1.11 R troubleshooting
While I generate many (and often very creative) errors in R, there are three simple things that will
most often go wrong for me. Those include:
1) Capitalization. R is case sensitive - a graph vertex named “Jack” is not the same as one
named “jack”. The function rowSums won’t work if spelled as rowsums or RowSums.
2) Object class. While many functions are willing to take anything you throw at them, some will
still surprisingly require character vector or a factor instead of a numeric vector, or a matrix
instead of a data frame. Functions will also occasionally return results in an unexpected
formats.
3) Package namespaces. Occasionally problems will arise when different packages contain
functions with the same name. R may warn you about this by saying something like “The
following object(s) are masked from ‘package:igraph’ as you load a package. One way to deal
with this is to call functions from a package explicitly using ::. For instance, if function
blah() is present in packages A and B, you can call A::blah and B::blah. In other cases
the problem is more complicated, and you may have to load packages in certain order, or not
use them together at all. For example (and pertinent to this workshop), igraph and Statnet
13
packages cause some problems when loaded at the same time. It is best to detach one before
loading the other.
For more advanced troubleshooting, check out try(), tryCatch(), and debug().
2. Networks in igraph
The code below generates an undirected graph with three edges. The numbers are interpreted as
vertex IDs, so the edges are 1–>2, 2–>3, 3–>1.
class(g1)
## [1] "igraph"
g1
14
## IGRAPH U--- 3 3 --
## + edges:
## [1] 1--2 2--3 1--3
10 9
5
6
8
3
1 7
2
4
g2
## IGRAPH D--- 10 3 --
## + edges:
## [1] 1->2 2->3 3->1
g3 <- graph( c("John", "Jim", "Jim", "Jill", "Jill", "John")) # named vertices
# When the edge list has vertex names, the number of nodes is not needed
plot(g3)
John
Jim
Jill
g3
## IGRAPH DN-- 3 3 --
## + attr: name (v/c)
## + edges (vertex names):
## [1] John->Jim Jim ->Jill Jill->John
15
g4 <- graph( c("John", "Jim", "Jim", "Jack", "Jim", "Jack", "John", "John"),
isolates=c("Jesse", "Janis", "Jennifer", "Justin") )
# In named graphs we can specify isolates by providing a list of their names.
Janis
Jesse
Jack
Jim Justin
John
Small graphs can also be generated with a description of this kind: - for undirected tie, +- or -+
for directed ties pointing left & right, ++ for a symmetric tie, and “:” for sets of vertices.
plot(graph_from_literal(a--+b, b+--c))
16
c
plot(graph_from_literal(a+-+b, b+-+c))
plot(graph_from_literal(a:b:c---c:d:e))
d
c
e
17
j
d
c e
f
b h
i
a g
g4[]
g4[1,]
18
Add attributes to the network, vertices, or edges:
Examine attributes:
edge_attr(g4)
## $type
## [1] "email" "email" "email" "email"
##
## $weight
## [1] 10 10 10 10
vertex_attr(g4)
## $name
## [1] "John" "Jim" "Jack" "Jesse" "Janis" "Jennifer"
## [7] "Justin"
##
## $gender
## [1] "male" "male" "male" "male" "female" "female" "male"
graph_attr(g4)
## named list()
Another way to set attributes (you can similarly use set_edge_attr(), set_vertex_attr(), etc.):
graph_attr_names(g4)
19
graph_attr(g4, "name")
graph_attr(g4)
## $name
## [1] "Email Network"
##
## $something
## [1] "A thing"
## $name
## [1] "Email Network"
Jennifer Jesse
Justin
Jack
Jim
Janis
John
The graph g4 has two edges going from Jim to Jack, and a loop from John to himself. We can
simplify our graph to remove loops & multiple edges between the same nodes. Use edge.attr.comb
to indicate how edge attributes are to be combined - possible options include sum, mean, prod
(product), min, max, first/last (selects the first/last edge’s attribute). Option “ignore” says the
attribute should be disregarded and dropped.
20
Janis
Jennifer
Justin
Jack
Jim Jesse
John
g4s
The two numbers that follow (7 5) refer to the number of nodes and edges in the graph. The
description also lists node & edge attributes, for example:
Empty graph
eg <- make_empty_graph(40)
plot(eg, vertex.size=10, vertex.label=NA)
21
Full graph
fg <- make_full_graph(40)
plot(fg, vertex.size=10, vertex.label=NA)
st <- make_star(40)
plot(st, vertex.size=10, vertex.label=NA)
Tree graph
22
tr <- make_tree(40, children = 3, mode = "undirected")
plot(tr, vertex.size=10, vertex.label=NA)
Ring graph
rn <- make_ring(40)
plot(rn, vertex.size=10, vertex.label=NA)
23
Watts-Strogatz small-world model
Creates a lattice (with dim dimensions and size nodes across dimension) and rewires edges randomly
with probability p. The neighborhood in which edges are connected is nei. You can allow loops
and multiple edges.
igraph can also give you some notable historical graphs. For instance:
24
Rewiring a graph
each_edge() is a rewiring method that changes the edge endpoints uniformly randomly with a
probability prob.
rn.neigh = connect.neighborhood(rn, 5)
plot(rn.neigh, vertex.size=8, vertex.label=NA)
25
Combine graphs (disjoint union, assuming separate vertex sets): %du%
26
3. Reading network data from files
In the following sections of the tutorial, we will work primarily with two small example data sets.
Both contain data about media organizations. One involves a network of hyperlinks and mentions
among news sources. The second is a network of links between media venues and consumers. While
the example data used here is small, many of the ideas behind the analyses and visualizations we
will generate apply to medium and large-scale networks.
The first data set we are going to work with consists of two files, “Media-Example-NODES.csv” and
“Media-Example-EDGES.csv” (download here).
head(nodes)
head(links)
nrow(nodes); length(unique(nodes$id))
nrow(links); nrow(unique(links[,c("from", "to")]))
Notice that there are more links than unique from-to combinations. That means we have cases
in the data where there are multiple links between the same two nodes. We will collapse all links
of the same type between the same two nodes by summing their weights, using aggregate() by
“from”, “to”, & “type”. We don’t use simplify() here so as not to collapse different link types.
Two-mode or bipartite graphs have two different types of actors and links that go across, but not
within each type. Our second media example is a network of that kind, examining links between
news sources and their consumers.
27
head(nodes2)
head(links2)
———————————–
We start by converting the raw data to an igraph network object. Here we use igraph’s
graph.data.frame function, which takes two data frames: d and vertices.
• d describes the edges of the network. Its first two columns are the IDs of the source and the
target node for each edge. The following columns are edge attributes (weight, type, label, or
anything else).
• vertices starts with a column of node IDs. Any following columns are interpreted as node
attributes.
4.1 Dataset 1
library(igraph)
## [1] "igraph"
net
## IGRAPH DNW- 17 49 --
## + attr: name (v/c), media (v/c), media.type (v/n), type.label
## | (v/c), audience.size (v/n), type (e/c), weight (e/n)
## + edges (vertex names):
## [1] s01->s02 s01->s03 s01->s04 s01->s15 s02->s01 s02->s03 s02->s09
## [8] s02->s10 s03->s01 s03->s04 s03->s05 s03->s08 s03->s10 s03->s11
## [15] s03->s12 s04->s03 s04->s06 s04->s11 s04->s12 s04->s17 s05->s01
## [22] s05->s02 s05->s09 s05->s15 s06->s06 s06->s16 s06->s17 s07->s03
## [29] s07->s08 s07->s10 s07->s14 s08->s03 s08->s07 s08->s09 s09->s10
## [36] s10->s03 s12->s06 s12->s13 s12->s14 s13->s12 s13->s17 s14->s11
## [43] s14->s13 s15->s01 s15->s04 s15->s06 s16->s06 s16->s17 s17->s04
28
We also have easy access to nodes, edges, and their attributes with:
Now that we have our igraph network object, let’s make a first attempt to plot it.
plot(net, edge.arrow.size=.4,vertex.label=NA)
That doesn’t look very good. Let’s start fixing things by removing the loops in the graph.
You might notice that we could have used simplify to combine multiple edges by summing their
weights with a command like simplify(net, edge.attr.comb=list(weight="sum","ignore")).
The problem is that this would also combine multiple edge types (in our data: “hyperlinks” and
“mentions”).
If you need them, you can extract an edge list or a matrix from igraph networks.
as_edgelist(net, names=T)
as_adjacency_matrix(net, attr="weight")
as_data_frame(net, what="edges")
as_data_frame(net, what="vertices")
29
4.2 Dataset 2
As we have seen above, this time the edges of the network are in a matrix format. We can read
those into a graph object using graph_from_incidence_matrix(). In igraph, bipartite networks
have a node attribute called type that is FALSE (or 0) for vertices in one mode and TRUE (or 1)
for those in the other mode.
head(nodes2)
head(links2)
## U01 U02 U03 U04 U05 U06 U07 U08 U09 U10 U11 U12 U13 U14 U15 U16 U17
## s01 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## s02 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## s03 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
## s04 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0
## s05 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
## s06 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1
## U18 U19 U20
## s01 0 0 0
## s02 0 0 1
## s03 0 0 0
## s04 0 0 0
## s05 0 0 0
## s06 0 0 0
##
## FALSE TRUE
## 10 20
To transform a one-mode network matrix into an igraph object, use instead graph_from_adjacency_matrix().
We can also easily generate bipartite projections for the two-mode network: (co-memberships are
easy to calculate by multiplying the network matrix by its transposed matrix, or using igraph’s
bipartite.projection() function).
30
net2.bp <- bipartite.projection(net2)
NYT
BBC
LATimes
CNN USAT
MSNBC
FOX WSJ
ABC
WaPo
Jill
Jim
Jo
BrianSheila
Jason Ronda
Mary
Lisa Sandra JohnPaul
Nancy
Dave Kate Dan
Ted Ed
Tom Anna
31
5. Plotting networks with igraph
Plotting with igraph: the network plots have a wide set of parameters you can set. Those include
node options (starting with vertex.) and edge options (starting with edge.). A list of selected
options is included below, but you can also check out ?igraph.plotting for more information.
NODES
vertex.color Node color
vertex.frame.color Node border color
vertex.shape One of “none”, “circle”, “square”, “csquare”, “rectangle”
“crectangle”, “vrectangle”, “pie”, “raster”, or “sphere”
vertex.size Size of the node (default is 15)
vertex.size2 The second size of the node (e.g. for a rectangle)
vertex.label Character vector used to label the nodes
vertex.label.family Font family of the label (e.g.“Times”, “Helvetica”)
vertex.label.font Font: 1 plain, 2 bold, 3, italic, 4 bold italic, 5 symbol
vertex.label.cex Font size (multiplication factor, device-dependent)
vertex.label.dist Distance between the label and the vertex
vertex.label.degree The position of the label in relation to the vertex,
where 0 right, “pi” is left, “pi/2” is below, and “-pi/2” is above
EDGES
edge.color Edge color
edge.width Edge width, defaults to 1
edge.arrow.size Arrow size, defaults to 1
edge.arrow.width Arrow width, defaults to 1
edge.lty Line type, could be 0 or “blank”, 1 or “solid”, 2 or “dashed”,
3 or “dotted”, 4 or “dotdash”, 5 or “longdash”, 6 or “twodash”
edge.label Character vector used to label edges
edge.label.family Font family of the label (e.g.“Times”, “Helvetica”)
edge.label.font Font: 1 plain, 2 bold, 3, italic, 4 bold italic, 5 symbol
edge.label.cex Font size for edge labels
edge.curved Edge curvature, range 0-1 (FALSE sets it to 0, TRUE to 0.5)
arrow.mode Vector specifying whether edges should have arrows,
possible values: 0 no arrow, 1 back, 2 forward, 3 both
OTHER
margin Empty space margins around the plot, vector with length 4
frame if TRUE, the plot will be framed
main If set, adds a title to the plot
sub If set, adds a subtitle to the plot
32
We can set the node & edge options in two ways - the first one is to specify them in the plot()
function, as we are doing below.
s13
s14 s17 s16
s12
s11 s06
s04
s07
s03
s08 s15
s01
s10
s02 s05
s09
MSNBC CNN
ABC Reuters.com
FOX News
BBC
Wall Street Journal
Washington Post
Google News
Yahoo News
LA TimesNY Times
USA Today
NYTimes.com AOL.com
WashingtonPost.com
The second way to set attributes is to add them to the igraph object. Let’s say we want to color
our network nodes based on type of media, and size them based on audience size (larger audience
-> larger node). We will also change the width of the edges based on their weight.
33
# Generate colors based on media type:
colrs <- c("gray50", "tomato", "gold")
V(net)$color <- colrs[V(net)$media.type]
34
It helps to add a legend explaining the meaning of the colors we used:
plot(net)
legend(x=-1.5, y=-1.1, c("Newspaper","Television", "Online News"), pch=21,
col="#777777", pt.bg=colrs, pt.cex=2, cex=.8, bty="n", ncol=1)
Newspaper
Television
Online News
Sometimes, especially with semantic networks, we may be interested in plotting only the labels of
the nodes:
35
WashingtonPost.com
NY Times
LA Times BBC Reuters.com
Wall Street Journal
Washington Post
ABC CNN
FOX News MSNBC
Let’s color the edges of the graph based on their source node color. We can get the starting node
for each edge with the ends() igraph function.
Network layouts are simply algorithms that return coordinates for each node in a network.
For the purposes of exploring layouts, we will generate a slightly larger 80-node graph. We use the
sample_pa() function which generates a simple graph starting from one node and adding more
nodes and links based on a preset level of preferential attachment (Barabasi-Albert model).
36
net.bg <- sample_pa(80)
V(net.bg)$size <- 8
V(net.bg)$frame.color <- "white"
V(net.bg)$color <- "orange"
V(net.bg)$label <- ""
E(net.bg)$arrow.mode <- 0
plot(net.bg)
plot(net.bg, layout=layout_randomly)
l <- layout_in_circle(net.bg)
plot(net.bg, layout=l)
37
l is simply a matrix of x, y coordinates (N x 2) for the N nodes in the graph. You can easily
generate your own:
This layout is just an example and not very helpful - thankfully igraph has a number of built-in
layouts, including:
38
# Circle layout
l <- layout_in_circle(net.bg)
plot(net.bg, layout=l)
# 3D sphere layout
l <- layout_on_sphere(net.bg)
plot(net.bg, layout=l)
39
Fruchterman-Reingold is one of the most used force-directed layout algorithms out there.
Force-directed layouts try to get a nice-looking graph where edges are similar in length and cross
each other as little as possible. They simulate the graph as a physical system. Nodes are electrically
charged particles that repulse each other when they get too close. The edges act as springs that
attract connected nodes closer together. As a result, nodes are evenly distributed through the chart
area, and the layout is intuitive in that nodes which share more connections are closer to each
other. The disadvantage of these algorithms is that they are rather slow and therefore less often
used in graphs larger than ~1000 vertices. You can set the “weight” parameter which increases the
attraction forces among nodes connected by heavier edges.
l <- layout_with_fr(net.bg)
plot(net.bg, layout=l)
You will notice that the layout is not deterministic - different runs will result in slightly different
configurations. Saving the layout in l allows us to get the exact same result multiple times, which
can be helpful if you want to plot the time evolution of a graph, or different relationships – and
want nodes to stay in the same place in multiple plots.
40
dev.off()
By default, the coordinates of the plots are rescaled to the [-1,1] interval for both x and y. You can
change that with the parameter rescale=FALSE and rescale your plot manually by multiplying the
coordinates by a scalar. You can use norm_coords to normalize the plot with the boundaries you
want.
l <- layout_with_fr(net.bg)
l <- norm_coords(l, ymin=-1, ymax=1, xmin=-1, xmax=1)
par(mfrow=c(2,2), mar=c(0,0,0,0))
plot(net.bg, rescale=F, layout=l*0.4)
plot(net.bg, rescale=F, layout=l*0.6)
plot(net.bg, rescale=F, layout=l*0.8)
plot(net.bg, rescale=F, layout=l*1.0)
41
dev.off()
Another popular force-directed algorithm that produces nice results for connected graphs is Kamada
Kawai. Like Fruchterman Reingold, it attempts to minimize the energy in a spring system.
l <- layout_with_kk(net.bg)
plot(net.bg, layout=l)
The LGL algorithm is meant for large, connected graphs. Here you can also specify a root: a node
that will be placed in the middle of the layout.
42
plot(net.bg, layout=layout_with_lgl)
par(mfrow=c(3,3), mar=c(1,1,1,1))
for (layout in layouts) {
print(layout)
l <- do.call(layout, list(net))
plot(net, edge.arrow.mode=0, layout=l, main=layout) }
43
layout_randomly layout_with_dh layout_with_drl
Notice that our network plot is still not too helpful. We can identify the type and size of nodes,
but cannot see much about the structure since the links we’re examining are so dense. One way to
approach this is to see if we can sparsify the network, keeping only the most important ties and
discarding the rest.
hist(links$weight)
mean(links$weight)
sd(links$weight)
There are more sophisticated ways to extract the key edges, but for the purposes of this exercise
we’ll only keep ones that have weight higher than the mean for the network. In igraph, we can
delete edges using delete_edges(net, edges):
44
cut.off <- mean(links$weight)
net.sp <- delete_edges(net, E(net)[weight<cut.off])
plot(net.sp)
Another way to think about this is to plot the two tie types (hyperlink & mention) separately.
45
Tie: Hyperlink Tie: Mention
dev.off()
R and igraph allow for interactive plotting of networks. This might be a useful option for you if you
want to tweak slightly the layout of a small graph. After adjusting the layout manually, you can get
the coordinates of the nodes and use them for other plots.
tkid <- tkplot(net) #tkid is the id of the tkplot that will open
l <- tkplot.getcoords(tkid) # grab the coordinates from tkplot
tk_close(tkid, window.close = T)
plot(net, layout=l)
46
5.5 Other ways to represent a network
At this point it might be useful to provide a quick reminder that there are many ways to represent
a network not limited to a hairball plot.
For example, here is a quick heatmap of the network matrix:
47
AOL.com
WashingtonPost.com
NYTimes.com
Reuters.com
Google News
Yahoo News
BBC
ABC
FOX News
MSNBC
CNN
New York Post
LA Times
USA Today
Wall Street Journal
Washington Post
NY Times
AOL.com
WashingtonPost.com
NYTimes.com
Reuters.com
Google News
Yahoo News
BBC
ABC
FOX News
MSNBC
CNN
New York Post
LA Times
USA Today
Wall Street Journal
Washington Post
NY Times
As with one-mode networks, we can modify the network object to include the visual properties that
will be used by default when plotting the network. Notice that this time we will also change the
shape of the nodes - media outlets will be squares, and their users will be circles.
48
WaPo
FOX MSNBC
ABC
CNN
WSJ
LATimes
USAT
BBC
NYT
Igraph also has a special layout for bipartite networks (though it doesn’t always work great, and
you might be better off generating your own two-mode layout).
49
Paul
Jill
NYTMary
MSNBCJim Ronda
John
CNN Sheila
LATimes BBC
Jo
Brian
Sandra
JasonFOX
USAT
Nancy
Lisa
Dan
ABC
Kate WSJ
Anna
Dave Ed
WaPo
Tom
Ted
6.1 Density
The proportion of present edges from all possible edges in the network.
edge_density(net, loops=F)
## [1] 0.1764706
## [1] 0.1764706
6.2 Reciprocity
reciprocity(net)
dyad_census(net) # Mutual, asymmetric, and null node pairs
2*dyad_census(net)$mut/ecount(net) # Calculating reciprocity
50
6.3 Transitivity
6.4 Diameter
A network diameter is the longest geodesic distance (length of the shortest path between two nodes)
in the network. In igraph, diameter() returns the distance, while get_diameter() returns the
nodes along the first found path of that distance.
Note that edge weights are used by default, unless set to NA.
## [1] 4
diameter(net, directed=F)
## [1] 28
51
diam <- get_diameter(net, directed=T)
diam
Note that get_diameter() returns a vertex sequence. Note though that when asked to behaved as
a vector, a vertex sequence will produce the numeric indexes of the nodes in it. The same applies
for edge sequences.
class(diam)
## [1] "igraph.vs"
as.vector(diam)
## [1] 12 6 17 4 3 8 7
52
6.5 Node degrees
The function degree() has a mode of in for in-degree, out for out-degree, and all or total for
total degree.
0 5 10 15
deg
53
6.6 Degree distribution
0 2 4 6 8 10 12
Degree
Centrality functions (vertex level) and centralization functions (graph level). The centralization
functions return res - vertex centrality, centralization, and theoretical_max - maximum
centralization score for a graph of that size. The centrality function can run on a subset of nodes
(set with the vids parameter). This is helpful for large graphs where calculating all centralities may
be a resource-intensive and time-consuming task.
Degree (number of ties)
degree(net, mode="in")
centr_degree(net, mode="in", normalized=T)
54
Eigenvector (centrality proportional to the sum of connection centralities)
Values of the first eigenvector of the graph matrix.
The hubs and authorities algorithm developed by Jon Kleinberg was initially used to examine
web pages. Hubs were expected to contain catalogs with a large number of outgoing links; while
authorities would get many incoming links from hubs, presumably because of their high-quality
relevant information.
par(mfrow=c(1,2))
plot(net, vertex.size=hs*50, main="Hubs")
plot(net, vertex.size=as*30, main="Authorities")
Hubs Authorities
dev.off()
55
7. Distances and paths
Average path length: the mean of the shortest distance between each pair of nodes in the network
(in both directions for directed graphs).
mean_distance(net, directed=F)
## [1] 2.058824
mean_distance(net, directed=T)
## [1] 2.742188
We can also find the length of all shortest paths in the graph:
We can extract the distances to a node or set of nodes we are interested in. Here we will get the
distance of every media from the New York Times.
2 2
2 2
3
1 1 2
1 0 2 3
1
1
2
2
56
We can also find the shortest path between specific nodes. Say here between MSNBC and the New
York Post:
Identify the edges going into or out of a vertex, for instance the WSJ. For a single node, use
incident(), for multiple nodes use incident_edges()
57
We can also easily identify the immediate neighbors of a vertex, say WSJ. The neighbors function
finds all nodes one step out from the focal actor.To find the neighbors for multiple nodes, use
adjacent_vertices() instead of neighbors(). To find node neighborhoods going more than one
step out, use function ego() with parameter order set to the number of steps out to go from the
focal node(s).
Special operators for the indexing of edge sequences: %–%, %->%, %<-%
E(network)[X %–% Y] selects edges between vertex sets X and Y, ignoring direction
E(network)[X %->% Y] selects edges from vertex sets X to vertex set Y
E(network)[X %->% Y] selects edges from vertex sets Y to vertex set X
For example, select edges from newspapers to online sources:
58
E(net)[ V(net)[type.label=="Newspaper"] %->% V(net)[type.label=="Online"] ]
Co-citation (for a couple of nodes, how many shared nominations they have):
cocitation(net)
## s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17
## s01 0 1 1 2 1 1 0 1 2 2 1 1 0 0 1 0 0
## s02 1 0 1 1 0 0 0 0 1 0 0 0 0 0 2 0 0
## s03 1 1 0 1 0 1 1 1 2 2 1 1 0 1 1 0 1
## s04 2 1 1 0 1 1 0 1 0 1 1 1 0 0 1 0 0
## s05 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0
## s06 1 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 2
## s07 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## s08 1 0 1 1 1 0 0 0 0 2 1 1 0 1 0 0 0
## s09 2 1 2 0 0 0 1 0 0 1 0 0 0 0 1 0 0
## s10 2 0 2 1 1 0 0 2 1 0 1 1 0 1 0 0 0
## s11 1 0 1 1 1 1 0 1 0 1 0 2 1 0 0 0 1
## s12 1 0 1 1 1 1 0 1 0 1 2 0 0 0 0 0 2
## s13 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0
## s14 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0
## s15 1 2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0
## s16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## s17 0 0 1 0 0 2 0 0 0 0 1 2 0 0 0 1 0
Before we start, we will make our network undirected. There are several ways to do that conversion:
• We can create an undirected link between any pair of connected nodes (mode="collapse" )
• Create undirected link for each directed one in the network, potentially ending up with a
multiplex graph (mode="each" )
• Create undirected link for each symmetric link in the graph (mode="mutual" ).
In cases when we may have ties A -> B and B -> A ties collapsed into a single undirected link, we
need to specify what to do with their edge attributes using the parameter ‘edge.attr.comb’ as we
did earlier with simplify(). Here we have said that the ‘weight’ of the links should be summed,
and all other edge attributes ignored and dropped.
59
net.sym <- as.undirected(net, mode= "collapse",
edge.attr.comb=list(weight="sum", "ignore"))
8.1 Cliques
s16
s17
s06
s13
s12 s04 s15
A number of algorithms aim to detect groups that consist of densely connected nodes with fewer
connections across groups.
Community detection based on edge betweenness (Newman-Girvan)
High-betweenness edges are removed sequentially (recalculating at each step) and the best parti-
tioning of the network is selected.
60
s04
s13
s16
s17
s06
s12
s14
s11
s08
s07
s03
s02
s09
s10
s05
s15
s01
plot(ceb, net)
class(ceb)
## [1] "communities"
## [1] 5
## s01 s02 s03 s04 s05 s06 s07 s08 s09 s10 s11 s12 s13 s14 s15 s16 s17
## 1 2 3 4 1 4 3 3 5 5 4 4 4 4 1 4 4
61
modularity(ceb) # how modular the graph partitioning is
## [1] 0.292476
High modularity for a partitioning reflects dense connections within communities and sparse
connections across communities.
Community detection based on based on propagating labels
Assigns node labels, randomizes, than replaces each vertex’s label with the label that appears most
frequently among neighbors. Those steps are repeated until each vertex has the most common label
of its neighbors.
62
We can also plot the communities without relying on their built-in plot:
The k-core is the maximal subgraph in which every node has degree of at least k. This also means
that the (k+1)-core will be a subgraph of the k-core.
The result here gives the coreness of each vertex in the network. A node has coreness D if it belongs
to a D-core but not to (D+1)-core.
63
4
4 4
4
4
4 4
3 3 4
3 4 4
3
3 3
Homophily: the tendency of nodes to connect to others who are similar on some variable.
## [1] 0.1715568
## [1] -0.1102857
assortativity_degree(net, directed=F)
## [1] -0.009551146
# As above, with the focal attribute being the node degree D-1
64
Module7_Security
– Compute
• This operator models the principle that “your enemy’s enemy is your friend”.
• This operator should only be applied when the situation makes it plausible.
• It is doubtful whether the enemy of your enemy’s enemy necessarily is your enemy too (if there are more
than 2 arcs)
Base Rate Sensitive Transitivity
Effect of base rate trust in a transitive path –
difficulty 2:
• The transitivity operators (transitivity trust
operator and transitivity fusion operator)
had no influence on the ‘a’ (base rate)
parameter.
Base Rate Sensitive Transitivity
Example: Imagine a stranger coming to a town which is know for its
citizens being honest.
– The stranger is looking for a car mechanic, and asks the first person he
meets to direct him to a good car mechanic.
– The stranger receives the reply that there are two car mechanics in town,
David and Eric, where David is cheap but does not always do quality work,
and Eric might be a bit more expensive, but he always does a perfect job.
• According to subjective logic, the stranger has no other info about
the person he asks than the base rate that the citizens in the town
are honest.
• The stranger is thus ignorant, but the expectation value of a good
advice is still very high.
• Without taking parameter ‘a’, base rate, into account, the result would
be that the stranger is completely ignorant about which of the
mechanics is the best.
• Definition 22.3 (next slide) supplements “Base Rate Sensitive
Transitivity”
Base Rate Sensitive Transitivity
Mass Hysteria
• Mass hysteria can be caused
by people not being aware
of dependency between
opinions.
• Example: Person A
recommends an opinion
about a particular
statement x to a group of
other persons. Without
being aware of the fact that
the opinion came from the
same origin, these persons
can recommend their
opinions to each other.
Mass Hysteria
• The arrows represent trust so that
for example B → A can be
interpreted as saying that B trusts A
to recommend an opinion about
statement x.
• The actual recommendation goes, of
course, in the opposite direction to
the arrows.
• Here, A recommends an opinion
about x to 6 other agents, and that
G receives six recommendations in
all. If G assumes the recommended
opinions to be independent and
takes the consensus between them,
his opinion can become abnormally
strong and in fact even stronger
than A’s opinion.
Mass Hysteria
Fig. 26.2 Use Case Diagram for Participation and Collaboration pattern
Participation-Collaboration
• Dynamics
– The class diagram supports the use cases shown in
Fig. 26.2.
– The sequence diagram of Fig. 26.3 shows the use
case modify content: First the user logs in the
platform, then she can make changes but those
changes are not published until the reviewer
approves them.
Participation-Collaboration
Fig. 26.3 Sequence Diagram for the use case Modify Content
Participation-Collaboration
• Implementation
– Using the facilities of Web 2.0, implement a collaborative platform
through which users or application actors can contribute knowledge,
code, facts, and other material relevant to a discussion.
– The material can be made available, possibly in more than one format
or language, and participants can then modify or append to that core
content.
– The collaborative platform allows the user to modify an article, upload
images, videos and audio.
– To do that, the user must have an account to be identified to the
reviewer; it is important to use an appropriate authentication system.
Participation-Collaboration
• Known Uses
– Platforms of collaboration such as Wikipedia, a free web encyclopedia.
– Facebook Wiki, a technical reference for developers interested in the
Facebook Platform.
– Facebook Platform is a standards-based Web service with methods for
accessing and contributing data.
• Related Patterns
– The Authenticator pattern, is used to authenticate the users of the
system.
– The Role-Based Access Control (RBAC) pattern is used to define the
rights of the users with respect to the contents.
Participation-Collaboration
• Consequences:
• This pattern has the following advantages:
– Allows users in any place to modify content; they can share text,
videos, images and can discuss any topic trying to collaborate and give
different ideas.
– Experts demonstrate their knowledge or talent about certain topic and
they can be recognized in their field of expertise.
– We can keep out spammers and other undesirable users.
• Possible disadvantages include:
– Sometimes the reviewers don’t know about a specific topic and they
can eliminate important or useful content.
– They can also be biased.
– This means the reviewer should be carefully selected.
Collaborative Tagging
• Intent
– Collaborative Tagging pattern makes content more meaningful and
useful by using keywords to tag bookmarks, photographs, and other
content.
• Example
– Consider a person tagging a photograph of broccoli.
– One person might label it “cruciform,” “vegetable,” or “nutritious,”
while another might tag it “gross,” “bitter,” or worse. We need some
way to attach information to this item as a guide to possibly interest
users about this vegetable.
• Context
– People in the internet need to search different kinds of content such
as pictures, text, content, audio files, bookmarks, news, items,
websites, products, blog posts, comments, and other items available
online. They may want an item for a variety of reasons.
Collaborative Tagging
• Problem
– Often, we need to use a search system to find resources on the
Internet.
– The resources must match our needs, and to find relevant information,
we need to enter search terms.
– The search system compares those terms with a metadata index that
shows which resources might be relevant to our search.
– The primary problem with such a system is that the metadata index is
often built by a small group of people who determine the relevancy of
those resources for specific terms.
– The smaller the group that does this, the greater the chance that the
group will apply inaccurate tags to some resources or omit certain
relationships between the search terms and the resources’ semantic
relevancy.
– How do we let users guide the search for people with related
interests?
Collaborative Tagging
• The solution is affected by the following forces:
– The number of ways to classify an item is undefined and
the choices can be as different as the users and all of these
are valid in some sense.
– A specific item can belong to an unlimited number of
categories.
– We want to have a variety of ways to find items.
Collaborative Tagging
• Solution
– Let the users add tags to items to indicate categorizations of interest
to the members of the group.
– Figure(Next Slide) shows the class diagram for this pattern.
– User belongs to a Domain and applies Tags from this domain to
Resources.
– User is any human, application, process, or other entity that is capable
of interacting with a resource.
– Domain is the total set of objects and actions that the language
provides.
– Resource denotes any digital asset that can have an identifier.
Examples of resources include online content, audio files, digital
photos, bookmarks, news items, websites, products, blog posts,
comments, and other items available online.
Collaborative Tagging
Fig. 26.5 Use Case Diagram for the Collaborative Tagging Pattern
Collaborative Tagging
Ofoto, an online digital photography website, on which users could Flickr, an image hosting and video hosting website and web
store, share, view and print digital photos services suite
mp3.com, a website providing information about digital music and Napster, a pioneering peer-to-peer (P2P) file sharing Internet
artists, songs, services, community, and technologies and a legal, service that emphasized sharing digital audio files, typically songs,
free music-sharing service encoded in MP3 format
• Wikispaces
Real Simple Syndication
• Subscribe to “feeds” from your favorite sites.
• Facebook, Twitter
• LinkedIn.com
Luigi Grimaudo Han Song Mario Baldi Marco Mellia Maurizio Munafò
Politecnico di Torino, Italy Narus Inc. Narus Inc. Politecnico di Torino, Italy Politecnico di Torino, Italy
luigi.grimaudo@polito.it hsong@narus.com mbaldi@narus.com mellia@polito.it munafo@polito.it
Abstract—Twitter has attracted millions of users that generate enabling the extraction of valuable information from the user’s
a humongous flow of information at constant pace. The research timeline. From a methodology stand-point, we build upon text
community has thus started proposing tools to extract meaningful mining techniques, adapting them to cope with the specific
information from tweets. In this paper, we take a different angle Twitter characteristics.
from the mainstream of previous works: we explicitly target the
analysis of the timeline of tweets from “single users”. We define As input, we group the target user’s tweets based on a
a framework - named TUCAN - to compare information offered window of time (e.g., a day, or a week) so to form bird songs,
by the target users over time, and to pinpoint recurrent topics or one for each time window. At the next step, filtering is applied
topics of interest. First, tweets belonging to the same time window to each bird song using either simple stop-word removal, stem-
are aggregated into “bird songs”. Several filtering procedures ming, lemmatization, or more complicated transformations
can be selected to remove stop-words and reduce noise. Then, based on lexical databases. Next, terms in bird songs are scored
each pair of bird songs is compared using a similarity score to
automatically highlight the most common terms, thus highlighting
using classic Term Frequency-Inverse Document Frequency
recurrent or persistent topics. TUCAN can be naturally applied (TF-IDF) [6] to pinpoint those terms that are particularly
to compare bird song pairs generated from timelines of different important for the target user. Each pair of birds songs are
users. finally compared by computing a similarity score, so to unveil
those bird songs that contain overlapping, and thus persistent,
By showing actual results for both public profiles and topics. The output is then represented using a coloured matrix,
anonymous users, we show how TUCAN is useful to highlight in which cell colour represents the similarity score. As a result,
meaningful information from a target user’s Twitter timeline.
TUCAN offers a simple and natural visual representation of
extracted information that easily unveils the most interesting
I. I NTRODUCTION AND MOTIVATION bird songs and the persistent topics the target user is interested
into during a given time period. Moreover, comparisons among
Twitter is nowadays part of everyone’s life, with hundreds bird songs gives intuitions on the transition of user interests
of millions of people using it on regular basis. Originally as well as the significance of topics to the user.
born as a microblogging service, Twitter is now being used
to chat, to discuss, to run polls, to collect feedback, etc. It is The framework is naturally extended to find and extract
not surprising then that the interest of the research community similarities among tweets of two or more target users. TUCAN
has been attracted to study the “social aspects” of Twitter. User computes and graphically shows the similarity among bird
and usage characterization [1], [2], topic analysis [3]–[5], and songs generated from the timelines of the pairs of target users,
community-level social interest identification [1] have recently revealing similarities and common interests that are present
emerged as hot research topics. Most of previous works focus possibly during different time periods.
on the analysis of “a community of twitters”, whose tweets
are analysed using text and data mining techniques to identify II. F RAMEWORK
the topics, moods, or interests. A. Bird song generation and cleaning process
In this paper we take a different angle: first, we focus on Let T W (u) be the set of tweets of a single user u that
the analysis of a Twitter target user. We consider set of tweets are retrieved from Twitter, time stamped with their generation
that appear on his Twitter public page, i.e., the target user’s time, stored and organized in a repository in binary format,
timeline, and define a methodology to explore exposed content to be easily accessed and further analyzed when necessary.
and extract possible valuable information. Which are the tweets Bird songs are created by aggregating tweets from T W (u)
that carry the most valuable information? Which are the topics generated within a time period T , to then be analyzed. We
he/she is interested into? How do this topics change over time? define the i-th bird song for the user u, BS(u, i), as the subset
Our second goal is to compare the Twitter activity of two (or of tweets in T W (u) that appear in the i-th time period of
more) target users. Do they share some common traits? Is there duration T , i.e., the set of tweets that are generated in the
any shared interest? How important is for one user a topic of [(i − 1)T, (i)T )], i > 0 window of time.
interest for the other user? What is the most common interest
of these two users, regardless of the time they are interested A “plain cleaning” pre-processing is applied to bird songs
in it? to discard stopwords, HTML tag entities, and links. Plain
cleaning can be possibly substituted by more advanced text
We propose a graphical framework which we term as cleaning mechanisms; the following are also considered in
TUCAN - Twitter User Centric ANalyzer. TUCAN highlights this work: (i) removal of Twitter ‘mentions’, (ii) stemming,
correlations among tweets using intuitive visualization, al- (iii) lemmatization, and (iv) ontology-based lexicon generaliza-
lowing exploration of the information exposed in them, thus tion. TUCAN allows the analyst to select the most appropriate
Rank single Tweet T = 1 day T = 7 days T = 14 days
cleaning method to take advantage of different effects of them
1 photo lead #immigrationreform #immigrationreform
in different contexts. 2 day international immigration gun
3 bo @cfpb gun immigration
4 snow cordray violence violence
B. Cross-correlation computation 5 mary comprehensive comprehensive
6 snow @whlive @whlive
Each pre-processed bird song is tailored in a Bag-Of-Words 7 nominates broken broken
(BoW) model, a common representation used in information 8 sec @vp reform
retrieval and natural language processing. Each word is then 9 richard representative representative
10 white reform @vp
scored according to a weighting scheme. In this work, the
Term Frequency-Inverse Document Frequency (TF-IDF) score TABLE I. T OP - WORDS RANKED BY TF-IDF, BARACK O BAMA .
is adopted as past literature has shown it to produce good
results [5]. Hence, words that are frequent in a bird song but Analysis on timeline of a single user. Figure 1 shows corre-
rare in the collection are assigned with higher weights. lation matrices representing similarities between pairs of bird
songs of a single user. Figure 1(a) shows a matrix on the bird
Bird songs are then transformed into a vector space model songs of Barack Obama. It highlights three blocks of highly
V S(u, i), in which each word is given a fixed position. In this correlated period of Tweets. The larger block [A] at the upper
space, each word in the bird song BS(u, i) is characterized left corner represents Obama tweets during US presidential
by its TF-IDF score. Words that do not appear in BS(u, i) are election in 2012. With a maximum Cosine similarity score
characterized by a null score. of 0.33, it is clear that he has been tweeting a lot on a
To evaluate the similarity V S(u, i) ⊗ V S(v, j) among a few correlated topics (voting, Romney, convention, health, etc.
pair of bird song vectors, the Cosine similarity measure is being among the most recurrent top terms). Block [B] refers
deployed. to periods when Obama was interested in fiscal cliff. Finally,
block [C] relates to the shooting in the Newtown elementary
school, during which Obama’s major topic terms were gun,
C. Dashboard visualizer
violence, and weapon.
In order to pinpoint similarities among bird songs, in-
The correlation matrix in Figure 1(b) shows an interesting
dependently of the time the user posted them, TUCAN
behavior of a generic “user X” (as opposed to a public
computes the similarity score for all possible pairs of bird
figure or news media). Analyzing user X’s bird songs the
songs. In total, N 2 similarity scores are computed and stored
plot highlights two blocks, [A] and [B]. The similarity of
in a matrix form, where each cell represents V S(u, i) ⊗
bird songs are dominated by the use of mentions to particular
V S(u, j), i, j ∈ [1, N ]. To help identifying correlation, the
follower/followee of his. Investigating key terms in the time
matrix is presented to the analyst in a graphical format using
period of block [A], user X was exchanging messages with one
a web interface. Each cell is represented by a square whose
of his follower. After one week of pause, in block [B], user X
color reflects the similarity score between the i-th and j-th
then mentions about another follower of his (and never refers to
bird songs. A demo of the dashboard is available online at
the follower in [A]). We suppose that user X’s sudden change
“http://dbdmg.polito.it/asonam2013/index.html”.
in his mentions indicates a change in his social relationship,
e.g., change of his dating partner.
III. E XPERIMENTS
Analysis across different users. Besides the per-user analysis,
A. Dataset description
TUCAN can infer semantic relationships across a multiple
To perform user centric analysis through TUCAN, we of users when applied to a group of target users. We select
monitor 712 randomly selected Twitter users for two or more ten public figures and media blogs and report the cross-
months starting from the Summer 2012. Additionally, we mon- similarity matrix in Figure 2. The latest six bird songs with
itor 28 well-known public figures, selected among politicians, T = 14 days are considered, referring to a common period of
news media, tech blogs, etc. time. Each bird song is checked against each other. Results
are represented as a colored matrix, using different color
From a total of 810,655 tweets, it emerges that about 300 scales (and normalization) for blocks outside the main diagonal
users (40%) twitted more than twice in each week. Out of and in the main diagonal (where same-user’s bird songs are
them, 20 users posted more than 400 tweets per week (i.e., compared). Focusing on the former, two pairs of users emerge
more than 57 tweets/day). This suggests that the window size as mostly correlated: {Barack Obama, White House} and
parameter T has to be tailored to each user twitting habit when {idownloadblog, iMore}.
forming bird songs. For example, Table I shows up to ten top-
words extracted from Barack Obama’s bird songs, for different Zooming in and increasing the resolution by selecting T =
value of T . 7 days, Figure 2(b) compares {Barack Obama, White House}
in detail over 25 weeks of tweeting. First, notice that during
B. User centric analysis Barack Obama’s campaign (ref. Figure 1(a)) the correlation
with White House is marginal. After elections, four periods
To demonstrate the effectiveness of TUCAN on user anal- of high correlations are pinpointed, highlighting the periods
ysis, we present results of case studies. Unless mentioned oth- Barack Obama and White House publicize similar topics. The
erwise, we use the following settings by default: (i) windows block [A] indicates the period of educational cost cut. [B]
size of 7 days, (ii) pre-processing with plain cleaning, and (iii) indicates the massacre at Newtown. [C] refers to fiscal cliff,
similarity scoring using Cosine similarity measure. and [D] on reformation of US immigration laws. The discovery
Barack Obama User D
week 1 week 10 week 20 week 1 week 10 week 20
week 1
week 1
week 10
week 10
Barack Obama
User D
week 20
week 20
(a) Barack Obama - Max similarity = 33% (b) User X - Max similarity = 31%
Fig. 1. Similarity among bird songs for different type of users. T = 7 days, plain cleaning, Cosine similarity.
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$Barack$Obama$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$iMore$ Barack Obama White House
idownloadblog$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$White$House$
week 1 week 10 week 20 week 1 week 10 week 20
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$Barack$Obama$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$iMore$
$$$$$idownloadblog$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$White$House$$
week 1
Barack Obama
week 10
week 20 week 1
White House
week 10
week 20
(a) Famous users vs famous users. T = 14 days. (b) Barack Obama vs White House. T = 7 days.
Fig. 2. Similarity among users over different bird songs. Plain cleaning and Cosine similarity.