You are on page 1of 42

2 3

0
3 /2
Networks / 0
09
-
zo-
a z
n d
Ra
ro
d
L2: community detection
es
s a n
l
- -A
ing
r n
e a
Michele Tumminello,
L
Department of Economics, e
h in Business and Statistics
University
c of Palermo
Ma
-
3
2 -2
2
e
e nc
Sci
ata
D
er
s t Data Science and Big Data Analytics
Ma
2 3
0
3/2
0
Learning outcomes 9 /
- 0
zo-
a z
n d
Ra
ro
d
• To learn about the concepts behind two methods
sa of
n
es
l
community detection - -A
ing
r n
a
•To understand why different methods
e of community
L e
in
h
detection can give different results
Ma
c
-
23 -
• To understand that e
2communities are (usually) hierarchically
2
nc
organized, as Sdifferent
ci
e
levels of aggregation appear in complex
ta
networksD a
ter
s
Ma
13/09/20 MT 2
2 3
0
3/2
0
Levels of description of networks 9 /
- 0
o-
zz a
n d
Ra
ro
n d
s a
Macroscopic level: degree distribution, total clustering es
coefficient, etc.
Al -
-
i ng
Microscopic level: degree, centrality of individual
r n nodes, etc.
e a
L
ne
c hi
Ma
-
23 -
Mesoscopic2 2 level: communities
e
e nc
i
Sc
ata
D
ter
s
Ma
13/09/20 MT 3
2 3
0
3/2
0
WARNING 9 /
- 0
zo-
a z
n d
Ra
ro
n d
s a
l es
- -A
ing
n
A community is NOT necessarily
L e a identified by a
r

single connected component


in
e of the network
ch
Ma
-
-23
2 2
ce
The scientific c iequestion: how can we reveal a community?
n
S
ata
D
r
s te
Ma
13/09/20 MT 4
2 3
0
3/2
/ 0
09
-
zo-
a z
n d
Ra
ro
n d
s a
es
The definition of “community” might depend
-A on
l
g-
in
r n
• the properties of the network (weighted,
L e a directed, etc.)
ne
c hi
• the “processes” running on
Ma it (spread of information, etc.)
-
-23
2 2
e
e nc
Sci
ata
D
ter
s
Ma
13/09/20 MT 5
Step 1: Detecting communities: /2
0 2 3

The Zachary’s Karate Club 0 3


9 /
- 0
zo-
a z
n d
Ra
ro
n d
s a
l es
- -A Community 1: Red
i ng
r n
e a Community 2: Green
L
ne
chi Community 3: Black
Ma
-
-23
2 2
e
e nc
S ci
ata
D
ter
M.sRosvall and C.T. Bergstrom, PNAS 105 (4), 1118-1123 (2008)
Ma
9/13/20 Michele Tumminello 6
2 3
Step 2: Characterizing communities: 9/03
0
/2

The FDR weighted network of organisms - 0


z o-
z a
n d
Ra
ro
n d
s a
l es
- -A
ing
r n
e a
L 66 organisms:
ine •Bacteria (circle)
c h
Ma •Archea (square)
- •Eukaryotes (triangle)
-23
2 2
e
e nc 4873 clusters of
S ci orthologous
ata Groups (COGs)
D
t er
MTs et al., PLOS ONE 6(3), e17994 (2011)
Ma
9/13/20 Michele Tumminello 7
2 3
0
3/2
/ 0
09
-
zo-
a z
n d
Ra
ro
n d
s a
es
Today we will focusi on Step 1:
l
- -A
ng
Detecting communities
e
Le
a rn

in
c h
Ma
-
-23
2 2
e
e nc
Sci
ata
D
ter
s
Ma
13/09/20 MT 8
2 3
0
/2
Community 0 3
9 /
- 0
zo-
a z
d
From the operative point of view a community is defined by Ran
a set of nodes ro
d
s an
l es
- -A
i ng
a rn
Le
i ne
a ch
M
-
3
2 -2
2
e
Community 1 (Red): e nc {5,11,17,6,7}
i
Community S2c(Green): {1,2,3,4,8,10,12,13,14,18,20,22}
Communitya ta 3 (Black):{9,15,16,19,21,23,24,25,26,27,28,29,30,31,32,33,34}
D
e r
st
TheaZachary’s Karate Club
M
13/09/20 MT 9
2 3
0
/2
Partition and Cover 0 3
9 /
- 0
zo-
a z
n d
Ra
Partition of a network: each node is assigned d ro
an
to one and only one community es
s
l
- -A
ing
r n
e a
L
ne
Cover of a network: each ch node is assigned to at
i
a
least one community -
M
3
2 -2
2
e
e nc
i
Sc
ata
D
ter
s
Ma
13/09/20 MT 10
2 3
0
/2
Partition 0 3
9 /
- 0
zo-
a z
Partition of the Zachary’s Karate club network in 2 communitiesand
R
d ro
s an
l es
- -A
i ng Partition:
n
e ar Each node
L is assigned
e
h in to one and
c
Ma only one
- community
3
2 -2
2
e
e nc
i
Sc
ata
D
e r Community 2
s t
Ma Community 1
13/09/20 MT 11
2 3
0
/2
Cover 0 3
9 /
- 0
zo-
a z
Cover of the Zachary’s Karate club network in 2 communities and
R
d ro
s an
l es
- -A
i ng Cover:
n
e ar Each node
L is assigned
e
h in to at least
c
Ma one
- community
3
2 -2
2
e
e nc Nodes 3 and
i 10 are
Sc
ta assigned
D a
r Community 2 to both the
s te communities
M a Community 1
13/09/20 MT 12
2 3
0
3/2
/ 0
09
-
zo-
a z
n d
Ra
ro
n d
s a
l es
- -A
ng
Partition n i
a r
L e
e n
h i
c
Ma
-
-23
2 2
e
e nc
Sci
ata
D
ter
s
Ma
13/09/20 MT 13
2 3
0
/2
An intuitive definition of community-0 9 / 0 3

o-
z z
da
n
Ra
ro
n d
s a
l es
- -A
ing
r n
e a
L
ne
c hi
Ma
-
-23
2 2
n ce
In many real networks,
c ie nodes appear to group in sub-networks in which the
S
density ofainternal links is larger than the connections with the rest of vertices
t
in theDnetwork.
a
ter
Mas
13/09/20 MT 14
2 3
0
3/2
0
Modularity 9 /
- 0
zo-
a z
n d
Ra
d ro
The modularity is the sum over all the communities of the difference s an between
es
the fraction of links connecting vertices within each community - A and the expected value
l
of this fraction in a random rewiring of the network. ng-
n i
e ar
L
1 X e
Modularity = ⇤ (num of links within
h irn expected num of links within r after random rewiring)
L c
communities:r
Ma
-
3
2 -2
2
e
e nc
i
Sc
The optimal
ta partition of nodes is the one which maximizes the Modularity
D a
r
s te
MaM.E.J. Newman and M. Girvan, PRE 69, 026113 (2004)
13/09/20 MT 15
2 3
0
/2
Rewiring - 0 9 / 0 3

zo-
z
Consider this network d a
G a n
B CR
G B ro Randomly
C
n d
s a select 2
F
l es A edges
F A - -A D

Iterate the procedure


D i ng
r n E
E L ea
i ne
c h
The rewiring procedure Ma
destroys the community 3 -
2
structure of the network22- G B C

ce Rewire
n
ie
The rewiring procedure
c
S F A
does not modify
ta the D
a
degreerofD vertices
s te E
Ma
13/09/20 MT 16
2 3
0
3 /2
0
Why not using the rewiring?zzo--0 9 /

da
n
Ra
ro
n d
s a
l es
- -A
ng
Because it is computationally
e ar
n i

VERY DEMANDING
L
e
h in
for- large networks
M ac

23
22-
nce
e
Sci
ata
D
ter
s
Ma
13/09/20 MT 17
2 3
0
/2
Expected number of links within a -0 9 / 0 3

community after rewiringandaz z o-

R
ro
n d
s a
l es
- -A
ing
r n
An approximated
e
(simple) result L e a

in
c h
Ma
-
-23
2 2
e
e nc
Sci
ata
D
ter
s
Ma
13/09/20 MT 18
2 3
0
3/2
/ 0
09
-
zo-
a z
n d
Ra
ro
n d
s a
l es
- -A
MATHarni ng

AHEAD
h i ne
Le

c
Ma
-
-23
2 2
e
e nc
Sci
ata
D
ter
s
Ma
13/09/20 MT 19
2 3
0
3/2
Consider two vertices A and B with degree kA and kB, respectively. 9 / 0
- 0
zo-
z
Conceptual insight
n da
The probability that a link appears between A and B after random Ra
ro
rewiring is proportional to the product of the degrees kAaand n kB:
d
e ss
- Al
g-
n in
e ar
L
n e
N is a constant, which is introduced to iguarantee that the (expected) degree
of each vertex is preserved after the h
crewiring:
Ma
-
3
2 -2
2
e
e nc
i
Sc
ata
D
r
s te
Ma
13/09/20 MT 20
2 3
0
/2
This is what we know / 03
0 9
We should evaluate
-
o-
the aconstant
z z N
n d
Ra
d ro
s an
l es
- -A
ing
r n
e a
L
ne
c hi
Ma
-
where L is the total number3
2 -2 of links in the network.
2
e
e nc
i
Sc
ata
D
r
s te
Ma
13/09/20 MT 21
2 3
0
3/2
0
An example 9 /
- 0
zo-
a z
n d
Ra
G B ro
C
n d
s a
l es
F A - -A
D ing
r n
e a
E L
ne
c hi
Ma
-
L = number of links in the network
2 3 =9
-
22
e
kA = degree(A) = 4enc
i
Sc
kB = degree(B)
ta = 4
D a
r
s te
Ma
13/09/20 MT 22
2 3
0
3/2
Consider two vertices A and B with degree kA and kB, respectively. 9 / 0
- 0
zo-
a z
The probability that a link appears between A and B after random and
rewiring is: R
d ro
s an
l es
- -A
i ng
a rn
X
Modularity = ⇤
1 Le
(num of links within r expected num of links within r after random rewiring)
L e
communities:r
h in
M ac
1 -
⇤ (expected num of3 links within r after random rewiring)
L -2
2 2
e
e nc
i
Sc
ta X
D a 1
ewhere
r ar = ⇤ ki
a s t 2 ⇤ L i2r
M
13/09/20 MT 23
2 3
0
/2
Modularity - 09 / 0 3

zo-
C z
X n da
Q= er a2r o
Ra
d r
r=1 n
s a
where es
- Al
g-
er = fraction of links observed n inwithin community r
X ar
1 L e
ar = ⇤ ki in
e
2 ⇤ L i2r c h
Ma
-
L = number of23links in the network
2 -
ki = degree 2
c e of vertex I
e n
ci
C = Snumber of communities
ata
D
t er
s M.E.J. Newman and M. Girvan, PRE 69, 026113 (2004)
Ma
13/09/20 MT 24
2 3
0
/2
Why is this expression of modularity-0 9 / 0 3

just an approximation? andaz z o-

R
ro
n d
s a
es
- Al
We said that the probability that a link appears between g- A and B after random
i n
rewiring is: a rn
Le
i ne
a ch
M
-
3
2 -2
e
2 Is this correct?
e nc
i
Sc
No!D a
a
tHowever, it’s a good approximation for small values
erthe degree k and k
of
t
s A B
Ma
13/09/20 MT 25
2 3
0
/2
Modularity - 09 / 0 3

zo-
C z
X n da
Q= er a2r o
Ra
d r
r=1 n
s a
where es
- Al
g-
er = fraction of links observed n inwithin community r
X ar
1 L e
ar = ⇤ ki in
e
2 ⇤ L i2r c h
Ma
-
L = number of23links in the network
2 -
ki = degree 2
c e of vertex I
e n
ci
C = Snumber of communities
ta
D a Modularity should be maximized!
t er
s M.E.J. Newman and M. Girvan, PRE 69, 026113 (2004)
Ma
13/09/20 MT 26
2 3
0
/2
Modularity and karate club - 09 / 0 3

zo-
a z
n d
Ra
ro
n d
s a
l es
- -A
ing
r n
e a
L
ne
c hi
Ma
-
-23
2 2
e
e nc
Sci
ata
D The modularity of the whole network is Q=0
ter
s
Ma
13/09/20 MT 27
2 3
0
/2
Modularity and karate club - 09 / 0 3

zo-
a z
n d
Ra
ro
n d
s a
l es
- -A
ing
r n
e a
L
ne
c hi
Ma
-
-23
2 2
e
e nc
Sci
ata
The maximum D modularity of the network after the cut in 2 communities is Q=0.3718
r
s te
Ma
13/09/20 MT 28
2 3
0
/2
Modularity and karate club - 09 / 0 3

zo-
a z
n d
Ra
ro
n d
s a
l es
- -A
ing
r n
e a
L
ne
c hi
Ma
-
-23
2 2
e
e nc
Sci
ata
The maximum D modularity of the network after the cut in 3 communities is Q=0.4020
r
s te
Ma
13/09/20 MT 29
2 3
0
/2
Modularity and karate club - 09 / 0 3

zo-
a z
n d
Ra
ro
n d
s a
l es
- -A
ing
r n
e a
L
ne
c hi
Ma
-
-23
2 2
e
e nc
ci
S
a
The maximum D atmodularity of the network after the cut in 4 communities is Q=0.4188
ter
s
Ma J. Duch and A. Arenas, PRE 72, 027104 (2005)
13/09/20 MT 30
2 3
0
3/2
/ 0
09
-
zo-
a z
n d
Ra
ro
n d
s a
l es

A different approachrnito community ng


- -A

detection e a
L
e
in h
c
Ma
-
-23
2 2
e
e nc
Sci
ata
D
ter
s
Ma
13/09/20 MT 31
2 3
0
/2
Cities I’ve visited - 0 9 / 0 3

z o-
The basic idea is dtoaz compress
n
information depending on the
Ra
time that ad“random
ro particle”
spent sinan each “area” of the
es
network
l
- -A
i ng CMU campus
n

Increasing level of aggregation


ar
L e
ne
hi Pittsburgh
c
Ma
-
-23 Pennsylvania
2 2
e
nc
ci
e United States
S
ata
D
er North America
s t
Ma
13/09/20 MT 32
Living in 2 3
0
/2
/ 03
Llanfairpwllgwyngyllgogerychwyrndro z o-
- 09

az
bwllllantysiliogogogoch
ro
Ra
n d

d
s an
l es
- -A
It is a large village and community ing
on the island of Anglesey in Wales, n
e ar
situated on the Menai Strait L
e
next to the Britannia Bridge h in
and across the strait c
Ma
from Bangor. -
3
2 -2
2
e
e nc
Imagine how it could i be if every city
Sc
has such a longa name…
t
Da
ter
s
Ma
13/09/20 MT 33
2 3
0
3/2
0
Compressing Information 9 /
- 0
zo-
a z
n d
Ra
ro
n d
s a
es
Forbes Avenue - Al
g-
in
5000 arPittsburgh
n
Le
Fifth Avenue ne
c hi
Ma
- Philadelphia PA
23
22-
nce
e New York NY
Sci
ata
D ?
ter
s
Ma
13/09/20 MT 34
2 3
0
/2
Information flow on networks: -0 9 / 0 3

another definition of communities


an
d az
z o-

R
o
The Infomap method d r
an
s
l es
- -A
ing
r n
e a
L
ne
c hi
Ma
-
-23
2 2
e
e nc
Sci
ata
D
t er
sM. Rosvall and C.T. Bergstrom, PNAS 105 (4), 1118-1123 (2008)
Ma
13/09/20 MT 35
2 3
0
/2
Modularity VS Information Flow--0 9 / 0 3

z o
a z
n d
Ra
Information r o Flow:
Minimizea nd L
ss
-(Infomap method)
e
Al
g-
in
r n
e a
L
ne Modularity
hi
Ma
c
Maximize Q
-
-23
2 2
e
e nc The modularity Q
Sci can be generalized to
ata deal with weighted
D directed networks
ter
s
MaM. Rosvall and C.T. Bergstrom, PNAS 105 (4), 1118-1123 (2008)
13/09/20 MT 36
2 3
0
/2
The Infomap applied to the karate club - 0 9 / 0 3

network an
d az
z o-

R
ro
n d
s a
l es
- -A
ing
r n
e a
L
ne
c hi
Ma
-
-23
2 2
e
e nc
Sci
ata
D
e r
s Map equation L=4.31179
t
Ma
13/09/20 MT 37
2 3
0
/2
The Infomap applied to the karate club - 0 9 / 0 3

network an
d az
z o-

R
ro
n d
s a
l es
- -A
ing
r n
e a
L
ne
c hi
Ma
-
-23
2 2
e
e nc
Sci
ata
D
e r
s Map equation L=4.31179
t
Ma
13/09/20 MT 38
2 3
0
/2
Concepts to take home - 09 / 0 3

zo-
a z
n d
Ra
d ro
Modularity Maximization is based on the presence of an excess s anof links within
a community of nodes l es
- -A
i ng
Infomap is based on information flow and informtion a rn compression
Le
ne
Communities are usually hierarchically ch organized
i
M a
-
Communities may be overlapping 3
2 -2
2
Different methods can c e detect different communities
n
c ie
S
ata
D
r
s te
Ma
13/09/20 MT 39
2 3
0
3/2
Network data and Software / 0
09
-
zo-
a z
Network Repository: n d
Ra
ro
http://networkrepository.com n d
s a
l es
Software packages (that include community detection -A algorithms):
-
i ng
http://deim.urv.cat/~sergio.gomez/radatools.php#download
a rn
Le
e
https://www.mapequation.org in
a ch
M
Package igraph (in R: https://www.r-project.org)
-
3
2 -2
Package NetworkX 2(in Python:
e
nc
https://networkx.github.io/documentation/stable/index.html)
e
i
Sc
ta for) Network visualization: Cytoscape (https://cytoscape.org); Pajek
(Mainly
a
D
e(http://mrvar.fdv.uni-lj.si/pajek/)
r
st
Ma
13/09/20 MT 40
2 3
0
3/2
0
How is the modularity maximized? 9 /
- 0
o-
zz a
n d
Ra
ro
n d
s a
l es
- -A
ing
r n
e a
L
ne
c hi
Ma
-
-23
2 2
e
e nc
Sci
ata
D
The maximum
e r modularity of the network after the cut in 2 communities is Q=0.3718
s t
Ma
13/09/20 MT 41
2 3
0
3/2
/ 0
09
-
zo-
a z
n d
Ra
ro
n d
s a
l es
- -A
ing
n
Next e class
ar
L e
i n
ch
Ma
-
-23
2 2
e
e nc
Sci
ata
D
ter
s
Ma
13/09/20 MT 42

You might also like