You are on page 1of 73

Lecture 8:

Generalized large-scale structure


Aaron Clauset
@aaronclauset

003
Assistant Professor of Computer Science

052
002
051
001
University of Colorado Boulder
External Faculty, Santa Fe Institute

© 2017 Aaron Clauset


hierarchical communities

most communities are not random graphs


• groups within groups / groups of groups
• finding communities at one "level" of a hierarchy
can obscure structure above or below that level
hierarchical communities

→ herbivore
→ parasite

plant
hierarchical communities

modules
hierarchical communities

nested
modules
hierarchical communities

can we automatically extract such hierarchies?

step 1: network data step 3: hierarchy

?
hierarchical communities

hierarchical random graph model

Clauset, Moore, Newman, Nature 453, 98-101 (2008)


Clauset, Moore, Newman, ICML (2006)
D

Clauset, Moore, Newman, Nature 453, 98-101 (2008)


Clauset, Moore, Newman, ICML (2006)
D, {pr }

assortative modules
probability pr

Clauset, Moore, Newman, Nature 453, 98-101 (2008)


Clauset, Moore, Newman, ICML (2006)
model

“inhomogeneous” random graph

i j

instance

j
Pr(i, j connected) = pr
= p(lowest common ancestor of i,j)
hierarchical communities

hierarchical random graph model

Y
Pr(A | D, {pr }) = pE
r (1
r
pr ) L r R r Er

}
r

Lr = number nodes in left subtree pr

Rr = number nodes in right subtree

Er = number edges with r as lowest


common ancestor


Lr Rr
Clauset, Moore, Newman, Nature 453, 98-101 (2008)
Er
Clauset, Moore, Newman, ICML (2006)
!
L(D, {pr }) = pE
r (1 − pr )
r Lr Rr −Er

1/4 !" # " # $r !" # " # $


1 2 2 6
1 2 1 3
1/3
L= ·
3 3 4 4
1
L = 0.0016
1 1
!
L(D, {pr }) = pE
r (1 − pr )
r Lr Rr −Er

!" # " # $r
1 8
1 8
1/9 L=
9 9
1 1

1
L = 0.0433
1
hierarchical communities
hierarchical communities

generalizing from a single example


• given graph A, estimate model parameters D, {pr }
• sample new graphs from posterior distribution Pr(G | D, {pr })

checking the models


compare resampled graphs with original data
check
1. degree distribution
2. clustering coefficient
3. geodesic path lengths
hierarchical communities

!"#$%$&'()')*+,$-
hierarchical communities

degree distribution
a 0
10
Fraction of vertices with degree k

original


−1
10

!"#$%$&'()')*+,$-
−2
10


resampled

−3
10 0 1
10 10
Degree, k
hierarchical communities

density of triangles
Fraction of graphs with clustering coefficient c

0.25

original


0.2
original

0.15

!"#$%$&'()')*+,$-
0.1

0.05
→ →
resampled resampled

0
0 0.05 0.1 0.15 0.2 0.25 0.3
Clustering coefficient, c
hierarchical communities

geodesic distances
b 0
10
Fraction of vertex−pairs at distance d

→ original
−1
10

!"#$%$&'()')*+,$-
−2
10

resampled →
−3
10
2 4 6 8 10
Distance, d
hierarchical communities

inspecting the dendrograms

NCAA Schedule 2000 Zachary’s Karate Club


49
53
58
63
46
83 114 22
33 28
25 11 97
1
88
67
59 18
105
73
50
24
8 20 25 26
103 37
89 10
45 109 110
69 36
2 28
57 90
44 66
42
34
4 30
4
16 75 82 24
31
93 91 112 86
80
31 27
0 48 18 54
3
9 92 13 1
23
104
7
8
29
61 71
15
41
78 35
94
34
68
99 6 32
19
22
55 16
21 77
5 10
81
111
101 30 7
3 79
108 5 19
51
85 38
52 84 12
98
14
2 6
76
113 17 43 26 33 21
107 60 39
70
17 9
40 14
74 72 62
47
13 95 27 96 12
11
102
100 15 29 23
65 20 87
106 64 32 56
hierarchical communities

25

14
8
26

3
34

13
10

4
33
20
22
18
8 20 25 26
16 22
10 28
2
4
24
30 30 18
31 27
3
13 1
15
27 2
34
6 32

7
16
24 12
5 19
12
14
33 21 28 5
17 9
11
29 23
29 6
7
32

11
21

17
19

15

1
23

31
9
MAP
hierarchical communities

BrighamYoung (0)
(59)Louisia
(58)LouisianTech

Stat (9)

04)
(11

(97)L astCarinnatm

(36)Cen TNStatef

SanDiego o (4)
(63)M ianaL a
(44) )Cin gha n

1)
CoilrForceegas (1
Utah ming (16)
2)A (48 6)T )Ar issis

Norkantate Stat (4

re S (5 te (1 (24)
(92 irmin ustoane

A MS ado 3)
E

NewMexic
ouis olin i
LB )Ho ul my

id

go ta 0) (2 1)
(9

I o th as 9)
NV (23)

UtdahiseSTexa Stat
(7

tF

8)
B r s (6
A LasV
nMonr
5) (6 7)L

lo

O ah o ta s
So 6) o

)
(

Wyo
c

08
8

te )
N or
ida

ta (90
ut Me uis
)

(1
(91 rnM ph ille
(8 (8

(5

he m v

a
2

nS te
49
(5 )No t
ta )
e
53
(8 T tr) )
aS (22 (111 )
58
4 e e
46
63
(10 )Ok xa Da n
o a 8
83 114
(40 2)M lah Te me
s iz on nia e (7
25
33 28
11 97 (72 )Co iss om ch Ar riz lifor Stat )
A a sh (21 )
(81 )Iow lor our a
88
1
67
59
C a A (68 )
)T a a i W CL on (77 7)
(10 exas Statdo
73
105 24
50
7)O A& e U reg ord al (
103 37
O tanf ernC (51)
89
(98 KSta M S outh gton
(10)B)Texatse
69
S ashin 14)
36
45 109 110

(3)Ka W waii (1
nsas aylor
57 90
44 Ha ada (67)
(52)KaS
66 34
42 tate Nev sElPaso (83)
n Texa State (46)
(74)Nebrassas
16 75 82
4
31
93 91 112 86
80 ka Fresno
0 48 18 54
(15)Wisconsin TXChristian (110)
9 92 (6)PennState Tulsa (88)
(64)Illinoist SanJoseState (7
Rice 3
higanStata
23 7 29
South(49)
104 8
94
61 71
(100 )M ic
41 35
neso ernMe
(60 in)Indiana Flo
78
) M th (53
W ridaSt
68

6 n )
Maar keForeate (1)
99
22 19
(10 ster
21 77
55
o rt hwe Statea Cle ylan st (1
o
5 10
(13)N47)Ohi(2)Iowan N m d( 05)
D CS son 109)
111 30
81 101
( hig ue Vi uke tate ( (103)
3 79

MicPurd ers e
108
) G rg ( 2
N e in 45) 5)
51
85 38 2
(3 39) utg pl ll
W oCorg ia (3
52 84
( )R em o y
i
98
2 6 113
(94 9)T tonCNav gh es ar aT 3)
te oli ec

C or llSt o (8 ich
17
(7 os 0) ur
76 43 26

en th at 5
N a ed rnM )
70

( am ia gi se
107 rn na h

B l
60 39

(11 20)A iFlo Tecnia


)B (8 ttsb M ( (37

tr ern e ( )
40

i in ir u
14

r h

Toaste n (1 (34)
74
ic 89 )

1)M irg stV rac

al I 26
E kro lo
(56 )Ark abamida
72 47 95
62
96 12 9
(2 Pi

A ffa 1)
)Ke ans a
13 27
h )

M ll )
Buhio (754)
(76) (27)Flotuckys
(10 9)V e y
5)

O t(
(1

ic (1
100

a
15

Kenrshall reen (31


(70)S nes a
(1 0)W 5)S

oCar see

Ma lingG
102

h 2)
4)

olina

BowmiOhio (61)
5

Mia ecticut (42)


pi
(95)Georgrn

Conn
(62)Vanderbilt
65
(

ia
MissState (65)
LouisianStat (96)
20 87

(3
i
(3 (3
106

r
56

sis ip
64 32

8)
bu
l
n

(17)Aus

8
Ten
3

(43
(87)Mis

(99)

)
)
MAP
hierarchical communities

link prediction in networks


• many networks are sampled
• social nets, foodwebs, protein interactions, etc.
• generative models provide estimate of Pr(Aij | ✓)
for either Aij = 0 (missing links) or Aij = 1 (spurious links)
• like cross-validation: hold out some adjacencies, {Aij }
measure accuracy of algorithm on these

now many approaches to link prediction:


• Liben-Nowell & Kleinberg (2003)
• Goldberg & Roth (2003)
• Szilágyi et al. (2005)
• Guimera & Sales-Pardo (2009)
• and many others
hierarchical communities

Grassland species network


1
Pure chance
Common neighbors
0.9 Jaccard coeff.
Degree product hierarchy
ROC curve

Shortest paths
0.8 Hierarchical structure

!"#$%$&'()')*+,$-
AUC

0.7
Area under

0.6 simple predictors

0.5

pure chance
0.4
0 0.2 0.4 0.6 0.8 1
Fraction of edges observed, k/m
hierarchical communities

a Terrorist association network


1
Pure chance
Common neighbors
0.9 Jaccard coefficient
Degree product
Shortest paths
0.8 Hierarchical structure
AUC

0.7
b T. pallidum metabolic network
1
0.6 Pure chance
Common neighbors
0.9 Jaccard coefficient
0.5 Degree product
Shortest paths
0.8 Hierarchical structure
0.4
0 0.2 0.4 0.6 0.8 1 AUC
Fraction of edges observed 0.7

0.6

0.5

0.4
0 0.2 0.4 0.6 0.8 1
Fraction of edges observed
hierarchical communities

other approaches
hierarchical communities

other approaches

PHYSICAL REVIEW X 4, 011047 (2014)

Hierarchical Block Structures and High-Resolution Model Selection in Large Networks


Tiago P. Peixoto*
Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany
(Received 5 November 2013; published 24 March 2014)
Discovering and characterizing the large-scale topological features in empirical networks are crucial
steps in understanding how complex systems function. However, most existing methods used to obtain the
modular structure of networks suffer from serious problems, such as being oblivious to the statistical
edge counts ers among blocks are 2 nodes
evidence supporting the discovered patterns, which results in the inability to separateBactual
E edges
structure from
noise. In addition to this, one also observes a resolution limit on the size of communities, where smaller but
another network well-defined clusters are not detectable when the network becomes large. This phenomenon occurs for the

3
very popular approach of modularity optimization, which lacks built-in statistical validation, but also for

=
l
more principled methods based on statistical inference and model selection, which do incorporate statistical
fit another SBM to these, repeat

Nested model
validation in a formally correct way. Here, we construct a nested generative model that, B1through
nodes a complete
description of the entire network hierarchy at multiple scales, is capable of avoiding this limitation and
E edges
enables the detection of modular structure at levels far beyond those possible with current approaches. Even
with this increased resolution, the method is based on the principle of parsimony, and is capable of

2
=
l
separating signal from noise, and thus will not lead to the identification of spurious modules even on sparse
networks. Furthermore, it fully generalizes other approaches in that it is not restrictedB0tonodes
purely assortative
mixing patterns, directed or undirected graphs, and ad hoc hierarchical structures Esuch edgesas binary trees.
Despite its general character, the approach is tractable and can be combined with advanced techniques of
community detection to yield an efficient algorithm that scales well for very large networks.
1
=
l

Observed network
DOI: 10.1103/PhysRevX.4.011047 Subject Areas: Complex Systems, Interdisciplinary
N nodes
Physics, Statistical Physics
E edges
0
=
l

I. INTRODUCTION The method that has perhaps gathered the most wide-
Peixoto, Phys. Rev. X 4, 011047 (2014)
The detection of communities and other large-scale spread use is called modularity optimization [10] and
hierarchical communities

other approaches (hierarchical SBM)


political blogs (2004) network

Peixoto, Phys. Rev. X 4, 011047 (2014)


limits of statistical inference
limits of statistical inference

community structure in networks


• dozens of algorithms for finding it
• generative models among the most powerful
• how methods fail is as important as how they succeed
• even if communities exist in a network, they may not be detectable
limits of statistical inference

planted partition problem


• synthetic data with known communities
• 2 groups, equal sized
• mean degree c
• parameterized strength of communities ✏ = cout /cin

cin cout
n n
cout cin
n n

Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
cout /cin > ϵc . In other words, in this region both BP and MCMC converge to the factorize
inals contain no information about the original assignment. For cout /cin < ϵc , however, th
thelimits offixed
factorized statistical inference
point is not the one to which BP or MCMC converge.
ht-hand side of Fig. 1 shows the case of q = 4 groups with average degree c = 16, correspondin
Newman and Girvan [9]. We show the large N results and also the overlap computed wit
128 which is the commonly used size for this benchmark. Again, up to symmetry breakin
es the best partition
planted possible overlap
problemthat can be inferred from the graph by any algorithm. Therefor
ested for performance, their results should be compared to Fig. 1 instead of to the common bu
• synthetic
t the four datadetectable
groups are with knownforcommunities
any ϵ < 1.
• 2 groups, equal sized
1
• mean N=70k,
degree
N=500k, BP
MCMC c N=100k, BP
N=70k, MC
easy hard MC
N=128,
0.8 ✏=
to cout /cin
detect
N=128, full BP
to detect
overlap (accuracy) q=4, c=16

0.6
overlap

0.4
strong random graph
q=2, c=3
communities
0.2 undetectable
undetectable

0
.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1
!= cout/cin != cout/cin
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
cout /cin > ϵc . In other words, in this region both BP and MCMC converge to the factorize
inals contain no information about the original assignment. For cout /cin < ϵc , however, th
thelimits offixed
factorized statistical inference
point is not the one to which BP or MCMC converge.
ht-hand side of Fig. 1 shows the case of q = 4 groups with average degree c = 16, correspondin
Newman and Girvan [9]. We show the large N results and also the overlap computed wit
128 which is the commonly used size for this benchmark. Again, up to symmetry breakin
es the best partition
planted possible overlap
problemthat can be inferred from the graph by any algorithm. Therefor
ested for performance, their results should be compared to Fig. 1 instead of to the common bu
• synthetic
t the four datadetectable
groups are with knownforcommunities
any ϵ < 1.
• 2 groups, equal sized
1
• mean N=70k,
degree
N=500k, BP
MCMC c N=100k, BP
N=70k, MC
N=128, MC
0.8 ✏ = cout /cin N=128, full BP
• 2nd order phase transition
q=4, c=16
in detectability overlap (accuracy)
0.6
• overlap goes to 0 for
overlap

p
c c
✏ p 0.4
c+ c(k 1) strong random graph
q=2, c=3
communities
0.2 undetectable
undetectable

0
.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1
!= cout/cin != cout/cin
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
limits of statistical inference

planted partition problem


• for 2 groups, phase transition is information theoretic
no algorithm can exist that detects these communities (better than chance)
• when communities are strong, most algorithms succeed
• when networks & communities are very sparse = trouble

• recently generalized to dynamic networks (Ghasemian et al. 2015)


• hierarchical block models (Peixoto 2014) and node metadata (Newman &
Clauset 2016) both improve detectability

Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
Ghasemian et al., arxiv:1506.0679 (2015)
Peixoto, Phys. Rev. X 4, 011047 (2014)
Newman & Clauset, Nature Communications, to appear (2016)
the trouble with community detection
the trouble with community detection

many networks include metadata on their nodes:


social networks age, sex, ethnicity or race, etc.
food webs feeding mode, species body mass, etc.
Internet data capacity, physical location, etc.
protein interactions molecular weight, association with cancer, etc.

metadata x is often used to evaluate the accuracy of community detection algs.

if community detection method A finds a partition P that correlates with x


then we say that A is good
the trouble with community detection

22
18
20 25 26
8
10
28
2
4
30
24
31 27
13 1 3
34 15
6
32 16
7
5 14
19
12 9
33
21
17
11
29 23

Zachary karate club political blogs network


the trouble with community detection

49
53
58
63
46
83 114
33 28
25 11 97
88
1 59
67
73
105 24
50
103 37
89
69 36
45 109 110
57 90
44 66 34
42
16 75 82
4
31
93 91 112 86
80
0 48 18 54
9 92

23 7 29
104 8 61 71
94
41 35
78
68
99
22 19

political books (2004)


55
21 77
5 10 111
81 101 30
3 79
108
51
85 38
52 84

98 113
2 6 17
76 43 26
70
107 60 39
40 14
74 72 62
47 95 96 12
13 27
100 15
102
65 20 87
106 64 32 56

NCAA 2000 Schedule


the trouble with community detection

often, groups found by community detection are meaningful


• allegiances or personal interests in social networks [1]
• biological function in metabolic networks [2]

but

[1] see Fortunato (2010), and Adamic & Glance (2005)


[2] see Holme, Huss & Jeong (2003), and Guimera & Amaral (2005)
the trouble with community detection

often, groups found by community detection are meaningful


• allegiances or personal interests in social networks [1]
• biological function in metabolic networks [2]

but some recent studies claim these are the exception


• real networks either do not contain structural communities or
communities exist but they do not correlate with metadata groups [3]

[1] see Fortunato (2010), and Adamic & Glance (2005)


[2] see Holme, Huss & Jeong (2003), and Guimera & Amaral (2005)
[3] see Leskovec et al. (2009), and Yang & Leskovec (2012), and Hric, Darst & Fortunato (2014)
the trouble with community detection

Hric, Darst & Fortunato (2014)


• 115 networks with metadata & 12 community detection methods
• compare extracted P with observed x for each A
Name No. Nodes No. Edges No. Groups Description of group nature
lfr 1000 9839 40 artificial network (lfr, 1000S, µ = 0.5)
karate 34 78 2 membership after the split
football 115 615 12 team scheduling groups
polbooks 105 441 2 political alignment
polblogs 1222 16782 3 political alignment
dpd 35029 161313 580 software package categories
as-caida 46676 262953 225 countries
fb100 762–41536 16651–1465654 2–2597 common students’ traits
pgp 81036 190143 17824 email domains
anobii 136547 892377 25992 declared group membership
dblp 317080 1049866 13472 publication venues
amazon 366997 1231439 14–29432 product categories
flickr 1715255 22613981 101192 declared group membership
orkut 3072441 117185083 8730807 declared group membership
lj-backstrom 4843953 43362750 292222 declared group membership
lj-mislove 5189809 49151786 2183754 declared group membership
[1] fb100 is 100 networks
the trouble with community detection

Hric, Darst & Fortunato (2014)


• evaluate by normalized mutual information NMI(P, x)
"classic" data sets
{

[1] maximum NMI between any partition layer of the metadata partitions and any layer returned by the community detection method
but wait!

[1] image copyright BostonGazette or maybe 20th Century Fox? gah


a solution

idea:
use metadata x to help select a partition P ⇤ 2 {P} that correlates with x ,
from among the exponential number of plausible partitions

[1] image copyright BostonGazette or maybe 20th Century Fox? gah


a solution

idea:
use metadata x to help select a partition P ⇤ 2 {P} that correlates with x ,
from among the exponential number of plausible partitions

use a generative model to guide the selection:


• define a parametric probability distribution over networks Pr(G | ✓)
• generation : given ✓ , draw G from this distribution
• inference : given G , choose ✓ that makes G likely

generation
model

Pr(G | θ) G = (V, E)
data
inference
a metadata-aware stochastic block model

generation
given metadata x = {xu } and degree d = {du } for each node u
• each node u is assigned a community s with probability sx
Y
• thus, prior on community assignments is P (s | , x) = si ,xi
i
• given assignments, place edges independently, each with probability:
puv = du dv ✓su ,sv
• where the ✓st are the stochastic block matrix parameters

this is a degree-corrected stochastic block model (DC-SBM)


with a metadata-based prior on community labels

[1] is the k ⇥ K matrix of parameters sx


[2] Karrer & Newman (2011)
a metadata-aware stochastic block model

inference
given observed network A (adjacency matrix)
• the model likelihood is
X network metadata

P (A | ⇥, , x) = P (A | ⇥, s)P (s | , x)
s
XY Y
= pA
uv (1
uv
puv ) 1 Auv
su ,xu
s u<v u

• where ⇥ is a k ⇥ k matrix of community interaction parameters ✓st ,


and the sum is over all possible assignments s
• we fit this model to data using expectation-maximization (EM) to
maximize P (A | ⇥, , x) w.r.t. ⇥ and

[1] technical details in Newman & Clauset (2015) arxiv:1507.04001


networks with planted structure

does this method recover known structure in synthetic data?


networks with planted structure

does this method recover known structure in synthetic data?


• use SBM to generate planted partition networks, with k = 2 cin cout
equal-sized groups and mean degree c = (cin + cout )/2
n n
• assign metadata with variable correlation ⇢ 2 [0.5, 0.9] to cout cin
true group labels
n n
• vary strength of partition cin cout
p
• when cin cout  2(cin + cout ) , no structure-only algorithm can
recover the planted communities better than chance
(the detectability threshold, which is a phase transition)

[1] Decelle, Krzakala, Moore & Zdeborova (2011)


networks with planted structure

let mean degree c = 8


• when ⇢ = 0.5 , metadata isn’t useful and we recover regular SBM behavior

weaker stronger
1 0.5
0.6
0.95 undetectable 0.7
Fraction of correctly assigned nodes
0.8
0.9 0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[1] n = 10 000
c -c
in out
networks with planted structure

let mean degree c = 8


• when ⇢ = 0.5 , metadata isn’t useful and we recover regular SBM behavior
• when metadata correlates weaker stronger
with true groups, ⇢ > 0.5 1 0.5
accuracy is better than 0.6
0.95 undetectable 0.7
Fraction of correctly assigned nodes
either metadata or SBM 0.8
0.9
alone 0.9
0.85
0.8
metadata + SBM performs 0.75
better than either 0.7

• any algorithm without 0.65

metadata, or 0.6
0.55
• metadata alone.
0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
c -c
in out
real-world networks
real-world networks

1. high school social network: 795 students in a medium-sized


American high school and its feeder middle school
2. marine food web: predator-prey interactions among 488
species in Weddell Sea in Antarctica
3. Malaria gene recombinations: recombination events among
297 var genes
4. Facebook friendships: online friendships among 15,126
Harvard students and alumni
5. Internet graph: peering relations among 46,676 Autonomous
Systems
real-world networks

1. high school social network: 795 students in a medium-sized


American high school and its feeder middle school
• x = {grade 7-12, ethnicity, gender}

[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks

1. high school social network: 795 students in a medium-sized


American high school and its feeder middle school
• x = {grade 7-12, ethnicity, gender}

• method finds a good partition


between high-school and
middle-school
• NMI
. = 0.881
• without metadata:
NMI 2 [0.105, 0.384]

[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks

1. high school social network: 795 students in a medium-sized


American high school and its feeder middle school
• x = {grade 7-12, ethnicity, gender}

• method finds a good partition


between blacks and whites
(with others scattered among)
NMI = 0.820
• without metadata:
NMI 2 [0.120, 0.239]

[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks

1. high school social network: 795 students in a medium-sized


American high school and its feeder middle school
• x = {grade 7-12, ethnicity, gender}

• method finds no good


partition between males/
females.
instead, chooses a mixture of
grade/ethnicity partitions
NMI = 0.003
• without metadata:
NMI 2 [0.000, 0.010]

[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks

2. marine food web: predator-prey


interactions among 488 species in
Weddell Sea in Antarctica
• x = {species body mass, feeding
mode, oceanic zone}
3
• partition recovers known correlation
between body mass, trophic level, and
ecosystem role: 8

1
Probability of community membership

1 3 2

0.5 2

Detritivore
Carnivore
out metadata 0
-12 -9 -6 -3 0 3 6 9
10 10 10 10 10 10 10 10 Omnivore
Mean body mass (g)

1 Herbivore
[1] here, we’re using a continuous metadata
FIG. model
S4: Learned priors, as a function of body mass, for the
[2] Brose et al. (2005) three-community division of the Weddell Sea network shown
Primary producer
real-world networks

3. Malaria gene recombinations: recombination events among


297 var genes
• x = {Cys-PoLV labels for HVR6 region}
• with metadata, partition discovers correlation with Cys labels
(which are associated with severe disease)

HVR6

without metadata with metadata


NMI 2 [0.077, 0.675] NMI = 0.596
[1] Larremore, Clauset & Buckee (2013)
real-world networks

3. Malaria gene recombinations: recombination events among


297 var genes
• x = {Cys-PoLV labels for HVR6 region}
• on adjacent region of gene, we find Cys-PoLV labels correlate
with recombinant structure here, too
HVR5

without metadata with metadata

[1] Larremore, Clauset & Buckee (2013)


the ground truth about metadata

what is the goal of community detection?

network G + method f ! communities C = f (G) vs. M metadata


the ground truth about metadata

what is the goal of community detection?

network G + method f ! communities C = f (G) vs. M metadata

C⇡M

"this method works!"


the ground truth about metadata

what is the goal of community detection?

network G + method f ! communities C = f (G) vs. M metadata

C⇡M C 6= M

"this method works!" "this method stinks!"


the ground truth about metadata

what is the goal of community detection?

there are 4 indistinguishable reasons why we might find f (G) = C 6= M :

1. metadata M are unrelated to network structure G


the ground truth about metadata

what is the goal of community detection?

there are 4 indistinguishable reasons why we might find f (G) = C 6= M :

1. metadata M are unrelated to network structure G


2. metadata M and communities C capture different aspects of structure

social groups leaders and followers


the ground truth about metadata

what is the goal of community detection?

there are 4 indistinguishable reasons why we might find f (G) = C 6= M :

1. metadata M are unrelated to network structure G


2. metadata M and communities C capture different aspects of structure
3. network G has no community structure
the ground truth about metadata

what is the goal of community detection?

there are 4 indistinguishable reasons why we might find f (G) = C 6= M :

1. metadata M are unrelated to network structure G


2. metadata M and communities C capture different aspects of structure
3. network G has no community structure
4. algorithm f is bad

"this method stinks!"


theorems for community detection

DON’T TRY TO FIND THE GROUND TRUTH

INSTEAD . . . TRY TO REALIZE THERE IS NO GROUND TRUTH


theorems for community detection

1. Theorem: no bijection between ground truth and communities


g(T ) ! G g 0 (T 0 ) 2 different processes, on 2 different ground truths,
can create the same observed network

2. Theorem: No Free Lunch in community detection


no algorithm f has better performance than
any other algorithm f 0 , when averaged over
all possible inputs {G}

! good performance comes from matching


algorithm f to its preferred subclass of
networks {G0 } ⇢ {G}
[1] performance defined as adjusted mutual information (AMI), which is like the normalized mutual information, but adjusted for expected values
[2] original NFL theorem: Wolpert, Neural Computation (1996)
[3] proofs of these theorems is in Peel, Larremore, Clauset (2016)
fin
real-world networks

4. Facebook friendships: online friendships among 15,126


Harvard students and alumni (in Sept. 2005)
• x = {graduation year, dormitory}
• method finds a good partition between alumni, recent
graduates, upperclassmen, sophomores, and freshmen 13

• NMI
.
the number of=metadata
0.668 values, 1

margin. Assuming the values of


• without metadata:
Prior probability of membership

broadly distributed, this implies


NMI 2 [0.573,
the communities will be 0.641]
smaller
ata H(x) and hence, normally,
0.5
. Thus if we define

I(s ; x)
, (B4)
min[H(s), H(x)]
0
None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
malized mutual information lies Year
that
[1] Traud,it has
Mucha a (20012)
& Porter symmetric defini-
13

thereal-world networks
number of metadata values, 1

margin. Assuming the values of

Prior probability of membership


broadly distributed, this implies
the communities will be smaller
ata H(x) and hence,
4. Facebook normally,
friendships: online friendships among 15,126
0.5
. Thus if we define
Harvard students and alumni (in Sept. 2005)
I(s
• ;xx)= {graduation year, dormitory}
, (B4)
min[H(s), H(x)]
• method finds a good partition among
0 the dorms
None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
malized mutual information lies
• NMI
.has a = 0.255 defini- Year
that it symmetric
and x, and that it will achieve
• without metadata: Prior probability of membership 1
ne when the metadata perfectly
NMI 2Other
membership. [0.074, 0.224]
definitions,
ean or maximum of the two en-
two of these three conditions but
ues smaller than one by an unpre- 0.5
hen the metadata perfectly pre-

(B4) in all the calculations pre-


0
Dorm
[1] Traud, Mucha & Porter (20012)

You might also like