Csci5352 2017 L8

Lecture 8:
Generalized large-scale structure

Aaron Clauset
@aaronclauset
003
Assistant Professor of Computer Science
052
002
051
001
University of Colorado Boulder
External Faculty, Santa Fe Institute
© 2017 Aaron Clauset

hierarchical communities
most communities are not random graphs

• groups within groups / groups of groups
• finding communities at one "level" of a hierarchy
can obscure structure above or below that level
→ herbivore
→ parasite
→
plant
modules
nested
modules
can we automatically extract such hierarchies?
step 1: network data step 3: hierarchy
?
hierarchical random graph model
Clauset, Moore, Newman, Nature 453, 98-101 (2008)

Clauset, Moore, Newman, ICML (2006)
D

D, {pr }
assortative modules
probability pr

model
“inhomogeneous” random graph
i j
instance
j
Pr(i, j connected) = pr
= p(lowest common ancestor of i,j)
hierarchical random graph model
Y
Pr(A | D, {pr }) = pE
r (1
r
pr ) L r R r Er
}
r
Lr = number nodes in left subtree pr
Rr = number nodes in right subtree
Er = number edges with r as lowest

common ancestor
→
Lr Rr
Er
!
L(D, {pr }) = pE
r (1 − pr )
r Lr Rr −Er
1/4 !" # " # $r !" # " # $

1 2 2 6
1 2 1 3
1/3
L= ·
3 3 4 4
1
L = 0.0016
1 1
!
L(D, {pr }) = pE
r (1 − pr )
r Lr Rr −Er
!" # " # $r
1 8
1 8
1/9 L=
9 9
1 1
1
L = 0.0433
1
generalizing from a single example

• given graph A, estimate model parameters D, {pr }
• sample new graphs from posterior distribution Pr(G | D, {pr })
checking the models

compare resampled graphs with original data
check
1. degree distribution
2. clustering coefficient
3. geodesic path lengths
!"#$%$&'()')*+,$-
degree distribution
a 0
10
Fraction of vertices with degree k
original
→
−1
10
!"#$%$&'()')*+,$-
−2
10
→
resampled
−3
10 0 1
10 10
Degree, k
density of triangles
Fraction of graphs with clustering coefficient c
0.25
original
→
0.2
original
0.15
!"#$%$&'()')*+,$-
0.1
0.05
→ →
resampled resampled
0
0 0.05 0.1 0.15 0.2 0.25 0.3
Clustering coefficient, c
geodesic distances
b 0
10
Fraction of vertex−pairs at distance d
→ original
−1
10
!"#$%$&'()')*+,$-
−2
10
resampled →
−3
10
2 4 6 8 10
Distance, d
inspecting the dendrograms
NCAA Schedule 2000 Zachary’s Karate Club

49
53
58
63
46
83 114 22
33 28
25 11 97
1
88
67
59 18
105
73
50
24
8 20 25 26
103 37
89 10
45 109 110
69 36
2 28
57 90
44 66
42
34
4 30
4
16 75 82 24
31
93 91 112 86
80
31 27
0 48 18 54
3
9 92 13 1
23
104
7
8
29
61 71
15
41
78 35
94
34
68
99 6 32
19
22
55 16
21 77
5 10
81
111
101 30 7
3 79
108 5 19
51
85 38
52 84 12
98
14
2 6
76
113 17 43 26 33 21
107 60 39
70
17 9
40 14
74 72 62
47
13 95 27 96 12
11
102
100 15 29 23
65 20 87
106 64 32 56
25
14
8
26
3
34
13
10
4
33
20
22
18
8 20 25 26
16 22
10 28
2
4
24
30 30 18
31 27
3
13 1
15
27 2
34
6 32
7
16
24 12
5 19
12
14
33 21 28 5
17 9
11
29 23
29 6
7
32
11
21
17
19
15
1
23
31
9
MAP
BrighamYoung (0)
(59)Louisia
(58)LouisianTech
Stat (9)
04)
(11
(97)L astCarinnatm
(36)Cen TNStatef
SanDiego o (4)
(63)M ianaL a
(44) )Cin gha n
1)
CoilrForceegas (1
Utah ming (16)
2)A (48 6)T )Ar issis
Norkantate Stat (4
re S (5 te (1 (24)
(92 irmin ustoane
A MS ado 3)
E
NewMexic
ouis olin i
LB )Ho ul my
id
go ta 0) (2 1)
(9
I o th as 9)
NV (23)
UtdahiseSTexa Stat
(7
tF
8)
B r s (6
A LasV
nMonr
5) (6 7)L
lo
O ah o ta s
So 6) o
)
(
Wyo
c
08
8
te )
N or
ida
ta (90
ut Me uis
)
(1
(91 rnM ph ille
(8 (8
(5
he m v
a
2
nS te
49
(5 )No t
ta )
e
53
(8 T tr) )
aS (22 (111 )
58
4 e e
46
63
(10 )Ok xa Da n
o a 8
83 114
(40 2)M lah Te me
s iz on nia e (7
25
33 28
11 97 (72 )Co iss om ch Ar riz lifor Stat )
A a sh (21 )
(81 )Iow lor our a
88
1
67
59
C a A (68 )
)T a a i W CL on (77 7)
(10 exas Statdo
73
105 24
50
7)O A& e U reg ord al (
103 37
O tanf ernC (51)
89
(98 KSta M S outh gton
(10)B)Texatse
69
S ashin 14)
36
45 109 110
(3)Ka W waii (1
nsas aylor
57 90
44 Ha ada (67)
(52)KaS
66 34
42 tate Nev sElPaso (83)
n Texa State (46)
(74)Nebrassas
16 75 82
4
31
93 91 112 86
80 ka Fresno
0 48 18 54
(15)Wisconsin TXChristian (110)
9 92 (6)PennState Tulsa (88)
(64)Illinoist SanJoseState (7
Rice 3
higanStata
23 7 29
South(49)
104 8
94
61 71
(100 )M ic
41 35
neso ernMe
(60 in)Indiana Flo
78
) M th (53
W ridaSt
68
6 n )
Maar keForeate (1)
99
22 19
(10 ster
21 77
55
o rt hwe Statea Cle ylan st (1
o
5 10
(13)N47)Ohi(2)Iowan N m d( 05)
D CS son 109)
111 30
81 101
( hig ue Vi uke tate ( (103)
3 79
MicPurd ers e
108
) G rg ( 2
N e in 45) 5)
51
85 38 2
(3 39) utg pl ll
W oCorg ia (3
52 84
( )R em o y
i
98
2 6 113
(94 9)T tonCNav gh es ar aT 3)
te oli ec
C or llSt o (8 ich
17
(7 os 0) ur
76 43 26
en th at 5
N a ed rnM )
70
( am ia gi se
107 rn na h
B l
60 39
(11 20)A iFlo Tecnia

)B (8 ttsb M ( (37
tr ern e ( )
40
i in ir u
14
r h
Toaste n (1 (34)
74
ic 89 )
1)M irg stV rac
al I 26
E kro lo
(56 )Ark abamida
72 47 95
62
96 12 9
(2 Pi
A ffa 1)
)Ke ans a
13 27
h )
M ll )
Buhio (754)
(76) (27)Flotuckys
(10 9)V e y
5)
O t(
(1
ic (1
100
a
15
Kenrshall reen (31

(70)S nes a
(1 0)W 5)S
oCar see
Ma lingG
102
h 2)
4)
olina
BowmiOhio (61)
5
Mia ecticut (42)

pi
(95)Georgrn
Conn
(62)Vanderbilt
65
(
ia
MissState (65)
LouisianStat (96)
20 87
(3
i
(3 (3
106
r
56
sis ip
64 32
8)
bu
l
n
(17)Aus
8
Ten
3
(43
(87)Mis
(99)
)
)
MAP
link prediction in networks

• many networks are sampled
• social nets, foodwebs, protein interactions, etc.
• generative models provide estimate of Pr(Aij | ✓)
for either Aij = 0 (missing links) or Aij = 1 (spurious links)
• like cross-validation: hold out some adjacencies, {Aij }
measure accuracy of algorithm on these
now many approaches to link prediction:

• Liben-Nowell & Kleinberg (2003)
• Goldberg & Roth (2003)
• Szilágyi et al. (2005)
• Guimera & Sales-Pardo (2009)
• and many others
Grassland species network

1
Pure chance
Common neighbors
0.9 Jaccard coeff.
Degree product hierarchy
ROC curve
Shortest paths
0.8 Hierarchical structure
!"#$%$&'()')*+,$-
AUC
0.7
Area under
0.6 simple predictors
0.5
pure chance
0.4
0 0.2 0.4 0.6 0.8 1
Fraction of edges observed, k/m
a Terrorist association network

1
Pure chance
Common neighbors
0.9 Jaccard coefficient
Degree product
Shortest paths
AUC
0.7
b T. pallidum metabolic network
1
0.6 Pure chance
Common neighbors
0.9 Jaccard coefficient
0.5 Degree product
Shortest paths
0.4
0 0.2 0.4 0.6 0.8 1 AUC
Fraction of edges observed 0.7
0.6
0.5
0.4
0 0.2 0.4 0.6 0.8 1
Fraction of edges observed
other approaches
other approaches
PHYSICAL REVIEW X 4, 011047 (2014)
Hierarchical Block Structures and High-Resolution Model Selection in Large Networks

Tiago P. Peixoto*
Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany
(Received 5 November 2013; published 24 March 2014)
Discovering and characterizing the large-scale topological features in empirical networks are crucial
steps in understanding how complex systems function. However, most existing methods used to obtain the
modular structure of networks suffer from serious problems, such as being oblivious to the statistical
edge counts ers among blocks are 2 nodes
evidence supporting the discovered patterns, which results in the inability to separateBactual
E edges
structure from
noise. In addition to this, one also observes a resolution limit on the size of communities, where smaller but
another network well-defined clusters are not detectable when the network becomes large. This phenomenon occurs for the
3
very popular approach of modularity optimization, which lacks built-in statistical validation, but also for
=
l
more principled methods based on statistical inference and model selection, which do incorporate statistical
fit another SBM to these, repeat
Nested model
validation in a formally correct way. Here, we construct a nested generative model that, B1through
nodes a complete
description of the entire network hierarchy at multiple scales, is capable of avoiding this limitation and
E edges
enables the detection of modular structure at levels far beyond those possible with current approaches. Even
with this increased resolution, the method is based on the principle of parsimony, and is capable of
2
=
l
separating signal from noise, and thus will not lead to the identification of spurious modules even on sparse
networks. Furthermore, it fully generalizes other approaches in that it is not restrictedB0tonodes
purely assortative
mixing patterns, directed or undirected graphs, and ad hoc hierarchical structures Esuch edgesas binary trees.
Despite its general character, the approach is tractable and can be combined with advanced techniques of
community detection to yield an efficient algorithm that scales well for very large networks.
1
=
l
Observed network
DOI: 10.1103/PhysRevX.4.011047 Subject Areas: Complex Systems, Interdisciplinary
N nodes
Physics, Statistical Physics
E edges
0
=
l
I. INTRODUCTION The method that has perhaps gathered the most wide-
Peixoto, Phys. Rev. X 4, 011047 (2014)
The detection of communities and other large-scale spread use is called modularity optimization [10] and
other approaches (hierarchical SBM)

political blogs (2004) network

limits of statistical inference
community structure in networks

• dozens of algorithms for finding it
• generative models among the most powerful
• how methods fail is as important as how they succeed
• even if communities exist in a network, they may not be detectable
planted partition problem

• synthetic data with known communities
• 2 groups, equal sized
• mean degree c
• parameterized strength of communities ✏ = cout /cin
cin cout
n n
cout cin
n n
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
cout /cin > ϵc . In other words, in this region both BP and MCMC converge to the factorize
inals contain no information about the original assignment. For cout /cin < ϵc , however, th
thelimits offixed
factorized statistical inference
point is not the one to which BP or MCMC converge.
ht-hand side of Fig. 1 shows the case of q = 4 groups with average degree c = 16, correspondin
Newman and Girvan [9]. We show the large N results and also the overlap computed wit
128 which is the commonly used size for this benchmark. Again, up to symmetry breakin
es the best partition
planted possible overlap
problemthat can be inferred from the graph by any algorithm. Therefor
ested for performance, their results should be compared to Fig. 1 instead of to the common bu
• synthetic
t the four datadetectable
groups are with knownforcommunities
any ϵ < 1.
1
• mean N=70k,
degree
N=500k, BP
MCMC c N=100k, BP
N=70k, MC
easy hard MC
N=128,
0.8 ✏=
to cout /cin
detect
N=128, full BP
to detect
overlap (accuracy) q=4, c=16
0.6
overlap
0.4
strong random graph
q=2, c=3
communities
0.2 undetectable
undetectable
0
.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1
!= cout/cin != cout/cin
cout /cin > ϵc . In other words, in this region both BP and MCMC converge to the factorize
inals contain no information about the original assignment. For cout /cin < ϵc , however, th
thelimits offixed
factorized statistical inference
point is not the one to which BP or MCMC converge.
ht-hand side of Fig. 1 shows the case of q = 4 groups with average degree c = 16, correspondin
Newman and Girvan [9]. We show the large N results and also the overlap computed wit
128 which is the commonly used size for this benchmark. Again, up to symmetry breakin
es the best partition
planted possible overlap
problemthat can be inferred from the graph by any algorithm. Therefor
ested for performance, their results should be compared to Fig. 1 instead of to the common bu
• synthetic
t the four datadetectable
groups are with knownforcommunities
any ϵ < 1.
1
• mean N=70k,
degree
N=500k, BP
MCMC c N=100k, BP
N=70k, MC
N=128, MC
0.8 ✏ = cout /cin N=128, full BP
• 2nd order phase transition
q=4, c=16
in detectability overlap (accuracy)
0.6
• overlap goes to 0 for
overlap
p
c c
✏ p 0.4
c+ c(k 1) strong random graph
q=2, c=3
communities
0.2 undetectable
undetectable
0
.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1
!= cout/cin != cout/cin
planted partition problem

• for 2 groups, phase transition is information theoretic
no algorithm can exist that detects these communities (better than chance)
• when communities are strong, most algorithms succeed
• when networks & communities are very sparse = trouble
• recently generalized to dynamic networks (Ghasemian et al. 2015)

• hierarchical block models (Peixoto 2014) and node metadata (Newman &
Clauset 2016) both improve detectability
Ghasemian et al., arxiv:1506.0679 (2015)
Newman & Clauset, Nature Communications, to appear (2016)
the trouble with community detection
many networks include metadata on their nodes:

social networks age, sex, ethnicity or race, etc.
food webs feeding mode, species body mass, etc.
Internet data capacity, physical location, etc.
protein interactions molecular weight, association with cancer, etc.
metadata x is often used to evaluate the accuracy of community detection algs.
if community detection method A finds a partition P that correlates with x

then we say that A is good
22
18
20 25 26
8
10
28
2
4
30
24
31 27
13 1 3
34 15
6
32 16
7
5 14
19
12 9
33
21
17
11
29 23
Zachary karate club political blogs network

49
53
58
63
46
83 114
33 28
25 11 97
88
1 59
67
73
105 24
50
103 37
89
69 36
45 109 110
57 90
44 66 34
42
16 75 82
4
31
93 91 112 86
80
0 48 18 54
9 92
23 7 29
104 8 61 71
94
41 35
78
68
99
22 19
political books (2004)

55
21 77
5 10 111
81 101 30
3 79
108
51
85 38
52 84
98 113
2 6 17
76 43 26
70
107 60 39
40 14
74 72 62
47 95 96 12
13 27
100 15
102
65 20 87
106 64 32 56
NCAA 2000 Schedule

often, groups found by community detection are meaningful

• allegiances or personal interests in social networks [1]
• biological function in metabolic networks [2]
but
[1] see Fortunato (2010), and Adamic & Glance (2005)

[2] see Holme, Huss & Jeong (2003), and Guimera & Amaral (2005)
often, groups found by community detection are meaningful

• allegiances or personal interests in social networks [1]
• biological function in metabolic networks [2]
but some recent studies claim these are the exception

• real networks either do not contain structural communities or
communities exist but they do not correlate with metadata groups [3]
[1] see Fortunato (2010), and Adamic & Glance (2005)

[2] see Holme, Huss & Jeong (2003), and Guimera & Amaral (2005)
[3] see Leskovec et al. (2009), and Yang & Leskovec (2012), and Hric, Darst & Fortunato (2014)
Hric, Darst & Fortunato (2014)

• 115 networks with metadata & 12 community detection methods
• compare extracted P with observed x for each A
Name No. Nodes No. Edges No. Groups Description of group nature
lfr 1000 9839 40 artificial network (lfr, 1000S, µ = 0.5)
karate 34 78 2 membership after the split
football 115 615 12 team scheduling groups
polbooks 105 441 2 political alignment
polblogs 1222 16782 3 political alignment
dpd 35029 161313 580 software package categories
as-caida 46676 262953 225 countries
fb100 762–41536 16651–1465654 2–2597 common students’ traits
pgp 81036 190143 17824 email domains
anobii 136547 892377 25992 declared group membership
dblp 317080 1049866 13472 publication venues
amazon 366997 1231439 14–29432 product categories
flickr 1715255 22613981 101192 declared group membership
orkut 3072441 117185083 8730807 declared group membership
lj-backstrom 4843953 43362750 292222 declared group membership
lj-mislove 5189809 49151786 2183754 declared group membership
[1] fb100 is 100 networks
Hric, Darst & Fortunato (2014)

• evaluate by normalized mutual information NMI(P, x)
"classic" data sets
{
[1] maximum NMI between any partition layer of the metadata partitions and any layer returned by the community detection method
but wait!
[1] image copyright BostonGazette or maybe 20th Century Fox? gah

a solution
idea:
use metadata x to help select a partition P ⇤ 2 {P} that correlates with x ,
from among the exponential number of plausible partitions
[1] image copyright BostonGazette or maybe 20th Century Fox? gah

a solution
idea:
use metadata x to help select a partition P ⇤ 2 {P} that correlates with x ,
from among the exponential number of plausible partitions
use a generative model to guide the selection:

• define a parametric probability distribution over networks Pr(G | ✓)
• generation : given ✓ , draw G from this distribution
• inference : given G , choose ✓ that makes G likely
generation
model
Pr(G | θ) G = (V, E)
data
inference
a metadata-aware stochastic block model
generation
given metadata x = {xu } and degree d = {du } for each node u
• each node u is assigned a community s with probability sx
Y
• thus, prior on community assignments is P (s | , x) = si ,xi
i
• given assignments, place edges independently, each with probability:
puv = du dv ✓su ,sv
• where the ✓st are the stochastic block matrix parameters
this is a degree-corrected stochastic block model (DC-SBM)

with a metadata-based prior on community labels
[1] is the k ⇥ K matrix of parameters sx

[2] Karrer & Newman (2011)
a metadata-aware stochastic block model
inference
given observed network A (adjacency matrix)
• the model likelihood is
X network metadata
P (A | ⇥, , x) = P (A | ⇥, s)P (s | , x)
s
XY Y
= pA
uv (1
uv
puv ) 1 Auv
su ,xu
s u<v u
• where ⇥ is a k ⇥ k matrix of community interaction parameters ✓st ,

and the sum is over all possible assignments s
• we fit this model to data using expectation-maximization (EM) to
maximize P (A | ⇥, , x) w.r.t. ⇥ and
[1] technical details in Newman & Clauset (2015) arxiv:1507.04001

networks with planted structure
does this method recover known structure in synthetic data?

does this method recover known structure in synthetic data?

• use SBM to generate planted partition networks, with k = 2 cin cout
equal-sized groups and mean degree c = (cin + cout )/2
n n
• assign metadata with variable correlation ⇢ 2 [0.5, 0.9] to cout cin
true group labels
n n
• vary strength of partition cin cout
p
• when cin cout  2(cin + cout ) , no structure-only algorithm can
recover the planted communities better than chance
(the detectability threshold, which is a phase transition)
[1] Decelle, Krzakala, Moore & Zdeborova (2011)

let mean degree c = 8

• when ⇢ = 0.5 , metadata isn’t useful and we recover regular SBM behavior
weaker stronger
1 0.5
0.6
0.95 undetectable 0.7
Fraction of correctly assigned nodes
0.8
0.9 0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[1] n = 10 000
c -c
in out
let mean degree c = 8

• when ⇢ = 0.5 , metadata isn’t useful and we recover regular SBM behavior
• when metadata correlates weaker stronger
with true groups, ⇢ > 0.5 1 0.5
accuracy is better than 0.6
0.95 undetectable 0.7
Fraction of correctly assigned nodes
either metadata or SBM 0.8
0.9
alone 0.9
0.85
0.8
metadata + SBM performs 0.75
better than either 0.7
• any algorithm without 0.65
metadata, or 0.6
0.55
• metadata alone.
0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
c -c
in out
real-world networks
real-world networks
1. high school social network: 795 students in a medium-sized

American high school and its feeder middle school
2. marine food web: predator-prey interactions among 488
species in Weddell Sea in Antarctica
3. Malaria gene recombinations: recombination events among
297 var genes
4. Facebook friendships: online friendships among 15,126
Harvard students and alumni
5. Internet graph: peering relations among 46,676 Autonomous
Systems
real-world networks

• x = {grade 7-12, ethnicity, gender}
[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks

• method finds a good partition

between high-school and
middle-school
• NMI
. = 0.881
• without metadata:
NMI 2 [0.105, 0.384]
real-world networks

• method finds a good partition

between blacks and whites
(with others scattered among)
NMI = 0.820
NMI 2 [0.120, 0.239]
real-world networks

• method finds no good

partition between males/
females.
instead, chooses a mixture of
grade/ethnicity partitions
NMI = 0.003
NMI 2 [0.000, 0.010]
real-world networks
2. marine food web: predator-prey

interactions among 488 species in
Weddell Sea in Antarctica
• x = {species body mass, feeding
mode, oceanic zone}
3
• partition recovers known correlation
between body mass, trophic level, and
ecosystem role: 8
1
Probability of community membership
1 3 2
0.5 2
Detritivore
Carnivore
out metadata 0
-12 -9 -6 -3 0 3 6 9
10 10 10 10 10 10 10 10 Omnivore
Mean body mass (g)
1 Herbivore
[1] here, we’re using a continuous metadata
FIG. model
S4: Learned priors, as a function of body mass, for the
[2] Brose et al. (2005) three-community division of the Weddell Sea network shown
Primary producer
real-world networks

297 var genes
• x = {Cys-PoLV labels for HVR6 region}
• with metadata, partition discovers correlation with Cys labels
(which are associated with severe disease)
HVR6
without metadata with metadata

NMI 2 [0.077, 0.675] NMI = 0.596
[1] Larremore, Clauset & Buckee (2013)
real-world networks

297 var genes
• x = {Cys-PoLV labels for HVR6 region}
• on adjacent region of gene, we find Cys-PoLV labels correlate
with recombinant structure here, too
HVR5
without metadata with metadata
[1] Larremore, Clauset & Buckee (2013)

the ground truth about metadata
what is the goal of community detection?
network G + method f ! communities C = f (G) vs. M metadata

C⇡M
"this method works!"

C⇡M C 6= M
"this method works!" "this method stinks!"

there are 4 indistinguishable reasons why we might find f (G) = C 6= M :
1. metadata M are unrelated to network structure G


2. metadata M and communities C capture different aspects of structure
social groups leaders and followers


3. network G has no community structure

3. network G has no community structure
4. algorithm f is bad
"this method stinks!"

theorems for community detection
DON’T TRY TO FIND THE GROUND TRUTH
INSTEAD . . . TRY TO REALIZE THERE IS NO GROUND TRUTH

theorems for community detection
1. Theorem: no bijection between ground truth and communities

g(T ) ! G g 0 (T 0 ) 2 different processes, on 2 different ground truths,
can create the same observed network
2. Theorem: No Free Lunch in community detection

no algorithm f has better performance than
any other algorithm f 0 , when averaged over
all possible inputs {G}
! good performance comes from matching

algorithm f to its preferred subclass of
networks {G0 } ⇢ {G}
[1] performance defined as adjusted mutual information (AMI), which is like the normalized mutual information, but adjusted for expected values
[2] original NFL theorem: Wolpert, Neural Computation (1996)
[3] proofs of these theorems is in Peel, Larremore, Clauset (2016)
fin
real-world networks
4. Facebook friendships: online friendships among 15,126

Harvard students and alumni (in Sept. 2005)
• x = {graduation year, dormitory}
• method finds a good partition between alumni, recent
graduates, upperclassmen, sophomores, and freshmen 13
• NMI
.
the number of=metadata
0.668 values, 1
margin. Assuming the values of

Prior probability of membership
broadly distributed, this implies

NMI 2 [0.573,
the communities will be 0.641]
smaller
ata H(x) and hence, normally,
0.5
. Thus if we define
I(s ; x)
, (B4)
min[H(s), H(x)]
0
None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
malized mutual information lies Year
that
[1] Traud,it has
Mucha a (20012)
& Porter symmetric defini-
13
thereal-world networks
number of metadata values, 1
margin. Assuming the values of
Prior probability of membership

broadly distributed, this implies
the communities will be smaller
ata H(x) and hence,
4. Facebook normally,
friendships: online friendships among 15,126
0.5
. Thus if we define
Harvard students and alumni (in Sept. 2005)
I(s
• ;xx)= {graduation year, dormitory}
, (B4)
min[H(s), H(x)]
• method finds a good partition among
0 the dorms
None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
malized mutual information lies
• NMI
.has a = 0.255 defini- Year
that it symmetric
and x, and that it will achieve
• without metadata: Prior probability of membership 1
ne when the metadata perfectly
NMI 2Other
membership. [0.074, 0.224]
definitions,
ean or maximum of the two en-
two of these three conditions but
ues smaller than one by an unpre- 0.5
hen the metadata perfectly pre-
(B4) in all the calculations pre-

0
Dorm
[1] Traud, Mucha & Porter (20012)

Csci5352 2017 L8

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Csci5352 2017 L8

Uploaded by

Copyright:

Available Formats

Lecture 8:

Generalized large-scale structure

© 2017 Aaron Clauset

most communities are not random graphs

can we automatically extract such hierarchies?

step 1: network data step 3: hierarchy

hierarchical random graph model

Clauset, Moore, Newman, Nature 453, 98-101 (2008)

Clauset, Moore, Newman, Nature 453, 98-101 (2008)

Clauset, Moore, Newman, Nature 453, 98-101 (2008)

“inhomogeneous” random graph

hierarchical random graph model

Lr = number nodes in left subtree pr

Rr = number nodes in right subtree

Er = number edges with r as lowest

1/4 !" # " # $r !" # " # $

generalizing from a single example

checking the models

inspecting the dendrograms

NCAA Schedule 2000 Zachary’s Karate Club

(11 20)A iFlo Tecnia

1)M irg stV rac

Kenrshall reen (31

Mia ecticut (42)

link prediction in networks

now many approaches to link prediction:

Grassland species network

0.6 simple predictors

a Terrorist association network

PHYSICAL REVIEW X 4, 011047 (2014)

Hierarchical Block Structures and High-Resolution Model Selection in Large Networks

other approaches (hierarchical SBM)

Peixoto, Phys. Rev. X 4, 011047 (2014)

community structure in networks

planted partition problem

planted partition problem

• recently generalized to dynamic networks (Ghasemian et al. 2015)

many networks include metadata on their nodes:

metadata x is often used to evaluate the accuracy of community detection algs.

if community detection method A finds a partition P that correlates with x

Zachary karate club political blogs network

political books (2004)

NCAA 2000 Schedule

often, groups found by community detection are meaningful

[1] see Fortunato (2010), and Adamic & Glance (2005)

often, groups found by community detection are meaningful

but some recent studies claim these are the exception

[1] see Fortunato (2010), and Adamic & Glance (2005)

Hric, Darst & Fortunato (2014)

Hric, Darst & Fortunato (2014)

[1] image copyright BostonGazette or maybe 20th Century Fox? gah

[1] image copyright BostonGazette or maybe 20th Century Fox? gah

use a generative model to guide the selection:

this is a degree-corrected stochastic block model (DC-SBM)

[1] is the k ⇥ K matrix of parameters sx

• where ⇥ is a k ⇥ k matrix of community interaction parameters ✓st ,

[1] technical details in Newman & Clauset (2015) arxiv:1507.04001

does this method recover known structure in synthetic data?

does this method recover known structure in synthetic data?

[1] Decelle, Krzakala, Moore & Zdeborova (2011)

let mean degree c = 8

let mean degree c = 8