Professional Documents
Culture Documents
net/publication/323276320
CITATIONS READS
2 135
6 authors, including:
Jiawei Han
University of Illinois, Urbana-Champaign
950 PUBLICATIONS 102,356 CITATIONS
SEE PROFILE
All content following this page was uploaded by Timothy Hanratty on 22 June 2018.
ABSTRACT
Graph pattern mining methods can extract informative and useful
patterns from large-scale graphs and capture underlying principles
through the overwhelmed information. Contrast analysis serves as
arXiv:1802.06189v1 [cs.SI] 17 Feb 2018
Beijing, China
10:30 PM — 11:00 PM
Given Seeds
Figure 2: Spatio-Temporal Event Detection. Heat maps visualize the adjacency matrices within the corresponding regions. The
darker, the more traffic. Contrasting two road networks of the real-time traffic and historical averages, the identified coherent
core demonstrates the usual, busy traffic on the East 2nd Ring in Beijing, while the extracted contrast subgraph reveals an
unusually large volume of traffic, which indicates the ending of a concert event at the Beijing Workers’ Sports Complex.
Besides extracting contrast subgraphs, defining the contrast itself Our experimental results based on real-world datasets demon-
remains an open problem. Moreover, trying to induce informative strate the identified coherent core and the extracted contrast sub-
subgraphs from the contrast makes the task even more challenging. graph are quite insightful and encouraging in a variety of tasks.
In real-world graphs, there usually exist noise edges. For the algo- For example, based on the traffic data and road network in Beijing,
rithms directly optimizing the contrast, it is likely to end up with a contrast subgraphs between the real-time traffic and the historical
subset of nodes which have heavy noise in one of the graphs and traffic indicate the events happening in the city. Figure 2 shows
no edge in the other graph. To avoid such numerous but meaning- the coherent core and the contrast subgraph discovered by our
less subgraphs, we propose to induce contrast subgraphs from the proposed method through a comparison between the traffic data on
coherent cores, i.e., a subset of nodes with similar edge structures Nov 24, 2012, and the historical averages. The usual, busy traffic on
in G A and G B . We visualize the workflow using two hypothetical the major roads (i.e., the East 2nd Ring in Beijing) form the coher-
graphs in Figure 1. Given the seed node 5, we first identify the co- ent core. The contrast subgraph reveals the unusual traffic around
herent core of nodes {3, 4, 5, 6}, and then add nodes 1, 7, and 8 into the Beijing Workers’ Sports Complex between 10:30 – 11:00 PM.
the contrast graph. More experiments in Section 5.2 demonstrate We find that the contrast subgraph indicates the ending of Dang
that inducing from the coherent cores is necessary and it acts as an Ding’s concert event, which further reflects the precious value of
anchor to guarantee the informativeness of the induced contrast the proposed contrast subgraph mining.
subgraphs. To our best knowledge, this is the first work that aims to mine
Beyond the problem formulation, it is also challenging to develop informative contrast subgraphs between two graphs. Our contribu-
an efficient and scalable algorithm for mining contrast subgraphs tions are highlighted as follows.
from large-scale graphs. A straightforward solution is to apply co- • We formulate the contrast subgraph mining problem and avoid
dense graph mining algorithms on (G A , G B ) and then on (G A , G B ). meaningless contrast subgraphs by inducing from coherent cores.
However, the dense complementary graph (i.e., G A or G B ) makes The problem is formulated for multiple applications using a gen-
most of, if not all, pruning techniques initially designed for sparse eral family of metrics.
graphs ineffective. In this paper, we derive a polynomial-time • We derive a polynomial-time algorithm to efficiently extract
algorithm to efficiently identify coherent subgraph cores and then contrast subgraphs from large-scale graphs.
extract contrast subgraphs. More specifically, we apply a binary • Extensive experiments verify the necessity of introducing co-
search on the coherent/contrast score and construct a network herent cores as well as the effectiveness and efficiency of our
such that whether the current score is achievable is equivalent algorithm. Real-world applications demonstrate the tremendous
to whether the min S − T cut in the network is above a certain potentials of contrast subgraph mining.
threshold. Thanks to the duality between the min cut and the max Reproducibility: We release our code at the GitHub1 .
flow, we can solve the reduced problem in a polynomial time [22].
1 https://github.com/shangjingbo1226/ContrastSubgraphMining
Contrast Subgraph Mining from Coherent Cores KDD’18, August 2018, London, UK
The remainder of this paper is organized as follows. We first Table 1: Notation Table.
formulate the contrast subgraph mining problem in Section 2. The GA , G B Two undirected, weighted input graphs.
related work is discussed in Section 3. Section 4 covers the technical GA , G B Complementary graphs considering whether the edge exists.
details of our derived polynomial-time algorithm. In Section 5, we V The node set for both G A and G B .
use a real-world task to verify the necessity of inducing contrast E A (u, v), E B (u, v) Edge weights between nodes u and v in G A and G B .
subgraph from coherent cores and evaluate the efficiency of our al- seeds A set of seed nodes. It could be empty.
gorithm. Section 6 uses two real-world applications to demonstrate c Coherent core. A subset of V . seeds ⊂ c.
g Contrast subgraph. A subset of V . c ⊂ g
the importance of contrast subgraph mining. Section 7 concludes
Nr (s) Neighbor node set of the node set s. r is a parameter.
this paper and outlines future directions. coherence(c) Coherence metric for a subgraph.
coherence(u, v) Coherence metric for an edge between u and v.
2 CONTRAST SUBGRAPH MINING contrast(g) Contrast metric for a subgraph.
In this section, we first formulate the problem and propose a general contrast(u, v) Contrast metric for an edge between u and v.
penalty(u) Node penalty metric.
family of metrics. And then, we discuss our choices of specific
metrics used in our experiments.
2.2 A General Family of Metrics
2.1 Problem Formulation With a specific coherence metric, i.e., coherence(c), and a specific
contrast metric, i.e., contrast(g), the contrast subgraph mining prob-
The input of the contrast subgraph mining problem consists of two
lem becomes the following two maximization problems.
weighted, undirected graphs G A = (V , E A ) and G B = (V , E B ). The
node set V is the same in both G A and G B . But the edge weights in (1) Identify the coherent core around the seed nodes:
E A and E B are different. As summarized in Table 1, we use E A (u, v) ĉ = argmax coherence(c).
and E B (u, v) to denote the edge weights of the edge between nodes seeds⊂c⊂N r (seeds)
u and v in these two graphs, respectively. Because G A and G B (2) Induce the contrast subgraph from the identified coherent core:
are undirected graphs, we have ∀u, v, E A (u, v) = E A (v, u) and
E B (u, v) = E B (v, u). ĝ = argmax contrast(g).
ĉ⊂g⊂N r (ĉ)
In this paper, without loss of generality, we make two assump-
tions for the input graphs. First, we assume E A (u, v), E B (u, v) ≥ 0, It is natural to start from a single edge, since an edge is also a sub-
where the edge weight 0 means there is no edge. In most of the graph of two nodes. We denote the edge coherence as coherence(u, v)
applications, the edge weights are mainly about the connection and the edge contrast as contrast(u, v). Inspired by the formula-
strengths between nodes. And we only care about the weight differ- tion of the maximum dense subgraphs [8], we design the coher-
ence between the corresponding edges. Therefore, it is reasonable ence/contrast metric based on the edge coherence/contrast as fol-
to assume the weights are non-negative. Note that this assumption lows:
includes boolean edges as a special case. Second, we assume the u,v ∈c∧u <v coherence(u, v)
Í
preprocessing has been conducted on E A and E B , thus the ranges coherence(c) = , (2)
u ∈c penalty(u)
Í
or distributions of E A (u, v) and E B (u, v) would be similar.
u,v ∈д∧u <v contrast(u, v)
Í
The user can provide seed nodes, if necessary, to better facili- contrast(g) = , (3)
u ∈g penalty(u)
Í
tate her interests. To make sure the nodes are really around the
given seeds or the coherent core, we define the neighbor nodes by a where penalty(u) is a function to control the subgraph size. The
parameter r , numerators in Equation 2 and Equation 3 become bigger when the
subgraph grows. The idea of introducing node penalty is to penalize
Nr (s) = {u|d A (s, u) ≤ r ∨ d B (s, u) ≤ r } (1)
these large subgraphs. In fact, the form of these two definitions is
where s is a subset of nodes, d A (s, u) and d B (s, u) are the minimum a generalization of the density function in [8].
number of non-zero edges to traverse from the node u to any node in Remark. (1) When there is no user-provided seed (i.e., seeds = ∅),
s. In practice, the value of r can be chosen based on the diameter of we will identify the most coherent subgraph among all non-empty
the two input graphs. For example, the collaborator graph induced subgraphs, which is the most “similar” part between G A and G B .
from the DBLP publication dataset has a small diameter, so r = 1 (2) We can find a series of non-overlapping contrast subgraphs by
and r = 2 are good choices. The road network has a relatively large iteratively removing the newly added nodes in the most contrasting
diameter, so we choose r from 10 to 20. Reasonable values of r can subgraph, i.e., ĝ \ ĉ, until ĝ \ ĉ = ∅. Here ĝ is the most contrasting
always lead to meaningful results. subgraph in the current iteration, while ĉ is the same coherent core
In summary, the inputs are two undirected, weighted graphs G A in all iterations.
and G B , a parameter r , and seed nodes seeds (could be empty). As
we discussed, to avoid meaningless cases, we propose to induce 2.3 Specific Metrics
contrast subgraphs from the coherent cores, where a coherent core Our algorithm works with any (1) non-negative edge coherence
is a subset of nodes with similar edges in the two graphs. The goal coherence(u, v); (2) non-negative edge contrast contrast(u, v); and
is to first identify the coherent core c around the seed nodes seeds (3) positive node penalty penalty(u). Therefore, there are countless
(seeds ⊂ c ⊂ Nr (seeds)), and then induce the contrast subgraph g ways to define these metrics. We discuss the specific metrics utilized
from this coherent core (c ⊂ g ⊂ Nr (c)). in both Section 5 and Section 6 here.
KDD’18, August 2018, London, UK J. Shang et al.
Edge Coherence. Considering the real-world graphs are usually patterns instead of contrast patterns. Moreover, it usually requires
sparse, the non-zero edges are more telling than the zero-weighted more than two, usually tens of graphs as input.
edges. For example, in co-authorship graphs, the observed co- Discriminative Subgraph. In discriminative subgraph mining,
authorship node pairs are more important compared to any two given multiple graphs with their class labels, the goal is to figure
unrelated nodes. Additionally, as discussed before, the edge weights out the subgraphs that discriminate between classes [13, 28, 29]. Dis-
are mainly about the connection strengths, and their distributions criminative subgraph is sometimes named as contrast subgraph [29],
are similar. So in our experiments, for a given edge, we adopt however, it’s completely different from our contrast subgraph min-
its smaller weight in two graphs to describe its edge coherence. ing, due to its supervised learning nature.
Formally, Co-Dense Subgraph. Density is an important measurement for
graphs. Among various models and definitions of dense subgraphs,
coherence(u, v) = min{E A (u, v), E B (u, v)}. (4)
quasi-clique is one of the most prominent ones [17]. As a natural
Edge Contrast. A good contrast metric must meet the following extension, researchers have explored how to efficiently find cross-
requirements. graph quasi-cliques [24] and coherent subgraphs [3, 4]. Jiang et al.
(1) Symmetric. The order of G A and G B should have no effect on further extend cross-graph quasi-cliques to frequent cross-graph
the contrast metric. Formally, given two nodes, u and v, if we quasi-cliques [12]. As an application, coherent dense subgraph
swap E A and E B , contrast(u, v) should be the same. mining has shown its usefulness in biological networks [11]. We
(2) Zero. It is natural to require contrast(u, v) = 0 when the edge can also generate some contrast subgraph in terms of density by
weights in two graphs are the same. That is, when E A (u, v) = straightforwardly applying co-dense subgraph mining methods
E B (u, v), contrast(u, v) must be 0. on G A , G B and then on G A , G B . However, as shown in our ex-
(3) Monotonicity. Suppose E A (u, v) ≤ E B (u, v). If we increase periments, these algorithms become extremely slow because the
E B (u, v) and keep the other same, the score contrast(u, v) should complementary graphs are so dense that pruning techniques are
increase. If we decrease E A (u, v) and keep the other same, the no longer effective.
score contrast(u, v) should also increase.
Starting from these three requirements, we find a neat definition
3.3 Informative Phrases from the Contrast
as follows. Given two corpora (i.e., the main corpus and the background cor-
pus), informative phrases are the phrases that occur more frequently
contrast(u, v) = |E A (u, v) − E B (u, v)| (5) in the main corpus than the background corpus [2, 6, 18, 19, 21, 27].
It meets all the requirements and is therefore adopted in our exper- With the concept of contrast, informative phrases are more salient
iments. than routine phrases like those consisted of stopwords, although
Node Penalty. If there is no prior knowledge about the two graphs, their frequencies in the main corpus are not super high. The def-
a uniform node penalty is always a safe choice. Therefore, we use inition of informative phrases is based on individual phrase fre-
penalty(u) = 1 in all our experiments. quencies without considering a set of phrases as a whole. Such
definition is not applicable for subgraphs.
3 RELATED WORK
In this section, we introduce the related work on dense graph 4 METHODOLOGY
patterns within a single graph, cross-graph patterns, and interesting In this section, we derive the polynomial-time algorithm for the
phrase patterns of contrast in the text mining field. contrast subgraph mining. Since the forms of coherence and con-
trast metrics are similar, we focus on extracting contrast subgraphs
3.1 Dense Subgraphs within a Single Graph in the derivation. For identifying coherent cores, the same deriva-
Our coherence and contrast metrics are inspired by the dense sub- tion applies if one replaces the edge contrast contrast(u, v) with
graph problem, i.e., finding a subset of nodes that maximizes the the edge coherence coherence(u, v).
ratio of the number of edges between these nodes over the number
of selected nodes. A polynomial time algorithm using a network 4.1 Binary Search
flow model is first proposed to solve this problem [8]. [1] tries to First of all, we reduce the original maximization problem to a veri-
solve it approximately using a greedy algorithm. Beyond traditional fication problem by applying the binary search technique on the
dense subgraphs, it has been proved that if one wants to specify contrast score. In order to conduct the binary search, we need to
the subgraph size, the problem becomes NP-hard [16]. Gibson et figure out the lower/upper bounds of the subgraph contrast as well
al. [7] design an efficient algorithm specifically for giant dense sub- as the minimum contrast difference between two subgraphs.
graphs. Rossi et al. [26] develop a fast, parallel maximum clique Lower Bound. In the worst case, there is only one different edge
algorithm for sparse graphs; and Leeuwen et al. [30] explore the between G A and G B (assuming they are not exactly same). Let
most interesting dense subgraphs based on the user’s prior belief. ϵc > 0 be the minimum non-zero edge contrast. We have
ϵc
3.2 Cross-Graph Patterns contrast(g) ≥ Í
u ∈V penalty(u)
Frequent Subgraph. Given a large collection of graphs, frequent
subgraph mining aims to find frequent structures across different Upper Bound. As an ideal case, all edges are self-loops of the same
graphs [33, 34]. It is an unsupervised task but focuses on shared node in one graph, while that node has no self-loops in the other
Contrast Subgraph Mining from Coherent Cores KDD’18, August 2018, London, UK
…
node T as shown in Figure 3, then show a one-to-one mapping
a between its S − T finite-value cuts and the contrast subgraphs g
Never cut here!
meeting the constraints. Finally, we prove that the min cut gives
contrast(a, b) us the most contrasting subgraph.
Constructed Network. The network is constructed as follows.
U U + 2 ⋅ % ⋅ penalty(.) − 1(.)
S b T • Creates a source node S and a sink node T .
• ∀u ∈ c, adds an edge between S and u of a weight +∞.
contrast(b, c) • Chooses a large enough constant U to make sure that all follow-
ing edge weights are positive. In our implementation, U is set as
u,v contrast(u, v), which is no smaller than any d(u).
Í
an F − G 4HI c
Conclusion: ℎK L = M;N K (L) – P ⋅ |RS (M)| • ∀u ∈ Nr (c) \ c, adds an edge between S and u of a weight U .
Minimize ℎK L ó Minimize cut K (L)
…
Algorithm 1: Contrast Subgraph Mining from Coherent Core regarding both effectiveness and efficiency. We implement both
Require: G A , G B , the coherent core c, and the parameter r . our algorithm and the compared method in C++. The following
Return: The most contrasting subgraph ĝ (c ⊂ ĝ ⊂ Nr (c)). execution time experiments were conducted on a machine with
Compute Nr (c) using a breadth-first search (BFS). Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz using a single thread.
ϵc ← minu,v ∈V ∧u <v contrast(u, v)
ϵp ← minu,v ∈V ∧u <v penalty(u, v) 5.1 Collaboration Change Detection
ϵc
l← Í The Two Graphs. Two co-authorship graphs are constructed by
penalty(u)
u ∈V splitting the DBLP publication network dataset2 into two time
u,v ∈V ∧u <v contrast(u, v)
Í
h← periods (2001–2008 and 2009–2016). The nodes are authors who
ϵp published at least one paper in KDD, ICDM, WWW, ICML, and
ϵc
ϵ← 2 NIPS. The co-authorship from 2001 to 2008 forms the edges in G A ,
u ∈V penalty(u) while G B consists of co-authorships from 2009 to 2016. It ends up
Í
ĝ ← c with 6, 999 nodes and 17, 806 edges in total. The edge weight is the
while |h − l | > ϵ do log value (i.e., log(x) + 1) of the actual collaboration time x. We
l +h choose r = 1.
m←
2 Visualization. We use two aligned heat maps (a black vertical
Construct the network as described in Section 4.2.
line in between) to visualize adjacency matrices of the subgraphs.
flow ← the max flow of the constructed network
The darker color, the larger weight. The coherent core is always
if flow - |Nr (c)| · U ≤ 0 then
positioned in the bottom-right corner and enclosed by red squares,
l ←m
while the nodes are ranked by their total edge weights of E A within
Update ĝ by the current S − T cut.
else this subgraph. Moreover, we append a table of the node names in
h ←m the top-down order of the rows in the heat map.
return ĝ. Expectations. In this task, coherent cores are expected to be some
long-term collaborators of the seed author spanned over 2001 –
About the constraints c ⊂ g ⊂ Nr (c), it is clear that the min-cut 2016, and contrast subgraphs will be likely the researchers who
solution will only involve nodes in Nr (c) because they are the only have more collaborations with the seed author in only one of the
nodes in the constructed network. Moreover, because the edge two periods.
weights of all edges from S to ∀u ∈ c are set to +∞, they will Results. Since the time periods start from 2001, we choose a senior
never be included in the min cut. Therefore, the g induced from the professor, Prof. Jiawei Han from the University of Illinois at Urbana-
min-cut will never miss any nodes in the coherent core c. Champaign as the seed. As shown in Figure 4, we find that all
authors in the coherent core are long-term collaborators of Prof.
4.3 Time Complexity Analysis Jiawei Han. This contrast subgraph achieves a contrast score of
We analyze the time complexity step by step. Given the parameter 8.99. Note that Jing Gao was Prof. Jiawei Han’s Ph.D. student
r and seed nodes s, we can obtain Nr (s) in an O(|V | + |E A | + |E B |) from 2006 to 2011, and 2008/2009 happens to be her midpoint.
time through a breadth-first-search. Computing ϵc and ϵp costs Moreover, they keep collaborating after Jing Gao’s graduation. The
first 11 researchers in g \ c have much more collaborations with
O(|V | + |E A | + |E B |) time. The binary search will have log2 h−l ϵ the researchers in the coherent core in 2001 – 2008 than those in
iterations before it stops. The bottleneck inside the binary search
2009 – 2016. The later 15 researchers have more collaborations
is the max flow part. The state-of-the-art max flow algorithm [22]
from 2009 to 2016. In summary, the semantic meanings of these
is O(nm) by combining James B Orlin’s algorithm (for the sparse
results match our observation and reality. Therefore, the contrast
network) and the KRT algorithm (for the dense network), where
subgraph mining is helpful to the collaboration change detection
n and m are the numbers of nodes and edges in the constructed
task, and our algorithm is effective.
network. Therefore, the time complexity of the max flow part is
O |Nr (c)|(|Nr (c)| + |E A | + |E B |) . So the overall time complexity
5.2 Coherent Core and Neighbor Constraints
becomes O |V |+|E A |+|E B |+|Nr (c)|(|Nr (c)|+|E A |+|E B |) log2 h−l ϵ .
In the worst case, when r is big enough, |Nr (c)| becomes |V |. So If NOT inducing from coherent cores. If we directly extract
the worst case time complexity is O |V |(|V | + |E A | + |E B |) log2 h−l
the most contrasting subgraph without the coherent core step and
ϵ
Moreover, as both log2 (h − l) and log2 ϵ are no more than the input the neighborhood step, we will find, of course, a subgraph with a
length, this time complexity is polynomial. potentially higher contrast score. We run the collaboration change
detection task using Prof. Jiawei Han as the seed again but skip
5 EXPERIMENTS these two steps. As shown in Figure 5, we obtain a subgraph of a
contrast score 11.08, which is higher than the previous score (i.e.,
In this section, we utilize the collaboration change detection task
8.99). Despite the high contrast, as highlighted by the light blue box
as a sanity check. We first present the useful results from contrast
in the figure, the seed node Prof. Jiawei Han never collaborated with
subgraph mining. Then, we discuss the importance of the coher-
anyone in the contrast subgraph. Therefore, although this subgraph
ent core and neighborhood constraints. In the end, we compare
our algorithm with an adapted co-dense subgraph mining method, 2 https://aminer.org/citation
Contrast Subgraph Mining from Coherent Cores KDD’18, August 2018, London, UK
Figure 7: Comparison between the adapted co-dense algo- seeds Games & Strategy Guides
rithm and our contrast subgraph mining algorithm. Computers & Internet, Games & Strategy Guides, Strategy Guide,
c
Programming, Video Games
if not all, pruning techniques are no longer effective. Our algorithm g\c Tomb Raider− , Half-Life− , Doom+
contrast(g) 1.44
demonstrates a polynomial growth as same as our previous time
complexity analysis. Figure 8: Trend Detection on E-commerce Platform. The
We also compare the contrast scores achieved by the two al- darker, the more popular. In g \ c, − means more popular
gorithms, as shown in Figure 7(b). In this figure, each data point in 2002, while + means more popular in 2004.
corresponds to two runs of the same seed. For a given seed, we
record the contrast score of the adapted co-dense algorithm as X a blue circle, and a black dashed polygon to highlight the seed, the
and the contrast score of our contrast subgraph mining algorithm coherent core, and the contrast subgraph, respectively.
as Y. Therefore, the data points above y = x indicates better perfor- Expectations. In this task, coherent cores are expected to contain
mance than the adapted co-dense algorithm, vice versa. We plot those road segments that are busy in both the real-time graph and
the y = x line for reference. It is not surprising that the contrast the historical graph, and contrast subgraphs will be likely those
subgraph mining algorithm will always have a higher score than road segments that are only busy in either the real-time graph of
the adapted co-dense algorithm because they have slightly differ- the historical graph.
ent objectives. The “contrast” obtained from the adapted co-dense Results. We choose some seeds on the East 2nd Ring, a major road
algorithm emphasizes more on the density vs. the sparsity. This in Beijing. As shown in Figure 2, our algorithm first identifies a
is from a macro view, while our contrast(g) is defined in a micro coherent core about the usual, busy traffic on the East 2nd Ring,
view by considering every edge. and later extracts a contrast subgraph including many nodes and
In summary, our contrast subgraph mining algorithm signifi- road segments around the Beijing Workers’ Sports Complex. The
cantly outperforms the adapted co-dense subgraph mining algo- contrast score is 23.87. This contrast subgraph demonstrates a
rithm regarding both efficiency and effectiveness. significantly busier real-time traffic than usual, therefore, it may
reflect some unusual event. After some search on the Internet,
6 REAL-WORLD APPLICATIONS interestingly, on November 24, 2012, there was a concert hosted at
In this section, we present two real-world applications of contrast the Beijing Workers’ Sports Complex, whose ending time was just
subgraph mining: (1) spatio-temporal event detection based on around 10:30 PM. This finding further consolidates the usefulness
a taxi trajectory dataset and (2) trend detection on e-commerce of contrast subgraph mining and the effectiveness of our algorithm
platforms based on an Amazon rating dataset. in real world applications.
Expectations. In this task, we expect the nodes in the coherent [10] L. Hong, Y. Zheng, D. Yung, J. Shang, and L. Zou. Detecting urban black holes
core represent popular products types in both years, while the based on human mobility data. In Proceedings of the 23rd SIGSPATIAL International
Conference on Advances in Geographic Information Systems, page 35. ACM, 2015.
nodes in the contrast subgraph demonstrate the specific trends in [11] H. Hu, X. Yan, Y. Huang, J. Han, and X. J. Zhou. Mining coherent dense subgraphs
one of the two years. across massive biological networks for functional discovery. Bioinformatics,
21(suppl 1):i213–i221, 2005.
Results. We choose “Subjects→Computers & Internet→Games & [12] D. Jiang and J. Pei. Mining frequent cross-graph quasi-cliques. ACM Transactions
Strategy Guides” as the seed. As shown in Figure 8, because the on Knowledge Discovery from Data (TKDD), 2(4):16, 2009.
[13] N. Jin and W. Wang. Lts: Discriminative subgraph mining by learning from search
popularity of each sub-category under “Games & Strategy Guides” is history. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on,
similar in both years, they form the coherent core. Moreover, “Tomb pages 207–218. IEEE, 2011.
Raider” and “Half-Life” received many reviews in 2002 but not in [14] N. Jindal and B. Liu. Opinion spam and analysis. In Proceedings of the 2008
International Conference on Web Search and Data Mining, pages 219–230. ACM,
2004, while “Doom” became popular in term of the review amount 2008.
in 2004. Therefore, these three nodes become the contrast subgraph [15] A. Kavanaugh, J. M. Carroll, M. B. Rosson, T. T. Zin, and D. D. Reese. Community
networks: Where offline communities meet online. Journal of Computer-Mediated
of a contrast score 1.44. After some brief research in Wikipedia, Communication, 10(4):JCMC10417, 2005.
we found out that “Tomb Raider” released “The Prophecy” episode [16] S. Khot. Ruling out ptas for graph min-bisection, dense k-subgraph, and bipartite
in 2002 but nothing in 2004, while “Doom” released “Doom 3” in clique. SIAM Journal on Computing, 36(4):1025–1071, 2006.
[17] G. Liu and L. Wong. Effective pruning techniques for mining quasi-cliques.
2004 and its predecessor (i.e., “Doom 64”) was long time ago in Machine Learning and Knowledge Discovery in Databases, pages 33–49, 2008.
1997. These results are constructive for the sales trend analysis on [18] J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive
e-commerce platforms and further prove the effectiveness of our text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference
on Management of Data, pages 1729–1744. ACM, 2015.
algorithm on graphs of abstract nodes like hierarchical trees. [19] D. P. A. D. D. Majumdar. Fast mining of interesting phrases from subsets of text
corpora. 2014.
[20] A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in
7 CONCLUSION & FUTURE WORK social networks. In Ijcai, volume 5, pages 786–791. Citeseer, 2005.
[21] B. L. Monroe, M. P. Colaresi, and K. M. Quinn. Fightin’words: Lexical feature
In this paper, we formulated the important contrast subgraph min- selection and evaluation for identifying the content of political conflict. Political
ing problem. To avoid meaningless contrast subgraphs, we pro- Analysis, 16(4):372–403, 2008.
posed to first identify coherent cores as cornerstones. Our frame- [22] J. B. Orlin. Max flows in o (nm) time, or better. In Proceedings of the forty-fifth
annual ACM symposium on Theory of computing, pages 765–774. ACM, 2013.
work can admit a general family of coherence, contrast, and node [23] C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and
penalty metrics. After rigorous derivations, we developed an el- complexity. Courier Corporation, 1998.
[24] J. Pei, D. Jiang, and A. Zhang. On mining cross-graph quasi-cliques. In Proceedings
egant polynomial-time algorithm to find the global optimum for of the eleventh ACM SIGKDD international conference on Knowledge discovery in
this problem. Extensive experiments verified the necessity of intro- data mining, pages 228–238. ACM, 2005.
ducing coherent cores as well as the efficiency and effectiveness of [25] R. A. Rossi and N. K. Ahmed. Role discovery in networks. IEEE Transactions on
Knowledge and Data Engineering, 27(4):1112–1131, 2015.
our proposed algorithm. Real-world applications demonstrated the [26] R. A. Rossi, D. F. Gleich, A. H. Gebremedhin, and M. M. A. Patwary. Fast maximum
tremendous potentials of the contrast subgraph mining. clique algorithms for large graphs. In Proceedings of the 23rd International
In future, we will (1) extend the problem setting to more than Conference on World Wide Web, pages 365–366. ACM, 2014.
[27] J. Shang, J. Liu, M. Jiang, X. Ren, C. R. Voss, and J. Han. Automated phrase
two graphs; (2) find the top-k contrast subgraphs with overlaps; and mining from massive text corpora. arXiv preprint arXiv:1702.04457, 2017.
(3) apply our contrast subgraph mining algorithm to other tasks, [28] M. Thoma, H. Cheng, A. Gretton, J. Han, H.-P. Kriegel, A. Smola, L. Song, P. S.
Yu, X. Yan, and K. M. Borgwardt. Discriminative frequent subgraph mining
like the abnormality detection in social networks. with optimality guarantees. Statistical Analysis and Data Mining: The ASA Data
Science Journal, 3(5):302–318, 2010.
[29] R. M. H. Ting and J. Bailey. Mining minimal contrast subgraph patterns. In
REFERENCES Proceedings of the 2006 SIAM International Conference on Data Mining, pages
[1] Y. Asahiro, K. Iwama, H. Tamaki, and T. Tokuyama. Greedily finding a dense 639–643. SIAM, 2006.
subgraph. Journal of Algorithms, 34(2):203–221, 2000. [30] M. van Leeuwen, T. De Bie, E. Spyropoulou, and C. Mesnage. Subjective inter-
[2] S. Bedathur, K. Berberich, J. Dittrich, N. Mamoulis, and G. Weikum. Interesting- estingness of subgraph patterns. Machine Learning, 105(1):41–75, 2016.
phrase mining for ad-hoc text analytics. Proceedings of the VLDB Endowment, [31] C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu, and J. Guo. Mining advisor-
3(1-2):1348–1357, 2010. advisee relationships from research publication networks. In Proceedings of the
[3] B. Boden, S. Günnemann, H. Hoffmann, and T. Seidl. Mining coherent subgraphs 16th ACM SIGKDD international conference on Knowledge discovery and data
in multi-layer graphs with edge labels. In Proceedings of the 18th ACM SIGKDD mining, pages 203–212. ACM, 2010.
international conference on Knowledge discovery and data mining, pages 1258– [32] B. Wellman, J. Boase, and W. Chen. The networked nature of community: Online
1266. ACM, 2012. and offline. It & Society, 1(1):151–165, 2002.
[4] B. Boden, S. Günnemann, H. Hoffmann, and T. Seidl. Mimag: mining coherent [33] X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In Data
subgraphs in multi-layer graphs with edge labels. Knowledge and Information Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on,
Systems, 50(2):417–446, 2017. pages 721–724. IEEE, 2002.
[5] V. M. Buote, E. Wood, and M. Pratt. Exploring similarities and differences between [34] X. Yan and J. Han. Closegraph: mining closed frequent graph patterns. In
online and offline friendships: The role of attachment style. Computers in Human Proceedings of the ninth ACM SIGKDD international conference on Knowledge
Behavior, 25(2):560–567, 2009. discovery and data mining, pages 286–295. ACM, 2003.
[6] C. Gao and S. Michel. Top-k interesting phrase mining in ad-hoc collections using [35] J. Yuan, Y. Zheng, C. Zhang, X. Xie, and G.-Z. Sun. An interactive-voting based
sequence pattern indexing. In Proceedings of the 15th International Conference on map matching algorithm. In Mobile Data Management (MDM), 2010 Eleventh
Extending Database Technology, pages 264–275. ACM, 2012. International Conference on, pages 43–52. IEEE, 2010.
[7] D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in [36] C. Zhang, G. Zhou, Q. Yuan, H. Zhuang, Y. Zheng, L. Kaplan, S. Wang, and
massive graphs. In Proceedings of the 31st international conference on Very large J. Han. Geoburst: Real-time local event detection in geo-tagged tweet streams.
data bases, pages 721–732. VLDB Endowment, 2005. In Proceedings of the 39th International ACM SIGIR conference on Research and
[8] A. V. Goldberg. Finding a maximum density subgraph. University of California Development in Information Retrieval, pages 513–522. ACM, 2016.
Berkeley, CA, 1984. [37] S. Zhang, H. Jiang, and J. M. Carroll. Integrating online and offline commu-
[9] K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu, D. Koutra, nity through facebook. In Collaboration Technologies and Systems (CTS), 2011
C. Faloutsos, and L. Li. Rolx: structural role extraction & mining in large graphs. International Conference on, pages 569–578. IEEE, 2011.
In Proceedings of the 18th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 1231–1239. ACM, 2012.