ANF: A Fast and Scalable Tool for Data Mining in Massive Graphs

Christopher R. Palmer
Computer Science Dept Carnegie Mellon University Pittsburgh, PA

Phillip B. Gibbons
Intel Research Pittsburgh Pittsburgh, PA

Christos Faloutsos
Computer Science Dept Carnegie Mellon University Pittsburgh, PA ABSTRACT
tant” nodes in such a graph [2]. Broder et al. studied the connectivity information of nodes in the Internet [13, 3]. The networking community has used different measures of node “importance” to build a hierarchy of the Internet [20]. Another source of graph data that has been studied are citation graphs [18]. Here, each node is a publication and each edge is a citation from one publication to another and we wish to know the most important papers. There are many more examples of graphs which contain interesting information for data mining purposes. For example, the telephone calling records from a long distance carrier can be viewed as a graph, and by mining the graph we may help identify fraudulent behaviour or marketing opportunities. DNA sequencing can also be viewed as a graph, and identifying common subsequences is a form of mining that could help scientists. Circuit design, for example from a CAD system, forms a graph and data mining could be used to find commonly used components, points of failure, etc. In fact, any binary relational table is a graph. For example, in this paper we use a graph derived from the Internet Movie Database [10] where we let each actor and each movie be a node and add an undirected edges between and actor, a, and a movie, m, to indicate that a appeared in m. It is also common to define graphs for board positions in a game. We will consider the simple game of tic-tac-toe. From all of these data sources, we find some prototypical questions which have motivated this work: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. How robust is the Internet to failures? Is the Canadian Internet similar to the French? Does the Internet have a hierarchical structure? Are phone call patterns (caller-callee) in Asia similar to those in the U.S.? Does a new circuit design appear similar to a previously patented design? What are the most influential database papers? Which set of street closures would least affect traffic? What is the best opening move in tic-tac-toe? Are there gender differences in movie appearances? Cluster movie genres.

Graphs are an increasingly important data source, with such important graphs as the Internet and the Web. Other familiar graphs include CAD circuits, phone records, gene sequences, city streets, social networks and academic citations. Any kind of relationship, such as actors appearing in movies, can be represented as a graph. This work presents a data mining tool, called ANF, that can quickly answer a number of interesting questions on graph-represented data, such as the following. How robust is the Internet to failures? What are the most influential database papers? Are there gender differences in movie appearance patterns? At its core, ANF is based on a fast and memory-efficient approach for approximating the complete “neighbourhood function” for a graph. For the Internet graph (268K nodes), ANF’s highlyaccurate approximation is more than 700 times faster than the exact computation. This reduces the running time from nearly a day to a matter of a minute or two, allowing users to perform ad hoc drill-down tasks and to repeatedly answer questions about changing data sources. To enable this drill-down, ANF employs new techniques for approximating neighbourhood-type functions for graphs with distinguished nodes and/or edges. When compared to the best existing approximation, ANF’s approach is both faster and more accurate, given the same resources. Additionally, unlike previous approaches, ANF scales gracefully to handle disk resident graphs. Finally, we present some of our results from mining large graphs using ANF.

Graph-based data is becoming more prevalent and interesting to the data mining community, for example in understanding the Internet and the WWW. These entities are modeled as graphs where each node is a computer, administrative domain of the Internet, or a web page, and each edge is a connection (network or hyperlink) between the resources. Google is interested in finding the most “impor-

It is possible to answer all of these questions by computing three graph properties pertaining to the connectivity or neighbourhood structure of the graph: Graph Similarity: Given two graphs, G1 and G2 , do the graphs have similar connectivity / neighbourhood structure

N (h) = | {(u. m). Low storage requirements: use only O(n) additional storage. E . primarily because neither scales well to very large graphs. v ∈ V. For example. This importance measure is useful for answering questions 1. and m directed edges. The two existing approaches (the RI approximation scheme [4] and a random sampling approach) scale very poorly while our ANF schemes scale much more gracefully and make it possible to investigate much larger graphs.) Let dist(u. C ) = | {v : v ∈ C. The main contribution of this paper is a tool that allows us to compute these three graph properties. That is. dist(u. . v ) ≤ h} | Def.32 Trials 5 10 Millions of edges 15 20 Figure 1: ANF algorithms scale but not the others which plots the running time versus the graph size for some randomly-generated graphs. definitions and a survey of the related work. dist(u. given the same resources. using various forms of the Neighbourhood Function . Note that IN (u. E ) be a graph with n vertices. 3. the city planner would want to interactively consider various sets of street closures and compare the effect on traffic. We are interested in the number of pairs starting from a node in S to a node in C within distance h: Def. V ). compare how these two induced subgraphs are connected within the graph. h) = | {v : v ∈ V. 90 80 70 60 50 40 30 20 10 0 0 RI Approx. thereby enabling us to answer interesting questions like those we suggested. h. N (h). Subgraph Similarity: Given two subsets of the vertices of the graph. 9. we generalize these two definitions slightly. we show that. BACKGROUND AND RELATED WORK 2. (Table 1 summarizes the symbols used in this paper.32 Trials 0. we have found two existing approaches that could prove useful for computing the neighbourhood function.1 Definitions Let G = (V. v ) ≤ h} |. Adapts to the available memory: when the node set does not fit in memory. In addition. v ) : u ∈ V.15% Exact-on-sample ANF-0 . The generalized individual neighbourhood function for u at h given C is the number of nodes in C within distance h. not just for the graph as a whole. with low overheads. We show that neither meets our requirements. make effective use of the available memory. V . Let S be a set of starting nodes and C be a set of concluding nodes . Such a similarity measure is useful for answering questions 1. we present experimental results demonstrating the scalability of our approach. h) = IN + (u. 4. The individual neighbourhood function for u at h is the number of nodes at distance h or less from u. dist(u. Accordingly. Beyond simply answering the questions. or N (h) = u∈V IN (u. v ) be the number of edges on the shortest path from u to v .32 Trials ANF . IN (u. We answer questions 1. Parallelizable: for massive graphs. These two goals give rise to the following list of properties that our tool must satisfy when analyzing a graph with n nodes and m edges: Error guarantees: estimates must be accurate at all distances (not just in the limit). h. for all distances h. Instead. 3. h).Wall clock running time (Minutes) (independent of their sizes). and (2) ANF is more than 700 times faster than the exact computation for a snapshot of the Internet graph (268K nodes). Almost in contrast to the need to be able to run interactively on graphs. 6. and 5. must be able to distribute the work over multiple processors and/or multiple workstations. This paper presents such a tool. P To deal with subgraphs. IN + (u. is the number of pairs of nodes within a specified distance h. The neighbourhood function at h is the number of pairs of nodes within distance h. 13].32 Trials ANF-C . (1) ANF is much more accurate and faster then the RI approximation scheme. as a whole. to determine the best roads to close for a parade. 8. The remaining questions can be answered in a similar fashion. we want it to be possible to interactively answer users requests. 2. 7 and 10 in this paper. Random accesses exhibit horrible paging performance for the common case that the graph is larger than the available memory. 2. also called the hop plot [8]. for a graph. The basic neighbourhood function. In section 5. In the literature. we want our tool to be fast enough to allow drill-down tasks. we define the following forms of the neighbourhood function: Def. 4. In section 3 we describe our ANF algorithms. measuring properties about the web required hardware resources that are beyond the means of most researchers. Fast: scale linearly with # of nodes and # edges (n. Sequential scans of the edge file: avoid random accesses to the graph. one from each of the three types. V1 and V2 . In [3. we also want a tool that scales to very large graphs. we use ANF to answer some of the prototypical questions posed in this introduction. Vertex Importance: Assign an importance to each node in the graph based on the connectivity. In section 2 we will define this more formally and present a more general form of the neighbourhood function that can be used to compute all three graph properties. 1. This can be seen in Figure 1. and 7. To answer our prototypical questions. we need a characterization of a node’s connectivity and the connectivity within a graph. which we call ANF for Approximate Neighbourhood Function . Estimates per node: must be able to estimate the neighbourhood function from each node. v ) ≤ h} |. In section 4. we produce a data mining tool that is useful given any amount of RAM. and 10. Such a similarity measure is useful for answering questions 2. . In section 2 we provide background material.

N (d). Asymptotically. 4. C ) but still requiring sufficient RAM for processing. C ). this does not solve the problems related to large edge files. V1 ) to N (h. S. which is prohibitive. this could be done in O(n2. More details of this algorithm.g. h. who gave an O(m) time algorithm using only O(n + m) memory [4]. V1 . we would also require O(n2 ) memory. we will find that the quality of this approximation can be quite poor (we show an example where even a sample as large as 15% does not provide a useful approximation!). S. PROPOSED APPROXIMATION TO THE NEIGHBOURHOOD FUNCTION We are given a graph G = (V. Motivated by this work. we construct such an approximation gradually. we will use the individual neighbourhood function for a node to characterize its importance. appear in section 4. They find that it is possible to define good storage layouts for undirected graphs but that the storage blowup can be very large. C ) and IN + (x. Finally. · · · . Instead. etc. we move . due to the random-like access to the edge file.second from edge-rel E1. we will need to find a new computation strategy which avoids random access to disk. v ∈ V } Diameter of the graph Starting set for the neighbourhood Concluding set for the neighbourhood Number of extra approximation bits Number of parallel approximations a breadth-first search results in an expensive random-like access to the disk blocks. v ) : u ∈ S. We wish to approximate the function N + (h. We can compute N (h) by running a breadth-first search from each node u and summing over all u. allowing us to compute N (h) and IN (x.second = E2.first. This appears to be the only previous work which attempts to approximate the neighbourhood function. 1. E ). The distinct means that we must know which of the n2 possible pairs of nodes have already been counted and we must take care not to over count in the presence of multiple paths between the same pair of nodes. One approach to computing N (h) is to repeatedly multiply the graph’s adjacency matrix. V2 . if we want to know the structural similarly of the Web from 1999 to today’s Web. Note that N (h) = N + (h. IN (u. State-of-the-art approaches to understanding/characterizing the Internet and the Web very often make use of neighbourhood information [3. For large. with respect to the connectivity. The generalized neighbourhood function at h is the number of pairs of a node from S and a node from C that are within distance h or less. Graphs have been used to improve marketing strategies [7]. E. the most important router is the one that in some way reaches the most routers the fastest. in section 4 we will experimentally evaluate a similar sampling strategy to discover that it scales poorly to large graphs and. 1. The transitive closure is N (∞) or. Our experiments demonstrate that the RI approximation is ill-suited for large graphs. E2.Table 1: Commonly used symbols Name n m V E d S C r k Description Number of vertices Number of edges Vertex set {0. v) : u.1 when we experimentally compare it to our new approximation. Next. v ) ≤ h} | N + (h. which we refer to as the RI approximation . For example. dist(u. Given that we are interested only in very large graphs and graphs with directed edges. Instead.38 ) time. Systems for this include SUBDUE [5] and AGM [11]. Unfortunately. We assume that V = {0. V2 ). n − 1} Edge set {(u. h. h) assuming memory-resident data structures. equivalently. Lipton and Naughton’s work was improved by Cohen. The approximation must be done accurately and in such a way that we will be able to handle disk resident graphs. edge-rel E2 where E1. h). Comparing subgraphs induced by vertex sets V1 and V2 can be done by comparing N + (h. C ) = | {(u. A breadth-first search beginning from u can easily compute IN (u. we approximate N (h) and/or IN (x. which are |V | and |V | + |E | respectively. E. 3. if we can compute all these variants of the neighbourhood function efficiently. disk resident graphs. heap data structures.2 Related Work It is trivial to compute N (0) and N (1). We extend this algorithm to approximate N + (h. This appears to be the state of the art solution for exact computation of N (h). The problems of random access to a disk resident edge file has been addressed in [15]. this is due to its extensive use of random-like access (for breadth first search. h) for all h. however. We will compare NG1 (h) to NG2 (h) to measure the similarity in connectivity/neighbourhood structure of graphs G1 and G2 . S. 2. the problem of finding frequent itemsets is generalized to frequent subgraphs. N + (h. 13. n − 1} and that E contains m directed edges. V. let V1 be the routers in the Canadian domain and V2 be the routers in the French domain. C ) = u∈S IN +(u. we can compute their neighbourhood functions and compare them at all points. This takes only O(n + m) storage but the worse case running time is O(nm). V ).first Writing N (2) − N (1) in this way illustrates the difficulty in efficiently computing N (h) for any h ≥ 2.g. v ∈ C. Most importantly.. Other recent work in data mining for graphs has focused on mining frequent substructures. Cohen also presented an O(m log n) time algorithm for estimating the individual neighbourhood functions. it is common to use breadth first searches in the graph. Lipton and Naughton [14] presented an O(n m) algorithm for estimating the transitive closure that uses an adaptive sampling approach for selecting starting nodes of breadth-first traversals. N (2) is reminiscent of the size of the (self-)join of the edge relation: each edge is viewed as a tuple with attributes “first” and “second” and N (2) − N (1) is the size of the result of the query select distinct E1. C ) and IN + (x. A survey of work on citation analysis appears in [18]. Thus. where d is the diameter √ of the graph. Undirected graphs can be represented using pairs of directed edges.). S. In this section. Def.1. · · · .. h. P In section 5 we will use the neighbourhood function to characterize graphs. C ) for a node x. then we can answer the graph questions that we posed in the introduction. 20]. Essentially. h). 1. it does not scale to graphs larger than the RAM. First.

often employ is to use bits to mark membership. Slightly more formally. h) = (2b )/. S. h. the approximation of the size of the set M(x. h) we note that x can still reach in h or fewer steps the nodes it could reach in h − 1 or fewer steps. To compute M(x. y) DO M(x. this approach needs O(n2 log n) time and space. An approach that people. where b is the least bit number in M (x. Instead of giving each node its own bit. h) as M (x. In terms of implementation. h − 1) This iterates over the edge set instead of performing a traversal. • The starting nodes. M (2.) Then. If we expect 25% of the nodes to be assigned to bit 1 and we haven’t seen .77351 Individual estimate IN where b is the average position of the least zero bits in the k bitmasks ˆ (h) = ˆ The estimate is: N all x IN (x.1 3. h − 1) for all x ∈ V FOR each edge (x. h − 1) if there is an edge from x to y. That is: M(x. We still mark a node by setting its bit and use bitwise-OR for the set-union. for example IN . Instead.25 4. But.2 5.1 3. 1) IN 4. Finally. The purpose is to clarify the concatenation of the bitmasks and to illustrate the computation. which is prohibitive. That is. 1) 110 110 101 110 101 101 110 101 100 100 111 100 100 110 101 ˆ (x. C ) P Figure 4: ANF-0: In-core ANF any of them (bit 1 is not set). 0) = {x} FOR each node x DO M (x. a quarter of them bit 1. h − 1) // Update M(x. then we probably saw about 4 or less nodes. (In the figure. 1). since the only node within distance 0 of x is x itself.2 5. 0). h) as a bitstring of length k(log n+r) bits.1 Basic ANF Algorithm (ANF-0) A graph traversal effectively accesses the edge file in random order. because nodes 1 and 3 are the neighbors of node ˆ (2. 3. h) x 0 1 2 3 4 M (x. y) DO M cur (x) = (M cur (x) BITWISE-OR M last(y)) FOR each node x DO ˆ + (x. S . Instead. which will be prohibitive for large graphs. We do k parallel approximations by treating M (x. each iteration uses the OR operation to combine the nodes that it could reach in h − 1 steps plus the ones that its neighbours could reach in h − 1 steps.25 3. A single approximation is obviously not very robust. One possibility is to use a dictionary data structure (e. if our algorithm is going to be efficient on a graph that does not fit in memory. h). h) and M last(y) to hold M (y.77359 = 3. we’re going to use a clever probabilistic counting algorithm [9] to approximate the sizes of the sets using shorter bit strings (log n + r bits. 1) is just M (1. h) BITWISE-OR M (y.5i+1 ) FOR each distance h starting with 1 DO FOR each node x DO M last(x) = M cur (x) FOR each edge (x. Unfortunately. this approach still uses O(n2 ) memory. each node is given one of n bits and a set is a bit string of length n. 1 = 4 3 24/3 /. this can be done by extending M cur to hold a marked bit indicating membership in S . we will extend this algorithm with bit compression to further increase its speed. h) = M (x. 0) = {x}. The union of two sets is the bitwise-OR of the bitmasks.2 5. and the average of the least zero bit positions (2. h).5i+1 ) FOR each distance h starting with 1 DO FOR each node x DO M (x. 0) 100 100 001 010 100 100 100 001 100 100 100 100 100 010 100 M (x. y) DO M (x.g. 0) = concatenation of k bitmasks each with 1 bit set (P (bit i) = . C ) = (2b )/. 2) 110 111 101 110 111 101 110 111 101 110 111 101 110 111 101 ˆ (x. h) uses M (y. Example. h − 2). Estimating the size of the set from the small bit string is done based on the following intuition. The estimates. So. h) is proportional to 2b . Thus. Figure 4 depicts the same algorithm. x can also reach the nodes in M(y. just changes the estimate by summing over x ∈ S instead of x ∈ V . we are going to build up the nodes reachable from x within h steps by first finding out which nodes its neighbours can reach in h − 1 steps. That is. 1). 1. C ) = ˆ N x∈S IN (x. However. h) = M(x. bit 0 is the leftmost bit in each 3-bit mask. Thus we use M cur(x) to hold M (x. 1) OR M (2. we cannot perform any graph traversals.// Set M(x. concatenation of k bitmasks each with 1 bit set (P (bit i) = . Figure 2 shows the complete algorithm implementing the edge-scan based ANF.2 Figure 3: Simple example of basic ANF FOR each node x DO IF x ∈ C THEN M cur (x) = P Figure 2: Introduction to the basic ANF algorithm the data structure to disk and to create an algorithm that meets all of our requirements. For example. h − 1) but never M (y.25 M (x. where b is the average IN position of the least zero bit in the k bitmasks ˆ + (h.77351. The input is a 5 node undirected cycle and we used parameters k = 3 and r = 0. h) by adding one step FOR each edge (x. h − 1) during iteration h. M(x. we randomly set one bit in each of the three concatenated bitmasks. h) = M(x. particularly C hackers. h) = (M (x. 3) IN 5. 1) OR M (3. h). h − 1)) // Compute the estimates for this h FOR each node x DO ˆ (x. The algorithm in Figure 2 uses an excessive amount of memory and does not estimate the more general forms of the neighbourhood function. let M(x. To add a node to the set. we mark its bit. using an exponential distribution. Clearly. 0) = {x} for all x ∈ V FOR each distance h DO M(x. h) be the set of nodes within distance h of x. and so on (give a node bit i with probability 1/2i+1 ). h) that has not been set. The trick will be to efficiently compute the number of distinct elements in M(x. are computed from 2. a B-tree) to represent the sets M(x. we are going to give about half the nodes bit 0. We refer the reader to [9] for a derivation of the constant of proportionality and a proof that this estimate has good error bounds. for some small constant r). h) ∪ M(y. We refer to the bit string that approximates M(x.2 5. h. The first FOR loop of the algorithms generates the table of random bitmasks M (x.. with the following improvements: • M (x. Figure 3 shows the bitmasks and approximations for a simple example of our most basic ANF algorithm.25).

3 Leading Ones Compressions (ANF-C) ANF is an algorithm that will be dominated by the I/O costs for large data sets and by the cost of the bit operations for smaller data sets. C ) is provably estimated with low error with high confidence. Second. equalsized pieces. Like the mark bit. C ) = (2b )/. Thus. Instead of compressing each mask individually. Our idea is to break the large bitmasks M cur and M last into b1 and b2 . but we now access M cur and M last in an effectively random order. Note that by reordering the computation to bucketize the edges. we can further improve ANF by reducing the number of bits of data that must be manipulated. We have k parallel approximations.• The concluding nodes change the h = 0 case. the cost of this I/O is (b1 + 2)N . we interleave the bitmasks by taking the first bit of each mask. If M cur and M last are N bytes long. In both cases. y) we read and write M cur(x) and read M last(y). That is. Fast: running time is O((n + m)d) which we expect to be fast since d is typically quite small (verified in section 4).5i+1 ) IF x ∈ S THEN mark(M cur (x)) IF current buffer is full THEN switch buffers and perform I/O Flush any buffers that need to be written FOR each distance h DO Fetch the data for the first bucket of M cur and M last Prefetch next buckets of M cur and M last FOR each edge (x. etc. we now have very predictable I/O. The complete algorithm with prefetching appears in Figure 5. most of the bitmasks will gradually accumulate a relatively lengthy set of leading 1s. h): Yes. j ) DO M cur(x) = M cur(x) OR M last(y) Write bucket i of M cur The cost of this algorithm is exactly the same cost as running ANF-0 plus the cost of the I/O (we have simply reordered the original computation). We propose a small amount of preprocessing. we will insert prefetching operations which allows the computation and the I/O to be performed in parallel. y) in bucket (i. First. M cur (x) = (M cur (x) OR M last(y)) // Copy M cur (u) to M last(u) as we stream through M cur (u) // computing the estimate est = 0 Fetch the data for the first bucket of M cur FOR each node x DO IF M cur (x) is not in memory THEN We have been prefetching it Wait for it to be available Start prefetching the next buffer M last(x) = M cur (x) If x is the last element in its bucket of M last THEN Asynchronously flush the buffer Continue processing in the double buffer IF marked(M cur (x)) THEN IN + (x. This algorithm now meets all of our requirements. with 2 masks: 11010. When we see the edge (x. C ) where b is the average position of the least zero bits in the k bitmasks + ˆ output N (h. we will compress them. Synchronization is only needed after each iteration.77351 est += IN + (x. observe that. Now M(x. If these tables are larger than the available memory. then the cost of the I/O required to update the bitmasks is: 2N to load and store each bucket of M cur and b1 N to read M last once for each bucket of M cur. The ANF-C algorithm uses a counter of the leading ones to reduce the amount of I/O and the number of bit operations. 0) is {} if x ∈ / C since it can reach no nodes in C in zero steps. Given that we have partitioned the edges. we select b1 and b2 such that b1 is minimal. h.. Adapts to the available memory? No! We will address this issue in the next section. we can achieve even better compression by bit shuffling . C ) = est Figure 5: ANF: Disk based processing 3. Sequential scans of the edge file: Yes. S. a one pass bucket sort can be used to partition the edges. we will find that leading ones .2 ANF Algorithm The ANF-0 algorithm no longer accesses the edges in random order. The ANF-0 algorithm meets all but one of the requirements set out in the introduction: Error guarantees: each IN + (x. 11100 ⇒ 1111011000 which gives rise to a larger number of leading ones. the bitmasks are of the form: 1111111110xxxxxx It is wasteful to apply the bit operations and to write these leading 1s to disk. we would like to run the following algorithm to update M cur: FOR each bucket i of M cur DO Load bucket i of M cur FOR each bucket j of M last DO Load bucket j of M last FOR each edge (x. h. Low storage requirements: only additional memory for M cur and M last. Instead. as ANF runs. followed by the second bit of each. Select the number of buckets b1 and b2 Partition the edges into the buckets (sorted by bucket) FOR each node x DO IF x ∈ C THEN M cur (x) = concatenation of k bitmasks each with 1 bit set (P (bit i) = . with provable accuracy. In our experiments. each of which has many leading ones. For example. swapping will kill performance. resp. Estimates IN (x. h. Thus nodes not in C are initially assigned a bitmask of all 0s. In most cases. 3. given that we have enough file descriptors to efficiently perform the bucket sort in one pass. this counter can be prepended to the bitmask. Thus. Easily parallelizable: Partition the nodes among the processors and then each processor may independently compute M cur for each x in its set. We partition the edges into b1 × b2 buckets. y) DO IF M cur (y) is not in memory THEN We have been flushing and prefetching it Wait for it if necessary Asynchronously flush modified buffer Begin prefetching next buffer IF M last(x) is not in memory THEN We have been prefetching it Wait for it if necessary Begin prefetching next buffer. to make these accesses predictable. That is.

This approximate counting scheme was used to estimate the individual neighbourhood functions with the following algorithm.2 and fix k = 64 for the time being. We use three real data sets: Router: Undirected Internet routers data from ISI [19]. EXPERIMENTAL EVALUATION In this section we present an experimental validation of our ANF approximation. Uniform: Graph with random (undirected) edges. The estimated size is the reciprocal of the least value seen. averaging over the estimates.2 Results 4. random nodes or random starting nodes for the breadth-first search. k.805 (m) #Edges 1. Additional experiments were run with other values of k which produced similar results. the error function.00 4 1.1 Parameter Sensitivity ANF has two parameters: the number of parallel approximations.98 20 3. h) is v1 − 1.95 2 2. up to 23% in Figure 1. the estimate for IN (u. Let ˆ be the candiN be the true neighbourhood function and N ˆ (h). Since N (0) = |V | and N (1) = |E | we do not require approximations for these points. Diam. h). Then. 1] and record the least of these values added to the set.378 127. imagine sampling a cycle – anything but a very large sample will leave disconnected arcs which have very different properties. the entire algorithm is repeated. h). r. Diam. N d h=2 We can sample by selecting random edges. we find that it does not scale to large graphs because of the random-like access to the edge file due to its use of breadth-first search.51 Prac.647 1.000 10. Dir Undir Undir Undir Dir Undir Undir 4.3 Experimental Data Sets We have collected three real data sets and generated three synthetic data sets. However. To measure the overall ˆ we use the Root Mean Square (RMS) of pointerror of N wise relative errors. k. To measure the sensitivity we averaged the RMS error over 10 trials for different values of r and the different data sets.1 Framework 4. 80-20: Very skewed graph generated in an Internet like fashion with undirected edges using the method in [17]. the number of extra bits? Is ANF faster than existing approaches? Is ANF more accurate than existing approaches? Does ANF really scales to very large graphs? Do the others? 4. Cornell: A crawl of the Cornell web site by Mark Craven.4 Evaluation Metric 4. It was shown that this improved procedure takes only O(m log n) time (with high probability). To measure the error of N the standard relative error metric. Next.000 65. 9 500 100 8 35 10 13 Orient.832 430. Grid: A 2D planar grid (undirected). finally.342 Degree Max. We then show that sampling can provide very poor estimates and. The number of approximations. The key results from this section are to answer these questions: 1.946 284. Additionally. We conduct a sensitivity analysis of the parameter r. For example. but more efficient algorithm was presented which uses breadth-first searches. These results appear in Figure 6 and we see that the accuracy is not very sensitive to the value of r.2. 4. The last approach is much more compelling. Randomly selecting a set of nodes (and all edges for which both end-points are in this set) and randomly selecting a set of edges is unlikely to produce a useful sample. we pick a value of r and we compare ANF to the approximation presented in [4] for various settings of the parameter k.2. We call this method exact-on-sample and it has the potential to provide great estimates – a single sample of a cycle will provide an exact solution. We consider this in section 4. 131 1.2 Sampling We are approximating functions defined over d points. is: ˆ (h)) = rel(N (h). is a typical trade-off between time and accuracy.1 RI approximation scheme The RI approximation algorithm is based on the approximate counting scheme proposed in [4]. and four synthetic data sets: Cycle: A single simple undirected cycle (circle). Thus. h An equivalent. u the minimum value vh of a node reachable from u in h hops. experimentally we find that this approach also has the potential to provide very poor estimates. Rather than summing over all nodes. Table 2: Data set characteristics Name Cornell Cycle Grid Uniform Cora 80-20 Router (n) #Nodes 844 1. These data sets have a variety of properties and cover many of the potential applications of the neighbourhood function. Is ANF sensitive to r.” is the Practical Diameter which we use informally to mean the distance which includes most of the pairs of points. 4.N d−1 Note that the RMS is computed beginning with h = 2. 3. assign each a random value in [0. P 4. Cora: The CORA project at JustResearch found research papers on the web and provided a citation graph [6].996 330.198 449. To reduce the variance in the estimates. N ˆ) = e(N.06 457 2. To estimate the number of distinct elements in a multi-set. Avg. we could sum over only a sample of the nodes while using breadth-first searches to compute the exact IN (u. Two alternative approaches will be introduced and then we will describe our data sets.000 19. We need to know for each node. we propose a metric for comparing two neighbourhood functions (functions over a potentially large domain). 2.1.978 1.083 166. we use date approximation.1. and the number of additional bits.69 1.60 723 2. we examine the scalability of all approaches. e. 4. Then. “Prac. minus 1.1. Recall that the neighbourhood function is: N (h) = u∈V IN (u.800 199. It is akin to the sampling done in [14]. qP ˆ (h)| |N (h)−N N (h) ˆ (h))2 rel(N (h).compressions provide a significant speed-up. Some summary information is provided in Table 2. (The lines between the . including scans done by Lucent Bell Laboratories [12]. u. For completeness we verified that these approaches produced useless estimates.1.

64 and 128. more than 20%. very few will be from the s source nodes. which should be acceptable for most situations. define a set of s source nodes that have an edge to each of the N center nodes and a set of t terminal nodes that have an edge from each of the N center nodes. we have reduced the running time from approximately a day down to less than 2 minutes. We chose k = 64 since it provides much less than a 10% error. the quality is dependent on the graph because there are no bounds on the error. t = 100 and d = 6. • Using much less time. we have a significant improvement. • ANF achieves less than 10%. we compare our accuracy with a highly tuned implementation of the RI approximation (the only existing approach). while the second row shows the corresponding running times. k = 64 and k = 128. for the Router data set. ANF-0 suffers from similar swapping issues when it 4. The approximations are quite fast and. s = t = 5 and d = 6. we had to stop its timing experiment. 4. • RI approximation’s error varies significantly between data sets. We increased the number of nodes and edges while preserving an edge:node ratio of 8:1 (based on the average degree found in two large crawls of the Web [3]).7 Speed-up 270x 4x 756x 705x cora router Figure 6: Results are not sensitive to r points are a visual aid only. k = 64 and k = 128. each of which has a directed edge to r and a directed edge from x. we constructed a graph with N = 250.14 12 RMS(relative error) 10 8 6 4 2 0 cornell cycle grid uniform 80-20 Data set 1 extra bit 3 extra bits 5 extra bits 7 extra bits 9 extra bits Table 3: Wall clock running time (minutes) Data Set Uniform Cora 80-20 Router BF (Exact) 92 6 680 1. the RI approximation and example-onsample.0015 for exact-on-sample. k = 8 for RI and p = 0.34 1. after about 2 million edges. The first row of Figure 7 shows the accuracy of each of the k values for each data set for each algorithm. We generated random graphs placing edges randomly between nodes. Parameters for each alternative were chosen such that they all had approximately the same running time for the first data point. This makes it possible to run drill down tasks on much larger data sets than before. Thus. respectively. s = 100. We measure the error and the running time over 20 trials for a variety of sample sizes ranging from . 4. Second. create a chain of d − 2 nodes that start from a node r and end at a node x. Figure 8(a) helps illustrate our example graph. Figure 8(c) shows that as the graph grows larger exact-on-sample scales about as well as ANF but as soon as the edge file no longer fits in memory (approximately 27 million edges) we see approximately a two order of magnitude increase in the running time of the exact-onsample approach. 7% and 5% errors for k = 32.1% to 15% on a graph generated with N = 25. it is not possible to compute the individual neighbourhood functions . it has heavy memory requirements because fast breadth-first search requires that the edge file fit in memory. even for very large samples. We then increase s and t proportionately to generate larger graphs. Thus. Figure 1 (in the introduction) shows the running times for the ANF variants. particularly on the larger graphs. This graph has diameter d and a neighbourhood function that is O(N ) for each distance less than d and O(N 2 ) for distance d. ANF is much more accurate than RI. 000.2.2. To do so. This will result in an error that is a factor of around s/p for exact-on-sample using a p% sample. Add N nodes to the center of the graph. with up to 3 times savings. even for the case of graphs that may be stored in memory. We now provide an example which demonstrates the first two problems.9 1. These values are k = 32 for the ANF variants. Exact-on-sample scales much worse than linearly. then the majority of the sampled nodes will be from the N center nodes and . We find that: 1. we conclude that exact-on-sample scales very poorly to graphs that are larger than the available memory.4 Speed and Scalability Table 3 reports wall-clock running times on an otherwise unloaded Pentium II-450 machine for both the exact computation (Breadth-First search) and ANF with k = 64 parallel approximations.4 0. If N ≫ s and N ≫ t. we find that ANF is up to 700 times faster than the exact computation on our data sets. Third. • ANF is much faster than RI. Figure 8(b) shows the large errors.2 Accuracy We now examine the accuracy of the ANF approximation. ANF also scales to much larger graphs than the alternatives. respectively.) We find r = 5 or r = 7 provide consistent results. 3. First. we expect it to scale quadratically when we increase the number of nodes and edges. For a fixed sampling rate. 14% and 12% for k = 32.200 ANF 0. Now we fix r = 7 and consider three values of k: 32.2. 2. Overall. Because RI was not designed to avoid the random accesses. We average the error over 10 trials of each approximation scheme. To illustrate the scalability issues for exact-on-sample. We see that: • ANF’s error is independent of the data sets. 000. Finally. • RI has errors of 27%. RI very quickly exhausts its resources due to its data structures.3 Sampling There are three problems with the described exact-on-sample approach. it has horrible paging behaviour and. First.

30 RI Approx. in this paper we will focus on a summarized statistic derived from the neighbourhood function. k = 128 Figure 7: Our ANF algorithm provides more accurate and faster results than the RI approximation 140 120 100000 Exact-on-sample for . . has been defined as the hop exponent (similarly. informally. called the hop exponent. M cur fits in memory (about 16 million edges) and neither fits in memory (the rest).. s nodes 100 10 0 10 20 Millions of edges 30 40 (a) Graph exhibiting huge errors (b) Error vs. . .128 trials ANF . 4. Due to the limits of page constraints and data availability. ANF-C offers up to a 23% speed-up over ANF. Approximate counting methods [9. The approach is to compute various neighbourhood functions and then to compare them. Second. Thus. k = 64 RI Approx. .64 trials ANF . ANF is the only algorithm that scales to large graphs and does so with a linear increase in running time.5 1 0.. DATA MINING WITH OUR ANF TOOL With our highly-accurate and efficient approximation tool. However.256 trials Time (seconds) in log scale 0. 4] are not enough for disk resident graphs. the neighbourhood function will have a linear section with slope H when viewed in log-log scale.64 trials Wall clock running time (Minutes) 12 (c) Accuracy. x N nodes 1000 ..128 trials 25 15 RMS (relative error) RMS (relative error) 20 RMS (relative error) cornell cycle grid uniform 80-20 Data Set cora router 10 8 6 4 2 15 10 10 5 5 0 cornell cycle grid uniform 80-20 Data Set cora router 0 0 cornell cycle grid uniform 80-20 Data Set cora router (a) Accuracy. 100 80 60 40 20 0 0 10 20 30 40 50 60 Running time (seconds) 70 80 90 10000 . ANF/ANF-C scale the best.5 0 cornell cycle grid uniform 80-20 Data Set cora router RI Approx. the hop exponent is. · · · .128 trials Wall clock running time (Minutes) 5 10 4 8 3 6 2 4 1 2 0 cornell cycle grid uniform 80-20 Data Set cora router 0 cornell cycle grid uniform 80-20 Data Set cora router (d) Time. k = 64 Time (f). 6.15% ANF using 2 .5 2 1. The break points are: all data fits in memory (about 8 million edges). # of edges Figure 8: Sampled breadth-first search can provide huge errors and does not scale to very large edge files exhausts the memory at around 8 million edges.32 trials ANF .128 trials ANF .32 trials 20 RI Approx. and it too had to be stopped.. if the powerlaw holds. This is as expected.64 trials ANF . 2. First. H.1% . . Instead. Running time (c) Running time vs.15% Exact-on-sample ANF (32 trials) r t nodes RMS (relative error) . growing piece-wise linearly with the size of the graph. Many real graphs [8] have a neighbourhood function that follows a power law N (h) ∝ hH . 5.. all 10 questions can be answered by the same approaches that we will now demonstrate. There are three interesting observations about the hop exponent that make it an appealing metric. However. comparing neighbourhood functions requires that we compare two functions over potentially large domains (the domain is {1. Hx is the individual hop exponent for a node x).. The exponent. we will report on answers to only a representative sample of . k = 32 4 3. it is now possible to answer some of the prototypical graph mining questions that we posed in the introduction. . d}).64 trials 14 12 RI Approx.32 trials Wall clock running time (Minutes) 6 (b) Accuracy. . those questions. Our tool allows for a detailed comparison of these functions. k = 128 RI Approx.. k = 32 (e) Time. ANF.32 trials ANF . the intrinsic dimensionality 5.5 3 2..

the robustness that we observe from the topology itself. Targeted failures of routers can very quickly and very dramatically disrupt the Internet.07 -. be the set of all boards in which X wins. routers are deleted in decreasing order of their degree. First. as a measure of a node’s importance with respect to the connectivity of a graph. ANF.36 2. a. to indicate an appearance by a in m.36 2. 5. An undirected edge is placed between an actor. we have found that using the hop exponent still fairly reasonably captures the growth of the neighbourhood function. That is. The slope is the hop exponent of the graph and we use it as our measure of the growth of a neighbourhood function. For each genre.51 (a) ANF importance -. This shows both the correct ordering of the starting moves and the relative importance of each. we will use our ANF tool to discover this same rule. The three lines differ in how the deleted routers are selected.06 (b) Delta importance Figure 9: ANF finds the best starting move for X of the graph. if two graphs have different hop exponents. a study used an early version of our ANF tool (ANF-0) to look at the inherent robustness of the Internet.21 -. 3. other fringe genres such as Adult turn out to be well separated from the others. which corresponds with some idea of their dimensionality. we take the set of movies in that genre and the set of actors which appear in one or more of those movies. construct the actor-movie graph by creating a node for each actor and a node for each movie. Second.1 difference). ANF determined the correct importance of each opening move. war which actually corresponds to movies that are typically older.36 2. Let C . We use the individual hop exponent. much better than the remaining 4 squares. 6. randomly selected nodes are deleted. To compute the hop exponent. the effective diameter .50 2. Third. This type of study was infeasible before our ANF tool as an exact computation would have required over a year of computation time.50 2. The goal is to determine how robust the Internet is to router failures.57 2. proposed an efficient and accurate approximation algorithm that gives us the tool. from [16]. we first truncate the neighbourhood function at deff . the next best is any of the 4 corners and the worst moves are the 4 remaining squares. alternatively place their mark on an unoccupied square in a 3x3 grid. Construct a graph where each node is a valid board and add an edge from board x to board y to indicate that this is a possible move. Let S be the nodes we created for the actors. We then cluster these graphs by computing the hop exponents and forming clusters that have similar hop exponents (less than 0. we see that the center is only slightly better than a corner square which is. the concluding set. One interesting cluster is mystery. The winner is the first player to connect 3 of their symbols in a row. First. Here we reproduce some of the results of this study. We can answer some of the proposed questions.20 -. We now employ another relation of the IMDB. Each router was a node and edges were used to indicate communication links. which is their importance (speed at which they attain winning positions from each of these moves). Figure 9 shows these importances along with the difference between each and the best move.21 -. Here we see some very interesting results: 1.3 Internet Router Data In the networking community. musical. and presented results for three of these questions on real-world data. It may be possible to take a random sample of the Internet by deleting random routers and adjacent edges. Film-Noir Animation Short Adult Fantasy Documentary Family Mystery Musical Western War Sci-Fi Romance Horror Adventure Crime Thriller Action Comedy Drama Figure 10: Movie genre clusters sorted in increasing hop exponent value dramas.21 -.07 -. This appears possible because we found that the connectivity information (hop exponent) is not significantly changed under random deletions. nodes are deleted in decreasing order of their importance.2 Clustering movie genres The Internet Movie Data Base (IMDB) [10] is a collection of relations about movies. This clustering appears in Figure 10. including Figure 11. We will map a subset of the relations into graphs to illustrate questions that we can now answer thanks to our ANF tool.50 2. and a movie. Third. We have found ANF to be quite useful for these and other questions that can be addressed by studying the neighbourhood structure of the . We define deff to be the least h such that it include 90% of the pairs of nodes. Each movie has been identified as being in one or more genres (such as documentaries.2. m. there is no way that they could be similar. in turn. To verify that our notion of importance has some potential use. we delete some number of routers and then measure the total connectivity (number of pairs of routers that are still able to communicate) and the hop exponent of the graph. western. etc). A cycle has a hop exponent of 1 while a grid has a hop exponent of 2. 5. 5. Finally.1 Tic-Tac-Toe Tic-tac-toe is a simple game in which two players. As an experiment.37 2. one using X and the other using O. then we compute the least-squares line of best fit in log-log space to extract the slope of this line. Random failures do not disrupt the Internet. column or diagonal. While not all neighbourhood functions will actually follow a power-law. 2. comedies.07 -. The best opening move for X is the center square. Using Figure 9(b). CONCLUSIONS In this paper we presented 10 interesting data mining questions on graph data. we needed to answer these questions. Compute the individual neighbourhood functions for each of the 9 possible first moves by X. Hx .

In Workshop on Network-Related Data Management (NRDM-2001). R. When graphs get too large to be processed effectively in main memory. [4] E. Siganos. In Proceedings of 15th International Conference on Very Large Data Bases. 30(1–7):107–117. [13] S. An apriori-based algorithm for mining frequent substructures from graph data. Has low storage requirements: Given the edge file. [15] M. Naughton. [3] A. 55(3):441–453. 2001. Kumar. and J. A. ACM PODS Conference (PODS-93). We experimentally verified that ANF provides the following advantages: Highly-accurate estimates: Provable bounds which we also verified experimentally. on Digital Libraries. M. Tauro. Palmer. Faloutsos. Inokuchi. P. [2] S. The anatomy of a large-scale hypertextual Web search engine. We found the best opening moves to tic-tac-toe. http://www. Computer Networks and ISDN Systems. and H. P. Journal of Computer and System Sciences. Martin. Goodrich. Flajolet and G. we have used ANF to study the most important movie actors). R. [7] P. Blocking for external graph searching. pages 57–66. Faloutsos. In Proceedings of the 9th International World Wide Web Conference. finding less than a 7% error when using k = 64 parallel approximations (for all our synthetic and real-world data sets). pages 247–256. Nodine. [18] G. T. We found that sampling the Internet actually preserves some connectivity patterns while targeted failures truly distort it. H. J. 7. we found the neighbourhood measures to be useful for discovering the following answers to our prototypical questions: 1. Richardson. and E. McGill. Page. Introduction to Modern Information Retrieval. Washio. These approximations proved to be very useful in identifying the “important” nodes in a graph. [14] R. In Proc. [5] Cook and Holder. Siganos. McGraw-Hill. Salton and M. our algorithm is up to 700 times faster than the exact computation. Is orders of magnitude faster: On the seven data sets used in this paper. 1999. and R. Stata. R. [8] M. G. . Rajagopalan. Raghavan. Even for the case that graphs (and data structures) fit into memory. The connectivity and fault-tolerance of the Internet topology. In PDKK. 2000. In ACM SIGMOD–SIGACT–SIGART Symposium on Principles of Database Systems. A simple conceptual model for the Internet topology. P. 1999. Faloutsos. REFERENCES [1] L. In KDD-2001. Probabilistic counting algorithms for data base applications. number of deleted nodes Figure 11: Effect of router failures on the Internet underlying graphs (e. M. and P. Adapts to the available memory: We presented a diskbased version of our algorithm and experimentally verified that it scales with the graph size. Journal of Computer and System Sciences. 2000. In addition to its speed. 2000. 2001. Palmer and J. We found that the Internet is resilient to random failures while targeted failures can quickly create disconnected components.g. 15. [6] CORA search engine. 1989.whizbang. Employs sequential scans: Unlike prior approximations of the neighbourhood function. ISTA: Intelligent Systems & their applications. Kumar. Can be parallelized: Our ANF algorithm may be parallelized with very few synchronization points.Number of reachable pairs 9e+10 8e+10 7e+10 (Uniform) random selection Decreasing individual hop exponent Decreasing node degree Hop exponent 5 4 3 2 1 0 (Uniform) random selection Decreasing individual hop exponent Decreasing node degrees 10K 20K 30K 40K 50K 60K 70K 80K 90K Number of nodes deleted (graph had approx. our algorithm uses only O(n) additional storage. C. Individual neighbourhood functions for free: ANF computed approximations of the individual neighbourhood functions as a byproduct of the computation. F. Size-estimation framework with applications to transitive closure and reachability. Tomkins. Faloutsos. and C. F. 1998. 4. 285K nodes) (a) Number of pairs of nodes that can communicate vs. Generating network toplogies that obey power laws.cora. and M. L. Maghoul. Lipton and J. 31:182–209. Graph structure in the Web. The small world Web.. ANF represents a significant improvement in speed and accuracy. Domingos and M. 285K nodes) 6e+10 5e+10 4e+10 3e+10 2e+10 1e+10 0 10K 20K 30K 40K 50K 60K 70K 80K 90K Number of nodes deleted (graph had approx. our algorithm avoids random access of the edge file and performs one sequential scan of the edge file per hop. We clustered movie genres. Brin and L. In Proceedings of the European Conf. R.N(infty) -. [10] IMDB. S. [9] P. Mining the network value of customers. to answer before. number of deleted nodes (b) Hop exponent of the Internet vs. 2000. T. Gibbons. In IEEE Globecom 2001. The Web as a graph. D. Graph-based data mining. Cohen.bell-labs. [12] http://cs. G. N. Sivakumar. pages 222–232. [19] http://www. A. [11] A. Faloutsos. C. 1993. 1985. G. Palmer. It is also up to 3 times faster than the RI approximation scheme. ANF makes it possible to answer questions that would have been at least infeasible. Adamic. [16] C. December 1997. pages 13–23. if not [20] S. http://www. Vitter. Steffan. On power-law relationships of the internet topology. [17] C.isi. pages 315–326. Estimating the size of generalized transitive closures. 2000. 2001. Broder. Raghavan. 3. S. 2. In SIGCOMM. In IEEE Globecom 2000. Faloutsos. pages 1–10.