This action might not be possible to undo. Are you sure you want to continue?

# Practica de Cercetare June 16, 2011 Speeding up Algorithms on Compressed Web Graphs

Student: Neamtu Elena Ramona ¸ : MOC1

ABSTRACT

In this paper, is showed that several important classes of web graph algorithms can be extended to run directly on virtual node compressed graphs, such that their running times depends on the size of the compressed graph rather than the original. These include algorithms for link analysis, estimating the size of vertex neighborhoods, and a variety of algorithms based on matrix-vector products and random walks.

1. INTRODUCTION

One approach to implementing algorithms on compressed graphs is to decompress the graph on the ﬂy, so that a client algorithm does not need to know hoe the underlying graph is compressed. Another approach is to design specialized algorithms that can directly use the compressed representation. In virtual node compression, a succinct representation of the graph is constructed by replacing dense subgraphs by spare ones. In particular, a directed bipartite clique on the vertex set K is replaced by a star centered at a new ’virtual’ node, with nodes K being the leaves. Applying this transformation repeatedly leads to a compressed graph with signiﬁcantly fewer edges and a relatively small number of additional nodes. Motwani and Feder showed that several classical graphs algorithms can be sped up by using a similar type of virtual node compression., in which an undirected clique is transformed into a star. Recently, Buehrer and Chellaplla demonstrated that virtual node compression can achieve high compression ratios for web graphs. A large class of web graphs algorithms can be extended to run on virtual node compressed graphs, with running time speeds-up proportional to the compression ratio. As a fundamental tool, we ﬁrst show the that multiplication by the adjacency matrix of the graph can be performed in time proportional to the size of the compressed graph. Using this matrix multiplication routine as a black box, we obtain signiﬁcant speed-ups for numerous popular web graph algorithms, including PageRank, HITS ans SALSA, and various algorithms based on random walks. This multiplication routine can be implemented in the sequential ﬁle access model, and can be implemented on a distributed graph using a small number of global synchronizations.

1

• Let Q be the set of real nodes in G’ that are reachable from a real node u by a path consisting internally entirely of virtual nodes. Then G is exactly the same as the set of out-neighbors of u in G. We allow virtual modes to be refused.2 Finding Virtual Nodes Using Frequent Item-set Mining The most important property of the graph compression is the number of edges and nodes. given two digraphs G(V. so the bipartite clique (S.2. this process is then repeated many times. we form a new compressed graph G’(S. Generally.T). Given a biclique (S. Figure 1: The bipartite clique-star transformation We call the node w a virtual node as opposed to the real nodes already present in G. • The graph induced by edges incident yo and from virtual nodes in G’ is a set of disjoints directed trees.1 Graph Compression Using Virtual Nodes A directed bipartite clique (or biclique) (S. We refer to quantity |E|/|E | as the compression ratio. we say G’ is a compression of G if it can be obtained by applying a series of bipartite clique-star transformations to G. there is a directed link from u to v in G. This relation is denoted G G. and adding a new edge uw ∈ E for each u ∈ S and a new edge wv ∈ E for each v ∈ T . We will call such a path a virtual edge. We will refer to this as the acyclic property of the compressed graphs. Note that the biclique-star transformation essentially replaces an edge uv in G with a unique path u → w → v in G’ that acts as a placeholder for the original edge.T) by adding a new vertex w to the graph. BACKGROUND 2. This transformation is depicted in Figure 1.E) and G’(V’. To obtain signiﬁcant compression. 2.E’). 2 . If uv ∈ E is such that uv E then there exists a unique path from u to v in G’.T). which we call depth of the compression. We want to bound the maximum length of any virtual edge. removing all the edges in (S. Any compression G G satisﬁes the following properties. The graph obtained by this process is called a compression of G.T) is a pair of disjoints vertex sets S and T such that for each u ∈ S and v ∈ T . The biclique-star transformation may be performed again on G’. • There is a one-to-one correspondence between edges in G and the set of edges and virtual edges in G’.T) found in G’ may contain the virtual edge path between u and v in the resulting graph G” is extended to u → w → w → v.

By W we will denote the random walk matrix obtained by normalizing each row of E. It is shown that ﬁnding the optimal compression is NP-hard. Their algorithm ﬁnds collections of bicliques using techniques from frequent item-set mining. 3. and let G G. Let G be a graph with adjacency matrix E. They report that the resulting compressed graphs contain ﬁve to ten times fewer edges than the original. This computation needs only sequential access to the adjacency list of G’ and does not require the original graph G. but a good approximation algorithm exists for the restricted problem of ﬁnding the best compression obtained from a collection of vertex-disjoint cliques.Buehrer and Chellapilla introduced an algorithm that produces compressions of web graphs with high compression ratio and small depth. 3. then p1 = W T p0 is the distribution resulting from a single step of the uniform random walk on the graph. for a variety of page-level web graphs. This matrix multiplication can be used as a black box to obtain eﬃcient compressed implementation. We denote the set of inG G neighbors and out-neighbors of node v by δin (v) and δout (v) respectively. 3 . where: 1 0 If edge uv ∈ E Otherwise E[u. 2.v). Speeding up Matrix-Vector Multiplication The multiplication of a vector by the adjacency matrix of a graph can be carried out in time proportional to the size of the graph’s compressed representation. G be a compression of • The matrix-vector product E T x can be computed in time O(|E | + |V |). Then for any vector x ∈ R|v| . So. To obtain this compression typically requires 4-5 phases of the algorithm.3 Notation We consider directed graphs G(V. We overload the symbol E to denote the adjacency matrix of the graph.E) with no loops or parallel edges. and runs on time O(|E|log|V |). if p0 is the starting probability distribution.1 Adjacency Matrix Multiplication Proposition 1. v] = We will denote the probability of transition of u to v by Pr(u. leading to compressions whose depth is a small constants.

Due to recursive deﬁnition (2).The input vector x is not deﬁned on virtual nodes. where w is a virtual node. the values have dependencies. Algorithm 1 performs a series of what are popularly called ’push’ operations. This fallows easily from the acyclic property. We illustrated this in Figure 2. 4 . we expand x[v] as: x[v] = uv∈E x[u] (2) The equation that computes y using the compressed graph G’: y[v] = uv∈E x[u] (3) Deﬁnitions (1) and (3) of y are equivalent. • If is real for all uv ∈ E then R(v) = 0. x) forall v ∈ V do y[v]=0 forall Nodes u ∈ V do forall Edges uv ∈ E do y[v]=y[v]+x[u] For a virtual node v. This algorithm simply encodes the deﬁnition of y: y[v] = x[u] (1) uv∈E Algorithm 1: Multiply(E. we can expand the terms corresponding to virtual nodes on the right side of equation (3) to obtain exactly equation (1). Figure 2: Push operation on compressed graph y[v] = x[u1 ] + x[u2 ] + x[u3 ] + x[u4 ] + x[u5 ] = x[u1 ] + x[u2 ] + x[w] We now formalize by assigning a rank R(v) to each virtual node v using the following recursive deﬁnition.Proof. The value stored at node u in x is ’pushed’ along the edge uv. Using the recursive deﬁnition (2). First. what the computation y = E T x looks like when the uncompressed graph G is accessible.

3. 5 . • Top singular vectors: The compressed in-link graph and out-link graph have the same values |E | and |V |. • Random walk distribution: Starting from initial distribution p0 . R(v) = 1 + maxu∈δin (v)∩v V R(u). if R(u) ≤ R(v) then the adjacency list of u appears before that of v. else x[v]=x[v]+x[u].2 Applications of Compressed Multiplication Here we describe a few examples of algorithms that can be written in terms of adjacency matrix multiplication. Algorithm 2 for computing y using the reordered representation of G’. Adjacency lists of real nodes appear before those of virtual nodes. • Eigenvectors and spectral methods: The time required per iteration is O(|E | + |V |). where D is the diagonal matrix. The time per operation is O(|V |)+O(|E | + |V |) = O(|E | + |V |). 2. Note that reordering can be performed during reprocessing by computing the ranking function R using a simple algorithm that requires O(|E | + |V |) time. forall Nodes u ∈ V do forall Edges uv ∈ E do if v is real then y[v]=y[v]+x[u]. Algorithm 2: Compressed-Multiply(E. We now reorder the rows of the adjacency list representation of G’ in the fallowing manner: 1.• Else. we need T iterations by computing pt+1 = E T D−1 Pt . For two virtual nodes u and v. forall Virtual nodes v do x[v]=0. x) forall Real nodes v do y[v]=0. the time per iteration is O(|E | + |V |).

we will show that the stationary vector in the original graph can be computed by computing the stationary vector of Markov chain running on the compressed graph. where the authority vector a and hub vector h are the top eigenvectors of WrT Wc and Wc WrT . where the sum is replaced by bitwise or. we deﬁne ∆G (u) as follows: ∆G (u) = 1 w∈δo utG (u) ∆G (w) . HITS and SALSA. if u is virtual 6 . This iteration can be viewed as multiplication by the adjacency matrix. • HITS and SALSA: The HITS algorithm assigns a separate hub score and authority score to each web page in a query-dependent graph. for diﬀerent graph-related matrices. equal to the top eigenvector of EE T and E T E. Our goal is to run an algorithm similar to above on a compression G G such that just restricted to nodes in V. we assume that the surfer only clicks on a random link on a page with probability 1 − α. SALSA can be viewed as a normalized version of HITS. With probability α. These algorithms essentially perform several iterations of the power method. which he then chooses from the probability distribution j. PageRank can be computed by the following power method step: xi+1 = (1 − α)E T (D−1 xi ) + αj where α is the jump probability and j is the jump vector. he jumps to any page in the graph. The matrix Wrepresents the underlying Markov chain. Each iteration requires Θ(|E| + |V |) operations on an uncompressed graph G. 4. Stochastic Algorithms on Compressed Graphs In this section. where j is the jump vector. it models the jump-adjusted uniform random walk in G. if u is real . For ensure ergodicity. For a graph G (compressed or otherwise).1 PageRank on Compressed Graphs PageRank models a uniform random walk on the web-graph performed by the random surfer. The equation governing the steady state is: p = ((1 − α)W T + αJ)p = LT p. where Wr and Wc are the row and column normalized versions of E. 4. 0 ≤ α ≤ 1. The approaches described above can be used to speed-up the canonical link analysis algorithms PageRank. based on probabilistic counting. • PageRank:Given a graph G with adjacency matrix E.• Estimating the size of neighborhoods: Becchetti introduced an algorithm for estimating the number of nodes within r steps of each node in graph. then projecting and rescaling.

v] . transitions made from virtual nodes have zero jump probability. Given graphs G G’ as fallows: G. 7 . v] = ∆G (v) ΓG (u) • We obtain Y from X by making adjustments for the jump probability: Y [u. jump probability α and the jump vector j to compute PageRank on vertex of G strictly the graph G’. Similarly. • The desired Markov chain is given by the transition matrix M C(G ) = Z = (Y + αJ T ). For a virtual node v. ΓG (u) is the number of real nodes in G reachable from u using one virtual edge. ΓG (u) = ∆G (u). Given the function ∆G (u).∆G (u) is the number of real nodes reachable from u by virtual edges not starting at u. The random walk on G’ is not uniform (unlike the one on G). 2. if u is virtual • Pad the jump vector j with zeros to obtain a jump vector j’ for G’. it’s compressed representation G’. Algorithm 3 takes as input a graph G. if u is real X[u. we deﬁne the real out-degree of u in G: ΓG (u) = w∈δo utG (u) ∆G (u) Figure 3: Illustration of the ∆ function For a real node u. We ensure that the jump vector has zeros in entries corresponding to virtual nodes. v] . v] = (1 − α)X[u. Let J’ be the jump matrix containing copies of j’ in each column. A random walk on a graph G’ compressed from G that exhibits the desired modeling behavior: 1. jump probability α and the jump vector j we deﬁne the random walk on • Let X be the matrix of dimension |V | × |V | such that: X[u.

Pi+1 [u] = βi pi [u]. Vector p computed by Algorithm 3 satisﬁes p = ((1 − α)W T + αJ)p That is. For all 0 ≤ i ≤ k and u ∈ Vi .. and the subsequent scaling up will only maintain this precision. Since only 4-5 phases are required in practice to obtain nearly the best possible compression. G1 G0 = G 1 by any sequence of graphs as in the proof of Theorem 1. Scale p” up to unit L1 norm to obtain p which is the desired vector of PageRank values on G. / p 1 be the scaling We can use the sequence of graph Gi . the above theorem then concludes that we lose only 4-5 bits of ﬂoating point accuracy when using Algorithm 3. Theorem 2. α. 4. satisfying the following recursive deﬁnition: a = WrT h h = Wc a 8 . Then β ≥ 2−k . Discard the values for virtual nodes. p is the steady state of the jump-adjusted uniform random walk M C(G). j) 1. where r is the desired number of power iterations. Le β = p factor between p” and p in Algorithm 3. Project p’ onto p”. This algorithm can be implemented to run in Θ(r(|E | + |V |)).Algorithm 3: ComputePageRank(G. If the value of the constant β is very small. such that Gi is the graph after i phases of edge-disjoint clique-star transformations. Claim 1.2 SALSA on Compressed Graphs SALSA is a link analysis algorithm similar to HITS that assigns each webpage a separate authority score and hub score. 1. Let G(V. Compute Z = M C(G ) 2. G. Theorem 1. where βi is a constant depending only upon i. the set of real nodes. 3.E) be the query-speciﬁc graph under consideration. Theorem 2 will prove a lower bound on β. the computed values of p’ will contain very few bits of accuracy.. Let G = Gk Gk−1 . Compute the steady state of the Markov chain represented by Z. with Wr and Wc being the row and column normalized version of E respectively. Then the authority vector a and a hub vector h are the top eigenvectors of WrT Wc and Wc WrT respectively.

• If we attempt to run SALSA power iterations unchanged on G’. we must push the hub score h’[u] into hub score h’[v]. if u is real . we deﬁne the in-degree analogue of ΓG as: ΦG (u) = w∈δo utG (u) ΛG (w) Consider an edge uv ∈ E in the graph G and the corresponding virtual edge u −→ w −→ v in the compressed graph G’.The solution a and h o the above system are unique and with non-negative entries. ai+1 [u] = ∆G (u) G v∈δin (u) ΓG (v) hi (v) ∆G (u) G v∈δout (u) ΦG (v) ai (v) ∆G (u) G v∈δout (u) ΦG (v) ai (v) ∆G (u) G v∈δin (u) ΓG (v) hi (v) . the situation is diﬀerent: • In the original graph G. if u is real . if u is virtual hi+1 [u] = 9 . if u is virtual w∈δo utG (u) ΛG (w) Similarly. ai+1 [u] = G v∈δin (u) hi (v) G δout (v) ai (v) G δin (v) 1 1 hi+1 [u] = G v∈δout (u) Figure 4: SALSA on uncompressed graph.the PageRank score . For a virtual edge u −→ w −→ v. In case of PageRank. which subsequently will contribute to a’[v] as desired. We deﬁne a function ΛG .ﬂows through the net-work. As with PageRank. In case of SALSA. in manner analogous to ∆G : ΛG (u) = 1 . Hence the virtual nodes in compressed graphs merely delay the ﬂow of PageRank between real nodes. if u is virtual . if u is real . only one commodity . the hub score from node u is pushed along a forward edge (uv ∈ E) into the authority score bucket of node v. the power method can be employed to compute the eigenvalues. whereas authority score of node v is pushed along the reverse edge into the hub score of node u. h’[u] would contribute to a’[w] but never a’[v].

The black-box methods simply speed-up each individual iteration. Aperiodicity can be obtained by introducing a non-zero probability α of non-transition on real nodes. apart from slowing the algorithm down to a small extent. If k is the length of the longest virtual edge in G’. This adds to the storage required for the compressed graph. a [u] = βa[u] and h [u] = βh[u] for all u ∈ V (G). any parallel algorithm for computing PageRank or SALSA can be used.3 Comparison of the Two Approaches We now summarize the advantages and disadvantages of computing PageRank and Salsa with the black-box multiplication algorithms. Then. • Although the Markov chain algorithms converge to eigenvectors that are similar to the corresponding eigenvectors on the uncompressed graph. Let [a . • The Markov chain methods are not directly applicable to HITS because the scaling step involved after every iteration destroys correctness. Unlike the PageRank. to compute PageRank in the original uncompressed graph. Black-box multiplication requires that certain sets if virtual nodes be pushed before others. h ] and [a. • An existing implementation of PageRank can be run directly on the compressed graph. requiring a small number of global synchronizations in each iteration. • Both method can be eﬃciently parallelized. The following theorem proves the correctness of our solution. the number of iterations required may change. The number of iterations required may increase by at most a factor of the longest virtual edge. 4.Figure 5: SALSA on compressed graph. 5. and the Markov chain algorithms. so the number of iterations required is identical. then β ≥ 2−k . Theorem 3. the number of iterations required by the Markov chain algorithm and the overall comparison in speed-up ratios is examined experimentally. 1. Experiments 10 . • The black-box method on SALSA needs lists of in-links of virtual nodes and separate ordering on virtual nodes in-links and out-links. the irreducibility and aperiodicity of this Markov chain is not immediately obvious. with appropriately modiﬁed weights. h] be top eigenvectors of M’ and M respectively. For the Markov chain method. 2.

Andersen and K. bringing down the net performance boost. Vigna The webgraph framework ii: Codes for the world-wide web. 2004. graph compressions and speeding-up algorithms. Federe and R. 1995. This is due the fact that both these algorithms perform some book-keeping operations in the compressed graphs due to the increased number of nodes. However. In Data Compression. since it only needs to store one ordering of the virtual nodes. R.The method discussed before were implemented and compared the against standard version of PageRank and SALSA running on uncompressed graphs. K. 2005. In the case of SALSA. 223-232 [2] P. Both the Black-box and Markov chain methods show an improvement in the time per iteration over the uncompressed versions of the algorithms. Lang. Communities from seed sets In WWW. Chung Spectral Graph Theory [4] T. J. Motwani Clique partitions. The Markov chain method for SALSA also requires slightly less storage on disk. The algorithms achieve signiﬁcant speed-up over the uncompressed version. 2006. Boldi and S. 575-582 11 . 261-272 [5] F. References [1] R. the Markov chain method performs better than the Black-box method. The overall speed-up ratios do not exactly match the reduction in the number of edges. McSherry A uniform approach to accelerated pagerank computation. the Markov chain method requires more iterations to coverage to the same accuracy. 258 [3] F.