27 views

Uploaded by api-25884893

save

- ('Christos Papadimitriou', 'Midterm 1', '(Solution)') Fall 2009
- lec13-dp2.pdf
- A COMPARATIVE STUDY ON MULTICAST ROUTING USING DIJKSTRA’S, PRIMS AND ANT COLONY SYSTEMS
- Dijk Stra Algorithm Description
- QoS routing
- Management of Communication Failures in Formation Flight
- Billey_Discrete_Mathematical_Modeling(math381_lecture_notes_Winter_2011_U_Washington).pdf
- Graphs Implementation Tips
- Graph Kernels.pdf
- Asa 2011 Graphs
- Bayesian Reasoning and Machine Learning
- null
- null
- null
- Http://Github.com/Stuarthalloway/Clojure Presentations
- null
- null
- Clojure’s Four Elevators
- null
- null
- null
- null
- null
- null
- null
- For Rubyists Http://Github.com/Stuarthalloway/Clojure-presentations
- The $25,000,000,000∗ Eigenvector the Linear Algebra Behind
- Data-Intensive Information Processing Applications ― Session #4
- Gi From the Bottom Up
- Programming Paradigms for Dummies: What Every Programmer
- Data-Intensive Text Processing With MapReduce
- Http://Github.com/Stuarthalloway/Clojure Presentations
- The Anatomy of a Large-Scale Social Search
- Quick Reference
- Purely Functional Data Structures
- The Python & the Elephant
- Tuesday: Lambda Calculus & Combinatory Logic
- Select Dim1, Dim2, Sum(Measure1) as Msum, Count(*)
- Tutorial: Metacompilers Part 1
- High Performance Scalable Data Stores Rick Cattell
- HSE
- POLITICAS_CEAMEG
- Memorias de Un Abogado
- San 14109
- 88p8968
- 19th English
- NANOGEOSCIENCE: An Introduction
- Avaliação 7º Ano A
- Libro Control Digital
- Jane Eyre and Psychology
- Normas Em Controle de Infecções Hospitalares 1
- UeberWahrheitundLüge_Konstruktivismus.pdf
- Stella. Stimulation and Modelling
- math rubric for mr
- Research Methodology 3.pdf
- Los 5 Principios de BACH
- Sullivan
- Le Chirk, C'est la plus immense injustice
- En Word
- Owners Manual Rich
- Guía 2-módulo 1-220144
- A Different Shape for Curriculum; C Dwyer 2012
- $R5JHZ1A
- S04 VALUES 10082016
- Columnas de La Tierra 1
- Empresa é responsável por ato de preposto que emite duplicata fraudulenta - Jus Navigandi
- Bread Science
- The Bank Intermediary Function and ed Loans Phenomenon
- 1.- La Ética Del Trabajo
- Analisis de Sensor

You are on page 1of 48

Graph Algorithms

Jimmy Lin

University of Maryland

Tuesday, March 2, 2010

**This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
**

See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Source: Wikipedia (Japanese rock garden)

Today’s Agenda

| Graph problems and representations

| Parallel

aa e b breadth-first

eadt st sea

search

c

| PageRank

What’s a graph?

| G = (V,E), where

z V represents the set of vertices (nodes)

z E represents the set of edges (links)

z Both vertices and edges may contain additional information

| Different types of graphs:

z Directed vs. undirected edges

z Presence or absence of cycles

| Graphs are everywhere:

z Hyperlink structure of the Web

z Physical structure of computers on the Internet

z Interstate highway system

z Social networks

Source: Wikipedia (Königsberg)

Some Graph Problems

| Finding shortest paths

z Routing Internet traffic and UPS trucks

| Finding minimum spanning trees

z Telco laying down fiber

| Finding Max Flow

z Airline scheduling

| Identify “special” nodes and communities

z Breaking up terrorist cells, spread of avian flu

| Bi tit matching

Bipartite t hi

z Monster.com, Match.com

| And of course

course... PageRank

Graphs and MapReduce

| Graph algorithms typically involve:

z Performing computations at each node: based on node features,

edge features, and local link structure

z Propagating computations: “traversing” the graph

| Key questions:

z How do you represent graph data in MapReduce?

z How do you traverse a graph in MapReduce?

Representing Graphs

| G = (V, E)

| Two

o co

common

o representations

ep ese tat o s

z Adjacency matrix

z Adjacency list

Adjacency Matrices

Represent a graph as an n x n square matrix M

z n = |V|

z Mij = 1 means a link from node i to j

2

1 2 3 4

1 0 1 0 1 1

3

2 1 0 1 1

3 1 0 0 0

4 1 0 1 0 4

Adjacency Matrices: Critique

| Advantages:

z Amenable to mathematical manipulation

z Iteration over rows and columns corresponds to computations on

outlinks and inlinks

| Disadvantages:

z Lots of zeros for sparse matrices

z Lots of wasted space

Adjacency Lists

Take adjacency matrices… and throw away all the zeros

1 2 3 4

1 0 1 0 1 1: 2,, 4

2 1 0 1 1 2: 1, 3, 4

3 1 0 0 0 3: 1

4: 1, 3

4 1 0 1 0

Adjacency Lists: Critique

| Advantages:

z Much more compact representation

z Easy to compute over outlinks

| Disadvantages:

z Much more difficult to compute over inlinks

Single Source Shortest Path

| Problem: find shortest path from a source node to one or

more target nodes

z Shortest might also mean lowest weight or cost

| First, a refresher: Dijkstra’s Algorithm

Dijkstra’s Algorithm Example

1

∞ ∞

10

0 2 3 9 4 6

5 7

∞ ∞

2

**Example from CLR
**

Dijkstra’s Algorithm Example

1

10 ∞

10

0 2 3 9 4 6

5 7

5 ∞

2

**Example from CLR
**

Dijkstra’s Algorithm Example

1

8 14

10

0 2 3 9 4 6

5 7

5 7

2

**Example from CLR
**

Dijkstra’s Algorithm Example

1

8 13

10

0 2 3 9 4 6

5 7

5 7

2

**Example from CLR
**

Dijkstra’s Algorithm Example

1

1

8 9

10

0 2 3 9 4 6

5 7

5 7

2

**Example from CLR
**

Dijkstra’s Algorithm Example

1

8 9

10

0 2 3 9 4 6

5 7

5 7

2

**Example from CLR
**

Single Source Shortest Path

| Problem: find shortest path from a source node to one or

more target nodes

z Shortest might also mean lowest weight or cost

| Single processor machine: Dijkstra’s Algorithm

| MapReduce: parallel Breadth-First Search (BFS)

Finding the Shortest Path

| Consider simple case of equal edge weights

| Solution

So ut o to tthe

epproblem

ob e ca

can be de

defined

ed inductively

duct e y

| Here’s the intuition:

z Define: b is reachable from a if b is on adjacency list of a

z DISTANCETO(s) = 0

z For all nodes p reachable from s,

DISTANCETO(p) = 1

z For all nodes n reachable from some other set of nodes M,

DISTANCETO(n) = 1 + min(DISTANCETO(m), m ∈ M)

d1 m1

…

d2

s … n

m2

… d3

m3

Source: Wikipedia (Wave)

Visualizing Parallel BFS

n7

n0 n1

n3 n2

n6

n5

n4

n8

n9

From Intuition to Algorithm

| Data representation:

z Key: node n

z Value: d (distance from start), adjacency list (list of nodes

reachable from n)

z Initialization: for all nodes except for start node, d = ∞

| Mapper:

z ∀m ∈ adjacency list: emit (m, d + 1)

| Sort/Shuffle

z Groups distances by reachable nodes

| Reducer:

z Selects minimum distance path for each reachable node

z Additional bookkeeping needed to keep track of actual path

Multiple Iterations Needed

| Each MapReduce iteration advances the “known frontier”

by one hop

z Subsequent iterations include more and more reachable nodes as

frontier expands

z Multiple iterations are needed to explore entire graph

| Preserving graph structure:

z Problem: Where did the adjacency list go?

z Solution: mapper emits (n, adjacency list) as well

BFS Pseudo-Code

Stopping Criterion

| How many iterations are needed in parallel BFS (equal

edge weight case)?

| Convince yourself: when a node is first “discovered”,

we’ve found the shortest path

| Now answer the question...

z Six degrees of separation?

| Practicalities of implementation in MapReduce

Comparison to Dijkstra

| Dijkstra’s algorithm is more efficient

z At any step it only pursues edges from the minimum-cost path

inside the frontier

| MapReduce explores all paths in parallel

z Lots of “waste”

z Useful work is only done at the “frontier”

| Whyy can’t we do better using

g MapReduce?

p

Weighted Edges

| Now add positive weights to the edges

z Why can’t edge weights be negative?

| Simple change: adjacency list now includes a weight w for

each edge

z In mapper, emit (m, d + wp) instead of (m, d + 1) for each node m

| That’s it?

Stopping Criterion

| How many iterations are needed in parallel BFS (positive

edge weight case)?

| Convince yourself: when a node is first “discovered”,

we’ve found the shortest path

Additional Complexities

1

search frontier 1

n6 n7 1

n8

r 10

1 n9

n5

n1

s 1 1

q

p 1 n4

n2 1

n3

Stopping Criterion

| How many iterations are needed in parallel BFS (positive

edge weight case)?

| Practicalities of implementation in MapReduce

Graphs and MapReduce

| Graph algorithms typically involve:

z Performing computations at each node: based on node features,

edge features, and local link structure

z Propagating computations: “traversing” the graph

| Generic recipe:

z Represent graphs as adjacency lists

z Perform local computations in mapper

z Pass along partial results via outlinks, keyed by destination node

z Perform aggregation in reducer on inlinks to a node

z Iterate

te ate until

u t convergence:

co e ge ce controlled

co t o ed by e external

te a “driver”

d e

z Don’t forget to pass the graph structure between iterations

Random Walks Over the Web

| Random surfer model:

z User starts at a random Web page

z User randomly clicks on links, surfing from page to page

| PageRank

z Characterizes the amount of time spent on any given page

z Mathematically, a probability distribution over pages

| PageRank captures notions of page importance

z Correspondence to human intuition?

z One of thousands of features used in web search

z Note: query-independent

PageRank: Defined

Given page x with inlinks t1…tn, where

z C(t) is the out-degree of t

z α is probability of random jump

z N is the total number of nodes in the graph

⎛1⎞ n

PR(ti )

PR( x) = α ⎜ ⎟ + (1 − α )∑

⎝N⎠ i =1 C (ti )

t1

X

t2

…

tn

Computing PageRank

| Properties of PageRank

z Can be computed iteratively

z Effects at each iteration are local

| Sketch of algorithm:

z Start with seed PRi values

z Each page distributes PRi “credit” to all pages it links to

z Each target page adds up “credit” from multiple in-bound links to

compute PRi+1

z Iterate until values converge

Simplified PageRank

| First, tackle the simple case:

z No random jump factor

z No dangling links

| Then, factor in these complexities…

z Why do we need the random jump?

z Where do dangling links come from?

Sample PageRank Iteration (1)

Iteration 1 n2 (0.2) n2 (0.166)

01

0.1

n1 (0.2) 0.1 0.1 n1 (0.066)

0.1

0.066

0.066 0.066

n5 (0.2) n5 (0.3)

n3 (0.2) n3 (0.166)

0.2 0.2

n4 (0.2) n4 (0.3)

Sample PageRank Iteration (2)

Iteration 2 n2 (0.166) n2 (0.133)

n1 (0.066) 0.033

0 033 0 083

0.083

0.083 n1 (0.1)

0.033

0.1

0.1 0.1

n5 (0.3) n5 (0.383)

n3 (0.166) n3 (0.183)

0.3 0.166

n4 (0.3) n4 (0.2)

PageRank in MapReduce

n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]

Map

n2 n4 n3 n5 n4 n5 n1 n2 n3

n1 n2 n2 n3 n3 n4 n4 n5 n5

Reduce

n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]

PageRank Pseudo-Code

Complete PageRank

| Two additional complexities

z What is the proper treatment of dangling nodes?

z How do we factor in the random jump factor?

| Solution:

z Second pass to redistribute “missing PageRank mass” and

account for random jumps

⎛ 1 ⎞ ⎛m ⎞

p' = α ⎜⎜ ⎟⎟ + (1 − α )⎜⎜ + p ⎟⎟

⎝G⎠ ⎝G ⎠

z p is

s PageRank

age a value

a ue from

o be

before,

o e, p

p' is

s updated PageRank

age a value

a ue

z |G| is the number of nodes in the graph

z m is the missing PageRank mass

PageRank Convergence

| Alternative convergence criteria

z Iterate until PageRank values don’t change

z Iterate until PageRank rankings don’t change

z Fixed number of iterations

| Convergence for web graphs?

Beyond PageRank

| Link structure is important for web search

z PageRank is one of many link-based features: HITS, SALSA, etc.

z One of many thousands of features used in ranking…

| Adversarial nature of web search

z Link spamming

z Spider traps

z Keyword stuffing

z …

Efficient Graph Algorithms

| Sparse vs. dense graphs

| Graph

G ap topo

topologies

og es

Figure from: Newman, M. E. J. (2005) “Power laws, Pareto

distributions and Zipf's law.” Contemporary Physics 46:323–351.

Local Aggregation

| Use combiners!

z In-mapper combining design pattern also applicable

| Maximize opportunities for local aggregation

z Simple tricks: sorting the dataset in specific ways

Questions?

Source: Wikipedia (Japanese rock garden)

- ('Christos Papadimitriou', 'Midterm 1', '(Solution)') Fall 2009Uploaded byJohn Smith
- lec13-dp2.pdfUploaded byAhmed Said
- A COMPARATIVE STUDY ON MULTICAST ROUTING USING DIJKSTRA’S, PRIMS AND ANT COLONY SYSTEMSUploaded byIAEME Publication
- Dijk Stra Algorithm DescriptionUploaded bySatya Raghav
- QoS routingUploaded byLamija Nukic
- Management of Communication Failures in Formation FlightUploaded byYoga Yulasmana
- Billey_Discrete_Mathematical_Modeling(math381_lecture_notes_Winter_2011_U_Washington).pdfUploaded byRicardo Marquez Hdez
- Graphs Implementation TipsUploaded byAnkush Babbar
- Graph Kernels.pdfUploaded byMaz Har Ul
- Asa 2011 GraphsUploaded byjhdmss
- Bayesian Reasoning and Machine LearningUploaded byWaresAnn

- nullUploaded byapi-25884893
- nullUploaded byapi-25884893
- nullUploaded byapi-25884893
- Http://Github.com/Stuarthalloway/Clojure PresentationsUploaded byapi-25884893
- nullUploaded byapi-25884893
- nullUploaded byapi-25884893
- Clojure’s Four ElevatorsUploaded byapi-25884893
- nullUploaded byapi-25884893
- nullUploaded byapi-25884893
- nullUploaded byapi-25884893
- nullUploaded byapi-25884893
- nullUploaded byapi-25884893
- nullUploaded byapi-25884893
- nullUploaded byapi-25884893
- For Rubyists Http://Github.com/Stuarthalloway/Clojure-presentationsUploaded byapi-25884893
- The $25,000,000,000∗ Eigenvector the Linear Algebra BehindUploaded byapi-25884893
- Data-Intensive Information Processing Applications ― Session #4Uploaded byapi-25884893
- Gi From the Bottom UpUploaded byapi-25884893
- Programming Paradigms for Dummies: What Every ProgrammerUploaded byapi-25884893
- Data-Intensive Text Processing With MapReduceUploaded byapi-25884893
- Http://Github.com/Stuarthalloway/Clojure PresentationsUploaded byapi-25884893
- The Anatomy of a Large-Scale Social SearchUploaded byapi-25884893
- Quick ReferenceUploaded byapi-25884893
- Purely Functional Data StructuresUploaded byapi-25884893
- The Python & the ElephantUploaded byapi-25884893
- Tuesday: Lambda Calculus & Combinatory LogicUploaded byapi-25884893
- Select Dim1, Dim2, Sum(Measure1) as Msum, Count(*)Uploaded byapi-25884893
- Tutorial: Metacompilers Part 1Uploaded byapi-25884893
- High Performance Scalable Data Stores Rick CattellUploaded byapi-25884893

- HSEUploaded byMohamad Zackuan
- POLITICAS_CEAMEGUploaded byLuis Arturo Andrade Coronado
- Memorias de Un AbogadoUploaded byFernandaSosa
- San 14109Uploaded byCesar Leon
- 88p8968Uploaded byNorbert Török
- 19th EnglishUploaded byShashi Prakash
- NANOGEOSCIENCE: An IntroductionUploaded byDr.Thrivikramji.K.P.
- Avaliação 7º Ano AUploaded byJosiane Campos
- Libro Control DigitalUploaded byCamilo Andres Restrepo
- Jane Eyre and PsychologyUploaded byChristine Lee
- Normas Em Controle de Infecções Hospitalares 1Uploaded byErosGuerra
- UeberWahrheitundLüge_Konstruktivismus.pdfUploaded byLisa
- Stella. Stimulation and ModellingUploaded bysyafira razab
- math rubric for mrUploaded byapi-432567134
- Research Methodology 3.pdfUploaded bydemmahom68
- Los 5 Principios de BACHUploaded byconi
- SullivanUploaded bypsicosur
- Le Chirk, C'est la plus immense injusticeUploaded byabouno3man
- En WordUploaded byVíctor Andrés Parra Gómez
- Owners Manual RichUploaded bydchosgo2639
- Guía 2-módulo 1-220144Uploaded byClaudio Sierpe
- A Different Shape for Curriculum; C Dwyer 2012Uploaded byC Dwyer
- $R5JHZ1AUploaded byscribdraschid2016
- S04 VALUES 10082016Uploaded byBi Bin
- Columnas de La Tierra 1Uploaded byalberto
- Empresa é responsável por ato de preposto que emite duplicata fraudulenta - Jus NavigandiUploaded byOtto Brasil
- Bread ScienceUploaded byRadwin Nurlatif
- The Bank Intermediary Function and ed Loans PhenomenonUploaded bynjprastowo
- 1.- La Ética Del TrabajoUploaded byvalentinarosario
- Analisis de SensorUploaded byJhofre Sanchez