You are on page 1of 76

How Spark Works on a Cluster

M.Sc Minh Hoang Vu


Thang Long University
Big Data
Https://tinyurl.com/W4BigDataHoangVu

Portions of this lecture have been contributed to the OpenDS4All project,


piloted by Penn, IBM, and the Linux Foundation
From SQL to a Spark Query Plan
yelp_business_sdf = spark.read.format("csv").option("header",
"true").load("yelp_business.csv")

avg_reviews_by_city_sdf = spark.sql(\
'select city, avg(stars) as avg_rating ’\
'from yelp_business yb '\
'group by city')
avg_reviews_by_city_sdf.explain()
File Hash Pre- Hash Avg
Exchange
*(2) Scan
HashAggregate(keys=[city#21], Aggregate
functions=[avg(cast(stars#25 as Aggregate
double))])
+- Exchange hashpartitioning(city #21, 200), true, [id=#519]
+- *(1) HashAggregate(keys=[city city functions=
yelp_business.csv #21], city city
[partial_avg(cast(stars#25 as double))])
+- FileScan csv [city #21,stars#25] Batched: false, DataFilters: [], Format: CSV,
Location: InMemoryFileIndex[file:/content/yelp_business.csv],
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<city:string,stars:string>

2
Colab Spark Execution
Manager Worker
res
ypa ybs
(Coordinator)

Livy + Worker
res
ypa ybs
Python
Worker
res
yelp_business.csv
ypa ybs

File Hash Pre- Hash Avg


Exchange
Scan Aggregate Aggregate
yelp_business.csv city city city

NETS212 | 3
Distributed Joins
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')

yelp_business
id name city
FYNWN1 Dental by Design Ahwatukee Server 0
BADF My Wine Cellar Ahwatukee Server 1
KQPW8 Western Motor Vehicles Phoenix Server 0
8DShNS Sports Authority Tempe Server 1

Sharded by ID

4
Distributed Joins
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')

yelp_business
id name city
FYNWN1 Dental by Design Ahwatukee Server 0 Create two
BADF My Wine Cellar Ahwatukee Server 1 copies, sharded
KQPW8 Western Motor Vehicles Phoenix Server 0 by city
8DShNS Sports Authority Tempe Server 1

Sharded by ID

5
Distributed Joins
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')

yelp_business yelp_business
id name city id name city
FYNWN1 Dental by Design Ahwatukee Server 0 FYNWN1 Dental by Design Ahwatukee Server 0
BADF My Wine Cellar Ahwatukee Server 1
BADF My Wine Cellar Ahwatukee Server 0
KQPW8 Western Motor Vehicles Phoenix Server 0
KQPW8 Western Motor Vehicles Phoenix Server 1
8DShNS Sports Authority Tempe Server 1
8DShNS Sports Authority Tempe Server 1

Exchange / repartition / shuffle


Sharded by ID Sharded by city

6
Distributed Joins
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')

yelp_business (b1) yelp_business (b2)


id name city id name city
FYNWN1 Dental by Design Ahwatukee Server 0 FYNWN1 Dental by Design Ahwatukee Server 0
BADF My Wine Cellar Ahwatukee Server 0
BADF My Wine Cellar Ahwatukee Server 0
KQPW8 Western Motor Vehicles Phoenix Server 1
KQPW8 Western Motor Vehicles Phoenix Server 1
8DShNS Sports Authority Tempe Server 1 8DShNS Sports Authority Tempe Server 1

name name
My Wine Cellar Dental by Design
Dental by Design My Wine Cellar

7
Variation: (Left) Outerjoin
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 left join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')

yelp_business (b1) yelp_business (b2)


id name city id name city
FYNWN1 Dental by Design Ahwatukee Server 0 FYNWN1 Dental by Design Ahwatukee Server 0
BADF My Wine Cellar Ahwatukee Server 0 BADF My Wine Cellar Ahwatukee Server 0

KQPW8 Western Motor Vehicles Phoenix Server 1 KQPW8 Western Motor Vehicles Phoenix Server 1

8DShNS Sports Authority Tempe Server 1 8DShNS Sports Authority Tempe Server 1

name name
My Wine Cellar Dental by Design
Dental By Design My Wine Cellar
Western Motor… NULL
Sports Authority NULL

8
Minimizing Shuffle/Exchange Steps

• Every time we do a join or a group-by, we need the data


to be sharded on the key
• If it isn’t, we need to do an exchange or repartition!
• A good strategy: amortize the repartitions across
multiple operations if possible!

9
Catalyst: Spark’s Query Optimizer
Generates the Plans
• Spark’s query optimizer seeks to:
• Estimate how big the input sources are
• Estimate how many results will be produced in each
filter, join, groupby – compare different orderings of
operations

• Find the strategy that minimizes the overall cost,


including repartitions and join costs

10
Spark Handles Failures!

What happens if one of our worker nodes dies?

Spark re-reads its input data using the other nodes,


and re-executes the missing part of the query!

11
Brief Review

When Spark runs on a cluster, it creates and executes a Spark query plan when
a. we execute a cell with a Pandas operation
b. we execute a cell that invokes an action like show() or save()
c. we execute a cell with a dataframe operation like a join
d. we execute a cell with an SQL query
Given two dataframes students(id,name) and enrolled(course_id,student_id), if we
execute a query to join on the student IDs, Spark must:
a. ensure students is sharded by ID and enrolled is sharded by student_id, or add exchange
operators as needed
b. perform a hash join within each of the worker nodes, without adding any exchange operators
c. ensure students is sharded by ID and enrolled is sharded by course_id, or add exchange
operators as needed
d. sort the enrolled dataframe by course_id

12
Recap

Apache Spark queries are lazy to maximize what can be optimized


Upon an action like show(), the queries are combined
and a plan is generated – which minimizes cost

Group-by and join require the data to be sharded on the key – may
need to exchange or reshuffle or repartition data
If a worker fails in execution, its work is re-executed

Spark’s Catalyst query optimizer seeks to find the minimum-cost


plan, but occasionally you may need to manually override it

13
Storing Data on the Cloud

M.Sc Minh Hoang Vu


Thang Long University
Big Data

Portions of this lecture have been contributed to the OpenDS4All project,


piloted by Penn, IBM, and the Linux Foundation
Where Do We Put Our Big Data?

• A cloud file system?


• A cloud NoSQL system?
• A cloud relational DBMS?

15
Key questions

How complex and large is the data and its content?


videos, images; JSON; large CSVs
How will I query my data?
e.g., by pathname, by properties, by features
Do I need transactions?

16
S3 (or GCS) for Storing Large Objects

• Amazon S3 supports “buckets” – virtual disk volumes


• Can use “s3a://bucketname/filename” to specify an S3 file
• For dataframes: df.write.parquet(), sqlContext.read.parquet()

17
DynamoDB (or BigTable) for Small Object Lookup

• Given objects in a map from keys to hierarchical values –


DynamoDB is a good choice
• Values may be JSON data, dictionaries (max 4KB / field)
• Queries largely limited to lookups by key
18
RDBMSs for Queriable Objects

• Relational DBMSs are best if we want:


• Complex queries that return subsets of data to Spark
• Atomic updates across tables, in transactions
• Interoperability with the most tools

• Amazon RDS lets us launch PostgreSQL, Oracle,


MariaDB, …

19
Brief Review

If we have tabular data that we are retrieving solely by an ID, our best choice(s) for
storage are likely to be:
a. DynamoDB or RDS
b. neither DynamoDB nor RDS
c. DynamoDB only
d. RDS only
If we have satellite photos, we are likely to want to store these on:
a. RDS
b. our laptop
c. DynamoDB
d. S3

20
Recap

• Our focus in this class: processing big data


• But there are multiple places we can save it:
• “Large object stores” like S3 – videos, images,
large CSVs, large parquet files
• NoSQL stores like DynamoDB – JSON, simple objects
• RDBMSs like RDS – tabular data that we’ll query

21
Materialization of Query Results

M.Sc Minh Hoang Vu


Thang Long University
Big Data
Portions of this lecture have been contributed to the OpenDS4All project,
piloted by Penn, IBM, and the Linux Foundation
When We Have Big Data,
We May Need to Make Storage Decisions

We’ve seen data with embedded hierarchy


• LinkedIn people included lists of education or job
experiences
• Key idea: split these into subtables, explode the lists
• There’s a goal of storing data without redundancy
But: Sometimes portions of data overlap, e.g., both parent
and subclasses have some common instances

23
An Example of Instances and Subclasses
id Person
name id name
Person 123 Ai

456 Jay

IS IS 789 Kaye

A A
Student Worker
id school id employer
Student Worker
456 Penn 789 Lutron

789 MIT

school employer

24
Materialization

Ideally, our original data is stored without redundancy –


this makes it easier to maintain

But as we generate analysis results, we may want to


strategically store redundant info! “View materialization”!

Let’s apply to people, students, and workers…

25
Student and Worker are Naturally Views

CREATE VIEW WorkerPerson(id, name, employer) AS WorkerPerson


SELECT * id name employer

FROM Person NATURAL JOIN Worker 789 Kaye Lutron

StudentPerson
CREATE VIEW StudentPerson(id, name, employer) AS id name school
SELECT * 456 Jay Penn

FROM Person NATURAL JOIN Student 789 Kaye MIT

but views are simply named queries treated as tables…

26
An Example of Instances and Subclasses
with Redundancy! Worker
id employer
id Person 789 Lutron
name id name
Person Student
123 Ai
id school
456 Jay
456 Penn

IS IS 789 Kaye
789 MIT
A A
StudentPerson WorkerPerson
id name school id name employer
Student Worker
456 Jay Penn 789 Kaye Lutron

789 Kaye MIT

school employer

27
More Generally…

• In Spark, we can take any Dataframe and persist it…


same_city_sdf = spark.sql('select b1.name, b2.name as name2 ‘\
from yelp_business b1 join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')
same_city_sdf.persist()

• Now any time we reference same_city_sdf it will use the


stored version!

28
Other Uses for Materialization

• Commonly used subqueries

• Generated reports or hierarchical data

• Recursive computations (we’ll see these over graphs)

29
Brief Review

If we use inheritance in an E-R diagram, the tables are naturally partitioned such
that
a. we only store instances in the child tables
b. instances show up in parent and child tables, but columns other than ID are split
c. the same columns show up in parent and child tables, but instances are split
d. we repeat both instances and all columns in parent and child tables
View materialization is accomplished by
a. calling materialize() on a dataframe
b. creating a view in SparkSQL
c. saving the input CSV
d. calling persist() on a dataframe

30
Recap

• View materialization sacrifices storage (and cost of


updating) for query performance
• Very commonly used in big data scenarios

• Can be done by saving a result directly, or by


DataFrame.persist()

31
Module Wrap-up

• As we scale to bigger and more complex data, need to


harness compute clusters
• Spark runs across multiple workers, shuffles data as
necessary for joins and grouping
• Query optimizer seeks to minimize these costs
• We have a series of options for storing our data
• Sometimes it’s useful to trade off space for query
performance

32
More Complex Relationships

• Most of our discussion has been about “direct” relationships


• Student TAKES a class
• a student ISA person

• In the real world, lots of transitive relationships!


• Real and digital social networks, the Internet, road networks,
supplier networks, …
• Leads to Part 3: graphs!

33
Graphs and Big Data

M.Sc Minh Hoang Vu


Thang Long University
Big Data
Portions of this lecture have been contributed to the OpenDS4All project,
piloted by Penn, IBM, and the Linux Foundation
Networks (Graphs)
are Everywhere!
• Transportation
• Economics
• Society / Friendships and
Interest groups
• Information sources
• Biology
• Computing
• ... Figure by Bruce Hoppe, Creative Commons Licensed

May be implicit (we compute links) or explicit (we can observe links)
For our running example: we’ll look at the LinkedIn connection network
Refresher: Graph Theory Basics

label
n1 n2

Graph G = (V,E)
V is a set of vertices or nodes, possibly
with properties
E is a set of tuples of vertices, called edges,
and may have lables or other data

V(node, label, prop1) e.g., (n1, “bob”, 20)


E(source, label, target) e.g. (n1, ‘friend_of’, n2)

36
Some Terminology
u,v are adjacent if there’s an edge between u and v
degree (u) = # adjacent vertices
• indegree or outdegree

path = sequence of adjacent vertices


u,v are connected if path between u and v
• Connected component: Set of vertices connected to each
other, that is not part of a larger connected set.
• Triangle: 3 vertices that are pairwise adjacent.
• Clique: Any set of vertices that are all pairwise adjacent.
Encoding Graphs as Data Structures
We’ll focus
Graph G on this
c initially,
as an edge
d Adjacency matrix A(G)
Dataframe
a
b a b c d
(a,c)
a 0 0 1 0
Adjacency list L(G) (b,c)
(b,d) b 0 0 1 1
a🡪 c c 1 1 0 1
(c,b)
b🡪 c, d (c,a) d 0 1 1 0
c🡪 (c,d)
b, a, d We’ll see this
d🡪 (d,c) subsequently
c, b
(d,b)
Brief Review

The most natural representation of a graph (ignoring unconnected nodes) in


dataframes is as
a. an edge list
b. an adjacency list
c. an adjacency matrix
d. a hierarchical structure
The notion of a clique generalizes which concept?
a. indegree
b. graph
c. triangle
d. connected component

39
Road Map

• Simple graph analysis, centrality, and “betweenness”


• Graph exploration
• BFS in Spark
• Applications of BFS

40
A Brief Intro to Graph Analysis,
aka Network Science
M.Sc Minh Hoang Vu
Thang Long University
Big Data

Portions of this lecture have been contributed to the OpenDS4All project,


piloted by Penn, IBM, and the Linux Foundation
A Simple Social (or Informational)
Network as a Graph

Who is most influential?

What are the natural


groupings?

Where should we suggest


new links?

42
“Network Centrality”

The general question:


How do we measure the important nodes in a graph?
aka the “central” nodes

Several methods proposed in network science literature:


• Degree centrality: for a node, how many other nodes is it connected to
• Betweenness centrality: for a node, how many shortest paths go through the
node
• Eigenvector centrality: very similar to PageRank, which we’ll discuss shortly

43
Degree Centrality

• For each node, compute its degree, i.e.,


the number of edges it connects to

• In a directed graph suppose we want


outdegree centrality, i.e., the number of
edges coming out from each node n?

• How to write in Spark, given a relation edges(from, to)?

44
LinkedIn Example
Given a list edges that looks like: [[0,5], [5,10], …]
from pyspark.sql.types import IntegerType
schema = StructType([
StructField("from", IntegerType(), True),
StructField("to", IntegerType(), True)
])
# Load the remote data as a list of dictionaries
edges_df = sqlContext.createDataFrame(edges, schema)

edges_df.createOrReplaceTempView('edges')
sqlContext.sql('select from as id, count(to) as degree '+
\'from edges group by from')

45
Beyond Degree Centrality

• Degree centrality is moderately useful


• citation counts in academia
• number of followers in Twitter
• number of commits in GitHub

• But we may want to look at relationships to more distant


nodes

46
Betweenness Centrality

• Some people (or entities) are


important “connectors” – they bridge n
natural clusters
• Another measure: how many shortest paths
go through a given node?
• To compute: # shortest paths between every (A,B)
through n
• Find all shortest paths -----------------------------------------------
• Count how many include any node n # shortest paths between every (A,B)

Often excludes n as an endpoint


on the path

47
Brief Review

Which type of centrality is reliant on computing shortest paths?


a. betweenness centrality
b. PageRank centrality
c. degree centrality
d. eigenvector centrality
When someone is proud of their number of retweets in Twitter,
this is an instance of
a. degree centrality
b. like centrality
c. eigenvector centrality
d. betweenness centrality

48
Recap

Network centrality seeks to identify the influence


(“centrality”) of a node
• Simplest measures are based on direct connectivity
• Most measures take into account the broader graph and its
paths!

So how do we explore paths?

49
Graph Exploration

M.Sc Minh Hoang Vu


Thang Long University
Big Data

Portions of this lecture have been contributed to the OpenDS4All project,


piloted by Penn, IBM, and the Linux Foundation
Exploring a Graph

• Commonly, we will want to start at some node and look at how it relates to
other nodes in the graph
• How far away is X from Y?
• How many nodes are within distance k?
• What are the odds I can start at X and end up at Y?

• (Some of these are the basis of ranking + recommendations)

• So how can we do this? Let’s start with a single machine…

51
Computing Distance in a Graph
How far apart are two nodes?
Distance between two nodes =
number of edges on the shortest path between them.

Breadth-First Search: Algorithm “pattern” for exploring


at successively greater distances

Needs to remember two things:


• What you have already visited (don’t want to backtrack)
• What places you’ve learned about but haven’t visited
Breadth-First Search (BFS)
for Undirected or Directed Graphs

Unexplored
Visited Queue of Frontier Vertices
vertices
vertices Frontier
vertices
BFS - Centralized
Initialize a frontier queue with the origin node
While the frontier queue has a vertex in it
Pick a vertex v from the front of the queue
Put each unexplored neighbor of v in queue
Note closer edges are always considered before more distant edges.

Efficiency: Each edge is examined once (undirected: in each direction)


(if graph given as adjacency list).

Just a small amount of work is required to examine each edge.


Running time is proportional to the number of edges.

Let’s see it in Python…


1

2 4

3 5

55
Brief Review

What data structures does breadth-first search employ?


a. queue of visited nodes only
b. queue of frontier nodes, set of visited nodes
c. queue of visited nodes, set of frontier nodes
d. set of frontier nodes only
How many times do we revisit a node in BFS?
a. we only visit each node once
b. we visit each node twice
c. we visit each node n times
d. we visit each node once for each edge

56
Breadth-First Search on One Computer

• Simple idea:queue enforces exploration from fewest


hops to successively greater and greater distances
• We can focus on the frontier which has new nodes
• We can prune paths that revisit nodes we’ve already seen
• Requires access to a global queue, and is inherently
sequential – so we need a different approach…

57
Distributed Breadth First Search

M.Sc Minh Hoang Vu


Thang Long University
Big Data

Portions of this lecture have been contributed to the OpenDS4All project,


piloted by Penn, IBM, and the Linux Foundation
How Do We Distribute BFS?
See companion notebook linked on web

• Don’t want to traverse one node at a time


• Can’t order directly using a global queue…
• … And need to be careful about when we check for
“visited” status

59
Suppose Our Graph is in an Edge Relation

edges_df
+----+-------+
|from| to|
+----+-------+
| 0|2152448| Can we traverse from a subset of these
| 0|1656491| nodes, via BFS, to more distant nodes?
| 0| 399364|
| 0| 18448|
| 0| 72025| Transitive closure query – what nodes
| 1| 77832| can we reach transitively
| 1| 542731|
| 1| 317452|
+----+-------+ (Later: we could count path lengths)

60
A Sketch of the Approach
And the next
Then we can traverse from
Start with our origin nodes! those nodes to the next level
8 8
all nodes with ID < 3
0 0
4 4
From edges_df we can directly get
the one-hop neighbors 1 1
7 7
from to 8 2
2
0 8 0 0 0
1 4 4 8 8
2 7 1 7 7
2 0 7
15 15 15
8 7
2 15
8 15
join(edges_df, l=to, r=from)
0 16
15 16 edges_df.filter() 16 r=from)
join(edges_df, l=to,

61
In Code, We Join in Each Traversal
and Rename!
+---+
from pyspark.sql.functions import col | id|
+---+
# Start with a subset of nodes |148|
start_nodes_df = edges_df[['from']].filter(edges_df['from'] < 1000).\ |463|
|471|
select(col('from').alias('id’)).drop_duplicates() |496|
|833|
+---+

+-------+
| id|
neighbor_nodes_df = start_nodes_df.\ +-------+
|1510404|
join(edges_df, start_nodes_df.id ==
| 523|
edges_df['from']).\ | 993804|
select(col('to').alias('id')).drop_duplicates() | 469009|
| 232979|
+-------+
62
What Happens Under the Covers?
Suppose that edges_df is sharded by from and that we put
even numbers on worker 0 and odd numbers on worker 1
edges_df start_nodes_df neighbor_nodes_df
+----+-------+ +----+ +-------+
|from| to| |from| | id|
+----+-------+ +----+ +-------+
| 0|2152448| | 0| |2152448|
| 0|1656491| | 1| |1656491|
| 0| 399364| | 2| | 399364|
| 0| 18448| | 3| | 18448|
| 0| 72025| | 4| | 77832|
| 1| 77832| | 5| | 542731|
| 1| 542731| | 6| +-------+
| 1| 317452| +----+
+----+-------+
Sharding: Sharding: Sharding:
even from on server 0 even from on server 0 based on from node that
odd from on server 1 odd from on server 1 isn’t part of the table!

63
Neighbor’s Neighbor?
neighbor_neighbor_nodes_df = neighbor_nodes_df.\
join(edges_df, neighbor_nodes_df.id == edges_df['from']).\
select(col('to').alias('id')).drop_duplicates()
neighbor_nodes_df
edges_df neighbor_nodes_df repartitioned by id
+----+-------+ +-------+ +-------+
|from| to| | id| | id|
+----+-------+ +-------+ +-------+
| 0|2152448| |2152448| |2152448|
| 0|1656491| |1656491| |1656491| Now directly joinable
| 0| 399364| | 399364| | 399364| with edges_df
| 0| 18448| | 18448| | 18448| on
| 0| 72025| | 72025| | 72025| id == edges_df.from
| 1| 77832| | 77832| | 77832|
| 1| 542731| | 542731| | 542731|
| 1| 317452| | 317452| | 317452|
+----+-------+ +-------+ +-------+
Sharding: Sharding: Sharding:
even from on server 0 based on from node that even id on server 0
odd from on server 1 isn’t part of the table! odd id on server 1

64
Can We Do an Iterative Join
and Track Hops?
Can we generalize from start 🡪 neighbor 🡪 neighbor’s
neighbor, in a loop?

• Base case: start with the direct edges, set distance to 1


• Iterative case:
• start with the existing set of nodes, add an edge to get to new
destinations
• project start and end (and increment distance) – use same
schema as base case
• remove duplicates!

65
Iterative Join
def iterate(df, depth):
df.createOrReplaceTempView('iter')

# Base case: direct connection +----+----+-----+ +----+----+-----+


|from| to|depth| |from| to|depth|
result = sqlContext.sql('select from, to, 1 as depth from iter')
+----+----+-----+ +----+----+-----+
| 0| 38| 1| | 0| 59| 2|
| 0| 101| 1| | 0| 66| 2|
for i in range(1, depth): | 0| 121| 1| | 0| 101| 2|
| 0| 161| 1| | 0| 121| 2|
result.createOrReplaceTempView('result’) | 0| 337| 1| | 0| 121| 2|
result = sqlContext.sql('select distinct r1.from
| as
0| 487| from,
1| r2.to
| 0| 161| 2|
| 0| 504| 1| | 0| 236| 2|
as to, r1.depth+1 as depth ‘\ | 0| 802| 1| | 0| 236| 2|
'from result r1 join iter r2 ‘\ … …
'on r1.to=r2.from')
return result
iterate(edges_df.filter(edges_df['from'] < 1000),
1).orderBy('from','to').show()
2).orderBy('from','to').show()

66
What We Can Do Better

• In the loop we remove duplicate paths in each iteration


• But given paths of different lengths from s to t, we
should remove the non-minimal ones!

• (Left as an exercise to the student!)

67
Brief Review

In a Spark-based iterative approach to BFS, we traverse edges


a. directly via a join with edges_df
b. one hop at a time via a join with edges_df
c. by grouping
d. by choosing the minimum path
Every time we do a join in a distributed BFS, there is a good chance we need to
a. group the results
b. select a subset of the matches
c. do another join
d. repartition one of the dataframes

68
Recap

• To do breadth-first traversals in Spark, we can iterate in


“waves” from the origin(s)
• A join at each stage
• We may want to keep information about the distance, the path,
etc.
• We may want to prune all non-minimal paths

• Next: let’s see some places BFS is used!

69
Applications of
Breadth-First Search
M.Sc Minh Hoang Vu
Thang Long University
Big Data

Portions of this lecture have been contributed to the OpenDS4All project,


piloted by Penn, IBM, and the Linux Foundation
A Common Question
O
• How far away is V N
from A? H G
• “Shortest path” P
F
M
B A E
• Let’s assume that Q
W
D L
1. the graph is V
directed and may I C
have cycles K
2. all edges have J U
equal (“unit”) cost R
Can BFS help? S T

71
Adding Connections in
Social Networks
One’s friends tend to become each other’s friends.
This is called Triadic Closure. B
A

Node A violates the Triadic Closure Property if it has D


two friends B and C who are not each other’s friends.
C
We often look to complete triangles to recommend
friends – prioritize by the number of incomplete
triangles
How can we use BFS/Shortest Path
here?
A Sketch

• Run BFS from “us” to find friends (nodes at depth 1) and


friends of our friends (nodes with min depth 2)

• Run BFS from FoF to depth 1


• For each FoF n, count how many of our friends are in
common

• Rank each FoF n by how many friends we have in common


73
Other Common
Graph Algorithms

• Algorithms for min-cost trees, traversals (minimum


spanning tree, Steiner tree)
• Betweenness centrality

• As we’ll see shortly, other recursive definitions of


centrality – eigenvector centrality, PageRank, label
propagation, variations thereof…
74
Brief Review

An important use case of BFS is


a. bipartite matching
b. k-means clustering
c. shortest-path computation
d. graph coloring
Triadic closure involves...
a. selecting people who resemble each other
b. adding edges to complete the most triangles
c. creating k-clusters
d. ranking friends-of-friends by strength of friendship

75
Summary

• Joins are a way of starting with a set of nodes and


performing path traversals

• Multi-step joins achieve multi-step paths

• We can easily implement distributed BFS and use this to


solve other problems

76

You might also like