BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)

How Spark Works on a Cluster
M.Sc Minh Hoang Vu

Thang Long University
Big Data
Https://tinyurl.com/W4BigDataHoangVu
Portions of this lecture have been contributed to the OpenDS4All project,

piloted by Penn, IBM, and the Linux Foundation
From SQL to a Spark Query Plan
yelp_business_sdf = spark.read.format("csv").option("header",
"true").load("yelp_business.csv")
avg_reviews_by_city_sdf = spark.sql(\
'select city, avg(stars) as avg_rating ’\
'from yelp_business yb '\
'group by city')
avg_reviews_by_city_sdf.explain()
File Hash Pre- Hash Avg
Exchange
*(2) Scan
HashAggregate(keys=[city#21], Aggregate
functions=[avg(cast(stars#25 as Aggregate
double))])
+- Exchange hashpartitioning(city #21, 200), true, [id=#519]
+- *(1) HashAggregate(keys=[city city functions=
yelp_business.csv #21], city city
[partial_avg(cast(stars#25 as double))])
+- FileScan csv [city #21,stars#25] Batched: false, DataFilters: [], Format: CSV,
Location: InMemoryFileIndex[file:/content/yelp_business.csv],
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<city:string,stars:string>
2
Colab Spark Execution
Manager Worker
res
ypa ybs
(Coordinator)
Livy + Worker
res
ypa ybs
Python
Worker
res
yelp_business.csv
ypa ybs
File Hash Pre- Hash Avg

Exchange
Scan Aggregate Aggregate
yelp_business.csv city city city
NETS212 | 3
Distributed Joins
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')
yelp_business
id name city
FYNWN1 Dental by Design Ahwatukee Server 0
BADF My Wine Cellar Ahwatukee Server 1
KQPW8 Western Motor Vehicles Phoenix Server 0
8DShNS Sports Authority Tempe Server 1
Sharded by ID
4
Distributed Joins
yelp_business
id name city
FYNWN1 Dental by Design Ahwatukee Server 0 Create two
BADF My Wine Cellar Ahwatukee Server 1 copies, sharded
KQPW8 Western Motor Vehicles Phoenix Server 0 by city
Sharded by ID
5
Distributed Joins
yelp_business yelp_business
id name city id name city
FYNWN1 Dental by Design Ahwatukee Server 0 FYNWN1 Dental by Design Ahwatukee Server 0
Exchange / repartition / shuffle

Sharded by ID Sharded by city
6
Distributed Joins
yelp_business (b1) yelp_business (b2)

8DShNS Sports Authority Tempe Server 1 8DShNS Sports Authority Tempe Server 1
name name
My Wine Cellar Dental by Design
Dental by Design My Wine Cellar
7
Variation: (Left) Outerjoin
'select b1.name, b2.name from yelp_business b1 left join yelp_business b2 '\
yelp_business (b1) yelp_business (b2)

BADF My Wine Cellar Ahwatukee Server 0 BADF My Wine Cellar Ahwatukee Server 0
KQPW8 Western Motor Vehicles Phoenix Server 1 KQPW8 Western Motor Vehicles Phoenix Server 1
8DShNS Sports Authority Tempe Server 1 8DShNS Sports Authority Tempe Server 1
name name
My Wine Cellar Dental by Design
Dental By Design My Wine Cellar
Western Motor… NULL
Sports Authority NULL
8
Minimizing Shuffle/Exchange Steps
• Every time we do a join or a group-by, we need the data

to be sharded on the key
• If it isn’t, we need to do an exchange or repartition!
• A good strategy: amortize the repartitions across
multiple operations if possible!
9
Catalyst: Spark’s Query Optimizer
Generates the Plans
• Spark’s query optimizer seeks to:
• Estimate how big the input sources are
• Estimate how many results will be produced in each
filter, join, groupby – compare different orderings of
operations
• Find the strategy that minimizes the overall cost,

including repartitions and join costs
10
Spark Handles Failures!
What happens if one of our worker nodes dies?
Spark re-reads its input data using the other nodes,

and re-executes the missing part of the query!
11
Brief Review
When Spark runs on a cluster, it creates and executes a Spark query plan when
a. we execute a cell with a Pandas operation
b. we execute a cell that invokes an action like show() or save()
c. we execute a cell with a dataframe operation like a join
d. we execute a cell with an SQL query
Given two dataframes students(id,name) and enrolled(course_id,student_id), if we
execute a query to join on the student IDs, Spark must:
a. ensure students is sharded by ID and enrolled is sharded by student_id, or add exchange
operators as needed
b. perform a hash join within each of the worker nodes, without adding any exchange operators
c. ensure students is sharded by ID and enrolled is sharded by course_id, or add exchange
operators as needed
d. sort the enrolled dataframe by course_id
12
Recap
Apache Spark queries are lazy to maximize what can be optimized

Upon an action like show(), the queries are combined
and a plan is generated – which minimizes cost
Group-by and join require the data to be sharded on the key – may
need to exchange or reshuffle or repartition data
If a worker fails in execution, its work is re-executed
Spark’s Catalyst query optimizer seeks to find the minimum-cost

plan, but occasionally you may need to manually override it
13
Storing Data on the Cloud
M.Sc Minh Hoang Vu

Big Data

Where Do We Put Our Big Data?
• A cloud file system?

• A cloud NoSQL system?
• A cloud relational DBMS?
15
Key questions
How complex and large is the data and its content?

videos, images; JSON; large CSVs
How will I query my data?
e.g., by pathname, by properties, by features
Do I need transactions?
16
S3 (or GCS) for Storing Large Objects
• Amazon S3 supports “buckets” – virtual disk volumes

• Can use “s3a://bucketname/filename” to specify an S3 file
• For dataframes: df.write.parquet(), sqlContext.read.parquet()
17
DynamoDB (or BigTable) for Small Object Lookup
• Given objects in a map from keys to hierarchical values –

DynamoDB is a good choice
• Values may be JSON data, dictionaries (max 4KB / field)
• Queries largely limited to lookups by key
18
RDBMSs for Queriable Objects
• Relational DBMSs are best if we want:

• Complex queries that return subsets of data to Spark
• Atomic updates across tables, in transactions
• Interoperability with the most tools
• Amazon RDS lets us launch PostgreSQL, Oracle,

MariaDB, …
19
Brief Review
If we have tabular data that we are retrieving solely by an ID, our best choice(s) for
storage are likely to be:
a. DynamoDB or RDS
b. neither DynamoDB nor RDS
c. DynamoDB only
d. RDS only
If we have satellite photos, we are likely to want to store these on:
a. RDS
b. our laptop
c. DynamoDB
d. S3
20
Recap
• Our focus in this class: processing big data

• But there are multiple places we can save it:
• “Large object stores” like S3 – videos, images,
large CSVs, large parquet files
• NoSQL stores like DynamoDB – JSON, simple objects
• RDBMSs like RDS – tabular data that we’ll query
21
Materialization of Query Results
M.Sc Minh Hoang Vu

Big Data
When We Have Big Data,
We May Need to Make Storage Decisions
We’ve seen data with embedded hierarchy

• LinkedIn people included lists of education or job
experiences
• Key idea: split these into subtables, explode the lists
• There’s a goal of storing data without redundancy
But: Sometimes portions of data overlap, e.g., both parent
and subclasses have some common instances
23
An Example of Instances and Subclasses
id Person
name id name
Person 123 Ai
456 Jay
IS IS 789 Kaye
A A
Student Worker
id school id employer
Student Worker
456 Penn 789 Lutron
789 MIT
school employer
24
Materialization
Ideally, our original data is stored without redundancy –

this makes it easier to maintain
But as we generate analysis results, we may want to

strategically store redundant info! “View materialization”!
Let’s apply to people, students, and workers…
25
Student and Worker are Naturally Views
CREATE VIEW WorkerPerson(id, name, employer) AS WorkerPerson

SELECT * id name employer
FROM Person NATURAL JOIN Worker 789 Kaye Lutron
StudentPerson
CREATE VIEW StudentPerson(id, name, employer) AS id name school
SELECT * 456 Jay Penn
FROM Person NATURAL JOIN Student 789 Kaye MIT
but views are simply named queries treated as tables…
26
An Example of Instances and Subclasses
with Redundancy! Worker
id employer
id Person 789 Lutron
name id name
Person Student
123 Ai
id school
456 Jay
456 Penn
IS IS 789 Kaye
789 MIT
A A
StudentPerson WorkerPerson
id name school id name employer
Student Worker
456 Jay Penn 789 Kaye Lutron
789 Kaye MIT
school employer
27
More Generally…
• In Spark, we can take any Dataframe and persist it…

same_city_sdf = spark.sql('select b1.name, b2.name as name2 ‘\
from yelp_business b1 join yelp_business b2 '\
same_city_sdf.persist()
• Now any time we reference same_city_sdf it will use the

stored version!
28
Other Uses for Materialization
• Commonly used subqueries
• Generated reports or hierarchical data
• Recursive computations (we’ll see these over graphs)
29
Brief Review
If we use inheritance in an E-R diagram, the tables are naturally partitioned such
that
a. we only store instances in the child tables
b. instances show up in parent and child tables, but columns other than ID are split
c. the same columns show up in parent and child tables, but instances are split
d. we repeat both instances and all columns in parent and child tables
View materialization is accomplished by
a. calling materialize() on a dataframe
b. creating a view in SparkSQL
c. saving the input CSV
d. calling persist() on a dataframe
30
Recap
• View materialization sacrifices storage (and cost of

updating) for query performance
• Very commonly used in big data scenarios
• Can be done by saving a result directly, or by

DataFrame.persist()
31
Module Wrap-up
• As we scale to bigger and more complex data, need to

harness compute clusters
• Spark runs across multiple workers, shuffles data as
necessary for joins and grouping
• Query optimizer seeks to minimize these costs
• We have a series of options for storing our data
• Sometimes it’s useful to trade off space for query
performance
32
More Complex Relationships
• Most of our discussion has been about “direct” relationships

• Student TAKES a class
• a student ISA person
• In the real world, lots of transitive relationships!

• Real and digital social networks, the Internet, road networks,
supplier networks, …
• Leads to Part 3: graphs!
33
Graphs and Big Data
M.Sc Minh Hoang Vu

Big Data
Networks (Graphs)
are Everywhere!
• Transportation
• Economics
• Society / Friendships and
Interest groups
• Information sources
• Biology
• Computing
• ... Figure by Bruce Hoppe, Creative Commons Licensed
May be implicit (we compute links) or explicit (we can observe links)
For our running example: we’ll look at the LinkedIn connection network
Refresher: Graph Theory Basics
label
n1 n2
Graph G = (V,E)
V is a set of vertices or nodes, possibly
with properties
E is a set of tuples of vertices, called edges,
and may have lables or other data
V(node, label, prop1) e.g., (n1, “bob”, 20)

E(source, label, target) e.g. (n1, ‘friend_of’, n2)
36
Some Terminology
u,v are adjacent if there’s an edge between u and v
degree (u) = # adjacent vertices
• indegree or outdegree
path = sequence of adjacent vertices

u,v are connected if path between u and v
• Connected component: Set of vertices connected to each
other, that is not part of a larger connected set.
• Triangle: 3 vertices that are pairwise adjacent.
• Clique: Any set of vertices that are all pairwise adjacent.
Encoding Graphs as Data Structures
We’ll focus
Graph G on this
c initially,
as an edge
d Adjacency matrix A(G)
Dataframe
a
b a b c d
(a,c)
a 0 0 1 0
Adjacency list L(G) (b,c)
(b,d) b 0 0 1 1
a🡪 c c 1 1 0 1
(c,b)
b🡪 c, d (c,a) d 0 1 1 0
c🡪 (c,d)
b, a, d We’ll see this
d🡪 (d,c) subsequently
c, b
(d,b)
Brief Review
The most natural representation of a graph (ignoring unconnected nodes) in

dataframes is as
a. an edge list
b. an adjacency list
c. an adjacency matrix
d. a hierarchical structure
The notion of a clique generalizes which concept?
a. indegree
b. graph
c. triangle
d. connected component
39
Road Map
• Simple graph analysis, centrality, and “betweenness”

• Graph exploration
• BFS in Spark
• Applications of BFS
40
A Brief Intro to Graph Analysis,
aka Network Science
M.Sc Minh Hoang Vu
Big Data

A Simple Social (or Informational)
Network as a Graph
Who is most influential?
What are the natural

groupings?
Where should we suggest

new links?
42
“Network Centrality”
The general question:

How do we measure the important nodes in a graph?
aka the “central” nodes
Several methods proposed in network science literature:

• Degree centrality: for a node, how many other nodes is it connected to
• Betweenness centrality: for a node, how many shortest paths go through the
node
• Eigenvector centrality: very similar to PageRank, which we’ll discuss shortly
43
Degree Centrality
• For each node, compute its degree, i.e.,

the number of edges it connects to
• In a directed graph suppose we want

outdegree centrality, i.e., the number of
edges coming out from each node n?
• How to write in Spark, given a relation edges(from, to)?
44
LinkedIn Example
Given a list edges that looks like: [[0,5], [5,10], …]
from pyspark.sql.types import IntegerType
schema = StructType([
StructField("from", IntegerType(), True),
StructField("to", IntegerType(), True)
])
# Load the remote data as a list of dictionaries
edges_df = sqlContext.createDataFrame(edges, schema)
edges_df.createOrReplaceTempView('edges')
sqlContext.sql('select from as id, count(to) as degree '+
\'from edges group by from')
45
Beyond Degree Centrality
• Degree centrality is moderately useful

• citation counts in academia
• number of followers in Twitter
• number of commits in GitHub
• But we may want to look at relationships to more distant

nodes
46
Betweenness Centrality
• Some people (or entities) are

important “connectors” – they bridge n
natural clusters
• Another measure: how many shortest paths
go through a given node?
• To compute: # shortest paths between every (A,B)
through n
• Find all shortest paths -----------------------------------------------
• Count how many include any node n # shortest paths between every (A,B)
Often excludes n as an endpoint

on the path
47
Brief Review
Which type of centrality is reliant on computing shortest paths?

a. betweenness centrality
b. PageRank centrality
c. degree centrality
d. eigenvector centrality
When someone is proud of their number of retweets in Twitter,
this is an instance of
a. degree centrality
b. like centrality
c. eigenvector centrality
d. betweenness centrality
48
Recap
Network centrality seeks to identify the influence

(“centrality”) of a node
• Simplest measures are based on direct connectivity
• Most measures take into account the broader graph and its
paths!
So how do we explore paths?
49
Graph Exploration
M.Sc Minh Hoang Vu

Big Data

Exploring a Graph
• Commonly, we will want to start at some node and look at how it relates to
other nodes in the graph
• How far away is X from Y?
• How many nodes are within distance k?
• What are the odds I can start at X and end up at Y?
• (Some of these are the basis of ranking + recommendations)
• So how can we do this? Let’s start with a single machine…
51
Computing Distance in a Graph
How far apart are two nodes?
Distance between two nodes =
number of edges on the shortest path between them.
Breadth-First Search: Algorithm “pattern” for exploring

at successively greater distances
Needs to remember two things:

• What you have already visited (don’t want to backtrack)
• What places you’ve learned about but haven’t visited
Breadth-First Search (BFS)
for Undirected or Directed Graphs
Unexplored
Visited Queue of Frontier Vertices
vertices
vertices Frontier
vertices
BFS - Centralized
Initialize a frontier queue with the origin node
While the frontier queue has a vertex in it
Pick a vertex v from the front of the queue
Put each unexplored neighbor of v in queue
Note closer edges are always considered before more distant edges.
Efficiency: Each edge is examined once (undirected: in each direction)

(if graph given as adjacency list).
Just a small amount of work is required to examine each edge.

Running time is proportional to the number of edges.
Let’s see it in Python…

1
2 4
3 5
55
Brief Review
What data structures does breadth-first search employ?

a. queue of visited nodes only
b. queue of frontier nodes, set of visited nodes
c. queue of visited nodes, set of frontier nodes
d. set of frontier nodes only
How many times do we revisit a node in BFS?
a. we only visit each node once
b. we visit each node twice
c. we visit each node n times
d. we visit each node once for each edge
56
Breadth-First Search on One Computer
• Simple idea:queue enforces exploration from fewest

hops to successively greater and greater distances
• We can focus on the frontier which has new nodes
• We can prune paths that revisit nodes we’ve already seen
• Requires access to a global queue, and is inherently
sequential – so we need a different approach…
57
Distributed Breadth First Search
M.Sc Minh Hoang Vu

Big Data

How Do We Distribute BFS?
See companion notebook linked on web
• Don’t want to traverse one node at a time

• Can’t order directly using a global queue…
• … And need to be careful about when we check for
“visited” status
59
Suppose Our Graph is in an Edge Relation
edges_df
+----+-------+
|from| to|
+----+-------+
| 0|2152448| Can we traverse from a subset of these
| 0|1656491| nodes, via BFS, to more distant nodes?
| 0| 399364|
| 0| 18448|
| 0| 72025| Transitive closure query – what nodes
| 1| 77832| can we reach transitively
| 1| 542731|
| 1| 317452|
+----+-------+ (Later: we could count path lengths)
60
A Sketch of the Approach
And the next
Then we can traverse from
Start with our origin nodes! those nodes to the next level
8 8
all nodes with ID < 3
0 0
4 4
From edges_df we can directly get
the one-hop neighbors 1 1
7 7
from to 8 2
2
0 8 0 0 0
1 4 4 8 8
2 7 1 7 7
2 0 7
15 15 15
8 7
2 15
8 15
join(edges_df, l=to, r=from)
0 16
15 16 edges_df.filter() 16 r=from)
join(edges_df, l=to,
61
In Code, We Join in Each Traversal
and Rename!
+---+
from pyspark.sql.functions import col | id|
+---+
# Start with a subset of nodes |148|
start_nodes_df = edges_df[['from']].filter(edges_df['from'] < 1000).\ |463|
|471|
select(col('from').alias('id’)).drop_duplicates() |496|
|833|
+---+
+-------+
| id|
neighbor_nodes_df = start_nodes_df.\ +-------+
|1510404|
join(edges_df, start_nodes_df.id ==
| 523|
edges_df['from']).\ | 993804|
select(col('to').alias('id')).drop_duplicates() | 469009|
| 232979|
+-------+
62
What Happens Under the Covers?
Suppose that edges_df is sharded by from and that we put
even numbers on worker 0 and odd numbers on worker 1
edges_df start_nodes_df neighbor_nodes_df
+----+-------+ +----+ +-------+
|from| to| |from| | id|
+----+-------+ +----+ +-------+
| 0|2152448| | 0| |2152448|
| 0|1656491| | 1| |1656491|
| 0| 399364| | 2| | 399364|
| 0| 18448| | 3| | 18448|
| 0| 72025| | 4| | 77832|
| 1| 77832| | 5| | 542731|
| 1| 542731| | 6| +-------+
| 1| 317452| +----+
+----+-------+
Sharding: Sharding: Sharding:
even from on server 0 even from on server 0 based on from node that
odd from on server 1 odd from on server 1 isn’t part of the table!
63
Neighbor’s Neighbor?
neighbor_neighbor_nodes_df = neighbor_nodes_df.\
join(edges_df, neighbor_nodes_df.id == edges_df['from']).\
select(col('to').alias('id')).drop_duplicates()
neighbor_nodes_df
edges_df neighbor_nodes_df repartitioned by id
+----+-------+ +-------+ +-------+
|from| to| | id| | id|
+----+-------+ +-------+ +-------+
| 0|2152448| |2152448| |2152448|
| 0|1656491| |1656491| |1656491| Now directly joinable
| 0| 399364| | 399364| | 399364| with edges_df
| 0| 18448| | 18448| | 18448| on
| 0| 72025| | 72025| | 72025| id == edges_df.from
| 1| 77832| | 77832| | 77832|
| 1| 542731| | 542731| | 542731|
| 1| 317452| | 317452| | 317452|
+----+-------+ +-------+ +-------+
Sharding: Sharding: Sharding:
even from on server 0 based on from node that even id on server 0
odd from on server 1 isn’t part of the table! odd id on server 1
64
Can We Do an Iterative Join
and Track Hops?
Can we generalize from start 🡪 neighbor 🡪 neighbor’s
neighbor, in a loop?
• Base case: start with the direct edges, set distance to 1

• Iterative case:
• start with the existing set of nodes, add an edge to get to new
destinations
• project start and end (and increment distance) – use same
schema as base case
• remove duplicates!
65
Iterative Join
def iterate(df, depth):
df.createOrReplaceTempView('iter')
# Base case: direct connection +----+----+-----+ +----+----+-----+

|from| to|depth| |from| to|depth|
result = sqlContext.sql('select from, to, 1 as depth from iter')
+----+----+-----+ +----+----+-----+
| 0| 38| 1| | 0| 59| 2|
| 0| 101| 1| | 0| 66| 2|
for i in range(1, depth): | 0| 121| 1| | 0| 101| 2|
| 0| 161| 1| | 0| 121| 2|
result.createOrReplaceTempView('result’) | 0| 337| 1| | 0| 121| 2|
result = sqlContext.sql('select distinct r1.from
| as
0| 487| from,
1| r2.to
| 0| 161| 2|
| 0| 504| 1| | 0| 236| 2|
as to, r1.depth+1 as depth ‘\ | 0| 802| 1| | 0| 236| 2|
'from result r1 join iter r2 ‘\ … …
'on r1.to=r2.from')
return result
iterate(edges_df.filter(edges_df['from'] < 1000),
1).orderBy('from','to').show()
2).orderBy('from','to').show()
66
What We Can Do Better
• In the loop we remove duplicate paths in each iteration

• But given paths of different lengths from s to t, we
should remove the non-minimal ones!
• (Left as an exercise to the student!)
67
Brief Review
In a Spark-based iterative approach to BFS, we traverse edges

a. directly via a join with edges_df
b. one hop at a time via a join with edges_df
c. by grouping
d. by choosing the minimum path
Every time we do a join in a distributed BFS, there is a good chance we need to
a. group the results
b. select a subset of the matches
c. do another join
d. repartition one of the dataframes
68
Recap
• To do breadth-first traversals in Spark, we can iterate in

“waves” from the origin(s)
• A join at each stage
• We may want to keep information about the distance, the path,
etc.
• We may want to prune all non-minimal paths
• Next: let’s see some places BFS is used!
69
Applications of
Breadth-First Search
M.Sc Minh Hoang Vu
Big Data

A Common Question
O
• How far away is V N
from A? H G
• “Shortest path” P
F
M
B A E
• Let’s assume that Q
W
D L
1. the graph is V
directed and may I C
have cycles K
2. all edges have J U
equal (“unit”) cost R
Can BFS help? S T
71
Adding Connections in
Social Networks
One’s friends tend to become each other’s friends.
This is called Triadic Closure. B
A
Node A violates the Triadic Closure Property if it has D

two friends B and C who are not each other’s friends.
C
We often look to complete triangles to recommend
friends – prioritize by the number of incomplete
triangles
How can we use BFS/Shortest Path
here?
A Sketch
• Run BFS from “us” to find friends (nodes at depth 1) and

friends of our friends (nodes with min depth 2)
• Run BFS from FoF to depth 1

• For each FoF n, count how many of our friends are in
common
• Rank each FoF n by how many friends we have in common

73
Other Common
Graph Algorithms
• Algorithms for min-cost trees, traversals (minimum

spanning tree, Steiner tree)
• Betweenness centrality
• As we’ll see shortly, other recursive definitions of

centrality – eigenvector centrality, PageRank, label
propagation, variations thereof…
74
Brief Review
An important use case of BFS is

a. bipartite matching
b. k-means clustering
c. shortest-path computation
d. graph coloring
Triadic closure involves...
a. selecting people who resemble each other
b. adding edges to complete the most triangles
c. creating k-clusters
d. ranking friends-of-friends by strength of friendship
75
Summary
• Joins are a way of starting with a set of nodes and

performing path traversals
• Multi-step joins achieve multi-step paths
• We can easily implement distributed BFS and use this to

solve other problems
76

BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)

Uploaded by

Copyright:

Available Formats

How Spark Works on a Cluster

M.Sc Minh Hoang Vu

Portions of this lecture have been contributed to the OpenDS4All project,

File Hash Pre- Hash Avg

Exchange / repartition / shuffle

yelp_business (b1) yelp_business (b2)

yelp_business (b1) yelp_business (b2)

• Every time we do a join or a group-by, we need the data

• Find the strategy that minimizes the overall cost,

What happens if one of our worker nodes dies?

Spark re-reads its input data using the other nodes,

Apache Spark queries are lazy to maximize what can be optimized

Spark’s Catalyst query optimizer seeks to find the minimum-cost

M.Sc Minh Hoang Vu

Portions of this lecture have been contributed to the OpenDS4All project,

• A cloud file system?

How complex and large is the data and its content?

• Amazon S3 supports “buckets” – virtual disk volumes

• Given objects in a map from keys to hierarchical values –

• Relational DBMSs are best if we want:

• Amazon RDS lets us launch PostgreSQL, Oracle,

• Our focus in this class: processing big data

M.Sc Minh Hoang Vu

We’ve seen data with embedded hierarchy

Ideally, our original data is stored without redundancy –

But as we generate analysis results, we may want to

Let’s apply to people, students, and workers…

CREATE VIEW WorkerPerson(id, name, employer) AS WorkerPerson

FROM Person NATURAL JOIN Worker 789 Kaye Lutron

FROM Person NATURAL JOIN Student 789 Kaye MIT

but views are simply named queries treated as tables…

789 Kaye MIT

• In Spark, we can take any Dataframe and persist it…

• Now any time we reference same_city_sdf it will use the

• Commonly used subqueries

• Generated reports or hierarchical data

• Recursive computations (we’ll see these over graphs)

• View materialization sacrifices storage (and cost of

• Can be done by saving a result directly, or by

• As we scale to bigger and more complex data, need to

• Most of our discussion has been about “direct” relationships

• In the real world, lots of transitive relationships!

M.Sc Minh Hoang Vu

V(node, label, prop1) e.g., (n1, “bob”, 20)

path = sequence of adjacent vertices

The most natural representation of a graph (ignoring unconnected nodes) in

• Simple graph analysis, centrality, and “betweenness”

Portions of this lecture have been contributed to the OpenDS4All project,

Who is most influential?

What are the natural

Where should we suggest

The general question:

Several methods proposed in network science literature:

• For each node, compute its degree, i.e.,

• In a directed graph suppose we want

• How to write in Spark, given a relation edges(from, to)?

• Degree centrality is moderately useful

• But we may want to look at relationships to more distant

• Some people (or entities) are

Often excludes n as an endpoint

Which type of centrality is reliant on computing shortest paths?

Network centrality seeks to identify the influence

So how do we explore paths?