Professional Documents
Culture Documents
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
BigData - W4 - Big Data 0 Graph Data - HoangVu (Cont)
avg_reviews_by_city_sdf = spark.sql(\
'select city, avg(stars) as avg_rating ’\
'from yelp_business yb '\
'group by city')
avg_reviews_by_city_sdf.explain()
File Hash Pre- Hash Avg
Exchange
*(2) Scan
HashAggregate(keys=[city#21], Aggregate
functions=[avg(cast(stars#25 as Aggregate
double))])
+- Exchange hashpartitioning(city #21, 200), true, [id=#519]
+- *(1) HashAggregate(keys=[city city functions=
yelp_business.csv #21], city city
[partial_avg(cast(stars#25 as double))])
+- FileScan csv [city #21,stars#25] Batched: false, DataFilters: [], Format: CSV,
Location: InMemoryFileIndex[file:/content/yelp_business.csv],
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<city:string,stars:string>
2
Colab Spark Execution
Manager Worker
res
ypa ybs
(Coordinator)
Livy + Worker
res
ypa ybs
Python
Worker
res
yelp_business.csv
ypa ybs
NETS212 | 3
Distributed Joins
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')
yelp_business
id name city
FYNWN1 Dental by Design Ahwatukee Server 0
BADF My Wine Cellar Ahwatukee Server 1
KQPW8 Western Motor Vehicles Phoenix Server 0
8DShNS Sports Authority Tempe Server 1
Sharded by ID
4
Distributed Joins
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')
yelp_business
id name city
FYNWN1 Dental by Design Ahwatukee Server 0 Create two
BADF My Wine Cellar Ahwatukee Server 1 copies, sharded
KQPW8 Western Motor Vehicles Phoenix Server 0 by city
8DShNS Sports Authority Tempe Server 1
Sharded by ID
5
Distributed Joins
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')
yelp_business yelp_business
id name city id name city
FYNWN1 Dental by Design Ahwatukee Server 0 FYNWN1 Dental by Design Ahwatukee Server 0
BADF My Wine Cellar Ahwatukee Server 1
BADF My Wine Cellar Ahwatukee Server 0
KQPW8 Western Motor Vehicles Phoenix Server 0
KQPW8 Western Motor Vehicles Phoenix Server 1
8DShNS Sports Authority Tempe Server 1
8DShNS Sports Authority Tempe Server 1
6
Distributed Joins
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')
name name
My Wine Cellar Dental by Design
Dental by Design My Wine Cellar
7
Variation: (Left) Outerjoin
same_city_sdf = spark.sql(
'select b1.name, b2.name from yelp_business b1 left join yelp_business b2 '\
' on b1.city = b2.city and b1.name <> b2.name')
KQPW8 Western Motor Vehicles Phoenix Server 1 KQPW8 Western Motor Vehicles Phoenix Server 1
8DShNS Sports Authority Tempe Server 1 8DShNS Sports Authority Tempe Server 1
name name
My Wine Cellar Dental by Design
Dental By Design My Wine Cellar
Western Motor… NULL
Sports Authority NULL
8
Minimizing Shuffle/Exchange Steps
9
Catalyst: Spark’s Query Optimizer
Generates the Plans
• Spark’s query optimizer seeks to:
• Estimate how big the input sources are
• Estimate how many results will be produced in each
filter, join, groupby – compare different orderings of
operations
10
Spark Handles Failures!
11
Brief Review
When Spark runs on a cluster, it creates and executes a Spark query plan when
a. we execute a cell with a Pandas operation
b. we execute a cell that invokes an action like show() or save()
c. we execute a cell with a dataframe operation like a join
d. we execute a cell with an SQL query
Given two dataframes students(id,name) and enrolled(course_id,student_id), if we
execute a query to join on the student IDs, Spark must:
a. ensure students is sharded by ID and enrolled is sharded by student_id, or add exchange
operators as needed
b. perform a hash join within each of the worker nodes, without adding any exchange operators
c. ensure students is sharded by ID and enrolled is sharded by course_id, or add exchange
operators as needed
d. sort the enrolled dataframe by course_id
12
Recap
Group-by and join require the data to be sharded on the key – may
need to exchange or reshuffle or repartition data
If a worker fails in execution, its work is re-executed
13
Storing Data on the Cloud
15
Key questions
16
S3 (or GCS) for Storing Large Objects
17
DynamoDB (or BigTable) for Small Object Lookup
19
Brief Review
If we have tabular data that we are retrieving solely by an ID, our best choice(s) for
storage are likely to be:
a. DynamoDB or RDS
b. neither DynamoDB nor RDS
c. DynamoDB only
d. RDS only
If we have satellite photos, we are likely to want to store these on:
a. RDS
b. our laptop
c. DynamoDB
d. S3
20
Recap
21
Materialization of Query Results
23
An Example of Instances and Subclasses
id Person
name id name
Person 123 Ai
456 Jay
IS IS 789 Kaye
A A
Student Worker
id school id employer
Student Worker
456 Penn 789 Lutron
789 MIT
school employer
24
Materialization
25
Student and Worker are Naturally Views
StudentPerson
CREATE VIEW StudentPerson(id, name, employer) AS id name school
SELECT * 456 Jay Penn
26
An Example of Instances and Subclasses
with Redundancy! Worker
id employer
id Person 789 Lutron
name id name
Person Student
123 Ai
id school
456 Jay
456 Penn
IS IS 789 Kaye
789 MIT
A A
StudentPerson WorkerPerson
id name school id name employer
Student Worker
456 Jay Penn 789 Kaye Lutron
school employer
27
More Generally…
28
Other Uses for Materialization
29
Brief Review
If we use inheritance in an E-R diagram, the tables are naturally partitioned such
that
a. we only store instances in the child tables
b. instances show up in parent and child tables, but columns other than ID are split
c. the same columns show up in parent and child tables, but instances are split
d. we repeat both instances and all columns in parent and child tables
View materialization is accomplished by
a. calling materialize() on a dataframe
b. creating a view in SparkSQL
c. saving the input CSV
d. calling persist() on a dataframe
30
Recap
31
Module Wrap-up
32
More Complex Relationships
33
Graphs and Big Data
May be implicit (we compute links) or explicit (we can observe links)
For our running example: we’ll look at the LinkedIn connection network
Refresher: Graph Theory Basics
label
n1 n2
Graph G = (V,E)
V is a set of vertices or nodes, possibly
with properties
E is a set of tuples of vertices, called edges,
and may have lables or other data
36
Some Terminology
u,v are adjacent if there’s an edge between u and v
degree (u) = # adjacent vertices
• indegree or outdegree
39
Road Map
40
A Brief Intro to Graph Analysis,
aka Network Science
M.Sc Minh Hoang Vu
Thang Long University
Big Data
42
“Network Centrality”
43
Degree Centrality
44
LinkedIn Example
Given a list edges that looks like: [[0,5], [5,10], …]
from pyspark.sql.types import IntegerType
schema = StructType([
StructField("from", IntegerType(), True),
StructField("to", IntegerType(), True)
])
# Load the remote data as a list of dictionaries
edges_df = sqlContext.createDataFrame(edges, schema)
edges_df.createOrReplaceTempView('edges')
sqlContext.sql('select from as id, count(to) as degree '+
\'from edges group by from')
45
Beyond Degree Centrality
46
Betweenness Centrality
47
Brief Review
48
Recap
49
Graph Exploration
• Commonly, we will want to start at some node and look at how it relates to
other nodes in the graph
• How far away is X from Y?
• How many nodes are within distance k?
• What are the odds I can start at X and end up at Y?
51
Computing Distance in a Graph
How far apart are two nodes?
Distance between two nodes =
number of edges on the shortest path between them.
Unexplored
Visited Queue of Frontier Vertices
vertices
vertices Frontier
vertices
BFS - Centralized
Initialize a frontier queue with the origin node
While the frontier queue has a vertex in it
Pick a vertex v from the front of the queue
Put each unexplored neighbor of v in queue
Note closer edges are always considered before more distant edges.
2 4
3 5
55
Brief Review
56
Breadth-First Search on One Computer
57
Distributed Breadth First Search
59
Suppose Our Graph is in an Edge Relation
edges_df
+----+-------+
|from| to|
+----+-------+
| 0|2152448| Can we traverse from a subset of these
| 0|1656491| nodes, via BFS, to more distant nodes?
| 0| 399364|
| 0| 18448|
| 0| 72025| Transitive closure query – what nodes
| 1| 77832| can we reach transitively
| 1| 542731|
| 1| 317452|
+----+-------+ (Later: we could count path lengths)
60
A Sketch of the Approach
And the next
Then we can traverse from
Start with our origin nodes! those nodes to the next level
8 8
all nodes with ID < 3
0 0
4 4
From edges_df we can directly get
the one-hop neighbors 1 1
7 7
from to 8 2
2
0 8 0 0 0
1 4 4 8 8
2 7 1 7 7
2 0 7
15 15 15
8 7
2 15
8 15
join(edges_df, l=to, r=from)
0 16
15 16 edges_df.filter() 16 r=from)
join(edges_df, l=to,
61
In Code, We Join in Each Traversal
and Rename!
+---+
from pyspark.sql.functions import col | id|
+---+
# Start with a subset of nodes |148|
start_nodes_df = edges_df[['from']].filter(edges_df['from'] < 1000).\ |463|
|471|
select(col('from').alias('id’)).drop_duplicates() |496|
|833|
+---+
+-------+
| id|
neighbor_nodes_df = start_nodes_df.\ +-------+
|1510404|
join(edges_df, start_nodes_df.id ==
| 523|
edges_df['from']).\ | 993804|
select(col('to').alias('id')).drop_duplicates() | 469009|
| 232979|
+-------+
62
What Happens Under the Covers?
Suppose that edges_df is sharded by from and that we put
even numbers on worker 0 and odd numbers on worker 1
edges_df start_nodes_df neighbor_nodes_df
+----+-------+ +----+ +-------+
|from| to| |from| | id|
+----+-------+ +----+ +-------+
| 0|2152448| | 0| |2152448|
| 0|1656491| | 1| |1656491|
| 0| 399364| | 2| | 399364|
| 0| 18448| | 3| | 18448|
| 0| 72025| | 4| | 77832|
| 1| 77832| | 5| | 542731|
| 1| 542731| | 6| +-------+
| 1| 317452| +----+
+----+-------+
Sharding: Sharding: Sharding:
even from on server 0 even from on server 0 based on from node that
odd from on server 1 odd from on server 1 isn’t part of the table!
63
Neighbor’s Neighbor?
neighbor_neighbor_nodes_df = neighbor_nodes_df.\
join(edges_df, neighbor_nodes_df.id == edges_df['from']).\
select(col('to').alias('id')).drop_duplicates()
neighbor_nodes_df
edges_df neighbor_nodes_df repartitioned by id
+----+-------+ +-------+ +-------+
|from| to| | id| | id|
+----+-------+ +-------+ +-------+
| 0|2152448| |2152448| |2152448|
| 0|1656491| |1656491| |1656491| Now directly joinable
| 0| 399364| | 399364| | 399364| with edges_df
| 0| 18448| | 18448| | 18448| on
| 0| 72025| | 72025| | 72025| id == edges_df.from
| 1| 77832| | 77832| | 77832|
| 1| 542731| | 542731| | 542731|
| 1| 317452| | 317452| | 317452|
+----+-------+ +-------+ +-------+
Sharding: Sharding: Sharding:
even from on server 0 based on from node that even id on server 0
odd from on server 1 isn’t part of the table! odd id on server 1
64
Can We Do an Iterative Join
and Track Hops?
Can we generalize from start 🡪 neighbor 🡪 neighbor’s
neighbor, in a loop?
65
Iterative Join
def iterate(df, depth):
df.createOrReplaceTempView('iter')
66
What We Can Do Better
67
Brief Review
68
Recap
69
Applications of
Breadth-First Search
M.Sc Minh Hoang Vu
Thang Long University
Big Data
71
Adding Connections in
Social Networks
One’s friends tend to become each other’s friends.
This is called Triadic Closure. B
A
75
Summary
76