M04-L03 - Designing Clusters Breakout

Optimizing Apache Spark:
Designing Clusters
Designing Clusters
How this is going to work...
We have four scenarios to choose from
● A Data Scientist training the first iteration of a model
● A SQL Analyst developing a report to be ran once a month
● A Team of 10 Data Analyst executing ad-hoc queries
● A Data Engineer processing a weekly job to ingest customer records
For each scenario, we will need to specify the following
● The set of features including Cluster Mode,

Pooling, Autoscaling & Auto Termination
● The cluster setup including Cluster Category, VM Level and Compute Level
Designing Clusters
Cluster Categories & VM Levels
Memory Optimized Compute Optimized
Type Memory Cores $/Hour Type Memory Cores $/Hour
M-1 32 GB 4 $0.252 C-1 16 GB 8 $0.340
M-2 64 GB 8 $0.504 C-2 32 GB 16 $0.680
M-3 128 GB 16 $1.008 C-3 64 GB 32 $1.360
M-4 256 GB 32 $2.016 C-4 128 GB 64 $2.720
Storage Optimized (w/Delta Cache) General Purpose

Type Memory Cores $/Hour Type Memory Cores $/Hour
S-1 30 GB 4 $0.312 G-1 16 GB 4 $0.192
S-2 60 GB 8 $0.624 G-2 32 GB 8 $0.384
S-3 120 GB 16 $1.248 G-3 64 GB 16 $0.768
S-4 240 GB 32 $2.496 G-4 128 GB 32 $1.536
Designing Clusters
Features
Cluster Mode Autoscaling
● Standard - Recommended for single-user ● Yes / No - The number of VMs can
clusters increase or decrease based on load
● High Concurrency - Optimized to run ● Min Workers - The minimum number
concurrent SQL, Python, and R workloads of VMs that a cluster can scale to
(not available w/Scala) ● Max Workers - The maximum number
of VMs that a cluster can scale to
Cluster Pools
● Yes / No - Terminated VMs are not released Auto Terminate
enabling quick reuse ● Yes / No - Terminate cluster when idle
● Max Nodes - The maximum number of ● Idle Minutes - The amount of idle time
nodes to be shared by all users after which the cluster will be terminated
● Idle Minutes - The amount of idle time after
which the VM will be released Runtime - Spark 3.0, Scala 2.12 and Python 3
Designing Clusters
Compute Level
● Understand your SLA - Jobs that require low latency may require high-priced
VMs vs cheaper, shared VMs for higher latency scenarios
● Category - You will first need to select the category of VMs for this job.
Note the difference in memory, cores and price
● VM Level - From your selected category, select the VM Level which will dictate
the base level of memory, compute and price per VM
● Compute Level - Predict how many tasks your job will require - assume that the
data on disk inflates by 2 in Spark and that each Spark-partition will be the
default 128 MB each
● Max Workers (VMs) - With your SLA in mind, and the compute & memory
level of each VM, determine how many VMs will be required
Designing Clusters
Cluster Design Worksheet
Category: Storage / Memory / Compute / General
VM Level/Type: Level-1 / Level-2 / Level-3 / Level-4
Min Workers (VMs): ___
Max Workers (VMs): ___ (if autoscaling)
Cluster Mode: Standard / High Concurrency
Cluster Pool: Yes / No Max Nodes: ___ Idle Min: ___
Auto Terminate: Yes / No Idle Min: ___

Designing Clusters
Scenario #1
● Who: Data Scientist
● What: Training the first iteration of an ML model
● Dataset (silver)
■ 10 GB table of transactions for the previous year
■ 5 MB table of product codes
■ 20 MB table of US zip codes, filtered to one state, Michigan
● SLA: not applicable
Designing Clusters
Scenario #2
● Who: SQL Analyst
● What: Developing a monthly report to quantify the number of sales aggregated
by sales associate
● Dataset (gold)
■ 3.7 TB table of transactions spanning 20 years, partitioned by year and month
Designing Clusters
Scenario #3
● Who: Team of 10 Data Analyst
● What: Ad hoc analysis as requested by other departments. Average time on
cluster is 4 hours per day, very sporadically.
● Dataset (silver)
■ 30 different tables
■ 50% of all tables are < 1 GB
■ One tables is ~512 GB with 18 years of data, partitioned by year and month
■ One table is ~200 GB partitioned by US state
■ All other tables average 10 GB
Designing Clusters
Scenario #4
● Who: Data Engineer
● Batch processing customer records
■ Every Sunday night customers upload CSV files to an FTP site which is immediately
moved to blob storage.
■ Every Monday morning the file must validated, file-specific duplicates removed and
merged into an existing parquet dataset.
■ Average file is 512 MB (assume that in spark it will be ½ this size)
■ Each customer must be processed as a single Spark job
■ The company currently processes 137 customers a week
● SLA: The data must be read for the next stage within 12 hours

M04-L03 - Designing Clusters Breakout

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

M04-L03 - Designing Clusters Breakout

Uploaded by

Copyright:

Available Formats

Optimizing Apache Spark:

For each scenario, we will need to specify the following

● The set of features including Cluster Mode,

Storage Optimized (w/Delta Cache) General Purpose

VM Level/Type: Level-1 / Level-2 / Level-3 / Level-4

Min Workers (VMs): ___

Max Workers (VMs): ___ (if autoscaling)

Cluster Mode: Standard / High Concurrency

Cluster Pool: Yes / No Max Nodes: _ Idle Min: _

Auto Terminate: Yes / No Idle Min: ___

You might also like