You are on page 1of 11

Optimizing Apache Spark:

Designing Clusters
Designing Clusters
How this is going to work...
We have four scenarios to choose from
● A Data Scientist training the first iteration of a model
● A SQL Analyst developing a report to be ran once a month
● A Team of 10 Data Analyst executing ad-hoc queries
● A Data Engineer processing a weekly job to ingest customer records

For each scenario, we will need to specify the following

● The set of features including Cluster Mode,


Pooling, Autoscaling & Auto Termination

● The cluster setup including Cluster Category, VM Level and Compute Level
Designing Clusters
Cluster Categories & VM Levels
Memory Optimized Compute Optimized
Type Memory Cores $/Hour Type Memory Cores $/Hour
M-1 32 GB 4 $0.252 C-1 16 GB 8 $0.340
M-2 64 GB 8 $0.504 C-2 32 GB 16 $0.680
M-3 128 GB 16 $1.008 C-3 64 GB 32 $1.360
M-4 256 GB 32 $2.016 C-4 128 GB 64 $2.720

Storage Optimized (w/Delta Cache) General Purpose


Type Memory Cores $/Hour Type Memory Cores $/Hour
S-1 30 GB 4 $0.312 G-1 16 GB 4 $0.192
S-2 60 GB 8 $0.624 G-2 32 GB 8 $0.384
S-3 120 GB 16 $1.248 G-3 64 GB 16 $0.768
S-4 240 GB 32 $2.496 G-4 128 GB 32 $1.536
Designing Clusters
Features
Cluster Mode Autoscaling
● Standard - Recommended for single-user ● Yes / No - The number of VMs can
clusters increase or decrease based on load
● High Concurrency - Optimized to run ● Min Workers - The minimum number
concurrent SQL, Python, and R workloads of VMs that a cluster can scale to
(not available w/Scala) ● Max Workers - The maximum number
of VMs that a cluster can scale to
Cluster Pools
● Yes / No - Terminated VMs are not released Auto Terminate
enabling quick reuse ● Yes / No - Terminate cluster when idle
● Max Nodes - The maximum number of ● Idle Minutes - The amount of idle time
nodes to be shared by all users after which the cluster will be terminated
● Idle Minutes - The amount of idle time after
which the VM will be released Runtime - Spark 3.0, Scala 2.12 and Python 3
Designing Clusters
Compute Level
● Understand your SLA - Jobs that require low latency may require high-priced
VMs vs cheaper, shared VMs for higher latency scenarios
● Category - You will first need to select the category of VMs for this job.
Note the difference in memory, cores and price
● VM Level - From your selected category, select the VM Level which will dictate
the base level of memory, compute and price per VM
● Compute Level - Predict how many tasks your job will require - assume that the
data on disk inflates by 2 in Spark and that each Spark-partition will be the
default 128 MB each
● Max Workers (VMs) - With your SLA in mind, and the compute & memory
level of each VM, determine how many VMs will be required
Designing Clusters
Cluster Design Worksheet
Category: Storage / Memory / Compute / General

VM Level/Type: Level-1 / Level-2 / Level-3 / Level-4

Min Workers (VMs): ___

Max Workers (VMs): ___ (if autoscaling)

Cluster Mode: Standard / High Concurrency

Cluster Pool: Yes / No Max Nodes: ___ Idle Min: ___

Auto Terminate: Yes / No Idle Min: ___


Designing Clusters
Scenario #1
● Who: Data Scientist
● What: Training the first iteration of an ML model
● Dataset (silver)
■ 10 GB table of transactions for the previous year
■ 5 MB table of product codes
■ 20 MB table of US zip codes, filtered to one state, Michigan
● SLA: not applicable
Designing Clusters
Scenario #2
● Who: SQL Analyst
● What: Developing a monthly report to quantify the number of sales aggregated
by sales associate
● Dataset (gold)
■ 3.7 TB table of transactions spanning 20 years, partitioned by year and month
● SLA: not applicable
Designing Clusters
Scenario #3
● Who: Team of 10 Data Analyst
● What: Ad hoc analysis as requested by other departments. Average time on
cluster is 4 hours per day, very sporadically.
● Dataset (silver)
■ 30 different tables
■ 50% of all tables are < 1 GB
■ One tables is ~512 GB with 18 years of data, partitioned by year and month
■ One table is ~200 GB partitioned by US state
■ All other tables average 10 GB
● SLA: not applicable
Designing Clusters
Scenario #4
● Who: Data Engineer
● Batch processing customer records
■ Every Sunday night customers upload CSV files to an FTP site which is immediately
moved to blob storage.
■ Every Monday morning the file must validated, file-specific duplicates removed and
merged into an existing parquet dataset.
■ Average file is 512 MB (assume that in spark it will be ½ this size)
■ Each customer must be processed as a single Spark job
■ The company currently processes 137 customers a week
● SLA: The data must be read for the next stage within 12 hours

You might also like