0% found this document useful (0 votes)

209 views9 pages

Spark Cheatsheet

Uploaded by

kumari.munni3737

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

209 views9 pages

Spark Cheatsheet

Uploaded by

kumari.munni3737

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DISHA MUKHERJEE Apache Spark Cheat Sheet

2023

CREDITS/SOURCE:-- https://www.zuar.com/blog/apache-spark-cheat-sheet/
DISHA MUKHERJEE Apache Spark Cheat Sheet
2023
Contents
• Benefits of Using Apache Spark
• Commands for Initializing Apache Spark
• Apache Spark Set Operations
• Apache Spark Action & Transformation Commands
• Common Apache Spark Actions
• Common Apache Spark Transformations
• RDD Persistence Commands
• Spark SQL & Dataframe Commands
• Beyond the Cheat Sheet

Benefits of Using Apache Spark

Apache Spark is an open-source, Hadoop-compatible, cluster-computing platform that processes
'big data' with built-in modules for SQL, machine learning, streaming, and graph processing. The
main benefits of using Apache Spark with your preferred API are...
• Computes data at blazing speeds by loading it across the distributed memory of a group of
machines.
• Leveraging standard APIs like Java, Python, Scala or SQL to be more accessible.
• Allowing enterprises to leverage their existing infrastructures by being compatible with
Hadoop v1 and 2.x.
• Easy to install and provides a convenient shell for learning the APIs.
• Improves productivity by focusing on content computation.

CREDITS/SOURCE:-- https://www.zuar.com/blog/apache-spark-cheat-sheet/
DISHA MUKHERJEE Apache Spark Cheat Sheet
2023

Commands for Initializing Apache Spark

These are the most common commands for initiating Apache Spark shell in either Scala or Python.
These are essential commands you need when setting up the platform:

Initializing Spark Shell Using Scala

$ ./bin/spark-shell --master local[4]

Initializing SparkContext Using Scala

val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)

Initializing Spark Shell Using Python

$ ./bin/pyspark --master local[4]

Initializing SparkContext Using Python

from pyspark import SparkContext
sc = SparkContext (master = ‘local[2]’)

Configuring Spark Using Python

from pyspark import SparkConf, Spark Context
conf = (SparkConf()

CREDITS/SOURCE:-- https://www.zuar.com/blog/apache-spark-cheat-sheet/
DISHA MUKHERJEE Apache Spark Cheat Sheet
2023

.setMaster(“local”)
.setAppName(“My app”)
.set(“spark.executor.memory”, “1g”))
sc = SparkContext(conf = conf)

Apache Spark Set Operations

Here is a list of the most common set operations to generate a new Resilient Distributed Dataset
(RDD). An RDD is a fault-tolerant collection of data elements that can be operated on in parallel.
You can create an RDD by referencing a dataset in an external storage system, or by parallelizing a
collection in your driver program.

Joining All Elements From the Argument and Source

union()

Intersecting All Elements From the Argument and Source

intersection()

Creating a Cross Product From the Argument and Source

cartesian()

Removing Data Elements in the Source

CREDITS/SOURCE:-- https://www.zuar.com/blog/apache-spark-cheat-sheet/
DISHA MUKHERJEE Apache Spark Cheat Sheet
2023
subtract()

Joining Data Elements to Create a New RDD

join(RDD,[numtask])
Converting to an Iterable
cogroup(RDD,[numTasks])

Piping Each Partition of an RDD Through a Shell Command

pipe(command, [envVars])

Apache SparkAction & Transformation Commands

Most RDD operations are either:
• Transformations: creating a new dataset from an existing dataset
• Actions: returning a value to the driver program from computing on the dataset
We’ll cover the most common actions and transformation commands below. Although, you should
note that syntax can vary depending on the API you are using, such as Python, Scala, or Java.

Common Apache Spark Actions

Here are the bread and butter actions when calling an RDD to retrieve specific data elements.

Counting Number of Data Elements in the RDD

count()

Collecting an Array of All the Data Elements

CREDITS/SOURCE:-- https://www.zuar.com/blog/apache-spark-cheat-sheet/
DISHA MUKHERJEE Apache Spark Cheat Sheet
2023
collect()

Aggregating the Data Elements

reduce(func)

Getting the First N Data Elements

take(n)

Executing the Function for Each Data Element

foreach(func)

Retrieving the First Data Element

first()

Common Apache Spark Transformations

Here are the main operations when you’re calling a new RDD by applying a transformation
function to the data elements.

Selecting Data Elements Based on a Function

filter(func)

Applying a Function to Each Data Element

CREDITS/SOURCE:-- https://www.zuar.com/blog/apache-spark-cheat-sheet/
DISHA MUKHERJEE Apache Spark Cheat Sheet
2023
map(func)

Applying a Function to Each Data Element to Return a

Sequence
flatMap(func)

Applying a Function to Each Data Element Running

Separately on Each Partition
mapPartitions(func)

Applying a Function to Each Data Element Running on an

Indexed Partition
mapPartitionsWithIndex(func)

Sampling a Fraction of the Data

Sample(withReplacement, fraction, seed)

Applying a Function to Aggregate Values

reduceByKey(func,[numTasks])

Eliminating Duplicates
distinct([numTasks])

CREDITS/SOURCE:-- https://www.zuar.com/blog/apache-spark-cheat-sheet/
DISHA MUKHERJEE Apache Spark Cheat Sheet
2023

RDD Persistence Commands

One of the best features of Apache Spark is its ability to cache an RDD in cluster memory,
speeding up the iterative computation. Here are the most commonly used commands for RDD
persistence.

Storing RDD in Cluster Memory as Deserialized Java

Objects at a Default Level
MEMORY_ONLY

Storing RDD in Cluster Memory or on the Disk as

Deserialized Java Objects
MEMORY_AND_DISK

Storing RDD As Serialized Java Objects One Byte Array per

Partition
MEMORY_ONLY_SER

Storing RDD Only on the Disk

DISC_ONLY

Spark SQL & Dataframe Commands

These are common integrated commands for using SQL with Apache Spark for working with
structured data:

CREDITS/SOURCE:-- https://www.zuar.com/blog/apache-spark-cheat-sheet/
DISHA MUKHERJEE Apache Spark Cheat Sheet
2023

Integrating SQL queries with Apache Spark

Results = spark.sql(“SELECT * FROM tbl_name)
data_name = results.map(lambda p: col_name)

Connecting to Any Data Source

spark.read.json(“s3n://…”)
.registerTempTable(“json”)
results = spark.sql (“””SELECT * FROM tbl_name JOIN json …”””)

CREDITS/SOURCE:-- https://www.zuar.com/blog/apache-spark-cheat-sheet/

Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Data Science Interview Prep Guide
No ratings yet
Data Science Interview Prep Guide
61 pages
Apache Spark RDDs Guide
No ratings yet
Apache Spark RDDs Guide
186 pages
Spark 3.0 Key Features Overview
No ratings yet
Spark 3.0 Key Features Overview
8 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
PySpark Tutorial: From Basics to Advanced
No ratings yet
PySpark Tutorial: From Basics to Advanced
102 pages
PYSPARK Interview Questions
100% (4)
PYSPARK Interview Questions
126 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
Spark
No ratings yet
Spark
96 pages
SPARK
No ratings yet
SPARK
35 pages
PySpark Notes
No ratings yet
PySpark Notes
4 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Big Data Processing with PySpark Guide
No ratings yet
Big Data Processing with PySpark Guide
2 pages
Databricks On AWS 01 Getting Started Apache Spark Slides
100% (1)
Databricks On AWS 01 Getting Started Apache Spark Slides
29 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
15 PDFsam Apache Spark Tutorial
No ratings yet
15 PDFsam Apache Spark Tutorial
7 pages
Apache Spark In-Memory Data Management Guide
No ratings yet
Apache Spark In-Memory Data Management Guide
12 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Spark Class 2
No ratings yet
Spark Class 2
37 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
PySpark 1713691456
No ratings yet
PySpark 1713691456
24 pages
PySpark DataFrame Lab Report
No ratings yet
PySpark DataFrame Lab Report
6 pages
Apache Spark: In-Memory Data Processing
No ratings yet
Apache Spark: In-Memory Data Processing
187 pages
Apache Spark Technical Round Dashboard
No ratings yet
Apache Spark Technical Round Dashboard
14 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
Int 421
No ratings yet
Int 421
2 pages
Spark Operations: Transformations & Actions
No ratings yet
Spark Operations: Transformations & Actions
6 pages
Introduction to Apache Spark Overview
No ratings yet
Introduction to Apache Spark Overview
28 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Execr
No ratings yet
Execr
4 pages
Pyspark
No ratings yet
Pyspark
31 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
Pyspark TOC - 24 Hours
No ratings yet
Pyspark TOC - 24 Hours
2 pages
PySpark Installation and Basics Guide
100% (1)
PySpark Installation and Basics Guide
131 pages
Spark Cheatsheet - BEPEC
No ratings yet
Spark Cheatsheet - BEPEC
1 page
Indrani Cheat Sheet
No ratings yet
Indrani Cheat Sheet
2 pages
Apache Spark Interview Q&A Guide
No ratings yet
Apache Spark Interview Q&A Guide
62 pages
Spark Interview Questions 04
No ratings yet
Spark Interview Questions 04
4 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Essential Apache Spark Interview Questions
No ratings yet
Essential Apache Spark Interview Questions
19 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Apache Spark Distributed Systems Exam
No ratings yet
Apache Spark Distributed Systems Exam
24 pages
Spark & RDD Guide for Developers
No ratings yet
Spark & RDD Guide for Developers
1 page
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
No ratings yet
CISD 42 Introduction To Spark - Spark Transformation - Spark Actions
27 pages
When Should A Master Assert and Deassert The HLOCK Signal For A Locked Transfer? Applies To: AHB
No ratings yet
When Should A Master Assert and Deassert The HLOCK Signal For A Locked Transfer? Applies To: AHB
11 pages
AMBA Advanced Peripheral Bus Protocol
No ratings yet
AMBA Advanced Peripheral Bus Protocol
7 pages
SOC-200 24 Week Learning Plan
No ratings yet
SOC-200 24 Week Learning Plan
26 pages
06 M68000 Programming Model and Addressing
No ratings yet
06 M68000 Programming Model and Addressing
12 pages
NVIDIA Control Panel and System Monitor With ESA User Guide
No ratings yet
NVIDIA Control Panel and System Monitor With ESA User Guide
47 pages
Harmonic Monitor Application Overview
No ratings yet
Harmonic Monitor Application Overview
16 pages
Google Cloud Internship Review
No ratings yet
Google Cloud Internship Review
10 pages
Palo Alto DoS Protection Lab Guide
No ratings yet
Palo Alto DoS Protection Lab Guide
15 pages
Esd Lab Manual
No ratings yet
Esd Lab Manual
41 pages
LTE QoS: Default vs. Dedicated EPS Bearers
No ratings yet
LTE QoS: Default vs. Dedicated EPS Bearers
6 pages
AZ-120 Exam Question Discussion: SAP HANA Backup
No ratings yet
AZ-120 Exam Question Discussion: SAP HANA Backup
3 pages
DataRay LabVIEW Integration Guide
No ratings yet
DataRay LabVIEW Integration Guide
2 pages
Linux Kernel: Michael Opdenacker Thomas Petazzoni
No ratings yet
Linux Kernel: Michael Opdenacker Thomas Petazzoni
21 pages
TEST3 - AWS Certified Solutions Architect Associate Practice Exams
No ratings yet
TEST3 - AWS Certified Solutions Architect Associate Practice Exams
124 pages
Avid Editing Application: Readme For Media Composer V8.5
No ratings yet
Avid Editing Application: Readme For Media Composer V8.5
47 pages
iSCSI Setup Guide for CentOS 7
No ratings yet
iSCSI Setup Guide for CentOS 7
8 pages
BAIT3113 Pre Assessment Question-202405
No ratings yet
BAIT3113 Pre Assessment Question-202405
3 pages
CX-One Ver.4.51 Upgrade Notice
No ratings yet
CX-One Ver.4.51 Upgrade Notice
4 pages
Windows Forensics Exercises
100% (2)
Windows Forensics Exercises
50 pages
Return-to-libc Attack Techniques Explained
No ratings yet
Return-to-libc Attack Techniques Explained
22 pages
Unity Pro V7.0 EC0X Error Guide
No ratings yet
Unity Pro V7.0 EC0X Error Guide
2 pages
IPSec VPN Setup: Mikrotik & Cisco
No ratings yet
IPSec VPN Setup: Mikrotik & Cisco
5 pages
Digital Logic Design Course Overview
No ratings yet
Digital Logic Design Course Overview
3 pages
Explain Paging and Segmentation in Detail
No ratings yet
Explain Paging and Segmentation in Detail
4 pages
Smart Home Control System Prototype
No ratings yet
Smart Home Control System Prototype
4 pages
OS Syllabus
No ratings yet
OS Syllabus
2 pages
IBPS Clerk Mains 2017: Computer Awareness Quiz
No ratings yet
IBPS Clerk Mains 2017: Computer Awareness Quiz
6 pages
6bk1200 0aa20 0aa0 Sicomp Imc 01s Siemens Panel Manual
100% (3)
6bk1200 0aa20 0aa0 Sicomp Imc 01s Siemens Panel Manual
130 pages
Storage Mirroring
No ratings yet
Storage Mirroring
12 pages
Ua5000 Ipmsan NGN Service Configuration
No ratings yet
Ua5000 Ipmsan NGN Service Configuration
21 pages

Spark Cheatsheet

Uploaded by

Spark Cheatsheet

Uploaded by

DISHA MUKHERJEE Apache Spark Cheat Sheet

Benefits of Using Apache Spark

Commands for Initializing Apache Spark

Initializing Spark Shell Using Scala

Initializing SparkContext Using Scala

Initializing Spark Shell Using Python

Initializing SparkContext Using Python

Configuring Spark Using Python

Apache Spark Set Operations

Joining All Elements From the Argument and Source

Intersecting All Elements From the Argument and Source

Creating a Cross Product From the Argument and Source

Removing Data Elements in the Source

Joining Data Elements to Create a New RDD

Piping Each Partition of an RDD Through a Shell Command

Apache SparkAction & Transformation Commands

Common Apache Spark Actions

Counting Number of Data Elements in the RDD

Collecting an Array of All the Data Elements

Aggregating the Data Elements

Getting the First N Data Elements

Executing the Function for Each Data Element

Retrieving the First Data Element

Common Apache Spark Transformations

Selecting Data Elements Based on a Function

Applying a Function to Each Data Element

Applying a Function to Each Data Element to Return a

Applying a Function to Each Data Element Running

Applying a Function to Each Data Element Running on an

Sampling a Fraction of the Data

Applying a Function to Aggregate Values

RDD Persistence Commands

Storing RDD in Cluster Memory as Deserialized Java

Storing RDD in Cluster Memory or on the Disk as

Storing RDD As Serialized Java Objects One Byte Array per

Storing RDD Only on the Disk

Spark SQL & Dataframe Commands

Integrating SQL queries with Apache Spark

Connecting to Any Data Source

You might also like