PySpark SQL Basics Cheat Sheet

PySpark and Spark SQL allow working with structured data in Apache Spark. A SparkSession can be used to create DataFrames, register them as tables, execute SQL queries on tables, and more. Common SQL operations include selecting columns, filtering rows, aggregating, sorting, handling null values, and repartitioning DataFrames. DataFrames can be registered as views to run SQL queries programmatically against them.

Uploaded by

Ketan Bhalerao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

694 views1 page

PySpark SQL Basics Cheat Sheet

Uploaded by

Ketan Bhalerao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

PythonForDataScience Cheat Sheet Duplicate Values GroupBy

>>> df = [Link]()
PySpark - SQL Basics
>>> [Link]("age")\ Group by age, count the members
.count() \ in the groups
Queries .show()

>>> from [Link] import functions as F

Select
Filter
>>> [Link]("firstName").show() Show all entries in firstName column >>> [Link](df["age"]>24).show() Filter entries of age, only keep
>>> [Link]("firstName","lastName") \
PySpark & Spark SQL .show()
>>> [Link]("firstName", Show all entries in firstName, age
those
records of which the values are >24
Spark SQL is Apache Spark's module for "age", and type
Sort
working with structured data. explode("phoneNumber") \
.alias("contactInfo")) \
>>> [Link]([Link]()).collect()
.select("[Link]",
Initializing SparkSession "firstName",
>>> [Link]("age", ascending=False).collect()
>>> [Link](["age","city"],ascending=[0,1])\
"age") \ .collect()
A SparkSession can be used create DataFrame, register DataFrame as tables, .show()
execute SQL over tables, cache tables, and read parquet files. >>> [Link](df["firstName"],df["age"]+ 1) Show all entries in firstName and
age, .show() add 1 to the entries of age
>>> from [Link] import SparkSession
>>> spark = SparkSession \
>>> [Link](df['age'] > 24).show()
When
Show all entries where age >24 Missing & Replacing Values
.builder \ >>> [Link]("firstName",
.appName("Python Spark SQL basic example") \ Show firstName and 0 or 1depending >>> [Link](50).show() Replace null values
[Link]([Link] > 30, 1) \ on age >30 >>> [Link]().show() Return new df omi5ing rows with null values
.config("[Link]", "some-value") \ .otherwise(0)) \
.getOrCreate() >>> [Link] \ Return new df replacing one value with
.show() .replace(10, 20) \ another
>>> df[[Link]("Jane","Boris")] Show firstName if in the given options .show()
Creating DataFrames Like
.collect()

FromRDDs
>>> [Link]("firstName", Show firstName, and lastName is
[Link]("Smith")) \ TRUE if lastName is like Smith
Repartitioning
.show()
>>> from [Link] import * Startswith - Endswith >>> [Link](10)\ df with 10 partitions
>>> [Link]("firstName", Show firstName, and TRUE if .rdd \
InferSchema .getNumPartitions()
>>> sc = [Link] [Link] \ lastName starts with Sm
.startswith("Sm")) \ >>> df with 1 partition
>>> lines = [Link]("[Link]")
.show() [Link](1).[Link]()
>>> parts = [Link](lambda l: [Link](",")) >>> [Link]([Link]("th"))\ Show last names ending in
>>> people = [Link](lambda p: Row(name=p[0],age=int(p[1]))) th .show() Running SQL Queries Programmatically
>>> peopledf = [Link](people) Substring
SpecifySchema
>>> people = [Link](lambda p: Row(name=p[0],
>>> [Link]([Link](1, 3) \
.alias("name")) \
Return substrings of firstName Registering DataFrames asViews
age=int(p[1].strip()))) .collect() >>> [Link]("people")
>>> schemaString = "name age" Between >>> [Link]("customer")
>>> [Link]([Link](22, 24)) \ Show age: values are TRUE if between >>> [Link]("customer")
>>> fields = [StructField(field_name, StringType(), True) .show() 22 and 24
for field_name in [Link]()]
>>> schema = StructType(fields) QueryViews
>>> [Link](people, schema).show()
+--------+---+
Add, Update & Remove Columns >>> df5 = [Link]("SELECT * FROM customer").show()
| name|age| >>> peopledf2 = [Link]("SELECT * FROM global_temp.people")\
+--------+---+
| Mine| 28|
Adding Columns .show()
| Filip| 29|
|Jonathan| 30| >>> df = [Link]('city',[Link]) \
+--------+---+ .withColumn('postalCode',[Link]) \
From Spark DataSources .withColumn('state',[Link]) \
.withColumn('streetAddress',[Link]) \
Output
JSON
.withColumn('telePhoneNumber',
explode([Link])) \ DataStructures
>>> df = [Link]("[Link]") .withColumn('telePhoneType',
>>> [Link]() >>> rdd1 = [Link] Convert df into an RDD
explode([Link])) >>> [Link]().first()
+--------------------+---+---------+--------+--------------------
| address|age|firstName |lastName| Convert df into a RDD of string
+ phoneNumber| >>> [Link]() Return the contents of df as Pandas
+--------------------+---+---------+--------+--------------------
+
UpdatingColumns DataFrame
|[New York,10021,N...| 25| John| Smith|[[212 555-1234,ho...|
|[New York,10021,N...| 21| Jane| Doe|[[322 888-1234,ho...| >>> df = [Link]('telePhoneNumber', 'phoneNumber') Write & Save to Files
+--------------------+---+---------+--------+--------------------
+
>>> df2 = [Link]("[Link]", format="json")
RemovingColumns >>> [Link]("firstName", "city")\
.write \
Parquetfiles >>> df = [Link]("address", "phoneNumber") .save("[Link]")
>>> df3 = [Link]("[Link]") >>> df = [Link]([Link]).drop([Link]) >>> [Link]("firstName", "age") \
.write \
TXT files .save("[Link]",format="json")
Inspect Data
>>> df4 = [Link]("[Link]")

>>> [Link] Return df column names and data types >>> [Link]().show() Compute summary statistics Stopping SparkSession
>>> [Link]() Display the content of df >>> [Link] Return the columns of df
>>> [Link]() Count the number of rows in df >>> [Link]()
>>> [Link]() Return first n rows
>>> [Link]() Return first row >>> [Link]().count() Count the number of distinct rows in df
>>> [Link](2) Return the first n rows >>> [Link]() Print the schema of df
>>> [Link] Return the schema of df >>> [Link]() Print the (logical and physical) plans

PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
SQL vs PySpark Operations Guide
No ratings yet
SQL vs PySpark Operations Guide
8 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Pyspark Window Functions Overview
100% (1)
Pyspark Window Functions Overview
8 pages
Pandas Interview Prep Guide
No ratings yet
Pandas Interview Prep Guide
5 pages
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
No ratings yet
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
8 pages
Walmart Stock Data Analysis with Spark
0% (1)
Walmart Stock Data Analysis with Spark
17 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Hive Query Execution and Data Management
75% (4)
Hive Query Execution and Data Management
17 pages
SQL Commands and Functions Guide
100% (1)
SQL Commands and Functions Guide
3 pages
PySpark Interview Questions
0% (1)
PySpark Interview Questions
3 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Leetcode SQL QnA 1693149052
No ratings yet
Leetcode SQL QnA 1693149052
60 pages
Hadoop Interview Questions Guide
100% (1)
Hadoop Interview Questions Guide
34 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Interview Question Python
No ratings yet
Interview Question Python
14 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Using Pandas for Data Analysis in Python
No ratings yet
Using Pandas for Data Analysis in Python
8 pages
SQL For Data Engineering
No ratings yet
SQL For Data Engineering
79 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
MySQL Cheatsheet - CodeWithHarry
100% (1)
MySQL Cheatsheet - CodeWithHarry
13 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Basic SQL For Data Analyst Interview Questions
No ratings yet
Basic SQL For Data Analyst Interview Questions
10 pages
PySpark DataFrame Operations Guide
100% (1)
PySpark DataFrame Operations Guide
25 pages
PySpark Zero To Hero Ebook
No ratings yet
PySpark Zero To Hero Ebook
6 pages
Spark DataFrame and RDD Operations Guide
No ratings yet
Spark DataFrame and RDD Operations Guide
5 pages
Spark QA
No ratings yet
Spark QA
34 pages
Numpy Final - Removed
No ratings yet
Numpy Final - Removed
46 pages
MongoDB Cheat Sheet for Developers
No ratings yet
MongoDB Cheat Sheet for Developers
10 pages
Top 40 SQL Query Interview Questions and Answers For Practice
No ratings yet
Top 40 SQL Query Interview Questions and Answers For Practice
18 pages
Python Pandas DataFrame Guide
No ratings yet
Python Pandas DataFrame Guide
53 pages
Fact and Dimension Tables
No ratings yet
Fact and Dimension Tables
11 pages
SQL Interview Questions
100% (1)
SQL Interview Questions
10 pages
Pandas vs PySpark: Data Operations
No ratings yet
Pandas vs PySpark: Data Operations
3 pages
Second Highest Salary in PySpark
No ratings yet
Second Highest Salary in PySpark
22 pages
Top 100 Python Interview Questions For Data Analyst
No ratings yet
Top 100 Python Interview Questions For Data Analyst
10 pages
Extract Transform Load
No ratings yet
Extract Transform Load
80 pages
Best2 Toppers SQL-interview-Question
100% (1)
Best2 Toppers SQL-interview-Question
47 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
DataStage Faq S
No ratings yet
DataStage Faq S
57 pages
Oracle PLSQL Notes
100% (4)
Oracle PLSQL Notes
59 pages
Qlik Sense - Next Steps in Scripting May 2023
No ratings yet
Qlik Sense - Next Steps in Scripting May 2023
51 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
A - Learning - Oreilly.com-Preface Data Engineering With AWS
No ratings yet
A - Learning - Oreilly.com-Preface Data Engineering With AWS
6 pages
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
PySpark SQL Cheat Sheet Guide
No ratings yet
PySpark SQL Cheat Sheet Guide
1 page
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Laser Therapy Physiotherapy 2
100% (1)
Laser Therapy Physiotherapy 2
41 pages
MCQ of Project Management Part 1 1 Compress
No ratings yet
MCQ of Project Management Part 1 1 Compress
19 pages
Keras Deep Learning Cheat Sheet
No ratings yet
Keras Deep Learning Cheat Sheet
1 page
Pune Housing and Area Development Board Payment Receipt: Application Details
No ratings yet
Pune Housing and Area Development Board Payment Receipt: Application Details
1 page
Pooja Jain: Data Analyst Profile
No ratings yet
Pooja Jain: Data Analyst Profile
2 pages
Sandhaya - Resume ML - DS PDF
0% (1)
Sandhaya - Resume ML - DS PDF
2 pages
Akshay Arora: Data Science & Machine Learning Practitioner
No ratings yet
Akshay Arora: Data Science & Machine Learning Practitioner
2 pages
Data Analyst Resume
No ratings yet
Data Analyst Resume
2 pages
Statistics Part1
No ratings yet
Statistics Part1
28 pages
Data Scientist Resume of Suraj Negi
No ratings yet
Data Scientist Resume of Suraj Negi
2 pages
Business Analytics: Advance: Simple & Multiple Linear Regression
No ratings yet
Business Analytics: Advance: Simple & Multiple Linear Regression
38 pages
Business Analytics: Advance: Logistic Regression
100% (1)
Business Analytics: Advance: Logistic Regression
26 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Set PYTHON PDF
No ratings yet
Set PYTHON PDF
2 pages
Stats Book Class Xi
No ratings yet
Stats Book Class Xi
8 pages
55.1.amelia Jones, The Artist Is Present
No ratings yet
55.1.amelia Jones, The Artist Is Present
31 pages
303-06 Starting System - Removal and Installation - Starter Motor
No ratings yet
303-06 Starting System - Removal and Installation - Starter Motor
4 pages
Engaging Similes for Kids
No ratings yet
Engaging Similes for Kids
6 pages
Extracapsular Cataract Extraction (Ecce)
No ratings yet
Extracapsular Cataract Extraction (Ecce)
9 pages
Math, Faith, and Reality Debate
100% (1)
Math, Faith, and Reality Debate
13 pages
Therapists Guide To Brief Cbtmanual-1-39
67% (3)
Therapists Guide To Brief Cbtmanual-1-39
39 pages
Marketing Strategies in Pharmaceuticals
No ratings yet
Marketing Strategies in Pharmaceuticals
2 pages
List Item + Barcode
50% (2)
List Item + Barcode
5 pages
Footing Cutting List: Mark Length Width Thickness Spacing Bar Diameter Top&Botom Clear Cover
No ratings yet
Footing Cutting List: Mark Length Width Thickness Spacing Bar Diameter Top&Botom Clear Cover
3 pages
MGMT4210 OrgStructure
No ratings yet
MGMT4210 OrgStructure
43 pages
Otl 565 Module 8 CT Project Reflection
No ratings yet
Otl 565 Module 8 CT Project Reflection
4 pages
Aluminum Standard and Data 2009 Metric Si
100% (3)
Aluminum Standard and Data 2009 Metric Si
240 pages
Break-Even Analysis and Forecasting
50% (2)
Break-Even Analysis and Forecasting
50 pages
MIL LAS Q3 Wk3 MELC3
No ratings yet
MIL LAS Q3 Wk3 MELC3
5 pages
A Summary and Analysis of Hawk Roosting by Ted Hughes
No ratings yet
A Summary and Analysis of Hawk Roosting by Ted Hughes
2 pages
Imsoglo Program
No ratings yet
Imsoglo Program
3 pages
Spectrophotometric Techniques Overview
No ratings yet
Spectrophotometric Techniques Overview
16 pages
Cns Assessment
No ratings yet
Cns Assessment
27 pages
16 - 5-Muelle TESA CT-1800
No ratings yet
16 - 5-Muelle TESA CT-1800
8 pages
Overview of Wide Area Monitoring Systems
No ratings yet
Overview of Wide Area Monitoring Systems
19 pages
Online Examination Hall Seating Arrangement
No ratings yet
Online Examination Hall Seating Arrangement
5 pages
Jurisprudence Lecture 1 &2 N Ndlovu 2022
No ratings yet
Jurisprudence Lecture 1 &2 N Ndlovu 2022
17 pages
Mission-Life Approach: A Road Towards Energy and Environmental Sustainability in India
No ratings yet
Mission-Life Approach: A Road Towards Energy and Environmental Sustainability in India
10 pages
5 1 1 - Corporate-Liquidation
No ratings yet
5 1 1 - Corporate-Liquidation
31 pages
Amiblu Product Guide: Sustainable Pipe Systems Engineered For The Next 150 Years
100% (1)
Amiblu Product Guide: Sustainable Pipe Systems Engineered For The Next 150 Years
40 pages
Calculate Formula
No ratings yet
Calculate Formula
18 pages
How and Why Intuition Works
No ratings yet
How and Why Intuition Works
18 pages
009 Assignment Download Install and Configure MetaMask-AS2-original
No ratings yet
009 Assignment Download Install and Configure MetaMask-AS2-original
11 pages
Homework For Day 1
17% (12)
Homework For Day 1
4 pages

PySpark SQL Basics Cheat Sheet

Uploaded by

PySpark SQL Basics Cheat Sheet

Uploaded by

PythonForDataScience Cheat Sheet Duplicate Values GroupBy

>>> from [Link] import functions as F

You might also like