Welcome to Scribd!

PySpark SQL Cheat Sheet for Data Analysis

Uploaded by

0% found this document useful (0 votes)

395 views1 page

This document provides a cheat sheet on using PySpark SQL to work with structured data. It covers initializing Spark sessions, creating and inspecting DataFrames, performing SQL queries programmatically, column operations like adding and renaming columns, and output operations like saving DataFrames to files. It also summarizes common DataFrame actions like grouping, filtering, sorting, handling missing values, and repartitioning data.

Original Description:

Original Title

PySpark-SQL-cheat-sheet

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

395 views1 page

PySpark SQL Cheat Sheet for Data Analysis

Uploaded by

Karthigai Selvan

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 1

Search inside document

PySpark SQL

From Spark Data Sources SQL Queries

• JSON >>> from pyspark.sql import functions as f
>>>df = spark.read.json("table.json)
>>>df.show() • Select

CHEAT SHEET
>>> df2 = spark.read.load("tablee2.json", format="json") >>> df.select("col1").show()
• Parquet Files >>> df.select("col2","col3") \ .show()
>>> df3 = spark.read.load("newFile.parquet") • When

>>> df.select("col1", f.when(df.col2> 30, 1) \ .otherwise(0)) \ .show()

Inspect Data
>>> df[df.col1.isin("A","B")] .collect()
Initializing Spark Session • >>> df.dtypes -- Return df column names and data types
• >>> from pyspark.sql import SparkSession • >>> df.show() -- Display the content of df Running SQL Queries Programmatically
• >>> spark = SparkSession\.builder\.appName("PySpark • >>> df.head() -- Return first n rows
SQL\.config("spark.some.config.option", "some-value") \.getOrCreate() • >>> df.first(n) -- Return the first n rows • Registering Data Frames as Views:
• >>> df.schema -- Return the schema of df >>> peopledf.createGlobalTempView("column1")
• >>> df.describe().show() -- Compute summary statistics >>> df.createTempView("column1")
Creating Data Frames • >>> df.columns -- Return the columns of df >>> df.createOrReplaceTempView("column2")
• >>> df.count() -- Count the number of rows in df
#import pyspark class Row from module sql • >>> df.distinct().count() -- Count the number of distinct rows in df • Query Views
>>>from pyspark.sql import * • >>> df.printSchema() -- Print the schema of df >>> df_one = spark.sql("SELECT * FROM customer").show()
• Infer Schema: • >>> df.explain() -- Print the (logical and physical) plans >>> df_new = spark.sql("SELECT * FROM global_temp.people")\ .show()
>>> sc = spark.sparkContext
>>> A = sc.textFile("Filename.txt")
>>> B = lines.map(lambda x: x.split(","))
Column Operations Output Operations
>>> C = parts.map(lambda a: Row(col1=a[0],col2=int(a[1]))) • Add
>>> C_df = spark.createDataFrame(C) >>> df = df.withColumn('col1',df.table.col1) \ .withColumn('col2',df.table.col2) \ • Data Structures:
• Specify Schema: .withColumn('col3',df.table.col3) \ .withColumn('col4',df.table.col4) >>> rdd_1 = df.rdd
>>> C = parts.map(lambda a: Row(col1=a[0], col2=int(a[1].strip()))) \.withColumn(col5', explode(df.table.col5)) >>> df.toJSON().first()
>>> schemaString = "MyTable" • Update >>> df.toPandas()
>>> D = [StructField(field_name, StringType(), True) for >>> df = df.withColumnRenamed('col1', 'column1')
field_name in schemaString.split()] • Remove • Write & Save to Files:
>>> E = StructType(D) >>> df = df.drop("col3", "col4") >>> df.select("Col1", "Col2")\ .write \ .save("newFile.parquet")
>>> spark.createDataFrame(C, E).show() >>> df = df.drop(df.col3).drop(df.col4) >>> df.select("col3", "col5") \ .write \ .save("table_new.json",format="json")

col1 col2 Actions • Stopping SparkSession

row1 3 • Group By: >>> df.groupBy("col1")\ .count() \ .show() >>> spark.stop()
• Filter: >>> df.filter(df["col2"]>4).show()
row2 4 • Sort: >>> peopledf.sort(peopledf.age.desc()).collect()
row3 5 >>> df.sort("col1", ascending=False).collect()
>>> df.orderBy(["col1","col3"],ascending=[0,1])\ .collect()
• Missing & Replacing Values:
>>> df.na.fill(20).show()
>>> df.na.drop().show()
>>> df.na \ .replace(10, 20) \ .show()
• Repartitioning:
>>> df.repartition(10)\ df with 10 partitions .rdd \
.getNumPartitions()
FURTHERMORE: Spark, Scala and Python Training Training Course
>>> df.coalesce(1).rdd.getNumPartitions()

HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
PySpark SQL Cheat Sheet Python PDF
Document1 page
PySpark SQL Cheat Sheet Python PDF
Camilo Avila
No ratings yet
BMC Control-M 7: A Journey from Traditional Batch Scheduling to Workload Automation
From Everand
BMC Control-M 7: A Journey from Traditional Batch Scheduling to Workload Automation
Qiang Ding
No ratings yet
Pyspark Vs Pandas Cheatsheet
Document3 pages
Pyspark Vs Pandas Cheatsheet
api-261489892
No ratings yet
Modern Web Applications with Next.JS: Learn Advanced Techniques to Build and Deploy Modern, Scalable and Production Ready React Applications with Next.JS
From Everand
Modern Web Applications with Next.JS: Learn Advanced Techniques to Build and Deploy Modern, Scalable and Production Ready React Applications with Next.JS
Shubham Jain
No ratings yet
PythonForDataScience Cheat Sheet - SQL Basics
Document1 page
PythonForDataScience Cheat Sheet - SQL Basics
Ketan Bhalerao
No ratings yet
Informatica MDM Master A Complete Guide - 2020 Edition
From Everand
Informatica MDM Master A Complete Guide - 2020 Edition
Gerardus Blokdyk
No ratings yet
ADF Functions CheatSheet by Deepak Goyal Azurelib
Document13 pages
ADF Functions CheatSheet by Deepak Goyal Azurelib
Bommireddy Rambabu
No ratings yet
Python Spark RDD Cheat Sheet
Document1 page
Python Spark RDD Cheat Sheet
ram179
No ratings yet
Cleaning Data With PySpark Chapter3
Document25 pages
Cleaning Data With PySpark Chapter3
Fgpeqw
No ratings yet
Spart Part 2
Document44 pages
Spart Part 2
Aleena Nasir
100% (1)
SQL Cheat Sheet
Document3 pages
SQL Cheat Sheet
rangel24
No ratings yet
PySpark Reference Guide
Document2 pages
PySpark Reference Guide
Tarun Singh
No ratings yet
Advanced Python Cheatsheet
Document200 pages
Advanced Python Cheatsheet
David EsparzaArellano
No ratings yet
Adl SQL Cheatsheet
Document3 pages
Adl SQL Cheatsheet
Chandan Deo
100% (1)
Apache Spark Cheatsheet (2014)
Document9 pages
Apache Spark Cheatsheet (2014)
john Bronson
No ratings yet
PySpark Transformations Tutorial
Document58 pages
PySpark Transformations Tutorial
ravikumar lanka
100% (1)
Data Mining Exercise 3
Document11 pages
Data Mining Exercise 3
Mohamed Boukhari
No ratings yet
Mutual fund insight HD April 2024
Document108 pages
Mutual fund insight HD April 2024
barundasa
No ratings yet
Top JavaScript Interview Questions
Document18 pages
Top JavaScript Interview Questions
CS-18 Harshil Jagani
No ratings yet
Distributed Database Systems: - Spark I
Document59 pages
Distributed Database Systems: - Spark I
Thomas Ariyanto
No ratings yet
Pyspark - SQL Module
Document132 pages
Pyspark - SQL Module
sergii volodarski
No ratings yet
Organizational filters identify Nestlé units
Document3 pages
Organizational filters identify Nestlé units
Nitin Sai Avirneni
No ratings yet
Final Print Py Spark
Document133 pages
Final Print Py Spark
Shivaraj K
No ratings yet
Pandas Cheat Sheet CN
Document4 pages
Pandas Cheat Sheet CN
ren zhe
No ratings yet
Spark Method
Document24 pages
Spark Method
satyabratasahoo
No ratings yet
Python OOP
Document46 pages
Python OOP
Sanchit Balchandani
No ratings yet
Core Python
Document102 pages
Core Python
Imran
No ratings yet
Hive Query Optimization Infinity
Document13 pages
Hive Query Optimization Infinity
shashwat2010
No ratings yet
Hadoop Questions
Document14 pages
Hadoop Questions
Shreya Kasturia
No ratings yet
Cleaning Data With PySpark Chapter2
Document25 pages
Cleaning Data With PySpark Chapter2
Fgpeqw
No ratings yet
4 - Action and RDD Transformations
Document25 pages
4 - Action and RDD Transformations
ravikumar lanka
No ratings yet
An Introduction To The ADOdb Class Library For PHP
Document127 pages
An Introduction To The ADOdb Class Library For PHP
sibferns
No ratings yet
Window Functions
Document15 pages
Window Functions
chenna kesava
No ratings yet
Python Regular Expression
Document31 pages
Python Regular Expression
sanjay
No ratings yet
Database Systems Scse
Document80 pages
Database Systems Scse
Chandan Singh
No ratings yet
DS Cheat Sheets
Document18 pages
DS Cheat Sheets
punmapi
No ratings yet
Data Engineering & ML on GCP
Document3 pages
Data Engineering & ML on GCP
venkat raj
No ratings yet
Hadoop Overview: MapReduce, HDFS Architecture & Demo
Document16 pages
Hadoop Overview: MapReduce, HDFS Architecture & Demo
Sunil D Patil
100% (1)
Characater Functions
Document8 pages
Characater Functions
Pavani Kodukula
No ratings yet
Top 10 ETL Design Tips
Document37 pages
Top 10 ETL Design Tips
will3323
No ratings yet
DataStage Faq S
Document57 pages
DataStage Faq S
swaroop24x7
No ratings yet
DB Related: 'Db3' 'E:/Sql - Dbs/Db3.Mdf'
Document8 pages
DB Related: 'Db3' 'E:/Sql - Dbs/Db3.Mdf'
bvrohith
No ratings yet
Pandas Cheat Sheet PDF
Document1 page
Pandas Cheat Sheet PDF
moraleseconomia
50% (2)
SQL Queries Interview Questions and Answers - Query Examples
Document2 pages
SQL Queries Interview Questions and Answers - Query Examples
dippune
No ratings yet
Python Cheat Sheet 8 Pages
Document8 pages
Python Cheat Sheet 8 Pages
arcukvana
No ratings yet
Big Data - Spark
Document72 pages
Big Data - Spark
SuprasannaPradhan
100% (1)
E. Balaguruswamy's ANSI C Solutions 8th Edition by Deb Anirban
Document12 pages
E. Balaguruswamy's ANSI C Solutions 8th Edition by Deb Anirban
Krishnaprasad R
33% (3)
Introduction to Big Data and Hadoop Framework
Document29 pages
Introduction to Big Data and Hadoop Framework
Manoj K Upadhyaya
100% (1)
Alteryx Webinar Lecture 3 - Slides PDF
Document29 pages
Alteryx Webinar Lecture 3 - Slides PDF
askerman 3
No ratings yet
Cleaning Data With PySpark Chapter4
Document23 pages
Cleaning Data With PySpark Chapter4
Fgpeqw
No ratings yet
Synapse Project Deck
Document196 pages
Synapse Project Deck
mysites220
No ratings yet
SQL Example
Document322 pages
SQL Example
minsarakanna
100% (2)
Apache Spark - Optimization Techniques
Document7 pages
Apache Spark - Optimization Techniques
vikas gupta
No ratings yet
Databricks Interview Questions & Answers Guide
Document10 pages
Databricks Interview Questions & Answers Guide
Adithya Vardhan Reddy Kothwal Patel
No ratings yet
Tuning Spark Best Performance
Document49 pages
Tuning Spark Best Performance
Quý Đỗ
No ratings yet
SAP BI Interview Questions by Kuldeep
Document6 pages
SAP BI Interview Questions by Kuldeep
Kuldeep Jain
No ratings yet
GCP Book
Document30 pages
GCP Book
narayanardy5
No ratings yet
Load Top 100 Rows Excel File Big Data SQL Server SSIS
Document7 pages
Load Top 100 Rows Excel File Big Data SQL Server SSIS
Kali Raj
No ratings yet
My Pyspark Practice Notes
Document63 pages
My Pyspark Practice Notes
Study Table
No ratings yet
Bash Notes For Professionals
Document204 pages
Bash Notes For Professionals
Richard Armando
100% (1)
Introduction To Big Data With Apache Spark: Uc Berkeley
Document43 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
Karthigai Selvan
No ratings yet
Exam Questions AZ-900: Microsoft Azure Fundamentals
Document9 pages
Exam Questions AZ-900: Microsoft Azure Fundamentals
Karthigai Selvan
No ratings yet
Software Developer Resume
Document1 page
Software Developer Resume
Ankit Gupta
No ratings yet
Microsoft: Exam Questions AZ-900
Document9 pages
Microsoft: Exam Questions AZ-900
Karthigai Selvan
No ratings yet
Microsoft: Exam Questions AZ-900
Document6 pages
Microsoft: Exam Questions AZ-900
Karthigai Selvan
No ratings yet
Daily A News - Feb 28
Document3 pages
Daily A News - Feb 28
Karthigai Selvan
No ratings yet
Daily A News - March 03
Document2 pages
Daily A News - March 03
Karthigai Selvan
No ratings yet
Microsoft: Exam Questions AZ-900
Document9 pages
Microsoft: Exam Questions AZ-900
Karthigai Selvan
No ratings yet
Convert Tenses Question
Document1 page
Convert Tenses Question
Karthigai Selvan
No ratings yet
Apache Spark PDF
Document7 pages
Apache Spark PDF
oparikoko
No ratings yet
Daily A News - March 01
Document2 pages
Daily A News - March 01
Karthigai Selvan
No ratings yet
Transforming Tenses Guide
Document2 pages
Transforming Tenses Guide
Karthigai Selvan
No ratings yet
Big Data: Processing
Document38 pages
Big Data: Processing
raulmanzano
No ratings yet
Certshared DP-900 Exam Questions Guarantee 100% Pass
Document20 pages
Certshared DP-900 Exam Questions Guarantee 100% Pass
Karthigai Selvan
No ratings yet
Big Data: Business Intelligence, and Analytics
Document31 pages
Big Data: Business Intelligence, and Analytics
Karthigai Selvan
No ratings yet
AffNegSentences
Document5 pages
AffNegSentences
Karthigai Selvan
No ratings yet
1000 Most Common Verbs in English - Verb Forms V1, V2, V3 List
Document106 pages
1000 Most Common Verbs in English - Verb Forms V1, V2, V3 List
Karthigai Selvan
No ratings yet
Fill in The Missing Forms of The Irregular Verbs: Base Form Past Tense Past Participle
Document2 pages
Fill in The Missing Forms of The Irregular Verbs: Base Form Past Tense Past Participle
Michael Fowl
No ratings yet
Teradata Interview Questions
Document8 pages
Teradata Interview Questions
Karthigai Selvan
No ratings yet
Parry New Price List - 2021
Document188 pages
Parry New Price List - 2021
Karthigai Selvan
100% (1)
Aws Best Practices Guide
Document14 pages
Aws Best Practices Guide
Karthigai Selvan
No ratings yet
Mazak Matrix Automatic NC Backup Procedure
Document6 pages
Mazak Matrix Automatic NC Backup Procedure
Tensaiga
100% (2)
00 E8 CD 01
Document2 pages
00 E8 CD 01
Morgothson
No ratings yet
MERN Stack CRUD Operations
Document28 pages
MERN Stack CRUD Operations
BishalShrivastava
100% (2)
How To Configure Rules Engine With Browser Using TOCF (EE)
Document6 pages
How To Configure Rules Engine With Browser Using TOCF (EE)
Emmanuel Uchenna Chukwu
100% (1)
Brochure Aplus
Document36 pages
Brochure Aplus
Tsuroerusu
No ratings yet
Touch-Screen Network Control UI Guide
Document13 pages
Touch-Screen Network Control UI Guide
Eduardo Perez
No ratings yet
Toughbook 33-Accessory and Service Guide: Mobility Solutions
Document11 pages
Toughbook 33-Accessory and Service Guide: Mobility Solutions
Dany Desbiens
No ratings yet
Outlier Treatments: Instructions
Document3 pages
Outlier Treatments: Instructions
vinutha
0% (2)
Model No: ATBOX-G3-J4125: Product Highlight
Document2 pages
Model No: ATBOX-G3-J4125: Product Highlight
Tam
No ratings yet
Gujarat Technological University
Document2 pages
Gujarat Technological University
Cynosure Sky
No ratings yet
C Isdigit - C Standard Library
Document3 pages
C Isdigit - C Standard Library
Marcelo Luna
No ratings yet
CH 08
Document33 pages
CH 08
Yasser Sabri
No ratings yet
SAP Certified Application Associate - SAP S - 4HANA Asset Management - Full - ERPPrep20
Document34 pages
SAP Certified Application Associate - SAP S - 4HANA Asset Management - Full - ERPPrep20
Budigi Rajaekhar
No ratings yet
Workshop 3.1 Shell Meshing - Stamped Part: Introduction To ANSYS Icem CFD
Document12 pages
Workshop 3.1 Shell Meshing - Stamped Part: Introduction To ANSYS Icem CFD
Petco Andrei
No ratings yet
Vodafone-MachineLink-4G Lite-User-Guide
Document230 pages
Vodafone-MachineLink-4G Lite-User-Guide
Ahmed Alsalami
No ratings yet
1 Information Security Incidents Iso Iec 27002 m1 Slides
Document23 pages
1 Information Security Incidents Iso Iec 27002 m1 Slides
Ravi Kumar Gali
No ratings yet
Chapter 3: Advanced Computer Hardware: Instructor Materials
Document62 pages
Chapter 3: Advanced Computer Hardware: Instructor Materials
Cabdixakim Xuseen Urug
No ratings yet
Assignment 3-2.1.2 Pseudocode and Flowcharts
Document2 pages
Assignment 3-2.1.2 Pseudocode and Flowcharts
Aditya Ghose
No ratings yet
A Review of Bring Your Own Device On
Document11 pages
A Review of Bring Your Own Device On
Nurrul Jannathul
No ratings yet
Datasheet - Link2500 Standalone - Dec20
Document2 pages
Datasheet - Link2500 Standalone - Dec20
Nicolas
No ratings yet
Partner University Access Instructions 2021
Document4 pages
Partner University Access Instructions 2021
pentax
No ratings yet
Gl100e Installation Manual
Document1 page
Gl100e Installation Manual
John William
No ratings yet
DBMS Record
Document10 pages
DBMS Record
VIGHNESHA V
No ratings yet
BHB Sicamdm Eng
Document228 pages
BHB Sicamdm Eng
Mạc David
No ratings yet
InnerClasses PDF
Document30 pages
InnerClasses PDF
Shweta Nikhar
No ratings yet
Timemjimpsecondsem
Document2 pages
Timemjimpsecondsem
Sundar Sharath
No ratings yet
Query Based Reports in Maximo: Overview of Maximo Ad-Hoc Reporting Functionality
Document40 pages
Query Based Reports in Maximo: Overview of Maximo Ad-Hoc Reporting Functionality
MtxSolution XolidGestion
No ratings yet
Software Testing
Document6 pages
Software Testing
Haya Naaz
No ratings yet
Understanding Undecidable Problems in Turing Machines
Document48 pages
Understanding Undecidable Problems in Turing Machines
Rajesh Salla
100% (1)
Final Year Project Proposal (Final)
Document20 pages
Final Year Project Proposal (Final)
sostegnaw sehtet
No ratings yet