PySpark SQL Functions-10-03

PySpark SQL Functions | array method
schedule AUG 12, 2023

local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' array(~) method combines multiples columns into a single column of
arrays.
NOTE
If you want to combine multiple columns of array-type, then use the concat(~) instead.
Parameters
1. *cols | string or Column
The columns to combine.
Return Value
A new PySpark Column.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([['A', 'a', '1'], ['B', 'b', '2'], ['C', 'c', '3']], ['col1', 'col2', 'col3'])
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| a| 1|
| B| b| 2|
| C| c| 3|
+----+----+----+
To combine the columns col1, col2 and col3 into a single column of arrays, use
the array(~) method:
filter_none
from pyspark.sql import functions as F
# Assign label to PySpark column returned by array(~) using alias(~)
df.select(F.array('col1','col2','col3').alias('combined_col')).show()
+------------+
|combined_col|
+------------+
| [A, a, 1]|
| [B, b, 2]|
| [C, c, 3]|
+------------+
Instead of passing a column labels, we could also supply Column objects instead:
filter_none
df.select(F.array(F.col('col1'),df['col2'],'col3').alias('combined_col')).show()
+------------+
|combined_col|
+------------+
| [A, a, 1]|
| [B, b, 2]|
| [C, c, 3]|
+------------+
PySpark SQL Functions | col method
local_offer
PySpark
mode_heat
PySpark SQL Functions' col(~) method returns a Column object.

Parameters
1. col | string
The label of the column to return.
Return Value
A Column object.
Examples
filter_none
df = spark.createDataFrame([["Alex", 20], ["Bob", 30]], ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Selecting a column in PySpark
To select the name column:
filter_none
import pyspark.sql.functions as F
df.select(F.col("name")).show()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
Note that we could also select the name column without the explicit use of F.col(~) like so:
filter_none
df.select("name").show()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
Creating a new column
To create a new column called status whose values are dependent on the age column:
filter_none
new_col = F.when(F.col("age") < 25, "junior").otherwise("senior").alias("status")
df.select("*", new_col).show()
+----+---+------+
|name|age|status|
+----+---+------+
|Alex| 20|junior|
| Bob| 30|senior|
+----+---+------+
Note the following:
the "*" refers to all the columns of df.
we are using the when(~) and otherwise(~) pattern to fill the values of our column
conditionally
we use the alias(~) method to assign a label to new column
Note F.col("age") can also be replaced by df["age"]:
filter_none
new_col = F.when(df["age"] < 25, "junior").otherwise("senior").alias("status")
df.select("*", new_col).show()
+----+---+------+
|name|age|status|
+----+---+------+
|Alex| 20|junior|
| Bob| 30|senior|
+----+---+------+
How does col know which DataFrame's column to refer to?
Notice how the col(~) method only takes in as argument the name of the column. PySpark
executes our code lazily and waits until an action is invoked (e.g. show()) to run all the
transformations (e.g. df.select(~)). Therefore, PySpark will have the needed context to
decipher to which DataFrame's column the col(~) is referring.
For example, suppose we have the following two PySpark DataFrames with the same
schema:
filter_none
df1 = spark.createDataFrame([["Alex", 20], ["Bob", 30]], ["name", "age"])
df2 = spark.createDataFrame([["Cathy", 40], ["Doge", 50]], ["name", "age"])
my_col = F.col("name")
Let's select the name column from df1:
filter_none
df1.select(my_col).show()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
Here, PySpark knows that we are referring to df1's name column because df1 is invoking the
transformation (select(~)).
Let's now select the name column from df2:
filter_none
df2.select(my_col).show()
+-----+
| name|
+-----+
|Cathy|
| Doge|
+-----+
?
PySpark SQL Functions | collect_list method
local_offer
PySpark
mode_heat
PySpark SQL functions' collect_list(~) method returns a list of values in a column.

Unlike collect_set(~), the returned list can contain duplicate values. Null values are ignored.
Parameters
1. col | string or Column object
The column label or a Column object.
Return Value
A PySpark SQL Column object (pyspark.sql.column.Column).
WARNING
Assume that the order of the returned list may be random since the order is affected by
shuffle operations.
Examples
filter_none
data = [("Alex", "A"), ("Alex", "B"), ("Bob", "A"), ("Cathy", "C"), ("Dave", None)]
df = spark.createDataFrame(data, ["name", "group"])
df.show()
+-----+-----+
| name|group|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| A|
|Cathy| C|
| Dave| null|
+-----+-----+
Getting a list of column values in PySpark
To get the a list of values in the group column:
filter_none
df.select(F.collect_list("group")).show()
+-------------------+
|collect_list(group)|
+-------------------+
| [A, B, A, C]|
+-------------------+
Notice the following:
we have duplicate values (A).
null values are ignored.
Equivalently, you can pass in a Column object to collect_list(~) as well:
filter_none
df.select(F.collect_list(df.group)).show()
+-------------------+
|collect_list(group)|
+-------------------+
| [A, B, A, C]|
+-------------------+
Obtaining a standard list
To obtain a standard list instead:
filter_none
list_rows = df.select(F.collect_list(df.group)).collect()
list_rows[0][0]
['A', 'B', 'A', 'C']
Here, the collect() method returns the content of the PySpark DataFrame returned
by select(~) as a list of Row objects. This list is guaranteed to be of length one
because collect_list(~) collects the values into a single list. Finally, we access the content of
the Row object using [0].
Getting a list of column values for each group in PySpark
The method collect_list(~) is often used in the context of aggregation. Consider the same
PySpark DataFrame as above:
filter_none
df.show()
+-----+-----+
| name|group|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| A|
|Cathy| C|
| Dave| null|
+-----+-----+
To flatten the group column into a single list for each name:
filter_none
df.groupby("name").agg(F.collect_list("group")).show()
+-----+-------------------+
| name|collect_list(group)|
+-----+-------------------+
| Alex| [A, B]|
| Bob| [A]|
|Cathy| [C]|
| Dave| []|
+-----+-------------------+
PySpark SQL Functions | collect_set method
local_offer
PySpark
mode_heat
PySpark SQL Functions' collect_set(~) method returns a unique set of values in a column.
Null values are ignored.
NOTE
Use collect_list(~) instead to obtain a list of values that allows for duplicates.
Parameters
The column label or a Column object.
Return Value
WARNING
Assume that the order of the returned set may be random since the order is affected
by shuffle operationslink.
Examples
filter_none
data = [("Alex", "A"), ("Alex", "B"), ("Bob", "A"), ("Cathy", "C"), ("Dave", None)]
df = spark.createDataFrame(data, ["name", "group"])
df.show()
+-----+-----+
| name|group|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| A|
|Cathy| C|
| Dave| null|
+-----+-----+
Getting a set of column values in PySpark
To get the unique set of values in the group column:
filter_none
df.select(F.collect_set("group")).show()
+------------------+
|collect_set(group)|
+------------------+
| [C, B, A]|
+------------------+
Equivalently, you can pass in a Column object to collect_set(~) as well:
filter_none
df.select(F.collect_set(df.group)).show()
+------------------+
|collect_set(group)|
+------------------+
| [C, B, A]|
+------------------+
Notice how the null value does not appear in the resulting set.
Getting the set as a standard list
To get the set as a standard list:
filter_none
list_rows = df.select(F.collect_set(df.group)).collect()
list_rows[0][0]
['C', 'B', 'A']
Here, the PySpark DataFrame's collect() method returns a list of Row objects. This list is
guaranteed to be length one due to the nature of collect_set(~). The Row object contains
the list so we need to include another [0].
Getting a set of column values of each group in PySpark
The method collect_set(~) is often used in the context of aggregation. Consider the same
PySpark DataFrame as before:
filter_none
df.show()
+-----+-----+
| name|group|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| A|
|Cathy| C|
| Dave| null|
+-----+-----+
To flatten the group column into a single set for each name:
filter_none
df.groupby("name").agg(F.collect_set("group")).show()
+-----+------------------+
| name|collect_set(group)|
+-----+------------------+
| Alex| [B, A]|
| Bob| [A]|
|Cathy| [C]|
+-----+------------------+
PySpark SQL Functions | concat method
local_offer
PySpark
mode_heat
PySpark SQL Functions' concat(~) method concatenates string and array columns.
Parameters
The columns to concat.
Return Value
A PySpark Column (pyspark.sql.column.Column).
Examples
Concatenating string-based columns in PySpark
filter_none
df = spark.createDataFrame([["Alex", "Wong"], ["Bob", "Marley"]], ["fname", "lname"])
df.show()
+-----+------+
|fname| lname|
+-----+------+
| Alex| Wong|
| Bob|Marley|
+-----+------+
To concatenate fname and lname:
filter_none
df.select(F.concat("fname", "lname")).show()
+--------------------+
|concat(fname, lname)|
+--------------------+
| AlexWong|
| BobMarley|
+--------------------+
If you wanted to include a space between the two columns, you can use F.lit(" ") like so:
filter_none
df.select(F.concat("fname", F.lit(" "), "lname")).show()
+-----------------------+
|concat(fname, , lname)|
+-----------------------+
| Alex Wong|
| Bob Marley|
+-----------------------+
F.lit(" ") is a Column object whose values are filled with " ".
You could also add an alias to the returned column like so:
filter_none
df.select(F.concat("fname", "lname").alias("COMBINED NAME")).show()
+-------------+
|COMBINED NAME|
+-------------+
| AlexWong|
| BobMarley|
+-------------+
You could also pass in Column objects intead of column labels:
filter_none
df.select(F.concat(df.fname, F.col("lname"))).show()
+--------------------+
|concat(fname, lname)|
+--------------------+
| AlexWong|
| BobMarley|
+--------------------+
Concatenating array-based columns in PySpark
filter_none
df = spark.createDataFrame([[ [4,5], [6]], [ [7], [8,9] ]], ["A", "B"])
df.show()
+------+------+
| A| B|
+------+------+
|[4, 5]| [6]|
| [7]|[8, 9]|
+------+------+
To concatenate the arrays of each column:
filter_none
df.select(F.concat("A", "B")).show()
+------------+
|concat(A, B)|
+------------+
| [4, 5, 6]|
| [7, 8, 9]|
+------------+
PySpark SQL Functions | concat_ws method
local_offer
PySpark
mode_heat
PySpark SQL Functions' concat_ws(~) method concatenates string-typed columns into a

single column with the specified separator.
Parameters
1. sep | string
The separator to use when concatenating the columns.
2. *cols | Column objects
The string-based columns to concatenate. If the type of the columns is not string, then
automatic casting will be performed. If casting fails, then an error is thrown.
Return Value
Examples
filter_none
df = spark.createDataFrame([("a", "b"), ("c", "d"), ("e", "f")], ["ONE", "TWO"])
df.show()
+---+---+
|ONE|TWO|
+---+---+
| a| b|
| c| d|
| e| f|
+---+---+
Concatenating multiple string columns in PySpark
To create a new PySpark DataFrame that combines the two string-typed columns:
filter_none
df.select(F.concat_ws("-", df.ONE, df.TWO).alias("NEW")).show()
+---+
|NEW|
+---+
|a-b|
|c-d|
|e-f|
+---+
Here, we are using the alias(~) function of the Column object to supply the name of the
newly combined column.
Note that we can combine more than two columns at once:
filter_none
df.select(F.concat_ws("-", df.ONE, df.TWO, df.ONE).alias("NEW")).show()
+-----+
| NEW|
+-----+
|a-b-a|
|c-d-c|
|e-f-e|
+-----+
PySpark SQL Functions | count method
local_offer
PySpark
mode_heat
PySpark SQL Functions' count(~) is an aggregate method used in conjunction with

the agg(~) method to compute the number of items in each group.
Parameters
1. col | string or Column
The column to perform the count on.
Return Value
Examples
filter_none
df = spark.createDataFrame([['Alex','A'],['Bob','B'],['Cathy','A']], ['name','class'])
df.show()
+-----+-----+
| name|class|
+-----+-----+
| Alex| A|
| Bob| B|
|Cathy| A|
+-----+-----+
Counting the number of items in each group
To count the number of rows for each class group:
filter_none
df.groupBy('class').agg(F.count('class').alias('COUNT')).show()
+-----+-----+
|class|COUNT|
+-----+-----+
| A| 2|
| B| 1|
+-----+-----+
Here, note the following:
we are first grouping by the class column using groupBy(~), and then for each group, we are
counting how many rows there are. Technically speaking, we are counting the number
of class values in each group (F.count('class')), but this is equivalent to just counting the
number of rows in each group.
we are assigning a label to the resulting aggregate column using the alias(~) method. Note
that the default label assigned is 'count'.
RELATED
PySpark SQL Functions | countDistinct method

PySpark SQL Functions' countDistinct(~) method returns the distinct number of rows for the
specified columns.
PySpark SQL Functions | count_distinct method

local_offer
PySpark
mode_heat
PySpark SQL Functions' count_distinct(~) method counts the number of distinct values in the
specified columns.
Parameters
The columns in which to count the number of distinct values.
Return Value
A PySpark Column holding an integer.
Examples
filter_none
df = spark.createDataFrame([["Alex", "A"], ["Bob", "A"], ["Cathy", "B"]], ["name", "class"])
df.show()
+-----+-----+
| name|class|
+-----+-----+
| Alex| A|
| Bob| A|
|Cathy| B|
+-----+-----+
Counting the number of distinct values in a single column in PySpark
To count the number of distinct values in the class column:
filter_none
df.select(F.count_distinct("class").alias("c")).show()
+---+
| c|
+---+
| 2|
+---+
Here, we are giving the name "c" to the Column returned by count_distinct(~) via alias(~).
Note that we could also supply a Column object to count_distinct(~) instead:
filter_none
df.select(F.count_distinct(df["class"]).alias("c")).show()
+---+
| c|
+---+
| 2|
+---+
Obtaining an integer count
By default, count_distinct(~) returns a PySpark Column. To get an integer count instead:
filter_none
df.select(F.count_distinct(df["class"])).collect()[0][0]
2
Here, we are use the select(~) method to convert the Column into PySpark DataFrame. We
then use the collect(~) method to convert the DataFrame into a list of Row objects. Since
there is only one Row in this list as well as one value in the Row, we use [0][0] to access the
integer count.
Counting the number of distinct values in a set of columns in PySpark
To count the number of distinct values for the columns name and class:
filter_none
df.select(F.count_distinct("name", "class").alias("c")).show()
+---+
| c|
+---+
| 3|
+---+
PySpark SQL Functions | countDistinct method
local_offer
PySpark
mode_heat
PySpark SQL Functions' countDistinct(~) method returns the distinct number of rows for the
specified columns.
Parameters
The column to consider when counting distinct rows.
2. *col | string or Column | optional
The additional columns to consider when counting distinct rows.
Return Value
Examples
filter_none
df = spark.createDataFrame([["Alex", 25], ["Bob", 30], ["Alex", 25], ["Alex", 50]], ["name",
"age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
|Alex| 25|
|Alex| 50|
+----+---+
Counting the number of distinct values in single PySpark column
To count the number of distinct rows in the column name:
filter_none
df.select(F.countDistinct("name")).show()
+--------------------+
|count(DISTINCT name)|
+--------------------+
| 2|
+--------------------+
Note that instead of passing in the column label ("name"), you can pass in a Column object
like so:
filter_none
# df.select(F.countDistinct(df.name)).show()
df.select(F.countDistinct(F.col("name"))).show()
+--------------------+
|count(DISTINCT name)|
+--------------------+
| 2|
+--------------------+
Counting the number of distinct values in multiple PySpark columns
To consider the columns name and age when counting duplicate rows:
filter_none
df.select(F.countDistinct("name", "age")).show()
+-------------------------+
|count(DISTINCT name, age)|
+-------------------------+
| 3|
+-------------------------+
Counting the number of distinct rows in PySpark DataFrame
To consider all columns when counting duplicate rows, pass in "*":
filter_none
df.select(F.countDistinct("*")).show()
+-------------------------+
|count(DISTINCT name, age)|
+-------------------------+
| 3|
+-------------------------+
PySpark SQL Functions | date_add method
local_offer
PySpark
mode_heat
PySpark's date_add(-) method adds the specified number of days to a date column.
Parameters
1. start |
The column of starting dates.
2. days | int
The number of days to add.
Return Value
A pyspark.sql.column.Column object.
Examples
Basic usage
Consider the following DataFrame:
filter_none
df = spark.createDataFrame([["2023-04-20"], ["2023-04-22"]], ["my_date"])
df.show()
+----------+
| my_date|
+----------+
|2023-04-20|
|2023-04-22|
+----------+
To add 5 days to our column:
filter_none
df.select(F.date_add("my_date", 5)).show()
+--------------------+
|date_add(my_date, 5)|
+--------------------+
| 2023-04-25|
| 2023-04-27|
+--------------------+
Adding a column of days to a column of dates
Unfortunately, the date_add(-) method only accepts a constant for the second parameter.
To add a column of days to a column of dates, we must take another approach.
To demonstrate, consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["2023-04-20", 3], ["2023-04-22", 5]], ["my_date", "my_days"])
df.show()
+----------+-------+
| my_date|my_days|
+----------+-------+
|2023-04-20| 3|
|2023-04-22| 5|
+----------+-------+
To add my_days to my_date, supply the following SQL method in the F.expr() method like
so:
filter_none
# Cast to INT first - by default, intgers have type BIGINT (F.expr(-) with raise an error)
df = df.withColumn("my_days", df["my_days"].cast("int"))
df_new = df_new.withColumn("new_date", F.expr("date_add(my_date, my_days)"))
df_new.show()
+----------+-------+----------+
| my_date|my_days| new_date|
+----------+-------+----------+
|2023-04-20| 3|2023-04-23|
|2023-04-22| 5|2023-04-27|
+----------+-------+----------+
The resulting data type of the columns is as follows:
filter_none
df_new.printSchema()
root
|-- my_date: string (nullable = true)
|-- days: integer (nullable = true)
|-- new_date: date (nullable = true)
Notice how even though my_date is of type string, the resulting new_date is of type date.
PySpark SQL Functions | date_format method
local_offer
PySpark
mode_heat
PySpark SQL Functions' date_format(~) method converts a date, timestamp or string into a
date string with the specified format.
Parameters
1. date | Column or string
The date column - this could be of type date, timestamp or string.
2. format | string
The format of the resulting date string.
Return Value
A Column object of date strings.
Examples
Formatting date strings in PySpark DataFrame
Consider the following PySpark DataFrame with some date strings:
filter_none
df = spark.createDataFrame([["Alex", "1995-12-16"], ["Bob", "1998-05-06"]], ["name",
"birthday"])
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1998-05-06|
+----+----------+
To convert the date strings in the column birthday:
filter_none
df.select(F.date_format("birthday", "dd/MM/yyyy").alias("birthday_new")).show()
+------------+
|birthday_new|
+------------+
| 16/12/1995|
| 06/05/1998|
+------------+
Here,:
"dd/MM/yyyy" indicates a date string starting with the day, then month, then year.
alias(~) is used to give a name to the Column object returned by date_format(~).
Formatting datetime values in PySpark DataFrame
Consider the following PySpark DataFrame with some datetime values:
filter_none
import datetime
df = spark.createDataFrame([["Alex", datetime.date(1995,12,16)], ["Bob",
datetime.date(1995,5,9)]], ["name", "birthday"])
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+
To convert the datetime values in column birthday:
filter_none
df.select(F.date_format("birthday", "dd-MM-yyyy").alias("birthday_new")).show()
+------------+
|birthday_new|
+------------+
| 16-12-1995|
| 06-05-1998|
+------------+
Here, we are using the date format "dd-MM-yyyy", which means day first, and then month
followed by year. We also assign the column name "birthday_new" to the Column returned
by date_format().
PySpark SQL Functions | dayofmonth method
local_offer
PySpark
mode_heat
PySpark SQL Functions' dayofmonth(~) method extracts the day component of each column
value, which can be of type date or a date string.
Parameters
The date column from which to extract the day of the month.
Return Value
A Column object of integers.
Examples
filter_none
import datetime
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+
Extracting the day component of datetime values in PySpark DataFrame
To extract the day component of date values:
filter_none
df.select(F.dayofmonth("birthday").alias("day")).show()
+---+
|day|
+---+
| 16|
| 9|
+---+
Here, we are assigning a label to the Column returned by dayofmonth(~) using alias(~).
Extracting the day component of date strings in PySpark DataFrame
To extract the day component of date strings:
filter_none
"birthday"])
df.select(F.dayofmonth("birthday").alias("day")).show()
+---+
|day|
+---+
| 16|
| 9|
+---+
PySpark SQL Functions | dayofweek method
local_offer
PySpark
mode_heat
PySpark SQL Functions' dayofweek(~) method extracts the day of the week of each datetime
value or date string of a PySpark column.
Parameters
The date column from which to extract the day of the week.
Return Value
A Column of integers.
Examples
filter_none
import datetime
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+
Getting the day of the week from datetime values in PySpark DataFrame
To get the day of the week in PySpark DataFrame:
filter_none
df.select(F.dayofweek('birthday').alias('day')).show()
+---+
|day|
+---+
| 7|
| 1|
+---+
Here:
7 represents Sunday while 1 represents Monday.
we are using alias(~) to give a label to the Column object returned by dayofweek(~).
Getting the day of week from date strings in PySpark DataFrame
Note that the method still works even if the date is in string form:
filter_none
"birthday"])
df.select(F.dayofweek("birthday").alias("day")).show()
+---+
|day|
+---+
| 7|
| 1|
+---+
PySpark SQL Functions | dayofyear method
local_offer
PySpark
mode_heat
PySpark SQL Functions' dayofyear(~) method extracts the day of the year of each datetime
string or datetime value in a PySpark column.
Parameters
The date column from which to extract the day of the year.
Return Value
Examples
filter_none
import datetime
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-06|
+----+----------+
Getting the day of the year of date values in PySpark DataFrame
To get the day of the year of date values:
filter_none
df.select(F.dayofyear("birthday").alias("day")).show()
+---+
|day|
+---+
|350|
|126|
+---+
Here, we are assigning the label "day" to the Column returned by dayofyear(~).
Getting the day of the year of date strings in PySpark DataFrame
Note that the method still works even if date is in string form:
filter_none
"birthday"])
df.select(F.dayofyear("birthday").alias("day")).show()
+---+
|day|
+---+
|350|
|126|
+---+
PySpark SQL Functions | element_at method
local_offer
PySpark
mode_heat
PySpark SQL Functions' element_at(~) method is used to extract values from lists or maps in
a PySpark Column.
Parameters
The column of lists or maps from which to extract values.
2. extraction | int
The position of the value that you wish to extract. Negative positioning is supported
- extraction=-1 will extract the last element from each list.
WARNING
The position is not indexed-based. This means that extraction=1 will extract the first value in
the lists or maps.
Return Value
Examples
Extracting n-th value from arrays in PySpark Column
Consider the following PySpark DataFrame that contains some lists:
filter_none
rows = [[[5,6]], [[7,8]]]
df = spark.createDataFrame(rows, ['vals'])
df.show()
+------+
| vals|
+------+
|[5, 6]|
|[7, 8]|
+------+
To extract the second value from each list in vals, we can use element_at(~) like so:
filter_none
df_res = df.select(F.element_at('vals',2).alias('2nd value'))
df_res.show()
+---------+
|2nd value|
+---------+
| 6|
| 8|
+---------+
the position 2 is not index-based.
we are using the alias(~) method to assign a label to the column returned by element_at(~).
Note that extracting values that are out of bounds will return null:
filter_none
df_res = df.select(F.element_at('vals',3))
df_res.show()
+-------------------+
|element_at(vals, 3)|
+-------------------+
| null|
| null|
+-------------------+
We can also extract the last element by supplying a negative value for extraction:
filter_none
df_res = df.select(F.element_at('vals',-1).alias('last value'))
df_res.show()
+----------+
|last value|
+----------+
| 6|
| 8|
+----------+
Extracting values from maps in PySpark Column
Consider the following PySpark DataFrame containing some dict values:
filter_none
rows = [[{'A':4}], [{'A':5, 'B':6}]]
df = spark.createDataFrame(rows, ['vals'])
df.show()
+----------------+
| vals|
+----------------+
| {A -> 4}|
|{A -> 5, B -> 6}|
+----------------+
To extract the values that has the key 'A' in the vals column:
filter_none
df_res = df.select(F.element_at('vals', F.lit('A')))
df_res.show()
+-------------------+
|element_at(vals, A)|
+-------------------+
| 4|
| 5|
+-------------------+
Note that extracting values using keys that do not exist will return null:
filter_none
df_res = df.select(F.element_at('vals', F.lit('B')))
df_res.show()
+-------------------+
|element_at(vals, B)|
+-------------------+
| null|
| 6|
+-------------------+
Here, the key 'B' does not exist in the map {'A':4} so a null was returned for that row.
RELATED
PySpark Column | getItem method

PySpark Column's getItem(~) method extracts a value from the lists or dictionaries in a
PySpark Column.
PySpark SQL Functions | explode method

local_offer
PySpark
mode_heat
PySpark SQL Functions' explode(~) method flattens the specified column values of
type list or dictionary.
Parameters
The column containing lists or dictionaries to flatten.
Return Value
Examples
Flattening lists
filter_none
df = spark.createDataFrame([[['a','b']],[['d']]], ['vals'])
df.show()
+------+
| vals|
+------+
|[a, b]|
| [d]|
+------+
Here, the column vals contains lists.
To flatten the lists in the column vals, use the explode(~) method:
filter_none
df.select(F.explode('vals').alias('exploded')).show()
+--------+
|exploded|
+--------+
| a|
| b|
| d|
+--------+
Here, we are using the alias(~) method to assign a label to the column returned
by explode(~).
Flattening dictionaries
filter_none
df = spark.createDataFrame([[{'a':'b'}],[{'c':'d','e':'f'}]], ['vals'])
df.show()
+----------------+
| vals|
+----------------+
| {a -> b}|
|{e -> f, c -> d}|
+----------------+
Here, the column vals contains dictionaries.
To flatten each dictionary in column vals, use the explode(~) method:
filter_none
df.select(F.explode('vals').alias('exploded_key', 'exploded_val')).show()
+------------+------------+
|exploded_key|exploded_val|
+------------+------------+
| a| b|
| e| f|
| c| d|
+------------+------------+
In the case of dictionaries, the explode(~) method returns two columns - the first column
contains all the keys while the second column contains all the values.
PySpark SQL Functions | expr method
local_offer
PySpark
mode_heat
PySpark SQL Functions' expr(~) method parses the given SQL expression.
Parameters
1. str | string
The SQL expression to parse.
Return Value
A PySpark Column.
Examples
filter_none
df = spark.createDataFrame([['Alex',30],['Bob',50]], ['name','age'])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 30|
| Bob| 50|
+----+---+
Using the expr method to convert column values to uppercase
The expr(~) method takes in as argument a SQL expression, so we can use SQL functions
such as upper(~):
filter_none
df.select(F.expr('upper(name)')).show()
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
+-----------+
NOTE
The expr(~) method can often be more succinctly written using PySpark
DataFrame's selectExpr(~) method. For instance, the above case can be rewritten as:
filter_none
df.selectExpr('upper(name)').show()
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
+-----------+
I recommend that you use selectExpr(~) whenever possible because:
you won't have to import the SQL functions library (pyspark.sql.functions).
syntax is shorter
Parsing complex SQL expressions using expr method
Here's a more complex SQL expression using clauses like AND and LIKE:
filter_none
df.select(F.expr('age > 40 AND name LIKE "B%"').alias('result')).show()
+------+
|result|
+------+
| false|
| true|
+------+
Note the following:
we are checking for rows where age is larger than 40 and name starts with B.
we are assigning the label 'result' to the Column returned by expr(~) using
the alias(~) method.
Practical applications of boolean masks returned by expr method
As we can see in the above example, the expr(~) method can return a boolean mask
depending on the SQL expression you supply:
filter_none
df.select(F.expr('age > 40 AND name LIKE "B%"').alias('result')).show()
+------+
|result|
+------+
| false|
| true|
+------+
This allows us to check for the existence of rows that satisfy a given condition using any(~):
filter_none
df.select(F.expr('any(age > 40 AND name LIKE "B%")').alias('exists?')).show()
+-------+
|exists?|
+-------+
| true|
+-------+
Here, we get True because there exists at least one True value in the boolean mask.
Mapping column values using expr method
We can map column values using CASE WHEN in the expr(~) method like so:
filter_none
col = F.expr('CASE WHEN age < 40 THEN "JUNIOR" ELSE "SENIOR" END').alias('result')
df.withColumn('status', col).show()
+----+---+------+
|name|age|status|
+----+---+------+
|Alex| 30|JUNIOR|
| Bob| 50|SENIOR|
+----+---+------+
we are using the DataFrame's withColumn(~) method to obtain a new PySpark DataFrame
that includes the column returned by expr(~).
RELATED
PySpark DataFrame | selectExpr method

PySpark DataFrame's selectExpr(~) method returns a new DataFrame based on the specified
SQL expression.
PySpark SQL Functions | first method

local_offer
PySpark
mode_heat
PySpark's SQL function first(~) method returns the first value of the specified column of a
PySpark DataFrame.
Parameters
The column label or Column object of interest.
2. ignorenulls | boolean | optional
Whether or not to ignore null values. By default, ignorenulls=False.
Return Value
Examples
filter_none
columns = ["name", "age"]
data = [("Alex", 15), ("Bob", 20), ("Cathy", 25)]
df = spark.createDataFrame(data, columns)
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
Getting the first value of a column in PySpark DataFrame
To get the first value of the name column:
filter_none
df.select(F.first(df.name)).show()
+-----------+
|first(name)|
+-----------+
| Alex|
+-----------+
Getting the first non-null value of a column in PySpark DataFrame
Consider the following PySpark DataFrame with null values:
filter_none
data = [("Alex", None), ("Bob", 20), ("Cathy", 25)]
df.show()
+-----+----+
| name| age|
+-----+----+
| Alex|null|
| Bob| 20|
|Cathy| 25|
+-----+----+
By default, ignorenulls=False, which means that the first value is returned regardless of
whether it is null or not:
filter_none
df.select(F.first(df.age)).show()
+----------+
|first(age)|
+----------+
| null|
+----------+
To return the first non-null value instead:
filter_none
df.select(F.first(df.age, ignorenulls=True)).show()
+----------+
|first(age)|
+----------+
| 20|
+----------+
Getting the first value of each group in PySpark
The first(~) method is also useful in aggregations. Consider the following PySpark
DataFrame:
filter_none
columns = ["name", "class"]
data = [("Alex", "A"), ("Alex", "B"), ("Bob", None), ("Bob", "A"), ("Cathy", "C")]
df.show()
+-----+-----+
| name|class|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| null|
| Bob| A|
|Cathy| C|
+-----+-----+
To get the first value of each aggregate:
filter_none
df.groupby("name").agg(F.first("class")).show()
+-----+------------+
| name|first(class)|
+-----+------------+
| Alex| A|
| Bob| null|
|Cathy| C|
+-----+------------+
Here, we are grouping by name, and then for each of these group, we are obtaining the first
value that occurred in the class column.
PySpark SQL Functions | greatest method
local_offer
PySpark
mode_heat
PySpark SQL Functions' greatest(~) method returns the maximum value of each row in the
specified columns. Note that you must specify two or more columns.
Parameters
The columns from which to compute the maximum values.
Return Value
Examples
filter_none
df = spark.createDataFrame([["Alex", 100, 200], ["Bob", 150, 50]], ["name", "salary",
"bonus"])
df.show()
+----+------+-----+
|name|salary|bonus|
+----+------+-----+
|Alex| 100| 200|
| Bob| 150| 50|
+----+------+-----+
Getting the largest value of each row in PySpark DataFrame
To get the largest value of each row in the columns salary and bonus:
filter_none
df.select(F.greatest("salary", "bonus")).show()
+-----------------------+
|greatest(salary, bonus)|
+-----------------------+
| 200|
| 150|
+-----------------------+
To append this column to the existing PySpark DataFrame, use the withColumn(~) method:
filter_none
df.withColumn("my_max", F.greatest("salary", "bonus")).show()
+----+------+-----+------+
|name|salary|bonus|my_max|
+----+------+-----+------+
|Alex| 100| 200| 200|
| Bob| 150| 50| 150|
+----+------+-----+------+
PySpark SQL Functions | instr method
local_offer
PySpark
mode_heat
PySpark SQL Functions' instr(~) method returns a new PySpark Column holding the position
of the first occurrence of the specified substring in each value of the specified column.
WARNING
The position is not index-based, and starts from 1 instead of 0.
Parameters
1. str | string or Column
The column to perform the operation on.
2. substr | string
The substring of which to check the position.
Return Value
A PySpark DataFrame.
Examples
filter_none
df = spark.createDataFrame([("ABA",), ("BBB",), ("CCC",), (None,)], ["x",])
df.show()
+----+
| x|
+----+
| ABA|
| BBB|
| CCC|
|null|
+----+
Getting the position of the first occurrence of a substring in PySpark Column
To get the position of the first occurrence of the substring "B" in column x, use
the instr(~) method:
filter_none
df.select(F.instr("x", "B")).show()
+-----------+
|instr(x, B)|
+-----------+
| 2|
| 1|
| 0|
| null|
+-----------+
we see 2 returned for the column value "ABA" because the substring "B" occurs in the 2nd
position - remember, this method counts position from 1 instead of 0.
if the substring does not exist in the string, then a value of 0 is returned. This is the case
for "Cathy" because this string does not include "B".
if the string is null, then the result will also be null.
PySpark SQL Functions | isnan method

local_offer
PySpark
mode_heat
PySpark SQL Functions' isnan(-) method returns True where the column value is NaN (not-a-
number).
NOTE
PySpark treats null and NaN as separate entities. Please refer to our isNull(-) method for
more details.
Parameters
The column label or Column object to target.
Return Value
Examples
filter_none
import numpy as np
df = spark.createDataFrame([["Alex", 25.0], ["Bob", np.nan], ["Cathy", float("nan")]],
["name", "age"])
df.show()
+-----+----+
| name| age|
+-----+----+
| Alex|25.0|
| Bob| NaN|
|Cathy| NaN|
+-----+----+
To get all rows where age is NaN:
filter_none
df.where(F.isnan("age")).show()
+-----+---+
| name|age|
+-----+---+
| Bob|NaN|
|Cathy|NaN|
+-----+---+
PySpark SQL Functions | last method
local_offer
PySpark
mode_heat
PySpark's SQL function last(~) method returns the last row of the DataFrame.
Parameters
Return Value
Examples
filter_none
df = spark.createDataFrame([("Alex", 15), ("Bob", 20), ("Cathy", 25)], ["name", "age"])
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
Getting the last value of a PySpark column
To get the last value of the name column:
filter_none
df.select(F.last("name")).show()
+----------+
|last(name)|
+----------+
| Cathy|
+----------+
Note we can also pass a Column object instead:
filter_none
# df.select(F.last(F.col("name"))).show()
df.select(F.last(df.name)).show()
+----------+
|last(name)|
+----------+
| Cathy|
+----------+
Getting the last non-null value in PySpark column
filter_none
df = spark.createDataFrame([("Alex", 15), ("Bob", 20), ("Cathy", None)], ["name", "age"])
df.show()
+-----+----+
| name| age|
+-----+----+
| Alex| 15|
| Bob| 20|
|Cathy|null|
+-----+----+
By default, ignorenulls=False, which means that the last value is returned regardless of
filter_none
df.select(F.last(df.age)).show()
+---------+
|last(age)|
+---------+
| null|
+---------+
To return the last non-null value instead, set ignorenulls=True:
filter_none
df.select(F.last(df.age, ignorenulls=True)).show()
+---------+
|last(age)|
+---------+
| 20|
+---------+
Getting the last value of each group in PySpark
The last(~) method is also useful in aggregations. Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame(data, ["name", "class"])
df.show()
+-----+-----+
| name|class|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| null|
| Bob| A|
|Cathy| C|
+-----+-----+
To get the last value of each aggregate:
filter_none
df.groupby("name").agg(F.last("class")).show()
+-----+-----------+
| name|last(class)|
+-----+-----------+
| Alex| B|
| Bob| A|
|Cahty| C|
+-----+-----------+
Here, we are grouping by name, and then for each of these group, we are obtaining the last
PySpark SQL Functions | least method
local_offer
PySpark
mode_heat
PySpark SQL Functions' least(~) takes as input multiple columns, and returns a PySpark
Column holding the least value for every row of the input columns.
Parameters
The input columns whose row values will be compared to check.
Return Value
Examples
filter_none
df = spark.createDataFrame([[20,10], [30,50], [40,90]], ["A", "B"])
df.show()
+---+---+
| A| B|
+---+---+
| 20| 10|
| 30| 50|
| 40| 90|
+---+---+
Getting the least value for every row of specified columns in PySpark
To get the least value for every row of columns A and B:
filter_none
df.select(F.least("A","B")).show()
+-----------+
|least(A, B)|
+-----------+
| 10|
| 30|
| 40|
+-----------+
We can also pass Column objects instead of column labels:
filter_none
df.select(F.least(df.A,df.B)).show()
+-----------+
|least(A, B)|
+-----------+
| 10|
| 30|
| 40|
+-----------+
Note that we can append the Column returned by least(~) by using withColumn(~):
filter_none
df.withColumn("smallest", F.least("A","B")).show()
+---+---+--------+
| A| B|smallest|
+---+---+--------+
| 20| 10| 10|
| 30| 50| 30|
| 40| 90| 40|
+---+---+--------+
PySpark SQL Functions | last method
local_offer
PySpark
mode_heat
PySpark's SQL function last(~) method returns the last row of the DataFrame.
Parameters
Return Value
Examples
filter_none
df = spark.createDataFrame([("Alex", 15), ("Bob", 20), ("Cathy", 25)], ["name", "age"])
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
Getting the last value of a PySpark column
To get the last value of the name column:
filter_none
df.select(F.last("name")).show()
+----------+
|last(name)|
+----------+
| Cathy|
+----------+
Note we can also pass a Column object instead:
filter_none
# df.select(F.last(F.col("name"))).show()
df.select(F.last(df.name)).show()
+----------+
|last(name)|
+----------+
| Cathy|
+----------+
Getting the last non-null value in PySpark column
filter_none
df = spark.createDataFrame([("Alex", 15), ("Bob", 20), ("Cathy", None)], ["name", "age"])
df.show()
+-----+----+
| name| age|
+-----+----+
| Alex| 15|
| Bob| 20|
|Cathy|null|
+-----+----+
By default, ignorenulls=False, which means that the last value is returned regardless of
filter_none
df.select(F.last(df.age)).show()
+---------+
|last(age)|
+---------+
| null|
+---------+
To return the last non-null value instead, set ignorenulls=True:
filter_none
df.select(F.last(df.age, ignorenulls=True)).show()
+---------+
|last(age)|
+---------+
| 20|
+---------+
Getting the last value of each group in PySpark
The last(~) method is also useful in aggregations. Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame(data, ["name", "class"])
df.show()
+-----+-----+
| name|class|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| null|
| Bob| A|
|Cathy| C|
+-----+-----+
To get the last value of each aggregate:
filter_none
df.groupby("name").agg(F.last("class")).show()
+-----+-----------+
| name|last(class)|
+-----+-----------+
| Alex| B|
| Bob| A|
|Cahty| C|
+-----+-----------+
Here, we are grouping by name, and then for each of these group, we are obtaining the last
PySpark SQL Functions | least method
local_offer
PySpark
mode_heat
PySpark SQL Functions' least(~) takes as input multiple columns, and returns a PySpark
Column holding the least value for every row of the input columns.
Parameters
The input columns whose row values will be compared to check.
Return Value
Examples
filter_none
df = spark.createDataFrame([[20,10], [30,50], [40,90]], ["A", "B"])
df.show()
+---+---+
| A| B|
+---+---+
| 20| 10|
| 30| 50|
| 40| 90|
+---+---+
Getting the least value for every row of specified columns in PySpark
To get the least value for every row of columns A and B:
filter_none
df.select(F.least("A","B")).show()
+-----------+
|least(A, B)|
+-----------+
| 10|
| 30|
| 40|
+-----------+
We can also pass Column objects instead of column labels:
filter_none
df.select(F.least(df.A,df.B)).show()
+-----------+
|least(A, B)|
+-----------+
| 10|
| 30|
| 40|
+-----------+
Note that we can append the Column returned by least(~) by using withColumn(~):
filter_none
df.withColumn("smallest", F.least("A","B")).show()
+---+---+--------+
| A| B|smallest|
+---+---+--------+
| 20| 10| 10|
| 30| 50| 30|
| 40| 90| 40|
+---+---+--------+
PySpark SQL Functions | length method

local_offer
PySpark
mode_heat
PySpark SQL Functions' length(~) method returns a new PySpark Column holding the lengths
of string values in the specified column.
Parameters
The column whose string values' length will be computed.
Return Value
Examples
filter_none
df = spark.createDataFrame([["Alex", 20], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Computing the length of column strings in PySpark
To compute the length of each value of the name column, use the length(~) method:
filter_none
df.select(F.length("name")).show()
+------------+
|length(name)|
+------------+
| 4|
| 3|
| 5|
+------------+
We could also pass in a Column object instead of a column label like so:
filter_none
# df.select(F.length(df.name)).show()
df.select(F.length(F.col("name"))).show()
+------------+
|length(name)|
+------------+
| 4|
| 3|
| 5|
+------------+
Note that we can append a new column containing the length of the strings
using withColumn(~):
filter_none
df_new = df.withColumn("name_length", F.length("name"))
df_new.show()
+-----+---+-----------+
| name|age|name_length|
+-----+---+-----------+
| Alex| 20| 4|
| Bob| 30| 3|
|Cathy| 40| 5|
+-----+---+-----------+
PySpark SQL Functions | lit method
local_offer
PySpark
mode_heat
PySpark SQL Functions' lit(~) method creates a Column object with the specified value.
Parameters
1. col | value
A value to fill the column.
Return Value
A Column object.
Examples
filter_none
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Creating a column of constants in PySpark DataFrame
To create a new PySpark DataFrame with the name column of df and a new column
called is_single made up of True values:
filter_none
df2 = df.select(F.col("name"), F.lit(True).alias("is_single"))
df2.show()
+----+---------+
|name|is_single|
+----+---------+
|Alex| true|
| Bob| true|
+----+---------+
Here, F.lit(True) returns a Column object, which has a method called alias(~) that assigns a
label.
Note that you could append a new column of constants using the withColumn(~) method:
filter_none
df_new = df.withColumn("is_single", F.lit(True))
df_new.show()
+----+---+---------+
|name|age|is_single|
+----+---+---------+
|Alex| 20| true|
| Bob| 30| true|
+----+---+---------+
Creating a column whose values are based on a condition in PySpark
We can also use lit(~) to create a column whose values depend on some condition:
filter_none
col = df.when(F.col("age") <= 20, F.lit("junior")).otherwise(F.lit("senior"))
df3 = df.withColumn("status", col)
df3.show()
+----+---+------+
|name|age|status|
+----+---+------+
|Alex| 20|junior|
| Bob| 30|senior|
+----+---+------+
Note the following:
we are using the when(~) and otherwise(~) pattern to fill the values of the column
conditionally.
we are using the withColumn(~) method to append a new column named status.
the F.lit("junior") can actually be replaced by "junior" - this is just to demonstrate one usage
of lit(~).
PySpark SQL Functions | lower method
local_offer
PySpark
mode_heat
PySpark SQL Functions' lower(~) method returns a new PySpark Column with the specified
column lower-cased.
Parameters
The column to perform the lowercase operation on.
Return Value
Examples
filter_none
df = spark.createDataFrame([["ALEX", 25], ["BoB", 30]], ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|ALEX| 25|
| BoB| 30|
+----+---+
Lowercasing strings in PySpark DataFrame
To lower-case the strings in the name column:
filter_none
df.select(F.lower(df.name)).show()
+-----------+
|lower(name)|
+-----------+
| alex|
| bob|
+-----------+
Note that passing in a column label as a string also works:
filter_none
df.select(F.lower("name")).show()
+-----------+
|lower(name)|
+-----------+
| alex|
| bob|
+-----------+
Replacing column with lowercased column in PySpark
To replace the name column with the lower-cased version, use the withColumn(~):
filter_none
df.withColumn("name", F.lower(df.name)).show()
+----+---+
|name|age|
+----+---+
|alex| 25|
| bob| 30|
+----+---+
PySpark SQL Functions | max method
local_offer
PySpark
mode_heat
PySpark SQL Functions' max(~) method returns the maximum value in the specified column.
Parameters
The column in which to obtain the maximum value.
Return Value
Examples
filter_none
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Getting the maximum value of a PySpark column
To obtain the maximum age:
filter_none
df.select(F.max("age")).show()
+--------+
|max(age)|
+--------+
| 30|
+--------+
To obtain the maximum age as an integer:
filter_none
list_rows = df.select(F.max("age")).collect()
list_rows[0][0]
30
Here, the collect() method returns a list of Row objects, which in this case is length one
because the PySpark DataFrame returned by select(~) only has one row. The content of
the Row object can be accessed via [0].
Getting the maximum value of each group in PySpark
filter_none
df = spark.createDataFrame([["Alex", 20, "A"],\
["Bob", 30, "B"],\
["Cathy", 50, "A"]],
["name", "age", "class"])
df.show()
+-----+---+-----+
| name|age|class|
+-----+---+-----+
| Alex| 20| A|
| Bob| 30| B|
|Cathy| 50| A|
+-----+---+-----+
To get the maximum age of each class:
filter_none
df.groupby("class").agg(F.max("age").alias("MAX AGE")).show()
+-----+-------+
|class|MAX AGE|
+-----+-------+
| A| 50|
| B| 30|
+-----+-------+
Here, we are using the alias(~) method to assign a label to the aggregated age column.
Getting Started with PySpark
local_offer
PySpark
mode_heat
What is PySpark?
PySpark is an API interface that allows you to write Python code to interact with Apache
Spark, which is an open source distributed computing framework to handle big data. As the
size of data grows year over year, Spark has become a popular framework in the industry to
efficiently process large datasets for streaming, data engineering, real-time analytics,
exploratory data analysis and machine learning.
Why use PySpark?
The core value proposition behind PySpark is that:
Spark partitions the dataset into smaller chunks and stores them in multiple machines. By
doing so, Spark can efficiently process massive volumes of data in parallel. This is extremely
useful when you are dealing with large datasets that cannot fit into the memory of a single
machine.
PySpark can handle a wide array of data formats including Hadoop Distributed File System
(HDFS), Cassandra and Amazon S3.
Anatomy of Spark
The following diagram shows the main components of a Spark application:
Briefly, the roles of each component are as follows:

Executors
Executors are threads or processes in the worker nodes that perform individual tasks. You
can think of nodes as another word for machines. In the above diagram, there are
3 tasks (to be defined later) assigned to each executor, and the worker node spawns a single
executor that handles these tasks one after another. As we shall see later, a single executor
can be allocated with multiple CPUs, which allows the executor to handle tasks in parallel.
Worker nodes
Worker nodes are machines that host the executors. A worker node can host multiple
executors if CPU and memory are available.
Spark Driver
Spark Driver is the entry point of the Spark application that receives the user's Spark
program, and is responsible for the following:
creating the SparkContext, which provides the connection to the cluster
manager. SparkContext holds configuration parameters such as the application name,
number of CPU cores, memory size of executors running on worker nodes.
submitting jobs and converting them into tasks. These tasks are then to be handled by
executors.
coordinates the execution of worker nodes and aggregates data from the worker nodes.
Cluster manager
The cluster manager processes that monitor worker nodes and reserve cluster sources for
the Driver to coordinate. There are many cluster managers to choose from such as YARN,
Kubernetes, Mesos and Spark Standalone. There are of course differences in how resources
are allocated for each of these managers, but they all come with a clean visual web
dashboard for live monitoring of your cluster. Note that the cluster manager does not
manage the worker nodes directly (this is the job of the Driver). Instead, the cluster
manager simply requests for resources for the Driver to use.
NOTE
All these components are written in a programming language called Scala, but they are
compiled into Java byte-code so that they run in a Java Virtual Machines (JVM), which is a
cross-platform runtime engine.
RDD (Resilient Distributed Dataset)
The basic building block of Spark is the Resilient Distributed Dataset (RDD), which is an
immutable data structure that is logically partitioned across multiple nodes in the cluster for
parallel computing. The following diagram illustrates an example of a RDD:
Here, our dataset is represented by a single RDD that consists of 4 partitions that is hosted
by 3 separate worker nodes. Note that worker nodes may hold different number of
partitions - for instance, worker node 2 holds partitions 2 and 3 in this case.
NOTE
For our comprehensive guide on RDDs, click here.
Transformations and Actions
There are two types of operations we can perform on RDDs:
Transformations
Actions
Transformations
A transformation takes in as input one or more RDDs, and returns a new RDD by applying
some function to the data. Examples include map(~), filter(~), sortByKey(~). Transformations
can be applied one after another as shown below:
Here, we are applying the map(~) transformation which applies a function over each data to
yield a new RDD, and then we perform the filter(~) transformation to obtain a subset of the
data. RDDs are immutable, meaning RDD cannot be modified once created. When you
perform a transformation on a RDD, a new RDD is returned while the original is kept intact.
NOTE
Each newly created RDD holds a reference to the original RDD prior to the transformation.
This allows Spark to keep track of the sequence of transformations, which is referred to
as RDD lineage.
Actions
An action triggers a computation, and returns a value back to the Driver program, or writes
to a stable external storage system:
This should make sense because the data held by the RDD even after applying some
transformation is still partitioned into multiple nodes, and so we would need to aggregate
the outputs into a single place - the driver node in this case.
Examples of actions include show(), reduce() and collect().
WARNING
Since all the data from each node is sent over to the driver with an action, make sure that
the driver node has enough RAM to hold all the incoming data - otherwise, an out-of-
memory error will occur.
Lazy transformations
When you execute the transformation, Spark will not immediately perform the
transformation. Instead, RDD will wait until an actionlink is required, and only then will the
transformation fire. We call this behaviour lazy-execution, and this has the following
benefits:
Scheduling - better usage of cluster usage
Some transformations can grouped together to avoid network traffic
Spark jobs, stages and tasks
When you invoke an action (e.g. count(), take(), collect()) on an RDD, a job is created. Spark
will then internally decompose a job into a single or multiple stages. Next, Spark splits each
stage into tasks, which are units of work that the Spark driver’s scheduler ships
to executors on the worker nodes to handle. Each task processes one unit of partitioned
dataset in its memory.
Executors with one core
As an example, consider the following setup:
Here, our RDD is composed of 6 partitions, with 2 partitions on each worker node.
The executor threads are equipped with one CPU core, which means that only one task can
be performed by each executor at any given time. The total number of tasks is equal to the
number of partitions, which means that there are 6 tasks.
Executors with multiple cores
Multiple tasks can run in parallel on the same executor if you allocate more than one core to
each executor. Consider the following case:
Here, each executor is equipped with 2 cores. The total number of tasks here is 6, which is
the same as the previous case since there are still 6 partitions. With 2 cores,
each executor can handle 2 tasks in parallel. As you can tell from this example, the more
number of cores you allocate to each executor, the more tasks you can perform in parallel.
Number of partitions
In Spark, we can choose the number of partitions by which to divide our dataset. For
instance, should we divide up our data into just a few partitions, or into hundreds of
partitions? We should choose carefully because the number of partitions has an immense
impact on the cluster's performance. As examples, let's explore the case of over-partitioning
and under-partitioning.
Under-partitioning
Consider the following case:
Here, each of our executors is equipped with 10 cores, but only 2 partitions reside at each
node. This means that each executor can tackle the two tasks assigned to it in parallel using
just 2 cores - the other 8 cores remain unused here. In other words, we are not making use
of the available cores here since the number of partitions is too small, that is, we are
underutilising our resources. A better configuration would be to have 10 partitions on each
worker node so that each executor can parse all 10 partitions on their node in parallel.
Excess partitioning
Consider the following case:
Here, we have 6 partitions residing in each worker node, which is equipped with only one
CPU core. The driver would need to create and schedule the same number of tasks as there
are partitions (18 in this case). There is considerable overhead in having to manage and
coordinate many small tasks. Therefore, having a large number of partitions is also not
desirable.
Recommended number of partitions
The official PySpark documentationopen_in_new recommends that there should be 2 to 4
partitions for each core in the executor. An example of this is as follows:
Here, we have 2 partitions per worker node, which holds an executor with one CPU core.
Note that the recommended offered by the official documentation is only a rule of thumb -
you might want to experiment with different number of partitions. For instance, you might
find that assigning two cores for each executor here would boost performance since the 2
partitions can be handled in parallel by the executors.
Next steps
This introductory guide only covered the basics of PySpark. For your next step, we
recommend that you follow our Getting Started with PySpark on Databricks guide to get
some hands-on experience with PySpark programming on Databricks for free. After that,
you can read our Comprehensive guide to RDDs to learn much more about RDDs!
File system in Databricks
local_offer
PySpark
mode_heat
What is Databricks filesystem?

Every Azure Databricks workspace has a mounted filesystem called DBFS (Databricks file
system). Under the hood, DBFS is actually a scalable object storage but what's great about it
is that we can use Unix-like commands (e.g. ls) to interact with it.
Besides the DBFS, we also have access to the file system of the driver node. This guide will
go through both types of file systems and how they are related.
File system in the driver node
Using magic command
To see the files in the driver node, use the %sh magic command:
filter_none
%sh ls
azure
conf
eventlogs
hadoop_accessed_config.lst
logs
preload_class.lst
Note that the output is misleading because some directories such
as /dbfs and /Workspace are not listed here. We will go into more details about these
hidden directories later.
Writing to and reading from driver node's filesystem
The file system accessed by the os library will also point to the file system of the driver
node:
filter_none
import os
os.listdir()
['azure',
'preload_class.lst',
'hadoop_accessed_config.lst',
'conf',
'eventlogs',
'logs']
Except when using PySpark, the files that we write ends up in the file system of the driver's
node. For instance, let's write a dictionary as a JSON file like so:
filter_none
import json
with open("./my_file.json", "w") as file:
json.dump({"A":1}, file)
The written JSON file can be found in the file system of the driver's node:
filter_none
%sh ls
azure
conf
...
my_file.json
...
Again, except when using PySpark functions, files are usually read from the driver node:
filter_none
with open("./my_file.json", "r") as file:
my_json = json.load(file)
print(my_json)
{'A': 1}
File system of DBFS
Using magic command
To access the files in DBFS, use the %fs magic command:
filter_none
%fs ls
path name size modificationTime
1 dbfs:/FileStore/ FileStore/ 0 0
2 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
3 dbfs:/databricks-results/ databricks-results/ 0 0
4 dbfs:/tmp/ tmp/ 0 0
5 dbfs:/user/ user/ 0 0
Note that the size column is misleading because it looks as though the folders are empty.
However, size is always equal to zero for folders (but not for files as we will see later).
Default folders in DBFS root
Let's go through the nature of the default 5 folders:
Folder Description
FileStore Stores files uploaded to Databricks.
databricks-
Open-source datasets for exploration.
datasets
databricks-
Stores results downloaded via the display(-) method.
results
Folder for temporary folders - this is managed by you and not

tmp
Databricks, that is, Databricks will not delete the content of this folder.
user Stores uploaded datasets that are registered as Databricks tables.

Let's now take a look at what's inside the databricks-datasets/ folder:
filter_none
%fs ls databricks-datasets/
1 dbfs:/databricks-datasets/COVID/ COVID/ 0 0
2 dbfs:/databricks-datasets/README.md README.md 976 1532468253000
3 dbfs:/databricks-datasets/Rdatasets/ Rdatasets/ 0 0
4 dbfs:/databricks-datasets/SPARK_README.md path 3359 1455043490000
5 dbfs:/databricks-datasets/adult/ adult/ 0 0
6 dbfs:/databricks-datasets/airlines/ airlines/ 0 0
We see that the databricks-datasets directory contains some sample datasets that we can
use for our own exploration!
Using dbutils command
We could also use the dbutils method like so:
filter_none
dbutils.fs.ls("/databricks-datasets")
[FileInfo(path='dbfs:/databricks-datasets/COVID/', name='COVID/', size=0,
modificationTime=0),
FileInfo(path='dbfs:/databricks-datasets/README.md', name='README.md', size=976,
FileInfo(path='dbfs:/databricks-datasets/Rdatasets/', name='Rdatasets/', size=0,
FileInfo(path='dbfs:/databricks-datasets/SPARK_README.md', name='SPARK_README.md',
size=3359, modificationTime=1455043490000),
FileInfo(path='dbfs:/databricks-datasets/adult/', name='adult/', size=0,
modificationTime=0), ...
Writing PySpark DataFrame into DBFS
By default, PySpark DataFrames are written to and read from DBFS. For instance, suppose
we write a PySpark DataFrame as a CSV file like so:
filter_none
df.write.csv("my_data.csv", header=True)
Our CSV file can be found in the DBFS root folder:
filter_none
%fs ls
2 dbfs:/Volume/ Volume/ 0 0
3 dbfs:/Volumes/ Volumes/ 0 0
6 dbfs:/my_data.csv/ my_data.csv/ 0 1689247799000
Here, the my_data.csv has a size of 0 because it is a directory. The actual CSV file containing
our data is inside this directory.
Reading PySpark DataFrame from DBFS
To read our my_data.csv from DBFS as a PySpark DataFrame:
filter_none
df = spark.read.format("csv") \
.option("header", True) \
.option("inferSchema", "true") \
.load("/my_data.csv")
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Moving files between driver node and DBFS
Copying files from driver node to DBFS
Let's first write a text file to the driver node like so:
filter_none
with open("sample.txt", "w") as file:
file.write("Hello world!")
To copy this file from the driver node to DBFS:
filter_none
dbutils.fs.cp("file:///databricks/driver/sample.txt", "dbfs:/my_sample.txt")
To confirm that our text file now exists in DBFS:
filter_none
%fs ls
2 dbfs:/Volumes/ Volumes/ 0 0
5 dbfs:/my_sample.txt my_sample.txt 12 1688792423000
Copying files from DBFS to driver node
Let's start by uploading a text file to DBFS. Navigate to a notebook, and click on the Upload
data to DBFS button:
By default, all files that we upload through this way will be stored under /FileStore in DBFS. I
will upload a text file called hello_world.txt like so:
Clicking on the Next button gives:

For reference later, the File API format is:
filter_none
/dbfs/FileStore/hello_world.txt
Let's now check that the text file exists in DBFS:
filter_none
%fs ls dbfs:/FileStore/
1 dbfs:/FileStore/hello_world.txt hello_world.txt 12 1688831344000
Let's now use the dbutils method to copy the text file from DBFS to the driver node:
filter_none
dbutils.fs.cp("dbfs:/FileStore/hello_world.txt", "file:///databricks/driver/hello_world.txt")
True
Let's check that the file is now available in the file system of the driver node:
filter_none
%sh ls
azure
...
hello_world.txt
...
Finally, we can read the content of the text file in our notebook:
filter_none
with open("hello_world.txt", "r") as file:
content = file.read()
print(content)
Hello world
Accessing files in DBFS directly in the driver node file system
In the previous section, we have copied files from the driver node to DBFS. However, DBFS is
mounted in the driver node under the /dbfs directory:
filter_none
%sh ls /dbfs
FileStore
Volumes
databricks
databricks-datasets
databricks-results
user
Here, the directories in the output are what we see when we run %fs ls.
Therefore, instead of copying the text file from DBFS to the driver node, we can directly
access the text file at /dbfs/FileStore/hello_world.txt - this is exactly the File API format of
the text file that we noted earlier:
filter_none
with open("/dbfs/FileStore/hello_world.txt", "r") as file:
print(content)
Hello world
Workspace files
Workspace files are the files that you can create and store in your Databricks workspace.
Workspace files can be notebooks, Python scripts, SQL scripts, text files, or any other type of
file that you want to work with in Databricks.
These files are stored within the workspace, which provides a hierarchical organization
structure to manage your files and folders. You can create, edit, and delete workspace files
directly within the Databricks workspace UI or through the Databricks CLI or APIs.
Uploading workspace files using Databricks UI
As an example, let's upload a text file called hello_world.txt via Databricks UI in our
dedicated folder below:
filter_none
/Workspace/Users/support@skytowner.com
In the Workspace tab, right-click on our dedicated workspace directory and click
on Import like so:
Behind the scenes, Databricks will upload this file into the driver node's file system instead
of DBFS.
After uploading our hello_world.txt file, we should have the following files in our account
workspace:
Here, we will use our Demo notebook to write some Python code that reads
the hello_world.txt file.
Accessing workspace files programatically
In our Demo notebook, we can check for the existence of our new file by:
filter_none
%sh ls /Workspace/Users/support@skytowner.com
Demo
hello_world.txt
Here, we are using %sh instead of %fs because the workspace files are located in the driver
node's file system.
We can also programatically access the text file like so:
filter_none
with open("/Workspace/Users/support@skytowner.com/hello_world.txt", "r") as file:
print(content)
Hello world
Mounting object storage to DBFS
Mounting Azure blob storage
Let's mount our Azure blob storage to DBFS such that we can access our storage directly
through DBFS. To do so, use the dbutils.fs.mount() method:
filter_none
storage_account = "demostorageskytowner"
container = "democontainer"
access_key = "*****"
dbutils.fs.mount(
source = f"wasbs://{container}@{storage_account}.blob.core.windows.net",
mount_point = "/mnt/my_demo_storage",
extra_configs = {
f"fs.azure.account.key.{storage_account}.blob.core.windows.net": access_key
}
)
True
Here, the access_key can be obtained in the Azure portal:
We can now access all the files in our blob storage using DBFS like so:
filter_none
%fs ls /mnt/
1 dbfs:/mnt/my_demo_storage/ my_demo_storage/ 0 0
Let's have a look at what's inside our blob storage:
filter_none
%fs ls /mnt/my_demo_storage/
1 dbfs:/mnt/my_demo_storage/hello_world.txt hello_world.txt 12 1689250809000
Reading files from mounted storage
Recall that the DBFS is mounted on the driver node as /dbfs, which means we can directly
access the files in our blob storage like so:
filter_none
with open("/dbfs/mnt/my_demo_storage/hello_world.txt", "r") as file:
print(content)
Hello world
Writing files to mounted storage
Let's now write a text file to the mounted storage:
filter_none
with open("/dbfs/mnt/my_demo_storage/bye_world.txt", "w") as file:
file.write("Bye world!")
Let's check that the new file has been written into the mounted storage:
filter_none
%fs ls /mnt/my_demo_storage/
1 dbfs:/mnt/my_demo_storage/bye_world.txt bye_world.txt 10 1689253301000
2 dbfs:/mnt/my_demo_storage/hello_world.txt hello_world.txt 12 1689250809000
We can also check the Azure blob UI dashboard to see that this text file is present:
Unmounting storage
To unmount the storage:
filter_none
dbutils.fs.unmount("/mnt/my_demo_storage")
/mnt/my_demo_storage has been unmounted.
Comprehensive guide on caching in PySpark
local_offer
PySpark
mode_heat
Prerequisites
To follow along with this guide, you should know what RDDs, transformations and actions
are. Please visit our comprehensive guide on RDD if you feel rusty!
What is caching in Spark?
The core data structure used in Spark is the resilient distributed dataset (RDD). There are
two types of operations one can perform on a RDD: a transformation and an action. Most
operations such as mapping and filtering are transformations. Whenever a transformation is
applied to a RDD, a new RDD is made instead of mutating the original RDD directly:
Here, applying the map transformation on the original RDD creates RDD', and then applying
the filter transformation creates RDD''.
Now here is where caching comes into play. Suppose we wanted to apply a transformation
on RDD'' multiple times. Without caching, RDD'' must be computed from scratch
using RDD each time. This means that if we apply a transformation on RDD'' 10 times,
then RDD'' must be generated 10 times from RDD. If we cache RDD'', then we no longer
have to recompute RDD'', but instead reuse the RDD'' that exists in cache. In this way,
caching can greatly speed up your computations and is therefore critical for optimizing your
PySpark code.
How to perform caching in PySpark?
Caching a RDD or a DataFrame can be done by calling the RDD's or
DataFrame's cache() method. The catch is that the cache() method is a transformation (lazy-
execution) instead of an action. This means that even if you call cache() on a RDD or a
DataFrame, Spark will not immediately cache the data. Spark will only cache the RDD by
performing an action such as count():
filter_none
# Cache will be created because count() is an action
df.cache().count()
Here, df.cache() returns the cached PySpark DataFrame.
We could also perform caching via the persist() method. The difference
between count() and persist() is that count() stores the cache using the
setting MEMORY_AND_DISK, whereas persist() allows you to specify storage levels other
than MEMORY_AND_DISK. MEMORY_AND_DISK means that the cache will be stored in
memory if possible, otherwise the cache will be stored in disk. Other storage levels
include MEMORY_ONLY and DISK_ONLY.
Basic example of caching in PySpark
filter_none
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Let's print out the execution plan of the filter(~) operation that fetches rows where
the age is not 20:
filter_none
df.filter('age != 20').explain()
== Physical Plan ==
*(1) Filter (isnotnull(age#6003L) AND NOT (age#6003L = 20))
+- *(1) Scan ExistingRDD[name#6002,age#6003L]
Here note the following:
the explain() method prints out the physical plan, which you can interpret as the actual
execution plan
the executed plan is often read from bottom to top
we see that PySpark first scans the DataFrame (Scan ExistingRDD). RDD is shown here
instead of DataFrame because, remember, DataFrames are implemented as RDDs under the
hood.
while scanning, the filtering (isnotnull(age) AND NOT (age=20)) is applied.
Let us now cache the PySpark DataFrame returned by the filter(~) method using cache():
filter_none
# Call count(), which is an action, to trigger caching
df.filter('age != 20').cache().count()
2
Here, the count() method is an action, which means that the PySpark DataFrame returned
by filter(~) will be cached.
Let's call filter(~) again and print the physical plan:
filter_none
df.filter('age != 20').explain()
== Physical Plan ==
InMemoryTableScan [name#6002, age#6003L]
+- InMemoryRelation [name#6002, age#6003L], StorageLevel(disk, memory, deserialized,
1 replicas)
+- *(1) Filter (isnotnull(age#6003L) AND NOT (age#6003L = 20))
+- *(1) Scan ExistingRDD[name#6002,age#6003L]
The physical plan is now different from when we called filter(~) before caching. We see two
new operations: InMemoryTableScan and InMemoryRelation. Behind the scenes, the cache
manager checks whether a DataFrame resulting from the same computation exists in cache.
In this case, we have cached the resulting DataFrame from filter('age!=20') previously
via cache() followed by an action (count()), so the cache manager uses this cached
DataFrame instead of recomputing filter('age!=20').
The InMemoryTableScan and InMemoryRelation we see in the physical plan indicate that we
are working with the cached version of the DataFrame.
Using the cached object explicitly
The methods cache() and persist() return a cached version of the RDD or DataFrame. As we
have seen in the above example, we can cache RDDs or DataFrames without explicitly using
the returned cached object:
filter_none
df.filter('age != 20').cache().count()
# Cached DataFrame will be used
df.filter('age != 20').show()
It is better practise to use the cached object returned by cache() like so:
filter_none
df_cached = df.filter('age != 20').cache()
print(df_cached.count())
df_cached.show()
The advantage of this is that calling methods like df_cached.count() clearly indicates that we
are using a cached DataFrame.
Confirming cache via Spark UI
We can also confirm the caching behaviour via the Spark UI by clicking on the Stages tab:
Click on the link provided in the Description column. This should open up a graph that shows
the operations performed under the hood:
You should see a green box in the middle, which means that this specific operation was not
computed thanks to a presence of a cache.
Note that if you are using Databricks, then click on View in the output cell:
This should open up the Spark UI and show you the same graph as above.
We could also see the stored caches on the Storage tab:
We can see that all the partitions of the RDD (8 in this case) resulting from the
operation filter.(age!=20) is stored in memory cache as opposed to disk cache. This is
because the storage level of the cache() method is set to MEMORY_AND_DISK by default,
which means to store the cache in disk only if the cache does not fit in memory.
Clearing existing cache
To clear (evict) all the cache, call the following:
filter_none
spark.catalog.clearCache()
To clear the cache of a specific RDD or DataFrame, call the unpersist() method:
filter_none
df_cached = df.filter('age != 20').cache()
# Trigger an action to persist cache
df_cached.count()
# Delete the cache
df_cached.unpersist()
NOTE
It is good practise to clear cache because if space starts running out, Spark will begin
removing cache using the LRU (least recently used) policy. It is generally better to not rely
on automatic deletion because it may delete cache that is vital for your PySpark application.
Things to consider when caching
Cache computed data that is used frequently
Caching is recommended when you use the same computed RDD or DataFrame multiple
times. Do remember that computing RDDs is generally very fast, so you may consider
caching only when your PySpark program is too slow for your needs.
Cache minimally
We should cache frugally because caching consumes memory, and memory is needed for
the worker nodes to perform their tasks. If we do decide to cache, make sure that you're
only caching the part of data that you will reuse multiple times. For instance, if we are going
to frequently perform some computation on column A only, then it makes sense to cache
column A instead of the entire DataFrame. Another example is if you have two queries
where one involves columns A and B, and the other involves columns B and C, then it may
be a good idea to cache columns A, B and C instead of caching columns (A and B) and
columns (B and C) which will store column B in cache redundantly
Comprehensive Guide to RDD in PySpark
local_offer
PySpark
mode_heat
Prerequisites
You should already be familiar with the basics of PySpark. For a refresher, check out
our Getting Started with PySpark guide.
What is a RDD?
PySpark operates on big data by partitioning the data into smaller subsets spread across
multiple machines. This allows for parallelisation, and this is precisely why PySpark can
handle computations on big data efficiently. Under the hood, PySpark uses a unique data
structure called RDD, which stands for resilient distributed dataset. In essence, RDD is an
immutable data structure in which the data is partitioned across a number of worker nodes
to facilitate parallel operations:
In the diagram above, a single RDD has 4 partitions that is distributed across 3 worker nodes
with the second worker node holding 2 partitions. By definition, a single partition cannot
span across multiple worker nodes. This means, for instance, that partition 2 can never
partially reside in both worker node 1 and 2 - the partition can only reside in either of
worker node 1 or 2. The Driver node serves to coordinate the task execution between these
worker nodes.
Transformations and actions
There are two operations we can perform on a RDD:
Transformations
Actions
Transformations
Transformations are basically functions applied on RDDs, which result in the creation of new
RDDs. RDDs are immutable, which means that even after applying a transformation, the
original RDD is kept intact. Examples of transformations include map(~) and filter(~).
For instance, consider the following RDD transformation:
Here, our RDD has 4 partitions that are distributed across 3 worker nodes. Partition 1 holds
the string a, partition 2 holds the values [d,B] and so on. Suppose we now apply a map
transformation that converts the string into uppercase. After the running the map
transformation, we end up with RDD' shown on the right. What's important here is that
each worker node performs the map transformation on the data it possesses - this is what
makes distributed computing so efficient!
Since transformations return a new RDD, we can keep on applying transformations. The
following example shows the creation of two new RDDs after applying two separate
transformations:
Here, we apply the map(~) transformation to a RDD, which applies a function to each data
in RDD to yield RDD'. Next, we apply the filter(~) transformation to select a subset of the
data in RDD' to finally obtain RDD''.
Spark keeps track of the series of transformations applied to RDD using graphs called RDD
lineage or RDD dependency graphs. In the above diagram, RDD is considered to be a parent
of RDD'. Every child RDD has a reference to its parent (e.g. RDD' will always have a reference
to RDD).
Actions
Actions are operations that either:
send all the data held by multiple nodes to the driver node. For instance, printing some
result in the driver node (e.g. show(~)).
or saving some data on an external storage system such as HDFS and Amazon S3.
(e.g. saveAsTextFile(~)).
Typically, actions are followed by a series of transformations like so:
After applying transformations, the actual data of the output RDD still reside in different
nodes. Actions are used to gather these scattered results in a single place - either the driver
node or an external data storage.
NOTE
Transformations are lazy, which means that even if you call the map(~) function, Spark will
not actually do anything behind the scenes. All transformations are only executed once an
action, such as collect(~), is triggered. This allows Spark to optimise the transformations by:
allocating resource more efficiently
grouping transformations together to avoid network traffic
Example using PySpark
Consider the same set of transformations and action from earlier:
Here, we are first converting each string into uppercase using the transformation map(~),
and then performing a filter(~) transformation to obtain a subset of the data. Finally, we
send the individual results held in different partitions to the driver node to print the final
result on the screen using the action show().
Consider the following RDD with 3 partitions:
filter_none
rdd = sc.parallelize(["Alex","Bob","Cathy"], numSlices=3)
rdd.collect()
['Alex', 'Bob', 'Cathy']
Here:
sc, which stands for SparkContext, is a global variable defined by Databricks.
we are using the parallelize(~) method of SparkContext to create a RDD.
the number of partitions is specified using the numSlices argument.
the collect(~) method is used to gather all the data from each partition to the driver node
and print the results on the screen.
Next, we use the map(~) transformation to convert each string (which resides in different
partitions) to uppercase. We then use the filter(~) transformation to obtain strings that
equal "ALEX":
filter_none
rdd2 = rdd1.map(lambda x: x.upper())
rdd3 = rdd2.filter(lambda name: name == "ALEX")
rdd3.collect()
['ALEX']
NOTE
To run this example, visit our guide Getting Started with PySpark on Databricks.
Narrow and wide transformations
There are two types of transformations:
Narrow - no shuffling is needed, which means that data residing in different nodes do not
have to be transferred to other nodes
Wide - shuffling is required, and so wide transformations are costly
The difference is illustrated below:
For narrow transformations, the partition remains in the same node after the
transformation, that is, the computation is local. In contrast, wide transformations involve
shuffling, which is slow and expensive because of network latency and bandwidth.
Some examples of narrow transformations include map(~) and filter(~). Consider a simple
map operation where we increment an integer of some data by one. It's clear that the each
worker node can perform this on their own since there is no dependency between the
partitions living on other worker nodes.
Some examples of wide transformations include groupBy(~) and sort(~). Suppose we
wanted to perform a groupBy(~) operation on some column, say a categorical variable
consisting of 3 classes: A, B and C. The following diagram illustrates how Spark will execute
this operation:
Notice how groupBy(~) cannot be computed locally because the operation requires
dependency between partitions lying in different nodes.
Fault tolerance property
The R in RDD stands for resilient, meaning that even if a worker node fails, the missing
partition can still be recomputed to recover the RDD with the help of RDD lineage. For
instance, consider the following example:
Suppose RDD'' is "damaged" because of a node failure. Since Spark knows that RDD' is the
parent of RDD'', Spark will be able to re-compute RDD'' from RDD'.
Viewing the underlying partitions of a RDD in PySpark
Let's create a RDD in PySpark by using the parallelize(~) method once again:
filter_none
rdd = sc.parallelize(["a","B","c","D"])
rdd.collect()
['a', 'B', 'c', 'D']
To see the underlying partition of the RDD, use the glom() method like so:
filter_none
rdd.glom().collect()
[[], ['a'], [], ['B'], [], ['c'], [], ['D']]
Here, we see that the RDD has 8 partitions by default. This default number of partitions can
be set in the Spark configuration file. Because our RDD only contains 4 values, we see that
half of the partitions are empty.
We can specify that we want to break down our data into say 3 partitions by supplying
the numSlices parameter:
filter_none
rdd = sc.parallelize(["a","B","c","D"], numSlices=3)
rdd.glom().collect()
[['a'], ['B'], ['c', 'D']]
Difference between RDD and DataFrames
When working with PySpark, we usually use DataFrames instead of RDDs. Similar
to RDDs, DataFrames are also an immutable collection of data, but the key difference is
that DataFrames can be thought of as a spreadsheet-like table where the data is organised
into columns. This does limit the use-case of DataFrames to only structured or tabular data,
but the added benefit is that we can work with our data at a much higher level of
abstraction. If you've ever used a Pandas DataFrame, you'll understand just how easy it is to
interact with your data.
DataFrames are actually built on top of RDDs, but there are still cases when you would
rather work at a lower level and tinker directly with RDDs. For instance, if you are dealing
with unstructured data (e.g. audio and streams of data), you would use RDDs rather
than DataFrames.
NOTE
If you are dealing with structured data, we highly recommend that you
use DataFrames instead of RDDs. This is because Spark will optimize the series of operations
you perform on DataFrames under the hood, but will not do so in the case of RDDs.
Seeing the partitions of a DataFrame
Since DataFrames are built on top of RDDs, we can easily see the underlying RDD
representation of a DataFrame. Let's start by creating a simple DataFrame:
filter_none
columns = ["Name", "Age"]
df = spark.createDataFrame([["Alex", 15], ["Bob", 20], ["Cathy", 30]], columns)
df.show()
+-----+-----+---+
| Name|Group|Age|
+-----+-----+---+
| Alex| A| 15|
| Bob| A| 20|
|Cathy| A| 30|
+-----+-----+---+
To see how this DataFrame is partitioned by its underlying RDD:
filter_none
df.rdd.glom().collect()
[[],
[],
[Row(Name='Alex', Age=15)],
[],
[],
[Row(Name='Bob', Age=20)],
[],
[Row(Name='Cathy', Age=30)]]
We see that our DataFrame is partitioned in terms of Row, which is a native object in
PySpark.
Getting Started with PySpark on Databricks
local_offer
PySpark
mode_heat
Setting up PySpark on Databricks
Databricks is the original creator of Spark and describes themselves as an "open and unified
data analytics platform for data engineering, data science, machine learning and analytics."
The company adds a layer on top of Cloud providers (Microsoft Azure, AWS, Google Cloud),
and manage the Spark cluster on your behalf.
Databricks offers a free tier (community edition) to spin up a node to run some PySpark, and
so this is the best way to gain some hands-on experience with PySpark without having to
install a Linux OS, which is the environment that Spark typically runs in.
Registering to Databricks
Firstly, head over to the Databricks webpageopen_in_new, and fill out the sign up form to
register for the community edition. After receiving a confirmation email from Databricks,
click on the "Get started with Community Edition" link at the bottom instead of choosing a
cloud provider:
WARNING
If you click on a cloud provider, Databricks will create a free-trial account instead of
a community-edition account. A free-trial account is very different from a community-
edition one as you will have to:
set up your own cloud storage on your provider (e.g. Google Cloud Storage)
pay for the resources you consume on your provider
For this reason, we highly recommend to make a community-edition account instead for
learning PySpark.
Environment of community edition
The community edition provides you with:
a single cluster with 15GB of storage
a single driver node equipped with 2 CPUs without any worker nodes
notebooks to write some PySpark code
Creating a cluster
We first need to create a cluster to run PySpark. Head over to the Databricks dashboard,
and click on "Compute" on the left side bar:
Now, click on the "Create Cluster" button, and enter the desired name for your cluster:
Click on the "Create Cluster" button on the top, and this will spin up a free 15GB cluster
consisting of a single driver node without any worker nodes.
WARNING
The cluster in the community edition will automatically terminate after an idle period of two
hours. Terminated clusters cannot be restarted, and so you would have to spin up a new
cluster. In order to set up a new cluster with the same configuration as the terminated one,
click on the terminated cluster and click the "clone" button on top.
We now need to wait 5 or so minutes until the cluster is set up. When the green pending
symbol turns to a green circle, then the cluster is well set up and ready to go!
Creating a notebook
Databricks uses notebooks (similar to JupyterLab) to run PySpark code. To create a new
notebook, click on the following in the left side bar:
Type in the desired name of the notebook, and select the cluster that we created in the
previous step:
The code that we write in this notebook will be in Python, and will be run on the cluster
earlier.
Running our first PySpark code
Now that we have our cluster and notebook set up, we can finally run some PySpark code.
To create a PySpark DataFrame:
filter_none
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
Applying a custom function on PySpark Columns with user-defined functions
local_offer
PySpark
mode_heat
What is a user-defined function in PySpark?

PySpark comes with a rich set of built-in functions that you can leverage to implement most
tasks, but there may be cases when you would have to roll out your own custom function. In
PySpark, we can easily register a custom function that takes as input a column value and
returns an updated value. This guide will go over how we can register a user-defined
function and use it to manipulate data in PySpark.
Applying a custom function on a column
filter_none
df = spark.createDataFrame([['Alex',10], ['Bob',20], ['Cathy',30]], ['name','age'])
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 10|
| Bob| 20|
|Cathy| 30|
+-----+---+
Let's define a custom function that takes in as argument a single column value:
filter_none
def to_upper(some_string):
return some_string.upper()
Here, our custom to_upper(~) function returns the uppercased version of the input string.
Now, we must register our custom function using the udf(~) function, which returns a
function that can be used just like any other function in the pyspark.sql.functions library:
filter_none
from pyspark.sql.functions import udf
# Register our custom function
udf_upper = udf(to_upper)
# We can use our custom function just like we would for any SQL function
df.select(udf_upper('name')).show()
+--------------+
|to_upper(name)|
+--------------+
| ALEX|
| BOB|
| CATHY|
+--------------+
WARNING
This basic example is only for demonstration - there already exists a built-in
function upper(~) in the pyspark.sql.functions library that uppercases string values:
filter_none
df.select(F.upper('name')).show()
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
| CATHY|
+-----------+
User-defined functions are treated as black-box by PySpark so these functions cannot be
optimized under the hood. Therefore, use built-in functions whenever possible and define
custom functions only when necessary.
Applying a custom function on multiple columns
We can easily extend our user-defined function such that it takes multiple columns as
argument:
filter_none
# Takes in as argument two column values
def my_func(str_name, int_age):
return f'{str_name} is {int_age} years old'
my_udf = udf(my_func)
# Pass in two columns to our my_udf
df_result = df.select(my_udf('name', 'age'))
df_result.show()
+--------------------+
| my_func(name, age)|
+--------------------+
|Alex is 10 years old|
| Bob is 20 years old|
|Cathy is 30 years...|
+--------------------+
our custom function my_func(~) now takes in two column values
when calling my_udf(~), we now pass in two columns
Specifying the resulting column type
By default, the column returned will always be of type string regardless of the actual return
type of your custom function. For instance, consider the following custom function:
filter_none
def my_double(int_age):
return 2 * int_age
# Register the function
udf_double = udf(my_double)
df_result = df.select(udf_double('age'))
df_result.show()
+--------------+
|my_double(age)|
+--------------+
| 20|
| 40|
| 60|
+--------------+
Here, the return type of our function my_double(~) is obviously an integer, but the resulting
column type is actually set to a string:
filter_none
df_result.printSchema()
root
|-- my_double(age): string (nullable = true)
We can specify the resulting column type using the second argument in udf(~):
filter_none
udf_double = udf(my_double, 'int')
root
|-- my_double(age): integer (nullable = true)
Here, we have indicated that the resulting column type should be integer.
Equivalently, we could also import an explicit PySpark type like so:
filter_none
from pyspark.sql.types import IntegerType
udf_double = udf(my_double, IntegerType())
root
|-- my_double(age): integer (nullable = true)
Calling user-defined functions in SQL expressions
To use user-defined functions in SQL expressions, register the custom function
using spark.udf.register(~):
filter_none
def to_upper(some_string):
return some_string.upper()
spark.udf.register('udf_upper', to_upper)
df.selectExpr('udf_upper(name)').show()
+---------------+
|udf_upper(name)|
+---------------+
| ALEX|
| BOB|
| CATHY|
+---------------+
Here, the method selectExpr(~) method takes in as argument a SQL expression.
We could also register the DataFrame as a SQL table so that we can run full SQL expressions
like so:
filter_none
# Register PySpark DataFrame as a SQL table
df.createOrReplaceTempView('my_table')
spark.sql('SELECT udf_upper(name) FROM my_table').show()
+---------------+
|udf_upper(name)|
+---------------+
| ALEX|
| BOB|
| CATHY|
+---------------+
Specifying the return type
Again, the type of the resulting column is string regardless of what your custom function
returns. Just like we did earlier when registering with udf(~), we can specify the type of the
returned column like so:
filter_none
def my_double(int_age):
return 2 * int_age
spark.udf.register('udf_double', my_double, 'int')

df.selectExpr('udf_double(age)').printSchema()
root
|-- udf_double(age): integer (nullable = true)
Equivalently, we could import the explicit type from pyspark.sql.types:
filter_none
from pyspark.sql.types import IntegerType
spark.udf.register('udf_double', my_double, IntegerType())
df.selectExpr('udf_double(age)').printSchema()
root
|-- udf_double(age): integer (nullable = true)
Limitations of user-defined functions
Ordering of execution in sub-expressions is not fixed
The ordering in which sub-expressions in SQL (e.g. WHERE and HAVING) are performed is
not guaranteed. As an example, consider the following:
filter_none
spark.udf.register('my_double', lambda val: 2 * val, 'int')
spark.sql('SELECT * from my_table WHERE age IS NOT NULL AND my_double(age) >
5').show()
Here, we have the sub-expression defined by WHERE that specifies two conditions linked
using the AND clause. There is no guarantee that the SQL parser will check age IS NOT
NULL before my_double(age)>5. This means that the input supplied to our custom
function my_double(~) may be null, which can cause your custom function to break if you
do not handle this case specifically.
The way to get around this problem is to use an IF statement that guarantees the ordering
of execution:
filter_none
spark.udf.register('my_double', lambda val: 2 * val, 'int')
spark.sql('SELECT * from my_table WHERE IF(age IS NOT NULL, my_double(age) > 5, null) IS
NOT NULL').show()
Here, the input supplied to our my_double(~) function is guaranteed to be not null.
Slow compared to built-in PySpark functions
Since PySpark does not know how to optimize user-defined functions, these functions will
always be slower compared to built-in functions. Therefore, only turn to user-defined
functions when built-in functions cannot be used to achieve your task.
Guide on Window Functions

schedule DEC 4, 2023
local_offer
PySpark
mode_heat
What is a window function?

PySpark window functions are very similar to group-by operations in that they both:
partition a PySpark DataFrame by the specified column.
apply an aggregate function such as max() and avg().
The main difference is as follows:
group-by operations summarize each group into a single statistic (e.g. count, max).
window functions do not summarize groups into a single statistic but instead provide
information about how each row relates to the other rows within the same group. This
allows us to compute statistics such as moving average.
Here's a simple example - consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A", 20], ["Bob", "A", 30], ["Cathy", "B", 40], ["Dave",
"B", 40]], ["name", "group", "age"])
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 20|
| Bob| A| 30|
|Cathy| B| 40|
| Dave| B| 40|
+-----+-----+---+
Let's perform a group-by operation on the column group:
filter_none
df_new = df.groupBy("group").max()
df_new.show()
+-----+--------+
|group|max(age)|
+-----+--------+
| A| 30|
| B| 40|
+-----+--------+
Notice how we started off with 4 rows but we end up with 2 rows
because groupBy(~) returns an aggregated DataFrame with summary statistics about each
group.
Now, let's apply a window function instead:
filter_none
from pyspark.sql.window import Window
window = Window.partitionBy("group")
df.withColumn("MAX", F.max(F.col("age")).over(window)).show()
+-----+-----+---+---+
| name|group|age|MAX|
+-----+-----+---+---+
| Alex| A| 20| 30|
| Bob| A| 30| 30|
|Cathy| B| 40| 40|
| Dave| B| 40| 40|
+-----+-----+---+---+
the original rows are kept intact.
we computed some statistic (max(~)) about how each row relates to the other rows within
its group.
we can also use other aggregate functions such as min(~), avg(~), sum(~).
NOTE
We could also partitionBy(~) on multiple columns by passing in a list of column labels.
Assigning row numbers within groups
filter_none
"B", 40]], ["name", "group", "age"])
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 30|
| Bob| A| 20|
|Cathy| B| 40|
| Dave| B| 40|
+-----+-----+---+
We can sort the rows of each group by using the orderBy(~) function:
filter_none
window = Window.partitionBy("group").orderBy("age") # ascending order by default
To create a new column called ROW NUMBER that holds the row number of every row
within each group:
filter_none
df.withColumn("ROW NUMBER", F.row_number().over(window)).show()
+-----+-----+---+----------+
| name|group|age|ROW NUMBER|
+-----+-----+---+----------+
| Bob| A| 20| 1|
| Alex| A| 30| 2|
|Cathy| B| 40| 1|
| Dave| B| 40| 2|
+-----+-----+---+----------+
Here, Bob is assigned a ROW NUMBER of 1 because we order the grouped rows by
the age column first before assigning the row number.
Ordering by multiple columns
To order by multiple columns, say by "age" first and "name" second:
filter_none
window = Window.partitionBy("group").orderBy("age", "name")
df.withColumn("RANK", F.rank().over(window)).show()
+-----+-----+---+----+
| name|group|age|RANK|
+-----+-----+---+----+
| Bob| A| 20| 1|
| Alex| A| 30| 2|
|Cathy| B| 40| 1|
| Dave| B| 40| 2|
+-----+-----+---+----+
Ordering by descending
By default, the ordering is applied in ascending order. We can perform perform ordering in
descending order like so:
filter_none
window = Window.partitionBy("group").orderBy(F.desc("age"), F.asc("name"))
+-----+-----+---+----+
+-----+-----+---+----+
| Alex| A| 30| 1|
| Bob| A| 20| 2|
|Cathy| B| 40| 1|
| Dave| B| 40| 2|
+-----+-----+---+----+
Here, we are ordering by age in descending order and then ordering by name in ascending
order.
Assigning ranks within groups
Consider the same PySpark DataFrame as before:
filter_none
"B", 40]], ["name", "group", "age"])
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 30|
| Bob| A| 20|
|Cathy| B| 40|
| Dave| B| 40|
+-----+-----+---+
Instead of row numbers, let's compute the ranking within each group:
filter_none
window = Window.partitionBy("group").orderBy("age")
+-----+-----+---+----+
+-----+-----+---+----+
| Bob| A| 20| 1|
| Alex| A| 30| 2|
|Cathy| B| 40| 1|
| Dave| B| 40| 1|
+-----+-----+---+----+
Here, Cathy and Dave both receive a rank of 1 because they have the same age.
Computing lag, lead and cumulative distributions
filter_none
"B", 50], ["Eric", "B", 60]], ["name", "group", "age"])
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 20|
| Bob| A| 30|
|Cathy| B| 40|
| Dave| B| 50|
| Eric| B| 60|
+-----+-----+---+
Lag function
Let's create a new column where the values of name are shifted down by one for
every group:
filter_none
df.withColumn("LAG", F.lag(F.col("name")).over(window)).show()
+-----+-----+---+-----+
| name|group|age| LAG|
+-----+-----+---+-----+
| Alex| A| 20| null|
| Bob| A| 30| Alex|
|Cathy| B| 40| null|
| Dave| B| 50|Cathy|
| Eric| B| 60| Dave|
+-----+-----+---+-----+
Here, Bob has a LAG value of Alex because Alex belongs to the same group and is above Bob
when ordered by age.
We can also shift down column values by 2 like so:
filter_none
df.withColumn("LAG", F.lag(F.col("name"), 2).over(window)).show()
+-----+-----+---+-----+
| name|group|age| LAG|
+-----+-----+---+-----+
| Alex| A| 20| null|
| Bob| A| 30| null|
|Cathy| B| 40| null|
| Dave| B| 50| null|
| Eric| B| 60|Cathy|
+-----+-----+---+-----+
Here, Eric has a LAG value of Cathy because Cathy has been shifted down by 2.
Lead function
The lead(~) function is the opposite of the lag(~) function - instead of shifting down values,
we shift up instead. Here's our DataFrame once again for your reference:
filter_none
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 20|
| Bob| A| 30|
|Cathy| B| 40|
| Dave| B| 50|
| Eric| B| 60|
+-----+-----+---+
Let's create a new column called LEAD where the name value is shifted up by one for
every group:
filter_none
df.withColumn("LEAD", F.lead(F.col("name")).over(window)).show()
+-----+-----+---+----+
| name|group|age|LEAD|
+-----+-----+---+----+
| Alex| A| 20| Bob|
| Bob| A| 30|null|
|Cathy| B| 40|Dave|
| Dave| B| 50|Eric|
| Eric| B| 60|null|
+-----+-----+---+----+
Just as we could do for the lag(~) function, we can add a shift unit like so:
filter_none
df.withColumn("LEAD", F.lead(F.col("name"), 2).over(window)).show()
+-----+-----+---+----+
| name|group|age|LEAD|
+-----+-----+---+----+
| Alex| A| 20|null|
| Bob| A| 30|null|
|Cathy| B| 40|Eric|
| Dave| B| 50|null|
| Eric| B| 60|null|
+-----+-----+---+----+
Cumulative distribution function
filter_none
df = spark.createDataFrame([["Alex", "A", 20], ["Bob", "B", 30], ["Cathy", "B", 40], ["Dave",
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 20|
| Bob| B| 30|
|Cathy| B| 40|
| Dave| B| 40|
| Eric| B| 60|
+-----+-----+---+
To get the cumulative distribution of age of each group:
filter_none
df.withColumn("CUMULATIVE DIS", F.cume_dist().over(window)).show()
+-----+-----+---+--------------+
| name|group|age|CUMULATIVE DIS|
+-----+-----+---+--------------+
| Alex| A| 20| 1.0|
| Bob| B| 30| 0.25|
|Cathy| B| 40| 0.75|
| Dave| B| 40| 0.75|
| Eric| B| 60| 1.0|
+-----+-----+---+--------------+
Here, Cathy and Dave have a CUMULATIVE DIS value of 0.75 because their age value is equal
to or greater than 75% of the age values within that group.
Specifying range using rangeBetween
We can use the rangeBetween(~) method to only consider rows whose specified column
value is within a given range. For example, consider the following DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A", 15], ["Bob", "A", 20], ["Cathy", "A", 30], ["Dave",
"A", 30], ["Eric", "B", 30]], ["Name", "Group", "Age"])
df.show()
+-----+-----+---+
| Name|Group|Age|
+-----+-----+---+
| Alex| A| 15|
| Bob| A| 20|
|Cathy| A| 30|
| Dave| A| 30|
| Eric| B| 30|
+-----+-----+---+
To compute a moving average of Age with rows whose Age value satisfies some range
condition:
filter_none
window = Window.partitionBy("Group").orderBy("Age").rangeBetween(start=-5, end=10)
df.withColumn("AVG", F.avg(F.col("Age")).over(window)).show()
+-----+-----+---+-----+
| Name|Group|Age| AVG|
+-----+-----+---+-----+
| Alex| A| 15| 17.5|
| Bob| A| 20|23.75|
|Cathy| A| 30| 30.0|
| Dave| A| 30| 30.0|
| Eric| B| 30| 30.0|
+-----+-----+---+-----+
In the beginning, the first row with Age=15 is selected and we scan for rows where
the Age value is between 15-5=10 and 15+10=25. Since Bob's row satisfies this condition,
the aggregate function (averaging in this case) takes in as input Alex's row (the current row)
and Bob's row:
Here:
the blue row indicates the current row.
the red row represents a row that satisfies the range condition.
Next, the second row with Age=20 is selected. Similarly, we scan for rows where the Age is
between 20-5=15 and 20+10=30 and compute the aggregate function based on the satisfied
rows:
Here, 23.75 is the average of 15, 20, 30 and 30. Note that Eric's row is not included in the
calculation even though his Age is 30 because he belongs to a different group.
As one last example, here's what would happen for the next row:
Once we repeat this process for the rest of the rows and all other groups, we end up with:
Specifying rows using rowBetween
We can use the rowsBetween(~) method to specify how many preceding and subsequent
rows we wish to consider when computing our aggregate function. For example, consider
the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A", 10], ["Bob", "A", 20], ["Cathy", "A", 30], ["Dave",
"A", 40], ["Eric", "B", 50]], ["Name", "Group", "Age"])
df.show()
+-----+-----+---+
| Name|Group|Age|
+-----+-----+---+
| Alex| A| 10|
| Bob| A| 20|
|Cathy| A| 30|
| Dave| A| 40|
| Eric| B| 50|
+-----+-----+---+
To use 1 preceding row and 2 subsequent rows in the calculation of our aggregate function:
filter_none
window = Window.partitionBy("Group").orderBy("Age").rowsBetween(start=-1, end=2)
df.withColumn("AVG", F.avg(F.col("Age")).over(window)).show()
+-----+-----+---+----+
| Name|Group|Age| AVG|
+-----+-----+---+----+
| Alex| A| 10|20.0|
| Bob| A| 20|25.0|
|Cathy| A| 30|30.0|
| Dave| A| 40|35.0|
| Eric| B| 50|50.0|
+-----+-----+---+----+
Alex's row has no preceding row but has 2 subsequent rows (Bob and Cathy's row). This
means that Alex's AVG value is 20 because (10+20+30)/3=20.
Bob's row has one preceding row and 2 subsequent rows. This means that Bob's AVG value
is 25 because (10+20+30+40)/4=25.
Using window functions to preserve ordering when collect_list
Window functions can also be used to preserver ordering when performing
a collect_list(~) operation. The conventional way of calling collect_list(~) is with groupBy(~).
For example, consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A", 2], ["Bob", "A", 1], ["Cathy", "B",1], ["Doge",
"A",3]], ["name", "my_group", "rank"])
df.show()
+-----+--------+----+
| name|my_group|rank|
+-----+--------+----+
| Alex| A| 2|
| Bob| A| 1|
|Cathy| B| 1|
| Doge| A| 3|
+-----+--------+----+
To collect all the names for each group in my_group as a list:
filter_none
df_result = df.groupBy("my_group").agg(F.collect_list("name").alias("name"))
df_result.show()
+--------+-----------------+
|my_group| name|
+--------+-----------------+
| A|[Alex, Bob, Doge]|
| B| [Cathy]|
+--------+-----------------+
This solution is acceptable only in the case when the ordering of the elements in the
collected list does not matter. In this particular case, we get the order [Alex, Bob, Doge] but
there is no guarantee that this will always be the output every time. This is because
the groupBy(~) operation shuffles the data across the worker nodes, and then Spark
appends values to the list in a non-deterministic order.
In the case when the ordering of the elements in the list matters, we can
use collect_list(~) over a window partition like so:
filter_none
w = Window.partitionBy("my_group").orderBy("rank")
df_result = df.withColumn("result", F.collect_list("name").over(w))
df_final_result = df_result.groupBy("my_group").agg(F.max("result").alias("result"))
df_final_result.show()
+--------+-----------------+
|my_group| result|
+--------+-----------------+
| A|[Bob, Alex, Doge]|
| B| [Cathy]|
+--------+-----------------+
Here, we've first defined a window partition based on my_group, which is ordered by rank.
We then directly use the collect_list(~) over this window partition to generate the following
intermediate result:
filter_none
df_result.show()
+-----+--------+----+-----------------+
| name|my_group|rank| result|
+-----+--------+----+-----------------+
| Bob| A| 1| [Bob]|
| Alex| A| 2| [Bob, Alex]|
| Doge| A| 3|[Bob, Alex, Doge]|
|Cathy| B| 1| [Cathy]|
+-----+--------+----+-----------------+
Remember, window partitions do not aggregate values, that is, the number of rows of the
resulting DataFrames will remain the same.
Finally, we group by my_group and fetch the row with the longest list for each group
using F.max(~) to obtain the desired output.
Note that we could also add a filtering condition for collect_list(~) like so:
filter_none
w = Window.partitionBy("my_group").orderBy("rank")
df_result = df.withColumn("result", F.collect_list(F.when(F.col("name") != "Alex",
F.col("name"))).over(w))
df_final_result = df_result.groupBy("my_group").agg(F.max("result").alias("result"))
df_final_result.show()
+--------+-----------+
|my_group| result|
+--------+-----------+
| A|[Bob, Doge]|
| B| [Cathy]|
+--------+-----------+
Here, we are collecting names as a list for each group while filtering out the name Alex.
Using SQL against a PySpark DataFrame
local_offer
PySpark
mode_heat

filter_none
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Registering PySpark DataFrame as a SQL table
Before we can run SQL queries against a PySpark DataFrame, we must first register the
DataFrame as a SQL table:
filter_none
df.createOrReplaceTempView("users")
Here, we have registered the DataFrame as a SQL table called users. The temporary table
will be dropped whenever the Spark session ends. On the other
hand, createGlobalTempView(~) will be shared across Spark sessions, and will only be
dropped whenever the Spark application ends.
Running SQL queries against PySpark DataFrame
We can now run SQL queries against our PySpark DataFrame:
filter_none
spark.sql("SELECT * FROM users").show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
WARNING
Only read-only SQL statements are allowed - data manipulation language (DML)
statements such as UPDATE and DELETE are not supported since PySpark has no notion of
transactions.
Using variables in SQL queries
The sql(~) method takes in a SQL query expression (string), and so incorporating variables
can be done using f-string:
filter_none
table_name = "users"
query = f"SELECT * FROM {table_name}"
df_res = spark.sql(query)
df_res.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
PySpark SQL Functions | mean method

local_offer
PySpark
mode_heat
PySpark SQL Functions' mean(~) method returns the mean value in the specified column.
Parameters
The column in which to obtain the mean value.

Return Value
A PySpark Column ( pyspark.sql.column.Column ).
Examples
filter_none
df = spark.createDataFrame ([["Alex", 25], ["Bob", 30]], ["name", "age"])
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Getting the mean of a PySpark column
To obtain the mean age :

filter_none
df.select (F.mean("age")).show ()
+--------+
|avg(age)|
+--------+
| 27.5|
+--------+
To get the mean age as an integer:

filter_none
list_rows = df.select(F.mean("age")).collect()
list_rows[0][0]
27.5
Here, we are converting the PySpark DataFrame returned from select(~) into a list of Row objects
using the collect() method. This list is guaranteed to be of size one because the mean(~) reduces
column values into a single number. To access the content of the Row object, we use another [0] .
Getting the mean of each group in PySpark

filter_none
df = spark.createDataFrame ([["Alex", 20, "A"],\
["Bob", 30, "B"],\
["Cathy", 50, "A"]],
df.show ()
+-----+---+-----+
| name|age|class|
+-----+---+-----+
| Alex| 20| A|
| Bob| 30| B|
|Cathy| 50| A|
+-----+---+-----+
To get the mean age of each class :

filter_none
df.groupby ("class").agg(F.mean ("age").alias("MEAN AGE")).show ()
+-----+--------+
|class|MEAN AGE|
+-----+--------+
| A| 35.0|
| B| 30.0|
+-----+--------+
Here, we are using alias("MEAN AGE") to assign a label to the aggregated age column.
Using SQL against a PySpark DataFrame
local_offer
PySpark
mode_heat

filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Registering PySpark DataFrame as a SQL table

Before we can run SQL queries against a PySpark DataFrame, we must first register the
DataFrame as a SQL table:
filter_none
df.createOrReplaceTempView("users")
Here, we have registered the DataFrame as a SQL table called users . The temporary table will be
dropped whenever the Spark session ends. On the other hand, createGlobalTempView(~) will be
shared across Spark sessions, and will only be dropped whenever the Spark application ends.
Running SQL queries against PySpark DataFrame

We can now run SQL queries against our PySpark DataFrame:
filter_none
spark.sql("SELECT * FROM users").show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
WARNING
Only read-only SQL statements are allowed - data manipulation language (DML) statements such
as UPDATE and DELETE are not supported since PySpark has no notion of transactions.
Using variables in SQL queries

The sql(~) method takes in a SQL query expression ( string ), and so incorporating variables can be
done using f-string:
filter_none
table_name = "users"
query = f"SELECT * FROM {table_name}"
df_res = spark.sql(query)
df_res.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
PySpark SQL Functions | min method

local_offer
PySpark
mode_heat
PySpark SQL Functions' min(~) method returns the minimum value in the specified column.
Parameters
The column in which to obtain the minimum value.
Return Value
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Getting the minimum value of a PySpark column
To obtain the minimum age :

filter_none
df.select (F.min("age")).show ()
+--------+
|min(age)|
+--------+
| 25|
+--------+
To get the minimum value as an integer:

filter_none
list_rows = df.select (F.min("age")).collect ()
list_rows[0][0]
25
Note the following:
 the collect() method converts the PySpark DataFrame returned by select(~) to a list
of Row objects.
 this list will always be of length one when we apply the min(~) method.
 the content of the Row object can be accessed using [0] .

Getting the minimum value of each group in PySpark

filter_none
df = spark.createDataFrame ([["Alex", 20, "A"],\
["Bob", 30, "B"],\
["Cathy", 50, "A"]],
df.show ()
+-----+---+-----+
| name|age|class|
+-----+---+-----+
| Alex| 20| A|
| Bob| 30| B|
|Cathy| 50| A|
+-----+---+-----+
To get the minimum age of each class :

filter_none
df.groupby ("class").agg(F.min("age").alias("MIN AGE")).show ()
+-----+-------+
|class|MIN AGE|
+-----+-------+
| A| 20|
| B| 30|
+-----+-------+
Here, the alias(~) method is used to assign a label to the aggregated age column
PySpark SQL Functions | month method

local_offer
PySpark
mode_heat
PySpark SQL Functions' month(~) method extracts the month component of each column value,
which can be of type string or date.
Parameters
The date column from which to extract the month.
Return Value
Examples
filter_none
import datetime
df = spark.createDataFrame ([["Alex", datetime.date(1995,12,16)], ["Bob", datetime.date(1995,5,9)]], ["name",

"birthday"])
df.show ()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+
Extracting the month component of datetime values in PySpark DataFrame
To get the month component of datetime values:

filter_none
df.select (F.month("birthday").alias("month")).show ()
+-----+
|month|
+-----+
| 12|
| 5|
+-----+
Here, we are assigning the name "month" to the Column object returned by month(~) .
Extracting the month component of date strings in PySpark DataFrame
To get the month component of date strings:

filter_none
df = spark.createDataFrame ([["Alex", "1995-12-16"], ["Bob", "1990-05-06"]], ["name", "birthday"])
df.select (F.month("birthday").alias("day")).show ()
+-----+
|month|
+-----+
| 12|
| 5|
+-----+
PySpark SQL Functions | regexp_extract method

local_offer
PySpark
mode_heat
PySpark SQL Functions' regexp_extract(~) method extracts a substring using regular expression.
Parameters
The column whose substrings will be extracted.
2. pattern | string or Regex
The regular expression pattern used for substring extraction.
3. idx | int
The group from which to extract values. Consult the examples below for clarification.
Return Value
Examples
filter_none
df = spark.createDataFrame ([['id_20_30', 10], ['id_40_50', 30]], ['id', 'age'])
df.show ()
+--------+---+
| id|age|
+--------+---+
|id_20_30| 10|
|id_40_50| 30|
+--------+---+
Extracting a specific substring
To extract the first number in each id value, use regexp_extract(~) like so:
filter_none
df.select (F.regexp_extract('id', '(\d+)', 1)).show ()
+----------------------------+
|regexp_extract(id, (\d+), 1)|
+----------------------------+
| 20|
| 40|
+----------------------------+
Here, the regular expression (\d+) matches one or more digits ( 20 and 40 in this case). We set the
third argument value as 1 to indicate that we are interested in extracting the first matched group -
this argument is useful when we capture multiple groups.
Extracting the n-th captured substring
We can use multiple (~) capture groups for regexp_extract(~) like so:
filter_none
df.select (F.regexp_extract('id', '(\d+)_(\d+)', 2)).show ()
+----------------------------------+
|regexp_extract(id, (\d+)_(\d+), 2)|
+----------------------------------+
| 30|
| 50|
+----------------------------------+
Here, we set the third argument value to 2 to indicate that we are interested in extracting the
values captured by the second group.
R E L AT E D
PySpark SQL Functions | regexp_replace method
PySpark SQL Functions' regexp_replace(~) method replaces the matched regular expression with the
specified string.
chevro
PySpark SQL Functions | regexp_replace method
local_offer
PySpark
mode_heat
PySpark SQL Functions' regexp_replace(~) method replaces the matched regular expression with
the specified string.
Parameters
The column whose values will be replaced.
2. pattern | string or Regex
The regular expression to be replaced.
3. replacement | string
The string value to replace pattern.
Return Value
Examples

filter_none
df = spark.createDataFrame([['Alex', 10], ['Mile', 30]], ['name', 'age'])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 10|
|Mile| 30|
+----+---+
Replacing a specific substring
To replace the substring 'le' with 'LE', use regexp_replace(~):

filter_none
# Use an alias to assign a new name to the returned column
df.select(F.regexp_replace('name', 'le', 'LE').alias('new_name')).show()
+--------+
|new_name|
+--------+
| ALEx|
| MiLE|
+--------+
NOTE
The second argument is a regular expression, so characters such as $ and [ will carry special
meaning. In order to treat these special characters as literal characters, escape them using
the \ character (e.g. \$).
Passing in a Column object
Instead of referring to the column by its name, we can also pass in a Column object:
filter_none
df.select(F.regexp_replace(df.name, 'le', 'LE').alias('new_name')).show()

+--------+
|new_name|
+--------+
| ALEx|
| MiLE|
+--------+
Getting a new PySpark DataFrame
We can use the PySpark DataFrame's withColumn(~) method to obtain a new PySpark DataFrame
with the updated column like so:
filter_none
df.withColumn('name', F.regexp_replace("name", 'le', 'LE').alias('new_name')).show()
+----+---+
|name|age|
+----+---+
|ALEx| 10|
|MiLE| 30|
+----+---+
Replacing a specific substring using regular expression
To replace the substring 'le' that occur only at the end with 'LE', use regexp_replace(~):
filter_none
df.select(F.regexp_replace('name', 'le$', 'LE').alias('new_name')).show()
+--------+
|new_name|
+--------+
| Alex|
| MiLE|
+--------+
Here, we are using the special regular expression character '$' that only matches patterns
occurring at the end of the string. This is the reason no replacement was done for
the 'le' in Alex.
R E L AT E D
PySpark SQL Functions | regexp_extract method
PySpark SQL Functions' regexp_extract(~) method extracts a substring using regular expression.
PySpark SQL Functions | repeat method

local_offer
PySpark
mode_heat
PySpark SQL Functions' repeat(~) method duplicates string values n times.
Parameters
The string-typed column to perform the repeat on.
2. n | int
The number of times to repeat the string values.
Return Value
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Duplicating string values in PySpark column
To repeat the values in the name column twice:

filter_none
df.select (F.repeat("name", 2)).show ()
+---------------+
|repeat(name, 2)|
+---------------+
| AlexAlex|
| BobBob|
+---------------+
Note that we could also supply a Column object to repeat(~) like so:
filter_none
# df.select(F.repeat(df.name), 2)).show()
df.select (F.repeat(F.col ("name"), 2)).show ()
+---------------+
|repeat(name, 2)|
+---------------+
| AlexAlex|
| BobBob|
+---------------+
PySpark SQL Functions | round method
local_offer
PySpark
mode_heat
PySpark SQL Functions' round(~) method rounds the values of the specified column.
Parameters
The column to perform rounding on.
2. scale | int | optional
If scale is positive, such as scale=2 , then values are rounded to the nearest 2nd decimal. If scale is
negative, such as scale=-1 , then values are rounded to the nearest tenth. By default, scale=0 , that is,
values are rounded to the nearest integer.
Return Value
Examples
filter_none
df = spark.createDataFrame ([["Alex", 90.4], ["Bob", 100.5], ["Cathy", 100.63]], ["name", "salary"])
df.show ()
+-----+------+
| name|salary|
+-----+------+
| Alex| 90.4|
| Bob| 100.5|
|Cathy|100.63|
+-----+------+
Rounding column values to the nearest integer
To round to the nearest integer:

filter_none
df.select (F.round("salary")).show ()
+----------------+
|round(salary, 0)|
+----------------+
| 90.0|
| 101.0|
| 101.0|
+----------------+
Notice how 100.5 got rounded up to 101 .
Rounding column values to the nearest first decimal
To round to the nearest 1st decimal:

filter_none
df.select (F.round("salary", scale=1)).show ()
+----------------+
|round(salary, 1)|
+----------------+
| 90.4|
| 100.5|
| 100.6|
+----------------+
Rounding column values to the nearest tenth
To round to the nearest tenth:

filter_none
df.select (F.round("salary", scale=-1)).show ()
+-----------------+
|round(salary, -1)|
+-----------------+
| 90.0|
| 100.0|
| 100.0|
+-----------------+
PySpark SQL Functions | split method

local_offer
PySpark
mode_heat
PySpark SQL Functions' split(~) method returns a new PySpark column of arrays containing
splitted tokens based on the specified delimiter.
Parameters
The column in which to perform the splitting.
2. pattern | string
The regular expression that serves as the delimiter.
3. limit | int | optional
 if limit > 0 , then the resulting array of splitted tokens will contain at most limit tokens.
 if limit <=0 , then there is no limit as to how many splits we perform.
By default, limit=-1 .
Return Value
A new PySpark column.
Examples
filter_none
df = spark.createDataFrame ([("A#A",), ("B##B",), ("#C#C#C#",), (None,)], ["x",])
df.show ()
+-------+
| x|
+-------+
| A#A|
| B##B|
|#C#C#C#|
| null|
+-------+
Splitting strings by delimiter in PySpark Column
To split the strings in column x by "#" , use the split(~) method:

filter_none
df.select (F.split ("x", "#")).show ()
+---------------+
|split(x, #, -1)|
+---------------+
| [A, A]|
| [B, , B]|
| [, C, C, C, ]|
| null|
+---------------+

 the second delimiter parameter is actually parsed as a regular expression - we will see an
example of this later.
 splitting null results in null .
We can also specify the maximum number of splits to perform using the optional parameter limit :
filter_none
df.select (F.split("x", "#", 2)).show ()
+--------------+
|split(x, #, 2)|
+--------------+
| [A, A]|
| [B, #B]|
| [, C#C#C#]|
| null|
+--------------+
Here, the array containing the splitted tokens can be at most length 2 . This is the reason why we
still see our delimiter substring "#" in there.
Splitting strings using regular expression in PySpark Column

filter_none
df = spark.createDataFrame ([("A#A",), ("B@B",), ("C#@C",)], ["x",])
df.show ()
+----+
| x|
+----+
| A#A|
| B@B|
|C#@C|
+----+
To split by either the characters # or @ , we can use a regular expression as the delimiter:
filter_none
df.select (F.split("x", "[#@]")).show ()
+------------------+
|split(x, [#@], -1)|
+------------------+
| [A, A]|
| [B, B]|
| [C, , C]|
+------------------+
PySpark SQL Functions | to_date method

local_offer
PySpark
mode_heat
PySpark SQL Functions' to_date() method converts date strings to date types.
Parameters
1. col | Column
The date string column.
2. format | string
The format of the date string.
Return Value
A PySpark Column.
Examples
Consider the following PySpark DataFrame with some date strings:
filter_none
df = spark.createDataFrame ([["Alex", "1995-12-16"], ["Bob", "1998-05-06"]], ["name", "birthday"])
df.show ()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1998-05-06|
+----+----------+
Converting date strings to date type in PySpark
To convert date strings in the birthday column to actual date type, use to_date(~) and specify the
pattern of the date string:
filter_none
df_new = df.withColumn ("birthday", F.to_date(df["birthday"], "yyyy-MM-dd"))
df_new.printSchema ()
root
|-- name: string (nullable = true)
|-- birthday: date (nullable = true)
Here, the withColumn(~) method is used to update the birthday column using the new column
returned by to_date(~) .
As another example, here's a PySpark DataFrame with slightly more complicated date strings:
filter_none
df = spark.createDataFrame ([["Alex", "1995/12/16 16:20:20"], ["Bob", "1998/05/06 18:56:10"]], ["name", "birthday"])
df.show ()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1998-05-06|
+----+----------+
Here, our date strings also contain hours, minutes and seconds.
To convert the birthday column to date type:

filter_none
df_new = df.withColumn ("birthday", F.to_date(df["birthday"], "yyyy/MM/dd HH:mm:ss"))
df_new.show ()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1998-05-06|
+----+----------+
PySpark SQL Functions | translate method

local_offer
PySpark
mode_heat
PySpark SQL Functions' translate(~) method replaces the specified characters by the desired
characters.
Parameters
1. srcCol | string or Column
The column to perform the operation on.
2. matching | string
The characters to be replaced.

3. replace | string
The characters to replace matching .
Return Value
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Replacing characters in PySpark Column
Suppose we wanted to make the following character replacements:

filter_none
A -> #
e -> @
o -> %
We can perform these character replacements using the translate(~) method:

filter_none
df.select (F.translate("name", "Aeo", "#@%")).show ()
+-------------------------+
|translate(name, Aeo, #@%)|

+-------------------------+
| #l@x|
| B%b|
| Cathy|
+-------------------------+
Note that we can obtain a new PySpark DataFrame with the translated column using
the withColumn(~) method:
filter_none
df_new = df.withColumn ("name", F.translate("name", "Aeo", "#@%"))
df_new.show ()
+-----+---+
| name|age|
+-----+---+
| #l@x| 20|
| B%b| 30|
|Cathy| 40|
+-----+---+
Finally, note that specifying less characters for the replace parameter will result in the removal of
the corresponding characters in matching :
filter_none
df.select (F.translate("name", "Aeo", "#")).show ()
+-----------------------+
|translate(name, Aeo, #)|
+-----------------------+
| #lx|
| Bb|
| Cathy|
+-----------------------+
Here, the characters e and o are removed, while A is replaced by # .

PySpark SQL Functions | trim method
local_offer
PySpark
mode_heat
PySpark SQL Functions' trim(~) method returns a new PySpark column with the string values
trimmed, that is, with the leading and trailing spaces removed.
Parameters
1. col | string
The column of type string to trim.
Return Value
Examples
filter_none
df = spark.createDataFrame ([[" Alex ", 20], [" Bob", 30], ["Cathy ", 40]], ["name", "age"])
df.show ()
+---------+---+
| name|age|
+---------+---+
| Alex | 20|
| Bob| 30|
|Cathy | 40|
+---------+---+
Here, the values in the name column have leading and trailing spaces.
Trimming columns in PySpark
To trim the name column, that is, to remove the leading and trailing spaces:
filter_none
df.select (F.trim("name").alias("trimmed_name")).show ()
+------------+
|trimmed_name|
+------------+
| Alex|
| Bob|
| Cathy|
+------------+
Here, the alias(~) method is used to assign a label to the Column returned by trim(~) .
To get the original PySpark DataFrame but with the name column updated with the trimmed
version, use the withColumn(~) method:
filter_none
df.withColumn ("name", F.trim("name").alias("trimmed_name")).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
PySpark SQL Functions | upper method

local_offer
PySpark
mode_heat
PySpark SQL Functions' upper(~) method returns a new PySpark Column with the specified column
upper-cased.
Parameters
The column to perform the uppercase operation on.
Return Value
Examples
filter_none
df = spark.createDataFrame ([["alex", 25], ["bOb", 30]], ["name", "age"])
df.show ()
+----+---+
|name|age|
+----+---+
|alex| 25|
| bOb| 30|
+----+---+
Uppercasing strings in PySpark DataFrame
To upper-case the strings in the name column:

filter_none
df.select (F.upper(df.name)).show ()
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
+-----------+
Note that passing in a column label as a string also works:

filter_none
df.select (F.upper("name")).show ()
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
+-----------+
Replacing column with uppercased column in PySpark
To replace the name column with the upper-cased version, use the withColumn(~) method:
filter_none
df.withColumn ("name", F.upper(df.name)).show ()
+----+---+
|name|age|
+----+---+
|ALEX| 25|
| BOB| 30|
+----+---+
PySpark SQL Functions | when method

local_offer
PySpark
mode_heat
PySpark SQL Functions' when(~) method is used to update values of a PySpark DataFrame column
to other values based on the given conditions.
NOTE
The when(~) method is often used in conjunction with the otherwise(~) method to implement an if-
else logic. See examples below for clarification.
Parameters
1. condition | Column | optional
A boolean Column expression. See examples below for clarification.
2. value | any | optional
The value to map to if the condition is true.
Return Value
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
Implementing if-else logic using when and otherwise
To rename the name Alex to Doge , and others to Eric :

filter_none
df.select (F.when(df.name == "Alex", "Doge").otherwise ("Eric")).show ()
+-----------------------------------------------+
|CASE WHEN (name = Alex) THEN Doge ELSE Eric END|
+-----------------------------------------------+
| Doge|
| Eric|
| Eric|
+-----------------------------------------------+
Notice how we used the method otherwise(~) to set values for cases when the conditions are not
met.
Case when otherwise method is not used
Note that if you do not include the otherwise(~) method, then any value that does not fulfil the if
condition will be assigned null :
filter_none
df.select (F.when(df.name == "Alex", "Doge")).show ()
+-------------------------------------+
|CASE WHEN (name = Alex) THEN Doge END|
+-------------------------------------+
| Doge|
| null|
| null|
+-------------------------------------+
Specifying multiple conditions
Using pipeline and ampersand operator
We can combine conditions using & (and) and | (or) like so:
filter_none
df.withColumn ("name", F.when((df.name == "Alex") & (df.age > 10), "Doge").otherwise("Eric")).show()
+----+---+
|name|age|
+----+---+
|Doge| 20|
|Eric| 24|
|Eric| 22|
+----+---+
Chaining the when method
The when(~) method can be chained like so:

filter_none
df.select (F.when(df.name == "Alex", "Doge")
.when(df.name == "Bob", "Zebra")
.otherwise ("Eric")).show ()
+----------------------------------------------------------------------------+
|CASE WHEN (name = Alex) THEN Doge WHEN (name = Bob) THEN Zebra ELSE Eric END|
+----------------------------------------------------------------------------+
| Doge|
| Zebra|
| Eric|
+----------------------------------------------------------------------------+
Setting a new value based on original value
To set a new value based on the original value:

filter_none
df.select (F.when(df.age > 15, df.age + 30)).show ()
+----------------------------------------+
|CASE WHEN (age > 15) THEN (age + 30) END|
+----------------------------------------+
| 50|
| 54|
| 52|
+----------------------------------------+
Using an alias
By default, the new column label is convoluted:

filter_none
df.select (F.when(df.name == "Alex", "Doge").otherwise ("Eric")).show ()
+-----------------------------------------------+
+-----------------------------------------------+
| Doge|
| Eric|
| Eric|
+-----------------------------------------------+
To assign a new column, simply use the alias(~) method:

filter_none
df.select (F.when(df.name == "Alex", "Doge").otherwise ("Eric").alias("new_name")).show ()
+--------+
|new_name|
+--------+
| Doge|
| Eric|
| Eric|
+--------+
PySpark DataFrame | rdd property

local_offer
PySpark
mode_heat
PySpark DataFrame's rdd property returns the RDD representation of the DataFrame. Keep in
mind that PySpark DataFrames are internally represented as RDD.
Return Value
RDD containing Row objects.
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Converting PySpark DataFrame into RDD
To convert our PySpark DataFrame into a RDD, use the rdd property:
filter_none
rdd = df.rdd
rdd.collect ()
[Row(name='Alex', age=25), Row(name='Bob', age=30)]
Here, we are using the collect() method to see the content of our RDD, which is a list
of Row objects.
PySpark DataFrame | alias method
local_offer
PySpark
mode_heat
PySpark DataFrame's alias(~) method gives an alias to the DataFrame that you can then refer to in
string statements.
Parameters
This method does not take any parameters.
Return Value
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Giving an alias to a PySpark DataFrame
Let's give an alias to our DataFrame, and then refer to the DataFrame using the alias:
filter_none
df = df.alias("my_df")
df.select (F.col ("my_df.name")).show ()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
PySpark DataFrame | coalesce method

local_offer
PySpark
mode_heat
PySpark DataFrame's coalesce(~) method reduces the number of partitions of the PySpark
DataFrame without shuffling.
Parameters
1. num_partitions | int
The number of partitions to split the PySpark DataFrame's data into.
Return Value
A new PySpark DataFrame.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
The default number of partitions is governed by your PySpark configuration. In my case, the
default number of partitions is:
filter_none
df.rdd .getNumPartitions ()
We can see the actual content of each partition of the PySpark DataFrame by using the underlying
RDD's glom() method:
filter_none
df.rdd .glom ().collect ()
[[],
[],
[Row(name='Alex', age=20)],
[],
[],
[Row(name='Bob', age=30)],
[],
[Row(name='Cathy', age=40)]]
We can see that we indeed have 8 partitions, 3 of which contain a Row .
Reducing the number of partitions of a PySpark DataFrame without shuffling
To reduce the number of partitions of the DataFrame without shuffling link, use coalesce(~) :
filter_none
df_new = df.coalesce(2)
df_new.rdd .glom ().collect ()
[[Row(name='Alex', age=20)],
[Row(name='Bob', age=30), Row(name='Cathy', age=40)]]
Here, we can see that we now only have 2 partitions!
NOTE
Both the methods repartition(~) and coalesce(~) are used to change the number of partitions, but here
are some notable differences:
 generally results in a shuffling operation link while coalesce(~) does not. This
repartition(~)
means that coalesce(~) is less costly than repartition(~) because the data does not have to
travel across the worker nodes much.
 coalesce(~) is used specifically for reducing the number of partitions.
PySpark DataFrame | collect method

local_offer
PySpark
mode_heat
PySpark DataFrame's collect() method returns all the records of the DataFrame as a list
of Row objects.
Return Value
A list of Row objects.
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Getting all rows of the PySpark DataFrame as a list of Row objects
To get all the rows as a list of Row objects:

filter_none
df.collect()
WARNING
Under the hood, the collect(~) method sends all the data scattered across the worker nodes to the
main deriver node. This means that if the size of the data is large, then the driver program will
run out of memory and throw an error.
PySpark DataFrame | colRegex method

local_offer
PySpark
mode_heat
PySpark DataFrame's colRegex(~) method returns a Column object whose label match the specified
regular expression. This method also allows multiple columns to be selected.
Parameters
1. colName | string
The regex to match the label of the columns.
Return Value
A PySpark Column.
Examples
Selecting columns using regular expression in PySpark

filter_none
df = spark.createDataFrame ([("Alex", 20), ("Bob", 30), ("Cathy", 40)], ["col1", "col2"])
df.show ()
+-----+----+
| col1|col2|
+-----+----+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+----+
To select columns using regular expression, use the colRegex(~) method:

filter_none
df.select (df.colRegex ("`col[123]`")).show ()
+-----+----+
| col1|col2|
+-----+----+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+----+

 we wrapped the column label using backticks ` - this is required otherwise PySpark will
throw an error.
 the regular expression col[123] matches columns with label col1 , col2 or col3 .
 the select(~) method is used to convert the Column object into a PySpark DataFrame.
Getting column labels that match regular expression as list of strings in PySpark
To get column labels as a list of strings instead of PySpark Column objects:

filter_none
df.select (df.colRegex("`col[123]`")).columns
['col1', 'col2']
Here, we are using the columns property of the PySpark DataFrame returned by select(~) .
PySpark DataFrame | corr method

local_offer
PySpark
mode_heat
PySpark DataFrame's corr(~) method returns the correlation of the specified numeric columns
as a float.
Parameters
1. col1 | string
The first column.
2. col2 | string
The second column.
3. method | string | optional
The type of correlation to compute. The only correlation type supported currently is
the Pearson Correlation Coefficient.
Return Value
A float.
Examples

filter_none
df = spark.createDataFrame([("Alex", 180, 80), ("Bob", 170, 70), ("Cathy", 160, 70)], ["name", "height", "weight"])
df.show()
+-----+------+------+
| name|height|weight|
+-----+------+------+
| Alex| 180| 80|
| Bob| 170| 70|
|Cathy| 160| 70|
+-----+------+------+
Computing the correlation of two numeric PySpark columns
To compute the correlation between the height and weight columns:

filter_none
df.corr("height","weight")
0.8660254037844387
Here, we see that the height and weight are positively correlated with a Pearson correlation
coefficient of around 0.87.
R E L AT E D
PySpark DataFrame | cov method
PySpark DataFrame's cov(~) method returns the covariance of two specified numeric columns as a
float.
PySpark DataFrame | count method

local_offer
PySpark
mode_heat
PySpark DataFrame's count(~) method returns the number of rows of the DataFrame.
Parameters
This method does not take in any parameters.
Return Value
An integer.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
Getting the number of rows of a PySpark DataFrame
To get the number of rows of the DataFrame:

filter_none
df.count()
PySpark DataFrame | cov method

local_offer
PySpark
mode_heat
PySpark DataFrame's cov(~) method returns the covariance of two specified numeric columns as a
float.
Parameters
1. col1 | string
The first column.
2. col2 | string
The second column.
Return Value
A float .
Examples
filter_none
df = spark.createDataFrame ([("Alex", 180, 80), ("Bob", 170, 70), ("Cathy", 160, 70)], ["name", "height", "weight"])
df.show ()
+-----+------+------+
| name|height|weight|
+-----+------+------+
| Alex| 180| 80|
| Bob| 170| 70|
|Cathy| 160| 70|
+-----+------+------+
Computing the covaraince of two numeric PySpark columns
To compute the covariance between the height and weight columns:

filter_none
df.cov("height","weight")
50.0
Here, we see that the covariance between height and weight is 50 (positive correlation).
R E L AT E D
PySpark DataFrame | corr method
PySpark DataFrame's corr(~) method returns the correlation of the specified numeric columns as a
float.
chevr
PySpark DataFrame | describe method
local_offer
PySpark
mode_heat
PySpark DataFrame's describe(~) method returns a new PySpark DataFrame holding summary
statistics of the specified columns.
Parameters
1. *cols | string | optional
By default, all numeric and string columns will be described.
Return Value
Examples
filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 25], ["Bob", 30]], ["name", "age"])
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 25|
| Bob| 30|
+----+---+
Getting summary statistics of certain columns in PySpark DataFrame
To get the summary statistics of the name and age columns:

filter_none
df.describe("name","age").show ()
+-------+----+----+
|summary|name| age|
+-------+----+----+
| count| 3| 3|
| mean|null|25.0|
| stddev|null| 5.0|
| min|Alex| 20|
| max| Bob| 30|
+-------+----+----+
Getting summary statistics of all numeric and string columns in PySpark DataFrame
To get the summary statistics of all numeric and string columns:

filter_none
df.describe().show ()
+-------+----+----+
|summary|name| age|
+-------+----+----+
| count| 3| 3|
| mean|null|25.0|
| stddev|null| 5.0|
| min|Alex| 20|
| max| Bob| 30|
+-------+----+----+
R E L AT E D
PySpark DataFrame | summary method
PySpark DataFrame's summary(~) method returns a PySpark DataFrame containing basic summary
statistics of numeric columns.
PySpark DataFrame | distinct method

local_offer
PySpark
mode_heat
PySpark DataFrame's distinct() method returns a new DataFrame containing distinct rows.
Parameters
Return Value
A PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
filter_none
df = spark.createDataFrame ([["Alex", 25], ["Bob", 30], ["Alex", 25], ["Alex", 50]], ["name", "age"])
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
|Alex| 25|
|Alex| 50|
+----+---+
Getting all distinct rows of PySpark DataFrame
To get all distinct rows of a PySpark DataFrame, use the distinct() method:
filter_none
df.distinct().show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
|Alex| 50|
+----+---+
Counting the number of distinct rows of PySpark DataFrame
To count the number of distinct rows of a PySpark DataFrame:

filter_none
df.distinct().count ()
PySpark DataFrame | drop method

local_offer
PySpark
mode_heat
PySpark DataFrame's drop(~) method returns a new DataFrame with the specified columns
dropped.
NOTE
Trying to drop a column that does not exist will not raise an error - the original DataFrame will
be returned instead.
Parameters
The columns to drop.
Return Value
Examples
filter_none
df = spark.createDataFrame ([["Alex", 25, True], ["Bob", 30, False]], ["name", "age", "is_married"])
df.show ()
+----+---+----------+
|name|age|is_married|
+----+---+----------+
|Alex| 25| true|
| Bob| 30| false|
+----+---+----------+
Dropping a single column of PySpark DataFrame
To drop the name column:

filter_none
df.drop("name").show ()
+---+----------+
|age|is_married|
+---+----------+
| 25| true|
| 30| false|
+---+----------+
Note that the original df is kept intact.
We can also supply the column as a Column object using sql.functions :

filter_none
df.drop(F.col("name")).show ()
+---+----------+
|age|is_married|
+---+----------+
| 25| true|
| 30| false|
+---+----------+
Dropping multiple columns of PySpark DataFrame
To drop columns name and age :

filter_none
df.drop("name", "age").show ()
+----------+
|is_married|
+----------+
| true|
| false|
+----------+
WARNING
We cannot remove columns by supplying multiple Column objects:

filter_none
df.drop(F.col("name"), F.col("age")).show ()
TypeError: each col in the param list should be a string
Dropping columns given a list of column labels
To drop columns given a list of column labels:

filter_none
cols = ["name", "age"]
df.drop(*cols).show ()
+----------+
|is_married|
+----------+
| true|
| false|
+----------+
Here, *cols converts the list into positional arguments
PySpark DataFrame | dropDuplicates method

local_offer
PySpark
mode_heat
PySpark DataFrame's dropDuplicates(~) returns a new DataFrame with duplicate rows removed. We
can optionally specify columns to check for duplicates.
NOTE
dropDuplicates(~) is an alias for drop_duplicates(~) .
Parameters
1. subset | string or list of string | optional
The columns by which to check for duplicates. By default, all columns will be checked.
Return Value
Examples
filter_none
df = spark.createDataFrame ([["Alex", 25], ["Bob", 30], ["Bob", 30], ["Cathy", 25]], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
| Bob| 30|
|Cathy| 25|
+-----+---+
Dropping duplicate rows in PySpark DataFrame
To drop duplicate rows:

filter_none
df.dropDuplicates().show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
|Cathy| 25|
+-----+---+
Note the following:
 only the first occurrence is kept while subsequent occurrences are removed.
 a new PySpark DataFrame is returned while the original is kept intact.

Dropping duplicate rows for certain columns
To drop duplicate rows based on the age column:

filter_none
df.dropDuplicates(["age"]).show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Again, only the first occurrence is kept while the latter duplicate rows are discarded.
PySpark DataFrame | dropna method

local_offer
PySpark
mode_heat
PySpark DataFrame's dropna(~) method removes row with missing values.
Parameters
1. how | string | optional
 If 'any' , then drop rows that contains any null value.

 If 'all' , then drop rows that contain all null values.
By default, how='any' .
2. thresh | int | optional
Drop rows that have less non-null values than thresh . Note that this overrides the how parameter.
3. subset | string or tuple or list | optional
The rows to check for null values. By default, all rows will be checked.
Return Value
Examples
filter_none
df = spark.createDataFrame ([["Alex", 20], [None, None], ["Cathy", None]], ["name", "age"])
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
| null|null|
|Cathy|null|
+-----+----+
Dropping rows with at least one missing value in PySpark DataFrame
To drop rows with at least one missing value:

filter_none
df.dropna().show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Dropping rows with at least n non-missing values in PySpark DataFrame
To drop rows with at least 2 non-missing values:

filter_none
n_non_missing_vals = 2
df.dropna(thresh=n_non_missing_vals).show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Dropping rows with at least n missing values in PySpark DataFrame
To drop rows with at least 2 missing values:

filter_none
n_missing_vals = 2
df.dropna(thresh=len (df.columns )-n_missing_vals+1).show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
|Cathy|null|
+-----+----+
Dropping rows with all missing values in PySpark DataFrame
To drop rows with all missing values:

filter_none
df.dropna(how='all').show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
|Cathy|null|
+-----+----+
Dropping rows where certain value is missing in PySpark DataFrame
To drop rows where the value for age is missing:

filter_none
df.dropna(subset='age').show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Dropping rows where certain values are missing (either) in PySpark DataFrame
To drop rows where either the name or age column value is missing:
filter_none
df.dropna(subset=['name','age'], how='any').show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Dropping rows where certain values are missing (all) in PySpark DataFrame
To drop rows where the name and age column values are both missing:
filter_none
df.dropna(subset=['name','age'], how='all').show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
|Cathy|null|
+-----+----+
PySpark DataFrame | exceptAll method

local_offer
PySpark
mode_heat
PySpark DataFrame's exceptAll(~) method returns a new DataFrame that exist in this DataFrame
but not in the other DataFrame.
Parameters
1. other | PySpark DataFrame
The other PySpark DataFrame.
Return Value
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Suppose the other DataFrame is:

filter_none
df_other = spark.createDataFrame ([["Alex", 20], ["Bob", 35], ["Cathy", 40]], ["name", "age"])
df_other.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 35|
|Cathy| 40|
+-----+---+
Getting all rows of PySpark DataFrame that do not exist in another PySpark DataFrame
To get all rows in df that does not exist in df_other :

filter_none
df.exceptAll(df_other).show ()
+----+---+
|name|age|
+----+---+
| Bob| 30|
+----+---+
PySpark DataFrame | filter method

local_offer
PySpark
mode_heat
PySpark DataFrame's filter(~) method returns the rows of the DataFrame that satisfies the given
condition.
NOTE
The filter(~) method is an alias for the where(~) method.
Parameters
1. condition | Column or string
A boolean mask ( Column ) or a SQL string expression.
Return Value
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
To get rows where age is greater than 25:

filter_none
df.filter("age > 25").show ()
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
Equivalently, we can pass a Column object that represents a boolean mask:

filter_none
df.filter(df.age > 25).show ()
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
Equivalently, we can obtain a boolean mask using sql.Functions as well:

filter_none
df.filter(F.col ("age") > 25).show ()
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
Here, F.col("age") returns the age column as a Column object
PySpark DataFrame | foreach method

local_offer
PySpark
mode_heat
PySpark DataFrame's foreach(~) method loops over each row of the DataFrame as a Row object and
applies the given function to the row.
WARNING
The following are some limitations of foreach(~) :
 the foreach(~) method in Spark is invoked in the worker nodes instead of the Driver
program. This means that if we perform a print(~) inside our function, we will not be able
to see the printed results in our session or notebook because the results are printed in the
worker node instead.
 rows are read-only and so you cannot update values of the rows.
Given these limitations, the foreach(~) method is mainly used for logging some information about
each row to the local machine or to an external database.
Parameters
1. f | function
The function to apply to each row ( Row ) of the DataFrame.
Return Value
Nothing is returned.
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
To iterate over each row and apply some custom function:

filter_none
# This function fires in the worker node
def f(row):
print (row.name)
df.foreach(f)
Here, the row.name is printed in the worker nodes so you would not see any output in the driver
program.
PySpark DataFrame | groupBy method
local_offer
PySpark
mode_heat
PySpark DataFrame's groupBy(~) method aggregates rows based on the specified columns. We can
then compute statistics such as the mean for each of these groups.
Parameters
1. cols | list or string or Column | optional
The columns to group by. By default, all rows will be grouped together.
Return Value
The GroupedData object ( pyspark.sql.group.GroupedData ).
Examples
filter_none
df = spark.createDataFrame ([["Alex", "IT", 20, 500],\
["Bob", "IT", 24, 400],\
["Cathy", "HR", 22, 600]],\
["name", "department", "age", "salary"])
df.show ()
+-----+----------+---+------+
| name|department|age|salary|
+-----+----------+---+------+
| Alex| IT| 20| 500|
| Bob| IT| 24| 400|
|Cathy| HR| 22| 600|
+-----+----------+---+------+
Basic usage
By default, groupBy() without any arguments will group all rows together, and will compute
statistics for each numeric column:
filter_none
df.groupby().max().show()
+--------+-----------+
|max(age)|max(salary)|
+--------+-----------+
| 24| 600|
+--------+-----------+
Grouping by a single column and computing statistic of all columns of each group
To get the highest age and salary in each department:

filter_none
df.groupBy("department").max().show()
+----------+--------+-----------+
|department|max(age)|max(salary)|
+----------+--------+-----------+
| IT| 24| 500|
| HR| 22| 600|
+----------+--------+-----------+
Instead of referring to the column by its label ( string ), we can also use SQL.functions.col(~) :
filter_none
df.groupby(F.col ("department")).max().show()
+----------+--------+-----------+
|department|max(age)|max(salary)|
+----------+--------+-----------+
| IT| 24| 500|
| HR| 22| 600|
+----------+--------+-----------+
Grouping by a single column and computing statistic of specific columns of each group
To get the highest age only instead of all numeric columns:

filter_none
df.groupby("department").max("age").show()
+----------+--------+
|department|max(age)|
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+
Equivalently, we can use the agg(~) method and use one of SQL.functions ' aggregate functions:
filter_none
df.groupby("department").agg(F.max("age")).show()
+----------+--------+
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+
NOTE
The following aggregate functions are supported in PySpark:
filter_none
agg, avg, count, max, mean, min, pivot, sum
Grouping by with aliases for the aggregated column
By default, computing the max age of each group will result in the column label max(age) :
filter_none
df.groupby("department").max("age").show()
+----------+--------+
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+
To use an alias, we need to use the function agg(~) instead:

filter_none
df.groupby("department").agg(F.max("age").alias("max_age")).show()
+----------+-------+
|department|max_age|
+----------+-------+
| IT| 24|
| HR| 22|
+----------+-------+
Grouping by and computing multiple statistics
To compute multiple statistics at once:

filter_none
df.groupby("department").agg(F.max("age").alias("max"), F.min("age"), F.avg("salary")).show()
+----------+--------+--------+-----------------+
|department| max|min(age)| avg(salary)|
+----------+--------+--------+-----------------+
| IT| 26| 20|566.6666666666666|
| HR| 22| 22| 600.0|
+----------+--------+--------+-----------------+
Grouping by multiple columns and computing statistic

filter_none
df = spark.createDataFrame ([["Alex", "junior", "IT", 20, 500],\
["Bob", "junior", "IT", 24, 400],\
["Cathy", "junior", "HR", 22, 600],\
["Doge", "senior", "IT", 26, 800]],\
["name", "position", "department", "age", "salary"])
df.show ()
+-----+--------+----------+---+------+
| name|position|department|age|salary|
+-----+--------+----------+---+------+
| Alex| junior| IT| 20| 500|
| Bob| junior| IT| 24| 400|
|Cathy| junior| HR| 22| 600|
| Doge| senior| IT| 26| 800|
+-----+--------+----------+---+------+
To group by position and department , and then computing the max age of each of these groups:
filter_none
df.groupby(["position", "department"]).max("age").show()
+--------+----------+--------+
|position|department|max(age)|
+--------+----------+--------+
| junior| IT| 24|
| junior| HR| 22|
| senior| IT| 26|
+--------+----------+--------+
PySpark DataFrame | head method

local_offer
PySpark
mode_heat
PySpark DataFrame's head(~) method returns the first n number of rows as Row objects.
Parameters
1. n | int | optional
The number of rows to return. By default, n=1 .
Return Value
 If n is larger than 1, then a list of Row objects is returned
 if n is equal to 1, then a single Row object ( pyspark.sql.types.Row ) is returned

Examples
filter_none
df = spark.createDataFrame (data, columns)
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
Getting the first row of PySpark DataFrame as a Row object
To get the first row as a Row object:

filter_none
df.head()
Row(name='Alex', age=15)
Getting a list of first n Row objects of PySpark DataFrame
To get the first two rows as a list of Row objects:

filter_none
df.head(n=2)
R E L AT E D
PySpark DataFrame | take method
PySpark DataFrame's take(~) method returns the first num number of rows as a list of Row objects.
chevron_right
Published by Isshin Inada
PySpark DataFrame | intersect method

local_offer
PySpark
mode_heat
PySpark DataFrame's intersect(~) method returns a new PySpark DataFrame with rows that exist in
another PySpark DataFrame. Note that unlike intersectAll(~) , intersect(~) only includes duplicate
rows once.
NOTE
The intersect(~) method is equivalent to the INTERSECT statement in SQL.

Parameters
The other PySpark DataFrame with which to perform intersection.
Return Value
Examples
filter_none
df = spark.createDataFrame ([("Alex", 20), ("Bob", 30), ("Cathy", 40)], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Consider the other PySpark DataFrame:

filter_none
df_other = spark.createDataFrame ([("Alex", 20), ("Doge", 30), ("eric", 40)], ["name", "age"])
df_other.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
|Doge| 30|
|eric| 40|
+----+---+
Getting rows of PySpark DataFrame that exist in another PySpark DataFrame
To get rows of a PySpark DataFrame that exist in another PySpark DataFrame, use
the intersect(~) method like so:
filter_none
df_intersect = df.intersect(df_other)
df_intersect.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Here, we get this row because both PySpark DataFrames contained this row.
PySpark DataFrame | intersectAll method

local_offer
PySpark
mode_heat
PySpark DataFrame's intersectAll(~) method returns a new PySpark DataFrame with rows that also
exist in the other PySpark DataFrame. Unlike intersect(~) , the intersectAll(~) method preserves
duplicates.
NOTE
The intersectAll(~) method is identical to to the INTERSECT ALL statement in SQL.

Parameters
The other PySpark DataFrame.
Return Value
Examples
filter_none
df = spark.createDataFrame ([("Alex", 20), ("Alex", 20), ("Bob", 30), ("Cathy", 40)], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Suppose the other PySpark DataFrame is:

filter_none
df_other = spark.createDataFrame ([("Alex", 20), ("Alex", 20), ("David", 80), ("Eric", 80)], ["name", "age"])
df_other.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Alex| 20|
|David| 80|
| Eric| 80|
+-----+---+
 the only matching row is Alex 's row
 Alex 's row appears twice in both df and df_other

Getting rows that also exist in other PySpark DataFrame while preserving duplicates
To get rows that also exist in other PySpark DataFrame while preserving duplicates:
filter_none
df_res = df.intersectAll(df_other)
df_res.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
|Alex| 20|
+----+---+
Note the following:
 Alex 's row is duplicated because Alex 's row appears twice in df and df_other each.
 if Alex 's row only appeared once in one DataFrame but appeared multiple times in
another, Alex 's row will only be included once in the resulting DataFrame.
 if you want to include duplicating rows only once, then use the intersect(~) method instead
PySpark DataFrame | join method
local_offer
PySpark
mode_heat
PySpark DataFrame's join(~) method joins two DataFrames using the given join method.
Parameters
1. other | DataFrame
The other PySpark DataFrame with which to join.
2. on | string or list or Column | optional
The columns to perform the join on.
3. how | string | optional
By default, how="inner" . See examples below for the type of joins implemented.
Return Value
Examples
Performing inner, left and right joins
Consider the following PySpark DataFrames:

filter_none
df1 = spark.createDataFrame ([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
The other PySpark DataFrame:

filter_none
df2 = spark.createDataFrame ([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["name", "salary"])
df2.show ()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+
Inner join
For inner join, all rows that have matching values in both the source and right DataFrame will be
present in the resulting DataFrame:
filter_none
df1.join(df2, on="name", how="inner").show () # how="cross" also works
+----+---+------+
|name|age|salary|
+----+---+------+
|Alex| 20| 250|
| Bob| 24| 200|
+----+---+------+
Left join and left-outer join
For left join (or left-outer join), all rows in the left DataFrame and matching rows in the right
DataFrame will be present in the resulting DataFrame:
filter_none
df1.join(df2, on="name", how="left").show () # how="left_outer" works
+-----+---+------+
| name|age|salary|
+-----+---+------+
| Alex| 20| 250|
| Bob| 24| 200|
|Cathy| 22| null|
+-----+---+------+
Right join and right-outer join
For right (right-outer) join, all rows in the right DataFrame and matching rows in the left
DataFrame will be present in the resulting DataFrame:
filter_none
df1.join(df2, on="name", how="right").show () # how="right_outer" also works
+----+----+------+
|name| age|salary|
+----+----+------+
|Alex| 20| 250|
| Bob| 24| 200|
|Doge|null| 100|
+----+----+------+
Performing outer join
Consider the same PySpark DataFrames as before:

filter_none
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
This is the other PySpark DataFrame:

filter_none
df2.show ()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+
For outer join, both the left and right DataFrames will be present:
filter_none
df1.join(df2, on="name", how="outer").show () # how="full" or "fullouter" also works
+-----+----+------+
| name| age|salary|
+-----+----+------+
| Alex| 20| 250|
| Bob| 24| 200|
|Cathy| 22| null|
| Doge|null| 100|
+-----+----+------+
Performing left-anti and left-semi joins
Consider the same PySpark DataFrames as before:

filter_none
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
This is the other DataFrame:

filter_none
df2.show ()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+
Left anti-join
For left anti-join, all rows in the left DataFrame that are not present in the right DataFrame will
be in the resulting DataFrame:
filter_none
df1.join(df2, on="name", how="left_anti").show () # how="leftanti" also works
+-----+---+
| name|age|
+-----+---+
|Cathy| 22|
+-----+---+
Left semi-join
Left semi-join is the opposite of left-anti join, that is, all rows in the left DataFrame that are
present in the right DataFrame will be in the resulting DataFrame:
filter_none
df1.join(df2, on="name", how="left_semi").show () # how="leftsemi" also works
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 24|
+----+---+
Performing join on different column names
Up to now, we have specified the join key using the on parameter. Let's now consider the case
when the join keys have different labels. Suppose one DataFrame is as follows:
filter_none
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
Suppose the other DataFrame is as follows:

filter_none
df2 = spark.createDataFrame ([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["NAME", "salary"])
df2.show ()
+----+------+
|NAME|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+
We can join using name of df1 and NAME of df2 like so:
filter_none
cond = [df1["name"] == df2["NAME"]]
df1.join(df2, on=cond, how="inner").show ()
+----+---+----+------+
|name|age|NAME|salary|
+----+---+----+------+
|Alex| 20|Alex| 250|

| Bob| 24| Bob| 200|
+----+---+----+------+
Here, we can supply multiple join keys since on accepts a list.
PySpark DataFrame | limit method

local_offer
PySpark
mode_heat
PySpark DataFrame's limit(~) method returns a new DataFrame with the number of rows specified.
Parameters
1. num | number
The desired number of rows returned.
Return Value
Examples
filter_none
df.show ()
+-----+-----+
| name| age|
+-----+-----+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+-----+
Getting a subset of rows of a PySpark DataFrame using limit
To limit the number of rows returned to 2 :

filter_none
df.limit(2).show ()
+----+----+
| age|name|
+----+----+
|Alex| 15|
| Bob| 20|
+----+----+
Note that show(~) method actually has a parameter that limits the number of rows printed:
filter_none
df.show (n=2)
+----+----+
| age|name|
+----+----+
|Alex| 15|
| Bob| 20|
+----+----+
PySpark DataFrame | orderBy method

local_offer
PySpark
mode_heat
PySpark DataFrame's orderBy(~) method returns a new DataFrame that is sorted based on the
specified columns.
Parameters
1. cols | string or list or Column | optional
A column or columns by which to sort.
2. ascending | boolean or list of boolean | optional
 If True , then the sort will be in ascending order.
 If False , then the sort will be in descending order.
 If a list of booleans is passed, then sort will respect this order. For example,
if [True,False] is passed and cols=["colA","colB"] , then the DataFrame will first be sorted in
ascending order of colA , and then in descending order of colB . Note that the second sort
will be relevant only when there are duplicate values in colA .
By default, ascending=True .
Return Value
Examples
filter_none
df = spark.createDataFrame ([["Alex", 22, 200], ["Bob", 24, 300], ["Cathy", 22, 100]], ["name", "age", "salary"])
df.show ()
+-----+---+------+
| name|age|salary|
+-----+---+------+
| Alex| 22| 200|
| Bob| 24| 300|
|Cathy| 22| 100|
+-----+---+------+
Sorting PySpark DataFrame by single column in ascending order
To sort by age in ascending order:

filter_none
df.orderBy("age").show ()
+-----+---+------+
| name|age|salary|
+-----+---+------+
| Alex| 22| 200|
|Cathy| 22| 100|
| Bob| 24| 300|
+-----+---+------+
Sorting PySpark DataFrame by multiple columns in ascending order
To sort by age , and then by salary (both by ascending order):

filter_none
df.orderBy(["age","salary"]).show ()
+-----+---+------+
| name|age|salary|
+-----+---+------+
|Cathy| 22| 100|
| Alex| 22| 200|
| Bob| 24| 300|
+-----+---+------+
Sorting PySpark DataFrame by descending order
To sort by descending order, set ascending=False :

filter_none
df.orderBy("age", ascending=False).show ()
+-----+---+------+
| name|age|salary|
+-----+---+------+
| Bob| 24| 300|
| Alex| 22| 200|
|Cathy| 22| 100|
+-----+---+------+
PySpark DataFrame | printSchema method

local_offer
PySpark
mode_heat
PySpark DataFrame's printSchema(~) method prints the schema, that is, the columns' name and type
of the DataFrame.
Parameters
This method does not take in any parameters
Return Value
None .
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Printing the name and type of each column (schema) in PySpark DataFrame
To obtain the schema, or the name and type of each column of our DataFrame:
filter_none
df.printSchema()
root
|-- age: long (nullable = true)
PySpark DataFrame | randomSplit method

local_offer
PySpark
mode_heat
PySpark DataFrame's randomSplit(~) method randomly splits the PySpark DataFrame into a list of
smaller DataFrames using Bernoulli sampling.
Parameters of randomSplit
1. weights | list of numbers
The list of weights that specify the distribution of the split. For instance, setting [0.8,0.2] will split
the PySpark DataFrame into 2 smaller DataFrames using the following logic:
 a random number is generated between 0 and 1 for each row of the original DataFrame.
 we set 2 acceptance ranges:
o if the random number is between 0 and 0.8, then the row will be placed in the first
sub-DataFrame
o if the random number is between 0.8 and 1.0, then the row will be placed in the
second sub-DataFrame
The following diagram shows how the split is performed:

 we assume that the PySpark DataFrame has two partitions (blue and green).
 the rows are first locally sorted based on some column value in each partition. This
sorting guarantees that as long as the same rows are in each partition (regardless of their
ordering), we would always end up with the same deterministic ordering.
 for each row, a random number between 0 and 1 is generated.
 the acceptance range of the first split is 0 to 0.8 . Any row whose generated random number
is between 0 and 0.8 will be placed in the first split.
 the acceptance range of the second split is 0.8 to 1.0 . Any row whose generated random
number is between 0.8 and 1.0 will be placed in the second split.
What's important here is that there is never a guarantee that the first DataFrame will have 80% of
the rows, and the second will have 20%. For instance, suppose the random number generated for
each row falls between 0 and 0.8 - this means that none of the rows will end up in the second
DataFrame split:
On average, we should expect that the first DataFrame will have 80% of the rows while the
second DataFrame with 20% of the rows, but the actual split may be very different.
NOTE
If the values do not add up to one, then they will be normalized.
2. seed | int | optional
Calling the function with the same seed will always generate the same results. There is a caveat to
this as we shall see later.
Return value of randomSplit

A list of PySpark DataFrames.
Examples
filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 30], ["Cathy", 40], ["Dave", 40]], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
| Dave| 40|
+-----+---+
Randomly splitting a PySpark DataFrame into smaller DataFrames
To randomly split this PySpark DataFrame into 2 sub-DataFrames with a 75-25 row split:
filter_none
list_dfs = df.randomSplit([0.75,0.25], seed=24)
for _df in list_dfs:
_df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
|Cathy| 40|
+-----+---+
+----+---+
|name|age|
+----+---+
| Bob| 30|
|Dave| 40|
+----+---+
Even though we expect the first DataFrame to contain 3 rows while the second DataFrame to
contain 1 row, we see that split was a 50-50. This is because, as discussed above, randomSplit(~) is
based on Bernoulli sampling.
Quirks of the seed parameter
The seed parameter is used for reproducibility. For instance, consider the following PySpark
DataFrame:
filter_none
df
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
| Dave| 40|
+-----+---+
Running the randomSplit(~) method with the same seed will guarantee the same splits given that the
PySpark DataFrame is partitioned in the exact same way:
filter_none
_df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
|Cathy| 40|
+-----+---+
+----+---+
|name|age|
+----+---+
| Bob| 30|
|Dave| 40|
+----+---+
Running the above multiple times will always yield the same splits since the partitioning of the
PySpark DataFrame is the same.
We can see how the rows of a PySpark DataFrame are partitioned by converting the DataFrame
into a RDD, and then using the glom() method:
filter_none
[[],
[],
[],
[Row(name='Cathy', age=40)],
[],
[Row(name='Dave', age=40)]]
Here, we see that our PySpark DataFrame is split into 8 partitions but half of them are empty.
Let's change the partitioning using repartition(~) :

filter_none
df = df.repartition (2)
[[Row(name='Alex', age=20),
Row(name='Bob', age=30),
Row(name='Cathy', age=40),
Row(name='Dave', age=40)],
[]]
Even though the content of the DataFrame is the same, we now only have 2 partitions instead of 8
partitions.
Let's call randomSplit(~) with the same seed ( 24 ) as before:

filter_none

_df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
| Dave| 40|
+-----+---+
+----+---+
|name|age|
+----+---+
+----+---+
Notice how even though we used the same seed, we ended up with a different split. This confirms
that the seed parameter only guarantees consistent splits only if the underlying partition is the
same. You should be cautious of this behaviour because partitions can change after a shuffle
operation (e.g. join(~) and groupBy(~) ).
PySpark DataFrame | repartition method

local_offer
PySpark
mode_heat
PySpark DataFrame's repartition(~) method returns a new PySpark DataFrame with the data split
into the specified number of partitions. This method also allows to partition by column values.
Parameters
1. numPartitions | int
The number of patitions to break down the DataFrame.
2. cols | str or Column
The columns by which to partition the DataFrame.
Return Value
Examples
Partitioning a PySpark DataFrame
Cosnider the following PySpark DataFrame:

filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
By default, the number of partitions depends on the parallelism level of your PySpark
configuration:
filter_none
In my case, our PySpark DataFrame is split into 8 partitions by default.
We can see how the rows of our DataFrame are partitioned using the glom() method of the
underlying RDD:
filter_none
[[],
[],
[],
[],
[],
Here, we can see that we have indeed 8 partitions, but only 3 of the partitions have a Row in them.
Now, let's repartition our DataFrame such that the Rows are divided into only 2 partitions:
filter_none
df_new = df.repartition(2)
df_new.rdd .getNumPartitions ()
The distribution of the rows in our repartitioned DataFrame is now:

filter_none
Row(name='Bob', age=30),
Row(name='Cathy', age=40)],
[]]
As demonstrated here, there is no guarantee that the rows will be evenly distributed in the
partitions.
Partitioning a PySpark DataFrame by column values

filter_none
df = spark.createDataFrame ([("Alex", 20), ("Bob", 30), ("Cathy", 40), ("Alex", 50)], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
| Alex| 50|
+-----+---+
To repartition this PySpark DataFrame by the column name into 2 partitions:

filter_none
df_new = df.repartition (2, "name")
Row(name='Cathy', age=40),
Row(name='Alex', age=50)],
[Row(name='Bob', age=30)]]
Here, notice how the rows with the same value for name ( 'Alex' in this case) end up in the same
partition.
We can also repartition by multiple column values:

filter_none
df_new = df.repartition(4, "name", "age")
[[Row(name='Alex', age=20)],
Here, we are repartitioning by the name and age columns into 4 partitions.
We can also use the default number of partitions by specifying column labels only:
filter_none
df_new = df.repartition("name")
df_new.rdd .getNumPartitions ()
1
PySpark DataFrame | replace method

local_offer
PySpark
mode_heat
PySpark DataFrame's replace(~) method returns a new DataFrame with certain values replaced. We
can also specify which columns to perform replacement in.
Parameters
1. to_replace | boolean , number , string , list or dict | optional
The value to be replaced.
2. value | boolean , number , string or None | optional
The new value to replace to_replace .
3. subset | list | optional
The columns to focus on. By default, all columns will be checked for replacement.
Return Value
PySpark DataFrame.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
|Cathy| 40|
+-----+---+
Replacing values for a single column
To replace the value "Alex" with "ALEX" in the name column:

filter_none
df.replace("Alex", "ALEX", "name").show ()
+-----+---+
| name|age|
+-----+---+
| ALEX| 25|
| Bob| 30|
|Cathy| 40|
+-----+---+
Note that a new PySpark DataFrame is returned, and the original DataFrame is kept intact.
Replacing multiple values for a single column
To replace the value "Alex" with "ALEX" and "Bob" with "BOB" in the name column:
filter_none
df.replace(["Alex","Bob"], ["ALEX","BOB"], "name").show ()
+-----+---+
| name|age|
+-----+---+
| ALEX| 25|
| BOB| 30|
|Cathy| 40|
+-----+---+
Replacing multiple values with a single value
To replace the values "Alex" and "Bob" with "SkyTowner" in the name column:
filter_none
df.replace(["Alex","Bob"], "SkyTowner", "name").show ()
+---------+---+
| name|age|
+---------+---+
|SkyTowner| 25|
|SkyTowner| 30|
| Cathy| 40|
+---------+---+
Replacing values in the entire DataFrame
To replace the values "Alex" and "Bob" with "SkyTowner" in the entire DataFrame:
filter_none
df.replace(["Alex","Bob"], "SkyTowner").show ()
+---------+---+
| name|age|
+---------+---+
|SkyTowner| 25|
|SkyTowner| 30|
| Cathy| 40|
+---------+---+
Here, notice how we did not specify the subset option.
Replacing values using a dictionary
To replace "Alex" with "ALEX" and "Bob" with "BOB" in the name column using a dictionary:
filter_none
df.replace({
"Alex": "ALEX",
"Bob": "Bob",
}, subset=["name"]).show ()
WARNING
Mixed-type replacements are not allowed. For instance, the following is not allowed:
filter_none
df.replace({
"Alex": "ALEX",
30: 99,
}, subset=["name","age"]).show()
ValueError: Mixed type replacements are not supported
Here, we are performing one string replacement and one integer replacement. Since this is a mix-
typed replacement, PySpark throws an error. To avoid this error, perform the two replacements
individually.
Replacing multiple values in multiple columns

filter_none
df = spark.createDataFrame ([["aa", "AA"], ["bb", "BB"]], ["col1", "col2"])
df.show ()
+----+----+
|col1|col2|
+----+----+
| aa| AA|
| bb| BB|
+----+----+
To replace certain values in col1 and col2 :

filter_none
df.replace({
"AA": "@@@",
"bb": "###",
}, subset=["col1","col2"]).show ()
+----+----+
|col1|col2|
+----+----+
| aa| @@@|
| ###| BB|
+----+----+
PySpark DataFrame | sample method

local_offer
PySpark
mode_heat
PySpark DataFrame's sample(~) method returns a random subset of rows of the DataFrame.
Parameters
1. withReplacement | boolean | optional
 If True , then sample with replacement, that is, allow for duplicate rows.
 If False , then sample without replacement, that is, do not allow for duplicate rows.
By default, withReplacement=False .
WARNING
If withReplacement=False , then Bernoulli sampling is performed, which is a technique where we

iterate over each element and we include the element into sample with a probability of fraction . On
the other hand, withReplacemnt=True will use Poisson sampling. I actually don't quite understand
this, and if you have any idea as to what this is, please let me know!
2. fraction | float
A number between 0 and 1 , which represents the probability that a value will be included in the
sample. For instance, if fraction=0.5 , then each element will be included in the sample with a
probability of 0.5 .
WARNING
The sample size of the subset will be random since the sampling is performed using Bernoulli
sampling (if withReplacement=True ). This means that even setting fraction=0.5 may result in a sample
without any rows! On average though, the supplied fraction value will reflect the number of rows
returned.
The seed for reproducibility. By default, no seed will be set which means that the derived samples
will be random each time.
Return Value
Examples
filter_none
df = spark.createDataFrame ([["Alex", 20],\
["Bob", 24],\
["Cathy", 22],\
["Doge", 22]],\
["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
| Doge| 22|
+-----+---+
Sampling random rows from a PySpark DataFrame (Bernoulli sampling)
To get a random sample in which the probability that an element is included in the sample is 0.5 :
filter_none
df.sample(fraction=0.5).show ()
+----+---+
|name|age|
+----+---+
|Doge| 22|
+----+---+
Running the code once again may yield a sample of different size:
filter_none
df.sample(fraction=0.5).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
|Cathy| 22|
+-----+---+
This is because the sampling is based on Bernoulli sampling as explained in the beginning.
Sampling with replacement (Poisson Sampling)
Once again, consider the following PySpark DataFrame:

filter_none
df = spark.createDataFrame ([["Alex", 20],\
["Bob", 24],\
["Cathy", 22],\
["Doge", 22]],\
["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
| Doge| 22|
+-----+---+
To sample with replacement (using Poisson sampling), use withReplacement=True :

filter_none
df.sample(fraction=0.5, withReplacement=True).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
| Bob| 24|
| Bob| 24|
|Cathy| 22|
+-----+---+
Notice how the sample size can exceed the original dataset size.
PySpark DataFrame | sampleBy method

local_offer
PySpark
mode_heat
PySpark DataFrame's sampleBy(~) method performs stratified sampling based on a column. Consult
examples below for clarification.
Parameters
1. col | Column or string
The column by which to perform sampling.
2. fractions | dict
The probability with which to include the value. Consult examples below for clarification.
Using the same value for seed produces the exact same results every time. By default, no seed will
be set, which means that the outcome will be different every time you run the method.
Return Value
Examples
filter_none
from pyspark.sql.types import *
vals = ['a','a','a','a','a','a','b','b','b','b']
df = spark.createDataFrame (vals, StringType())
df.show (3)
+-----+
|value|
+-----+
| a|
| a|
| a|
+-----+
only showing top 3 rows
Performing stratified sampling
Let's performing stratified sampling based on the column value :

filter_none
df.sampleBy('value', fractions={'a':0.5,'b':0.25}).show ()
+-----+
|value|
+-----+
| a|
| a|
| a|
| b|
| b|
+-----+
Here, rows with value 'a' will be included in our sample with a probability of 0.5 , while rows with
value 'b' will be included with a probability of 0.25 .
WARNING
The number of samples that will be included will be different each time. For instance,
specifying {'a':0.5} does not mean that half the rows with the value 'a' will be included - instead it
means that each row will be included with a probability of 0.5 . This means that there may be cases
when all rows with value 'a' will end up in the final sample.
R E L AT E D
PySpark DataFrame | sample method
PySpark DataFrame's sample(~) method returns a random subset of rows of the DataFrame.
PySpark DataFrame | select method

local_offer
PySpark
mode_heat
The select(~) method of PySpark DataFrame returns a new DataFrame with the specified columns.
Parameters
1. *cols | string , Column or list
The columns to include in the returned DataFrame.
Return Value
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Selecting a single column of PySpark DataFrame
To select a single column, pass the name of the column as a string:

filter_none
df.select("name").show ()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
Or equivalently, we could pass in a Column object:

filter_none
df.select(df["name"]).show ()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
Here, df["name"] is of type Column . Here, you can think of the role of select(~) as converting
a Column object into a PySpark DataFrame.
Or equivalently, the Column object can also be obtained using sql.function :

filter_none
df.select(F.col ("name")).show ()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
Selecting multiple columns of a PySpark DataFrame
To select the columns name and age :

filter_none
df.select("name","age").show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Or equivalently, we can supply multiple Column objects:

filter_none
df.select(df["name"],df["age"]).show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Or equivalently, we can supply Column objects obtained from sql.functions :

filter_none
df.select(F.col("name"), F.col("age")).show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Selecting all columns of a PySpark DataFrame
To select all columns, pass "*" :

filter_none
df.select("*").show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Selecting columns given a list of column labels
To select columns given a list of column labels, use the * operator:

filter_none
cols = ["name", "age"]
df.select(cols).show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Here, the * operator is used to convert the list into positional arguments.
Selecting columns that begin with a certain substring
To select columns that begin with a certain substring:

filter_none
cols = [col for col in df.columns if col.startswith ("na")]
df.select(cols).show ()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
Here, we are using Python's list comprehension to get a list of column labels that begin with the
substring "na" :
filter_none
cols = [col for col in df.columns if col.startswith ("na")]
cols
['name']
PySpark DataFrame | selectExpr method
local_offer
PySpark
mode_heat
PySpark DataFrame's selectExpr(~) method returns a new DataFrame based on the specified SQL
expression.
Parameters
1. *expr | string
The SQL expression.
Return Value
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Selecting data using SQL expressions in PySpark DataFrame
To get a new DataFrame where the values for the name column is uppercased:
filter_none
df.selectExpr("upper(name) AS upper_name", "age * 2").show ()
+----------+---------+
|upper_name|(age * 2)|
+----------+---------+
| ALEX| 40|
| BOB| 60|
| CATHY| 80|
+----------+---------+
We should use selectExpr(~) rather than select(~) to extract columns while performing some simple
transformations on them - just as we have done here.
NOTE
There exists a similar method expr(~) in the pyspark.sql.functions library. expr(~) also takes in as
argument a SQL expression, but the difference is that the return type is a PySpark Column . The
following usage of selectExpr(~) and expr(~) are equivalent:
filter_none
# The below is the same as df.selectExpr("upper(name)").show()
df.select (F.expr ("upper(name)")).show ()
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
| CATHY|
+-----------+
In general, you should use selectExpr(~) rather than expr(~) because:

 you won't have to import the pyspark.sql.functions library.
 the syntax is shorter and clearer

Parsing more complex SQL expressions

filter_none
df = spark.createDataFrame ([['Alex',20],['Bob',60]], ['name','age'])
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 60|
+----+---+
We can use classic SQL clauses like AND and LIKE to formulate more complicated expressions:
filter_none
df.selectExpr ('age < 30 AND name LIKE "A%" as result').show ()
+------+
|result|
+------+
| true|
| false|
+------+
Here, we are checking for rows where age is less than 30 and the name starts with the letter A .
Note that we can implement the same logic like so:

filter_none
col = ((df.age < 30) & (F.col ('name').startswith ('A'))).alias ('result')
df.select (col).show ()
+------+
|result|
+------+
| true|
| false|
+------+
I personally prefer using selectExpr(~) because the syntax is cleaner and the meaning is intuitive for
those who are familiar with SQL.
Checking for the existence of values in PySpark column
Another application of selectExpr(~) is to check for the existence of values in a PySpark column.
Please check out the recipe here .
PySpark DataFrame | show method

local_offer
PySpark
mode_heat
PySpark DataFrame's show(~) method prints the rows of the DataFrame on the console.
Parameters
1. n | int | optional
The number of rows to show. By default, n=20 .
2. truncate | boolean or int | optional
 If True , then strings that are longer than 20 characters will be truncated.
 If False , then whole strings will be shown.
 If int , then strings that are longer than truncate will be truncated.
If truncation occurs, then the left part of the string is preserved. By default, truncate=True .
3. vertical | boolean | optional
If True , then the rows are printed with one line for each column value. By default, vertical=False .
Return Value
None .
Examples
filter_none
Printing the first n rows of PySpark DataFrame
To print the first 20 rows of the PySpark DataFrame:

filter_none
df.show () # n=20
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
To print the first 2 rows of the DataFrame:

filter_none
df.show (n=2)
+----+---+
|name|age|
+----+---+
|Alex| 15|
| Bob| 20|
+----+---+
only showing top 2 rows
Truncating strings in printed rows of PySpark DataFrame
To truncate strings that are longer than 2:

filter_none
df.show(truncate=2)
+----+---+
|name|age|
+----+---+
| Al| 15|
| Bo| 20|
| Ca| 25|
+----+---+
Disabling truncation of strings in printed rows of PySpark DataFrame
To disable truncation of strings in printed rows:

filter_none
df.show(truncate=False)
+-----+---+
|name |age|
+-----+---+
|Alex |15 |
|Bob |20 |
|Cathy|25 |
+-----+---+
Printing rows of PySpark DataFrame vertically
To print each column value in a separate line:

filter_none
df.show(vertical=True)
-RECORD 0-----
name | Alex
age | 15
-RECORD 1-----
name | Bob
age | 20
-RECORD 2-----
name | Cathy
age | 25
Published
PySpark DataFrame | sort method

local_offer
PySpark
mode_heat
PySpark DataFrame's sort(~) method returns a new DataFrame with the rows sorted based on the
specified columns.
Parameters
1. cols | string or list or Column
The columns by which to sort the rows.
2. ascending | boolean or list | optional
Whether to sort in ascending or descending order. By default, ascending=True .
Return Value
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 30|
| Bob| 20|
|Cathy| 20|
+-----+---+
Sorting a PySpark DataFrame in ascending order by a single column
To sort our PySpark DataFrame by the age column in ascending order:

filter_none
df.sort("age").show () # ascending=True
+-----+---+
| name|age|
+-----+---+
|Cathy| 20|
| Bob| 20|
| Alex| 30|
+-----+---+
We could also use sql.functions to refer to the column:

filter_none
df.sort(F.col("age")).show ()
+-----+---+
| name|age|
+-----+---+
|Cathy| 20|
| Bob| 20|
| Alex| 30|
+-----+---+
Sorting a PySpark DataFrame in descending order by a single column
To sort a PySpark DataFrame by the age column in descending order:

filter_none
df.sort("age", ascending=False).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 30|
| Bob| 20|
|Cathy| 20|
+-----+---+
Sorting a PySpark DataFrame by multiple columns
To sort a PySpark DataFrame by the age column first, and then by the name column both in
ascending order:
filter_none
df.sort(["age", "name"]).show ()
+-----+---+
| name|age|
+-----+---+
| Bob| 20|
|Cathy| 20|
| Alex| 30|
+-----+---+
Here, Bob and Cathy appear before Alex because their age ( 20 ) is smaller. Bob then comes
before Cathy because B comes before C .
We can also pass a list of booleans to specify the desired ordering of each column:
filter_none
df.sort(["age", "name"], ascending=[True, False]).show ()
+-----+---+
| name|age|
+-----+---+
|Cathy| 20|
| Bob| 20|
| Alex| 30|
+-----+---+
Here, we are first sorting by age in ascending order, and then by name in descending order.
PySpark DataFrame | summary method

local_offer
PySpark
mode_heat
PySpark DataFrame's summary(~) method returns a PySpark DataFrame containing basic summary
statistics of numeric columns.
Parameters
1. statistics | string | optional
The statistic to compute. The following are available:

 count
 mean
 stddev
 min
 max
 arbitrary percentiles (e.g. "60%" )
By default, all the above as well as the 25%, 50%, and 75% percentiles are computed.
Return Value
PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 24], ["Cathy", 22], ["Doge", 30]], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
| Doge| 30|
+-----+---+
Getting the summary statistics of numeric columns of PySpark DataFrame
The summary statistics of our DataFrame is as follows:

filter_none
df.summary().show ()
+-------+----+-----------------+
|summary|name| age|
+-------+----+-----------------+
| count| 4| 4|
| mean|null| 24.0|
| stddev|null|4.320493798938574|
| min|Alex| 20|
| 25%|null| 20|
| 50%|null| 22|
| 75%|null| 24|
| max|Doge| 30|
+-------+----+-----------------+
To compute certain summary statistics only:

filter_none
df.summary("max", "min").show ()
+-------+----+---+
|summary|name|age|
+-------+----+---+
| max|Doge| 30|
| min|Alex| 20|
+-------+----+---+
Getting n-th percentile of numeric columns in PySpark DataFrame
To compute the 60th percentile:

filter_none
df.summary("60%").show ()
+-------+----+---+
|summary|name|age|
+-------+----+---+
| 60%|null| 24|
+-------+----+---+
Getting summary statistics of certain columns in PySpark DataFrame
To summarise certain columns instead, use the select(~) method first to select the columns that you
want to summarize:
filter_none
df.select ("age").summary("max", "min").show ()
+-------+---+
|summary|age|
+-------+---+
| max| 30|
| min| 20|
+-------+---+
R E L AT E D
PySpark DataFrame | describe method
PySpark DataFrame's describe(~) method returns a new PySpark DataFrame holding summary statistics
of the specified columns.
Getting the last n rows of a PySpark DataFrame
To get the last two rows:

filter_none
df.tail(num=2)
[Row(name='Bob', age=20), Row(name='Cathy', age=25)]
PySpark DataFrame | take method

local_offer
PySpark
mode_heat
PySpark DataFrame's take(~) method returns the first num number of rows as a list of Row objects.
Parameters
1. num | integer
The number of rows to return.
Return Value
A list of Row objects.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
|Cathy| 40|
+-----+---+
Getting the first n number of rows of PySpark DataFrame as list of Row objects
To get the first n number of rows as list of Row objects:

filter_none
df.take (2)
Difference between methods take(~) and head(~)
The difference between methods takes(~) and head(~) is takes always return a list of Row objects,
whereas head(~) will return just a Row object in the case when we set head(n=1) .
For instance, consider the following PySpark DataFrame:

filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Invoking take(1) yields:

filter_none
df.take(1)
[Row(name='Alex', age=20)]
Invoking head(1) yields:

filter_none
df.head (1)
[Row(name='Alex', age=20)]
For all other values of n , the methods take(~) and head(~) yield the same output.
PySpark DataFrame | toDF method

local_offer
PySpark
mode_heat
PySpark DataFrame's toDF(~) method returns a new DataFrame with the columns arranged in the
order that you specify.
WARNING
This method only allows you to change the ordering of the columns - the new DataFrame must
contain the same columns as before.
Parameters
1. *cols | str
The columns to include.
Return Value
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Arranging columns in specific order in PySpark
To arrange the columns from age first and name second:

filter_none
df.toDF("age", "name").show ()
+----+----+
| age|name|
+----+----+
|Alex| 20|
| Bob| 30|
+----+----+
Note that if the columns of the new DataFrame do not match the original DataFrame, then an
error will be thrown:
filter_none
df.toDF("age").show ()
IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (2): name, age
New column names (1): age
Arrange columns in alphabetical order in PySpark
To arrange the columns in alphabetical order:

filter_none
df.toDF(*sorted (df.columns )).show ()
+----+----+
| age|name|
+----+----+
|Alex| 20|
| Bob| 30|
+----+----+
Here:
 sorted(~) returns the column labels in alphabetical order.
 the * is used to convert the list into positional arguments

PySpark DataFrame | toJSON method
local_offer
PySpark
mode_heat
PySpark DataFrame's toJSON(~) method converts the DataFrame into a string-typed RDD. When
the RDD data is extracted, each row of the DataFrame will be converted into a string JSON.
Consult the examples below for clarification.
Parameters
1. use_unicode | boolean
Whether to use unicode during the conversion. By default, use_unicode=True .
Return Value
A MapPartitionsRDD object.
Examples
filter_none
df = spark.createDataFrame ([["André", 20], ["Bob", 30], ["Cathy", 30]], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
|André| 20|
| Bob| 30|
|Cathy| 30|
+-----+---+
Converting the first row of PySpark DataFrame into a dictionary
To convert the first row of a PySpark DataFrame into a string-encoded JSON:

filter_none
df.toJSON().first()
'{"name":"André","age":20}'
To convert a string-encoded JSON into a native dict :

filter_none
import json
json.loads(df.toJSON().first())
{'name': 'André', 'age': 20}

Converting PySpark DataFrame into a list of row objects (dictionaries)
To convert a PySpark DataFrame into a list of string-encoded JSON:

filter_none
df.toJSON().collect ()
['{"name":"André","age":20}',
'{"name":"Bob","age":30}',
'{"name":"Cathy","age":30}']
To convert a PySpark DataFrame into a list of native dict :

filter_none
df.toJSON().map (lambda str_json: json.loads(str_json)).collect ()
[{'name': 'André', 'age': 20},
{'name': 'Bob', 'age': 30},
{'name': 'Cathy', 'age': 30}]
Here:
 we are using the RDD.map(~) method to apply a custom function on each element of the
RDD.
 our custom function converts each string-encoded JSON into a dict .

Disabling unicode when converting PySpark DataFrame rows into string JSON
By default, unicode is enabled:

filter_none
df.toJSON().first () # use_unicode=True
'{"name":"André","age":20}'
To disable unicode, set use_unicode=False :

filter_none
df.toJSON(use_unicode=False).first ()
b'{"name":"Andr\xc3\xa9","age":20}'
PySpark DataFrame | toPandas method

local_offer
PySpark
mode_heat
PySpark DataFrame's toPandas(~) method converts a PySpark DataFrame into a Pandas DataFrame.
WARNING
Watch out for the following:
 All the data from the worker nodes are transferred to the Driver, and so make sure that
your Driver has sufficient memory.
 Driver must have the Pandas libraries installed.

Parameters
Return Value
A Pandas DataFrame.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
Converting a PySpark DataFrame into a Pandas DataFrame
To convert this PySpark DataFrame into a Pandas DataFrame:

filter_none
df.toPandas()
name age
0 Alex 20
1 Bob 24
2 Cathy 22
PySpark DataFrame | transform method

local_offer
PySpark
mode_heat
PySpark DataFrame's transform(~) method applies a function on the DataFrame that called this
method and returns a new PySpark DataFrame.
Parameters
1. func | function
The PySpark DataFrame that called the transform(~) method.
Return Value
PySpark DataFrame.
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Sorting the columns in ascending order by their label in PySpark DataFrame
To get a new PySpark DataFrame where the columns are sorted in ascending order:
filter_none
def sort_columns(df_input):
return df_input.select (*sorted (df_input.columns ))
df.transform(sort_columns).show ()
+---+----+
|age|name|
+---+----+
| 25|Alex|
| 30| Bob|
+---+----+
Here, the * converts the list of column labels into positional arguments of the select(~) method.
PySpark DataFrame | transform method

local_offer
PySpark
mode_heat
PySpark DataFrame's transform(~) method applies a function on the DataFrame that called this
method and returns a new PySpark DataFrame.
Parameters
1. func | function
The PySpark DataFrame that called the transform(~) method.
Return Value
PySpark DataFrame.
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Sorting the columns in ascending order by their label in PySpark DataFrame
To get a new PySpark DataFrame where the columns are sorted in ascending order:
filter_none
def sort_columns(df_input):
return df_input.select (*sorted (df_input.columns ))
df.transform(sort_columns).show ()
+---+----+
|age|name|
+---+----+
| 25|Alex|
| 30| Bob|
+---+----+
Here, the * converts the list of column labels into positional arguments of the select(~) method.
PySpark DataFrame | union method

local_offer
PySpark
mode_heat
PySpark DataFrame's union(~) method concatenates two DataFrames vertically based on column
positions.
WARNING
Note the following:
 the two DataFrames must have the same number of columns
 the DataFrames will be vertically concatenated based on the column position rather than
the labels. See examples below for clarification.
Parameters
The other DataFrame with which to vertically concatenate with.
Return Value
Examples
Concatenating PySpark DataFrames vertically based on column position
Consider the following two PySpark DataFrames:

filter_none
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
The other DataFrame:

filter_none
df2 = spark.createDataFrame ([["Alex", 25], ["Doge", 30], ["Eric", 50]], ["name", "age"])
df2.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
|Doge| 30|
|Eric| 50|
+----+---+
To concatenate the two DataFrames:

filter_none
df1.union(df2).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
| Alex| 25|
| Doge| 30|
| Eric| 50|
+-----+---+
Union is based on column position
Consider the following PySpark DataFrames:

filter_none
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
The other PySpark DataFrame has a different column called salary :

filter_none
df2 = spark.createDataFrame ([["Alex", 250], ["Doge", 200], ["Eric", 100]], ["name", "salary"])
df2.show ()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
|Doge| 200|
|Eric| 100|
+----+------+
Joining the two DataFrames using union(~) yields:

filter_none
df1.union(df2).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
| Alex|250|
| Doge|200|
| Eric|100|
+-----+---+
Notice how even though the two DataFrames had separate column labels, the method still
concatenated them. This is because the concatenation is based on the column positions and so the
labels play no role here. You should be wary of this behaviour because the union(~) method may
yield incorrect DataFrames like the one above without throwing an error!
R E L AT E D
PySpark DataFrame | unionByName method
PySpark DataFrame's unionByName(~) method concatenates PySpark DataFrames vertically by aligning

the column labels.
PySpark DataFrame | unionByName method

local_offer
PySpark
mode_heat
PySpark DataFrame's unionByName(~) method concatenates PySpark DataFrames vertically by

aligning the column labels.
Parameters
The other DataFrame with which to concatenate.

2. allowMissingColumns | boolean | optional
 If True , then no error will be thrown if the column labels of the two DataFrames do not
align. If in case of misalignments, then null values will be set.
 If False , then an error will be thrown if the column labels of the two DataFrames do not
align.
By default, allowMissingColumns=False .
Return Value
A new PySpark DataFrame .
Examples
Concatenating PySpark DataFrames vertically by aligning columns

filter_none
df1 = spark.createDataFrame ([[1, 2, 3]], ["A", "B", "C"])
df1.show ()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
+---+---+---+
Here's another PySpark DataFrame:

filter_none
df2 = spark.createDataFrame ([[4, 5, 6], [7, 8, 9]], ["A", "B", "C"])
df2.show ()
+---+---+---+
| A| B| C|
+---+---+---+
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
To concatenate these two DataFrames vertically by aligning the columns:
filter_none
df1.unionByName(df2).show ()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
Dealing with cases when column labels mismatch
By default, allowMissingColumns=False , which means that if the two DataFrames do not have exactly
matching column labels, then an error will be thrown.
For example, consider the following PySpark DataFrames:

filter_none
df1 = spark.createDataFrame ([[1, 2, 3]], ["A", "B", "C"])
df1.show ()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
+---+---+---+
Here's the other PySpark DataFrame that have slightly different column labels:
filter_none
df2 = spark.createDataFrame ([[4, 5, 6], [7, 8, 9]], ["B", "C", "D"])
df2.show ()
+---+---+---+
| B| C| D|
+---+---+---+
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
Since the column labels do not match, calling unionByName(~) will result in an error:
filter_none
df1.unionByName(df2).show () # allowMissingColumns=False
AnalysisException: Cannot resolve column name "A" among (B, C, D)
To allow for misaligned columns, set allowMissingColumns=True :

filter_none
df1.unionByName(df2, allowMissingColumns=True).show ()
+----+---+---+----+
| A| B| C| D|
+----+---+---+----+
| 1| 2| 3|null|
|null| 4| 5| 6|
|null| 7| 8| 9|
+----+---+---+----+
Notice how we have null values for the misaligned columns.
PySpark DataFrame | where method

local_offer
PySpark
mode_heat
PySpark DataFrame's where(~) method returns rows of the DataFrame that satisfies the given
condition.
NOTE
The where(~) method is an alias for the filter(~) method.
Parameters
1. condition | Column or string
A boolean mask ( Column ) or a SQL string expression.
Return Value
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Basic usage
To get rows where age is greater than 25:

filter_none
df.where("age > 25").show ()
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
Equivalently, we can pass a Column object that represents a boolean mask:
filter_none
df.where(df.age > 25).show ()
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
Equivalently, we can use the col(~) function of sql.functions to refer to the column:
filter_none
df.where(F.col ("age") > 25).show ()
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
Compound queries
The where(~) method supports the AND and OR statement like so:
filter_none
df.where("age > 25 AND name = 'Bob'").show ()
+----+---+
|name|age|
+----+---+
| Bob| 30|
+----+---+
Dealing with null values

filter_none
df = spark.createDataFrame ([["Alex", 20], [None, None], ["Cathy", None]], ["name", "age"])
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
| null|null|
|Cathy|null|
+-----+----+
Let's query for rows where age!=10 like so:

filter_none
df.where("age != 10").show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Notice how only Alex's row is returned even though the other two rows technically have age!=10 .
This happens because PySpark's where(-) method filters our null values by default.
To prevent rows with null values getting filtered out, we can perform the query like so:
filter_none
df.where((F.col ("age") != 10) | (F.col ("age").isNull ())).show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
| null|null|
|Cathy|null|
+-----+----+
Note that PySpark's treatment of null values is different compared to Pandas because Pandas will
retain rows with missing values, as demonstrated below:
filter_none
import pandas as pd
df = pd.DataFrame ({
"col": ["a", "b", None]
})
df[df["col"] != "a"]
col
1 b
2 None
Notice how the row with col=None is not left out!

PySpark DataFrame | withColumn method
local_offer
PySpark
mode_heat
PySpark DataFrame's withColumn(~) method can be used to:
 add a new column
 update an existing column

Parameters
1. colName | string
The label of the new column. If colName already exists, then supplied col will update the existing
column. If colName does not exist, then col will be a new column.
2. col | Column
The new column.
Return Value
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
|Cathy| 50|
+-----+---+
Updating column values based on original column values in PySpark
To update an existing column, supply its column label as the first argument:
filter_none
df.withColumn("age", 2 * df.age).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 50|
| Bob| 60|
|Cathy|100|
+-----+---+
Note that you must pass in a Column object as the second argument, and so you cannot simply use
a list as the new column values.
Adding a new column to a PySpark DataFrame
To add a new column AGEE with 0 s:

filter_none
df.withColumn("AGEE", F.lit (0)).show ()
+-----+---+----+
| name|age|AGEE|
+-----+---+----+
| Alex| 25| 0|
| Bob| 30| 0|
|Cathy| 50| 0|
+-----+---+----+
Here, F.lit(0) returns a Column object holding 0 s. Note that since column labels are case insensitive,
if you pass in "AGE" as the first argument, you would end up overwriting the age column.
PySpark DataFrame | withColumnRenamed method

local_offer
PySpark
mode_heat
PySpark DataFrame's withColumnRenamed(~) method is used to replace column labels. If the column
label that you want to replace does not exist, no error will be thrown.
Parameters
1. existing | string | optional
The label of an existing column. This will be replaced.
2. new | string
The new column label.
Return Value
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Replacing column labels of PySpark DataFrame
To replace the column label age with AGE :

filter_none
df.withColumnRenamed("age", "AGE").show ()
+----+---+
|name|AGE|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Note that no error will be thrown if the column label you want to replace does not exist:
filter_none
df.withColumnRenamed("ageeee", "AGE").show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Notice how the original DataFrame is returned in such cases.
Replacing multiple column labels of PySpark DataFrame
To replace multiple column labels at once, we can chain the withColumnRenamed(-) method like so:
filter_none
df.withColumnRenamed("age", "AGE").withColumnRenamed("name", "NAME").show ()
+----+---+
|NAME|AGE|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
PySpark DataFrame | columns property

local_offer
PySpark
mode_heat
PySpark DataFrame's columns property returns the column labels as a list.
Return Value
A standard list of strings.
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Getting a list of column labels of PySpark DataFrame
To get all column labels as a list of strings:

filter_none
df.columns
['name', 'age']
PySpark DataFrame | dtypes property

local_offer
PySpark
mode_heat
PySpark DataFrame's dtypes property returns the column labels and types as a list of tuples.
Return Value
List of tuples.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
Getting the column labels and types in PySpark DataFrame
To obtain the column labels and types, use the dtypes property:
filter_none
df.dtypes
[('name', 'string'), ('age', 'bigint')]
PySpark DataFrame | rdd property

local_offer
PySpark
mode_heat
PySpark DataFrame's rdd property returns the RDD representation of the DataFrame. Keep in
mind that PySpark DataFrames are internally represented as RDD.
Return Value
RDD containing Row objects.
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Converting PySpark DataFrame into RDD
To convert our PySpark DataFrame into a RDD, use the rdd property:
filter_none
rdd = df.rdd
rdd.collect ()
Here, we are using the collect() method to see the content of our RDD, which is a list
of Row objects.
local_offer
PySpark
mode_heat
Return Value
List of tuples.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
filter_none
df.dtypes
PySpark DataFrame | columns property

local_offer
PySpark
mode_heat
PySpark DataFrame's columns property returns the column labels as a list.
Return Value
A standard list of strings.
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Getting a list of column labels of PySpark DataFrame
To get all column labels as a list of strings:

filter_none
df.columns
['name', 'age']
PySpark SQL Row | asDict method

local_offer
PySpark
mode_heat
PySpark's SQL Row asDict(~) method converts a Row object into a dictionary.
Parameters
1. recursive | boolean | optional
 If True , then nested Row objects will be converted into dictionary as well.
 If False , then nested Row objects will be kept as Row objects.
By default, recursive=False .
Return Value
A dictionary.
Examples
Converting a PySpark Row object into a dictionary
Consider the following PySpark Row object:

filter_none
from pyspark.sql import Row
row = Row(name="alex", age=25)
row
Row(name='alex', age=25)
To convert this Row object into a dictionary:

filter_none
row.asDict()
{'name': 'Alex', 'age': 25}
Converting nested PySpark Rows into dictionaries
By default, recursive=False , which means that nested rows will not be converted into dictionaries:
filter_none
row = Row(name="Alex", age=25, friends=Row(name="Bob", age=30))
row.asDict() # recursive=False
{'name': 'Alex', 'age': 25, 'friends': Row(name='Bob', age=30)}
To convert nested Row objects into dictionaries as well, set recursive=True like so:
filter_none
row.asDict(True)
{'name': 'Alex', 'age': 25, 'friends': {'name': 'Bob', 'age': 30}}

local_offer
PySpark
mode_heat
Return Value
List of tuples.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
filter_none
df.dtypes
Published by Isshin In
PySpark SQL Row | asDict method

local_offer
PySpark
mode_heat
PySpark's SQL Row asDict(~) method converts a Row object into a dictionary.
Parameters
1. recursive | boolean | optional
 If True , then nested Row objects will be converted into dictionary as well.
 If False , then nested Row objects will be kept as Row objects.
By default, recursive=False .
Return Value
A dictionary.
Examples
Converting a PySpark Row object into a dictionary
Consider the following PySpark Row object:

filter_none
row = Row(name="alex", age=25)
row
Row(name='alex', age=25)
To convert this Row object into a dictionary:

filter_none
row.asDict()
{'name': 'Alex', 'age': 25}
Converting nested PySpark Rows into dictionaries
By default, recursive=False , which means that nested rows will not be converted into dictionaries:
filter_none
row.asDict() # recursive=False
{'name': 'Alex', 'age': 25, 'friends': Row(name='Bob', age=30)}
To convert nested Row objects into dictionaries as well, set recursive=True like so:
filter_none
row.asDict(True)
{'name': 'Alex', 'age': 25, 'friends': {'name': 'Bob', 'age': 30}}
PySpark Column | alias method

local_offer
PySpark
mode_heat
PySpark Column's alias(~) method assigns a column label to a PySpark Column .
Parameters
1. *alias | string
The column label.
2. metadata | dict | optional
A dictionary holding additional meta-information to store in the StructField of the returned Column .
Return Value
Examples
filter_none
df = spark.createDataFrame ([["ALEX", 20], ["BOB", 30], ["CATHY", 40]], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| ALEX| 20|
| BOB| 30|
|CATHY| 40|
+-----+---+
Most methods in the PySpark SQL Functions library return Column objects whose label is governed by
the method that we use. For instance, consider the lower(~) method:
filter_none
df.select (F.lower ("name")).show ()

+-----------+
|lower(name)|
+-----------+
| alex|
| bob|
| cathy|
+-----------+
Here, the PySpark Column returned by lower(~) has the label lower(name) by default.
Assigning new label to PySpark Column using the alias method
We can assign a new label to a column by using the alias(~) method:

filter_none
df.select (F.lower ("name").alias ("lower_name")).show ()
+----------+
|lower_name|
+----------+
| alex|
| bob|
| cathy|
+----------+
Here, we have assigned the label "lower_name" to the column returned by lower(~) .
Storing meta-data in PySpark Column's alias method
To store some meta-data in a PySpark Column, we can add the metadata option in alias(~) :
filter_none
df_new = df.select (F.lower ("name").alias("lower_name", metadata={"some_data": 10}))
df_new.show ()
+----------+
|lower_name|
+----------+
| alex|
| bob|
| cathy|
+----------+
The metadata is a dictionary that will be stored in the Column object.
To access the metadata , we can use the PySpark DataFrame's schema property:
filter_none
df_new.schema["lower_name"].metadata["some_data"]
10
PySpark Column | cast method

local_offer
PySpark
mode_heat
PySpark Column's cast(~) method returns a new Column of the specified type.
Parameters
1. dataType | Type or string
The type to convert the column to.
Return Value
A new Column object.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Converting PySpark column type to string
To convert the type of the DataFrame's age column from numeric to string :
filter_none
df_new = df.withColumn ("age", df["age"].cast ("string"))
df_new.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Equivalently, we can pass in the StringType() method like so:

filter_none
from pyspark.sql.types import StringType
df_new = df.withColumn ("age", df["age"].cast(StringType()))
df_new.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
I recommend passing in "string" instead of StringType() for simplicity.
To confirm that the column type has been converted to string, use the printSchema() method:
filter_none
root
|-- age: string (nullable = true)
Converting PySpark column type to integer
To convert the column type to integer, use cast("int") :

filter_none
df_new = df.withColumn ("age", df["age"].cast("int"))
root
|-- age: integer (nullable = true)
Converting PySpark column type to float
To convert the column type to float, use cast("float") :

filter_none
df_new = df.withColumn ("age", df["age"].cast("float"))
root
|-- age: float (nullable = true)
Converting PySpark column type to date
To convert the PySpark column type to date, use the to_date(~) method instead of cast(~) .
PySpark Column | contains method
local_offer
PySpark
mode_heat
PySpark Column's contains(~) method returns a Column object of booleans where True corresponds
to column values that contain the specified substring.
Parameters
1. other | string or Column
A string or a Column to perform the check.
Return Value
A Column object of booleans.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Getting rows that contain a substring in PySpark DataFrame
To get rows that contain the substring "le" :

filter_none
df.filter (F.col ("name").contains ("le")).show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Here, F.col("name").contains("le") returns a Column object holding booleans where True corresponds to
strings that contain the substring "le" :
filter_none
df.select (F.col ("name").contains("le")).show ()
+------------------+
|contains(name, le)|
+------------------+
| true|
| false|
| false|
+------------------+
In our solution, we use the filter(~) method to extract rows that correspond to True .
PySpark Column | dropFields method

local_offer
PySpark
mode_heat
PySpark Column's dropFields(~) method returns a new PySpark Column object with the specified
nested fields removed.
Parameters
1. *fieldNames | string
The nested fields to remove.
Return Value
A PySpark Column.
Examples
Consider the following PySpark DataFrame with some nested Rows:
filter_none
data = [
Row(name="Alex", age=20, friend=Row(name="Bob",age=30,height=150)),
Row(name="Cathy", age=40, friend=Row(name="Doge",age=40,height=180))
df = spark.createDataFrame (data)
df.show ()
+-----+---+---------------+
| name|age| friend|
+-----+---+---------------+
| Alex| 20| {Bob, 30, 150}|
|Cathy| 40|{Doge, 40, 180}|
+-----+---+---------------+
The schema of this PySpark DataFrame is as follows:

filter_none
df.printSchema ()
root

|-- friend: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- age: long (nullable = true)
| |-- height: long (nullable = true)
Dropping certain nested fields in PySpark Column
To remove the age and height fields under friend , use the dropFields(~) method:
filter_none
updated_col = df["friend"].dropFields ("age", "height")
df_new = df.withColumn ("friend", updated_col)
df_new.show ()
+-----+---+------+
| name|age|friend|
+-----+---+------+
| Alex| 20| {Bob}|
|Cathy| 40|{Doge}|
+-----+---+------+
 we are using the withColumn(~) method to update the friend column with the new column
returned by dropFields(~) .
The schema of this updated PySpark DataFrame is as follows:

filter_none
root

Notice how the age and height fields are no longer present under friend .
NOTE
Even if the nested field you wish to delete does not exist, no error will be thrown:
filter_none
updated_col = df["friend"].dropFields("ZZZZZZZZZ")
df_new.show ()
+-----+---+---------------+
| name|age| friend|
+-----+---+---------------+
| Alex| 20| {Bob, 30, 150}|
|Cathy| 40|{Doge, 40, 180}|
+-----+---+---------------+
Here, the nested field "ZZZZZZZZZ" obviously does not exist but no error was thrown.
PySpark Column | endswith method

local_offer
PySpark
mode_heat
PySpark Column's endswith(~) method returns a column of booleans where True is given to strings
that end with the specified substring.
Parameters
The substring or column to compare with.
Return Value
A Column object holding booleans.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Getting rows that end with a certain substring in PySpark DataFrame
To get rows that end with a certain substring:

filter_none
df.filter (F.col ("name").endswith("x")).show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Here, F.col("name").endswith("x") returns a Column object of booleans where True corresponds to

values that end in the letter x :
filter_none
df.select (F.col ("name").endswith("x")).show ()
+-----------------+
|endswith(name, x)|
+-----------------+
| true|
| false|
| false|
+-----------------+
We then use the PySpark DataFrame's filter(~) method to fetch rows that correspond to True .

local_offer
PySpark
mode_heat
PySpark Column's getItem(~) method extracts a value from the lists or dictionaries in a PySpark
Column.
Parameters
1. key | any
The key value depends on the column type:
 for lists, key should be an integer index indicating the position of the value that you wish
to extract.
 for dictionaries, key should be the key of the values you wish to extract.
Return Value
Examples
filter_none
rows = [[[5,6]], [[7,8]]]
df = spark.createDataFrame (rows, ['vals'])

df.show ()
+------+
| vals|
+------+
|[5, 6]|
|[7, 8]|
+------+
Extracting n-th item in lists
To extract the second value from each list in the vals column:
filter_none
# Assign a label to the column returned by getItem(~)
df_result = df.select (F.col ('vals').getItem(1).alias ('2nd val'))
df_result.show ()
+-------+
|2nd val|
+-------+
| 6|
| 8|
+-------+
Note that we could also use [~] syntax instead of getItem(~) :

filter_none
df_result = df.select (F.col ('vals')[1].alias ('2nd val'))
df_result.show ()
+-------+
|2nd val|
+-------+
| 6|
| 8|
+-------+
Specifying an index position that is out of bounds for the list will return a null value:
filter_none
df_result = df.select (F.col ('vals').getItem(9))
df_result.show ()
+-------+
|2nd val|
+-------+
| null|
| null|
+-------+
Extracting values using keys in dictionaries

filter_none
rows = [[{'A':4}], [{'A':5, 'B':6}]]
df.show ()
+----------------+
| vals|
+----------------+
| {A -> 4}|
|{A -> 5, B -> 6}|
+----------------+
To extract the value where the key is 'A' :

filter_none
df_result = df.select (F.col ('vals').getItem('A'))
df_result.show ()
+-------+
|vals[A]|
+-------+
| 4|
| 5|
+-------+
Note that referring to keys that do not exist will return null :
filter_none
df_result = df.select (F.col ('vals').getItem('C'))
df_result.show ()
+-------+
|vals[C]|
+-------+
| null|
| null|
+-------+
R E L AT E D
PySpark SQL Functions' element_at(~) method is used to extract values from lists or maps in a PySpark
Column.

local_offer
PySpark
mode_heat
PySpark SQL Functions' element_at(~) method is used to extract values from lists or maps in a
PySpark Column.
Parameters
The column of lists or maps from which to extract values.
2. extraction | int
The position of the value that you wish to extract. Negative positioning is supported - extraction=-
1 will extract the last element from each list.
WARNING
The position is not indexed-based. This means that extraction=1 will extract the first value in the
lists or maps.
Return Value
Examples
Extracting n-th value from arrays in PySpark Column
Consider the following PySpark DataFrame that contains some lists:

filter_none
rows = [[[5,6]], [[7,8]]]
df.show ()
+------+
| vals|
+------+
|[5, 6]|
|[7, 8]|
+------+
To extract the second value from each list in vals , we can use element_at(~) like so:
filter_none
df_res = df.select (F.element_at('vals',2).alias ('2nd value'))
df_res.show ()
+---------+
|2nd value|
+---------+
| 6|
| 8|
+---------+
 the position 2 is not index-based.
 we are using the alias(~) method to assign a label to the column returned by element_at(~) .
Note that extracting values that are out of bounds will return null :
filter_none
df_res = df.select (F.element_at('vals',3))
df_res.show ()
+-------------------+
|element_at(vals, 3)|
+-------------------+
| null|
| null|
+-------------------+
We can also extract the last element by supplying a negative value for extraction :
filter_none
df_res = df.select (F.element_at('vals',-1).alias ('last value'))
df_res.show ()
+----------+
|last value|
+----------+
| 6|
| 8|
+----------+
Extracting values from maps in PySpark Column
Consider the following PySpark DataFrame containing some dict values:

filter_none
rows = [[{'A':4}], [{'A':5, 'B':6}]]
df.show ()
+----------------+
| vals|
+----------------+
| {A -> 4}|
|{A -> 5, B -> 6}|
+----------------+
To extract the values that has the key 'A' in the vals column:
filter_none
df_res = df.select (F.element_at('vals', F.lit ('A')))
df_res.show ()
+-------------------+
|element_at(vals, A)|
+-------------------+
| 4|
| 5|
+-------------------+
Note that extracting values using keys that do not exist will return null :
filter_none
df_res = df.select (F.element_at('vals', F.lit ('B')))
df_res.show ()
+-------------------+
|element_at(vals, B)|
+-------------------+
| null|
| 6|
+-------------------+
Here, the key 'B' does not exist in the map {'A':4} so a null was returned for that row.
R E L AT E D
PySpark Column's getItem(~) method extracts a value from the lists or dictionaries in a PySpark
Column.
PySpark Column | isin method

local_offer
PySpark
mode_heat
PySpark Column's isin(~) method returns a Column object of booleans where True corresponds to
column values that are included in the specified list of values.
Parameters
1. *cols | any type
The values to compare against.
Return Value
Examples
filter_none

df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Getting rows where values are contained in a list of values in PySpark DataFrame
To get rows where values for the name column is either "Cathy" or "Alex" :
filter_none
df.filter (F.col ("name").isin("Cathy", "Alex")).show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Here, F.col("name").isin("Cathy","Alex") returns a Column object of booleans:

filter_none
df.select (F.col ("name").isin("Cathy", "Alex")).show ()
+-----------------------+
|(name IN (Cathy, Alex))|
+-----------------------+
| true|
| false|
+-----------------------+
The filter(~) method fetches the rows that correspond to True .

Note that if you have a list of values instead, use the * operator to convert the list into positional
arguments:
filter_none
my_list = ["Cathy", "Alex"]
df.filter (F.col ("name").isin(*my_list)).show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
PySpark Column | isNotNull method

local_offer
PySpark
mode_heat
PySpark Column's isNotNull() method identifies rows where the value is not null.
Return Value
Examples
filter_none
df = spark.createDataFrame ([["Alex", 25], ["Bob", 30], ["Cathy", None]], ["name", "age"])
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 25|
| Bob| 30|
|Cathy|null|
+-----+----+
Identifying rows where certain value is not null in PySpark DataFrame
To identify rows where the value for age is not null :

filter_none
df.select (df.age.isNotNull()).show ()
+-------------+
|(age IS NULL)|
+-------------+
| false|
| false|
| true|
+-------------+
Getting rows where certain value is not null in PySpark DataFrame
To get rows where the age field is not null :

filter_none
df.filter (df.age.isNotNull()).show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Here, the filter(~) extracts rows that correspond to True in the boolean column returned
by isNotNull() method.
PySpark Column | isin method
local_offer
PySpark
mode_heat
PySpark Column's isin(~) method returns a Column object of booleans where True corresponds to
column values that are included in the specified list of values.
Parameters
1. *cols | any type
The values to compare against.
Return Value
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Getting rows where values are contained in a list of values in PySpark DataFrame
To get rows where values for the name column is either "Cathy" or "Alex" :
filter_none
df.filter (F.col ("name").isin("Cathy", "Alex")).show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Here, F.col("name").isin("Cathy","Alex") returns a Column object of booleans:

filter_none
df.select (F.col ("name").isin("Cathy", "Alex")).show ()
+-----------------------+
|(name IN (Cathy, Alex))|
+-----------------------+
| true|
| false|
+-----------------------+
The filter(~) method fetches the rows that correspond to True .
Note that if you have a list of values instead, use the * operator to convert the list into positional
arguments:
filter_none
my_list = ["Cathy", "Alex"]
df.filter (F.col ("name").isin(*my_list)).show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
PySpark Column | isNull method

local_offer
PySpark
mode_heat
PySpark Column's isNull() method identifies rows where the value is null.
Return Value
Examples
filter_none
df = spark.createDataFrame ([["Alex", 25], ["Bob", 30], ["Cathy", None]], ["name", "age"])
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 25|
| Bob| 30|
|Cathy|null|
+-----+----+
Identifying rows where certain value is null in PySpark DataFrame
To identify rows where the value for age is null :

filter_none
df.select (df.age.isNull()).show ()
+-----------------+
|(age IS NOT NULL)|
+-----------------+
| true|
| true|
| false|
+-----------------+
Getting rows where certain value is null in PySpark DataFrame
To get rows where the value for age is null :

filter_none
df.where (df.age.isNull()).show ()
+-----+----+
| name| age|
+-----+----+
|Cathy|null|
+-----+----+
Here, the where(~) method fetches rows that correspond to True in the boolean column returned by
the isNull() method.
Warning - using equality to compare null values
One common mistake is to use equality to compare null values. For example, consider the
following DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 25.0], ["Bob", 30.0], ["Cathy", None]], ["name", "age"])
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex|25.0|
| Bob|30.0|
|Cathy|null|
+-----+----+
Let's get the rows where age is equal to None :

filter_none
df.where (F.col ("age") == None).show ()
+----+---+
|name|age|
+----+---+
+----+---+
Notice how Cathy's row where the age is null is not picked up. When comparing null values, we
should always use isNull() instead.
Null values and NaN are treated differently

filter_none
import numpy as np
df = spark.createDataFrame ([["Alex", 25.0], ["Bob", np.nan], ["Cathy", None]], ["name", "age"])
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex|25.0|
| Bob| NaN|
|Cathy|null|
+-----+----+
Here, the age column contains both NaN and null . In PySpark, NaN and null are treated as different
entities as demonstrated below:
filter_none
df.where (F.col ("age").isNull()).show ()
+-----+----+
| name| age|
+-----+----+
|Cathy|null|
+-----+----+
Here, notice how Bob's row whose age is NaN is not picked up. To get rows with NaN , use
the isnan(-) method like so:
filter_none
df.where (F.isnan ("age")).show ()
+----+---+
|name|age|
+----+---+
| Bob|NaN|
PySpark Column | otherwise method

local_offer
PySpark
mode_heat
PySpark Column's otherwise(~) method is used after a when(~) method to implement an if-else logic.
Click here for our documentation on when(~) method.
Parameters
1. value
The value to assign if the conditions set by when(~) are not satisfied.
Return Value
Examples
Basic usage

filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
To replace the name Alex with Doge , and others with Eric :
filter_none
df.select (F.when (df.name == "Alex", "Doge").otherwise("Eric")).show ()
+-----------------------------------------------+
+-----------------------------------------------+
| Doge|
| Eric|
| Eric|
+-----------------------------------------------+
Note that we can replace our existing column with the new column like so:
filter_none
df.name = F.when (df.name == "Alex", "Doge").otherwise("Eric")
df.show ()
+----+---+
|name|age|
+----+---+
|Doge| 25|
|Eric| 30|
|Eric| 50|
+----+---+
PySpark Column | rlike method

local_offer
PySpark
mode_heat
PySpark Column's rlike(~) method returns a Column of booleans where True corresponds to string
column values that match the specified regular expression.
NOTE
The rlike(~) method is the same as the RLIKE operator in SQL.

Parameters
1. str | other
The regular expression to match against.
Return Value
Examples
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Getting rows where values match some regular expression in PySpark DataFrame
To get rows where values match some regex:

filter_none
df.filter (F.col ("name").rlike("Â")).show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Here, the regular expression "Â" matches strings that begin with "A" .
Also, F.col("name").rlike("Â") returns a Column object of booleans:
filter_none
df.select (F.col ("name").rlike("Â")).show ()
+---------------+
|RLIKE(name, Â)|
+---------------+
| true|
| false|
+---------------+
In our solution, we use the filter(~) method to fetch only the rows that correspond to True .
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
PySpark Column | startswith method
local_offer
PySpark
mode_heat
PySpark Column's startswith(~) method returns a column of booleans where True is given to strings
that begin with the specified substring.
Parameters
The substring or column to compare with.
Return Value
A Column object holding booleans.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Getting rows that start with a certain substring in PySpark DataFrame
To get rows that start with a certain substring:

filter_none
df.filter (F.col ("name").startswith("A")).show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Here, F.col("name").startswith("A") returns a Column object of booleans where True corresponds to

values that begin with A :
filter_none
df.select (F.col ("name").startswith ("A")).show ()
+-------------------+
|startswith(name, A)|
+-------------------+
| true|
| false|
| false|
+-------------------+
We then use the PySpark DataFrame's filter(~) method to fetch rows that correspond to True .
PySpark Column | substr method

local_offer
PySpark
mode_heat
PySpark Column's substr(~) method returns a Column of substrings extracted from string column
values.
Parameters
1. startPos | int or Column
The starting position. This position is inclusive and non-index, meaning the first character is in
position 1. Negative position is allowed here as well - please consult the example below for
clarification.
2. length | int or Column
The length of the substring to extract.
Return Value
A Column object.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Extracting substrings from column values in PySpark DataFrame
To extract substrings from column values:

filter_none
df.select (F.col ("name").substr (2,3).alias ("short_name")).show ()
+----------+
|short_name|
+----------+
| lex|
| ob|
| ath|
+----------+
Note the following:
 the F.col("name").substr(2,3) means that we are extracting a substring starting from the 2nd
character and up to a length of 3.
 even if the string is too short (e.g. "Bob" ), no error will be thrown.
 alias(~) method is used to assign a label to our column.
Note that you could also specify a negative starting position like so:
filter_none
df.select (F.col ("name").substr(-3,2).alias ("short_name")).show ()
+----------+
|short_name|
+----------+
| le|
| Bo|
| th|
+----------+
Here, we are starting from the third character from the end (inclusive).
PySpark Column | substr method

local_offer
PySpark
mode_heat
PySpark Column's substr(~) method returns a Column of substrings extracted from string column
values.
Parameters
1. startPos | int or Column
The starting position. This position is inclusive and non-index, meaning the first character is in
position 1. Negative position is allowed here as well - please consult the example below for
clarification.
2. length | int or Column
The length of the substring to extract.
Return Value
A Column object.
Examples
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Extracting substrings from column values in PySpark DataFrame
To extract substrings from column values:

filter_none
df.select (F.col ("name").substr (2,3).alias ("short_name")).show ()
+----------+
|short_name|
+----------+
| lex|
| ob|
| ath|
+----------+
Note the following:
 the F.col("name").substr(2,3) means that we are extracting a substring starting from the 2nd
character and up to a length of 3.
 even if the string is too short (e.g. "Bob" ), no error will be thrown.
 alias(~) method is used to assign a label to our column.
Note that you could also specify a negative starting position like so:
filter_none
df.select (F.col ("name").substr(-3,2).alias ("short_name")).show ()
+----------+
|short_name|
+----------+
| le|
| Bo|
| th|
+----------+
Here, we are starting from the third character from the end (inclusive).
PySpark Column | withField method

local_offer
PySpark
mode_heat
PySpark Column's withField(~) method is used to either add or update a nested field value.
Parameters
1. fieldName | string
The name of the nested field.
2. col | Column
The new column value to add or update with.
Return Value
Examples
Consider the following PySpark DataFrame with nested rows:
filter_none
data = [
Row(name="Alex", age=20, friend=Row(name="Bob",age=30)),
Row(name="Cathy", age=40, friend=Row(name="Doge",age=40))
df = spark.createDataFrame (data)
df.show ()
+-----+---+----------+
| name|age| friend|
+-----+---+----------+
| Alex| 20| {Bob, 30}|
|Cathy| 40|{Doge, 40}|
+-----+---+----------+
Here, the friend column contains nested Row , which can be confirmed by printing out the schema:
filter_none
df.printSchema ()
root
Updating nested rows in PySpark
To update nested rows, use the withField(~) method like so:

filter_none
updated_col = df["friend"].withField("name", F.lit ("BOB"))
df.withColumn ("friend", updated_col).show ()
+-----+---+---------+
| name|age| friend|
+-----+---+---------+
| Alex| 20|{BOB, 30}|
|Cathy| 40|{BOB, 40}|
+-----+---+---------+
Note the following:
 we are updating the name field of the friend column with a constant string "BOB" .
 the F.lit("BOB") returns a Column object whose values are filled with the string "BOB" .
 the withColumn(~) method replaces the friend column of our DataFrame with the updated
column returned by withField(~) .
Updating nested rows using original values in PySpark
To update nested rows using original values, use the withField(~) method like so:
filter_none
updated_col = df["friend"].withField("name", F.upper ("friend.name"))
df.withColumn ("friend", updated_col).show ()
+-----+---+----------+
| name|age| friend|
+-----+---+----------+
| Alex| 20| {BOB, 30}|
|Cathy| 40|{DOGE, 40}|
+-----+---+----------+
Here, we are uppercasing the name field of the friend column using F.upper("friend.name") , which
returns a Column object.
Adding new field values in nested rows in PySpark
The withField(~) column can also be used to add new field values in nested rows:
filter_none
updated_col = df["friend"].withField("upper_name", F.upper ("friend.name"))
df_new.show ()
+-----+---+----------------+
| name|age| friend|
+-----+---+----------------+
| Alex| 20| {Bob, 30, BOB}|
|Cathy| 40|{Doge, 40, DOGE}|

+-----+---+----------------+
Now, checking our schema of our new PySpark DataFrame:

filter_none
root
| |-- upper_name: string (nullable = true)
We can see the new nested field upper_name has been added!
PySpark RDD | coalesce method

local_offer
PySpark
mode_heat
PySpark RDD's coalesce(~) method returns a new RDD with the number of partitions reduced.
Parameters
The number of partitions to reduce to.
2. shuffle | boolean | optional
Whether or not to shuffle the data such that they end up in different partitions. By
default, shuffle=False .
Return Value
A PySpark RDD ( pyspark.rdd.RDD ).
Examples
Consider the following RDD with 3 partitions:
filter_none
rdd = sc.parallelize (["A","B","C","D","A"], numSlices=3)
rdd.glom ().collect ()
[['A'], ['B', 'C'], ['D', 'A']]
Here:
 parallelize(~) creates a RDD with 3 partitions
 glom() shows the actual content of each partition.

Reducing the number of partitions of RDD
To reduce the number of partitions to 2:

filter_none
new_rdd = rdd.coalesce(numPartitions=2)
new_rdd.glom ().collect ()
[['A'], ['B', 'C', 'D', 'A']]
We can see that the 2nd partition merged with the 3rd partition.
Balanced partitioning of RDD using shuffle
Instead of merging partitions to reduce the number partitions, we can also shuffle the data:
filter_none
new_rdd = rdd.coalesce(numPartitions=2, shuffle=True)
[['A', 'D', 'A'], ['B', 'C']]
As you can see, this results in a partitioning that is more balanced. The downside to shuffling,
however, is that this is a costly process when your data size is large since data must be transferred
from one worker node to another.
PySpark RDD | collect method

local_offer
PySpark
mode_heat
PySpark RDD's collect(~) method returns a list containing all the items in the RDD.
Parameters
Return Value
A Python standard list.
Examples
Converting a PySpark RDD into a list of values
Consider the following RDD:

filter_none
rdd = sc.parallelize ([4,2,5,7])
rdd
ParallelCollectionRDD[7] at readRDDFromInputStream at PythonRDD.scala:413
This RDD is partitioned into 8 subsets:

filter_none
rdd.getNumPartitions ()
Depending on your configuration, these 8 partitions can reside in multiple machines (working
nodes). The collect(~) method sends all the data of the RDD to the driver node, and packs them in a
single list:
filter_none
rdd.collect()
[4, 2, 5, 7]
WARNING
All the data from the worker nodes will be sent to the driver node, so make sure that you have
enough memory for the driver node - otherwise you'll end up with an OutOfMemory error!
PySpark RDD | collectAsMap method
local_offer
PySpark
mode_heat
PySpark RDD's collectAsMap(~) method collects all the elements of a pair RDD in the driver
node link and converts the RDD into a dictionary.
NOTE
A pair RDD is a RDD that contains a list of tuples.

Parameters
Return Value
A dictionary.
Examples
Consider the following PySpark pair RDD:
filter_none
rdd = sc.parallelize ([("a",5),("b",2),("c",3)])
rdd.collect ()
[('a', 5), ('b', 2), ('c', 3)]
Here, we are using the parallelize(~) method to create a pair RDD.
Converting a pair RDD into a dictionary in PySpark
To convert a pair RDD into a dictionary in PySpark, use the collectAsMap() method:
filter_none
rdd.collectAsMap()
{'a': 5, 'b': 2, 'c': 3}
WARNING
Since all the underlying data in the RDD is sent to driver node, you may encounter
an OutOfMemoryError if the data is too large.
In case of duplicate keys
When we have duplicate keys, the latter key-value pair will overwrite the former ones:
filter_none
rdd = sc.parallelize ([("a",5),("a",6),("b",2)])
rdd.collectAsMap()
{'a': 6, 'b': 2}
Here, the tuple ("a",6) has overwritten ("a",5) .
PySpark RDD | count method

local_offer
PySpark
mode_heat
PySpark RDD's count(~) method returns the number of values in the RDD as an integer.
Parameters
Return Value
An integer ( int ).
Examples
Consider the following PySpark RDD:
filter_none
rdd = sc.parallelize (["A","B","A","B"])
rdd.collect ()
['A', 'B', 'A', 'B']
Here, we are using the parallelize(~) method to create a PySpark RDD.

Getting the number of values in PySpark RDD
To get the number of elements in the RDD, use the count() method:
filter_none
rdd.count()
PySpark RDD | countByKey method

local_offer
PySpark
mode_heat
PySpark RDD's countByKey(~) method groups by the key of the elements in a pair RDD, and counts
each group.
Parameters
This method does not take in any parameter.
Return Value
A DefaultDict[key,int] .
Examples
filter_none
rdd = sc.parallelize ([("a",5),("a",1),("b",2),("c",4)])
rdd.collect ()
[('a', 5), ('a', 1), ('b', 2), ('c', 4)]
Here, we are using the parallelize(~) method to create a RDD.
Getting the count of each group in PySpark Pair RDD
To group by the key, and get the count of each group:

filter_none
rdd.countByKey()
defaultdict(int, {'a': 2, 'b': 1, 'c': 1})
Here, the returned value is DefaultDict , which is basically a dictionary in which accessing values
that do not exist in the dictionary will return a 0 instead of throwing an error.
You can access the count of a key just as you would for an ordinary dictionary:
filter_none
counts = rdd.countByKey()
counts["a"]
Accesing counts of keys that do not exist will return 0 :

filter_none
counts = rdd.countByKey()
counts["z"]
PySpark RDD | filter method

local_offer
PySpark
mode_heat
PySpark RDD's filter(~) method extracts a subset of the data based on the given function.
Parameters
1. f | function
A function that takes in as input an item of the RDD's data and returns a boolean where:
 True indicates keeping
 False indicates ignoring.

Return Value
A PySpark RDD ( pyspark.rdd.PipelinedRDD ).
Examples
filter_none
rdd
Filtering elements of a RDD
To obtain a new RDD where the values are all strictly larger than 3:
filter_none
new_rdd = rdd.filter(lambda x: x > 3)
new_rdd.collect ()
[4, 5, 7]
Here, the collect() method is used to retrieve the content of the RDD as a single list.
PySpark RDD | first method

local_offer
PySpark
mode_heat
PySpark RDD's first(~) method returns the first element of the RDD.
Parameters
Return Value
The type will be that of the first element of the RDD.
Examples
We create a RDD using the parallelize(~) method:
filter_none
rdd = sc.parallelize ([2, 3, 4])
rdd
Fetching the first element of a RDD
To fetch the first element in the RDD, use the first() method:
filter_none
rdd.first()
PySpark RDD | getNumPartitions method

local_offer
PySpark
mode_heat
PySpark RDD's getNumPartitions() method returns the number of partitions of a RDD.
Return Value
An int .
Examples
Getting the number of partitions of RDD
To get the number of partitions of a RDD:

filter_none
# Create a RDD of 3 partitions
rdd = sc.parallelize (["A","B","C","D","E","F"], numSlices=3)
rdd.getNumPartitions()
3
PySpark RDD | glom method
local_offer
PySpark
mode_heat
PySpark RDD's glom() method returns a RDD holding the content of each partition.
Parameters
Return Value
Examples
filter_none
# Create a RDD with 3 partitions
rdd = sc.parallelize (["A","B","C","A"], numSlices=3)
rdd.collect ()
['A', 'B', 'C', 'A']
Getting the values of each partition in PySpark RDD
To see the content of these partitions:

filter_none
rdd.glom().collect ()
[['A'], ['B'], ['C', 'A']]
Here:
 Partition 1 holds 'A'
 Partition 2 holds 'B'

 Partition 3 holds 'C' and 'A'
PySpark RDD | keys method
local_offer
PySpark
mode_heat
PySpark RDD's keys(~) method returns the keys of a pair RDD that contains tuples of length two.
Parameters
Return Value
Examples
filter_none
# Create a RDD using the parallelize method
rdd = sc.parallelize ([("a",3),("a",2),("b",5),("c",1)])
rdd.collect ()
[('a', 3), ('a', 2), ('b', 5), ('c', 1)]
Getting the keys of a pair RDD in PySpark
To get the keys of the pair RDD as a list of strings:

filter_none
rdd.keys ().collect ()
['a', 'a', 'b', 'c']
Note that if the RDD is not a pair RDD, then the values are returned:
filter_none
rdd = sc.parallelize (["a","a","b","c"])

rdd.collect ()
['a', 'a', 'b', 'c']
PySpark RDD | map method

local_offer
PySpark
mode_heat
PySpark RDD's map(~) method applies a function on each element of the RDD.
Parameters
1. f | function
The function to apply.
2. preservesPartitioning | boolean | optional
Whether or not to let Spark assume that partitioning is still valid. This is only relevant to PairRDD .
Consult examples below for clarification. By default, preservesPartitioning=False .
Return Value
Examples
Applying a function to each element of RDD
To make all values in the RDD lowercased:

filter_none
rdd = sc.parallelize (["A","B","C","D","E","F"], numSlices=5)
new_rdd = rdd.map(lambda x: x.lower())
new_rdd.collect ()
['a', 'b', 'c', 'd', 'e', 'f']

Preserving partition while applying the map method to RDD
The preservesPartitioning parameter only comes into play when the RDD contains a list of tuples
(pair RDD).
When a RDD is re-partitioned via partitionBy(~) (using a hash partitioner ), we guarantee that the
tuples with the same key end up in the same partition:
filter_none
rdd = sc.parallelize ([("A",1),("B",1),("C",1),("A",1),("D",1)], numSlices=3)
new_rdd = rdd.partitionBy (numPartitions=2)
[[('C', 1)], [('A', 1), ('B', 1), ('A', 1), ('D', 1)]]
Indeed, we see that the tuple ('A',1) and ('A',1) lie in the same partition.
Let us now perform a map(~) operation with preservesPartitioning set to False (default):
filter_none
mapped_rdd = new_rdd.map(lambda my_tuple: (my_tuple[0], my_tuple[1]+3))
mapped_rdd.glom ().collect ()
[[('C', 4)], [('A', 4), ('B', 4), ('A', 4), ('D', 4)]]
Here, we are applying a map(~) that returns a tuple with the same key, but with a different value.
We can see that the partitioning has not changed. Behind the scenes, however, Spark internally
has a flag that indicates whether or not the partitioning has been destroyed, and this flag has now
been set to True (i.e. partitioning has been destroyed) due to setting preservesPartitioning=False by
default. This is naive of Spark to do so, since the tuples key have not been changed, and so the
partitioning should still be valid.
We can confirm that Spark is now naively unaware that the data is partitioned by the tuple key by
performing a shuffling operation like reduceByKey(~) :
filter_none
mapped_rdd_reduced = mapped_rdd.reduceByKey (lambda x: x+y)
print (mapped_rdd_reduced.toDebugString().decode("utf-8"))
(2) PythonRDD[238] at RDD at PythonRDD.scala:58 []
| MapPartitionsRDD[237] at mapPartitions at PythonRDD.scala:183 []
| ShuffledRDD[236] at partitionBy at <unknown>:0 []
+-(2) PairwiseRDD[235] at reduceByKey at <command-1339085475381822>:1 []
| PythonRDD[234] at reduceByKey at <command-1339085475381822>:1 []

+-(3) PairwiseRDD[221] at partitionBy at <command-1339085475381815>:2 []
| PythonRDD[220] at partitionBy at <command-1339085475381815>:2 []
| ParallelCollectionRDD[219] at readRDDFromInputStream at PythonRDD.scala:413 []
You can see that a shuffling has indeed occurred. However, this is completely unnecessary
because we know that the tuples with the same key reside in the same partition (machine), and so
this operation can be done locally.
Now, consider the case when we set preservesPartitioning to True :

filter_none
mapped_rdd_preserved = new_rdd.map(lambda my_tuple: (my_tuple[0], my_tuple[1]+3), preservesPartitioning=True)
mapped_rdd_preserved_reduced = mapped_rdd_preserved. reduceByKey(lambda x: x+y)
print (mapped_rdd_preserved_reduced.toDebugString().decode("utf-8"))
(2) PythonRDD[239] at RDD at PythonRDD.scala:58 []
+-(3) PairwiseRDD[221] at partitionBy at <command-1339085475381815>:2 []
| PythonRDD[220] at partitionBy at <command-1339085475381815>:2 []
| ParallelCollectionRDD[219] at readRDDFromInputStream at PythonRDD.scala:413 []
We can see that no shuffling has occurred. This is because we tell Spark that we have only
changed the value of the tuple, and not the key, and so Spark should assume that the original
partitioning is kept intact.
PySpark RDD | partitionBy method

local_offer
PySpark
mode_heat
PySpark RDD's partitionBy(~) method re-partitions a pair RDD into the desired number of
partitions.
Parameters
The desired number of partitions of the resulting RDD.
2. partitionFunc | function | optional
The partitioning function - the input is the key and the return value must be the hashed value. By
default, a hash partitioner will be used.
Return Value
Examples
Repartitioning a pair RDD

filter_none
rdd = sc.parallelize ([("A",1),("B",1),("C",1),("A",1)], numSlices=3)
rdd.collect ()
[('A', 1), ('B', 1), ('C', 1), ('A', 1)]
To see how this RDD is partitioned, use the glom() method:

filter_none
[[('A', 1)], [('B', 1)], [('C', 1), ('A', 1)]]
We can indeed see that there are 3 partitions:
 Partition one: ('A',1) and ('B',1)
 Partition two: ('C',1)
 Partition three: ('A',1)
To re-partition into 2 partitions:

filter_none

new_rdd.collect ()
[('C', 1), ('A', 1), ('B', 1), ('A', 1)]
To see the contents of the new partitions:

filter_none
[[('C', 1)], [('A', 1), ('B', 1), ('A', 1)]]
 Partition one: ('C',1)
 Partition two: ('A',1) , ('B',1) , ('A', 1)
Notice how the tuple with the key A has ended up in the same partition. This is guaranteed to
happen because the hash partitioner will perform bucketing based on the tuple key.
PySpark RDD | reduceByKey method

local_offer
PySpark
mode_heat
PySpark RDD's reduceByKey(~) method aggregates the RDD data by key, and perform a reduction
operation. A reduction operation is simply one where multiple values become reduced to a single
value (e.g. summation, multiplication).
Parameters
1. func | function
The reduction function to apply.
2. numPartitions | int | optional
By default, the number of partitions will be equal to the number of partitions of the parent RDD.
If the parent RDD does not have the partition count set, then the parallelism level in the PySpark
configuration will be used.

The partitioner to use - the input is a key and return value must be the hashed value. By default, a
hash partitioner will be used.
Return Value
Examples
Consider the following Pair RDD:
filter_none
rdd.collect ()
[('A', 1), ('B', 1), ('C', 1), ('A', 1)]
Here, the parallelize(~) method creates a RDD with 3 partitions.
Grouping by key in pair RDD and performing a reduction operation
To group by key and perform a summation of the values of each grouped key:
filter_none
rdd.reduceByKey(lambda a, b: a+b).collect ()
[('B', 1), ('C', 1), ('A', 2)]
Setting number of partitions after reducing by key in pair RDD
By default, the number of partitions of the resulting RDD will be equal to the number of
partitions of the parent RDD:
filter_none
# Create a RDD using 3 partitions
new_rdd = rdd.reduceByKey(lambda a, b: a+b)
new_rdd.getNumPartitions ()
Here, rdd is the parent RDD of new_rdd .
We can set the number of partitions of the resulting RDD by setting the numPartitions parameter:
filter_none

new_rdd = rdd.reduceByKey(lambda a, b: a+b, numPartitions=2)
PySpark RDD | partitionBy method

local_offer
PySpark
mode_heat
PySpark RDD's partitionBy(~) method re-partitions a pair RDD into the desired number of
partitions.
Parameters
The desired number of partitions of the resulting RDD.
The partitioning function - the input is the key and the return value must be the hashed value. By
default, a hash partitioner will be used.
Return Value
Examples
Repartitioning a pair RDD

filter_none
rdd.collect ()
[('A', 1), ('B', 1), ('C', 1), ('A', 1)]

To see how this RDD is partitioned, use the glom() method:
filter_none
[[('A', 1)], [('B', 1)], [('C', 1), ('A', 1)]]
 Partition one: ('A',1) and ('B',1)
 Partition two: ('C',1)
 Partition three: ('A',1)
To re-partition into 2 partitions:

filter_none
new_rdd.collect ()
[('C', 1), ('A', 1), ('B', 1), ('A', 1)]
To see the contents of the new partitions:

filter_none
[[('C', 1)], [('A', 1), ('B', 1), ('A', 1)]]
 Partition one: ('C',1)
 Partition two: ('A',1) , ('B',1) , ('A', 1)
Notice how the tuple with the key A has ended up in the same partition. This is guaranteed to
happen because the hash partitioner will perform bucketing based on the tuple key.
PySpark RDD | reduceByKey method

local_offer
PySpark
mode_heat
PySpark RDD's reduceByKey(~) method aggregates the RDD data by key, and perform a reduction
operation. A reduction operation is simply one where multiple values become reduced to a single
value (e.g. summation, multiplication).
Parameters
1. func | function
The reduction function to apply.
By default, the number of partitions will be equal to the number of partitions of the parent RDD.
If the parent RDD does not have the partition count set, then the parallelism level in the PySpark
configuration will be used.
The partitioner to use - the input is a key and return value must be the hashed value. By default, a
hash partitioner will be used.
Return Value
Examples
Consider the following Pair RDD:
filter_none
rdd.collect ()
[('A', 1), ('B', 1), ('C', 1), ('A', 1)]
Here, the parallelize(~) method creates a RDD with 3 partitions.
Grouping by key in pair RDD and performing a reduction operation
To group by key and perform a summation of the values of each grouped key:
filter_none
rdd.reduceByKey(lambda a, b: a+b).collect ()
[('B', 1), ('C', 1), ('A', 2)]

Setting number of partitions after reducing by key in pair RDD
By default, the number of partitions of the resulting RDD will be equal to the number of
partitions of the parent RDD:
filter_none
# Create a RDD using 3 partitions
new_rdd = rdd.reduceByKey(lambda a, b: a+b)
Here, rdd is the parent RDD of new_rdd .
We can set the number of partitions of the resulting RDD by setting the numPartitions parameter:
filter_none
new_rdd = rdd.reduceByKey(lambda a, b: a+b, numPartitions=2)
PySpark RDD | repartition method

local_offer
PySpark
mode_heat
PySpark RDD's repartition(~) method splits the RDD into the specified number of partitions.
NOTE
When we first create RDDs, they will already be partitioned under the hood, which means that all
RDDs are already partitioned. This method is called repartition(~) (emphasis on the re ) because we
are changing the existing partitioning.
Parameters
The number of partitions in which to split the RDD.
Return Value
Examples
Re-partitioning a RDD with certain number of partitions

filter_none
rdd = sc.parallelize (["A","B","C","A","A","B"], numSlices=3)
rdd.collect ()
['A', 'B', 'C', 'A', 'A', 'B']
Here, we are using the parallelize(~) method to create a RDD with 3 partitions.
We can use the glom() method to see the actual content of the partitions:
filter_none
[['A', 'B'], ['C', 'A'], ['A', 'B']]
To repartition our RDD into 2 partitions:

filter_none
new_rdd = rdd.repartition(2)
[['A', 'B', 'A', 'B'], ['C', 'A']]
Notice how even if we repartition our RDD:
 the same values do not necessarily end up in the same partition ( 'A' can be found in both
partitions)
 the number of elements in each partition may also not be balanced - here we have 4
elements in the first partition, while only 2 elements in the second partition.
WARNING
The repartition(~) method involves shuffling link, even when reducing the number of partitions. To
avoid shuffling when reducing the number of partitions, use RDD's coalesce(~) method instead.
PySpark RDD | zip method
local_offer
PySpark
mode_heat
PySpark RDD's zip(~) method combines the elements of two RDDs into a single RDD of tuples.
Parameters
1. other | RDD
The other RDD to combine with.
Return Value
A new PySpark RDD.
Examples
Combining two PySpark RDDs into a single RDD of tuples
Consider the following two PySpark RDDs:

filter_none
x = sc.parallelize (range (0,6), 3)
y = sc.parallelize (range (10, 16), 3)
Here, we are using the parallelize(~) method to create two RDDs, each having 3 partitions.
We can see the actual values in each partition using the glom(~) method:
filter_none
x.glom ().collect ()
[[0, 1], [2, 3], [4, 5]]
We see that RDD x indeed has 3 partitions, and we have 2 elements in each partition. The same
can be said for RDD y :
filter_none
y.glom ().collect ()
[[10, 11], [12, 13], [14, 15]]
We can combine the two RDDs x and y into a single RDD of tuples using the zip(~) method:
filter_none
zipped_rdd = x.zip(y)
zipped_rdd.collect ()
[(0, 10), (1, 11), (2, 12), (3, 13), (4, 14), (5, 15)]
WARNING
In order to use the zip(~) method, the two RDDs must have the exact same number of partitions as
well as the exact same number of elements in each partition.
PySpark RDD | zipWithIndex method

local_offer
PySpark
mode_heat
PySpark RDD's zipWithIndex(~) method returns a RDD of tuples where the first element of the tuple
is the value and the second element is the index. The first value of the first partition will be given
an index of 0.
Parameters
Return Value
A new PySpark RDD.
Examples
Consider the following PySpark RDD with 2 partitions:
filter_none
rdd = sc.parallelize (['A','B','C'], 2)
rdd.collect ()
['A', 'B', 'C']
We can see the content of each partition using the glom() method:
filter_none
[['A'], ['B', 'C']]
We see that we indeed have 2 partitions with the first partition containing the value 'A' , and the
second containing the values 'B' and 'C' .
We can create a new RDD of tuples containing positional index information using zipWithIndex(~) :
filter_none
new_rdd = rdd.zipWithIndex()
new_rdd.collect ()
[('A', 0), ('B', 1), ('C', 2)]
We see that the index position is assigned based on the partitioning position - the first element of
the first partition will be assigned the 0th index.
R E L AT E D
PySpark RDD | zip method

PySpark RDD's zip(~) method combines the elements of two RDDs into a single RDD of tuples.
PySpark SparkContext | parallelize method

local_offer
mode_heat
PySpark SparkContext's parallelize(~) method creates a RDD (resilient distributed dataset) from the
given dataset.
Parameters
1. c | any
The data you want to convert into RDD. Typically, you would pass a list of values.
2. numSlices | int | optional
The number of partitions to use. By default, the parallelism level set in the Spark configuration
will be used for the number of partitions:
filter_none
sc.defaultParallelism # For my configs, this is set to 8

8
Return Value
Examples
Creating a RDD with a list of values
To create a RDD, use the parallelize(~) function:

filter_none
rdd = sc.parallelize(["A","B","C","A"])
rdd.collect ()
['A', 'B', 'C', 'A']
The default number of partitions as specified by my Spark configuration is:

filter_none
rdd.getNumPartitions ()
Creating a RDD with specific number of partitions
To create a RDD using a list that has 3 partitions:

filter_none
rdd = sc.parallelize(["A","B","C","A"], numSlices=3)
rdd.collect ()
['A', 'B', 'C', 'A']
Here, Spark is partitioning our list into 3 sub-datasets. We can see the content of each partition
using the glom() method:
filter_none
[['A'], ['B'], ['C', 'A']]
 Partition one: 'A'
 Partition two: 'B'
 Partition three: 'C' and 'A'

Notice how the same value 'A' does not necessarily end up in the same partition - the partitioning
is done naively based on the ordering of the list.
Creating a pair RDD
To create a pair RDD, pass a list of tuples like so:

filter_none
rdd.collect ()
[('A', 1), ('B', 1), ('C', 1), ('A', 1)]
Note that parallelize will not perform partitioning based on the key, as shown here:
filter_none
[[('A', 1)], [('B', 1)], [('C', 1), ('A', 1)]]
We can see that just like the previous case, the partitioning is done using the ordering of the list.
NOTE
A pair RDD is not a type of its own:

filter_none
type (rdd)
pyspark.rdd.RDD
What makes pair RDDs special is that, we can perform additional methods such as reduceByKey(~) ,
which performs a groupby on the key and perform a custom reduction function:
filter_none
rdd = sc.parallelize([("A",1),("B",1),("C",1),("A",1)], numSlices=3)
new_rdd = rdd.reduceByKey (lambda a,b: a+b)
new_rdd.collect ()
[('B', 1), ('C', 1), ('A', 2)]
Here, the reduction function that we used is a simple summation.
Creating a RDD from a Pandas DataFrame
Consider the following Pandas DataFrame:

filter_none
import pandas as pd
df_pandas = pd.DataFrame ({"A":[3,4],"B":[5,6]})
df_pandas
A B
0 3 5
1 4 6
To create a RDD that contains the values of this Pandas DataFrame:

filter_none
df_spark = spark.createDataFrame (df_pandas)
rdd = df_spark.rdd
rdd.collect ()
[Row(A=3, B=5), Row(A=4, B=6)]
Notice how only the values of the DataFrame are kept - column labels are not included in the
RDD.
WARNING
Even though parallelize(~) can accept a Pandas DataFrame directly, this does not give us the desired
RDD:
filter_none
import pandas as pd
df_pandas = pd.DataFrame ({"A":[3,4],"B":[5,6]})
rdd = sc.parallelize(df_pandas)
rdd.collect ()
['A', 'B']
As you can see, the rdd only contains the column labels but not the data itself.
PySpark SparkSession | createDataFrame method

local_offer
mode_heat
PySpark's createDataFrame(~) method creates a new DataFrame from the given list, Pandas
DataFrame or RDD.
Parameters
1. data | list-like or Pandas DataFrame or RDD
The data used to create the new DataFrame.
2. schema | pyspark.sql.types.DataType , string or list | optional
The column names and the data type of each column.
3. samplingRatio | float | optional
If the data type is not provided via schema , then samplingRatio indicates the proportion of rows to
sample from when making inferences about the column type. By default, only the first row will be
used for type inference.
4. verifySchema | boolean | optional
Whether or not to check the data against the given schema. If data type does not align, then an
error will be thrown. By default, verifySchema=True .
Return Value
Examples
Creating a PySpark DataFrame from a list of lists
To create a PySpark DataFrame from a list of lists:

filter_none
rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows)
df.show ()
+----+---+
| _1| _2|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
To create a PySpark DataFrame from a list of lists with the column names specified:
filter_none
rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, ["name", "age"])
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Creating a PySpark DataFrame with column names and type
To create a PySpark DataFrame with column name and type:

filter_none
rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, "name:string, age:int")
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Creating a PySpark DataFrame from a list of values
To create a PySpark DataFrame from a list of values:

filter_none
vals = [3,4,5]
spark.createDataFrame(vals, IntegerType()).show()
+-----+
|value|
+-----+
| 3|
| 4|
| 5|
+-----+
Here, the IntegerType() indicates that the column is of type integer - this is needed in this case,
otherwise PySpark will throw an error.
Creating a PySpark DataFrame from a list of tuples
To create a PySpark DataFrame from a list of tuples:

filter_none
rows = (("Alex", 25), ("Bob", 30))
df = spark.createDataFrame(rows, ["name", "age"])
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Creating a PySpark DataFrame from a list of objects
To create a PySpark DataFrame from a list of objects:

filter_none
data = [{"name":"Alex", "age":20},{"name":"Bob", "age":30}]
df = spark.createDataFrame(data)
df.show ()
+---+----+
|age|name|
+---+----+
| 20|Alex|
| 30| Bob|
+---+----+
Creating a PySpark DataFrame from a RDD
To create a PySpark DataFrame from a RDD:

filter_none
rdd = sc.parallelize ([["Alex", 25], ["Bob", 30]])
df = spark.createDataFrame(rdd, ["name", "age"])
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Here, we are using the parallelize(~) method to create a RDD.
Creating a PySpark DataFrame from a Pandas DataFrame
Consider the following Pandas DataFrame:

filter_none
import pandas as pd
df = pd.DataFrame ({"A":[3,4],"B":[5,6]})
df
A B
0 3 5
1 4 6
To create a PySpark DataFrame from this Pandas DataFrame:

filter_none
pyspark_df = spark.createDataFrame(df)
pyspark_df.show ()
+---+---+
| A| B|
+---+---+
| 3| 5|
| 4| 6|
+---+---+
Creating a PySpark DataFrame with a schema (StructType)
To create PySpark DataFrame while specifying the column names and types:
filter_none
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())])
rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema)
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Here, name is of type string and age is of type integer.
Creating a PySpark DataFrame with date columns
To create a PySpark DataFrame with date columns, use the datetime library:
filter_none
import datetime
df = spark.createDataFrame([["Alex", datetime.date(1995,12,16)], ["Bob", datetime.date(1995,5,9)]], ["name",

"birthday"])
df.show ()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+
Specifying verifySchema
By default, verifySchema=True , which means that an error is thrown if there is a mismatch in the
type indicated by the schema and the type inferred from data :
filter_none
StructField("name", IntegerType()),
rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema) # verifySchema=True
df.show ()
org.apache.spark.api.python.PythonException:
'TypeError: field name: IntegerType can not accept object 'Alex' in type <class 'str'>'
Here, an error is thrown because the inferred type of column name is string , but we have specified
the column type to be integer in our schema .
By setting verifySchema=False , PySpark will fill the column with nulls instead of throwing an error:
filter_none
StructField("name", IntegerType()),
rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema, verifySchema=False)
df.show ()
+----+---+
|name|age|
+----+---+
|null| 25|
|null| 30|
+----+---+
PySpark SparkSession | range method

local_offer
PySpark
mode_heat
PySpark SparkSession's range(~) method creates a new PySpark DataFrame using a series of values
- this method is similar to Python's standard range(~) method.
Parameters
1. start | int
The starting value (inclusive).
2. end | int | optional
The ending value (exclusive).
3. step | int | optional
The value by which to increment. By default, step=1 .
The number of partitions to divide the values in.
Return Value
Examples
Creating a PySpark DataFrame using range (series of values)
To create a PySpark DataFrame that holds a series of values, use the range(~) method:
filter_none
df = spark.range(1,4)
df.show ()
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
Notice how the starting value is included while the ending value is not.
Note that if only one argument is supplied, then the range will start from 0 (inclusive) and the
argument will represent the end-value (exclusive):
filter_none
df = spark.range(3)
df.show ()
+---+
| id|
+---+
| 0|
| 1|
| 2|
+---+
Setting an incremental value
Instead of the default incremental value of step=1 , we can choose a specific incremental value
using the third argument:
filter_none
df = spark.range(1,6,2)
df.show ()
+---+
| id|
+---+
| 1|
| 3|
| 5|
+---+
Series of values in descending order
We can also get a series of values in descending order:

filter_none
df = spark.range(4,1,-1)
df.show ()
+---+
| id|
+---+
| 4|
| 3|
| 2|
+---+
Note the following:
 the starting value must be larger than the ending value
 the incremental value must be negative.

Specifying the number of partitions
By default, the number of partitions in which the resulting PySpark DataFrame will be split is
governed by our PySpark configuration. In my case, the default number of partitions is 8:
filter_none
df = spark.range(1,4)
We can override our configuration by specifying the numPartitions parameter:

filter_none
df = spark.range(1,4, numPartitions=2)

PySpark SQL Functions-10-03

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PySpark SQL Functions-10-03

Uploaded by

Copyright:

Available Formats

PySpark SQL Functions | array method

schedule AUG 12, 2023

PySpark SQL Functions' col(~) method returns a Column object.

PySpark SQL functions' collect_list(~) method returns a list of values in a column.

PySpark SQL Functions' concat_ws(~) method concatenates string-typed columns into a

PySpark SQL Functions' count(~) is an aggregate method used in conjunction with

PySpark SQL Functions | countDistinct method

PySpark SQL Functions | count_distinct method

PySpark Column | getItem method

PySpark SQL Functions | explode method

PySpark DataFrame | selectExpr method

PySpark SQL Functions | first method

PySpark SQL Functions | isnan method

PySpark SQL Functions | length method

Briefly, the roles of each component are as follows:

What is Databricks filesystem?

FileStore Stores files uploaded to Databricks.

Folder for temporary folders - this is managed by you and not

user Stores uploaded datasets that are registered as Databricks tables.

Clicking on the Next button gives:

What is a user-defined function in PySpark?

spark.udf.register('udf_double', my_double, 'int')

Guide on Window Functions

What is a window function?

Consider the following PySpark DataFrame:

PySpark SQL Functions | mean method

The column in which to obtain the mean value.

df = spark.createDataFrame ([["Alex", 25], ["Bob", 30]], ["name", "age"])

Getting the mean of a PySpark column

To obtain the mean age :

To get the mean age as an integer:

Getting the mean of each group in PySpark

Consider the following PySpark DataFrame:

df = spark.createDataFrame ([["Alex", 20, "A"],\

["Bob", 30, "B"],\

["Cathy", 50, "A"]],

["name", "age", "class"])

To get the mean age of each class :

df.groupby ("class").agg(F.mean ("age").alias("MEAN AGE")).show ()

Consider the following PySpark DataFrame:

Registering PySpark DataFrame as a SQL table

Running SQL queries against PySpark DataFrame

Using variables in SQL queries

query = f"SELECT * FROM {table_name}"

PySpark SQL Functions | min method

The column in which to obtain the minimum value.

df = spark.createDataFrame ([["Alex", 25], ["Bob", 30]], ["name", "age"])

Getting the minimum value of a PySpark column

To obtain the minimum age :

To get the minimum value as an integer:

list_rows = df.select (F.min("age")).collect ()

Note the following:

 the content of the Row object can be accessed using [0] .

Consider the following PySpark DataFrame:

df = spark.createDataFrame ([["Alex", 20, "A"],\

["Bob", 30, "B"],\

["Cathy", 50, "A"]],

["name", "age", "class"])

To get the minimum age of each class :

PySpark SQL Functions | month method

The date column from which to extract the month.

df = spark.createDataFrame ([["Alex", datetime.date(1995,12,16)], ["Bob", datetime.date(1995,5,9)]], ["name",

Extracting the month component of datetime values in PySpark DataFrame

To get the month component of datetime values: